Exploring and Running Decision Tree Model on Employee Attrition Data Set¶

Sections:¶

### Importing Libraries
### Importing Data
### Exploratory Data Analysis
### Decision Tree Model Used to Predict Whether Each Employee Will Be Active Each Year
- ### Decision Tree Results

Importing Libraries¶

import pandas as pd
import plotly
import cufflinks as cf
cf.go_offline()

Importing Data¶

df = pd.read_csv('MFG10YearTerminationData.csv')

Below are statistics describing the age and length of service of the employees in this data set:¶

df.describe().drop(['EmployeeID','store_name', 'STATUS_YEAR'], axis=1)

Here is the 'head' of the dataset:¶

df.head(10)

Exploratory Data Analysis¶

The bar graph below displays the proportion of employees for each 'department_name' that were still 'active' after ten years, sorted from low to high:¶

groupby_dept = df[df.STATUS == 'ACTIVE'].groupby('department_name')

(df[df.STATUS == 'ACTIVE'].groupby('department_name')['STATUS'].count() 
 / df.groupby('department_name')['STATUS'].count()).sort_values().iplot(kind='bar', barmode='group')

Looking at the bar graph, IT employees are the most likely to not be 'active' after ten years, while executives are the most likely to be active. Another interesting insight is that five out of ten of the most active departments are food related departments.¶

This histogram shows the distribution of age in the dataset:¶

df['age'].iplot(kind='histogram')

The bar graph below displays the proportion of employes that remain active at each age:¶

(df[df.STATUS == 'ACTIVE'].groupby('age')['STATUS'].count() 
 / df.groupby('age')['STATUS'].count()).sort_values().iplot(kind='bar')

According to this graph, at most ages, it is likely that the employees will remain active at their job. However, for younger people and people age 60 or 65 it is more likely that they will not remain active¶

Decision Tree Model Used to Predict Whether Each Employee Will Be Active Each Year¶

Dropping the columns that cannot be used for decision tree:¶

df_drop_columns = df.drop(['EmployeeID','gender_full','terminationdate_key','birthdate_key','orighiredate_key', 'recorddate_key'], axis=1)

Changing the categorical variables to '1 or 0' columns so that they can be used in decision tree:¶

cat_feats = ['city_name', 'department_name', 'job_title', 'store_name', 'gender_short', 'termreason_desc', 'termtype_desc', 'STATUS_YEAR', 'STATUS', 'BUSINESS_UNIT']
final_data = pd.get_dummies(df_drop_columns,columns=cat_feats,drop_first=True)
final_data.head()

Import decision tree model. Next, split the data into a training set and a test set. Finally, train the model on the training set and then test it on the test set:¶

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X = final_data.drop('STATUS_TERMINATED',axis=1)
y = final_data['STATUS_TERMINATED']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

predictions = dtree.predict(X_test)

Decision Tree Results¶

from sklearn.metrics import classification_report

print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     14465
          1       1.00      1.00      1.00       431

avg / total       1.00      1.00      1.00     14896

	age	length_of_service
count	49653.000000	49653.000000
mean	42.077035	10.434596
std	12.427257	6.325286
min	19.000000	0.000000
25%	31.000000	5.000000
50%	42.000000	10.000000
75%	53.000000	15.000000
max	65.000000	26.000000

	EmployeeID	recorddate_key	birthdate_key	orighiredate_key	terminationdate_key	age	length_of_service	city_name	department_name	job_title	store_name	gender_short	gender_full	termreason_desc	termtype_desc	STATUS_YEAR	STATUS	BUSINESS_UNIT
0	1318	12/31/2006 0:00	1/3/1954	8/28/1989	1/1/1900	52	17	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2006	ACTIVE	HEADOFFICE
1	1318	12/31/2007 0:00	1/3/1954	8/28/1989	1/1/1900	53	18	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2007	ACTIVE	HEADOFFICE
2	1318	12/31/2008 0:00	1/3/1954	8/28/1989	1/1/1900	54	19	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2008	ACTIVE	HEADOFFICE
3	1318	12/31/2009 0:00	1/3/1954	8/28/1989	1/1/1900	55	20	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2009	ACTIVE	HEADOFFICE
4	1318	12/31/2010 0:00	1/3/1954	8/28/1989	1/1/1900	56	21	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2010	ACTIVE	HEADOFFICE
5	1318	12/31/2011 0:00	1/3/1954	8/28/1989	1/1/1900	57	22	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2011	ACTIVE	HEADOFFICE
6	1318	12/31/2012 0:00	1/3/1954	8/28/1989	1/1/1900	58	23	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2012	ACTIVE	HEADOFFICE
7	1318	12/31/2013 0:00	1/3/1954	8/28/1989	1/1/1900	59	24	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2013	ACTIVE	HEADOFFICE
8	1318	12/31/2014 0:00	1/3/1954	8/28/1989	1/1/1900	60	25	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2014	ACTIVE	HEADOFFICE
9	1318	12/31/2015 0:00	1/3/1954	8/28/1989	1/1/1900	61	26	Vancouver	Executive	CEO	35	M	Male	Not Applicable	Not Applicable	2015	ACTIVE	HEADOFFICE

	age	length_of_service	...	STATUS_YEAR_2008	STATUS_YEAR_2009	STATUS_YEAR_2010
0	52	17	...	0	0	0
1	53	18	...	0	0	0
2	54	19	...	1	0	0
3	55	20	...	0	1	0
4	56	21	...	0	0	1

Brendan's Sample Work

Exploring and Running Decision Tree Model on Employee Attrition Data Set¶

Sections:¶

Importing Libraries¶

Importing Data¶

Below are statistics describing the age and length of service of the employees in this data set:¶

Here is the 'head' of the dataset:¶

Exploratory Data Analysis¶

The bar graph below displays the proportion of employees for each 'department_name' that were still 'active' after ten years, sorted from low to high:¶

This histogram shows the distribution of age in the dataset:¶

The bar graph below displays the proportion of employes that remain active at each age:¶

According to this graph, at most ages, it is likely that the employees will remain active at their job. However, for younger people and people age 60 or 65 it is more likely that they will not remain active¶

Decision Tree Model Used to Predict Whether Each Employee Will Be Active Each Year¶

Dropping the columns that cannot be used for decision tree:¶

Changing the categorical variables to '1 or 0' columns so that they can be used in decision tree:¶

Import decision tree model. Next, split the data into a training set and a test set. Finally, train the model on the training set and then test it on the test set:¶

Decision Tree Results¶

The Decision Tree had remarkable results, correctly predicting whether employees would be active each year at a rate of 100%¶

	age	length_of_service	...	STATUS_YEAR_2008	STATUS_YEAR_2009	STATUS_YEAR_2010
0	52	17	...	0	0	0
1	53	18	...	0	0	0
2	54	19	...	1	0	0
3	55	20	...	0	1	0
4	56	21	...	0	0	1

	age	length_of_service	...	STATUS_YEAR_2008	STATUS_YEAR_2009	STATUS_YEAR_2010
0	52	17	...	0	0	0
1	53	18	...	0	0	0
2	54	19	...	1	0	0
3	55	20	...	0	1	0
4	56	21	...	0	0	1

	age	length_of_service	...	STATUS_YEAR_2008	STATUS_YEAR_2009	STATUS_YEAR_2010
0	52	17	...	0	0	0
1	53	18	...	0	0	0
2	54	19	...	1	0	0
3	55	20	...	0	1	0
4	56	21	...	0	0	1