Brendan's Sample Work

employee_dataset

Exploring and Running Decision Tree Model on Employee Attrition Data Set



Sections:

  • ### Importing Libraries
  • ### Importing Data
  • ### Exploratory Data Analysis
  • ### Decision Tree Model Used to Predict Whether Each Employee Will Be Active Each Year
    • ### Decision Tree Results

Importing Libraries


In [137]:
import pandas as pd
import plotly
import cufflinks as cf
cf.go_offline()

Importing Data


In [138]:
df = pd.read_csv('MFG10YearTerminationData.csv')

Below are statistics describing the age and length of service of the employees in this data set:

In [139]:
df.describe().drop(['EmployeeID','store_name', 'STATUS_YEAR'], axis=1)
Out[139]:
age length_of_service
count 49653.000000 49653.000000
mean 42.077035 10.434596
std 12.427257 6.325286
min 19.000000 0.000000
25% 31.000000 5.000000
50% 42.000000 10.000000
75% 53.000000 15.000000
max 65.000000 26.000000

Here is the 'head' of the dataset:

In [140]:
df.head(10)
Out[140]:
EmployeeID recorddate_key birthdate_key orighiredate_key terminationdate_key age length_of_service city_name department_name job_title store_name gender_short gender_full termreason_desc termtype_desc STATUS_YEAR STATUS BUSINESS_UNIT
0 1318 12/31/2006 0:00 1/3/1954 8/28/1989 1/1/1900 52 17 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2006 ACTIVE HEADOFFICE
1 1318 12/31/2007 0:00 1/3/1954 8/28/1989 1/1/1900 53 18 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2007 ACTIVE HEADOFFICE
2 1318 12/31/2008 0:00 1/3/1954 8/28/1989 1/1/1900 54 19 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2008 ACTIVE HEADOFFICE
3 1318 12/31/2009 0:00 1/3/1954 8/28/1989 1/1/1900 55 20 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2009 ACTIVE HEADOFFICE
4 1318 12/31/2010 0:00 1/3/1954 8/28/1989 1/1/1900 56 21 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2010 ACTIVE HEADOFFICE
5 1318 12/31/2011 0:00 1/3/1954 8/28/1989 1/1/1900 57 22 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2011 ACTIVE HEADOFFICE
6 1318 12/31/2012 0:00 1/3/1954 8/28/1989 1/1/1900 58 23 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2012 ACTIVE HEADOFFICE
7 1318 12/31/2013 0:00 1/3/1954 8/28/1989 1/1/1900 59 24 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2013 ACTIVE HEADOFFICE
8 1318 12/31/2014 0:00 1/3/1954 8/28/1989 1/1/1900 60 25 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2014 ACTIVE HEADOFFICE
9 1318 12/31/2015 0:00 1/3/1954 8/28/1989 1/1/1900 61 26 Vancouver Executive CEO 35 M Male Not Applicable Not Applicable 2015 ACTIVE HEADOFFICE

Exploratory Data Analysis


The bar graph below displays the proportion of employees for each 'department_name' that were still 'active' after ten years, sorted from low to high:

In [141]:
groupby_dept = df[df.STATUS == 'ACTIVE'].groupby('department_name')

(df[df.STATUS == 'ACTIVE'].groupby('department_name')['STATUS'].count() 
 / df.groupby('department_name')['STATUS'].count()).sort_values().iplot(kind='bar', barmode='group')
Information TechnologyLegalLabor RelationsTrainingAuditInvestmentCompensationHR TechnologyEmployee RecordsAccounts ReceiveableAccounts PayableStore ManagementRecruitmentAccountingProduceCustomer ServiceMeatsDairyProcessed FoodsBakeryExecutive00.20.40.60.81Export to plot.ly »

This histogram shows the distribution of age in the dataset:

In [142]:
df['age'].iplot(kind='histogram')
20253035404550556065020040060080010001200Export to plot.ly »

The bar graph below displays the proportion of employes that remain active at each age:

In [143]:
(df[df.STATUS == 'ACTIVE'].groupby('age')['STATUS'].count() 
 / df.groupby('age')['STATUS'].count()).sort_values().iplot(kind='bar')
2025303540455055606500.20.40.60.81Export to plot.ly »

According to this graph, at most ages, it is likely that the employees will remain active at their job. However, for younger people and people age 60 or 65 it is more likely that they will not remain active


Decision Tree Model Used to Predict Whether Each Employee Will Be Active Each Year


Dropping the columns that cannot be used for decision tree:

In [144]:
df_drop_columns = df.drop(['EmployeeID','gender_full','terminationdate_key','birthdate_key','orighiredate_key', 'recorddate_key'], axis=1)

Changing the categorical variables to '1 or 0' columns so that they can be used in decision tree:

In [145]:
cat_feats = ['city_name', 'department_name', 'job_title', 'store_name', 'gender_short', 'termreason_desc', 'termtype_desc', 'STATUS_YEAR', 'STATUS', 'BUSINESS_UNIT']
final_data = pd.get_dummies(df_drop_columns,columns=cat_feats,drop_first=True)
final_data.head()
Out[145]:
age length_of_service city_name_Aldergrove city_name_Bella Bella city_name_Blue River city_name_Burnaby city_name_Chilliwack city_name_Cortes Island city_name_Cranbrook city_name_Dawson Creek ... STATUS_YEAR_2008 STATUS_YEAR_2009 STATUS_YEAR_2010 STATUS_YEAR_2011 STATUS_YEAR_2012 STATUS_YEAR_2013 STATUS_YEAR_2014 STATUS_YEAR_2015 STATUS_TERMINATED BUSINESS_UNIT_STORES
0 52 17 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 53 18 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 54 19 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
3 55 20 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
4 56 21 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 169 columns

Import decision tree model. Next, split the data into a training set and a test set. Finally, train the model on the training set and then test it on the test set:

In [146]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
In [147]:
X = final_data.drop('STATUS_TERMINATED',axis=1)
y = final_data['STATUS_TERMINATED']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
In [148]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
Out[148]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [149]:
predictions = dtree.predict(X_test)

Decision Tree Results

In [150]:
from sklearn.metrics import classification_report
In [151]:
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     14465
          1       1.00      1.00      1.00       431

avg / total       1.00      1.00      1.00     14896

The Decision Tree had remarkable results, correctly predicting whether employees would be active each year at a rate of 100%