A classification approach to strategize projected growths in user subscription
Authors: Moorissa Tjokro, Jager Hartman
A banking institution ran a direct marketing campaign based on phone calls. Oftentimes, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be subscribed or not. To solve this, we will predict whether someone will subscribe to the term deposit or not based on the given information.
data.csv which contains the following fields:
job: type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
marital_status: marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
credit_default: has credit in default? (categorical: "no","yes","unknown")
housing: has housing loan? (categorical: "no","yes","unknown")
loan: has personal loan? (categorical: "no","yes","unknown")
contact: contact communication type (categorical: "cellular","telephone")
month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
duration: last contact duration, in seconds (numeric).
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
prev_days: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
prev_contacts: number of contacts performed before this campaign and for this client (numeric)
prev_outcomes: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
emp_var_rate: employment variation rate - quarterly indicator (numeric)
cons_price_idx: consumer price index - monthly indicator (numeric)
cons_conf_idx: consumer confidence index - monthly indicator (numeric)
euribor3m: euribor 3 month rate - daily indicator (numeric)
nr_employed: number of employees - quarterly indicator (numeric)
subscribed(target variable): has the client subscribed a term deposit? (binary: "yes","no")
Our original approach was to create a simple poor man's stacking ensemble with models that were somewhat simplistic. We first tried stacking NaiveBayes, RandomForest, ExtraTreeClassifier, SVM and logistic regression together. These models performed well with regards to cross validation and on the testing set with roc-auc scores around 0.78-0.8. However, when submitting the Kaggle the score dropped to 0.76. Gaussian Naive Bayes was one of the strongest performers though did not add anything to the ensemble methods. As an aside, we tried NearestCentroid and KNN though had issues with consistent predict_proba calls so dropped these models as well.
*Note that all model hyperparameters were tuned using GridSearch with cv=5. Not all grid searches are included due to time in execution of the notebook.
Moving forward we decided to drop the SVM all together due to time constraints and drop NaiveBayes since it was performing strangely. We were then left with gradient boosting, ada boosting, easy ensembles, random forests and logistic regression with feature selection. The ExtraTree classifier was left out since the random forests would be more powerful and pick up on similar trends. AdaBoosting overfit too much and dominated the stacking ensemble. The gradient boost also overfit quite a lot but there seemed to be an improvement in the Kaggle scores when using this model. Cross-validation and the test set showed roc-auc's around 0.8 - 0.82 with the ensemble methods implementing the gradient boosted random forest. This left us with Gradient Boosting random forests and different implementations of logistic regression such as AdaBoost, easy ensemble and other sampling techniques.
The feature selection showed little imrpovements applied to all of the data prior to the voting classifier for poor-man's stacking. This was due to the tree models we were using. Instead, applying an RFE(RandomForest()) selector prior to logistic regression alone seemed to perform the best.
Standard scaling and min-max scaling seemed to perform similarly with standard scaling having a slight edge. This is contrary to our belief since the binary data from the dummy variables is between 0 and 1 where the standard scaler would shift the 0's to negative values.
Again, contrary to our belief, SMOTE and omitting a technique to deal with imbalance performed much better than using RandomUnderSampler. Using SMOTE(ratio = 0.5) followed by RandomUnderSampler seemed to give a compromise between the performance of SMOTE alone and RandomUnderSampler alone though showed no improvement over SMOTE alone. This was gauged with regard to logistic regression and the poor-man's stacking classifier.
Another approach we tried was to create an easy ensemble out of the voting classifiers which overfitted, though not as bad as adaboosting, and the results seemed to stay consistent regardless of the number of classifiers used. A further analysis of the affect of number of classifiers on easy ensemble is included at the very end in the analysis of resampling techniques.
0.1. Load data and convert unknowns to nulls 0.2. Categorize features based on types 0.3. Define response variable 0.4. Split data into training and test set
1.1. Identify possible associations between dependent and independent variables * Includes a scatterplot matrix, density plots, and histograms. 1.2. Convert yes/no values to 1/0 1.3. Create dummies for categorical variables 1.4. Create binary prev_days indicator 1.5. Impute missing values 1.6. Select significant variables 1.7. Deal with imbalanced data * Includes oversampling and undersampling techniques.
2.1. Logistic Regression 2.2. Linear SVM 2.3. Kernelized SVM (RBF) 2.4. Naive Bayes 2.5. Stochastic Gradient Descent Classifier 2.6. Nearest Centroid Classifier 2.7. Logistic Regression with Resampled Ensemble 2.8. Logistic Regression with RFE 2.9. Logistic Regression with RFE Lasso 2.10. Model Evaluation (ROC-AUC & F-1 Scores)
3.1. Decision Tree 3.2. Random Forest 3.3. Bagging 3.4. Gradient Boosting 3.5. Adaboost 3.6. Extra Tree Classifier 3.7. Model Evaluation (ROC-AUC & F-1 Scores)
4.1. Poor Man's Stacking using Gradient Boosted Classifier and an Easy Ensemble of Logistic Regressions 4.2. Poor Man's Stacking using Random Forest Classifier and Easy Ensemble of Logistic Regressions 4.3. Model Evaluation
5.1. Sampling Transformation and Analysis 5.2. Sampling Techniques Evaluation on Poor-man's Stacking Algorithm 5.2. Easy Ensembles 5.4. AdaBoost Resampled Ensembles
This is the basic step where we load the data and create train and test sets for internal validation.
%matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import time plt.rcParams["figure.dpi"] = 100 np.set_printoptions(precision=3, suppress=True)
Unknown values in the dataset seem to be clean and consistent, encoded as
unknown. In this case, we can convert them to null values while importing the data.
data = pd.read_csv('data/data.csv', delimiter = ',', na_values='unknown') data.head()
5 rows × 21 columns
Based on observations, we categorize independent variables based on their types below. Note that
subsribed is not part of them because it is the response variable.
age float64 job object marital_status object education object credit_default object housing object loan object contact object month object day_of_week object duration float64 campaign float64 prev_days int64 prev_contacts int64 prev_outcomes object emp_var_rate float64 cons_price_idx float64 cons_conf_idx float64 euribor3m float64 nr_employed float64 subscribed object dtype: object
categorical = ['job', 'marital_status', 'education', 'credit_default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'prev_outcomes'] #Removed Duration continuous = ['age', 'campaign', 'prev_days', 'prev_contacts', 'nr_employed', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m'] print("Total number of categorical predictors:", str(len(data[categorical].columns.values))) print("All categorical data as object:", str(data[categorical].dtypes.all() == 'object'), '\n') print("Total number of continuous predictors:", str(len(data[continuous].columns.values))) print("All continues data as float64 or int64:", str(data[continuous].dtypes.all() in ['float64','int64']))
Total number of categorical predictors: 10 All categorical data as object: True Total number of continuous predictors: 9 All continues data as float64 or int64: True
Since our goal is to predict whether someone will subscribe to the term deposit or not based on the given information, we define
subscribed variable to be our response variable.
no 29238 yes 3712 Name: subscribed, dtype: int64
Note that we see imbalanced data here between no and yes. We would like to change no to 0 and yes to 1 as classification values so it would be easier to deal with as we model the data, but lets explore this more on the next step.
Also, it's good to see that there are no unknown values, so we don't need to drop any datapoints or rows.
Note below that we are also dropping
duration variable since it's prohibited in the assignment.
from sklearn.model_selection import train_test_split subscribed = data.subscribed data_ = data.drop(["duration", "subscribed"], axis = 1) X_train, X_test, y_train, y_test = train_test_split(data_, subscribed == "yes", random_state=0, stratify=subscribed) print("Size for X_train:", X_train.shape) print("Size for X_test:", X_test.shape) print("Size for y_train:", y_train.shape) print("Size for y_test:", y_test.shape)
Size for X_train: (24712, 19) Size for X_test: (8238, 19) Size for y_train: (24712,) Size for y_test: (8238,)
In this step, we expect you to look into the data and try to understand it before modeling. This understanding may lead to some basic data preparation steps which are common across the two model sets required.
pd.tools.plotting.scatter_matrix(X_train[continuous], c=y_train, alpha=.2, figsize=(10, 10));
A few observations we see from the scatter matrix above:
prev_contacts: clients with less number of employees tend to receive higher number of contacts before a specific campaign
age) tends to have higher number of employees (
We also used density plots below to visualize the distribution of those who subscribed and those who did not subscribed (y-axis) for each continuous variable (x-axis). The kernel gaussian density is used to draw inferences about the population of those who subscribed vs. those who didn't.
from scipy.stats import gaussian_kde def density_calc(array, N = 500, bw = 0.2): """ Parameters ---------- array : array-like, data to be plotted as density N : int, number of points to use to generate density curve bw : float, corresponds to bandwidth, smaller results in skinnier bands. larger value results in wider bands Outputs ------- x : array created from np.linspace density : points of the density curve created with scipy.stats.gaussian_kde """ density = gaussian_kde(array) x = np.linspace(np.min(array), np.max(array), N) density.covariance_factor = lambda: bw density._compute_covariance() return x, density(x) def plot_density(values, bw = 0.2, N = 500): """ Parameters ---------- values : list, corresponds to columns in data table to plot bw : float, bandwidth parameter to be given to density_calc N : int, number of points used to generate density in density_calc categorical : if data is categorical, ie non-integer/non-float then create integer dummy variables Output ------ Array of plots for density functions """ df = data.copy() fig = plt.figure(figsize=(2*len(values),10)) for i,value in enumerate(values): axes = fig.add_subplot(2, len(values)/2+1, i+1) x, y = density_calc(df[str(value)][df["subscribed"] == "yes"], bw = bw) axes.plot(x, y, 'r', label='subscribed') x, y = density_calc(df[str(value)][df["subscribed"] == "no"], bw = bw) axes.plot(x, y, 'b', label='did not subscribe') axes.legend() axes.set_title(value) axes.title.set_fontsize(20) plt.show() plot_density(continuous, bw=1)