Banks routinely lose money based on loans that eventually default. Per the Federal Reserve, at the height of the financial crisis in 2009-2010, the amount lost approached 500 billion U.S dollars. More recently, losses each quarter tend to approach 150 billion. Delinquency rates tend to be around 1.5% most recently. Because of this, it is vitally important for banks to ensure that they keep their delinquencies as low as possible.
- Can we accurately predict loan approval based on historical data?
- How can we confidentially determine whether a loan can be approved?
Rationale and Objective:
- If a loan is current, the company is making money and should approve such future loans based on the model.
- If a loan is late or in default, the company is not making money and should reject future loans based the model.
- What factors predict loan approval?
- Which variables best predict if a loan will be a loss and how much is the average loss?
Data¶
The data was retrieved from Kaggle. Lending Club is a peer to peer financial company. Essentially, people can request an unsecured loan between 1,000 and 40,000 dollars while other individuals can visit the site to choose to invest in the loans. So, people are essentially lending to other people directly with Lending Club as a facilitator.
Preprocessing¶
The code that was used for preprocessing and EDA can be found LoanApproval_LendingClub Github repository. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and empty rows replace with NA.
import os
import random
import numpy as np
import warnings
from numpy import sort
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
seed_value = 42
os.environ['LoanStatus_PreprocessEDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
df = pd.read_csv('loan_Master.csv', index_col=False, low_memory=False)
print('- Dimensions of initial data:', df.shape)
df = df.replace(r'^\s*$', np.nan, regex=True)
df1 = df.loc[:, df.isnull().mean() < 0.05]
print('- Dimensions when columns > 95% missing removed:', df1.shape)
s = set(df1)
varDiff = [x for x in df if x not in s]
print('- Number of features removed due to high missingness:'
+ str(len(varDiff)))
df = df1
del df1
Let's now use a function to examine the data types present, the amount of data missing as a percentage, the number of unique values and the number of rows and columns. This can be utilized to determine which features should remain in the set as well as the ones that might result in longer runtimes and more difficulty in explaining the downstream models.
def data_quality_table(df):
"""Returns the characteristics of variables in a Pandas dataframe."""
var_type = df.dtypes
mis_val_percent = 100 * df.isnull().sum() / len(df)
unique_count = df.nunique()
mis_val_table = pd.concat([var_type, mis_val_percent, unique_count], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0: 'Data Type', 1: 'Percent Missing', 2: 'Number Unique'})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] >= 0].sort_values(
'Percent Missing', ascending=False).round(1)
print ('- There are ' df.shape[0]) + ' rows and '
+ str(df.shape[1]) + ' columns.\n')
return mis_val_table_ren_columns
Categorical Variables (Features)¶
Dimensionality is a problem that frequently occurs when working with categorical features. Variables with a large number of groups or levels will increase the size of the data when dummy variables or one hot encoding is used during modeling. After examining the quality of the data, the features with a high percentage of missing values and a large number of groups were removed from the set.
df1 = df.select_dtypes(include = 'object')
print('\n Data Quality: Qualitative Variables')
display(data_quality_table(df1))
print('\n')
print('\nSample observations of qualitative variables:')
display(df1.head())
df = df.drop(['title', 'last_pymnt_d', 'zip_code', 'earliest_cr_line',
'last_credit_pull_d', 'issue_d', 'addr_state', 'sub_grade'],
axis=1)
del df1
Quantitative Variables (Features)¶
Now we can remove the rows with any column having NA/null for some of the variables determined with the data_quality_table function so the complete cases without missing data can be used for the set.
df1 = df.select_dtypes(exclude = 'object')
print('\n Data Quality: Quantitative Variables')
display(data_quality_table(df1))
print('\n')
print('\nSample observations of quantitative variables:')
display(df1.head())
df = df[df.bc_util.notna() & df.percent_bc_gt_75.notna()
& df.pct_tl_nvr_dlq.notna() & df.mths_since_recent_bc.notna()
& df.dti.notna() & df.inq_last_6mths.notna() & df.num_rev_accts.notna()]
del df1
Dependent Variable (Target): Status of Loan¶
The majority of the individuals in this set are Fully Paid or Current for the loan_status feature regarding their loan payments. However, the minority of the samples are not.
print(df.loan_status.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')
Now let's convert loan_status to a binary variable current = 0 and default = 1 so this can be used as the target (label) for classification modeling. After recoding to binary, a clear class imbalance exists with 87.35% of the observations on track for completing payments.
df['loan_status'] = df['loan_status'].replace(['Fully Paid'], 0)
df['loan_status'] = df['loan_status'].replace(['In Grace Period'], 0)
df['loan_status'] = df['loan_status'].replace(['Current'], 0)
df['loan_status'] = df['loan_status'].replace(['Charged Off'], 1)
df['loan_status'] = df['loan_status'].replace(['Late (31-120 days)'], 1)
df['loan_status'] = df['loan_status'].replace(['Late (16-30 days)'], 1)
df['loan_status'] = df['loan_status'].replace(['Does not meet the credit policy. Status:Fully Paid'], 1)
df['loan_status'] = df['loan_status'].replace(['Does not meet the credit policy. Status:Charged Off'], 1)
df['loan_status'] = df['loan_status'].replace(['Default'], 1)
print('\nExamine Binary Loan Status for Class Imbalance:')
print(df.loan_status.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')
Variable Selection¶
SelectFromModel using XGBoost¶
To prepare the data for feature selection using this approach, let'separate the input features as X and the target as y. Then we can create dummy variables for the categorical features using pandas.get_dummies and drop the initial features. Now we can fit a baseline model on all data using the evaluation metric logloss, specifying the parameters to run on a GPU and the random_state as the seed_value. We can then fit the XGBClassifier model, make predictions and determine the accuracy of the baseline model using accuracy_score.
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import accuracy_score
X = df.drop('loan_status', axis=1)
y = df.loan_status
X = pd.get_dummies(X, drop_first=True)
model = XGBClassifier(eval_metric='logloss',
use_label_encoder=False,
tree_method='gpu_hist',
gpu_id=0,
random_state=seed_value)
model.fit(X, y)
y_pred = model.predict(X)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y, predictions)
print('Accuracy: %.3f%%' % (accuracy * 100))
The feature importance of the model can be examined using plot_importance from the XGBoost module and the plot parameters can be defined using matplotlib.pyplot.
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams.update({'font.size': 8.5})
plot_importance(model)
plt.tight_layout()
plt.show();
The ten most important features using the baseline model are total_rec_prncp, out_prncp, last_pymnt_amnt, int_rate, loan_amnt, total_rec_int
installment, total_rec_late_fee, funded_amnt_inv and mo_sin_old_rev_tl_op.
Another method to examine how features affect a model is permutation feature importance. This is defined as the reduction in the model score when a single feature value is shuffled randomly, indicating the extent to which the model relies on the feature. Different numbers of permutations can be tested.
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(model, X, y)
plt.rcParams.update({'font.size': 7})
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.tight_layout()
plt.show();
estimates the impact of the individual components on the final result, separately, while conserving the total impact on the final result. This can be considered in conjunction with the previously used feature importance measures when modeling. Let's use the SHAP (SHapley Additive exPlanations) published as A Unified Approach to Interpreting Model PredictionsTreeExplainer since using XGBoost and generate shap_values for the features. The summary_plot shows the global importance of each feature and the distribution of the effect sizes over the set.
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X, show=False)
plt.show();
As total_rec_prncp, out_prncp_inv and the last_pymnt_amnt increase, there is a lower probability for a loan to default while as recoveries increases, there is a higher probability the loan will default.
Now, we can define the maximum number of features to test as the current number of features, and the minimum number of features, which is two. We need to define a metric to test, so let's use the model accuracy. The thresholds can then defined by sorting the model feature importances in descending order. Then a for loop can be used to iterate through the list of thresholds to fit a model, predict and calculate the model accuracy. The model accuracy can then be saved in a list where the new values are added as a new model is tested.
from sklearn.feature_selection import SelectFromModel
import time
print('Time for feature selection using XGBoost...')
search_time_start = time.time()
feat_max = X.shape[1]
feat_min = 2
acc_max = accuracy
thresholds = sort(model.feature_importances_)
thresh_goal = thresholds[0]
accuracy_list = []
for thresh in thresholds:
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X = selection.transform(X)
selection_model = XGBClassifier(eval_metric='logloss',
use_label_encoder=False,
tree_method='gpu_hist',
gpu_id=0,
random_state=seed_value)
selection_model.fit(select_X, y)
selection_model_pred = selection_model.predict(select_X)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y, y_pred=selection_predictions)
accuracy = accuracy * 100
print('Thresh= %.6f, n= %d, Accuracy: %.3f%%' % (thresh, select_X.shape[1],
accuracy))
accuracy_list.append(accuracy)
if(select_X.shape[1] < feat_max) and (select_X.shape[1] >= feat_min) and (accuracy >= acc_max):
n_min = select_X.shape[1]
acc_max = accuracy
thresh_goal = thresh
print('\n')
print('Finished feature selection using XGBoost in:',
time.time() - search_time_start)
print('\n')
print('\nThe optimal threshold is:')
print(thresh_goal)
Now we can subset the features where the optimal threshold occurred into a new set and examine which features were chosen. We can then fit a model with these features as the input to the model, and subsequently generate predictions using all of the available data and evaluate the accuracy of the model.
selection = SelectFromModel(model, threshold=thresh_goal, prefit=True)
feature_names = X.columns[selection.get_support(indices=True)]
print('\n- Feature selection using XGBoost resulted in '
+ str(len(feature_names)) + ' features.')
print('\n- Features selected using optimal threshold for accuracy:')
print(X.columns[selection.get_support()])
print('\n')
X = pd.DataFrame(data=X, columns=feature_names)
model = XGBClassifier(eval_metric='logloss',
use_label_encoder=False,
tree_method='gpu_hist',
gpu_id=0,
random_state=seed_value)
model.fit(X, y)
model.save_model('xgb_featureSelection.model')
y_pred = model.predict(X)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y, predictions)
print('Accuracy: %.3f%%' % (accuracy * 100.0))
45 features were selected and 98.946% was the resulting accuracy. Let's now plot the feature importance graph again and compare to the baseline model.
plt.rcParams.update({'font.size': 10})
plot_importance(model)
plt.tight_layout()
plt.show();
The ten most important features using the SelectFromModel are still total_rec_prncp and out_prncp, but int_rate increased in rank instead of last_pymnt_amnt and total_rec_int as well as installment instead of loan_amnt and total_rec_late_fee. Also, funded_amnt_inv and revol_bal increased in rank instead of mo_sin_old_rev_tl_op.
Then we can examine the permutation based feature importance again and plot the results.
perm_importance = permutation_importance(model, X, y)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.show();
Then we can visualize the feature importance computed with the SHAP values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X, show=False)
plt.show();
Compared to default model, the majority of the features demonstrate the same properties, but loan_amnt is higher than out_prncp and int_rate is higher than tota_rec_late_fee.
Group Lasso for Variable Selection¶
Multicollinearity using Variance Inflation Factor (VIF)¶
To check for multicollinearity, we can use the Variance Inflation Factor (VIF) to quantify the amount that the estimated coefficients are inflated when multicollinearity exists. Let's first separate the input features and target and then select the quantitative features for inspection. We can define a function to calculate VIF with a specified threshold which removes the feature if the VIF is greater than a set threshold. The typical threshold is 4-5 to remove multicollinearity while 10 is considered as high multicollinearity. So let's use threshold=5.0 and examine the number of remaining numerical variables using this defined threshold.
from joblib import Parallel, delayed
from statsmodels.stats.outliers_influence import variance_inflation_factor
X1 = df.drop('loan_status', axis=1)
y = df.loan_status
df_num = X1.select_dtypes(include = ['float64', 'int64'])
def calculate_vif(X, threshold=5.0):
features = [X.columns[i] for i in range(X.shape[1])]
dropped = True
while dropped:
dropped = False
print('\nThe starting number of quantitative features is: '
+ str(len(features)))
vif = Parallel(n_jobs=-1,
verbose=5)(delayed(variance_inflation_factor)(X[features].values,
ix) for ix in range(len(features)))
maxloc = vif.index(max(vif))
if max(vif) > threshold:
print(time.ctime() + ' dropping \'' + X[features].columns[maxloc]
+ '\' at index: ' + str(maxloc))
features.pop(maxloc)
dropped = True
print('Features Remaining:')
print([features])
return X[[i for i in features]]
print('Time for calculating VIF on numerical data using threshold = 5...')
search_time_start = time.time()
X1 = calculate_vif(df_num, 5)
print('\nNumber of quant features after VIF:', X1.shape[1])
print('Finished calculating VIF on numerical data using threshold = 5 in:',
time.time() - search_time_start)
print ('- There are ' + str(X1.shape[0]) + ' rows and '
+ str(X1.shape[1]) + ' columns.\n')
print('\nQuant features remaining after VIF:')
print(X1.columns)
In the initial set, there were 61 quantitative features, and after using VIF with a threshold=5.0, 31 variables remain. Almost half of the variables were removed using this approach!
To note, when SelectFromModel using XGBoost was completed without performing VIF, the resulting score was 98.946% and when retested after completing VIF, the score was 96.854%. Ensemble-based algorithms can handle multicollinearity while linear given the data distribution cannot.
Now let's concatenate these selected variables with the non-numerical features including the dependent variable, loan_status. Before a group lasso can be tested for variable selection, the quantitative and categorical variables need to be preprocessed consisting of scaling using the MinMaxScaler for the numerical features, one hot encoding the categorical variables followed by stacking together column wise in a sparse matrix. The groups are then extracted from the one hot encoded features since they need to be specified for the group lasso.
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import scipy.sparse
from group_lasso.utils import extract_ohe_groups
df1 = df.select_dtypes(include = 'object')
df1 = pd.concat([X1, df1], axis=1)
df1 = pd.concat([y, df1], axis=1)
df_num = df1.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)
num_columns = df_num.columns.tolist()
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df_num)
df_cat = df1.select_dtypes(include = 'object')
cat_columns = df_cat.columns.tolist()
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(df1[cat_columns])
X2 = scipy.sparse.hstack([onehot_data, scipy.sparse.csr_matrix(scaled)])
y = df1['loan_status']
groups = extract_ohe_groups(ohe)
groups = np.hstack([groups, len(cat_columns) + np.arange(len(num_columns))+1])
print('The groups consist of ' + str(groups) + ' for the group lasso.')
To perform the group lasso, let's first define the parameter grid to test where the same number of iterations is used and a differing number for the convergence tolerance where the Default=1e-5.
Then we can define the parameters for the estimator:
groups: Iterable that specifies which group each column corresponds. The categorical variables as labeled groups.group_reg: The regularization coefficient(s) for the group sparsity penalty.Default=0.05l1_reg: The regularization coefficient for the coefficient sparsity penalty.Default=0.05scale_reg: How to scale the group-wise regularisation coefficients.Default='group_size'random_state: The random state used for initialisation of parameters.Default='None'. Let's use the same seed for reproducibility (seed_value).
Then the GridSearchCV parameters can be defined:
- The defined
estimatorasLogisticGroupLasso scoring: Score the model based onaccuracycv: The number of folds for cross validation as 5-fold.param_grid: The hyperparameters inparams.
Now the models can be fit given the defined parameters in the grid search space to find the best estimator parameters and accuracy.
from group_lasso import LogisticGroupLasso
from sklearn.model_selection import GridSearchCV
LogisticGroupLasso.LOG_LOSSES = True
params = {
'n_iter': [3000],
'tol': [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
}
grid = GridSearchCV(estimator=LogisticGroupLasso(
groups=groups, group_reg=0.05, l1_reg=0, scale_reg=None,
supress_warning=True, random_state=seed_value),
scoring='accuracy', cv=5, param_grid=params)
print('Time for feature selection using GroupLasso GridSearchCV...')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
grid.fit(X2, y)
print('Finished feature selection using GroupLasso GridSearchCV in:',
time.time() - search_time_start)
print('\nGroup Lasso GridSearchCV Feature selection')
print('\nBest Estimator:')
print(grid.best_estimator_)
print('\nBest Parameters:')
print(grid.best_params_)
print('\nBest Accuracy:')
print(grid.best_score_)
print('\nResults from GridSearch CV:')
print(grid.cv_results_)
The best parameters from the grid search are {'n_iter': 3000, 'tol': 0.1} with an accuracy of 0.8735407759559557. However, the rank_test_score using tol=1e-1 and tol=1e-2 generated comparable results.
Now using these parameters, let's fit the model, predict and extract the results from the performance metrics. We can use the defined model to extract the features selected and then count the number selected. Then a list of the chosen_groups_ in the model as a pandas.Series can be transposed and the values converted to a list to match with the features in the original set. Then the loss over the iterations can plotted to visualize the results.
gl = LogisticGroupLasso(
groups=groups,
group_reg=0.05,
n_iter=3000,
tol=0.1,
l1_reg=0,
scale_reg=None,
supress_warning=True,
random_state=seed_value,
)
with parallel_backend('threading', n_jobs=-1):
gl.fit(X2, y)
pred_y = gl.predict(X2)
sparsity_mask = gl.sparsity_mask_
accuracy = (pred_y == y).mean()
print(f'Number of total variables: {len(sparsity_mask)}')
print(f'Number of chosen variables: {sparsity_mask.sum()}')
print(f'Accuracy: {accuracy}')
tdf = pd.Series(list(gl.chosen_groups_)).T
tdf = tdf.values.tolist()
X2 = df1.drop('loan_status', axis=1)
X2 = X2.iloc[:,tdf]
variables = X2.columns.tolist()
print(f'Selected variables from group lasso: {variables}')
plt.rcParams['figure.figsize'] = (7, 5)
plt.rcParams.update({'font.size': 15})
plt.plot(gl.losses_)
plt.tight_layout()
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Group Lasso: Loss over the Number of Iterations')
plt.show()
del X2, tdf, variables
Given that the starting number of features after VIF equaled 75, 19 were selected using the parameters tested in the grid search which included dti, delinq_2yrs, inq_last_6mths, revol_bal, out_prncp, collections_12_mths_ex_med. The resulting accuracy when fitting this model was 87.354 compared to 98.942 using 45 features when evaluating SelectFromModel with XGBoost. Therefore, this method for utilizing a linear based feature selection method might not be the best approach, but an alternative like the glmnet package in both python and R might perform better. However, let's test more variable selection methods.
Model-Free Sure Independence Screening (MV-SIS)¶
When I first began exploring different feature selection approaches in an academic research setting, a biostatistician recommended Model-Free Sure Independence Screening (MV-SIS) given the robustness in high dimensional sets where p (predictors) >> n (number of observations). The publication of this approach can be found here.
This method builds on the approaches that were proposed beyond correlation ranking for non-linear problems. It also implies that the MV-based variable screening is model-free because it is defined with conditional and unconditional distribution functions and is able to address linear and nonlinear relationships between the given response and features. MV-SIS is robust to heavy-tailed distributions of features and outliers because it inherits the robustness property of the conditional distribution function. The sure screening property contained within this approach can also handle multi-class response targets.
Let's now migrate to R to utilize this approach. The script that was used to process the data utilizing similar approaches can be found here. In summary, the libraries that were used for preprocessing include tidyr, plyr, dplyr and fastDummies.
The preprocessing code before testing this approach can be found here. Let's convert loan_status to binary given the same aforementioned conditions and sample 10% of the data using the caret library. Now, we can write the column names to a .csv file and the row level data without the column names to a .txt file to prepare the data.
data_sampling_vector <- createDataPartition(df$loan_status, p=0.90, list=FALSE)
data_train <- df[data_sampling_vector,]
data_test <- df[-data_sampling_vector,]
a <- data_test
dim(a)
write.table(a, 'names.csv', sep=',', col.names=TRUE, row.names=FALSE)
write(t(a), 'LoanStatus_screen.txt', ncol=ncol(a))
I was provided the source code to utilize this approach for feature selection, but a package called VariableScreening is available on CRAN. We can use source("*.R) to delegate the script that we want R to use, in this case, source("MVSIS.R"). Let's now examine what is contained within the MVSIS.R file.
MVSIS Source Code¶
To compute the criteria in the simulation, multiple functions are required. Let's first start by creating a function M to compute the minimum model size to ensure the inclusion of all active predictors.
Input:
true.v: the true variables index.rank.mtx: the ranked index matrix by screening method for the 1000 replications each column corresponds the ranked index in one replication.
Output:
M: a vector of the minimum model sizes to ensure the inclusion of all active predictors
M <- function(true.v,rank.mtx){
r <- min(dim(rank.mtx)[2], length(rank.mtx))
M <- c()
for (j in 1:r){
M[j] <- max(match(true.v, rank.mtx[,j]))
}
return(M)
}
Another function mqtl can be used to compute the 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size out of 1,000 replications.
Input:
M: a vector of the minimum model sizes to ensure the inclusion of all active predictors.
Output:
- 5%, 25%, 50%, 75%, 95% quantiles of minimum model sizes out of 1000 replications
mqtl <- function(M){
quantile(M, probs=c(0.05,0.25,0.5,0.75,0.95))
}
Another function Sel.rate can be used to compute the proportion that every single active predictor is selected for a given model size, which is defaults to c[n/log(n)] in the 1,000 replications.
Input:
n: The sample size.c: Coeficient of cutoffs. i.e. when henc=2,cutoff = 2[n/log(n)].true.v: The true variable index.rank.mtx: The ranked index matrix by screening method for the 1,000 replications each column corresponds the ranked index in one replication.
Output:
rate: The proportions that every single active predictor is selected for a given model size, which the defaultc[n/log(n)], in the 1,000 replications.
Sel.rate <- function(n,c,true.v,rank.mtx){
d <- c * floor(n / log(n))
rank.mtx.sel <- rank.mtx[1:d,]
r <- min(dim(rank.mtx)[2], length(rank.mtx))
p0 <- length(true.v)
R <- matrix(0,p0,r)
rate <- c()
for (i in 1:p0){
for (j in 1:r){
R[i,j] <- (min(abs(rank.mtx.sel[,j] - true.v[i]))==0)
}
rate[i] <- mean(R[i,])
}
return(rate)
}
Now, let's define the functions that will used to compute the MV(X,Y) where Y is a discrete value.
Fk <- function(X0,x){
Fk = c()
for (i in 1:length(x)){
Fk[i] = sum(X0 <= x[i]) / length(X0)
}
return(Fk)
}
Fkr <- function(X0,Y,yr,x){
Fkr = c()
ind_yr = (Y==yr)
for (i in 1:length(x)){
Fkr[i] = sum((X0 <= x[i]) * ind_yr) / sum(ind_yr)
}
return(Fkr)
}
MV <- function(Xk,Y){
Fk0 <- Fk(Xk,Xk)
Yr <- unique(Y)
MVr <- c()
for (r in 1:length(Yr)){
MVr[r] <- (sum(Y==Yr[r]) / length(Y)) * mean((Fkr(Xk,Y,Yr[r],Xk) - Fk0)^2)
}
MV <- sum(MVr)
return(MV)
}
Then the row-level .txt file can be read byrow=T as a matrix with the number of columns defined with ncol and loan_status as the first feature to define the response. The selection criteria can be defined as mu for computing the criteria in the simulation. Quantiles can then be used with mu to determine the variable rank. Different values of quantile(mu, A) can be evaluated, where a higher value of A results in less features while a lower value selects more variables. Let's use A=0.7 and determine how many features result in the pruned set.
F0 <- matrix(scan('LoanStatus_screen.txt'), ncol=112, byrow=T)
response <- F0[,1]
mu <- NULL
for(i in 2:ncol(F0)){
u <- MV(F0[,i], response)
mu <- c(mu,u)
if(i%%100==0) print(i)
}
q3 <- quantile(mu, 0.7)
mu <- cbind(1:(ncol(F0)-1), mu)
name <- read.table('names.csv', sep=',')
name <- name[mu[,2] > q3]
write(t(name), 'names2.csv', sep=',', ncol=length(name))
su <- mu[mu[,2] > q3,]
dim(su)
F0 <- F0[,-1]
F1 <- F0[,su[,1]]
B <- cbind(response,F1)
V <- B
B[15:41,] <- V[19:45,]
B[42:45,] <- V[15:18,]
write(t(B), ncol=ncol(B), file='MVSIS_0.7.txt')
ncol(B) - 1
28 features were selected using this threshold. Now we can evaluate the performance of a model given the input features from MV-SIS.
a1 <- read.table('MVSIS_0.7.txt')
a1 <- as.matrix(a1)
dim(a1)
b <- read.table('names2.csv', sep=',')
b <- as.character.numeric_version(b)
k <- 'status'
b <- c(k,b)
write.table(a1, file='MVSIS_0.7_rf.txt', col.names=b)
Let's first load the required libraries
(randomForest and doParallel), and set up a cluster to run the jobs in parallel using the 8 cores available on the local machine.
library(randomForest)
library(doParallel)
cl <- makePSOCKcluster(8)
registerDoParallel(cl)
getDoParWorkers()
Then the set containing the selected features from MV-SIS can be read, shuffled and examined.
variable <- read.table('MVSIS_0.7_rf.txt')
variable = variable[sample(1:nrow(variable)),]
colnames(variable)[colnames(variable) == 'status'] = 'loan_status'
str(variable)
We can now partition the data into training and test sets using 90% for training and 10% for testing to run a Random Forest model with the default parameters to evaluate the predictive performance of using this variable selection approach.
k1 <- 194612
k2 <- 21624
train <- k1
test <- k2
rf <- randomForest(factor(loan_status) ~ .,
data=variable[1:k1,],
ntree=100,
importance=T,
keep.forest=T)
rf
stopCluster(cl)
Then, we can use the fitted model to predict and find the accuracy of the train and test sets.
m <- rf$predicted
n <- predict(rf, variable[(k1+1):(k1+k2),])
k <- 0
for (i in 1:k1){
if (variable[i,1]==m[i]){
k=k+1
}
}
Accuracy.fit <- k / k1
cat(sprintf('Train accuracy = %s\n', Accuracy.fit))
k <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==n[i]){
k=k+1
}
}
Accuracy.test <- k / k2
cat(sprintf('Test accuracy = %s\n', Accuracy.test))
Using the default parameters, the accuracy is lower than optimal, but the dimensionality of the number of features has significantly decreased. Let's now evaluate for specificity and sensitivity.
a <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==1 & n[i]==1){
a = a+1
}
}
b <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==0 & n[i]==1){
b = b+1
}
}
c <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==1 & n[i]==0){
c = c+1
}
}
d <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==0 & n[i]==0){
d = d+1
}
}
sensitivity = d / (b + d)
cat(sprintf('Sensitivity = %s\n', sensitivity))
specificity = a / (a + c)
cat(sprintf('Specificity = %s\n', specificity))
The sensitivity is great, but the specificity is quite low, so we should keep this in mind when comparing the various variable selection methods evaluated.
Next, let's use the model to predict and generate a confusion matrix for the training and the test set.
train$loan_status <- as.factor(train$loan_status)
pred = predict(rf, newdata=train[-1])
pred1 <- as.factor(pred)
confusionMatrix(data=pred1, reference=train$loan_status)
test$loan_status <- as.factor(test$loan_status)
pred = predict(rf, newdata=test[-1])
pred1 <- as.factor(pred)
confusionMatrix(data=pred1, reference=test$loan_status)
Boruta¶
Boruta is a wrapper around the Random Forest algorithm which contains an improvement in regards to the feature importance measure since it considers all of the features that are relevant to the target variable. It can handle interactions between features as well while maintaining the varying degree of randomness that might exist within a set.
The reference manual states:
Boruta iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones. Attributes that have significantly worst importance than shadow ones are being consecutively dropped. On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed. Shadows are re-created in each iteration. Algorithm stops when only Confirmed attributes are left, or when it reaches maxRuns importance source runs. If the second scenario occurs, some attributes may be left without a decision. They are claimed Tentative. You may try to extend maxRuns or lower pValue to clarify them, but in some cases their importances do fluctuate too much for Boruta to converge. Instead, you can use TentativeRoughFix function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis.
The input set for Boruta cannot contain missing observations, so either the complete cases for a set must be determined or a selected imputation method needs to be utilized prior given the reason(s) for the presence of data missing (missing completely at random (MCAR) or missing at random (MAR). The current available versions are slow in performing feature selection compared to other traditional feature selection algorithms, and performance should always be considered. In conjunction, after determining the important features, collinearity needs to be addressed which is another limitation of using this approach.
Let's first set the working directory to FeatureSelection where the results can be saved. Then load the Boruta library and convert loan_status to a factor since this is the target variable. Due to the lengthy computation time and required CPU resources, we will use the test set and doTrace=0 for the verbosity level.
data_test$loan_status <- as.factor(data_test$loan_status)
boruta.df <- Boruta(loan_status~., data=data_test, doTrace=0)
Results from Boruta¶
We can now examine what was saved in the boruta.df list to inspect the results from utilizing this approach.
print(boruta.df)
After more than 81 hours, 79 out of the 111 features were selected. This is a large number of variables though. This package offers a function called TentativeRoughFix that "claims as Confirmed those attributes that have median importance higher than the median importance of maximal shadow attribute, and the rest as Rejected". So, let's now explore the tentative variables.
final.boruta <- TentativeRoughFix(boruta.df)
print(final.boruta)
2 more features were confirmed as important after using the available TenativeRougFix, while 5 more were confirmed as unimportant. We can then used the variable importance chart from the native package to display the important features as green, unimportant as red and yellow indicating the ones that are considered as tentative. Let's first change the plot width and height of the plot window by setting the options with the repr library. Then we can plot the results from Boruta after using the TenativeRoughFix. lapply as a function can be applied to the generated ImpHistory where the names of the columns are defined and sapply can then be used with the output from the lapply function where the feature importance of the features is sorted by the median called as Labels. This generates a plot where the feature importance is arranged in an ascending order from left to right.
library(repr)
options(repr.plot.width=30, repr.plot.height=30)
par(mar = c(20,5,1,1.5))
plot(final.boruta, xlab='', xaxt='n')
lz<-lapply(1:ncol(final.boruta$ImpHistory),function(i)
final.boruta$ImpHistory[is.finite(final.boruta$ImpHistory[,i]),i])
names(lz) <- colnames(final.boruta$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side=1,las=2, labels=names(Labels),
at=1:ncol(final.boruta$ImpHistory), cex.axis=1.4)
Additional statistical measures can be generated using attStats, which contains the defined meanImp, medianImp and minImp. Let's now filter out features which are designated as Confirmed on this subset using boruta_stats$decision == "Confirmed".
boruta_stats <- attStats(final.boruta)
boruta_confirmed = subset(boruta_stats,
subset=boruta_stats$decision == 'Confirmed')
boruta_confirmed
This can also be completed with the features designated as Rejected on another subset using boruta_stats$decision == "Rejected".
boruta_rejected = subset(boruta_stats,
subset=boruta_stats$decision == 'Rejected')
boruta_rejected
Utilizing Boruta resulted in 81 features compared to 27 from MV-SIS, suggesting the methods contained within Boruta probably select the features that might generate the best model metrics, but results in high dimensionality and subsequently higher computational costs in downstream modeling tasks.
Results from Variable Selection Methods¶
Various approaches can be utilized for feature selection/importance as demonstrated here using both Python and R. Therefore, let's now compare the results in regards to which variables were selected from the test methods using set differences. To begin, we can create dummy variables using pd.get_dummies where the initial features are removed. Then create a separate dataframe containing the features from the methods that did not generate too few (Group Lasso) or too many features (Boruta). We can denote them as X_xgb for the ones from SelectFromModel using XGBoost, X_vif containing the remaining after a threshold=5 for VIF and X_mfs containing the variables from MV-SIS in R.
df = df.drop('loan_status', axis=1)
df = pd.get_dummies(df, drop_first=True)
X_xgb = df[['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate',
'installment', 'annual_inc', 'inq_last_6mths', 'pub_rec',
'revol_bal', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
'total_rec_late_fee', 'recoveries', 'last_pymnt_amnt',
'acc_open_past_24mths', 'bc_open_to_buy',
'chargeoff_within_12_mths', 'delinq_amnt', 'mths_since_recent_bc',
'num_rev_tl_bal_gt_0', 'tot_hi_cred_lim', 'total_bc_limit',
'term_ 60 months', 'grade_B', 'grade_C', 'grade_D', 'grade_E',
'home_ownership_MORTGAGE', 'home_ownership_RENT',
'verification_status_Source Verified', 'verification_status_Verified',
'pymnt_plan_y', 'purpose_major_purchase', 'purpose_moving',
'purpose_small_business', 'purpose_vacation', 'initial_list_status_w',
'application_type_Joint App', 'hardship_flag_Y',
'disbursement_method_DirectPay', 'debt_settlement_flag_Y']
X_xgb = X_xgb.drop_duplicates()
print('- Dimensions using variables selected from XGB:', X_xgb.shape)
X_vif = df[['annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'revol_bal',
'out_prncp', 'total_rec_int', 'total_rec_late_fee',
'collection_recovery_fee', 'last_pymnt_amnt',
'collections_12_mths_ex_med', 'acc_now_delinq', 'tot_coll_amt',
'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths',
'delinq_amnt', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op',
'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc',
'num_accts_ever_120_pd', 'num_il_tl', 'num_tl_30dpd',
'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75',
'pub_rec_bankruptcies', 'tax_liens', 'total_il_high_credit_limit']]
X_vif = X_vif.drop_duplicates()
print('- Dimensions using variables selected from VIF:', X_vif.shape)
X_mfs = df[['num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'pymnt_plan_y',
'num_accts_ever_120_pd', 'mo_sin_old_rev_tl_op',
'last_pymnt_amnt', 'percent_bc_gt_75', 'revol_util',
'tot_hi_cred_lim', 'num_actv_rev_tl', 'disbursement_method_DirectPay',
'tot_coll_amt', 'term_ 60 months', 'mort_acc', 'funded_amnt_inv',
'int_rate', 'inq_last_6mths', 'delinq_2yrs', 'installment',
'collections_12_mths_ex_med', 'open_acc', 'loan_amnt',
'funded_amnt', 'annual_inc', 'num_tl_op_past_12m',
'home_ownership_OTHER', 'total_bc_limit']]
X_mfs = X_mfs.drop_duplicates()
print('- Dimensions using variables selected from MVSIS:', X_mfs.shape)
Each of these can be considered as a set with a unique identifier specifying the features from the various feature selection approaches examined. Let's start with the features that were selected from SelectFromModel using XGBoost in a set denoted as s. Subsequently, the others can be compared to this set to determine which features do not exist in s but exist in the other set. This results in a list of the features where the number can be quantified using the length of the list. Other combinations of the initial set with the various permutations can be utilized to compare the features from the methodologies examined.
s = set(X_xgb)
varDiff_vif = [x for x in X_vif if x not in s]
print('\nFeatures using VIF but not in XGB:')
print(varDiff_vif)
print('- Number of different features: ' + str(len(varDiff_vif)))
s1 = set(X_vif)
varDiff_xgb = [x for x in X_xgb if x not in s1]
print('\nFeatures in XGB but not in VIF:')
print(varDiff_xgb)
print('- Number of different features: ' + str(len(varDiff_xgb)))
varDiff_mvsisAll = [x for x in X_mfs if x not in s]
print('\nFeatures in MVSIS but not in XGB:')
print(varDiff_mvsisAll)
print('- Number of different features: ' + str(len(varDiff_mvsisAll)))
varDiff_mvsisVIF = [x for x in X_mfs if x not in s1]
print('\nFeatures in MVSIS but not in VIF:')
print(varDiff_mvsisVIF)
print('- Number of different features: ' + str(len(varDiff_mvsisVIF)))
s1 = set(X_mfs)
varDiff_mvsisAll1 = [x for x in X_xgb if x not in s1]
print('\nFeatures in XGB but not in MV-SIS:')
print(varDiff_mvsisAll1)
print('- Number of different features: ' + str(len(varDiff_mvsisAll1)))
varDiff_mvsisVIF1 = [x for x in X_vif if x not in s1]
print('\nFeatures in VIF but not in MV-SIS:')
print(varDiff_mvsisVIF1)
print('- Number of different features: ' + str(len(varDiff_mvsisVIF1)))
From the comparisons made, the number of features selected with MV-SIS that are not present in the set using SelectFromModel using XGBoost, which contained 45 features, can be added to a temporary dataframe that is concatenated by column with X_xgb and loan_status for further examination.
df_tmp = df[['num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_accts_ever_120_pd',
'mo_sin_old_rev_tl_op', 'percent_bc_gt_75', 'revol_util',
'num_actv_rev_tl', 'tot_coll_amt', 'mort_acc', 'delinq_2yrs',
'collections_12_mths_ex_med', 'open_acc',
'num_tl_op_past_12m', 'home_ownership_OTHER']]
df = pd.concat([X_xgb, df_tmp, y], axis=1)
df = df.drop_duplicates()
print('- Dimensions of data using for further EDA:', df.shape)
del X_xgb, X_vif, X_mfs
del df_tmp, y, s, s1, varDiff_vif, varDiff_xgb, varDiff_mvsisAll
del varDiff_mvsisVIF, varDiff_mvsisAll1, varDiff_mvsisVIF1
Exploratory Data Analysis (EDA)¶
Quantitative Variables¶
We can now select the float64 and int64 quantitative features as a subset of the data. To evaluate if correlated features are present and to what extent, we can define a function to find the non-repetitive correlations by only considering the pairs of features on the diagonal and the lower part of the triangle of the correlation matrix. Then another function can be defined to find the features with the highest absolute rank correlation, which reduces the granularity from the presence of both positive and negative correlation coefficients. The second function can then call the first function to remove the repetitive pairs of features.
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)
def find_repetitive_pairs(df):
"""Returns the pairs of features on the diagonal and lower triangle of the correlation matrix."""
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
def find_top_correlations(df, n):
"""Returns the highest correlations without duplicates."""
au_corr = df.corr(method='spearman').abs().unstack()
labels_to_drop = find_repetitive_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
Let's first examine the number of features present in this subset and then apply the find_top_correlations function to determine the 20 features with the the highest correlation. Then we can create a correlation heatmap using sns.heatmap which only contains the features with a correlation coefficient >= 0.7 or <= -0.7, signifying a strong and very strong relationship.
import seaborn as sns
print('- The selected dataframe has ' + str(df_num.shape[1]) + ' columns that are quantitative variables.')
print('- The 20 features with the highest correlations:')
print(find_top_correlations(df_num, 20))
corr = df_num.corr(method='spearman')
fig = plt.figure(21, 18)
plt.rcParams.update({'font.size': 13})
ax = sns.heatmap(corr[(corr >= 0.7) | (corr <= -0.7)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
linecolor='black', annot=True, annot_kws={'size': 9},
square=True)
plt.title('Correlation Matrix with Spearman rho')
plt.show()
Now we can examine the distributions of these features using different visualizations. Let's first start by plotting histograms using histplot from seaborn where a for loop iterates through each of the selected features.
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(15,3, figsize=(21,35))
fig.suptitle('Quantitative Features: Histograms', y=1.01, fontsize=30)
for variable, subplot in zip(df_num, ax.flatten()):
a = sns.histplot(df_num[variable], ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
a.set_xticklabels(a.get_xticks(), size=9)
fig.tight_layout()
plt.show();
Then we can increase the grain of this visualization by adding hue=df.loan_status to map the dependent variable as a binary variable as well as kde=True to smooth the distribution using a kernel density estimate.
plt.rcParams.update({'font.size': 14})
fig, ax = plt.subplots(15,3, figsize=(25,35))
fig.suptitle('Quantitative Features: Histograms Grouped by Loan Status', y=1.01,
fontsize=30)
for variable, subplot in zip(df_num, ax.flatten()):
a = sns.histplot(x=df_num[variable], data=df_num, hue=df.loan_status,
kde=True, ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
a.set_xticklabels(a.get_xticks(), size=10)
fig.tight_layout()
plt.show();
From the previous figure, some features contain a larger range of values, so let's extract these as a subset of the quantitative features to inspect at a more granular level.
df_num1 = df_num[['loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'int_rate', 'installment', 'total_pymnt', 'total_pymnt_inv',
'total_rec_prncp', 'total_rec_int', 'mo_sin_old_rev_tl_op',
'percent_bc_gt_75', 'revol_util']]
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(4,3, figsize=(21,30))
fig.suptitle('Subset of Quantitative Features: Histograms with Grouped by Loan Status',
y=1.01, fontsize=30)
for variable, subplot in zip(df_num1, ax.flatten()):
sns.histplot(x=df_num1[variable], data=df_num1, hue=df.loan_status,
kde=True, ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
a.set_xticklabels(a.get_xticks(), size=9)
fig.tight_layout()
plt.show();
Box-and-whisker plots are another way to visualize both quantitative and qualitative features. Let's use seaborn.boxplot where the target loan_status resides on the x-axis and the quantitative feature is on the y-axis and . This approach summarizes the distribution of the feature using five key numbers: the minimum, 1st quartile (25th percentile), 2nd quartile (median), 3rd quartile (75%) and the maximum of the data. The interquartile range (IQR) is the difference between the 75% and 25% percentile (Q3 - Q1). The whisker sections extend for the whole range of the data.
Boxplots can be useful to visualize if outlier values are present within the data. If values are greater than Q3 + 1.5 x IQR or less than Q1 - 1.5 x IQR, then they can be considered as outliers.
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(9,5, figsize=(21,30))
fig.suptitle('Quantitative Features: Boxplots', y=1.01, fontsize=30)
for var, subplot in zip(df_num, ax.flatten()):
a = sns.boxplot(x=df.loan_status, y=df_num[var], data=df_num, ax=subplot)
a.set_yticklabels(a.get_yticks(), size=12)
a.set_xticklabels(a.get_xticks(), size=12)
fig.tight_layout()
plt.show();
When considering the quantitative features stratified by loan_status, differences are observable, especially for recoveries, last_pymnt_amnt and percent_bc_gt_75. These potentially will contribute to explaining models once they are fit.
Now, we can examine the qualitative features in the set including the uint8 ones using a countplot to visualize the number of observations in each categorical bin.
df_cat = df.select_dtypes(include = 'uint8')
print('The selected dataframe has ' + str(df_cat.shape[1])
+ ' columns that are qualitative variables.')
print('\n')
fig, ax = plt.subplots(5, 4, figsize=(21,21))
fig.suptitle('Qualitative Features: Count Plots', y=1.01, fontsize=30)
for variable, subplot in zip(df_cat, ax.flatten()):
a = sns.countplot(df_cat[variable], ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
fig.tight_layout()
plt.show();
Automated EDA with Sweetviz¶
Overview: Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application We can install using pip install sweetviz. There are many options to profile the data, but let's utilize the the default settings to examine the data quality (data types, missing values, duplicate rows, unique values) and summary statistics (min/max/range, quartiles, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.)
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_html('Loan_Status_AutomatedEDA.html')
sweet_report.show_notebook(layout='widescreen', w=1500, h=1000, scale=0.8)
Automated EDA using Pandas Profiling¶
We can install using pip install ydata-profiling.
Key features:
Type inference: automatic detection of columns' data types (Categorical, Numerical, Date, etc.)Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histogramsMultivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interactionAlso includes the abilty to examine
Time-Series,Text analysis,File and Image analysisandCompare datasets
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title='Loan Status_EDA')
profile.to_notebook_iframe()
The output is on the Github repository due to size contraints here.
Now we can remove the features with high correlation coefficients and the categorical variables containing imbalances, and examine the final set using the data_quality_table function.
df = df.drop(['out_prncp_inv', 'funded_amnt', 'funded_amnt_inv',
'total_pymnt_inv', 'open_acc', 'tot_cur_bal', 'total_rec_prncp',
'num_op_rev_tl', 'home_ownership_OTHER', 'hardship_flag_Y',
'pymnt_plan_y', 'purpose_house', 'purpose_medical',
'debt_settlement_flag_Y', 'purpose_small_business'], axis=1)
df = df.drop_duplicates()
print('\n Data Quality: Final Set')
display(data_quality_table(df))
Class Imbalance Methods¶
Resampling data is one of the most commonly accepted approaches for handling an imbalanced dataset. There are broadly two types of methods: oversampling and undersampling. Oversampling is generally preferred compared to undersampling techniques because the small percentage of instances potentially containing important information tend to be removed from the set when undersampling based methods are utilized.
Resampling Techniques¶
Let's first import the packages and define the conditional options to set up the environment with a random and numpy seed for reproducibility. Then the data can be read into a pandas.Dataframe and duplicates removed, if present.
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
seed_value = 42
os.environ['LoanStatus_Linear'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
df = pd.read_csv('LendingTree_LoanStatus_final.csv', low_memory=False)
df = df.drop_duplicates()
Upsample Minority Class¶
Let's first separate the input features and the target into X and y. Then train_test_split from sklearn.model_selection can be utilized to set up the training/test sets using 80% of the data for the training set and 20% for the test set. Then the features and target from the training data can be concatenated by column, followed by splitting this set into the majority class, where loan_status==0 as current, and the minority class, where loan_status==1 as default.
Next, resample from sklearn.utils can be leveraged to oversample the minority class with replacement specifying the number of samples (the number of observations contained within the current set) and the defined random_state=seed_value. Then, the majority and upsampled minority sets be concatenated together and the counts of each of the classes in the balanced set examined.
Then the input features and target of the Upsampled training set can be defined as X_train and y_train to match the format of the test set, the train/test target defined as a pandas.dataframe to be concatenated for full train/test sets to be saved for later use in .csv files.
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
X = df.drop('loan_status', axis=1)
y = df['loan_status']
del df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=seed_value)
df1 = pd.concat([X_train, y_train], axis=1)
current = df1[df1.loan_status==0]
default = df1[df1.loan_status==1]
del df1
default_upsampled = resample(default,
replace=True,
n_samples=len(current),
random_state=seed_value)
upsampled = pd.concat([current, default_upsampled])
del default_upsampled, current, default
print('\nExamine Loan Status after oversampling minority class')
print(upsampled.loan_status.value_counts())
X_train = upsampled.drop('loan_status', axis=1)
y_train = upsampled.loan_status
del upsampled
cols = ['loan_status']
y_train = pd.DataFrame(data=y_train, columns=cols)
y_test = pd.DataFrame(data=y_test, columns=cols)
train_US = pd.concat([X_train, y_train], axis=1)
train_US.to_csv('trainDF_US.csv', index=False)
test_US = pd.concat([X_test, y_test], axis=1)
test_US.to_csv('testDF_US.csv', index=False)
del train_US, test_US
Synthetic Minority Oversampling Technique (SMOTE)¶
SMOTE (Synthetic Minority Oversampling Technique) generates new synthetic samples for the minority class, which is a type of data augmentation. The smaller the distance between the minority data and the closest minority data in the feature space is where new synthetic data is formed. For a random example within the minority class, a specified k of the nearest neighbors are determined, randomly selected and a synthetic example is generated located at a random point between the initial and the randomly selected neighbor. Since this uses the smaller distance between the aformentioned, this results in more realistic examples from the minority class.
Since this approaches relies on the minority class and not the majority class, potential negative events include decreasing the decision space as well as increasing the amount of noise in the feature space, resulting in more difficulty classifying groups, whether binary or multi-level outcomes.
Let's first set up the training/testing sets using the same parameters that were utilized for the Upsampled set. The SMOTE class from the Imbalanced-learn library can then be imported from imblearn.over_sampling specifying the sampling_strategy='minority' and random_state=seed_value as parameters to fit_sample on the training set. The default specified k nearest neighbor is k_neighbors=5. Then the counts of both classes examined to compare with the Upsampled set, and similar methods used to same the train/test sets for later use.
from imblearn.over_sampling import SMOTE
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.20,
random_state=seed_value)
smote = SMOTE(sampling_strategy='minority', random_state=seed_value)
X1_train, y1_train = smote.fit_sample(X1_train, y1_train)
print('\nExamine Loan Status after upsampling with SMOTE')
print(y1_train.value_counts())
cols = ['loan_status']
y1_train = pd.DataFrame(data=y1_train, columns=cols)
y1_test = pd.DataFrame(data=y1_test, columns=cols)
train_SMOTE = pd.concat([X1_train, y1_train], axis=1)
train_SMOTE.to_csv('trainDF_SMOTE.csv', index=False)
test_SMOTE = pd.concat([X1_test, y1_test], axis=1)
test_SMOTE.to_csv('testDF_SMOTE.csv', index=False)
del train_SMOTE, test_SMOTE
Classification: Linear¶
Before machine learning models are optimized for the associated parameters of different algorithms, baseline models are important to establish which parameters should be more closely targeted during the tuning process. The notebook containing the Linear baseline models and hyperparameter tuning using GridSearchCV is located here.
Upsampling: Lasso Baseline Model¶
Baseline Default Parameters¶
penalty: Specifies the norm of the penalty.default=’l2’.tol: Tolerance for stopping criteria.default=1e-4.C: Inverse of regularization strength.default=1.0.fit_intercept: Specifies if a bias or intercept should be added to the decision function.default=True.intercept_scaling:default=1class_weight: Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.default=None.random_state: RandomState instance. Used whensolver == ‘sag’,sagaorliblinearto shuffle the data.default=None.solver: Algorithm to use in the optimization problem.default=’lbfgs’.max_iter: Maximum number of iterations taken for the solvers to converge.default=100.verbose: For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.default=0.warm_start: When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver.default=False.n_jobs: Number of CPU cores used when parallelizing over classes ifmulti_class='ovr'.default=None.l1_ratio: The Elastic-Net mixing parameter, Only used ifpenalty='elasticnet'. Settingl1_ratio=0is equivalent to usingpenalty='l2', while settingl1_ratio=1is equivalent to usingpenalty='l1'. For 0 <l1_ratio< 1, the penalty is a combination ofL1andL2.default=None.
Let's first set the path where the results from training a model will be stored. In particular where the model .pkl is saved and can be reloaded for model comparisons to the baseline model or if more training is required to find even more optimal hyperparameters than the first iteration of the tuning process. Then, we can scale the data between zero and one using the MinMaxScaler which consists of fit_transform on the training set and applying that with tranform to the test set. For the Upsampled Lasso baseline model, we can fit the model using the default parameters from the LogisticRegression model in scikit-learn and save the model. Then predict based on trained model using the scaled test set to evaluate the model performance/metrics using various sklearn.metrics utilizing the input format of metric(y_true = y_test, y_pred = model.predict(X_test)) where the metric is specified and the defined model was fit with the training set and predictions evaluated with the test set. This can be applied to both the training and the test sets, independently, to evaluate if overfitting is occurring as well.
The classification_report outputs granular information in table format with the rows containing loan_status==0, loan_status==1, accuracy, macro avg, and weighted avg and paired columns (precision, recall, f1-score, support) with the rows.
The confusion_matrix contains even more granular information with the count the model evaluated:
True Positives(Top left): Model successfully predicts the positive class (loan_status==1)False Positives(Top right): Model incorrectly predicts the positive classFalse Negatives(Lower left): Model incorrectly predicts the negative classTrue Negatives(Lower right): Model successfully predicts the negative class (loan_status==0)
This information is needed to calculate the following metrics:
Accuracy: (True Positives+True Negatives) / Total Count of PredictedPrecision:True Positives/ (True Positives+False Positives)Recall:True Positives/ (True Positives+False Negatives)-
F1-Score: 2 (PrecisionRecall) / (Precision+Recall)
These basic metrics for classification problems can be imported from sklearn.metrics, so we do not need to manually calculate the accuracy_score, precision_score, recall_score and f1_score.
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
scaler = MinMaxScaler()
X_trainS = scaler.fit_transform(X_train)
X_testS = scaler.transform(X_test)
lasso = LogisticRegression(penalty='l1', solver='saga',
random_state=seed_value)
with parallel_backend('threading', n_jobs=-1):
lasso.fit(X_trainS, y_train)
Pkl_Filename = 'Lasso_US_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lasso, file)
# =============================================================================
# To load saved model
#lasso_US = joblib.load('Lasso_US_baseline.pkl')
#print(Lasso_US)
# =============================================================================
y_pred_US = lasso.predict(X_testS)
print('Results from Lasso baseline model on Upsampled data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Upsampling: Elastic Net Baseline Model¶
Now we can set the baseline using the parameters needed for an elastic net model using an l1_ratio=0.5 with the saga solver.
elnet = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)
with parallel_backend('threading', n_jobs=-1):
elnet.fit(X_trainS, y_train)
Pkl_Filename = 'Elnet_US_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(elnet, file)
y_pred_US = elnet.predict(X_testS)
print('Results from Elastic Net baseline model on Upsampled data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
When comparing the model metrics to the Lasso baseline model, they are quite comparable, with a slightly higher precision and slightly lower recall and F1 score. They are all above 90%, which suggests the baseline models perform well.
SMOTE: Lasso Baseline Model¶
Using the same processing that was utilized for the Upsampled data, we can fit the baseline linear models for the Lasso and Elastic Net using the SMOTE data.
scaler = MinMaxScaler()
X1_trainS = scaler.fit_transform(X1_train)
X1_testS = scaler.transform(X1_test)
with parallel_backend('threading', n_jobs=-1):
lasso.fit(X1_trainS, y1_train)
Pkl_Filename = 'Lasso_SMOTE_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lasso, file)
y_pred_SMOTE = lasso.predict(X1_testS)
print('Results from Lasso baseline model on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
Compared to the model metrics of the baseline linear models using the Upsampled set, the model metrics are very similar due to the L1 regularization using the Lasso to shrink the model coefficients to zero.
SMOTE: Elastic Net Baseline Model¶
with parallel_backend('threading', n_jobs=-1):
elnet.fit(X1_trainS, y1_train)
Pkl_Filename = 'Elnet_SMOTE_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(elnet, file)
y_pred_SMOTE = elnet.predict(X1_testS)
print('Results from Elastic Net baseline model on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
Compared to the model metrics of the baseline linear models using the Upsampled set, the accuracy and the Precision score are comparable but the Recall score is 2% lower while the F1 score is potentially lower suggesting further inspection for these metrics.
When first performing GridSearchCV using both the Upsampled and the SMOTE sets, initially, the same parameters were selected. Expanding the range of the parameters used in the search, such as increasing/decreasing C and the number of iterations (max_iter) as well as changing the scoring metric evaluated during the search might provide better model performance.
Grid Search using Pipelines: Weighted Recall¶
We can design a sklearn pipeline by first using a list to select the numerical features to be scaled using the MinMaxScaler() in the defined numeric_transformer. This pipeline method yields the potential to transform various data types in different data contexts. Since this is a classification approach where the important categorical features have all ready been reduced down to the most important groups from the initial categorical features during the variable selection process, the numeric_transformer is all what is required for the preprocessing before testing different combinations of model parameters. To test two different modeling approaches, the numeric_transformer can be specified in two different pipelines where the model component of the pipelines contains differing parameters for the Lasso and the Elastic Net. When utilizing scikit-learn, the model type specified as LogisticRegression containing the penalty='l1' is similar to a Lasso while utilizing the penalty='elasticnet' is comparable to using Elastic Net for classification problems. Given the specified parameters for the two model types to be evaluated, these two pipelines can be specified in a list and ordered into a dictionary.
Grid Search: Upsampled¶
Let's first start with the Upsampled set using the parameters that define a model with an L1 penalty and another that resides between the L1 and L2 regularization parameters first described in 2005 by Zou and Hastie as the Elastic Net. With differing penalties and l1_ratio values, but using the same solver, the same maxmimum iteration number as well as the same random_state, various other parameters within a defined grid can then be evaluated to compare how the model performance differs within the Upsampled set as well as compared with the SMOTE set.
from sklearn.pipeline import Pipeline
numeric_features = list(X.columns[X.dtypes != 'object'])
numeric_transformer = Pipeline(steps=[('mms', MinMaxScaler())])
pipe_lasso = Pipeline(steps=[('preprocessor', numeric_transformer),
('model', LogisticRegression(penalty='l1',
solver='saga',
max_iter=30000,
random_state=seed_value))])
pipe_elnet = Pipeline(steps=[('preprocessor', numeric_transformer),
('model', LogisticRegression(penalty='elasticnet',
solver='saga',
l1_ratio=0.5,
max_iter=30000,
random_state=seed_value))])
pipelines = [pipe_lasso, pipe_elnet]
pipe_dict = {0: 'Lasso', 1: 'Elastic Net'}
Given the specified method for preprocessing the numerical features and the two differing model architectures, we can then define a function called make_param_grids which utilizes itertools to loop through the pipeline_steps with either the Lasso or Elastic Net as well as the parameters in the grid for each defined hyperparameter space contained within the all_param_grids dictionary. The pipeline object is then initialized and the estimators are contained within the param_grids_list. The GridSearchCV components can then be specified which metric should be utilized for scoring, and given the initial difference in the recall score between the two sets, a weighted recall score might provide some insight. The fold number to be used in cross validation can be denoted, and given the size of the sets and computation power available, let's use 3 fold, cv=3, and n_jobs=-3 to utilize all of the number of cores available minus three. This also suggests this runtime environment is probably not the best optimized given the inability to utilize all of the avaiable resources when specifying joblib with n_jobs=-1 did not result in the utilization of 8 cores and 64GB RAM for this job.
import itertools
from sklearn.model_selection import GridSearchCV
def make_param_grids(steps, param_grids):
final_params = []
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).items():
if param == 'object':
current_grid[step_name] = [value]
else:
current_grid[step_name + '__' + param] = value
final_params.append(current_grid)
return final_params
pipeline_steps = {'classifier': ['lasso', 'elnet']}
all_param_grids = {'lasso': {'object': LogisticRegression(),
'penalty': ['l1'],
'solver': ['saga'],
'max_iter': [30000, 50000, 100000],
'C': [1, 0.5, 0],
'tol': [1e-6, 1e-5, 1e-4, 1e-3]},
'elnet': {'object': LogisticRegression(),
'penalty': ['elasticnet'],
'solver': ['saga'],
'l1_ratio': [0.0, 0.5, 1.0],
'max_iter': [30000, 50000, 100000],
'C': [1, 0.5, 0],
'tol': [1e-6, 1e-5, 1e-4, 1e-3]}
}
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
pipe = Pipeline(steps=[('classifier', LogisticRegression())])
grid = GridSearchCV(pipe, param_grid=param_grids_list,
scoring='recall_weighted',
cv=3, verbose=4, n_jobs=-3)
Now, the pipeline with the defined parameters for GridSearchCV can be run using the Upsampled data utilizing dask.delayed for parallel jobs where the start/end time can be monitored.
import time
import dask.delayed
print('Start Upsampling - Grid Search..')
search_time_start = time.time()
pipelines_ = [dask.delayed(grid).fit(X_train, y_train.values.ravel())]
fit_pipelines = dask.compute(*pipelines_, scheduler='processes')
print('Finished Upsampling - Grid Search :', time.time() - search_time_start)
Then the accuracies of the models tested during the grid search can be compared to determine which model resulted in the highest accuracy when evaluated using the test set.
for idx, val in enumerate(fit_pipelines):
print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx],
val.score(X_test, y_test)))
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(fit_pipelines):
if val.score(X_test, y_test) > best_acc:
best_acc = val.score(X_test, y_test)
best_pipe = val
best_clf = idx
print('Classification with highest accuracy: %s' % pipe_dict[best_clf])
print('Lasso/Elastic Net using Upsampling - Best Estimator')
print(best_pipe.best_params_)
For the Upsampled data, the model that utitilized the Lasso with C=1, max_iter=100000, penalty='l1', solver='saga' and tol=1e-06 resulted in the highest accuracy. Given these parameters, we can fit the model with the best accuracy for the Upsampled set and save the model in a pickle file for later use.
LassoMod_US_HPO = LogisticRegression(penalty='l1',
C=1,
solver='saga',
max_iter=100000,
tol=1e-06,
n_jobs=-1,
random_state=seed_value)
print('Start fit the best hyperparameters from Upsampling grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_US_HPO.fit(X_trainS, y_train)
print('Finished fit the best hyperparameters from Upsampling grid search to the data :',
time.time() - search_time_start)
Pkl_Filename = 'Linear_HPO_US_mmsRecall_mostParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_US_HPO, file)
Now that the model has been fit using the training set, we can predict using the scaled test set to determine the model performance/metrics.
y_pred_US_HPO = LassoMod_US_HPO.predict(X_testS)
print('Results from LASSO using Upsampling HPO:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US_HPO))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US_HPO))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US_HPO))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US_HPO))
Compared to the baseline model, the model determined using GridSearchCV resulted in practically the same metrics. Given the time that was required to search through the various models in the given parameter space, the baseline model would probably be sufficient unless an even higher recall score where the true positives classified as positive was needed for the task at hand.
Model Metrics with ELI5¶
Let's now utilize the PermutationImportance from eli5.sklearn which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.
class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
import webbrowser
from eli5.formatters import format_as_dataframe
perm_importance = PermutationImportance(LassoMod_US_HPO,
random_state=seed_value).fit(X_testS,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The total_pymnt feature contains the highest weight (0.466812), followed by loan_amnt (0.429280), total_rec_int (0.392205 and out_prncp (0.338332).
Now we can use ELI5 to show the prediction, which reveals the contribution of the various features in explaining whether a loan will default or stay current. This can also be stored in an HTML file that is incorporated into a dashboard, whether in a Jupyter notebook or specified in a script with the environment and package dependencies that is loaded for batch or real time infence, given the task at hand.
html_obj2 = eli5.show_prediction(LassoMod_US_HPO, X_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off or Current, the features out_prncp, total_pymnt, annual_inc, total_hi_credit_lim, revol_bal and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, total_rec_int, total_bal_ex_mort and installment with higher values are not beneficial.
Create an Azure FIles Datastore¶
Now that the features and potential methods for addressing class imbalance have been selected, let's save these sets in a Datastore in the cloud with Azure Files Datastore as option. We need to first connect to the Azure Machine Learning workspace with MLClient using the credential, subscription_id, resource_group_name and the workspace_name.
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
ml_client = MLClient(
credential=credential,
subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
resource_group_name='aschultzdata',
workspace_name='ds-ml-env',
)
Then we can create the datastore by specifying the name of the datastore, the description, the account, the name of the container, the protocol and the account_key.
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration
from azure.ai.ml import MLClient
store = AzureBlobDatastore(
name='loanstatus_datastore',
description='Datastore for Loan Status',
account_name='dsmlenv8898281366',
container_name='loanstatus',
protocol='https',
credentials=AccountKeyConfiguration(
account_key='XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'
),
)
ml_client.create_or_update(store)
Then we can upload both the train/test sets for the Upsampled and SMOTE sets to the loanstatus container.
Train a model using Microsoft Azure¶
Before we train the model, we need to create a compute instance/cluster with a defined name, so let's create one with a CPU, if it has not all ready been created. Then we can specify the cluster components containing the on-demand virtual machine (VM) service, the type of VM, the minimum/maximum nodes in the cluster, the time the node will run after the job finishes/terminates and the type of tier. Finally, we can input the object to create_or_update from MLClient.
from azure.ai.ml.entities import AmlCompute
cpu_compute_target = 'cpu-cluster-E8s-v3'
try:
cpu_cluster = ml_client.compute.get(cpu_compute_target)
print(
f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
)
except Exception:
print('Creating a new cpu compute target...')
cpu_cluster = AmlCompute(
name=cpu_compute_target,
type='amlcompute',
size='Standard_E8s_v3',
min_instances=0,
max_instances=1,
idle_time_before_scale_down=180,
tier='Dedicated',
)
print(
f"AMLCompute with name {cpu_cluster.name} will be created, with compute size {cpu_cluster.size}"
)
cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)
Next, let's create the environment by making a directory for the dependencies which lists the components for the runtime and the libraries installed on the compute for training the model.
import os
dependencies_dir = './dependencies'
os.makedirs(dependencies_dir, exist_ok=True)
Now we can write the conda.yaml file into the dependencies directory.
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- numpy=1.21.6
- pip=23.1.2
- scikit-learn==1.1.2
- scipy
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- mlflow
- azureml-mlflow
- psutil==5.9.0
- tqdm
- ipykernel
- matplotlib
- seaborn
- eli5==0.13.0
- shap==0.41.0
- lime
The created conda.yaml file allows for the environment to be created and registered in the workspace.
from azure.ai.ml.entities import Environment
custom_env_name = 'aml-loanstatus-cpu'
custom_job_env = Environment(
name=custom_env_name,
description='Custom environment for Loast Status Linear job',
tags={'scikit-learn': '1.1.2'},
conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
image='mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)
print(
f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}"
)
Next, we can create the training script by first creating the source folder where the training script, main.py, will be stored.
import os
train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)
The training script consists of preparing the environment, reading the data, data preparation, model training, evaluating the model and saving/registering the model. This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of argparse, start logging with sklearn.autolog, reading the train/test sets, defining the features/target and preprocessing the data by scaling the features with the MinMaxScaler. Then the number of samples and features are logged with MLFlow. It uses this to then train a Linear model using the best parameters from GridSearchCV where the classification_report and confusion_matrix as well as the metrics accuracy, precision, recall and f1_score for the train/test sets are logged as MLFlow artifacts and metrics. Then the model can be saved and registered.
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['LoanStatus_Linear'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
def main():
"""Main function of the script."""
parser = argparse.ArgumentParser()
parser.add_argument('--train_data', type=str,
help='path to input train data')
parser.add_argument('--test_data', type=str, help='path to input test data')
parser.add_argument('--penalty', required=False, default='l2', type=str)
parser.add_argument('--solver', required=False, default='lbfgs', type=str)
parser.add_argument('--max_iter', required=False, default=100, type=int)
parser.add_argument('--C', required=False, default=1, type=int)
parser.add_argument('--tol', required=False, default=1e-4, type=float)
parser.add_argument('--n_jobs', required=False, default=1, type=int)
parser.add_argument('--registered_model_name', type=str, help='model name')
args = parser.parse_args()
mlflow.start_run()
mlflow.sklearn.autolog()
###################
#<prepare the data>
###################
print(" ".join(f"{k}={v}" for k, v in vars(args).items()))
print('Input Train Data:', args.train_data)
print('Input Test Data:', args.test_data)
trainDF = pd.read_csv(args.train_data, low_memory=False)
testDF = pd.read_csv(args.test_data, low_memory=False)
train_label = trainDF[['loan_status']]
test_label = testDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])
test_features = testDF.drop(columns = ['loan_status'])
print(f"Training with data of shape {train_features.shape}")
scaler = MinMaxScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)
mlflow.log_metric('num_samples', train_features.shape[0])
mlflow.log_metric('num_features', train_features.shape[1])
####################
#</prepare the data>
####################
##################
#<train the model>
##################
model = LogisticRegression(penalty=args.penalty,
solver=args.solver,
max_iter=args.max_iter,
C=args.C,
tol=args.tol,
random_state=seed_value)
with parallel_backend('threading', n_jobs=args.n_jobs):
model.fit(train_features, train_label)
##################
#</train the model>
##################
#####################
#<evaluate the model>
#####################
train_label_pred = model.predict(train_features)
test_label_pred = model.predict(test_features)
clr_train = classification_report(train_label, train_label_pred,
output_dict=True)
sns.heatmap(pd.DataFrame(clr_train).iloc[:-1,:].T, annot=True)
plt.savefig('clr_train.png')
mlflow.log_artifact('clr_train.png')
plt.close()
clr_test = classification_report(test_label, test_label_pred,
output_dict=True)
sns.heatmap(pd.DataFrame(clr_test).iloc[:-1,:].T, annot=True)
plt.savefig('clr_test.png')
mlflow.log_artifact('clr_test.png')
plt.close()
cm_train = confusion_matrix(train_label, train_label_pred)
cm_train = ConfusionMatrixDisplay(confusion_matrix=cm_train)
cm_train.plot()
plt.savefig('cm_train.png')
mlflow.log_artifact('cm_train.png')
plt.close()
cm_test = confusion_matrix(test_label, test_label_pred)
cm_test = ConfusionMatrixDisplay(confusion_matrix=cm_test)
cm_test.plot()
plt.savefig('cm_test.png')
mlflow.log_artifact('cm_test.png')
plt.close()
train_accuracy = accuracy_score(train_label, train_label_pred)
train_precision = precision_score(train_label, train_label_pred)
train_recall = recall_score(train_label, train_label_pred)
train_f1 = f1_score(train_label, train_label_pred)
test_accuracy = accuracy_score(test_label, test_label_pred)
test_precision = precision_score(test_label, test_label_pred)
test_recall = recall_score(test_label, test_label_pred)
test_f1 = f1_score(test_label, test_label_pred)
mlflow.log_metric('train_accuracy', train_accuracy)
mlflow.log_metric('train_precision', train_precision)
mlflow.log_metric('train_recall', train_recall)
mlflow.log_metric('train_f1', train_f1)
mlflow.log_metric('test_accuracy', test_accuracy)
mlflow.log_metric('test_precision', test_precision)
mlflow.log_metric('test_recall', test_recall)
mlflow.log_metric('test_f1', test_f1)
#####################
#</evaluate the model>
#####################
##########################
#<save and register model>
##########################
print('Registering the model via MLFlow')
mlflow.sklearn.log_model(
sk_model=model,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)
mlflow.sklearn.save_model(
sk_model=model,
path=os.path.join(args.registered_model_name, 'trained_model'),
)
###########################
#</save and register model>
###########################
mlflow.end_run()
if __name__ == "__main__":
main()
To train the model, a command job configured with the input specifying the input data, the number of epochs and the batch size, which then runs the training script using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.
from azure.ai.ml import command
from azure.ai.ml import Input
registered_model_name = 'loanstatus_us_linear_model'
job = command(
inputs=dict(
train_data=Input(
type='uri_file',
path='azureml://datastores/loanstatus_datastore/paths/trainDF_US.csv',
),
test_data=Input(
type='uri_file',
path = 'azureml://datastores/loanstatus_datastore/paths/testDF_US.csv',
),
penalty='l1',
solver='saga',
max_iter=100000,
C=1,
tol=1e-06,
n_jobs=-1,
registered_model_name=registered_model_name,
),
code='./src/',
command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --penalty ${{inputs.penalty}} --solver ${{inputs.solver}} --max_iter ${{inputs.max_iter}} --n_jobs ${{inputs.n_jobs}} --registered_model_name ${{inputs.registered_model_name}}',
environment='aml-loanstatus-cpu@latest',
compute='cpu-cluster-E8s-v3',
display_name='loanstatus_us_linear_best_prediction',
)
Finally, this job can be submitted to run in Azure Machine Learning Studio using the create_or_update command with ml_client.
ml_client.create_or_update(job)
The submitted job can then be viewed by selecting the link in the output of the previous cell. The logged information with MLFlow including the model metrics and saved graphs can then be viewed/downloaded when the job completes.
Test Model on SMOTE Set¶
Now, we can fit the best model from iterating through the parameters contained within the grid search using the Upsampled data to the SMOTE set to compare how this model peforms using another approach for handling class imbalance problems. After fitting the model, we can predict with the test set that was not utilzing for training, and evaluate the model performance.
print('Start Fit best model using gridsearch results on Upsampling to SMOTE data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_US_HPO.fit(X1_trainS, y1_train)
print('Finished Fit best model using gridsearch results on Upsampling to SMOTE data :',
time.time() - search_time_start)
Pkl_Filename = 'LassoMod_SMOTEusingUSHPO_mms_moreParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_US_HPO, file)
y_pred_US_HPO = LassoMod_US_HPO.predict(X1_testS)
print('Results from LASSO using Upsampling HPO on SMOTE Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_US_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_US_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_US_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_US_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_US_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_US_HPO))
When testing the best model from the GridSearchCV using the Upsampled set with the SMOTE data, this resulted in the same accuracy, higher precision and F1 scores but lower recall.
Model Explanations using ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(LassoMod_US_HPO,
random_state=seed_value).fit(X1_testS,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
There is a lower total_rec_int (0.377133 vs. 0.392205) compared to the Upsampled test while total_pymnt, out_prncp loan_amnt are similar. Interestingly, home_ownership_RENT (0.129386 vs. 0.000037) and home_ownership_MORTGAGE (0.123567 vs. 0.000069) are considerably higher compared to the Upsampled test set.
Now we can use ELI5 to show the prediction.
html_obj2 = eli5.show_prediction(LassoMod_US_HPO, X1_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off or Current, the features annual_inc, out_prncp, total_pymnt, total_bc_limit, total_bal_ex_mort, bc_open_to_buy,revol_bal, total_hi_credit_lim and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, and total_rec_int with higher values are not beneficial.
Grid Search: SMOTE¶
To follow the methods used for evaluating the Upsampled set, let's fit the SMOTE data using dask.delayed for parallel jobs with time being monitored and then compare the model accuracies to determine which parameters led to this metric being achieved.
print('Start SMOTE - Grid Search..')
search_time_start = time.time()
pipelines1_ = [dask.delayed(grid).fit(X1_train, y1_train.values.ravel())]
fit_pipelines1 = dask.compute(*pipelines1_, scheduler='processes')
print('Finished SMOTE - Grid Search :', time.time() - search_time_start)
for idx, val in enumerate(fit_pipelines1):
print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx],
val.score(X1_test, y1_test)))
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(fit_pipelines1):
if val.score(X1_test, y1_test) > best_acc:
best_acc = val.score(X1_test, y1_test)
best_pipe = val
best_clf = idx
print('Classification with highest accuracy: %s' % pipe_dict[best_clf])
print('Lasso/Elastic Net using SMOTE - Best Estimator')
print(best_pipe.best_params_)
Now that we determined the parameters that led to the highest accuracy during the search, we can fit a model with the parameters, save the model and predict based on the trained using the test set.
LassoMod_SMOTE = LogisticRegression(penalty='l1',
C=1,
solver='saga',
max_iter=100000,
tol=1e-06,
n_jobs=-1,
random_state=seed_value)
print('Start fit the best hyperparameters from SMOTE grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_SMOTE.fit(X1_trainS, y1_train)
print('Finished fit the best hyperparameters from SMOTE grid search to the data:',
time.time() - search_time_start)
Pkl_Filename = 'Linear_HPO_SMOTE_mmsRecall_mostParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_SMOTE, file)
y_pred_SMOTE_HPO = LassoMod_SMOTE.predict(X1_testS)
print('Results from LASSO using SMOTE HPO:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_HPO))
Compared to the baseline model, the model determined using GridSearchCV resulted in practically the same metrics besides a higher precision score. Given the time that was required to search through the various models in the given parameter space, the baseline model would probably be sufficient unless an even higher recall score where the true positives classified as positive was needed for the task at hand.
Compared to the model after GridSearchCV using the Upsampled set, there was a higher precision score, but a lower recall score.
Model Explanations using ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(LassoMod_SMOTE,
random_state=seed_value).fit(X1_testS,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
There is a lower total_rec_int (0.377133 vs. 0.392205) compared to the Upsampled test while total_pymnt, out_prncp loan_amnt are similar. Interestingly, home_ownership_RENT (0.129386 vs. 0.000037) and home_ownership_MORTGAGE (0.123567 vs. 0.000069) are considerably higher compared to the Upsampled test set.
Now we can use ELI5 to show the prediction.
html_obj2 = eli5.show_prediction(LassoMod_SMOTE, X1_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off or Current, the features annual_inc, out_prncp, total_pymnt, total_bc_limit, total_bal_ex_mort, bc_open_to_buy, revol_bal, total_hi_credit_lim and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, and total_rec_int with higher values are not beneficial.
Test Model on Upsampled Set¶
Now, we can fit the determined model from the search using the SMOTE set, save the .pkl file, then predict based on trained model with the scaled test set and evaluate the model metrics.
print('Start fit best model using gridsearch results on SMOTE to Upsampling data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_SMOTE.fit(X_trainS, y_train)
print('Finished fit best model using gridsearch results on SMOTE to Upsampling data :',
time.time() - search_time_start)
Pkl_Filename = 'LassMod_UpsamplingusingSMOTEHPO_mmsRecall_mostParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_SMOTE, file)
y_pred_SMOTE_HPO = LassoMod_SMOTE.predict(X_testS)
print('Results from LASSO using SMOTE HPO on Upsampling data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y_test, y_pred_SMOTE_HPO))
When testing the best model from the GridSearchCV using the SMOTE set with the Upsampled data, this resulted in the same accuracy but lower precision and F1 scores a higher recall score.
Model Explanations using ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(LassoMod_SMOTE,
random_state=seed_value).fit(X_testS,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The feature weights are the same values from the best model from the GridSearchCV with scoring='recall_weighted' for the Upsampled set.
Now we can use ELI5 to show the prediction.
html_obj2 = eli5.show_prediction(LassoMod_SMOTE, X_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off or Current, the features out_prncp, total_pymnt, annual_inc, total_hi_credit_lim, total_bc_limit, revol_bal and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, total_rec_int, total_bal_ex_mort and installment with higher values are not beneficial.
Naive Bayes¶
Naive Bayes classifiers are a family of supervised learning algorithms based on applying Bayes’ theorem with the "naive" assumption that conditional independence exists between every pair of features given the value of the class variable. Bayes’ theorem is a formula that offers a conditional probability of an event A happening given another event B has all ready occurred:
$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$
- $P(A)$ is called the prior probability
- $P(A|B)$ is called the posterior probability (the probability after taking into account the observation of $B$).
The fundamental Naive Bayes assumption is that each feature makes an independent and equal contribution to the outcome. In Gaussian Naive Bayes, the continuous values of each feature are assumed to be distributed according to a Normal distribution and the likelihood of the features is assumed to be Gaussian.
In The Optimality of Naive Bayes, it was suggested that Naive Bayes can still perform well even if the features contain strong dependences as long as the dependences are evenly distributed in the target variable or if they cancel each other out.
Baseline Models¶
The notebook containing the baseline models and hyperparameter tuning using GridSearchCV is located here.
Baseline Default Parameters¶
priors: Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.default=None.var_smoothing: Portion of the largest variance of all features that is added to variances for calculation stability.default=1e-09.
Upsampled: Baseline Model¶
As with the baseline models for the Linear models, we can first set the path where the ML results are saved as this is always needed and will be assumed for all workflows. Let's first start with the Upsampled set with fitting the model, saving it as .pkl file followed by making predictions and evaluating with the test set.
nb = GaussianNB()
with parallel_backend('threading', n_jobs=-1):
nb.fit(X_train, y_train)
Pkl_Filename = 'NB_Upsampling_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb, file)
y_pred_US = nb.predict(X_test)
print('Results from Naives Bayes baseline model on Upsampled Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
All of the model metrics for this baseline model are lower than both previous baseline linear models for the Upsampled set. Although the accuracy=0.948 and precision=0.911, the lowest metric is recall_score=0.651 followed by the f1_score=0.759. Naive Bayes is non-linear and might require too many simulations to find match the metrics achieved from the linear approaches, so let's try some different scoring metrics during GridSearchCV since the time for runtime completion is not too computational demanding.
SMOTE: Baseline Model¶
We can now fit the baseline model for the SMOTE set using the default model parameters.
with parallel_backend('threading', n_jobs=-1):
nb.fit(X1_train, y1_train)
Pkl_Filename = 'NB_SMOTE_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb, file)
y_pred_SMOTE = nb.predict(X_test)
print('Results from Naives Bayes baseline model on SMOTE Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
All of the model metrics for this baseline model are even lower than both of previous baseline linear models for the SMOTE and the Upsampled sets. However, precision=0.465 is the lowest model metric observed up to this point, so this is definitely the metric we will have to use for scoring if the Naive Bayes algorithm has any likelihood of being considered with the adjusted class balance generated sets.
Grid Search using Pipelines: Precision¶
Grid Search: Upsampled¶
Using a similar approach that was utilized for the Linear models, let's define the estimator as the default GaussianNB, the grid search parameters as param_grid, the scoring = precision using 3-fold cross validation and to use all available resouces with joblib in both the scikit-learn model and GridSearchCV. We can monitor the completion time as well to compare to any of the hyperparameter searches. However, this is still using CPU and RAM constraints with suboptimal runtimes which the literature has demonstrated significant differences in runtimes when using code optimized to run on a GPU even with tabular data. Let's print the searches best score and estimator parameters after the search completes.
param_grid = {'var_smoothing': np.logspace(0,-9, num=1000)}
nb_grid_US = GridSearchCV(estimator=GaussianNB(),
param_grid=param_grid,
scoring='precision',
cv=3, verbose=1, n_jobs=-1)
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_grid_US.fit(X_train, y_train)
print('Time to finish Grid Search using Upsampling:',
time.time() - search_time_start)
print('======================================================================')
print('Naive Bayes: Upsampling')
print('- Best Score:', nb_grid_US.best_score_)
print('- Best Estimator:', nb_grid_US.best_estimator_)
Using 3-fold cross validation monitoring precision resulted in the baseline model as the best model so far. Let's fit the best model from the grid search using the Upsampled data, and examine the classification metrics.
nb_US_HPO = nb_grid_US.best_estimator_
print('Start fit the best hyperparameters from Upsampling grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_US_HPO.fit(X_train, y_train)
print('Finished fit the best hyperparameters from Upsampling grid search to the data:',
time.time() - search_time_start)
Pkl_Filename = 'NB_HPO_US_precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_US_HPO, file)
y_pred_US = nb_US_HPO.predict(X_test)
print('Results from Naives Bayes using Best HPO from GridSearchCV on Upsampled Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Compared to the baseline model, the model determined using GridSearchCV resulted in the same metrics. This suggests that this algorithm is going to prove to be difficult in this context.
Model Metrics with ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_US_HPO,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The feature recoveries is the only feature with contributing weights (0.115911).
Test Model on SMOTE Set¶
Now, we can fit the best model from iterating through the parameters contained within the grid search using the Upsampled data to the SMOTE set to compare how this model peforms using another approach for handling class imbalance problems. After fitting the model, we can predict with the test set that was not utilzing for training, and evaluate the model performance.
print('Start Fit best model using gridsearch results on Upsampling to SMOTE data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_US_HPO.fit(X1_train, y1_train)
print('Finished Fit best model using gridsearch results on Upsampling to SMOTE data :',
time.time() - search_time_start)
Pkl_Filename = 'NB_SMOTEusingUpsamplingHPO_precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_US_HPO, file)
y_pred_SMOTE_US = nb_US_HPO.predict(X1_test)
print('Results from Naives Bayes using Upsampling Best HPO from GridSearchCV on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_US))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_US))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_US))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_US))
This resulted in the same model metrics as the baseline model.
Model Explanations using ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_US_HPO,
random_state=seed_value).fit(X1_test,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The recoveries feature contains less weight compared to the Upsampled model (0.074780 vs. 0.11) while the
out_prncp (0.056124) and last_pymnt_amnt (0.042546) contain more weight than the determined with the Upsampled model.
Grid Search using Pipelines: Weighted Precision¶
Given the low precision score obtained for the SMOTE set when scoring='precision' was used (0.465), which resulted because the parameters used for the baseline model were selected, let's increase the parameter space by increasing var_smoothing and num to 5000 and testing scoring='precision_weighted' with 10-fold cross validation.
param_grid = {'var_smoothing': np.logspace(-7,-13, num=5000)}
nb_grid = GridSearchCV(estimator=GaussianNB(),
param_grid=param_grid,
scoring='precision_weighted',
cv=10, verbose=4, n_jobs=-3)
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_grid.fit(X1_train, y1_train)
print('Finished SMOTE - Grid Search :', time.time() - search_time_start)
print('======================================================================')
print('Naive Bayes using SMOTE ')
print('- Best Score:', nb_grid.best_score_)
print('- Best Estimator:', nb_grid.best_estimator_)
Let's fit the best model from the grid search using the SMOTE data, and examine the classification metrics.
nb_SMOTE_HPO = nb_grid.best_estimator_
print('Start fit the best hyperparameters from SMOTE grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_SMOTE_HPO.fit(X1_train, y1_train)
print('Finished fit the best hyperparameters from SMOTE grid search to the data:',
time.time() - search_time_start)
Pkl_Filename = 'NB_HPO_SMOTE_-7to-13_precisionWeighted_num5000_10cv.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_SMOTE_HPO, file)
y_pred_SMOTE_HPO = nb_SMOTE_HPO.predict(X1_test)
print('Results from Naives Bayes using Best HPO from GridSearchCV on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_HPO))
Compared to the baseline NB model using the SMOTE, all of the model metrics increased besides a decrease in the recall_score. However, the low recall, F1 and precision score, this model even after testing various searches, does not seem worthwhile unless the data is preprocessed further with feature engineering new features given the current ones.
Model Explanations using ELI5¶
Although the model metrics are not ideal nor worth pursuing for further optimization, let's compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_SMOTE_HPO,
random_state=seed_value).fit(X1_test,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Test Model on Upsampled Set¶
Using the best model from grid search using the SMOTE set, let's compare the results to the ones from using the Upsampled set by fitting a model with the best parameters and evaluating the model performance.
print('Start fit best model using gridsearch results on SMOTE to Upsampling data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_SMOTE_HPO.fit(X_train, y_train)
print('Finished fit best model using gridsearch results on SMOTE to Upsampling data:',
time.time() - search_time_start)
Pkl_Filename = 'NB_UpsamplingUsingSMOTEHPO_-7to-13_precisionWeighted_num5000_10cv.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_SMOTE_HPO, file)
y_pred_US_SMOTE = nb_SMOTE_HPO.predict(X_test)
print('Results from Naives Bayes using SMOTE Best HPO from GridSearchCV on Upsampling data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US_SMOTE))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US_SMOTE))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US_SMOTE))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US_SMOTE))
Compared to the baseline model using the default parameters, this resulted in lower model metrics for the Upsampled set. The accuracy, precision and recall scores were better than the SMOTE baseline model metrics though.
Model Explanations using ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_SMOTE_HPO,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The weight of the recoveries is potentially higher for the SMOTE model using the Upsampled data, but potentially the same (0.123842 vs. 0.115911).
When the increased parameter space with 10 fold cross validation was tested with the `Upsampled` set, the the model resulted in wose model metrics than compared to the initial hyperparameter search. So, let's move on to more model evaluation.
SparkML¶
Apache Spark is an open-source analytics engine that was developed to improve upon the limitations of the MapReduce in the the Hadoop Distributed File System by processing data in memory and distributed over multiple operations in parallel. There are various ways to use this, ranging from real time big data ingestion to ML/DL tasks. Let's first set up the environment by updating the packages and installing Java JRE/JDK with:
1. apt update
2. apt install default-jre
3. apt install default-jdk
We can then install pyspark==3.3 and findspark using pip and set up the SparkSession. Both Paperspace and Colab were utilized so the amount of RAM allocated was different for the SparkSession for Paperspace since more RAM and CPU cores are available.
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder\
.master('local')\
.appName('Paperspace')\
.config('spark.driver.memory', '38g')\
.config('spark.executor.pyspark.memory', '32g')\
.config('spark.executor.cores', '4')\
.config('spark.python.worker.memory', '32g')\
.config('spark.sql.execution.arrow.pyspark.enabled', 'True')\
.config('spark.sql.debug.maxToStringFields', '1000')\
.config('spark.sql.autoBroadcastJoinThreshold', '-1')\
.config('spark.ui.port', '4050')\
.getOrCreate()
spark
Then we can remove the warnings from the environment by setting the log to only ERROR. After the SparkSession is configured, we can install & import the necessary packages and then set the seed for reproducibility.
spark.sparkContext.setLogLevel('ERROR')
Upsampled Minority Class¶
Let's first read the train and the test sets, which are stored as .csv files, cache the data into memory for better performance and then inspect the schema of the features to determine if the schema needs to be updated to reflect the proper data type(s).
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['SparkML_HPO'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
trainDF = spark.read.csv('/notebooks/LoanStatus/Data/trainDF_US.csv',
header=True, inferSchema=True).cache()
print('\nTrain Schema')
trainDF.printSchema()
testDF = spark.read.csv('/notebooks/LoanStatus/Data/testDF_US.csv',
header=True, inferSchema=True).cache()
print('\nTest Schema')
testDF.printSchema()
To prepare the data for modeling, the features and label (loan_status) need to be defined. Let's start with the training set. The VectorAssembler is set up with the features as inputCols and the outputCol='unscaledFeatures'. Then the train feature columns can be transformed into a vector column.
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
features = trainDF.columns[0: len(trainDF.columns) - 1]
trainDF = trainDF.select(col('loan_status').alias('label'), *features)
vecAssembler = VectorAssembler(inputCols=features,
outputCol='unscaledFeatures',
handleInvalid='skip')
trainDF = vecAssembler.transform(trainDF)
The same steps that were applied to the training set should also be applied to the test set.
features = testDF.columns[0: len(testDF.columns) - 1]
testDF = testDF.select(col('loan_status').alias('label'), *features)
testDF = vecAssembler.transform(testDF)
Some models perform better when the data is scaled. The MinMaxScaler rescales the data to a common range from the minimum to maximum value linearly using the summary statistics from the variables. The StandardScaler standardizes the variables by eliminating the mean value and scales by dividing each value by the standard deviation. Given this is a classification problem, let's define a few metrics, areaUnderROC and accuracy, to evaluate the performance from training different model types.
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
mmScaler = MinMaxScaler(inputCol='unscaledFeatures',
outputCol='scaledFeatures',
min=0, max=1)
stdScaler = StandardScaler(inputCol='unscaledFeatures',
outputCol='scaledFeatures',
withStd=True,
withMean=False)
evaluator_auroc = BinaryClassificationEvaluator(labelCol='label',
metricName='areaUnderROC')
evaluator_acc = MulticlassClassificationEvaluator(labelCol='label',
metricName='accuracy')
The models examined using Spark's Machine Learning Library (MLlib) are:
Logistic Regression(LR)Linear SVM Classifier(LinearSVC)Decision Trees(DT)Random Forest(RF)Gradient Boosted Trees(GBT)
To set the baseline models, a Pipeline was utilized with the input of the unscaledFeatures from the VectorAssembler, then potentially the scaledFeatures, if a scaler was being used, to the featuresCol for the model.
Logistic Regression¶
First the dependencies are imported from pyspark.ml, time and mlflow. The pipeline was set up by first defining the model to use, which utilizes the default parameters. The steps in the pipline consist of scaling the data with the MinMaxScaler, fitting the timed model and saving it for later use, if needed.
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
import time
try:
import mlflow.pyspark.ml
mlflow.pyspark.ml.autolog()
except:
print(f'Your version of MLflow ({mlflow.__version__}) does not support pyspark.ml for autologging. To use autologging, upgrade your MLflow client version or use Databricks Runtime for ML 8.3 or above.')
lr = LogisticRegression(family='binomial',
labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
elasticNetParam=0.0,
maxIter=100)
search_time_start = time.time()
pipeline_lr = Pipeline(stages=[mmScaler, lr])
pipelineModel_lr = pipeline_lr.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lr_us'
pipelineModel_lr.save(Path)
Then predictions can be made using the test set. The default parameters of regParam=0.0 and elasticNetParam=0.0 provide an AUROC = 0.987 and an Accuracy = 98.583%, so optimizing this model might prove to be difficult. Let's now examine some other model types, and assess if hyperparameter tuning could increase the model performance better than the LogisticRegression model that was fit.
prediction_lr = pipelineModel_lr.transform(testDF)
lr_auroc = evaluator_auroc.evaluate(prediction_lr)
print('Baseline: LogisticRegression')
print('Area under ROC curve: %g' % (lr_auroc))
print('Test Error: %g' % (1.0 - lr_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lr))
LinearSVC¶
from pyspark.ml.classification import LinearSVC
lsvc = LinearSVC(labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
tol=1e-5,
maxIter=100)
search_time_start = time.time()
pipeline_lsvc = Pipeline(stages=[stdScaler, lsvc])
pipelineModel_lsvc = pipeline_lsvc.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lsvc_us'
pipelineModel_lsvc.save(Path)
prediction_lsvc = pipelineModel_lsvc.transform(testDF)
lsvc_auroc = evaluator_auroc.evaluate(prediction_lsvc)
print('Baseline: LinearSVC')
print('Area under ROC curve: %g' % (lsvc_auroc))
print('Test Error: %g' % (1.0 - lsvc_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lsvc))
Decision Tree¶
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=16,
impurity='gini',
seed=seed_value)
search_time_start = time.time()
pipeline_dt = Pipeline(stages=[dt])
pipelineModel_dt = pipeline_dt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_dt_us'
pipelineModel_dt.save(Path)
prediction_dt = pipelineModel_dt.transform(testDF)
dt_auroc = evaluator_auroc.evaluate(prediction_dt)
print('Baseline: Decision Tree')
print('Area under ROC curve: %g' % (dt_auroc))
print('Test Error: %g' % (1.0 - dt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_dt))
Random Forest¶
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(labelCol='label',
featuresCol='unscaledFeatures',
impurity='gini',
maxDepth=5,
maxBins=32,
numTrees=10,
seed=seed_value)
search_time_start = time.time()
pipeline_rf = Pipeline(stages=[rf])
pipelineModel_rf = pipeline_rf.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_rf_us'
pipelineModel_rf.save(Path)
prediction_rf = pipelineModel_rf.transform(testDF)
rf_auroc = evaluator_auroc.evaluate(prediction_rf)
print('Baseline: Random Forest')
print('Area under ROC curve: %g' % (rf_auroc))
print('Test Error: %g' % (1.0 - rf_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_rf))
Gradient Boosted Trees¶
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=32,
maxIter=10,
seed=seed_value)
search_time_start = time.time()
pipeline_gbt = Pipeline(stages=[gbt])
pipelineModel_gbt = pipeline_gbt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_gbt_us'
pipelineModel_gbt.save(Path)
prediction_gbt = pipelineModel_gbt.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('Baseline: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
Of the models fit, only LR and LinearSVC were scaled prior to modeling. The scaling for LinearSVC used the StandardScaler since SVMs tend to perform better with scaled data and the computation time is faster when compared with unscaled data. Given all models were using the same runtime environment, the DT model completed the fastest at 59.3 seconds while the LR took the most amount to finish at 233.4 seconds. Given the lower accuracies for GBT, RF and DT, these will require tuning the hyperparameters more than LR and LinearSVC. However, accuracy is not a great metric, especially if the original data is imbalanced. Therefore, let's add some more metrics and then compare all the models at a greater depth or granularity.
Model Metrics¶
We can utilize a for loop which iterates through the models tested and asesses various metrics on a more granular level like the amount of true positives and false positives as well as a more macroscopic level like recall, precision, and F1 score.
print('Upsampling Baseline Models:')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
LR performed the best on all of the macro metrics. Perhaps, using a linear approach might be worthwhile in this scenario since the metrics from the baseline model are all over 90%. However, the overall recall score for each of the models is relatively low with the highest being 91.0%. Also, DT, RF and GBT have very lower precision scores for the Upsampled baseline models.
SMOTE¶
Upon the initial inspection of the schema of the train and test sets when the data was loaded, there were some differences when inferSchema=True was used. The differences can be noted below. This was mostly due to the creation of uint8 dummy variables during the variable selection process and the creation of FloatType with inferSchema=True. To keep the train/test sets consistent with the Upsampled sets, this was adjusted.
Then, the train and the test sets were procesed using the same approaches described for the Upsampled data.
from pyspark.sql.types import IntegerType, FloatType
trainDF = spark.read.csv('/content/drive/MyDrive/LoanStatus/Data/trainDF_SMOTE.csv',
header=True, inferSchema=True).cache()
trainDF = trainDF \
.withColumn('loan_amnt', trainDF['loan_amnt'].cast(IntegerType())) \
.withColumn('revol_bal', trainDF['revol_bal'].cast(IntegerType())) \
.withColumn('term_ 60 months', trainDF['term_ 60 months'].cast(IntegerType())) \
.withColumn('grade_B', trainDF['grade_B'].cast(IntegerType())) \
.withColumn('grade_C', trainDF['grade_C'].cast(IntegerType())) \
.withColumn('grade_D', trainDF['grade_D'].cast(IntegerType())) \
.withColumn('home_ownership_MORTGAGE', trainDF['home_ownership_MORTGAGE'].cast(IntegerType())) \
.withColumn('home_ownership_OWN', trainDF['home_ownership_OWN'].cast(IntegerType())) \
.withColumn('home_ownership_RENT', trainDF['home_ownership_RENT'].cast(IntegerType())) \
.withColumn('verification_status_Source Verified', trainDF['verification_status_Source Verified'].cast(IntegerType())) \
.withColumn('verification_status_Verified', trainDF['verification_status_Verified'].cast(IntegerType())) \
.withColumn('purpose_credit_card', trainDF['purpose_credit_card'].cast(IntegerType())) \
.withColumn('initial_list_status_w', trainDF['initial_list_status_w'].cast(IntegerType())) \
.withColumn('application_type_Joint App', trainDF['application_type_Joint App'].cast(IntegerType())) \
.withColumn('disbursement_method_DirectPay', trainDF['disbursement_method_DirectPay'].cast(IntegerType()))
print('\nTrain Schema')
trainDF.printSchema()
testDF = spark.read.csv('/content/drive/MyDrive/LoanStatus/Data/testDF_SMOTE.csv',
header=True, inferSchema=True).cache()
print('\nTest Schema')
testDF.printSchema()
Logistic Regression¶
Let's now run the baseline models for the SMOTE set and compare the different model results within the SMOTE data as well as the Upsampled sets.
lr = LogisticRegression(family='binomial',
labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
elasticNetParam=0.0,
maxIter=100)
search_time_start = time.time()
pipeline_lr = Pipeline(stages=[mmScaler, lr])
pipelineModel_lr = pipeline_lr.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lr_smote'
pipelineModel_lr.save(Path)
prediction_lr = pipelineModel_lr.transform(testDF)
lr_auroc = evaluator_auroc.evaluate(prediction_lr)
print('Baseline: Logistic Regression')
print('Area under ROC curve: %g' % (lr_auroc))
print('Test Error: %g' % (1.0 - lr_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lr))
LinearSVC¶
lsvc = LinearSVC(labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
tol=1e-5,
maxIter=100)
search_time_start = time.time()
pipeline_lsvc = Pipeline(stages=[stdScaler, lsvc])
pipelineModel_lsvc = pipeline_lsvc.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lsvc_smote'
pipelineModel_lsvc.save(Path)
prediction_lsvc = pipelineModel_lsvc.transform(testDF)
lsvc_auroc = evaluator_auroc.evaluate(prediction_lsvc)
print('Baseline: LinearSVC')
print('Area under ROC curve: %g' % (lsvc_auroc))
print('Test Error: %g' % (1.0 - lsvc_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lsvc))
Decision Tree¶
dt = DecisionTreeClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=16,
impurity='gini',
seed=seed_value)
search_time_start = time.time()
pipeline_dt = Pipeline(stages=[dt])
pipelineModel_dt = pipeline_dt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_dt_smote'
pipelineModel_dt.save(Path)
prediction_dt = pipelineModel_dt.transform(testDF)
dt_auroc = evaluator_auroc.evaluate(prediction_dt)
print('Baseline: Decision Tree')
print('Area under ROC curve: %g' % (dt_auroc))
print('Test Error: %g' % (1.0 - dt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_dt))
Random Forest¶
rf = RandomForestClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=32,
numTrees=10,
impurity='gini',
seed=seed_value)
search_time_start = time.time()
pipeline_rf = Pipeline(stages=[rf])
pipelineModel_rf = pipeline_rf.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_rf_smote'
pipelineModel_rf.save(Path)
prediction_rf = pipelineModel_rf.transform(testDF)
rf_auroc = evaluator_auroc.evaluate(prediction_rf)
print('Baseline: Random Forest')
print('Area under ROC curve: %g' % (rf_auroc))
print('Test Error: %g' % (1.0 - rf_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_rf))
Gradient Boosted Trees¶
gbt = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=32,
maxIter=10,
seed=seed_value)
search_time_start = time.time()
pipeline_gbt = Pipeline(stages=[gbt])
pipelineModel_gbt = pipeline_gbt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_gbt_smote'
pipelineModel_gbt.save(Path)
prediction_gbt = pipelineModel_gbt.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('Baseline: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
Model Metrics¶
print('SMOTE Baseline Models:')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
LR 3/4 best metrics besides LinearSVC contained a higher precision score. The precision of LinearSVC was higher for the SMOTE data compared to US results. DT was the fastest to complete at 61.8 while LinearSVC took the longest at 134.3. Overall, there was higher recall and F1 scores for the US models while higher accuracy and precision for the SMOTE models.
GridSearchCV¶
The GridSearchCV models for both the Upsampled and SMOTE sets are located here.
Upsampling: Gradient Boosted Trees¶
From the models evaluated using the defined GridSearchCV parameters, GBT performed the best. So, let's evaluate the performance by first setting up the pipeline components with the different stages, although only the model stage is needed here since scaling the data isn't required, then define the space in the parameter grid, followed by the components within CrossValidator with the AUROC as the metric to evalaute, using numFolds=5 and parallelism=5 to assess which combination of parameters results in the highest AUROC.
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
model = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
seed=seed_value)
pipeline_gbt_hpo = Pipeline(stages=[model])
paramGrid = (ParamGridBuilder()
.addGrid(model.maxDepth, [5, 10, 15])
.addGrid(model.maxBins, [32, 64])
.addGrid(model.maxIter, [10])
.build())
cv = CrossValidator(estimator=pipeline_gbt_hpo,
estimatorParamMaps=paramGrid,
evaluator=evaluator_auroc,
numFolds=5, parallelism=5)
search_time_start = time.time()
pipelineModel_gbt_hpo = cv.fit(trainDF)
print('Time to perform GridSearchCV:', time.time() - search_time_start)
path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/'
gbt_grid = pipelineModel_gbt_hpo.bestModel
gbt_grid.save(path + '/pipelineModel_gbt_us_hpo_grid')
From the specified pipeline evaluated, the various models with the best parameters and then view all results (AUROC) by each parameters.
print(pipelineModel_gbt_hpo.getEstimatorParamMaps()[np.argmax(pipelineModel_gbt_hpo.avgMetrics)])
list(zip(pipelineModel_gbt_hpo.avgMetrics,
pipelineModel_gbt_hpo.getEstimatorParamMaps()))
Given these parameters resulted in the best objective, highest AUROC, we can use the trained model to predict using the test set evaluate the AUROC, test error and accuracy on data that was not used for training.
prediction_gbt = pipelineModel_gbt_hpo.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('GridSearchCV: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g ' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
Then we can extract the information from the pipeline.
pipelineModel_gbt_hpo.bestModel.stages[-1].extractParamMap()
best_mod = pipelineModel_gbt_hpo.bestModel
param_dict = best_mod.stages[-1].extractParamMap()
sane_dict = {}
for k, v in param_dict.items():
sane_dict[k.name] = v
sane_dict
SMOTE: Gradient Boosted Trees¶
In conjunction with the Upsampled set, GBT performed the best out of the models evaluated using the defined grid. We can use the sample methods to evaluate this model and compare it with the results from the Upsampled search. To complete this, we need to define stages of the pipeline, the parameters in the parameter space as well as the estimator, number of folds for cross validation for parallelism.
model = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
seed=seed_value)
pipeline_gbt_hpo = Pipeline(stages=[model])
paramGrid = (ParamGridBuilder()
.addGrid(model.maxDepth, [5, 10, 15])
.addGrid(model.maxBins, [32, 64])
.addGrid(model.maxIter, [10])
.build())
cv = CrossValidator(estimator=pipeline_gbt_hpo,
estimatorParamMaps=paramGrid,
evaluator=evaluator_auroc,
numFolds=5, parallelism=5)
search_time_start = time.time()
pipelineModel_gbt_hpo = cv.fit(trainDF)
print('Time to perform GridSearchCV:', time.time() - search_time_start)
gbt_grid = pipelineModel_gbt_hpo.bestModel
gbt_grid.save('/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/pipelineModel_gbt_smote_hpo_grid')
From the specified pipeline evaluated, the various models with the best parameters and then view all results (AUROC) by each parameters.
print(pipelineModel_gbt_hpo.getEstimatorParamMaps()[np.argmax(pipelineModel_gbt_hpo.avgMetrics)])
list(zip(pipelineModel_gbt_hpo.avgMetrics,
pipelineModel_gbt_hpo.getEstimatorParamMaps()))
Let's now predict with the pipeline, determine the AUROC, test error and accuracy and extract the information from the pipeline.
prediction_gbt = pipelineModel_gbt_hpo.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('GridSearchCV: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g ' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
pipelineModel_gbt_hpo.bestModel.stages[-1].extractParamMap()
From the search, we can determine the best model and the associates parameters.
bestPipeline = pipelineModel_gbt_hpo.bestModel
bestModel = bestPipeline.stages[-1]
bestParams = bestModel.extractParamMap()
print(bestPipeline)
print('\n')
print(bestModel)
print('\n')
print(bestParams)
best_mod = pipelineModel_gbt_hpo.bestModel
param_dict = best_mod.stages[-1].extractParamMap()
sane_dict = {}
for k, v in param_dict.items():
sane_dict[k.name] = v
sane_dict
GridSearchCV Best Models¶
from pyspark.ml import PipelineModel
path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/'
pipelineModel_lr_hpo_US = PipelineModel.load(path + '/pipelineModel_lr_us_hpo_grid/')
pipelineModel_lsvc_hpo_US = PipelineModel.load(path + '/pipelineModel_lsvc_us_hpo_grid/')
pipelineModel_dt_hpo_US = PipelineModel.load(path + '/pipelineModel_dt_us_hpo_grid/')
pipelineModel_rf_hpo_US = PipelineModel.load(path + '/pipelineModel_rf_us_hpo_grid/')
pipelineModel_gbt_hpo_US = PipelineModel.load(path + '/pipelineModel_gbt_us_hpo_grid/')
prediction_lr = pipelineModel_lr_hpo_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hpo_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hpo_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hpo_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hpo_US.transform(testDF_US)
print('GridSearchCV Best Models Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('GridSearchCV Best Models Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
print('\n')
The best AUROC and Accuracy determined was 0.9849 and 98.24%, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.
Load Saved Models - SMOTE¶
Now we can use the same methods mentioned above to evaluate the SMOTE models using the respective testDF.
pipelineModel_lr_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_lr_smote_hpo_grid/')
pipelineModel_lsvc_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_lsvc_smote_hpo_grid/')
pipelineModel_rf_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_rf_smote_hpo_grid/')
pipelineModel_gbt_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_gbt_smote_hpo_grid/')
pipelineModel_dt_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_dt_smote_hpo_grid/')
prediction_lr = pipelineModel_lr_hpo_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hpo_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hpo_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hpo_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hpo_SMOTE.transform(testDF_SMOTE)
print('GridSearchCV Best Models Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('GridSearchCV Best Models Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
For the best GridSearchCV SMOTE models, The best AUROC and Accuracy determined was 0.9865 and 98.59%, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.
For the best GridSearchCV Upsampled models, the best AUROC and Accuracy determined was 0.9849 and 98.245, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.
Load Saved Models - Upsampled¶
After loading the models, let's predict using the test set of the Upsampled data and evaluate the model metrics from the models where AUROC was monitored.
from pyspark.ml import PipelineModel
path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/Hyperopt/'
pipelineModel_lr_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_us_auroc_100trials/')
pipelineModel_lsvc_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_us_auroc/')
pipelineModel_dt_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_us_auroc2/')
pipelineModel_rf_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_auroc_30trials/')
pipelineModel_rf_hyperopt_auroc1_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_auroc_moreParams/')
pipelineModel_gbt_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_us_auroc/')
pipelineModel_lr_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_us_f1_100trials/')
pipelineModel_lsvc_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_us_f1/')
pipelineModel_dt_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_us_f1/')
pipelineModel_rf_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_f1_30trials/')
pipelineModel_rf_hyperopt_f1_1_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_f1_moreParams/')
pipelineModel_gbt_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_us_f1/')
prediction_lr = pipelineModel_lr_hyperopt_auroc_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hyperopt_auroc_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hyperopt_auroc_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hyperopt_auroc_US.transform(testDF_US)
prediction_rf1 = pipelineModel_rf_hyperopt_auroc1_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hyperopt_auroc_US.transform(testDF_US)
print('Hyperopt Best Models AUROC Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models AUROC Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
print('\n')
For the best Hyperopt Upsampled models where AUROC was monitored, The best AUROC and Accuracy determined was 0.9870 and 98.58%, respectively, using LR. The best precision was 97.55% using RF, the best recall was 92.23% using GBT and the best F1 score was 94.25% with LR.
Predict and F1 Model Metrics using testDF of Upsampled Set¶
Let's now predict using the test set of the Upsampled data and evaluate the model metrics from the models where F1 score was monitored.
prediction_lr = pipelineModel_lr_hyperopt_f1_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hyperopt_f1_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hyperopt_f1_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hyperopt_f1_US.transform(testDF_US)
prediction_rf1 = pipelineModel_rf_hyperopt_f1_1_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hyperopt_f1_US.transform(testDF_US)
print('Hyperopt Best Models F1 Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models F1 Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
print('\n')
For the best Hyperopt Upsampled models where F1 score was monitored, The best AUROC determined was 0.9870 using LR and the best Accuracy was 98.62% using RF. The best precision was 99.51% using RF, the best recall was 89.48% using rf and the best F1 score was 94.23% with RF.
Load Saved Models - SMOTE¶
After loading the models, let's predict using the test set of the SMOTE data and evaluate the model metrics from the models where AUROC was monitored.
pipelineModel_lr_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_smote_auroc_100trials/')
pipelineModel_lsvc_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_smote_auroc/')
pipelineModel_dt_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_smote_auroc/')
pipelineModel_rf_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_auroc_30trials/')
pipelineModel_rf_hyperopt_auroc1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_auroc_moreParams/')
pipelineModel_gbt_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_smote_auroc/')
pipelineModel_lr_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_smote_f1_100trials/')
pipelineModel_lsvc_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_smote_f1/')
pipelineModel_dt_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_smote_f1/')
pipelineModel_rf_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_f1_29trials/')
pipelineModel_rf_hyperopt_f1_1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_f1_moreParams/')
pipelineModel_gbt_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_smote_f1/')
prediction_lr = pipelineModel_lr_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_rf1 = pipelineModel_rf_hyperopt_auroc1_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
print('Hyperopt Best Models AUROC Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models AUROC Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
For the best Hyperopt SMOTE models where AUROC was monitored, the best AUROC and Accuracy determined was 0.98712 and 98.66%, respectively, using GBT. The best precision was 99.69% using RF, the best recall was 91.56% using LR and the best F1 score was 94.23% with LR.
For the best Hyperopt Upsampled models where AUROC was monitored, The best AUROC and Accuracy determined was 0.9870 and 98.58%, respectively, using LR. The best precision was 97.55% using RF, the best recall was 92.23% using GBT and the best F1 score was 94.25% with LR.
Predict and F1 Model Metrics using testDF of SMOTE Set¶
Let's now predict using the test set of the SMOTE data and evaluate the model metrics from the models where F1 score was monitored.
prediction_lr = pipelineModel_lr_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_rf1 = pipelineModel_rf_hyperopt_f1_1_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
print('Hyperopt Best Models F1 Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models F1 Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
For the best Hyperopt SMOTE models where AUROC was monitored, the best AUROC and Accuracy determined was 0.9870 and 98.67%, respectively, using GBT. The best precision was 99.60% using RF, the best recall was 91.56% using LR and the best F1 score was 94.53% with LR.
For the best Hyperopt Upsampled models where F1 score was monitored, The best AUROC determined was 0.9870 using LR and the best Accuracy was 98.62% using RF. The best precision was 99.51% using RF, the best recall was 89.48% using RF and the best F1 score was 94.23% with RF.
%cd /content/drive/MyDrive/
!git clone --recursive https://github.com/Microsoft/LightGBM
Then we can navigate to the directory and create a build directory within the cloned repository where we can use cmake to compile and build.
%cd /content/drive/MyDrive/LightGBM
!mkdir build
!cmake -DUSE_GPU=1
!make -j$(nproc)
Next, we can install pip, setuptools and the required dependencies to set up the environment for utilizing LightGBM with GPU capabilities.
!sudo apt-get -y install python-pip
!sudo -H pip install setuptools optuna plotly eli5 shap lime -U
Now, we can move to the python-package subdirectory and use setup.py to compile the dependencies needed so that data will be processed on GPUs when using the LightGBM package.
%cd /content/drive/MyDrive/LightGBM/python-package
!sudo python3 setup.py install --precompile --gpu
The notebook containing the baseline models and hyperparameter searches using Optuna for the Upsampled and SMOTE sets is located here.
Let's first set up the environment by importing the dependencies/declaring their options, setting the name of the environment, the random and numpy seed, followed by examining which CUDA compiler and GPU is available.
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['lgb_GPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Upsampled: Baseline Model¶
The train/test sets can be read into separate pandas.Dataframes, then the features as X_train or X_test and the labels as y_train or y_test can be separated for both sets. From the original training set, the label can be defined as loan_status, and the remaining as features for the train/test sets with the target removed and the feature column names extracted and converted to a numpy.array. Let's fit the baseline model using the default parameters and evaluate the model performance.
import pandas as pd
trainDF = pd.read_csv('trainDF_US.csv', low_memory=False)
testDF = pd.read_csv('testDF_US.csv', low_memory=False)
X_train = trainDF.drop('loan_status', axis=1)
y_train = trainDF.loan_status
X_test = testDF.drop('loan_status', axis=1)
y_test = testDF.loan_status
label = trainDF[['loan_status']]
features = trainDF.drop(columns = ['loan_status'])
features = features.columns
features = np.array(features)
import lightgbm as lgb
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
lgbc = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1)
lgbc.fit(X_train, y_train)
Pkl_Filename = 'lightGBM_Upsampling_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lgbc, file)
print('\nModel Metrics for LightGBM Baseline Upsampling')
y_train_pred = lgbc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lgbc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_train, y_train_pred)
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_train, y_train_pred))
print('Precision score : %.3f' % precision_score(y_train, y_train_pred))
print('Recall score : %.3f' % recall_score(y_train, y_train_pred))
print('F1 score : %.3f' % f1_score(y_train, y_train_pred))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_test_pred))
print('Precision score : %.3f' % precision_score(y_test, y_test_pred))
print('Recall score : %.3f' % recall_score(y_test, y_test_pred))
print('F1 score : %.3f' % f1_score(y_test, y_test_pred))
Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the precision, recall and the F1 score, so let's try monitoring the log_loss to see if hyperparameter tuning results in better model performance.
1000 Trials 5-fold Cross Validation¶
For hyperparameter tuning using Optuna, let's define a multi-faceted objective function for the optimization of the hyperparameters which uses joblib to save each trial parameters and the loss in a pickle file that can be reloaded to continue optimizing. Then we can define the parameter space of the hyperparameters to be tested:
random_state:seed_valuedesignated for reproducibility.default=None.device_type: Device for the tree learning.default=cpuobjective: Specify the learning task and the corresponding learning. objective or a custom objective function to be used.default=Noneordefault=regression.metric: Metric(s) to be evaluated on the evaluation set(s).default="".boosting_type: (gbdt,gdbt_subsample, 0.5. default='gbdt') - traditional Gradient Boosting Decision Tree.dart, Dropouts meet Multiple Additive Regression Trees.rf, Random Forest.n_estimators: Number of boosted trees to fit.default=100.learning_rate: Boosting learning rate.default=0.1.num_leaves: Maximum tree leaves for base learners.default=31.bagging_freq: Frequency for bagging.default=0.subsample: Randomly select part of data without resampling.default=1.colsample_bytree: Subsample ratio of columns when constructing each tree.default=1.max_depth: Maximum tree depth for base learners, <=0 means no limit.default=-1.lambda_l1: L1 regularization.default=0.lambda_l2: L2 regularization.default=0.min_gain_to_split: The minimal gain to perform split.default=0.min_child_samples: Minimal number of data in one leaf. Can be used to deal with over-fitting.default=20.verbosity: Controls the level of LightGBM’s verbosity.default=1(Info).
We need to utilize the same k-folds for reproducibility, so let's use 5-fold as the number of folds which to split the training set into the train and validation sets with shuffle=True and the initial defined seed value. The LGBMClassifier model is specified to use the parameters within the params_lgb_optuna dictionary as well as with early_stopping_rounds=150. Each trial is timed and a prediction of the binary_log_loss is made using the created validation label and the predicted label. The mean score is tracked with the objective of obtaining the lowest loss using the callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')].
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
from optuna.integration import LightGBMPruningCallback
import joblib
from sklearn.model_selection import KFold, cross_val_score
from timeit import default_timer as timer
import lightgbm as lgb
from sklearn.metrics import log_loss
def objective(trial):
"""
Objective function to tune a LightGBMClassifier model.
"""
joblib.dump(study, 'lightGBM_US_Optuna_1000_GPU.pkl')
params_lgb_optuna = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': 'gbdt',
'n_estimators': trial.suggest_int('n_estimators', 100, 500, step=10),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6, 1e-1),
'num_leaves': trial.suggest_int('num_leaves', 100, 1000, step=20),
'bagging_freq': trial.suggest_int('bagging_freq', 3, 10),
'subsample': trial.suggest_float('subsample', 0.5, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 0.9),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 1e-1, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 1e-1, log=True),
'min_gain_to_split': trial.suggest_float('min_gain_to_split', 1, 15),
'min_child_samples': trial.suggest_int('min_child_samples', 100, 500,
step=10),
'verbosity': -1}
kf = KFold(n_splits=5, shuffle=True, random_state=seed_value)
for idx, (trn_idx, val_idx) in enumerate(kf.split(trainDF[features],
label)):
train_features, train_label = trainDF[features].iloc[trn_idx], label.iloc[trn_idx]
val_features, val_label = trainDF[features].iloc[val_idx], label.iloc[val_idx]
start = timer()
model = lgb.LGBMClassifier(**params_lgb_optuna,
early_stopping_rounds=150)
cv_scores = np.empty(5)
model.fit(train_features, train_label.values.ravel(),
eval_set = [(val_features, val_label.values.ravel())],
eval_metric='binary_logloss',
callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')])
y_pred_val = model.predict_proba(val_features)
cv_scores[idx] = log_loss(val_label, y_pred_val)
run_time = timer() - start
return np.mean(cv_scores)
We can leverage the use of if/else conditional statements to load the .pkl file if it all ready exists in the directory and if it doesn't, then create a new one for the optuna.study with the objective of minimizing the log_loss using 1000 trials.
import time
search_time_start = time.time()
if os.path.isfile('lightGBM_US_Optuna_1000_GPU.pkl'):
study = joblib.load('lightGBM_US_Optuna_1000_GPU.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1000)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest LogLoss', study.best_value)
Results from the Hyperparameter Search¶
Let's now extract the trial number, logloss and hyperparameter value into a dataframe and sort with the lowest loss first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'logloss'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
inplace=True)
trials_df.rename(columns={'params_min_gain_to_split': 'min_gain_to_split'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('logloss', ascending=True)
print(trials_df)
Optuna contains many ways to visualize the parameters tested during the search using the visualization module including plot_parallel_coordinate, plot_slice, plot_contour, plot_param_importances, plot_optimization_history, plot_slice and plot_edf. Let's utilize some of these components to examine the hyperparameters tested starting with plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_param_importances to plot the importance of the hyperparameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
The darker the blue signifies lower values for the log_loss and where a higher number of trials gravitated during the study
To note, the scatterplot of the regularization hyperparameters (lambda_l1, lambda_l2) did not reveal trends over the trials iterations, which might have been affected by the wide range (0-0.1).
Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameters for the search using the Upsampled set are n_estimators = 0.31, min_child_samples = 0.21 and lambda_l2 = 0.20. We will need to compare this order and magnitude with the results from the search using the SMOTE set.
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'binary_error'
params
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
train_label = trainDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])
test_features = testDF.drop(columns = ['loan_status'])
best_model = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1,
**params)
best_model.fit(train_features, train_label.values.ravel())
Pkl_Filename = 'lightGBM_Upsampling_Optuna_trials1000_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO Upsampling 1000 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('\n')
print('Classification Report: Train')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Train')
print(confusion_matrix(y_train, y_train_pred))
print('\n')
print('Classification Report: Test')
clf_rpt = classification_report(y_test, y_test_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Test')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score: train: %.3f, test: %.3f' % (
accuracy_score(y_train, y_train_pred),
accuracy_score(y_test, y_test_pred)))
print('Precision score: train: %.3f, test: %.3f' % (
precision_score(y_train, y_train_pred),
precision_score(y_test, y_test_pred)))
print('Recall score: train: %.3f, test: %.3f' % (
recall_score(y_train, y_train_pred),
recall_score(y_test, y_test_pred)))
print('F1 score: train: %.3f, test: %.3f' % (
f1_score(y_train, y_train_pred),
f1_score(y_test, y_test_pred)))
All of the model metrics, besides the recall score, increased when comparing to the baseline model. The reduction in recall score is 0.03, so it might be potentially neglible.
Let's now evaluate the predictive probability on the test set.
from sklearn.metrics import roc_auc_score
print('The best model from optimization scores {:.5f} Accuracy on the test set.'.format(accuracy_score(y_test,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
SHAP (SHapley Additive exPlanations)¶
Let's use the shap.TreeExplainer using LightGBM and generate shap_values for the features. The summary_plot shows the global importance of each feature and the distribution of the effect sizes over the set. Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features, show=False);
Let's use the test set and summarize the effects of all the features.
sshap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features, show=False);
Model Metrics with Eli5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
from eli5.formatters import format_as_dataframe
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The features with the highest weights are out_prncp (0.248844), total_pymnt (0.239985) and recoveries (0.066753).
Now we can use ELI5 to show the prediction.
html_obj2 = eli5.show_prediction(best_model, test_features.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off or Current, the features out_prncp and recoveries while last_pymnt_amnt and total_pymnt with lower values are not beneficial.
SMOTE: Baseline Model¶
Let's now proceed with using similar methods
for the SMOTE set by first reading the data, defining the label and features for the train/test sets, removing the target, extracting the feature names and fitting the baseline model with the default parameters.
trainDF = pd.read_csv('trainDF_SMOTE.csv', low_memory=False)
testDF = pd.read_csv('testDF_SMOTE.csv', low_memory=False)
X_train = trainDF.drop('loan_status', axis=1)
y_train = trainDF.loan_status
X_test = testDF.drop('loan_status', axis=1)
y_test = testDF.loan_status
label = trainDF[['loan_status']]
features = trainDF.drop(columns = ['loan_status'])
feature_names = list(features.columns)
features = features.columns
features = np.array(features)
lgbc = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1)
lgbc.fit(X_train, y_train)
Pkl_Filename = 'lightGBM_SMOTE_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lgbc, file)
print('\nModel Metrics for LightGBM Baseline SMOTE')
y_train_pred = lgbc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lgbc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Compared to the baseline model metrics for the Upsampled set using the default LightGBM parameters with GPU support, the accuracy, precision and F1 scores are higher but the recall score is lower.
1000 Trials 5-Fold Cross Validation¶
Let's define a new objective function for the optimization of hyperparameters with a new pickle file that joblib dumps if it is present, then uses the same parameter space, the same k-folds for reproducibility and the same structure that was utilized for the Upsampled LightGBM search using Optuna.
def objective(trial):
"""
Objective function to tune a LightGBMClassifier model.
"""
joblib.dump(study, 'lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl')
params_lgb_optuna = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': 'gbdt',
'n_estimators': trial.suggest_int('n_estimators', 100, 500, step=10),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6, 1e-1),
'num_leaves': trial.suggest_int('num_leaves', 100, 1000, step=20),
'bagging_freq': trial.suggest_int('bagging_freq', 3, 10),
'subsample': trial.suggest_float('subsample', 0.5, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 0.9),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 1e-1, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 1e-1, log=True),
'min_gain_to_split': trial.suggest_float('min_gain_to_split', 1, 15),
'min_child_samples': trial.suggest_int('min_child_samples', 100, 500,
step=10),
'verbosity': -1}
kf = KFold(n_splits=5, shuffle=True, random_state=seed_value)
for idx, (trn_idx, val_idx) in enumerate(kf.split(trainDF[features],
label)):
train_features, train_label = trainDF[features].iloc[trn_idx], label.iloc[trn_idx]
val_features, val_label = trainDF[features].iloc[val_idx], label.iloc[val_idx]
start = timer()
model = lgb.LGBMClassifier(**params_lgb_optuna,
early_stopping_rounds=150)
cv_scores = np.empty(5)
model.fit(train_features, train_label.values.ravel(),
eval_set = [(val_features, val_label.values.ravel())],
eval_metric='binary_logloss',
callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')])
y_pred_val = model.predict_proba(val_features)
cv_scores[idx] = log_loss(val_label, y_pred_val)
run_time = timer() - start
return np.mean(cv_scores)
Let's now begin the HPO trials creating a new Optuna.study that can be compared with the results from the search using the Upsampled set.
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl'):
study = joblib.load('lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1000)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest LogLoss', study.best_value)
Results from the Hyperparameter Search¶
Now, we can extract the trial number, logloss and hyperparameter value into a dataframe and sort with the lowest loss first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'logloss'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
inplace=True)
trials_df.rename(columns={'params_min_gain_to_split': 'min_gain_to_split'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('logloss', ascending=True)
print(trials_df)
Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contours.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
The slice plot was not informative and the scatterplot of the regularization hyperparameters (lambda_l1, lambda_l2) did not reveal trends over the trials iterations, which might have been affected by the widerange (0-0.1).
Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameters are colsample_bytree = 0.32, max_depth - 0.18, min_child_samples = 0.12, learning_rate = 0.11. These are different than the Upsampled results which are n_estimators = 0.31, min_child_samples = 0.21, lambda_l2 = 0.20. For the Upsampled set, the importance for these are
colsample_bytree = 0.03, max_depth - 0.01, min_child_samples = 0.21, learning_rate = 0.09. For the converse, the importance of the most important hyperparameters for the Upsampled study are n_estimators = 0.06, min_child_samples = 0.12, lambda_l2 = 0.06.
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'binary_error'
params
train_label = trainDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])
test_features = testDF.drop(columns = ['loan_status'])
best_model = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1,
**params)
best_model.fit(train_features, train_label.values.ravel())
Pkl_Filename = 'lightGBM_HPO_Optuna_SMOTE_trials1000_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO SMOTE 1000 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('\n')
print('Classification Report: Train')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Train')
print(confusion_matrix(y_train, y_train_pred))
print('\n')
print('Classification Report: Test')
clf_rpt = classification_report(y_test, y_test_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Test')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score: train: %.3f, test: %.3f' % (
accuracy_score(y_train, y_train_pred),
accuracy_score(y_test, y_test_pred)))
print('Precision score: train: %.3f, test: %.3f' % (
precision_score(y_train, y_train_pred),
precision_score(y_test, y_test_pred)))
print('Recall score: train: %.3f, test: %.3f' % (
recall_score(y_train, y_train_pred),
recall_score(y_test, y_test_pred)))
print('F1 score: train: %.3f, test: %.3f' % (
f1_score(y_train, y_train_pred),
f1_score(y_test, y_test_pred)))
Compared to the baseline LightGBM model for the SMOTE set, there was not a large change in the model metrics, and the observed changed potentially are neglible. Compared to the tuned LightGBM model using the Upsampled set, there might be a better precision but a worse recall and maybe F1 score.
Let's now evaluate the predictive probability on the test set.
print('The best model from optimization scores {:.5f} Accuracy on the test set.'.format(accuracy_score(y_test,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features, show=False);
Now, we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features, show=False);
Model Metrics with Eli5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The out_prncp feature has the highest weight compared to the other features with the next being total_pymnt at 0.0802. Compared to the Upsampled set, out_prncp has less weight (0.223551 vs. 0.248844) but there is a large difference in the weights for total_pymnt (0.0802 vs. 0.239985).
Now we can use ELI5 to show the prediction.
html_obj2 = eli5.show_prediction(best_model, test_features.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off or Current, the features recoveries and out_prncp with higher values are beneficial while the features last_pymnt_amnt, total_pymnt, loan_amnt and annual_inc with lower values are not beneficial.
Environment to Utilize RAPIDS¶
We can utilize the resources Paperspace offers to use a RAPIDS Docker container to leverage GPUs with higher memory than what is available on most local desktops or request quota increases and pay the cloud providers for every second a VM is running.
First, let's upgrade pip, install the dependencies and import the required packages. Then we can set the os environment as well as the random, cupy and numpy seed. We can also define a function timed to time the code blocks and examine the components of the GPU which are being utilized in the runtime.
!pip install --upgrade pip
import os
import warnings
import random
import cupy
import numpy as np
import urllib.request
from contextlib import contextmanager
import time
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['xgbRAPIDS_GPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)
@contextmanager
def timed(name):
t0 = time.time()
yield
t1 = time.time()
print('..%-24s: %8.4f' % (name, t1 - t0))
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
print('\n')
!nvidia-smi
There is a NVIDIA RTX A4000 with 16GB GPU memory with NVIDIA-SMI version 510.73.05 and CUDA version 11.6.
Now, we can set up a LocalCUDACluster for Dask with defined limits for memory and the device memory as well as an ip that is assigned to the dask.distributed.Client for all of the connected workers (1).
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar
from dask.utils import parse_bytes
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES='0', memory_limit='10GB',
device_memory_limit='5GB', ip='0.0.0.0')
client = Client(cluster)
client
XGBoost¶
The notebooks evaluating different metrics utilizing train/validation/test sets with RAPIDS and Optuna are located here, and the initial notebooks using Hyperopt and XGBoost with GPU runtimes are located here.
Let's now read the training and the test sets for the Upsampled set into separate cudf.Dataframe. For the XGBoost models, splitting the training set into train and validation sets resulted in better model performance, so let's use train_size=0.8 with the training set to create the validation set. Then we can convert the features to float32 and the target to int32 for modeling.
import cudf
from cuml.model_selection import train_test_split
trainDF = cudf.read_csv('trainDF_US.csv', low_memory=False)
print('Train set: Number of rows and columns:', trainDF.shape)
testDF = cudf.read_csv('testDF_US.csv', low_memory=False)
print('Test set: Number of rows and columns:', testDF.shape)
X_train, X_val, y_train, y_val = train_test_split(trainDF, 'loan_status',
train_size=0.8)
X_train = X_train.astype('float32')
y_train = y_train.astype('int32')
X_val = X_val.astype('float32')
y_val = y_val.astype('int32')
X_test = X_test.astype('float32')
y_test = y_test.astype('int32')
Upsampled: Baseline Model¶
Let's first define the baseline model components for classification using gbtree for the booster to run on the GPU using tree_method=gpu_hist with scale_pos_weight=1to balance the positive and negative weights without label encoding. Then use the defined seed_value for the random number seed with silent output from the model. Now we can fit the model using the training set, save it as .pkl file and evaluate the model using classification metrics.
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Baseline_US.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline Upsampling')
y_train_pred = xgb.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = xgb.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the recall and F1 score, so let's monitor one of those to see if hyperparameter tuning results in better performance.
300 Trial: Weighted F1¶
Now, let's set up an Optuna.study by creating a unique name for the study. We can leverage the optuna.integration.wandb to set up callbacks that will be saved to Weights & Biases. First, we login and set up the arguments that include the name of the project, the person saving the results, the group the study belongs to, if code is saved or not, and notes about the study for future reference.
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
study_name = 'dask_optuna_us_xgbRapids_weightedF1_300_tpe'
wandb.login()
wandb_kwargs = {'project': 'loanStatus_hpo', 'entity': 'aschultz',
'group': 'optuna_us_xgb300gpu_trainvaltest_f1weighted',
'save_code': 'False',
'notes': 'us_xgb300gpu_trainvaltest_f1weighted'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Then, we can define a function objective for the optimization of various hyperparameters which uses joblib to save each of the trial parameters and the loss in a pickle file that can be reloaded to continue optimizing. Then we can denote the parameter space of the hyperparameters to be tested.
Then the model type, XGBClassifier, needs to be defined with the parameters that will be included in all of the trials during the search, which are:
objective='binary:logistic: Specify the learning task and the corresponding learning objective or a custom objective function to be used.booster='gbtree': Specify which booster to use:gbtree,gblinearordart.tree_method='gpu_hist': Specify which tree method to use.Default=auto.scale_pos_weight=1: Balancing of positive and negative weights.use_label_encoder=False: Encodes labels.random_state=seed_value: Random number seed.verbosity=0: The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
Then we can fit the model specified to use the parameters within the params_xgboost_optuna dictionary utilizing the training set and the eval_set as the validation set, followed by the prediction of the weighted f1_score using the created validation label and the predicted label.
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import joblib
from xgboost import XGBClassifier
from timeit import default_timer as timer
from sklearn.metrics import f1_score
from cupy import asnumpy
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostClassifier model.
"""
joblib.dump(study, 'xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl')
params_xgboost_optuna = {
'n_estimators': trial.suggest_int('n_estimators', 400, 700),
'max_depth': trial.suggest_int('max_depth', 8, 15),
'subsample': trial.suggest_float('subsample', 0.6, 0.9),
'gamma': trial.suggest_float('gamma', 0.02, 1),
'learning_rate': trial.suggest_float('learning_rate', 0.13, 0.3),
'reg_alpha': trial.suggest_float('reg_alpha', 0.5, 2.5),
'reg_lambda': trial.suggest_float('reg_lambda', 0.3, 2),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.8, 0.99),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.25, 0.6),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 5)
}
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params_xgboost_optuna)
start = timer()
xgb.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=0)
y_pred_val = xgb.predict(X_val)
score = f1_score(asnumpy(y_val), asnumpy(y_pred_val), average='weighted')
run_time = timer() - start
return score
Now, we can begin the Optuna study containing 300 trials that are optimized to run in parallel on a Dask cluster. The goal is to find the parameters that maximize the weighted F1 score for predicting loan_status.
import time
import dask_optuna
import dask
import dask_cudf
with timed('dask_optuna'):
search_time_start = time.time()
if os.path.isfile('xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl'):
study = joblib.load('xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial),
n_trials=300)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
wandb.finish()
Results from the Hyperparameter Search¶
Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
learning_rate, gamma, colsample_bytree decreased over the trials while colsample_bylevel increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
reg_alpha decreased while reg_lambda did not show any trend over the trial iterations.
From using plot_param_importances to visualize parameter importances, the most important hyperparameter is max_depth (0.89) followed by reg_lambda (0.04) and gamma (0.02).
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
X_train, y_train = trainDF.drop('loan_status',
axis=1), trainDF['loan_status'].astype('int32')
X_train = X_train.astype('float32')
X_test, y_test = testDF.drop('loan_status',
axis=1), testDF['loan_status'].astype('int32')
X_test = X_test.astype('float32')
best_model = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRAPIDS_Optuna_US_300trials_GPU_f1weighted.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost RAPIDS HPO Upsampling 300 trials GPU F1 Weighted')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Compared to the baseline model, there is a higher accuracy, precision and F1 score but a lower recall score. Let's now evaluate the predictive probability on the test set.
from sklearn.metrics import roc_auc_score
print('The best model from Upsampling 300 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_train.to_pandas(), show=False);
Now, we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_test.to_pandas(), show=False);
Model Metrics with Eli5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
from eli5.formatters import format_as_dataframe
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The out_prncp feature had the highest weight (0.2415) followed by total_pymnt (0.2345) and recoveries (0.0670).
Let's now utilize the plot_importance from the XGBoost package to examine the feature importance.
from xgboost import plot_importance
plot_importance(best_model, max_num_features=15);
The most important features are:
total_pymntmo_sin_old_rev_tl_optotal_rec_intannual_inctotal_bal_ex_mortint_ratetotal_hi_cred_limbc_open_to_buyrevol_utilrevol_baltotal_bc_limitinstallmentmths_since_recent_bcout_prncplast_pymnt_amnt
Next, we can calculate the Permutation importance:
perm_importance = permutation_importance(best_model, X_test.to_pandas(),
asnumpy(y_test))
sorted_idx = perm_importance.importances_mean.argsort()
plt.rcParams.update({'font.size': 10})
fig = plt.figure(figsize=(8, 15))
plt.barh(testDF.columns[sorted_idx],
perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.title('XGBoost Permutation Importance')
plt.show();
From the permutation importance, the most important features are:
out_prncptotal_pymntrecoveriesloan_amntinstallmentlast_pymnt_amnttotal_rec_int
SMOTE: Baseline Model¶
The SMOTE train and test sets were processed using the same methods as the Upsampled data, and the same parameters that were used for the Upsampled baseline model were maintained. Now we can fit the model using the training set, save it as .pkl file and evaluate the model using the same classification metrics.
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Baseline_SMOTE.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline SMOTE')
y_train_pred = xgb.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = xgb.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the recall and F1 score. Compared to the Upsampled baseline model, the accuracy, precision and F1 score are better, but the recall is lower (0.907 vs. 0.928).
300 Trials: Weighted F1¶
The same methods for reading the data, splitting the train set into train and validation set and convert data types for modeling were used for trainDF_SMOTE and testDF_SMOTE as well as defining a name for the study and the information for W & B callbacks.
study_name = 'dask_optuna_smote_xgbRapids_weightedF1_300_tpe'
wandb.login()
wandb_kwargs = {'project': 'loanStatus_hpo', 'entity': 'aschultz',
'group': 'optuna_SMOTE_xgb300gpu_trainvaltest',
'save_code': 'False',
'notes': 'smote_xgb300gpu_trainvaltest_f1weighted'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostClassifier model.
"""
joblib.dump(study, 'xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
params_xgboost_optuna = {
'n_estimators': trial.suggest_int('n_estimators', 400, 700),
'max_depth': trial.suggest_int('max_depth', 7, 15),
'subsample': trial.suggest_float('subsample', 0.55, 0.8),
'gamma': trial.suggest_float('gamma', 0.6, 2.5),
'learning_rate': trial.suggest_float('learning_rate', 0.13, 0.3),
'reg_alpha': trial.suggest_float('reg_alpha', 0.2, 2.65),
'reg_lambda': trial.suggest_float('reg_lambda', 0.6, 2.3),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.8),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.35, 0.6),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 8)
}
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params_xgboost_optuna)
start = timer()
xgb.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=0)
y_pred_val = xgb.predict(X_val)
score = f1_score(asnumpy(y_val), asnumpy(y_pred_val), average='weighted')
run_time = timer() - start
return score
We can load the saved .pkl file to continue training for 265 more trials to reach the intended 300 for the study.
study = joblib.load('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
with timed('dask_optuna'):
search_time_start = time.time()
if os.path.isfile('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl'):
study = joblib.load('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial),
n_trials=265)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
wandb.finish()
Results from the Hyperparameter Search¶
Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
learning_rate, gamma, decreased like the Upsampled trials over the trials while colsample_bylevel and colsample_bytree increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
reg_alpha decreased like for the Upsampled trials but reg_lambda decreased while there was not a trend for the Upsampled trials over the trial iterations.
From using plot_param_importances to visualize the parameter importances for the SMOTE set, the most important hyperparameters are subsample = 0.25, min_child_weight = 0.21, colsample_by_level = 0.17 and colsample_by_tree = 0.16. max_depth, the most important hyperparameter for the Upsampled model is (0.04 vs. 0.89) and reg_lambda is (0.02 vs 0.04) and gamma is (< 0.01 vs 0.02).
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params
best_model = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRAPIDS_Optuna_SMOTE_300trials_GPU_f1weighted.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost RAPIDS HPO SMOTE 300 trials GPU F1 Weighted')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Compared to the baseline model, there is a higher recall and F1 score and comparable accuacy and precision score. Let's now evaluate the predictive probability on the test set.
print('The best model from SMOTE 300 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_train.to_pandas(), show=False);
Now, we can use the test set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_test.to_pandas(), show=False);
Model Metrics with Eli5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The total_pymnt feature has the highest weight for this model (0.284731) Upsampled lower out_prncp 0.2415
total_pymnt 0.2345
Let's now utilize the plot_importance from the XGBoost package to examine the feature importance.
plot_importance(best_model, max_num_features=15);
The most important features are:
total_pymnttotal_rec_intint_ratelast_pymnt_amntout_prncpinstallmenttotal_bal_ex_mortmo_sin_old_rev_tl_opannual_incloan_amntbc_open_to_buytotal_bc_limitrevol_balrevol_util
Next, we can calculate the Permutation importance:
perm_importance = permutation_importance(best_model, X_test.to_pandas(),
asnumpy(y_test))
sorted_idx = perm_importance.importances_mean.argsort()
plt.rcParams.update({'font.size': 10})
fig = plt.figure(figsize=(8, 15))
plt.barh(testDF.columns[sorted_idx],
perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.title('XGBoost Permutation Importance')
plt.show();
From the permutation importance, the most important features are:
total_pymntout_prncploan_amntrecoveriesinstallmentlast_pymnt_amnttotal_rec_intint_rate
Linear¶
The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score metric to monitor during the hyperparameter tuning search using (Logistic Regression, Ridge Regression, Lasso Regression or Elastic Net Regression) are located here.
The methods used to set up the RAPIDS environment in Colab can be found here. For the subsequent algorithms evaluated, the RAPIDS environment was set up with installing/importing packages, setting the seed and Dask cluster as described in the previous XGBoost section. For the following sections, the data was read and the features/target were set up as demonstrated below.
LinearSVC¶
Support Vector Machine (SVM) is a supervised algorithm where the objective is to determine the hyperplane, a line that separates the features for the target variable, with the highest possible distance between the hyperplane and the nearest data point from either of the target group (margin). The points closest to the hyperplane are the support vectors. Removing these would affect the the position of the dividing hyperplane.
However, finding the optimal hyperplane for non-linear problems is time and computationally expensive. This 2005 paper A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs developed a method for solving linear SVMs with an L2 loss function and other have proposed new methods for utilizing LinearSVMs.
The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score metric to monitor during the hyperparameter tuning search are located
here.
Upsampled: Baseline Model¶
Let's first define the baseline model components for classification. Now we can fit the model using the training set from the Upsampled set, save it as .pkl file and evaluate the model using classification metrics.
from cuml.svm import LinearSVC
lsvc = LinearSVC()
lsvc.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Baseline_US.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lsvc, file)
print('\nModel Metrics for LinearSVC Baseline Upsampling')
y_train_pred = lsvc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lsvc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Overall, the model metrics are horrible, so the model will definitely need to be tuned. In the SparkML section, the StandardScaler was utilized before fitting the model. This might be needed to obtain better model performance because using the default parameters resulted in the worst baseline model up to this point.
100 Trials: Weighted F1¶
Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:
penalty: The regularization term of the target function.Default='l2'.loss: The loss term of the target function.Default = 'squared_hinge'penalized_intercept: When true, the bias term is treated the same way as other features; i.e. it’s penalized by the regularization term of the target function.Default=False.max_iter: Maximum number of iterations for the underlying solver.Default=1000.linesearch_max_iter: Maximum number of linesearch (inner loop) iterations for the underlying (QN) solver.Default=100.lbfgs_memory: Number of vectors approximating the hessian for the underlying QN solver (l-bfgs).Default=5.C: The constant scaling factor of the loss term in the target formula.Default=1.0.grad_tol: The threshold on the gradient for the underlying QN solver.Default=0.0001.change_tol: The threshold on the function change for the underlying QN solver.Default=1e-05.
Then model will be trained using the training set, predictions made using the test set and then the model evaluated for the weighted f1_score between the actual loan_status in the test set versus the predicted one.
def train_and_eval(X_param, y_param, penalty='l2',
loss='squared_hinge',
penalized_intercept='False',
max_iter=10000,
linesearch_max_iter=100,
lbfgs_memory=5, C=1,
grad_tol=0.0001, change_tol=1e-5,
verbose=False):
"""
Partition data into train/test sets, train and evaluate the model
for the given parameters.
Params
______
X_param: DataFrame.
The data to use for training and testing.
y_param: Series.
The label for training
Returns
score: Weighted F1 of the fitted model
"""
X_train, y_train = trainDF.drop('loan_status',
axis=1), trainDF['loan_status'].astype('int32')
X_train = X_train.astype('float32')
X_test, y_test= testDF.drop('loan_status',
axis=1), testDF['loan_status'].astype('int32')
X_test = X_test.astype('float32')
model = LinearSVC(penalty=penalty,
loss=loss,
penalized_intercept=penalized_intercept,
max_iter=max_iter,
linesearch_max_iter=linesearch_max_iter,
lbfgs_memory=lbfgs_memory,
C=C,
grad_tol=grad_tol,
change_tol=change_tol,
verbose=False)
start = timer()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = f1_score(y_test.to_numpy(), y_pred.to_numpy(), average='weighted')
run_time = timer() - start
return score
print('Score with default parameters : ', train_and_eval(X_train, y_train))
As with other hyperparameter optimizations, an objective function containing the .pkl file for the study and the parameter space to be tested needs to be defined.
def objective(trial, X_param, y_param):
joblib.dump(study, 'LinearSVC_Optuna_US_100_GPU_weightedF1.pkl')
penalty = trial.suggest_categorical('penalty', ['l1', 'l2'])
loss = trial.suggest_categorical('loss', ['squared_hinge', 'hinge'])
penalized_intercept = trial.suggest_categorical('penalized_intercept',
['True', 'False'])
max_iter = trial.suggest_int('max_iter', -1, 10e6)
linesearch_max_iter = trial.suggest_int('linesearch_max_iter', 100, 10000)
lbfgs_memory = trial.suggest_int('lbfgs_memory', 5, 20)
C = trial.suggest_float('C', 0.1, 10, step=0.1)
grad_tol = trial.suggest_float('grad_tol', 1e-3, 1e-1)
change_tol = trial.suggest_float('change_tol', 1e-5, 1e-3)
score = train_and_eval(X_param,
y_param,
penalty=penalty,
loss=loss,
penalized_intercept=penalized_intercept,
max_iter=max_iter,
linesearch_max_iter=linesearch_max_iter,
lbfgs_memory=lbfgs_memory,
C=C,
grad_tol=grad_tol,
change_tol=change_tol,
verbose=False)
return score
Now, let's set up an Optuna.study by creating a unique name for the study and beginning time hyperparameter search with the TPESampler to maximize the weighted_f1 score for 100 trials using the parallel_backend from Dask.
from datetime import datetime, timedelta
study_name = 'dask_linearSVC_optuna_US_100_weightedF1_tpe'
with timed('dask_optuna'):
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('LinearSVC_Optuna_US_100_GPU_weightedF1.pkl'):
study = joblib.load('LinearSVC_Optuna_US_100_GPU_weightedF1.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial, X_train, y_train),
n_trials=100, n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
Results from the Hyperparameter Search¶
Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_C': 'C'}, inplace=True)
trials_df.rename(columns={'params_change_tol': 'change_tol'}, inplace=True)
trials_df.rename(columns={'params_grad_tol': 'grad_tol'}, inplace=True)
trials_df.rename(columns={'params_lbfgs_memory': 'lbfgs_memory'}, inplace=True)
trials_df.rename(columns={'params_linesearch_max_iter': 'linesearch_max_iter'},
inplace=True)
trials_df.rename(columns={'params_loss': 'loss'}, inplace=True)
trials_df.rename(columns={'params_max_iter': 'max_iter'}, inplace=True)
trials_df.rename(columns={'params_penalized_intercept': 'penalized_intercept'},
inplace=True)
trials_df.rename(columns={'params_penalty': 'penatly'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
The grad_tol parameter with lower values and the lbfgs_memory parameter with higher values resulted in higher objective values.
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 6, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['C', 'change_tol', 'grad_tol',
'lbfgs_memory', 'linesearch_max_iter', 'max_iter']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the study, the parameters C, grad_tol, linesearch_max_iter and max_iter decreased, while the parameter change_tol increased. There was not really a trend for the lbfgs_memory parameter.
Now, let's use plot_param_importances to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The penalty hyperparameter is the most important (0.97) followed by loss (0.03).
Let's now utilize plot_edf to visualize the empirical distribution of the study.
fig = optuna.visualization.plot_edf(study)
fig.show()
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params
best_model = LinearSVC(**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Optuna_US_trials100_GPU_weightedF1.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for LinearSVC HPO Upsampling 100trials GPU')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
It's definitely nice to see model metrics which are not all zero values like what occurred in the baseline model except for accuracy. This is definitely an improvement in the baseline model, but the metrics are still lower than what the boosting algorithms generated. Let's now evaluate the predictive probability on the test set.
print('The best model from Upsampling 100 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
SMOTE: Baseline Model¶
Let's first define the baseline model components for classification. Now we can fit the model using the training set from the SMOTE set, save it as .pkl file and evaluate the model using classification metrics.
lsvc = LinearSVC()
lsvc.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Baseline_SMOTE.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lsvc, file)
print('\nModel Metrics for LinearSVC Baseline SMOTE')
y_train_pred = lsvc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lsvc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Once again, the model metrics are horrible, so the model will definitely need to be tuned. Using the default parameters resulted in the worst baseline models up to this point for both the Upsampled and SMOTE sets.
100 Trials: Weighted F1¶
The same train_and_eval and objective functions and search parameter space described in the Upsampled section were utilized for the SMOTE study. A new .pkl file was generated for each search.
print('Score with default parameters : ', train_and_eval(X_train, y_train))
study_name = 'dask_linearSVC_optuna_SMOTE_100_weightedF1_tpe'
with timed('dask_optuna'):
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('LinearSVC_Optuna_SMOTE_100_GPU_weightedF1.pkl'):
study = joblib.load('LinearSVC_Optuna_SMOTE_100_GPU_weightedF1.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial, X_train, y_train),
n_trials=100, n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
Results from the Hyperparameter Search¶
Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_C': 'C'}, inplace=True)
trials_df.rename(columns={'params_change_tol': 'change_tol'}, inplace=True)
trials_df.rename(columns={'params_grad_tol': 'grad_tol'}, inplace=True)
trials_df.rename(columns={'params_lbfgs_memory': 'lbfgs_memory'}, inplace=True)
trials_df.rename(columns={'params_linesearch_max_iter': 'linesearch_max_iter'},
inplace=True)
trials_df.rename(columns={'params_loss': 'loss'}, inplace=True)
trials_df.rename(columns={'params_max_iter': 'max_iter'}, inplace=True)
trials_df.rename(columns={'params_penalized_intercept': 'penalized_intercept'},
inplace=True)
trials_df.rename(columns={'params_penalty': 'penatly'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
Lower C, change_tol grad_tol and higher lbfgs_memory resulted in higher objective value.
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 6, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['C', 'change_tol', 'grad_tol',
'lbfgs_memory', 'linesearch_max_iter', 'max_iter']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, C, change_tol, grad_tol decreased while lbfgs_memory and max_iter increased. There was not really a trend for linesearch_max_iter. In contrast, the results from the search using the Upsampled set showed change_tol increase over the trials while max_iter decreased.
Now, let's use plot_param_importances to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameters for the SMOTE set is loss (0.55) followed by penalty (0.44), which was the most important hyperparameter for the Upsampled study at 0.97.
Let's now utilize plot_edf to visualize the empirical distribution of the study.
fig = optuna.visualization.plot_edf(study)
fig.show()
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params
best_model = LinearSVC(**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Optuna_SMOTE_trials100_GPU_weightedF1.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for LinearSVC HPO SMOTE 100trials GPU')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
When comparing the model metrics after the SMOTE search to close after the search using the Upsampled set, there was a higher accuracy, precision and F1 scores, but a lower recall score. The precision and recall scores seemed to differ by 4% in opposite directions. Let's now evaluate the predictive probability on the test set.
print('The best model from SMOTE 100 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
K-Nearest Neighbors (KNN) Classifier¶
KNN was first proposed in this 1951 publication Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties and later expounded on in this 1967 publication Nearest Neighbor Pattern Classification.
The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score metric to monitor during the hyperparameter tuning search are located here.
Upsampled: Baseline Model¶
Let's first define the baseline model components for classification. Now we can fit the model using the training set from the Upsampled set, save it as .pkl file and evaluate the model using classification metrics.
from cuml.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
Pkl_Filename = 'KNN_Baseline_US.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(knn, file)
y_train_pred = knn.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = knn.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('\nModel Metrics for KNN Baseline Upsampling')
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
The training set is definitely overfitting when the model performance is not balanced with the test set. The lowest metrics are the precision and F1 score, so let's monitor precision to see if hyperparameter tuning results in better performance.
100 Trials: Precision¶
Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:
n_neighbors: Default number of neighbors to query.Default=5.metric: Distance metric to use. Options are'euclidean','manhattan','chebyshev','minkowski'.
Then model will be trained using the training set, predictions made using the test set and then the model evaluated for the precision between the actual loan_status in the test set versus the predicted one.
Results from the Hyperparameter Search¶
Let's utilize plot_slice to compare the objective value and individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
Less n_neighbors resulted in higher objective values.
Now, let's use plot_param_importances to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameter is n_neighbors (0.98) compared to metric (0.02).
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params
best_model = KNeighborsClassifier(n_neighbors=4, metric='manhattan')
best_model.fit(X_train, y_train)
Pkl_Filename = 'KNN_Optuna_US_trials100_GPU_Precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for KNN HPO US 100trials GPU Precision')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
Compared to the baseline model using the Upsampled set, all metrics increased besides a small decrease in the recall score. It might be worthwhile to consider monitoring F1 score more or preprocessing the data befoe modeling.
Let's now evaluate the predictive probability on the test set.
print('The best model from US 100 Precision GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
SMOTE: Baseline Model¶
Let's first define the baseline model components for classification, fit the model using the training set from the SMOTE set, save it as .pkl file and evaluate the model using classification metrics.
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
Pkl_Filename = 'KNN_Baseline_SMOTE.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(knn, file)
y_train_pred = knn.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = knn.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('\nModel Metrics for KNN Baseline SMOTE')
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
The training set is definitely overfitting for the SMOTE like it was for the Upsampled baseline KNN model. The lowest metrics are the precision and F1 score again, which are even lower than what the Upsampled baseline model revealed.
100 Trials: Precision¶
The same train_and_eval and objective functions and search parameter space described in the Upsampled section were utilized for the SMOTE study. A new .pkl file was generated for each search.
Results from the Hyperparameter Search¶
Let's utilize plot_slice to compare the objective value and the individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
Less n_neighbors resulted in higher objective values like in Upsampled trials.
Now, let's use plot_param_importances to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The n_neighbors hyperparameter is the most important (0.99), while it was 0.98 for the Upsampled set.
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.
params = study.best_params
params
best_model = KNeighborsClassifier(n_neighbors=4, metric='euclidean')
best_model.fit(X_train, y_train)
Pkl_Filename = 'KNN_Optuna_SMOTE_trials100_GPU_Precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for KNN HPO SMOTE 100trials GPU Precision')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
Compared to the baseline model using the SMOTE set, all metrics increased besides a small decrease in the recall score like what was observed when comparing the tuned Upsampled model to the baseline model. To restate, it might be worthwhile to consider monitoring F1 score more or preprocessing the data befoe modeling. Compared to the tuned model using the Upsampled set, the Upsampled model metrics performed better than the metrics from the SMOTE search.
Let's now evaluate the predictive probability on the test set.
print('The best model from SMOTE 100 Precision GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Comments
comments powered by Disqus