Banks routinely lose money based on loans that eventually default. Per the Federal Reserve, at the height of the financial crisis in 2009-2010, the amount lost approached 500 billion U.S dollars. More recently, losses each quarter tend to approach 150 billion. Delinquency rates tend to be around 1.5% most recently. Because of this, it is vitally important for banks to ensure that they keep their delinquencies as low as possible.
- Can we accurately predict loan approval based on historical data?
- How can we confidentially determine whether a loan can be approved?
Rationale and Objective:
- If a loan is current, the company is making money and should approve such future loans based on the model.
- If a loan is late or in default, the company is not making money and should reject future loans based the model.
- What factors predict loan approval?
- Which variables best predict if a loan will be a loss and how much is the average loss?
Data
¶
The data was retrieved from Kaggle. Lending Club is a peer to peer financial company. Essentially, people can request an unsecured loan between 1,000 and 40,000 dollars while other individuals can visit the site to choose to invest in the loans. So, people are essentially lending to other people directly with Lending Club as a facilitator.
Preprocessing
¶
The code that was used for preprocessing and EDA can be found LoanApproval_LendingClub Github repository. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and empty rows replace with NA
.
import os
import random
import numpy as np
import warnings
from numpy import sort
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
seed_value = 42
os.environ['LoanStatus_PreprocessEDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
df = pd.read_csv('loan_Master.csv', index_col=False, low_memory=False)
print('- Dimensions of initial data:', df.shape)
df = df.replace(r'^\s*$', np.nan, regex=True)
df1 = df.loc[:, df.isnull().mean() < 0.05]
print('- Dimensions when columns > 95% missing removed:', df1.shape)
s = set(df1)
varDiff = [x for x in df if x not in s]
print('- Number of features removed due to high missingness:'
+ str(len(varDiff)))
df = df1
del df1
Let's now use a function to examine the data types present, the amount of data missing as a percentage, the number of unique values and the number of rows and columns. This can be utilized to determine which features should remain in the set as well as the ones that might result in longer runtimes and more difficulty in explaining the downstream models.
def data_quality_table(df):
"""Returns the characteristics of variables in a Pandas dataframe."""
var_type = df.dtypes
mis_val_percent = 100 * df.isnull().sum() / len(df)
unique_count = df.nunique()
mis_val_table = pd.concat([var_type, mis_val_percent, unique_count], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0: 'Data Type', 1: 'Percent Missing', 2: 'Number Unique'})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] >= 0].sort_values(
'Percent Missing', ascending=False).round(1)
print ('- There are ' df.shape[0]) + ' rows and '
+ str(df.shape[1]) + ' columns.\n')
return mis_val_table_ren_columns
Categorical Variables (Features)
¶
Dimensionality is a problem that frequently occurs when working with categorical features. Variables with a large number of groups or levels will increase the size of the data when dummy variables or one hot encoding is used during modeling. After examining the quality of the data, the features with a high percentage of missing values and a large number of groups were removed from the set.
df1 = df.select_dtypes(include = 'object')
print('\n Data Quality: Qualitative Variables')
display(data_quality_table(df1))
print('\n')
print('\nSample observations of qualitative variables:')
display(df1.head())
df = df.drop(['title', 'last_pymnt_d', 'zip_code', 'earliest_cr_line',
'last_credit_pull_d', 'issue_d', 'addr_state', 'sub_grade'],
axis=1)
del df1
Quantitative Variables (Features)
¶
Now we can remove the rows with any column having NA/null
for some of the variables determined with the data_quality_table
function so the complete cases without missing data can be used for the set.
df1 = df.select_dtypes(exclude = 'object')
print('\n Data Quality: Quantitative Variables')
display(data_quality_table(df1))
print('\n')
print('\nSample observations of quantitative variables:')
display(df1.head())
df = df[df.bc_util.notna() & df.percent_bc_gt_75.notna()
& df.pct_tl_nvr_dlq.notna() & df.mths_since_recent_bc.notna()
& df.dti.notna() & df.inq_last_6mths.notna() & df.num_rev_accts.notna()]
del df1
Dependent Variable (Target): Status of Loan
¶
The majority of the individuals in this set are Fully Paid
or Current
for the loan_status
feature regarding their loan payments. However, the minority of the samples are not.
print(df.loan_status.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')
Now let's convert loan_status
to a binary variable current = 0
and default = 1
so this can be used as the target (label) for classification modeling. After recoding to binary, a clear class imbalance exists with 87.35%
of the observations on track for completing payments.
df['loan_status'] = df['loan_status'].replace(['Fully Paid'], 0)
df['loan_status'] = df['loan_status'].replace(['In Grace Period'], 0)
df['loan_status'] = df['loan_status'].replace(['Current'], 0)
df['loan_status'] = df['loan_status'].replace(['Charged Off'], 1)
df['loan_status'] = df['loan_status'].replace(['Late (31-120 days)'], 1)
df['loan_status'] = df['loan_status'].replace(['Late (16-30 days)'], 1)
df['loan_status'] = df['loan_status'].replace(['Does not meet the credit policy. Status:Fully Paid'], 1)
df['loan_status'] = df['loan_status'].replace(['Does not meet the credit policy. Status:Charged Off'], 1)
df['loan_status'] = df['loan_status'].replace(['Default'], 1)
print('\nExamine Binary Loan Status for Class Imbalance:')
print(df.loan_status.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')
Variable Selection
¶
SelectFromModel using XGBoost
¶
To prepare the data for feature selection using this approach, let'separate the input features as X
and the target as y
. Then we can create dummy variables for the categorical features using pandas.get_dummies
and drop the initial features. Now we can fit a baseline model on all data using the evaluation metric logloss
, specifying the parameters to run on a GPU and the random_state
as the seed_value
. We can then fit the XGBClassifier
model, make predictions and determine the accuracy of the baseline model using accuracy_score
.
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import accuracy_score
X = df.drop('loan_status', axis=1)
y = df.loan_status
X = pd.get_dummies(X, drop_first=True)
model = XGBClassifier(eval_metric='logloss',
use_label_encoder=False,
tree_method='gpu_hist',
gpu_id=0,
random_state=seed_value)
model.fit(X, y)
y_pred = model.predict(X)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y, predictions)
print('Accuracy: %.3f%%' % (accuracy * 100))
The feature importance of the model can be examined using plot_importance
from the XGBoost
module and the plot parameters can be defined using matplotlib.pyplot
.
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams.update({'font.size': 8.5})
plot_importance(model)
plt.tight_layout()
plt.show();
The ten most important features using the baseline model are total_rec_prncp
, out_prncp
, last_pymnt_amnt
, int_rate
, loan_amnt
, total_rec_int
installment
, total_rec_late_fee
, funded_amnt_inv
and mo_sin_old_rev_tl_op
.
Another method to examine how features affect a model is permutation feature importance. This is defined as the reduction in the model score when a single feature value is shuffled randomly, indicating the extent to which the model relies on the feature. Different numbers of permutations can be tested.
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(model, X, y)
plt.rcParams.update({'font.size': 7})
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.tight_layout()
plt.show();
estimates the impact of the individual components on the final result, separately, while conserving the total impact on the final result. This can be considered in conjunction with the previously used feature importance measures when modeling. Let's use the SHAP (SHapley Additive exPlanations)
published as A Unified Approach to Interpreting Model PredictionsTreeExplainer
since using XGBoost
and generate shap_values
for the features. The summary_plot
shows the global importance of each feature and the distribution of the effect sizes over the set.
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X, show=False)
plt.show();
As total_rec_prncp
, out_prncp_inv
and the last_pymnt_amnt
increase, there is a lower probability for a loan to default while as recoveries
increases, there is a higher probability the loan will default.
Now, we can define the maximum number of features to test as the current number of features, and the minimum number of features, which is two. We need to define a metric to test, so let's use the model accuracy. The thresholds can then defined by sorting the model feature importances in descending order. Then a for
loop can be used to iterate through the list of thresholds to fit a model, predict and calculate the model accuracy. The model accuracy can then be saved in a list where the new values are added as a new model is tested.
from sklearn.feature_selection import SelectFromModel
import time
print('Time for feature selection using XGBoost...')
search_time_start = time.time()
feat_max = X.shape[1]
feat_min = 2
acc_max = accuracy
thresholds = sort(model.feature_importances_)
thresh_goal = thresholds[0]
accuracy_list = []
for thresh in thresholds:
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X = selection.transform(X)
selection_model = XGBClassifier(eval_metric='logloss',
use_label_encoder=False,
tree_method='gpu_hist',
gpu_id=0,
random_state=seed_value)
selection_model.fit(select_X, y)
selection_model_pred = selection_model.predict(select_X)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y, y_pred=selection_predictions)
accuracy = accuracy * 100
print('Thresh= %.6f, n= %d, Accuracy: %.3f%%' % (thresh, select_X.shape[1],
accuracy))
accuracy_list.append(accuracy)
if(select_X.shape[1] < feat_max) and (select_X.shape[1] >= feat_min) and (accuracy >= acc_max):
n_min = select_X.shape[1]
acc_max = accuracy
thresh_goal = thresh
print('\n')
print('Finished feature selection using XGBoost in:',
time.time() - search_time_start)
print('\n')
print('\nThe optimal threshold is:')
print(thresh_goal)
Now we can subset the features where the optimal threshold occurred into a new set and examine which features were chosen. We can then fit a model with these features as the input to the model, and subsequently generate predictions using all of the available data and evaluate the accuracy of the model.
selection = SelectFromModel(model, threshold=thresh_goal, prefit=True)
feature_names = X.columns[selection.get_support(indices=True)]
print('\n- Feature selection using XGBoost resulted in '
+ str(len(feature_names)) + ' features.')
print('\n- Features selected using optimal threshold for accuracy:')
print(X.columns[selection.get_support()])
print('\n')
X = pd.DataFrame(data=X, columns=feature_names)
model = XGBClassifier(eval_metric='logloss',
use_label_encoder=False,
tree_method='gpu_hist',
gpu_id=0,
random_state=seed_value)
model.fit(X, y)
model.save_model('xgb_featureSelection.model')
y_pred = model.predict(X)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y, predictions)
print('Accuracy: %.3f%%' % (accuracy * 100.0))
45 features were selected and 98.946%
was the resulting accuracy. Let's now plot the feature importance graph again and compare to the baseline model.
plt.rcParams.update({'font.size': 10})
plot_importance(model)
plt.tight_layout()
plt.show();
The ten most important features using the SelectFromModel
are still total_rec_prncp
and out_prncp
, but int_rate
increased in rank instead of last_pymnt_amnt
and total_rec_int
as well as installment
instead of loan_amnt
and total_rec_late_fee
. Also, funded_amnt_inv
and revol_bal
increased in rank instead of mo_sin_old_rev_tl_op
.
Then we can examine the permutation based feature importance again and plot the results.
perm_importance = permutation_importance(model, X, y)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.show();
Then we can visualize the feature importance computed with the SHAP
values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X, show=False)
plt.show();
Compared to default model, the majority of the features demonstrate the same properties, but loan_amnt
is higher than out_prncp
and int_rate
is higher than tota_rec_late_fee
.
Group Lasso for Variable Selection
¶
Multicollinearity using Variance Inflation Factor (VIF)
¶
To check for multicollinearity, we can use the Variance Inflation Factor (VIF)
to quantify the amount that the estimated coefficients are inflated when multicollinearity exists. Let's first separate the input features and target and then select the quantitative features for inspection. We can define a function to calculate VIF
with a specified threshold which removes the feature if the VIF
is greater than a set threshold. The typical threshold is 4-5 to remove multicollinearity while 10 is considered as high multicollinearity. So let's use threshold=5.0
and examine the number of remaining numerical variables using this defined threshold.
from joblib import Parallel, delayed
from statsmodels.stats.outliers_influence import variance_inflation_factor
X1 = df.drop('loan_status', axis=1)
y = df.loan_status
df_num = X1.select_dtypes(include = ['float64', 'int64'])
def calculate_vif(X, threshold=5.0):
features = [X.columns[i] for i in range(X.shape[1])]
dropped = True
while dropped:
dropped = False
print('\nThe starting number of quantitative features is: '
+ str(len(features)))
vif = Parallel(n_jobs=-1,
verbose=5)(delayed(variance_inflation_factor)(X[features].values,
ix) for ix in range(len(features)))
maxloc = vif.index(max(vif))
if max(vif) > threshold:
print(time.ctime() + ' dropping \'' + X[features].columns[maxloc]
+ '\' at index: ' + str(maxloc))
features.pop(maxloc)
dropped = True
print('Features Remaining:')
print([features])
return X[[i for i in features]]
print('Time for calculating VIF on numerical data using threshold = 5...')
search_time_start = time.time()
X1 = calculate_vif(df_num, 5)
print('\nNumber of quant features after VIF:', X1.shape[1])
print('Finished calculating VIF on numerical data using threshold = 5 in:',
time.time() - search_time_start)
print ('- There are ' + str(X1.shape[0]) + ' rows and '
+ str(X1.shape[1]) + ' columns.\n')
print('\nQuant features remaining after VIF:')
print(X1.columns)
In the initial set, there were 61 quantitative features, and after using VIF
with a threshold=5.0
, 31 variables remain. Almost half of the variables were removed using this approach!
To note, when SelectFromModel using XGBoost
was completed without performing VIF
, the resulting score was 98.946%
and when retested after completing VIF
, the score was 96.854%
. Ensemble-based algorithms can handle multicollinearity while linear given the data distribution cannot.
Now let's concatenate these selected variables with the non-numerical features including the dependent variable, loan_status
. Before a group lasso
can be tested for variable selection, the quantitative and categorical variables need to be preprocessed consisting of scaling using the MinMaxScaler
for the numerical features, one hot encoding the categorical variables followed by stacking together column wise in a sparse matrix. The groups are then extracted from the one hot encoded features since they need to be specified for the group lasso
.
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import scipy.sparse
from group_lasso.utils import extract_ohe_groups
df1 = df.select_dtypes(include = 'object')
df1 = pd.concat([X1, df1], axis=1)
df1 = pd.concat([y, df1], axis=1)
df_num = df1.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)
num_columns = df_num.columns.tolist()
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df_num)
df_cat = df1.select_dtypes(include = 'object')
cat_columns = df_cat.columns.tolist()
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(df1[cat_columns])
X2 = scipy.sparse.hstack([onehot_data, scipy.sparse.csr_matrix(scaled)])
y = df1['loan_status']
groups = extract_ohe_groups(ohe)
groups = np.hstack([groups, len(cat_columns) + np.arange(len(num_columns))+1])
print('The groups consist of ' + str(groups) + ' for the group lasso.')
To perform the group lasso, let's first define the parameter grid to test where the same number of iterations is used and a differing number for the convergence tolerance where the Default=1e-5
.
Then we can define the parameters for the estimator:
groups
: Iterable that specifies which group each column corresponds. The categorical variables as labeled groups.group_reg
: The regularization coefficient(s) for the group sparsity penalty.Default=0.05
l1_reg
: The regularization coefficient for the coefficient sparsity penalty.Default=0.05
scale_reg
: How to scale the group-wise regularisation coefficients.Default='group_size'
random_state
: The random state used for initialisation of parameters.Default='None'
. Let's use the same seed for reproducibility (seed_value
).
Then the GridSearchCV
parameters can be defined:
- The defined
estimator
asLogisticGroupLasso
scoring
: Score the model based onaccuracy
cv
: The number of folds for cross validation as 5-fold.param_grid
: The hyperparameters inparams
.
Now the models can be fit given the defined parameters in the grid search space to find the best estimator parameters and accuracy.
from group_lasso import LogisticGroupLasso
from sklearn.model_selection import GridSearchCV
LogisticGroupLasso.LOG_LOSSES = True
params = {
'n_iter': [3000],
'tol': [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
}
grid = GridSearchCV(estimator=LogisticGroupLasso(
groups=groups, group_reg=0.05, l1_reg=0, scale_reg=None,
supress_warning=True, random_state=seed_value),
scoring='accuracy', cv=5, param_grid=params)
print('Time for feature selection using GroupLasso GridSearchCV...')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
grid.fit(X2, y)
print('Finished feature selection using GroupLasso GridSearchCV in:',
time.time() - search_time_start)
print('\nGroup Lasso GridSearchCV Feature selection')
print('\nBest Estimator:')
print(grid.best_estimator_)
print('\nBest Parameters:')
print(grid.best_params_)
print('\nBest Accuracy:')
print(grid.best_score_)
print('\nResults from GridSearch CV:')
print(grid.cv_results_)
The best parameters from the grid search are {'n_iter': 3000, 'tol': 0.1}
with an accuracy of 0.8735407759559557
. However, the rank_test_score
using tol=1e-1
and tol=1e-2
generated comparable results.
Now using these parameters, let's fit the model, predict and extract the results from the performance metrics. We can use the defined model to extract the features selected and then count the number selected. Then a list of the chosen_groups_
in the model as a pandas.Series
can be transposed and the values converted to a list to match with the features in the original set. Then the loss over the iterations can plotted to visualize the results.
gl = LogisticGroupLasso(
groups=groups,
group_reg=0.05,
n_iter=3000,
tol=0.1,
l1_reg=0,
scale_reg=None,
supress_warning=True,
random_state=seed_value,
)
with parallel_backend('threading', n_jobs=-1):
gl.fit(X2, y)
pred_y = gl.predict(X2)
sparsity_mask = gl.sparsity_mask_
accuracy = (pred_y == y).mean()
print(f'Number of total variables: {len(sparsity_mask)}')
print(f'Number of chosen variables: {sparsity_mask.sum()}')
print(f'Accuracy: {accuracy}')
tdf = pd.Series(list(gl.chosen_groups_)).T
tdf = tdf.values.tolist()
X2 = df1.drop('loan_status', axis=1)
X2 = X2.iloc[:,tdf]
variables = X2.columns.tolist()
print(f'Selected variables from group lasso: {variables}')
plt.rcParams['figure.figsize'] = (7, 5)
plt.rcParams.update({'font.size': 15})
plt.plot(gl.losses_)
plt.tight_layout()
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Group Lasso: Loss over the Number of Iterations')
plt.show()
del X2, tdf, variables
Given that the starting number of features after VIF
equaled 75, 19 were selected using the parameters tested in the grid search which included dti
, delinq_2yrs
, inq_last_6mths
, revol_bal
, out_prncp
, collections_12_mths_ex_med
. The resulting accuracy when fitting this model was 87.354
compared to 98.942
using 45 features when evaluating SelectFromModel
with XGBoost
. Therefore, this method for utilizing a linear based feature selection method might not be the best approach, but an alternative like the glmnet
package in both python and R might perform better. However, let's test more variable selection methods.
Model-Free Sure Independence Screening (MV-SIS)
¶
When I first began exploring different feature selection approaches in an academic research setting, a biostatistician recommended Model-Free Sure Independence Screening (MV-SIS)
given the robustness in high dimensional sets where p (predictors) >> n (number of observations). The publication of this approach can be found here.
This method builds on the approaches that were proposed beyond correlation ranking for non-linear problems. It also implies that the MV-based variable screening is model-free because it is defined with conditional and unconditional distribution functions and is able to address linear and nonlinear relationships between the given response and features. MV-SIS
is robust to heavy-tailed distributions of features and outliers because it inherits the robustness property of the conditional distribution function. The sure screening property contained within this approach can also handle multi-class response targets.
Let's now migrate to R
to utilize this approach. The script that was used to process the data utilizing similar approaches can be found here. In summary, the libraries that were used for preprocessing include tidyr
, plyr
, dplyr
and fastDummies
.
The preprocessing code before testing this approach can be found here. Let's convert loan_status
to binary given the same aforementioned conditions and sample 10% of the data using the caret
library. Now, we can write the column names to a .csv
file and the row level data without the column names to a .txt
file to prepare the data.
data_sampling_vector <- createDataPartition(df$loan_status, p=0.90, list=FALSE)
data_train <- df[data_sampling_vector,]
data_test <- df[-data_sampling_vector,]
a <- data_test
dim(a)
write.table(a, 'names.csv', sep=',', col.names=TRUE, row.names=FALSE)
write(t(a), 'LoanStatus_screen.txt', ncol=ncol(a))
I was provided the source code to utilize this approach for feature selection, but a package called VariableScreening is available on CRAN
. We can use source("*.R)
to delegate the script that we want R
to use, in this case, source("MVSIS.R")
. Let's now examine what is contained within the MVSIS.R
file.
MVSIS Source Code
¶
To compute the criteria in the simulation, multiple functions are required. Let's first start by creating a function M
to compute the minimum model size to ensure the inclusion of all active predictors.
Input
:
true.v
: the true variables index.rank.mtx
: the ranked index matrix by screening method for the 1000 replications each column corresponds the ranked index in one replication.
Output
:
M
: a vector of the minimum model sizes to ensure the inclusion of all active predictors
M <- function(true.v,rank.mtx){
r <- min(dim(rank.mtx)[2], length(rank.mtx))
M <- c()
for (j in 1:r){
M[j] <- max(match(true.v, rank.mtx[,j]))
}
return(M)
}
Another function mqtl
can be used to compute the 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size out of 1,000 replications.
Input
:
M
: a vector of the minimum model sizes to ensure the inclusion of all active predictors.
Output
:
- 5%, 25%, 50%, 75%, 95% quantiles of minimum model sizes out of 1000 replications
mqtl <- function(M){
quantile(M, probs=c(0.05,0.25,0.5,0.75,0.95))
}
Another function Sel.rate
can be used to compute the proportion that every single active predictor is selected for a given model size, which is defaults to c[n/log(n)]
in the 1,000 replications.
Input
:
n
: The sample size.c
: Coeficient of cutoffs. i.e. when henc=2
,cutoff = 2[n/log(n)]
.true.v
: The true variable index.rank.mtx
: The ranked index matrix by screening method for the 1,000 replications each column corresponds the ranked index in one replication.
Output
:
rate
: The proportions that every single active predictor is selected for a given model size, which the defaultc[n/log(n)]
, in the 1,000 replications.
Sel.rate <- function(n,c,true.v,rank.mtx){
d <- c * floor(n / log(n))
rank.mtx.sel <- rank.mtx[1:d,]
r <- min(dim(rank.mtx)[2], length(rank.mtx))
p0 <- length(true.v)
R <- matrix(0,p0,r)
rate <- c()
for (i in 1:p0){
for (j in 1:r){
R[i,j] <- (min(abs(rank.mtx.sel[,j] - true.v[i]))==0)
}
rate[i] <- mean(R[i,])
}
return(rate)
}
Now, let's define the functions that will used to compute the MV(X,Y)
where Y is a discrete value.
Fk <- function(X0,x){
Fk = c()
for (i in 1:length(x)){
Fk[i] = sum(X0 <= x[i]) / length(X0)
}
return(Fk)
}
Fkr <- function(X0,Y,yr,x){
Fkr = c()
ind_yr = (Y==yr)
for (i in 1:length(x)){
Fkr[i] = sum((X0 <= x[i]) * ind_yr) / sum(ind_yr)
}
return(Fkr)
}
MV <- function(Xk,Y){
Fk0 <- Fk(Xk,Xk)
Yr <- unique(Y)
MVr <- c()
for (r in 1:length(Yr)){
MVr[r] <- (sum(Y==Yr[r]) / length(Y)) * mean((Fkr(Xk,Y,Yr[r],Xk) - Fk0)^2)
}
MV <- sum(MVr)
return(MV)
}
Then the row-level .txt
file can be read byrow=T
as a matrix with the number of columns defined with ncol
and loan_status
as the first feature to define the response
. The selection criteria can be defined as mu
for computing the criteria in the simulation. Quantiles can then be used with mu
to determine the variable rank. Different values of quantile(mu, A)
can be evaluated, where a higher value of A
results in less features while a lower value selects more variables. Let's use A=0.7
and determine how many features result in the pruned set.
F0 <- matrix(scan('LoanStatus_screen.txt'), ncol=112, byrow=T)
response <- F0[,1]
mu <- NULL
for(i in 2:ncol(F0)){
u <- MV(F0[,i], response)
mu <- c(mu,u)
if(i%%100==0) print(i)
}
q3 <- quantile(mu, 0.7)
mu <- cbind(1:(ncol(F0)-1), mu)
name <- read.table('names.csv', sep=',')
name <- name[mu[,2] > q3]
write(t(name), 'names2.csv', sep=',', ncol=length(name))
su <- mu[mu[,2] > q3,]
dim(su)
F0 <- F0[,-1]
F1 <- F0[,su[,1]]
B <- cbind(response,F1)
V <- B
B[15:41,] <- V[19:45,]
B[42:45,] <- V[15:18,]
write(t(B), ncol=ncol(B), file='MVSIS_0.7.txt')
ncol(B) - 1
28 features were selected using this threshold. Now we can evaluate the performance of a model given the input features from MV-SIS
.
a1 <- read.table('MVSIS_0.7.txt')
a1 <- as.matrix(a1)
dim(a1)
b <- read.table('names2.csv', sep=',')
b <- as.character.numeric_version(b)
k <- 'status'
b <- c(k,b)
write.table(a1, file='MVSIS_0.7_rf.txt', col.names=b)
Let's first load the required libraries
(randomForest
and doParallel
), and set up a cluster to run the jobs in parallel using the 8 cores available on the local machine.
library(randomForest)
library(doParallel)
cl <- makePSOCKcluster(8)
registerDoParallel(cl)
getDoParWorkers()
Then the set containing the selected features from MV-SIS
can be read, shuffled and examined.
variable <- read.table('MVSIS_0.7_rf.txt')
variable = variable[sample(1:nrow(variable)),]
colnames(variable)[colnames(variable) == 'status'] = 'loan_status'
str(variable)
We can now partition the data into training and test sets using 90% for training and 10% for testing to run a Random Forest
model with the default parameters to evaluate the predictive performance of using this variable selection approach.
k1 <- 194612
k2 <- 21624
train <- k1
test <- k2
rf <- randomForest(factor(loan_status) ~ .,
data=variable[1:k1,],
ntree=100,
importance=T,
keep.forest=T)
rf
stopCluster(cl)
Then, we can use the fitted model to predict and find the accuracy of the train and test sets.
m <- rf$predicted
n <- predict(rf, variable[(k1+1):(k1+k2),])
k <- 0
for (i in 1:k1){
if (variable[i,1]==m[i]){
k=k+1
}
}
Accuracy.fit <- k / k1
cat(sprintf('Train accuracy = %s\n', Accuracy.fit))
k <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==n[i]){
k=k+1
}
}
Accuracy.test <- k / k2
cat(sprintf('Test accuracy = %s\n', Accuracy.test))
Using the default parameters, the accuracy is lower than optimal, but the dimensionality of the number of features has significantly decreased. Let's now evaluate for specificity and sensitivity.
a <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==1 & n[i]==1){
a = a+1
}
}
b <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==0 & n[i]==1){
b = b+1
}
}
c <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==1 & n[i]==0){
c = c+1
}
}
d <- 0
for (i in 1:k2){
if (variable[(k1+i),1]==0 & n[i]==0){
d = d+1
}
}
sensitivity = d / (b + d)
cat(sprintf('Sensitivity = %s\n', sensitivity))
specificity = a / (a + c)
cat(sprintf('Specificity = %s\n', specificity))
The sensitivity is great, but the specificity is quite low, so we should keep this in mind when comparing the various variable selection methods evaluated.
Next, let's use the model to predict and generate a confusion matrix for the training and the test set.
train$loan_status <- as.factor(train$loan_status)
pred = predict(rf, newdata=train[-1])
pred1 <- as.factor(pred)
confusionMatrix(data=pred1, reference=train$loan_status)
test$loan_status <- as.factor(test$loan_status)
pred = predict(rf, newdata=test[-1])
pred1 <- as.factor(pred)
confusionMatrix(data=pred1, reference=test$loan_status)
Boruta
¶
Boruta
is a wrapper around the Random Forest
algorithm which contains an improvement in regards to the feature importance measure since it considers all of the features that are relevant to the target variable. It can handle interactions between features as well while maintaining the varying degree of randomness that might exist within a set.
The reference manual states:
Boruta iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones. Attributes that have significantly worst importance than shadow ones are being consecutively dropped. On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed. Shadows are re-created in each iteration. Algorithm stops when only Confirmed attributes are left, or when it reaches maxRuns importance source runs. If the second scenario occurs, some attributes may be left without a decision. They are claimed Tentative. You may try to extend maxRuns or lower pValue to clarify them, but in some cases their importances do fluctuate too much for Boruta to converge. Instead, you can use TentativeRoughFix function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis.
The input set for Boruta
cannot contain missing observations, so either the complete cases for a set must be determined or a selected imputation method needs to be utilized prior given the reason(s) for the presence of data missing (missing completely at random (MCAR
) or missing at random (MAR
). The current available versions are slow in performing feature selection compared to other traditional feature selection algorithms, and performance should always be considered. In conjunction, after determining the important features, collinearity needs to be addressed which is another limitation of using this approach.
Let's first set the working directory to FeatureSelection
where the results can be saved. Then load the Boruta
library and convert loan_status
to a factor since this is the target variable. Due to the lengthy computation time and required CPU resources, we will use the test set and doTrace=0
for the verbosity level.
data_test$loan_status <- as.factor(data_test$loan_status)
boruta.df <- Boruta(loan_status~., data=data_test, doTrace=0)
Results from Boruta
¶
We can now examine what was saved in the boruta.df
list to inspect the results from utilizing this approach.
print(boruta.df)
After more than 81 hours, 79 out of the 111 features were selected. This is a large number of variables though. This package offers a function called TentativeRoughFix
that "claims as Confirmed those attributes that have median importance higher than the median importance of maximal shadow attribute, and the rest as Rejected". So, let's now explore the tentative variables.
final.boruta <- TentativeRoughFix(boruta.df)
print(final.boruta)
2 more features were confirmed as important after using the available TenativeRougFix
, while 5 more were confirmed as unimportant. We can then used the variable importance chart from the native package to display the important features as green, unimportant as red and yellow indicating the ones that are considered as tentative. Let's first change the plot width
and height
of the plot window by setting the options
with the repr
library. Then we can plot the results from Boruta
after using the TenativeRoughFix
. lapply
as a function can be applied to the generated ImpHistory
where the names of the columns are defined and sapply
can then be used with the output from the lapply
function where the feature importance of the features is sorted by the median
called as Labels
. This generates a plot where the feature importance is arranged in an ascending order from left to right.
library(repr)
options(repr.plot.width=30, repr.plot.height=30)
par(mar = c(20,5,1,1.5))
plot(final.boruta, xlab='', xaxt='n')
lz<-lapply(1:ncol(final.boruta$ImpHistory),function(i)
final.boruta$ImpHistory[is.finite(final.boruta$ImpHistory[,i]),i])
names(lz) <- colnames(final.boruta$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side=1,las=2, labels=names(Labels),
at=1:ncol(final.boruta$ImpHistory), cex.axis=1.4)
Additional statistical measures can be generated using attStats
, which contains the defined meanImp
, medianImp
and minImp
. Let's now filter out features which are designated as Confirmed
on this subset using boruta_stats$decision == "Confirmed"
.
boruta_stats <- attStats(final.boruta)
boruta_confirmed = subset(boruta_stats,
subset=boruta_stats$decision == 'Confirmed')
boruta_confirmed
This can also be completed with the features designated as Rejected
on another subset using boruta_stats$decision == "Rejected"
.
boruta_rejected = subset(boruta_stats,
subset=boruta_stats$decision == 'Rejected')
boruta_rejected
Utilizing Boruta
resulted in 81 features compared to 27 from MV-SIS
, suggesting the methods contained within Boruta
probably select the features that might generate the best model metrics, but results in high dimensionality and subsequently higher computational costs in downstream modeling tasks.
Results from Variable Selection Methods
¶
Various approaches can be utilized for feature selection/importance as demonstrated here using both Python
and R
. Therefore, let's now compare the results in regards to which variables were selected from the test methods using set differences. To begin, we can create dummy variables using pd.get_dummies
where the initial features are removed. Then create a separate dataframe containing the features from the methods that did not generate too few (Group Lasso
) or too many features (Boruta
). We can denote them as X_xgb
for the ones from SelectFromModel using XGBoost
, X_vif
containing the remaining after a threshold=5
for VIF
and X_mfs
containing the variables from MV-SIS
in R
.
df = df.drop('loan_status', axis=1)
df = pd.get_dummies(df, drop_first=True)
X_xgb = df[['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate',
'installment', 'annual_inc', 'inq_last_6mths', 'pub_rec',
'revol_bal', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
'total_rec_late_fee', 'recoveries', 'last_pymnt_amnt',
'acc_open_past_24mths', 'bc_open_to_buy',
'chargeoff_within_12_mths', 'delinq_amnt', 'mths_since_recent_bc',
'num_rev_tl_bal_gt_0', 'tot_hi_cred_lim', 'total_bc_limit',
'term_ 60 months', 'grade_B', 'grade_C', 'grade_D', 'grade_E',
'home_ownership_MORTGAGE', 'home_ownership_RENT',
'verification_status_Source Verified', 'verification_status_Verified',
'pymnt_plan_y', 'purpose_major_purchase', 'purpose_moving',
'purpose_small_business', 'purpose_vacation', 'initial_list_status_w',
'application_type_Joint App', 'hardship_flag_Y',
'disbursement_method_DirectPay', 'debt_settlement_flag_Y']
X_xgb = X_xgb.drop_duplicates()
print('- Dimensions using variables selected from XGB:', X_xgb.shape)
X_vif = df[['annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'revol_bal',
'out_prncp', 'total_rec_int', 'total_rec_late_fee',
'collection_recovery_fee', 'last_pymnt_amnt',
'collections_12_mths_ex_med', 'acc_now_delinq', 'tot_coll_amt',
'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths',
'delinq_amnt', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op',
'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc',
'num_accts_ever_120_pd', 'num_il_tl', 'num_tl_30dpd',
'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75',
'pub_rec_bankruptcies', 'tax_liens', 'total_il_high_credit_limit']]
X_vif = X_vif.drop_duplicates()
print('- Dimensions using variables selected from VIF:', X_vif.shape)
X_mfs = df[['num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'pymnt_plan_y',
'num_accts_ever_120_pd', 'mo_sin_old_rev_tl_op',
'last_pymnt_amnt', 'percent_bc_gt_75', 'revol_util',
'tot_hi_cred_lim', 'num_actv_rev_tl', 'disbursement_method_DirectPay',
'tot_coll_amt', 'term_ 60 months', 'mort_acc', 'funded_amnt_inv',
'int_rate', 'inq_last_6mths', 'delinq_2yrs', 'installment',
'collections_12_mths_ex_med', 'open_acc', 'loan_amnt',
'funded_amnt', 'annual_inc', 'num_tl_op_past_12m',
'home_ownership_OTHER', 'total_bc_limit']]
X_mfs = X_mfs.drop_duplicates()
print('- Dimensions using variables selected from MVSIS:', X_mfs.shape)
Each of these can be considered as a set
with a unique identifier specifying the features from the various feature selection approaches examined. Let's start with the features that were selected from SelectFromModel using XGBoost
in a set denoted as s
. Subsequently, the others can be compared to this set to determine which features do not exist in s
but exist in the other set. This results in a list of the features where the number can be quantified using the length of the list. Other combinations of the initial set with the various permutations can be utilized to compare the features from the methodologies examined.
s = set(X_xgb)
varDiff_vif = [x for x in X_vif if x not in s]
print('\nFeatures using VIF but not in XGB:')
print(varDiff_vif)
print('- Number of different features: ' + str(len(varDiff_vif)))
s1 = set(X_vif)
varDiff_xgb = [x for x in X_xgb if x not in s1]
print('\nFeatures in XGB but not in VIF:')
print(varDiff_xgb)
print('- Number of different features: ' + str(len(varDiff_xgb)))
varDiff_mvsisAll = [x for x in X_mfs if x not in s]
print('\nFeatures in MVSIS but not in XGB:')
print(varDiff_mvsisAll)
print('- Number of different features: ' + str(len(varDiff_mvsisAll)))
varDiff_mvsisVIF = [x for x in X_mfs if x not in s1]
print('\nFeatures in MVSIS but not in VIF:')
print(varDiff_mvsisVIF)
print('- Number of different features: ' + str(len(varDiff_mvsisVIF)))
s1 = set(X_mfs)
varDiff_mvsisAll1 = [x for x in X_xgb if x not in s1]
print('\nFeatures in XGB but not in MV-SIS:')
print(varDiff_mvsisAll1)
print('- Number of different features: ' + str(len(varDiff_mvsisAll1)))
varDiff_mvsisVIF1 = [x for x in X_vif if x not in s1]
print('\nFeatures in VIF but not in MV-SIS:')
print(varDiff_mvsisVIF1)
print('- Number of different features: ' + str(len(varDiff_mvsisVIF1)))
From the comparisons made, the number of features selected with MV-SIS
that are not present in the set using SelectFromModel using XGBoost
, which contained 45 features, can be added to a temporary dataframe that is concatenated by column with X_xgb
and loan_status
for further examination.
df_tmp = df[['num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_accts_ever_120_pd',
'mo_sin_old_rev_tl_op', 'percent_bc_gt_75', 'revol_util',
'num_actv_rev_tl', 'tot_coll_amt', 'mort_acc', 'delinq_2yrs',
'collections_12_mths_ex_med', 'open_acc',
'num_tl_op_past_12m', 'home_ownership_OTHER']]
df = pd.concat([X_xgb, df_tmp, y], axis=1)
df = df.drop_duplicates()
print('- Dimensions of data using for further EDA:', df.shape)
del X_xgb, X_vif, X_mfs
del df_tmp, y, s, s1, varDiff_vif, varDiff_xgb, varDiff_mvsisAll
del varDiff_mvsisVIF, varDiff_mvsisAll1, varDiff_mvsisVIF1
Exploratory Data Analysis (EDA)
¶
Quantitative Variables
¶
We can now select the float64
and int64
quantitative features as a subset of the data. To evaluate if correlated features are present and to what extent, we can define a function to find the non-repetitive correlations by only considering the pairs of features on the diagonal and the lower part of the triangle of the correlation matrix. Then another function can be defined to find the features with the highest absolute rank correlation, which reduces the granularity from the presence of both positive and negative correlation coefficients. The second function can then call the first function to remove the repetitive pairs of features.
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)
def find_repetitive_pairs(df):
"""Returns the pairs of features on the diagonal and lower triangle of the correlation matrix."""
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
def find_top_correlations(df, n):
"""Returns the highest correlations without duplicates."""
au_corr = df.corr(method='spearman').abs().unstack()
labels_to_drop = find_repetitive_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
Let's first examine the number of features present in this subset and then apply the find_top_correlations
function to determine the 20 features with the the highest correlation. Then we can create a correlation heatmap using sns.heatmap
which only contains the features with a correlation coefficient >= 0.7
or <= -0.7
, signifying a strong and very strong relationship.
import seaborn as sns
print('- The selected dataframe has ' + str(df_num.shape[1]) + ' columns that are quantitative variables.')
print('- The 20 features with the highest correlations:')
print(find_top_correlations(df_num, 20))
corr = df_num.corr(method='spearman')
fig = plt.figure(21, 18)
plt.rcParams.update({'font.size': 13})
ax = sns.heatmap(corr[(corr >= 0.7) | (corr <= -0.7)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
linecolor='black', annot=True, annot_kws={'size': 9},
square=True)
plt.title('Correlation Matrix with Spearman rho')
plt.show()
Now we can examine the distributions of these features using different visualizations. Let's first start by plotting histograms using histplot
from seaborn
where a for
loop iterates through each of the selected features.
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(15,3, figsize=(21,35))
fig.suptitle('Quantitative Features: Histograms', y=1.01, fontsize=30)
for variable, subplot in zip(df_num, ax.flatten()):
a = sns.histplot(df_num[variable], ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
a.set_xticklabels(a.get_xticks(), size=9)
fig.tight_layout()
plt.show();
Then we can increase the grain of this visualization by adding hue=df.loan_status
to map the dependent variable as a binary variable as well as kde=True
to smooth the distribution using a kernel density estimate.
plt.rcParams.update({'font.size': 14})
fig, ax = plt.subplots(15,3, figsize=(25,35))
fig.suptitle('Quantitative Features: Histograms Grouped by Loan Status', y=1.01,
fontsize=30)
for variable, subplot in zip(df_num, ax.flatten()):
a = sns.histplot(x=df_num[variable], data=df_num, hue=df.loan_status,
kde=True, ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
a.set_xticklabels(a.get_xticks(), size=10)
fig.tight_layout()
plt.show();
From the previous figure, some features contain a larger range of values, so let's extract these as a subset of the quantitative features to inspect at a more granular level.
df_num1 = df_num[['loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'int_rate', 'installment', 'total_pymnt', 'total_pymnt_inv',
'total_rec_prncp', 'total_rec_int', 'mo_sin_old_rev_tl_op',
'percent_bc_gt_75', 'revol_util']]
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(4,3, figsize=(21,30))
fig.suptitle('Subset of Quantitative Features: Histograms with Grouped by Loan Status',
y=1.01, fontsize=30)
for variable, subplot in zip(df_num1, ax.flatten()):
sns.histplot(x=df_num1[variable], data=df_num1, hue=df.loan_status,
kde=True, ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
a.set_xticklabels(a.get_xticks(), size=9)
fig.tight_layout()
plt.show();
Box-and-whisker plots are another way to visualize both quantitative and qualitative features. Let's use seaborn.boxplot
where the target loan_status
resides on the x-axis and the quantitative feature is on the y-axis and . This approach summarizes the distribution of the feature using five key numbers: the minimum, 1st quartile (25th percentile), 2nd quartile (median), 3rd quartile (75%) and the maximum of the data. The interquartile range (IQR)
is the difference between the 75% and 25% percentile (Q3 - Q1
). The whisker sections extend for the whole range of the data.
Boxplots can be useful to visualize if outlier values are present within the data. If values are greater than Q3 + 1.5 x IQR
or less than Q1 - 1.5 x IQR
, then they can be considered as outliers.
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(9,5, figsize=(21,30))
fig.suptitle('Quantitative Features: Boxplots', y=1.01, fontsize=30)
for var, subplot in zip(df_num, ax.flatten()):
a = sns.boxplot(x=df.loan_status, y=df_num[var], data=df_num, ax=subplot)
a.set_yticklabels(a.get_yticks(), size=12)
a.set_xticklabels(a.get_xticks(), size=12)
fig.tight_layout()
plt.show();
When considering the quantitative features stratified by loan_status
, differences are observable, especially for recoveries
, last_pymnt_amnt
and percent_bc_gt_75
. These potentially will contribute to explaining models once they are fit.
Now, we can examine the qualitative features in the set including the uint8
ones using a countplot
to visualize the number of observations in each categorical bin.
df_cat = df.select_dtypes(include = 'uint8')
print('The selected dataframe has ' + str(df_cat.shape[1])
+ ' columns that are qualitative variables.')
print('\n')
fig, ax = plt.subplots(5, 4, figsize=(21,21))
fig.suptitle('Qualitative Features: Count Plots', y=1.01, fontsize=30)
for variable, subplot in zip(df_cat, ax.flatten()):
a = sns.countplot(df_cat[variable], ax=subplot)
a.set_yticklabels(a.get_yticks(), size=10)
fig.tight_layout()
plt.show();
Automated EDA with Sweetviz
¶
Overview: Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application We can install using pip install sweetviz. There are many options to profile the data, but let's utilize the the default settings to examine the data quality (data types, missing values, duplicate rows, unique values) and summary statistics (min/max/range, quartiles, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.
)
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_html('Loan_Status_AutomatedEDA.html')
sweet_report.show_notebook(layout='widescreen', w=1500, h=1000, scale=0.8)
Automated EDA using Pandas Profiling
¶
We can install using pip install ydata-profiling
.
Key features:
Type inference: automatic detection of columns' data types (Categorical, Numerical, Date, etc.)
Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
Also includes the abilty to examine
Time-Series
,Text analysis
,File and Image analysis
andCompare datasets
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title='Loan Status_EDA')
profile.to_notebook_iframe()
The output is on the Github
repository due to size contraints here.
Now we can remove the features with high correlation coefficients and the categorical variables containing imbalances, and examine the final set using the data_quality_table
function.
df = df.drop(['out_prncp_inv', 'funded_amnt', 'funded_amnt_inv',
'total_pymnt_inv', 'open_acc', 'tot_cur_bal', 'total_rec_prncp',
'num_op_rev_tl', 'home_ownership_OTHER', 'hardship_flag_Y',
'pymnt_plan_y', 'purpose_house', 'purpose_medical',
'debt_settlement_flag_Y', 'purpose_small_business'], axis=1)
df = df.drop_duplicates()
print('\n Data Quality: Final Set')
display(data_quality_table(df))
Class Imbalance Methods
¶
Resampling data is one of the most commonly accepted approaches for handling an imbalanced dataset. There are broadly two types of methods: oversampling and undersampling. Oversampling is generally preferred compared to undersampling techniques because the small percentage of instances potentially containing important information tend to be removed from the set when undersampling based methods are utilized.
Resampling Techniques
¶
Let's first import the packages and define the conditional options to set up the environment with a random
and numpy
seed for reproducibility. Then the data can be read into a pandas.Dataframe
and duplicates removed, if present.
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
seed_value = 42
os.environ['LoanStatus_Linear'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
df = pd.read_csv('LendingTree_LoanStatus_final.csv', low_memory=False)
df = df.drop_duplicates()
Upsample Minority Class
¶
Let's first separate the input features and the target into X
and y
. Then train_test_split
from sklearn.model_selection
can be utilized to set up the training/test sets using 80% of the data for the training set and 20% for the test set. Then the features and target from the training data can be concatenated by column, followed by splitting this set into the majority class, where loan_status==0
as current
, and the minority class, where loan_status==1
as default
.
Next, resample
from sklearn.utils
can be leveraged to oversample the minority class with replacement specifying the number of samples (the number of observations contained within the current
set) and the defined random_state=seed_value
. Then, the majority and upsampled minority sets be concatenated together and the counts of each of the classes in the balanced set examined.
Then the input features and target of the Upsampled
training set can be defined as X_train
and y_train
to match the format of the test set, the train/test target defined as a pandas.dataframe
to be concatenated for full train/test sets to be saved for later use in .csv
files.
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
X = df.drop('loan_status', axis=1)
y = df['loan_status']
del df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
random_state=seed_value)
df1 = pd.concat([X_train, y_train], axis=1)
current = df1[df1.loan_status==0]
default = df1[df1.loan_status==1]
del df1
default_upsampled = resample(default,
replace=True,
n_samples=len(current),
random_state=seed_value)
upsampled = pd.concat([current, default_upsampled])
del default_upsampled, current, default
print('\nExamine Loan Status after oversampling minority class')
print(upsampled.loan_status.value_counts())
X_train = upsampled.drop('loan_status', axis=1)
y_train = upsampled.loan_status
del upsampled
cols = ['loan_status']
y_train = pd.DataFrame(data=y_train, columns=cols)
y_test = pd.DataFrame(data=y_test, columns=cols)
train_US = pd.concat([X_train, y_train], axis=1)
train_US.to_csv('trainDF_US.csv', index=False)
test_US = pd.concat([X_test, y_test], axis=1)
test_US.to_csv('testDF_US.csv', index=False)
del train_US, test_US
Synthetic Minority Oversampling Technique (SMOTE)
¶
SMOTE (Synthetic Minority Oversampling Technique)
generates new synthetic samples for the minority class, which is a type of data augmentation. The smaller the distance between the minority data and the closest minority data in the feature space is where new synthetic data is formed. For a random example within the minority class, a specified k of the nearest neighbors are determined, randomly selected and a synthetic example is generated located at a random point between the initial and the randomly selected neighbor. Since this uses the smaller distance between the aformentioned, this results in more realistic examples from the minority class.
Since this approaches relies on the minority class and not the majority class, potential negative events include decreasing the decision space as well as increasing the amount of noise in the feature space, resulting in more difficulty classifying groups, whether binary or multi-level outcomes.
Let's first set up the training/testing sets using the same parameters that were utilized for the Upsampled
set. The SMOTE
class from the Imbalanced-learn
library can then be imported from imblearn.over_sampling
specifying the sampling_strategy='minority'
and random_state=seed_value
as parameters to fit_sample
on the training set. The default specified k nearest neighbor is k_neighbors=5
. Then the counts of both classes examined to compare with the Upsampled
set, and similar methods used to same the train/test sets for later use.
from imblearn.over_sampling import SMOTE
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.20,
random_state=seed_value)
smote = SMOTE(sampling_strategy='minority', random_state=seed_value)
X1_train, y1_train = smote.fit_sample(X1_train, y1_train)
print('\nExamine Loan Status after upsampling with SMOTE')
print(y1_train.value_counts())
cols = ['loan_status']
y1_train = pd.DataFrame(data=y1_train, columns=cols)
y1_test = pd.DataFrame(data=y1_test, columns=cols)
train_SMOTE = pd.concat([X1_train, y1_train], axis=1)
train_SMOTE.to_csv('trainDF_SMOTE.csv', index=False)
test_SMOTE = pd.concat([X1_test, y1_test], axis=1)
test_SMOTE.to_csv('testDF_SMOTE.csv', index=False)
del train_SMOTE, test_SMOTE
Classification: Linear
¶
Before machine learning models are optimized for the associated parameters of different algorithms, baseline models are important to establish which parameters should be more closely targeted during the tuning process. The notebook containing the Linear
baseline models and hyperparameter tuning using GridSearchCV
is located here.
Upsampling: Lasso Baseline Model
¶
Baseline Default Parameters
¶
penalty
: Specifies the norm of the penalty.default=’l2’
.tol
: Tolerance for stopping criteria.default=1e-4
.C
: Inverse of regularization strength.default=1.0
.fit_intercept
: Specifies if a bias or intercept should be added to the decision function.default=True
.intercept_scaling
:default=1
class_weight
: Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.default=None
.random_state
: RandomState instance. Used whensolver == ‘sag’
,saga
orliblinear
to shuffle the data.default=None
.solver
: Algorithm to use in the optimization problem.default=’lbfgs’
.max_iter
: Maximum number of iterations taken for the solvers to converge.default=100
.verbose
: For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.default=0
.warm_start
: When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver.default=False
.n_jobs
: Number of CPU cores used when parallelizing over classes ifmulti_class='ovr'
.default=None
.l1_ratio
: The Elastic-Net mixing parameter, Only used ifpenalty='elasticnet'
. Settingl1_ratio=0
is equivalent to usingpenalty='l2'
, while settingl1_ratio=1
is equivalent to usingpenalty='l1'
. For 0 <l1_ratio
< 1, the penalty is a combination ofL1
andL2
.default=None
.
Let's first set the path where the results from training a model will be stored. In particular where the model .pkl
is saved and can be reloaded for model comparisons to the baseline model or if more training is required to find even more optimal hyperparameters than the first iteration of the tuning process. Then, we can scale the data between zero and one using the MinMaxScaler
which consists of fit_transform
on the training set and applying that with tranform
to the test set. For the Upsampled Lasso
baseline model, we can fit the model using the default parameters from the LogisticRegression
model in scikit-learn
and save the model. Then predict based on trained model using the scaled test set to evaluate the model performance/metrics using various sklearn.metrics
utilizing the input format of metric(y_true = y_test, y_pred = model.predict(X_test))
where the metric
is specified and the defined model
was fit with the training set and predictions evaluated with the test set. This can be applied to both the training and the test sets, independently, to evaluate if overfitting is occurring as well.
The classification_report
outputs granular information in table format with the rows containing loan_status==0
, loan_status==1
, accuracy
, macro avg
, and weighted avg
and paired columns (precision
, recall
, f1-score
, support
) with the rows.
The confusion_matrix
contains even more granular information with the count the model evaluated:
True Positives
(Top left): Model successfully predicts the positive class (loan_status==1
)False Positives
(Top right): Model incorrectly predicts the positive classFalse Negatives
(Lower left): Model incorrectly predicts the negative classTrue Negatives
(Lower right): Model successfully predicts the negative class (loan_status==0
)
This information is needed to calculate the following metrics:
Accuracy
: (True Positives
+True Negatives
) / Total Count of PredictedPrecision
:True Positives
/ (True Positives
+False Positives
)Recall
:True Positives
/ (True Positives
+False Negatives
)-
F1-Score
: 2 (Precision
Recall
) / (Precision
+Recall
)
These basic metrics
for classification problems can be imported from sklearn.metrics
, so we do not need to manually calculate the accuracy_score
, precision_score
, recall_score
and f1_score
.
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
scaler = MinMaxScaler()
X_trainS = scaler.fit_transform(X_train)
X_testS = scaler.transform(X_test)
lasso = LogisticRegression(penalty='l1', solver='saga',
random_state=seed_value)
with parallel_backend('threading', n_jobs=-1):
lasso.fit(X_trainS, y_train)
Pkl_Filename = 'Lasso_US_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lasso, file)
# =============================================================================
# To load saved model
#lasso_US = joblib.load('Lasso_US_baseline.pkl')
#print(Lasso_US)
# =============================================================================
y_pred_US = lasso.predict(X_testS)
print('Results from Lasso baseline model on Upsampled data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Upsampling: Elastic Net Baseline Model
¶
Now we can set the baseline using the parameters needed for an elastic net
model using an l1_ratio=0.5
with the saga
solver.
elnet = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)
with parallel_backend('threading', n_jobs=-1):
elnet.fit(X_trainS, y_train)
Pkl_Filename = 'Elnet_US_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(elnet, file)
y_pred_US = elnet.predict(X_testS)
print('Results from Elastic Net baseline model on Upsampled data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
When comparing the model metrics to the Lasso
baseline model, they are quite comparable, with a slightly higher precision and slightly lower recall and F1 score. They are all above 90%, which suggests the baseline models perform well.
SMOTE: Lasso Baseline Model
¶
Using the same processing that was utilized for the Upsampled
data, we can fit the baseline linear
models for the Lasso
and Elastic Net
using the SMOTE
data.
scaler = MinMaxScaler()
X1_trainS = scaler.fit_transform(X1_train)
X1_testS = scaler.transform(X1_test)
with parallel_backend('threading', n_jobs=-1):
lasso.fit(X1_trainS, y1_train)
Pkl_Filename = 'Lasso_SMOTE_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lasso, file)
y_pred_SMOTE = lasso.predict(X1_testS)
print('Results from Lasso baseline model on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
Compared to the model metrics of the baseline linear
models using the Upsampled
set, the model metrics are very similar due to the L1 regularization using the Lasso
to shrink the model coefficients to zero.
SMOTE: Elastic Net Baseline Model
¶
with parallel_backend('threading', n_jobs=-1):
elnet.fit(X1_trainS, y1_train)
Pkl_Filename = 'Elnet_SMOTE_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(elnet, file)
y_pred_SMOTE = elnet.predict(X1_testS)
print('Results from Elastic Net baseline model on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
Compared to the model metrics of the baseline linear
models using the Upsampled
set, the accuracy
and the Precision score
are comparable but the Recall score
is 2% lower while the F1 score
is potentially lower suggesting further inspection for these metrics.
When first performing GridSearchCV
using both the Upsampled
and the SMOTE
sets, initially, the same parameters were selected. Expanding the range of the parameters used in the search, such as increasing/decreasing C
and the number of iterations (max_iter
) as well as changing the scoring
metric evaluated during the search might provide better model performance.
Grid Search using Pipelines: Weighted Recall
¶
We can design a sklearn
pipeline by first using a list
to select the numerical features to be scaled using the MinMaxScaler()
in the defined numeric_transformer
. This pipeline
method yields the potential to transform various data types in different data contexts. Since this is a classification approach where the important categorical features have all ready been reduced down to the most important groups from the initial categorical features during the variable selection process, the numeric_transformer
is all what is required for the preprocessing before testing different combinations of model parameters. To test two different modeling approaches, the numeric_transformer
can be specified in two different pipelines where the model
component of the pipelines contains differing parameters for the Lasso
and the Elastic Net
. When utilizing scikit-learn
, the model type specified as LogisticRegression
containing the penalty='l1'
is similar to a Lasso
while utilizing the penalty='elasticnet'
is comparable to using Elastic Net
for classification problems. Given the specified parameters for the two model types to be evaluated, these two pipelines can be specified in a list
and ordered into a dictionary
.
Grid Search: Upsampled
¶
Let's first start with the Upsampled
set using the parameters that define a model with an L1 penalty
and another that resides between the L1
and L2
regularization parameters first described in 2005 by Zou and Hastie as the Elastic Net
. With differing penalties and l1_ratio
values, but using the same solver
, the same maxmimum iteration number as well as the same random_state
, various other parameters within a defined grid can then be evaluated to compare how the model performance differs within the Upsampled
set as well as compared with the SMOTE
set.
from sklearn.pipeline import Pipeline
numeric_features = list(X.columns[X.dtypes != 'object'])
numeric_transformer = Pipeline(steps=[('mms', MinMaxScaler())])
pipe_lasso = Pipeline(steps=[('preprocessor', numeric_transformer),
('model', LogisticRegression(penalty='l1',
solver='saga',
max_iter=30000,
random_state=seed_value))])
pipe_elnet = Pipeline(steps=[('preprocessor', numeric_transformer),
('model', LogisticRegression(penalty='elasticnet',
solver='saga',
l1_ratio=0.5,
max_iter=30000,
random_state=seed_value))])
pipelines = [pipe_lasso, pipe_elnet]
pipe_dict = {0: 'Lasso', 1: 'Elastic Net'}
Given the specified method for preprocessing the numerical features and the two differing model architectures, we can then define a function called make_param_grids
which utilizes itertools
to loop through the pipeline_steps
with either the Lasso
or Elastic Net
as well as the parameters in the grid for each defined hyperparameter space contained within the all_param_grids
dictionary. The pipeline object is then initialized and the estimators are contained within the param_grids_list
. The GridSearchCV
components can then be specified which metric should be utilized for scoring, and given the initial difference in the recall score
between the two sets, a weighted recall
score might provide some insight. The fold number to be used in cross validation
can be denoted, and given the size of the sets and computation power available, let's use 3 fold, cv=3
, and n_jobs=-3
to utilize all of the number of cores available minus three. This also suggests this runtime environment is probably not the best optimized given the inability to utilize all of the avaiable resources when specifying joblib
with n_jobs=-1
did not result in the utilization of 8 cores and 64GB RAM
for this job.
import itertools
from sklearn.model_selection import GridSearchCV
def make_param_grids(steps, param_grids):
final_params = []
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).items():
if param == 'object':
current_grid[step_name] = [value]
else:
current_grid[step_name + '__' + param] = value
final_params.append(current_grid)
return final_params
pipeline_steps = {'classifier': ['lasso', 'elnet']}
all_param_grids = {'lasso': {'object': LogisticRegression(),
'penalty': ['l1'],
'solver': ['saga'],
'max_iter': [30000, 50000, 100000],
'C': [1, 0.5, 0],
'tol': [1e-6, 1e-5, 1e-4, 1e-3]},
'elnet': {'object': LogisticRegression(),
'penalty': ['elasticnet'],
'solver': ['saga'],
'l1_ratio': [0.0, 0.5, 1.0],
'max_iter': [30000, 50000, 100000],
'C': [1, 0.5, 0],
'tol': [1e-6, 1e-5, 1e-4, 1e-3]}
}
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
pipe = Pipeline(steps=[('classifier', LogisticRegression())])
grid = GridSearchCV(pipe, param_grid=param_grids_list,
scoring='recall_weighted',
cv=3, verbose=4, n_jobs=-3)
Now, the pipeline with the defined parameters for GridSearchCV
can be run using the Upsampled
data utilizing dask.delayed
for parallel jobs where the start/end time
can be monitored.
import time
import dask.delayed
print('Start Upsampling - Grid Search..')
search_time_start = time.time()
pipelines_ = [dask.delayed(grid).fit(X_train, y_train.values.ravel())]
fit_pipelines = dask.compute(*pipelines_, scheduler='processes')
print('Finished Upsampling - Grid Search :', time.time() - search_time_start)
Then the accuracies of the models tested during the grid search can be compared to determine which model resulted in the highest accuracy when evaluated using the test set.
for idx, val in enumerate(fit_pipelines):
print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx],
val.score(X_test, y_test)))
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(fit_pipelines):
if val.score(X_test, y_test) > best_acc:
best_acc = val.score(X_test, y_test)
best_pipe = val
best_clf = idx
print('Classification with highest accuracy: %s' % pipe_dict[best_clf])
print('Lasso/Elastic Net using Upsampling - Best Estimator')
print(best_pipe.best_params_)
For the Upsampled
data, the model that utitilized the Lasso
with C=1
, max_iter=100000
, penalty='l1'
, solver='saga'
and tol=1e-06
resulted in the highest accuracy. Given these parameters, we can fit the model with the best accuracy for the Upsampled
set and save the model in a pickle
file for later use.
LassoMod_US_HPO = LogisticRegression(penalty='l1',
C=1,
solver='saga',
max_iter=100000,
tol=1e-06,
n_jobs=-1,
random_state=seed_value)
print('Start fit the best hyperparameters from Upsampling grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_US_HPO.fit(X_trainS, y_train)
print('Finished fit the best hyperparameters from Upsampling grid search to the data :',
time.time() - search_time_start)
Pkl_Filename = 'Linear_HPO_US_mmsRecall_mostParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_US_HPO, file)
Now that the model has been fit using the training set, we can predict using the scaled test set to determine the model performance/metrics.
y_pred_US_HPO = LassoMod_US_HPO.predict(X_testS)
print('Results from LASSO using Upsampling HPO:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US_HPO))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US_HPO))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US_HPO))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US_HPO))
Compared to the baseline model, the model determined using GridSearchCV
resulted in practically the same metrics. Given the time that was required to search through the various models in the given parameter space, the baseline model would probably be sufficient unless an even higher recall score
where the true positives classified as positive was needed for the task at hand.
Model Metrics with ELI5
¶
Let's now utilize the PermutationImportance
from eli5.sklearn
which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.
class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
import webbrowser
from eli5.formatters import format_as_dataframe
perm_importance = PermutationImportance(LassoMod_US_HPO,
random_state=seed_value).fit(X_testS,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The total_pymnt
feature contains the highest weight (0.466812), followed by loan_amnt
(0.429280), total_rec_int
(0.392205 and out_prncp
(0.338332).
Now we can use ELI5
to show the prediction, which reveals the contribution of the various features in explaining whether a loan will default or stay current. This can also be stored in an HTML
file that is incorporated into a dashboard, whether in a Jupyter
notebook or specified in a script with the environment and package dependencies that is loaded for batch or real time infence, given the task at hand.
html_obj2 = eli5.show_prediction(LassoMod_US_HPO, X_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off
or Current
, the features out_prncp
, total_pymnt
, annual_inc
, total_hi_credit_lim
, revol_bal
and last_pymnt_amnt
with higher values are beneficial while the features loan_amnt
, total_rec_int
, total_bal_ex_mort
and installment
with higher values are not beneficial.
Create an Azure FIles Datastore
¶
Now that the features and potential methods for addressing class imbalance have been selected, let's save these sets in a Datastore
in the cloud with Azure Files Datastore
as option. We need to first connect to the Azure Machine Learning
workspace with MLClient
using the credential
, subscription_id
, resource_group_name
and the workspace_name
.
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
ml_client = MLClient(
credential=credential,
subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
resource_group_name='aschultzdata',
workspace_name='ds-ml-env',
)
Then we can create the datastore by specifying the name
of the datastore, the description
, the account
, the name of the container
, the protocol
and the account_key
.
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration
from azure.ai.ml import MLClient
store = AzureBlobDatastore(
name='loanstatus_datastore',
description='Datastore for Loan Status',
account_name='dsmlenv8898281366',
container_name='loanstatus',
protocol='https',
credentials=AccountKeyConfiguration(
account_key='XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'
),
)
ml_client.create_or_update(store)
Then we can upload both the train/test sets for the Upsampled
and SMOTE
sets to the loanstatus
container.
Train a model using Microsoft Azure
¶
Before we train the model, we need to create a compute instance/cluster with a defined name, so let's create one with a CPU
, if it has not all ready been created. Then we can specify the cluster components containing the on-demand virtual machine (VM) service, the type of VM, the minimum/maximum nodes in the cluster, the time the node will run after the job finishes/terminates and the type of tier. Finally, we can input the object to create_or_update
from MLClient
.
from azure.ai.ml.entities import AmlCompute
cpu_compute_target = 'cpu-cluster-E8s-v3'
try:
cpu_cluster = ml_client.compute.get(cpu_compute_target)
print(
f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
)
except Exception:
print('Creating a new cpu compute target...')
cpu_cluster = AmlCompute(
name=cpu_compute_target,
type='amlcompute',
size='Standard_E8s_v3',
min_instances=0,
max_instances=1,
idle_time_before_scale_down=180,
tier='Dedicated',
)
print(
f"AMLCompute with name {cpu_cluster.name} will be created, with compute size {cpu_cluster.size}"
)
cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)
Next, let's create the environment by making a directory for the dependencies
which lists the components for the runtime and the libraries installed on the compute for training the model.
import os
dependencies_dir = './dependencies'
os.makedirs(dependencies_dir, exist_ok=True)
Now we can write the conda.yaml
file into the dependencies
directory.
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- numpy=1.21.6
- pip=23.1.2
- scikit-learn==1.1.2
- scipy
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- mlflow
- azureml-mlflow
- psutil==5.9.0
- tqdm
- ipykernel
- matplotlib
- seaborn
- eli5==0.13.0
- shap==0.41.0
- lime
The created conda.yaml
file allows for the environment to be created and registered in the workspace.
from azure.ai.ml.entities import Environment
custom_env_name = 'aml-loanstatus-cpu'
custom_job_env = Environment(
name=custom_env_name,
description='Custom environment for Loast Status Linear job',
tags={'scikit-learn': '1.1.2'},
conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
image='mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)
print(
f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}"
)
Next, we can create the training script by first creating the source folder where the training script, main.py
, will be stored.
import os
train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)
The training script consists of preparing the environment, reading the data, data preparation, model training, evaluating the model and saving/registering the model. This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of argparse
, start logging with sklearn.autolog
, reading the train/test sets, defining the features/target and preprocessing the data by scaling the features with the MinMaxScaler
. Then the number of samples and features are logged with MLFlow
. It uses this to then train a Linear
model using the best parameters from GridSearchCV
where the classification_report
and confusion_matrix
as well as the metrics accuracy
, precision
, recall
and f1_score
for the train/test sets are logged as MLFlow
artifacts and metrics. Then the model can be saved and registered.
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['LoanStatus_Linear'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
def main():
"""Main function of the script."""
parser = argparse.ArgumentParser()
parser.add_argument('--train_data', type=str,
help='path to input train data')
parser.add_argument('--test_data', type=str, help='path to input test data')
parser.add_argument('--penalty', required=False, default='l2', type=str)
parser.add_argument('--solver', required=False, default='lbfgs', type=str)
parser.add_argument('--max_iter', required=False, default=100, type=int)
parser.add_argument('--C', required=False, default=1, type=int)
parser.add_argument('--tol', required=False, default=1e-4, type=float)
parser.add_argument('--n_jobs', required=False, default=1, type=int)
parser.add_argument('--registered_model_name', type=str, help='model name')
args = parser.parse_args()
mlflow.start_run()
mlflow.sklearn.autolog()
###################
#<prepare the data>
###################
print(" ".join(f"{k}={v}" for k, v in vars(args).items()))
print('Input Train Data:', args.train_data)
print('Input Test Data:', args.test_data)
trainDF = pd.read_csv(args.train_data, low_memory=False)
testDF = pd.read_csv(args.test_data, low_memory=False)
train_label = trainDF[['loan_status']]
test_label = testDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])
test_features = testDF.drop(columns = ['loan_status'])
print(f"Training with data of shape {train_features.shape}")
scaler = MinMaxScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)
mlflow.log_metric('num_samples', train_features.shape[0])
mlflow.log_metric('num_features', train_features.shape[1])
####################
#</prepare the data>
####################
##################
#<train the model>
##################
model = LogisticRegression(penalty=args.penalty,
solver=args.solver,
max_iter=args.max_iter,
C=args.C,
tol=args.tol,
random_state=seed_value)
with parallel_backend('threading', n_jobs=args.n_jobs):
model.fit(train_features, train_label)
##################
#</train the model>
##################
#####################
#<evaluate the model>
#####################
train_label_pred = model.predict(train_features)
test_label_pred = model.predict(test_features)
clr_train = classification_report(train_label, train_label_pred,
output_dict=True)
sns.heatmap(pd.DataFrame(clr_train).iloc[:-1,:].T, annot=True)
plt.savefig('clr_train.png')
mlflow.log_artifact('clr_train.png')
plt.close()
clr_test = classification_report(test_label, test_label_pred,
output_dict=True)
sns.heatmap(pd.DataFrame(clr_test).iloc[:-1,:].T, annot=True)
plt.savefig('clr_test.png')
mlflow.log_artifact('clr_test.png')
plt.close()
cm_train = confusion_matrix(train_label, train_label_pred)
cm_train = ConfusionMatrixDisplay(confusion_matrix=cm_train)
cm_train.plot()
plt.savefig('cm_train.png')
mlflow.log_artifact('cm_train.png')
plt.close()
cm_test = confusion_matrix(test_label, test_label_pred)
cm_test = ConfusionMatrixDisplay(confusion_matrix=cm_test)
cm_test.plot()
plt.savefig('cm_test.png')
mlflow.log_artifact('cm_test.png')
plt.close()
train_accuracy = accuracy_score(train_label, train_label_pred)
train_precision = precision_score(train_label, train_label_pred)
train_recall = recall_score(train_label, train_label_pred)
train_f1 = f1_score(train_label, train_label_pred)
test_accuracy = accuracy_score(test_label, test_label_pred)
test_precision = precision_score(test_label, test_label_pred)
test_recall = recall_score(test_label, test_label_pred)
test_f1 = f1_score(test_label, test_label_pred)
mlflow.log_metric('train_accuracy', train_accuracy)
mlflow.log_metric('train_precision', train_precision)
mlflow.log_metric('train_recall', train_recall)
mlflow.log_metric('train_f1', train_f1)
mlflow.log_metric('test_accuracy', test_accuracy)
mlflow.log_metric('test_precision', test_precision)
mlflow.log_metric('test_recall', test_recall)
mlflow.log_metric('test_f1', test_f1)
#####################
#</evaluate the model>
#####################
##########################
#<save and register model>
##########################
print('Registering the model via MLFlow')
mlflow.sklearn.log_model(
sk_model=model,
registered_model_name=args.registered_model_name,
artifact_path=args.registered_model_name,
)
mlflow.sklearn.save_model(
sk_model=model,
path=os.path.join(args.registered_model_name, 'trained_model'),
)
###########################
#</save and register model>
###########################
mlflow.end_run()
if __name__ == "__main__":
main()
To train the model, a command job
configured with the input specifying the input data, the number of epochs and the batch size, which then runs the training script
using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.
from azure.ai.ml import command
from azure.ai.ml import Input
registered_model_name = 'loanstatus_us_linear_model'
job = command(
inputs=dict(
train_data=Input(
type='uri_file',
path='azureml://datastores/loanstatus_datastore/paths/trainDF_US.csv',
),
test_data=Input(
type='uri_file',
path = 'azureml://datastores/loanstatus_datastore/paths/testDF_US.csv',
),
penalty='l1',
solver='saga',
max_iter=100000,
C=1,
tol=1e-06,
n_jobs=-1,
registered_model_name=registered_model_name,
),
code='./src/',
command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --penalty ${{inputs.penalty}} --solver ${{inputs.solver}} --max_iter ${{inputs.max_iter}} --n_jobs ${{inputs.n_jobs}} --registered_model_name ${{inputs.registered_model_name}}',
environment='aml-loanstatus-cpu@latest',
compute='cpu-cluster-E8s-v3',
display_name='loanstatus_us_linear_best_prediction',
)
Finally, this job can be submitted to run in Azure Machine Learning Studio
using the create_or_update
command with ml_client
.
ml_client.create_or_update(job)
The submitted job can then be viewed by selecting the link in the output of the previous cell. The logged information with MLFlow
including the model metrics and saved graphs can then be viewed/downloaded when the job completes.
Test Model on SMOTE Set
¶
Now, we can fit the best model from iterating through the parameters contained within the grid search using the Upsampled
data to the SMOTE
set to compare how this model peforms using another approach for handling class imbalance problems. After fitting the model, we can predict with the test set that was not utilzing for training, and evaluate the model performance.
print('Start Fit best model using gridsearch results on Upsampling to SMOTE data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_US_HPO.fit(X1_trainS, y1_train)
print('Finished Fit best model using gridsearch results on Upsampling to SMOTE data :',
time.time() - search_time_start)
Pkl_Filename = 'LassoMod_SMOTEusingUSHPO_mms_moreParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_US_HPO, file)
y_pred_US_HPO = LassoMod_US_HPO.predict(X1_testS)
print('Results from LASSO using Upsampling HPO on SMOTE Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_US_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_US_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_US_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_US_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_US_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_US_HPO))
When testing the best model from the GridSearchCV
using the Upsampled
set with the SMOTE
data, this resulted in the same accuracy, higher precision and F1 scores but lower recall.
Model Explanations using ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(LassoMod_US_HPO,
random_state=seed_value).fit(X1_testS,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
There is a lower total_rec_int
(0.377133 vs. 0.392205) compared to the Upsampled
test while total_pymnt
, out_prncp
loan_amnt
are similar. Interestingly, home_ownership_RENT
(0.129386 vs. 0.000037) and home_ownership_MORTGAGE
(0.123567 vs. 0.000069) are considerably higher compared to the Upsampled
test set.
Now we can use ELI5
to show the prediction.
html_obj2 = eli5.show_prediction(LassoMod_US_HPO, X1_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off
or Current
, the features annual_inc
, out_prncp
, total_pymnt
, total_bc_limit
, total_bal_ex_mort
, bc_open_to_buy
,revol_bal
, total_hi_credit_lim
and last_pymnt_amnt
with higher values are beneficial while the features loan_amnt
, and total_rec_int
with higher values are not beneficial.
Grid Search: SMOTE
¶
To follow the methods used for evaluating the Upsampled
set, let's fit the SMOTE
data using dask.delayed
for parallel jobs with time
being monitored and then compare the model accuracies to determine which parameters led to this metric being achieved.
print('Start SMOTE - Grid Search..')
search_time_start = time.time()
pipelines1_ = [dask.delayed(grid).fit(X1_train, y1_train.values.ravel())]
fit_pipelines1 = dask.compute(*pipelines1_, scheduler='processes')
print('Finished SMOTE - Grid Search :', time.time() - search_time_start)
for idx, val in enumerate(fit_pipelines1):
print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx],
val.score(X1_test, y1_test)))
best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(fit_pipelines1):
if val.score(X1_test, y1_test) > best_acc:
best_acc = val.score(X1_test, y1_test)
best_pipe = val
best_clf = idx
print('Classification with highest accuracy: %s' % pipe_dict[best_clf])
print('Lasso/Elastic Net using SMOTE - Best Estimator')
print(best_pipe.best_params_)
Now that we determined the parameters that led to the highest accuracy during the search, we can fit a model with the parameters, save the model and predict based on the trained using the test set.
LassoMod_SMOTE = LogisticRegression(penalty='l1',
C=1,
solver='saga',
max_iter=100000,
tol=1e-06,
n_jobs=-1,
random_state=seed_value)
print('Start fit the best hyperparameters from SMOTE grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_SMOTE.fit(X1_trainS, y1_train)
print('Finished fit the best hyperparameters from SMOTE grid search to the data:',
time.time() - search_time_start)
Pkl_Filename = 'Linear_HPO_SMOTE_mmsRecall_mostParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_SMOTE, file)
y_pred_SMOTE_HPO = LassoMod_SMOTE.predict(X1_testS)
print('Results from LASSO using SMOTE HPO:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_HPO))
Compared to the baseline model, the model determined using GridSearchCV
resulted in practically the same metrics besides a higher precision score. Given the time that was required to search through the various models in the given parameter space, the baseline model would probably be sufficient unless an even higher recall score
where the true positives classified as positive was needed for the task at hand.
Compared to the model after GridSearchCV
using the Upsampled
set, there was a higher precision score, but a lower recall score.
Model Explanations using ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(LassoMod_SMOTE,
random_state=seed_value).fit(X1_testS,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
There is a lower total_rec_int
(0.377133 vs. 0.392205) compared to the Upsampled
test while total_pymnt
, out_prncp
loan_amnt
are similar. Interestingly, home_ownership_RENT
(0.129386 vs. 0.000037) and home_ownership_MORTGAGE
(0.123567 vs. 0.000069) are considerably higher compared to the Upsampled
test set.
Now we can use ELI5
to show the prediction.
html_obj2 = eli5.show_prediction(LassoMod_SMOTE, X1_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off
or Current
, the features annual_inc
, out_prncp
, total_pymnt
, total_bc_limit
, total_bal_ex_mort
, bc_open_to_buy
, revol_bal
, total_hi_credit_lim
and last_pymnt_amnt
with higher values are beneficial while the features loan_amnt
, and total_rec_int
with higher values are not beneficial.
Test Model on Upsampled Set
¶
Now, we can fit the determined model from the search using the SMOTE
set, save the .pkl
file, then predict based on trained model with the scaled test set and evaluate the model metrics.
print('Start fit best model using gridsearch results on SMOTE to Upsampling data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
LassoMod_SMOTE.fit(X_trainS, y_train)
print('Finished fit best model using gridsearch results on SMOTE to Upsampling data :',
time.time() - search_time_start)
Pkl_Filename = 'LassMod_UpsamplingusingSMOTEHPO_mmsRecall_mostParams.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(LassoMod_SMOTE, file)
y_pred_SMOTE_HPO = LassoMod_SMOTE.predict(X_testS)
print('Results from LASSO using SMOTE HPO on Upsampling data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y_test, y_pred_SMOTE_HPO))
When testing the best model from the GridSearchCV
using the SMOTE
set with the Upsampled
data, this resulted in the same accuracy but lower precision and F1 scores a higher recall score.
Model Explanations using ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(LassoMod_SMOTE,
random_state=seed_value).fit(X_testS,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The feature weights are the same values from the best model from the GridSearchCV
with scoring='recall_weighted'
for the Upsampled
set.
Now we can use ELI5
to show the prediction.
html_obj2 = eli5.show_prediction(LassoMod_SMOTE, X_test1.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off
or Current
, the features out_prncp
, total_pymnt
, annual_inc
, total_hi_credit_lim
, total_bc_limit
, revol_bal
and last_pymnt_amnt
with higher values are beneficial while the features loan_amnt
, total_rec_int
, total_bal_ex_mort
and installment
with higher values are not beneficial.
Naive Bayes
¶
Naive Bayes
classifiers are a family of supervised learning algorithms based on applying Bayes’ theorem with the "naive" assumption that conditional independence exists between every pair of features given the value of the class variable. Bayes’ theorem is a formula that offers a conditional probability of an event A
happening given another event B
has all ready occurred:
$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$
- $P(A)$ is called the prior probability
- $P(A|B)$ is called the posterior probability (the probability after taking into account the observation of $B$).
The fundamental Naive Bayes
assumption is that each feature makes an independent and equal contribution to the outcome. In Gaussian Naive Bayes
, the continuous values of each feature are assumed to be distributed according to a Normal distribution and the likelihood of the features is assumed to be Gaussian.
In The Optimality of Naive Bayes, it was suggested that Naive Bayes
can still perform well even if the features contain strong dependences as long as the dependences are evenly distributed in the target variable or if they cancel each other out.
Baseline Models
¶
The notebook containing the baseline models and hyperparameter tuning using GridSearchCV
is located here.
Baseline Default Parameters
¶
priors
: Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.default=None
.var_smoothing
: Portion of the largest variance of all features that is added to variances for calculation stability.default=1e-09
.
Upsampled: Baseline Model
¶
As with the baseline models for the Linear
models, we can first set the path where the ML results
are saved as this is always needed and will be assumed for all workflows. Let's first start with the Upsampled
set with fitting the model, saving it as .pkl
file followed by making predictions and evaluating with the test set.
nb = GaussianNB()
with parallel_backend('threading', n_jobs=-1):
nb.fit(X_train, y_train)
Pkl_Filename = 'NB_Upsampling_baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb, file)
y_pred_US = nb.predict(X_test)
print('Results from Naives Bayes baseline model on Upsampled Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
All of the model metrics for this baseline model are lower than both previous baseline linear models for the Upsampled
set. Although the accuracy=0.948
and precision=0.911
, the lowest metric is recall_score=0.651
followed by the f1_score=0.759
. Naive Bayes
is non-linear and might require too many simulations to find match the metrics achieved from the linear
approaches, so let's try some different scoring
metrics during GridSearchCV
since the time for runtime completion is not too computational demanding.
SMOTE: Baseline Model
¶
We can now fit the baseline model for the SMOTE
set using the default model parameters.
with parallel_backend('threading', n_jobs=-1):
nb.fit(X1_train, y1_train)
Pkl_Filename = 'NB_SMOTE_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb, file)
y_pred_SMOTE = nb.predict(X_test)
print('Results from Naives Bayes baseline model on SMOTE Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
All of the model metrics for this baseline model are even lower than both of previous baseline linear models for the SMOTE
and the Upsampled
sets. However, precision=0.465
is the lowest model metric observed up to this point, so this is definitely the metric we will have to use for scoring
if the Naive Bayes
algorithm has any likelihood of being considered with the adjusted class balance generated sets.
Grid Search using Pipelines: Precision
¶
Grid Search: Upsampled
¶
Using a similar approach that was utilized for the Linear
models, let's define the estimator
as the default GaussianNB
, the grid search parameters as param_grid
, the scoring = precision
using 3-fold cross validation and to use all available resouces with joblib
in both the scikit-learn
model and GridSearchCV
. We can monitor the completion time
as well to compare to any of the hyperparameter searches. However, this is still using CPU
and RAM
constraints with suboptimal runtimes which the literature has demonstrated significant differences in runtimes when using code optimized to run on a GPU
even with tabular data. Let's print the searches best score and estimator parameters after the search completes.
param_grid = {'var_smoothing': np.logspace(0,-9, num=1000)}
nb_grid_US = GridSearchCV(estimator=GaussianNB(),
param_grid=param_grid,
scoring='precision',
cv=3, verbose=1, n_jobs=-1)
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_grid_US.fit(X_train, y_train)
print('Time to finish Grid Search using Upsampling:',
time.time() - search_time_start)
print('======================================================================')
print('Naive Bayes: Upsampling')
print('- Best Score:', nb_grid_US.best_score_)
print('- Best Estimator:', nb_grid_US.best_estimator_)
Using 3-fold cross validation monitoring precision
resulted in the baseline model as the best model so far. Let's fit the best model from the grid search using the Upsampled
data, and examine the classification metrics.
nb_US_HPO = nb_grid_US.best_estimator_
print('Start fit the best hyperparameters from Upsampling grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_US_HPO.fit(X_train, y_train)
print('Finished fit the best hyperparameters from Upsampling grid search to the data:',
time.time() - search_time_start)
Pkl_Filename = 'NB_HPO_US_precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_US_HPO, file)
y_pred_US = nb_US_HPO.predict(X_test)
print('Results from Naives Bayes using Best HPO from GridSearchCV on Upsampled Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Compared to the baseline model, the model determined using GridSearchCV
resulted in the same metrics. This suggests that this algorithm is going to prove to be difficult in this context.
Model Metrics with ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_US_HPO,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The feature recoveries
is the only feature with contributing weights (0.115911).
Test Model on SMOTE Set
¶
Now, we can fit the best model from iterating through the parameters contained within the grid search using the Upsampled
data to the SMOTE
set to compare how this model peforms using another approach for handling class imbalance problems. After fitting the model, we can predict with the test set that was not utilzing for training, and evaluate the model performance.
print('Start Fit best model using gridsearch results on Upsampling to SMOTE data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_US_HPO.fit(X1_train, y1_train)
print('Finished Fit best model using gridsearch results on Upsampling to SMOTE data :',
time.time() - search_time_start)
Pkl_Filename = 'NB_SMOTEusingUpsamplingHPO_precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_US_HPO, file)
y_pred_SMOTE_US = nb_US_HPO.predict(X1_test)
print('Results from Naives Bayes using Upsampling Best HPO from GridSearchCV on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_US))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_US))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_US))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_US))
This resulted in the same model metrics as the baseline model.
Model Explanations using ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_US_HPO,
random_state=seed_value).fit(X1_test,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The recoveries
feature contains less weight compared to the Upsampled
model (0.074780 vs. 0.11) while the
out_prncp
(0.056124) and last_pymnt_amnt
(0.042546) contain more weight than the determined with the Upsampled
model.
Grid Search using Pipelines: Weighted Precision
¶
Given the low precision score obtained for the SMOTE
set when scoring='precision'
was used (0.465), which resulted because the parameters used for the baseline model were selected, let's increase the parameter space by increasing var_smoothing
and num
to 5000 and testing scoring='precision_weighted'
with 10-fold cross validation.
param_grid = {'var_smoothing': np.logspace(-7,-13, num=5000)}
nb_grid = GridSearchCV(estimator=GaussianNB(),
param_grid=param_grid,
scoring='precision_weighted',
cv=10, verbose=4, n_jobs=-3)
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_grid.fit(X1_train, y1_train)
print('Finished SMOTE - Grid Search :', time.time() - search_time_start)
print('======================================================================')
print('Naive Bayes using SMOTE ')
print('- Best Score:', nb_grid.best_score_)
print('- Best Estimator:', nb_grid.best_estimator_)
Let's fit the best model from the grid search using the SMOTE
data, and examine the classification metrics.
nb_SMOTE_HPO = nb_grid.best_estimator_
print('Start fit the best hyperparameters from SMOTE grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_SMOTE_HPO.fit(X1_train, y1_train)
print('Finished fit the best hyperparameters from SMOTE grid search to the data:',
time.time() - search_time_start)
Pkl_Filename = 'NB_HPO_SMOTE_-7to-13_precisionWeighted_num5000_10cv.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_SMOTE_HPO, file)
y_pred_SMOTE_HPO = nb_SMOTE_HPO.predict(X1_test)
print('Results from Naives Bayes using Best HPO from GridSearchCV on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_HPO))
Compared to the baseline NB
model using the SMOTE
, all of the model metrics increased besides a decrease in the recall_score
. However, the low recall, F1 and precision score, this model even after testing various searches, does not seem worthwhile unless the data is preprocessed further with feature engineering new features given the current ones.
Model Explanations using ELI5
¶
Although the model metrics are not ideal nor worth pursuing for further optimization, let's compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_SMOTE_HPO,
random_state=seed_value).fit(X1_test,
y1_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Test Model on Upsampled Set
¶
Using the best model from grid search using the SMOTE
set, let's compare the results to the ones from using the Upsampled
set by fitting a model with the best parameters and evaluating the model performance.
print('Start fit best model using gridsearch results on SMOTE to Upsampling data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
nb_SMOTE_HPO.fit(X_train, y_train)
print('Finished fit best model using gridsearch results on SMOTE to Upsampling data:',
time.time() - search_time_start)
Pkl_Filename = 'NB_UpsamplingUsingSMOTEHPO_-7to-13_precisionWeighted_num5000_10cv.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(nb_SMOTE_HPO, file)
y_pred_US_SMOTE = nb_SMOTE_HPO.predict(X_test)
print('Results from Naives Bayes using SMOTE Best HPO from GridSearchCV on Upsampling data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US_SMOTE))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US_SMOTE))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US_SMOTE))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US_SMOTE))
Compared to the baseline model using the default parameters, this resulted in lower model metrics for the Upsampled
set. The accuracy, precision and recall scores were better than the SMOTE
baseline model metrics though.
Model Explanations using ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(nb_SMOTE_HPO,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The weight of the recoveries
is potentially higher for the SMOTE
model using the Upsampled
data, but potentially the same (0.123842 vs. 0.115911).
When the increased parameter space with 10 fold cross validation was tested with the `Upsampled` set, the the model resulted in wose model metrics than compared to the initial hyperparameter search. So, let's move on to more model evaluation.
SparkML
¶
Apache Spark is an open-source analytics engine that was developed to improve upon the limitations of the MapReduce in the the Hadoop Distributed File System by processing data in memory and distributed over multiple operations in parallel. There are various ways to use this, ranging from real time big data ingestion to ML/DL tasks. Let's first set up the environment by updating the packages and installing Java JRE/JDK
with:
1. apt update
2. apt install default-jre
3. apt install default-jdk
We can then install pyspark==3.3
and findspark
using pip
and set up the SparkSession
. Both Paperspace
and Colab
were utilized so the amount of RAM
allocated was different for the SparkSession
for Paperspace
since more RAM
and CPU
cores are available.
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder\
.master('local')\
.appName('Paperspace')\
.config('spark.driver.memory', '38g')\
.config('spark.executor.pyspark.memory', '32g')\
.config('spark.executor.cores', '4')\
.config('spark.python.worker.memory', '32g')\
.config('spark.sql.execution.arrow.pyspark.enabled', 'True')\
.config('spark.sql.debug.maxToStringFields', '1000')\
.config('spark.sql.autoBroadcastJoinThreshold', '-1')\
.config('spark.ui.port', '4050')\
.getOrCreate()
spark
Then we can remove the warnings from the environment by setting the log to only ERROR
. After the SparkSession
is configured, we can install & import the necessary packages and then set the seed for reproducibility.
spark.sparkContext.setLogLevel('ERROR')
Upsampled Minority Class
¶
Let's first read the train and the test sets, which are stored as .csv
files, cache
the data into memory for better performance and then inspect the schema of the features to determine if the schema needs to be updated to reflect the proper data type(s).
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['SparkML_HPO'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
trainDF = spark.read.csv('/notebooks/LoanStatus/Data/trainDF_US.csv',
header=True, inferSchema=True).cache()
print('\nTrain Schema')
trainDF.printSchema()
testDF = spark.read.csv('/notebooks/LoanStatus/Data/testDF_US.csv',
header=True, inferSchema=True).cache()
print('\nTest Schema')
testDF.printSchema()
To prepare the data for modeling, the features and label (loan_status
) need to be defined. Let's start with the training set. The VectorAssembler
is set up with the features as inputCols
and the outputCol='unscaledFeatures'
. Then the train feature columns can be transformed into a vector column.
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
features = trainDF.columns[0: len(trainDF.columns) - 1]
trainDF = trainDF.select(col('loan_status').alias('label'), *features)
vecAssembler = VectorAssembler(inputCols=features,
outputCol='unscaledFeatures',
handleInvalid='skip')
trainDF = vecAssembler.transform(trainDF)
The same steps that were applied to the training set should also be applied to the test set.
features = testDF.columns[0: len(testDF.columns) - 1]
testDF = testDF.select(col('loan_status').alias('label'), *features)
testDF = vecAssembler.transform(testDF)
Some models perform better when the data is scaled. The MinMaxScaler
rescales the data to a common range from the minimum to maximum value linearly using the summary statistics from the variables. The StandardScaler
standardizes the variables by eliminating the mean value and scales by dividing each value by the standard deviation. Given this is a classification problem, let's define a few metrics, areaUnderROC
and accuracy
, to evaluate the performance from training different model types.
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
mmScaler = MinMaxScaler(inputCol='unscaledFeatures',
outputCol='scaledFeatures',
min=0, max=1)
stdScaler = StandardScaler(inputCol='unscaledFeatures',
outputCol='scaledFeatures',
withStd=True,
withMean=False)
evaluator_auroc = BinaryClassificationEvaluator(labelCol='label',
metricName='areaUnderROC')
evaluator_acc = MulticlassClassificationEvaluator(labelCol='label',
metricName='accuracy')
The models examined using Spark's Machine Learning Library (MLlib)
are:
Logistic Regression
(LR)Linear SVM Classifier
(LinearSVC)Decision Trees
(DT)Random Forest
(RF)Gradient Boosted Trees
(GBT)
To set the baseline models, a Pipeline
was utilized with the input of the unscaledFeatures
from the VectorAssembler
, then potentially the scaledFeatures
, if a scaler was being used, to the featuresCol
for the model.
Logistic Regression
¶
First the dependencies are imported from pyspark.ml
, time
and mlflow
. The pipeline
was set up by first defining the model to use, which utilizes the default parameters. The steps in the pipline consist of scaling the data with the MinMaxScaler
, fitting the timed model and saving it for later use, if needed.
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
import time
try:
import mlflow.pyspark.ml
mlflow.pyspark.ml.autolog()
except:
print(f'Your version of MLflow ({mlflow.__version__}) does not support pyspark.ml for autologging. To use autologging, upgrade your MLflow client version or use Databricks Runtime for ML 8.3 or above.')
lr = LogisticRegression(family='binomial',
labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
elasticNetParam=0.0,
maxIter=100)
search_time_start = time.time()
pipeline_lr = Pipeline(stages=[mmScaler, lr])
pipelineModel_lr = pipeline_lr.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lr_us'
pipelineModel_lr.save(Path)
Then predictions can be made using the test set. The default parameters of regParam=0.0
and elasticNetParam=0.0
provide an AUROC = 0.987
and an Accuracy = 98.583%
, so optimizing this model might prove to be difficult. Let's now examine some other model types, and assess if hyperparameter tuning could increase the model performance better than the LogisticRegression
model that was fit.
prediction_lr = pipelineModel_lr.transform(testDF)
lr_auroc = evaluator_auroc.evaluate(prediction_lr)
print('Baseline: LogisticRegression')
print('Area under ROC curve: %g' % (lr_auroc))
print('Test Error: %g' % (1.0 - lr_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lr))
LinearSVC
¶
from pyspark.ml.classification import LinearSVC
lsvc = LinearSVC(labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
tol=1e-5,
maxIter=100)
search_time_start = time.time()
pipeline_lsvc = Pipeline(stages=[stdScaler, lsvc])
pipelineModel_lsvc = pipeline_lsvc.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lsvc_us'
pipelineModel_lsvc.save(Path)
prediction_lsvc = pipelineModel_lsvc.transform(testDF)
lsvc_auroc = evaluator_auroc.evaluate(prediction_lsvc)
print('Baseline: LinearSVC')
print('Area under ROC curve: %g' % (lsvc_auroc))
print('Test Error: %g' % (1.0 - lsvc_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lsvc))
Decision Tree
¶
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=16,
impurity='gini',
seed=seed_value)
search_time_start = time.time()
pipeline_dt = Pipeline(stages=[dt])
pipelineModel_dt = pipeline_dt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_dt_us'
pipelineModel_dt.save(Path)
prediction_dt = pipelineModel_dt.transform(testDF)
dt_auroc = evaluator_auroc.evaluate(prediction_dt)
print('Baseline: Decision Tree')
print('Area under ROC curve: %g' % (dt_auroc))
print('Test Error: %g' % (1.0 - dt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_dt))
Random Forest
¶
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(labelCol='label',
featuresCol='unscaledFeatures',
impurity='gini',
maxDepth=5,
maxBins=32,
numTrees=10,
seed=seed_value)
search_time_start = time.time()
pipeline_rf = Pipeline(stages=[rf])
pipelineModel_rf = pipeline_rf.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_rf_us'
pipelineModel_rf.save(Path)
prediction_rf = pipelineModel_rf.transform(testDF)
rf_auroc = evaluator_auroc.evaluate(prediction_rf)
print('Baseline: Random Forest')
print('Area under ROC curve: %g' % (rf_auroc))
print('Test Error: %g' % (1.0 - rf_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_rf))
Gradient Boosted Trees
¶
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=32,
maxIter=10,
seed=seed_value)
search_time_start = time.time()
pipeline_gbt = Pipeline(stages=[gbt])
pipelineModel_gbt = pipeline_gbt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_gbt_us'
pipelineModel_gbt.save(Path)
prediction_gbt = pipelineModel_gbt.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('Baseline: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
Of the models fit, only LR and LinearSVC were scaled prior to modeling. The scaling for LinearSVC used the StandardScaler
since SVMs tend to perform better with scaled data and the computation time is faster when compared with unscaled data. Given all models were using the same runtime environment, the DT model completed the fastest at 59.3 seconds while the LR took the most amount to finish at 233.4 seconds. Given the lower accuracies for GBT, RF and DT, these will require tuning the hyperparameters more than LR and LinearSVC. However, accuracy is not a great metric, especially if the original data is imbalanced. Therefore, let's add some more metrics and then compare all the models at a greater depth or granularity.
Model Metrics
¶
We can utilize a for
loop which iterates through the models tested and asesses various metrics on a more granular level like the amount of true positives and false positives as well as a more macroscopic level like recall, precision, and F1 score.
print('Upsampling Baseline Models:')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
LR performed the best on all of the macro metrics. Perhaps, using a linear
approach might be worthwhile in this scenario since the metrics from the baseline model are all over 90%. However, the overall recall score for each of the models is relatively low with the highest being 91.0%. Also, DT, RF and GBT have very lower precision scores for the Upsampled
baseline models.
SMOTE
¶
Upon the initial inspection of the schema of the train and test sets when the data was loaded, there were some differences when inferSchema=True
was used. The differences can be noted below. This was mostly due to the creation of uint8
dummy variables during the variable selection process and the creation of FloatType
with inferSchema=True
. To keep the train/test sets consistent with the Upsampled
sets, this was adjusted.
Then, the train and the test sets were procesed using the same approaches described for the Upsampled
data.
from pyspark.sql.types import IntegerType, FloatType
trainDF = spark.read.csv('/content/drive/MyDrive/LoanStatus/Data/trainDF_SMOTE.csv',
header=True, inferSchema=True).cache()
trainDF = trainDF \
.withColumn('loan_amnt', trainDF['loan_amnt'].cast(IntegerType())) \
.withColumn('revol_bal', trainDF['revol_bal'].cast(IntegerType())) \
.withColumn('term_ 60 months', trainDF['term_ 60 months'].cast(IntegerType())) \
.withColumn('grade_B', trainDF['grade_B'].cast(IntegerType())) \
.withColumn('grade_C', trainDF['grade_C'].cast(IntegerType())) \
.withColumn('grade_D', trainDF['grade_D'].cast(IntegerType())) \
.withColumn('home_ownership_MORTGAGE', trainDF['home_ownership_MORTGAGE'].cast(IntegerType())) \
.withColumn('home_ownership_OWN', trainDF['home_ownership_OWN'].cast(IntegerType())) \
.withColumn('home_ownership_RENT', trainDF['home_ownership_RENT'].cast(IntegerType())) \
.withColumn('verification_status_Source Verified', trainDF['verification_status_Source Verified'].cast(IntegerType())) \
.withColumn('verification_status_Verified', trainDF['verification_status_Verified'].cast(IntegerType())) \
.withColumn('purpose_credit_card', trainDF['purpose_credit_card'].cast(IntegerType())) \
.withColumn('initial_list_status_w', trainDF['initial_list_status_w'].cast(IntegerType())) \
.withColumn('application_type_Joint App', trainDF['application_type_Joint App'].cast(IntegerType())) \
.withColumn('disbursement_method_DirectPay', trainDF['disbursement_method_DirectPay'].cast(IntegerType()))
print('\nTrain Schema')
trainDF.printSchema()
testDF = spark.read.csv('/content/drive/MyDrive/LoanStatus/Data/testDF_SMOTE.csv',
header=True, inferSchema=True).cache()
print('\nTest Schema')
testDF.printSchema()
Logistic Regression
¶
Let's now run the baseline models for the SMOTE
set and compare the different model results within the SMOTE
data as well as the Upsampled
sets.
lr = LogisticRegression(family='binomial',
labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
elasticNetParam=0.0,
maxIter=100)
search_time_start = time.time()
pipeline_lr = Pipeline(stages=[mmScaler, lr])
pipelineModel_lr = pipeline_lr.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lr_smote'
pipelineModel_lr.save(Path)
prediction_lr = pipelineModel_lr.transform(testDF)
lr_auroc = evaluator_auroc.evaluate(prediction_lr)
print('Baseline: Logistic Regression')
print('Area under ROC curve: %g' % (lr_auroc))
print('Test Error: %g' % (1.0 - lr_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lr))
LinearSVC
¶
lsvc = LinearSVC(labelCol='label',
featuresCol='scaledFeatures',
regParam=0.0,
tol=1e-5,
maxIter=100)
search_time_start = time.time()
pipeline_lsvc = Pipeline(stages=[stdScaler, lsvc])
pipelineModel_lsvc = pipeline_lsvc.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lsvc_smote'
pipelineModel_lsvc.save(Path)
prediction_lsvc = pipelineModel_lsvc.transform(testDF)
lsvc_auroc = evaluator_auroc.evaluate(prediction_lsvc)
print('Baseline: LinearSVC')
print('Area under ROC curve: %g' % (lsvc_auroc))
print('Test Error: %g' % (1.0 - lsvc_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lsvc))
Decision Tree
¶
dt = DecisionTreeClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=16,
impurity='gini',
seed=seed_value)
search_time_start = time.time()
pipeline_dt = Pipeline(stages=[dt])
pipelineModel_dt = pipeline_dt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_dt_smote'
pipelineModel_dt.save(Path)
prediction_dt = pipelineModel_dt.transform(testDF)
dt_auroc = evaluator_auroc.evaluate(prediction_dt)
print('Baseline: Decision Tree')
print('Area under ROC curve: %g' % (dt_auroc))
print('Test Error: %g' % (1.0 - dt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_dt))
Random Forest
¶
rf = RandomForestClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=32,
numTrees=10,
impurity='gini',
seed=seed_value)
search_time_start = time.time()
pipeline_rf = Pipeline(stages=[rf])
pipelineModel_rf = pipeline_rf.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_rf_smote'
pipelineModel_rf.save(Path)
prediction_rf = pipelineModel_rf.transform(testDF)
rf_auroc = evaluator_auroc.evaluate(prediction_rf)
print('Baseline: Random Forest')
print('Area under ROC curve: %g' % (rf_auroc))
print('Test Error: %g' % (1.0 - rf_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_rf))
Gradient Boosted Trees
¶
gbt = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
maxDepth=5,
maxBins=32,
maxIter=10,
seed=seed_value)
search_time_start = time.time()
pipeline_gbt = Pipeline(stages=[gbt])
pipelineModel_gbt = pipeline_gbt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)
Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_gbt_smote'
pipelineModel_gbt.save(Path)
prediction_gbt = pipelineModel_gbt.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('Baseline: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
Model Metrics
¶
print('SMOTE Baseline Models:')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
LR 3/4 best metrics besides LinearSVC contained a higher precision score. The precision of LinearSVC was higher for the SMOTE
data compared to US
results. DT was the fastest to complete at 61.8 while LinearSVC took the longest at 134.3. Overall, there was higher recall and F1 scores for the US
models while higher accuracy
and precision
for the SMOTE
models.
GridSearchCV
¶
The GridSearchCV
models for both the Upsampled
and SMOTE
sets are located here.
Upsampling: Gradient Boosted Trees
¶
From the models evaluated using the defined GridSearchCV
parameters, GBT
performed the best. So, let's evaluate the performance by first setting up the pipeline components with the different stages, although only the model
stage is needed here since scaling the data isn't required, then define the space in the parameter grid, followed by the components within CrossValidator
with the AUROC
as the metric to evalaute, using numFolds=5
and parallelism=5
to assess which combination of parameters results in the highest AUROC
.
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
model = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
seed=seed_value)
pipeline_gbt_hpo = Pipeline(stages=[model])
paramGrid = (ParamGridBuilder()
.addGrid(model.maxDepth, [5, 10, 15])
.addGrid(model.maxBins, [32, 64])
.addGrid(model.maxIter, [10])
.build())
cv = CrossValidator(estimator=pipeline_gbt_hpo,
estimatorParamMaps=paramGrid,
evaluator=evaluator_auroc,
numFolds=5, parallelism=5)
search_time_start = time.time()
pipelineModel_gbt_hpo = cv.fit(trainDF)
print('Time to perform GridSearchCV:', time.time() - search_time_start)
path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/'
gbt_grid = pipelineModel_gbt_hpo.bestModel
gbt_grid.save(path + '/pipelineModel_gbt_us_hpo_grid')
From the specified pipeline evaluated, the various models with the best parameters and then view all results (AUROC
) by each parameters.
print(pipelineModel_gbt_hpo.getEstimatorParamMaps()[np.argmax(pipelineModel_gbt_hpo.avgMetrics)])
list(zip(pipelineModel_gbt_hpo.avgMetrics,
pipelineModel_gbt_hpo.getEstimatorParamMaps()))
Given these parameters resulted in the best objective, highest AUROC
, we can use the trained model to predict using the test set evaluate the AUROC
, test error and accuracy
on data that was not used for training.
prediction_gbt = pipelineModel_gbt_hpo.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('GridSearchCV: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g ' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
Then we can extract the information from the pipeline.
pipelineModel_gbt_hpo.bestModel.stages[-1].extractParamMap()
best_mod = pipelineModel_gbt_hpo.bestModel
param_dict = best_mod.stages[-1].extractParamMap()
sane_dict = {}
for k, v in param_dict.items():
sane_dict[k.name] = v
sane_dict
SMOTE: Gradient Boosted Trees
¶
In conjunction with the Upsampled
set, GBT
performed the best out of the models evaluated using the defined grid. We can use the sample methods to evaluate this model and compare it with the results from the Upsampled
search. To complete this, we need to define stages of the pipeline, the parameters in the parameter space as well as the estimator, number of folds for cross validation for parallelism.
model = GBTClassifier(labelCol='label',
featuresCol='unscaledFeatures',
seed=seed_value)
pipeline_gbt_hpo = Pipeline(stages=[model])
paramGrid = (ParamGridBuilder()
.addGrid(model.maxDepth, [5, 10, 15])
.addGrid(model.maxBins, [32, 64])
.addGrid(model.maxIter, [10])
.build())
cv = CrossValidator(estimator=pipeline_gbt_hpo,
estimatorParamMaps=paramGrid,
evaluator=evaluator_auroc,
numFolds=5, parallelism=5)
search_time_start = time.time()
pipelineModel_gbt_hpo = cv.fit(trainDF)
print('Time to perform GridSearchCV:', time.time() - search_time_start)
gbt_grid = pipelineModel_gbt_hpo.bestModel
gbt_grid.save('/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/pipelineModel_gbt_smote_hpo_grid')
From the specified pipeline evaluated, the various models with the best parameters and then view all results (AUROC
) by each parameters.
print(pipelineModel_gbt_hpo.getEstimatorParamMaps()[np.argmax(pipelineModel_gbt_hpo.avgMetrics)])
list(zip(pipelineModel_gbt_hpo.avgMetrics,
pipelineModel_gbt_hpo.getEstimatorParamMaps()))
Let's now predict with the pipeline, determine the AUROC
, test error and accuracy and extract the information from the pipeline.
prediction_gbt = pipelineModel_gbt_hpo.transform(testDF)
gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('GridSearchCV: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g ' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
pipelineModel_gbt_hpo.bestModel.stages[-1].extractParamMap()
From the search, we can determine the best model and the associates parameters.
bestPipeline = pipelineModel_gbt_hpo.bestModel
bestModel = bestPipeline.stages[-1]
bestParams = bestModel.extractParamMap()
print(bestPipeline)
print('\n')
print(bestModel)
print('\n')
print(bestParams)
best_mod = pipelineModel_gbt_hpo.bestModel
param_dict = best_mod.stages[-1].extractParamMap()
sane_dict = {}
for k, v in param_dict.items():
sane_dict[k.name] = v
sane_dict
GridSearchCV Best Models
¶
from pyspark.ml import PipelineModel
path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/'
pipelineModel_lr_hpo_US = PipelineModel.load(path + '/pipelineModel_lr_us_hpo_grid/')
pipelineModel_lsvc_hpo_US = PipelineModel.load(path + '/pipelineModel_lsvc_us_hpo_grid/')
pipelineModel_dt_hpo_US = PipelineModel.load(path + '/pipelineModel_dt_us_hpo_grid/')
pipelineModel_rf_hpo_US = PipelineModel.load(path + '/pipelineModel_rf_us_hpo_grid/')
pipelineModel_gbt_hpo_US = PipelineModel.load(path + '/pipelineModel_gbt_us_hpo_grid/')
prediction_lr = pipelineModel_lr_hpo_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hpo_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hpo_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hpo_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hpo_US.transform(testDF_US)
print('GridSearchCV Best Models Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('GridSearchCV Best Models Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
print('\n')
The best AUROC
and Accuracy
determined was 0.9849 and 98.24%, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.
Load Saved Models - SMOTE
¶
Now we can use the same methods mentioned above to evaluate the SMOTE
models using the respective testDF
.
pipelineModel_lr_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_lr_smote_hpo_grid/')
pipelineModel_lsvc_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_lsvc_smote_hpo_grid/')
pipelineModel_rf_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_rf_smote_hpo_grid/')
pipelineModel_gbt_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_gbt_smote_hpo_grid/')
pipelineModel_dt_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_dt_smote_hpo_grid/')
prediction_lr = pipelineModel_lr_hpo_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hpo_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hpo_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hpo_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hpo_SMOTE.transform(testDF_SMOTE)
print('GridSearchCV Best Models Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('GridSearchCV Best Models Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
For the best GridSearchCV
SMOTE
models, The best AUROC
and Accuracy
determined was 0.9865 and 98.59%, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.
For the best GridSearchCV
Upsampled
models, the best AUROC
and Accuracy
determined was 0.9849 and 98.245, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.
Load Saved Models - Upsampled
¶
After loading the models, let's predict using the test set of the Upsampled
data and evaluate the model metrics from the models where AUROC
was monitored.
from pyspark.ml import PipelineModel
path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/Hyperopt/'
pipelineModel_lr_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_us_auroc_100trials/')
pipelineModel_lsvc_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_us_auroc/')
pipelineModel_dt_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_us_auroc2/')
pipelineModel_rf_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_auroc_30trials/')
pipelineModel_rf_hyperopt_auroc1_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_auroc_moreParams/')
pipelineModel_gbt_hyperopt_auroc_US = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_us_auroc/')
pipelineModel_lr_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_us_f1_100trials/')
pipelineModel_lsvc_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_us_f1/')
pipelineModel_dt_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_us_f1/')
pipelineModel_rf_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_f1_30trials/')
pipelineModel_rf_hyperopt_f1_1_US = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_us_f1_moreParams/')
pipelineModel_gbt_hyperopt_f1_US = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_us_f1/')
prediction_lr = pipelineModel_lr_hyperopt_auroc_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hyperopt_auroc_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hyperopt_auroc_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hyperopt_auroc_US.transform(testDF_US)
prediction_rf1 = pipelineModel_rf_hyperopt_auroc1_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hyperopt_auroc_US.transform(testDF_US)
print('Hyperopt Best Models AUROC Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models AUROC Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
print('\n')
For the best Hyperopt
Upsampled
models where AUROC
was monitored, The best AUROC
and Accuracy
determined was 0.9870 and 98.58%, respectively, using LR. The best precision was 97.55% using RF, the best recall was 92.23% using GBT and the best F1 score was 94.25% with LR.
Predict and F1 Model Metrics using testDF of Upsampled Set
¶
Let's now predict using the test set of the Upsampled
data and evaluate the model metrics from the models where F1 score
was monitored.
prediction_lr = pipelineModel_lr_hyperopt_f1_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hyperopt_f1_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hyperopt_f1_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hyperopt_f1_US.transform(testDF_US)
prediction_rf1 = pipelineModel_rf_hyperopt_f1_1_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hyperopt_f1_US.transform(testDF_US)
print('Hyperopt Best Models F1 Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models F1 Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
print('\n')
For the best Hyperopt
Upsampled
models where F1 score
was monitored, The best AUROC
determined was 0.9870 using LR and the best Accuracy
was 98.62% using RF. The best precision was 99.51% using RF, the best recall was 89.48% using rf and the best F1 score was 94.23% with RF.
Load Saved Models - SMOTE
¶
After loading the models, let's predict using the test set of the SMOTE
data and evaluate the model metrics from the models where AUROC
was monitored.
pipelineModel_lr_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_smote_auroc_100trials/')
pipelineModel_lsvc_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_smote_auroc/')
pipelineModel_dt_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_smote_auroc/')
pipelineModel_rf_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_auroc_30trials/')
pipelineModel_rf_hyperopt_auroc1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_auroc_moreParams/')
pipelineModel_gbt_hyperopt_auroc_SMOTE = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_smote_auroc/')
pipelineModel_lr_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lr_hyperopt_smote_f1_100trials/')
pipelineModel_lsvc_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_lsvc_hyperopt_smote_f1/')
pipelineModel_dt_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_dt_hyperopt_smote_f1/')
pipelineModel_rf_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_f1_29trials/')
pipelineModel_rf_hyperopt_f1_1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_rf_hyperopt_smote_f1_moreParams/')
pipelineModel_gbt_hyperopt_f1_SMOTE = PipelineModel.load(path
+ '/pipelineModel_gbt_hyperopt_smote_f1/')
prediction_lr = pipelineModel_lr_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_rf1 = pipelineModel_rf_hyperopt_auroc1_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
print('Hyperopt Best Models AUROC Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models AUROC Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
For the best Hyperopt
SMOTE
models where AUROC
was monitored, the best AUROC
and Accuracy
determined was 0.98712 and 98.66%, respectively, using GBT. The best precision was 99.69% using RF, the best recall was 91.56% using LR and the best F1 score was 94.23% with LR.
For the best Hyperopt
Upsampled
models where AUROC
was monitored, The best AUROC
and Accuracy
determined was 0.9870 and 98.58%, respectively, using LR. The best precision was 97.55% using RF, the best recall was 92.23% using GBT and the best F1 score was 94.25% with LR.
Predict and F1 Model Metrics using testDF of SMOTE Set
¶
Let's now predict using the test set of the SMOTE
data and evaluate the model metrics from the models where F1 score
was monitored.
prediction_lr = pipelineModel_lr_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_rf1 = pipelineModel_rf_hyperopt_f1_1_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
print('Hyperopt Best Models F1 Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
print('Hyperopt Best Models F1 Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
df = globals()[model]
tp = df[(df.label == 1) & (df.prediction == 1)].count()
tn = df[(df.label == 0) & (df.prediction == 0)].count()
fp = df[(df.label == 0) & (df.prediction == 1)].count()
fn = df[(df.label == 1) & (df.prediction == 0)].count()
a = ((tp + tn)/df.count())
if(tp + fn == 0.0):
r = 0.0
p = float(tp) / (tp + fp)
elif(tp + fp == 0.0):
r = float(tp) / (tp + fn)
p = 0.0
else:
r = float(tp) / (tp + fn)
p = float(tp) / (tp + fp)
if(p + r == 0):
f1 = 0
else:
f1 = 2 * ((p * r)/(p + r))
print('\nModel:', model)
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print('Total:', df.count())
print('Accuracy:', a)
print('Recall:', r)
print('Precision: ', p)
print('F1 score:', f1)
print('\n')
For the best Hyperopt
SMOTE
models where AUROC
was monitored, the best AUROC
and Accuracy
determined was 0.9870 and 98.67%, respectively, using GBT. The best precision was 99.60% using RF, the best recall was 91.56% using LR and the best F1 score was 94.53% with LR.
For the best Hyperopt
Upsampled
models where F1 score
was monitored, The best AUROC
determined was 0.9870 using LR and the best Accuracy
was 98.62% using RF. The best precision was 99.51% using RF, the best recall was 89.48% using RF and the best F1 score was 94.23% with RF.
%cd /content/drive/MyDrive/
!git clone --recursive https://github.com/Microsoft/LightGBM
Then we can navigate to the directory and create a build
directory within the cloned repository where we can use cmake
to compile and build.
%cd /content/drive/MyDrive/LightGBM
!mkdir build
!cmake -DUSE_GPU=1
!make -j$(nproc)
Next, we can install pip
, setuptools
and the required dependencies to set up the environment for utilizing LightGBM
with GPU capabilities.
!sudo apt-get -y install python-pip
!sudo -H pip install setuptools optuna plotly eli5 shap lime -U
Now, we can move to the python-package
subdirectory and use setup.py
to compile the dependencies needed so that data will be processed on GPUs when using the LightGBM
package.
%cd /content/drive/MyDrive/LightGBM/python-package
!sudo python3 setup.py install --precompile --gpu
The notebook containing the baseline models and hyperparameter searches using Optuna
for the Upsampled
and SMOTE
sets is located here.
Let's first set up the environment by importing the dependencies/declaring their options, setting the name of the environment, the random
and numpy
seed, followed by examining which CUDA
compiler and GPU
is available.
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['lgb_GPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Upsampled: Baseline Model
¶
The train/test sets can be read into separate pandas.Dataframes
, then the features as X_train
or X_test
and the labels as y_train
or y_test
can be separated for both sets. From the original training set, the label can be defined as loan_status
, and the remaining as features for the train/test sets with the target removed and the feature column names extracted and converted to a numpy.array
. Let's fit the baseline model using the default parameters and evaluate the model performance.
import pandas as pd
trainDF = pd.read_csv('trainDF_US.csv', low_memory=False)
testDF = pd.read_csv('testDF_US.csv', low_memory=False)
X_train = trainDF.drop('loan_status', axis=1)
y_train = trainDF.loan_status
X_test = testDF.drop('loan_status', axis=1)
y_test = testDF.loan_status
label = trainDF[['loan_status']]
features = trainDF.drop(columns = ['loan_status'])
features = features.columns
features = np.array(features)
import lightgbm as lgb
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
lgbc = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1)
lgbc.fit(X_train, y_train)
Pkl_Filename = 'lightGBM_Upsampling_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lgbc, file)
print('\nModel Metrics for LightGBM Baseline Upsampling')
y_train_pred = lgbc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lgbc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_train, y_train_pred)
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_train, y_train_pred))
print('Precision score : %.3f' % precision_score(y_train, y_train_pred))
print('Recall score : %.3f' % recall_score(y_train, y_train_pred))
print('F1 score : %.3f' % f1_score(y_train, y_train_pred))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_test_pred))
print('Precision score : %.3f' % precision_score(y_test, y_test_pred))
print('Recall score : %.3f' % recall_score(y_test, y_test_pred))
print('F1 score : %.3f' % f1_score(y_test, y_test_pred))
Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the precision, recall and the F1 score, so let's try monitoring the log_loss
to see if hyperparameter tuning results in better model performance.
1000 Trials 5-fold Cross Validation
¶
For hyperparameter tuning using Optuna
, let's define a multi-faceted objective
function for the optimization of the hyperparameters which uses joblib
to save each trial parameters and the loss in a pickle
file that can be reloaded to continue optimizing. Then we can define the parameter space of the hyperparameters to be tested:
random_state
:seed_value
designated for reproducibility.default=None
.device_type
: Device for the tree learning.default=cpu
objective
: Specify the learning task and the corresponding learning. objective or a custom objective function to be used.default=None
ordefault=regression
.metric
: Metric(s) to be evaluated on the evaluation set(s).default=""
.boosting_type
: (gbdt
,gdbt_subsample
, 0.5. default='gbdt'
) - traditional Gradient Boosting Decision Tree.dart
, Dropouts meet Multiple Additive Regression Trees.rf
, Random Forest.n_estimators
: Number of boosted trees to fit.default=100
.learning_rate
: Boosting learning rate.default=0.1
.num_leaves
: Maximum tree leaves for base learners.default=31
.bagging_freq
: Frequency for bagging.default=0
.subsample
: Randomly select part of data without resampling.default=1
.colsample_bytree
: Subsample ratio of columns when constructing each tree.default=1
.max_depth
: Maximum tree depth for base learners, <=0 means no limit.default=-1
.lambda_l1
: L1 regularization.default=0
.lambda_l2
: L2 regularization.default=0
.min_gain_to_split
: The minimal gain to perform split.default=0
.min_child_samples
: Minimal number of data in one leaf. Can be used to deal with over-fitting.default=20
.verbosity
: Controls the level of LightGBM’s verbosity.default=1
(Info).
We need to utilize the same k-folds for reproducibility, so let's use 5-fold as the number of folds which to split the training set into the train and validation sets with shuffle=True
and the initial defined seed value. The LGBMClassifier
model is specified to use the parameters within the params_lgb_optuna
dictionary as well as with early_stopping_rounds=150
. Each trial is timed and a prediction of the binary_log_loss
is made using the created validation label and the predicted label. The mean score is tracked with the objective of obtaining the lowest loss using the callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')]
.
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
from optuna.integration import LightGBMPruningCallback
import joblib
from sklearn.model_selection import KFold, cross_val_score
from timeit import default_timer as timer
import lightgbm as lgb
from sklearn.metrics import log_loss
def objective(trial):
"""
Objective function to tune a LightGBMClassifier model.
"""
joblib.dump(study, 'lightGBM_US_Optuna_1000_GPU.pkl')
params_lgb_optuna = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': 'gbdt',
'n_estimators': trial.suggest_int('n_estimators', 100, 500, step=10),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6, 1e-1),
'num_leaves': trial.suggest_int('num_leaves', 100, 1000, step=20),
'bagging_freq': trial.suggest_int('bagging_freq', 3, 10),
'subsample': trial.suggest_float('subsample', 0.5, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 0.9),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 1e-1, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 1e-1, log=True),
'min_gain_to_split': trial.suggest_float('min_gain_to_split', 1, 15),
'min_child_samples': trial.suggest_int('min_child_samples', 100, 500,
step=10),
'verbosity': -1}
kf = KFold(n_splits=5, shuffle=True, random_state=seed_value)
for idx, (trn_idx, val_idx) in enumerate(kf.split(trainDF[features],
label)):
train_features, train_label = trainDF[features].iloc[trn_idx], label.iloc[trn_idx]
val_features, val_label = trainDF[features].iloc[val_idx], label.iloc[val_idx]
start = timer()
model = lgb.LGBMClassifier(**params_lgb_optuna,
early_stopping_rounds=150)
cv_scores = np.empty(5)
model.fit(train_features, train_label.values.ravel(),
eval_set = [(val_features, val_label.values.ravel())],
eval_metric='binary_logloss',
callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')])
y_pred_val = model.predict_proba(val_features)
cv_scores[idx] = log_loss(val_label, y_pred_val)
run_time = timer() - start
return np.mean(cv_scores)
We can leverage the use of if/else conditional statements to load the .pkl
file if it all ready exists in the directory and if it doesn't, then create a new one for the optuna.study
with the objective of minimizing the log_loss
using 1000 trials.
import time
search_time_start = time.time()
if os.path.isfile('lightGBM_US_Optuna_1000_GPU.pkl'):
study = joblib.load('lightGBM_US_Optuna_1000_GPU.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1000)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest LogLoss', study.best_value)
Results from the Hyperparameter Search
¶
Let's now extract the trial number, logloss
and hyperparameter value into a dataframe and sort with the lowest loss first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'logloss'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
inplace=True)
trials_df.rename(columns={'params_min_gain_to_split': 'min_gain_to_split'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('logloss', ascending=True)
print(trials_df)
Optuna
contains many ways to visualize the parameters tested during the search using the visualization module including plot_parallel_coordinate
, plot_slice
, plot_contour
, plot_param_importances
, plot_optimization_history
, plot_slice
and plot_edf
. Let's utilize some of these components to examine the hyperparameters tested starting with plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_param_importances
to plot the importance of the hyperparameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
The darker the blue signifies lower values for the log_loss
and where a higher number of trials gravitated during the study
To note, the scatterplot of the regularization hyperparameters (lambda_l1
, lambda_l2
) did not reveal trends over the trials iterations, which might have been affected by the wide range (0-0.1).
Let's now visualize the parameter importances by utilizing the plot_param_importances
feature from the Optuna
package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameters for the search using the Upsampled
set are n_estimators
= 0.31, min_child_samples
= 0.21 and lambda_l2
= 0.20. We will need to compare this order and magnitude with the results from the search using the SMOTE
set.
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'binary_error'
params
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
train_label = trainDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])
test_features = testDF.drop(columns = ['loan_status'])
best_model = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1,
**params)
best_model.fit(train_features, train_label.values.ravel())
Pkl_Filename = 'lightGBM_Upsampling_Optuna_trials1000_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO Upsampling 1000 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('\n')
print('Classification Report: Train')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Train')
print(confusion_matrix(y_train, y_train_pred))
print('\n')
print('Classification Report: Test')
clf_rpt = classification_report(y_test, y_test_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Test')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score: train: %.3f, test: %.3f' % (
accuracy_score(y_train, y_train_pred),
accuracy_score(y_test, y_test_pred)))
print('Precision score: train: %.3f, test: %.3f' % (
precision_score(y_train, y_train_pred),
precision_score(y_test, y_test_pred)))
print('Recall score: train: %.3f, test: %.3f' % (
recall_score(y_train, y_train_pred),
recall_score(y_test, y_test_pred)))
print('F1 score: train: %.3f, test: %.3f' % (
f1_score(y_train, y_train_pred),
f1_score(y_test, y_test_pred)))
All of the model metrics, besides the recall score, increased when comparing to the baseline model. The reduction in recall score is 0.03, so it might be potentially neglible.
Let's now evaluate the predictive probability on the test set.
from sklearn.metrics import roc_auc_score
print('The best model from optimization scores {:.5f} Accuracy on the test set.'.format(accuracy_score(y_test,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
SHAP (SHapley Additive exPlanations)
¶
Let's use the shap.TreeExplainer
using LightGBM
and generate shap_values
for the features. The summary_plot
shows the global importance of each feature and the distribution of the effect sizes over the set. Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features, show=False);
Let's use the test set and summarize the effects of all the features.
sshap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features, show=False);
Model Metrics with Eli5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
from eli5.formatters import format_as_dataframe
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The features with the highest weights are out_prncp
(0.248844), total_pymnt
(0.239985) and recoveries
(0.066753).
Now we can use ELI5
to show the prediction.
html_obj2 = eli5.show_prediction(best_model, test_features.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off
or Current
, the features out_prncp
and recoveries
while last_pymnt_amnt
and total_pymnt
with lower values are not beneficial.
SMOTE: Baseline Model
¶
Let's now proceed with using similar methods
for the SMOTE
set by first reading the data, defining the label and features for the train/test sets, removing the target, extracting the feature names and fitting the baseline model with the default parameters.
trainDF = pd.read_csv('trainDF_SMOTE.csv', low_memory=False)
testDF = pd.read_csv('testDF_SMOTE.csv', low_memory=False)
X_train = trainDF.drop('loan_status', axis=1)
y_train = trainDF.loan_status
X_test = testDF.drop('loan_status', axis=1)
y_test = testDF.loan_status
label = trainDF[['loan_status']]
features = trainDF.drop(columns = ['loan_status'])
feature_names = list(features.columns)
features = features.columns
features = np.array(features)
lgbc = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1)
lgbc.fit(X_train, y_train)
Pkl_Filename = 'lightGBM_SMOTE_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lgbc, file)
print('\nModel Metrics for LightGBM Baseline SMOTE')
y_train_pred = lgbc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lgbc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Compared to the baseline model metrics for the Upsampled
set using the default LightGBM
parameters with GPU
support, the accuracy, precision and F1 scores are higher but the recall score is lower.
1000 Trials 5-Fold Cross Validation
¶
Let's define a new objective
function for the optimization of hyperparameters with a new pickle
file that joblib
dumps if it is present, then uses the same parameter space, the same k-folds for reproducibility and the same structure that was utilized for the Upsampled
LightGBM
search using Optuna
.
def objective(trial):
"""
Objective function to tune a LightGBMClassifier model.
"""
joblib.dump(study, 'lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl')
params_lgb_optuna = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': 'gbdt',
'n_estimators': trial.suggest_int('n_estimators', 100, 500, step=10),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6, 1e-1),
'num_leaves': trial.suggest_int('num_leaves', 100, 1000, step=20),
'bagging_freq': trial.suggest_int('bagging_freq', 3, 10),
'subsample': trial.suggest_float('subsample', 0.5, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 0.9),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 1e-1, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 1e-1, log=True),
'min_gain_to_split': trial.suggest_float('min_gain_to_split', 1, 15),
'min_child_samples': trial.suggest_int('min_child_samples', 100, 500,
step=10),
'verbosity': -1}
kf = KFold(n_splits=5, shuffle=True, random_state=seed_value)
for idx, (trn_idx, val_idx) in enumerate(kf.split(trainDF[features],
label)):
train_features, train_label = trainDF[features].iloc[trn_idx], label.iloc[trn_idx]
val_features, val_label = trainDF[features].iloc[val_idx], label.iloc[val_idx]
start = timer()
model = lgb.LGBMClassifier(**params_lgb_optuna,
early_stopping_rounds=150)
cv_scores = np.empty(5)
model.fit(train_features, train_label.values.ravel(),
eval_set = [(val_features, val_label.values.ravel())],
eval_metric='binary_logloss',
callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')])
y_pred_val = model.predict_proba(val_features)
cv_scores[idx] = log_loss(val_label, y_pred_val)
run_time = timer() - start
return np.mean(cv_scores)
Let's now begin the HPO
trials creating a new Optuna.study
that can be compared with the results from the search using the Upsampled
set.
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl'):
study = joblib.load('lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1000)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest LogLoss', study.best_value)
Results from the Hyperparameter Search
¶
Now, we can extract the trial number, logloss
and hyperparameter value into a dataframe and sort with the lowest loss first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'logloss'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
inplace=True)
trials_df.rename(columns={'params_min_gain_to_split': 'min_gain_to_split'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('logloss', ascending=True)
print(trials_df)
Let's utilize plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_contour
to plot the parameter relationship with a contours.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
The slice
plot was not informative and the scatterplot of the regularization hyperparameters (lambda_l1
, lambda_l2
) did not reveal trends over the trials iterations, which might have been affected by the widerange (0-0.1).
Let's now visualize the parameter importances by utilizing the plot_param_importances
feature from the Optuna
package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameters are colsample_bytree
= 0.32, max_depth
- 0.18, min_child_samples
= 0.12, learning_rate
= 0.11. These are different than the Upsampled
results which are n_estimators
= 0.31, min_child_samples
= 0.21, lambda_l2
= 0.20. For the Upsampled
set, the importance for these are
colsample_bytree
= 0.03, max_depth
- 0.01, min_child_samples
= 0.21, learning_rate
= 0.09. For the converse, the importance of the most important hyperparameters for the Upsampled
study are n_estimators
= 0.06, min_child_samples
= 0.12, lambda_l2
= 0.06.
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'binary_error'
params
train_label = trainDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])
test_features = testDF.drop(columns = ['loan_status'])
best_model = lgb.LGBMClassifier(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1,
**params)
best_model.fit(train_features, train_label.values.ravel())
Pkl_Filename = 'lightGBM_HPO_Optuna_SMOTE_trials1000_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO SMOTE 1000 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('\n')
print('Classification Report: Train')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Train')
print(confusion_matrix(y_train, y_train_pred))
print('\n')
print('Classification Report: Test')
clf_rpt = classification_report(y_test, y_test_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Test')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score: train: %.3f, test: %.3f' % (
accuracy_score(y_train, y_train_pred),
accuracy_score(y_test, y_test_pred)))
print('Precision score: train: %.3f, test: %.3f' % (
precision_score(y_train, y_train_pred),
precision_score(y_test, y_test_pred)))
print('Recall score: train: %.3f, test: %.3f' % (
recall_score(y_train, y_train_pred),
recall_score(y_test, y_test_pred)))
print('F1 score: train: %.3f, test: %.3f' % (
f1_score(y_train, y_train_pred),
f1_score(y_test, y_test_pred)))
Compared to the baseline LightGBM
model for the SMOTE
set, there was not a large change in the model metrics, and the observed changed potentially are neglible. Compared to the tuned LightGBM
model using the Upsampled
set, there might be a better precision but a worse recall and maybe F1 score.
Let's now evaluate the predictive probability on the test set.
print('The best model from optimization scores {:.5f} Accuracy on the test set.'.format(accuracy_score(y_test,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features, show=False);
Now, we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features, show=False);
Model Metrics with Eli5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The out_prncp
feature has the highest weight compared to the other features with the next being total_pymnt
at 0.0802. Compared to the Upsampled
set, out_prncp
has less weight (0.223551 vs. 0.248844) but there is a large difference in the weights for total_pymnt
(0.0802 vs. 0.239985).
Now we can use ELI5
to show the prediction.
html_obj2 = eli5.show_prediction(best_model, test_features.iloc[1],
show_feature_values=True)
html_obj2
For the loans to be Paid Off
or Current
, the features recoveries
and out_prncp
with higher values are beneficial while the features last_pymnt_amnt
, total_pymnt
, loan_amnt
and annual_inc
with lower values are not beneficial.
Environment to Utilize RAPIDS
¶
We can utilize the resources Paperspace offers to use a RAPIDS
Docker
container to leverage GPUs
with higher memory than what is available on most local desktops or request quota increases and pay the cloud providers for every second a VM is running.
First, let's upgrade pip
, install the dependencies and import the required packages. Then we can set the os
environment as well as the random
, cupy
and numpy
seed. We can also define a function timed
to time the code blocks and examine the components of the GPU
which are being utilized in the runtime.
!pip install --upgrade pip
import os
import warnings
import random
import cupy
import numpy as np
import urllib.request
from contextlib import contextmanager
import time
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['xgbRAPIDS_GPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)
@contextmanager
def timed(name):
t0 = time.time()
yield
t1 = time.time()
print('..%-24s: %8.4f' % (name, t1 - t0))
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
print('\n')
!nvidia-smi
There is a NVIDIA RTX A4000
with 16GB GPU memory with NVIDIA-SMI
version 510.73.05 and CUDA
version 11.6.
Now, we can set up a LocalCUDACluster
for Dask
with defined limits for memory
and the device memory
as well as an ip
that is assigned to the dask.distributed.Client
for all of the connected workers (1).
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar
from dask.utils import parse_bytes
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES='0', memory_limit='10GB',
device_memory_limit='5GB', ip='0.0.0.0')
client = Client(cluster)
client
XGBoost
¶
The notebooks evaluating different metrics utilizing train/validation/test sets with RAPIDS
and Optuna
are located here, and the initial notebooks using Hyperopt
and XGBoost
with GPU
runtimes are located here.
Let's now read the training and the test sets for the Upsampled
set into separate cudf.Dataframe
. For the XGBoost
models, splitting the training set into train and validation sets resulted in better model performance, so let's use train_size=0.8
with the training set to create the validation set. Then we can convert the features to float32
and the target to int32
for modeling.
import cudf
from cuml.model_selection import train_test_split
trainDF = cudf.read_csv('trainDF_US.csv', low_memory=False)
print('Train set: Number of rows and columns:', trainDF.shape)
testDF = cudf.read_csv('testDF_US.csv', low_memory=False)
print('Test set: Number of rows and columns:', testDF.shape)
X_train, X_val, y_train, y_val = train_test_split(trainDF, 'loan_status',
train_size=0.8)
X_train = X_train.astype('float32')
y_train = y_train.astype('int32')
X_val = X_val.astype('float32')
y_val = y_val.astype('int32')
X_test = X_test.astype('float32')
y_test = y_test.astype('int32')
Upsampled: Baseline Model
¶
Let's first define the baseline model components for classification using gbtree
for the booster
to run on the GPU
using tree_method=gpu_hist
with scale_pos_weight=1
to balance the positive and negative weights without label encoding. Then use the defined seed_value
for the random number seed with silent output from the model. Now we can fit the model using the training set, save it as .pkl
file and evaluate the model using classification metrics.
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Baseline_US.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline Upsampling')
y_train_pred = xgb.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = xgb.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the recall and F1 score, so let's monitor one of those to see if hyperparameter tuning results in better performance.
300 Trial: Weighted F1
¶
Now, let's set up an Optuna.study
by creating a unique name for the study
. We can leverage the optuna.integration.wandb
to set up callbacks that will be saved to Weights & Biases
. First, we login
and set up the arguments that include the name of the project
, the person saving the results, the group
the study
belongs to, if code is saved or not, and notes
about the study for future reference.
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
study_name = 'dask_optuna_us_xgbRapids_weightedF1_300_tpe'
wandb.login()
wandb_kwargs = {'project': 'loanStatus_hpo', 'entity': 'aschultz',
'group': 'optuna_us_xgb300gpu_trainvaltest_f1weighted',
'save_code': 'False',
'notes': 'us_xgb300gpu_trainvaltest_f1weighted'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Then, we can define a function objective
for the optimization of various hyperparameters which uses joblib
to save each of the trial parameters and the loss in a pickle
file that can be reloaded to continue optimizing. Then we can denote the parameter space of the hyperparameters to be tested.
Then the model type, XGBClassifier
, needs to be defined with the parameters that will be included in all of the trials during the search, which are:
objective='binary:logistic
: Specify the learning task and the corresponding learning objective or a custom objective function to be used.booster='gbtree'
: Specify which booster to use:gbtree
,gblinear
ordart
.tree_method='gpu_hist'
: Specify which tree method to use.Default=auto
.scale_pos_weight=1
: Balancing of positive and negative weights.use_label_encoder=False
: Encodes labels.random_state=seed_value
: Random number seed.verbosity=0
: The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
Then we can fit the model specified to use the parameters within the params_xgboost_optuna
dictionary utilizing the training set and the eval_set
as the validation set, followed by the prediction of the weighted f1_score
using the created validation label and the predicted label.
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import joblib
from xgboost import XGBClassifier
from timeit import default_timer as timer
from sklearn.metrics import f1_score
from cupy import asnumpy
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostClassifier model.
"""
joblib.dump(study, 'xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl')
params_xgboost_optuna = {
'n_estimators': trial.suggest_int('n_estimators', 400, 700),
'max_depth': trial.suggest_int('max_depth', 8, 15),
'subsample': trial.suggest_float('subsample', 0.6, 0.9),
'gamma': trial.suggest_float('gamma', 0.02, 1),
'learning_rate': trial.suggest_float('learning_rate', 0.13, 0.3),
'reg_alpha': trial.suggest_float('reg_alpha', 0.5, 2.5),
'reg_lambda': trial.suggest_float('reg_lambda', 0.3, 2),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.8, 0.99),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.25, 0.6),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 5)
}
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params_xgboost_optuna)
start = timer()
xgb.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=0)
y_pred_val = xgb.predict(X_val)
score = f1_score(asnumpy(y_val), asnumpy(y_pred_val), average='weighted')
run_time = timer() - start
return score
Now, we can begin the Optuna
study containing 300 trials that are optimized to run in parallel on a Dask
cluster. The goal is to find the parameters that maximize the weighted F1 score for predicting loan_status
.
import time
import dask_optuna
import dask
import dask_cudf
with timed('dask_optuna'):
search_time_start = time.time()
if os.path.isfile('xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl'):
study = joblib.load('xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial),
n_trials=300)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
wandb.finish()
Results from the Hyperparameter Search
¶
Let's now extract the trial number, f1_weighted
and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
learning_rate
, gamma
, colsample_bytree
decreased over the trials while colsample_bylevel
increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
reg_alpha
decreased while reg_lambda
did not show any trend over the trial iterations.
From using plot_param_importances
to visualize parameter importances, the most important hyperparameter is max_depth
(0.89) followed by reg_lambda
(0.04) and gamma
(0.02).
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
X_train, y_train = trainDF.drop('loan_status',
axis=1), trainDF['loan_status'].astype('int32')
X_train = X_train.astype('float32')
X_test, y_test = testDF.drop('loan_status',
axis=1), testDF['loan_status'].astype('int32')
X_test = X_test.astype('float32')
best_model = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRAPIDS_Optuna_US_300trials_GPU_f1weighted.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost RAPIDS HPO Upsampling 300 trials GPU F1 Weighted')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Compared to the baseline model, there is a higher accuracy, precision and F1 score but a lower recall score. Let's now evaluate the predictive probability on the test set.
from sklearn.metrics import roc_auc_score
print('The best model from Upsampling 300 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_train.to_pandas(), show=False);
Now, we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_test.to_pandas(), show=False);
Model Metrics with Eli5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
from eli5.formatters import format_as_dataframe
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The out_prncp
feature had the highest weight (0.2415) followed by total_pymnt
(0.2345) and recoveries
(0.0670).
Let's now utilize the plot_importance
from the XGBoost
package to examine the feature importance.
from xgboost import plot_importance
plot_importance(best_model, max_num_features=15);
The most important features are:
total_pymnt
mo_sin_old_rev_tl_op
total_rec_int
annual_inc
total_bal_ex_mort
int_rate
total_hi_cred_lim
bc_open_to_buy
revol_util
revol_bal
total_bc_limit
installment
mths_since_recent_bc
out_prncp
last_pymnt_amnt
Next, we can calculate the Permutation importance:
perm_importance = permutation_importance(best_model, X_test.to_pandas(),
asnumpy(y_test))
sorted_idx = perm_importance.importances_mean.argsort()
plt.rcParams.update({'font.size': 10})
fig = plt.figure(figsize=(8, 15))
plt.barh(testDF.columns[sorted_idx],
perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.title('XGBoost Permutation Importance')
plt.show();
From the permutation importance, the most important features are:
out_prncp
total_pymnt
recoveries
loan_amnt
installment
last_pymnt_amnt
total_rec_int
SMOTE: Baseline Model
¶
The SMOTE
train and test sets were processed using the same methods as the Upsampled
data, and the same parameters that were used for the Upsampled
baseline model were maintained. Now we can fit the model using the training set, save it as .pkl
file and evaluate the model using the same classification metrics.
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Baseline_SMOTE.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline SMOTE')
y_train_pred = xgb.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)
y_test_pred = xgb.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the recall and F1 score. Compared to the Upsampled
baseline model, the accuracy, precision and F1 score are better, but the recall is lower (0.907 vs. 0.928).
300 Trials: Weighted F1
¶
The same methods for reading the data, splitting the train set into train and validation set and convert data types for modeling were used for trainDF_SMOTE
and testDF_SMOTE
as well as defining a name for the study
and the information for W & B
callbacks.
study_name = 'dask_optuna_smote_xgbRapids_weightedF1_300_tpe'
wandb.login()
wandb_kwargs = {'project': 'loanStatus_hpo', 'entity': 'aschultz',
'group': 'optuna_SMOTE_xgb300gpu_trainvaltest',
'save_code': 'False',
'notes': 'smote_xgb300gpu_trainvaltest_f1weighted'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostClassifier model.
"""
joblib.dump(study, 'xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
params_xgboost_optuna = {
'n_estimators': trial.suggest_int('n_estimators', 400, 700),
'max_depth': trial.suggest_int('max_depth', 7, 15),
'subsample': trial.suggest_float('subsample', 0.55, 0.8),
'gamma': trial.suggest_float('gamma', 0.6, 2.5),
'learning_rate': trial.suggest_float('learning_rate', 0.13, 0.3),
'reg_alpha': trial.suggest_float('reg_alpha', 0.2, 2.65),
'reg_lambda': trial.suggest_float('reg_lambda', 0.6, 2.3),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.8),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.35, 0.6),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 8)
}
xgb = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params_xgboost_optuna)
start = timer()
xgb.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=0)
y_pred_val = xgb.predict(X_val)
score = f1_score(asnumpy(y_val), asnumpy(y_pred_val), average='weighted')
run_time = timer() - start
return score
We can load the saved .pkl
file to continue training for 265 more trials to reach the intended 300 for the study
.
study = joblib.load('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
with timed('dask_optuna'):
search_time_start = time.time()
if os.path.isfile('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl'):
study = joblib.load('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial),
n_trials=265)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
wandb.finish()
Results from the Hyperparameter Search
¶
Let's now extract the trial number, f1_weighted
and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
learning_rate
, gamma
, decreased like the Upsampled
trials over the trials while colsample_bylevel
and colsample_bytree
increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
reg_alpha
decreased like for the Upsampled
trials but reg_lambda
decreased while there was not a trend for the Upsampled
trials over the trial iterations.
From using plot_param_importances
to visualize the parameter importances for the SMOTE
set, the most important hyperparameters are subsample
= 0.25, min_child_weight
= 0.21, colsample_by_level
= 0.17 and colsample_by_tree
= 0.16. max_depth
, the most important hyperparameter for the Upsampled
model is (0.04 vs. 0.89) and reg_lambda
is (0.02 vs 0.04) and gamma
is (< 0.01 vs 0.02).
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params
best_model = XGBClassifier(objective='binary:logistic',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRAPIDS_Optuna_SMOTE_300trials_GPU_f1weighted.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost RAPIDS HPO SMOTE 300 trials GPU F1 Weighted')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Compared to the baseline model, there is a higher recall and F1 score and comparable accuacy and precision score. Let's now evaluate the predictive probability on the test set.
print('The best model from SMOTE 300 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_train.to_pandas(), show=False);
Now, we can use the test set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_test.to_pandas(), show=False);
Model Metrics with Eli5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The total_pymnt
feature has the highest weight for this model (0.284731) Upsampled
lower out_prncp
0.2415
total_pymnt
0.2345
Let's now utilize the plot_importance
from the XGBoost
package to examine the feature importance.
plot_importance(best_model, max_num_features=15);
The most important features are:
total_pymnt
total_rec_int
int_rate
last_pymnt_amnt
out_prncp
installment
total_bal_ex_mort
mo_sin_old_rev_tl_op
annual_inc
loan_amnt
bc_open_to_buy
total_bc_limit
revol_bal
revol_util
Next, we can calculate the Permutation importance:
perm_importance = permutation_importance(best_model, X_test.to_pandas(),
asnumpy(y_test))
sorted_idx = perm_importance.importances_mean.argsort()
plt.rcParams.update({'font.size': 10})
fig = plt.figure(figsize=(8, 15))
plt.barh(testDF.columns[sorted_idx],
perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.title('XGBoost Permutation Importance')
plt.show();
From the permutation importance, the most important features are:
total_pymnt
out_prncp
loan_amnt
recoveries
installment
last_pymnt_amnt
total_rec_int
int_rate
Linear
¶
The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score
metric to monitor during the hyperparameter tuning search using (Logistic Regression
, Ridge Regression
, Lasso Regression
or Elastic Net Regression
) are located here.
The methods used to set up the RAPIDS
environment in Colab
can be found here. For the subsequent algorithms evaluated, the RAPIDS
environment was set up with installing/importing packages, setting the seed and Dask
cluster as described in the previous XGBoost
section. For the following sections, the data was read and the features/target were set up as demonstrated below.
LinearSVC
¶
Support Vector Machine (SVM) is a supervised algorithm where the objective is to determine the hyperplane, a line that separates the features for the target variable, with the highest possible distance between the hyperplane and the nearest data point from either of the target group (margin). The points closest to the hyperplane are the support vectors. Removing these would affect the the position of the dividing hyperplane.
However, finding the optimal hyperplane for non-linear problems is time and computationally expensive. This 2005 paper A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs developed a method for solving linear SVMs with an L2 loss function and other have proposed new methods for utilizing LinearSVMs
.
The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score
metric to monitor during the hyperparameter tuning search are located
here.
Upsampled: Baseline Model
¶
Let's first define the baseline model components for classification. Now we can fit the model using the training set from the Upsampled
set, save it as .pkl
file and evaluate the model using classification metrics.
from cuml.svm import LinearSVC
lsvc = LinearSVC()
lsvc.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Baseline_US.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lsvc, file)
print('\nModel Metrics for LinearSVC Baseline Upsampling')
y_train_pred = lsvc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lsvc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Overall, the model metrics are horrible, so the model will definitely need to be tuned. In the SparkML
section, the StandardScaler
was utilized before fitting the model. This might be needed to obtain better model performance because using the default parameters resulted in the worst baseline model up to this point.
100 Trials: Weighted F1
¶
Now, we can define a function train_and_eval
to set up the train/test sets and define the model/parameters that will be tested during the search:
penalty
: The regularization term of the target function.Default='l2'
.loss
: The loss term of the target function.Default = 'squared_hinge'
penalized_intercept
: When true, the bias term is treated the same way as other features; i.e. it’s penalized by the regularization term of the target function.Default=False
.max_iter
: Maximum number of iterations for the underlying solver.Default=1000
.linesearch_max_iter
: Maximum number of linesearch (inner loop) iterations for the underlying (QN) solver.Default=100
.lbfgs_memory
: Number of vectors approximating the hessian for the underlying QN solver (l-bfgs).Default=5
.C
: The constant scaling factor of the loss term in the target formula.Default=1.0
.grad_tol
: The threshold on the gradient for the underlying QN solver.Default=0.0001
.change_tol
: The threshold on the function change for the underlying QN solver.Default=1e-05
.
Then model will be trained using the training set, predictions made using the test set and then the model evaluated for the weighted f1_score
between the actual loan_status
in the test set versus the predicted one.
def train_and_eval(X_param, y_param, penalty='l2',
loss='squared_hinge',
penalized_intercept='False',
max_iter=10000,
linesearch_max_iter=100,
lbfgs_memory=5, C=1,
grad_tol=0.0001, change_tol=1e-5,
verbose=False):
"""
Partition data into train/test sets, train and evaluate the model
for the given parameters.
Params
______
X_param: DataFrame.
The data to use for training and testing.
y_param: Series.
The label for training
Returns
score: Weighted F1 of the fitted model
"""
X_train, y_train = trainDF.drop('loan_status',
axis=1), trainDF['loan_status'].astype('int32')
X_train = X_train.astype('float32')
X_test, y_test= testDF.drop('loan_status',
axis=1), testDF['loan_status'].astype('int32')
X_test = X_test.astype('float32')
model = LinearSVC(penalty=penalty,
loss=loss,
penalized_intercept=penalized_intercept,
max_iter=max_iter,
linesearch_max_iter=linesearch_max_iter,
lbfgs_memory=lbfgs_memory,
C=C,
grad_tol=grad_tol,
change_tol=change_tol,
verbose=False)
start = timer()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = f1_score(y_test.to_numpy(), y_pred.to_numpy(), average='weighted')
run_time = timer() - start
return score
print('Score with default parameters : ', train_and_eval(X_train, y_train))
As with other hyperparameter optimizations, an objective
function containing the .pkl
file for the study
and the parameter space to be tested needs to be defined.
def objective(trial, X_param, y_param):
joblib.dump(study, 'LinearSVC_Optuna_US_100_GPU_weightedF1.pkl')
penalty = trial.suggest_categorical('penalty', ['l1', 'l2'])
loss = trial.suggest_categorical('loss', ['squared_hinge', 'hinge'])
penalized_intercept = trial.suggest_categorical('penalized_intercept',
['True', 'False'])
max_iter = trial.suggest_int('max_iter', -1, 10e6)
linesearch_max_iter = trial.suggest_int('linesearch_max_iter', 100, 10000)
lbfgs_memory = trial.suggest_int('lbfgs_memory', 5, 20)
C = trial.suggest_float('C', 0.1, 10, step=0.1)
grad_tol = trial.suggest_float('grad_tol', 1e-3, 1e-1)
change_tol = trial.suggest_float('change_tol', 1e-5, 1e-3)
score = train_and_eval(X_param,
y_param,
penalty=penalty,
loss=loss,
penalized_intercept=penalized_intercept,
max_iter=max_iter,
linesearch_max_iter=linesearch_max_iter,
lbfgs_memory=lbfgs_memory,
C=C,
grad_tol=grad_tol,
change_tol=change_tol,
verbose=False)
return score
Now, let's set up an Optuna.study
by creating a unique name for the study
and beginning time hyperparameter search with the TPESampler
to maximize
the weighted_f1
score for 100 trials using the parallel_backend
from Dask
.
from datetime import datetime, timedelta
study_name = 'dask_linearSVC_optuna_US_100_weightedF1_tpe'
with timed('dask_optuna'):
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('LinearSVC_Optuna_US_100_GPU_weightedF1.pkl'):
study = joblib.load('LinearSVC_Optuna_US_100_GPU_weightedF1.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial, X_train, y_train),
n_trials=100, n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
Results from the Hyperparameter Search
¶
Let's now extract the trial number, f1_weighted
and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_C': 'C'}, inplace=True)
trials_df.rename(columns={'params_change_tol': 'change_tol'}, inplace=True)
trials_df.rename(columns={'params_grad_tol': 'grad_tol'}, inplace=True)
trials_df.rename(columns={'params_lbfgs_memory': 'lbfgs_memory'}, inplace=True)
trials_df.rename(columns={'params_linesearch_max_iter': 'linesearch_max_iter'},
inplace=True)
trials_df.rename(columns={'params_loss': 'loss'}, inplace=True)
trials_df.rename(columns={'params_max_iter': 'max_iter'}, inplace=True)
trials_df.rename(columns={'params_penalized_intercept': 'penalized_intercept'},
inplace=True)
trials_df.rename(columns={'params_penalty': 'penatly'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
Let's utilize plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_slice
to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
The grad_tol
parameter with lower values and the lbfgs_memory
parameter with higher values resulted in higher objective values.
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 6, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['C', 'change_tol', 'grad_tol',
'lbfgs_memory', 'linesearch_max_iter', 'max_iter']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the study, the parameters C
, grad_tol
, linesearch_max_iter
and max_iter
decreased, while the parameter change_tol
increased. There was not really a trend for the lbfgs_memory
parameter.
Now, let's use plot_param_importances
to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The penalty
hyperparameter is the most important (0.97) followed by loss
(0.03).
Let's now utilize plot_edf
to visualize the empirical distribution of the study
.
fig = optuna.visualization.plot_edf(study)
fig.show()
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params
best_model = LinearSVC(**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Optuna_US_trials100_GPU_weightedF1.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for LinearSVC HPO Upsampling 100trials GPU')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
It's definitely nice to see model metrics which are not all zero values like what occurred in the baseline model except for accuracy. This is definitely an improvement in the baseline model, but the metrics are still lower than what the boosting algorithms generated. Let's now evaluate the predictive probability on the test set.
print('The best model from Upsampling 100 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
SMOTE: Baseline Model
¶
Let's first define the baseline model components for classification. Now we can fit the model using the training set from the SMOTE
set, save it as .pkl
file and evaluate the model using classification metrics.
lsvc = LinearSVC()
lsvc.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Baseline_SMOTE.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lsvc, file)
print('\nModel Metrics for LinearSVC Baseline SMOTE')
y_train_pred = lsvc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = lsvc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
Once again, the model metrics are horrible, so the model will definitely need to be tuned. Using the default parameters resulted in the worst baseline models up to this point for both the Upsampled
and SMOTE
sets.
100 Trials: Weighted F1
¶
The same train_and_eval
and objective
functions and search parameter space described in the Upsampled
section were utilized for the SMOTE
study. A new .pkl
file was generated for each search.
print('Score with default parameters : ', train_and_eval(X_train, y_train))
study_name = 'dask_linearSVC_optuna_SMOTE_100_weightedF1_tpe'
with timed('dask_optuna'):
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('LinearSVC_Optuna_SMOTE_100_GPU_weightedF1.pkl'):
study = joblib.load('LinearSVC_Optuna_SMOTE_100_GPU_weightedF1.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='maximize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial, X_train, y_train),
n_trials=100, n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
Results from the Hyperparameter Search
¶
Let's now extract the trial number, f1_weighted
and hyperparameter value into a dataframe and sort with the highest score first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_C': 'C'}, inplace=True)
trials_df.rename(columns={'params_change_tol': 'change_tol'}, inplace=True)
trials_df.rename(columns={'params_grad_tol': 'grad_tol'}, inplace=True)
trials_df.rename(columns={'params_lbfgs_memory': 'lbfgs_memory'}, inplace=True)
trials_df.rename(columns={'params_linesearch_max_iter': 'linesearch_max_iter'},
inplace=True)
trials_df.rename(columns={'params_loss': 'loss'}, inplace=True)
trials_df.rename(columns={'params_max_iter': 'max_iter'}, inplace=True)
trials_df.rename(columns={'params_penalized_intercept': 'penalized_intercept'},
inplace=True)
trials_df.rename(columns={'params_penalty': 'penatly'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
Let's utilize plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_slice
to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
Lower C
, change_tol
grad_tol
and higher lbfgs_memory
resulted in higher objective value.
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 6, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['C', 'change_tol', 'grad_tol',
'lbfgs_memory', 'linesearch_max_iter', 'max_iter']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, C
, change_tol
, grad_tol
decreased while lbfgs_memory
and max_iter
increased. There was not really a trend for linesearch_max_iter
. In contrast, the results from the search using the Upsampled
set showed change_tol
increase over the trials while max_iter
decreased.
Now, let's use plot_param_importances
to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameters for the SMOTE
set is loss
(0.55) followed by penalty
(0.44), which was the most important hyperparameter for the Upsampled
study
at 0.97.
Let's now utilize plot_edf
to visualize the empirical distribution of the study
.
fig = optuna.visualization.plot_edf(study)
fig.show()
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params
best_model = LinearSVC(**params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'LinearSVC_Optuna_SMOTE_trials100_GPU_weightedF1.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for LinearSVC HPO SMOTE 100trials GPU')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
When comparing the model metrics after the SMOTE
search to close after the search using the Upsampled
set, there was a higher accuracy, precision and F1 scores, but a lower recall score. The precision and recall scores seemed to differ by 4% in opposite directions. Let's now evaluate the predictive probability on the test set.
print('The best model from SMOTE 100 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
K-Nearest Neighbors (KNN) Classifier
¶
KNN
was first proposed in this 1951 publication Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties and later expounded on in this 1967 publication Nearest Neighbor Pattern Classification.
The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score
metric to monitor during the hyperparameter tuning search are located here.
Upsampled: Baseline Model
¶
Let's first define the baseline model components for classification. Now we can fit the model using the training set from the Upsampled
set, save it as .pkl
file and evaluate the model using classification metrics.
from cuml.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
Pkl_Filename = 'KNN_Baseline_US.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(knn, file)
y_train_pred = knn.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = knn.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('\nModel Metrics for KNN Baseline Upsampling')
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
The training set is definitely overfitting when the model performance is not balanced with the test set. The lowest metrics are the precision and F1 score, so let's monitor precision to see if hyperparameter tuning results in better performance.
100 Trials: Precision
¶
Now, we can define a function train_and_eval
to set up the train/test sets and define the model/parameters that will be tested during the search:
n_neighbors
: Default number of neighbors to query.Default=5
.metric
: Distance metric to use. Options are'euclidean'
,'manhattan'
,'chebyshev'
,'minkowski'
.
Then model will be trained using the training set, predictions made using the test set and then the model evaluated for the precision
between the actual loan_status
in the test set versus the predicted one.
Results from the Hyperparameter Search
¶
Let's utilize plot_slice
to compare the objective value and individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
Less n_neighbors
resulted in higher objective values.
Now, let's use plot_param_importances
to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The most important hyperparameter is n_neighbors
(0.98) compared to metric
(0.02).
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params
best_model = KNeighborsClassifier(n_neighbors=4, metric='manhattan')
best_model.fit(X_train, y_train)
Pkl_Filename = 'KNN_Optuna_US_trials100_GPU_Precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for KNN HPO US 100trials GPU Precision')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
Compared to the baseline model using the Upsampled
set, all metrics increased besides a small decrease in the recall score. It might be worthwhile to consider monitoring F1 score more or preprocessing the data befoe modeling.
Let's now evaluate the predictive probability on the test set.
print('The best model from US 100 Precision GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
SMOTE: Baseline Model
¶
Let's first define the baseline model components for classification, fit the model using the training set from the SMOTE
set, save it as .pkl
file and evaluate the model using classification metrics.
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
Pkl_Filename = 'KNN_Baseline_SMOTE.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(knn, file)
y_train_pred = knn.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)
y_test_pred = knn.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)
print('\nModel Metrics for KNN Baseline SMOTE')
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))
The training set is definitely overfitting for the SMOTE
like it was for the Upsampled
baseline KNN
model. The lowest metrics are the precision and F1 score again, which are even lower than what the Upsampled
baseline model revealed.
100 Trials: Precision
¶
The same train_and_eval
and objective
functions and search parameter space described in the Upsampled
section were utilized for the SMOTE
study. A new .pkl
file was generated for each search.
Results from the Hyperparameter Search
¶
Let's utilize plot_slice
to compare the objective value and the individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
Less n_neighbors
resulted in higher objective values like in Upsampled
trials.
Now, let's use plot_param_importances
to visualize parameter importances:
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The n_neighbors
hyperparameter is the most important (0.99), while it was 0.98 for the Upsampled
set.
Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl
file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics
.
params = study.best_params
params
best_model = KNeighborsClassifier(n_neighbors=4, metric='euclidean')
best_model.fit(X_train, y_train)
Pkl_Filename = 'KNN_Optuna_SMOTE_trials100_GPU_Precision.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for KNN HPO SMOTE 100trials GPU Precision')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))
Compared to the baseline model using the SMOTE
set, all metrics increased besides a small decrease in the recall score like what was observed when comparing the tuned Upsampled
model to the baseline model. To restate, it might be worthwhile to consider monitoring F1 score more or preprocessing the data befoe modeling. Compared to the tuned model using the Upsampled
set, the Upsampled
model metrics performed better than the metrics from the SMOTE
search.
Let's now evaluate the predictive probability on the test set.
print('The best model from SMOTE 100 Precision GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Comments
comments powered by Disqus