Banks routinely lose money based on loans that eventually default. Per the Federal Reserve, at the height of the financial crisis in 2009-2010, the amount lost approached 500 billion U.S dollars. More recently, losses each quarter tend to approach 150 billion. Delinquency rates tend to be around 1.5% most recently. Because of this, it is vitally important for banks to ensure that they keep their delinquencies as low as possible.

  • Can we accurately predict loan approval based on historical data?
  • How can we confidentially determine whether a loan can be approved?

Rationale and Objective:

  • If a loan is current, the company is making money and should approve such future loans based on the model.
  • If a loan is late or in default, the company is not making money and should reject future loans based the model.
  • What factors predict loan approval?
  • Which variables best predict if a loan will be a loss and how much is the average loss?

Data

   The data was retrieved from Kaggle. Lending Club is a peer to peer financial company. Essentially, people can request an unsecured loan between 1,000 and 40,000 dollars while other individuals can visit the site to choose to invest in the loans. So, people are essentially lending to other people directly with Lending Club as a facilitator.

Preprocessing

   The code that was used for preprocessing and EDA can be found LoanApproval_LendingClub Github repository. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and empty rows replace with NA.

In [ ]:
import os
import random
import numpy as np
import warnings
from numpy import sort
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

seed_value = 42
os.environ['LoanStatus_PreprocessEDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

df = pd.read_csv('loan_Master.csv', index_col=False, low_memory=False)
print('- Dimensions of initial data:', df.shape)

df = df.replace(r'^\s*$', np.nan, regex=True)
df1 = df.loc[:, df.isnull().mean() < 0.05]
print('- Dimensions when columns > 95% missing removed:', df1.shape)

s = set(df1)
varDiff = [x for x in df if x not in s]
print('- Number of features removed due to high missingness:'
      + str(len(varDiff)))

df = df1
del df1
- Dimensions of initial data: (2260668, 145)
- Dimensions when columns > 95% missing removed: (2260668, 82)
- Number of features removed due to high missingness: 63

   Let's now use a function to examine the data types present, the amount of data missing as a percentage, the number of unique values and the number of rows and columns. This can be utilized to determine which features should remain in the set as well as the ones that might result in longer runtimes and more difficulty in explaining the downstream models.

In [ ]:
def data_quality_table(df):
    """Returns the characteristics of variables in a Pandas dataframe."""
    var_type = df.dtypes
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    unique_count = df.nunique()
    mis_val_table = pd.concat([var_type, mis_val_percent, unique_count], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0: 'Data Type', 1: 'Percent Missing', 2: 'Number Unique'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] >= 0].sort_values(
            'Percent Missing', ascending=False).round(1)
    print ('- There are ' df.shape[0]) + ' rows and '
           + str(df.shape[1]) + ' columns.\n')
    return mis_val_table_ren_columns

Categorical Variables (Features)

   Dimensionality is a problem that frequently occurs when working with categorical features. Variables with a large number of groups or levels will increase the size of the data when dummy variables or one hot encoding is used during modeling. After examining the quality of the data, the features with a high percentage of missing values and a large number of groups were removed from the set.

In [ ]:
df1 = df.select_dtypes(include = 'object')
print('\n              Data Quality: Qualitative Variables')
display(data_quality_table(df1))
print('\n')
print('\nSample observations of qualitative variables:')
display(df1.head())

df = df.drop(['title', 'last_pymnt_d', 'zip_code', 'earliest_cr_line',
              'last_credit_pull_d', 'issue_d', 'addr_state', 'sub_grade'],
             axis=1)

del df1

              Data Quality: Qualitative Variables
- There are 2260668 rows and 20 columns.

Data Type Percent Missing Number Unique
title object 1.0 63155
last_pymnt_d object 0.1 135
last_credit_pull_d object 0.0 140
earliest_cr_line object 0.0 754
zip_code object 0.0 956
addr_state object 0.0 51
disbursement_method object 0.0 2
hardship_flag object 0.0 2
application_type object 0.0 2
initial_list_status object 0.0 2
term object 0.0 2
grade object 0.0 7
purpose object 0.0 14
pymnt_plan object 0.0 2
loan_status object 0.0 9
issue_d object 0.0 139
verification_status object 0.0 3
home_ownership object 0.0 6
sub_grade object 0.0 35
debt_settlement_flag object 0.0 2



Sample observations of qualitative variables:
term grade sub_grade home_ownership verification_status issue_d loan_status pymnt_plan purpose title zip_code addr_state earliest_cr_line initial_list_status last_pymnt_d last_credit_pull_d application_type hardship_flag disbursement_method debt_settlement_flag
0 36 months C C1 RENT Not Verified Dec-2018 Current n debt_consolidation Debt consolidation 109xx NY Apr-2001 w Feb-2019 Feb-2019 Individual N Cash N
1 60 months D D2 MORTGAGE Source Verified Dec-2018 Current n debt_consolidation Debt consolidation 713xx LA Jun-1987 w Feb-2019 Feb-2019 Individual N Cash N
2 36 months D D1 MORTGAGE Source Verified Dec-2018 Current n debt_consolidation Debt consolidation 490xx MI Apr-2011 w Feb-2019 Feb-2019 Individual N Cash N
3 36 months D D2 MORTGAGE Source Verified Dec-2018 Current n debt_consolidation Debt consolidation 985xx WA Feb-2006 w Feb-2019 Feb-2019 Individual N Cash N
4 60 months C C4 MORTGAGE Not Verified Dec-2018 Current n debt_consolidation Debt consolidation 212xx MD Dec-2000 w Feb-2019 Feb-2019 Individual N Cash N

Quantitative Variables (Features)

   Now we can remove the rows with any column having NA/null for some of the variables determined with the data_quality_table function so the complete cases without missing data can be used for the set.

In [ ]:
df1 = df.select_dtypes(exclude = 'object')
print('\n              Data Quality: Quantitative Variables')
display(data_quality_table(df1))
print('\n')
print('\nSample observations of quantitative variables:')
display(df1.head())

df = df[df.bc_util.notna() & df.percent_bc_gt_75.notna()
        & df.pct_tl_nvr_dlq.notna() & df.mths_since_recent_bc.notna()
        & df.dti.notna() & df.inq_last_6mths.notna() & df.num_rev_accts.notna()]

del df1

              Data Quality: Quantitative Variables
- There are 2260668 rows and 62 columns.

Data Type Percent Missing Number Unique
bc_util float64 3.4 1494
percent_bc_gt_75 float64 3.3 284
bc_open_to_buy float64 3.3 91500
mths_since_recent_bc float64 3.2 546
pct_tl_nvr_dlq float64 3.1 690
avg_cur_bal float64 3.1 88597
mo_sin_old_rev_tl_op float64 3.1 787
mo_sin_rcnt_rev_tl_op float64 3.1 333
num_rev_accts float64 3.1 117
num_actv_bc_tl float64 3.1 42
total_rev_hi_lim float64 3.1 34220
mo_sin_rcnt_tl float64 3.1 232
num_accts_ever_120_pd float64 3.1 44
num_bc_tl float64 3.1 76
num_actv_rev_tl float64 3.1 57
tot_coll_amt float64 3.1 15574
num_il_tl float64 3.1 122
num_op_rev_tl float64 3.1 81
num_rev_tl_bal_gt_0 float64 3.1 50
num_tl_30dpd float64 3.1 5
num_tl_90g_dpd_24m float64 3.1 34
num_tl_op_past_12m float64 3.1 33
tot_hi_cred_lim float64 3.1 529972
tot_cur_bal float64 3.1 487688
total_il_high_credit_limit float64 3.1 194137
num_bc_sats float64 2.6 60
num_sats float64 2.6 91
total_bc_limit float64 2.2 20309
total_bal_ex_mort float64 2.2 212777
acc_open_past_24mths float64 2.2 57
mort_acc float64 2.2 47
revol_util float64 0.1 1430
dti float64 0.1 10845
pub_rec_bankruptcies float64 0.1 12
collections_12_mths_ex_med float64 0.0 16
chargeoff_within_12_mths float64 0.0 11
tax_liens float64 0.0 42
inq_last_6mths float64 0.0 28
total_acc float64 0.0 152
delinq_2yrs float64 0.0 37
acc_now_delinq float64 0.0 9
pub_rec float64 0.0 43
open_acc float64 0.0 91
delinq_amnt float64 0.0 2617
annual_inc float64 0.0 89368
installment float64 0.0 93296
funded_amnt_inv float64 0.0 10057
int_rate float64 0.0 673
policy_code int64 0.0 1
revol_bal int64 0.0 102251
out_prncp float64 0.0 364399
out_prncp_inv float64 0.0 377353
total_pymnt float64 0.0 1608307
total_pymnt_inv float64 0.0 1299089
total_rec_prncp float64 0.0 487427
total_rec_int float64 0.0 629835
total_rec_late_fee float64 0.0 17991
recoveries float64 0.0 127920
funded_amnt int64 0.0 1572
collection_recovery_fee float64 0.0 140449
last_pymnt_amnt float64 0.0 692560
loan_amnt int64 0.0 1572



Sample observations of quantitative variables:
loan_amnt funded_amnt funded_amnt_inv int_rate installment annual_inc dti delinq_2yrs inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_amnt collections_12_mths_ex_med policy_code acc_now_delinq tot_coll_amt tot_cur_bal total_rev_hi_lim acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util chargeoff_within_12_mths delinq_amnt mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mort_acc mths_since_recent_bc num_accts_ever_120_pd num_actv_bc_tl num_actv_rev_tl num_bc_sats num_bc_tl num_il_tl num_op_rev_tl num_rev_accts num_rev_tl_bal_gt_0 num_sats num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit
0 2500 2500 2500.0 13.56 84.92 55000.0 18.24 0.0 1.0 9.0 1.0 4341 10.3 34.0 2386.02 2386.02 167.02 167.02 113.98 53.04 0.0 0.0 0.0 84.92 0.0 1 0.0 0.0 16901.0 42000.0 9.0 1878.0 34360.0 5.9 0.0 0.0 212.0 1.0 1.0 0.0 1.0 0.0 2.0 5.0 3.0 3.0 16.0 7.0 18.0 5.0 9.0 0.0 0.0 3.0 100.0 0.0 1.0 0.0 60124.0 16901.0 36500.0 18124.0
1 30000 30000 30000.0 18.94 777.23 90000.0 26.52 0.0 0.0 13.0 1.0 12315 24.2 44.0 29387.75 29387.75 1507.11 1507.11 612.25 894.86 0.0 0.0 0.0 777.23 0.0 1 0.0 1208.0 321915.0 50800.0 10.0 24763.0 13761.0 8.3 0.0 0.0 378.0 4.0 3.0 3.0 4.0 0.0 2.0 4.0 4.0 9.0 27.0 8.0 14.0 4.0 13.0 0.0 0.0 6.0 95.0 0.0 1.0 0.0 372872.0 99468.0 15000.0 94072.0
2 5000 5000 5000.0 17.97 180.69 59280.0 10.51 0.0 0.0 8.0 0.0 4599 19.1 13.0 4787.21 4787.21 353.89 353.89 212.79 141.10 0.0 0.0 0.0 180.69 0.0 1 0.0 0.0 110299.0 24100.0 4.0 18383.0 13800.0 0.0 0.0 0.0 92.0 15.0 14.0 2.0 77.0 0.0 0.0 3.0 3.0 3.0 4.0 6.0 7.0 3.0 8.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 136927.0 11749.0 13800.0 10000.0
3 4000 4000 4000.0 18.94 146.51 92000.0 16.74 0.0 0.0 10.0 0.0 5468 78.1 13.0 3831.93 3831.93 286.71 286.71 168.07 118.64 0.0 0.0 0.0 146.51 0.0 1 0.0 686.0 305049.0 7000.0 5.0 30505.0 1239.0 75.2 0.0 0.0 154.0 64.0 5.0 3.0 64.0 0.0 1.0 2.0 1.0 2.0 7.0 2.0 3.0 2.0 10.0 0.0 0.0 3.0 100.0 100.0 0.0 0.0 385183.0 36151.0 5000.0 44984.0
4 30000 30000 30000.0 16.14 731.78 57250.0 26.35 0.0 0.0 12.0 0.0 829 3.6 26.0 29339.02 29339.02 1423.21 1423.21 660.98 762.23 0.0 0.0 0.0 731.78 0.0 1 0.0 0.0 116007.0 23100.0 9.0 9667.0 8471.0 8.9 0.0 0.0 216.0 2.0 2.0 2.0 2.0 0.0 2.0 2.0 3.0 8.0 9.0 6.0 15.0 2.0 12.0 0.0 0.0 5.0 92.3 0.0 0.0 0.0 157548.0 29674.0 9300.0 32332.0

Dependent Variable (Target): Status of Loan

   The majority of the individuals in this set are Fully Paid or Current for the loan_status feature regarding their loan payments. However, the minority of the samples are not.

In [ ]:
print(df.loan_status.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')
Fully Paid             45.0%
Current               41.94%
Charged Off           11.48%
Late (31-120 days)      1.0%
In Grace Period        0.41%
Late (16-30 days)      0.17%
Default                 0.0%
Name: loan_status, dtype: object

   Now let's convert loan_status to a binary variable current = 0 and default = 1 so this can be used as the target (label) for classification modeling. After recoding to binary, a clear class imbalance exists with 87.35% of the observations on track for completing payments.

In [ ]:
df['loan_status'] = df['loan_status'].replace(['Fully Paid'], 0)
df['loan_status'] = df['loan_status'].replace(['In Grace Period'], 0)
df['loan_status'] = df['loan_status'].replace(['Current'], 0)

df['loan_status'] = df['loan_status'].replace(['Charged Off'], 1)
df['loan_status'] = df['loan_status'].replace(['Late (31-120 days)'], 1)
df['loan_status'] = df['loan_status'].replace(['Late (16-30 days)'], 1)
df['loan_status'] = df['loan_status'].replace(['Does not meet the credit policy. Status:Fully Paid'], 1)
df['loan_status'] = df['loan_status'].replace(['Does not meet the credit policy. Status:Charged Off'], 1)
df['loan_status'] = df['loan_status'].replace(['Default'], 1)

print('\nExamine Binary Loan Status for Class Imbalance:')
print(df.loan_status.value_counts(normalize=True).mul(100).round(2).astype(str) + '%')

Examine Binary Loan Status for Class Imbalance:
0    87.35%
1    12.65%
Name: loan_status, dtype: object

Variable Selection

SelectFromModel using XGBoost

   To prepare the data for feature selection using this approach, let'separate the input features as X and the target as y. Then we can create dummy variables for the categorical features using pandas.get_dummies and drop the initial features. Now we can fit a baseline model on all data using the evaluation metric logloss, specifying the parameters to run on a GPU and the random_state as the seed_value. We can then fit the XGBClassifier model, make predictions and determine the accuracy of the baseline model using accuracy_score.

In [ ]:
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import accuracy_score

X = df.drop('loan_status', axis=1)
y = df.loan_status

X = pd.get_dummies(X, drop_first=True)

model = XGBClassifier(eval_metric='logloss',
                      use_label_encoder=False,
                      tree_method='gpu_hist',
                      gpu_id=0,
                      random_state=seed_value)

model.fit(X, y)

y_pred = model.predict(X)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y, predictions)
print('Accuracy: %.3f%%' % (accuracy * 100))
Accuracy: 98.942%

   The feature importance of the model can be examined using plot_importance from the XGBoost module and the plot parameters can be defined using matplotlib.pyplot.

In [ ]:
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams.update({'font.size': 8.5})
plot_importance(model)
plt.tight_layout()
plt.show();

   The ten most important features using the baseline model are total_rec_prncp, out_prncp, last_pymnt_amnt, int_rate, loan_amnt, total_rec_int installment, total_rec_late_fee, funded_amnt_inv and mo_sin_old_rev_tl_op.

   Another method to examine how features affect a model is permutation feature importance. This is defined as the reduction in the model score when a single feature value is shuffled randomly, indicating the extent to which the model relies on the feature. Different numbers of permutations can be tested.

In [ ]:
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(model, X, y)

plt.rcParams.update({'font.size': 7})
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.tight_layout()
plt.show();

   SHAP (SHapley Additive exPlanations) published as A Unified Approach to Interpreting Model Predictions estimates the impact of the individual components on the final result, separately, while conserving the total impact on the final result. This can be considered in conjunction with the previously used feature importance measures when modeling. Let's use the TreeExplainer since using XGBoost and generate shap_values for the features. The summary_plot shows the global importance of each feature and the distribution of the effect sizes over the set.

In [ ]:
import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X, show=False)
plt.show();

   As total_rec_prncp, out_prncp_inv and the last_pymnt_amnt increase, there is a lower probability for a loan to default while as recoveries increases, there is a higher probability the loan will default.

   Now, we can define the maximum number of features to test as the current number of features, and the minimum number of features, which is two. We need to define a metric to test, so let's use the model accuracy. The thresholds can then defined by sorting the model feature importances in descending order. Then a for loop can be used to iterate through the list of thresholds to fit a model, predict and calculate the model accuracy. The model accuracy can then be saved in a list where the new values are added as a new model is tested.

In [ ]:
from sklearn.feature_selection import SelectFromModel
import time

print('Time for feature selection using XGBoost...')
search_time_start = time.time()
feat_max = X.shape[1]
feat_min = 2
acc_max = accuracy
thresholds = sort(model.feature_importances_)
thresh_goal = thresholds[0]
accuracy_list = []
for thresh in thresholds:
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X = selection.transform(X)
    selection_model = XGBClassifier(eval_metric='logloss',
                                    use_label_encoder=False,
                                    tree_method='gpu_hist',
                                    gpu_id=0,
                                    random_state=seed_value)
    selection_model.fit(select_X, y)

  selection_model_pred = selection_model.predict(select_X)
  selection_predictions = [round(value) for value in selection_model_pred]
  accuracy = accuracy_score(y_true=y, y_pred=selection_predictions)
  accuracy = accuracy * 100
  print('Thresh= %.6f, n= %d, Accuracy: %.3f%%' % (thresh, select_X.shape[1],
                                                   accuracy))
  accuracy_list.append(accuracy)
  if(select_X.shape[1] < feat_max) and (select_X.shape[1] >= feat_min) and (accuracy >= acc_max):
    n_min = select_X.shape[1]
    acc_max = accuracy
    thresh_goal = thresh

print('\n')
print('Finished feature selection using XGBoost in:',
      time.time() - search_time_start)
print('\n')
print('\nThe optimal threshold is:')
print(thresh_goal)
Time for feature selection using XGBoost...
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000000, n= 94, Accuracy: 98.942%
Thresh= 0.000044, n= 86, Accuracy: 98.942%
Thresh= 0.000098, n= 85, Accuracy: 98.943%
Thresh= 0.000099, n= 84, Accuracy: 98.936%
Thresh= 0.000103, n= 83, Accuracy: 98.938%
Thresh= 0.000104, n= 82, Accuracy: 98.939%
Thresh= 0.000106, n= 81, Accuracy: 98.938%
Thresh= 0.000108, n= 80, Accuracy: 98.944%
Thresh= 0.000114, n= 79, Accuracy: 98.935%
Thresh= 0.000116, n= 78, Accuracy: 98.938%
Thresh= 0.000118, n= 77, Accuracy: 98.941%
Thresh= 0.000121, n= 76, Accuracy: 98.941%
Thresh= 0.000121, n= 75, Accuracy: 98.940%
Thresh= 0.000123, n= 74, Accuracy: 98.937%
Thresh= 0.000124, n= 73, Accuracy: 98.937%
Thresh= 0.000124, n= 72, Accuracy: 98.939%
Thresh= 0.000125, n= 71, Accuracy: 98.941%
Thresh= 0.000126, n= 70, Accuracy: 98.936%
Thresh= 0.000129, n= 69, Accuracy: 98.939%
Thresh= 0.000129, n= 68, Accuracy: 98.939%
Thresh= 0.000130, n= 67, Accuracy: 98.941%
Thresh= 0.000130, n= 66, Accuracy: 98.941%
Thresh= 0.000131, n= 65, Accuracy: 98.943%
Thresh= 0.000131, n= 64, Accuracy: 98.941%
Thresh= 0.000132, n= 63, Accuracy: 98.939%
Thresh= 0.000133, n= 62, Accuracy: 98.939%
Thresh= 0.000134, n= 61, Accuracy: 98.939%
Thresh= 0.000135, n= 60, Accuracy: 98.942%
Thresh= 0.000135, n= 59, Accuracy: 98.942%
Thresh= 0.000137, n= 58, Accuracy: 98.943%
Thresh= 0.000139, n= 57, Accuracy: 98.943%
Thresh= 0.000140, n= 56, Accuracy: 98.939%
Thresh= 0.000141, n= 55, Accuracy: 98.940%
Thresh= 0.000143, n= 54, Accuracy: 98.938%
Thresh= 0.000143, n= 53, Accuracy: 98.934%
Thresh= 0.000146, n= 52, Accuracy: 98.938%
Thresh= 0.000150, n= 51, Accuracy: 98.938%
Thresh= 0.000154, n= 50, Accuracy: 98.939%
Thresh= 0.000155, n= 49, Accuracy: 98.939%
Thresh= 0.000158, n= 48, Accuracy: 98.937%
Thresh= 0.000161, n= 47, Accuracy: 98.945%
Thresh= 0.000166, n= 46, Accuracy: 98.941%
Thresh= 0.000167, n= 45, Accuracy: 98.946%
Thresh= 0.000167, n= 44, Accuracy: 98.938%
Thresh= 0.000169, n= 43, Accuracy: 98.938%
Thresh= 0.000172, n= 42, Accuracy: 98.943%
Thresh= 0.000173, n= 41, Accuracy: 98.942%
Thresh= 0.000179, n= 40, Accuracy: 98.942%
Thresh= 0.000189, n= 39, Accuracy: 98.943%
Thresh= 0.000196, n= 38, Accuracy: 98.942%
Thresh= 0.000197, n= 37, Accuracy: 98.944%
Thresh= 0.000199, n= 36, Accuracy: 98.943%
Thresh= 0.000207, n= 35, Accuracy: 98.942%
Thresh= 0.000207, n= 34, Accuracy: 98.940%
Thresh= 0.000210, n= 33, Accuracy: 98.941%
Thresh= 0.000230, n= 32, Accuracy: 98.941%
Thresh= 0.000236, n= 31, Accuracy: 98.941%
Thresh= 0.000250, n= 30, Accuracy: 98.944%
Thresh= 0.000293, n= 29, Accuracy: 98.940%
Thresh= 0.000322, n= 28, Accuracy: 98.939%
Thresh= 0.000357, n= 27, Accuracy: 98.943%
Thresh= 0.000367, n= 26, Accuracy: 98.940%
Thresh= 0.000376, n= 25, Accuracy: 98.942%
Thresh= 0.000376, n= 24, Accuracy: 98.941%
Thresh= 0.000414, n= 23, Accuracy: 98.940%
Thresh= 0.000455, n= 22, Accuracy: 98.941%
Thresh= 0.000482, n= 21, Accuracy: 98.941%
Thresh= 0.000482, n= 20, Accuracy: 98.942%
Thresh= 0.000543, n= 19, Accuracy: 98.941%
Thresh= 0.000658, n= 18, Accuracy: 98.943%
Thresh= 0.000671, n= 17, Accuracy: 98.942%
Thresh= 0.000698, n= 16, Accuracy: 98.938%
Thresh= 0.000910, n= 15, Accuracy: 98.941%
Thresh= 0.000942, n= 14, Accuracy: 98.943%
Thresh= 0.001166, n= 13, Accuracy: 98.929%
Thresh= 0.001632, n= 12, Accuracy: 98.930%
Thresh= 0.001872, n= 11, Accuracy: 98.929%
Thresh= 0.002695, n= 10, Accuracy: 98.927%
Thresh= 0.004031, n= 9, Accuracy: 98.927%
Thresh= 0.006384, n= 8, Accuracy: 98.921%
Thresh= 0.006503, n= 7, Accuracy: 98.910%
Thresh= 0.007374, n= 6, Accuracy: 98.872%
Thresh= 0.009866, n= 5, Accuracy: 95.287%
Thresh= 0.011903, n= 4, Accuracy: 95.209%
Thresh= 0.025941, n= 3, Accuracy: 94.960%
Thresh= 0.029138, n= 2, Accuracy: 94.960%
Thresh= 0.874743, n= 1, Accuracy: 94.960%


Finished feature selection using XGBoost in: 691.7376308441162



The optimal threshold is:
0.00016690888

   Now we can subset the features where the optimal threshold occurred into a new set and examine which features were chosen. We can then fit a model with these features as the input to the model, and subsequently generate predictions using all of the available data and evaluate the accuracy of the model.

In [ ]:
selection = SelectFromModel(model, threshold=thresh_goal, prefit=True)

feature_names = X.columns[selection.get_support(indices=True)]
print('\n- Feature selection using XGBoost resulted in '
      + str(len(feature_names)) + ' features.')
print('\n- Features selected using optimal threshold for accuracy:')
print(X.columns[selection.get_support()])
print('\n')

X = pd.DataFrame(data=X, columns=feature_names)

model = XGBClassifier(eval_metric='logloss',
                      use_label_encoder=False,
                      tree_method='gpu_hist',
                      gpu_id=0,
                      random_state=seed_value)
model.fit(X, y)
model.save_model('xgb_featureSelection.model')

y_pred = model.predict(X)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y, predictions)
print('Accuracy: %.3f%%' % (accuracy * 100.0))

- Feature selection using XGBoost resulted in 45 features.

- Features selected using optimal threshold for accuracy:
Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 
       'installment', 'annual_inc', 'inq_last_6mths', 'pub_rec', 'revol_bal',
       'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'last_pymnt_amnt', 'acc_open_past_24mths', 'bc_open_to_buy',
       'chargeoff_within_12_mths', 'delinq_amnt', 'mths_since_recent_bc',
       'num_rev_tl_bal_gt_0', 'tot_hi_cred_lim', 'total_bc_limit',
       'term_ 60 months', 'grade_B', 'grade_C', 'grade_D', 'grade_E',
       'home_ownership_MORTGAGE', 'home_ownership_RENT',
       'verification_status_Source Verified', 'verification_status_Verified',
       'pymnt_plan_y', 'purpose_major_purchase', 'purpose_moving',
       'purpose_small_business', 'purpose_vacation', 'initial_list_status_w',
       'application_type_Joint App', 'hardship_flag_Y',
       'disbursement_method_DirectPay', 'debt_settlement_flag_Y'],
      dtype='object')


Accuracy: 98.946%

   45 features were selected and 98.946% was the resulting accuracy. Let's now plot the feature importance graph again and compare to the baseline model.

In [ ]:
plt.rcParams.update({'font.size': 10})
plot_importance(model)
plt.tight_layout()
plt.show();

   The ten most important features using the SelectFromModel are still total_rec_prncp and out_prncp, but int_rate increased in rank instead of last_pymnt_amnt and total_rec_int as well as installment instead of loan_amnt and total_rec_late_fee. Also, funded_amnt_inv and revol_bal increased in rank instead of mo_sin_old_rev_tl_op.

   Then we can examine the permutation based feature importance again and plot the results.

In [ ]:
perm_importance = permutation_importance(model, X, y)

sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.show();

   Then we can visualize the feature importance computed with the SHAP values.

In [ ]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X, show=False)
plt.show();

   Compared to default model, the majority of the features demonstrate the same properties, but loan_amnt is higher than out_prncp and int_rate is higher than tota_rec_late_fee.

Group Lasso for Variable Selection

Multicollinearity using Variance Inflation Factor (VIF)

   To check for multicollinearity, we can use the Variance Inflation Factor (VIF) to quantify the amount that the estimated coefficients are inflated when multicollinearity exists. Let's first separate the input features and target and then select the quantitative features for inspection. We can define a function to calculate VIF with a specified threshold which removes the feature if the VIF is greater than a set threshold. The typical threshold is 4-5 to remove multicollinearity while 10 is considered as high multicollinearity. So let's use threshold=5.0 and examine the number of remaining numerical variables using this defined threshold.

In [ ]:
from joblib import Parallel, delayed
from statsmodels.stats.outliers_influence import variance_inflation_factor

X1 = df.drop('loan_status', axis=1)
y = df.loan_status

df_num = X1.select_dtypes(include = ['float64', 'int64'])

def calculate_vif(X, threshold=5.0):
    features = [X.columns[i] for i in range(X.shape[1])]
    dropped = True
    while dropped:
        dropped = False
        print('\nThe starting number of quantitative features is: '
              + str(len(features)))
        vif = Parallel(n_jobs=-1,
                       verbose=5)(delayed(variance_inflation_factor)(X[features].values,
                                                                     ix) for ix in range(len(features)))
        maxloc = vif.index(max(vif))
        if max(vif) > threshold:
            print(time.ctime() + ' dropping \'' + X[features].columns[maxloc]
                  + '\' at index: ' + str(maxloc))
          features.pop(maxloc)
          dropped = True
  print('Features Remaining:')
  print([features])
  return X[[i for i in features]]

print('Time for calculating VIF on numerical data using threshold = 5...')
search_time_start = time.time()

X1 = calculate_vif(df_num, 5)
print('\nNumber of quant features after VIF:', X1.shape[1])

print('Finished calculating VIF on numerical data using threshold = 5 in:',
      time.time() - search_time_start)
print ('- There are ' + str(X1.shape[0]) + ' rows and '
       + str(X1.shape[1]) + ' columns.\n')
print('\nQuant features remaining after VIF:')
print(X1.columns)
Time for calculating VIF on numerical data using threshold = 5...

The starting number of quantitative features is: 61
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  43 out of  61 | elapsed:  5.2min remaining:  2.2min
[Parallel(n_jobs=-1)]: Done  56 out of  61 | elapsed:  6.4min remaining:   34.0s
[Parallel(n_jobs=-1)]: Done  61 out of  61 | elapsed:  6.5min finished
Sun Aug 28 22:16:47 2022 dropping 'total_pymnt' at index: 16

The starting number of quantitative features is: 60
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 out of  60 | elapsed:  4.8min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  55 out of  60 | elapsed:  5.9min remaining:   32.0s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  6.0min finished
Sun Aug 28 22:22:49 2022 dropping 'funded_amnt' at index: 1

The starting number of quantitative features is: 59
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  59 | elapsed:  4.3min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  52 out of  59 | elapsed:  5.4min remaining:   43.8s
[Parallel(n_jobs=-1)]: Done  59 out of  59 | elapsed:  5.7min finished
Sun Aug 28 22:28:35 2022 dropping 'funded_amnt_inv' at index: 1

The starting number of quantitative features is: 58
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  39 out of  58 | elapsed:  4.3min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  51 out of  58 | elapsed:  5.4min remaining:   44.7s
[Parallel(n_jobs=-1)]: Done  58 out of  58 | elapsed:  5.7min finished
Sun Aug 28 22:34:20 2022 dropping 'out_prncp_inv' at index: 13

The starting number of quantitative features is: 57
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 out of  57 | elapsed:  4.3min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  50 out of  57 | elapsed:  5.3min remaining:   44.6s
[Parallel(n_jobs=-1)]: Done  57 out of  57 | elapsed:  5.6min finished
Sun Aug 28 22:39:58 2022 dropping 'total_pymnt_inv' at index: 13

The starting number of quantitative features is: 56
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 out of  56 | elapsed:  4.1min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  49 out of  56 | elapsed:  5.2min remaining:   44.5s
[Parallel(n_jobs=-1)]: Done  56 out of  56 | elapsed:  5.4min finished
Sun Aug 28 22:45:22 2022 dropping 'open_acc' at index: 7

The starting number of quantitative features is: 55
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  55 | elapsed:  3.9min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  48 out of  55 | elapsed:  4.9min remaining:   43.0s
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed:  5.1min finished
Sun Aug 28 22:50:29 2022 dropping 'total_acc' at index: 10

The starting number of quantitative features is: 54
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 out of  54 | elapsed:  3.6min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  45 out of  54 | elapsed:  4.6min remaining:   55.3s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:  5.0min finished
Sun Aug 28 22:55:30 2022 dropping 'num_rev_tl_bal_gt_0' at index: 41

The starting number of quantitative features is: 53
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 out of  53 | elapsed:  3.4min remaining:  2.0min
[Parallel(n_jobs=-1)]: Done  44 out of  53 | elapsed:  4.4min remaining:   54.5s
[Parallel(n_jobs=-1)]: Done  53 out of  53 | elapsed:  4.7min finished
Sun Aug 28 23:00:15 2022 dropping 'tot_hi_cred_lim' at index: 49

The starting number of quantitative features is: 52
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  32 out of  52 | elapsed:  3.3min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  43 out of  52 | elapsed:  4.1min remaining:   51.1s
[Parallel(n_jobs=-1)]: Done  52 out of  52 | elapsed:  4.4min finished
Sun Aug 28 23:04:43 2022 dropping 'loan_amnt' at index: 0

The starting number of quantitative features is: 51
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  31 out of  51 | elapsed:  2.8min remaining:  1.8min
[Parallel(n_jobs=-1)]: Done  42 out of  51 | elapsed:  3.7min remaining:   47.7s
[Parallel(n_jobs=-1)]: Done  51 out of  51 | elapsed:  4.0min finished
Sun Aug 28 23:08:44 2022 dropping 'num_op_rev_tl' at index: 38

The starting number of quantitative features is: 50
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  50 | elapsed:  3.3min remaining:  2.2min
[Parallel(n_jobs=-1)]: Done  41 out of  50 | elapsed:  3.3min remaining:   43.6s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  3.9min finished
Sun Aug 28 23:12:41 2022 dropping 'bc_util' at index: 24

The starting number of quantitative features is: 49
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 out of  49 | elapsed:  2.9min remaining:  2.2min
[Parallel(n_jobs=-1)]: Done  38 out of  49 | elapsed:  3.5min remaining:  1.0min
[Parallel(n_jobs=-1)]: Done  49 out of  49 | elapsed:  3.9min finished
Sun Aug 28 23:16:40 2022 dropping 'num_rev_accts' at index: 37

The starting number of quantitative features is: 48
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  48 | elapsed:  2.8min remaining:  2.2min
[Parallel(n_jobs=-1)]: Done  37 out of  48 | elapsed:  3.4min remaining:  1.0min
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  3.8min finished
Sun Aug 28 23:20:30 2022 dropping 'pct_tl_nvr_dlq' at index: 41

The starting number of quantitative features is: 47
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 out of  47 | elapsed:  2.1min remaining:  1.7min
[Parallel(n_jobs=-1)]: Done  36 out of  47 | elapsed:  3.3min remaining:   59.6s
[Parallel(n_jobs=-1)]: Done  47 out of  47 | elapsed:  3.5min finished
Sun Aug 28 23:24:04 2022 dropping 'total_bal_ex_mort' at index: 44

The starting number of quantitative features is: 46
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  46 | elapsed:  2.3min remaining:  1.9min
[Parallel(n_jobs=-1)]: Done  35 out of  46 | elapsed:  3.1min remaining:   58.8s
[Parallel(n_jobs=-1)]: Done  46 out of  46 | elapsed:  3.3min finished
Sun Aug 28 23:27:26 2022 dropping 'num_actv_bc_tl' at index: 32

The starting number of quantitative features is: 45
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  45 | elapsed:  2.2min remaining:  1.9min
[Parallel(n_jobs=-1)]: Done  34 out of  45 | elapsed:  2.8min remaining:   53.4s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  3.1min finished
Sun Aug 28 23:30:36 2022 dropping 'recoveries' at index: 13

The starting number of quantitative features is: 44
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  22 out of  44 | elapsed:  2.1min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  31 out of  44 | elapsed:  2.5min remaining:  1.0min
[Parallel(n_jobs=-1)]: Done  40 out of  44 | elapsed:  2.9min remaining:   17.5s
[Parallel(n_jobs=-1)]: Done  44 out of  44 | elapsed:  3.0min finished
Sun Aug 28 23:33:36 2022 dropping 'num_sats' at index: 35

The starting number of quantitative features is: 43
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  21 out of  43 | elapsed:  2.0min remaining:  2.1min
[Parallel(n_jobs=-1)]: Done  30 out of  43 | elapsed:  2.4min remaining:  1.0min
[Parallel(n_jobs=-1)]: Done  39 out of  43 | elapsed:  2.8min remaining:   16.9s
[Parallel(n_jobs=-1)]: Done  43 out of  43 | elapsed:  2.8min finished
Sun Aug 28 23:36:25 2022 dropping 'total_rev_hi_lim' at index: 19

The starting number of quantitative features is: 42
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  42 | elapsed:  1.9min remaining:  2.0min
[Parallel(n_jobs=-1)]: Done  29 out of  42 | elapsed:  2.2min remaining:   60.0s
[Parallel(n_jobs=-1)]: Done  38 out of  42 | elapsed:  2.6min remaining:   16.3s
[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:  2.6min finished
Sun Aug 28 23:39:03 2022 dropping 'installment' at index: 1

The starting number of quantitative features is: 41
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  19 out of  41 | elapsed:  2.3min remaining:  2.6min
[Parallel(n_jobs=-1)]: Done  28 out of  41 | elapsed:  2.8min remaining:  1.3min
[Parallel(n_jobs=-1)]: Done  37 out of  41 | elapsed:  3.2min remaining:   20.9s
[Parallel(n_jobs=-1)]: Done  41 out of  41 | elapsed:  3.3min finished
Sun Aug 28 23:42:22 2022 dropping 'total_bc_limit' at index: 39

The starting number of quantitative features is: 40
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  40 | elapsed:  2.4min remaining:  3.0min
[Parallel(n_jobs=-1)]: Done  27 out of  40 | elapsed:  3.1min remaining:  1.5min
[Parallel(n_jobs=-1)]: Done  36 out of  40 | elapsed:  3.6min remaining:   24.1s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  3.7min finished
Sun Aug 28 23:46:04 2022 dropping 'num_bc_sats' at index: 30

The starting number of quantitative features is: 39
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 out of  39 | elapsed:  1.7min remaining:  2.5min
[Parallel(n_jobs=-1)]: Done  24 out of  39 | elapsed:  3.0min remaining:  1.9min
[Parallel(n_jobs=-1)]: Done  32 out of  39 | elapsed:  3.3min remaining:   43.6s
[Parallel(n_jobs=-1)]: Done  39 out of  39 | elapsed:  3.5min finished
Sun Aug 28 23:49:37 2022 dropping 'revol_util' at index: 7

The starting number of quantitative features is: 38
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  38 | elapsed:  1.8min remaining:  2.7min
[Parallel(n_jobs=-1)]: Done  23 out of  38 | elapsed:  2.9min remaining:  1.9min
[Parallel(n_jobs=-1)]: Done  31 out of  38 | elapsed:  3.1min remaining:   41.4s
[Parallel(n_jobs=-1)]: Done  38 out of  38 | elapsed:  3.3min finished
Sun Aug 28 23:53:00 2022 dropping 'tot_cur_bal' at index: 16

The starting number of quantitative features is: 37
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 out of  37 | elapsed:  1.7min remaining:  2.8min
[Parallel(n_jobs=-1)]: Done  22 out of  37 | elapsed:  2.7min remaining:  1.9min
[Parallel(n_jobs=-1)]: Done  30 out of  37 | elapsed:  2.9min remaining:   41.1s
[Parallel(n_jobs=-1)]: Done  37 out of  37 | elapsed:  3.2min finished
Sun Aug 28 23:56:13 2022 dropping 'pub_rec' at index: 5

The starting number of quantitative features is: 36
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  36 | elapsed:  1.6min remaining:  2.9min
[Parallel(n_jobs=-1)]: Done  21 out of  36 | elapsed:  2.7min remaining:  1.9min
[Parallel(n_jobs=-1)]: Done  29 out of  36 | elapsed:  2.9min remaining:   41.4s
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  3.1min finished
Sun Aug 28 23:59:19 2022 dropping 'int_rate' at index: 0

The starting number of quantitative features is: 35
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  35 | elapsed:  1.5min remaining:  3.0min
[Parallel(n_jobs=-1)]: Done  20 out of  35 | elapsed:  2.5min remaining:  1.9min
[Parallel(n_jobs=-1)]: Done  28 out of  35 | elapsed:  2.8min remaining:   41.5s
[Parallel(n_jobs=-1)]: Done  35 out of  35 | elapsed:  3.0min finished
Mon Aug 29 00:02:18 2022 dropping 'acc_open_past_24mths' at index: 14

The starting number of quantitative features is: 34
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  34 | elapsed:  1.5min remaining:  3.7min
[Parallel(n_jobs=-1)]: Done  17 out of  34 | elapsed:  2.3min remaining:  2.3min
[Parallel(n_jobs=-1)]: Done  24 out of  34 | elapsed:  2.7min remaining:  1.1min
[Parallel(n_jobs=-1)]: Done  31 out of  34 | elapsed:  2.8min remaining:   16.0s
[Parallel(n_jobs=-1)]: Done  34 out of  34 | elapsed:  2.9min finished
Mon Aug 29 00:05:12 2022 dropping 'total_rec_prncp' at index: 6

The starting number of quantitative features is: 33
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of  33 | elapsed:  1.4min remaining:  3.6min
[Parallel(n_jobs=-1)]: Done  16 out of  33 | elapsed:  1.6min remaining:  1.7min
[Parallel(n_jobs=-1)]: Done  23 out of  33 | elapsed:  2.4min remaining:  1.0min
[Parallel(n_jobs=-1)]: Done  30 out of  33 | elapsed:  2.5min remaining:   14.9s
[Parallel(n_jobs=-1)]: Done  33 out of  33 | elapsed:  2.6min finished
Mon Aug 29 00:07:46 2022 dropping 'num_bc_tl' at index: 24

The starting number of quantitative features is: 32
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  32 | elapsed:  1.3min remaining:  3.8min
[Parallel(n_jobs=-1)]: Done  15 out of  32 | elapsed:  1.5min remaining:  1.7min
[Parallel(n_jobs=-1)]: Done  22 out of  32 | elapsed:  2.2min remaining:  1.0min
[Parallel(n_jobs=-1)]: Done  29 out of  32 | elapsed:  2.3min remaining:   14.4s
[Parallel(n_jobs=-1)]: Done  32 out of  32 | elapsed:  2.4min finished
Mon Aug 29 00:10:09 2022 dropping 'num_actv_rev_tl' at index: 23

The starting number of quantitative features is: 31
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of  31 | elapsed:  1.1min remaining:  3.9min
[Parallel(n_jobs=-1)]: Done  14 out of  31 | elapsed:  1.4min remaining:  1.7min
[Parallel(n_jobs=-1)]: Done  21 out of  31 | elapsed:  2.0min remaining:   58.1s
[Parallel(n_jobs=-1)]: Done  28 out of  31 | elapsed:  2.2min remaining:   13.7s
[Parallel(n_jobs=-1)]: Done  31 out of  31 | elapsed:  2.2min finished
Features Remaining:
[['annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'revol_bal', 'out_prncp', 'total_rec_int', 'total_rec_late_fee', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'acc_now_delinq', 'tot_coll_amt', 'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'num_accts_ever_120_pd', 'num_il_tl', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'total_il_high_credit_limit']]

Number of quant features after VIF: 31
Finished calculating VIF on numerical data using threshold = 5 in: 7326.294751167297
- There are 2162365 rows and 31 columns.


Quant features remaining after VIF:
Index(['annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'revol_bal',
       'out_prncp', 'total_rec_int', 'total_rec_late_fee',
       'collection_recovery_fee', 'last_pymnt_amnt',
       'collections_12_mths_ex_med', 'acc_now_delinq', 'tot_coll_amt',
       'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths',
       'delinq_amnt', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op',
       'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc',
       'num_accts_ever_120_pd', 'num_il_tl', 'num_tl_30dpd',
       'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75',
       'pub_rec_bankruptcies', 'tax_liens', 'total_il_high_credit_limit'],
      dtype='object')

   In the initial set, there were 61 quantitative features, and after using VIF with a threshold=5.0, 31 variables remain. Almost half of the variables were removed using this approach!

   To note, when SelectFromModel using XGBoost was completed without performing VIF, the resulting score was 98.946% and when retested after completing VIF, the score was 96.854%. Ensemble-based algorithms can handle multicollinearity while linear given the data distribution cannot.

    Now let's concatenate these selected variables with the non-numerical features including the dependent variable, loan_status. Before a group lasso can be tested for variable selection, the quantitative and categorical variables need to be preprocessed consisting of scaling using the MinMaxScaler for the numerical features, one hot encoding the categorical variables followed by stacking together column wise in a sparse matrix. The groups are then extracted from the one hot encoded features since they need to be specified for the group lasso.

In [ ]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import scipy.sparse
from group_lasso.utils import extract_ohe_groups

df1 = df.select_dtypes(include = 'object')
df1 = pd.concat([X1, df1], axis=1)
df1 = pd.concat([y, df1], axis=1)

df_num = df1.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)
num_columns = df_num.columns.tolist()

scaler = MinMaxScaler()
scaled = scaler.fit_transform(df_num)

df_cat = df1.select_dtypes(include = 'object')
cat_columns = df_cat.columns.tolist()

ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(df1[cat_columns])

X2 = scipy.sparse.hstack([onehot_data, scipy.sparse.csr_matrix(scaled)])
y = df1['loan_status']

groups = extract_ohe_groups(ohe)
groups = np.hstack([groups, len(cat_columns) + np.arange(len(num_columns))+1])
print('The groups consist of ' + str(groups) + ' for the group lasso.')
The groups consist of [0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 7
 7 8 8 9 9 10 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
 31 32 33 34 35 36 37 38 39 40 41 42] for the group lasso.

   To perform the group lasso, let's first define the parameter grid to test where the same number of iterations is used and a differing number for the convergence tolerance where the Default=1e-5.

   Then we can define the parameters for the estimator:

  • groups: Iterable that specifies which group each column corresponds. The categorical variables as labeled groups.
  • group_reg: The regularization coefficient(s) for the group sparsity penalty. Default=0.05
  • l1_reg: The regularization coefficient for the coefficient sparsity penalty. Default=0.05
  • scale_reg: How to scale the group-wise regularisation coefficients. Default='group_size'
  • random_state: The random state used for initialisation of parameters. Default='None'. Let's use the same seed for reproducibility (seed_value).

   Then the GridSearchCV parameters can be defined:

  • The defined estimator as LogisticGroupLasso
  • scoring: Score the model based on accuracy
  • cv: The number of folds for cross validation as 5-fold.
  • param_grid: The hyperparameters in params.

   Now the models can be fit given the defined parameters in the grid search space to find the best estimator parameters and accuracy.

In [ ]:
from group_lasso import LogisticGroupLasso
from sklearn.model_selection import GridSearchCV

LogisticGroupLasso.LOG_LOSSES = True

params = {
    'n_iter': [3000],
    'tol': [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
    }

grid = GridSearchCV(estimator=LogisticGroupLasso(
                    groups=groups, group_reg=0.05, l1_reg=0, scale_reg=None,
                    supress_warning=True, random_state=seed_value),
                    scoring='accuracy', cv=5, param_grid=params)

print('Time for feature selection using GroupLasso GridSearchCV...')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    grid.fit(X2, y)
print('Finished feature selection using GroupLasso GridSearchCV in:',
      time.time() - search_time_start)

print('\nGroup Lasso GridSearchCV Feature selection')
print('\nBest Estimator:')
print(grid.best_estimator_)
print('\nBest Parameters:')
print(grid.best_params_)
print('\nBest Accuracy:')
print(grid.best_score_)
print('\nResults from GridSearch CV:')
print(grid.cv_results_)
Time for feature selection using GroupLasso GridSearchCV...
Finished feature selection using GroupLasso GridSearchCV in: 14369.552435874939

Group Lasso GridSearchCV Feature selection

Best Estimator:
LogisticGroupLasso(groups=array([0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10,
       12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42],
      dtype=object),
                   l1_reg=0, n_iter=3000, random_state=42, scale_reg=None,
                   supress_warning=True, tol=0.1)

Best Parameters:
{'n_iter': 3000, 'tol': 0.1}

Best Accuracy:
0.8735407759559557

Results from GridSearch CV:
{'mean_fit_time': array([ 518.28504934, 4770.94411964, 5226.64562335, 5811.78841519,
       4953.19865055, 7256.40037341]), 'std_fit_time': array([  12.44344049,   87.36115474,  239.33446816, 1763.87135773,
       3138.6469911 , 3622.15695553]), 'mean_score_time': array([12.19659243, 21.37654667,  9.95250163,  1.08136353,  1.99505672,
        0.82633104]), 'std_score_time': array([1.89190727, 4.97357209, 8.13540789, 2.16272707, 1.75450612,
       0.62814339]), 'param_n_iter': masked_array(data=[3000, 3000, 3000, 3000, 3000, 3000],
             mask=[False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_tol': masked_array(data=[0.1, 0.01, 0.001, 0.0001, 1e-05, 1e-06],
             mask=[False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_iter': 3000, 'tol': 0.1}, {'n_iter': 3000, 'tol': 0.01}, {'n_iter': 3000, 'tol': 0.001}, {'n_iter': 3000, 'tol': 0.0001}, {'n_iter': 3000, 'tol': 1e-05}, {'n_iter': 3000, 'tol': 1e-06}], 'split0_test_score': array([0.87354124, 0.87354124, 0.87354124,        nan,        nan,
       0.87354124]), 'split1_test_score': array([0.87354124, 0.87354124,        nan, 0.87354124, 0.87354124,
       0.87354124]), 'split2_test_score': array([0.87354124, 0.87354124,        nan,        nan,        nan,
              nan]), 'split3_test_score': array([0.87354124, 0.87354124, 0.87354124,        nan, 0.87354124,
       0.87354124]), 'split4_test_score': array([0.87353893, 0.87353893, 0.87353893,        nan, 0.87353893,
       0.87353893]), 'mean_test_score': array([0.87354078, 0.87354078,        nan,        nan,        nan,
              nan]), 'std_test_score': array([9.24913232e-07, 9.24913232e-07,            nan,            nan,
                  nan,            nan]), 'rank_test_score': array([1, 1, 3, 4, 5, 6])}

   The best parameters from the grid search are {'n_iter': 3000, 'tol': 0.1} with an accuracy of 0.8735407759559557. However, the rank_test_score using tol=1e-1 and tol=1e-2 generated comparable results.

   Now using these parameters, let's fit the model, predict and extract the results from the performance metrics. We can use the defined model to extract the features selected and then count the number selected. Then a list of the chosen_groups_ in the model as a pandas.Series can be transposed and the values converted to a list to match with the features in the original set. Then the loss over the iterations can plotted to visualize the results.

In [ ]:
gl = LogisticGroupLasso(
    groups=groups,
    group_reg=0.05,
    n_iter=3000,
    tol=0.1,
    l1_reg=0,
    scale_reg=None,
    supress_warning=True,
    random_state=seed_value,
)

with parallel_backend('threading', n_jobs=-1):
    gl.fit(X2, y)

pred_y = gl.predict(X2)
sparsity_mask = gl.sparsity_mask_
accuracy = (pred_y == y).mean()

print(f'Number of total variables: {len(sparsity_mask)}')
print(f'Number of chosen variables: {sparsity_mask.sum()}')
print(f'Accuracy: {accuracy}')

tdf = pd.Series(list(gl.chosen_groups_)).T
tdf = tdf.values.tolist()

X2 = df1.drop('loan_status', axis=1)
X2 = X2.iloc[:,tdf]
variables = X2.columns.tolist()
print(f'Selected variables from group lasso: {variables}')

plt.rcParams['figure.figsize'] = (7, 5)
plt.rcParams.update({'font.size': 15})
plt.plot(gl.losses_)
plt.tight_layout()
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Group Lasso: Loss over the Number of Iterations')
plt.show()

del X2, tdf, variables
Number of total variables: 75
Number of chosen variables: 19
Accuracy: 0.8735407759559556
Selected variables from group lasso: ['dti', 'delinq_2yrs', 'inq_last_6mths', 'revol_bal', 'out_prncp', 'collections_12_mths_ex_med']

   Given that the starting number of features after VIF equaled 75, 19 were selected using the parameters tested in the grid search which included dti, delinq_2yrs, inq_last_6mths, revol_bal, out_prncp, collections_12_mths_ex_med. The resulting accuracy when fitting this model was 87.354 compared to 98.942 using 45 features when evaluating SelectFromModel with XGBoost. Therefore, this method for utilizing a linear based feature selection method might not be the best approach, but an alternative like the glmnet package in both python and R might perform better. However, let's test more variable selection methods.

Model-Free Sure Independence Screening (MV-SIS)

   When I first began exploring different feature selection approaches in an academic research setting, a biostatistician recommended Model-Free Sure Independence Screening (MV-SIS) given the robustness in high dimensional sets where p (predictors) >> n (number of observations). The publication of this approach can be found here.

   This method builds on the approaches that were proposed beyond correlation ranking for non-linear problems. It also implies that the MV-based variable screening is model-free because it is defined with conditional and unconditional distribution functions and is able to address linear and nonlinear relationships between the given response and features. MV-SIS is robust to heavy-tailed distributions of features and outliers because it inherits the robustness property of the conditional distribution function. The sure screening property contained within this approach can also handle multi-class response targets.

   Let's now migrate to R to utilize this approach. The script that was used to process the data utilizing similar approaches can be found here. In summary, the libraries that were used for preprocessing include tidyr, plyr, dplyr and fastDummies.

   The preprocessing code before testing this approach can be found here. Let's convert loan_status to binary given the same aforementioned conditions and sample 10% of the data using the caret library. Now, we can write the column names to a .csv file and the row level data without the column names to a .txt file to prepare the data.

In [ ]:
data_sampling_vector <- createDataPartition(df$loan_status, p=0.90, list=FALSE)

data_train <- df[data_sampling_vector,]
data_test <- df[-data_sampling_vector,]

a <- data_test
dim(a)

write.table(a, 'names.csv', sep=',', col.names=TRUE, row.names=FALSE)
write(t(a), 'LoanStatus_screen.txt', ncol=ncol(a))
  1. 216236
  2. 112

   I was provided the source code to utilize this approach for feature selection, but a package called VariableScreening is available on CRAN. We can use source("*.R) to delegate the script that we want R to use, in this case, source("MVSIS.R"). Let's now examine what is contained within the MVSIS.R file.

MVSIS Source Code

   To compute the criteria in the simulation, multiple functions are required. Let's first start by creating a function M to compute the minimum model size to ensure the inclusion of all active predictors.

Input:

  • true.v: the true variables index.
  • rank.mtx: the ranked index matrix by screening method for the 1000 replications each column corresponds the ranked index in one replication.

Output:

  • M: a vector of the minimum model sizes to ensure the inclusion of all active predictors
In [ ]:
M <- function(true.v,rank.mtx){
    r <- min(dim(rank.mtx)[2], length(rank.mtx))
	  M <- c()
	  for (j in 1:r){
        M[j] <- max(match(true.v, rank.mtx[,j]))
        }
    return(M)
    }

   Another function mqtl can be used to compute the 5%, 25%, 50%, 75% and 95% quantiles of the minimum model size out of 1,000 replications.

Input:

  • M: a vector of the minimum model sizes to ensure the inclusion of all active predictors.

Output:

  • 5%, 25%, 50%, 75%, 95% quantiles of minimum model sizes out of 1000 replications
In [ ]:
mqtl <- function(M){
    quantile(M, probs=c(0.05,0.25,0.5,0.75,0.95))
}

   Another function Sel.rate can be used to compute the proportion that every single active predictor is selected for a given model size, which is defaults to c[n/log(n)] in the 1,000 replications.

Input:

  • n: The sample size.
  • c: Coeficient of cutoffs. i.e. when hen c=2, cutoff = 2[n/log(n)].
  • true.v: The true variable index.
  • rank.mtx: The ranked index matrix by screening method for the 1,000 replications each column corresponds the ranked index in one replication.

Output:

  • rate: The proportions that every single active predictor is selected for a given model size, which the default c[n/log(n)], in the 1,000 replications.
In [ ]:
Sel.rate <- function(n,c,true.v,rank.mtx){
    d <- c * floor(n / log(n))
    rank.mtx.sel <- rank.mtx[1:d,]
    r <- min(dim(rank.mtx)[2], length(rank.mtx))
    p0 <- length(true.v)
	  R <- matrix(0,p0,r)
	  rate <- c()
	  for (i in 1:p0){
	      for (j in 1:r){
            R[i,j] <- (min(abs(rank.mtx.sel[,j] - true.v[i]))==0)
						}
		        rate[i] <- mean(R[i,])
					}
	  return(rate)
	 }

   Now, let's define the functions that will used to compute the MV(X,Y) where Y is a discrete value.

In [ ]:
Fk <- function(X0,x){
    Fk = c()
    for (i in 1:length(x)){
        Fk[i] = sum(X0 <= x[i]) / length(X0)
        }
    return(Fk)
    }

Fkr <- function(X0,Y,yr,x){
    Fkr = c()
	  ind_yr = (Y==yr)
    for (i in 1:length(x)){
        Fkr[i] = sum((X0 <= x[i]) * ind_yr) / sum(ind_yr)
        }
    return(Fkr)
    }

MV <- function(Xk,Y){
    Fk0 <- Fk(Xk,Xk)
    Yr <- unique(Y)
	  MVr <- c()
    for (r in 1:length(Yr)){
	      MVr[r] <- (sum(Y==Yr[r]) / length(Y)) * mean((Fkr(Xk,Y,Yr[r],Xk) - Fk0)^2)
       }
	  MV <- sum(MVr)
	  return(MV)
   }

   Then the row-level .txt file can be read byrow=T as a matrix with the number of columns defined with ncol and loan_status as the first feature to define the response. The selection criteria can be defined as mu for computing the criteria in the simulation. Quantiles can then be used with mu to determine the variable rank. Different values of quantile(mu, A) can be evaluated, where a higher value of A results in less features while a lower value selects more variables. Let's use A=0.7 and determine how many features result in the pruned set.

In [ ]:
F0 <- matrix(scan('LoanStatus_screen.txt'), ncol=112, byrow=T)
response <- F0[,1]

mu <- NULL
for(i in 2:ncol(F0)){
    u <- MV(F0[,i], response)
    mu <- c(mu,u)
    if(i%%100==0) print(i)
}

q3 <- quantile(mu, 0.7)
mu <- cbind(1:(ncol(F0)-1), mu)
name <- read.table('names.csv', sep=',')
name <- name[mu[,2] > q3]
write(t(name), 'names2.csv', sep=',', ncol=length(name))
[1] 100
In [ ]:
su <- mu[mu[,2] > q3,]
dim(su)

F0 <- F0[,-1]
F1 <- F0[,su[,1]]
B <- cbind(response,F1)

V <- B
B[15:41,] <- V[19:45,]
B[42:45,] <- V[15:18,]

write(t(B), ncol=ncol(B), file='MVSIS_0.7.txt')

ncol(B) - 1
  1. 28
  2. 2
28

   28 features were selected using this threshold. Now we can evaluate the performance of a model given the input features from MV-SIS.

In [ ]:
a1 <- read.table('MVSIS_0.7.txt')
a1 <- as.matrix(a1)
dim(a1)

b <- read.table('names2.csv', sep=',')
b <- as.character.numeric_version(b)
k <- 'status'
b <- c(k,b)

write.table(a1, file='MVSIS_0.7_rf.txt', col.names=b)
  1. 216236
  2. 28
  1. 216236
  2. 28

   Let's first load the required libraries (randomForest and doParallel), and set up a cluster to run the jobs in parallel using the 8 cores available on the local machine.

In [ ]:
library(randomForest)
library(doParallel)
cl <- makePSOCKcluster(8)
registerDoParallel(cl)
getDoParWorkers()
randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.

Loading required package: foreach

Loading required package: iterators

Loading required package: parallel

8

   Then the set containing the selected features from MV-SIS can be read, shuffled and examined.

In [ ]:
variable <- read.table('MVSIS_0.7_rf.txt')
variable = variable[sample(1:nrow(variable)),]
colnames(variable)[colnames(variable) == 'status'] = 'loan_status'
str(variable)
'data.frame':	216236 obs. of  28 variables:
 $ loan_status                  : num  0 0 0 0 1 0 0 0 1 0 ...
 $ loan_amnt                    : int  13125 10000 13200 3600 8000 11000 2500 21000 8600 12000 ...
 $ funded_amnt                  : int  13125 10000 13200 3600 8000 11000 2500 21000 8600 12000 ...
 $ funded_amnt_inv              : num  13000 10000 13200 3600 8000 11000 2500 21000 8600 12000 ...
 $ int_rate                     : num  24.99 9.49 14.33 15.31 16.99 ...
 $ installment                  : num  385 320 309 125 285 ...
 $ annual_inc                   : num  65221 68928 75000 49800 22883 ...
 $ delinq_2yrs                  : int  0 1 0 1 0 0 0 1 4 0 ...
 $ inq_last_6mths               : int  1 0 1 1 0 1 0 0 1 0 ...
 $ open_acc                     : int  30 14 10 11 9 8 11 12 12 5 ...
 $ revol_util                   : num  22 43.2 89.7 52.7 38.3 54.9 26.3 43.2 67 88.4 ...
 $ last_pymnt_amnt              : num  11940 320 9362 2622 285 ...
 $ collections_12_mths_ex_med   : int  0 0 0 0 0 0 0 1 1 0 ...
 $ tot_coll_amt                 : int  0 0 0 0 0 1667 0 598 364 0 ...
 $ mo_sin_old_rev_tl_op         : int  219 146 138 190 74 130 325 245 105 61 ...
 $ mort_acc                     : int  0 3 5 1 0 2 1 2 1 0 ...
 $ num_accts_ever_120_pd        : int  8 0 0 2 0 14 0 5 0 0 ...
 $ num_actv_rev_tl              : int  3 6 4 6 2 5 5 4 7 3 ...
 $ num_bc_tl                    : int  6 9 4 9 2 6 9 5 8 3 ...
 $ num_il_tl                    : int  31 20 3 3 4 19 4 25 29 2 ...
 $ num_op_rev_tl                : int  11 10 6 9 5 5 9 7 7 3 ...
 $ num_tl_op_past_12m           : int  7 1 1 3 2 4 2 3 3 0 ...
 $ percent_bc_gt_75             : num  25 33.3 50 33.3 0 20 50 50 100 66.7 ...
 $ tot_hi_cred_lim              : int  141633 275761 324342 181219 36651 351716 62621 48133 278962 26800 ...
 $ total_bc_limit               : int  16200 28800 9800 11900 7300 19100 38700 3100 5500 10800 ...
 $ home_ownership_OTHER         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pymnt_plan_y                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ disbursement_method_DirectPay: int  0 0 0 0 0 0 0 0 0 0 ...

   We can now partition the data into training and test sets using 90% for training and 10% for testing to run a Random Forest model with the default parameters to evaluate the predictive performance of using this variable selection approach.

In [ ]:
k1 <- 194612
k2 <- 21624

train <- k1
test <- k2

rf <- randomForest(factor(loan_status) ~ .,
                   data=variable[1:k1,],
                   ntree=100,
                   importance=T,
                   keep.forest=T)
rf
stopCluster(cl)

Call:
 randomForest(formula = factor(loan_status) ~ ., data = variable[1:k1,      ], ntree = 100, importance = T, keep.forest = T) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 5

        OOB estimate of  error rate: 12.59%
Confusion matrix:
       0    1 class.error
0 168086 1766  0.01039729
1  22736 2024  0.91825525

   Then, we can use the fitted model to predict and find the accuracy of the train and test sets.

In [ ]:
m <- rf$predicted
n <- predict(rf, variable[(k1+1):(k1+k2),])

k <- 0
for (i in 1:k1){
    if (variable[i,1]==m[i]){
        k=k+1
    }
}

Accuracy.fit <- k / k1

cat(sprintf('Train accuracy = %s\n', Accuracy.fit))

k <- 0
for (i in 1:k2){
    if (variable[(k1+i),1]==n[i]){
        k=k+1
    }
}

Accuracy.test <- k / k2
cat(sprintf('Test accuracy = %s\n', Accuracy.test))
Train accuracy = 0.87359463959057
Test accuracy = 0.880965593784684

   Using the default parameters, the accuracy is lower than optimal, but the dimensionality of the number of features has significantly decreased. Let's now evaluate for specificity and sensitivity.

In [ ]:
a <- 0
for (i in 1:k2){
    if (variable[(k1+i),1]==1 & n[i]==1){
        a = a+1
    }
}

b <- 0
for (i in 1:k2){
    if (variable[(k1+i),1]==0 & n[i]==1){
        b = b+1
    }
}

c <- 0
for (i in 1:k2){
    if (variable[(k1+i),1]==1 & n[i]==0){
        c = c+1
    }
}

d <- 0
for (i in 1:k2){
    if (variable[(k1+i),1]==0 & n[i]==0){
        d = d+1
    }
}

sensitivity = d / (b + d)
cat(sprintf('Sensitivity = %s\n', sensitivity))

specificity = a / (a + c)
cat(sprintf('Specificity = %s\n', specificity))
Sensitivity = 0.993800996079262
Specificity = 0.0687272727272727

   The sensitivity is great, but the specificity is quite low, so we should keep this in mind when comparing the various variable selection methods evaluated.

   Next, let's use the model to predict and generate a confusion matrix for the training and the test set.

In [ ]:
train$loan_status <- as.factor(train$loan_status)

pred = predict(rf, newdata=train[-1])

pred1 <- as.factor(pred)

confusionMatrix(data=pred1, reference=train$loan_status)
Confusion Matrix and Statistics

          Reference
Prediction      0      1
         0 169710      0
         1      0  24902
                                   
               Accuracy : 1        
                 95% CI : (1, 1)   
    No Information Rate : 0.872    
    P-Value [Acc > NIR] : < 2.2e-16
                                   
                  Kappa : 1        
                                   
 Mcnemar's Test P-Value : NA       
                                   
            Sensitivity : 1.000    
            Specificity : 1.000    
         Pos Pred Value : 1.000    
         Neg Pred Value : 1.000    
             Prevalence : 0.872    
         Detection Rate : 0.872    
   Detection Prevalence : 0.872    
      Balanced Accuracy : 1.000    
                                   
       'Positive' Class : 0        
                                   
In [ ]:
test$loan_status <- as.factor(test$loan_status)

pred = predict(rf, newdata=test[-1])

pred1 <- as.factor(pred)

confusionMatrix(data=pred1, reference=test$loan_status)
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 18867  2423
         1   149   185
                                          
               Accuracy : 0.8811          
                 95% CI : (0.8767, 0.8853)
    No Information Rate : 0.8794          
    P-Value [Acc > NIR] : 0.2296          
                                          
                  Kappa : 0.1012          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.99216         
            Specificity : 0.07094         
         Pos Pred Value : 0.88619         
         Neg Pred Value : 0.55389         
             Prevalence : 0.87939         
         Detection Rate : 0.87250         
   Detection Prevalence : 0.98455         
      Balanced Accuracy : 0.53155         
                                          
       'Positive' Class : 0               
                                          

Boruta

   Boruta is a wrapper around the Random Forest algorithm which contains an improvement in regards to the feature importance measure since it considers all of the features that are relevant to the target variable. It can handle interactions between features as well while maintaining the varying degree of randomness that might exist within a set.

   The reference manual states:

   Boruta iteratively compares importances of attributes with importances of shadow attributes, created by shuffling original ones. Attributes that have significantly worst importance than shadow ones are being consecutively dropped. On the other hand, attributes that are significantly better than shadows are admitted to be Confirmed. Shadows are re-created in each iteration. Algorithm stops when only Confirmed attributes are left, or when it reaches maxRuns importance source runs. If the second scenario occurs, some attributes may be left without a decision. They are claimed Tentative. You may try to extend maxRuns or lower pValue to clarify them, but in some cases their importances do fluctuate too much for Boruta to converge. Instead, you can use TentativeRoughFix function, which will perform other, weaker test to make a final decision, or simply treat them as undecided in further analysis.

   The input set for Boruta cannot contain missing observations, so either the complete cases for a set must be determined or a selected imputation method needs to be utilized prior given the reason(s) for the presence of data missing (missing completely at random (MCAR) or missing at random (MAR). The current available versions are slow in performing feature selection compared to other traditional feature selection algorithms, and performance should always be considered. In conjunction, after determining the important features, collinearity needs to be addressed which is another limitation of using this approach.

   Let's first set the working directory to FeatureSelection where the results can be saved. Then load the Boruta library and convert loan_status to a factor since this is the target variable. Due to the lengthy computation time and required CPU resources, we will use the test set and doTrace=0 for the verbosity level.

In [ ]:
data_test$loan_status <- as.factor(data_test$loan_status)

boruta.df <- Boruta(loan_status~., data=data_test, doTrace=0)
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 32 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 15 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 12 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 59 seconds.
Computing permutation importance.. Progress: 33%. Estimated remaining time: 4 minutes, 31 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 51 seconds.
Computing permutation importance.. Progress: 50%. Estimated remaining time: 3 minutes, 18 seconds.
Computing permutation importance.. Progress: 57%. Estimated remaining time: 2 minutes, 54 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 2 minutes, 20 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 1 minute, 50 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 12 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 43 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 7 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 31 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 15 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 2 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 59 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 4 minutes, 17 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 47 seconds.
Computing permutation importance.. Progress: 51%. Estimated remaining time: 3 minutes, 14 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 2 minutes, 38 seconds.
Computing permutation importance.. Progress: 68%. Estimated remaining time: 2 minutes, 4 seconds.
Computing permutation importance.. Progress: 76%. Estimated remaining time: 1 minute, 34 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 1 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 29 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 26%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 52%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 7 minutes, 46 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 2 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 5 minutes, 2 seconds.
Computing permutation importance.. Progress: 33%. Estimated remaining time: 4 minutes, 29 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 55 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 3 minutes, 26 seconds.
Computing permutation importance.. Progress: 57%. Estimated remaining time: 2 minutes, 52 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 16 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 1 minute, 41 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 9 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 35 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 24%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 49%. Estimated remaining time: 1 minute, 3 seconds.
Growing trees.. Progress: 75%. Estimated remaining time: 31 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 1 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 2 seconds.
Computing permutation importance.. Progress: 25%. Estimated remaining time: 4 minutes, 56 seconds.
Computing permutation importance.. Progress: 32%. Estimated remaining time: 4 minutes, 37 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 3 minutes, 52 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 3 minutes, 26 seconds.
Computing permutation importance.. Progress: 57%. Estimated remaining time: 2 minutes, 49 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 2 minutes, 20 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 1 minute, 46 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 1 minute, 14 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 41 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 7 seconds.
Growing trees.. Progress: 24%. Estimated remaining time: 1 minute, 37 seconds.
Growing trees.. Progress: 50%. Estimated remaining time: 1 minute, 1 seconds.
Growing trees.. Progress: 76%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 29 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 17 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 5 minutes, 16 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 4 minutes, 19 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 55 seconds.
Computing permutation importance.. Progress: 51%. Estimated remaining time: 3 minutes, 16 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 2 minutes, 48 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 12 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 1 minute, 42 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 7 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 38 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 26%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 58 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 7 minutes, 46 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 56 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 59 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 4 minutes, 17 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 52 seconds.
Computing permutation importance.. Progress: 50%. Estimated remaining time: 3 minutes, 15 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 2 minutes, 43 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 9 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 1 minute, 39 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 11 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 39 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 6 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 33 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 15 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 46 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 32%. Estimated remaining time: 4 minutes, 34 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 4 minutes, 7 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 3 minutes, 27 seconds.
Computing permutation importance.. Progress: 56%. Estimated remaining time: 2 minutes, 58 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 2 minutes, 21 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 1 minute, 50 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 11 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 35 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 5 seconds.
Growing trees.. Progress: 24%. Estimated remaining time: 1 minute, 36 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 7 minutes, 46 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 46 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 47 seconds.
Computing permutation importance.. Progress: 33%. Estimated remaining time: 4 minutes, 28 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 3 minutes, 53 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 3 minutes, 20 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 2 minutes, 44 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 2 minutes, 15 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 1 minute, 46 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 12 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 40 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 7 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 32 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 15 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 2 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 56 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 4 minutes, 15 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 45 seconds.
Computing permutation importance.. Progress: 50%. Estimated remaining time: 3 minutes, 17 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 2 minutes, 45 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 11 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 1 minute, 39 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 6 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 38 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 6 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 35 seconds.
Growing trees.. Progress: 50%. Estimated remaining time: 1 minute, 1 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 57 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 22 seconds.
Computing permutation importance.. Progress: 25%. Estimated remaining time: 5 minutes, 17 seconds.
Computing permutation importance.. Progress: 32%. Estimated remaining time: 4 minutes, 52 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 4 minutes, 0 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 3 minutes, 30 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 2 minutes, 52 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 18 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 1 minute, 40 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 10 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 34 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 31 seconds.
Growing trees.. Progress: 52%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 15 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 7 seconds.
Computing permutation importance.. Progress: 25%. Estimated remaining time: 5 minutes, 8 seconds.
Computing permutation importance.. Progress: 33%. Estimated remaining time: 4 minutes, 29 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 3 minutes, 58 seconds.
Computing permutation importance.. Progress: 50%. Estimated remaining time: 3 minutes, 19 seconds.
Computing permutation importance.. Progress: 57%. Estimated remaining time: 2 minutes, 51 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 14 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 1 minute, 42 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 10 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 39 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 7 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 31 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 1 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 41 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 5 minutes, 25 seconds.
Computing permutation importance.. Progress: 32%. Estimated remaining time: 4 minutes, 34 seconds.
Computing permutation importance.. Progress: 39%. Estimated remaining time: 4 minutes, 4 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 3 minutes, 29 seconds.
Computing permutation importance.. Progress: 56%. Estimated remaining time: 2 minutes, 55 seconds.
Computing permutation importance.. Progress: 63%. Estimated remaining time: 2 minutes, 26 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 20 seconds.
Computing permutation importance.. Progress: 88%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 76%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 15 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 12 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 5 minutes, 2 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 4 minutes, 22 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 54 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 3 minutes, 22 seconds.
Computing permutation importance.. Progress: 57%. Estimated remaining time: 2 minutes, 55 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 2 minutes, 21 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 1 minute, 48 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 13 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 43 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 8 seconds.
Growing trees.. Progress: 26%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 52%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 8 minutes, 15 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 6 minutes, 2 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 59 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 4 minutes, 19 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 3 minutes, 50 seconds.
Computing permutation importance.. Progress: 50%. Estimated remaining time: 3 minutes, 17 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 2 minutes, 42 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 12 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 1 minute, 40 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 10 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 40 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 6 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 42 seconds.
Growing trees.. Progress: 49%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 75%. Estimated remaining time: 32 seconds.
Computing permutation importance.. Progress: 8%. Estimated remaining time: 6 minutes, 16 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 5 minutes, 11 seconds.
Computing permutation importance.. Progress: 27%. Estimated remaining time: 4 minutes, 16 seconds.
Computing permutation importance.. Progress: 37%. Estimated remaining time: 3 minutes, 38 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 56%. Estimated remaining time: 2 minutes, 28 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 17 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 44 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 1 minute, 6 seconds.
Growing trees.. Progress: 74%. Estimated remaining time: 32 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 7 minutes, 4 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 16 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 21 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 3 minutes, 40 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 3 minutes, 1 seconds.
Computing permutation importance.. Progress: 56%. Estimated remaining time: 2 minutes, 28 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 1 minute, 56 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 1 minute, 26 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 55 seconds.
Computing permutation importance.. Progress: 93%. Estimated remaining time: 23 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 43 seconds.
Growing trees.. Progress: 49%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 74%. Estimated remaining time: 31 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 7 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 25 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 36 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 3 minutes, 51 seconds.
Computing permutation importance.. Progress: 45%. Estimated remaining time: 3 minutes, 12 seconds.
Computing permutation importance.. Progress: 55%. Estimated remaining time: 2 minutes, 37 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 2 minutes, 4 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 1 minute, 31 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 58 seconds.
Computing permutation importance.. Progress: 93%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 42 seconds.
Growing trees.. Progress: 49%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 74%. Estimated remaining time: 32 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 8%. Estimated remaining time: 6 minutes, 6 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 33 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 3 minutes, 51 seconds.
Computing permutation importance.. Progress: 44%. Estimated remaining time: 3 minutes, 25 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 2 minutes, 49 seconds.
Computing permutation importance.. Progress: 63%. Estimated remaining time: 2 minutes, 10 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 1 minute, 41 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 1 minute, 7 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 34 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 46 seconds.
Growing trees.. Progress: 46%. Estimated remaining time: 1 minute, 11 seconds.
Growing trees.. Progress: 71%. Estimated remaining time: 38 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 7 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 7 minutes, 32 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 25 seconds.
Computing permutation importance.. Progress: 26%. Estimated remaining time: 4 minutes, 30 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 3 minutes, 51 seconds.
Computing permutation importance.. Progress: 45%. Estimated remaining time: 3 minutes, 16 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 2 minutes, 42 seconds.
Computing permutation importance.. Progress: 62%. Estimated remaining time: 2 minutes, 14 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 1 minute, 40 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 1 minute, 5 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 31 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 2 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 43 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 1 minute, 8 seconds.
Growing trees.. Progress: 72%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 5 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 4 minutes, 51 seconds.
Computing permutation importance.. Progress: 19%. Estimated remaining time: 4 minutes, 46 seconds.
Computing permutation importance.. Progress: 30%. Estimated remaining time: 3 minutes, 53 seconds.
Computing permutation importance.. Progress: 39%. Estimated remaining time: 3 minutes, 21 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 2 minutes, 48 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 2 minutes, 14 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 1 minute, 42 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 13 seconds.
Computing permutation importance.. Progress: 87%. Estimated remaining time: 41 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 10 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 46 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 1 minute, 7 seconds.
Growing trees.. Progress: 72%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 5 seconds.
Computing permutation importance.. Progress: 8%. Estimated remaining time: 6 minutes, 16 seconds.
Computing permutation importance.. Progress: 19%. Estimated remaining time: 4 minutes, 28 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 3 minutes, 57 seconds.
Computing permutation importance.. Progress: 39%. Estimated remaining time: 3 minutes, 18 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 2 minutes, 43 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 2 minutes, 12 seconds.
Computing permutation importance.. Progress: 68%. Estimated remaining time: 1 minute, 43 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 11 seconds.
Computing permutation importance.. Progress: 88%. Estimated remaining time: 40 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 11 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 44 seconds.
Growing trees.. Progress: 47%. Estimated remaining time: 1 minute, 8 seconds.
Growing trees.. Progress: 72%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 97%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 8%. Estimated remaining time: 5 minutes, 38 seconds.
Computing permutation importance.. Progress: 19%. Estimated remaining time: 4 minutes, 24 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 4 minutes, 1 seconds.
Computing permutation importance.. Progress: 37%. Estimated remaining time: 3 minutes, 32 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 56%. Estimated remaining time: 2 minutes, 25 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 1 minute, 53 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 1 minute, 21 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 94%. Estimated remaining time: 18 seconds.
Growing trees.. Progress: 23%. Estimated remaining time: 1 minute, 49 seconds.
Growing trees.. Progress: 47%. Estimated remaining time: 1 minute, 9 seconds.
Growing trees.. Progress: 71%. Estimated remaining time: 37 seconds.
Growing trees.. Progress: 95%. Estimated remaining time: 6 seconds.
Computing permutation importance.. Progress: 7%. Estimated remaining time: 7 minutes, 4 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 5 minutes, 35 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 5 minutes, 25 seconds.
Computing permutation importance.. Progress: 30%. Estimated remaining time: 4 minutes, 59 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 4 minutes, 37 seconds.
Computing permutation importance.. Progress: 43%. Estimated remaining time: 4 minutes, 11 seconds.
Computing permutation importance.. Progress: 49%. Estimated remaining time: 3 minutes, 46 seconds.
Computing permutation importance.. Progress: 55%. Estimated remaining time: 3 minutes, 26 seconds.
Computing permutation importance.. Progress: 61%. Estimated remaining time: 2 minutes, 58 seconds.
Computing permutation importance.. Progress: 68%. Estimated remaining time: 2 minutes, 31 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 2 minutes, 3 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 35 seconds.
Computing permutation importance.. Progress: 86%. Estimated remaining time: 1 minute, 5 seconds.
Computing permutation importance.. Progress: 93%. Estimated remaining time: 35 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 5 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 42%. Estimated remaining time: 1 minute, 25 seconds.
Growing trees.. Progress: 64%. Estimated remaining time: 52 seconds.
Growing trees.. Progress: 85%. Estimated remaining time: 21 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 21 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 26 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 42 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 58 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 8 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 35 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 4 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 37 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 2 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 27 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 54 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 22 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 5 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 31 seconds.
Growing trees.. Progress: 63%. Estimated remaining time: 55 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 9 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 58 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 21 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 28 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 41 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 22 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 48 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 11 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 38 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 5 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 34 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 4 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 31 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 59 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 24 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 58 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 14 minutes, 40 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 26 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 45 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 6 minutes, 9 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 30 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 4 minutes, 56 seconds.
Computing permutation importance.. Progress: 52%. Estimated remaining time: 4 minutes, 25 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 3 minutes, 54 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 3 minutes, 17 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 41 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 6 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 31 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 1 minute, 1 seconds.
Computing permutation importance.. Progress: 94%. Estimated remaining time: 30 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 14 minutes, 40 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 27 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 6 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 31 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 51 seconds.
Computing permutation importance.. Progress: 33%. Estimated remaining time: 6 minutes, 16 seconds.
Computing permutation importance.. Progress: 39%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 45%. Estimated remaining time: 5 minutes, 12 seconds.
Computing permutation importance.. Progress: 51%. Estimated remaining time: 4 minutes, 36 seconds.
Computing permutation importance.. Progress: 57%. Estimated remaining time: 4 minutes, 2 seconds.
Computing permutation importance.. Progress: 63%. Estimated remaining time: 3 minutes, 25 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 2 minutes, 52 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 2 minutes, 23 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 49 seconds.
Computing permutation importance.. Progress: 86%. Estimated remaining time: 1 minute, 15 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 43 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 9 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 33 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 31 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 37 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 58 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 37 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 42 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 49 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 13 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 33 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 56 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 25 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 3 minutes, 56 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 3 minutes, 24 seconds.
Computing permutation importance.. Progress: 70%. Estimated remaining time: 2 minutes, 48 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 2 minutes, 19 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 1 minute, 45 seconds.
Computing permutation importance.. Progress: 87%. Estimated remaining time: 1 minute, 12 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 45 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 18%. Estimated remaining time: 2 minutes, 19 seconds.
Growing trees.. Progress: 37%. Estimated remaining time: 1 minute, 44 seconds.
Growing trees.. Progress: 56%. Estimated remaining time: 1 minute, 12 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 97%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 37 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 10 minutes, 16 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 57 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 52 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 7 minutes, 1 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 15 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 34 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 57 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 18 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 41 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 5 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 30 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 57 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 16 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 18 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 35 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 56 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 22 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 49 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 18 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 42 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 11 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 36 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 3 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 28 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 57 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 29 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 5 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 58 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 29 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 27 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 18 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 27 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 48 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 44 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 12 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 36 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 30 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 25 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 56 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 24 seconds.
Growing trees.. Progress: 18%. Estimated remaining time: 2 minutes, 19 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 4 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 30 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 27 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 6 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 6 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 23 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 52 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 16 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 6 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 31 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 55 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 23 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 50 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 20 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 9 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 37 seconds.
Growing trees.. Progress: 57%. Estimated remaining time: 1 minute, 10 seconds.
Growing trees.. Progress: 75%. Estimated remaining time: 40 seconds.
Growing trees.. Progress: 95%. Estimated remaining time: 7 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 11 minutes, 47 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 8 minutes, 54 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 52 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 13 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 33 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 5 minutes, 24 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 4 minutes, 49 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 16 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 40 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 8 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 33 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 59 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 27 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 56 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 23 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 45 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 19 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 30 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 44 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 4 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 26 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 52 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 14 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 38 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 55 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 20 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 49 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 18 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 5 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 9 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 58 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 21 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 21 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 38 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 6 minutes, 9 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 29 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 48 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 14 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 43 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 10 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 39 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 5 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 30 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 58 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 29 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 39 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 32 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 2 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 58 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 26 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 23 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 33 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 54 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 14 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 36 seconds.
Computing permutation importance.. Progress: 55%. Estimated remaining time: 4 minutes, 0 seconds.
Computing permutation importance.. Progress: 61%. Estimated remaining time: 3 minutes, 27 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 52 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 21 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 53 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 36 seconds.
Growing trees.. Progress: 57%. Estimated remaining time: 1 minute, 10 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 35 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 6 minutes, 58 seconds.
Computing permutation importance.. Progress: 30%. Estimated remaining time: 6 minutes, 13 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 34 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 3 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 35 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 3 minutes, 59 seconds.
Computing permutation importance.. Progress: 61%. Estimated remaining time: 3 minutes, 24 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 49 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 2 minutes, 15 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 41 seconds.
Computing permutation importance.. Progress: 87%. Estimated remaining time: 1 minute, 8 seconds.
Computing permutation importance.. Progress: 93%. Estimated remaining time: 36 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 35 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 77%. Estimated remaining time: 37 seconds.
Growing trees.. Progress: 97%. Estimated remaining time: 5 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 13 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 13 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 38 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 45 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 16 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 39 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 4 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 31 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 57 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 23 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 39 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 3 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 12 minutes, 48 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 28 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 8 minutes, 5 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 10 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 29 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 47 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 23 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 54 seconds.
Computing permutation importance.. Progress: 52%. Estimated remaining time: 4 minutes, 22 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 44 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 5 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 32 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 2 minutes, 0 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 29 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 54 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 18 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 2 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 31 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 6 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 23 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 55 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 47 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 36 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 3 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 21 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 14 minutes, 40 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 30 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 45 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 3 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 25 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 55 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 19 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 45 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 12 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 35 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 28 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 56 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 28 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 56 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 23 seconds.
Growing trees.. Progress: 17%. Estimated remaining time: 2 minutes, 29 seconds.
Growing trees.. Progress: 36%. Estimated remaining time: 1 minute, 50 seconds.
Growing trees.. Progress: 55%. Estimated remaining time: 1 minute, 18 seconds.
Growing trees.. Progress: 75%. Estimated remaining time: 42 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 10 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 9 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 21 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 25 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 58 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 6 minutes, 16 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 5 minutes, 33 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 4 minutes, 56 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 19 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 3 minutes, 47 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 3 minutes, 14 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 38 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 4 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 29 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 54 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 20 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 58%. Estimated remaining time: 1 minute, 6 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 30 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 9 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 27 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 26 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 43 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 6 minutes, 5 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 5 minutes, 28 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 51 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 14 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 40 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 10 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 35 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 1 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 29 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 55 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 23 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 36 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 34 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 36 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 30 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 57 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 51 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 17 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 3 minutes, 47 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 3 minutes, 18 seconds.
Computing permutation importance.. Progress: 70%. Estimated remaining time: 2 minutes, 43 seconds.
Computing permutation importance.. Progress: 76%. Estimated remaining time: 2 minutes, 8 seconds.
Computing permutation importance.. Progress: 82%. Estimated remaining time: 1 minute, 38 seconds.
Computing permutation importance.. Progress: 88%. Estimated remaining time: 1 minute, 7 seconds.
Computing permutation importance.. Progress: 94%. Estimated remaining time: 34 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 35 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 75%. Estimated remaining time: 40 seconds.
Growing trees.. Progress: 95%. Estimated remaining time: 8 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 14 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 10 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 30 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 59 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 45 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 10 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 59 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 26 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 21 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 3 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 32 seconds.
Growing trees.. Progress: 97%. Estimated remaining time: 4 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 23 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 57 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 6 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 28 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 54 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 13 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 12 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 43 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 7 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 31 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 26 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 55 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 15 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 32 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 16 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 23 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 47 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 6 minutes, 7 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 5 minutes, 31 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 4 minutes, 53 seconds.
Computing permutation importance.. Progress: 52%. Estimated remaining time: 4 minutes, 23 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 46 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 12 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 36 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 3 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 32 seconds.
Computing permutation importance.. Progress: 88%. Estimated remaining time: 1 minute, 3 seconds.
Computing permutation importance.. Progress: 94%. Estimated remaining time: 31 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 2 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 58 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 28 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 6 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 29 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 4 minutes, 52 seconds.
Computing permutation importance.. Progress: 52%. Estimated remaining time: 4 minutes, 18 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 3 minutes, 48 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 3 minutes, 17 seconds.
Computing permutation importance.. Progress: 70%. Estimated remaining time: 2 minutes, 41 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 7 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 30 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 55 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 21 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 15 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 30 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 6 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 14 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 26 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 54 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 5 minutes, 28 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 51 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 13 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 37 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 6 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 32 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 55 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 23 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 17 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 14 minutes, 40 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 16 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 21 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 41 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 6 minutes, 8 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 5 minutes, 31 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 55 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 18 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 42 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 9 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 36 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 2 minutes, 1 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 26 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 52 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 18 seconds.
Growing trees.. Progress: 18%. Estimated remaining time: 2 minutes, 17 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 43 seconds.
Growing trees.. Progress: 57%. Estimated remaining time: 1 minute, 9 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 35 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 30 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 6 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 10 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 26 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 44 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 6 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 31 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 2 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 28 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 58 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 23 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 2 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 30 seconds.
Computing permutation importance.. Progress: 11%. Estimated remaining time: 8 minutes, 37 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 50 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 5 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 27 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 50 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 44 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 10 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 35 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 22 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 46 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 18 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 12 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 33 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 4 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 30 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 14 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 14 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 42 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 45 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 36 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 1 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 16 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 44 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 12 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 31 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 1 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 29 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 36 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 8 minutes, 2 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 10 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 29 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 52 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 18 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 43 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 8 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 33 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 59 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 30 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 23 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 5 seconds.
Growing trees.. Progress: 42%. Estimated remaining time: 1 minute, 28 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 58 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 11 minutes, 47 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 30 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 44 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 2 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 25 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 49 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 13 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 41 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 3 minutes, 14 seconds.
Computing permutation importance.. Progress: 70%. Estimated remaining time: 2 minutes, 41 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 5 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 32 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 1 minute, 1 seconds.
Computing permutation importance.. Progress: 94%. Estimated remaining time: 30 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 5 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 43 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 6 minutes, 55 seconds.
Computing permutation importance.. Progress: 30%. Estimated remaining time: 6 minutes, 15 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 37 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 2 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 34 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 4 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 30 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 55 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 21 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 18 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 44 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 33 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 31 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 9 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 18 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 40 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 23 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 51 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 13 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 37 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 24 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 21 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 47 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 16 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 35 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 2 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 29 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 16 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 30 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 16 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 40 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 3 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 7 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 33 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 1 minute, 59 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 25 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 51 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 16 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 58%. Estimated remaining time: 1 minute, 6 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 31 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 0 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 19 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 41 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 11 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 5 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 33 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 4 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 31 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 1 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 26 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 55 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 22 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 5 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 12 minutes, 24 seconds.
Computing permutation importance.. Progress: 11%. Estimated remaining time: 8 minutes, 42 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 38 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 6 minutes, 58 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 28 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 16 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 38 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 4 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 5 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 30 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 55 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 19 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 58%. Estimated remaining time: 1 minute, 8 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 34 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 2 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 14 minutes, 40 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 10 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 29 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 52 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 14 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 6 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 32 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 56 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 21 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 46 seconds.
Computing permutation importance.. Progress: 86%. Estimated remaining time: 1 minute, 13 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 39 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 7 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 32 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 30 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 14 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 26 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 57 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 42 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 7 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 31 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 57 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 22 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 47 seconds.
Computing permutation importance.. Progress: 86%. Estimated remaining time: 1 minute, 13 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 40 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 6 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 37 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 3 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 30 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 27 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 14 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 20 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 35 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 58 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 19 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 40 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 1 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 20 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 44 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 11 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 12 minutes, 48 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 23 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 36 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 22 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 46 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 31 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 57 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 24 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 36 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 30 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 23 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 13 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 25 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 16 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 39 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 3 seconds.
Computing permutation importance.. Progress: 61%. Estimated remaining time: 3 minutes, 26 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 53 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 21 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 48 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 17 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 44 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 11 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 40 seconds.
Growing trees.. Progress: 58%. Estimated remaining time: 1 minute, 7 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 2 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 9 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 9 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 21 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 34 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 50 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 13 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 48 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 14 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 36 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 57 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 24 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 53 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 17 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 15 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 36 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 4 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 2 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 21 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 26 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 41 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 3 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 23 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 47 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 35 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 3 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 21 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 47 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 15 minutes, 9 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 58 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 21 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 28 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 43 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 26 seconds.
Computing permutation importance.. Progress: 46%. Estimated remaining time: 4 minutes, 53 seconds.
Computing permutation importance.. Progress: 51%. Estimated remaining time: 4 minutes, 29 seconds.
Computing permutation importance.. Progress: 58%. Estimated remaining time: 3 minutes, 52 seconds.
Computing permutation importance.. Progress: 64%. Estimated remaining time: 3 minutes, 14 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 39 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 3 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 31 seconds.
Computing permutation importance.. Progress: 89%. Estimated remaining time: 59 seconds.
Computing permutation importance.. Progress: 95%. Estimated remaining time: 26 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 20 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 40 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 6 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 32 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 18 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 31 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 53 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 15 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 45 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 36 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 12 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 58%. Estimated remaining time: 1 minute, 6 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 34 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 30 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 5 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 21 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 15 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 6 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 32 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 58 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 30 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 57 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 23 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 39 seconds.
Growing trees.. Progress: 58%. Estimated remaining time: 1 minute, 6 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 34 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 2 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 11 minutes, 47 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 52 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 6 minutes, 58 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 31 seconds.
Computing permutation importance.. Progress: 34%. Estimated remaining time: 5 minutes, 58 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 19 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 43 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 39 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 7 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 33 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 23 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 49 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 36 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 4 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 14 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 13 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 31 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 57 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 43 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 5 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 37 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 5 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 32 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 24 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 52 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 19 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 31 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 1 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 11 minutes, 47 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 2 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 38 seconds.
Computing permutation importance.. Progress: 24%. Estimated remaining time: 6 minutes, 47 seconds.
Computing permutation importance.. Progress: 31%. Estimated remaining time: 5 minutes, 43 seconds.
Computing permutation importance.. Progress: 38%. Estimated remaining time: 5 minutes, 13 seconds.
Computing permutation importance.. Progress: 44%. Estimated remaining time: 4 minutes, 46 seconds.
Computing permutation importance.. Progress: 50%. Estimated remaining time: 4 minutes, 13 seconds.
Computing permutation importance.. Progress: 57%. Estimated remaining time: 3 minutes, 41 seconds.
Computing permutation importance.. Progress: 63%. Estimated remaining time: 3 minutes, 9 seconds.
Computing permutation importance.. Progress: 69%. Estimated remaining time: 2 minutes, 38 seconds.
Computing permutation importance.. Progress: 75%. Estimated remaining time: 2 minutes, 7 seconds.
Computing permutation importance.. Progress: 81%. Estimated remaining time: 1 minute, 37 seconds.
Computing permutation importance.. Progress: 87%. Estimated remaining time: 1 minute, 5 seconds.
Computing permutation importance.. Progress: 93%. Estimated remaining time: 37 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 5 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 32 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 12 minutes, 24 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 14 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 31 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 52 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 14 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 44 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 8 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 37 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 3 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 20 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 42 seconds.
Growing trees.. Progress: 57%. Estimated remaining time: 1 minute, 10 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 35 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 2 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 12 minutes, 24 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 10 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 30 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 59 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 43 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 12 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 40 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 6 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 32 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 24 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 34 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 78%. Estimated remaining time: 35 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 10 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 25 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 18 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 42 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 8 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 36 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 1 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 20 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 47 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 31 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 14 minutes, 40 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 10 minutes, 7 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 14 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 13 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 29 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 54 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 18 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 44 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 59 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 22 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 49 seconds.
Computing permutation importance.. Progress: 86%. Estimated remaining time: 1 minute, 15 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 42 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 11 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 0 seconds.
Growing trees.. Progress: 42%. Estimated remaining time: 1 minute, 24 seconds.
Growing trees.. Progress: 63%. Estimated remaining time: 55 seconds.
Growing trees.. Progress: 84%. Estimated remaining time: 23 seconds.
Computing permutation importance.. Progress: 3%. Estimated remaining time: 14 minutes, 40 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 52 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 6 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 28 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 12 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 36 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 0 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 26 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 56 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 49 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 15 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 4 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 31 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 45 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 8 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 28 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 18 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 7 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 32 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 1 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 31 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 57 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 49 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 2 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 32 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 3 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 30 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 26 seconds.
Computing permutation importance.. Progress: 22%. Estimated remaining time: 7 minutes, 32 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 48 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 6 minutes, 1 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 21 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 45 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 11 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 33 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 55 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 2 minutes, 20 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 44 seconds.
Computing permutation importance.. Progress: 86%. Estimated remaining time: 1 minute, 13 seconds.
Computing permutation importance.. Progress: 93%. Estimated remaining time: 38 seconds.
Computing permutation importance.. Progress: 99%. Estimated remaining time: 7 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 10 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 33 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 59 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 14 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 29 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 52 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 19 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 8 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 33 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 24 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 15 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 5 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 30 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 6 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 26 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 59 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 46 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 10 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 39 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 6 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 35 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 0 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 26 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 53 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 20 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 5 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 12 minutes, 24 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 1 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 8 seconds.
Computing permutation importance.. Progress: 28%. Estimated remaining time: 6 minutes, 35 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 13 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 40 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 4 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 27 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 55 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 25 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 19 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 11 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 37%. Estimated remaining time: 1 minute, 47 seconds.
Growing trees.. Progress: 54%. Estimated remaining time: 1 minute, 19 seconds.
Growing trees.. Progress: 74%. Estimated remaining time: 43 seconds.
Growing trees.. Progress: 95%. Estimated remaining time: 8 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 6 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 8 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 31 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 54 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 15 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 40 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 6 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 30 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 53 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 26 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 20 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 33 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 1 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 29 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 39 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 6 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 6 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 33 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 19 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 44 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 7 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 33 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 1 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 28 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 55 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 38 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 3 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 31 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 8 minutes, 54 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 30%. Estimated remaining time: 6 minutes, 13 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 43 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 16 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 7 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 33 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 26 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 54 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 22 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 15 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 35 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 2 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 29 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 24 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 46 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 13 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 35 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 8 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 34 seconds.
Computing permutation importance.. Progress: 77%. Estimated remaining time: 2 minutes, 2 seconds.
Computing permutation importance.. Progress: 83%. Estimated remaining time: 1 minute, 27 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 55 seconds.
Computing permutation importance.. Progress: 96%. Estimated remaining time: 22 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 30 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 2 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 29 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 6 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 26 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 57 seconds.
Computing permutation importance.. Progress: 40%. Estimated remaining time: 5 minutes, 28 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 49 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 12 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 37 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 53 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 21 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 33 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 1 minute, 1 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 11 minutes, 47 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 5 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 7 minutes, 59 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 3 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 28 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 11 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 39 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 5 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 28 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 54 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 20 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 50 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 18 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 46 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 12 seconds.
Growing trees.. Progress: 18%. Estimated remaining time: 2 minutes, 21 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 36 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 4 seconds.
Growing trees.. Progress: 79%. Estimated remaining time: 32 seconds.
Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 30 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 14 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 26 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 18 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 44 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 9 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 0 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 27 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 52 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 19 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 45 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 11 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 30 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 21 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 13 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 24 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 46 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 10 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 32 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 1 seconds.
Computing permutation importance.. Progress: 61%. Estimated remaining time: 3 minutes, 26 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 53 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 23 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 47 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 16 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 43 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 11 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 27 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 58 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 14 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 8 minutes, 7 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 7 seconds.
Computing permutation importance.. Progress: 30%. Estimated remaining time: 6 minutes, 22 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 43 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 7 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 40 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 10 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 37 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 56 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 47 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 4 seconds.
Growing trees.. Progress: 42%. Estimated remaining time: 1 minute, 26 seconds.
Growing trees.. Progress: 63%. Estimated remaining time: 55 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 12 minutes, 24 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 8 minutes, 54 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 24 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 54 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 20 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 46 seconds.
Computing permutation importance.. Progress: 53%. Estimated remaining time: 4 minutes, 12 seconds.
Computing permutation importance.. Progress: 59%. Estimated remaining time: 3 minutes, 37 seconds.
Computing permutation importance.. Progress: 65%. Estimated remaining time: 3 minutes, 3 seconds.
Computing permutation importance.. Progress: 71%. Estimated remaining time: 2 minutes, 32 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 58 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 24 seconds.
Computing permutation importance.. Progress: 90%. Estimated remaining time: 50 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 40 seconds.
Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 4 seconds.
Growing trees.. Progress: 80%. Estimated remaining time: 30 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 27 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 9 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 13 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 35 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 56 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 15 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 38 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 5 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 31 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 57 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 24 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 48 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 17 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 44 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 13 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 58 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 27 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 30 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 49 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 11 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 14 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 29 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 49 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 12 seconds.
Computing permutation importance.. Progress: 47%. Estimated remaining time: 4 minutes, 39 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 3 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 29 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 56 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 22 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 53 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 18 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 45 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 12 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 13 seconds.
Growing trees.. Progress: 40%. Estimated remaining time: 1 minute, 32 seconds.
Growing trees.. Progress: 61%. Estimated remaining time: 1 minute, 0 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 11%. Estimated remaining time: 8 minutes, 40 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 52 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 36 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 52 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 18 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 41 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 8 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 34 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 3 minutes, 2 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 29 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 55 seconds.
Computing permutation importance.. Progress: 84%. Estimated remaining time: 1 minute, 22 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 49 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 15 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 8 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 56 seconds.
Growing trees.. Progress: 82%. Estimated remaining time: 26 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 4 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 16%. Estimated remaining time: 8 minutes, 4 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 5 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 23 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 48 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 13 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 39 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 3 seconds.
Computing permutation importance.. Progress: 60%. Estimated remaining time: 3 minutes, 33 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 56 seconds.
Computing permutation importance.. Progress: 73%. Estimated remaining time: 2 minutes, 22 seconds.
Computing permutation importance.. Progress: 79%. Estimated remaining time: 1 minute, 47 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 17 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 44 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 10 seconds.
Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 7 seconds.
Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 28 seconds.
Growing trees.. Progress: 62%. Estimated remaining time: 57 seconds.
Growing trees.. Progress: 83%. Estimated remaining time: 25 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 18 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 7 minutes, 52 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 0 seconds.
Computing permutation importance.. Progress: 30%. Estimated remaining time: 6 minutes, 16 seconds.
Computing permutation importance.. Progress: 36%. Estimated remaining time: 5 minutes, 39 seconds.
Computing permutation importance.. Progress: 42%. Estimated remaining time: 5 minutes, 11 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 37 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 1 seconds.
Computing permutation importance.. Progress: 61%. Estimated remaining time: 3 minutes, 25 seconds.
Computing permutation importance.. Progress: 66%. Estimated remaining time: 2 minutes, 57 seconds.
Computing permutation importance.. Progress: 72%. Estimated remaining time: 2 minutes, 28 seconds.
Computing permutation importance.. Progress: 78%. Estimated remaining time: 1 minute, 54 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 21 seconds.
Computing permutation importance.. Progress: 91%. Estimated remaining time: 48 seconds.
Computing permutation importance.. Progress: 97%. Estimated remaining time: 14 seconds.
Computing permutation importance.. Progress: 100%. Estimated remaining time: 0 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 12 seconds.
Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 40 seconds.
Growing trees.. Progress: 60%. Estimated remaining time: 1 minute, 3 seconds.
Growing trees.. Progress: 81%. Estimated remaining time: 30 seconds.
Growing trees.. Progress: 100%. Estimated remaining time: 0 seconds.
Computing permutation importance.. Progress: 4%. Estimated remaining time: 13 minutes, 50 seconds.
Computing permutation importance.. Progress: 10%. Estimated remaining time: 9 minutes, 27 seconds.
Computing permutation importance.. Progress: 17%. Estimated remaining time: 8 minutes, 2 seconds.
Computing permutation importance.. Progress: 23%. Estimated remaining time: 7 minutes, 11 seconds.
Computing permutation importance.. Progress: 29%. Estimated remaining time: 6 minutes, 24 seconds.
Computing permutation importance.. Progress: 35%. Estimated remaining time: 5 minutes, 51 seconds.
Computing permutation importance.. Progress: 41%. Estimated remaining time: 5 minutes, 15 seconds.
Computing permutation importance.. Progress: 48%. Estimated remaining time: 4 minutes, 39 seconds.
Computing permutation importance.. Progress: 54%. Estimated remaining time: 4 minutes, 4 seconds.
Computing permutation importance.. Progress: 61%. Estimated remaining time: 3 minutes, 29 seconds.
Computing permutation importance.. Progress: 67%. Estimated remaining time: 2 minutes, 54 seconds.
Computing permutation importance.. Progress: 74%. Estimated remaining time: 2 minutes, 16 seconds.
Computing permutation importance.. Progress: 80%. Estimated remaining time: 1 minute, 45 seconds.
Computing permutation importance.. Progress: 85%. Estimated remaining time: 1 minute, 16 seconds.
Computing permutation importance.. Progress: 92%. Estimated remaining time: 42 seconds.
Computing permutation importance.. Progress: 98%. Estimated remaining time: 8 seconds.

Results from Boruta

   We can now examine what was saved in the boruta.df list to inspect the results from utilizing this approach.

In [ ]:
print(boruta.df)
Boruta performed 99 iterations in 18.34501 hours.
 79 attributes confirmed important: `term_36 months`, `term_60 months`,
`verification_status_Not Verified`, `verification_status_Source
Verified`, acc_now_delinq and 74 more;
 25 attributes confirmed unimportant: `emp_length_< 1 year`,
`emp_length_1 year`, `emp_length_2 years`, `emp_length_3 years`,
`emp_length_4 years` and 20 more;
 7 tentative attributes left: `emp_length_10+ years`, `emp_length_8
years`, delinq_amnt, purpose_home_improvement, purpose_major_purchase
and 2 more;

   After more than 81 hours, 79 out of the 111 features were selected. This is a large number of variables though. This package offers a function called TentativeRoughFix that "claims as Confirmed those attributes that have median importance higher than the median importance of maximal shadow attribute, and the rest as Rejected". So, let's now explore the tentative variables.

In [ ]:
final.boruta <- TentativeRoughFix(boruta.df)
print(final.boruta)
Boruta performed 99 iterations in 18.34501 hours.
Tentatives roughfixed over the last 99 iterations.
 81 attributes confirmed important: `emp_length_10+ years`, `term_36
months`, `term_60 months`, `verification_status_Not Verified`,
`verification_status_Source Verified` and 76 more;
 30 attributes confirmed unimportant: `emp_length_< 1 year`,
`emp_length_1 year`, `emp_length_2 years`, `emp_length_3 years`,
`emp_length_4 years` and 25 more;

   2 more features were confirmed as important after using the available TenativeRougFix, while 5 more were confirmed as unimportant. We can then used the variable importance chart from the native package to display the important features as green, unimportant as red and yellow indicating the ones that are considered as tentative. Let's first change the plot width and height of the plot window by setting the options with the repr library. Then we can plot the results from Boruta after using the TenativeRoughFix. lapply as a function can be applied to the generated ImpHistory where the names of the columns are defined and sapply can then be used with the output from the lapply function where the feature importance of the features is sorted by the median called as Labels. This generates a plot where the feature importance is arranged in an ascending order from left to right.

In [ ]:
library(repr)

options(repr.plot.width=30, repr.plot.height=30)

par(mar = c(20,5,1,1.5))
plot(final.boruta, xlab='', xaxt='n')

lz<-lapply(1:ncol(final.boruta$ImpHistory),function(i)
  final.boruta$ImpHistory[is.finite(final.boruta$ImpHistory[,i]),i])
names(lz) <- colnames(final.boruta$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side=1,las=2, labels=names(Labels),
     at=1:ncol(final.boruta$ImpHistory), cex.axis=1.4)

   Additional statistical measures can be generated using attStats, which contains the defined meanImp, medianImp and minImp. Let's now filter out features which are designated as Confirmed on this subset using boruta_stats$decision == "Confirmed".

In [ ]:
boruta_stats <- attStats(final.boruta)
boruta_confirmed = subset(boruta_stats,
                          subset=boruta_stats$decision == 'Confirmed')
boruta_confirmed
A data.frame: 81 × 6
meanImpmedianImpminImpmaxImpnormHitsdecision
<dbl><dbl><dbl><dbl><dbl><fct>
loan_amnt31.15083531.12666328.266385934.0148401.0000000Confirmed
funded_amnt31.03017930.86040828.306355334.1712101.0000000Confirmed
funded_amnt_inv31.95570331.92810228.023580235.6324721.0000000Confirmed
int_rate24.03881224.08480119.086370227.7572271.0000000Confirmed
installment34.16964834.22533230.063385238.4304621.0000000Confirmed
annual_inc20.64517520.57072917.286132823.6919851.0000000Confirmed
pymnt_plan36.01439636.38500229.816166139.4499881.0000000Confirmed
dti20.83709620.86523917.400958825.0218581.0000000Confirmed
delinq_2yrs 6.842601 6.870843 4.8862545 8.8089211.0000000Confirmed
inq_last_6mths12.78014512.80526010.240829016.7191361.0000000Confirmed
open_acc19.89571619.86394614.239235124.6191321.0000000Confirmed
pub_rec 5.818764 5.923536 2.2464243 8.7858260.9898990Confirmed
revol_bal25.13981525.14485619.752629330.3372801.0000000Confirmed
revol_util22.59737122.48374818.285816926.5650571.0000000Confirmed
total_acc26.36785326.85460615.875366831.4735091.0000000Confirmed
initial_list_status13.88647313.85870511.999069615.8641691.0000000Confirmed
out_prncp33.45268333.39327931.103381835.6962101.0000000Confirmed
out_prncp_inv33.31579733.23812931.040554236.6282241.0000000Confirmed
total_pymnt34.85832834.74925232.507666937.5383271.0000000Confirmed
total_pymnt_inv34.25730634.21949032.154978036.7430221.0000000Confirmed
total_rec_prncp48.96479648.97045645.682305052.5676961.0000000Confirmed
total_rec_int34.04453334.02511830.517885237.8750991.0000000Confirmed
total_rec_late_fee60.16205860.76317049.801125065.0766921.0000000Confirmed
recoveries42.63098942.63533738.838893145.6715181.0000000Confirmed
collection_recovery_fee28.54810728.56678325.622783631.3942211.0000000Confirmed
last_pymnt_amnt51.47920851.43919147.493043956.7748251.0000000Confirmed
application_type12.00834611.93064910.018382313.8392021.0000000Confirmed
acc_now_delinq 3.892868 3.987302 0.4627547 6.0098340.8888889Confirmed
tot_coll_amt 4.528058 4.463284 2.1039163 7.0420680.9494949Confirmed
tot_cur_bal18.57176918.93724714.182938621.9301931.0000000Confirmed
num_tl_90g_dpd_24m 3.722676 3.657719 1.2787911 7.1663960.8787879Confirmed
num_tl_op_past_12m16.48291316.49477412.114458218.8689611.0000000Confirmed
pct_tl_nvr_dlq11.80477311.796209 9.052745114.4838261.0000000Confirmed
percent_bc_gt_7514.68571814.62419111.508067018.6913551.0000000Confirmed
pub_rec_bankruptcies 6.446292 6.432009 3.2376621 9.2603001.0000000Confirmed
tax_liens 6.125352 6.134996 4.0366175 8.0243711.0000000Confirmed
tot_hi_cred_lim19.32965219.47947315.009610023.0586001.0000000Confirmed
total_bal_ex_mort17.88519117.96428611.335814222.7352631.0000000Confirmed
total_bc_limit18.37748518.72230712.016846822.8015221.0000000Confirmed
total_il_high_credit_limit19.59853919.72745913.966070423.5157951.0000000Confirmed
disbursement_method11.39753911.348889 9.989231413.7032821.0000000Confirmed
`term_36 months`17.11961917.10949915.534053918.5500101.0000000Confirmed
`term_60 months`17.12879017.12729715.618479218.7115601.0000000Confirmed
grade_A14.73214614.86591211.561791118.0816801.0000000Confirmed
grade_B 8.644539 8.513724 6.881504310.7736251.0000000Confirmed
grade_C 8.409557 8.530408 5.633049810.4772291.0000000Confirmed
grade_D11.08793011.117210 9.439953213.8806041.0000000Confirmed
grade_E11.39298311.358284 9.897323613.5995921.0000000Confirmed
grade_F10.40176110.321775 8.582338411.9030031.0000000Confirmed
grade_G 6.308212 6.318082 4.0351071 7.7913791.0000000Confirmed
`emp_length_10+ years` 2.905031 2.940490-0.2636700 4.8561140.6363636Confirmed
home_ownership_MORTGAGE 8.143434 8.168095 5.921004510.1466851.0000000Confirmed
home_ownership_RENT 8.274686 8.219050 6.602774710.1440711.0000000Confirmed
`verification_status_Not Verified`11.55292011.578453 9.541964513.1547001.0000000Confirmed
`verification_status_Source Verified` 3.231367 3.179140 0.7338770 5.4916920.7575758Confirmed
verification_status_Verified 6.633842 6.573560 3.8792528 8.0306701.0000000Confirmed
purpose_credit_card 4.451185 4.426374 2.5750356 6.8659010.9696970Confirmed
purpose_debt_consolidation 5.021257 5.080437 2.7169266 7.0899450.9898990Confirmed
purpose_moving 4.182588 4.158052 1.5566777 6.4709210.9494949Confirmed
purpose_other 2.237874 2.235340-0.3460263 4.8255310.3838384Confirmed

   This can also be completed with the features designated as Rejected on another subset using boruta_stats$decision == "Rejected".

In [ ]:
boruta_rejected = subset(boruta_stats,
                         subset=boruta_stats$decision == 'Rejected')
boruta_rejected
A data.frame: 30 × 6
meanImpmedianImpminImpmaxImpnormHitsdecision
<dbl><dbl><dbl><dbl><dbl><fct>
collections_12_mths_ex_med-0.1478136-0.179837983-1.67815311.66637690.00000000Rejected
chargeoff_within_12_mths 0.6631156 0.725582284-1.87920762.39841900.00000000Rejected
delinq_amnt 2.5976992 2.605932248 0.38563675.40541430.49494949Rejected
`emp_length_< 1 year`-0.3363462-0.471679520-2.06841391.22764980.00000000Rejected
`emp_length_1 year` 0.3480110 0.358736902-0.70953941.51827730.00000000Rejected
`emp_length_2 years`-0.1098134-0.280429927-2.21418771.53178920.00000000Rejected
`emp_length_3 years` 0.3408140 0.384973900-0.97172832.20122010.00000000Rejected
`emp_length_4 years`-0.6611490-0.589088871-2.03221790.49214390.00000000Rejected
`emp_length_5 years`-0.7739029-0.761179073-1.81782620.95017220.00000000Rejected
`emp_length_6 years` 0.6001474 0.447498291-1.00760812.85983670.01010101Rejected
`emp_length_7 years` 0.3893559 0.518813208-1.24797821.71865860.00000000Rejected
`emp_length_8 years` 2.4312260 2.450489064 0.22538994.79064860.47474747Rejected
`emp_length_9 years` 1.4935792 1.536933337-0.73235063.41755310.06060606Rejected
`emp_length_n/a` 0.6688853 0.825380830-0.26794031.70314110.00000000Rejected
home_ownership_ANY 0.1864711 0.001422159-1.34397181.60716360.00000000Rejected
home_ownership_NONE 0.0000000 0.000000000 0.00000000.00000000.00000000Rejected
home_ownership_OTHER 0.0000000 0.000000000 0.00000000.00000000.00000000Rejected
home_ownership_OWN-0.2318905-0.541990464-1.65269523.11475610.01010101Rejected
purpose_car-0.7276462-0.710510210-1.75162470.26241740.00000000Rejected
purpose_educational 0.0000000 0.000000000 0.00000000.00000000.00000000Rejected
purpose_home_improvement 2.6117850 2.614040696 0.79821584.66400050.51515152Rejected
purpose_house 0.8086646 0.777966995-1.93741412.77984170.01010101Rejected
purpose_major_purchase 2.2011871 2.250334090-0.79513484.36678830.37373737Rejected
purpose_medical 2.7326248 2.829318175 0.23291304.93612440.62626263Rejected
purpose_renewable_energy-0.1314194-0.071820597-1.01747860.95380050.00000000Rejected
purpose_small_business 1.7870050 1.817728639-0.23620793.49785560.04040404Rejected
purpose_vacation 0.5542481 0.512812741-0.55462622.31929020.00000000Rejected
purpose_wedding 0.7084514 0.711024734-0.57686631.71672390.00000000Rejected
hardship_flag_N 0.0000000 0.000000000 0.00000000.00000000.00000000Rejected
debt_settlement_flag_N 0.0000000 0.000000000 0.00000000.00000000.00000000Rejected

   Utilizing Boruta resulted in 81 features compared to 27 from MV-SIS, suggesting the methods contained within Boruta probably select the features that might generate the best model metrics, but results in high dimensionality and subsequently higher computational costs in downstream modeling tasks.

Results from Variable Selection Methods

   Various approaches can be utilized for feature selection/importance as demonstrated here using both Python and R. Therefore, let's now compare the results in regards to which variables were selected from the test methods using set differences. To begin, we can create dummy variables using pd.get_dummies where the initial features are removed. Then create a separate dataframe containing the features from the methods that did not generate too few (Group Lasso) or too many features (Boruta). We can denote them as X_xgb for the ones from SelectFromModel using XGBoost, X_vif containing the remaining after a threshold=5 for VIF and X_mfs containing the variables from MV-SIS in R.

In [ ]:
df = df.drop('loan_status', axis=1)
df = pd.get_dummies(df, drop_first=True)

X_xgb = df[['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate',
            'installment', 'annual_inc', 'inq_last_6mths', 'pub_rec',
            'revol_bal', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
            'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
            'total_rec_late_fee', 'recoveries', 'last_pymnt_amnt',
            'acc_open_past_24mths', 'bc_open_to_buy',
            'chargeoff_within_12_mths', 'delinq_amnt', 'mths_since_recent_bc',
            'num_rev_tl_bal_gt_0', 'tot_hi_cred_lim', 'total_bc_limit',
            'term_ 60 months', 'grade_B', 'grade_C', 'grade_D', 'grade_E',
            'home_ownership_MORTGAGE', 'home_ownership_RENT',
            'verification_status_Source Verified', 'verification_status_Verified',
            'pymnt_plan_y', 'purpose_major_purchase', 'purpose_moving',
            'purpose_small_business', 'purpose_vacation', 'initial_list_status_w',
            'application_type_Joint App', 'hardship_flag_Y',
            'disbursement_method_DirectPay', 'debt_settlement_flag_Y']
X_xgb = X_xgb.drop_duplicates()
print('- Dimensions using variables selected from XGB:', X_xgb.shape)
- Dimensions using variables selected from XGB: (2162365, 45)
In [ ]:
X_vif = df[['annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'revol_bal',
            'out_prncp', 'total_rec_int', 'total_rec_late_fee',
            'collection_recovery_fee', 'last_pymnt_amnt',
            'collections_12_mths_ex_med', 'acc_now_delinq', 'tot_coll_amt',
            'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths',
            'delinq_amnt', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op',
            'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc',
            'num_accts_ever_120_pd', 'num_il_tl', 'num_tl_30dpd',
            'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75',
            'pub_rec_bankruptcies', 'tax_liens', 'total_il_high_credit_limit']]
X_vif = X_vif.drop_duplicates()
print('- Dimensions using variables selected from VIF:', X_vif.shape)
- Dimensions using variables selected from VIF: (2162365, 31)
In [ ]:
X_mfs = df[['num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'pymnt_plan_y',
            'num_accts_ever_120_pd', 'mo_sin_old_rev_tl_op',
            'last_pymnt_amnt', 'percent_bc_gt_75', 'revol_util',
            'tot_hi_cred_lim', 'num_actv_rev_tl', 'disbursement_method_DirectPay',
            'tot_coll_amt', 'term_ 60 months', 'mort_acc', 'funded_amnt_inv',
            'int_rate', 'inq_last_6mths', 'delinq_2yrs', 'installment',
            'collections_12_mths_ex_med', 'open_acc', 'loan_amnt',
            'funded_amnt', 'annual_inc', 'num_tl_op_past_12m',
            'home_ownership_OTHER', 'total_bc_limit']]
X_mfs = X_mfs.drop_duplicates()
print('- Dimensions using variables selected from MVSIS:', X_mfs.shape)
- Dimensions using variables selected from MVSIS: (2162365, 28)

   Each of these can be considered as a set with a unique identifier specifying the features from the various feature selection approaches examined. Let's start with the features that were selected from SelectFromModel using XGBoost in a set denoted as s. Subsequently, the others can be compared to this set to determine which features do not exist in s but exist in the other set. This results in a list of the features where the number can be quantified using the length of the list. Other combinations of the initial set with the various permutations can be utilized to compare the features from the methodologies examined.

In [ ]:
s = set(X_xgb)
varDiff_vif = [x for x in X_vif if x not in s]
print('\nFeatures using VIF but not in XGB:')
print(varDiff_vif)
print('- Number of different features: ' + str(len(varDiff_vif)))

s1 = set(X_vif)
varDiff_xgb = [x for x in X_xgb if x not in s1]
print('\nFeatures in XGB but not in VIF:')
print(varDiff_xgb)
print('- Number of different features: ' + str(len(varDiff_xgb)))

varDiff_mvsisAll = [x for x in X_mfs if x not in s]
print('\nFeatures in MVSIS but not in XGB:')
print(varDiff_mvsisAll)
print('- Number of different features: ' + str(len(varDiff_mvsisAll)))

varDiff_mvsisVIF = [x for x in X_mfs if x not in s1]
print('\nFeatures in MVSIS but not in VIF:')
print(varDiff_mvsisVIF)
print('- Number of different features: ' + str(len(varDiff_mvsisVIF)))

s1 = set(X_mfs)
varDiff_mvsisAll1 = [x for x in X_xgb if x not in s1]
print('\nFeatures in XGB but not in MV-SIS:')
print(varDiff_mvsisAll1)
print('- Number of different features: ' + str(len(varDiff_mvsisAll1)))

varDiff_mvsisVIF1 = [x for x in X_vif if x not in s1]
print('\nFeatures in VIF but not in MV-SIS:')
print(varDiff_mvsisVIF1)
print('- Number of different features: ' + str(len(varDiff_mvsisVIF1)))

Features using VIF but not in XGB:
['dti', 'delinq_2yrs', 'collection_recovery_fee', 'collections_12_mths_ex_med', 'acc_now_delinq', 'tot_coll_amt', 'avg_cur_bal', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'num_accts_ever_120_pd', 'num_il_tl', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'total_il_high_credit_limit']
- Number of different features: 20

Features in XGB but not in VIF:
['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'installment', 'pub_rec', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'recoveries', 'acc_open_past_24mths', 'num_rev_tl_bal_gt_0', 'tot_hi_cred_lim', 'total_bc_limit', 'term_ 60 months', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'home_ownership_MORTGAGE', 'home_ownership_RENT', 'verification_status_Source Verified', 'verification_status_Verified', 'pymnt_plan_y', 'purpose_major_purchase', 'purpose_moving', 'purpose_small_business', 'purpose_vacation', 'initial_list_status_w', 'application_type_Joint App', 'hardship_flag_Y', 'disbursement_method_DirectPay', 'debt_settlement_flag_Y']
- Number of different features: 34

Features in MVSIS but not in XGB:
['num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_accts_ever_120_pd', 'mo_sin_old_rev_tl_op', 'percent_bc_gt_75', 'revol_util', 'num_actv_rev_tl', 'tot_coll_amt', 'mort_acc', 'delinq_2yrs', 'collections_12_mths_ex_med', 'open_acc', 'num_tl_op_past_12m', 'home_ownership_OTHER']
- Number of different features: 15

Features in MVSIS but not in VIF:
['num_bc_tl', 'num_op_rev_tl', 'pymnt_plan_y', 'revol_util', 'tot_hi_cred_lim', 'num_actv_rev_tl', 'disbursement_method_DirectPay', 'term_ 60 months', 'funded_amnt_inv', 'int_rate', 'installment', 'open_acc', 'loan_amnt', 'funded_amnt', 'home_ownership_OTHER', 'total_bc_limit']
- Number of different features: 16

Features in XGB but not in MV-SIS:
['pub_rec', 'revol_bal', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'acc_open_past_24mths', 'bc_open_to_buy', 'chargeoff_within_12_mths', 'delinq_amnt', 'mths_since_recent_bc', 'num_rev_tl_bal_gt_0', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'home_ownership_MORTGAGE', 'home_ownership_RENT', 'verification_status_Source Verified', 'verification_status_Verified', 'purpose_major_purchase', 'purpose_moving', 'purpose_small_business', 'purpose_vacation', 'initial_list_status_w', 'application_type_Joint App', 'hardship_flag_Y', 'debt_settlement_flag_Y']
- Number of different features: 32

Features in VIF but not in MV-SIS:
['dti', 'revol_bal', 'out_prncp', 'total_rec_int', 'total_rec_late_fee', 'collection_recovery_fee', 'acc_now_delinq', 'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mths_since_recent_bc', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'pub_rec_bankruptcies', 'tax_liens', 'total_il_high_credit_limit']
- Number of different features: 19

   From the comparisons made, the number of features selected with MV-SIS that are not present in the set using SelectFromModel using XGBoost, which contained 45 features, can be added to a temporary dataframe that is concatenated by column with X_xgb and loan_status for further examination.

In [ ]:
df_tmp = df[['num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_accts_ever_120_pd',
             'mo_sin_old_rev_tl_op', 'percent_bc_gt_75', 'revol_util',
             'num_actv_rev_tl', 'tot_coll_amt', 'mort_acc', 'delinq_2yrs',
             'collections_12_mths_ex_med', 'open_acc',
             'num_tl_op_past_12m', 'home_ownership_OTHER']]

df = pd.concat([X_xgb, df_tmp, y], axis=1)
df = df.drop_duplicates()
print('- Dimensions of data using for further EDA:', df.shape)

del X_xgb, X_vif, X_mfs
del df_tmp, y, s, s1, varDiff_vif, varDiff_xgb, varDiff_mvsisAll
del varDiff_mvsisVIF, varDiff_mvsisAll1, varDiff_mvsisVIF1
- Dimensions of data using for further EDA: (2162365, 64)

Exploratory Data Analysis (EDA)

Quantitative Variables

   We can now select the float64 and int64 quantitative features as a subset of the data. To evaluate if correlated features are present and to what extent, we can define a function to find the non-repetitive correlations by only considering the pairs of features on the diagonal and the lower part of the triangle of the correlation matrix. Then another function can be defined to find the features with the highest absolute rank correlation, which reduces the granularity from the presence of both positive and negative correlation coefficients. The second function can then call the first function to remove the repetitive pairs of features.

In [ ]:
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)

def find_repetitive_pairs(df):
    """Returns the pairs of features on the diagonal and lower triangle of the correlation matrix."""
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))

def find_top_correlations(df, n):
    """Returns the highest correlations without duplicates."""
    au_corr = df.corr(method='spearman').abs().unstack()
    labels_to_drop = find_repetitive_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

   Let's first examine the number of features present in this subset and then apply the find_top_correlations function to determine the 20 features with the the highest correlation. Then we can create a correlation heatmap using sns.heatmap which only contains the features with a correlation coefficient >= 0.7 or <= -0.7, signifying a strong and very strong relationship.

In [ ]:
import seaborn as sns

print('- The selected dataframe has ' + str(df_num.shape[1]) + ' columns that are quantitative variables.')
print('- The 20 features with the highest correlations:')
print(find_top_correlations(df_num, 20))

corr = df_num.corr(method='spearman')

fig = plt.figure(21, 18)
plt.rcParams.update({'font.size': 13})
ax = sns.heatmap(corr[(corr >= 0.7) | (corr <= -0.7)],
                 cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
                 linecolor='black', annot=True, annot_kws={'size': 9},
                 square=True)
plt.title('Correlation Matrix with Spearman rho')
plt.show()
- The selected dataframe has 45 columns that are quantitative variables.
- The 20 features with the highest correlations:
out_prncp             out_prncp_inv         1.000000
loan_amnt             funded_amnt           0.999999
total_pymnt           total_pymnt_inv       0.999996
funded_amnt           funded_amnt_inv       0.999886
loan_amnt             funded_amnt_inv       0.999885
num_sats              open_acc              0.998736
tot_cur_bal           tot_hi_cred_lim       0.972169
total_pymnt           total_rec_prncp       0.967473
total_pymnt_inv       total_rec_prncp       0.967457
funded_amnt           installment           0.964027
loan_amnt             installment           0.964025
funded_amnt_inv       installment           0.963728
num_op_rev_tl         open_acc              0.829206
num_sats              num_op_rev_tl         0.828544
num_op_rev_tl         num_actv_rev_tl       0.786085
bc_open_to_buy        total_bc_limit        0.776871
percent_bc_gt_75      revol_util            0.745717
acc_open_past_24mths  num_tl_op_past_12m    0.744758
num_bc_sats           num_op_rev_tl         0.742778
                      num_bc_tl             0.732805
dtype: float64

   Now we can examine the distributions of these features using different visualizations. Let's first start by plotting histograms using histplot from seaborn where a for loop iterates through each of the selected features.

In [ ]:
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num = df_num.drop(['loan_status'], axis=1)

plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(15,3, figsize=(21,35))
fig.suptitle('Quantitative Features: Histograms', y=1.01, fontsize=30)
for variable, subplot in zip(df_num, ax.flatten()):
    a = sns.histplot(df_num[variable], ax=subplot)
    a.set_yticklabels(a.get_yticks(), size=10)
    a.set_xticklabels(a.get_xticks(), size=9)
fig.tight_layout()
plt.show();

   Then we can increase the grain of this visualization by adding hue=df.loan_status to map the dependent variable as a binary variable as well as kde=True to smooth the distribution using a kernel density estimate.

In [ ]:
plt.rcParams.update({'font.size': 14})
fig, ax = plt.subplots(15,3, figsize=(25,35))
fig.suptitle('Quantitative Features: Histograms Grouped by Loan Status', y=1.01,
             fontsize=30)
for variable, subplot in zip(df_num, ax.flatten()):
    a = sns.histplot(x=df_num[variable], data=df_num, hue=df.loan_status,
                     kde=True, ax=subplot)
    a.set_yticklabels(a.get_yticks(), size=10)
    a.set_xticklabels(a.get_xticks(), size=10)
fig.tight_layout()
plt.show();

   From the previous figure, some features contain a larger range of values, so let's extract these as a subset of the quantitative features to inspect at a more granular level.

In [ ]:
df_num1 = df_num[['loan_amnt', 'funded_amnt', 'funded_amnt_inv',
                  'int_rate', 'installment', 'total_pymnt', 'total_pymnt_inv',
                  'total_rec_prncp', 'total_rec_int', 'mo_sin_old_rev_tl_op',
                  'percent_bc_gt_75', 'revol_util']]
In [ ]:
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(4,3, figsize=(21,30))
fig.suptitle('Subset of Quantitative Features: Histograms with Grouped by Loan Status',
             y=1.01, fontsize=30)
for variable, subplot in zip(df_num1, ax.flatten()):
    sns.histplot(x=df_num1[variable], data=df_num1, hue=df.loan_status,
                 kde=True, ax=subplot)
    a.set_yticklabels(a.get_yticks(), size=10)
    a.set_xticklabels(a.get_xticks(), size=9)
fig.tight_layout()
plt.show();

   Box-and-whisker plots are another way to visualize both quantitative and qualitative features. Let's use seaborn.boxplot where the target loan_status resides on the x-axis and the quantitative feature is on the y-axis and . This approach summarizes the distribution of the feature using five key numbers: the minimum, 1st quartile (25th percentile), 2nd quartile (median), 3rd quartile (75%) and the maximum of the data. The interquartile range (IQR) is the difference between the 75% and 25% percentile (Q3 - Q1). The whisker sections extend for the whole range of the data.

   Boxplots can be useful to visualize if outlier values are present within the data. If values are greater than Q3 + 1.5 x IQR or less than Q1 - 1.5 x IQR, then they can be considered as outliers.

In [ ]:
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(9,5, figsize=(21,30))
fig.suptitle('Quantitative Features: Boxplots', y=1.01, fontsize=30)
for var, subplot in zip(df_num, ax.flatten()):
    a = sns.boxplot(x=df.loan_status, y=df_num[var], data=df_num, ax=subplot)
    a.set_yticklabels(a.get_yticks(), size=12)
    a.set_xticklabels(a.get_xticks(), size=12)
fig.tight_layout()
plt.show();

   When considering the quantitative features stratified by loan_status, differences are observable, especially for recoveries, last_pymnt_amnt and percent_bc_gt_75. These potentially will contribute to explaining models once they are fit.

   Now, we can examine the qualitative features in the set including the uint8 ones using a countplot to visualize the number of observations in each categorical bin.

In [ ]:
df_cat = df.select_dtypes(include = 'uint8')

print('The selected dataframe has ' + str(df_cat.shape[1])
      + ' columns that are qualitative variables.')
print('\n')
fig, ax = plt.subplots(5, 4, figsize=(21,21))
fig.suptitle('Qualitative Features: Count Plots', y=1.01, fontsize=30)
for variable, subplot in zip(df_cat, ax.flatten()):
    a = sns.countplot(df_cat[variable], ax=subplot)
    a.set_yticklabels(a.get_yticks(), size=10)
fig.tight_layout()
plt.show();
The selected dataframe has 20 columns that are qualitative variables.


Automated EDA with Sweetviz

   Overview: Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application We can install using pip install sweetviz. There are many options to profile the data, but let's utilize the the default settings to examine the data quality (data types, missing values, duplicate rows, unique values) and summary statistics (min/max/range, quartiles, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.)

In [ ]:
import sweetviz as sv

sweet_report = sv.analyze(df)
sweet_report.show_html('Loan_Status_AutomatedEDA.html')
sweet_report.show_notebook(layout='widescreen', w=1500, h=1000, scale=0.8)
Report Loan_Status_AutomatedEDA_h1000.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Automated EDA using Pandas Profiling

   We can install using pip install ydata-profiling.

Key features:

  • Type inference: automatic detection of columns' data types (Categorical, Numerical, Date, etc.)
  • Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
  • Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
  • Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
  • Also includes the abilty to examine Time-Series, Text analysis, File and Image analysis and Compare datasets

  • Source: Github and PyPi

In [ ]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title='Loan Status_EDA')
profile.to_notebook_iframe()

   The output is on the Github repository due to size contraints here.

   Now we can remove the features with high correlation coefficients and the categorical variables containing imbalances, and examine the final set using the data_quality_table function.

In [ ]:
df = df.drop(['out_prncp_inv', 'funded_amnt', 'funded_amnt_inv',
              'total_pymnt_inv', 'open_acc', 'tot_cur_bal', 'total_rec_prncp',
              'num_op_rev_tl', 'home_ownership_OTHER', 'hardship_flag_Y',
              'pymnt_plan_y', 'purpose_house', 'purpose_medical',
              'debt_settlement_flag_Y', 'purpose_small_business'], axis=1)
df = df.drop_duplicates()

print('\n              Data Quality: Final Set')
display(data_quality_table(df))

              Data Quality: Final Set
- There are 2162365 rows and 51 columns.

Data Type Percent Missing Number Unique
loan_amnt int64 0.0 1561
application_type_Joint App uint8 0.0 2
grade_B uint8 0.0 2
grade_C uint8 0.0 2
grade_D uint8 0.0 2
home_ownership_MORTGAGE uint8 0.0 2
home_ownership_OWN uint8 0.0 2
home_ownership_RENT uint8 0.0 2
verification_status_Source Verified uint8 0.0 2
verification_status_Verified uint8 0.0 2
purpose_credit_card uint8 0.0 2
initial_list_status_w uint8 0.0 2
disbursement_method_DirectPay uint8 0.0 2
total_bc_limit float64 0.0 19224
num_il_tl float64 0.0 122
num_accts_ever_120_pd float64 0.0 44
mo_sin_old_rev_tl_op float64 0.0 787
percent_bc_gt_75 float64 0.0 284
revol_util float64 0.0 1323
num_actv_rev_tl float64 0.0 57
tot_coll_amt float64 0.0 15475
mort_acc float64 0.0 46
delinq_2yrs float64 0.0 36
num_tl_op_past_12m float64 0.0 33
term_ 60 months uint8 0.0 2
total_bal_ex_mort float64 0.0 212018
int_rate float64 0.0 357
last_pymnt_amnt float64 0.0 668006
installment float64 0.0 90848
annual_inc float64 0.0 86511
inq_last_6mths float64 0.0 9
pub_rec float64 0.0 43
revol_bal int64 0.0 101135
out_prncp float64 0.0 360171
total_pymnt float64 0.0 1526102
total_rec_int float64 0.0 617804
total_rec_late_fee float64 0.0 15025
recoveries float64 0.0 121614
collections_12_mths_ex_med float64 0.0 16
tot_hi_cred_lim float64 0.0 528060
acc_open_past_24mths float64 0.0 57
bc_open_to_buy float64 0.0 91335
chargeoff_within_12_mths float64 0.0 11
delinq_amnt float64 0.0 2583
mths_since_recent_bc float64 0.0 542
num_bc_sats float64 0.0 60
num_bc_tl float64 0.0 75
num_sats float64 0.0 90
num_tl_30dpd float64 0.0 5
tax_liens float64 0.0 42
loan_status int64 0.0 2

Class Imbalance Methods

   Resampling data is one of the most commonly accepted approaches for handling an imbalanced dataset. There are broadly two types of methods: oversampling and undersampling. Oversampling is generally preferred compared to undersampling techniques because the small percentage of instances potentially containing important information tend to be removed from the set when undersampling based methods are utilized.

Resampling Techniques

   Let's first import the packages and define the conditional options to set up the environment with a random and numpy seed for reproducibility. Then the data can be read into a pandas.Dataframe and duplicates removed, if present.

In [ ]:
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

seed_value = 42
os.environ['LoanStatus_Linear'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

df = pd.read_csv('LendingTree_LoanStatus_final.csv', low_memory=False)
df = df.drop_duplicates()

Upsample Minority Class

   Let's first separate the input features and the target into X and y. Then train_test_split from sklearn.model_selection can be utilized to set up the training/test sets using 80% of the data for the training set and 20% for the test set. Then the features and target from the training data can be concatenated by column, followed by splitting this set into the majority class, where loan_status==0 as current, and the minority class, where loan_status==1 as default.

   Next, resample from sklearn.utils can be leveraged to oversample the minority class with replacement specifying the number of samples (the number of observations contained within the current set) and the defined random_state=seed_value. Then, the majority and upsampled minority sets be concatenated together and the counts of each of the classes in the balanced set examined.

   Then the input features and target of the Upsampled training set can be defined as X_train and y_train to match the format of the test set, the train/test target defined as a pandas.dataframe to be concatenated for full train/test sets to be saved for later use in .csv files.

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

X = df.drop('loan_status', axis=1)
y = df['loan_status']

del df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
                                                    random_state=seed_value)

df1 = pd.concat([X_train, y_train], axis=1)

current = df1[df1.loan_status==0]
default = df1[df1.loan_status==1]

del df1

default_upsampled = resample(default,
                             replace=True,
                             n_samples=len(current),
                             random_state=seed_value)

upsampled = pd.concat([current, default_upsampled])

del default_upsampled, current, default

print('\nExamine Loan Status after oversampling minority class')
print(upsampled.loan_status.value_counts())

X_train = upsampled.drop('loan_status', axis=1)
y_train = upsampled.loan_status

del upsampled

cols = ['loan_status']

y_train = pd.DataFrame(data=y_train, columns=cols)
y_test = pd.DataFrame(data=y_test, columns=cols)

train_US = pd.concat([X_train, y_train], axis=1)
train_US.to_csv('trainDF_US.csv', index=False)

test_US = pd.concat([X_test, y_test], axis=1)
test_US.to_csv('testDF_US.csv', index=False)

del train_US, test_US

Examine Loan Status after oversampling minority class
0    1511066
1    1511066
Name: loan_status, dtype: int64

Synthetic Minority Oversampling Technique (SMOTE)

   SMOTE (Synthetic Minority Oversampling Technique) generates new synthetic samples for the minority class, which is a type of data augmentation. The smaller the distance between the minority data and the closest minority data in the feature space is where new synthetic data is formed. For a random example within the minority class, a specified k of the nearest neighbors are determined, randomly selected and a synthetic example is generated located at a random point between the initial and the randomly selected neighbor. Since this uses the smaller distance between the aformentioned, this results in more realistic examples from the minority class.

   Since this approaches relies on the minority class and not the majority class, potential negative events include decreasing the decision space as well as increasing the amount of noise in the feature space, resulting in more difficulty classifying groups, whether binary or multi-level outcomes.

   Let's first set up the training/testing sets using the same parameters that were utilized for the Upsampled set. The SMOTE class from the Imbalanced-learn library can then be imported from imblearn.over_sampling specifying the sampling_strategy='minority' and random_state=seed_value as parameters to fit_sample on the training set. The default specified k nearest neighbor is k_neighbors=5. Then the counts of both classes examined to compare with the Upsampled set, and similar methods used to same the train/test sets for later use.

In [ ]:
from imblearn.over_sampling import SMOTE

X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.20,
                                                        random_state=seed_value)

smote = SMOTE(sampling_strategy='minority', random_state=seed_value)
X1_train, y1_train = smote.fit_sample(X1_train, y1_train)

print('\nExamine Loan Status after upsampling with SMOTE')
print(y1_train.value_counts())

cols = ['loan_status']

y1_train = pd.DataFrame(data=y1_train, columns=cols)
y1_test = pd.DataFrame(data=y1_test, columns=cols)

train_SMOTE = pd.concat([X1_train, y1_train], axis=1)
train_SMOTE.to_csv('trainDF_SMOTE.csv', index=False)

test_SMOTE = pd.concat([X1_test, y1_test], axis=1)
test_SMOTE.to_csv('testDF_SMOTE.csv', index=False)

del train_SMOTE, test_SMOTE

Examine Loan Status after upsampling with SMOTE
0    1511066
1    1511066
Name: loan_status, dtype: int64

Classification: Linear

   Before machine learning models are optimized for the associated parameters of different algorithms, baseline models are important to establish which parameters should be more closely targeted during the tuning process. The notebook containing the Linear baseline models and hyperparameter tuning using GridSearchCV is located here.

Upsampling: Lasso Baseline Model

Baseline Default Parameters

  • penalty: Specifies the norm of the penalty. default=’l2’.
  • tol: Tolerance for stopping criteria. default=1e-4.
  • C: Inverse of regularization strength. default=1.0.
  • fit_intercept: Specifies if a bias or intercept should be added to the decision function. default=True.
  • intercept_scaling: default=1
  • class_weight: Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. default=None.
  • random_state: RandomState instance. Used when solver == ‘sag’, saga or liblinear to shuffle the data. default=None.
  • solver: Algorithm to use in the optimization problem. default=’lbfgs’.
  • max_iter: Maximum number of iterations taken for the solvers to converge. default=100.
  • verbose: For the liblinear and lbfgs solvers set verbose to any positive number for verbosity. default=0.
  • warm_start: When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. default=False.
  • n_jobs: Number of CPU cores used when parallelizing over classes if multi_class='ovr'. default=None.
  • l1_ratio: The Elastic-Net mixing parameter, Only used if penalty='elasticnet'. Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2. default=None.

   Let's first set the path where the results from training a model will be stored. In particular where the model .pkl is saved and can be reloaded for model comparisons to the baseline model or if more training is required to find even more optimal hyperparameters than the first iteration of the tuning process. Then, we can scale the data between zero and one using the MinMaxScaler which consists of fit_transform on the training set and applying that with tranform to the test set. For the Upsampled Lasso baseline model, we can fit the model using the default parameters from the LogisticRegression model in scikit-learn and save the model. Then predict based on trained model using the scaled test set to evaluate the model performance/metrics using various sklearn.metrics utilizing the input format of metric(y_true = y_test, y_pred = model.predict(X_test)) where the metric is specified and the defined model was fit with the training set and predictions evaluated with the test set. This can be applied to both the training and the test sets, independently, to evaluate if overfitting is occurring as well.

   The classification_report outputs granular information in table format with the rows containing loan_status==0, loan_status==1, accuracy, macro avg, and weighted avg and paired columns (precision, recall, f1-score, support) with the rows.

   The confusion_matrix contains even more granular information with the count the model evaluated:

  • True Positives (Top left): Model successfully predicts the positive class (loan_status==1)
  • False Positives (Top right): Model incorrectly predicts the positive class
  • False Negatives (Lower left): Model incorrectly predicts the negative class
  • True Negatives (Lower right): Model successfully predicts the negative class (loan_status==0)

   This information is needed to calculate the following metrics:

  • Accuracy: (True Positives + True Negatives) / Total Count of Predicted
  • Precision: True Positives / (True Positives + False Positives)
  • Recall: True Positives / (True Positives + False Negatives)
  • F1-Score: 2 (Precision Recall) / (Precision + Recall)

   These basic metrics for classification problems can be imported from sklearn.metrics, so we do not need to manually calculate the accuracy_score, precision_score, recall_score and f1_score.

In [ ]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score

scaler = MinMaxScaler()
X_trainS = scaler.fit_transform(X_train)
X_testS = scaler.transform(X_test)

lasso = LogisticRegression(penalty='l1', solver='saga',
                           random_state=seed_value)

with parallel_backend('threading', n_jobs=-1):
    lasso.fit(X_trainS, y_train)

Pkl_Filename = 'Lasso_US_baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(lasso, file)

# =============================================================================
# To load saved model
#lasso_US = joblib.load('Lasso_US_baseline.pkl')
#print(Lasso_US)
# =============================================================================

y_pred_US = lasso.predict(X_testS)

print('Results from Lasso baseline model on Upsampled data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Results from Lasso baseline model on Upsampled data:


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.97      0.92      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.96      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[376134   1714]
 [  4524  50101]]


Accuracy score : 0.986
Precision score : 0.967
Recall score : 0.917
F1 score : 0.941

Upsampling: Elastic Net Baseline Model

   Now we can set the baseline using the parameters needed for an elastic net model using an l1_ratio=0.5 with the saga solver.

In [ ]:
elnet = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)

with parallel_backend('threading', n_jobs=-1):
    elnet.fit(X_trainS, y_train)

Pkl_Filename = 'Elnet_US_baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(elnet, file)

y_pred_US = elnet.predict(X_testS)

print('Results from Elastic Net baseline model on Upsampled data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Results from Elastic Net baseline model on Upsampled data:


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.97      0.91      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.95      0.97    432473
weighted avg       0.99      0.99      0.98    432473



Confusion matrix:
[[376304   1544]
 [  4883  49742]]


Accuracy score : 0.985
Precision score : 0.970
Recall score : 0.911
F1 score : 0.939

   When comparing the model metrics to the Lasso baseline model, they are quite comparable, with a slightly higher precision and slightly lower recall and F1 score. They are all above 90%, which suggests the baseline models perform well.

SMOTE: Lasso Baseline Model

   Using the same processing that was utilized for the Upsampled data, we can fit the baseline linear models for the Lasso and Elastic Net using the SMOTE data.

In [ ]:
scaler = MinMaxScaler()
X1_trainS = scaler.fit_transform(X1_train)
X1_testS = scaler.transform(X1_test)

with parallel_backend('threading', n_jobs=-1):
    lasso.fit(X1_trainS, y1_train)

Pkl_Filename = 'Lasso_SMOTE_baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(lasso, file)

y_pred_SMOTE = lasso.predict(X1_testS)

print('Results from Lasso baseline model on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
Results from Lasso baseline model on SMOTE data:


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.97      0.92      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.96      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[376134   1714]
 [  4524  50101]]


Accuracy score : 0.986
Precision score : 0.967
Recall score : 0.917
F1 score : 0.941

   Compared to the model metrics of the baseline linear models using the Upsampled set, the model metrics are very similar due to the L1 regularization using the Lasso to shrink the model coefficients to zero.

SMOTE: Elastic Net Baseline Model

In [ ]:
with parallel_backend('threading', n_jobs=-1):
    elnet.fit(X1_trainS, y1_train)

Pkl_Filename = 'Elnet_SMOTE_baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(elnet, file)

y_pred_SMOTE = elnet.predict(X1_testS)

print('Results from Elastic Net baseline model on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
Results from Elastic Net baseline model on SMOTE data:


Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    377848
           1       0.97      0.89      0.93     54625

    accuracy                           0.98    432473
   macro avg       0.98      0.95      0.96    432473
weighted avg       0.98      0.98      0.98    432473



Confusion matrix:
[[376548   1300]
 [  5801  48824]]


Accuracy score : 0.984
Precision score : 0.974
Recall score : 0.894
F1 score : 0.932

   Compared to the model metrics of the baseline linear models using the Upsampled set, the accuracy and the Precision score are comparable but the Recall score is 2% lower while the F1 score is potentially lower suggesting further inspection for these metrics.

   When first performing GridSearchCV using both the Upsampled and the SMOTE sets, initially, the same parameters were selected. Expanding the range of the parameters used in the search, such as increasing/decreasing C and the number of iterations (max_iter) as well as changing the scoring metric evaluated during the search might provide better model performance.

Grid Search using Pipelines: Weighted Recall

   We can design a sklearn pipeline by first using a list to select the numerical features to be scaled using the MinMaxScaler() in the defined numeric_transformer. This pipeline method yields the potential to transform various data types in different data contexts. Since this is a classification approach where the important categorical features have all ready been reduced down to the most important groups from the initial categorical features during the variable selection process, the numeric_transformer is all what is required for the preprocessing before testing different combinations of model parameters. To test two different modeling approaches, the numeric_transformer can be specified in two different pipelines where the model component of the pipelines contains differing parameters for the Lasso and the Elastic Net. When utilizing scikit-learn, the model type specified as LogisticRegression containing the penalty='l1' is similar to a Lasso while utilizing the penalty='elasticnet' is comparable to using Elastic Net for classification problems. Given the specified parameters for the two model types to be evaluated, these two pipelines can be specified in a list and ordered into a dictionary.

Grid Search: Upsampled

   Let's first start with the Upsampled set using the parameters that define a model with an L1 penalty and another that resides between the L1 and L2 regularization parameters first described in 2005 by Zou and Hastie as the Elastic Net. With differing penalties and l1_ratio values, but using the same solver, the same maxmimum iteration number as well as the same random_state, various other parameters within a defined grid can then be evaluated to compare how the model performance differs within the Upsampled set as well as compared with the SMOTE set.

In [ ]:
from sklearn.pipeline import Pipeline

numeric_features = list(X.columns[X.dtypes != 'object'])
numeric_transformer = Pipeline(steps=[('mms', MinMaxScaler())])

pipe_lasso = Pipeline(steps=[('preprocessor', numeric_transformer),
                             ('model', LogisticRegression(penalty='l1',
                                                          solver='saga',
                                                          max_iter=30000,
                                                          random_state=seed_value))])

pipe_elnet = Pipeline(steps=[('preprocessor', numeric_transformer),
                             ('model', LogisticRegression(penalty='elasticnet',
                                                          solver='saga',
                                                          l1_ratio=0.5,
                                                          max_iter=30000,
                                                          random_state=seed_value))])

pipelines = [pipe_lasso, pipe_elnet]

pipe_dict = {0: 'Lasso', 1: 'Elastic Net'}

   Given the specified method for preprocessing the numerical features and the two differing model architectures, we can then define a function called make_param_grids which utilizes itertools to loop through the pipeline_steps with either the Lasso or Elastic Net as well as the parameters in the grid for each defined hyperparameter space contained within the all_param_grids dictionary. The pipeline object is then initialized and the estimators are contained within the param_grids_list. The GridSearchCV components can then be specified which metric should be utilized for scoring, and given the initial difference in the recall score between the two sets, a weighted recall score might provide some insight. The fold number to be used in cross validation can be denoted, and given the size of the sets and computation power available, let's use 3 fold, cv=3, and n_jobs=-3 to utilize all of the number of cores available minus three. This also suggests this runtime environment is probably not the best optimized given the inability to utilize all of the avaiable resources when specifying joblib with n_jobs=-1 did not result in the utilization of 8 cores and 64GB RAM for this job.

In [ ]:
import itertools
from sklearn.model_selection import GridSearchCV

def make_param_grids(steps, param_grids):
    final_params = []
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).items():
                if param == 'object':
                    current_grid[step_name] = [value]
                else:
                    current_grid[step_name + '__' + param] = value
        final_params.append(current_grid)

        return final_params

pipeline_steps = {'classifier': ['lasso', 'elnet']}

all_param_grids = {'lasso': {'object': LogisticRegression(),
                             'penalty': ['l1'],
                             'solver': ['saga'],
                             'max_iter': [30000, 50000, 100000],
                             'C': [1, 0.5, 0],
                             'tol': [1e-6, 1e-5, 1e-4, 1e-3]},

                   'elnet': {'object': LogisticRegression(),
                             'penalty': ['elasticnet'],
                             'solver': ['saga'],
                             'l1_ratio': [0.0, 0.5, 1.0],
                             'max_iter': [30000, 50000, 100000],
                             'C': [1, 0.5, 0],
                             'tol': [1e-6, 1e-5, 1e-4, 1e-3]}
                   }

param_grids_list = make_param_grids(pipeline_steps, all_param_grids)

pipe = Pipeline(steps=[('classifier', LogisticRegression())])

grid = GridSearchCV(pipe, param_grid=param_grids_list,
                    scoring='recall_weighted',
                    cv=3, verbose=4, n_jobs=-3)

   Now, the pipeline with the defined parameters for GridSearchCV can be run using the Upsampled data utilizing dask.delayed for parallel jobs where the start/end time can be monitored.

In [ ]:
import time
import dask.delayed

print('Start Upsampling - Grid Search..')
search_time_start = time.time()
pipelines_ = [dask.delayed(grid).fit(X_train, y_train.values.ravel())]
fit_pipelines = dask.compute(*pipelines_, scheduler='processes')
print('Finished Upsampling - Grid Search :', time.time() - search_time_start)
Start Upsampling - Grid Search..
Finished Upsampling - Grid Search : 923824.5640733242

   Then the accuracies of the models tested during the grid search can be compared to determine which model resulted in the highest accuracy when evaluated using the test set.

In [ ]:
for idx, val in enumerate(fit_pipelines):
    print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx],
                                               val.score(X_test, y_test)))

best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(fit_pipelines):
    if val.score(X_test, y_test) > best_acc:
        best_acc = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx

print('Classification with highest accuracy: %s' % pipe_dict[best_clf])
print('Lasso/Elastic Net using Upsampling - Best Estimator')
print(best_pipe.best_params_)
Lasso pipeline test accuracy: 0.986
Classification with highest accuracy: Lasso
Lasso/Elastic Net using Upsampling - Best Estimator
{'classifier': LogisticRegression(C=1, max_iter=100000, penalty='l1', solver='saga', tol=1e-06), 'classifier__C': 1, 'classifier__max_iter': 100000, 'classifier__penalty': 'l1', 'classifier__solver': 'saga', 'classifier__tol': 1e-06}

   For the Upsampled data, the model that utitilized the Lasso with C=1, max_iter=100000, penalty='l1', solver='saga' and tol=1e-06 resulted in the highest accuracy. Given these parameters, we can fit the model with the best accuracy for the Upsampled set and save the model in a pickle file for later use.

In [ ]:
LassoMod_US_HPO = LogisticRegression(penalty='l1',
                                     C=1,
                                     solver='saga',
                                     max_iter=100000,
                                     tol=1e-06,
                                     n_jobs=-1,
                                     random_state=seed_value)

print('Start fit the best hyperparameters from Upsampling grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    LassoMod_US_HPO.fit(X_trainS, y_train)
print('Finished fit the best hyperparameters from Upsampling grid search to the data :',
      time.time() - search_time_start)

Pkl_Filename = 'Linear_HPO_US_mmsRecall_mostParams.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(LassoMod_US_HPO, file)
Start fit the best hyperparameters from Upsampling grid search to the data..
Finished fit the best hyperparameters from Upsampling grid search to the data : 64808.703051805496

   Now that the model has been fit using the training set, we can predict using the scaled test set to determine the model performance/metrics.

In [ ]:
y_pred_US_HPO = LassoMod_US_HPO.predict(X_testS)

print('Results from LASSO using Upsampling HPO:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US_HPO))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US_HPO))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US_HPO))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US_HPO))
Results from LASSO using Upsampling HPO:


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.97      0.92      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.96      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[376102   1746]
 [  4439  50186]]


Accuracy score : 0.986
Precision score : 0.966
Recall score : 0.919
F1 score : 0.942

   Compared to the baseline model, the model determined using GridSearchCV resulted in practically the same metrics. Given the time that was required to search through the various models in the given parameter space, the baseline model would probably be sufficient unless an even higher recall score where the true positives classified as positive was needed for the task at hand.

Model Metrics with ELI5

   Let's now utilize the PermutationImportance from eli5.sklearn which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.

class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
import eli5
from eli5.sklearn import PermutationImportance
import webbrowser
from eli5.formatters import format_as_dataframe

perm_importance = PermutationImportance(LassoMod_US_HPO,
                                        random_state=seed_value).fit(X_testS,
                                                                     y_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.4668 ± 0.0005 total_pymnt
0.4293 ± 0.0009 loan_amnt
0.3922 ± 0.0005 total_rec_int
0.3383 ± 0.0005 out_prncp
0.0596 ± 0.0002 recoveries
0.0247 ± 0.0002 installment
0.0048 ± 0.0002 total_rec_late_fee
0.0008 ± 0.0001 grade_C
0.0008 ± 0.0001 grade_B
0.0006 ± 0.0001 last_pymnt_amnt
0.0003 ± 0.0000 num_sats
0.0001 ± 0.0000 grade_D
0.0001 ± 0.0000 home_ownership_MORTGAGE
0.0001 ± 0.0000 verification_status_Source Verified
0.0000 ± 0.0000 tot_hi_cred_lim
0.0000 ± 0.0000 num_tl_op_past_12m
0.0000 ± 0.0000 home_ownership_RENT
0.0000 ± 0.0000 home_ownership_OWN
0.0000 ± 0.0000 total_bc_limit
0.0000 ± 0.0000 revol_util
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 total_pymnt 0.466812 0.000232
1 loan_amnt 0.429280 0.000429
2 total_rec_int 0.392205 0.000249
3 out_prncp 0.338332 0.000275
4 recoveries 0.059570 0.000091
5 installment 0.024710 0.000117
6 total_rec_late_fee 0.004786 0.000095
7 grade_C 0.000844 0.000039
8 grade_B 0.000757 0.000038
9 last_pymnt_amnt 0.000597 0.000040
10 num_sats 0.000336 0.000014
11 grade_D 0.000093 0.000014
12 home_ownership_MORTGAGE 0.000069 0.000022
13 verification_status_Source Verified 0.000065 0.000020
14 tot_hi_cred_lim 0.000041 0.000012
15 num_tl_op_past_12m 0.000039 0.000008
16 home_ownership_RENT 0.000037 0.000016
17 home_ownership_OWN 0.000018 0.000010
18 total_bc_limit 0.000016 0.000011
19 revol_util 0.000013 0.000009

   The total_pymnt feature contains the highest weight (0.466812), followed by loan_amnt (0.429280), total_rec_int (0.392205 and out_prncp (0.338332).

   Now we can use ELI5 to show the prediction, which reveals the contribution of the various features in explaining whether a loan will default or stay current. This can also be stored in an HTML file that is incorporated into a dashboard, whether in a Jupyter notebook or specified in a script with the environment and package dependencies that is loaded for batch or real time infence, given the task at hand.

In [ ]:
html_obj2 = eli5.show_prediction(LassoMod_US_HPO, X_test1.iloc[1],
                                 show_feature_values=True)
html_obj2
Out[ ]:

y=0 (probability 1.000, score -2418518.724) top features

Contribution? Feature Value
+9378493.908 out_prncp 13241.960
+2567513.392 total_pymnt 2274.930
+425246.132 annual_inc 35000.000
+375279.987 tot_hi_cred_lim 162071.000
+185286.465 total_bc_limit 30300.000
+8977.232 revol_bal 20454.000
+7781.646 last_pymnt_amnt 327.920
+25.504 num_sats 12.000
+20.087 revol_util 56.500
+9.567 mths_since_recent_bc 9.000
+4.443 mo_sin_old_rev_tl_op 164.000
+0.879 disbursement_method_DirectPay 1.000
+0.820 num_tl_op_past_12m 3.000
+0.419 mort_acc 1.000
+0.411 num_actv_rev_tl 5.000
+0.336 num_il_tl 10.000
+0.275 home_ownership_MORTGAGE 1.000
+0.065 purpose_credit_card 1.000
-0.091 initial_list_status_w 1.000
-0.216 verification_status_Verified 1.000
-0.233 term_ 60 months 1.000
-0.421 grade_B 1.000
-0.966 percent_bc_gt_75 40.000
-1.003 inq_last_6mths 2.000
-1.818 num_bc_tl 6.000
-8.526 num_bc_sats 5.000
-10.166 int_rate 12.730
-11.053 acc_open_past_24mths 3.000
-14.919 <BIAS> 1.000
-1812.468 installment 327.920
-29194.144 total_bal_ex_mort 73529.000
-515189.983 total_rec_int 1016.890
-9983876.839 loan_amnt 14500.000

   For the loans to be Paid Off or Current, the features out_prncp, total_pymnt, annual_inc, total_hi_credit_lim, revol_bal and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, total_rec_int, total_bal_ex_mort and installment with higher values are not beneficial.

Create an Azure FIles Datastore

   Now that the features and potential methods for addressing class imbalance have been selected, let's save these sets in a Datastore in the cloud with Azure Files Datastore as option. We need to first connect to the Azure Machine Learning workspace with MLClient using the credential, subscription_id, resource_group_name and the workspace_name.

In [ ]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

ml_client = MLClient(
    credential=credential,
    subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
    resource_group_name='aschultzdata',
    workspace_name='ds-ml-env',
)

   Then we can create the datastore by specifying the name of the datastore, the description, the account, the name of the container, the protocol and the account_key.

In [ ]:
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration
from azure.ai.ml import MLClient

store = AzureBlobDatastore(
    name='loanstatus_datastore',
    description='Datastore for Loan Status',
    account_name='dsmlenv8898281366',
    container_name='loanstatus',
    protocol='https',
    credentials=AccountKeyConfiguration(
        account_key='XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'
    ),
)

ml_client.create_or_update(store)
Out[ ]:
AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'loanstatus_datastore', 'description': 'Datastore for Loan Status', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/a134465f-eb15-4297-b902-3c97d4c81838/resourceGroups/aschultzdata/providers/Microsoft.MachineLearningServices/workspaces/ds-ml-env/datastores/loanstatus_datastore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/cpu-standardds12v2/code/Users/aschultz.data/UsedCarsCarGurus/Models/DL/MLP', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f5450284f70>, 'credentials': {'type': 'account_key'}, 'container_name': 'loanstatus', 'account_name': 'dsmlenv8898281366', 'endpoint': 'core.windows.net', 'protocol': 'https'})

   Then we can upload both the train/test sets for the Upsampled and SMOTE sets to the loanstatus container.

Train a model using Microsoft Azure

   Before we train the model, we need to create a compute instance/cluster with a defined name, so let's create one with a CPU, if it has not all ready been created. Then we can specify the cluster components containing the on-demand virtual machine (VM) service, the type of VM, the minimum/maximum nodes in the cluster, the time the node will run after the job finishes/terminates and the type of tier. Finally, we can input the object to create_or_update from MLClient.

In [ ]:
from azure.ai.ml.entities import AmlCompute

cpu_compute_target = 'cpu-cluster-E8s-v3'

try:
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print('Creating a new cpu compute target...')

    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        type='amlcompute',
        size='Standard_E8s_v3',
        min_instances=0,
        max_instances=1,
        idle_time_before_scale_down=180,
        tier='Dedicated',
    )
    print(
        f"AMLCompute with name {cpu_cluster.name} will be created, with compute size {cpu_cluster.size}"
    )
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)
Creating a new cpu compute target...
AMLCompute with name cpu-cluster-E8s-v3 will be created, with compute size Standard_E8s_v3

   Next, let's create the environment by making a directory for the dependencies which lists the components for the runtime and the libraries installed on the compute for training the model.

In [ ]:
import os

dependencies_dir = './dependencies'
os.makedirs(dependencies_dir, exist_ok=True)

   Now we can write the conda.yaml file into the dependencies directory.

In [ ]:
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8.5
  - numpy=1.21.6
  - pip=23.1.2
  - scikit-learn==1.1.2
  - scipy
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - mlflow
    - azureml-mlflow
    - psutil==5.9.0
    - tqdm
    - ipykernel
    - matplotlib
    - seaborn
    - eli5==0.13.0
    - shap==0.41.0
    - lime
Writing ./dependencies/conda.yaml

   The created conda.yaml file allows for the environment to be created and registered in the workspace.

In [ ]:
from azure.ai.ml.entities import Environment

custom_env_name = 'aml-loanstatus-cpu'

custom_job_env = Environment(
    name=custom_env_name,
    description='Custom environment for Loast Status Linear job',
    tags={'scikit-learn': '1.1.2'},
    conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
    image='mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}"
)
Environment with name aml-loanstatus-cpu is registered to workspace, the environment version is 2

   Next, we can create the training script by first creating the source folder where the training script, main.py, will be stored.

In [ ]:
import os

train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)

   The training script consists of preparing the environment, reading the data, data preparation, model training, evaluating the model and saving/registering the model. This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of argparse, start logging with sklearn.autolog, reading the train/test sets, defining the features/target and preprocessing the data by scaling the features with the MinMaxScaler. Then the number of samples and features are logged with MLFlow. It uses this to then train a Linear model using the best parameters from GridSearchCV where the classification_report and confusion_matrix as well as the metrics accuracy, precision, recall and f1_score for the train/test sets are logged as MLFlow artifacts and metrics. Then the model can be saved and registered.

In [ ]:
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['LoanStatus_Linear'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

def main():
    """Main function of the script."""

    parser = argparse.ArgumentParser()
    parser.add_argument('--train_data', type=str,
                        help='path to input train data')
    parser.add_argument('--test_data', type=str, help='path to input test data')
    parser.add_argument('--penalty', required=False, default='l2', type=str)
    parser.add_argument('--solver', required=False, default='lbfgs', type=str)
    parser.add_argument('--max_iter', required=False, default=100, type=int)
    parser.add_argument('--C', required=False, default=1, type=int)
    parser.add_argument('--tol', required=False, default=1e-4, type=float)
    parser.add_argument('--n_jobs', required=False, default=1, type=int)
    parser.add_argument('--registered_model_name', type=str, help='model name')
    args = parser.parse_args()

    mlflow.start_run()

    mlflow.sklearn.autolog()

    ###################
    #<prepare the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print('Input Train Data:', args.train_data)
    print('Input Test Data:', args.test_data)

    trainDF = pd.read_csv(args.train_data, low_memory=False)
    testDF = pd.read_csv(args.test_data, low_memory=False)

    train_label = trainDF[['loan_status']]
    test_label = testDF[['loan_status']]

    train_features = trainDF.drop(columns = ['loan_status'])
    test_features = testDF.drop(columns = ['loan_status'])

    print(f"Training with data of shape {train_features.shape}")

    scaler = MinMaxScaler()
    train_features = scaler.fit_transform(train_features)
    test_features = scaler.transform(test_features)

    mlflow.log_metric('num_samples', train_features.shape[0])
    mlflow.log_metric('num_features', train_features.shape[1])

    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################
    model = LogisticRegression(penalty=args.penalty,
                               solver=args.solver,
                               max_iter=args.max_iter,
                               C=args.C,
                               tol=args.tol,
                               random_state=seed_value)

    with parallel_backend('threading', n_jobs=args.n_jobs):
        model.fit(train_features, train_label)

    ##################
    #</train the model>
    ##################

    #####################
    #<evaluate the model>
    #####################
    train_label_pred = model.predict(train_features)
    test_label_pred = model.predict(test_features)

    clr_train = classification_report(train_label, train_label_pred,
                                      output_dict=True)
    sns.heatmap(pd.DataFrame(clr_train).iloc[:-1,:].T, annot=True)
    plt.savefig('clr_train.png')
    mlflow.log_artifact('clr_train.png')
    plt.close()

    clr_test = classification_report(test_label, test_label_pred,
                                     output_dict=True)
    sns.heatmap(pd.DataFrame(clr_test).iloc[:-1,:].T, annot=True)
    plt.savefig('clr_test.png')
    mlflow.log_artifact('clr_test.png')
    plt.close()

    cm_train = confusion_matrix(train_label, train_label_pred)
    cm_train = ConfusionMatrixDisplay(confusion_matrix=cm_train)
    cm_train.plot()
    plt.savefig('cm_train.png')
    mlflow.log_artifact('cm_train.png')
    plt.close()

    cm_test = confusion_matrix(test_label, test_label_pred)
    cm_test = ConfusionMatrixDisplay(confusion_matrix=cm_test)
    cm_test.plot()
    plt.savefig('cm_test.png')
    mlflow.log_artifact('cm_test.png')
    plt.close()

    train_accuracy = accuracy_score(train_label, train_label_pred)
    train_precision = precision_score(train_label, train_label_pred)
    train_recall = recall_score(train_label, train_label_pred)
    train_f1 = f1_score(train_label, train_label_pred)

    test_accuracy = accuracy_score(test_label, test_label_pred)
    test_precision = precision_score(test_label, test_label_pred)
    test_recall = recall_score(test_label, test_label_pred)
    test_f1 = f1_score(test_label, test_label_pred)

    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.log_metric('train_precision', train_precision)
    mlflow.log_metric('train_recall', train_recall)
    mlflow.log_metric('train_f1', train_f1)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('test_precision', test_precision)
    mlflow.log_metric('test_recall', test_recall)
    mlflow.log_metric('test_f1', test_f1)

    #####################
    #</evaluate the model>
    #####################

    ##########################
    #<save and register model>
    ##########################
    print('Registering the model via MLFlow')
    mlflow.sklearn.log_model(
        sk_model=model,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    mlflow.sklearn.save_model(
        sk_model=model,
        path=os.path.join(args.registered_model_name, 'trained_model'),
    )

    ###########################
    #</save and register model>
    ###########################

    mlflow.end_run()

if __name__ == "__main__":
    main()
Writing ./src/main.py

   To train the model, a command job configured with the input specifying the input data, the number of epochs and the batch size, which then runs the training script using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.

In [ ]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = 'loanstatus_us_linear_model'

job = command(
    inputs=dict(
        train_data=Input(
            type='uri_file',
            path='azureml://datastores/loanstatus_datastore/paths/trainDF_US.csv',
        ),
        test_data=Input(
            type='uri_file',
            path = 'azureml://datastores/loanstatus_datastore/paths/testDF_US.csv',
        ),
        penalty='l1',
        solver='saga',
        max_iter=100000,
        C=1,
        tol=1e-06,
        n_jobs=-1,
        registered_model_name=registered_model_name,
    ),

    code='./src/',
    command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --penalty ${{inputs.penalty}} --solver ${{inputs.solver}} --max_iter ${{inputs.max_iter}} --n_jobs ${{inputs.n_jobs}} --registered_model_name ${{inputs.registered_model_name}}',
    environment='aml-loanstatus-cpu@latest',
    compute='cpu-cluster-E8s-v3',
    display_name='loanstatus_us_linear_best_prediction',
)

   Finally, this job can be submitted to run in Azure Machine Learning Studio using the create_or_update command with ml_client.

In [ ]:
ml_client.create_or_update(job)
Uploading src (0.01 MBs): 100%|██████████| 6105/6105 [00:00<00:00, 395731.86it/s]


Out[ ]:
ExperimentNameTypeStatusDetails Page
Linearplacid_calypso_gckg34lwndcommandStartingLink to Azure Machine Learning studio

   The submitted job can then be viewed by selecting the link in the output of the previous cell. The logged information with MLFlow including the model metrics and saved graphs can then be viewed/downloaded when the job completes.

Test Model on SMOTE Set

   Now, we can fit the best model from iterating through the parameters contained within the grid search using the Upsampled data to the SMOTE set to compare how this model peforms using another approach for handling class imbalance problems. After fitting the model, we can predict with the test set that was not utilzing for training, and evaluate the model performance.

In [ ]:
print('Start Fit best model using gridsearch results on Upsampling to SMOTE data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    LassoMod_US_HPO.fit(X1_trainS, y1_train)
print('Finished Fit best model using gridsearch results on Upsampling to SMOTE data :',
      time.time() - search_time_start)

Pkl_Filename = 'LassoMod_SMOTEusingUSHPO_mms_moreParams.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(LassoMod_US_HPO, file)
Start Fit best model using gridsearch results on Upsampling to SMOTE data..
Finished Fit best model using gridsearch results on Upsampling to SMOTE data : 32627.21972966194
In [ ]:
y_pred_US_HPO = LassoMod_US_HPO.predict(X1_testS)

print('Results from LASSO using Upsampling HPO on SMOTE Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_US_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_US_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_US_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_US_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_US_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_US_HPO))
Results from LASSO using Upsampling HPO on SMOTE Data:


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.98      0.91      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.95      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[376742   1106]
 [  4759  49866]]


Accuracy score : 0.986
Precision score : 0.978
Recall score : 0.913
F1 score : 0.944

   When testing the best model from the GridSearchCV using the Upsampled set with the SMOTE data, this resulted in the same accuracy, higher precision and F1 scores but lower recall.

Model Explanations using ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(LassoMod_US_HPO,
                                        random_state=seed_value).fit(X1_testS,
                                                                     y1_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.4657 ± 0.0005 total_pymnt
0.4311 ± 0.0009 loan_amnt
0.3771 ± 0.0004 total_rec_int
0.3385 ± 0.0006 out_prncp
0.1294 ± 0.0005 home_ownership_RENT
0.1236 ± 0.0005 home_ownership_MORTGAGE
0.0535 ± 0.0002 recoveries
0.0533 ± 0.0002 home_ownership_OWN
0.0327 ± 0.0002 installment
0.0094 ± 0.0002 term_ 60 months
0.0020 ± 0.0001 total_rec_late_fee
0.0013 ± 0.0001 verification_status_Verified
0.0008 ± 0.0000 grade_D
0.0008 ± 0.0000 grade_C
0.0007 ± 0.0001 verification_status_Source Verified
0.0004 ± 0.0001 last_pymnt_amnt
0.0003 ± 0.0000 application_type_Joint App
0.0002 ± 0.0000 annual_inc
0.0002 ± 0.0001 num_bc_sats
0.0002 ± 0.0001 acc_open_past_24mths
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 total_pymnt 0.465679 0.000227
1 loan_amnt 0.431091 0.000439
2 total_rec_int 0.377133 0.000178
3 out_prncp 0.338522 0.000284
4 home_ownership_RENT 0.129386 0.000237
5 home_ownership_MORTGAGE 0.123567 0.000251
6 recoveries 0.053518 0.000099
7 home_ownership_OWN 0.053262 0.000108
8 installment 0.032734 0.000090
9 term_ 60 months 0.009414 0.000103
10 total_rec_late_fee 0.002045 0.000068
11 verification_status_Verified 0.001274 0.000032
12 grade_D 0.000781 0.000022
13 grade_C 0.000769 0.000022
14 verification_status_Source Verified 0.000708 0.000033
15 last_pymnt_amnt 0.000425 0.000059
16 application_type_Joint App 0.000261 0.000012
17 annual_inc 0.000188 0.000005
18 num_bc_sats 0.000167 0.000033
19 acc_open_past_24mths 0.000154 0.000044

   There is a lower total_rec_int (0.377133 vs. 0.392205) compared to the Upsampled test while total_pymnt, out_prncp loan_amnt are similar. Interestingly, home_ownership_RENT (0.129386 vs. 0.000037) and home_ownership_MORTGAGE (0.123567 vs. 0.000069) are considerably higher compared to the Upsampled test set.

   Now we can use ELI5 to show the prediction.

In [ ]:
html_obj2 = eli5.show_prediction(LassoMod_US_HPO, X1_test1.iloc[1],
                                 show_feature_values=True)
html_obj2
Out[ ]:

y=0 (probability 1.000, score -15317056.267) top features

Contribution? Feature Value
+13167125.933 annual_inc 35000.000
+9628265.504 out_prncp 13241.960
+2640157.990 total_pymnt 2274.930
+337731.195 total_bc_limit 30300.000
+255215.327 total_bal_ex_mort 73529.000
+138321.563 bc_open_to_buy 10970.000
+88920.973 revol_bal 20454.000
+14185.942 tot_hi_cred_lim 162071.000
+9774.006 last_pymnt_amnt 327.920
+3114.828 installment 327.920
+90.240 mo_sin_old_rev_tl_op 164.000
+33.664 num_sats 12.000
+15.462 mths_since_recent_bc 9.000
+8.976 num_il_tl 10.000
+5.730 home_ownership_MORTGAGE 1.000
+2.537 disbursement_method_DirectPay 1.000
+1.835 term_ 60 months 1.000
+1.560 grade_B 1.000
+1.134 num_bc_tl 6.000
+1.120 num_actv_rev_tl 5.000
+1.094 verification_status_Verified 1.000
+0.821 purpose_credit_card 1.000
+0.405 mort_acc 1.000
+0.243 initial_list_status_w 1.000
-1.242 inq_last_6mths 2.000
-1.731 num_tl_op_past_12m 3.000
-6.382 percent_bc_gt_75 40.000
-16.603 acc_open_past_24mths 3.000
-20.140 num_bc_sats 5.000
-21.966 <BIAS> 1.000
-26.061 int_rate 12.730
-38.356 revol_util 56.500
-528762.378 total_rec_int 1016.890
-10437026.955 loan_amnt 14500.000

   For the loans to be Paid Off or Current, the features annual_inc, out_prncp, total_pymnt, total_bc_limit, total_bal_ex_mort, bc_open_to_buy,revol_bal, total_hi_credit_lim and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, and total_rec_int with higher values are not beneficial.

Grid Search: SMOTE

   To follow the methods used for evaluating the Upsampled set, let's fit the SMOTE data using dask.delayed for parallel jobs with time being monitored and then compare the model accuracies to determine which parameters led to this metric being achieved.

In [ ]:
print('Start SMOTE - Grid Search..')
search_time_start = time.time()
pipelines1_ = [dask.delayed(grid).fit(X1_train, y1_train.values.ravel())]
fit_pipelines1 = dask.compute(*pipelines1_, scheduler='processes')
print('Finished SMOTE - Grid Search :', time.time() - search_time_start)
Start SMOTE - Grid Search..
Finished SMOTE - Grid Search : 915874.5640733242
In [ ]:
for idx, val in enumerate(fit_pipelines1):
  print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx],
                                             val.score(X1_test, y1_test)))

best_acc = 0.0
best_clf = 0
best_pipe = ''
for idx, val in enumerate(fit_pipelines1):
    if val.score(X1_test, y1_test) > best_acc:
        best_acc = val.score(X1_test, y1_test)
        best_pipe = val
        best_clf = idx

print('Classification with highest accuracy: %s' % pipe_dict[best_clf])
print('Lasso/Elastic Net using SMOTE - Best Estimator')
print(best_pipe.best_params_)
Lasso pipeline test accuracy: 0.986
Classification with highest accuracy: Lasso
Lasso/Elastic Net using SMOTE - Best Estimator
{'classifier': LogisticRegression(C=1, max_iter=100000, penalty='l1', solver='saga', tol=1e-06), 'classifier__C': 1, 'classifier__max_iter': 100000, 'classifier__penalty': 'l1', 'classifier__solver': 'saga', 'classifier__tol': 1e-06}

   Now that we determined the parameters that led to the highest accuracy during the search, we can fit a model with the parameters, save the model and predict based on the trained using the test set.

In [ ]:
LassoMod_SMOTE = LogisticRegression(penalty='l1',
                                    C=1,
                                    solver='saga',
                                    max_iter=100000,
                                    tol=1e-06,
                                    n_jobs=-1,
                                    random_state=seed_value)

print('Start fit the best hyperparameters from SMOTE grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    LassoMod_SMOTE.fit(X1_trainS, y1_train)
print('Finished fit the best hyperparameters from SMOTE grid search to the data:',
      time.time() - search_time_start)

Pkl_Filename = 'Linear_HPO_SMOTE_mmsRecall_mostParams.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(LassoMod_SMOTE, file)
Start fit the best hyperparameters from SMOTE grid search to the data..
Finished fit the best hyperparameters from SMOTE grid search to the data: 30523.45330929756
In [ ]:
y_pred_SMOTE_HPO = LassoMod_SMOTE.predict(X1_testS)

print('Results from LASSO using SMOTE HPO:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_HPO))
Results from LASSO using SMOTE HPO:


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.98      0.91      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.95      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[376742   1106]
 [  4759  49866]]


Accuracy score : 0.986
Precision score : 0.978
Recall score : 0.913
F1 score : 0.944

   Compared to the baseline model, the model determined using GridSearchCV resulted in practically the same metrics besides a higher precision score. Given the time that was required to search through the various models in the given parameter space, the baseline model would probably be sufficient unless an even higher recall score where the true positives classified as positive was needed for the task at hand.

   Compared to the model after GridSearchCV using the Upsampled set, there was a higher precision score, but a lower recall score.

Model Explanations using ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(LassoMod_SMOTE,
                                        random_state=seed_value).fit(X1_testS,
                                                                     y1_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.4657 ± 0.0005 total_pymnt
0.4311 ± 0.0009 loan_amnt
0.3771 ± 0.0004 total_rec_int
0.3385 ± 0.0006 out_prncp
0.1294 ± 0.0005 home_ownership_RENT
0.1236 ± 0.0005 home_ownership_MORTGAGE
0.0535 ± 0.0002 recoveries
0.0533 ± 0.0002 home_ownership_OWN
0.0327 ± 0.0002 installment
0.0094 ± 0.0002 term_ 60 months
0.0020 ± 0.0001 total_rec_late_fee
0.0013 ± 0.0001 verification_status_Verified
0.0008 ± 0.0000 grade_D
0.0008 ± 0.0000 grade_C
0.0007 ± 0.0001 verification_status_Source Verified
0.0004 ± 0.0001 last_pymnt_amnt
0.0003 ± 0.0000 application_type_Joint App
0.0002 ± 0.0000 annual_inc
0.0002 ± 0.0001 num_bc_sats
0.0002 ± 0.0001 acc_open_past_24mths
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 total_pymnt 0.465679 0.000227
1 loan_amnt 0.431091 0.000439
2 total_rec_int 0.377133 0.000178
3 out_prncp 0.338522 0.000284
4 home_ownership_RENT 0.129386 0.000237
5 home_ownership_MORTGAGE 0.123567 0.000251
6 recoveries 0.053518 0.000099
7 home_ownership_OWN 0.053262 0.000108
8 installment 0.032734 0.000090
9 term_ 60 months 0.009414 0.000103
10 total_rec_late_fee 0.002045 0.000068
11 verification_status_Verified 0.001274 0.000032
12 grade_D 0.000781 0.000022
13 grade_C 0.000769 0.000022
14 verification_status_Source Verified 0.000708 0.000033
15 last_pymnt_amnt 0.000425 0.000059
16 application_type_Joint App 0.000261 0.000012
17 annual_inc 0.000188 0.000005
18 num_bc_sats 0.000167 0.000033
19 acc_open_past_24mths 0.000154 0.000044

   There is a lower total_rec_int (0.377133 vs. 0.392205) compared to the Upsampled test while total_pymnt, out_prncp loan_amnt are similar. Interestingly, home_ownership_RENT (0.129386 vs. 0.000037) and home_ownership_MORTGAGE (0.123567 vs. 0.000069) are considerably higher compared to the Upsampled test set.

   Now we can use ELI5 to show the prediction.

In [ ]:
html_obj2 = eli5.show_prediction(LassoMod_SMOTE, X1_test1.iloc[1],
                                 show_feature_values=True)
html_obj2
Out[ ]:

y=0 (probability 1.000, score -15317056.267) top features

Contribution? Feature Value
+13167125.933 annual_inc 35000.000
+9628265.504 out_prncp 13241.960
+2640157.990 total_pymnt 2274.930
+337731.195 total_bc_limit 30300.000
+255215.327 total_bal_ex_mort 73529.000
+138321.563 bc_open_to_buy 10970.000
+88920.973 revol_bal 20454.000
+14185.942 tot_hi_cred_lim 162071.000
+9774.006 last_pymnt_amnt 327.920
+3114.828 installment 327.920
+90.240 mo_sin_old_rev_tl_op 164.000
+33.664 num_sats 12.000
+15.462 mths_since_recent_bc 9.000
+8.976 num_il_tl 10.000
+5.730 home_ownership_MORTGAGE 1.000
+2.537 disbursement_method_DirectPay 1.000
+1.835 term_ 60 months 1.000
+1.560 grade_B 1.000
+1.134 num_bc_tl 6.000
+1.120 num_actv_rev_tl 5.000
+1.094 verification_status_Verified 1.000
+0.821 purpose_credit_card 1.000
+0.405 mort_acc 1.000
+0.243 initial_list_status_w 1.000
-1.242 inq_last_6mths 2.000
-1.731 num_tl_op_past_12m 3.000
-6.382 percent_bc_gt_75 40.000
-16.603 acc_open_past_24mths 3.000
-20.140 num_bc_sats 5.000
-21.966 <BIAS> 1.000
-26.061 int_rate 12.730
-38.356 revol_util 56.500
-528762.378 total_rec_int 1016.890
-10437026.955 loan_amnt 14500.000

   For the loans to be Paid Off or Current, the features annual_inc, out_prncp, total_pymnt, total_bc_limit, total_bal_ex_mort, bc_open_to_buy, revol_bal, total_hi_credit_lim and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, and total_rec_int with higher values are not beneficial.

Test Model on Upsampled Set

   Now, we can fit the determined model from the search using the SMOTE set, save the .pkl file, then predict based on trained model with the scaled test set and evaluate the model metrics.

In [ ]:
print('Start fit best model using gridsearch results on SMOTE to Upsampling data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    LassoMod_SMOTE.fit(X_trainS, y_train)
print('Finished fit best model using gridsearch results on SMOTE to Upsampling data :',
      time.time() - search_time_start)
Pkl_Filename = 'LassMod_UpsamplingusingSMOTEHPO_mmsRecall_mostParams.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(LassoMod_SMOTE, file)
Start fit best model using gridsearch results on SMOTE to Upsampling data..
Finished fit best model using gridsearch results on SMOTE to Upsampling data : 62429.723383426666
In [ ]:
y_pred_SMOTE_HPO = LassoMod_SMOTE.predict(X_testS)

print('Results from LASSO using SMOTE HPO on Upsampling data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y_test, y_pred_SMOTE_HPO))
Results from LASSO using SMOTE HPO on Upsampling data:


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.97      0.92      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.96      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[376102   1746]
 [  4439  50186]]


Accuracy score : 0.986
Precision score : 0.966
Recall score : 0.919
F1 score : 0.942

   When testing the best model from the GridSearchCV using the SMOTE set with the Upsampled data, this resulted in the same accuracy but lower precision and F1 scores a higher recall score.

Model Explanations using ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(LassoMod_SMOTE,
                                        random_state=seed_value).fit(X_testS,
                                                                     y_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.4668 ± 0.0005 total_pymnt
0.4293 ± 0.0009 loan_amnt
0.3922 ± 0.0005 total_rec_int
0.3383 ± 0.0005 out_prncp
0.0596 ± 0.0002 recoveries
0.0247 ± 0.0002 installment
0.0048 ± 0.0002 total_rec_late_fee
0.0008 ± 0.0001 grade_C
0.0008 ± 0.0001 grade_B
0.0006 ± 0.0001 last_pymnt_amnt
0.0003 ± 0.0000 num_sats
0.0001 ± 0.0000 grade_D
0.0001 ± 0.0000 home_ownership_MORTGAGE
0.0001 ± 0.0000 verification_status_Source Verified
0.0000 ± 0.0000 tot_hi_cred_lim
0.0000 ± 0.0000 num_tl_op_past_12m
0.0000 ± 0.0000 home_ownership_RENT
0.0000 ± 0.0000 home_ownership_OWN
0.0000 ± 0.0000 total_bc_limit
0.0000 ± 0.0000 revol_util
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 total_pymnt 0.466812 0.000232
1 loan_amnt 0.429280 0.000429
2 total_rec_int 0.392205 0.000249
3 out_prncp 0.338332 0.000275
4 recoveries 0.059570 0.000091
5 installment 0.024710 0.000117
6 total_rec_late_fee 0.004786 0.000095
7 grade_C 0.000844 0.000039
8 grade_B 0.000757 0.000038
9 last_pymnt_amnt 0.000597 0.000040
10 num_sats 0.000336 0.000014
11 grade_D 0.000093 0.000014
12 home_ownership_MORTGAGE 0.000069 0.000022
13 verification_status_Source Verified 0.000065 0.000020
14 tot_hi_cred_lim 0.000041 0.000012
15 num_tl_op_past_12m 0.000039 0.000008
16 home_ownership_RENT 0.000037 0.000016
17 home_ownership_OWN 0.000018 0.000010
18 total_bc_limit 0.000016 0.000011
19 revol_util 0.000013 0.000009

   The feature weights are the same values from the best model from the GridSearchCV with scoring='recall_weighted' for the Upsampled set.

   Now we can use ELI5 to show the prediction.

In [ ]:
html_obj2 = eli5.show_prediction(LassoMod_SMOTE, X_test1.iloc[1],
                                 show_feature_values=True)
html_obj2
Out[ ]:

y=0 (probability 1.000, score -2418518.724) top features

Contribution? Feature Value
+9378493.908 out_prncp 13241.960
+2567513.392 total_pymnt 2274.930
+425246.132 annual_inc 35000.000
+375279.987 tot_hi_cred_lim 162071.000
+185286.465 total_bc_limit 30300.000
+8977.232 revol_bal 20454.000
+7781.646 last_pymnt_amnt 327.920
+25.504 num_sats 12.000
+20.087 revol_util 56.500
+9.567 mths_since_recent_bc 9.000
+4.443 mo_sin_old_rev_tl_op 164.000
+0.879 disbursement_method_DirectPay 1.000
+0.820 num_tl_op_past_12m 3.000
+0.419 mort_acc 1.000
+0.411 num_actv_rev_tl 5.000
+0.336 num_il_tl 10.000
+0.275 home_ownership_MORTGAGE 1.000
+0.065 purpose_credit_card 1.000
-0.091 initial_list_status_w 1.000
-0.216 verification_status_Verified 1.000
-0.233 term_ 60 months 1.000
-0.421 grade_B 1.000
-0.966 percent_bc_gt_75 40.000
-1.003 inq_last_6mths 2.000
-1.818 num_bc_tl 6.000
-8.526 num_bc_sats 5.000
-10.166 int_rate 12.730
-11.053 acc_open_past_24mths 3.000
-14.919 <BIAS> 1.000
-1812.468 installment 327.920
-29194.144 total_bal_ex_mort 73529.000
-515189.983 total_rec_int 1016.890
-9983876.839 loan_amnt 14500.000

   For the loans to be Paid Off or Current, the features out_prncp, total_pymnt, annual_inc, total_hi_credit_lim, total_bc_limit, revol_bal and last_pymnt_amnt with higher values are beneficial while the features loan_amnt, total_rec_int, total_bal_ex_mort and installment with higher values are not beneficial.

Naive Bayes

   Naive Bayes classifiers are a family of supervised learning algorithms based on applying Bayes’ theorem with the "naive" assumption that conditional independence exists between every pair of features given the value of the class variable. Bayes’ theorem is a formula that offers a conditional probability of an event A happening given another event B has all ready occurred:

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

  • $P(A)$ is called the prior probability
  • $P(A|B)$ is called the posterior probability (the probability after taking into account the observation of $B$).

   The fundamental Naive Bayes assumption is that each feature makes an independent and equal contribution to the outcome. In Gaussian Naive Bayes, the continuous values of each feature are assumed to be distributed according to a Normal distribution and the likelihood of the features is assumed to be Gaussian.

   In The Optimality of Naive Bayes, it was suggested that Naive Bayes can still perform well even if the features contain strong dependences as long as the dependences are evenly distributed in the target variable or if they cancel each other out.

Baseline Models

   The notebook containing the baseline models and hyperparameter tuning using GridSearchCV is located here.

Baseline Default Parameters

  • priors: Prior probabilities of the classes. If specified, the priors are not adjusted according to the data. default=None.
  • var_smoothing: Portion of the largest variance of all features that is added to variances for calculation stability. default=1e-09.

Upsampled: Baseline Model

   As with the baseline models for the Linear models, we can first set the path where the ML results are saved as this is always needed and will be assumed for all workflows. Let's first start with the Upsampled set with fitting the model, saving it as .pkl file followed by making predictions and evaluating with the test set.

In [ ]:
nb = GaussianNB()

with parallel_backend('threading', n_jobs=-1):
    nb.fit(X_train, y_train)

Pkl_Filename = 'NB_Upsampling_baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(nb, file)

y_pred_US = nb.predict(X_test)

print('Results from Naives Bayes baseline model on Upsampled Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Results from Naives Bayes baseline model on Upsampled Data:


Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97    377848
           1       0.91      0.65      0.76     54625

    accuracy                           0.95    432473
   macro avg       0.93      0.82      0.86    432473
weighted avg       0.95      0.95      0.94    432473



Confusion matrix:
[[374366   3482]
 [ 19073  35552]]


Accuracy score : 0.948
Precision score : 0.911
Recall score : 0.651
F1 score : 0.759

   All of the model metrics for this baseline model are lower than both previous baseline linear models for the Upsampled set. Although the accuracy=0.948 and precision=0.911, the lowest metric is recall_score=0.651 followed by the f1_score=0.759. Naive Bayes is non-linear and might require too many simulations to find match the metrics achieved from the linear approaches, so let's try some different scoring metrics during GridSearchCV since the time for runtime completion is not too computational demanding.

SMOTE: Baseline Model

   We can now fit the baseline model for the SMOTE set using the default model parameters.

In [ ]:
with parallel_backend('threading', n_jobs=-1):
    nb.fit(X1_train, y1_train)

Pkl_Filename = 'NB_SMOTE_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(nb, file)

y_pred_SMOTE = nb.predict(X_test)

print('Results from Naives Bayes baseline model on SMOTE Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE))
Results from Naives Bayes baseline model on SMOTE Data:


Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.86      0.91    377848
           1       0.46      0.82      0.59     54625

    accuracy                           0.86    432473
   macro avg       0.72      0.84      0.75    432473
weighted avg       0.91      0.86      0.87    432473



Confusion matrix:
[[326068  51780]
 [  9625  45000]]


Accuracy score : 0.858
Precision score : 0.465
Recall score : 0.824
F1 score : 0.594

   All of the model metrics for this baseline model are even lower than both of previous baseline linear models for the SMOTE and the Upsampled sets. However, precision=0.465 is the lowest model metric observed up to this point, so this is definitely the metric we will have to use for scoring if the Naive Bayes algorithm has any likelihood of being considered with the adjusted class balance generated sets.

Grid Search using Pipelines: Precision

Grid Search: Upsampled

   Using a similar approach that was utilized for the Linear models, let's define the estimator as the default GaussianNB, the grid search parameters as param_grid, the scoring = precision using 3-fold cross validation and to use all available resouces with joblib in both the scikit-learn model and GridSearchCV. We can monitor the completion time as well to compare to any of the hyperparameter searches. However, this is still using CPU and RAM constraints with suboptimal runtimes which the literature has demonstrated significant differences in runtimes when using code optimized to run on a GPU even with tabular data. Let's print the searches best score and estimator parameters after the search completes.

In [ ]:
param_grid = {'var_smoothing': np.logspace(0,-9, num=1000)}

nb_grid_US = GridSearchCV(estimator=GaussianNB(),
                          param_grid=param_grid,
                          scoring='precision',
                          cv=3, verbose=1, n_jobs=-1)

search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    nb_grid_US.fit(X_train, y_train)
print('Time to finish Grid Search using Upsampling:',
      time.time() - search_time_start)
print('======================================================================')
print('Naive Bayes: Upsampling')
print('- Best Score:', nb_grid_US.best_score_)
print('- Best Estimator:', nb_grid_US.best_estimator_)
Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
Time to finish Grid Search using Upsampling: 4652.506390810013
======================================================================
Naive Bayes: Upsampling
- Best Score: 0.9864703739117203
- Best Estimator: GaussianNB()

    Using 3-fold cross validation monitoring precision resulted in the baseline model as the best model so far. Let's fit the best model from the grid search using the Upsampled data, and examine the classification metrics.

In [ ]:
nb_US_HPO = nb_grid_US.best_estimator_

print('Start fit the best hyperparameters from Upsampling grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    nb_US_HPO.fit(X_train, y_train)
print('Finished fit the best hyperparameters from Upsampling grid search to the data:',
      time.time() - search_time_start)

Pkl_Filename = 'NB_HPO_US_precision.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(nb_US_HPO, file)
Start fit the best hyperparameters from Upsampling grid search to the data..
Finished fit the best hyperparameters from Upsampling grid search to the data: 2.649251937866211
In [ ]:
y_pred_US = nb_US_HPO.predict(X_test)

print('Results from Naives Bayes using Best HPO from GridSearchCV on Upsampled Data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US))
Results from Naives Bayes using Best HPO from GridSearchCV on Upsampled Data:


Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97    377848
           1       0.91      0.65      0.76     54625

    accuracy                           0.95    432473
   macro avg       0.93      0.82      0.86    432473
weighted avg       0.95      0.95      0.94    432473



Confusion matrix:
[[374366   3482]
 [ 19073  35552]]


Accuracy score : 0.948
Precision score : 0.911
Recall score : 0.651
F1 score : 0.759

   Compared to the baseline model, the model determined using GridSearchCV resulted in the same metrics. This suggests that this algorithm is going to prove to be difficult in this context.

Model Metrics with ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(nb_US_HPO,
                                        random_state=seed_value).fit(X_test,
                                                                     y_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.1159 ± 0.0002 recoveries
0.0093 ± 0.0002 out_prncp
0.0057 ± 0.0002 total_pymnt
0.0055 ± 0.0003 total_rec_late_fee
0.0036 ± 0.0001 last_pymnt_amnt
0.0026 ± 0.0001 total_rec_int
0.0022 ± 0.0001 int_rate
0.0007 ± 0.0001 loan_amnt
0.0006 ± 0.0001 installment
0.0003 ± 0.0001 annual_inc
0.0003 ± 0.0001 bc_open_to_buy
0.0003 ± 0.0001 tot_hi_cred_lim
0.0002 ± 0.0001 total_bc_limit
0.0002 ± 0.0000 mths_since_recent_bc
0.0001 ± 0.0001 total_bal_ex_mort
0.0001 ± 0.0000 revol_bal
0.0001 ± 0.0001 acc_open_past_24mths
0.0001 ± 0.0001 num_actv_rev_tl
0.0001 ± 0.0001 percent_bc_gt_75
0.0001 ± 0.0001 revol_util
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 recoveries 0.115911 0.000124
1 out_prncp 0.009270 0.000090
2 total_pymnt 0.005725 0.000095
3 total_rec_late_fee 0.005496 0.000133
4 last_pymnt_amnt 0.003591 0.000058
5 total_rec_int 0.002624 0.000037
6 int_rate 0.002165 0.000071
7 loan_amnt 0.000692 0.000054
8 installment 0.000610 0.000044
9 annual_inc 0.000304 0.000050
10 bc_open_to_buy 0.000278 0.000046
11 tot_hi_cred_lim 0.000276 0.000072
12 total_bc_limit 0.000214 0.000034
13 mths_since_recent_bc 0.000160 0.000024
14 total_bal_ex_mort 0.000108 0.000035
15 revol_bal 0.000094 0.000019
16 acc_open_past_24mths 0.000089 0.000055
17 num_actv_rev_tl 0.000087 0.000029
18 percent_bc_gt_75 0.000069 0.000051
19 revol_util 0.000067 0.000032

   The feature recoveries is the only feature with contributing weights (0.115911).

Test Model on SMOTE Set

   Now, we can fit the best model from iterating through the parameters contained within the grid search using the Upsampled data to the SMOTE set to compare how this model peforms using another approach for handling class imbalance problems. After fitting the model, we can predict with the test set that was not utilzing for training, and evaluate the model performance.

In [ ]:
print('Start Fit best model using gridsearch results on Upsampling to SMOTE data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    nb_US_HPO.fit(X1_train, y1_train)
print('Finished Fit best model using gridsearch results on Upsampling to SMOTE data :',
      time.time() - search_time_start)

Pkl_Filename = 'NB_SMOTEusingUpsamplingHPO_precision.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(nb_US_HPO, file)
Start Fit best model using gridsearch results on Upsampling to SMOTE data..
Finished Fit best model using gridsearch results on Upsampling to SMOTE data : 5.161005973815918
In [ ]:
y_pred_SMOTE_US = nb_US_HPO.predict(X1_test)

print('Results from Naives Bayes using Upsampling Best HPO from GridSearchCV on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_US)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_US))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_US))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_US))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_US))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_US))
Results from Naives Bayes using Upsampling Best HPO from GridSearchCV on SMOTE data:


Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.86      0.91    377848
           1       0.46      0.82      0.59     54625

    accuracy                           0.86    432473
   macro avg       0.72      0.84      0.75    432473
weighted avg       0.91      0.86      0.87    432473



Confusion matrix:
[[326068  51780]
 [  9625  45000]]


Accuracy score : 0.858
Precision score : 0.465
Recall score : 0.824
F1 score : 0.594

   This resulted in the same model metrics as the baseline model.

Model Explanations using ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(nb_US_HPO,
                                        random_state=seed_value).fit(X1_test,
                                                                     y1_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.0748 ± 0.0003 recoveries
0.0561 ± 0.0010 out_prncp
0.0425 ± 0.0008 last_pymnt_amnt
0.0041 ± 0.0001 total_rec_late_fee
0.0030 ± 0.0001 installment
0.0023 ± 0.0003 loan_amnt
0.0023 ± 0.0003 total_rec_int
0.0018 ± 0.0003 mths_since_recent_bc
0.0016 ± 0.0002 int_rate
0.0008 ± 0.0001 acc_open_past_24mths
0.0004 ± 0.0001 num_actv_rev_tl
0.0003 ± 0.0001 num_sats
0.0003 ± 0.0002 tot_coll_amt
0.0001 ± 0.0001 num_tl_op_past_12m
0.0001 ± 0.0001 num_bc_tl
0.0000 ± 0.0001 num_il_tl
0.0000 ± 0.0000 delinq_amnt
0.0000 ± 0.0000 grade_B
0.0000 ± 0.0000 home_ownership_RENT
0.0000 ± 0.0000 purpose_credit_card
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 recoveries 0.074780 0.000162
1 out_prncp 0.056124 0.000491
2 last_pymnt_amnt 0.042546 0.000385
3 total_rec_late_fee 0.004069 0.000070
4 installment 0.003048 0.000068
5 loan_amnt 0.002277 0.000127
6 total_rec_int 0.002260 0.000131
7 mths_since_recent_bc 0.001846 0.000148
8 int_rate 0.001564 0.000102
9 acc_open_past_24mths 0.000779 0.000056
10 num_actv_rev_tl 0.000356 0.000033
11 num_sats 0.000339 0.000067
12 tot_coll_amt 0.000327 0.000097
13 num_tl_op_past_12m 0.000135 0.000040
14 num_bc_tl 0.000117 0.000036
15 num_il_tl 0.000045 0.000068
16 delinq_amnt 0.000037 0.000018
17 grade_B 0.000030 0.000006
18 home_ownership_RENT 0.000022 0.000005
19 purpose_credit_card 0.000013 0.000014

    The recoveries feature contains less weight compared to the Upsampled model (0.074780 vs. 0.11) while the out_prncp (0.056124) and last_pymnt_amnt (0.042546) contain more weight than the determined with the Upsampled model.

Grid Search using Pipelines: Weighted Precision

   Given the low precision score obtained for the SMOTE set when scoring='precision' was used (0.465), which resulted because the parameters used for the baseline model were selected, let's increase the parameter space by increasing var_smoothing and num to 5000 and testing scoring='precision_weighted' with 10-fold cross validation.

In [ ]:
param_grid = {'var_smoothing': np.logspace(-7,-13, num=5000)}

nb_grid = GridSearchCV(estimator=GaussianNB(),
                       param_grid=param_grid,
                       scoring='precision_weighted',
                       cv=10, verbose=4, n_jobs=-3)

search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    nb_grid.fit(X1_train, y1_train)
print('Finished SMOTE - Grid Search :', time.time() - search_time_start)
print('======================================================================')
print('Naive Bayes using SMOTE ')
print('- Best Score:', nb_grid.best_score_)
print('- Best Estimator:', nb_grid.best_estimator_)
Fitting 10 folds for each of 5000 candidates, totalling 50000 fits
Finished SMOTE - Grid Search : 71216.84461021423
======================================================================
Naive Bayes using SMOTE
- Best Score: 0.9297076935182048
- Best Estimator: GaussianNB(var_smoothing=1.045210684942588e-13)

   Let's fit the best model from the grid search using the SMOTE data, and examine the classification metrics.

In [ ]:
nb_SMOTE_HPO = nb_grid.best_estimator_

print('Start fit the best hyperparameters from SMOTE grid search to the data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    nb_SMOTE_HPO.fit(X1_train, y1_train)
print('Finished fit the best hyperparameters from SMOTE grid search to the data:',
      time.time() - search_time_start)

Pkl_Filename = 'NB_HPO_SMOTE_-7to-13_precisionWeighted_num5000_10cv.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(nb_SMOTE_HPO, file)
Start fit the best hyperparameters from SMOTE grid search to the data..
Finished fit the best hyperparameters from SMOTE grid search to the data: 2.3943393230438232
In [ ]:
y_pred_SMOTE_HPO = nb_SMOTE_HPO.predict(X1_test)

print('Results from Naives Bayes using Best HPO from GridSearchCV on SMOTE data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y1_test, y_pred_SMOTE_HPO)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y1_test, y_pred_SMOTE_HPO))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y1_test, y_pred_SMOTE_HPO))
print('Precision score : %.3f' % precision_score(y1_test, y_pred_SMOTE_HPO))
print('Recall score : %.3f' % recall_score(y1_test, y_pred_SMOTE_HPO))
print('F1 score : %.3f' % f1_score(y1_test, y_pred_SMOTE_HPO))
Results from Naives Bayes using Best HPO from GridSearchCV on SMOTE data:


Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.96    377848
           1       0.80      0.67      0.73     54625

    accuracy                           0.94    432473
   macro avg       0.87      0.82      0.85    432473
weighted avg       0.93      0.94      0.93    432473



Confusion matrix:
[[368405   9443]
 [ 17934  36691]]


Accuracy score : 0.937
Precision score : 0.795
Recall score : 0.672
F1 score : 0.728

   Compared to the baseline NB model using the SMOTE, all of the model metrics increased besides a decrease in the recall_score. However, the low recall, F1 and precision score, this model even after testing various searches, does not seem worthwhile unless the data is preprocessed further with feature engineering new features given the current ones.

Model Explanations using ELI5

   Although the model metrics are not ideal nor worth pursuing for further optimization, let's compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(nb_SMOTE_HPO,
                                        random_state=seed_value).fit(X1_test,
                                                                     y1_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.1151 ± 0.0003 recoveries
0.0096 ± 0.0003 last_pymnt_amnt
0.0075 ± 0.0002 grade_C
0.0054 ± 0.0002 grade_D
0.0052 ± 0.0002 grade_B
0.0047 ± 0.0002 out_prncp
0.0033 ± 0.0003 total_rec_late_fee
0.0014 ± 0.0002 total_pymnt
0.0009 ± 0.0002 home_ownership_OWN
0.0008 ± 0.0003 verification_status_Verified
0.0008 ± 0.0001 home_ownership_RENT
0.0007 ± 0.0002 int_rate
0.0006 ± 0.0001 installment
0.0006 ± 0.0000 loan_amnt
0.0005 ± 0.0003 verification_status_Source Verified
0.0002 ± 0.0000 total_rec_int
0.0001 ± 0.0001 pub_rec
0.0001 ± 0.0001 tax_liens
0.0001 ± 0.0000 delinq_amnt
0.0000 ± 0.0001 tot_coll_amt
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 recoveries 0.115140 0.000140
1 last_pymnt_amnt 0.009617 0.000165
2 grade_C 0.007534 0.000120
3 grade_D 0.005374 0.000113
4 grade_B 0.005167 0.000123
5 out_prncp 0.004717 0.000107
6 total_rec_late_fee 0.003345 0.000147
7 total_pymnt 0.001371 0.000094
8 home_ownership_OWN 0.000877 0.000122
9 verification_status_Verified 0.000799 0.000135
10 home_ownership_RENT 0.000771 0.000033
11 int_rate 0.000662 0.000111
12 installment 0.000641 0.000040
13 loan_amnt 0.000564 0.000020
14 verification_status_Source Verified 0.000467 0.000151
15 total_rec_int 0.000244 0.000002
16 pub_rec 0.000083 0.000047
17 tax_liens 0.000058 0.000049
18 delinq_amnt 0.000051 0.000016
19 tot_coll_amt 0.000043 0.000035

Test Model on Upsampled Set

   Using the best model from grid search using the SMOTE set, let's compare the results to the ones from using the Upsampled set by fitting a model with the best parameters and evaluating the model performance.

In [ ]:
print('Start fit best model using gridsearch results on SMOTE to Upsampling data..')
search_time_start = time.time()
with parallel_backend('threading', n_jobs=-1):
    nb_SMOTE_HPO.fit(X_train, y_train)
print('Finished fit best model using gridsearch results on SMOTE to Upsampling data:',
      time.time() - search_time_start)

Pkl_Filename = 'NB_UpsamplingUsingSMOTEHPO_-7to-13_precisionWeighted_num5000_10cv.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(nb_SMOTE_HPO, file)
Start fit best model using gridsearch results on SMOTE to Upsampling data..
Finished fit best model using gridsearch results on SMOTE to Upsampling data: 4.957686185836792
In [ ]:
y_pred_US_SMOTE = nb_SMOTE_HPO.predict(X_test)

print('Results from Naives Bayes using SMOTE Best HPO from GridSearchCV on Upsampling data:')
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_pred_US_SMOTE)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred_US_SMOTE))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_pred_US_SMOTE))
print('Precision score : %.3f' % precision_score(y_test, y_pred_US_SMOTE))
print('Recall score : %.3f' % recall_score(y_test, y_pred_US_SMOTE))
print('F1 score : %.3f' % f1_score(y_test, y_pred_US_SMOTE))
Results from Naives Bayes using SMOTE Best HPO from GridSearchCV on Upsampling data:


Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.97    377848
           1       0.85      0.64      0.73     54625

    accuracy                           0.94    432473
   macro avg       0.90      0.81      0.85    432473
weighted avg       0.94      0.94      0.94    432473



Confusion matrix:
[[371517   6331]
 [ 19718  34907]]


Accuracy score : 0.940
Precision score : 0.846
Recall score : 0.639
F1 score : 0.728

   Compared to the baseline model using the default parameters, this resulted in lower model metrics for the Upsampled set. The accuracy, precision and recall scores were better than the SMOTE baseline model metrics though.

Model Explanations using ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(nb_SMOTE_HPO,
                                        random_state=seed_value).fit(X_test,
                                                                     y_test)

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.1238 ± 0.0003 recoveries
0.0037 ± 0.0002 total_rec_late_fee
0.0034 ± 0.0003 last_pymnt_amnt
0.0016 ± 0.0001 out_prncp
0.0008 ± 0.0001 total_pymnt
0.0004 ± 0.0001 term_ 60 months
0.0001 ± 0.0001 total_rec_int
0.0001 ± 0.0000 installment
0.0001 ± 0.0000 loan_amnt
0.0001 ± 0.0001 num_tl_30dpd
0.0001 ± 0.0001 tot_coll_amt
0.0001 ± 0.0000 delinq_amnt
0.0000 ± 0.0000 revol_util
0.0000 ± 0.0000 verification_status_Source Verified
0.0000 ± 0.0000 percent_bc_gt_75
0.0000 ± 0.0001 mo_sin_old_rev_tl_op
-0.0000 ± 0.0000 num_bc_sats
-0.0000 ± 0.0000 home_ownership_OWN
-0.0000 ± 0.0000 grade_B
-0.0000 ± 0.0001 tax_liens
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 recoveries 0.123842 0.000129
1 total_rec_late_fee 0.003727 0.000087
2 last_pymnt_amnt 0.003449 0.000162
3 out_prncp 0.001637 0.000051
4 total_pymnt 0.000768 0.000053
5 term_ 60 months 0.000379 0.000037
6 total_rec_int 0.000090 0.000028
7 installment 0.000085 0.000017
8 loan_amnt 0.000082 0.000018
9 num_tl_30dpd 0.000064 0.000038
10 tot_coll_amt 0.000064 0.000027
11 delinq_amnt 0.000059 0.000013
12 revol_util 0.000040 0.000018
13 verification_status_Source Verified 0.000014 0.000015
14 percent_bc_gt_75 0.000008 0.000020
15 mo_sin_old_rev_tl_op 0.000003 0.000028
16 num_bc_sats -0.000002 0.000006
17 home_ownership_OWN -0.000002 0.000002
18 grade_B -0.000017 0.000018
19 tax_liens -0.000026 0.000044

   The weight of the recoveries is potentially higher for the SMOTE model using the Upsampled data, but potentially the same (0.123842 vs. 0.115911).

   When the increased parameter space with 10 fold cross validation was tested with the `Upsampled` set, the the model resulted in wose model metrics than compared to the initial hyperparameter search. So, let's move on to more model evaluation.

SparkML

   Apache Spark is an open-source analytics engine that was developed to improve upon the limitations of the MapReduce in the the Hadoop Distributed File System by processing data in memory and distributed over multiple operations in parallel. There are various ways to use this, ranging from real time big data ingestion to ML/DL tasks. Let's first set up the environment by updating the packages and installing Java JRE/JDK with:

1. apt update

2. apt install default-jre

3. apt install default-jdk

   We can then install pyspark==3.3 and findspark using pip and set up the SparkSession. Both Paperspace and Colab were utilized so the amount of RAM allocated was different for the SparkSession for Paperspace since more RAM and CPU cores are available.

In [ ]:
import findspark
from pyspark.sql import SparkSession
findspark.init()

spark = SparkSession.builder\
        .master('local')\
        .appName('Paperspace')\
        .config('spark.driver.memory', '38g')\
        .config('spark.executor.pyspark.memory', '32g')\
        .config('spark.executor.cores', '4')\
        .config('spark.python.worker.memory', '32g')\
        .config('spark.sql.execution.arrow.pyspark.enabled', 'True')\
        .config('spark.sql.debug.maxToStringFields', '1000')\
        .config('spark.sql.autoBroadcastJoinThreshold', '-1')\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

spark
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/09/14 22:41:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Out[ ]:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.3.0
Master
local
AppName
Paperspace

   Then we can remove the warnings from the environment by setting the log to only ERROR. After the SparkSession is configured, we can install & import the necessary packages and then set the seed for reproducibility.

In [ ]:
spark.sparkContext.setLogLevel('ERROR')

Baseline Models

   The baseline models for both the Upsampled and the SMOTE sets are located here.

Upsampled Minority Class

   Let's first read the train and the test sets, which are stored as .csv files, cache the data into memory for better performance and then inspect the schema of the features to determine if the schema needs to be updated to reflect the proper data type(s).

In [ ]:
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['SparkML_HPO'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

trainDF = spark.read.csv('/notebooks/LoanStatus/Data/trainDF_US.csv',
                         header=True, inferSchema=True).cache()
print('\nTrain Schema')
trainDF.printSchema()

testDF = spark.read.csv('/notebooks/LoanStatus/Data/testDF_US.csv',
                        header=True, inferSchema=True).cache()
print('\nTest Schema')
testDF.printSchema()

Train Schema
root
 |-- loan_amnt: integer (nullable = true)
 |-- int_rate: double (nullable = true)
 |-- installment: double (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- inq_last_6mths: double (nullable = true)
 |-- pub_rec: double (nullable = true)
 |-- revol_bal: integer (nullable = true)
 |-- out_prncp: double (nullable = true)
 |-- total_pymnt: double (nullable = true)
 |-- total_rec_int: double (nullable = true)
 |-- total_rec_late_fee: double (nullable = true)
 |-- recoveries: double (nullable = true)
 |-- last_pymnt_amnt: double (nullable = true)
 |-- collections_12_mths_ex_med: double (nullable = true)
 |-- acc_open_past_24mths: double (nullable = true)
 |-- bc_open_to_buy: double (nullable = true)
 |-- chargeoff_within_12_mths: double (nullable = true)
 |-- delinq_amnt: double (nullable = true)
 |-- mths_since_recent_bc: double (nullable = true)
 |-- num_bc_sats: double (nullable = true)
 |-- num_bc_tl: double (nullable = true)
 |-- num_sats: double (nullable = true)
 |-- num_tl_30dpd: double (nullable = true)
 |-- tax_liens: double (nullable = true)
 |-- tot_hi_cred_lim: double (nullable = true)
 |-- total_bal_ex_mort: double (nullable = true)
 |-- total_bc_limit: double (nullable = true)
 |-- term_ 60 months: integer (nullable = true)
 |-- grade_B: integer (nullable = true)
 |-- grade_C: integer (nullable = true)
 |-- grade_D: integer (nullable = true)
 |-- home_ownership_MORTGAGE: integer (nullable = true)
 |-- home_ownership_OWN: integer (nullable = true)
 |-- home_ownership_RENT: integer (nullable = true)
 |-- verification_status_Source Verified: integer (nullable = true)
 |-- verification_status_Verified: integer (nullable = true)
 |-- purpose_credit_card: integer (nullable = true)
 |-- initial_list_status_w: integer (nullable = true)
 |-- application_type_Joint App: integer (nullable = true)
 |-- disbursement_method_DirectPay: integer (nullable = true)
 |-- num_il_tl: double (nullable = true)
 |-- num_accts_ever_120_pd: double (nullable = true)
 |-- mo_sin_old_rev_tl_op: double (nullable = true)
 |-- percent_bc_gt_75: double (nullable = true)
 |-- revol_util: double (nullable = true)
 |-- num_actv_rev_tl: double (nullable = true)
 |-- tot_coll_amt: double (nullable = true)
 |-- mort_acc: double (nullable = true)
 |-- delinq_2yrs: double (nullable = true)
 |-- num_tl_op_past_12m: double (nullable = true)
 |-- loan_status: integer (nullable = true)


Test Schema
root
 |-- loan_amnt: integer (nullable = true)
 |-- int_rate: double (nullable = true)
 |-- installment: double (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- inq_last_6mths: double (nullable = true)
 |-- pub_rec: double (nullable = true)
 |-- revol_bal: integer (nullable = true)
 |-- out_prncp: double (nullable = true)
 |-- total_pymnt: double (nullable = true)
 |-- total_rec_int: double (nullable = true)
 |-- total_rec_late_fee: double (nullable = true)
 |-- recoveries: double (nullable = true)
 |-- last_pymnt_amnt: double (nullable = true)
 |-- collections_12_mths_ex_med: double (nullable = true)
 |-- acc_open_past_24mths: double (nullable = true)
 |-- bc_open_to_buy: double (nullable = true)
 |-- chargeoff_within_12_mths: double (nullable = true)
 |-- delinq_amnt: double (nullable = true)
 |-- mths_since_recent_bc: double (nullable = true)
 |-- num_bc_sats: double (nullable = true)
 |-- num_bc_tl: double (nullable = true)
 |-- num_sats: double (nullable = true)
 |-- num_tl_30dpd: double (nullable = true)
 |-- tax_liens: double (nullable = true)
 |-- tot_hi_cred_lim: double (nullable = true)
 |-- total_bal_ex_mort: double (nullable = true)
 |-- total_bc_limit: double (nullable = true)
 |-- term_ 60 months: integer (nullable = true)
 |-- grade_B: integer (nullable = true)
 |-- grade_C: integer (nullable = true)
 |-- grade_D: integer (nullable = true)
 |-- home_ownership_MORTGAGE: integer (nullable = true)
 |-- home_ownership_OWN: integer (nullable = true)
 |-- home_ownership_RENT: integer (nullable = true)
 |-- verification_status_Source Verified: integer (nullable = true)
 |-- verification_status_Verified: integer (nullable = true)
 |-- purpose_credit_card: integer (nullable = true)
 |-- initial_list_status_w: integer (nullable = true)
 |-- application_type_Joint App: integer (nullable = true)
 |-- disbursement_method_DirectPay: integer (nullable = true)
 |-- num_il_tl: double (nullable = true)
 |-- num_accts_ever_120_pd: double (nullable = true)
 |-- mo_sin_old_rev_tl_op: double (nullable = true)
 |-- percent_bc_gt_75: double (nullable = true)
 |-- revol_util: double (nullable = true)
 |-- num_actv_rev_tl: double (nullable = true)
 |-- tot_coll_amt: double (nullable = true)
 |-- mort_acc: double (nullable = true)
 |-- delinq_2yrs: double (nullable = true)
 |-- num_tl_op_past_12m: double (nullable = true)
 |-- loan_status: integer (nullable = true)

   To prepare the data for modeling, the features and label (loan_status) need to be defined. Let's start with the training set. The VectorAssembler is set up with the features as inputCols and the outputCol='unscaledFeatures'. Then the train feature columns can be transformed into a vector column.

In [ ]:
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler

features = trainDF.columns[0: len(trainDF.columns) - 1]
trainDF = trainDF.select(col('loan_status').alias('label'), *features)

vecAssembler = VectorAssembler(inputCols=features,
                               outputCol='unscaledFeatures',
                               handleInvalid='skip')

trainDF = vecAssembler.transform(trainDF)

   The same steps that were applied to the training set should also be applied to the test set.

In [ ]:
features = testDF.columns[0: len(testDF.columns) - 1]
testDF = testDF.select(col('loan_status').alias('label'), *features)

testDF = vecAssembler.transform(testDF)

   Some models perform better when the data is scaled. The MinMaxScaler rescales the data to a common range from the minimum to maximum value linearly using the summary statistics from the variables. The StandardScaler standardizes the variables by eliminating the mean value and scales by dividing each value by the standard deviation. Given this is a classification problem, let's define a few metrics, areaUnderROC and accuracy, to evaluate the performance from training different model types.

In [ ]:
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

mmScaler = MinMaxScaler(inputCol='unscaledFeatures',
                        outputCol='scaledFeatures',
                        min=0, max=1)

stdScaler = StandardScaler(inputCol='unscaledFeatures',
                           outputCol='scaledFeatures',
                           withStd=True,
                           withMean=False)

evaluator_auroc = BinaryClassificationEvaluator(labelCol='label',
                                                metricName='areaUnderROC')

evaluator_acc = MulticlassClassificationEvaluator(labelCol='label',
                                                  metricName='accuracy')

   The models examined using Spark's Machine Learning Library (MLlib) are:

  • Logistic Regression (LR)
  • Linear SVM Classifier (LinearSVC)
  • Decision Trees (DT)
  • Random Forest (RF)
  • Gradient Boosted Trees (GBT)

   To set the baseline models, a Pipeline was utilized with the input of the unscaledFeatures from the VectorAssembler, then potentially the scaledFeatures, if a scaler was being used, to the featuresCol for the model.

Logistic Regression

   First the dependencies are imported from pyspark.ml, time and mlflow. The pipeline was set up by first defining the model to use, which utilizes the default parameters. The steps in the pipline consist of scaling the data with the MinMaxScaler, fitting the timed model and saving it for later use, if needed.

In [ ]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
import time
try:
    import mlflow.pyspark.ml
    mlflow.pyspark.ml.autolog()
except:
    print(f'Your version of MLflow ({mlflow.__version__}) does not support pyspark.ml for autologging. To use autologging, upgrade your MLflow client version or use Databricks Runtime for ML 8.3 or above.')

lr = LogisticRegression(family='binomial',
                        labelCol='label',
                        featuresCol='scaledFeatures',
                        regParam=0.0,
                        elasticNetParam=0.0,
                        maxIter=100)

search_time_start = time.time()
pipeline_lr = Pipeline(stages=[mmScaler, lr])
pipelineModel_lr = pipeline_lr.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lr_us'
pipelineModel_lr.save(Path)
2022/10/10 01:25:45 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '4f23ebbab68a4183b8e16c8ad7a6cd2f', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
2022/10/10 01:27:42 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '50959865b44949f3bae0aebc0bfea305', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 233.42002511024475
                                                                                

   Then predictions can be made using the test set. The default parameters of regParam=0.0 and elasticNetParam=0.0 provide an AUROC = 0.987 and an Accuracy = 98.583%, so optimizing this model might prove to be difficult. Let's now examine some other model types, and assess if hyperparameter tuning could increase the model performance better than the LogisticRegression model that was fit.

In [ ]:
prediction_lr = pipelineModel_lr.transform(testDF)

lr_auroc = evaluator_auroc.evaluate(prediction_lr)
print('Baseline: LogisticRegression')
print('Area under ROC curve: %g' % (lr_auroc))
print('Test Error: %g' % (1.0 - lr_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lr))
Baseline: Logistic Regression
Area under ROC curve: 0.98701
Test Error: 0.0129899
Accuracy: 0.9858303292922332

LinearSVC

In [ ]:
from pyspark.ml.classification import LinearSVC

lsvc = LinearSVC(labelCol='label',
                 featuresCol='scaledFeatures',
                 regParam=0.0,
                 tol=1e-5,
                 maxIter=100)

search_time_start = time.time()
pipeline_lsvc = Pipeline(stages=[stdScaler, lsvc])
pipelineModel_lsvc = pipeline_lsvc.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lsvc_us'
pipelineModel_lsvc.save(Path)
2022/10/10 01:32:16 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '6244b6c7087d40c9ba3bcecb84c96be2', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 154.46316194534302
                                                                                
In [ ]:
prediction_lsvc = pipelineModel_lsvc.transform(testDF)

lsvc_auroc = evaluator_auroc.evaluate(prediction_lsvc)
print('Baseline: LinearSVC')
print('Area under ROC curve: %g' % (lsvc_auroc))
print('Test Error: %g' % (1.0 - lsvc_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lsvc))
                                                                                
Baseline: LinearSVC
Area under ROC curve: 0.978874
Test Error: 0.0211257
                                                                                
Accuracy: 0.9799987513671373

Decision Tree

In [ ]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(labelCol='label',
                            featuresCol='unscaledFeatures',
                            maxDepth=5,
                            maxBins=16,
                            impurity='gini',
                            seed=seed_value)

search_time_start = time.time()
pipeline_dt = Pipeline(stages=[dt])
pipelineModel_dt = pipeline_dt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_dt_us'
pipelineModel_dt.save(Path)
2022/10/10 01:35:09 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '3d9d3090f60f46149b270867004547a7', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 59.340999603271484
In [ ]:
prediction_dt = pipelineModel_dt.transform(testDF)

dt_auroc = evaluator_auroc.evaluate(prediction_dt)
print('Baseline: Decision Tree')
print('Area under ROC curve: %g' % (dt_auroc))
print('Test Error: %g' % (1.0 - dt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_dt))
                                                                                
Baseline: Decision Tree
Area under ROC curve: 0.913289
Test Error: 0.086711
                                                                                
Accuracy: 0.910900796119064

Random Forest

In [ ]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol='label',
                            featuresCol='unscaledFeatures',
                            impurity='gini',
                            maxDepth=5,
                            maxBins=32,
                            numTrees=10,
                            seed=seed_value)

search_time_start = time.time()
pipeline_rf = Pipeline(stages=[rf])
pipelineModel_rf = pipeline_rf.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_rf_us'
pipelineModel_rf.save(Path)
2022/10/10 01:36:18 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '56518e71367042c88e4508ccf0cdf517', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 81.84045767784119
                                                                                
In [ ]:
prediction_rf = pipelineModel_rf.transform(testDF)

rf_auroc = evaluator_auroc.evaluate(prediction_rf)
print('Baseline: Random Forest')
print('Area under ROC curve: %g' % (rf_auroc))
print('Test Error: %g' % (1.0 - rf_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_rf))
                                                                                
Baseline: Random Forest
Area under ROC curve: 0.960471
Test Error: 0.0395285
                                                                                
Accuracy: 0.9159831943265823

Gradient Boosted Trees

In [ ]:
from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(labelCol='label',
                    featuresCol='unscaledFeatures',
                    maxDepth=5,
                    maxBins=32,
                    maxIter=10,
                    seed=seed_value)

search_time_start = time.time()
pipeline_gbt = Pipeline(stages=[gbt])
pipelineModel_gbt = pipeline_gbt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_gbt_us'
pipelineModel_gbt.save(Path)
2022/10/10 01:38:25 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '51428177e006441b8bc7be07be8f4d09', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 124.68918800354004
                                                                                
In [ ]:
prediction_gbt = pipelineModel_gbt.transform(testDF)

gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('Baseline: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
                                                                                
Baseline: Gradient Boosted Trees
Area under ROC curve: 0.972706
Test Error: 0.0272944
                                                                                
Accuracy: 0.9430554046148546

   Of the models fit, only LR and LinearSVC were scaled prior to modeling. The scaling for LinearSVC used the StandardScaler since SVMs tend to perform better with scaled data and the computation time is faster when compared with unscaled data. Given all models were using the same runtime environment, the DT model completed the fastest at 59.3 seconds while the LR took the most amount to finish at 233.4 seconds. Given the lower accuracies for GBT, RF and DT, these will require tuning the hyperparameters more than LR and LinearSVC. However, accuracy is not a great metric, especially if the original data is imbalanced. Therefore, let's add some more metrics and then compare all the models at a greater depth or granularity.

Model Metrics

   We can utilize a for loop which iterates through the models tested and asesses various metrics on a more granular level like the amount of true positives and false positives as well as a more macroscopic level like recall, precision, and F1 score.

In [ ]:
print('Upsampling Baseline Models:')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
              'prediction_rf', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
Upsampling Baseline Models:
                                                                                

Model: prediction_lr
True Positives: 50199
True Negatives: 376146
False Positives: 1702
False Negatives: 4426
Total: 432473
Accuracy: 0.9858303292922332
Recall: 0.918974828375286
Precision:  0.9672067975568871
F1 score: 0.9424741377691831


                                                                                

Model: prediction_lsvc
True Positives: 48472
True Negatives: 375351
False Positives: 2497
False Negatives: 6153
Total: 432473
Accuracy: 0.9799987513671373
Recall: 0.8873592677345538
Precision:  0.9510094371088309
F1 score: 0.9180824668068263


                                                                                

Model: prediction_dt
True Positives: 49091
True Negatives: 344849
False Positives: 32999
False Negatives: 5534
Total: 432473
Accuracy: 0.910900796119064
Recall: 0.8986910755148741
Precision:  0.5980143744670484
F1 score: 0.7181508978531983


                                                                                

Model: prediction_rf
True Positives: 47783
True Negatives: 348355
False Positives: 29493
False Negatives: 6842
Total: 432473
Accuracy: 0.9159831943265823
Recall: 0.874745995423341
Precision:  0.6183420466897872
F1 score: 0.7245282446683496


                                                                                

Model: prediction_gbt
True Positives: 49134
True Negatives: 358712
False Positives: 19136
False Negatives: 5491
Total: 432473
Accuracy: 0.9430554046148546
Recall: 0.8994782608695652
Precision:  0.7197011864655046
F1 score: 0.7996094226778957


   LR performed the best on all of the macro metrics. Perhaps, using a linear approach might be worthwhile in this scenario since the metrics from the baseline model are all over 90%. However, the overall recall score for each of the models is relatively low with the highest being 91.0%. Also, DT, RF and GBT have very lower precision scores for the Upsampled baseline models.

SMOTE

   Upon the initial inspection of the schema of the train and test sets when the data was loaded, there were some differences when inferSchema=True was used. The differences can be noted below. This was mostly due to the creation of uint8 dummy variables during the variable selection process and the creation of FloatType with inferSchema=True. To keep the train/test sets consistent with the Upsampled sets, this was adjusted.

   Then, the train and the test sets were procesed using the same approaches described for the Upsampled data.

In [ ]:
from pyspark.sql.types import IntegerType, FloatType

trainDF = spark.read.csv('/content/drive/MyDrive/LoanStatus/Data/trainDF_SMOTE.csv',
                         header=True, inferSchema=True).cache()
trainDF = trainDF \
  .withColumn('loan_amnt', trainDF['loan_amnt'].cast(IntegerType())) \
  .withColumn('revol_bal', trainDF['revol_bal'].cast(IntegerType())) \
  .withColumn('term_ 60 months', trainDF['term_ 60 months'].cast(IntegerType())) \
  .withColumn('grade_B', trainDF['grade_B'].cast(IntegerType())) \
  .withColumn('grade_C', trainDF['grade_C'].cast(IntegerType())) \
  .withColumn('grade_D', trainDF['grade_D'].cast(IntegerType())) \
  .withColumn('home_ownership_MORTGAGE', trainDF['home_ownership_MORTGAGE'].cast(IntegerType())) \
  .withColumn('home_ownership_OWN', trainDF['home_ownership_OWN'].cast(IntegerType())) \
  .withColumn('home_ownership_RENT', trainDF['home_ownership_RENT'].cast(IntegerType())) \
  .withColumn('verification_status_Source Verified', trainDF['verification_status_Source Verified'].cast(IntegerType())) \
  .withColumn('verification_status_Verified', trainDF['verification_status_Verified'].cast(IntegerType())) \
  .withColumn('purpose_credit_card', trainDF['purpose_credit_card'].cast(IntegerType())) \
  .withColumn('initial_list_status_w', trainDF['initial_list_status_w'].cast(IntegerType())) \
  .withColumn('application_type_Joint App', trainDF['application_type_Joint App'].cast(IntegerType())) \
  .withColumn('disbursement_method_DirectPay', trainDF['disbursement_method_DirectPay'].cast(IntegerType()))
print('\nTrain Schema')
trainDF.printSchema()

testDF = spark.read.csv('/content/drive/MyDrive/LoanStatus/Data/testDF_SMOTE.csv',
                        header=True, inferSchema=True).cache()
print('\nTest Schema')
testDF.printSchema()

Train Schema
root
 |-- loan_amnt: integer (nullable = true)
 |-- int_rate: double (nullable = true)
 |-- installment: double (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- inq_last_6mths: double (nullable = true)
 |-- pub_rec: double (nullable = true)
 |-- revol_bal: integer (nullable = true)
 |-- out_prncp: double (nullable = true)
 |-- total_pymnt: double (nullable = true)
 |-- total_rec_int: double (nullable = true)
 |-- total_rec_late_fee: double (nullable = true)
 |-- recoveries: double (nullable = true)
 |-- last_pymnt_amnt: double (nullable = true)
 |-- collections_12_mths_ex_med: double (nullable = true)
 |-- acc_open_past_24mths: double (nullable = true)
 |-- bc_open_to_buy: double (nullable = true)
 |-- chargeoff_within_12_mths: double (nullable = true)
 |-- delinq_amnt: double (nullable = true)
 |-- mths_since_recent_bc: double (nullable = true)
 |-- num_bc_sats: double (nullable = true)
 |-- num_bc_tl: double (nullable = true)
 |-- num_sats: double (nullable = true)
 |-- num_tl_30dpd: double (nullable = true)
 |-- tax_liens: double (nullable = true)
 |-- tot_hi_cred_lim: double (nullable = true)
 |-- total_bal_ex_mort: double (nullable = true)
 |-- total_bc_limit: double (nullable = true)
 |-- term_ 60 months: integer (nullable = true)
 |-- grade_B: integer (nullable = true)
 |-- grade_C: integer (nullable = true)
 |-- grade_D: integer (nullable = true)
 |-- home_ownership_MORTGAGE: integer (nullable = true)
 |-- home_ownership_OWN: integer (nullable = true)
 |-- home_ownership_RENT: integer (nullable = true)
 |-- verification_status_Source Verified: integer (nullable = true)
 |-- verification_status_Verified: integer (nullable = true)
 |-- purpose_credit_card: integer (nullable = true)
 |-- initial_list_status_w: integer (nullable = true)
 |-- application_type_Joint App: integer (nullable = true)
 |-- disbursement_method_DirectPay: integer (nullable = true)
 |-- num_il_tl: double (nullable = true)
 |-- num_accts_ever_120_pd: double (nullable = true)
 |-- mo_sin_old_rev_tl_op: double (nullable = true)
 |-- percent_bc_gt_75: double (nullable = true)
 |-- revol_util: double (nullable = true)
 |-- num_actv_rev_tl: double (nullable = true)
 |-- tot_coll_amt: double (nullable = true)
 |-- mort_acc: double (nullable = true)
 |-- delinq_2yrs: double (nullable = true)
 |-- num_tl_op_past_12m: double (nullable = true)
 |-- loan_status: integer (nullable = true)


Test Schema
root
 |-- loan_amnt: integer (nullable = true)
 |-- int_rate: double (nullable = true)
 |-- installment: double (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- inq_last_6mths: double (nullable = true)
 |-- pub_rec: double (nullable = true)
 |-- revol_bal: integer (nullable = true)
 |-- out_prncp: double (nullable = true)
 |-- total_pymnt: double (nullable = true)
 |-- total_rec_int: double (nullable = true)
 |-- total_rec_late_fee: double (nullable = true)
 |-- recoveries: double (nullable = true)
 |-- last_pymnt_amnt: double (nullable = true)
 |-- collections_12_mths_ex_med: double (nullable = true)
 |-- acc_open_past_24mths: double (nullable = true)
 |-- bc_open_to_buy: double (nullable = true)
 |-- chargeoff_within_12_mths: double (nullable = true)
 |-- delinq_amnt: double (nullable = true)
 |-- mths_since_recent_bc: double (nullable = true)
 |-- num_bc_sats: double (nullable = true)
 |-- num_bc_tl: double (nullable = true)
 |-- num_sats: double (nullable = true)
 |-- num_tl_30dpd: double (nullable = true)
 |-- tax_liens: double (nullable = true)
 |-- tot_hi_cred_lim: double (nullable = true)
 |-- total_bal_ex_mort: double (nullable = true)
 |-- total_bc_limit: double (nullable = true)
 |-- term_ 60 months: integer (nullable = true)
 |-- grade_B: integer (nullable = true)
 |-- grade_C: integer (nullable = true)
 |-- grade_D: integer (nullable = true)
 |-- home_ownership_MORTGAGE: integer (nullable = true)
 |-- home_ownership_OWN: integer (nullable = true)
 |-- home_ownership_RENT: integer (nullable = true)
 |-- verification_status_Source Verified: integer (nullable = true)
 |-- verification_status_Verified: integer (nullable = true)
 |-- purpose_credit_card: integer (nullable = true)
 |-- initial_list_status_w: integer (nullable = true)
 |-- application_type_Joint App: integer (nullable = true)
 |-- disbursement_method_DirectPay: integer (nullable = true)
 |-- num_il_tl: double (nullable = true)
 |-- num_accts_ever_120_pd: double (nullable = true)
 |-- mo_sin_old_rev_tl_op: double (nullable = true)
 |-- percent_bc_gt_75: double (nullable = true)
 |-- revol_util: double (nullable = true)
 |-- num_actv_rev_tl: double (nullable = true)
 |-- tot_coll_amt: double (nullable = true)
 |-- mort_acc: double (nullable = true)
 |-- delinq_2yrs: double (nullable = true)
 |-- num_tl_op_past_12m: double (nullable = true)
 |-- loan_status: integer (nullable = true)

Logistic Regression

   Let's now run the baseline models for the SMOTE set and compare the different model results within the SMOTE data as well as the Upsampled sets.

In [ ]:
lr = LogisticRegression(family='binomial',
                        labelCol='label',
                        featuresCol='scaledFeatures',
                        regParam=0.0,
                        elasticNetParam=0.0,
                        maxIter=100)

search_time_start = time.time()
pipeline_lr = Pipeline(stages=[mmScaler, lr])
pipelineModel_lr = pipeline_lr.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lr_smote'
pipelineModel_lr.save(Path)
2022/10/10 01:41:55 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'ee55f19131ad4adcae453703c5f3daec', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 121.01508712768555
                                                                                
In [ ]:
prediction_lr = pipelineModel_lr.transform(testDF)

lr_auroc = evaluator_auroc.evaluate(prediction_lr)
print('Baseline: Logistic Regression')
print('Area under ROC curve: %g' % (lr_auroc))
print('Test Error: %g' % (1.0 - lr_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lr))
Baseline: Logistic Regression
Area under ROC curve: 0.979804
Test Error: 0.0201956
Accuracy: 0.9857540239506282

LinearSVC

In [ ]:
lsvc = LinearSVC(labelCol='label',
                 featuresCol='scaledFeatures',
                 regParam=0.0,
                 tol=1e-5,
                 maxIter=100)

search_time_start = time.time()
pipeline_lsvc = Pipeline(stages=[stdScaler, lsvc])
pipelineModel_lsvc = pipeline_lsvc.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_lsvc_smote'
pipelineModel_lsvc.save(Path)
2022/10/10 01:45:14 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '0cbc28d50afe499087fe00a0d33b4d94', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 134.26036667823792
                                                                                
In [ ]:
prediction_lsvc = pipelineModel_lsvc.transform(testDF)

lsvc_auroc = evaluator_auroc.evaluate(prediction_lsvc)
print('Baseline: LinearSVC')
print('Area under ROC curve: %g' % (lsvc_auroc))
print('Test Error: %g' % (1.0 - lsvc_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_lsvc))
                                                                                
Baseline: LinearSVC
Area under ROC curve: 0.977104
Test Error: 0.0228963
                                                                                
Accuracy: 0.9829214771789222

Decision Tree

In [ ]:
dt = DecisionTreeClassifier(labelCol='label',
                            featuresCol='unscaledFeatures',
                            maxDepth=5,
                            maxBins=16,
                            impurity='gini',
                            seed=seed_value)

search_time_start = time.time()
pipeline_dt = Pipeline(stages=[dt])
pipelineModel_dt = pipeline_dt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_dt_smote'
pipelineModel_dt.save(Path)
2022/10/10 01:47:47 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '1a66c94a0f2e4043af3203374cf7b774', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 61.8033344745636
                                                                                
In [ ]:
prediction_dt = pipelineModel_dt.transform(testDF)

dt_auroc = evaluator_auroc.evaluate(prediction_dt)
print('Baseline: Decision Tree')
print('Area under ROC curve: %g' % (dt_auroc))
print('Test Error: %g' % (1.0 - dt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_dt))
                                                                                
Baseline: Decision Tree
Area under ROC curve: 0.92133
Test Error: 0.0786704
                                                                                
Accuracy: 0.9362596046458391

Random Forest

In [ ]:
rf = RandomForestClassifier(labelCol='label',
                            featuresCol='unscaledFeatures',
                            maxDepth=5,
                            maxBins=32,
                            numTrees=10,
                            impurity='gini',
                            seed=seed_value)

search_time_start = time.time()
pipeline_rf = Pipeline(stages=[rf])
pipelineModel_rf = pipeline_rf.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_rf_smote'
pipelineModel_rf.save(Path)
2022/10/10 01:48:58 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '78f94c87b037438195ad4e4064490fca', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 82.65176653862
                                                                                
In [ ]:
prediction_rf = pipelineModel_rf.transform(testDF)

rf_auroc = evaluator_auroc.evaluate(prediction_rf)
print('Baseline: Random Forest')
print('Area under ROC curve: %g' % (rf_auroc))
print('Test Error: %g' % (1.0 - rf_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_rf))
                                                                                
Baseline: Random Forest
Area under ROC curve: 0.941545
Test Error: 0.0584552
                                                                                
Accuracy: 0.9482372309947673

Gradient Boosted Trees

In [ ]:
gbt = GBTClassifier(labelCol='label',
                    featuresCol='unscaledFeatures',
                    maxDepth=5,
                    maxBins=32,
                    maxIter=10,
                    seed=seed_value)

search_time_start = time.time()
pipeline_gbt = Pipeline(stages=[gbt])
pipelineModel_gbt = pipeline_gbt.fit(trainDF)
print('Time to fit baseline model:', time.time() - search_time_start)

Path = '/notebooks/LoanStatus/Python/ML/SparkML/Models/Baseline/baselineModel_gbt_smote'
pipelineModel_gbt.save(Path)
2022/10/10 01:51:06 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '3cd40588f9e145b394b199b2611df978', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
                                                                                
Time to fit baseline model: 126.09147500991821
                                                                                
In [ ]:
prediction_gbt = pipelineModel_gbt.transform(testDF)

gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('Baseline: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
                                                                                
Baseline: Gradient Boosted Trees
Area under ROC curve: 0.970601
Test Error: 0.0293991
                                                                                
Accuracy: 0.9605848226363264

Model Metrics

In [ ]:
print('SMOTE Baseline Models:')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
              'prediction_rf', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
SMOTE Baseline Models:
                                                                                

Model: prediction_lr
True Positives: 49671
True Negatives: 376641
False Positives: 1207
False Negatives: 4954
Total: 432473
Accuracy: 0.9857540239506282
Recall: 0.9093089244851259
Precision:  0.9762765831990251
F1 score: 0.9416035562969773


                                                                                

Model: prediction_lsvc
True Positives: 48070
True Negatives: 377017
False Positives: 831
False Negatives: 6555
Total: 432473
Accuracy: 0.9829214771789222
Recall: 0.88
Precision:  0.9830064824850208
F1 score: 0.9286556034232947


                                                                                

Model: prediction_dt
True Positives: 43557
True Negatives: 361350
False Positives: 16498
False Negatives: 11068
Total: 432473
Accuracy: 0.9362596046458391
Recall: 0.7973821510297483
Precision:  0.7252851552743319
F1 score: 0.7596267875828393


                                                                                

Model: prediction_rf
True Positives: 34722
True Negatives: 375365
False Positives: 2483
False Negatives: 19903
Total: 432473
Accuracy: 0.9482372309947673
Recall: 0.6356430205949657
Precision:  0.9332616583792501
F1 score: 0.7562234563868017


                                                                                

Model: prediction_gbt
True Positives: 43402
True Negatives: 372025
False Positives: 5823
False Negatives: 11223
Total: 432473
Accuracy: 0.9605848226363264
Recall: 0.7945446224256293
Precision:  0.8817064499746065
F1 score: 0.8358594126143477


   LR 3/4 best metrics besides LinearSVC contained a higher precision score. The precision of LinearSVC was higher for the SMOTE data compared to US results. DT was the fastest to complete at 61.8 while LinearSVC took the longest at 134.3. Overall, there was higher recall and F1 scores for the US models while higher accuracy and precision for the SMOTE models.

GridSearchCV

   The GridSearchCV models for both the Upsampled and SMOTE sets are located here.

Upsampling: Gradient Boosted Trees

   From the models evaluated using the defined GridSearchCV parameters, GBT performed the best. So, let's evaluate the performance by first setting up the pipeline components with the different stages, although only the model stage is needed here since scaling the data isn't required, then define the space in the parameter grid, followed by the components within CrossValidator with the AUROC as the metric to evalaute, using numFolds=5 and parallelism=5 to assess which combination of parameters results in the highest AUROC.

In [ ]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

model = GBTClassifier(labelCol='label',
                      featuresCol='unscaledFeatures',
                      seed=seed_value)

pipeline_gbt_hpo = Pipeline(stages=[model])

paramGrid = (ParamGridBuilder()
            .addGrid(model.maxDepth, [5, 10, 15])
            .addGrid(model.maxBins, [32, 64])
            .addGrid(model.maxIter, [10])
            .build())

cv = CrossValidator(estimator=pipeline_gbt_hpo,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator_auroc,
                    numFolds=5, parallelism=5)

search_time_start = time.time()
pipelineModel_gbt_hpo = cv.fit(trainDF)
print('Time to perform GridSearchCV:', time.time() - search_time_start)

path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/'

gbt_grid = pipelineModel_gbt_hpo.bestModel
gbt_grid.save(path + '/pipelineModel_gbt_us_hpo_grid')
2022/10/10 21:36:49 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '7c04b608003541febf14cb801d6090ee', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
2022/10/10 23:59:39 WARNING mlflow.pyspark.ml: Model inputs contain unsupported Spark data types: [StructField(unscaledFeatures,VectorUDT,true)]. Model signature is not logged.
2022/10/11 00:00:33 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpzh1dpkge/model, flavor: spark), fall back to return ['pyspark==3.3.0']. Set logging level to DEBUG to see the full traceback.
2022/10/11 00:01:15 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpvl4elzfl/model, flavor: spark), fall back to return ['pyspark==3.3.0']. Set logging level to DEBUG to see the full traceback.
Time to perform GridSearchCV: 8675.648339033127

   From the specified pipeline evaluated, the various models with the best parameters and then view all results (AUROC) by each parameters.

In [ ]:
print(pipelineModel_gbt_hpo.getEstimatorParamMaps()[np.argmax(pipelineModel_gbt_hpo.avgMetrics)])
{Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15, Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64, Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10}
In [ ]:
list(zip(pipelineModel_gbt_hpo.avgMetrics,
         pipelineModel_gbt_hpo.getEstimatorParamMaps()))
Out[ ]:
[(0.9719265974945634,
  {Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
   Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 32,
   Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9719671645400161,
  {Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
   Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
   Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9868639558736061,
  {Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
   Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 32,
   Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9873657713205404,
  {Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
   Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
   Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9931349499974691,
  {Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15,
   Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 32,
   Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9937295507264521,
  {Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15,
   Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
   Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10})]

   Given these parameters resulted in the best objective, highest AUROC, we can use the trained model to predict using the test set evaluate the AUROC, test error and accuracy on data that was not used for training.

In [ ]:
prediction_gbt = pipelineModel_gbt_hpo.transform(testDF)

gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('GridSearchCV: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g ' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
GridSearchCV: Gradient Boosted Trees
Area under ROC curve: 0.984932
Test Error: 0.0150683 
Accuracy: 0.9823503432584231

   Then we can extract the information from the pipeline.

In [ ]:
pipelineModel_gbt_hpo.bestModel.stages[-1].extractParamMap()
Out[ ]:
{Param(parent='GBTClassifier_41665e5b1949', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False,
 Param(parent='GBTClassifier_41665e5b1949', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'): 10,
 Param(parent='GBTClassifier_41665e5b1949', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'"): 'all',
 Param(parent='GBTClassifier_41665e5b1949', name='featuresCol', doc='features column name.'): 'unscaledFeatures',
 Param(parent='GBTClassifier_41665e5b1949', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: variance'): 'variance',
 Param(parent='GBTClassifier_41665e5b1949', name='labelCol', doc='label column name.'): 'label',
 Param(parent='GBTClassifier_41665e5b1949', name='leafCol', doc='Leaf indices column name. Predicted leaf index of each instance in each tree by preorder.'): '',
 Param(parent='GBTClassifier_41665e5b1949', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic'): 'logistic',
 Param(parent='GBTClassifier_41665e5b1949', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
 Param(parent='GBTClassifier_41665e5b1949', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15,
 Param(parent='GBTClassifier_41665e5b1949', name='maxIter', doc='max number of iterations (>= 0).'): 10,
 Param(parent='GBTClassifier_41665e5b1949', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size.'): 256,
 Param(parent='GBTClassifier_41665e5b1949', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.'): 0.0,
 Param(parent='GBTClassifier_41665e5b1949', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1,
 Param(parent='GBTClassifier_41665e5b1949', name='minWeightFractionPerNode', doc='Minimum fraction of the weighted sample count that each child must have after split. If a split causes the fraction of the total weight in the left or right child to be less than minWeightFractionPerNode, the split will be discarded as invalid. Should be in interval [0.0, 0.5).'): 0.0,
 Param(parent='GBTClassifier_41665e5b1949', name='predictionCol', doc='prediction column name.'): 'prediction',
 Param(parent='GBTClassifier_41665e5b1949', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'probability',
 Param(parent='GBTClassifier_41665e5b1949', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
 Param(parent='GBTClassifier_41665e5b1949', name='seed', doc='random seed.'): 42,
 Param(parent='GBTClassifier_41665e5b1949', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.1,
 Param(parent='GBTClassifier_41665e5b1949', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].'): 1.0,
 Param(parent='GBTClassifier_41665e5b1949', name='validationTol', doc='Threshold for stopping early when fit with validation is used. If the error rate on the validation input changes by less than the validationTol, then learning will stop early (before `maxIter`). This parameter is ignored when fit without validation is used.'): 0.01}
In [ ]:
best_mod = pipelineModel_gbt_hpo.bestModel
param_dict = best_mod.stages[-1].extractParamMap()
sane_dict = {}
for k, v in param_dict.items():
    sane_dict[k.name] = v
sane_dict
Out[ ]:
{'cacheNodeIds': False,
 'checkpointInterval': 10,
 'featureSubsetStrategy': 'all',
 'featuresCol': 'unscaledFeatures',
 'impurity': 'variance',
 'labelCol': 'label',
 'leafCol': '',
 'lossType': 'logistic',
 'maxBins': 64,
 'maxDepth': 15,
 'maxIter': 10,
 'maxMemoryInMB': 256,
 'minInfoGain': 0.0,
 'minInstancesPerNode': 1,
 'minWeightFractionPerNode': 0.0,
 'predictionCol': 'prediction',
 'probabilityCol': 'probability',
 'rawPredictionCol': 'rawPrediction',
 'seed': 42,
 'stepSize': 0.1,
 'subsamplingRate': 1.0,
 'validationTol': 0.01}

SMOTE: Gradient Boosted Trees

   In conjunction with the Upsampled set, GBT performed the best out of the models evaluated using the defined grid. We can use the sample methods to evaluate this model and compare it with the results from the Upsampled search. To complete this, we need to define stages of the pipeline, the parameters in the parameter space as well as the estimator, number of folds for cross validation for parallelism.

In [ ]:
model = GBTClassifier(labelCol='label',
                      featuresCol='unscaledFeatures',
                      seed=seed_value)

pipeline_gbt_hpo = Pipeline(stages=[model])

paramGrid = (ParamGridBuilder()
            .addGrid(model.maxDepth, [5, 10, 15])
            .addGrid(model.maxBins, [32, 64])
            .addGrid(model.maxIter, [10])
            .build())

cv = CrossValidator(estimator=pipeline_gbt_hpo,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator_auroc,
                    numFolds=5, parallelism=5)

search_time_start = time.time()
pipelineModel_gbt_hpo = cv.fit(trainDF)
print('Time to perform GridSearchCV:', time.time() - search_time_start)

gbt_grid = pipelineModel_gbt_hpo.bestModel
gbt_grid.save('/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/pipelineModel_gbt_smote_hpo_grid')
2022/10/11 16:02:42 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '632d76fa07cf4f7aba9967984272045f', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current pyspark.ml workflow
2022/10/11 18:39:32 WARNING mlflow.pyspark.ml: Model inputs contain unsupported Spark data types: [StructField(unscaledFeatures,VectorUDT,true)]. Model signature is not logged.
022/10/11 18:40:26 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpsspix5l1/model, flavor: spark), fall back to return ['pyspark==3.3.0']. Set logging level to DEBUG to see the full traceback.
2022/10/11 18:41:08 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpkhylcbdh/model, flavor: spark), fall back to return ['pyspark==3.3.0']. Set logging level to DEBUG to see the full traceback.
Time to perform GridSearchCV: 9517.265165090561

   From the specified pipeline evaluated, the various models with the best parameters and then view all results (AUROC) by each parameters.

In [ ]:
print(pipelineModel_gbt_hpo.getEstimatorParamMaps()[np.argmax(pipelineModel_gbt_hpo.avgMetrics)])
{Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15, Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64, Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10}
In [ ]:
list(zip(pipelineModel_gbt_hpo.avgMetrics,
         pipelineModel_gbt_hpo.getEstimatorParamMaps()))
Out[ ]:
[(0.991749422965293,
  {Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 32,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9915485855418229,
  {Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9972144428158407,
  {Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 32,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9972290885240692,
  {Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9976729746290087,
  {Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 32,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10}),
 (0.9978078088357267,
  {Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
   Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10})]

   Let's now predict with the pipeline, determine the AUROC, test error and accuracy and extract the information from the pipeline.

In [ ]:
prediction_gbt = pipelineModel_gbt_hpo.transform(testDF)

gbt_auroc = evaluator_auroc.evaluate(prediction_gbt)
print('GridSearchCV: Gradient Boosted Trees')
print('Area under ROC curve: %g' % (gbt_auroc))
print('Test Error: %g ' % (1.0 - gbt_auroc))
print('Accuracy:', evaluator_acc.evaluate(prediction_gbt))
GridSearchCV: Gradient Boosted Trees
Area under ROC curve: 0.986468
Test Error: 0.013532 
Accuracy: 0.9858603889722596
In [ ]:
pipelineModel_gbt_hpo.bestModel.stages[-1].extractParamMap()
Out[ ]:
{Param(parent='GBTClassifier_8c9c74b6ce2d', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'): 10,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'"): 'all',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='featuresCol', doc='features column name.'): 'unscaledFeatures',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: variance'): 'variance',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='labelCol', doc='label column name.'): 'label',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='leafCol', doc='Leaf indices column name. Predicted leaf index of each instance in each tree by preorder.'): '',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic'): 'logistic',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size.'): 256,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.'): 0.0,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='minWeightFractionPerNode', doc='Minimum fraction of the weighted sample count that each child must have after split. If a split causes the fraction of the total weight in the left or right child to be less than minWeightFractionPerNode, the split will be discarded as invalid. Should be in interval [0.0, 0.5).'): 0.0,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='predictionCol', doc='prediction column name.'): 'prediction',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'probability',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='seed', doc='random seed.'): 42,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.1,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].'): 1.0,
 Param(parent='GBTClassifier_8c9c74b6ce2d', name='validationTol', doc='Threshold for stopping early when fit with validation is used. If the error rate on the validation input changes by less than the validationTol, then learning will stop early (before `maxIter`). This parameter is ignored when fit without validation is used.'): 0.01}

   From the search, we can determine the best model and the associates parameters.

In [ ]:
bestPipeline = pipelineModel_gbt_hpo.bestModel
bestModel = bestPipeline.stages[-1]
bestParams = bestModel.extractParamMap()
print(bestPipeline)
print('\n')
print(bestModel)
print('\n')
print(bestParams)
PipelineModel_7aed6edf9218


GBTClassificationModel: uid = GBTClassifier_8c9c74b6ce2d, numTrees=10, numClasses=2, numFeatures=50


{Param(parent='GBTClassifier_8c9c74b6ce2d', name='cacheNodeIds', doc='If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.'): False, Param(parent='GBTClassifier_8c9c74b6ce2d', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.'): 10, Param(parent='GBTClassifier_8c9c74b6ce2d', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'"): 'all', Param(parent='GBTClassifier_8c9c74b6ce2d', name='featuresCol', doc='features column name.'): 'unscaledFeatures', Param(parent='GBTClassifier_8c9c74b6ce2d', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: variance'): 'variance', Param(parent='GBTClassifier_8c9c74b6ce2d', name='labelCol', doc='label column name.'): 'label', Param(parent='GBTClassifier_8c9c74b6ce2d', name='leafCol', doc='Leaf indices column name. Predicted leaf index of each instance in each tree by preorder.'): '', Param(parent='GBTClassifier_8c9c74b6ce2d', name='lossType', doc='Loss function which GBT tries to minimize (case-insensitive). Supported options: logistic'): 'logistic', Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 64, Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 15, Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxIter', doc='max number of iterations (>= 0).'): 10, Param(parent='GBTClassifier_8c9c74b6ce2d', name='maxMemoryInMB', doc='Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size.'): 256, Param(parent='GBTClassifier_8c9c74b6ce2d', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.'): 0.0, Param(parent='GBTClassifier_8c9c74b6ce2d', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1, Param(parent='GBTClassifier_8c9c74b6ce2d', name='minWeightFractionPerNode', doc='Minimum fraction of the weighted sample count that each child must have after split. If a split causes the fraction of the total weight in the left or right child to be less than minWeightFractionPerNode, the split will be discarded as invalid. Should be in interval [0.0, 0.5).'): 0.0, Param(parent='GBTClassifier_8c9c74b6ce2d', name='predictionCol', doc='prediction column name.'): 'prediction', Param(parent='GBTClassifier_8c9c74b6ce2d', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'probability', Param(parent='GBTClassifier_8c9c74b6ce2d', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction', Param(parent='GBTClassifier_8c9c74b6ce2d', name='seed', doc='random seed.'): 42, Param(parent='GBTClassifier_8c9c74b6ce2d', name='stepSize', doc='Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator.'): 0.1, Param(parent='GBTClassifier_8c9c74b6ce2d', name='subsamplingRate', doc='Fraction of the training data used for learning each decision tree, in range (0, 1].'): 1.0, Param(parent='GBTClassifier_8c9c74b6ce2d', name='validationTol', doc='Threshold for stopping early when fit with validation is used. If the error rate on the validation input changes by less than the validationTol, then learning will stop early (before `maxIter`). This parameter is ignored when fit without validation is used.'): 0.01}
In [ ]:
best_mod = pipelineModel_gbt_hpo.bestModel
param_dict = best_mod.stages[-1].extractParamMap()
sane_dict = {}
for k, v in param_dict.items():
    sane_dict[k.name] = v
sane_dict
Out[ ]:
{'cacheNodeIds': False,
 'checkpointInterval': 10,
 'featureSubsetStrategy': 'all',
 'featuresCol': 'unscaledFeatures',
 'impurity': 'variance',
 'labelCol': 'label',
 'leafCol': '',
 'lossType': 'logistic',
 'maxBins': 64,
 'maxDepth': 15,
 'maxIter': 10,
 'maxMemoryInMB': 256,
 'minInfoGain': 0.0,
 'minInstancesPerNode': 1,
 'minWeightFractionPerNode': 0.0,
 'predictionCol': 'prediction',
 'probabilityCol': 'probability',
 'rawPredictionCol': 'rawPrediction',
 'seed': 42,
 'stepSize': 0.1,
 'subsamplingRate': 1.0,
 'validationTol': 0.01}

GridSearchCV Best Models

Load Saved Models - Upsampled

   The notebook containing the best models is located here. Now we can read data, set up the vector assembler, scalers and evaluators. Then load the saved models, predict and evaluate the model metrics using the testDF of the Upsampled Set.

In [ ]:
from pyspark.ml import PipelineModel

path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/GridSearchCV/'

pipelineModel_lr_hpo_US = PipelineModel.load(path + '/pipelineModel_lr_us_hpo_grid/')
pipelineModel_lsvc_hpo_US = PipelineModel.load(path + '/pipelineModel_lsvc_us_hpo_grid/')
pipelineModel_dt_hpo_US = PipelineModel.load(path + '/pipelineModel_dt_us_hpo_grid/')
pipelineModel_rf_hpo_US = PipelineModel.load(path + '/pipelineModel_rf_us_hpo_grid/')
pipelineModel_gbt_hpo_US = PipelineModel.load(path + '/pipelineModel_gbt_us_hpo_grid/')

prediction_lr = pipelineModel_lr_hpo_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hpo_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hpo_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hpo_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hpo_US.transform(testDF_US)

print('GridSearchCV Best Models Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
GridSearchCV Best Models Metrics: Upsampling


Area Under ROC Curve:
Logistic Regression: 0.9716420492988681
LinearSVC: 0.9802039596564845
Decision Trees: 0.9602220402746184
Random Forest: 0.9811805363162989
Gradient Boosted Trees: 0.9849316561714048


Accuracy:
Logistic Regression: 0.9740423101557785
LinearSVC: 0.9816335355039505
Decision Trees: 0.9810508401680568
Random Forest: 0.9791455189110072
Gradient Boosted Trees: 0.9823503432584231
In [ ]:
print('GridSearchCV Best Models Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
			        'prediction_rf', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
print('\n')
GridSearchCV Best Models Metrics: Upsampling

Model: prediction_lr
True Positives: 47818
True Negatives: 373429
False Positives: 4419
False Negatives: 6807
Total: 432473
Accuracy: 0.9740423101557785
Recall: 0.8753867276887872
Precision:  0.9154047897084442
F1 score: 0.8949486253298647



Model: prediction_lsvc
True Positives: 48966
True Negatives: 375564
False Positives: 2284
False Negatives: 5659
Total: 432473
Accuracy: 0.9816335355039505
Recall: 0.8964027459954234
Precision:  0.9554341463414634
F1 score: 0.9249775678866589



Model: prediction_dt
True Positives: 50419
True Negatives: 373859
False Positives: 3989
False Negatives: 4206
Total: 432473
Accuracy: 0.9810508401680568
Recall: 0.9230022883295195
Precision:  0.9266835759447141
F1 score: 0.924839268845212



Model: prediction_rf
True Positives: 49771
True Negatives: 373683
False Positives: 4165
False Negatives: 4854
Total: 432473
Accuracy: 0.9791455189110072
Recall: 0.9111395881006865
Precision:  0.9227788490062296
F1 score: 0.9169222833245826



Model: prediction_gbt
True Positives: 50339
True Negatives: 374501
False Positives: 3347
False Negatives: 4286
Total: 432473
Accuracy: 0.9823503432584231
Recall: 0.9215377574370709
Precision:  0.9376559997019707
F1 score: 0.9295270101836378




   The best AUROC and Accuracy determined was 0.9849 and 98.24%, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.

Load Saved Models - SMOTE

   Now we can use the same methods mentioned above to evaluate the SMOTE models using the respective testDF.

In [ ]:
pipelineModel_lr_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_lr_smote_hpo_grid/')
pipelineModel_lsvc_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_lsvc_smote_hpo_grid/')
pipelineModel_rf_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_rf_smote_hpo_grid/')
pipelineModel_gbt_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_gbt_smote_hpo_grid/')
pipelineModel_dt_hpo_SMOTE = PipelineModel.load(path + '/pipelineModel_dt_smote_hpo_grid/')

prediction_lr = pipelineModel_lr_hpo_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hpo_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hpo_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hpo_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hpo_SMOTE.transform(testDF_SMOTE)

print('GridSearchCV Best Models Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
GridSearchCV Best Models Metrics: SMOTE


Area Under ROC Curve:
Logistic Regression: 0.9705261720148816
LinearSVC: 0.979112214992606
Decision Trees: 0.9493941365983161
Random Forest: 0.9780902581532774
Gradient Boosted Trees: 0.9864679978587151


Accuracy:
Logistic Regression: 0.9745810721131724
LinearSVC: 0.9819711288334763
Decision Trees: 0.9851320198023923
Random Forest: 0.975866701505065
Gradient Boosted Trees: 0.9858603889722596
In [ ]:
print('GridSearchCV Best Models Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
              'prediction_rf', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
GridSearchCV Best Models Metrics: SMOTE

Model: prediction_lr
True Positives: 46901
True Negatives: 374579
False Positives: 3269
False Negatives: 7724
Total: 432473
Accuracy: 0.9745810721131724
Recall: 0.8585995423340961
Precision:  0.9348415387681882
F1 score: 0.89509995705902



Model: prediction_lsvc
True Positives: 48314
True Negatives: 376362
False Positives: 1486
False Negatives: 6311
Total: 432473
Accuracy: 0.9819711288334763
Recall: 0.884466819221968
Precision:  0.9701606425702811
F1 score: 0.9253339717500598



Model: prediction_dt
True Positives: 49542
True Negatives: 376501
False Positives: 1347
False Negatives: 5083
Total: 432473
Accuracy: 0.9851320198023923
Recall: 0.9069473684210526
Precision:  0.9735306254789837
F1 score: 0.9390602194969387



Model: prediction_rf
True Positives: 44961
True Negatives: 377075
False Positives: 773
False Negatives: 9664
Total: 432473
Accuracy: 0.975866701505065
Recall: 0.8230846681922197
Precision:  0.9830979140245769
F1 score: 0.896003347980749



Model: prediction_gbt
True Positives: 49576
True Negatives: 376782
False Positives: 1066
False Negatives: 5049
Total: 432473
Accuracy: 0.9858603889722596
Recall: 0.9075697940503432
Precision:  0.9789502784250227
F1 score: 0.941909620298859


   For the best GridSearchCV SMOTE models, The best AUROC and Accuracy determined was 0.9865 and 98.59%, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.

   For the best GridSearchCV Upsampled models, the best AUROC and Accuracy determined was 0.9849 and 98.245, respectively, using GBT. The best precision was 95.54% using LinearSVC, the best recall was 92.30% using DT and the best F1 score was 92.95% with GBT.

Hyperopt Best Models

   The Hyperopt models for both the Upsampled and SMOTE sets are located here.

Load Saved Models - Upsampled

   After loading the models, let's predict using the test set of the Upsampled data and evaluate the model metrics from the models where AUROC was monitored.

In [ ]:
from pyspark.ml import PipelineModel

path = '/content/drive/MyDrive/LoanStatus/Python/Models/ML/SparkML/Models/Hyperopt/'

pipelineModel_lr_hyperopt_auroc_US = PipelineModel.load(path
                                                        + '/pipelineModel_lr_hyperopt_us_auroc_100trials/')
pipelineModel_lsvc_hyperopt_auroc_US = PipelineModel.load(path
                                                          + '/pipelineModel_lsvc_hyperopt_us_auroc/')
pipelineModel_dt_hyperopt_auroc_US = PipelineModel.load(path
                                                        + '/pipelineModel_dt_hyperopt_us_auroc2/')
pipelineModel_rf_hyperopt_auroc_US = PipelineModel.load(path
                                                        + '/pipelineModel_rf_hyperopt_us_auroc_30trials/')
pipelineModel_rf_hyperopt_auroc1_US = PipelineModel.load(path
                                                         + '/pipelineModel_rf_hyperopt_us_auroc_moreParams/')
pipelineModel_gbt_hyperopt_auroc_US = PipelineModel.load(path
                                                         + '/pipelineModel_gbt_hyperopt_us_auroc/')

pipelineModel_lr_hyperopt_f1_US = PipelineModel.load(path
                                                     + '/pipelineModel_lr_hyperopt_us_f1_100trials/')
pipelineModel_lsvc_hyperopt_f1_US = PipelineModel.load(path
                                                       + '/pipelineModel_lsvc_hyperopt_us_f1/')
pipelineModel_dt_hyperopt_f1_US = PipelineModel.load(path
                                                     + '/pipelineModel_dt_hyperopt_us_f1/')
pipelineModel_rf_hyperopt_f1_US = PipelineModel.load(path
                                                     + '/pipelineModel_rf_hyperopt_us_f1_30trials/')
pipelineModel_rf_hyperopt_f1_1_US = PipelineModel.load(path
                                                       + '/pipelineModel_rf_hyperopt_us_f1_moreParams/')
pipelineModel_gbt_hyperopt_f1_US = PipelineModel.load(path
                                                      + '/pipelineModel_gbt_hyperopt_us_f1/')

prediction_lr = pipelineModel_lr_hyperopt_auroc_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hyperopt_auroc_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hyperopt_auroc_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hyperopt_auroc_US.transform(testDF_US)
prediction_rf1 = pipelineModel_rf_hyperopt_auroc1_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hyperopt_auroc_US.transform(testDF_US)

print('Hyperopt Best Models AUROC Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
Hyperopt Best Models AUROC Metrics: Upsampling


Area Under ROC Curve:
Logistic Regression: 0.9870101538778137
LinearSVC: 0.980184661932512
Decision Trees: 0.9625074740986497
Random Forest: 0.9845460058836394
Random Forest - More Params: 0.9849490161239277
Gradient Boosted Trees: 0.9866776626412849


Accuracy:
Logistic Regression: 0.9858303292922332
LinearSVC: 0.9812543210790038
Decision Trees: 0.974555636999304
Random Forest: 0.9845169525033933
Random Forest - More Params: 0.9847504930943666
Gradient Boosted Trees: 0.9810554647342146
In [ ]:
print('Hyperopt Best Models AUROC Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
			        'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
print('\n')
Hyperopt Best Models AUROC Metrics: Upsampling

Model: prediction_lr
True Positives: 50199
True Negatives: 376146
False Positives: 1702
False Negatives: 4426
Total: 432473
Accuracy: 0.9858303292922332
Recall: 0.918974828375286
Precision:  0.9672067975568871
F1 score: 0.9424741377691831



Model: prediction_lsvc
True Positives: 48829
True Negatives: 375537
False Positives: 2311
False Negatives: 5796
Total: 432473
Accuracy: 0.9812543210790038
Recall: 0.8938947368421053
Precision:  0.9548103245991396
F1 score: 0.9233489339573584



Model: prediction_dt
True Positives: 50116
True Negatives: 371353
False Positives: 6495
False Negatives: 4509
Total: 432473
Accuracy: 0.974555636999304
Recall: 0.9174553775743707
Precision:  0.8852696472417021
F1 score: 0.9010751914847711



Model: prediction_rf
True Positives: 49231
True Negatives: 376546
False Positives: 1302
False Negatives: 5394
Total: 432473
Accuracy: 0.9845169525033933
Recall: 0.901254004576659
Precision:  0.9742346585399639
F1 score: 0.9363243880636756



Model: prediction_rf1
True Positives: 49268
True Negatives: 376610
False Positives: 1238
False Negatives: 5357
Total: 432473
Accuracy: 0.9847504930943666
Recall: 0.9019313501144165
Precision:  0.9754880608244565
F1 score: 0.9372687409042053



Model: prediction_gbt
True Positives: 50383
True Negatives: 373897
False Positives: 3951
False Negatives: 4242
Total: 432473
Accuracy: 0.9810554647342146
Recall: 0.9223432494279176
Precision:  0.9272831008208489
F1 score: 0.9248065786213163




   For the best Hyperopt Upsampled models where AUROC was monitored, The best AUROC and Accuracy determined was 0.9870 and 98.58%, respectively, using LR. The best precision was 97.55% using RF, the best recall was 92.23% using GBT and the best F1 score was 94.25% with LR.

Predict and F1 Model Metrics using testDF of Upsampled Set

   Let's now predict using the test set of the Upsampled data and evaluate the model metrics from the models where F1 score was monitored.

In [ ]:
prediction_lr = pipelineModel_lr_hyperopt_f1_US.transform(testDF_US)
prediction_lsvc = pipelineModel_lsvc_hyperopt_f1_US.transform(testDF_US)
prediction_dt = pipelineModel_dt_hyperopt_f1_US.transform(testDF_US)
prediction_rf = pipelineModel_rf_hyperopt_f1_US.transform(testDF_US)
prediction_rf1 = pipelineModel_rf_hyperopt_f1_1_US.transform(testDF_US)
prediction_gbt = pipelineModel_gbt_hyperopt_f1_US.transform(testDF_US)

print('Hyperopt Best Models F1 Metrics: Upsampling')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
Hyperopt Best Models F1 Metrics: Upsampling


Area Under ROC Curve:
Logistic Regression: 0.9870101538778137
LinearSVC: 0.9801636688553507
Decision Trees: 0.9564704240277363
Random Forest: 0.9845886468846079
Random Forest - More Params: 0.9832680256640207
Gradient Boosted Trees: 0.9848670786073231


Accuracy:
Logistic Regression: 0.9858303292922332
LinearSVC: 0.9814716756884245
Decision Trees: 0.9809121031833201
Random Forest: 0.9849354757406821
Random Forest - More Params: 0.9861586734894433
Gradient Boosted Trees: 0.983296067037711
In [ ]:
print('Hyperopt Best Models F1 Metrics: Upsampling')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
			        'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
print('\n')
Hyperopt Best Models F1 Metrics: Upsampling

Model: prediction_lr
True Positives: 50199
True Negatives: 376146
False Positives: 1702
False Negatives: 4426
Total: 432473
Accuracy: 0.9858303292922332
Recall: 0.918974828375286
Precision:  0.9672067975568871
F1 score: 0.9424741377691831



Model: prediction_lsvc
True Positives: 48978
True Negatives: 375482
False Positives: 2366
False Negatives: 5647
Total: 432473
Accuracy: 0.9814716756884245
Recall: 0.8966224256292906
Precision:  0.9539186662511686
F1 score: 0.9243835461314158



Model: prediction_dt
True Positives: 50417
True Negatives: 373801
False Positives: 4047
False Negatives: 4208
Total: 432473
Accuracy: 0.9809121031833201
Recall: 0.9229656750572083
Precision:  0.9256940364277321
F1 score: 0.9243278424039088



Model: prediction_rf
True Positives: 49291
True Negatives: 376667
False Positives: 1181
False Negatives: 5334
Total: 432473
Accuracy: 0.9849354757406821
Recall: 0.9023524027459954
Precision:  0.9766008876208591
F1 score: 0.9380096482297305



Model: prediction_rf1
True Positives: 48881
True Negatives: 377606
False Positives: 242
False Negatives: 5744
Total: 432473
Accuracy: 0.9861586734894433
Recall: 0.8948466819221969
Precision:  0.9950735907823219
F1 score: 0.9423025022169103



Model: prediction_gbt
True Positives: 50218
True Negatives: 375031
False Positives: 2817
False Negatives: 4407
Total: 432473
Accuracy: 0.983296067037711
Recall: 0.9193226544622426
Precision:  0.9468841331196379
F1 score: 0.9328998699609883




   For the best Hyperopt Upsampled models where F1 score was monitored, The best AUROC determined was 0.9870 using LR and the best Accuracy was 98.62% using RF. The best precision was 99.51% using RF, the best recall was 89.48% using rf and the best F1 score was 94.23% with RF.

Load Saved Models - SMOTE

   After loading the models, let's predict using the test set of the SMOTE data and evaluate the model metrics from the models where AUROC was monitored.

In [ ]:
pipelineModel_lr_hyperopt_auroc_SMOTE = PipelineModel.load(path
                                                           + '/pipelineModel_lr_hyperopt_smote_auroc_100trials/')
pipelineModel_lsvc_hyperopt_auroc_SMOTE = PipelineModel.load(path
                                                             + '/pipelineModel_lsvc_hyperopt_smote_auroc/')
pipelineModel_dt_hyperopt_auroc_SMOTE = PipelineModel.load(path
                                                           + '/pipelineModel_dt_hyperopt_smote_auroc/')
pipelineModel_rf_hyperopt_auroc_SMOTE = PipelineModel.load(path
                                                           + '/pipelineModel_rf_hyperopt_smote_auroc_30trials/')
pipelineModel_rf_hyperopt_auroc1_SMOTE = PipelineModel.load(path
                                                            + '/pipelineModel_rf_hyperopt_smote_auroc_moreParams/')
pipelineModel_gbt_hyperopt_auroc_SMOTE = PipelineModel.load(path
                                                            + '/pipelineModel_gbt_hyperopt_smote_auroc/')

pipelineModel_lr_hyperopt_f1_SMOTE = PipelineModel.load(path
                                                        + '/pipelineModel_lr_hyperopt_smote_f1_100trials/')
pipelineModel_lsvc_hyperopt_f1_SMOTE = PipelineModel.load(path
                                                          + '/pipelineModel_lsvc_hyperopt_smote_f1/')
pipelineModel_dt_hyperopt_f1_SMOTE = PipelineModel.load(path
                                                        + '/pipelineModel_dt_hyperopt_smote_f1/')
pipelineModel_rf_hyperopt_f1_SMOTE = PipelineModel.load(path
                                                        + '/pipelineModel_rf_hyperopt_smote_f1_29trials/')
pipelineModel_rf_hyperopt_f1_1_SMOTE = PipelineModel.load(path
                                                          + '/pipelineModel_rf_hyperopt_smote_f1_moreParams/')
pipelineModel_gbt_hyperopt_f1_SMOTE = PipelineModel.load(path
                                                         + '/pipelineModel_gbt_hyperopt_smote_f1/')

prediction_lr = pipelineModel_lr_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)
prediction_rf1 = pipelineModel_rf_hyperopt_auroc1_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hyperopt_auroc_SMOTE.transform(testDF_SMOTE)

print('Hyperopt Best Models AUROC Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
Hyperopt Best Models AUROC Metrics: SMOTE


Area Under ROC Curve:
Logistic Regression: 0.9860501025753575
LinearSVC: 0.9792529293074257
Decision Trees: 0.9619760183492718
Random Forest: 0.9807582878240927
Random Forest - More Params: 0.9809293771442331
Gradient Boosted Trees: 0.9871254350362428


Accuracy:
Logistic Regression: 0.9866072564067584
LinearSVC: 0.9824312731661861
Decision Trees: 0.980546762456847
Random Forest: 0.9816127249562401
Random Forest - More Params: 0.9837770219181313
Gradient Boosted Trees: 0.9866095686898373
In [ ]:
print('Hyperopt Best Models AUROC Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
              'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
Hyperopt Best Models AUROC Metrics: SMOTE

Model: prediction_lr
True Positives: 50014
True Negatives: 376667
False Positives: 1181
False Negatives: 4611
Total: 432473
Accuracy: 0.9866072564067584
Recall: 0.9155881006864989
Precision:  0.9769313409512648
F1 score: 0.9452655452655452



Model: prediction_lsvc
True Positives: 48353
True Negatives: 376522
False Positives: 1326
False Negatives: 6272
Total: 432473
Accuracy: 0.9824312731661861
Recall: 0.8851807780320367
Precision:  0.9733086414782907
F1 score: 0.9271552385335173



Model: prediction_dt
True Positives: 49885
True Negatives: 374175
False Positives: 3673
False Negatives: 4740
Total: 432473
Accuracy: 0.980546762456847
Recall: 0.9132265446224256
Precision:  0.9314201426490907
F1 score: 0.9222336226579038



Model: prediction_rf
True Positives: 46904
True Negatives: 377617
False Positives: 231
False Negatives: 7721
Total: 432473
Accuracy: 0.9816127249562401
Recall: 0.8586544622425629
Precision:  0.9950991831971996
F1 score: 0.9218553459119497



Model: prediction_rf1
True Positives: 47756
True Negatives: 377701
False Positives: 147
False Negatives: 6869
Total: 432473
Accuracy: 0.9837770219181313
Recall: 0.8742517162471396
Precision:  0.9969312986660543
F1 score: 0.9315699126092384



Model: prediction_gbt
True Positives: 49358
True Negatives: 377324
False Positives: 524
False Negatives: 5267
Total: 432473
Accuracy: 0.9866095686898373
Recall: 0.903578947368421
Precision:  0.9894952086925143
F1 score: 0.9445874439032792


   For the best Hyperopt SMOTE models where AUROC was monitored, the best AUROC and Accuracy determined was 0.98712 and 98.66%, respectively, using GBT. The best precision was 99.69% using RF, the best recall was 91.56% using LR and the best F1 score was 94.23% with LR.

   For the best Hyperopt Upsampled models where AUROC was monitored, The best AUROC and Accuracy determined was 0.9870 and 98.58%, respectively, using LR. The best precision was 97.55% using RF, the best recall was 92.23% using GBT and the best F1 score was 94.25% with LR.

Predict and F1 Model Metrics using testDF of SMOTE Set

   Let's now predict using the test set of the SMOTE data and evaluate the model metrics from the models where F1 score was monitored.

In [ ]:
prediction_lr = pipelineModel_lr_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_lsvc = pipelineModel_lsvc_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_dt = pipelineModel_dt_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_rf = pipelineModel_rf_hyperopt_f1_SMOTE.transform(testDF_SMOTE)
prediction_rf1 = pipelineModel_rf_hyperopt_f1_1_SMOTE.transform(testDF_SMOTE)
prediction_gbt = pipelineModel_gbt_hyperopt_f1_SMOTE.transform(testDF_SMOTE)

print('Hyperopt Best Models F1 Metrics: SMOTE')
print('\n')
print('Area Under ROC Curve:')
print('Logistic Regression:', evaluator_auroc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_auroc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_auroc.evaluate(prediction_dt))
print('Random Forest:', evaluator_auroc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_auroc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_auroc.evaluate(prediction_gbt))
print('\n')
print('Accuracy:')
print('Logistic Regression:', evaluator_acc.evaluate(prediction_lr))
print('LinearSVC:', evaluator_acc.evaluate(prediction_lsvc))
print('Decision Trees:', evaluator_acc.evaluate(prediction_dt))
print('Random Forest:', evaluator_acc.evaluate(prediction_rf))
print('Random Forest - More Params:', evaluator_acc.evaluate(prediction_rf1))
print('Gradient Boosted Trees:', evaluator_acc.evaluate(prediction_gbt))
Hyperopt Best Models F1 Metrics: SMOTE


Area Under ROC Curve:
Logistic Regression: 0.9860501025753575
LinearSVC: 0.981959571141341
Decision Trees: 0.9494460525988754
Random Forest: 0.9805702355001202
Random Forest - More Params: 0.9804896037281486
Gradient Boosted Trees: 0.9869668327152193


Accuracy:
Logistic Regression: 0.9866072564067584
LinearSVC: 0.9841793591738675
Decision Trees: 0.9862858490587851
Random Forest: 0.9815433564638717
Random Forest - More Params: 0.9834856742501844
Gradient Boosted Trees: 0.9867483056745739
In [ ]:
print('Hyperopt Best Models F1 Metrics: SMOTE')
for model in ['prediction_lr', 'prediction_lsvc', 'prediction_dt',
              'prediction_rf', 'prediction_rf1', 'prediction_gbt']:
    df = globals()[model]

    tp = df[(df.label == 1) & (df.prediction == 1)].count()
    tn = df[(df.label == 0) & (df.prediction == 0)].count()
    fp = df[(df.label == 0) & (df.prediction == 1)].count()
    fn = df[(df.label == 1) & (df.prediction == 0)].count()
    a = ((tp + tn)/df.count())

    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)

    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))

    print('\nModel:', model)
    print('True Positives:', tp)
    print('True Negatives:', tn)
    print('False Positives:', fp)
    print('False Negatives:', fn)
    print('Total:', df.count())
    print('Accuracy:', a)
    print('Recall:', r)
    print('Precision: ', p)
    print('F1 score:', f1)
    print('\n')
Hyperopt Best Models F1 Metrics: SMOTE

Model: prediction_lr
True Positives: 50014
True Negatives: 376667
False Positives: 1181
False Negatives: 4611
Total: 432473
Accuracy: 0.9866072564067584
Recall: 0.9155881006864989
Precision:  0.9769313409512648
F1 score: 0.9452655452655452



Model: prediction_lsvc
True Positives: 48896
True Negatives: 376735
False Positives: 1113
False Negatives: 5729
Total: 432473
Accuracy: 0.9841793591738675
Recall: 0.8951212814645308
Precision:  0.9777440060789058
F1 score: 0.9346101649559416



Model: prediction_dt
True Positives: 49561
True Negatives: 376981
False Positives: 867
False Negatives: 5064
Total: 432473
Accuracy: 0.9862858490587851
Recall: 0.9072951945080091
Precision:  0.9828071706194971
F1 score: 0.943542783166592



Model: prediction_rf
True Positives: 46869
True Negatives: 377622
False Positives: 226
False Negatives: 7756
Total: 432473
Accuracy: 0.9815433564638717
Recall: 0.8580137299771167
Precision:  0.9952011890858902
F1 score: 0.9215296893432953



Model: prediction_rf1
True Positives: 47673
True Negatives: 377658
False Positives: 190
False Negatives: 6952
Total: 432473
Accuracy: 0.9834856742501844
Recall: 0.8727322654462243
Precision:  0.9960303365856716
F1 score: 0.9303137928342832



Model: prediction_gbt
True Positives: 49354
True Negatives: 377388
False Positives: 460
False Negatives: 5271
Total: 432473
Accuracy: 0.9867483056745739
Recall: 0.9035057208237987
Precision:  0.9907656482113462
F1 score: 0.9451258629439193


   For the best Hyperopt SMOTE models where AUROC was monitored, the best AUROC and Accuracy determined was 0.9870 and 98.67%, respectively, using GBT. The best precision was 99.60% using RF, the best recall was 91.56% using LR and the best F1 score was 94.53% with LR.

   For the best Hyperopt Upsampled models where F1 score was monitored, The best AUROC determined was 0.9870 using LR and the best Accuracy was 98.62% using RF. The best precision was 99.51% using RF, the best recall was 89.48% using RF and the best F1 score was 94.23% with RF.

LightGBM

   Let's first set up the environment by mounting the drive on Google Colaboratory, cloning the LightGBM repository from Github.

In [ ]:
%cd /content/drive/MyDrive/
/content/drive/MyDrive
In [ ]:
!git clone --recursive https://github.com/Microsoft/LightGBM
Cloning into 'LightGBM'...
remote: Enumerating objects: 28579, done.
remote: Counting objects: 100% (28578/28578), done.
remote: Compressing objects: 100% (6535/6535), done.
remote: Total 28579 (delta 21247), reused 28354 (delta 21111), pack-reused 1
Receiving objects: 100% (28579/28579), 19.87 MiB | 9.87 MiB/s, done.
Resolving deltas: 100% (21247/21247), done.
Checking out files: 100% (535/535), done.
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'external_libs/compute'
Submodule 'eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'external_libs/eigen'
Submodule 'external_libs/fast_double_parser' (https://github.com/lemire/fast_double_parser.git) registered for path 'external_libs/fast_double_parser'
Submodule 'external_libs/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'external_libs/fmt'
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/compute'...
remote: Enumerating objects: 21733, done.        
remote: Counting objects: 100% (5/5), done.        
remote: Compressing objects: 100% (4/4), done.        
remote: Total 21733 (delta 1), reused 3 (delta 1), pack-reused 21728        
Receiving objects: 100% (21733/21733), 8.51 MiB | 5.87 MiB/s, done.
Resolving deltas: 100% (17567/17567), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/eigen'...
remote: Enumerating objects: 117713, done.        
remote: Counting objects: 100% (800/800), done.        
remote: Compressing objects: 100% (282/282), done.        
remote: Total 117713 (delta 530), reused 771 (delta 515), pack-reused 116913        
Receiving objects: 100% (117713/117713), 103.14 MiB | 10.23 MiB/s, done.
Resolving deltas: 100% (97111/97111), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fast_double_parser'...
remote: Enumerating objects: 781, done.        
remote: Counting objects: 100% (180/180), done.        
remote: Compressing objects: 100% (66/66), done.        
remote: Total 781 (delta 124), reused 131 (delta 103), pack-reused 601        
Receiving objects: 100% (781/781), 833.45 KiB | 1.61 MiB/s, done.
Resolving deltas: 100% (395/395), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fmt'...
remote: Enumerating objects: 31417, done.        
remote: Counting objects: 100% (1040/1040), done.        
remote: Compressing objects: 100% (85/85), done.        
remote: Total 31417 (delta 964), reused 979 (delta 940), pack-reused 30377        
Receiving objects: 100% (31417/31417), 13.60 MiB | 4.95 MiB/s, done.
Resolving deltas: 100% (21284/21284), done.
Submodule path 'external_libs/compute': checked out '36350b7de849300bd3d72a05d8bf890ca405a014'
Submodule path 'external_libs/eigen': checked out '3147391d946bb4b6c68edd901f2add6ac1f31f8c'
Submodule path 'external_libs/fast_double_parser': checked out 'ace60646c02dc54c57f19d644e49a61e7e7758ec'
Submodule 'benchmark/dependencies/abseil-cpp' (https://github.com/abseil/abseil-cpp.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'
Submodule 'benchmark/dependencies/double-conversion' (https://github.com/google/double-conversion.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'...
remote: Enumerating objects: 20079, done.        
remote: Counting objects: 100% (2719/2719), done.        
remote: Compressing objects: 100% (711/711), done.        
remote: Total 20079 (delta 2077), reused 2063 (delta 2008), pack-reused 17360        
Receiving objects: 100% (20079/20079), 12.07 MiB | 6.00 MiB/s, done.
Resolving deltas: 100% (15721/15721), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'...
remote: Enumerating objects: 1352, done.        
remote: Counting objects: 100% (196/196), done.        
remote: Compressing objects: 100% (105/105), done.        
remote: Total 1352 (delta 109), reused 157 (delta 84), pack-reused 1156        
Receiving objects: 100% (1352/1352), 7.15 MiB | 10.65 MiB/s, done.
Resolving deltas: 100% (881/881), done.
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp': checked out 'd936052d32a5b7ca08b0199a6724724aea432309'
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion': checked out 'f4cb2384efa55dee0e6652f8674b05763441ab09'
Submodule path 'external_libs/fmt': checked out 'b6f4ceaed0a0a24ccf575fab6c56dd50ccf6f1a9'

   Then we can navigate to the directory and create a build directory within the cloned repository where we can use cmake to compile and build.

In [ ]:
%cd /content/drive/MyDrive/LightGBM

!mkdir build
/content/drive/MyDrive/LightGBM
In [ ]:
!cmake -DUSE_GPU=1
!make -j$(nproc)
CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.


-- OpenCL include directory: /usr/include
-- Using _mm_prefetch
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /content/drive/MyDrive/LightGBM
Consolidate compiler generated dependencies of target lightgbm_capi_objs
[ -1%] Built target lightgbm_capi_objs
Consolidate compiler generated dependencies of target lightgbm_objs
[ 89%] Built target lightgbm_objs
[ 90%] Built target _lightgbm
Consolidate compiler generated dependencies of target lightgbm
[ 96%] Built target lightgbm

   Next, we can install pip, setuptools and the required dependencies to set up the environment for utilizing LightGBM with GPU capabilities.

In [ ]:
!sudo apt-get -y install python-pip
!sudo -H pip install setuptools optuna plotly eli5 shap lime -U
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libpython-all-dev python-all python-all-dev python-asn1crypto
  python-cffi-backend python-crypto python-cryptography python-dbus
  python-enum34 python-gi python-idna python-ipaddress python-keyring
  python-keyrings.alt python-pip-whl python-pkg-resources python-secretstorage
  python-setuptools python-six python-wheel python-xdg
Suggested packages:
  python-crypto-doc python-cryptography-doc python-cryptography-vectors
  python-dbus-dbg python-dbus-doc python-enum34-doc python-gi-cairo
  gnome-keyring libkf5wallet-bin gir1.2-gnomekeyring-1.0 python-fs
  python-gdata python-keyczar python-secretstorage-doc python-setuptools-doc
The following NEW packages will be installed:
  libpython-all-dev python-all python-all-dev python-asn1crypto
  python-cffi-backend python-crypto python-cryptography python-dbus
  python-enum34 python-gi python-idna python-ipaddress python-keyring
  python-keyrings.alt python-pip python-pip-whl python-pkg-resources
  python-secretstorage python-setuptools python-six python-wheel python-xdg
0 upgraded, 22 newly installed, 0 to remove and 5 not upgraded.
Need to get 3,430 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libpython-all-dev amd64 2.7.15~rc1-1 [1,092 B]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-all amd64 2.7.15~rc1-1 [1,076 B]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-all-dev amd64 2.7.15~rc1-1 [1,100 B]
Get:4 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-asn1crypto all 0.24.0-1 [72.7 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-cffi-backend amd64 1.11.5-1 [63.4 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-crypto amd64 2.6.1-8ubuntu2 [244 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-enum34 all 1.1.6-2 [34.8 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-idna all 2.6-1 [32.4 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-ipaddress all 1.0.17-1 [18.2 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-six all 1.11.0-2 [11.3 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python-cryptography amd64 2.1.4-1ubuntu1.4 [276 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-dbus amd64 1.2.6-1 [90.2 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python-gi amd64 3.26.1-2ubuntu1 [197 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-secretstorage all 2.3.1-2 [11.8 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-keyring all 10.6.0-1 [30.6 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-keyrings.alt all 3.0-1 [16.7 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-pip-whl all 9.0.1-2.3~ubuntu1.18.04.5 [1,653 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-pip all 9.0.1-2.3~ubuntu1.18.04.5 [151 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-pkg-resources all 39.0.1-2 [128 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-setuptools all 39.0.1-2 [329 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-wheel all 0.30.0-0.2 [36.4 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-xdg all 0.25-4ubuntu1.1 [31.2 kB]
Fetched 3,430 kB in 3s (1,219 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 22.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package libpython-all-dev:amd64.
(Reading database ... 123941 files and directories currently installed.)
Preparing to unpack .../00-libpython-all-dev_2.7.15~rc1-1_amd64.deb ...
Unpacking libpython-all-dev:amd64 (2.7.15~rc1-1) ...
Selecting previously unselected package python-all.
Preparing to unpack .../01-python-all_2.7.15~rc1-1_amd64.deb ...
Unpacking python-all (2.7.15~rc1-1) ...
Selecting previously unselected package python-all-dev.
Preparing to unpack .../02-python-all-dev_2.7.15~rc1-1_amd64.deb ...
Unpacking python-all-dev (2.7.15~rc1-1) ...
Selecting previously unselected package python-asn1crypto.
Preparing to unpack .../03-python-asn1crypto_0.24.0-1_all.deb ...
Unpacking python-asn1crypto (0.24.0-1) ...
Selecting previously unselected package python-cffi-backend.
Preparing to unpack .../04-python-cffi-backend_1.11.5-1_amd64.deb ...
Unpacking python-cffi-backend (1.11.5-1) ...
Selecting previously unselected package python-crypto.
Preparing to unpack .../05-python-crypto_2.6.1-8ubuntu2_amd64.deb ...
Unpacking python-crypto (2.6.1-8ubuntu2) ...
Selecting previously unselected package python-enum34.
Preparing to unpack .../06-python-enum34_1.1.6-2_all.deb ...
Unpacking python-enum34 (1.1.6-2) ...
Selecting previously unselected package python-idna.
Preparing to unpack .../07-python-idna_2.6-1_all.deb ...
Unpacking python-idna (2.6-1) ...
Selecting previously unselected package python-ipaddress.
Preparing to unpack .../08-python-ipaddress_1.0.17-1_all.deb ...
Unpacking python-ipaddress (1.0.17-1) ...
Selecting previously unselected package python-six.
Preparing to unpack .../09-python-six_1.11.0-2_all.deb ...
Unpacking python-six (1.11.0-2) ...
Selecting previously unselected package python-cryptography.
Preparing to unpack .../10-python-cryptography_2.1.4-1ubuntu1.4_amd64.deb ...
Unpacking python-cryptography (2.1.4-1ubuntu1.4) ...
Selecting previously unselected package python-dbus.
Preparing to unpack .../11-python-dbus_1.2.6-1_amd64.deb ...
Unpacking python-dbus (1.2.6-1) ...
Selecting previously unselected package python-gi.
Preparing to unpack .../12-python-gi_3.26.1-2ubuntu1_amd64.deb ...
Unpacking python-gi (3.26.1-2ubuntu1) ...
Selecting previously unselected package python-secretstorage.
Preparing to unpack .../13-python-secretstorage_2.3.1-2_all.deb ...
Unpacking python-secretstorage (2.3.1-2) ...
Selecting previously unselected package python-keyring.
Preparing to unpack .../14-python-keyring_10.6.0-1_all.deb ...
Unpacking python-keyring (10.6.0-1) ...
Selecting previously unselected package python-keyrings.alt.
Preparing to unpack .../15-python-keyrings.alt_3.0-1_all.deb ...
Unpacking python-keyrings.alt (3.0-1) ...
Selecting previously unselected package python-pip-whl.
Preparing to unpack .../16-python-pip-whl_9.0.1-2.3~ubuntu1.18.04.5_all.deb ...
Unpacking python-pip-whl (9.0.1-2.3~ubuntu1.18.04.5) ...
Selecting previously unselected package python-pip.
Preparing to unpack .../17-python-pip_9.0.1-2.3~ubuntu1.18.04.5_all.deb ...
Unpacking python-pip (9.0.1-2.3~ubuntu1.18.04.5) ...
Selecting previously unselected package python-pkg-resources.
Preparing to unpack .../18-python-pkg-resources_39.0.1-2_all.deb ...
Unpacking python-pkg-resources (39.0.1-2) ...
Selecting previously unselected package python-setuptools.
Preparing to unpack .../19-python-setuptools_39.0.1-2_all.deb ...
Unpacking python-setuptools (39.0.1-2) ...
Selecting previously unselected package python-wheel.
Preparing to unpack .../20-python-wheel_0.30.0-0.2_all.deb ...
Unpacking python-wheel (0.30.0-0.2) ...
Selecting previously unselected package python-xdg.
Preparing to unpack .../21-python-xdg_0.25-4ubuntu1.1_all.deb ...
Unpacking python-xdg (0.25-4ubuntu1.1) ...
Setting up python-idna (2.6-1) ...
Setting up python-pip-whl (9.0.1-2.3~ubuntu1.18.04.5) ...
Setting up python-asn1crypto (0.24.0-1) ...
Setting up python-crypto (2.6.1-8ubuntu2) ...
Setting up python-wheel (0.30.0-0.2) ...
Setting up libpython-all-dev:amd64 (2.7.15~rc1-1) ...
Setting up python-pkg-resources (39.0.1-2) ...
Setting up python-cffi-backend (1.11.5-1) ...
Setting up python-gi (3.26.1-2ubuntu1) ...
Setting up python-six (1.11.0-2) ...
Setting up python-enum34 (1.1.6-2) ...
Setting up python-dbus (1.2.6-1) ...
Setting up python-ipaddress (1.0.17-1) ...
Setting up python-pip (9.0.1-2.3~ubuntu1.18.04.5) ...
Setting up python-all (2.7.15~rc1-1) ...
Setting up python-xdg (0.25-4ubuntu1.1) ...
Setting up python-setuptools (39.0.1-2) ...
Setting up python-keyrings.alt (3.0-1) ...
Setting up python-all-dev (2.7.15~rc1-1) ...
Setting up python-cryptography (2.1.4-1ubuntu1.4) ...
Setting up python-secretstorage (2.3.1-2) ...
Setting up python-keyring (10.6.0-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (57.4.0)
Collecting setuptools
  Downloading setuptools-65.5.1-py3-none-any.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 28.3 MB/s 
Collecting optuna
  Downloading optuna-3.0.3-py3-none-any.whl (348 kB)
     |████████████████████████████████| 348 kB 82.7 MB/s 
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (5.5.0)
Collecting plotly
  Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB)
     |████████████████████████████████| 15.3 MB 72.5 MB/s 
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     |████████████████████████████████| 216 kB 73.3 MB/s 
Collecting shap
  Downloading shap-0.41.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (569 kB)
     |████████████████████████████████| 569 kB 77.8 MB/s 
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
     |████████████████████████████████| 275 kB 73.5 MB/s 
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.64.1)
Collecting alembic>=1.5.0
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
     |████████████████████████████████| 209 kB 82.0 MB/s 
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
     |████████████████████████████████| 81 kB 10.5 MB/s 
Requirement already satisfied: scipy<1.9.0,>=1.7.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.7.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.21.6)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.9.0-py3-none-any.whl (23 kB)
Requirement already satisfied: importlib-metadata<5.0.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (4.13.0)
Requirement already satisfied: sqlalchemy>=1.3.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.42)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (6.0)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.3)
Collecting Mako
  Downloading Mako-1.2.3-py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 7.6 MB/s 
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic>=1.5.0->optuna) (5.10.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<5.0.0->optuna) (4.1.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<5.0.0->optuna) (3.10.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (3.0.9)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.3.0->optuna) (2.0.0.post0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from plotly) (8.1.0)
Requirement already satisfied: attrs>17.1.0 in /usr/local/lib/python3.7/dist-packages (from eli5) (22.1.0)
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 79.0 MB/s 
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from eli5) (1.15.0)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.7/dist-packages (from eli5) (1.0.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from eli5) (0.8.10)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.7/dist-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20->eli5) (1.2.0)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: numba in /usr/local/lib/python3.7/dist-packages (from shap) (0.56.4)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from shap) (1.5.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from shap) (1.3.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from lime) (3.2.2)
Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.7/dist-packages (from lime) (0.18.3)
Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (7.1.2)
Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.6.3)
Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2021.11.2)
Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (1.3.0)
Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.9.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (0.11.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (2.8.2)
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.11.0-py2.py3-none-any.whl (112 kB)
     |████████████████████████████████| 112 kB 68.9 MB/s 
Collecting autopage>=0.4.0
  Downloading autopage-0.5.1-py3-none-any.whl (29 kB)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.5.0)
Collecting stevedore>=2.0.1
  Downloading stevedore-3.5.2-py3-none-any.whl (50 kB)
     |████████████████████████████████| 50 kB 6.6 MB/s 
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.2-py3-none-any.whl (147 kB)
     |████████████████████████████████| 147 kB 77.1 MB/s 
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Collecting pyperclip>=1.6
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.7/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->shap) (2022.6)
Building wheels for collected packages: eli5, lime, pyperclip
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=53ec9507155788c5ef2efa597a3fd94d4109d778b5e6fea28eb40b814230d212
  Stored in directory: /root/.cache/pip/wheels/cc/3c/96/3ead31a8e6c20fc0f1a707fde2e05d49a80b1b4b30096573be
  Building wheel for lime (setup.py) ... done
  Created wheel for lime: filename=lime-0.2.0.1-py3-none-any.whl size=283857 sha256=a3a1cbe14bcf2df333b36d2cc42ed97a7ce8029a37d580902011cb04ce660dc0
  Stored in directory: /root/.cache/pip/wheels/ca/cb/e5/ac701e12d365a08917bf4c6171c0961bc880a8181359c66aa7
  Building wheel for pyperclip (setup.py) ... done
  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11136 sha256=05a28b28ead19d2e27c09bb2d74c114cf8fd9267b007da5a789fe9e6985f5d3a
  Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built eli5 lime pyperclip
Installing collected packages: pyperclip, pbr, stevedore, setuptools, Mako, cmd2, autopage, slicer, jinja2, colorlog, cmaes, cliff, alembic, shap, plotly, optuna, lime, eli5
  Attempting uninstall: setuptools
    Found existing installation: setuptools 57.4.0
    Uninstalling setuptools-57.4.0:
      Successfully uninstalled setuptools-57.4.0
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
  Attempting uninstall: plotly
    Found existing installation: plotly 5.5.0
    Uninstalling plotly-5.5.0:
      Successfully uninstalled plotly-5.5.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.9.0 requires jedi>=0.10, which is not installed.
notebook 5.7.16 requires jinja2<=3.0.0, but you have jinja2 3.1.2 which is incompatible.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
Successfully installed Mako-1.2.3 alembic-1.8.1 autopage-0.5.1 cliff-3.10.1 cmaes-0.9.0 cmd2-2.4.2 colorlog-6.7.0 eli5-0.13.0 jinja2-3.1.2 lime-0.2.0.1 optuna-3.0.3 pbr-5.11.0 plotly-5.11.0 pyperclip-1.8.2 setuptools-65.5.1 shap-0.41.0 slicer-0.0.7 stevedore-3.5.2

   Now, we can move to the python-package subdirectory and use setup.py to compile the dependencies needed so that data will be processed on GPUs when using the LightGBM package.

In [ ]:
%cd /content/drive/MyDrive/LightGBM/python-package

!sudo python3 setup.py install --precompile --gpu
/content/drive/MyDrive/LightGBM/python-package
INFO:root:running install
/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
INFO:root:running build
INFO:root:running build_py
INFO:root:running egg_info
INFO:root:writing lightgbm.egg-info/PKG-INFO
INFO:root:writing dependency_links to lightgbm.egg-info/dependency_links.txt
INFO:root:writing requirements to lightgbm.egg-info/requires.txt
INFO:root:writing top-level names to lightgbm.egg-info/top_level.txt
INFO:root:reading manifest template 'MANIFEST.in'
WARNING:root:no previously-included directories found matching 'build'
WARNING:root:warning: no files found matching 'LICENSE'
WARNING:root:warning: no files found matching '*.txt'
WARNING:root:warning: no files found matching '*.so' under directory 'lightgbm'
WARNING:root:warning: no files found matching 'compile/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/cmake/IntegratedOpenCL.cmake'
WARNING:root:warning: no files found matching '*.so' under directory 'compile'
WARNING:root:warning: no files found matching '*.dll' under directory 'compile/Release'
WARNING:root:warning: no files found matching 'compile/external_libs/compute/CMakeLists.txt'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/compute/cmake'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/compute/include'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/compute/meta'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Cholesky'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Core'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Dense'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Eigenvalues'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Geometry'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Householder'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Jacobi'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/LU'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/QR'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/SVD'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Cholesky'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Core'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Eigenvalues'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Geometry'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Householder'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Jacobi'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/LU'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/misc'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/plugins'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/QR'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/SVD'
WARNING:root:warning: no files found matching 'compile/external_libs/fast_double_parser/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/external_libs/fast_double_parser/LICENSE'
WARNING:root:warning: no files found matching 'compile/external_libs/fast_double_parser/LICENSE.BSL'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/fast_double_parser/include'
WARNING:root:warning: no files found matching 'compile/external_libs/fmt/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/external_libs/fmt/LICENSE.rst'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/fmt/include'
WARNING:root:warning: no files found matching '*' under directory 'compile/include'
WARNING:root:warning: no files found matching '*' under directory 'compile/src'
WARNING:root:warning: no files found matching 'LightGBM.sln' under directory 'compile/windows'
WARNING:root:warning: no files found matching 'LightGBM.vcxproj' under directory 'compile/windows'
WARNING:root:warning: no files found matching '*.dll' under directory 'compile/windows/x64/DLL'
WARNING:root:warning: no previously-included files matching '*.py[co]' found anywhere in distribution
WARNING:root:warning: no previously-included files found matching 'compile/external_libs/compute/.git'
INFO:root:writing manifest file 'lightgbm.egg-info/SOURCES.txt'
INFO:root:copying lightgbm/VERSION.txt -> build/lib/lightgbm
INFO:root:running install_lib
INFO:root:creating /usr/lib/python3.8/site-packages
INFO:root:creating /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/dask.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/basic.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/libpath.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/callback.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/sklearn.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/plotting.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/compat.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/engine.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/__init__.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/VERSION.txt -> /usr/lib/python3.8/site-packages/lightgbm
INFO:LightGBM:Installing lib_lightgbm from: ['/content/drive/MyDrive/LightGBM/lib_lightgbm.so']
INFO:root:copying /content/drive/MyDrive/LightGBM/lib_lightgbm.so -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/dask.py to dask.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/basic.py to basic.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/libpath.py to libpath.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/callback.py to callback.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/sklearn.py to sklearn.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/plotting.py to plotting.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/compat.py to compat.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/engine.py to engine.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/__init__.py to __init__.cpython-38.pyc
INFO:root:running install_egg_info
INFO:root:Copying lightgbm.egg-info to /usr/lib/python3.8/site-packages/lightgbm-3.3.3.99-py3.8.egg-info
INFO:root:running install_scripts

   The notebook containing the baseline models and hyperparameter searches using Optuna for the Upsampled and SMOTE sets is located here.

   Let's first set up the environment by importing the dependencies/declaring their options, setting the name of the environment, the random and numpy seed, followed by examining which CUDA compiler and GPU is available.

In [ ]:
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['lgb_GPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Wed Nov  9 04:35:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Upsampled: Baseline Model

   The train/test sets can be read into separate pandas.Dataframes, then the features as X_train or X_test and the labels as y_train or y_test can be separated for both sets. From the original training set, the label can be defined as loan_status, and the remaining as features for the train/test sets with the target removed and the feature column names extracted and converted to a numpy.array. Let's fit the baseline model using the default parameters and evaluate the model performance.

In [ ]:
import pandas as pd

trainDF = pd.read_csv('trainDF_US.csv', low_memory=False)
testDF = pd.read_csv('testDF_US.csv', low_memory=False)

X_train = trainDF.drop('loan_status', axis=1)
y_train = trainDF.loan_status
X_test = testDF.drop('loan_status', axis=1)
y_test = testDF.loan_status

label = trainDF[['loan_status']]
features = trainDF.drop(columns = ['loan_status'])
features = features.columns
features = np.array(features)
In [ ]:
import lightgbm as lgb
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score

lgbc = lgb.LGBMClassifier(boosting_type='gbdt',
                          device='gpu',
                          gpu_platform_id=0,
                          gpu_device_id=0,
                          verbosity=-1)

lgbc.fit(X_train, y_train)

Pkl_Filename = 'lightGBM_Upsampling_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(lgbc, file)

print('\nModel Metrics for LightGBM Baseline Upsampling')

y_train_pred = lgbc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)

y_test_pred = lgbc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)

print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_train, y_train_pred)
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_train, y_train_pred))
print('Precision score : %.3f' % precision_score(y_train, y_train_pred))
print('Recall score : %.3f' % recall_score(y_train, y_train_pred))
print('F1 score : %.3f' % f1_score(y_train, y_train_pred))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(y_test, y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test, y_test_pred))
print('Precision score : %.3f' % precision_score(y_test, y_test_pred))
print('Recall score : %.3f' % recall_score(y_test, y_test_pred))
print('F1 score : %.3f' % f1_score(y_test, y_test_pred))

Model Metrics for LightGBM Baseline Upsampling
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.99      0.96   1511066
           1       0.99      0.93      0.96   1511066

    accuracy                           0.96   3022132
   macro avg       0.96      0.96      0.96   3022132
weighted avg       0.96      0.96      0.96   3022132



Confusion matrix:
[[1495328   15738]
 [ 111369 1399697]]


Accuracy score : 0.958
Precision score : 0.989
Recall score : 0.926
F1 score : 0.957


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99    377848
           1       0.93      0.93      0.93     54625

    accuracy                           0.98    432473
   macro avg       0.96      0.96      0.96    432473
weighted avg       0.98      0.98      0.98    432473



Confusion matrix:
[[373836   4012]
 [  3986  50639]]


Accuracy score : 0.982
Precision score : 0.927
Recall score : 0.927
F1 score : 0.927

   Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the precision, recall and the F1 score, so let's try monitoring the log_loss to see if hyperparameter tuning results in better model performance.

1000 Trials 5-fold Cross Validation

   For hyperparameter tuning using Optuna, let's define a multi-faceted objective function for the optimization of the hyperparameters which uses joblib to save each trial parameters and the loss in a pickle file that can be reloaded to continue optimizing. Then we can define the parameter space of the hyperparameters to be tested:

  • random_state: seed_value designated for reproducibility. default=None.
  • device_type: Device for the tree learning. default=cpu
  • objective: Specify the learning task and the corresponding learning. objective or a custom objective function to be used. default=None or default=regression.
  • metric: Metric(s) to be evaluated on the evaluation set(s). default="".
  • boosting_type: (gbdt, gdbt_subsample, 0.5. default='gbdt') - traditional Gradient Boosting Decision Tree. dart, Dropouts meet Multiple Additive Regression Trees. rf, Random Forest.
  • n_estimators: Number of boosted trees to fit. default=100.
  • learning_rate: Boosting learning rate. default=0.1.
  • num_leaves: Maximum tree leaves for base learners. default=31.
  • bagging_freq: Frequency for bagging. default=0.
  • subsample: Randomly select part of data without resampling. default=1.
  • colsample_bytree: Subsample ratio of columns when constructing each tree. default=1.
  • max_depth: Maximum tree depth for base learners, <=0 means no limit. default=-1.
  • lambda_l1: L1 regularization. default=0.
  • lambda_l2: L2 regularization. default=0.
  • min_gain_to_split: The minimal gain to perform split. default=0.
  • min_child_samples: Minimal number of data in one leaf. Can be used to deal with over-fitting. default=20.
  • verbosity: Controls the level of LightGBM’s verbosity. default=1 (Info).

    We need to utilize the same k-folds for reproducibility, so let's use 5-fold as the number of folds which to split the training set into the train and validation sets with shuffle=True and the initial defined seed value. The LGBMClassifier model is specified to use the parameters within the params_lgb_optuna dictionary as well as with early_stopping_rounds=150. Each trial is timed and a prediction of the binary_log_loss is made using the created validation label and the predicted label. The mean score is tracked with the objective of obtaining the lowest loss using the callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')].

In [ ]:
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
from optuna.integration import LightGBMPruningCallback
import joblib
from sklearn.model_selection import KFold, cross_val_score
from timeit import default_timer as timer
import lightgbm as lgb
from sklearn.metrics import log_loss

def objective(trial):
    """
    Objective function to tune a LightGBMClassifier model.
    """
    joblib.dump(study, 'lightGBM_US_Optuna_1000_GPU.pkl')

    params_lgb_optuna = {
        'random_state': seed_value,
        'device_type': 'gpu',
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators', 100, 500, step=10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6, 1e-1),
        'num_leaves': trial.suggest_int('num_leaves', 100, 1000, step=20),
        'bagging_freq': trial.suggest_int('bagging_freq', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 0.95),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 0.9),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 1e-1, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 1e-1,  log=True),
        'min_gain_to_split': trial.suggest_float('min_gain_to_split', 1, 15),
        'min_child_samples': trial.suggest_int('min_child_samples', 100, 500,
                                               step=10),
        'verbosity': -1}

    kf = KFold(n_splits=5, shuffle=True, random_state=seed_value)

    for idx, (trn_idx, val_idx) in enumerate(kf.split(trainDF[features],
                                                      label)):
        train_features, train_label = trainDF[features].iloc[trn_idx], label.iloc[trn_idx]
        val_features, val_label = trainDF[features].iloc[val_idx], label.iloc[val_idx]

        start = timer()
        model = lgb.LGBMClassifier(**params_lgb_optuna,
                                   early_stopping_rounds=150)

        cv_scores = np.empty(5)

        model.fit(train_features, train_label.values.ravel(),
                  eval_set = [(val_features, val_label.values.ravel())],
                  eval_metric='binary_logloss',
                  callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')])

        y_pred_val = model.predict_proba(val_features)
        cv_scores[idx] = log_loss(val_label, y_pred_val)
        run_time = timer() - start

        return np.mean(cv_scores)

   We can leverage the use of if/else conditional statements to load the .pkl file if it all ready exists in the directory and if it doesn't, then create a new one for the optuna.study with the objective of minimizing the log_loss using 1000 trials.

In [ ]:
import time

search_time_start = time.time()
if os.path.isfile('lightGBM_US_Optuna_1000_GPU.pkl'):
    study = joblib.load('lightGBM_US_Optuna_1000_GPU.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1000)

print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest LogLoss', study.best_value)
Time to run HPO: 30753.569622278214


Number of finished trials: 1000
Best trial: {'n_estimators': 400, 'learning_rate': 0.09968311861639294, 'num_leaves': 500, 'bagging_freq': 9, 'subsample': 0.8992734856248441, 'colsample_bytree': 0.878277507693008, 'max_depth': 12, 'lambda_l1': 5.255903820573677e-08, 'lambda_l2': 0.005095957109154312, 'min_gain_to_split': 1.0062350795323771, 'min_child_samples': 250}
Lowest LogLoss 0.008495173050732639

   Let's now extract the trial number, logloss and hyperparameter value into a dataframe and sort with the lowest loss first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'logloss'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
                 inplace=True)
trials_df.rename(columns={'params_min_gain_to_split': 'min_gain_to_split'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('logloss', ascending=True)
print(trials_df)
     iteration     logloss             datetime_start  \
843        843    0.008495 2022-11-09 11:56:59.005979   
908        908    0.008504 2022-11-09 12:24:34.113709   
979        979    0.008766 2022-11-09 12:57:39.993860   
568        568    0.008830 2022-11-09 09:51:20.994162   
563        563    0.008995 2022-11-09 09:47:17.094289   
..         ...         ...                        ...   
67          67    0.693146 2022-11-09 05:23:54.452957   
349        349    0.693146 2022-11-09 08:15:35.739043   
669        669    0.693147 2022-11-09 10:41:29.925729   
18          18    0.693147 2022-11-09 04:57:09.308489   
16          16  311.015362 2022-11-09 04:54:33.587588   

             datetime_complete               duration  bagging_freq  \
843 2022-11-09 12:00:17.672499 0 days 00:03:18.666520             9   
908 2022-11-09 12:27:51.321102 0 days 00:03:17.207393             9   
979 2022-11-09 13:00:58.062928 0 days 00:03:18.069068             9   
568 2022-11-09 09:54:46.840009 0 days 00:03:25.845847             9   
563 2022-11-09 09:50:42.668510 0 days 00:03:25.574221             9   
..                         ...                    ...           ...   
67  2022-11-09 05:24:03.274637 0 days 00:00:08.821680             7   
349 2022-11-09 08:15:44.623111 0 days 00:00:08.884068            10   
669 2022-11-09 10:41:39.399734 0 days 00:00:09.474005            10   
18  2022-11-09 04:57:18.155107 0 days 00:00:08.846618             7   
16  2022-11-09 04:57:00.651299 0 days 00:02:27.063711             7   

     colsample_bytree     lambda_l1  lambda_l2  learning_rate  max_depth  \
843          0.878278  5.255904e-08   0.005096       0.099683         12   
908          0.890479  1.712589e-08   0.000031       0.099854         12   
979          0.881896  3.897024e-08   0.000042       0.098987         12   
568          0.875514  2.201301e-08   0.002067       0.099572         12   
563          0.876359  3.982179e-08   0.001305       0.099805         12   
..                ...           ...        ...            ...        ...   
67           0.778684  7.249932e-06   0.027874       0.000002         10   
349          0.899390  3.644665e-06   0.000668       0.000001         10   
669          0.863951  3.113517e-08   0.003268       0.000001         11   
18           0.486363  2.904571e-07   0.002417       0.000001         11   
16           0.821854  5.291917e-06   0.093657       0.090656         10   

     min_child_samples  min_gain_to_split  n_estimators  num_leaves  \
843                250           1.006235           400         500   
908                240           1.007996           400         580   
979                250           1.012956           400         520   
568                240           1.256333           410         520   
563                240           1.283104           420         500   
..                 ...                ...           ...         ...   
67                 160           3.671721           150         680   
349                310           1.255420           340         500   
669                230           7.265473           390         540   
18                 300           3.412177           400        1000   
16                 180           3.437757           500         300   

     subsample     state  
843   0.899273  COMPLETE  
908   0.797280  COMPLETE  
979   0.880821  COMPLETE  
568   0.852568  COMPLETE  
563   0.852764  COMPLETE  
..         ...       ...  
67    0.576261    PRUNED  
349   0.722635    PRUNED  
669   0.823530    PRUNED  
18    0.724683    PRUNED  
16    0.592605  COMPLETE  

[1000 rows x 17 columns]

   Optuna contains many ways to visualize the parameters tested during the search using the visualization module including plot_parallel_coordinate, plot_slice, plot_contour, plot_param_importances, plot_optimization_history, plot_slice and plot_edf. Let's utilize some of these components to examine the hyperparameters tested starting with plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_param_importances to plot the importance of the hyperparameters.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()

   The darker the blue signifies lower values for the log_loss and where a higher number of trials gravitated during the study

   To note, the scatterplot of the regularization hyperparameters (lambda_l1, lambda_l2) did not reveal trends over the trials iterations, which might have been affected by the wide range (0-0.1).

   Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The most important hyperparameters for the search using the Upsampled set are n_estimators = 0.31, min_child_samples = 0.21 and lambda_l2 = 0.20. We will need to compare this order and magnitude with the results from the search using the SMOTE set.

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'binary_error'
params
Out[ ]:
{'n_estimators': 400,
 'learning_rate': 0.09968311861639294,
 'num_leaves': 500,
 'bagging_freq': 9,
 'subsample': 0.8992734856248441,
 'colsample_bytree': 0.878277507693008,
 'max_depth': 12,
 'lambda_l1': 5.255903820573677e-08,
 'lambda_l2': 0.005095957109154312,
 'min_gain_to_split': 1.0062350795323771,
 'min_child_samples': 250,
 'random_state': 42,
 'metric': 'binary_error'}
In [ ]:
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score

train_label = trainDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])

test_features = testDF.drop(columns = ['loan_status'])

best_model = lgb.LGBMClassifier(boosting_type='gbdt',
                                device='gpu',
                                gpu_platform_id=0,
                                gpu_device_id=0,
                                verbosity=-1,
                                **params)

best_model.fit(train_features, train_label.values.ravel())

Pkl_Filename = 'lightGBM_Upsampling_Optuna_trials1000_GPU.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for lightGBM HPO Upsampling 1000 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('\n')
print('Classification Report: Train')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Train')
print(confusion_matrix(y_train, y_train_pred))
print('\n')
print('Classification Report: Test')
clf_rpt = classification_report(y_test, y_test_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Test')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score: train: %.3f, test: %.3f' % (
        accuracy_score(y_train, y_train_pred),
        accuracy_score(y_test, y_test_pred)))
print('Precision score: train: %.3f, test: %.3f' % (
        precision_score(y_train, y_train_pred),
        precision_score(y_test, y_test_pred)))
print('Recall score: train: %.3f, test: %.3f' % (
        recall_score(y_train, y_train_pred),
        recall_score(y_test, y_test_pred)))
print('F1 score: train: %.3f, test: %.3f' % (
        f1_score(y_train, y_train_pred),
        f1_score(y_test, y_test_pred)))

Model Metrics for lightGBM HPO Upsampling 1000 GPU trials


Classification Report: Train
              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1511066
           1       1.00      0.98      0.99   1511066

    accuracy                           0.99   3022132
   macro avg       0.99      0.99      0.99   3022132
weighted avg       0.99      0.99      0.99   3022132



Confusion matrix: Train
[[1509224    1842]
 [  25073 1485993]]


Classification Report: Test
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.98      0.92      0.95     54625

    accuracy                           0.99    432473
   macro avg       0.98      0.96      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix: Test
[[376602   1246]
 [  4166  50459]]


Accuracy score: train: 0.991, test: 0.987
Precision score: train: 0.999, test: 0.976
Recall score: train: 0.983, test: 0.924
F1 score: train: 0.991, test: 0.949

   All of the model metrics, besides the recall score, increased when comparing to the baseline model. The reduction in recall score is 0.03, so it might be potentially neglible.

   Let's now evaluate the predictive probability on the test set.

In [ ]:
from sklearn.metrics import roc_auc_score

print('The best model from optimization scores {:.5f} Accuracy on the test set.'.format(accuracy_score(y_test,
                                                                                                       y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 0.98749 Accuracy on the test set.
This was achieved using these conditions:
iteration                                   843
logloss                                0.008495
datetime_start       2022-11-09 11:56:59.005979
datetime_complete    2022-11-09 12:00:17.672499
duration                 0 days 00:03:18.666520
bagging_freq                                  9
colsample_bytree                       0.878278
lambda_l1                                   0.0
lambda_l2                              0.005096
learning_rate                          0.099683
max_depth                                    12
min_child_samples                           250
min_gain_to_split                      1.006235
n_estimators                                400
num_leaves                                  500
subsample                              0.899273
state                                  COMPLETE
Name: 0, dtype: object

Model Explanations

SHAP (SHapley Additive exPlanations)

   Let's use the shap.TreeExplainer using LightGBM and generate shap_values for the features. The summary_plot shows the global importance of each feature and the distribution of the effect sizes over the set. Let's use the training set and summarize the effects of all the features.

In [ ]:
import shap

shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
In [ ]:
shap.summary_plot(shap_values, train_features, show=False);

   Let's use the test set and summarize the effects of all the features.

In [ ]:
sshap.initjs()
shap_values = explainer.shap_values(test_features)

shap.summary_plot(shap_values, test_features, show=False);

Model Metrics with Eli5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
import eli5
from eli5.sklearn import PermutationImportance

X_test1 = pd.DataFrame(test_features, columns=test_features.columns)

perm_importance = PermutationImportance(best_model,
                                        random_state=seed_value).fit(test_features,
                                                                     y_test)
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.2488 ± 0.0010 out_prncp
0.2400 ± 0.0006 total_pymnt
0.0668 ± 0.0002 recoveries
0.0664 ± 0.0005 loan_amnt
0.0536 ± 0.0002 installment
0.0472 ± 0.0004 last_pymnt_amnt
0.0232 ± 0.0003 total_rec_int
0.0158 ± 0.0003 int_rate
0.0027 ± 0.0002 total_rec_late_fee
0.0003 ± 0.0001 term_ 60 months
0.0003 ± 0.0000 grade_D
0.0002 ± 0.0001 num_bc_tl
0.0002 ± 0.0001 total_bc_limit
0.0001 ± 0.0000 num_bc_sats
0.0001 ± 0.0001 bc_open_to_buy
0.0001 ± 0.0001 num_il_tl
0.0001 ± 0.0000 application_type_Joint App
0.0000 ± 0.0000 mths_since_recent_bc
0.0000 ± 0.0000 home_ownership_RENT
0.0000 ± 0.0000 percent_bc_gt_75
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
from eli5.formatters import format_as_dataframe

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 out_prncp 0.248844 0.000493
1 total_pymnt 0.239985 0.000317
2 recoveries 0.066753 0.000106
3 loan_amnt 0.066444 0.000246
4 installment 0.053564 0.000083
5 last_pymnt_amnt 0.047178 0.000216
6 total_rec_int 0.023249 0.000131
7 int_rate 0.015755 0.000130
8 total_rec_late_fee 0.002742 0.000090
9 term_ 60 months 0.000297 0.000039
10 grade_D 0.000250 0.000010
11 num_bc_tl 0.000154 0.000051
12 total_bc_limit 0.000151 0.000050
13 num_bc_sats 0.000066 0.000024
14 bc_open_to_buy 0.000062 0.000033
15 num_il_tl 0.000059 0.000034
16 application_type_Joint App 0.000050 0.000016
17 mths_since_recent_bc 0.000043 0.000013
18 home_ownership_RENT 0.000038 0.000013
19 percent_bc_gt_75 0.000030 0.000022

   The features with the highest weights are out_prncp (0.248844), total_pymnt (0.239985) and recoveries (0.066753).

   Now we can use ELI5 to show the prediction.

In [ ]:
html_obj2 = eli5.show_prediction(best_model, test_features.iloc[1],
                                 show_feature_values=True)
html_obj2
Out[ ]:

y=0 (probability 0.966, score -3.347) top features

Contribution? Feature Value
+4.700 out_prncp 13241.960
+3.567 recoveries 0.000
+0.870 disbursement_method_DirectPay 1.000
+0.202 grade_B 1.000
+0.181 int_rate 12.730
+0.178 total_rec_late_fee 0.000
+0.173 acc_open_past_24mths 3.000
+0.099 installment 327.920
+0.075 grade_C 0.000
+0.051 home_ownership_RENT 0.000
+0.045 total_rec_int 1016.890
+0.034 purpose_credit_card 1.000
+0.027 num_bc_sats 5.000
+0.025 num_bc_tl 6.000
+0.022 home_ownership_MORTGAGE 1.000
+0.017 pub_rec 0.000
+0.012 delinq_2yrs 0.000
+0.011 num_actv_rev_tl 5.000
+0.010 initial_list_status_w 1.000
+0.010 grade_D 0.000
+0.009 tot_hi_cred_lim 162071.000
+0.007 mort_acc 1.000
+0.003 verification_status_Source Verified 0.000
+0.000 num_tl_30dpd 0.000
-0.000 collections_12_mths_ex_med 0.000
-0.000 delinq_amnt 0.000
-0.000 revol_util 56.500
-0.001 tax_liens 0.000
-0.014 application_type_Joint App 0.000
-0.025 loan_amnt 14500.000
-0.027 tot_coll_amt 0.000
-0.037 num_il_tl 10.000
-0.046 num_accts_ever_120_pd 0.000
-0.051 percent_bc_gt_75 40.000
-0.058 num_tl_op_past_12m 3.000
-0.060 revol_bal 20454.000
-0.070 mths_since_recent_bc 9.000
-0.073 num_sats 12.000
-0.082 verification_status_Verified 1.000
-0.119 total_bc_limit 30300.000
-0.121 total_bal_ex_mort 73529.000
-0.128 bc_open_to_buy 10970.000
-0.144 term_ 60 months 1.000
-0.188 annual_inc 35000.000
-0.221 mo_sin_old_rev_tl_op 164.000
-0.386 inq_last_6mths 2.000
-0.444 total_pymnt 2274.930
-1.602 last_pymnt_amnt 327.920
-3.083 <BIAS> 1.000

   For the loans to be Paid Off or Current, the features out_prncp and recoveries while last_pymnt_amnt and total_pymnt with lower values are not beneficial.

SMOTE: Baseline Model

   Let's now proceed with using similar methods for the SMOTE set by first reading the data, defining the label and features for the train/test sets, removing the target, extracting the feature names and fitting the baseline model with the default parameters.

In [ ]:
trainDF = pd.read_csv('trainDF_SMOTE.csv', low_memory=False)
testDF = pd.read_csv('testDF_SMOTE.csv', low_memory=False)

X_train = trainDF.drop('loan_status', axis=1)
y_train = trainDF.loan_status
X_test = testDF.drop('loan_status', axis=1)
y_test = testDF.loan_status

label = trainDF[['loan_status']]
features = trainDF.drop(columns = ['loan_status'])
feature_names = list(features.columns)
features = features.columns
features = np.array(features)
In [ ]:
lgbc = lgb.LGBMClassifier(boosting_type='gbdt',
                          device='gpu',
                          gpu_platform_id=0,
                          gpu_device_id=0,
                          verbosity=-1)

lgbc.fit(X_train, y_train)

Pkl_Filename = 'lightGBM_SMOTE_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(lgbc, file)

print('\nModel Metrics for LightGBM Baseline SMOTE')
y_train_pred = lgbc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)

y_test_pred = lgbc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)

print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
                                               asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
                                                 asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
                                           asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for LightGBM Baseline SMOTE
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1511066
           1       1.00      0.98      0.99   1511066

    accuracy                           0.99   3022132
   macro avg       0.99      0.99      0.99   3022132
weighted avg       0.99      0.99      0.99   3022132



Confusion matrix:
[[1510899     167]
 [  24667 1486399]]


Accuracy score : 0.992
Precision score : 1.000
Recall score : 0.984
F1 score : 0.992


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       1.00      0.89      0.94     54625

    accuracy                           0.99    432473
   macro avg       0.99      0.95      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[377806     42]
 [  5741  48884]]


Accuracy score : 0.987
Precision score : 0.999
Recall score : 0.895
F1 score : 0.944

   Compared to the baseline model metrics for the Upsampled set using the default LightGBM parameters with GPU support, the accuracy, precision and F1 scores are higher but the recall score is lower.

1000 Trials 5-Fold Cross Validation

   Let's define a new objective function for the optimization of hyperparameters with a new pickle file that joblib dumps if it is present, then uses the same parameter space, the same k-folds for reproducibility and the same structure that was utilized for the Upsampled LightGBM search using Optuna.

In [ ]:
def objective(trial):
    """
    Objective function to tune a LightGBMClassifier model.
    """
    joblib.dump(study, 'lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl')

    params_lgb_optuna = {
        'random_state': seed_value,
        'device_type': 'gpu',
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators', 100, 500, step=10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6, 1e-1),
        'num_leaves': trial.suggest_int('num_leaves', 100, 1000, step=20),
        'bagging_freq': trial.suggest_int('bagging_freq', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 0.95),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 0.9),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 1e-1, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 1e-1,  log=True),
        'min_gain_to_split': trial.suggest_float('min_gain_to_split', 1, 15),
        'min_child_samples': trial.suggest_int('min_child_samples', 100, 500,
                                               step=10),
        'verbosity': -1}

    kf = KFold(n_splits=5, shuffle=True, random_state=seed_value)

    for idx, (trn_idx, val_idx) in enumerate(kf.split(trainDF[features],
                                                      label)):
        train_features, train_label = trainDF[features].iloc[trn_idx], label.iloc[trn_idx]
        val_features, val_label = trainDF[features].iloc[val_idx], label.iloc[val_idx]

        start = timer()
        model = lgb.LGBMClassifier(**params_lgb_optuna,
                                   early_stopping_rounds=150)

        cv_scores = np.empty(5)

        model.fit(train_features, train_label.values.ravel(),
                  eval_set = [(val_features, val_label.values.ravel())],
                  eval_metric='binary_logloss',
                  callbacks=[LightGBMPruningCallback(trial, 'binary_logloss')])

        y_pred_val = model.predict_proba(val_features)
        cv_scores[idx] = log_loss(val_label, y_pred_val)
        run_time = timer() - start

        return np.mean(cv_scores)

   Let's now begin the HPO trials creating a new Optuna.study that can be compared with the results from the search using the Upsampled set.

In [ ]:
from datetime import datetime, timedelta

start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl'):
    study = joblib.load('lightGBM_GPU_HPO_SMOTE_Optuna_1000.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1000)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest LogLoss', study.best_value)
Start Time           2022-04-08 04:59:40.067174
End Time             2022-04-08 09:59:50.009017
5:00:09


Number of finished trials: 1000
Best trial: {'n_estimators': 170, 'learning_rate': 0.09981693737153056, 'num_leaves': 320, 'bagging_freq': 7, 'subsample': 0.6623669608159025, 'colsample_bytree': 0.40828008982810615, 'max_depth': 12, 'lambda_l1': 6.296534376034453e-07, 'lambda_l2': 2.223229202863544e-07, 'min_gain_to_split': 13.413959754649742, 'min_child_samples': 230}
Lowest LogLoss 0.09239519027116873

   Now, we can extract the trial number, logloss and hyperparameter value into a dataframe and sort with the lowest loss first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'logloss'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
                 inplace=True)
trials_df.rename(columns={'params_min_gain_to_split': 'min_gain_to_split'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('logloss', ascending=True)
print(trials_df)
     iteration     logloss             datetime_start  \
598        598    0.092395 2022-04-08 08:24:19.762493   
889        889    0.092416 2022-04-08 09:30:48.338655   
651        651    0.092425 2022-04-08 08:35:18.331681   
535        535    0.092430 2022-04-08 08:10:59.355493   
472        472    0.092433 2022-04-08 07:53:53.453169   
..         ...         ...                        ...   
11          11  229.110890 2022-04-08 05:07:47.872529   
13          13  229.114920 2022-04-08 05:09:02.613496   
41          41  278.096703 2022-04-08 05:22:25.536570   
46          46  278.099300 2022-04-08 05:24:19.341644   
44          44  278.099655 2022-04-08 05:23:31.061596   

             datetime_complete               duration  bagging_freq  \
598 2022-04-08 08:24:58.737481 0 days 00:00:38.974988             7   
889 2022-04-08 09:31:31.652790 0 days 00:00:43.314135             5   
651 2022-04-08 08:36:00.010968 0 days 00:00:41.679287             7   
535 2022-04-08 08:11:33.161178 0 days 00:00:33.805685             6   
472 2022-04-08 07:54:36.714005 0 days 00:00:43.260836             5   
..                         ...                    ...           ...   
11  2022-04-08 05:08:21.559008 0 days 00:00:33.686479             5   
13  2022-04-08 05:09:44.563987 0 days 00:00:41.950491             6   
41  2022-04-08 05:23:16.353031 0 days 00:00:50.816461             8   
46  2022-04-08 05:24:46.472107 0 days 00:00:27.130463             8   
44  2022-04-08 05:24:12.231764 0 days 00:00:41.170168             9   

     colsample_bytree     lambda_l1     lambda_l2  learning_rate  max_depth  \
598          0.408280  6.296534e-07  2.223229e-07       0.099817         12   
889          0.408859  3.088437e-07  7.881006e-07       0.099912         12   
651          0.408101  9.794762e-07  3.097203e-07       0.099811         12   
535          0.400124  4.964825e-07  2.494315e-07       0.099781         12   
472          0.408015  1.384088e-06  1.546964e-06       0.099896         12   
..                ...           ...           ...            ...        ...   
11           0.885977  6.906449e-07  2.554995e-07       0.097856          5   
13           0.898451  9.284701e-05  3.325110e-07       0.010459          3   
41           0.448133  1.009473e-08  1.095265e-04       0.035481          9   
46           0.451546  2.165970e-08  5.747619e-06       0.058348          7   
44           0.522288  2.038278e-08  1.083951e-02       0.055620         10   

     min_child_samples  min_gain_to_split  n_estimators  num_leaves  \
598                230          13.413960           170         320   
889                180          11.248189           340         420   
651                210          12.381289           180         560   
535                240          12.470946           100         560   
472                290          12.852333           190         540   
..                 ...                ...           ...         ...   
11                 100          13.918849           490         580   
13                 250           4.086732           390         440   
41                 420           9.842782           420         200   
46                 320          12.119606           100         620   
44                 360           8.515117           470         260   

     subsample     state  
598   0.662367  COMPLETE  
889   0.741477  COMPLETE  
651   0.759621  COMPLETE  
535   0.830529  COMPLETE  
472   0.764925  COMPLETE  
..         ...       ...  
11    0.697973  COMPLETE  
13    0.642601  COMPLETE  
41    0.624655  COMPLETE  
46    0.626739  COMPLETE  
44    0.501282  COMPLETE  

[1000 rows x 17 columns]

   Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contours.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()

   The slice plot was not informative and the scatterplot of the regularization hyperparameters (lambda_l1, lambda_l2) did not reveal trends over the trials iterations, which might have been affected by the widerange (0-0.1).

   Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The most important hyperparameters are colsample_bytree = 0.32, max_depth - 0.18, min_child_samples = 0.12, learning_rate = 0.11. These are different than the Upsampled results which are n_estimators = 0.31, min_child_samples = 0.21, lambda_l2 = 0.20. For the Upsampled set, the importance for these are colsample_bytree = 0.03, max_depth - 0.01, min_child_samples = 0.21, learning_rate = 0.09. For the converse, the importance of the most important hyperparameters for the Upsampled study are n_estimators = 0.06, min_child_samples = 0.12, lambda_l2 = 0.06.

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'binary_error'
params
Out[ ]:
{'n_estimators': 170,
 'learning_rate': 0.09981693737153056,
 'num_leaves': 320,
 'bagging_freq': 7,
 'subsample': 0.6623669608159025,
 'colsample_bytree': 0.40828008982810615,
 'max_depth': 12,
 'lambda_l1': 6.296534376034453e-07,
 'lambda_l2': 2.223229202863544e-07,
 'min_gain_to_split': 13.413959754649742,
 'min_child_samples': 230,
 'random_state': 42,
 'metric': 'binary_error'}
In [ ]:
train_label = trainDF[['loan_status']]
train_features = trainDF.drop(columns = ['loan_status'])
test_features = testDF.drop(columns = ['loan_status'])

best_model = lgb.LGBMClassifier(boosting_type='gbdt',
                                device='gpu',
                                gpu_platform_id=0,
                                gpu_device_id=0,
                                verbosity=-1,
                                **params)

best_model.fit(train_features, train_label.values.ravel())

Pkl_Filename = 'lightGBM_HPO_Optuna_SMOTE_trials1000_GPU.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for lightGBM HPO SMOTE 1000 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('\n')
print('Classification Report: Train')
clf_rpt = classification_report(y_train, y_train_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Train')
print(confusion_matrix(y_train, y_train_pred))
print('\n')
print('Classification Report: Test')
clf_rpt = classification_report(y_test, y_test_pred)
print(clf_rpt)
print('\n')
print('Confusion matrix: Test')
print(confusion_matrix(y_test, y_test_pred))
print('\n')
print('Accuracy score: train: %.3f, test: %.3f' % (
        accuracy_score(y_train, y_train_pred),
        accuracy_score(y_test, y_test_pred)))
print('Precision score: train: %.3f, test: %.3f' % (
        precision_score(y_train, y_train_pred),
        precision_score(y_test, y_test_pred)))
print('Recall score: train: %.3f, test: %.3f' % (
        recall_score(y_train, y_train_pred),
        recall_score(y_test, y_test_pred)))
print('F1 score: train: %.3f, test: %.3f' % (
        f1_score(y_train, y_train_pred),
        f1_score(y_test, y_test_pred)))

Model Metrics for lightGBM HPO SMOTE 1000 GPU trials


Classification Report: Train
              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1511066
           1       1.00      0.98      0.99   1511066

    accuracy                           0.99   3022132
   macro avg       0.99      0.99      0.99   3022132
weighted avg       0.99      0.99      0.99   3022132



Confusion matrix: Train
[[1510709     357]
 [  23283 1487783]]


Classification Report: Test
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       1.00      0.90      0.95     54625

    accuracy                           0.99    432473
   macro avg       0.99      0.95      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix: Test
[[377731    117]
 [  5563  49062]]


Accuracy score: train: 0.992, test: 0.987
Precision score: train: 1.000, test: 0.998
Recall score: train: 0.985, test: 0.898
F1 score: train: 0.992, test: 0.945

   Compared to the baseline LightGBM model for the SMOTE set, there was not a large change in the model metrics, and the observed changed potentially are neglible. Compared to the tuned LightGBM model using the Upsampled set, there might be a better precision but a worse recall and maybe F1 score.

   Let's now evaluate the predictive probability on the test set.

In [ ]:
print('The best model from optimization scores {:.5f} Accuracy on the test set.'.format(accuracy_score(y_test,
                                                                                                       y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 0.98687 Accuracy on the test set.
This was achieved using these conditions:
iteration                                   598
logloss                                0.092395
datetime_start       2022-04-08 08:24:19.762493
datetime_complete    2022-04-08 08:24:58.737481
duration                 0 days 00:00:38.974988
bagging_freq                                  7
colsample_bytree                        0.40828
lambda_l1                              0.000001
lambda_l2                                   0.0
learning_rate                          0.099817
max_depth                                    12
min_child_samples                           230
min_gain_to_split                      13.41396
n_estimators                                170
num_leaves                                  320
subsample                              0.662367
state                                  COMPLETE
Name: 0, dtype: object

Model Explanations

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
In [ ]:
shap.summary_plot(shap_values, train_features, show=False);

   Now, we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(test_features)
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
In [ ]:
shap.summary_plot(shap_values, test_features, show=False);

Model Metrics with Eli5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)

perm_importance = PermutationImportance(best_model,
                                        random_state=seed_value).fit(test_features,
                                                                     y_test)
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.2236 ± 0.0010 out_prncp
0.0802 ± 0.0002 total_pymnt
0.0700 ± 0.0002 recoveries
0.0165 ± 0.0001 loan_amnt
0.0162 ± 0.0003 last_pymnt_amnt
0.0028 ± 0.0001 installment
0.0025 ± 0.0001 total_rec_int
0.0007 ± 0.0000 total_rec_late_fee
0.0003 ± 0.0000 term_ 60 months
0.0000 ± 0.0000 grade_B
0.0000 ± 0.0000 verification_status_Source Verified
0.0000 ± 0.0000 int_rate
0.0000 ± 0.0000 annual_inc
0.0000 ± 0.0000 disbursement_method_DirectPay
0.0000 ± 0.0000 num_actv_rev_tl
0.0000 ± 0.0000 pub_rec
0.0000 ± 0.0000 num_bc_tl
0.0000 ± 0.0000 revol_util
0.0000 ± 0.0000 bc_open_to_buy
0.0000 ± 0.0000 mort_acc
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 out_prncp 2.235506e-01 0.000501
1 total_pymnt 8.015483e-02 0.000104
2 recoveries 6.999142e-02 0.000118
3 loan_amnt 1.653514e-02 0.000047
4 last_pymnt_amnt 1.620309e-02 0.000165
5 installment 2.760866e-03 0.000071
6 total_rec_int 2.526400e-03 0.000063
7 total_rec_late_fee 7.403930e-04 0.000015
8 term_ 60 months 3.112333e-04 0.000021
9 grade_B 3.283442e-05 0.000010
10 verification_status_Source Verified 1.988563e-05 0.000012
11 int_rate 1.757335e-05 0.000023
12 annual_inc 1.479861e-05 0.000007
13 disbursement_method_DirectPay 1.341124e-05 0.000005
14 num_actv_rev_tl 6.011936e-06 0.000008
15 pub_rec 2.774740e-06 0.000005
16 num_bc_tl 2.312283e-06 0.000007
17 revol_util 2.312283e-06 0.000003
18 bc_open_to_buy 9.249132e-07 0.000009
19 mort_acc 4.440892e-17 0.000002

   The out_prncp feature has the highest weight compared to the other features with the next being total_pymnt at 0.0802. Compared to the Upsampled set, out_prncp has less weight (0.223551 vs. 0.248844) but there is a large difference in the weights for total_pymnt (0.0802 vs. 0.239985).

   Now we can use ELI5 to show the prediction.

In [ ]:
html_obj2 = eli5.show_prediction(best_model, test_features.iloc[1],
                                 show_feature_values=True)
html_obj2
Out[ ]:

y=0 (probability 0.992, score -4.772) top features

Contribution? Feature Value
+3.016 recoveries 0.000
+2.902 out_prncp 13241.960
+1.113 disbursement_method_DirectPay 1.000
+0.392 grade_B 1.000
+0.348 inq_last_6mths 2.000
+0.342 initial_list_status_w 1.000
+0.295 total_rec_late_fee 0.000
+0.169 purpose_credit_card 1.000
+0.143 verification_status_Verified 1.000
+0.135 int_rate 12.730
+0.110 total_rec_int 1016.890
+0.080 pub_rec 0.000
+0.072 verification_status_Source Verified 0.000
+0.068 grade_D 0.000
+0.063 home_ownership_RENT 0.000
+0.043 acc_open_past_24mths 3.000
+0.042 total_bc_limit 30300.000
+0.031 home_ownership_MORTGAGE 1.000
+0.030 delinq_2yrs 0.000
+0.029 total_bal_ex_mort 73529.000
+0.023 grade_C 0.000
+0.018 term_ 60 months 1.000
+0.011 num_bc_tl 6.000
+0.009 revol_bal 20454.000
+0.008 num_bc_sats 5.000
+0.007 tot_hi_cred_lim 162071.000
+0.006 num_accts_ever_120_pd 0.000
+0.005 tot_coll_amt 0.000
+0.003 revol_util 56.500
+0.000 application_type_Joint App 0.000
+0.000 mo_sin_old_rev_tl_op 164.000
+0.000 delinq_amnt 0.000
-0.003 num_tl_op_past_12m 3.000
-0.011 num_sats 12.000
-0.021 percent_bc_gt_75 40.000
-0.032 mths_since_recent_bc 9.000
-0.048 installment 327.920
-0.091 annual_inc 35000.000
-0.208 loan_amnt 14500.000
-0.728 total_pymnt 2274.930
-0.772 last_pymnt_amnt 327.920
-2.825 <BIAS> 1.000

   For the loans to be Paid Off or Current, the features recoveries and out_prncp with higher values are beneficial while the features last_pymnt_amnt, total_pymnt, loan_amnt and annual_inc with lower values are not beneficial.

Environment to Utilize RAPIDS

   We can utilize the resources Paperspace offers to use a RAPIDS Docker container to leverage GPUs with higher memory than what is available on most local desktops or request quota increases and pay the cloud providers for every second a VM is running.

   First, let's upgrade pip, install the dependencies and import the required packages. Then we can set the os environment as well as the random, cupy and numpy seed. We can also define a function timed to time the code blocks and examine the components of the GPU which are being utilized in the runtime.

In [ ]:
!pip install --upgrade pip
Requirement already satisfied: pip in /opt/conda/envs/rapids/lib/python3.9/site-packages (22.0.3)
Collecting pip
  Downloading pip-22.2.2-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 65.3 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.3
    Uninstalling pip-22.0.3:
      Successfully uninstalled pip-22.0.3
Successfully installed pip-22.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
In [ ]:
import os
import warnings
import random
import cupy
import numpy as np
import urllib.request
from contextlib import contextmanager
import time
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['xgbRAPIDS_GPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)

@contextmanager
def timed(name):
    t0 = time.time()
    yield
    t1 = time.time()
    print('..%-24s:  %8.4f' % (name, t1 - t0))

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
print('\n')
!nvidia-smi
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: xgboost==1.5.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (1.5.2)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.6.0)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.21.5)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting optuna
  Downloading optuna-3.0.2-py3-none-any.whl (348 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 348.3/348.3 kB 34.8 MB/s eta 0:00:00
Collecting alembic>=1.5.0
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.8/209.8 kB 47.9 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (21.3)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting sqlalchemy>=1.3.0
  Downloading SQLAlchemy-1.4.41-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 85.8 MB/s eta 0:00:00
Requirement already satisfied: tqdm in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (4.62.3)
Requirement already satisfied: PyYAML in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (6.0)
Collecting scipy<1.9.0,>=1.7.0
  Downloading scipy-1.8.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.2/42.2 MB 41.4 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (1.21.5)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting cliff
  Downloading cliff-4.0.0-py3-none-any.whl (80 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.0/81.0 kB 26.5 MB/s eta 0:00:00
Collecting Mako
  Downloading Mako-1.2.2-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.7/78.7 kB 26.7 MB/s eta 0:00:00
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from packaging>=20.0->optuna) (3.0.7)
Collecting greenlet!=0.4.17
  Downloading greenlet-1.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (154 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.1/154.1 kB 40.5 MB/s eta 0:00:00
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.2-py3-none-any.whl (147 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.1/147.1 kB 38.4 MB/s eta 0:00:00
Requirement already satisfied: importlib-metadata>=4.4 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cliff->optuna) (4.10.1)
Collecting stevedore>=2.0.1
  Downloading stevedore-4.0.0-py3-none-any.whl (49 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.5/49.5 kB 17.1 MB/s eta 0:00:00
Collecting PrettyTable>=0.7.2
  Downloading prettytable-3.4.1-py3-none-any.whl (26 kB)
Collecting autopage>=0.4.0
  Downloading autopage-0.5.1-py3-none-any.whl (29 kB)
Collecting pyperclip>=1.6
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: attrs>=16.3.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cmd2>=1.0.0->cliff->optuna) (21.4.0)
Requirement already satisfied: wcwidth>=0.1.7 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from importlib-metadata>=4.4->cliff->optuna) (3.7.0)
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.10.0-py2.py3-none-any.whl (112 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 112.4/112.4 kB 33.3 MB/s eta 0:00:00
Requirement already satisfied: MarkupSafe>=0.9.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from Mako->alembic>=1.5.0->optuna) (2.0.1)
Building wheels for collected packages: pyperclip
  Building wheel for pyperclip (setup.py) ... done
  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11137 sha256=784e4c674b4c51f4414ffb2fc7cf1e6831e5ea7f1bbbda490a16bec257bd166a
  Stored in directory: /root/.cache/pip/wheels/0c/09/9e/49e21a6840ef7955b06d47394afef0058f0378c0914e48b8b8
Successfully built pyperclip
Installing collected packages: pyperclip, scipy, PrettyTable, pbr, Mako, greenlet, colorlog, cmd2, cmaes, autopage, stevedore, sqlalchemy, cliff, alembic, optuna
  Attempting uninstall: scipy
    Found existing installation: scipy 1.6.0
    Uninstalling scipy-1.6.0:
      Successfully uninstalled scipy-1.6.0
Successfully installed Mako-1.2.2 PrettyTable-3.4.1 alembic-1.8.1 autopage-0.5.1 cliff-4.0.0 cmaes-0.8.2 cmd2-2.4.2 colorlog-6.7.0 greenlet-1.1.3 optuna-3.0.2 pbr-5.10.0 pyperclip-1.8.2 scipy-1.8.1 sqlalchemy-1.4.41 stevedore-4.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting dask_optuna
  Downloading dask_optuna-0.0.2-py3-none-any.whl (9.1 kB)
Requirement already satisfied: dask in /opt/conda/envs/rapids/lib/python3.9/site-packages (from dask_optuna) (2022.1.0)
Requirement already satisfied: optuna>=2.1.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from dask_optuna) (3.0.2)
Requirement already satisfied: distributed in /opt/conda/envs/rapids/lib/python3.9/site-packages (from dask_optuna) (2022.1.0)
Requirement already satisfied: alembic>=1.5.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (1.8.1)
Requirement already satisfied: PyYAML in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (6.0)
Requirement already satisfied: tqdm in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (4.62.3)
Requirement already satisfied: colorlog in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (6.7.0)
Requirement already satisfied: scipy<1.9.0,>=1.7.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (1.8.1)
Requirement already satisfied: sqlalchemy>=1.3.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (1.4.41)
Requirement already satisfied: cliff in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (4.0.0)
Requirement already satisfied: packaging>=20.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (21.3)
Requirement already satisfied: cmaes>=0.8.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (0.8.2)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna>=2.1.0->dask_optuna) (1.21.5)
Requirement already satisfied: fsspec>=0.6.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from dask->dask_optuna) (2022.1.0)
Requirement already satisfied: cloudpickle>=1.1.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from dask->dask_optuna) (2.0.0)
Requirement already satisfied: toolz>=0.8.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from dask->dask_optuna) (0.11.2)
Requirement already satisfied: partd>=0.3.10 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from dask->dask_optuna) (1.2.0)
Requirement already satisfied: jinja2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (3.0.3)
Requirement already satisfied: msgpack>=0.6.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (1.0.3)
Requirement already satisfied: tornado>=6.0.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (6.1)
Requirement already satisfied: tblib>=1.6.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (1.7.0)
Requirement already satisfied: click>=6.6 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (8.0.3)
Requirement already satisfied: zict>=0.1.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (2.0.0)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (2.4.0)
Requirement already satisfied: psutil>=5.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (5.9.0)
Requirement already satisfied: setuptools in /opt/conda/envs/rapids/lib/python3.9/site-packages (from distributed->dask_optuna) (59.8.0)
Requirement already satisfied: Mako in /opt/conda/envs/rapids/lib/python3.9/site-packages (from alembic>=1.5.0->optuna>=2.1.0->dask_optuna) (1.2.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from packaging>=20.0->optuna>=2.1.0->dask_optuna) (3.0.7)
Requirement already satisfied: locket in /opt/conda/envs/rapids/lib/python3.9/site-packages (from partd>=0.3.10->dask->dask_optuna) (0.2.0)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sqlalchemy>=1.3.0->optuna>=2.1.0->dask_optuna) (1.1.3)
Requirement already satisfied: heapdict in /opt/conda/envs/rapids/lib/python3.9/site-packages (from zict>=0.1.3->distributed->dask_optuna) (1.0.1)
Requirement already satisfied: PrettyTable>=0.7.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (3.4.1)
Requirement already satisfied: cmd2>=1.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (2.4.2)
Requirement already satisfied: autopage>=0.4.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (0.5.1)
Requirement already satisfied: stevedore>=2.0.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (4.0.0)
Requirement already satisfied: importlib-metadata>=4.4 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (4.10.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from jinja2->distributed->dask_optuna) (2.0.1)
Requirement already satisfied: wcwidth>=0.1.7 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cmd2>=1.0.0->cliff->optuna>=2.1.0->dask_optuna) (0.2.5)
Requirement already satisfied: attrs>=16.3.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cmd2>=1.0.0->cliff->optuna>=2.1.0->dask_optuna) (21.4.0)
Requirement already satisfied: pyperclip>=1.6 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cmd2>=1.0.0->cliff->optuna>=2.1.0->dask_optuna) (1.8.2)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from importlib-metadata>=4.4->cliff->optuna>=2.1.0->dask_optuna) (3.7.0)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from stevedore>=2.0.1->cliff->optuna>=2.1.0->dask_optuna) (5.10.0)
Installing collected packages: dask_optuna
Successfully installed dask_optuna-0.0.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.2/216.2 kB 26.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: attrs>17.1.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (21.4.0)
Requirement already satisfied: jinja2>=3.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (3.0.3)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.21.5)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.8.1)
Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.16.0)
Requirement already satisfied: scikit-learn>=0.20 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (0.24.2)
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.0/47.0 kB 17.1 MB/s eta 0:00:00
Collecting tabulate>=0.7.7
  Downloading tabulate-0.8.10-py3-none-any.whl (29 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (1.1.0)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=4e2beae90f1097cbf5b1df15ff9909a233cfc99cba7bc939e4de5111fe57be55
  Stored in directory: /root/.cache/pip/wheels/7b/26/a5/8460416695a992a2966b41caa5338e5e7fcea98c9d032d055c
Successfully built eli5
Installing collected packages: tabulate, graphviz, eli5
Successfully installed eli5-0.13.0 graphviz-0.20.1 tabulate-0.8.10
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting shap
  Downloading shap-0.41.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 572.4/572.4 kB 48.0 MB/s eta 0:00:00
Requirement already satisfied: numba in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (0.55.0)
Requirement already satisfied: packaging>20.9 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (21.3)
Requirement already satisfied: cloudpickle in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (2.0.0)
Requirement already satisfied: scikit-learn in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (0.24.2)
Requirement already satisfied: tqdm>4.25.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (4.62.3)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.21.5)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.8.1)
Requirement already satisfied: pandas in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.3.5)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from packaging>20.9->shap) (3.0.7)
Requirement already satisfied: setuptools in /opt/conda/envs/rapids/lib/python3.9/site-packages (from numba->shap) (59.8.0)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from numba->shap) (0.38.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->shap) (2021.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn->shap) (1.1.0)
Requirement already satisfied: six>=1.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->shap) (1.16.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting plotly
  Downloading plotly-5.10.0-py2.py3-none-any.whl (15.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 68.9 MB/s eta 0:00:0000:0100:01
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.10.0 tenacity-8.0.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv


Mon Sep 19 23:16:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:00:05.0 Off |                  Off |
| 41%   31C    P2    37W / 140W |    157MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

   There is a NVIDIA RTX A4000 with 16GB GPU memory with NVIDIA-SMI version 510.73.05 and CUDA version 11.6.

   Now, we can set up a LocalCUDACluster for Dask with defined limits for memory and the device memory as well as an ip that is assigned to the dask.distributed.Client for all of the connected workers (1).

In [ ]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar
from dask.utils import parse_bytes

cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES='0', memory_limit='10GB',
                           device_memory_limit='5GB', ip='0.0.0.0')

client = Client(cluster)
client
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Out[ ]:

Client

Client-bb464191-3883-11ed-8028-be12f8a4cbd6

Connection method: Cluster object Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://10.42.102.181:8787/status

Cluster Info

XGBoost

   The notebooks evaluating different metrics utilizing train/validation/test sets with RAPIDS and Optuna are located here, and the initial notebooks using Hyperopt and XGBoost with GPU runtimes are located here.

   Let's now read the training and the test sets for the Upsampled set into separate cudf.Dataframe. For the XGBoost models, splitting the training set into train and validation sets resulted in better model performance, so let's use train_size=0.8 with the training set to create the validation set. Then we can convert the features to float32 and the target to int32 for modeling.

In [ ]:
import cudf
from cuml.model_selection import train_test_split

trainDF = cudf.read_csv('trainDF_US.csv', low_memory=False)
print('Train set: Number of rows and columns:', trainDF.shape)

testDF = cudf.read_csv('testDF_US.csv', low_memory=False)
print('Test set: Number of rows and columns:', testDF.shape)

X_train, X_val, y_train, y_val = train_test_split(trainDF, 'loan_status',
                                                  train_size=0.8)

X_train = X_train.astype('float32')
y_train = y_train.astype('int32')
X_val = X_val.astype('float32')
y_val = y_val.astype('int32')
X_test = X_test.astype('float32')
y_test = y_test.astype('int32')
Train set: Number of rows and columns: (3022132, 51)
Test set: Number of rows and columns: (432473, 51)

Upsampled: Baseline Model

   Let's first define the baseline model components for classification using gbtree for the booster to run on the GPU using tree_method=gpu_hist with scale_pos_weight=1to balance the positive and negative weights without label encoding. Then use the defined seed_value for the random number seed with silent output from the model. Now we can fit the model using the training set, save it as .pkl file and evaluate the model using classification metrics.

In [ ]:
xgb = XGBClassifier(objective='binary:logistic',
                    booster='gbtree',
                    tree_method='gpu_hist',
                    scale_pos_weight=1,
                    use_label_encoder=False,
                    random_state=seed_value,
                    verbosity=0)

xgb.fit(X_train, y_train)

Pkl_Filename = 'XGB_Baseline_US.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

print('\nModel Metrics for XGBoost Baseline Upsampling')

y_train_pred = xgb.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)

y_test_pred = xgb.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)

print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
                                               asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
                                                 asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
                                           asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for XGBoost Baseline Upsampling
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.99      0.96   1511066
           1       0.99      0.93      0.96   1511066

    accuracy                           0.96   3022132
   macro avg       0.96      0.96      0.96   3022132
weighted avg       0.96      0.96      0.96   3022132



Confusion matrix:
[[1498673   12393]
 [ 102969 1408097]]


Accuracy score : 0.962
Precision score : 0.991
Recall score : 0.932
F1 score : 0.961


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99    377848
           1       0.94      0.93      0.93     54625

    accuracy                           0.98    432473
   macro avg       0.96      0.96      0.96    432473
weighted avg       0.98      0.98      0.98    432473



Confusion matrix:
[[374541   3307]
 [  3910  50715]]


Accuracy score : 0.983
Precision score : 0.939
Recall score : 0.928
F1 score : 0.934

   Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the recall and F1 score, so let's monitor one of those to see if hyperparameter tuning results in better performance.

300 Trial: Weighted F1

   Now, let's set up an Optuna.study by creating a unique name for the study. We can leverage the optuna.integration.wandb to set up callbacks that will be saved to Weights & Biases. First, we login and set up the arguments that include the name of the project, the person saving the results, the group the study belongs to, if code is saved or not, and notes about the study for future reference.

In [ ]:
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback

study_name = 'dask_optuna_us_xgbRapids_weightedF1_300_tpe'

wandb.login()

wandb_kwargs = {'project': 'loanStatus_hpo', 'entity': 'aschultz',
                'group': 'optuna_us_xgb300gpu_trainvaltest_f1weighted',
                'save_code': 'False',
                'notes': 'us_xgb300gpu_trainvaltest_f1weighted'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: WARNING Calling wandb.login() after wandb.init() has no effect.

   Then, we can define a function objective for the optimization of various hyperparameters which uses joblib to save each of the trial parameters and the loss in a pickle file that can be reloaded to continue optimizing. Then we can denote the parameter space of the hyperparameters to be tested.

   Then the model type, XGBClassifier, needs to be defined with the parameters that will be included in all of the trials during the search, which are:

  • objective='binary:logistic: Specify the learning task and the corresponding learning objective or a custom objective function to be used.
  • booster='gbtree': Specify which booster to use: gbtree, gblinear or dart.
  • tree_method='gpu_hist': Specify which tree method to use. Default=auto.
  • scale_pos_weight=1: Balancing of positive and negative weights.
  • use_label_encoder=False: Encodes labels.
  • random_state=seed_value: Random number seed.
  • verbosity=0: The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

   Then we can fit the model specified to use the parameters within the params_xgboost_optuna dictionary utilizing the training set and the eval_set as the validation set, followed by the prediction of the weighted f1_score using the created validation label and the predicted label.

In [ ]:
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import joblib
from xgboost import XGBClassifier
from timeit import default_timer as timer
from sklearn.metrics import f1_score
from cupy import asnumpy

@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostClassifier model.
    """
    joblib.dump(study, 'xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl')

    params_xgboost_optuna = {
        'n_estimators': trial.suggest_int('n_estimators', 400, 700),
        'max_depth': trial.suggest_int('max_depth', 8, 15),
        'subsample': trial.suggest_float('subsample', 0.6, 0.9),
        'gamma': trial.suggest_float('gamma', 0.02, 1),
        'learning_rate': trial.suggest_float('learning_rate', 0.13, 0.3),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.5, 2.5),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.3, 2),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.8, 0.99),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.25, 0.6),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 5)
        }

    xgb = XGBClassifier(objective='binary:logistic',
                        booster='gbtree',
                        tree_method='gpu_hist',
                        scale_pos_weight=1,
                        use_label_encoder=False,
                        random_state=seed_value,
                        verbosity=0,
                        **params_xgboost_optuna)

    start = timer()
    xgb.fit(X_train, y_train,
            eval_set=[(X_val, y_val)],
            verbose=0)

    y_pred_val = xgb.predict(X_val)
    score = f1_score(asnumpy(y_val), asnumpy(y_pred_val), average='weighted')
    run_time = timer() - start

    return score

   Now, we can begin the Optuna study containing 300 trials that are optimized to run in parallel on a Dask cluster. The goal is to find the parameters that maximize the weighted F1 score for predicting loan_status.

In [ ]:
import time
import dask_optuna
import dask
import dask_cudf

with timed('dask_optuna'):
    search_time_start = time.time()
    if os.path.isfile('xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl'):
        study = joblib.load('xgbRAPIDS_Optuna_US_300_GPU_f1weighted.pkl')
    else:
        study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                    study_name=study_name,
                                    direction='maximize')
    with parallel_backend('dask'):
        study.optimize(lambda trial: objective(trial),
                       n_trials=300)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
wandb.finish()
Time to run HPO: 13905.847610235214


Number of finished trials: 300
Best trial: {'n_estimators': 491, 'max_depth': 15, 'subsample': 0.7604888319527146, 'gamma': 0.04087059120394826, 'learning_rate': 0.15906588784061876, 'reg_alpha': 0.5297264561831198, 'reg_lambda': 1.4839744523445213, 'colsample_bytree': 0.8416926592684433, 'colsample_bylevel': 0.5774439932010007, 'min_child_weight': 1}
Highest F1 Score 0.999247220735666
Waiting for W&B process to finish... (success).
VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…
Synced dulcet-galaxy-15: https://wandb.ai/aschultz/loanStatus_hpo/runs/3pnmuttd
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20220917_211340-3pnmuttd/logs

   Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
     iteration  f1_weighted             datetime_start  \
191        191     0.999247 2022-09-17 23:42:18.359414   
258        258     0.999242 2022-09-18 00:45:11.098517   
242        242     0.999237 2022-09-18 00:28:52.299674   
197        197     0.999236 2022-09-17 23:47:29.801421   
278        278     0.999231 2022-09-18 01:04:14.356666   
..         ...          ...                        ...   
148        148     0.990621 2022-09-17 23:07:45.365055   
3            3     0.989366 2022-09-17 21:32:25.410750   
287        287     0.988660 2022-09-18 01:12:22.001476   
95          95     0.983905 2022-09-17 22:29:08.662198   
260        260     0.979139 2022-09-18 00:47:19.155823   

             datetime_complete               duration  colsample_bylevel  \
191 2022-09-17 23:43:16.500515 0 days 00:00:58.141101           0.577444   
258 2022-09-18 00:46:14.684968 0 days 00:01:03.586451           0.488759   
242 2022-09-18 00:29:56.528316 0 days 00:01:04.228642           0.518016   
197 2022-09-17 23:48:24.287870 0 days 00:00:54.486449           0.541171   
278 2022-09-18 01:05:18.247899 0 days 00:01:03.891233           0.507059   
..                         ...                    ...                ...   
148 2022-09-17 23:08:02.016132 0 days 00:00:16.651077           0.560999   
3   2022-09-17 21:32:44.150926 0 days 00:00:18.740176           0.410638   
287 2022-09-18 01:12:42.593446 0 days 00:00:20.591970           0.531795   
95  2022-09-17 22:29:22.974740 0 days 00:00:14.312542           0.556336   
260 2022-09-18 00:47:35.481398 0 days 00:00:16.325575           0.486564   

     colsample_bytree     gamma  learning_rate  max_depth  min_child_weight  \
191          0.841693  0.040871       0.159066         15                 1   
258          0.828521  0.056368       0.130137         15                 1   
242          0.821736  0.022923       0.143276         15                 1   
197          0.844044  0.059744       0.163222         15                 1   
278          0.836984  0.020443       0.146207         15                 1   
..                ...       ...            ...        ...               ...   
148          0.875270  0.096369       0.193190          9                 1   
3            0.970038  0.997113       0.159829          9                 4   
287          0.839160  0.064727       0.137840          9                 1   
95           0.874941  0.142290       0.198427          8                 1   
260          0.824896  0.065039       0.145268          8                 1   

     n_estimators  reg_alpha  reg_lambda  subsample     state  
191           491   0.529726    1.483974   0.760489  COMPLETE  
258           526   0.832003    1.600654   0.795352  COMPLETE  
242           525   0.765251    1.539347   0.812440  COMPLETE  
197           489   0.557064    1.512291   0.811702  COMPLETE  
278           538   0.919857    0.825318   0.827206  COMPLETE  
..            ...        ...         ...        ...       ...  
148           424   0.633854    1.391717   0.780691  COMPLETE  
3             534   1.060199    0.521002   0.835662  COMPLETE  
287           529   0.907747    0.811851   0.817776  COMPLETE  
95            452   0.906030    1.580921   0.707402  COMPLETE  
260           529   0.744753    1.642626   0.797077  COMPLETE  

[300 rows x 16 columns]

   We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   learning_rate, gamma, colsample_bytree decreased over the trials while colsample_bylevel increased.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   reg_alpha decreased while reg_lambda did not show any trend over the trial iterations.

   From using plot_param_importances to visualize parameter importances, the most important hyperparameter is max_depth (0.89) followed by reg_lambda (0.04) and gamma (0.02).

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params
Out[ ]:
{'n_estimators': 491,
 'max_depth': 15,
 'subsample': 0.7604888319527146,
 'gamma': 0.04087059120394826,
 'learning_rate': 0.15906588784061876,
 'reg_alpha': 0.5297264561831198,
 'reg_lambda': 1.4839744523445213,
 'colsample_bytree': 0.8416926592684433,
 'colsample_bylevel': 0.5774439932010007,
 'min_child_weight': 1}
In [ ]:
import pickle
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score

X_train, y_train = trainDF.drop('loan_status',
                                axis=1), trainDF['loan_status'].astype('int32')
X_train = X_train.astype('float32')

X_test, y_test = testDF.drop('loan_status',
                             axis=1), testDF['loan_status'].astype('int32')
X_test = X_test.astype('float32')

best_model = XGBClassifier(objective='binary:logistic',
                           booster='gbtree',
                           tree_method='gpu_hist',
                           scale_pos_weight=1,
                           use_label_encoder=False,
                           random_state=seed_value,
                           verbosity=0,
                           **params)

best_model.fit(X_train, y_train)

Pkl_Filename = 'xgbRAPIDS_Optuna_US_300trials_GPU_f1weighted.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost RAPIDS HPO Upsampling 300 trials GPU F1 Weighted')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for XGBoost RAPIDS HPO Upsampling 300 trials GPU F1 Weighted


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       0.99      0.91      0.95     54625

    accuracy                           0.99    432473
   macro avg       0.99      0.96      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[377480    368]
 [  4727  49898]]


Accuracy score : 0.988
Precision score : 0.993
Recall score : 0.913
F1 score : 0.951

   Compared to the baseline model, there is a higher accuracy, precision and F1 score but a lower recall score. Let's now evaluate the predictive probability on the test set.

In [ ]:
from sklearn.metrics import roc_auc_score

print('The best model from Upsampling 300 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(asnumpy(y_test),
                                                                                                                               asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from Upsampling 300 GPU trials optimization scores 0.95625 AUC ROC on the test set.
This was achieved using these conditions:
iteration                                   191
f1_weighted                            0.999247
datetime_start       2022-09-17 23:42:18.359414
datetime_complete    2022-09-17 23:43:16.500515
duration                 0 days 00:00:58.141101
colsample_bylevel                      0.577444
colsample_bytree                       0.841693
gamma                                  0.040871
learning_rate                          0.159066
max_depth                                    15
min_child_weight                              1
n_estimators                                491
reg_alpha                              0.529726
reg_lambda                             1.483974
subsample                              0.760489
state                                  COMPLETE
Name: 0, dtype: object

Model Explanations

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
import shap

shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
In [ ]:
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_train.to_pandas(), show=False);

   Now, we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(X_test)
In [ ]:
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_test.to_pandas(), show=False);

Model Metrics with Eli5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
import eli5
from eli5.sklearn import PermutationImportance

perm_importance = PermutationImportance(best_model,
                                        random_state=seed_value).fit(asnumpy(X_test),
                                                                     asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.2415 ± 0.0011 out_prncp
0.2346 ± 0.0005 total_pymnt
0.0670 ± 0.0002 recoveries
0.0537 ± 0.0004 loan_amnt
0.0421 ± 0.0002 installment
0.0276 ± 0.0005 last_pymnt_amnt
0.0114 ± 0.0002 total_rec_int
0.0011 ± 0.0001 int_rate
0.0009 ± 0.0000 total_rec_late_fee
0.0001 ± 0.0000 term_ 60 months
0.0001 ± 0.0000 grade_C
0.0000 ± 0.0000 grade_D
0.0000 ± 0.0000 delinq_2yrs
0.0000 ± 0.0000 home_ownership_OWN
0.0000 ± 0.0000 num_tl_30dpd
-0.0000 ± 0.0000 collections_12_mths_ex_med
-0.0000 ± 0.0000 chargeoff_within_12_mths
-0.0000 ± 0.0000 delinq_amnt
-0.0000 ± 0.0000 num_bc_tl
-0.0000 ± 0.0000 disbursement_method_DirectPay
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
from eli5.formatters import format_as_dataframe

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 out_prncp 2.415055e-01 5.564289e-04
1 total_pymnt 2.345598e-01 2.616965e-04
2 recoveries 6.699517e-02 8.409338e-05
3 loan_amnt 5.371942e-02 2.024874e-04
4 installment 4.208263e-02 1.226565e-04
5 last_pymnt_amnt 2.759895e-02 2.469769e-04
6 total_rec_int 1.137828e-02 1.139127e-04
7 int_rate 1.071512e-03 4.494641e-05
8 total_rec_late_fee 8.930037e-04 2.332542e-05
9 term_ 60 months 1.156142e-04 1.151508e-05
10 grade_C 5.826953e-05 9.853691e-06
11 grade_D 8.786676e-06 5.353322e-06
12 delinq_2yrs 3.699653e-06 8.476970e-06
13 home_ownership_OWN 3.699653e-06 9.766488e-06
14 num_tl_30dpd 4.624566e-07 9.249132e-07
15 collections_12_mths_ex_med -2.220446e-17 1.462416e-06
16 chargeoff_within_12_mths -4.440892e-17 2.532979e-06
17 delinq_amnt -9.249132e-07 1.132783e-06
18 num_bc_tl -1.387370e-06 1.800614e-05
19 disbursement_method_DirectPay -2.312283e-06 1.232254e-05

   The out_prncp feature had the highest weight (0.2415) followed by total_pymnt (0.2345) and recoveries (0.0670).

   Let's now utilize the plot_importance from the XGBoost package to examine the feature importance.

In [ ]:
from xgboost import plot_importance

plot_importance(best_model, max_num_features=15);
Out[ ]:
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

   The most important features are:

  • total_pymnt
  • mo_sin_old_rev_tl_op
  • total_rec_int
  • annual_inc
  • total_bal_ex_mort
  • int_rate
  • total_hi_cred_lim
  • bc_open_to_buy
  • revol_util
  • revol_bal
  • total_bc_limit
  • installment
  • mths_since_recent_bc
  • out_prncp
  • last_pymnt_amnt

   Next, we can calculate the Permutation importance:

In [ ]:
perm_importance = permutation_importance(best_model, X_test.to_pandas(),
                                         asnumpy(y_test))

sorted_idx = perm_importance.importances_mean.argsort()
plt.rcParams.update({'font.size': 10})
fig = plt.figure(figsize=(8, 15))
plt.barh(testDF.columns[sorted_idx],
         perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.title('XGBoost Permutation Importance')
plt.show();

   From the permutation importance, the most important features are:

  • out_prncp
  • total_pymnt
  • recoveries
  • loan_amnt
  • installment
  • last_pymnt_amnt
  • total_rec_int

SMOTE: Baseline Model

   The SMOTE train and test sets were processed using the same methods as the Upsampled data, and the same parameters that were used for the Upsampled baseline model were maintained. Now we can fit the model using the training set, save it as .pkl file and evaluate the model using the same classification metrics.

In [ ]:
xgb = XGBClassifier(objective='binary:logistic',
                    booster='gbtree',
                    tree_method='gpu_hist',
                    scale_pos_weight=1,
                    use_label_encoder=False,
                    random_state=seed_value,
                    verbosity=0)

xgb.fit(X_train, y_train)

Pkl_Filename = 'XGB_Baseline_SMOTE.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

print('\nModel Metrics for XGBoost Baseline SMOTE')

y_train_pred = xgb.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = np.where(y_train_pred > 0.5, 1, 0)

y_test_pred = xgb.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = np.where(y_test_pred > 0.5, 1, 0)

print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
                                               asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
                                                 asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
                                           asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for XGBoost Baseline SMOTE
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99   1511066
           1       1.00      0.99      0.99   1511066

    accuracy                           0.99   3022132
   macro avg       0.99      0.99      0.99   3022132
weighted avg       0.99      0.99      0.99   3022132



Confusion matrix:
[[1510705     361]
 [  20620 1490446]]


Accuracy score : 0.993
Precision score : 1.000
Recall score : 0.986
F1 score : 0.993


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       1.00      0.91      0.95     54625

    accuracy                           0.99    432473
   macro avg       0.99      0.95      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[377676    172]
 [  5069  49556]]


Accuracy score : 0.988
Precision score : 0.997
Recall score : 0.907
F1 score : 0.950

   Overall, the model metrics are over 90%, so the baseline model performs quite well. The lowest metrics are the recall and F1 score. Compared to the Upsampled baseline model, the accuracy, precision and F1 score are better, but the recall is lower (0.907 vs. 0.928).

300 Trials: Weighted F1

   The same methods for reading the data, splitting the train set into train and validation set and convert data types for modeling were used for trainDF_SMOTE and testDF_SMOTE as well as defining a name for the study and the information for W & B callbacks.

In [ ]:
study_name = 'dask_optuna_smote_xgbRapids_weightedF1_300_tpe'

wandb.login()

wandb_kwargs = {'project': 'loanStatus_hpo', 'entity': 'aschultz',
                'group': 'optuna_SMOTE_xgb300gpu_trainvaltest',
                'save_code': 'False',
                'notes': 'smote_xgb300gpu_trainvaltest_f1weighted'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostClassifier model.
    """
    joblib.dump(study, 'xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')

    params_xgboost_optuna = {
        'n_estimators': trial.suggest_int('n_estimators', 400, 700),
        'max_depth': trial.suggest_int('max_depth', 7, 15),
        'subsample': trial.suggest_float('subsample', 0.55, 0.8),
        'gamma': trial.suggest_float('gamma', 0.6, 2.5),
        'learning_rate': trial.suggest_float('learning_rate', 0.13, 0.3),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.2, 2.65),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.6, 2.3),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.8),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.35, 0.6),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 8)
        }

    xgb = XGBClassifier(objective='binary:logistic',
                        booster='gbtree',
                        tree_method='gpu_hist',
                        scale_pos_weight=1,
                        use_label_encoder=False,
                        random_state=seed_value,
                        verbosity=0,
                        **params_xgboost_optuna)

    start = timer()
    xgb.fit(X_train, y_train,
            eval_set=[(X_val, y_val)],
            verbose=0)

    y_pred_val = xgb.predict(X_val)
    score = f1_score(asnumpy(y_val), asnumpy(y_pred_val), average='weighted')
    run_time = timer() - start

    return score

   We can load the saved .pkl file to continue training for 265 more trials to reach the intended 300 for the study.

In [ ]:
study = joblib.load('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
Number of finished trials: 35
Best trial: {'n_estimators': 504, 'max_depth': 10, 'subsample': 0.7884109129309878, 'gamma': 2.4917228543486036, 'learning_rate': 0.13369246192923853, 'reg_alpha': 1.6781034649937816, 'reg_lambda': 2.2841174167066725, 'colsample_bytree': 0.7321652864845699, 'colsample_bylevel': 0.5108159983287197, 'min_child_weight': 3}
Highest F1 Score 0.9930741838743147
In [ ]:
with timed('dask_optuna'):
    search_time_start = time.time()
    if os.path.isfile('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl'):
        study = joblib.load('xgbRAPIDS_Optuna_SMOTE_300_GPU_f1weighted.pkl')
    else:
        study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                    study_name=study_name,
                                    direction='maximize')
    with parallel_backend('dask'):
        study.optimize(lambda trial: objective(trial),
                       n_trials=265)
print('Time to run HPO:', time.time() - search_time_start)
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
wandb.finish()
Time to run HPO: 4813.409572601318


Number of finished trials: 300
Best trial: {'n_estimators': 691, 'max_depth': 8, 'subsample': 0.796466106984622, 'gamma': 1.547512878374652, 'learning_rate': 0.13800983815031365, 'reg_alpha': 0.2841773131264607, 'reg_lambda': 1.4119808147660278, 'colsample_bytree': 0.7839284986307313, 'colsample_bylevel': 0.5449365714550495, 'min_child_weight': 1}
Highest F1 Score 0.9931784373321696
Waiting for W&B process to finish... (success).
VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…
Synced silvery-planet-16: https://wandb.ai/aschultz/loanStatus_hpo/runs/2vb8texw
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20220918_023952-2vb8texw/logs

   Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
     iteration  f1_weighted             datetime_start  \
271        271     0.993178 2022-09-18 04:15:41.163464   
122        122     0.993170 2022-09-18 03:34:13.524632   
182        182     0.993165 2022-09-18 03:50:32.282648   
268        268     0.993162 2022-09-18 04:14:47.364919   
231        231     0.993155 2022-09-18 04:04:23.402626   
..         ...          ...                        ...   
8            8     0.992809 2022-09-18 02:52:30.182629   
51          51     0.992809 2022-09-18 03:10:46.524193   
4            4     0.992801 2022-09-18 02:51:50.103090   
19          19     0.992753 2022-09-18 02:54:31.774280   
34          34          NaN 2022-09-18 02:58:48.130980   

             datetime_complete               duration  colsample_bylevel  \
271 2022-09-18 04:15:59.718844 0 days 00:00:18.555380           0.544937   
122 2022-09-18 03:34:31.023521 0 days 00:00:17.498889           0.592178   
182 2022-09-18 03:50:49.120409 0 days 00:00:16.837761           0.563410   
268 2022-09-18 04:15:04.562021 0 days 00:00:17.197102           0.563690   
231 2022-09-18 04:04:39.589980 0 days 00:00:16.187354           0.561954   
..                         ...                    ...                ...   
8   2022-09-18 02:52:37.571025 0 days 00:00:07.388396           0.439248   
51  2022-09-18 03:11:05.894341 0 days 00:00:19.370148           0.352828   
4   2022-09-18 02:52:00.059896 0 days 00:00:09.956806           0.446002   
19  2022-09-18 02:54:38.074525 0 days 00:00:06.300245           0.473968   
34                         NaT                    NaT                NaN   

     colsample_bytree     gamma  learning_rate  max_depth  min_child_weight  \
271          0.783928  1.547513       0.138010        8.0               1.0   
122          0.789262  1.737375       0.157400        8.0               1.0   
182          0.794970  1.538634       0.147207        7.0               1.0   
268          0.795061  1.456625       0.149467        7.0               1.0   
231          0.786213  1.679453       0.154538        7.0               1.0   
..                ...       ...            ...        ...               ...   
8            0.605694  2.202516       0.254594        9.0               8.0   
51           0.545644  1.850384       0.170407       12.0               5.0   
4            0.531367  1.557027       0.187707       11.0               6.0   
19           0.758870  2.300779       0.299497       11.0               6.0   
34                NaN       NaN            NaN        NaN               NaN   

     n_estimators  reg_alpha  reg_lambda  subsample     state  
271         691.0   0.284177    1.411981   0.796466  COMPLETE  
122         660.0   0.992328    1.515634   0.759540  COMPLETE  
182         667.0   0.726457    1.556874   0.786536  COMPLETE  
268         676.0   0.452447    1.446416   0.789540  COMPLETE  
231         650.0   0.914711    1.368587   0.771501  COMPLETE  
..            ...        ...         ...        ...       ...  
8           602.0   0.455088    2.018047   0.747924  COMPLETE  
51          415.0   1.662494    1.918067   0.750623  COMPLETE  
4           574.0   0.325456    1.708349   0.634853  COMPLETE  
19          449.0   0.837412    1.229321   0.579175  COMPLETE  
34            NaN        NaN         NaN        NaN   RUNNING  

[300 rows x 16 columns]

   We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   learning_rate, gamma, decreased like the Upsampled trials over the trials while colsample_bylevel and colsample_bytree increased.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   reg_alpha decreased like for the Upsampled trials but reg_lambda decreased while there was not a trend for the Upsampled trials over the trial iterations.

   From using plot_param_importances to visualize the parameter importances for the SMOTE set, the most important hyperparameters are subsample = 0.25, min_child_weight = 0.21, colsample_by_level = 0.17 and colsample_by_tree = 0.16. max_depth, the most important hyperparameter for the Upsampled model is (0.04 vs. 0.89) and reg_lambda is (0.02 vs 0.04) and gamma is (< 0.01 vs 0.02).

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params
Out[ ]:
{'n_estimators': 691,
 'max_depth': 8,
 'subsample': 0.796466106984622,
 'gamma': 1.547512878374652,
 'learning_rate': 0.13800983815031365,
 'reg_alpha': 0.2841773131264607,
 'reg_lambda': 1.4119808147660278,
 'colsample_bytree': 0.7839284986307313,
 'colsample_bylevel': 0.5449365714550495,
 'min_child_weight': 1}
In [ ]:
best_model = XGBClassifier(objective='binary:logistic',
                           booster='gbtree',
                           tree_method='gpu_hist',
                           scale_pos_weight=1,
                           use_label_encoder=False,
                           random_state=seed_value,
                           verbosity=0,
                           **params)

best_model.fit(X_train, y_train)

Pkl_Filename = 'xgbRAPIDS_Optuna_SMOTE_300trials_GPU_f1weighted.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost RAPIDS HPO SMOTE 300 trials GPU F1 Weighted')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for XGBoost RAPIDS HPO SMOTE 300 trials GPU F1 Weighted


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    377848
           1       1.00      0.91      0.95     54625

    accuracy                           0.99    432473
   macro avg       0.99      0.96      0.97    432473
weighted avg       0.99      0.99      0.99    432473



Confusion matrix:
[[377629    219]
 [  4839  49786]]


Accuracy score : 0.988
Precision score : 0.996
Recall score : 0.911
F1 score : 0.952

   Compared to the baseline model, there is a higher recall and F1 score and comparable accuacy and precision score. Let's now evaluate the predictive probability on the test set.

In [ ]:
print('The best model from SMOTE 300 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(asnumpy(y_test),
                                                                                                                          asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from SMOTE 300 GPU trials optimization scores 0.95542 AUC ROC on the test set.
This was achieved using these conditions:
iteration                                   271
f1_weighted                            0.993178
datetime_start       2022-09-18 04:15:41.163464
datetime_complete    2022-09-18 04:15:59.718844
duration                 0 days 00:00:18.555380
colsample_bylevel                      0.544937
colsample_bytree                       0.783928
gamma                                  1.547513
learning_rate                           0.13801
max_depth                                   8.0
min_child_weight                            1.0
n_estimators                              691.0
reg_alpha                              0.284177
reg_lambda                             1.411981
subsample                              0.796466
state                                  COMPLETE
Name: 271, dtype: object

Model Explanations

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
In [ ]:
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_train.to_pandas(), show=False);

   Now, we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)
In [ ]:
plt.rcParams.update({'font.size': 7})
shap.summary_plot(shap_values, X_test.to_pandas(), show=False);

Model Metrics with Eli5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
perm_importance = PermutationImportance(best_model,
                                        random_state=seed_value).fit(asnumpy(X_test),
                                                                     asnumpy(y_test))

html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.2847 ± 0.0006 total_pymnt
0.2404 ± 0.0008 out_prncp
0.0829 ± 0.0007 loan_amnt
0.0743 ± 0.0002 installment
0.0675 ± 0.0002 recoveries
0.0296 ± 0.0007 last_pymnt_amnt
0.0279 ± 0.0003 total_rec_int
0.0065 ± 0.0002 int_rate
0.0008 ± 0.0001 total_rec_late_fee
0.0003 ± 0.0000 grade_C
0.0002 ± 0.0000 term_ 60 months
0.0001 ± 0.0001 grade_D
0.0001 ± 0.0000 revol_util
0.0001 ± 0.0001 home_ownership_RENT
0.0001 ± 0.0000 annual_inc
0.0000 ± 0.0000 tot_hi_cred_lim
0.0000 ± 0.0000 bc_open_to_buy
0.0000 ± 0.0001 total_bc_limit
0.0000 ± 0.0000 num_bc_tl
0.0000 ± 0.0000 num_bc_sats
… 30 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 total_pymnt 0.284731 0.000322
1 out_prncp 0.240418 0.000402
2 loan_amnt 0.082900 0.000331
3 installment 0.074277 0.000123
4 recoveries 0.067454 0.000094
5 last_pymnt_amnt 0.029555 0.000332
6 total_rec_int 0.027872 0.000129
7 int_rate 0.006538 0.000083
8 total_rec_late_fee 0.000816 0.000030
9 grade_C 0.000321 0.000025
10 term_ 60 months 0.000154 0.000010
11 grade_D 0.000123 0.000027
12 revol_util 0.000078 0.000009
13 home_ownership_RENT 0.000064 0.000028
14 annual_inc 0.000062 0.000010
15 tot_hi_cred_lim 0.000050 0.000012
16 bc_open_to_buy 0.000049 0.000021
17 total_bc_limit 0.000047 0.000033
18 num_bc_tl 0.000047 0.000023
19 num_bc_sats 0.000044 0.000004

   The total_pymnt feature has the highest weight for this model (0.284731) Upsampled lower out_prncp 0.2415 total_pymnt 0.2345

   Let's now utilize the plot_importance from the XGBoost package to examine the feature importance.

In [ ]:
plot_importance(best_model, max_num_features=15);

   The most important features are:

  • total_pymnt
  • total_rec_int
  • int_rate
  • last_pymnt_amnt
  • out_prncp
  • installment
  • total_bal_ex_mort
  • mo_sin_old_rev_tl_op
  • annual_inc
  • loan_amnt
  • bc_open_to_buy
  • total_bc_limit
  • revol_bal
  • revol_util

   Next, we can calculate the Permutation importance:

In [ ]:
perm_importance = permutation_importance(best_model, X_test.to_pandas(),
                                         asnumpy(y_test))

sorted_idx = perm_importance.importances_mean.argsort()
plt.rcParams.update({'font.size': 10})
fig = plt.figure(figsize=(8, 15))
plt.barh(testDF.columns[sorted_idx],
         perm_importance.importances_mean[sorted_idx])
plt.xlabel('Permutation Importance')
plt.title('XGBoost Permutation Importance')
plt.show();

   From the permutation importance, the most important features are:

  • total_pymnt
  • out_prncp
  • loan_amnt
  • recoveries
  • installment
  • last_pymnt_amnt
  • total_rec_int
  • int_rate

Linear

   The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score metric to monitor during the hyperparameter tuning search using (Logistic Regression, Ridge Regression, Lasso Regression or Elastic Net Regression) are located here.

   The methods used to set up the RAPIDS environment in Colab can be found here. For the subsequent algorithms evaluated, the RAPIDS environment was set up with installing/importing packages, setting the seed and Dask cluster as described in the previous XGBoost section. For the following sections, the data was read and the features/target were set up as demonstrated below.

LinearSVC

   Support Vector Machine (SVM) is a supervised algorithm where the objective is to determine the hyperplane, a line that separates the features for the target variable, with the highest possible distance between the hyperplane and the nearest data point from either of the target group (margin). The points closest to the hyperplane are the support vectors. Removing these would affect the the position of the dividing hyperplane.

   However, finding the optimal hyperplane for non-linear problems is time and computationally expensive. This 2005 paper A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs developed a method for solving linear SVMs with an L2 loss function and other have proposed new methods for utilizing LinearSVMs.

   The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score metric to monitor during the hyperparameter tuning search are located here.

Upsampled: Baseline Model

   Let's first define the baseline model components for classification. Now we can fit the model using the training set from the Upsampled set, save it as .pkl file and evaluate the model using classification metrics.

In [ ]:
from cuml.svm import LinearSVC

lsvc = LinearSVC()

lsvc.fit(X_train, y_train)

Pkl_Filename = 'LinearSVC_Baseline_US.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(lsvc, file)
[W] [02:20:52.829795] L-BFGS line search failed (code 4); stopping at the last valid step
In [ ]:
print('\nModel Metrics for LinearSVC Baseline Upsampling')

y_train_pred = lsvc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)

y_test_pred = lsvc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)

print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
                                               asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
                                                 asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
                                           asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for LinearSVC Baseline Upsampling
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       0.50      1.00      0.67   1511066
           1       0.00      0.00      0.00   1511066

    accuracy                           0.50   3022132
   macro avg       0.25      0.50      0.33   3022132
weighted avg       0.25      0.50      0.33   3022132



Confusion matrix:
[[1511066       0]
 [1511066       0]]


Accuracy score : 0.500
Precision score : 0.000
Recall score : 0.000
F1 score : 0.000


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93    377848
           1       0.00      0.00      0.00     54625

    accuracy                           0.87    432473
   macro avg       0.44      0.50      0.47    432473
weighted avg       0.76      0.87      0.81    432473



Confusion matrix:
[[377848      0]
 [ 54625      0]]


Accuracy score : 0.874
Precision score : 0.000
Recall score : 0.000
F1 score : 0.000

   Overall, the model metrics are horrible, so the model will definitely need to be tuned. In the SparkML section, the StandardScaler was utilized before fitting the model. This might be needed to obtain better model performance because using the default parameters resulted in the worst baseline model up to this point.

100 Trials: Weighted F1

   Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:

  • penalty: The regularization term of the target function. Default='l2'.
  • loss: The loss term of the target function. Default = 'squared_hinge'
  • penalized_intercept: When true, the bias term is treated the same way as other features; i.e. it’s penalized by the regularization term of the target function. Default=False.
  • max_iter: Maximum number of iterations for the underlying solver. Default=1000.
  • linesearch_max_iter: Maximum number of linesearch (inner loop) iterations for the underlying (QN) solver. Default=100.
  • lbfgs_memory: Number of vectors approximating the hessian for the underlying QN solver (l-bfgs). Default=5.
  • C: The constant scaling factor of the loss term in the target formula. Default=1.0.
  • grad_tol: The threshold on the gradient for the underlying QN solver. Default=0.0001.
  • change_tol: The threshold on the function change for the underlying QN solver. Default=1e-05.

   Then model will be trained using the training set, predictions made using the test set and then the model evaluated for the weighted f1_score between the actual loan_status in the test set versus the predicted one.

In [ ]:
def train_and_eval(X_param, y_param, penalty='l2',
                   loss='squared_hinge',
                   penalized_intercept='False',
                   max_iter=10000,
                   linesearch_max_iter=100,
                   lbfgs_memory=5, C=1,
                   grad_tol=0.0001, change_tol=1e-5,
                   verbose=False):
    """
    Partition data into train/test sets, train and evaluate the model
    for the given parameters.

    Params
    ______

    X_param:  DataFrame.
              The data to use for training and testing.
    y_param:  Series.
              The label for training

    Returns
    score: Weighted F1 of the fitted model
    """
    X_train, y_train = trainDF.drop('loan_status',
                                    axis=1), trainDF['loan_status'].astype('int32')
    X_train = X_train.astype('float32')

    X_test, y_test= testDF.drop('loan_status',
                                axis=1), testDF['loan_status'].astype('int32')
    X_test = X_test.astype('float32')

    model = LinearSVC(penalty=penalty,
                      loss=loss,
                      penalized_intercept=penalized_intercept,
                      max_iter=max_iter,
                      linesearch_max_iter=linesearch_max_iter,
                      lbfgs_memory=lbfgs_memory,
                      C=C,
                      grad_tol=grad_tol,
                      change_tol=change_tol,
                      verbose=False)

    start = timer()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    score = f1_score(y_test.to_numpy(), y_pred.to_numpy(), average='weighted')
    run_time = timer() - start

    return score
In [ ]:
print('Score with default parameters : ', train_and_eval(X_train, y_train))
[W] [15:35:01.693102] L-BFGS line search failed (code 4); stopping at the last valid step
Score with default parameters :  0.8147946302801139

   As with other hyperparameter optimizations, an objective function containing the .pkl file for the study and the parameter space to be tested needs to be defined.

In [ ]:
def objective(trial, X_param, y_param):

    joblib.dump(study, 'LinearSVC_Optuna_US_100_GPU_weightedF1.pkl')

    penalty = trial.suggest_categorical('penalty', ['l1', 'l2'])
    loss = trial.suggest_categorical('loss', ['squared_hinge', 'hinge'])
    penalized_intercept = trial.suggest_categorical('penalized_intercept',
                                                    ['True', 'False'])
    max_iter = trial.suggest_int('max_iter', -1, 10e6)
    linesearch_max_iter = trial.suggest_int('linesearch_max_iter', 100, 10000)
    lbfgs_memory = trial.suggest_int('lbfgs_memory', 5, 20)
    C = trial.suggest_float('C', 0.1, 10, step=0.1)
    grad_tol = trial.suggest_float('grad_tol', 1e-3, 1e-1)
    change_tol = trial.suggest_float('change_tol', 1e-5, 1e-3)

    score = train_and_eval(X_param,
                           y_param,
                           penalty=penalty,
                           loss=loss,
                           penalized_intercept=penalized_intercept,
                           max_iter=max_iter,
                           linesearch_max_iter=linesearch_max_iter,
                           lbfgs_memory=lbfgs_memory,
                           C=C,
                           grad_tol=grad_tol,
                           change_tol=change_tol,
                           verbose=False)

    return score

   Now, let's set up an Optuna.study by creating a unique name for the study and beginning time hyperparameter search with the TPESampler to maximize the weighted_f1 score for 100 trials using the parallel_backend from Dask.

In [ ]:
from datetime import datetime, timedelta

study_name = 'dask_linearSVC_optuna_US_100_weightedF1_tpe'

with timed('dask_optuna'):
    start_time = datetime.now()
    print('%-20s %s' % ('Start Time', start_time))
    if os.path.isfile('LinearSVC_Optuna_US_100_GPU_weightedF1.pkl'):
        study = joblib.load('LinearSVC_Optuna_US_100_GPU_weightedF1.pkl')
    else:
        study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                    study_name=study_name,
                                    direction='maximize')
    with parallel_backend('dask'):
        study.optimize(lambda trial: objective(trial, X_train, y_train),
                       n_trials=100, n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
Start Time           2022-06-28 15:35:02.028988
End Time             2022-06-28 15:37:29.835427
0:02:27


Number of finished trials: 100
Best trial: {'penalty': 'l1', 'loss': 'squared_hinge', 'penalized_intercept': 'True', 'max_iter': 2564428, 'linesearch_max_iter': 4076, 'lbfgs_memory': 14, 'C': 7.9, 'grad_tol': 0.010976199943976632, 'change_tol': 0.0002542098250104354}
Highest F1 Score 0.9774040519671148

   Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_C': 'C'}, inplace=True)
trials_df.rename(columns={'params_change_tol': 'change_tol'}, inplace=True)
trials_df.rename(columns={'params_grad_tol': 'grad_tol'}, inplace=True)
trials_df.rename(columns={'params_lbfgs_memory': 'lbfgs_memory'}, inplace=True)
trials_df.rename(columns={'params_linesearch_max_iter': 'linesearch_max_iter'},
                 inplace=True)
trials_df.rename(columns={'params_loss': 'loss'}, inplace=True)
trials_df.rename(columns={'params_max_iter': 'max_iter'}, inplace=True)
trials_df.rename(columns={'params_penalized_intercept': 'penalized_intercept'},
                 inplace=True)
trials_df.rename(columns={'params_penalty': 'penatly'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
    iteration  f1_weighted             datetime_start  \
27         27     0.977404 2022-06-28 15:35:39.456973   
63         63     0.977391 2022-06-28 15:36:30.618930   
25         25     0.975667 2022-06-28 15:35:37.299182   
22         22     0.975491 2022-06-28 15:35:32.425700   
92         92     0.975444 2022-06-28 15:37:17.089781   
..        ...          ...                        ...   
26         26     0.814795 2022-06-28 15:35:38.883446   
3           3     0.814795 2022-06-28 15:35:04.330834   
2           2     0.814795 2022-06-28 15:35:03.798425   
59         59     0.814795 2022-06-28 15:36:25.560220   
99         99     0.814795 2022-06-28 15:37:29.177828   

            datetime_complete               duration    C  change_tol  \
27 2022-06-28 15:35:41.332008 0 days 00:00:01.875035  7.9    0.000254   
63 2022-06-28 15:36:32.436435 0 days 00:00:01.817505  3.2    0.000679   
25 2022-06-28 15:35:38.883297 0 days 00:00:01.584115  7.8    0.000236   
22 2022-06-28 15:35:33.993519 0 days 00:00:01.567819  7.7    0.000613   
92 2022-06-28 15:37:18.659766 0 days 00:00:01.569985  0.9    0.000691   
..                        ...                    ...  ...         ...   
26 2022-06-28 15:35:39.456854 0 days 00:00:00.573408  5.5    0.000114   
3  2022-06-28 15:35:04.914463 0 days 00:00:00.583629  2.9    0.000796   
2  2022-06-28 15:35:04.330718 0 days 00:00:00.532293  1.1    0.000695   
59 2022-06-28 15:36:26.188211 0 days 00:00:00.627991  4.7    0.000976   
99 2022-06-28 15:37:29.835118 0 days 00:00:00.657290  1.1    0.000086   

    grad_tol  lbfgs_memory  linesearch_max_iter           loss  max_iter  \
27  0.010976            14                 4076  squared_hinge   2564428   
63  0.026112            19                  193  squared_hinge   2957593   
25  0.031513            17                 5485  squared_hinge   2843803   
22  0.044719            17                 5721  squared_hinge   4245521   
92  0.037195            14                 4180  squared_hinge   2085421   
..       ...           ...                  ...            ...       ...   
26  0.031084            18                 5560  squared_hinge   2061770   
3   0.004298            19                 4795          hinge   1052041   
2   0.021821            12                 7909          hinge   4342153   
59  0.070978            17                 7448  squared_hinge    543564   
99  0.005016            13                 3188  squared_hinge   1636237   

   penalized_intercept penatly     state  
27                True      l1  COMPLETE  
63                True      l1  COMPLETE  
25                True      l1  COMPLETE  
22                True      l1  COMPLETE  
92                True      l1  COMPLETE  
..                 ...     ...       ...  
26                True      l2  COMPLETE  
3                False      l2  COMPLETE  
2                False      l2  COMPLETE  
59                True      l2  COMPLETE  
99                True      l2  COMPLETE  

[100 rows x 15 columns]

   Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   The grad_tol parameter with lower values and the lbfgs_memory parameter with higher values resulted in higher objective values.

   We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 6, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['C', 'change_tol', 'grad_tol',
                         'lbfgs_memory', 'linesearch_max_iter', 'max_iter']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the study, the parameters C, grad_tol, linesearch_max_iter and max_iter decreased, while the parameter change_tol increased. There was not really a trend for the lbfgs_memory parameter.

   Now, let's use plot_param_importances to visualize parameter importances:

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The penalty hyperparameter is the most important (0.97) followed by loss (0.03).

   Let's now utilize plot_edf to visualize the empirical distribution of the study.

In [ ]:
fig = optuna.visualization.plot_edf(study)
fig.show()

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params
Out[ ]:
{'C': 7.9,
 'change_tol': 0.0002542098250104354,
 'grad_tol': 0.010976199943976632,
 'lbfgs_memory': 14,
 'linesearch_max_iter': 4076,
 'loss': 'squared_hinge',
 'max_iter': 2564428,
 'penalized_intercept': 'True',
 'penalty': 'l1'}
In [ ]:
best_model = LinearSVC(**params)

best_model.fit(X_train, y_train)

Pkl_Filename = 'LinearSVC_Optuna_US_trials100_GPU_weightedF1.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for LinearSVC HPO Upsampling 100trials GPU')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
                                               y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
                                                 y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
                                           y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))

Model Metrics for LinearSVC HPO Upsampling 100trials GPU


Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.98    377848
           1       0.80      0.88      0.84     54625

    accuracy                           0.96    432473
   macro avg       0.89      0.92      0.91    432473
weighted avg       0.96      0.96      0.96    432473



Confusion matrix:
[[365934  11914]
 [  6582  48043]]


Accuracy score : 0.957
Precision score : 0.801
Recall score : 0.880
F1 score : 0.839

   It's definitely nice to see model metrics which are not all zero values like what occurred in the baseline model except for accuracy. This is definitely an improvement in the baseline model, but the metrics are still lower than what the boosting algorithms generated. Let's now evaluate the predictive probability on the test set.

In [ ]:
print('The best model from Upsampling 100 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
                                                                                                                               y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from Upsampling 100 GPU trials optimization scores 0.92399 AUC ROC on the test set.
This was achieved using these conditions:
iteration                                      27
f1_weighted                              0.977404
datetime_start         2022-06-28 15:35:39.456973
datetime_complete      2022-06-28 15:35:41.332008
duration                   0 days 00:00:01.875035
C                                             7.9
change_tol                               0.000254
grad_tol                                 0.010976
lbfgs_memory                                   14
linesearch_max_iter                          4076
loss                                squared_hinge
max_iter                                  2564428
penalized_intercept                          True
penatly                                        l1
state                                    COMPLETE
Name: 27, dtype: object

SMOTE: Baseline Model

   Let's first define the baseline model components for classification. Now we can fit the model using the training set from the SMOTE set, save it as .pkl file and evaluate the model using classification metrics.

In [ ]:
lsvc = LinearSVC()

lsvc.fit(X_train, y_train)

Pkl_Filename = 'LinearSVC_Baseline_SMOTE.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(lsvc, file)
[W] [01:41:28.892500] L-BFGS line search failed (code 4); stopping at the last valid step
In [ ]:
print('\nModel Metrics for LinearSVC Baseline SMOTE')

y_train_pred = lsvc.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)

y_test_pred = lsvc.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)

print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
                                               asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
                                                 asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
                                           asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for LinearSVC Baseline SMOTE
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       0.50      1.00      0.67   1511066
           1       0.00      0.00      0.00   1511066

    accuracy                           0.50   3022132
   macro avg       0.25      0.50      0.33   3022132
weighted avg       0.25      0.50      0.33   3022132



Confusion matrix:
[[1511066       0]
 [1511066       0]]


Accuracy score : 0.500
Precision score : 0.000
Recall score : 0.000
F1 score : 0.000


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93    377848
           1       0.00      0.00      0.00     54625

    accuracy                           0.87    432473
   macro avg       0.44      0.50      0.47    432473
weighted avg       0.76      0.87      0.81    432473



Confusion matrix:
[[377848      0]
 [ 54625      0]]


Accuracy score : 0.874
Precision score : 0.000
Recall score : 0.000
F1 score : 0.000

   Once again, the model metrics are horrible, so the model will definitely need to be tuned. Using the default parameters resulted in the worst baseline models up to this point for both the Upsampled and SMOTE sets.

100 Trials: Weighted F1

   The same train_and_eval and objective functions and search parameter space described in the Upsampled section were utilized for the SMOTE study. A new .pkl file was generated for each search.

In [ ]:
print('Score with default parameters : ', train_and_eval(X_train, y_train))
[W] [21:42:35.660712] L-BFGS line search failed (code 4); stopping at the last valid step
Score with default parameters :  0.8147946302801139
In [ ]:
study_name = 'dask_linearSVC_optuna_SMOTE_100_weightedF1_tpe'

with timed('dask_optuna'):
    start_time = datetime.now()
    print('%-20s %s' % ('Start Time', start_time))
    if os.path.isfile('LinearSVC_Optuna_SMOTE_100_GPU_weightedF1.pkl'):
        study = joblib.load('LinearSVC_Optuna_SMOTE_100_GPU_weightedF1.pkl')
    else:
        study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                    study_name=study_name,
                                    direction='maximize')
    with parallel_backend('dask'):
        study.optimize(lambda trial: objective(trial, X_train, y_train),
                       n_trials=100, n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Highest F1 Score', study.best_value)
Start Time           2022-06-28 21:42:35.904221
End Time             2022-06-28 21:47:49.929670
0:05:14


Number of finished trials: 100
Best trial: {'penalty': 'l1', 'loss': 'hinge', 'penalized_intercept': 'False', 'max_iter': 4809613, 'linesearch_max_iter': 7740, 'lbfgs_memory': 19, 'C': 0.30000000000000004, 'grad_tol': 0.012580609689127427, 'change_tol': 1.5571253843605563e-05}
Highest F1 Score 0.9600449303161168

   Let's now extract the trial number, f1_weighted and hyperparameter value into a dataframe and sort with the highest score first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'f1_weighted'}, inplace=True)
trials_df.rename(columns={'params_C': 'C'}, inplace=True)
trials_df.rename(columns={'params_change_tol': 'change_tol'}, inplace=True)
trials_df.rename(columns={'params_grad_tol': 'grad_tol'}, inplace=True)
trials_df.rename(columns={'params_lbfgs_memory': 'lbfgs_memory'}, inplace=True)
trials_df.rename(columns={'params_linesearch_max_iter': 'linesearch_max_iter'},
                 inplace=True)
trials_df.rename(columns={'params_loss': 'loss'}, inplace=True)
trials_df.rename(columns={'params_max_iter': 'max_iter'}, inplace=True)
trials_df.rename(columns={'params_penalized_intercept': 'penalized_intercept'},
                 inplace=True)
trials_df.rename(columns={'params_penalty': 'penatly'}, inplace=True)
trials_df = trials_df.sort_values('f1_weighted', ascending=False)
print(trials_df)
    iteration  f1_weighted             datetime_start  \
98         98     0.960045 2022-06-28 21:47:11.747797   
97         97     0.960045 2022-06-28 21:46:51.248450   
99         99     0.958921 2022-06-28 21:47:32.045391   
94         94     0.958039 2022-06-28 21:45:35.252464   
96         96     0.957740 2022-06-28 21:46:34.860978   
..        ...          ...                        ...   
69         69     0.814795 2022-06-28 21:44:15.668381   
14         14     0.809820 2022-06-28 21:42:52.821627   
36         36     0.809820 2022-06-28 21:43:24.572736   
2           2     0.809820 2022-06-28 21:42:37.853009   
59         59     0.809820 2022-06-28 21:44:00.031310   

            datetime_complete               duration    C  change_tol  \
98 2022-06-28 21:47:32.045225 0 days 00:00:20.297428  0.3    0.000011   
97 2022-06-28 21:47:11.747599 0 days 00:00:20.499149  0.3    0.000016   
99 2022-06-28 21:47:49.929412 0 days 00:00:17.884021  0.3    0.000039   
94 2022-06-28 21:45:52.041181 0 days 00:00:16.788717  0.3    0.000047   
96 2022-06-28 21:46:51.248292 0 days 00:00:16.387314  0.3    0.000050   
..                        ...                    ...  ...         ...   
69 2022-06-28 21:44:16.242281 0 days 00:00:00.573900  7.7    0.000429   
14 2022-06-28 21:42:54.401860 0 days 00:00:01.580233  1.6    0.000699   
36 2022-06-28 21:43:26.009162 0 days 00:00:01.436426  0.2    0.000665   
2  2022-06-28 21:42:39.162458 0 days 00:00:01.309449  9.9    0.000789   
59 2022-06-28 21:44:01.398318 0 days 00:00:01.367008  0.8    0.000483   

    grad_tol  lbfgs_memory  linesearch_max_iter           loss  max_iter  \
98  0.013727            20                 7718          hinge   4700874   
97  0.012581            19                 7740          hinge   4809613   
99  0.011307            20                 7232          hinge   4655111   
94  0.013507            18                 6990          hinge   6264747   
96  0.013203            18                 7081          hinge   4821225   
..       ...           ...                  ...            ...       ...   
69  0.040747            13                 2408  squared_hinge   7193990   
14  0.097564             9                 7966  squared_hinge   7561925   
36  0.043296            10                 3480  squared_hinge   4851178   
2   0.034849            10                 4182  squared_hinge   1914182   
59  0.074967            18                 5545  squared_hinge   8639446   

   penalized_intercept penatly     state  
98               False      l1  COMPLETE  
97               False      l1  COMPLETE  
99               False      l1  COMPLETE  
94               False      l1  COMPLETE  
96               False      l1  COMPLETE  
..                 ...     ...       ...  
69               False      l2  COMPLETE  
14                True      l1  COMPLETE  
36               False      l1  COMPLETE  
2                False      l1  COMPLETE  
59               False      l1  COMPLETE  

[100 rows x 15 columns]

   Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   Lower C, change_tol grad_tol and higher lbfgs_memory resulted in higher objective value.

   We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 6, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['C', 'change_tol', 'grad_tol',
                         'lbfgs_memory', 'linesearch_max_iter', 'max_iter']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, C, change_tol, grad_tol decreased while lbfgs_memory and max_iter increased. There was not really a trend for linesearch_max_iter. In contrast, the results from the search using the Upsampled set showed change_tol increase over the trials while max_iter decreased.

   Now, let's use plot_param_importances to visualize parameter importances:

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The most important hyperparameters for the SMOTE set is loss (0.55) followed by penalty (0.44), which was the most important hyperparameter for the Upsampled study at 0.97.

   Let's now utilize plot_edf to visualize the empirical distribution of the study.

In [ ]:
fig = optuna.visualization.plot_edf(study)
fig.show()

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params
Out[ ]:
{'C': 0.30000000000000004,
 'change_tol': 1.5571253843605563e-05,
 'grad_tol': 0.012580609689127427,
 'lbfgs_memory': 19,
 'linesearch_max_iter': 7740,
 'loss': 'hinge',
 'max_iter': 4809613,
 'penalized_intercept': 'False',
 'penalty': 'l1'}
In [ ]:
best_model = LinearSVC(**params)

best_model.fit(X_train, y_train)

Pkl_Filename = 'LinearSVC_Optuna_SMOTE_trials100_GPU_weightedF1.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for LinearSVC HPO SMOTE 100trials GPU')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
                                               y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
                                                 y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
                                           y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))

Model Metrics for LinearSVC HPO SMOTE 100trials GPU


Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98    377848
           1       0.85      0.84      0.84     54625

    accuracy                           0.96    432473
   macro avg       0.91      0.91      0.91    432473
weighted avg       0.96      0.96      0.96    432473



Confusion matrix:
[[369403   8445]
 [  8559  46066]]


Accuracy score : 0.961
Precision score : 0.845
Recall score : 0.843
F1 score : 0.844

   When comparing the model metrics after the SMOTE search to close after the search using the Upsampled set, there was a higher accuracy, precision and F1 scores, but a lower recall score. The precision and recall scores seemed to differ by 4% in opposite directions. Let's now evaluate the predictive probability on the test set.

In [ ]:
print('The best model from SMOTE 100 GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
                                                                                                                          y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from SMOTE 100 GPU trials optimization scores 0.91048 AUC ROC on the test set.
This was achieved using these conditions:
iteration                                      98
f1_weighted                              0.960045
datetime_start         2022-06-28 21:47:11.747797
datetime_complete      2022-06-28 21:47:32.045225
duration                   0 days 00:00:20.297428
C                                             0.3
change_tol                               0.000011
grad_tol                                 0.013727
lbfgs_memory                                   20
linesearch_max_iter                          7718
loss                                        hinge
max_iter                                  4700874
penalized_intercept                         False
penatly                                        l1
state                                    COMPLETE
Name: 98, dtype: object

K-Nearest Neighbors (KNN) Classifier

   KNN was first proposed in this 1951 publication Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties and later expounded on in this 1967 publication Nearest Neighbor Pattern Classification.

   The notebooks utilized for evaluating the weighted F1, weighted ROC, precision or recall score as the score metric to monitor during the hyperparameter tuning search are located here.

Upsampled: Baseline Model

   Let's first define the baseline model components for classification. Now we can fit the model using the training set from the Upsampled set, save it as .pkl file and evaluate the model using classification metrics.

In [ ]:
from cuml.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

Pkl_Filename = 'KNN_Baseline_US.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(knn, file)

y_train_pred = knn.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)

y_test_pred = knn.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)

print('\nModel Metrics for KNN Baseline Upsampling')
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
                                               asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
                                                 asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
                                           asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for KNN Baseline Upsampling
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.91      0.95   1511066
           1       0.92      1.00      0.96   1511066

    accuracy                           0.96   3022132
   macro avg       0.96      0.96      0.95   3022132
weighted avg       0.96      0.96      0.95   3022132



Confusion matrix:
[[1378332  132734]
 [   3244 1507822]]


Accuracy score : 0.955
Precision score : 0.919
Recall score : 0.998
F1 score : 0.957


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.87      0.91    377848
           1       0.45      0.73      0.55     54625

    accuracy                           0.85    432473
   macro avg       0.70      0.80      0.73    432473
weighted avg       0.89      0.85      0.87    432473



Confusion matrix:
[[328356  49492]
 [ 14638  39987]]


Accuracy score : 0.852
Precision score : 0.447
Recall score : 0.732
F1 score : 0.555

   The training set is definitely overfitting when the model performance is not balanced with the test set. The lowest metrics are the precision and F1 score, so let's monitor precision to see if hyperparameter tuning results in better performance.

100 Trials: Precision

   Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:

  • n_neighbors: Default number of neighbors to query. Default=5.
  • metric: Distance metric to use. Options are 'euclidean', 'manhattan', 'chebyshev', 'minkowski'.

   Then model will be trained using the training set, predictions made using the test set and then the model evaluated for the precision between the actual loan_status in the test set versus the predicted one.

   Let's utilize plot_slice to compare the objective value and individal parameters.

In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   Less n_neighbors resulted in higher objective values.

   Now, let's use plot_param_importances to visualize parameter importances:

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The most important hyperparameter is n_neighbors (0.98) compared to metric (0.02).

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params
Out[ ]:
{'metric': 'manhattan', 'n_neighbors': 4}
In [ ]:
best_model = KNeighborsClassifier(n_neighbors=4, metric='manhattan')

best_model.fit(X_train, y_train)

Pkl_Filename = 'KNN_Optuna_US_trials100_GPU_Precision.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for KNN HPO US 100trials GPU Precision')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
                                               y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
                                                 y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
                                           y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))

Model Metrics for KNN HPO US 100trials GPU Precision


Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.92      0.94    377848
           1       0.56      0.70      0.62     54625

    accuracy                           0.89    432473
   macro avg       0.76      0.81      0.78    432473
weighted avg       0.91      0.89      0.90    432473



Confusion matrix:
[[347439  30409]
 [ 16216  38409]]


Accuracy score : 0.892
Precision score : 0.558
Recall score : 0.703
F1 score : 0.622

   Compared to the baseline model using the Upsampled set, all metrics increased besides a small decrease in the recall score. It might be worthwhile to consider monitoring F1 score more or preprocessing the data befoe modeling.

   Let's now evaluate the predictive probability on the test set.

In [ ]:
print('The best model from US 100 Precision GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
                                                                                                                                 y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from US 100 Precision GPU trials optimization scores 0.81133 AUC ROC on the test set.
This was achieved using these conditions:
iteration                                    49
precision                              0.529366
datetime_start       2022-06-15 23:10:41.997432
datetime_complete    2022-06-15 23:11:18.324032
duration                 0 days 00:00:36.326600
metric                                euclidean
n_neighbors                                 4.0
state                                  COMPLETE
Name: 49, dtype: object

SMOTE: Baseline Model

   Let's first define the baseline model components for classification, fit the model using the training set from the SMOTE set, save it as .pkl file and evaluate the model using classification metrics.

In [ ]:
knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

Pkl_Filename = 'KNN_Baseline_SMOTE.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(knn, file)

y_train_pred = knn.predict(X_train)
y_train_pred = y_train_pred.round(2)
y_train_pred = cupy.where(y_train_pred > 0.5, 1, 0)

y_test_pred = knn.predict(X_test)
y_test_pred = y_test_pred.round(2)
y_test_pred = cupy.where(y_test_pred > 0.5, 1, 0)

print('\nModel Metrics for KNN Baseline SMOTE')
print('Training Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_train), asnumpy(y_train_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_train), asnumpy(y_train_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_train),
                                               asnumpy(y_train_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_train),
                                                 asnumpy(y_train_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_train),
                                           asnumpy(y_train_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_train), asnumpy(y_train_pred)))

print('\n')
print('Test Set')
print('Classification Report:')
clf_rpt = classification_report(asnumpy(y_test), asnumpy(y_test_pred))
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(asnumpy(y_test), asnumpy(y_test_pred)))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(asnumpy(y_test),
                                               asnumpy(y_test_pred)))
print('Precision score : %.3f' % precision_score(asnumpy(y_test),
                                                 asnumpy(y_test_pred)))
print('Recall score : %.3f' % recall_score(asnumpy(y_test),
                                           asnumpy(y_test_pred)))
print('F1 score : %.3f' % f1_score(asnumpy(y_test), asnumpy(y_test_pred)))

Model Metrics for KNN Baseline SMOTE
Training Set
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.91      0.95   1511066
           1       0.91      1.00      0.95   1511066

    accuracy                           0.95   3022132
   macro avg       0.96      0.95      0.95   3022132
weighted avg       0.96      0.95      0.95   3022132



Confusion matrix:
[[1367615  143451]
 [   1682 1509384]]


Accuracy score : 0.952
Precision score : 0.913
Recall score : 0.999
F1 score : 0.954


Test Set
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.86      0.91    377848
           1       0.43      0.75      0.55     54625

    accuracy                           0.85    432473
   macro avg       0.70      0.81      0.73    432473
weighted avg       0.89      0.85      0.86    432473



Confusion matrix:
[[324306  53542]
 [ 13448  41177]]


Accuracy score : 0.845
Precision score : 0.435
Recall score : 0.754
F1 score : 0.551

   The training set is definitely overfitting for the SMOTE like it was for the Upsampled baseline KNN model. The lowest metrics are the precision and F1 score again, which are even lower than what the Upsampled baseline model revealed.

100 Trials: Precision

   The same train_and_eval and objective functions and search parameter space described in the Upsampled section were utilized for the SMOTE study. A new .pkl file was generated for each search.

   Let's utilize plot_slice to compare the objective value and the individal parameters.

In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   Less n_neighbors resulted in higher objective values like in Upsampled trials.

   Now, let's use plot_param_importances to visualize parameter importances:

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The n_neighbors hyperparameter is the most important (0.99), while it was 0.98 for the Upsampled set.

   Next, let's arrange the best parameters to re-create the best model and fit using the training data and save it as a .pkl file. Then, we can evaluate the model metrics for both the training and the test sets using the classification report, confusion matrix as well as accuracy, precision, recall and F1 score from sklearn.metrics.

In [ ]:
params = study.best_params
params
Out[ ]:
{'metric': 'euclidean', 'n_neighbors': 4}
In [ ]:
best_model = KNeighborsClassifier(n_neighbors=4, metric='euclidean')

best_model.fit(X_train, y_train)

Pkl_Filename = 'KNN_Optuna_SMOTE_trials100_GPU_Precision.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for KNN HPO SMOTE 100trials GPU Precision')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('\n')
print('Classification Report:')
clf_rpt = classification_report(y_test.to_numpy(), y_test_pred.to_numpy())
print(clf_rpt)
print('\n')
print('Confusion matrix:')
print(confusion_matrix(y_test.to_numpy(), y_test_pred.to_numpy()))
print('\n')
print('Accuracy score : %.3f' % accuracy_score(y_test.to_numpy(),
                                               y_test_pred.to_numpy()))
print('Precision score : %.3f' % precision_score(y_test.to_numpy(),
                                                 y_test_pred.to_numpy()))
print('Recall score : %.3f' % recall_score(y_test.to_numpy(),
                                           y_test_pred.to_numpy()))
print('F1 score : %.3f' % f1_score(y_test.to_numpy(), y_test_pred.to_numpy()))

Model Metrics for KNN HPO SMOTE 100trials GPU Precision


Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.91      0.93    377848
           1       0.52      0.70      0.60     54625

    accuracy                           0.88    432473
   macro avg       0.74      0.80      0.76    432473
weighted avg       0.90      0.88      0.89    432473



Confusion matrix:
[[342198  35650]
 [ 16384  38241]]


Accuracy score : 0.880
Precision score : 0.518
Recall score : 0.700
F1 score : 0.595

   Compared to the baseline model using the SMOTE set, all metrics increased besides a small decrease in the recall score like what was observed when comparing the tuned Upsampled model to the baseline model. To restate, it might be worthwhile to consider monitoring F1 score more or preprocessing the data befoe modeling. Compared to the tuned model using the Upsampled set, the Upsampled model metrics performed better than the metrics from the SMOTE search.

   Let's now evaluate the predictive probability on the test set.

In [ ]:
print('The best model from SMOTE 100 Precision GPU trials optimization scores {:.5f} AUC ROC on the test set.'.format(roc_auc_score(y_test.to_numpy(),
                                                                                                                                    y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from SMOTE 100 Precision GPU trials optimization scores 0.80286 AUC ROC on the test set.
This was achieved using these conditions:
iteration                                    50
precision                              0.517533
datetime_start       2022-06-15 20:32:34.661307
datetime_complete    2022-06-15 20:33:10.961472
duration                 0 days 00:00:36.300165
metric                                euclidean
n_neighbors                                   4
state                                  COMPLETE
Name: 50, dtype: object

Comments

comments powered by Disqus