The price of a used car has always been an intriguing thought. Once a brand-new car leaves the dealership, the price instantly decreases way below the price an individual paid. During the 2022 car market, the price of a used vehicle increased significantly due to the shortage of semiconductors and microchips, which are needed for the manufacturing of new vehicles, and the rising rate of inflation.
Data
¶
The data was retrieved from Kaggle. The individual who posted the data used a webcrawler on the Cargurus inventory in September 2020. This data contains a vast amount of information about the physical and mechanical components of cars, and this feature rich set can provide insight into what might contribute to the price differences in the value of a vehicle. Does the location where a vehicle is being sold and the age of the vehicle affect the price? How could these factors contribute to building models to predict the price of vehicles? Which features of used vehicles should be used in predictive modeling?
Preprocessing
¶
The code that was used for preprocessing and EDA can be found in the Used Cars Github repository. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and empty rows replace with NA
.
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
plt.rcParams['figure.figsize'] = (20, 10)
plt.rcParams['font.size'] = 25
sns.set_context('paper', rc={'font.size': 25, 'axes.titlesize': 35,
'axes.labelsize': 30})
seed_value = 42
os.environ['UsedCars_PreprocessEDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
df = pd.read_csv('used_cars_data.csv', low_memory=False)
df = df.drop_duplicates()
df = df.replace(r'^\s*$', np.nan, regex=True)
print('\nSample observations:')
display(df.head())
The initial data appears to contain various data types, missing data and complex strings with text and numbers. To reduce the high dimensionality, which string variables can contribute to, extracting the numerical information might allow for more granularity and insight.
Data Quality
¶
Let's define a function to examine the quality of the data by assessing the absolute number/percentage of missing observations in the data, the types of data present as well as the number unique values for the quantitative and qualitative features. This can then be sorted by the missingness to determine which variables should be dropped and which categorical variables contain a large number of groups.
def data_type_quality_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
var_type = df.dtypes
unique_count = df.nunique()
val_table = pd.concat([mis_val, mis_val_percent, var_type, unique_count],
axis=1)
val_table_ren_columns = val_table.rename(
columns = {0: 'Missing Count', 1: '% Missing', 2: 'Data Type',
3: 'Number Unique'})
val_table_ren_columns = val_table_ren_columns[
val_table_ren_columns.iloc[:,1] >= 0].sort_values(
'% Missing', ascending=False).round(1)
print('The data has ' + str(df.shape[0]) + ' rows and '
+ str(df.shape[1]) + ' columns.\n')
return val_table_ren_columns
print('\nData Quality Report:')
display(data_type_quality_table(df))
Data Relevancy and Missingness
¶
Now we can first drop the variables that will not be usedful. A url will not provide much insight nor will the ones containing *id
which act as an identification on a numerical level. The information contained within description
might reveal specifics about the vehicles, but since there are 2,519,325 different strings, this might not be worth the amount of preprocessing.
The initial data consisted of a high degree of missing data. 16 variables have over >= 47% missing values and the next lowest missing value is under 20%. We can filter the features with under 20% missing and then select the complete cases for each of the remaining features to maximize the amount of observations contained in the set.
drop_columns = ['main_picture_url', 'vin', 'trimId', 'trim_name',
'sp_id', 'sp_name', 'listing_id', 'description']
df.drop(columns=drop_columns, inplace=True)
df['year'] = df['year'].astype('int')
df = df.loc[:, df.isnull().mean() < 0.20]
df = df[df.major_options.notna() & df.mileage.notna()
& df.engine_displacement.notna()
& df.transmission_display.notna() & df.seller_rating.notna()
& df.engine_cylinders.notna()
& df.back_legroom.notna() & df.wheel_system.notna()
& df.interior_color.notna()
& df.body_type.notna() & df.exterior_color.notna()
& df.franchise_make.notna() & df.torque.notna()
& df.highway_fuel_economy.notna() & df.city_fuel_economy
& df.power.notna()]
Time
¶
Vehicle manufacturers release new models each year with newer technologies and accessories. The time
aspect generally is an important aspect to consider when building predictive models from historical data. Let's examine when the vehicle was made by generating a graph that contains the amount of vehicles in each year
. Older vehicles probably cost less than newer vehicles, but this might not always be true considering there are vintage cars.
import seaborn as sns
import matplotlib.pyplot as plt
a = sns.countplot(x='year', data=df).set_title('Count of Postings in Each Year')
plt.yticks(fontsize=20)
plt.xticks(fontsize=20)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show();
print('Average year: ' + str(round(df['year'].mean())))
The majority of the vehicles were manufacturered in 2020 with the average year being 2019. The oldest vehicle is from 1984, and there are vehicles from 2021, the year in which the crawler pulled this data. Let's filter the data by year greater than or equal 2016 due to the count and the distribution. Then using a function with autocpt
arguments to convert count data to percentages, we can graph the percentage of each model year out of the total listings in this subset.
df = df.loc[df['year'] >= 2016]
def func(pct):
return "{:1.1f}%".format(pct)
plt.figure(figsize=(20, 7))
plt.title('Percentage of Listings in Each Year', fontsize=25)
df['year'].value_counts().plot.pie(autopct=lambda pct: func(pct),
textprops={'fontsize': 15})
plt.tight_layout()
plt.show();
57% of the vehicle listings occur in 2020. Due to the potential seasonality effects, let's examine the month of the year when a vehicle was listed by converting the listed_date
to a monthly feature and examining the top 10 year-month by count.
pd.set_option('display.max_rows', 10)
df['listed_date'] = pd.to_datetime(df['listed_date'])
df['listed_date_yearMonth'] = df['listed_date'].dt.to_period('M')
print('\nNumber of listings \nin each Year-Month:')
print('\n')
print(df.listed_date_yearMonth.value_counts(sort=True).nlargest(10).to_string())
August contained the most listings, while April contained the least in close proximity to December, January and May. Most of the listings occurred during June - September of 2020, so the set was filtered to this date range. Now we can create a graph that has been condensed significantly. Given this time period is during the first year of the COVID-19 pandemic, the potential for various confounding factors is something that needs to be considered, so let's utilize a narrow range of dates around the summer months.
df = df.loc[(df['listed_date_yearMonth'] >= '2020-06')]
sns.countplot(x='listed_date_yearMonth',
data=df,
order=df['listed_date_yearMonth'].value_counts().index).set_title('Number Postings in Each Year-Month')
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.tight_layout()
plt.show();
Location of Listings
¶
The initial data contains 4,687 cities and 8,237 dealer zip codes, so ths is potentially from all over the United States where the cost of living varies depending on location. Some states have higher/more taxes and some cost more to live. We want to be sure that price
is not guided by location and that it doesn't confound the modeling results. Therefore, let's examine the length of the dealer_zip
because there is a shortened zip code and the complete zip code.
df['dealer_zip_length'] = df['dealer_zip'].str.len()
print('\nNumber of different \nlengths of zipcodes:')
print('\n')
print(df.dealer_zip_length.value_counts(ascending=False).to_string())
The majority of the zipcodes are the shortened version, so let's subset the shortened zipcode and convert it to numerical values. Now we can create a temporary pandas.Dataframe
with only the zipcode, find the unique zipcodes, process to a list and then a dataframe. There are 4,898 unique zipcodes, so it appears the data is not localized to a small section of the country.
df = df[df.dealer_zip_length == 5]
df = df.drop(['dealer_zip_length'], axis=1)
df['dealer_zip'] = df['dealer_zip'].astype('int64')
df1 = df.dealer_zip
df1 = df1.unique().tolist()
df1 = pd.DataFrame(df1)
df1.columns = ['dealer_zip_unique']
print('- Number of unique zipcodes:', df1.shape[0])
The initial set contains the city
but not the state where the vehicle was listed. We can use a temporary dataframe with unique zipcodes to find the state and city corresponding to the zipcode by defining two separate functions using a SearchEngine
that utilizes the zipcode to find state and city. This temporary dataframe now has three columns where the queried City
and dealer_zip_unique
can be used to merge the new features using a right join. Duplicates and unnecessary columns can be removed as well as the zipcodes that were not found within the SearchEngine
.
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=False)
def zcode_state(x):
""" Use the zipcode to find the state """
return search.by_zipcode(x).state
def zcode_city(x):
""" Use the zipcode to find city """
return search.by_zipcode(x).city
df1['State'] = df1['dealer_zip_unique'].fillna(0).astype(int).apply(zcode_state)
df1['City'] = df1['dealer_zip_unique'].fillna(0).astype(int).apply(zcode_city)
df = pd.merge(df, df1, how='right', left_on=['city', 'dealer_zip'],
right_on=['City', 'dealer_zip_unique'])
df.drop_duplicates()
df = df.drop(['dealer_zip_unique', 'city', 'City', 'dealer_zip'], axis=1)
df = df[df['mileage'].notna()]
With the addition of the State
where the vehicles were listed, we can now examine the top ten locations and consider the possibility of there being differences in the standard of living due to location. Out of the top 10 states with the highest amount of listings in this set, Texas has the most while Michigan has the least. Georgia, North Carolina and Michigan don't seem to fit with the northern states or the largest states like Texas, California and Florida, so let's filter the states with the seven highest counts of listings. This results in a set with over 400,000 observations and 42 columns.
print('\nNumber of listings \n in each US state:')
print('\n')
print(df.State.value_counts(sort=True).nlargest(10).to_string())
print('\n')
df1 = df['State'].value_counts().index[:7]
df = df[df['State'].isin(df1)]
print('- Dimensions after filtering states with the 7 highest counts of listings:',
df.shape)
del df1
Dimensionality of Categorical Variables
¶
The problem of high dimensionality is an important consideration when modeling because the higher number of different groups per categorical variables, the larger the size of the set after converting to dummy or one hot encoded variables. This will result in more memory requirements (RAM and/or GPU memory) needed for the computations, resulting in longer runtimes and more cost utilization. Examining the current object
and bool
variables, major_options
and color related features (interior
and exterior_color
) contain over thousands of different groups, so let's drop these features. The transmission_diplay
variable contains 44 groups while transmission
has four levels, so we'll use a broader variable for the characteristics of the transmission
. Next, engine_cylinders
and engine_type
appear to contain the same information and contain over 30 groups, so let's also remove from the set. franchise_make
contains the manufacturer name, but we can identify the components of the car rather than bias the analysis by using a provided name. Lastly, franchise_dealer
is boolean
with True/False
, so let's also remove this feature.
drop_columns = ['major_options', 'interior_color', 'exterior_color',
'transmission_display', 'engine_cylinders', 'engine_type',
'franchise_dealer', 'franchise_make']
df.drop(columns=drop_columns, inplace=True)
Some of the categorical features contain information about the size (in
) / volume (gal
) / number (seats
) of various characteristics about vehicles, so removing the repetitive parts of the variables will allow these features to become continuous variables and not increase the dimensionality of the set. Now we can examine a few observations of the variables and decide how to process them.
df[['back_legroom', 'wheelbase', 'width', 'length', 'height',
'fuel_tank_volume', 'front_legroom', 'maximum_seating']].head()
Let's use a lambda
function on a series
that will string split the numbers from the text components leaving only the numbers. Then we designate which columns to apply this action. This creates missing data, so we can then remove the rows containing missing values due to transforming the previous string variable to numeric. Now we can see that these columns contain only numerical data.
extract_num_from_catVar = lambda series: series.str.split().str[0].astype(np.float)
columns = ['back_legroom', 'wheelbase', 'width', 'length', 'height',
'fuel_tank_volume', 'front_legroom', 'maximum_seating']
df[columns] = df[columns].replace({',': '',
'--': np.nan}).apply(extract_num_from_catVar)
df = df[df.back_legroom.notna() & df.front_legroom.notna()
& df.fuel_tank_volume.notna() & df.maximum_seating.notna()
& df.height.notna() & df.length.notna() & df.wheelbase.notna()
& df.width.notna()]
df.reset_index(drop=True, inplace=True)
df[['back_legroom', 'wheelbase', 'width', 'length', 'height',
'fuel_tank_volume', 'front_legroom', 'maximum_seating']].head()
The torque
and power
features contain similar structures to one another. The sequence containing a number followed by a text string, @
, number and another text string signifies that similar string splitting at the same index location as well as the string components being converted to np.nan
can be used. Let's first examine a sample of both variables:
df[['torque', 'power']].head()
To process torque
, first remove the ','
, and split the string. Then create a pandas.DataFrame
containing only the float
information, name the columns and then reset the index. Now, concatenate the new feature columns created with the main table, drop the original torque
feature and lastly remove any missing data that was created during the string split. The same methods can be used to process power
which results in a numerical horsepower
. A few samples of the new features can then be viewed:
df1 = df.torque.str.replace(',', '').str.split().str[0:4:3]
df1 = pd.DataFrame([[np.nan, np.nan] if type(i).__name__ == 'float'
else np.asarray(i).astype('float') for i in df1])
df1.columns = ['torque_new', 'torque_rpm']
df1.reset_index(drop=True, inplace=True)
df = pd.concat([df, df1], axis=1)
df = df.drop(['torque'], axis=1)
df = df[df.torque_rpm.notna()]
df.rename(columns={'torque_new': 'torque'}, inplace=True)
del df1
df1 = df.power.str.replace(',', '').str.split().str[0:4:3]
df1 = pd.DataFrame([[np.nan, np.nan] if type(i).__name__ == 'float'
else np.asarray(i).astype('float') for i in df1])
df1.columns = ['horsepower_new', 'horsepower_rpm']
df1.reset_index(drop=True, inplace=True)
df = pd.concat([df, df1], axis=1)
df = df.drop(['horsepower', 'power'], axis=1)
df = df[df.horsepower_rpm.notna()]
df.rename(columns={'horsepower_new': 'horsepower'}, inplace=True)
del df1
df[['torque', 'horsepower']].head()
Examine Dependent Variable: Price
¶
Using describe()
can generate more than the typical five summary statistics by adding the count and minimum/maximum values. The average is around 32,000
U.S dollars. The standard deviation is around 15,000
, so let's include up to 50,000
dollars in the set. When outliers are present, the mean can be affected, so we can examine the median as well, which is around 29,000
. The proximity of the mean and median suggest the data is not highly skewed, but with a maxmimum price of 449,995
dollars, there definitely are outliers. Outlier testing methods like Z-score
, which measures how many standard deviations values are from the mean, or the Local Outlier Factor
, which uses how close values are to the values of their neighbors (a local density), where greater than one local density signifies it is different than the rest.
print('\nSummary statistics: price' + ('\n') + str(round(df['price'].describe()).to_string()))
print('\n')
print('- Median price: ' + str(round(df['price'].median())))
Now, let's filter the observations with the price
less than or equal $50,000, and examine the price
across the locations where they were posted using the latitude
vs. longitude
features.
df = df.loc[df['price'] <= 50000.0]
plt.figure(figsize=(20, 7))
sns.histplot(x=df['price'],
kde=True).set_title('Distribution of Price <= $50,000');
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.show();
print('- There are now ' + str(df.shape[0]) + ' observations.')
df.plot(kind='scatter', x='latitude', y='longitude', c='price', cmap='RdBu_r',
colorbar=True, sharex=False);
plt.title('Price: Latitude vs Longitude', fontsize=35, y=1.02);
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.show()
drop_columns = ['latitude', 'longitude']
df.drop(columns=drop_columns, inplace=True)
Visually, there doesn't appear to be a large amount of lower prices (blue) localized to any of the US states in this set, but there might a greater proportion of higher prices (dark red) in New York compared to Pennsylvania.
Visualizations: Categorical Variables
Now we can filter the categorical features that have been reduced to ones with a lower number of groups, and see how price varies within the groups by using side by side boxplots.
df_cat = df.select_dtypes(include = 'object')
df_cat = df_cat.drop(['State', 'model_name', 'make_name', 'listing_color'],
axis=1)
plt.rcParams.update({'font.size': 20})
fig, ax = plt.subplots(3, 2, figsize=(20,17))
fig.suptitle('Categorical Variables vs. Price', fontsize=35, y=1.02)
for var, ax in zip(df_cat, ax.flatten()):
sns.boxplot(x=var, y='price', data=df, ax=ax)
ax.tick_params(axis='x', labelrotation=45, labelsize=15)
ax.tick_params(axis='y', labelsize=15)
plt.tight_layout();
plt.show();
del df_cat, fig, ax
wheel_system
and wheel_system_display
contain the same information, one uses acronomyns and the later with the full text spelled out. We will retain the feature with most detail (wheel_system_display
) so the components of this feature are able to be distinguished easier.
The color of a vehicle tends to be an important factor when individuals purchase a different car. Let's investigate if the listing_color
has any association with price
.
sorted_nb = df.groupby(['listing_color'])['price'].median().sort_values()
plt.title('Listing Color vs. Price', fontsize=35, y=1.02)
sns.boxplot(x=df['listing_color'], y=df['price'], order=list(sorted_nb.index))
plt.tick_params(axis='x', labelrotation=45, labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.tight_layout()
plt.show();
Now we can examine the amount of listings with different colors that we observed in the prior graph. Seven colors including UNKNOWN
seem to make up the majority of the set. We can then filter the observations with the seven highest number counts for listing_color
and then remove the listings with listing_color='UNKNOWN'
.
print('\nNumber of listings \nwith vehicle color:')
print('\n')
print(df.listing_color.value_counts(ascending=False).to_string())
df1 = df['listing_color'].value_counts().index[:7]
df = df[df['listing_color'].isin(df1)]
df = df.loc[df['listing_color'] != 'UNKNOWN']
del df1
Since the name of the car manufacturer exists in the set, let's examine the median price of vehicles distinguished by the manufacturer, and compare that to the overall median price ($28,710
). This might provide some insight about the features and components of the vehicles with higher prices.
df1 = df.groupby('make_name')['price'].median().reset_index()
df1.rename(columns={'price': 'make_medianPrice'}, inplace=True)
print('\nMedian price of the manufacturer:')
print('\n')
df1 = df1.sort_values('make_medianPrice', ascending=False)
print(df1.iloc[0:10])
del df1
The vehicles with the highest median price are mostly luxury manufacturers. This makes logical sense. RAM
makes trucks, which are larger vehicles containing larger engines. The cost would have to be more to produce and then make a profit.
After the exploratory data analysis (EDA) of the qualitative features, we can now drop variables due to similarity, high dimensionality and not useful for modeling.
df = df.drop(['make_name', 'model_name', 'wheel_system',
'seller_rating', 'listed_date'], axis=1)
df = df.drop_duplicates()
Visualizations: Quantitative Variables
Histograms can be used to examine the distributions of numerical data. Visually, left or right skewed data can be observed if the majority of data is on the right side of the histogram or left, respectfully.
df_num = df.select_dtypes(include = ['float64', 'int64'])
cols = df_num.columns.tolist()
cols = cols[-8:] + cols[:-8]
df_num = df_num[cols]
sns.set(font_scale=2)
df_num.hist(figsize=(30,30), bins=50, xlabelsize=16, ylabelsize=17)
plt.tight_layout()
plt.show();
Correlations
¶
To determine which quantitative variables are correlated with each other, we can create a correlation matrix using the method='spearman'
, which is a nonparametric approach that ranks the features. From the prior visualization, some of the features contain non-Gaussian distributions. Then we can plot the correlation matrix without any thresholds.
To increase the granularity of which features are correlated with price
, let's extract the features that contain a correlation coefficient greater than 0.5 or less than -0.5 and examine the magnitude.
corr = df_num.corr(method='spearman')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (20, 10)
sns.heatmap(corr, cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=False, square=True)
plt.title('Correlation Matrix with Spearman rho', fontsize=35, y=1.02)
plt.tight_layout()
plt.show();
df_num_corr = df_num.drop('price', axis=1).apply(lambda x: x.corr(df_num.price))
quant_features_list = df_num_corr[abs(df_num_corr) > 0.50].sort_values(ascending=False)
print('- There are {} moderately correlated values with Price:\n{}'.format(
len(quant_features_list), quant_features_list))
The horsepower
feature has the highest positive correlation with price
while highway_fuel_economy
has the highest negative.
Now let's filter the correlation matrix using thresholds of rho <= -0.5 or >= 0.5
to reduce the noise in the plot and to better decipher which features have higher correlation in the set.
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={'size': 9}, square=True);
plt.title('Correlation Matrix with Spearman rho >= 0.5 or <= -0.5',
fontsize=30, y=1.02)
plt.tight_layout()
plt.show();
When utilizing this threshold, the amount of content on the heatmap decreased. The most positively correlated features are highway_fuel_economy
and city_fuel_economy
with a rho=0.94
followed by length
and wheelbase
with a rho=0.93
. The most negatively correlated features city_fuel_economy
and horsepower
with a rho=-0.85
as well as city_fuel_economy
and fuel_tank_volume
with a rho=-0.85
. These relationships might need to be considered regarding multicollinearity prior to modeling depending on the method utilized.
Modeling
¶
Now that the initial data has been processed to a final set, we can proceed with evaluating various modeling approaches. When testing Linear
models (lasso
, ridge
and elastic net
regression) using the RAPIDS
ecosystem, hyperparameter tuning with 1000 trials each did not generate adequate model metrics even after applying variance inflation factor (VIF)
for multicollinearity. Different scaling methods (MinMaxScaler
and StandardScaler
) were tested as well. Similar results were encountered using LinearSVR
as well. This might be attributed to the foundational solvers available in the version of cuml
that were available in rapids:22.02
. The notebook for the Linear
models is here and LinearSVR
here.
Multi-Layer Perceptron (MLP) Regression
¶
The MLP
can be utilized for regression problems as demonstrated in Multilayer perceptrons for classification and regression] by Fionn Murtagh in 1991. It is derived from the perceptron invented in the 1940s and implemented by Frank Rosenblatt in the 1950s with The Perceptron — A Perceiving and Recognizing Automaton. The MLP
is a artifical neural network connected by neurons containing an input layer, the hidden layer(s) and the output layer with a single output neuron for regression. The data is used as the input to the network which is fed to the hidden layers one layer at a time generating weights (connections to the next neuron) and biases (threshod value needed to obtain the output) till the output layer. The output is then compared to the expected output, and the difference is the error. This error is then propagated back through the network, layer by layer where the weights are updated depending on the effect to the error called the backpropagation
algorithm. This is repreated for all of the samples in the training data where one cycle of updating the network is an epoch
.
The notebook can be found here. Let's first set up the environment by installing kerar-tuner
for hyperparameter tuning the MLP
, then import the dependencies and examine the versions of tensorflow
, keras
, the amount of GPUs available, the CUDA
version and which GPU will be used.
!pip install keras-tuner
import os
import random
import numpy as np
import warnings
import tensorflow as tf
warnings.filterwarnings('ignore')
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Now we can define a function to set the numpy
, random
and tensorflow
seed as well as configure the session environment for reproducibility.
def init_seeds(seed=42):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
session_conf = tf.compat.v1.ConfigProto()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
os.environ['TF_DETERMINISTIC_OPS'] = 'True'
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
return sess
init_seeds(seed=42)
The data can then be read from the data directory, partitioned into the features and the target, and the training and test sets created using test_size=0.2
. These processed sets can be saved for later use with other ML
algorithms so comparisons can be made. Next, we need to create dummy variables for the categorical variables since most algorithms require the model input to be numerical values. Neural networks tend to perform better when the input data is scaled, so let's use the StandardScaler
that is fit on the training set and then applied to the test set.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', None)
df = pd.read_csv('usedCars_final.csv', low_memory=False)
X = df.drop(['price'], axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=seed_value)
del X, y
train = pd.concat([y_train, X_train], axis=1)
train.to_csv('usedCars_trainSet.csv', index=False)
test = pd.concat([y_test, X_test], axis=1)
test.to_csv('usedCars_testSet.csv', index=False)
del, train, test
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print('Dimensions of X_train for input:', X_train.shape[1])
sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_test = pd.DataFrame(sc.transform(X_test))
Baseline Model
¶
It is always good to use a baseline model which can function as a reference for downstream comparisons. This can also suggest which parameter(s) need more tuning to generate a model with better model metrics. Since the dependent variable price
is quantitative, the metrics for regression-based problems should be followed. Let's first set up where the logs and callbacks from the model are saved. We can specify the parameters for EarlyStopping
which monitors the val_loss
with a patience=5
. This says that if the validation loss from the model does not improve after five epochs, then the model will stop training. We can also specify the ModelCheckpoint
to monitor the mse
and save only the one with the lowest mse
.
import datetime
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'MLP_weights_only_baseline_b4_HPO.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
tensorboard_callback]
Next, we can define the baseline model architecture. Since the number of features after creating the dummy variables is 53 features, this needs to be specified as the input_dim
. Given this is a neural network, the architecture of the baseline model could be as simple as model = Sequential()
then model.compile()
. Let's try using double the number of features as the first layer that decreases sequentially by 10 neurons.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(105, input_dim=53, kernel_initializer='normal',
activation='relu'))
model.add(Dense(95, kernel_initializer='normal', activation='relu'))
model.add(Dense(85, kernel_initializer='normal', activation='relu'))
model.add(Dense(75, kernel_initializer='normal', activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(55, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dense(35, kernel_initializer='normal', activation='relu'))
model.add(Dense(25, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
Now the model can be trained by calling fit
utilizing the train dataset for 30 epochs, which is the iterations over the entire X_train
and y_train
, using a batch_size=4
, which is the number of samples per gradient update, validation_split=0.2
, which is the fraction of the training data not used for training but used for evaluating the loss and the model metrics on this data at the end of each epoch., and the specified callbacks from the callbacks_list
.
history = model.fit(X_train, y_train, epochs=50, batch_size=4,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.
import matplotlib.pyplot as plt
model.save('./MLP_batch4_50epochs_baseline_tf.h5', save_format='tf')
#filepath = 'MLP_weights_only_baseline_b4.h5'
#model = tf.keras.models.load_model('./MLP_batch4_50epochs_baseline_tf.h5')
#model.load_weights(filepath)
plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.show();
We can save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show();
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted price to a pandas.Dataframe
and examine the size to determine chunk size
for plotting. Let's graph the predicted vs. actual price for the training set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
y_pred.shape
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Price vs Actual Price',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show();
We can use model.predict
using the test set, convert the predicted price to a pandas.Dataframe
and graph the predicted vs. actual price
for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 1000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show();
To evaluate if this MLP is effective at predicting the price of used car prices, metrics like the Mean Absolute Error (MAE)
, Mean Squared Error (MSE)
, Root Mean Squared Error (RMSE)
and the Coefficient of Determination
(R²) can be utilized.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
The test set has higher values for each of the aforementioned by are in close proximity suggesting the model is worked quite well and was not overfit. The baseline model with this architecture explains the variance quite well, but let's test if this can be improved.
Let's also examine how the predicted values for the maximum, average and minimum metric price and compare to the actual maximum, average and minimum price.
print('Maximum Price:', np.amax(y_test))
print('Predicted Max Price:', np.amax(pred_test))
print('\nAverage Price:', np.average(y_test))
print('Predicted Average Price:', np.average(pred_test))
print('\nMinimum Price:', np.amin(y_test))
print('Predicted Minimum Price:', np.amin(pred_test))
For the baseline model, there is a higher predicted maximum price than the actual maximum price and a higher predicted minimum compared to the actual while there is a lower predicted average vs actual average price.
Hyperparameter Optimization
¶
Let's first set up the callbacks for the TensorBoard
. Then we can define a function build_model
that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:
num_layers
: 13 - 20 layerslayer_size
: 70 - 130 nodes usingstep=10
learning_rate
: 1 x 10-1, 1 x 10-2, 1 x 10-3
The same 30% Dropout
before the output layer as well as the same loss='mae'
and metrics=['mse']
can be utilized to compare the aforementioned hyperparameters.
filepath = 'MLP_weights_only_b4_HPO3.h5'
checkpoint_dir = os.path.dirname(filepath)
callbacks = [TensorBoard(log_dir=log_folder,
histogram_freq=1,
write_graph=True,
write_images=True,
update_freq='epoch',
profile_batch=1,
embeddings_freq=1)]
callbacks_list = [callbacks]
def build_model(hp):
model = keras.Sequential()
for i in range(hp.Int('num_layers', 13, 20)):
model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
min_value=70,
max_value=130, step=10),
activation='relu',
kernel_initializer='normal'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(1))
model.compile(loss='mae', metrics=['mse'],
optimizer=tf.keras.optimizers.Adam(
hp.Choice('learning_rate', values=[1e-1, 1e-2, 1e-3])))
return model
We can define the conditions for keras_tuner
which include the objective='val_loss'
, the number of trials to search through (max_trials=25
), the length of each trial (executions_per_trial=1
), and the directory
and project_name
. Then we can now print a summary of the search space.
import keras_tuner
from keras_tuner import BayesianOptimization
tuner = BayesianOptimization(
build_model,
objective='val_loss',
max_trials=25,
executions_per_trial=1,
overwrite=True,
directory='MLP_HPO3',
project_name='MLP_HPO3')
tuner.search_space_summary()
Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2
and batch_size=4
.
tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=4,
callbacks=callbacks_list)
Now that the search has completed, let's print a summary of the results from the trials.
tuner.results_summary()
The hyperparameters that resulted in the lowest val_loss
(Score: 2655
) contained 13 layers, a learning_rate: 0.001
with the initial layers being the largest (130). 9/13 layers contain a layer_size
of 130 while 4 contain a layer_size
of 70.
Fit The Best Model from HPO
¶
Let's now set the path where model is saved and set up the callbacks with EarlyStopping
that monitors the val_loss
and will stop training if it does not improve after 5 epochs
, the ModelCheckpoint
that monitors the mse
saving only the best one with a min
loss and the TensorBoard
as well.
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
tensorboard_callback]
We can now define the model architecture using the results from kerar-tuner
search, compile
and then examine.
model = Sequential()
model.add(Dense(130, input_dim=53, kernel_initializer='normal',
activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
The total parameters for this model have increased to 148,871 from 44,356 that were utilized for the baseline model. Now the model can be trained by calling fit
utilizing the train dataset for 50 epochs
using a batch_size=4
, validation_split=0.2
and the specified callbacks from the callbacks_list
.
history = model.fit(X_train, y_train, epochs=50, batch_size=4,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.We can also save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
model.save('./MLP_batch4_50epochs_HPO1_bestModel_tf.h5', save_format='tf')
# Load model for more training or later use
#filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
#model = tf.keras.models.load_model('./MLP_batch4_50epochs_HPO1_bestModel_tf.h5')
#model.load_weights(filepath)
plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show()
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted price to a pandas.Dataframe
and examine the size to determine chunk size
for plotting. Let's graph the predicted vs. actual price
for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
y_pred.shape
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Price vs Actual Price',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show()
We can use model.predict
using the test set, convert the predicted price to a pandas.Dataframe
and graph the predicted vs. actual price
for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 1000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting the price of used cars, metrics like the Mean Absolute Error (MAE)
, Mean Squared Error (MSE)
, Root Mean Squared Error (RMSE)
and the Coefficient of Determination
(R²) can be utilized for both the training and test sets.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the model metrics from the hyperparameter tuning to the baseline model, the MAE
, MSE
and RMSE
were all lower for the training and test sets. The (R²) was higher for both sets compared to the baseline model.
Let's also examine how the predicted values for the maximum, average and minimum price compare to the actual maximum, average and minimum price.
print('Maximum Price:', np.amax(y_test))
print('Predicted Max Price:', np.amax(pred_test))
print('\nAverage Price:', np.average(y_test))
print('Predicted Average Price:', np.average(pred_test))
print('\nMinimum Price:', np.amin(y_test))
print('Predicted Minimum Price:', np.amin(pred_test))
When evaluating the model after the hyperparameter search with the test set, lower and closer values for the predicted vs. actual maximum price (49700.676 vs. 58631.125), but higher and farther away from actual for the predicted average price (27487.953 vs. 27052.695) and predicted minimum price (7409.093 vs. 7353.111) when compared to the baseline model.
Another hyperparameter search was tested using more complex network architecture:
num_layers
: 20 - 30layer_size
: 100 - 200 usingstep=10
- The same learning rate
This utilized 50 trials vs. 25 trials for the aforementioned HPO
. The time for the completion of the this was 03h 54m 19s
compared to the aforementioned of 01h 38m 11s
. The best val_loss = 547.799072265625
compared to the best val_loss = 2655.16650390625
. The more complex search resulted in a network with dense_20
where 12 of the Dense
layers had 200 nodes resulting the total parameters of 568,051 compared to 148,871 parameters for the aforementioned. This resulted in only a slightly better R² but a worse predicted minimum price. Another search tested stratify=X.listed_date_yearMonth
with 50 trials and the completion time was 05h 40m 54s
with the best val_loss = 2471.0478515625
and 63,236 total parameters.
Create an Azure Files Datastore
¶
Since we determined the previous model resulted in the lowest error, let's move this to the cloud with Azure Machine Learning Studio
. Let's first connect to the workspace by authenticating with DefaultAzureCredential
and then input the necessary components with MLClient
to get a handle to the workspace.
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
ml_client = MLClient(
credential=credential,
subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
resource_group_name='aschultzdata',
workspace_name='ds-ml-env',
)
Then we can create the datastore by specifying the name
of the datastore, the description
, the account
, the name of the container
, the protocol
and the account_key
.
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration
store = AzureBlobDatastore(
name='usedcars_datastore',
description='Datastore for Used Cars',
account_name='dsmlenv8898281366',
container_name='usedcarscargurus',
protocol='https',
credentials=AccountKeyConfiguration(
account_key='XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'
),
)
ml_client.create_or_update(store)
Finally, the train/test sets can be uploaded to the usedcarscargurus
container.
Train the MLP using Microsoft Azure
¶
Before we train the model, we need to create a compute instance/cluster with a defined name, so let's create one with a GPU
, if it has not all ready been created. Then we can specify the cluster components containing the on-demand virtual machine (VM) service, the type of VM, the minimum/maximum nodes in the cluster, the time the node will run after the job finishes/terminates and the type of tier. Finally, we can input the object to create_or_update
from MLClient
.
from azure.ai.ml.entities import AmlCompute
gpu_compute_target = 'gpu-cluster-NC4as-T4-v3'
try:
gpu_cluster = ml_client.compute.get(gpu_compute_target)
print(
f"You already have a cluster named {gpu_compute_target}, we'll reuse it as is."
)
except Exception:
print('Creating a new gpu compute target...')
gpu_cluster = AmlCompute(
name=gpu_compute_target,
type='amlcompute',
size='Standard_NC4as_T4_v3',
min_instances=0,
max_instances=1,
idle_time_before_scale_down=180,
tier='Dedicated',
)
print(
f"AMLCompute with name {gpu_cluster.name} will be created, with compute size {gpu_cluster.size}"
)
gpu_cluster = ml_client.compute.begin_create_or_update(gpu_cluster)
Next, let's create the environment by making a directory for the dependencies
which lists the components for the runtime and the libraries installed on the compute for training the model.
import os
dependencies_dir = './dependencies'
os.makedirs(dependencies_dir, exist_ok=True)
Now we can write the conda.yaml
file into the dependencies
directory.
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip=23.1.2
- numpy=1.21.6
- scikit-learn==1.1.2
- scipy
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- mlflow
- azureml-mlflow==1.50.0
- psutil==5.9.0
- tqdm
- ipykernel
- matplotlib
- tensorflow==2.12
- keras-tuner==1.1.3
The created conda.yaml
file allows for the environment to be created and registered in the workspace.
from azure.ai.ml.entities import Environment
custom_env_name = 'aml-usedcars-gpu-mlp'
custom_job_env = Environment(
name=custom_env_name,
description='Custom environment for Used Cars MLP job',
tags={'tensorflow': '2.12'},
conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
image='mcr.microsoft.com/azureml/curated/tensorflow-2.12-cuda11:6',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)
print(
f'Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}'
)
Next, we can create the training script by first creating the source folder where the training script, main.py
, will be stored.
train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)
The training script includes preparing the environment, reading the data, data preparation, model training, saving the model and evaluating the model. This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of argparse
, start logging, reading the train/test sets, preprocessing the data for dummy variables and scaling using the StandardScaler
. Then the number of samples and features are logged with MLFlow
. It uses this to then train a MLP
model using the best parameters from keras-tuner
where the loss
and val_Loss
are logged with MLFlow
with defined callbacks. The model is then saved and evaluated for the model loss, the metrics of the train/test sets and the predicted vs. actual minimum/average/maximum price, which are logged as MLFlow
artifacts and metrics.
%%writefile {train_src_dir}/main.py
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
from sklearn.preprocessing import StandardScaler
import mlflow
import tensorflow as tf
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from keras.callbacks import Callback
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
def init_seeds(seed=42):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
session_conf = tf.compat.v1.ConfigProto()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
os.environ['TF_DETERMINISTIC_OPS'] = 'True'
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
return sess
init_seeds(seed=42)
def main():
"""Main function of the script."""
parser = argparse.ArgumentParser()
parser.add_argument('--train_data', type=str, help='path to input train data')
parser.add_argument('--test_data', type=str, help='path to input test data')
parser.add_argument('--epochs', required=False, default=30, type=int)
parser.add_argument('--batch_size', required=False, default=1, type=int)
args = parser.parse_args()
mlflow.start_run()
###################
#<prepare the data>
###################
print(' '.join(f'{k}={v}' for k, v in vars(args).items()))
print('input train data:', args.train_data)
print('input test data:', args.test_data)
trainDF = pd.read_csv(args.train_data, low_memory=False)
testDF = pd.read_csv(args.test_data, low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
sc = StandardScaler()
train_features = pd.DataFrame(sc.fit_transform(train_features))
test_features = pd.DataFrame(sc.transform(test_features))
mlflow.log_metric('num_samples', train_features.shape[0])
mlflow.log_metric('num_features', train_features.shape[1])
print(f'Training with data of shape {train_features.shape}')
####################
#</prepare the data>
####################
##################
#<train the model>
##################
model = Sequential()
model.add(Dense(130, input_dim=53, kernel_initializer='normal',
activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
class LogRunMetrics(Callback):
def on_epoch_end(self, epoch, log):
# log a value repeated which creates a list
mlflow.log_metric('Loss', log['loss'])
mlflow.log_metric('Val_Loss', log['val_loss'])
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'MLP_weights_only_HPO1_bestModel.tf'
checkpoint_dir = os.path.dirname(filepath)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
LogRunMetrics()]
history = model.fit(train_features, train_label, epochs=args.epochs,
batch_size=args.batch_size, validation_split=0.2,
callbacks=callbacks_list)
model.save('./MLP_HPO1_bestModel', save_format='tf')
#filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
#model = tf.keras.models.load_model('./MLP_HPO1_bestModel_tf.h5')
#model.load_weights(filepath)
##################
#</train the model>
##################
#####################
#<evaluate the model>
#####################
plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.savefig('Loss vs. Price.png')
mlflow.log_artifact('Loss vs. Price.png')
plt.close()
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.grid(True)
plt.savefig('Error vs. Price.png')
mlflow.log_artifact('Error vs. Price.png')
plt.close()
pred_train = model.predict(train_features)
train_mae = mean_absolute_error(train_label[:], pred_train[:])
train_mse = mean_squared_error(train_label[:], pred_train[:])
train_rmse = np.sqrt(mean_squared_error(train_label[:], pred_train[:]))
train_r2 = r2_score(train_label[:], pred_train[:])
pred_test = model.predict(test_features)
test_mae = mean_absolute_error(test_label[:], pred_test[:])
test_mse = mean_squared_error(test_label[:], pred_test[:])
test_rmse = np.sqrt(mean_squared_error(test_label[:], pred_test[:]))
test_r2 = r2_score(test_label[:], pred_test[:])
mlflow.log_metric('train_mae', train_mae)
mlflow.log_metric('train_mse', train_mse)
mlflow.log_metric('train_rmse', train_rmse)
mlflow.log_metric('train_r2', train_r2)
mlflow.log_metric('test_mae', test_mae)
mlflow.log_metric('test_mse', test_mse)
mlflow.log_metric('test_rmse', test_rmse)
mlflow.log_metric('test_r2', test_r2)
MaximumPrice = np.amax(test_label)
PredictedMaxPrice = np.amax(pred_test)
AveragePrice = np.average(test_label)
PredictedAveragePrice = np.average(pred_test)
MinimumPrice = np.amin(test_label)
PredictedMinimumPrice = np.amin(pred_test)
mlflow.log_metric('Maximum Price', MaximumPrice)
mlflow.log_metric('Predicted Maximum Price', PredictedMaxPrice)
mlflow.log_metric('Average Price', AveragePrice)
mlflow.log_metric('Predicted Average Price', PredictedAveragePrice)
mlflow.log_metric('Minimum Price', MinimumPrice)
mlflow.log_metric('Predicted Minimum Price', PredictedMinimumPrice)
###################
#</evaluate the model>
###################
mlflow.end_run()
if __name__ == "__main__":
main()
To train the model, a command job
configured with the input specifying the input data, the number of epochs and the batch size, which then runs the training script
using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.
from azure.ai.ml import command
from azure.ai.ml import Input
registered_model_name = 'usedcars_mlp_model'
job = command(
inputs=dict(
train_data=Input(
type='uri_file',
path='azureml://datastores/usedcars_datastore/paths/usedCars_trainSet.csv',
),
test_data=Input(
type='uri_file',
path = 'azureml://datastores/usedcars_datastore/paths/usedCars_testSet.csv',
),
epochs=50,
batch_size=4,
registered_model_name=registered_model_name,
),
code='./src/',
command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --epochs ${{inputs.epochs}} --batch_size ${{inputs.batch_size}}',
environment='aml-usedcars-gpu-mlp@latest',
compute='gpu-cluster-NC4as-T4-v3',
display_name='usedcars_mlp_prediction',
)
Finally, this job can be submitted to run in Azure Machine Learning Studio
using the create_or_update
command with ml_client
.
ml_client.create_or_update(job)
The submitted job can then be viewed by selecting the link in the output of the previous cell. The logged information with MLFlow
including the model metrics and saved graphs can then be viewed/downloaded when the job completes.
CatBoost
¶
Gradient boosting utilizes the process of iteratively combining weaker models to build strong predictive models using gradient descent. The paper CatBoost: unbiased boosting with categorical features presented a new boosting algorithm, CatBoost
, or categorical boosting, where preprocessing the categorical features is not required because they are handled during the training step. Then the paper CatBoost: gradient boosting with categorical features support> explained the methodologies in greater detail.
The notebook containing the baseline model and models using Hyperopt
is here. Let's first start by setting up the environment by installing the necessary dependencies, importing the required packages, setting the options and the random
and numpy
seed. We can also determine if the CUDA
compiler is present/which version of CUDA Toolkit
is available with nvcc --version
and if a GPU is present and the associated characteristics with nvidia-smi
.
!pip install catboost
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_catGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
There is a 70W Tesla T4 with 16GB GPU memory with CUDA Toolkit 11.2
available. Now, we can read the train/test sets into pandas
dataframes and set up the features with X_train
and X_test
and the target/label with y_train
and y_test
. CatBoost
does not need the categorical variables to be one hot encoded or dummy variables, so let's define the categorical variables in a list which will be used when modeling.
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
X_train = trainDF.drop(columns = ['price'])
y_train = trainDF[['price']]
X_test = testDF.drop(columns = ['price'])
y_test = testDF[['price']]
categorical_features_indices = ['body_type', 'fuel_type', 'listing_color',
'transmission', 'wheel_system_display','State',
'listed_date_yearMonth']
Baseline Model
¶
Let's now set a baseline model using the default parameters for CatBoost
, save as a.pkl
file, and evaluate the model metrics for both the train and the test sets.
from catboost import CatBoostRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
cat = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
random_state=seed_value,
logging_level='Silent')
cat.fit(X_train, y_train)
Pkl_Filename = 'CatBoost_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(cat, file)
print('\nModel Metrics for Catboost Baseline Model')
y_train_pred = cat.predict(X_train)
y_test_pred = cat.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Hyperopt: 100 Trials 10-Fold Cross Validation
¶
To set up Hyperopt
, let's first define the number of trials with NUM_EVAL = 100
. We need to utilize the same k-folds for reproducibility, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True
and the initial defined seed value. Then hyperparameters can then be defined in a dictionary. To define integers, hp.choice
with np.arange
and dtype=int
is used while float types are defined using hp.uniform
. The space consists of 6 parameters with 4 integers and 2 float
type:
iterations
: The number of trees that can be built.depth
: Depth of the tree.l2_leaf_reg
: Coefficient at the L2 regularization term of the cost function.learning_rate
: Used for reducing the gradient step.min_data_in_leaf
: The minimum number of training samples in a leaf.one_hot_max_size
: Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value.
from sklearn.model_selection import KFold
from hyperopt import hp
NUM_EVAL = 100
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
catboost_tune_kwargs= {
'iterations': hp.choice('iterations', np.arange(100, 500, dtype=int)),
'depth': hp.choice('depth', np.arange(3, 10, dtype=int)),
'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1e-2, 1e0),
'learning_rate': hp.uniform('learning_rate', 1e-4, 0.3),
'min_data_in_leaf': hp.choice('min_data_in_leaf', np.arange(2, 20,
dtype=int)),
'one_hot_max_size': hp.choice('one_hot_max_size', np.arange(2, 20,
dtype=int)),
}
Now, we can define a function cat_hpo
for optimizing the hyperparameters. Within this function, we can use joblib
to save a .pkl
file that can be reloaded if more training is needed. We can define the trial number with ITERATION
that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float
. This is then allocated for n_estimators
and max_depth
, which starts at max_depth=3
. Then the model type, CatBoostRegressor, needs to be defined with the parameters that will be included in all of the trials during the search, which are:
loss_function
: The metric to use in training.task_type
: The processing unit type to use for training.cat_features
: A one-dimensional array of categorical columns indices.early_stopping_rounds
: Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.rsm
: Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.random_state=seed_value
: Random number seed.logging_level
: The logging level to output to stdout.
Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error'
for the scoring
metric and 10-fold cross validation to generate the negative cross_val_score
where the np.mean
of the scores
is saved as the rmse
for each trial. The trial loss=rmse
, params
= parameters tested in the trial from the catboost_tune_kwargs
space , the trial number (iteration
), the time to complete the trial (train_time
) and if the trial completed successfully or not (status
) are written to the defined .csv
file and appended by row for each trial.
import joblib
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK
def cat_hpo(config):
"""
Objective function to tune a CatoostRegressor model
"""
joblib.dump(bayesOpt_trials, 'Catboost_Hyperopt_100_GPU.pkl')
global ITERATION
ITERATION += 1
config['iterations'] = int(config['iterations'])
config['depth'] = int(config['depth']) + 3
cat = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
random_state=seed_value,
logging_level='Silent',
**config)
start = timer()
scores = -cross_val_score(cat, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
The Tree-structured Parzen Estimator Approach (TPE)
algorithm is the default algorithm for Hyperopt
. This algorithm proposed in Algorithms for Hyper-Parameter Optimization uses a Bayesian optimization approach to develop a probabilistic model of a defined function (the minimum, in this case) by generating a random starting point, calculating the function, then choosing the next set of parameters that will probabilistically result in a lower minimum utilizing past calculations for the conditional probability model, then computing the real value, and continuing until the defined stopping criteria is met.
Let's now define an out_file
to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION
and defined the Hyperopt Trials
as bayesOpt_trials
.
We can utilize if/else
condtional statements to load a .pkl
file if it exists, and then utilizing fmin
, we can specify the training function cat_hpo
, the parameter space cat_tune_kwargs
, the algorithm for optimization tpe.suggest
, the number of trials to evaluate NUM_EVAL
, the name of the trial set bayesOpt_trials
and the random state np.random.RandomState(42)
. We can now begin the hyperparameter optimization (HPO
) trials.
from hyperopt import tpe, Trials
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/Catboost/Hyperopt/trialOptions/Catboost_trials_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
from datetime import datetime, timedelta
from hyperopt import fmin
search_time_start = time.time()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Hyperopt_100_GPU.pkl'):
bayesOpt_trials = joblib.load('Catboost_Hyperopt_100_GPU.pkl')
else:
best_param = fmin(cat_hpo, catboost_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
print('Time to run HPO:', time.time() - search_time_start)
This search completed in 2 hours and 57 minutes. Let's now sort the trials with lowest loss (lowest RMSE
) first and examine the top two trials:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
import ast
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
results = pd.read_csv('Catboost_trials_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
random_state=seed_value,
logging_level='Silent',
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'Catboost_HPO_trials100_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
# =============================================================================
# # To load saved model
# best_bayes_model = joblib.load('Catboost_HPO_trials100_GPU.pkl')
# print(best_bayes_model)
# =============================================================================
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE
, MSE
and RMSE
and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search
¶
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the search hyperparameters by using a for
loop to iterate through the quantitative parameters and visualizing with seaborn.kdeplot
.
import matplotlib.pyplot as plt
import seaborn as sns
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['depth'] = bayes_params['depth'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['l2_leaf_reg'] = bayes_params['l2_leaf_reg'].astype('float64')
bayes_params['min_data_in_leaf'] = bayes_params['min_data_in_leaf'].astype('float64')
bayes_params['one_hot_max_size'] = bayes_params['one_hot_max_size'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['iteration', 'iterations']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of depth
, l2_leaf_reg
, min_data_in_leaf
and one_hot_max_size
in the given hyperparameter space performed better to generate a lower loss. A learning_rate
between 0.1 and 0.2 performed better and the lowest loss was achieved with ~ learning_rate=0.119
.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
'one_hot_max_size']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the learning_rate
and min_data_in_leaf
decreased while one_hot_max_size
slightly increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
plt.figure(figsize=(20,8))
plt.rcParams['font.size'] = 18
ax = sns.regplot('iteration', 'l2_leaf_reg', data=bayes_params,
label='Bayes Optimization: Hyperopt')
ax.set(xlabel='Iteration', ylabel='l2_leaf_reg')
plt.tight_layout()
plt.show()
Over the trials, there was a slight increase in l2_leaf_reg
.
Model Explanations
¶
Next, we can define a function to plot the feature importance, which can also be saved and exported for later use. The feature importance value and the associated feature name can be extracted into two separate arrays then converted to a pandas.DataFrame
using the feature names as the key
and the feature importance as the value
in a dictionary. The dataframe can then be sorted in decreasing order by the feature importance value. Then a searborn.barplot
with a defined title
, xlabel
and ylabel
can be utilized to visualize the results.
def plot_feature_importance(importance, names, model_type):
feature_importance = np.array(importance)
feature_names = np.array(names)
data={'feature_names': feature_names,
'feature_importance': feature_importance}
fi_df = pd.DataFrame(data)
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
plt.figure(figsize=(10,8))
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'],
palette='cool')
plt.title(model_type + ' Feature Importance')
plt.xlabel('Feature Importance')
plt.ylabel('Feature Names')
plot_feature_importance(best_bayes_model.get_feature_importance(),
X_train.columns, 'Catboost')
The features horsepower
, year
and mileage
are the most important followed by wheel_system_display
, back_legroom
, horsepower_rpm
, width
, front_legroom
, height
, body_type
.
SHAP (SHapley Additive exPlanations)
¶
SHAP (SHapley Additive exPlanations)
published as A Unified Approach to Interpreting Model Predictions estimates the impact of the individual components on the final result, separately, while conserving the total impact on the final result. This can be considered in conjunction with the previously used feature importance measures when modeling. Let's use the TreeExplainer
since using CatBoost
and generate shap_values
for the features. The summary_plot
shows the global importance of each feature and the distribution of the effect sizes over the set. Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test);
Train set has more positive and more negative compared to test set.
100 Trials Train/Validation/Test
¶
The training set can be partitioned into train/validation sets using a test_size=0.2
. This will allow for the original testDF
to be a true test set that the model has not encountered.
train_Features, val_features, train_Label, val_label = train_test_split(X_train,
y_train,
test_size=0.2,
random_state=seed_value)
We use the same number of trials and parameters space, but defined a new objective function for cat_hpo
with a new .pkl
file, the same config
for the parameters to be evaluated and the same base model. Now, we can specify for the model to fit with the training set with the validation set as the eval_set
and to predict using the validation set. We can now define a new out_file
to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION
and defined the Hyperopt Trials
as bayesOpt_trials
.
def cat_hpo(config):
"""
Objective function to tune a CatBoostRegressor model.
"""
joblib.dump(bayesOpt_trials, 'Catboost_Hyperopt_100_GPU_TrainValTest.pkl')
global ITERATION
ITERATION += 1
config['iterations'] = int(config['iterations'])
config['depth'] = int(config['depth']) + 3
model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**config)
start = timer()
model.fit(train_Features, train_Label.values.ravel(),
eval_set=[(val_features, val_label.values.ravel())])
run_time = timer() - start
y_pred_val = model.predict(val_features)
rmse = mean_squared_error(val_label, y_pred_val, squared=False)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
tpe_algorithm = tpe.suggest
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/Catboost/Hyperopt/trialOptions/Catboost_trials_100_GPU_TrainValTest.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
We can utilize if/else
condtional statements to load a .pkl
file if it exists, and then utilizing fmin
, we can specify the training function cat_hpo
, the parameter space cat_tune_kwargs
, the algorithm for optimization tpe.suggest
, the number of trials to evaluate NUM_EVAL
and the name of the trial set bayesOpt_trials
. We can now begin the HPO
trials.
if os.path.isfile('Catboost_Hyperopt_100_GPU_TrainValTest.pkl'):
bayesOpt_trials = joblib.load('Catboost_Hyperopt_100_GPU_TrainValTest.pkl')
best_param = fmin(cat_hpo, cat_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
best_param = fmin(cat_hpo, cat_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
Let's now sort the trials with lowest loss (lowest RMSE
) first and examine the top two trials:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
We can access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
results = pd.read_csv('Catboost_trials_100_GPU_TrainValTest.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**best_bayes_params)
best_bayes_model.fit(train_features, train_label)
Pkl_Filename = 'Catboost_HPO_trials100_GPU_TrainValTest.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(train_features)
y_test_pred = best_bayes_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
When comparing the model metrics for CatBoost
utilizing 10-fold cross validation vs train/validation/test, a lower MAE
for the train and the test set, a lower MSE
and RMSE
for the train set but higher for the test set. There was a higher R² for the train set but a lower R² for the test set, so this model overfit more than the model from the cross validatio search.
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
As stated, there was a higher MSE
on the test set using this model compared to the previous search using 10-fold cross validation (5878087.83144 vs. 5773043.66389) but this was achieved at similar number of search iterations.
Results from the Hyperparameter Search
¶
To compare the results from this method for partitioning the data for training a model compared to the results from 10-fold cross validation, we can iterate through the quantitative parameters using a similar method and visualizing with seaborn.kdeplot
.
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['depth'] = bayes_params['depth'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['l2_leaf_reg'] = bayes_params['l2_leaf_reg'].astype('float64')
bayes_params['min_data_in_leaf'] = bayes_params['min_data_in_leaf'].astype('float64')
bayes_params['one_hot_max_size'] = bayes_params['one_hot_max_size'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['iteration', 'iterations', 'random_state', 'task_type']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Similar results of higher values of depth
, l2_leaf_reg
, min_data_in_leaf
and one_hot_max_size
in the given hyperparameter space were observed to generate a lower loss. A learning_rate
between 0.1 and 0.2 performed better, the lowest loss was achieved with around a learning_rate=0.136
compared to 0.119
using 10-fold cross validation.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
'one_hot_max_size']):
sns.regplot(x='iteration', y=hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the one_hot_max_size
increased while learning_rate
and min_data_in_leaf
did not reveal any trends.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
plt.figure(figsize=(10,5))
plt.rcParams['font.size'] = 18
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=bayes_params,
label='Bayes Optimization')
ax.set(xlabel='Iteration', ylabel='l2_leaf_reg')
plt.tight_layout()
plt.show()
There was not any observable trend for l2_leaf_reg
over the trials.
Model Explanations
¶
plot_feature_importance(best_bayes_model.get_feature_importance(),
train_features.columns, 'Catboost')
The features horsepower
, year
and mileage
are the most important followed by wheel_system_display
, back_legroom
, horsepower_rpm
,
front_legroom
before width
, height
, wheelbase
and State
before body_type
when comparing to the model from the 10-fold cross validation search.
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.summary_plot(shap_values, test_features);
Optuna: 100 Trials 10-Fold Cross Validation
¶
The Optuna
notebooks are located here. Let's set up the environment by installing wandb
, CatBoost
, Optuna
and also shap
for model explanations. The import the necessary dependencies to set the seed for reproducibility and set the options. We can also examine the GPU
characteristics with nvidia-smi
.
!pip install --upgrade -q wandb
!pip install catboost
!pip install optuna
!pip install shap
import os
import random
import numpy as np
import warnings
pd.set_option('display.max_columns', None)
seed_value = 42
os.environ['usedCars_catboostGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Given the environment is now configured, the training and the test sets are read into separate pandas
dataframes. Then the features and the target are defined for the two sets as train_features
, train_label
, test_features
and test_label
. We can also set the categorical variables like was completed for the Hyperopt
section. Then we can set the working directory where the .pkl
is stored and set up Weights & Biases
.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_cat100gpu_T4_cv_kfold10',
'save_code': 'False', 'notes': 'optuna_cat100gpu_T4_cv_kfold10'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Let's utilize the same k-folds for reproducibility that was utilized for the Hyperopt
search, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True
and the initial defined seed value. Then we can set the up WB
callbacks for tracking. Now, let's define a function objective
for the optimization of hyperparameters using an Optuna study
with a pickle file, parameters to test different combinations using the same ones that were utilized during the Hyperopt
trials. Then we can define the model type with the parameters that will be used for each trial and perform timed kfolds
cross validation trials with the goal of finding the lowest averaged score.
from sklearn.model_selection import KFold, cross_val_score
import joblib
from catboost import CatBoostRegressor
from timeit import default_timer as timer
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a CatBoostRegressor model.
"""
joblib.dump(study, 'Catboost_Optuna_100_GPU_T4_wb_CV.pkl')
catboost_tune_kwargs = {
'task_type': 'GPU',
'random_state': seed_value,
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3),
'depth': trial.suggest_int('depth', 3, 10),
'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-2, 1e0),
'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 20),
'one_hot_max_size': trial.suggest_int('one_hot_max_size', 2, 20)
}
model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10
rsm=1,
logging_level='Silent',
**catboost_tune_kwargs)
start = timer()
scores = -cross_val_score(model, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
return rmse
Now, we can begin the optimization where the parameters and score for each run is stored usedCars_HPO_Catboost on Weights & Biases.
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Optuna_100_GPU_T4_wb_CV.pkl'):
study = joblib.load('Catboost_Optuna_100_GPU_T4_wb_CV.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
Results from the Hyperparameter Search
¶
Let's now extract the trial number, rmse
and hyperparameter value into a pandas.Dataframe
and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_depth': 'depth'}, inplace=True)
trials_df.rename(columns={'params_l2_leaf_reg': 'l2_leaf_reg'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_min_data_in_leaf': 'min_data_in_leaf'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_one_hot_max_size': 'one_hot_max_size'},
inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Optuna
contains many ways to visualize the parameters tested during the search using the visualization module including plot_parallel_coordinate
, plot_slice
, plot_contour
, plot_param_importances
, plot_optimization_history
, plot_slice
and plot_edf
. Let's utilize some of these components to examine the hyperparameters tested starting with plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_contour
to plot the parameter relationship with a contour plot
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_contour(study, params=['n_estimators',
'min_data_in_leaf',
'depth',
'learning_rate'])
fig.show()
Next, we can plot_slice
to compare the objective value and individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
From this plot, lower rmse
occurred with a higher depth
and sometimes using more estimators in the model. The other parameters exhibited more complex relationships with the loss.
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
import matplotlib.pyplot as plt
import seaborn as sns
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of depth
while lower values of l2_leaf_reg
, min_data_in_leaf
and one_hot_max_size
in the given hyperparameter space performed better to generate a lower loss. A learning_rate
between 0.17 and 0.22 performed better compared to 0.1 - 0.2 for the Hyperopt
trials and the lowest loss was achieved with ~ learning_rate=0.191
, which is higher than both Hyperopt
trials.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
'depth']):
sns.regplot(x='iteration', y=hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The learning_rate
and min_data_in_leaf
decreased over the trials while the depth
did not reveal any trends besides the trials have a higher depth
throughtout the trials.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=trials_df)
plt.tight_layout()
plt.show()
A slight decrease in l2_leaf_reg
was observed over the trials.
Let's now visualize the parameter importances by utilizing the plot_param_importances
feature from the Optuna
package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The learning_rate
is clearly the most important hyperparameter followed by depth
.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
params = study.best_params
params['random_state'] = seed_value
params
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'Catboost_Optuna_trials100_GPU_T4_CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
When comparing the model metrics for CatBoost
utilizing 10-fold cross validation from Hyperopt
, there was a higher MAE
for the train set while a lower MAE
for the test set. There was higher MSE
for both the train/test sets and a lower RMSE
and R² for the train and test sets, so this model overfit less than the model from the Hyperopt
cross validation search.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
Using the defined function in the Hyperopt
section, we can also plot the feature importance from best model achieved through this hyperparameter search.
plot_feature_importance(best_bayes_model.get_feature_importance(),
X_train.columns, 'Catboost')
The features horsepower
, year
and mileage
are the most important followed by wheel_system_display
, back_legroom
, horsepower_rpm
, height
, width
, front_legroom
, wheelbase
, engine_displacement
and savings_amount
.
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
100 Trials Train/Validation/Test
¶
To evaluate utilizing the train/validation/test set approach, let's use the train set that this is split into the training and validation sets using train_test_split
with a test_size=0.2
, so the set is split into 60% for training, 20% validation and 20% for testing the trained model. We can then set the working directory where the .pkl
is stored and set up Weights & Biases
.
train_Features, val_features, train_Label, val_label = train_test_split(X_train,
y_train,
test_size=0.2,
random_state=seed_value)
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_cat100gpu_T4_trainvalTest',
'save_code': 'False',
'notes': 'optuna_cat100gpu_T4_trainvalTest'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Let's set up the WB
callbacks and define a function for the optimization of hyperparameters for an Optuna study
with a .pkl
file and the parameter space to test the different combinations of parameters using the same ones used during the Hyperopt
trials.
Then, we can define the model type with parameters that will be used for each trial and fit the model using the train set the validation set for the eval_set
where the rmse
is predicted using the validation set.
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a CatBoostRegressor model.
"""
joblib.dump(study, 'Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl')
catboost_tune_kwargs = {
'random_state': seed_value,
'task_type': 'GPU',
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3),
'depth': trial.suggest_int('depth', 3, 10),
'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-2, 1e0),
'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 20),
'one_hot_max_size': trial.suggest_int('one_hot_max_size', 2, 20)
}
model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**catboost_tune_kwargs)
start = timer()
model.fit(train_Features, train_Label.values.ravel(),
eval_set=[(val_features, val_label.values.ravel())])
run_time = timer() - start
y_pred_val = model.predict(val_features)
rmse = mean_squared_error(val_label, y_pred_val, squared=False)
return rmse
Now, let's begin the optimization where the parameters and score for each run is stored here.
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl'):
study = joblib.load('Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
Results from the Hyperparameter Search
¶
Let's now extract the trial number, rmse
and hyperparameter value into a pandas.Dataframe
and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_depth': 'depth'}, inplace=True)
trials_df.rename(columns={'params_l2_leaf_reg': 'l2_leaf_reg'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_min_data_in_leaf': 'min_data_in_leaf'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_one_hot_max_size': 'one_hot_max_size'},
inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Let's utilize plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_contour
to plot the parameter relationship with a contours.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_contour(study, params=['n_estimators',
'min_data_in_leaf',
'depth',
'learning_rate'])
fig.show()
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of depth
, min_data_in_leaf
and one_hot_max_size
while lower values of l2_leaf_reg
in the given hyperparameter space performed better to generate a lower loss. A learning_rate
between 0.17 and 0.22 performed better compared to 0.1 - 0.2 for the Hyperopt
trials and the lowest loss was achieved with ~ learning_rate=0.178
, which is higher than both Hyperopt
trials.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf', 'depth']):
sns.regplot(x='iteration', y=hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the search trials, there were not any trends with the quantitative features over the iterations.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=trials_df)
plt.tight_layout()
plt.show()
Over the trials, there was a slight increase in l2_leaf_reg
.
Let's now visualize the parameter importances by utilizing the plot_param_importances
feature from the Optuna
package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The learning_rate
is clearly the most important hyperparameter, more when using cross-validation for the Optuna
trials followed by depth
, which is less important in this search.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
params = study.best_params
params['random_state'] = seed_value
params
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
categorical_features_indices = ['body_type', 'fuel_type', 'listing_color',
'transmission', 'wheel_system_display', 'State',
'listed_date_yearMonth']
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**params)
best_bayes_model.fit(train_features, train_label)
Pkl_Filename = 'Catboost_Optuna_trials100_GPU_T4_wb_trainValTest.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(train_features)
y_test_pred = best_bayes_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
When comparing the model metrics to the 10-fold cross validation trials using Hyperopt
, the MAE
, MSE
, and RMSE
for the both the train and the test is higher and the R² is lower for both of the sets. Therefore, the model with the best performance for CatBoost
occurred with the Hyperopt
100 trials using 10-fold cross validation.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
Using the defined function used in the Hyperopt
section, plot the feature importance from best model result.
plt.rcParams.update({'font.size': 10})
plot_feature_importance(best_bayes_model.get_feature_importance(),
train_features.columns, 'Catboost')
The features horsepower
, year
and mileage
are the most important followed by wheel_system_display
, back_legroom
, horsepower_rpm
, height
, width
, front_legroom
, wheelbase
, engine_displacement
, highway_fuel_economy
, savings_amount
.
After evaluating the two different packages for performing hyperparanmeter tuning and cross validation vs. train/validation/test sets utilized during the tuning the process for CatBoost
, the features horsepower
, year
and mileage
are the most important followed slightly different orders of of feature importance using this approach.
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
LightGBM
¶
The paper LightGBM: A Highly Efficient Gradient Boosting Decision Tree introduced two new techniques called Gradient-based One-Side Sampling (GOSS
) and Exclusive Feature Bundling (EFB
).
GOSS
is a new gradient boosting decision tree (GBDT
) algorithm under the assumption that instances containing larger gradients (greater than a specified threshold or within the top percentiles) contribute more to information gain. Therefore, the instances with large gradients are retained and only the instances with small gradients are randomly dropped when down sampling the data. This results in more accurate gain estimation compared to uniformly random sampling, with the same target sampling rate, to a higher extent when there is a large range to benefit.
EFB
addresses the problem when there is a large number of features present, but in reality, the feature space is quite sparse. This utilizes a graph approach where the features are vertices, edges are added for every two features, if they are not mutually exclusive, and then leveraging a greedy algorithm with a constant approximation ratio to solve.
Since Google Colaboratory
was utilized to run LightGBM
with a GPU
, the environment needs to be set up by mounting the drive and cloning the LightGBM
repository from Github.
%cd /content/drive/MyDrive/
!git clone --recursive https://github.com/Microsoft/LightGBM
Then we can navigate to the directory and create a build
directory within the cloned repository where we can use cmake
to compile and build.
%cd /content/drive/MyDrive/LightGBM
!mkdir build
!cmake -DUSE_GPU=1
!make -j$(nproc)
Next, we can install pip
, setuptools
and the required dependencies to set up the environment for utilizing LightGBM
with GPU capabilities.
!sudo apt-get -y install python-pip
!sudo -H pip install setuptools optuna plotly eli5 shap lime -U
Now, we can move to the python-package
subdirectory and use setup.py
to compile the dependencies needed so that data will be processed on GPUs when using the LightGBM
package.
%cd /content/drive/MyDrive/LightGBM/python-package
!sudo python3 setup.py install --precompile --gpu
The Hyperopt
notebook is here. Let's now set up the environment by importing the required packages, setting the the options, setting the random
and numpy
seed and determining the CUDA
compiler and GPU
characteristics.
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_lgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
X_train = trainDF.drop(['price'], axis=1)
y_train = trainDF['price']
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
X_test = testDF.drop(['price'], axis=1)
y_test = testDF['price']
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
Baseline Model
¶
Let's now fit the baseline model to the data, save as a .pkl
file and evaluate the model metrics for the baseline model.
from lightgbm import LGBMRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
lgb = LGBMRegressor(device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
random_state=seed_value,
objective='regression',
metric='rmse',
verbosity=-1)
lgb.fit(X_train, y_train)
Pkl_Filename = 'LightGBM_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lgb, file)
print('\nModel Metrics for LightGBM Baseline')
y_train_pred = lgb.predict(X_train)
y_test_pred = lgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Hyperopt: 300 Trials 10-Fold Cross Validation
¶
We can now define the number of trials, the same k-folds using shuffled n_splits=10
for reproducibility and the parameter space:
random_state
:seed_value
designated for reproducibility.default=None
.device_type
: Device for the tree learning.default=cpu
.objective
: Specify the learning task and the corresponding learning objective or a custom objective function to be used.default=None
ordefault=regression
.metric
: Metric(s) to be evaluated on the evaluation set(s).default=""
.verbosity
: Controls the level of LightGBM’s verbosity.default=1
.n_estimators
: Number of boosted trees to fit.default=100
.learning_rate
: Boosting learning ratedefault=0.1
.max_depth
: Maximum tree depth for base learners, <=0 means no limit.default=-1
.num_leaves
: Maximum tree leaves for base learners.default=31
.boosting_type
:gbdt
,gdbt_subsample
', 0.5.default='gbdt'
)) –gbdt
, traditional Gradient Boosting Decision Tree.dart
, Dropouts meet Multiple Additive Regression Trees.rf
, Random Forest.colsample_bytree
: Subsample ratio of columns when constructing each tree.default=1
.reg_alpha
: L1 regularization term on weights.default=0
.reg_lambda
: L2 regularization term on weights.default=0
.
from sklearn.model_selection import KFold
from hyperopt import hp
NUM_EVAL = 300
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
lgb_tune_kwargs = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'regression',
'metric': 'rmse',
'verbosity' : -1,
'n_estimators': hp.choice('n_estimators', np.arange(400, 700, dtype=int)),
'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
'max_depth': hp.choice('max_depth', np.arange(5, 12, dtype=int)),
'num_leaves': hp.choice('num_leaves', np.arange(70, 150, dtype=int)),
'boosting_type': hp.choice('boosting_type',
[{'boosting_type': 'gbdt',
'subsample': hp.uniform('gdbt_subsample', 0.5,
0.95)}]),
'colsample_bytree': hp.uniform('colsample_by_tree', 0.75, 1.0),
'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0)
}
Now, we can define a function lgb_hpo
for optimizing the hyperparameters. Within this function, we can use joblib
to save a pkl
file that can be reloaded if more training is needed. We can define the trial number with ITERATION
that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float
. This is then allocated for max_depth
and num_leaves
. Then the model type, LGBMRegressor
, needs to be defined.
Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error'
for the scoring
metric and 10-fold cross validation to generate the negative cross_val_score
where the np.mean
of the scores
is saved as the rmse
for each trial. The trial loss=rmse
, params
= parameters tested in the trial from the lgb_tune_kwargs
space, the trial number (ITERATION
), the time to complete the trial (run_time
) and if the trial completed successfully or not (STATUS_OK
) are written to the defined .csv
file and appended by row for each trial.
from lightgbm import LGBMRegressor
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK
def lgb_hpo(config):
"""
Objective function to tune a LightGBMRegressor model.
"""
global ITERATION
ITERATION += 1
subsample = config['boosting_type'].get('subsample', 1.0)
config['boosting_type'] = config['boosting_type']['boosting_type']
config['subsample'] = subsample
for param_name in ['max_depth', 'num_leaves']:
config[param_name] = int(config[param_name])
lgb = LGBMRegressor(**config)
start = timer()
scores = -cross_val_score(lgb, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
Let's now define an out_file
to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION
and defined the Hyperopt Trials
as bayesOpt_trials
.
We can utilize if/else
condtional statements to load a .pkl
file if it exists, and then utilizing fmin
, we can specify the training function lgb_hpo
, the parameter space lgb_tune_kwargs
, the algorithm for optimization tpe.suggest
, the number of trials to evaluate NUM_EVAL
, the name of the trial set bayesOpt_trials
and the random state np.random.RandomState(42)
. We can now begin the hyperparameter optimization HPO
trials.
from hyperopt import tpe, Trials
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/lightGBM/Hyperopt/trialOptions/lightGBM_CV_trials_300_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
tpe_algorithm = tpe.suggest
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
from datetime import datetime, timedelta
import joblib
from hyperopt import fmin
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_CV_Hyperopt_300_GPU.pkl'):
bayesOpt_trials = joblib.load('lightGBM_CV_Hyperopt_300_GPU.pkl')
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
else:
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Let's now sort the trials with lowest loss first and examine the lowest two losses:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
import ast
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
results = pd.read_csv('lightGBM_CV_trials_300_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('lightGBM_CV_trials_300_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = LGBMRegressor(device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'lightGBM_CV_HPO_300_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for lightGBM HPO Hyperopt 300 trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE
, MSE
and RMSE
and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0,
'iteration']))
Results from the Hyperparameter Search
¶
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
import matplotlib.pyplot as plt
import seaborn as sns
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['num_leaves'] = bayes_params['num_leaves'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['boosting_type', 'iteration', 'subsample', 'force_col_wise',
'max_depth', 'device_type', 'verbosity', 'random_state',
'objective', 'metric']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of colsample_bytree
, n_estimators
, num_leaves
and reg_lambda
in the given hyperparameter space performed better while a lower learning_rate
and a lower reg_alpha
performed better to generate a lower loss.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
colsample_bytree
increases over trials while learning_rate
and num_leaves
do not tend to follow a trend.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the reg_alpha
hyperparameter decreased while reg_lambda
increased.
Model Explanations
¶
Model Metrics with ELI5
¶
Let's now utilize the PermutationImportance
from eli5.sklearn
which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.
class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
X_test1 = pd.DataFrame(X_test, columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_name=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
from eli5.formatters import format_as_dataframe
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The horsepower
feature contains the highest weight (0.533249). The year
feature has the next highest, but it is significantly lower (0.132819). This is followed by mileage
(0.073229), width
(0.070825) and height
(0.056370).
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test);
100 Trials Train/Validation/Test
¶
Let's now prepare the data for modeling by creating the dummy variables for the categorical features. Then we can define the label
as price
and the features
as everything except price
by dropping price
to remove the target. Next, we can extract the names of the features
by using the columns and converting to a np.array
.
train = pd.get_dummies(X_train, drop_first=True)
label = train[['price']]
features = train.drop(columns = ['price'])
features = features.columns
features = np.array(features)
We can now define the number of trials, the same k-folds using shuffled n_splits=10
for reproducibility and the parameter space. Compared to the 10-fold cross validation trials, let's use lower values for the n_estimators
(300-500 vs 400-700), max_depth
(5-6 vs. 5-12), num_leaves
(30-100 vs. 70-150) and colsample_bytree
(0.6-1 vs. 0.75-1) and higher values for gbdt_subsample
(0.5-1 vs 0.5-0.95). We can also test different kinds of boosting_type
, so let's evaluate these trials using gdbt
and dart
with the same ranges for subsampling and goss
with 'subsample'=1.0
. We can utilize the same range for the learning_rate
, reg_alpha
and reg_lambda
hyperparameters:
NUM_EVAL = 100
kf = KFold(n_splits=10, shuffle=True, random_state=seed_value)
lgb_tune_kwargs = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'regression',
'metric': 'rmse',
'verbosity' : -1,
'n_estimators': hp.choice('n_estimators', np.arange(300, 500, dtype=int)),
'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
'max_depth': hp.choice('max_depth', np.arange(5, 6, dtype=int)),
'num_leaves': hp.choice('num_leaves', np.arange(30, 100, dtype=int)),
'boosting_type': hp.choice('boosting_type', [{'boosting_type': 'gbdt',
'subsample': hp.uniform('gdbt_subsample',
0.5,
1)},
{'boosting_type': 'dart',
'subsample': hp.uniform('dart_subsample',
0.5,
1)},
{'boosting_type': 'goss',
'subsample': 1.0}]),
'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0),
'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0)
}
Now, we can define a function lgb_hpo
for optimizing the hyperparameters. Within this function, we can use joblib
to save a pkl
file that can be reloaded if more training is needed. We can define the trial number with ITERATION
that will increase by 1 each trial to keep track of the parameters for each trial. Then we can retrieve the value used for subsample
, if it is present, otherwise it will be set to 1.0. Then we can extract the boosting type. Then the parameters which are integers need to configured to remain as integers and not as float
, so let's allocate them for the max_depth
and num_leaves
hyperparameters.
Next, we can split the training set into the train and validation sets by utilizing the defined variable kf
in a for
loop to split the training set 10-fold into the train and the validation sets with shuffling and the random_state=seed_value
. This will allow for different sets to be fit and evaluated using the LGBMRegressor
model. Then the validation set can be evaluated for the trial rmse
for each trial and
saved. The trial loss=rmse
, params
= parameters tested in the trial from the lgb_tune_kwargs
space, the trial number (ITERATION
), the time to complete the trial (run_time
) and if the trial completed successfully or not (STATUS_OK
) are written to the defined .csv
file and appended by row for each trial.
def lgb_hpo(config):
"""
Objective function to tune a LightGBMRegressor model.
"""
global ITERATION
ITERATION += 1
subsample = config['boosting_type'].get('subsample', 1.0)
config['boosting_type'] = config['boosting_type']['boosting_type']
config['subsample'] = subsample
for param_name in ['max_depth', 'num_leaves']:
config[param_name] = int(config[param_name])
for trn_idx, val_idx in kf.split(train[features], label):
train_features, train_label = train[features].iloc[trn_idx], label.iloc[trn_idx]
val_features, val_label = train[features].iloc[val_idx], label.iloc[val_idx]
lgb = LGBMRegressor(**config)
start = timer()
lgb.fit(train_features, train_label,
eval_set = [(val_features, val_label),
(train_features, train_label)])
run_time = timer() - start
y_pred_val = lgb.predict(val_features)
rmse = mean_squared_error(val_label, y_pred_val, squared=False)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
Let's now define an out_file
to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION
and defined the Hyperopt Trials
as bayesOpt_trials
.
We can utilize if/else
condtional statements to load a .pkl
file if it exists, and then utilizing fmin
, we can specify the training function lgb_hpo
, the parameter space lgb_tune_kwargs
, the algorithm for optimization tpe.suggest
, the number of trials to evaluate NUM_EVAL
, the name of the trial set bayesOpt_trials
and the random state np.random.RandomState(42)
. We can now begin the hyperparameter optimization HPO
trials.
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/lightGBM/Hyperopt/trialOptions/lightGBM_trials_100_GPU_val.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_Hyperopt_100_GPU_val.pkl'):
bayesOpt_trials = joblib.load('Hyperopt_100_GPU_val.pkl')
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
else:
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Compared to the completion time for 300 trials using 10-fold cross validation (6:46:24) using the same GPU, a Tesla P100-PCIE
, this completed in more than 1/6 the time. We need to evaluate the the model metrics for comparisons, though.
Let's now sort the trials with lowest loss first and examine the lowest two losses:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
results = pd.read_csv('lightGBM_trials_100_GPU_val.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('lightGBM_CV_trials_300_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
train_label = train[['price']]
train_features = train.drop(columns=['price'])
test_label = testDF[['price']]
test_features = testDF.drop(columns=['price'])
test_features = pd.get_dummies(test_features, drop_first=True)
best_bayes_model = LGBMRegressor(device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
**best_bayes_params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'lightGBM_HPO_100_GPU_val.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO Hyperopt 100 trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE
, MSE
and RMSE
and a higher R² for both train/test sets after hyperparameter tuning. With 300 trials, the model overfit more with the training set compared to the test, while with 100 trials the model performed, more balanced regarding the model metrics.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0,
'iteration']))
Results from the Hyperparameter Search
¶
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['num_leaves'] = bayes_params['num_leaves'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['boosting_type', 'iteration', 'subsample', 'force_col_wise',
'max_depth', 'device_type', 'verbosity', 'random_state',
'objective', 'metric']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Lower values of colsample_bytree
, learning_rate
and reg_lambda
while higher values of n_estimators
and num_leaves
in the given hyperparameter space performed better to generate a lower loss.
Compared to the 10-fold cross validation model, there were differences in the densities of the colsample_bytree
and the reg_lambda
hyperparameters.
We can map the boosting_type
to integer (essentially label encoding) and plot the boosting type over the search.
bayes_params['boosting_int'] = bayes_params['boosting_type'].replace({'goss': 0,
'dart': 1,
'gbdt': 2})
plt.plot(bayes_params['iteration'], bayes_params['boosting_int'], 'ro')
plt.yticks([0, 1, 2], ['goss', 'dart', 'gbdt']);
plt.xlabel('Iteration'); plt.title('Boosting Type over trials')
plt.show()
More of the trial iterations tested dart
for the boosting_type
compared to gbdt
and goss
.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The colsample_bytree
parameter decreased, compared to increasing when using cross-validation over trials while the learning_rate
might have slightly increased. Both did not show a trend for the num_leaves
parameter over the trials.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the hyperparameter reg_lambda
decreased when there was not a trend for reg_alpha
. For cross-validation, reg_lambda
increased over the trials.
Model Explanations
¶
Model Metrics with ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
test_label)
html_obj = eli5.show_weights(perm_importance,
feature_name=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
When comparing the weights from the train/validation/test model to the ones generated from using 10-fold cross validation,
horsepower
was (0.426414 vs 0.533249), year
was lower (0.110301 vs. 0.132819), height
was higher (0.098178 vs. 0.056370), mileage
was higher (0.080769 vs. 0.073229) and width
was higher (0.097303 vs. 0.070825).
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features);
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_lgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Let's read the data and define the label/features with price
as the model target. Then create dummy variables for the categorical variables and use the column names to extract feature names.
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
features = train_features.columns
features = np.array(features)
We can leverage the optuna.integration.wandb
to set up the callbacks that will be saved to Weights & Biases
. First, we login
and set up the arguments that include the name of the project
, the person saving the results, the group
the study
belongs to, if code is saved or not, and notes
about the study for future reference.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_lgbm100gpu_T4_cv_kfold10',
'save_code': 'False',
'notes': 'optuna_lgbm100gpu_T4_cv_kfold10'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Now, let's set up the WB
callbacks using @wandbc.track_in_wandb()
. Then we can define a function objective
for the optimization of hyperparameters using an Optuna study
with a new pickle
file and the parameter space to evaluate different combinations. Then we can create a lgb
dataset using the training set and the hyperparameter space as a dictionary. Then we can define the model type as LGBMRegressor
specifying the parameters that will be used for each trial. We can then leverage cross validation from the lightgbm
package by using lgb.cv
with the params
, train_set
, 10-fold cross validation by specifying nfold=10
and stratified=False
. This can then perform timed cross validation trials with the goal of finding the lowest averaged rmse
in the validation set.
import joblib
from sklearn.model_selection import cross_val_score, KFold
import lightgbm as lgb
from lightgbm import LGBMRegressor
from timeit import default_timer as timer
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a LightGBMRegressor model.
"""
joblib.dump(study, 'lightGBM_Optuna_100_GPU_T4_wb_CV.pkl')
params = {'verbose': -1,
'device': 'gpu',
'gpu_platform_id': 0,
'gpu_device_id': 0,
'boosting_type': 'gbdt',
'objective':'regression',
'metric': 'rmse',
'seed': seed_value,
'early_stopping_rounds': 150,
'n_estimators': trial.suggest_int('n_estimators', 925, 935),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.0075,
0.008),
'num_leaves': trial.suggest_int('num_leaves', 350, 365),
'bagging_freq': trial.suggest_int('bagging_freq', 8, 10),
'subsample': trial.suggest_float('subsample', 0.85, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.77,
0.81),
'max_depth': trial.suggest_int('max_depth', 9, 11),
'lambda_l1': trial.suggest_float('lambda_l1', 0.02, 0.03,
log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 0.001, 0.002,
log=True),
'min_child_samples': trial.suggest_int('min_child_samples',
540, 550)
'verbosity': -1
}
train_set = lgb.Dataset(X_train, label=y_train, params=params)
model = LGBMRegressor(**params)
start = timer()
cv_results = lgb.cv(params, train_set, nfold=10, stratified=False)
run_time = timer() - start
rmse = round(cv_results['valid rmse-mean'][-1], 4)
return rmse
Now let's begin the optimization with Weights & Biases
tracking the loss and associated parameters. The optimization is tracked and the trial run components are located here.
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_Optuna_100_GPU_T4_wb_CV.pkl'):
study = joblib.load('lightGBM_Optuna_100_GPU_T4_wb_CV.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
Let's now extract the trial number, rmse
and hyperparameter value into a pandas.Dataframe
and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Results from the Hyperparameter Search
¶
Let's utilize plot_optimization_history
, which shows the scores from all trials as well as the best score so far at each point. This search did not contain extreme outliers for the objective value so it can be useful for examining the study
output.
fig = optuna.visualization.plot_optimization_history(study)
fig.show()
Next, we can plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_slice
to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
From this plot, lower rmse
occurred with lower colsample_bytree
, higher lambda_l1
, higher lambda_l2
, higher learning_rate
, higher max_depth
, higher n_estimators
, higher subsample
.
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Lower values of colsample_bytree
, min_child_samples
and num_leaves
while higher values of lambda_l1
, lambda_l2
, learning_rate
, max_depth
and subsample
in the given hyperparameter space performed better to generate a lower loss.
We can now examine if/how the quantitative parameters changed over the trials run to see if any trends exist.
fig, axs = plt.subplots(2, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree',
'max_depth', 'subsample', 'bagging_freq']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the hyperparameters colsample_bytree
and num_leaves
decreased while the hyperparameters learning_rate
and subsample
increased. The results from max_depth
and bagging_freq
over the trials are not clear.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['lambda_l1', 'lambda_l2']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, lambda_l1
potentially increased where there was not a trend for lambda_l2
. Increasing the number of trials might reveal more insight about this parameter.
Let's now visualize the parameter importances by utilizing the plot_param_importances
feature from the Optuna
package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The max_depth
was the most important parameter for the loss utilized. CatBoost
demonstrated the learning_rate
was the most important parameter during the Optuna
trials.
We can use plot_edf
to plot the objective value empirical distribution function (EDF) for the study.
fig = optuna.visualization.plot_edf(study)
fig.show()
This plot demonstates a narrow range of values for the loss from the models evalauted.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
train_label = trainDF[['price']]
train_features = trainDF.drop(columns=['price'])
test_label = testDF[['price']]
test_features = testDF.drop(columns=['price'])
test_features = pd.get_dummies(test_features, drop_first=True)
best_model = LGBMRegressor(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'lightGBM_Optuna_trials100_GPU_T4_wb_CV.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO 100 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
Compared to the baseline model, there are lower values for MAE
, MSE
and RMSE
and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features);
Model Metrics with ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
test_label)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The horsepower
feature contains the highest weight (0.390659). Then the mileage
has the next highest, but the second weight is once again way lower than horsepower
as in the previous models (0.132178). This is followed by width
(0.082871), year
(0.079460) and horsepower_rpm
(0.034325), which is a different order than the previous model weights using ELI5
.
XGBoost
¶
A decision tree is a hierarchical model that utilizes conditional statements which represent choices and the potential downstream outcomes. The tree structure is comprised of a root node, branches, internal nodes, and leaf nodes. Gradient Boosting Decision Trees (GBDT
) are comprised of multiple decision trees in an ensemble similar to what is used with Random Forest
models. However, the methodologies for building and combining the trees differ. Gradient boosting is derived from boosting where a single weak model is combined with other weak models leveraging the objective of improving and generating a more effective model. However, the process of additively generating weak models is characterized with gradient descent for the objective function. Gradient boosting delineates targeted outcomes based on the error of the prediction which is then utilized in subsequent models to minimize the error. The weighted sum of all of the tree predictions is then used for the final prediction.
Extreme Gradient Boosting (XGBoost) presented in XGBoost: A Scalable Tree Boosting System leverages trees built in parallel following a level-wise strategy that scans across the gradient values and uses these partial sums to evaluate the quality of splits at every possible split in the training set. XGBoost
is built for model performance and computational speed with CPU
and GPU
support for the ML library.
The Hyperopt
notebook is located here. Let's first set up the environment by installing/importing the dependencies, setting the options and seed followed by examining the CUDA
and GPU
characteristics.
!pip install xgboost==1.5.2
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings('ignore')
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Then we can read the data and partition the train/test sets with price
as the target. Next, we can create dummy variables for the categorical variables.
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
X_train = trainDF.drop(columns = ['price'])
y_train = trainDF[['price']]
X_test = testDF.drop(columns = ['price'])
y_test = testDF[['price']]
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
Baseline Model
¶
Let's now fit the baseline model to the data, save as a .pkl
file and evaluate the model metrics for the baseline model.
from xgboost import XGBRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Hyperopt: 300 Trials 10-Fold Cross Validation
¶
To set up Hyperopt
, let's first define the number of trials with NUM_EVAL = 300
. We need to utilize the same k-folds for reproducibility, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True
and the initial defined seed value. Then hyperparameters can then be defined in a dictionary. To define integers, hp.choice
with np.arange
and dtype=int
is used while float types are defined using hp.uniform
. The space consists of 6 parameters with 4 integers and 6 float
type:
n_estimators
: Number of gradient boosted trees.max_depth
: Maximum tree depth for base learners.subsample
: Subsample ratio of the training instance.learning_rate
: Used for reducing the gradient step.gamma
: Minimum loss reduction required to make a further partition on a leaf node of the tree.reg_alpha
: L1 regularization term on weights (xgb’s alpha).reg_lambda
: L2 regularization term on weights (xgb’s lambda).colsample_bytree
: Subsample ratio of columns when constructing each tree.colsample_bylevel
: Subsample ratio of columns for each level.min_child_weight
: Minimum sum of instance weight(hessian) needed in a child.
from sklearn.model_selection import KFold
from hyperopt import hp
NUM_EVAL = 300
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
xgb_tune_kwargs= {
'n_estimators': hp.choice('n_estimators', np.arange(100, 500, dtype=int)),
'max_depth': hp.choice('max_depth', np.arange(3, 10, dtype=int)),
'subsample': hp.uniform('subsample', 0.25, 0.75),
'gamma': hp.uniform('gamma', 0, 9),
'learning_rate': hp.uniform('learning_rate', 1e-4, 0.3),
'reg_alpha': hp.choice('reg_alpha', np.arange(0, 30, dtype=int)),
'reg_lambda': hp.uniform('reg_lambda', 0, 3),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'colsample_bylevel': hp.uniform('colsample_bylevel', 0.05, 0.5),
'min_child_weight': hp.choice('min_child_weight', np.arange(0, 10,
dtype=int))
}
Now, we can define a function for the optimization of these hyperparameters. Within this function, we can use joblib
to save a pkl
file that can be reloaded if more training is needed. We can define the trial number with ITERATION
that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float
. This is then allocated for n_estimators
and max_depth
, which starts at max_depth=3
. Then the model type, XGBRegressor
, needs to be defined with the parameters that will be included in all of the trials during the search, which are:
objective='reg:squarederror'
: Specify the learning task and the corresponding learning objective or a custom objective function to be usedbooster='gbtree'
: Specify which booster to use:gbtree
,gblinear
ordart
.tree_method='gpu_hist'
: Specify which tree method to use.Default=auto
.scale_pos_weight=1
: Balancing of positive and negative weights.use_label_encoder=False
: Encodes labelsrandom_state=seed_value
: Random number seed.verbosity=0
: The degree of verbosity. Valid values are 0 (silent) - 3 (debug)
Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error'
for the scoring
metric and 10-fold cross validation to generate the negative cross_val_score
where the np.mean
of the scores
is saved as the rmse
for each trial. The trial loss=rmse
, params
= parameters tested in the trial from the xgb_tune_kwargs
space , the trial number (iteration
), the time to complete the trial (train_time
) and if the trial completed successfully or not (status
) are written to the defined .csv
file and appended by row for each trial.
import joblib
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
def xgb_hpo(config):
"""
Objective function to tune a XGBoostRegressor model
"""
joblib.dump(bayesOpt_trials, 'xgb_Hyperopt_300_GPU.pkl')
global ITERATION
ITERATION += 1
config['n_estimators'] = int(config['n_estimators'])
config['max_depth'] = int(config['max_depth']) + 3
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
early_stopping_rounds=10,
random_state=seed_value,
verbosity=0,
**config)
start = timer()
scores = -cross_val_score(xgb, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
Let's now define an out_file
to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION
and defined the Hyperopt Trials
as bayesOpt_trials
.
We can utilize if/else
condtional statements to load a .pkl
file if it exists, and then utilizing fmin
, we can specify the training function xgb_hpo, the parameter space xgb_tune_kwargs
, the algorithm for optimization tpe.suggest
, the number of trials to evaluate NUM_EVAL
, the name of the trial set bayesOpt_trials
and the random state np.random.RandomState(42)
. We can now begin the HPO
trials.
from hyperopt import tpe, Trials
tpe_algorithm = tpe.suggest
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/XGBoost/Hyperopt/trialOptions/XGB_trials_300_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
from datetime import datetime, timedelta
from hyperopt import fmin
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('xgb_Hyperopt_300_GPU.pkl'):
bayesOpt_trials = joblib.load('xgb_Hyperopt_300_GPU.pkl')
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
else:
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Let's now sort the trials with lowest loss first and examine the lowest two losses:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
import ast
results = pd.read_csv('XGB_300_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'XGB_HPO_trials300_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for XGBoost HPO 300 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE
, MSE
and RMSE
and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search
¶
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
import matplotlib.pyplot as plt
import seaborn as sns
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['iteration', 'subsample', 'force_col_wise',
'max_depth', 'min_child_weight', 'n_estimators']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
The colsample_bylevel
and reg_lambda
parameters with higher values demonstrate lower loss for the objective value. Lower values for the colsample_bytree
, gamma
and learning_rate
parameters resulted in lower loss. The reg_alpha
parameter demonstrated bimodal distribution.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The learning_rate
, gamma
, colsample_bytree
decreased over the trials while colsample_bylevel
increased. reg_alpha
and reg_lambda
did not demonstrate any trends over the trial iterations.
Model Explanations
¶
Now, we can plot the feature importance from the best model.
plot_importance(best_bayes_model, max_num_features=15)
Using the approach for feature importance, the most important features are daysonmarket
, mileage
, torque
, torque_rpm
, savings_amount
, back_legroom
, city_fuel_economy
, height
, length
, highway_fuel_economy
, front_legroom
, fuel_tank_volume
, wheelbase
, width
and engine_displacement
.
SHAP (SHapley Additive exPlanations)
¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test);
Model Metrics with ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
X_test1 = pd.DataFrame(X_test, columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
The horsepower
feature has the highest weight (0.1222) followed by mileage
, width
and year
.
Optuna: 300 Trials 10-Fold Cross Validation
¶
The Optuna
notebooks can be found here. Let's first set up the environment by installing/importing the dependencies, setting the options and seed followed by examining the CUDA
and GPU
characteristics.
!pip install --upgrade -q wandb
!pip install xgboost==1.5.2
!pip install optuna
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Then the data can be read into a pandas.Dataframe
, the label and features defined followed by creating the dummy variables for the categorical features.
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
Let's now set up Weights & Biases
specifying the project
, entity
, group
and notes
.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgb300gpu_cv_kfold10',
'save_code': 'False', 'notes': 'xgb300gpu_cv_kfold10'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Using the same k-folds for reproducibility and setting up the W & B
callbacks, let's now define a function for the optimization of hyperparameters using an Optuna.study
with a pickle
file, the parameters to test with different combinations during the search, and the the model type with the parameters that will be used for each trial. Then the timer can be specified to begin when called to perform the timed kFold
cross validation trials with the goal of finding the lowest averaged score. The W & B
parameters with loss for this study
can be examined here.
from sklearn.model_selection import cross_val_score, KFold
import joblib
from xgboost import XGBRegressor, plot_importance
from datetime import datetime, timedelta
from timeit import default_timer as timer
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_300_GPU_CV_wb.pkl')
params_xgb_optuna = {
'n_estimators': trial.suggest_int('n_estimators', 400, 1000),
'max_depth': trial.suggest_int('max_depth', 9, 15),
'subsample': trial.suggest_float('subsample', 0.6, 0.8),
'gamma': trial.suggest_float('gamma', 1e-8, 7e-5),
'learning_rate': trial.suggest_float('learning_rate', 0.04, 0.13),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 1e-5),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.35, 0.6),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 12)
}
model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params_xgb_optuna)
start = timer()
scores = -cross_val_score(model, train_features, train_label,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
return rmse
Since the 300 trials did not complete, we can load the saved pickle
file for the study
using joblib
to continue training to complete the 300 trials.
study = joblib.load('XGB_Optuna_300_GPU_CV_wb.pkl')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_300_GPU_CV_wb.pkl'):
study = joblib.load('XGB_Optuna_300_GPU_CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=18, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
Results from the Hyperparameter Search
¶
Let's now extract the trial number, rmse
and hyperparameter value into a pandas.Dataframe
and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Let's utilize plot_optimization_history
, which shows the scores from all trials as well as the best score so far at each point. This search did not contain extreme outliers for the objective value so it can be useful for examining the study
output.
fig = optuna.visualization.plot_optimization_history(study)
fig.show()
Next, we can utilize plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_contour
to plot the parameter relationship with a contour plot.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_contour(study, params=['min_child_weight',
'max_depth',
'learning_rate',
'gamma'])
fig.show()
Next, we can plot_slice
to compare the objective value and individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
The colsample_bylevel
, max_depth
, reg_alpha
, and subsample
parameters with higher values demonstrate lower loss for the objective value. Lower values for the learning_rate
, n_estimators
, reg_lambda
, and the colsample_bytree
parameters resulted in lower loss. It is difficult to to determine if lower/higher values for the gamma
and min_child_weight
parameters resulted in lower loss for this study
.
Let's now visualize the parameter importances by utilizing the plot_param_importances
feature from the Optuna
package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The learning_rate
parameter is clearly the most important with a value of 0.88 when the next important parameter, max_depth
, contains a value of 0.03 . CatBoost
demonstrated the learning_rate
was the most important parameter during the Optuna
trials while LightGBM
demonstated max_depth
. XGBoost
is suggesting the learning_rate
is the most important parameter in conjunction with CatBoost
.
Let's now use plot_edf
to visualize the empirical distribution function.
fig = optuna.visualization.plot_edf(study)
fig.show()
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
The colsample_bylevel
, max_depth
, reg_alpha
and subsample
parameters with higher values demonstrate lower loss for the objective value. Lower values for the colsample_bytree
, gamma
learning_rate
, min_child_weight
and reg_lambda
resulted in lower loss.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the study duration, the colsample_bytree
and learning_rate
parameters decreased while the others did not show a clear trend.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize = (20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the reg_alpha
parameter increased while reg_lambda
decreased.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
param = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials300_GPU_CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO 300 GPU CV trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
Compared to the baseline model, there are lower values for MAE
, MSE
and RMSE
and a higher R² for both train/test sets after hyperparameter tuning. Also, there were lower MAE
, MSE
, and RMSE
and higher R² both for the train/test sets compared to the metrics from using 10-fold cross validation for the Hyperopt
trials
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations
¶
SHAP (SHapley Additive exPlanations)
¶
Using the training set, let's summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features);
Model Metrics with ELI5
¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
test_label)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn
which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The horsepower
feature is again the most important weight (0.189968), followed by year
(0.083167), mileage
(0.077648), width
(0.077522) and height
(0.034736).
XGBoost Set Params Same GPU
¶
The notebook utilizing the GPU for XGBoost
is here and using RAPIDS
is here All of the W & B
parameters with loss for this experiments
can be examined here.
To evaluate the runtime length between cross validation and train/test split, let's utilize the same GPU
, a Quadro RTX 4000
. Let's first set up the environment, read the data, define the label and features, create dummy variables for categorical variables and create a matrix for GPU.
!pip install --upgrade pip
!pip install --upgrade -q wandb
!pip install xgboost==1.5.2
!pip install optuna
!pip install plotly
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!nvidia-smi
import pandas as pd
import xgboost as xgb
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
dtrain = xgb.DMatrix(data=train_features, label=train_label)
dtest = xgb.DMatrix(data=test_features, label=test_label)
Cross Validation
¶
Let's start with 3-fold cross validation where the results are logged with W & B
.
200 Trials 3-Fold Cross Validation
¶
import wandb
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_3CV_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_3CV_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
import joblib
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
from optuna.integration.wandb import WeightsAndBiasesCallback
from timeit import default_timer as timer
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=3,
metrics='rmse', as_pandas=True)
run_time = timer() - start
rmse = cv_results['test-rmse-mean'].values[-1]
return rmse
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over a litte more than one hour. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_3CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 5-Fold Cross Validation
¶
Let's know try using 5-fold cross validation where the results are logged with W & B
.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_5CV_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_5CV_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=5,
metrics='rmse', as_pandas=True)
run_time = timer() - start
rmse = cv_results['test-rmse-mean'].values[-1]
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over nine minutes longer than 3-fold cross validation. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_5CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 10-Fold Cross Validation
¶
Let's know try using 10-fold cross validation where the results are logged with W & B
.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_10CV_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_10CV_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=10,
metrics='rmse', as_pandas=True)
run_time = timer() - start
rmse = cv_results['test-rmse-mean'].values[-1]
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over 20 minutes longer than 3-fold cross validation and 10 minutes longer than 5-fold cross validation. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_10CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Train/Test
¶
Let's know try using train/test split to compare to cross validation where the results are logged with W & B
.
200 Trials Train/Test DMatrix
¶
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study,
'XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl')
params_xgb_optuna = {
'objective': 'reg:squarederror',
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_ratel': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'verbose': False,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
model = xgb.train(params_xgb_optuna, dtrain)
run_time = timer() - start
y_pred_val = model.predict(dtest)
rmse = mean_squared_error(test_label, y_pred_val,
squared=False)
print('Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took four minutes less than 3-fold cross validation and 10 minutes shorter than 5-fold cross validation and 20 minutes shorter than 10-fold cross validationn. This is potential justification for not using cross validation when using a smaller set size. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Cross Validation - cuML
¶
Let's go back to Paperspace
to utilize the RAPIDS
Docker
container. We can installing/importing the dependencies, setting the options and seed, set up the local CUDA
cluster for Dask
, query the client for all connected workers. and the read data.
!pip install --upgrade pip
!pip install xgboost==1.5.2
!pip install --upgrade -q wandb
!pip install optuna
!pip install plotly
!pip install shap
!pip install eli5
import os
import warnings
import random
import numpy as np
import cupy
from cupy import asnumpy
import urllib.request
from contextlib import contextmanager
import time
from timeit import default_timer as timer
warnings.filterwarnings('ignore')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)
@contextmanager
def timed(name):
t0 = time.time()
yield
t1 = time.time()
print('..%-24s: %8.4f' % (name, t1 - t0))
print('\n')
!nvidia-smi
import dask
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar
from dask_cuda import LocalCUDACluster
import dask_cudf
cluster = LocalCUDACluster(threads_per_worker=1, ip='',
dashboard_address='8082')
c = Client(cluster)
workers = c.has_what().keys()
n_workers = len(workers)
c
import cudf
import cuml
import xgboost as xgb
trainDF = cudf.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = cudf.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = cudf.get_dummies(train_features)
test_features = cudf.get_dummies(test_features)
train_features = train_features.to_cupy()
train_label = train_label.to_cupy()
test_features = test_features.to_cupy()
test_label = test_label.to_cupy()
train_features = train_features.astype('float32')
train_label = train_label.astype('int32')
test_features = test_features.astype('float32')
test_label = test_label.astype('int32')
dtrain = xgb.DMatrix(data=train_features, label=train_label)
dtest = xgb.DMatrix(data=test_features, label=test_label)
200 Trials 3-fold Cross Validation cuML
¶
Let's start with 3-fold cross validation where the results are logged with W & B
.
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_3CV_cuml',
'tags': 'xgbGPU_setParams_RTX4000_3CV_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
from xgboost import XGBRegressor, plot_importance
import joblib
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=3,
metrics='rmse')
run_time = timer() - start
rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over a litte more than one hour just like what was observed with 3-fold cross validation without cuML
. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
import matplotlib.pyplot as plt
import seaborn as sns
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
import pickle
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_3CV_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 5-fold Cross Validation cuML
¶
Let's know try using 5-fold cross validation where the results are logged with W & B
.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_5CV_cuml',
'tags': 'xgbGPU_setParams_RTX4000_5CV_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=5,
metrics='rmse')
run_time = timer() - start
rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over nine minutes longer than 3-fold cross validation, and a little longer than 5-fold cross validation without cuML
. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_5CV_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 10-fold Cross Validation cuML
¶
Let's know try using 10-fold cross validation where the results are logged with W & B
.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_10CV_cuml',
'tags': 'xgbGPU_setParams_RTX4000_10CV_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=10,
metrics='rmse')
run_time = timer() - start
rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over 30 minutes longer than 3-fold cross validation and 20 minutes longer than 5-fold cross validation. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_10CV_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Train/Test - cuML
¶
200 Trials Train/Test DMatrix
¶
Let's know try using train/test split to compare to cross validation where the results are logged with W & B
.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_cuml',
'tags': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl')
params_xgb_optuna = {
'objective': 'reg:squarederror',
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'verbose': False,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
model = xgb.train(params_xgb_optuna, dtrain)
run_time = timer() - start
y_pred_val = model.predict(dtest)
rmse = mean_squared_error(asnumpy(test_label), asnumpy(y_pred_val),
squared=False)
print('Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took seven minutes less than 3-fold cross validation and 17 minutes shorter than 5-fold cross validation and 35 minutes shorter than 10-fold cross validation. This is potential justification for not using cross validation when using a smaller set size and the strength of GPU matrix
. Now we can examine the study
components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Set Params GPU Test - cuML: LightGBM 500 Trials
¶
To utilize the GPU for LightGBM
with RAPIDS
and a CUDA compiler on Paperspace
, the notebook containing how the environment set up was completed is here.
The steps to complete this are the following:
In a Linux terminal,
- sudo apt update && sudo apt upgradesudo apt-get -y install cmake
- sudo apt-get install -y -qq libboost-all-dev libboost-system-dev libboost-filesystem-dev
- sudo apt-get install --no-install-recommends nvidia-opencl-dev opencl-headers
- sudo apt install glibc-source -y
In a notebook,
- git clone --recursive https://github.com/microsoft/LightGBM
- %cd /notebooks/LightGBM
- !sudo mkdir build
!sudo mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
%cd /notebooks/LightGBM/build
- !sudo cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
- !make -j4
- !sudo apt-get -y install python3-pip
- !sudo -H pip install setuptools joblib wandb optuna datetime plotly eli5 shap -U
- %cd /notebooks/LightGBM/python-package/
- !python setup.py install --gpu --opencl-include-dir=/usr/local/cuda/include/
- !pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
- !pip install cuml-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
Once completed, the dependencies were imported.
The project results can be found here. The runtime lengths using different GPUs
were the following:
- RTX A4000 - 2:21:28
- Quadro RTX 5000 - 2:50:45
- Quadro RTX 4000 - 3:00:01
- Quadro P5000 - 4:39:26
Random Forest
¶
Random Forests utilize an ensemble of decision trees and with a technique called bagging, or bootstrap aggregation, which consists of selecting random samples with replacement as the training data for a model, fitting decisions trees for each subset and then aggregating the predictions. This approach minimizes overfitting and model variance since more randomness and combinations of features are evaluated. Decision trees use specified set of features and have a high likelihood of overfitting.
Let's go back to Paperspace
to utilize the RAPIDS
Docker
container. We can installing/importing the dependencies, setting the options and seed, set up the local CUDA
cluster for Dask
, query the client for all connected workers. and the read data.
from cuml.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_streams=1, random_state=seed_value)
rf.fit(X_train, y_train)
Pkl_Filename = 'RF_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(rf, file)
print('\nModel Metrics for RF Baseline')
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Optuna 300 Trials Train/Test
¶
Now, we can define a function train_and_eval
to set up the train/test sets and define the model/parameters that will be tested during the search:
bootstrap
: If True, each tree in the forest is built on a bootstrapped sample with replacement.Default=True
.n_estimators
: Number of trees in the forest.Default=100
.max_depth
: Maximum tree depth.Default=16
.max_leaves
: Maximum leaf nodes per tree.Default=-1
.min_samples_leaf
: The minimum number of samples (rows) in each leaf node.Default=1
.min_samples_split
: The minimum number of samples required to split an internal node.Default=2
.n_bins
: Maximum number of bins used by the split algorithm per feature.Default=128
.n_streams
: Number of parallel streams used for forest building.Default=4
.
Then the model will be trained using the training set, predictions made using the test set and then the model evaluated for the rmse
with the test set versus the predicted one.
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['max_leaves', 'min_samples_leaf', 'min_samples_split',
'n_bins']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the study
, the max_leaves
, min_samples_split
and n_bins
parameters increased while min_samples_leaf
decreased.
Let's now use plot_param_importances
to visualize the parameter importances.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
min_samples_leaf
was the most important hyperparameter followed by max_leaves
.
We can also use plot_edf
to examine the empirical distribution.
fig = optuna.visualization.plot_edf(study)
fig.show()
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
params = study.best_params
params['random_state'] = seed_value
params
import pickle
from sklearn.metrics import mean_absolute_error, r2_score
best_model = RandomForestRegressor(n_streams=1, **params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'RF_Optuna_trials300_GPU_paramsHi2.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for RF HPO 300 GPU trials')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Tuning did not prove to be worthwhile when higher errors and lower R² compared to baseline.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
K-Nearest Neighbor Regression (KNR)
¶
from cuml.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor()
knr.fit(X_train, y_train)
Pkl_Filename = 'KNR_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(knr, file)
print('\nModel Metrics for KNR Baseline')
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Optuna 1000 Train/Test Trials
¶
Now, we can define a function train_and_eval
to set up the train/test sets and define the model/parameters that will be tested during the search:
n_neighbors
: Default number of neighbors to query.Default=5
.metric
: Distance metric to use. Options are'euclidean'
,'manhattan'
,'chebyshev'
,'minkowski'
.
Then the model will be trained using the training set, predictions made using the test set and then the model evaluated for the rmse
with the test set versus the predicted one.
from timeit import default_timer as timer
from sklearn.metrics import mean_squared_error
def train_and_eval(X_param, y_param, n_neighbors=10,
metric='euclidean', verbose=False):
"""
Partition data into train/test sets, train and evaluate the model
for the given parameters.
Params
______
X_param: DataFrame.
The data to use for training and testing.
y_param: Series.
The label for training
Returns
score: RMSE of the fitted model
"""
X_train, y_train = trainDF.drop('price',
axis=1), trainDF['price'].astype('int32')
X_train = cudf.get_dummies(X_train)
X_train = X_train.astype('float32')
X_test, y_test = testDF.drop('price',
axis=1), testDF['price'].astype('int32')
X_test = cudf.get_dummies(X_test)
X_test = X_test.astype('float32')
model = KNeighborsRegressor(n_neighbors=n_neighbors,
metric=metric,
verbose=verbose)
start = timer()
model.fit(X_train, y_train)
run_time = timer() - start
y_pred = model.predict(X_test)
score = mean_squared_error(y_test.to_numpy(), y_pred.to_numpy(),
squared=False)
return score
print('Score with default parameters : ', train_and_eval(X_train, y_train))
As with other hyperparameter optimizations, an objective
function with the study .pkl
and the range of the parameters to be tested needs to be defined.
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import joblib
def objective(trial, X_param, y_param):
joblib.dump(study, 'KNR_Optuna_1000_GPU.pkl')
n_neighbors = trial.suggest_int('n_neighbors', 2, 100)
metric = trial.suggest_categorical('metric', ['euclidean', 'manhattan',
'chebyshev', 'minkowski'])
score = train_and_eval(X_param, y_param,
n_neighbors=n_neighbors,
verbose=False)
return score
with timed('dask_optuna'):
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('KNR_Optuna_1000_GPU.pkl'):
study = joblib.load('KNR_Optuna_1000_GPU.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='minimize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial, X_train, y_train),
n_trials=1000,
n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
Let's now extract the trial number, rmse
and hyperparameter values into a pandas.Dataframe
and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_metric': 'metric'}, inplace=True)
trials_df.rename(columns={'params_n_neighbors': 'n_neighbors'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Let's utilize plot_parallel_coordinate
to plot the high-dimensional parameter relationships contained within the search and plot_slice
to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
It is difficult to determine which metric
performs the best off this slice plot
, but a model that uses a lower number of n_neighbors
results in lower objective values.
Let's now utilize plot_edf
to visualize the empirical distribution of the study
.
fig = optuna.visualization.plot_edf(study)
fig.show()
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
params = study.best_params
params
best_model = KNeighborsRegressor(n_neighbors=2, metric='manhattan')
best_model.fit(X_train, y_train)
Pkl_Filename = 'KNR_Optuna_trials1000_GPU_man.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for KNR HPO 1000 GPU trials - Manhattan')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Tuning proved to be worthwhile when there are lower errors and higher R² for both the train/test sets compared to the baseline model.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Comments
comments powered by Disqus