The price of a used car has always been an intriguing thought. Once a brand-new car leaves the dealership, the price instantly decreases way below the price an individual paid. During the 2022 car market, the price of a used vehicle increased significantly due to the shortage of semiconductors and microchips, which are needed for the manufacturing of new vehicles, and the rising rate of inflation.
Data¶
The data was retrieved from Kaggle. The individual who posted the data used a webcrawler on the Cargurus inventory in September 2020. This data contains a vast amount of information about the physical and mechanical components of cars, and this feature rich set can provide insight into what might contribute to the price differences in the value of a vehicle. Does the location where a vehicle is being sold and the age of the vehicle affect the price? How could these factors contribute to building models to predict the price of vehicles? Which features of used vehicles should be used in predictive modeling?
Preprocessing¶
The code that was used for preprocessing and EDA can be found in the Used Cars Github repository. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and empty rows replace with NA.
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
plt.rcParams['figure.figsize'] = (20, 10)
plt.rcParams['font.size'] = 25
sns.set_context('paper', rc={'font.size': 25, 'axes.titlesize': 35,
'axes.labelsize': 30})
seed_value = 42
os.environ['UsedCars_PreprocessEDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
df = pd.read_csv('used_cars_data.csv', low_memory=False)
df = df.drop_duplicates()
df = df.replace(r'^\s*$', np.nan, regex=True)
print('\nSample observations:')
display(df.head())
The initial data appears to contain various data types, missing data and complex strings with text and numbers. To reduce the high dimensionality, which string variables can contribute to, extracting the numerical information might allow for more granularity and insight.
Data Quality¶
Let's define a function to examine the quality of the data by assessing the absolute number/percentage of missing observations in the data, the types of data present as well as the number unique values for the quantitative and qualitative features. This can then be sorted by the missingness to determine which variables should be dropped and which categorical variables contain a large number of groups.
def data_type_quality_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
var_type = df.dtypes
unique_count = df.nunique()
val_table = pd.concat([mis_val, mis_val_percent, var_type, unique_count],
axis=1)
val_table_ren_columns = val_table.rename(
columns = {0: 'Missing Count', 1: '% Missing', 2: 'Data Type',
3: 'Number Unique'})
val_table_ren_columns = val_table_ren_columns[
val_table_ren_columns.iloc[:,1] >= 0].sort_values(
'% Missing', ascending=False).round(1)
print('The data has ' + str(df.shape[0]) + ' rows and '
+ str(df.shape[1]) + ' columns.\n')
return val_table_ren_columns
print('\nData Quality Report:')
display(data_type_quality_table(df))
Data Relevancy and Missingness¶
Now we can first drop the variables that will not be usedful. A url will not provide much insight nor will the ones containing *id which act as an identification on a numerical level. The information contained within description might reveal specifics about the vehicles, but since there are 2,519,325 different strings, this might not be worth the amount of preprocessing.
The initial data consisted of a high degree of missing data. 16 variables have over >= 47% missing values and the next lowest missing value is under 20%. We can filter the features with under 20% missing and then select the complete cases for each of the remaining features to maximize the amount of observations contained in the set.
drop_columns = ['main_picture_url', 'vin', 'trimId', 'trim_name',
'sp_id', 'sp_name', 'listing_id', 'description']
df.drop(columns=drop_columns, inplace=True)
df['year'] = df['year'].astype('int')
df = df.loc[:, df.isnull().mean() < 0.20]
df = df[df.major_options.notna() & df.mileage.notna()
& df.engine_displacement.notna()
& df.transmission_display.notna() & df.seller_rating.notna()
& df.engine_cylinders.notna()
& df.back_legroom.notna() & df.wheel_system.notna()
& df.interior_color.notna()
& df.body_type.notna() & df.exterior_color.notna()
& df.franchise_make.notna() & df.torque.notna()
& df.highway_fuel_economy.notna() & df.city_fuel_economy
& df.power.notna()]
Time¶
Vehicle manufacturers release new models each year with newer technologies and accessories. The time aspect generally is an important aspect to consider when building predictive models from historical data. Let's examine when the vehicle was made by generating a graph that contains the amount of vehicles in each year. Older vehicles probably cost less than newer vehicles, but this might not always be true considering there are vintage cars.
import seaborn as sns
import matplotlib.pyplot as plt
a = sns.countplot(x='year', data=df).set_title('Count of Postings in Each Year')
plt.yticks(fontsize=20)
plt.xticks(fontsize=20)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show();
print('Average year: ' + str(round(df['year'].mean())))
The majority of the vehicles were manufacturered in 2020 with the average year being 2019. The oldest vehicle is from 1984, and there are vehicles from 2021, the year in which the crawler pulled this data. Let's filter the data by year greater than or equal 2016 due to the count and the distribution. Then using a function with autocpt arguments to convert count data to percentages, we can graph the percentage of each model year out of the total listings in this subset.
df = df.loc[df['year'] >= 2016]
def func(pct):
return "{:1.1f}%".format(pct)
plt.figure(figsize=(20, 7))
plt.title('Percentage of Listings in Each Year', fontsize=25)
df['year'].value_counts().plot.pie(autopct=lambda pct: func(pct),
textprops={'fontsize': 15})
plt.tight_layout()
plt.show();
57% of the vehicle listings occur in 2020. Due to the potential seasonality effects, let's examine the month of the year when a vehicle was listed by converting the listed_date to a monthly feature and examining the top 10 year-month by count.
pd.set_option('display.max_rows', 10)
df['listed_date'] = pd.to_datetime(df['listed_date'])
df['listed_date_yearMonth'] = df['listed_date'].dt.to_period('M')
print('\nNumber of listings \nin each Year-Month:')
print('\n')
print(df.listed_date_yearMonth.value_counts(sort=True).nlargest(10).to_string())
August contained the most listings, while April contained the least in close proximity to December, January and May. Most of the listings occurred during June - September of 2020, so the set was filtered to this date range. Now we can create a graph that has been condensed significantly. Given this time period is during the first year of the COVID-19 pandemic, the potential for various confounding factors is something that needs to be considered, so let's utilize a narrow range of dates around the summer months.
df = df.loc[(df['listed_date_yearMonth'] >= '2020-06')]
sns.countplot(x='listed_date_yearMonth',
data=df,
order=df['listed_date_yearMonth'].value_counts().index).set_title('Number Postings in Each Year-Month')
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.tight_layout()
plt.show();
Location of Listings¶
The initial data contains 4,687 cities and 8,237 dealer zip codes, so ths is potentially from all over the United States where the cost of living varies depending on location. Some states have higher/more taxes and some cost more to live. We want to be sure that price is not guided by location and that it doesn't confound the modeling results. Therefore, let's examine the length of the dealer_zip because there is a shortened zip code and the complete zip code.
df['dealer_zip_length'] = df['dealer_zip'].str.len()
print('\nNumber of different \nlengths of zipcodes:')
print('\n')
print(df.dealer_zip_length.value_counts(ascending=False).to_string())
The majority of the zipcodes are the shortened version, so let's subset the shortened zipcode and convert it to numerical values. Now we can create a temporary pandas.Dataframe with only the zipcode, find the unique zipcodes, process to a list and then a dataframe. There are 4,898 unique zipcodes, so it appears the data is not localized to a small section of the country.
df = df[df.dealer_zip_length == 5]
df = df.drop(['dealer_zip_length'], axis=1)
df['dealer_zip'] = df['dealer_zip'].astype('int64')
df1 = df.dealer_zip
df1 = df1.unique().tolist()
df1 = pd.DataFrame(df1)
df1.columns = ['dealer_zip_unique']
print('- Number of unique zipcodes:', df1.shape[0])
The initial set contains the city but not the state where the vehicle was listed. We can use a temporary dataframe with unique zipcodes to find the state and city corresponding to the zipcode by defining two separate functions using a SearchEngine that utilizes the zipcode to find state and city. This temporary dataframe now has three columns where the queried City and dealer_zip_unique can be used to merge the new features using a right join. Duplicates and unnecessary columns can be removed as well as the zipcodes that were not found within the SearchEngine.
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=False)
def zcode_state(x):
""" Use the zipcode to find the state """
return search.by_zipcode(x).state
def zcode_city(x):
""" Use the zipcode to find city """
return search.by_zipcode(x).city
df1['State'] = df1['dealer_zip_unique'].fillna(0).astype(int).apply(zcode_state)
df1['City'] = df1['dealer_zip_unique'].fillna(0).astype(int).apply(zcode_city)
df = pd.merge(df, df1, how='right', left_on=['city', 'dealer_zip'],
right_on=['City', 'dealer_zip_unique'])
df.drop_duplicates()
df = df.drop(['dealer_zip_unique', 'city', 'City', 'dealer_zip'], axis=1)
df = df[df['mileage'].notna()]
With the addition of the State where the vehicles were listed, we can now examine the top ten locations and consider the possibility of there being differences in the standard of living due to location. Out of the top 10 states with the highest amount of listings in this set, Texas has the most while Michigan has the least. Georgia, North Carolina and Michigan don't seem to fit with the northern states or the largest states like Texas, California and Florida, so let's filter the states with the seven highest counts of listings. This results in a set with over 400,000 observations and 42 columns.
print('\nNumber of listings \n in each US state:')
print('\n')
print(df.State.value_counts(sort=True).nlargest(10).to_string())
print('\n')
df1 = df['State'].value_counts().index[:7]
df = df[df['State'].isin(df1)]
print('- Dimensions after filtering states with the 7 highest counts of listings:',
df.shape)
del df1
Dimensionality of Categorical Variables¶
The problem of high dimensionality is an important consideration when modeling because the higher number of different groups per categorical variables, the larger the size of the set after converting to dummy or one hot encoded variables. This will result in more memory requirements (RAM and/or GPU memory) needed for the computations, resulting in longer runtimes and more cost utilization. Examining the current object and bool variables, major_options and color related features (interior and exterior_color) contain over thousands of different groups, so let's drop these features. The transmission_diplay variable contains 44 groups while transmission has four levels, so we'll use a broader variable for the characteristics of the transmission. Next, engine_cylinders and engine_type appear to contain the same information and contain over 30 groups, so let's also remove from the set. franchise_make contains the manufacturer name, but we can identify the components of the car rather than bias the analysis by using a provided name. Lastly, franchise_dealer is boolean with True/False, so let's also remove this feature.
drop_columns = ['major_options', 'interior_color', 'exterior_color',
'transmission_display', 'engine_cylinders', 'engine_type',
'franchise_dealer', 'franchise_make']
df.drop(columns=drop_columns, inplace=True)
Some of the categorical features contain information about the size (in) / volume (gal) / number (seats) of various characteristics about vehicles, so removing the repetitive parts of the variables will allow these features to become continuous variables and not increase the dimensionality of the set. Now we can examine a few observations of the variables and decide how to process them.
df[['back_legroom', 'wheelbase', 'width', 'length', 'height',
'fuel_tank_volume', 'front_legroom', 'maximum_seating']].head()
Let's use a lambda function on a series that will string split the numbers from the text components leaving only the numbers. Then we designate which columns to apply this action. This creates missing data, so we can then remove the rows containing missing values due to transforming the previous string variable to numeric. Now we can see that these columns contain only numerical data.
extract_num_from_catVar = lambda series: series.str.split().str[0].astype(np.float)
columns = ['back_legroom', 'wheelbase', 'width', 'length', 'height',
'fuel_tank_volume', 'front_legroom', 'maximum_seating']
df[columns] = df[columns].replace({',': '',
'--': np.nan}).apply(extract_num_from_catVar)
df = df[df.back_legroom.notna() & df.front_legroom.notna()
& df.fuel_tank_volume.notna() & df.maximum_seating.notna()
& df.height.notna() & df.length.notna() & df.wheelbase.notna()
& df.width.notna()]
df.reset_index(drop=True, inplace=True)
df[['back_legroom', 'wheelbase', 'width', 'length', 'height',
'fuel_tank_volume', 'front_legroom', 'maximum_seating']].head()
The torque and power features contain similar structures to one another. The sequence containing a number followed by a text string, @, number and another text string signifies that similar string splitting at the same index location as well as the string components being converted to np.nan can be used. Let's first examine a sample of both variables:
df[['torque', 'power']].head()
To process torque, first remove the ',', and split the string. Then create a pandas.DataFrame containing only the float information, name the columns and then reset the index. Now, concatenate the new feature columns created with the main table, drop the original torque feature and lastly remove any missing data that was created during the string split. The same methods can be used to process power which results in a numerical horsepower. A few samples of the new features can then be viewed:
df1 = df.torque.str.replace(',', '').str.split().str[0:4:3]
df1 = pd.DataFrame([[np.nan, np.nan] if type(i).__name__ == 'float'
else np.asarray(i).astype('float') for i in df1])
df1.columns = ['torque_new', 'torque_rpm']
df1.reset_index(drop=True, inplace=True)
df = pd.concat([df, df1], axis=1)
df = df.drop(['torque'], axis=1)
df = df[df.torque_rpm.notna()]
df.rename(columns={'torque_new': 'torque'}, inplace=True)
del df1
df1 = df.power.str.replace(',', '').str.split().str[0:4:3]
df1 = pd.DataFrame([[np.nan, np.nan] if type(i).__name__ == 'float'
else np.asarray(i).astype('float') for i in df1])
df1.columns = ['horsepower_new', 'horsepower_rpm']
df1.reset_index(drop=True, inplace=True)
df = pd.concat([df, df1], axis=1)
df = df.drop(['horsepower', 'power'], axis=1)
df = df[df.horsepower_rpm.notna()]
df.rename(columns={'horsepower_new': 'horsepower'}, inplace=True)
del df1
df[['torque', 'horsepower']].head()
Examine Dependent Variable: Price¶
Using describe() can generate more than the typical five summary statistics by adding the count and minimum/maximum values. The average is around 32,000 U.S dollars. The standard deviation is around 15,000, so let's include up to 50,000 dollars in the set. When outliers are present, the mean can be affected, so we can examine the median as well, which is around 29,000. The proximity of the mean and median suggest the data is not highly skewed, but with a maxmimum price of 449,995 dollars, there definitely are outliers. Outlier testing methods like Z-score, which measures how many standard deviations values are from the mean, or the Local Outlier Factor, which uses how close values are to the values of their neighbors (a local density), where greater than one local density signifies it is different than the rest.
print('\nSummary statistics: price' + ('\n') + str(round(df['price'].describe()).to_string()))
print('\n')
print('- Median price: ' + str(round(df['price'].median())))
Now, let's filter the observations with the price less than or equal $50,000, and examine the price across the locations where they were posted using the latitude vs. longitude features.
df = df.loc[df['price'] <= 50000.0]
plt.figure(figsize=(20, 7))
sns.histplot(x=df['price'],
kde=True).set_title('Distribution of Price <= $50,000');
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.show();
print('- There are now ' + str(df.shape[0]) + ' observations.')
df.plot(kind='scatter', x='latitude', y='longitude', c='price', cmap='RdBu_r',
colorbar=True, sharex=False);
plt.title('Price: Latitude vs Longitude', fontsize=35, y=1.02);
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.show()
drop_columns = ['latitude', 'longitude']
df.drop(columns=drop_columns, inplace=True)
Visually, there doesn't appear to be a large amount of lower prices (blue) localized to any of the US states in this set, but there might a greater proportion of higher prices (dark red) in New York compared to Pennsylvania.
Visualizations: Categorical Variables
Now we can filter the categorical features that have been reduced to ones with a lower number of groups, and see how price varies within the groups by using side by side boxplots.
df_cat = df.select_dtypes(include = 'object')
df_cat = df_cat.drop(['State', 'model_name', 'make_name', 'listing_color'],
axis=1)
plt.rcParams.update({'font.size': 20})
fig, ax = plt.subplots(3, 2, figsize=(20,17))
fig.suptitle('Categorical Variables vs. Price', fontsize=35, y=1.02)
for var, ax in zip(df_cat, ax.flatten()):
sns.boxplot(x=var, y='price', data=df, ax=ax)
ax.tick_params(axis='x', labelrotation=45, labelsize=15)
ax.tick_params(axis='y', labelsize=15)
plt.tight_layout();
plt.show();
del df_cat, fig, ax
wheel_system and wheel_system_display contain the same information, one uses acronomyns and the later with the full text spelled out. We will retain the feature with most detail (wheel_system_display) so the components of this feature are able to be distinguished easier.
The color of a vehicle tends to be an important factor when individuals purchase a different car. Let's investigate if the listing_colorhas any association with price.
sorted_nb = df.groupby(['listing_color'])['price'].median().sort_values()
plt.title('Listing Color vs. Price', fontsize=35, y=1.02)
sns.boxplot(x=df['listing_color'], y=df['price'], order=list(sorted_nb.index))
plt.tick_params(axis='x', labelrotation=45, labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.tight_layout()
plt.show();
Now we can examine the amount of listings with different colors that we observed in the prior graph. Seven colors including UNKNOWNseem to make up the majority of the set. We can then filter the observations with the seven highest number counts for listing_colorand then remove the listings with listing_color='UNKNOWN'.
print('\nNumber of listings \nwith vehicle color:')
print('\n')
print(df.listing_color.value_counts(ascending=False).to_string())
df1 = df['listing_color'].value_counts().index[:7]
df = df[df['listing_color'].isin(df1)]
df = df.loc[df['listing_color'] != 'UNKNOWN']
del df1
Since the name of the car manufacturer exists in the set, let's examine the median price of vehicles distinguished by the manufacturer, and compare that to the overall median price ($28,710). This might provide some insight about the features and components of the vehicles with higher prices.
df1 = df.groupby('make_name')['price'].median().reset_index()
df1.rename(columns={'price': 'make_medianPrice'}, inplace=True)
print('\nMedian price of the manufacturer:')
print('\n')
df1 = df1.sort_values('make_medianPrice', ascending=False)
print(df1.iloc[0:10])
del df1
The vehicles with the highest median price are mostly luxury manufacturers. This makes logical sense. RAM makes trucks, which are larger vehicles containing larger engines. The cost would have to be more to produce and then make a profit.
After the exploratory data analysis (EDA) of the qualitative features, we can now drop variables due to similarity, high dimensionality and not useful for modeling.
df = df.drop(['make_name', 'model_name', 'wheel_system',
'seller_rating', 'listed_date'], axis=1)
df = df.drop_duplicates()
Visualizations: Quantitative Variables
Histograms can be used to examine the distributions of numerical data. Visually, left or right skewed data can be observed if the majority of data is on the right side of the histogram or left, respectfully.
df_num = df.select_dtypes(include = ['float64', 'int64'])
cols = df_num.columns.tolist()
cols = cols[-8:] + cols[:-8]
df_num = df_num[cols]
sns.set(font_scale=2)
df_num.hist(figsize=(30,30), bins=50, xlabelsize=16, ylabelsize=17)
plt.tight_layout()
plt.show();
Correlations¶
To determine which quantitative variables are correlated with each other, we can create a correlation matrix using the method='spearman', which is a nonparametric approach that ranks the features. From the prior visualization, some of the features contain non-Gaussian distributions. Then we can plot the correlation matrix without any thresholds.
To increase the granularity of which features are correlated with price, let's extract the features that contain a correlation coefficient greater than 0.5 or less than -0.5 and examine the magnitude.
corr = df_num.corr(method='spearman')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (20, 10)
sns.heatmap(corr, cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=False, square=True)
plt.title('Correlation Matrix with Spearman rho', fontsize=35, y=1.02)
plt.tight_layout()
plt.show();
df_num_corr = df_num.drop('price', axis=1).apply(lambda x: x.corr(df_num.price))
quant_features_list = df_num_corr[abs(df_num_corr) > 0.50].sort_values(ascending=False)
print('- There are {} moderately correlated values with Price:\n{}'.format(
len(quant_features_list), quant_features_list))
The horsepower feature has the highest positive correlation with price while highway_fuel_economy has the highest negative.
Now let's filter the correlation matrix using thresholds of rho <= -0.5 or >= 0.5 to reduce the noise in the plot and to better decipher which features have higher correlation in the set.
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={'size': 9}, square=True);
plt.title('Correlation Matrix with Spearman rho >= 0.5 or <= -0.5',
fontsize=30, y=1.02)
plt.tight_layout()
plt.show();
When utilizing this threshold, the amount of content on the heatmap decreased. The most positively correlated features are highway_fuel_economy and city_fuel_economy with a rho=0.94 followed by length and wheelbase with a rho=0.93. The most negatively correlated features city_fuel_economy and horsepower with a rho=-0.85 as well as city_fuel_economy and fuel_tank_volume with a rho=-0.85. These relationships might need to be considered regarding multicollinearity prior to modeling depending on the method utilized.
Modeling¶
Now that the initial data has been processed to a final set, we can proceed with evaluating various modeling approaches. When testing Linear models (lasso, ridge and elastic net regression) using the RAPIDS ecosystem, hyperparameter tuning with 1000 trials each did not generate adequate model metrics even after applying variance inflation factor (VIF) for multicollinearity. Different scaling methods (MinMaxScaler and StandardScaler) were tested as well. Similar results were encountered using LinearSVR as well. This might be attributed to the foundational solvers available in the version of cuml that were available in rapids:22.02. The notebook for the Linear models is here and LinearSVR here.
Multi-Layer Perceptron (MLP) Regression¶
The MLP can be utilized for regression problems as demonstrated in Multilayer perceptrons for classification and regression] by Fionn Murtagh in 1991. It is derived from the perceptron invented in the 1940s and implemented by Frank Rosenblatt in the 1950s with The Perceptron — A Perceiving and Recognizing Automaton. The MLP is a artifical neural network connected by neurons containing an input layer, the hidden layer(s) and the output layer with a single output neuron for regression. The data is used as the input to the network which is fed to the hidden layers one layer at a time generating weights (connections to the next neuron) and biases (threshod value needed to obtain the output) till the output layer. The output is then compared to the expected output, and the difference is the error. This error is then propagated back through the network, layer by layer where the weights are updated depending on the effect to the error called the backpropagation algorithm. This is repreated for all of the samples in the training data where one cycle of updating the network is an epoch.
The notebook can be found here. Let's first set up the environment by installing kerar-tuner for hyperparameter tuning the MLP, then import the dependencies and examine the versions of tensorflow, keras, the amount of GPUs available, the CUDA version and which GPU will be used.
!pip install keras-tuner
import os
import random
import numpy as np
import warnings
import tensorflow as tf
warnings.filterwarnings('ignore')
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Now we can define a function to set the numpy, random and tensorflow seed as well as configure the session environment for reproducibility.
def init_seeds(seed=42):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
session_conf = tf.compat.v1.ConfigProto()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
os.environ['TF_DETERMINISTIC_OPS'] = 'True'
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
return sess
init_seeds(seed=42)
The data can then be read from the data directory, partitioned into the features and the target, and the training and test sets created using test_size=0.2. These processed sets can be saved for later use with other ML algorithms so comparisons can be made. Next, we need to create dummy variables for the categorical variables since most algorithms require the model input to be numerical values. Neural networks tend to perform better when the input data is scaled, so let's use the StandardScaler that is fit on the training set and then applied to the test set.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', None)
df = pd.read_csv('usedCars_final.csv', low_memory=False)
X = df.drop(['price'], axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=seed_value)
del X, y
train = pd.concat([y_train, X_train], axis=1)
train.to_csv('usedCars_trainSet.csv', index=False)
test = pd.concat([y_test, X_test], axis=1)
test.to_csv('usedCars_testSet.csv', index=False)
del, train, test
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print('Dimensions of X_train for input:', X_train.shape[1])
sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_test = pd.DataFrame(sc.transform(X_test))
Baseline Model¶
It is always good to use a baseline model which can function as a reference for downstream comparisons. This can also suggest which parameter(s) need more tuning to generate a model with better model metrics. Since the dependent variable price is quantitative, the metrics for regression-based problems should be followed. Let's first set up where the logs and callbacks from the model are saved. We can specify the parameters for EarlyStopping which monitors the val_loss with a patience=5. This says that if the validation loss from the model does not improve after five epochs, then the model will stop training. We can also specify the ModelCheckpoint to monitor the mse and save only the one with the lowest mse.
import datetime
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'MLP_weights_only_baseline_b4_HPO.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
tensorboard_callback]
Next, we can define the baseline model architecture. Since the number of features after creating the dummy variables is 53 features, this needs to be specified as the input_dim. Given this is a neural network, the architecture of the baseline model could be as simple as model = Sequential() then model.compile(). Let's try using double the number of features as the first layer that decreases sequentially by 10 neurons.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(105, input_dim=53, kernel_initializer='normal',
activation='relu'))
model.add(Dense(95, kernel_initializer='normal', activation='relu'))
model.add(Dense(85, kernel_initializer='normal', activation='relu'))
model.add(Dense(75, kernel_initializer='normal', activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(55, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dense(35, kernel_initializer='normal', activation='relu'))
model.add(Dense(25, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
Now the model can be trained by calling fit utilizing the train dataset for 30 epochs, which is the iterations over the entire X_train and y_train, using a batch_size=4, which is the number of samples per gradient update, validation_split=0.2, which is the fraction of the training data not used for training but used for evaluating the loss and the model metrics on this data at the end of each epoch., and the specified callbacks from the callbacks_list.
history = model.fit(X_train, y_train, epochs=50, batch_size=4,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.
import matplotlib.pyplot as plt
model.save('./MLP_batch4_50epochs_baseline_tf.h5', save_format='tf')
#filepath = 'MLP_weights_only_baseline_b4.h5'
#model = tf.keras.models.load_model('./MLP_batch4_50epochs_baseline_tf.h5')
#model.load_weights(filepath)
plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.show();
We can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show();
We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted price to a pandas.Dataframe and examine the size to determine chunk size for plotting. Let's graph the predicted vs. actual price for the training set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
y_pred.shape
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Price vs Actual Price',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show();
We can use model.predict using the test set, convert the predicted price to a pandas.Dataframe and graph the predicted vs. actual price for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 1000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show();
To evaluate if this MLP is effective at predicting the price of used car prices, metrics like the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²) can be utilized.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
The test set has higher values for each of the aforementioned by are in close proximity suggesting the model is worked quite well and was not overfit. The baseline model with this architecture explains the variance quite well, but let's test if this can be improved.
Let's also examine how the predicted values for the maximum, average and minimum metric price and compare to the actual maximum, average and minimum price.
print('Maximum Price:', np.amax(y_test))
print('Predicted Max Price:', np.amax(pred_test))
print('\nAverage Price:', np.average(y_test))
print('Predicted Average Price:', np.average(pred_test))
print('\nMinimum Price:', np.amin(y_test))
print('Predicted Minimum Price:', np.amin(pred_test))
For the baseline model, there is a higher predicted maximum price than the actual maximum price and a higher predicted minimum compared to the actual while there is a lower predicted average vs actual average price.
Hyperparameter Optimization¶
Let's first set up the callbacks for the TensorBoard. Then we can define a function build_model that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:
num_layers: 13 - 20 layerslayer_size: 70 - 130 nodes usingstep=10learning_rate: 1 x 10-1, 1 x 10-2, 1 x 10-3
The same 30% Dropout before the output layer as well as the same loss='mae' and metrics=['mse'] can be utilized to compare the aforementioned hyperparameters.
filepath = 'MLP_weights_only_b4_HPO3.h5'
checkpoint_dir = os.path.dirname(filepath)
callbacks = [TensorBoard(log_dir=log_folder,
histogram_freq=1,
write_graph=True,
write_images=True,
update_freq='epoch',
profile_batch=1,
embeddings_freq=1)]
callbacks_list = [callbacks]
def build_model(hp):
model = keras.Sequential()
for i in range(hp.Int('num_layers', 13, 20)):
model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
min_value=70,
max_value=130, step=10),
activation='relu',
kernel_initializer='normal'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(1))
model.compile(loss='mae', metrics=['mse'],
optimizer=tf.keras.optimizers.Adam(
hp.Choice('learning_rate', values=[1e-1, 1e-2, 1e-3])))
return model
We can define the conditions for keras_tuner which include the objective='val_loss', the number of trials to search through (max_trials=25), the length of each trial (executions_per_trial=1), and the directory and project_name. Then we can now print a summary of the search space.
import keras_tuner
from keras_tuner import BayesianOptimization
tuner = BayesianOptimization(
build_model,
objective='val_loss',
max_trials=25,
executions_per_trial=1,
overwrite=True,
directory='MLP_HPO3',
project_name='MLP_HPO3')
tuner.search_space_summary()
Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2 and batch_size=4.
tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=4,
callbacks=callbacks_list)
Now that the search has completed, let's print a summary of the results from the trials.
tuner.results_summary()
The hyperparameters that resulted in the lowest val_loss (Score: 2655) contained 13 layers, a learning_rate: 0.001 with the initial layers being the largest (130). 9/13 layers contain a layer_size of 130 while 4 contain a layer_size of 70.
Fit The Best Model from HPO¶
Let's now set the path where model is saved and set up the callbacks with EarlyStopping that monitors the val_loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well.
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
tensorboard_callback]
We can now define the model architecture using the results from kerar-tuner search, compile and then examine.
model = Sequential()
model.add(Dense(130, input_dim=53, kernel_initializer='normal',
activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
The total parameters for this model have increased to 148,871 from 44,356 that were utilized for the baseline model. Now the model can be trained by calling fit utilizing the train dataset for 50 epochs using a batch_size=4, validation_split=0.2 and the specified callbacks from the callbacks_list.
history = model.fit(X_train, y_train, epochs=50, batch_size=4,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.We can also save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.
model.save('./MLP_batch4_50epochs_HPO1_bestModel_tf.h5', save_format='tf')
# Load model for more training or later use
#filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
#model = tf.keras.models.load_model('./MLP_batch4_50epochs_HPO1_bestModel_tf.h5')
#model.load_weights(filepath)
plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show()
We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted price to a pandas.Dataframe and examine the size to determine chunk size for plotting. Let's graph the predicted vs. actual price for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
y_pred.shape
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Price vs Actual Price',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show()
We can use model.predict using the test set, convert the predicted price to a pandas.Dataframe and graph the predicted vs. actual price for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 1000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP is effective at predicting the price of used cars, metrics like the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²) can be utilized for both the training and test sets.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the model metrics from the hyperparameter tuning to the baseline model, the MAE, MSE and RMSE were all lower for the training and test sets. The (R²) was higher for both sets compared to the baseline model.
Let's also examine how the predicted values for the maximum, average and minimum price compare to the actual maximum, average and minimum price.
print('Maximum Price:', np.amax(y_test))
print('Predicted Max Price:', np.amax(pred_test))
print('\nAverage Price:', np.average(y_test))
print('Predicted Average Price:', np.average(pred_test))
print('\nMinimum Price:', np.amin(y_test))
print('Predicted Minimum Price:', np.amin(pred_test))
When evaluating the model after the hyperparameter search with the test set, lower and closer values for the predicted vs. actual maximum price (49700.676 vs. 58631.125), but higher and farther away from actual for the predicted average price (27487.953 vs. 27052.695) and predicted minimum price (7409.093 vs. 7353.111) when compared to the baseline model.
Another hyperparameter search was tested using more complex network architecture:
num_layers: 20 - 30layer_size: 100 - 200 usingstep=10- The same learning rate
This utilized 50 trials vs. 25 trials for the aforementioned HPO. The time for the completion of the this was 03h 54m 19s compared to the aforementioned of 01h 38m 11s. The best val_loss = 547.799072265625 compared to the best val_loss = 2655.16650390625. The more complex search resulted in a network with dense_20 where 12 of the Dense layers had 200 nodes resulting the total parameters of 568,051 compared to 148,871 parameters for the aforementioned. This resulted in only a slightly better R² but a worse predicted minimum price. Another search tested stratify=X.listed_date_yearMonth with 50 trials and the completion time was 05h 40m 54s with the best val_loss = 2471.0478515625 and 63,236 total parameters.
Create an Azure Files Datastore¶
Since we determined the previous model resulted in the lowest error, let's move this to the cloud with Azure Machine Learning Studio. Let's first connect to the workspace by authenticating with DefaultAzureCredential and then input the necessary components with MLClient to get a handle to the workspace.
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
ml_client = MLClient(
credential=credential,
subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
resource_group_name='aschultzdata',
workspace_name='ds-ml-env',
)
Then we can create the datastore by specifying the name of the datastore, the description, the account, the name of the container, the protocol and the account_key.
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration
store = AzureBlobDatastore(
name='usedcars_datastore',
description='Datastore for Used Cars',
account_name='dsmlenv8898281366',
container_name='usedcarscargurus',
protocol='https',
credentials=AccountKeyConfiguration(
account_key='XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'
),
)
ml_client.create_or_update(store)
Finally, the train/test sets can be uploaded to the usedcarscargurus container.
Train the MLP using Microsoft Azure¶
Before we train the model, we need to create a compute instance/cluster with a defined name, so let's create one with a GPU, if it has not all ready been created. Then we can specify the cluster components containing the on-demand virtual machine (VM) service, the type of VM, the minimum/maximum nodes in the cluster, the time the node will run after the job finishes/terminates and the type of tier. Finally, we can input the object to create_or_update from MLClient.
from azure.ai.ml.entities import AmlCompute
gpu_compute_target = 'gpu-cluster-NC4as-T4-v3'
try:
gpu_cluster = ml_client.compute.get(gpu_compute_target)
print(
f"You already have a cluster named {gpu_compute_target}, we'll reuse it as is."
)
except Exception:
print('Creating a new gpu compute target...')
gpu_cluster = AmlCompute(
name=gpu_compute_target,
type='amlcompute',
size='Standard_NC4as_T4_v3',
min_instances=0,
max_instances=1,
idle_time_before_scale_down=180,
tier='Dedicated',
)
print(
f"AMLCompute with name {gpu_cluster.name} will be created, with compute size {gpu_cluster.size}"
)
gpu_cluster = ml_client.compute.begin_create_or_update(gpu_cluster)
Next, let's create the environment by making a directory for the dependencies which lists the components for the runtime and the libraries installed on the compute for training the model.
import os
dependencies_dir = './dependencies'
os.makedirs(dependencies_dir, exist_ok=True)
Now we can write the conda.yaml file into the dependencies directory.
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8.5
- pip=23.1.2
- numpy=1.21.6
- scikit-learn==1.1.2
- scipy
- pandas>=1.1,<1.2
- pip:
- inference-schema[numpy-support]==1.3.0
- mlflow
- azureml-mlflow==1.50.0
- psutil==5.9.0
- tqdm
- ipykernel
- matplotlib
- tensorflow==2.12
- keras-tuner==1.1.3
The created conda.yaml file allows for the environment to be created and registered in the workspace.
from azure.ai.ml.entities import Environment
custom_env_name = 'aml-usedcars-gpu-mlp'
custom_job_env = Environment(
name=custom_env_name,
description='Custom environment for Used Cars MLP job',
tags={'tensorflow': '2.12'},
conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
image='mcr.microsoft.com/azureml/curated/tensorflow-2.12-cuda11:6',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)
print(
f'Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}'
)
Next, we can create the training script by first creating the source folder where the training script, main.py, will be stored.
train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)
The training script includes preparing the environment, reading the data, data preparation, model training, saving the model and evaluating the model. This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of argparse, start logging, reading the train/test sets, preprocessing the data for dummy variables and scaling using the StandardScaler. Then the number of samples and features are logged with MLFlow. It uses this to then train a MLP model using the best parameters from keras-tuner where the loss and val_Loss are logged with MLFlow with defined callbacks. The model is then saved and evaluated for the model loss, the metrics of the train/test sets and the predicted vs. actual minimum/average/maximum price, which are logged as MLFlow artifacts and metrics.
%%writefile {train_src_dir}/main.py
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
from sklearn.preprocessing import StandardScaler
import mlflow
import tensorflow as tf
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from keras.callbacks import Callback
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
def init_seeds(seed=42):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
session_conf = tf.compat.v1.ConfigProto()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
os.environ['TF_DETERMINISTIC_OPS'] = 'True'
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
return sess
init_seeds(seed=42)
def main():
"""Main function of the script."""
parser = argparse.ArgumentParser()
parser.add_argument('--train_data', type=str, help='path to input train data')
parser.add_argument('--test_data', type=str, help='path to input test data')
parser.add_argument('--epochs', required=False, default=30, type=int)
parser.add_argument('--batch_size', required=False, default=1, type=int)
args = parser.parse_args()
mlflow.start_run()
###################
#<prepare the data>
###################
print(' '.join(f'{k}={v}' for k, v in vars(args).items()))
print('input train data:', args.train_data)
print('input test data:', args.test_data)
trainDF = pd.read_csv(args.train_data, low_memory=False)
testDF = pd.read_csv(args.test_data, low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
sc = StandardScaler()
train_features = pd.DataFrame(sc.fit_transform(train_features))
test_features = pd.DataFrame(sc.transform(test_features))
mlflow.log_metric('num_samples', train_features.shape[0])
mlflow.log_metric('num_features', train_features.shape[1])
print(f'Training with data of shape {train_features.shape}')
####################
#</prepare the data>
####################
##################
#<train the model>
##################
model = Sequential()
model.add(Dense(130, input_dim=53, kernel_initializer='normal',
activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
class LogRunMetrics(Callback):
def on_epoch_end(self, epoch, log):
# log a value repeated which creates a list
mlflow.log_metric('Loss', log['loss'])
mlflow.log_metric('Val_Loss', log['val_loss'])
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'MLP_weights_only_HPO1_bestModel.tf'
checkpoint_dir = os.path.dirname(filepath)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
LogRunMetrics()]
history = model.fit(train_features, train_label, epochs=args.epochs,
batch_size=args.batch_size, validation_split=0.2,
callbacks=callbacks_list)
model.save('./MLP_HPO1_bestModel', save_format='tf')
#filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
#model = tf.keras.models.load_model('./MLP_HPO1_bestModel_tf.h5')
#model.load_weights(filepath)
##################
#</train the model>
##################
#####################
#<evaluate the model>
#####################
plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.grid(True)
plt.savefig('Loss vs. Price.png')
mlflow.log_artifact('Loss vs. Price.png')
plt.close()
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.grid(True)
plt.savefig('Error vs. Price.png')
mlflow.log_artifact('Error vs. Price.png')
plt.close()
pred_train = model.predict(train_features)
train_mae = mean_absolute_error(train_label[:], pred_train[:])
train_mse = mean_squared_error(train_label[:], pred_train[:])
train_rmse = np.sqrt(mean_squared_error(train_label[:], pred_train[:]))
train_r2 = r2_score(train_label[:], pred_train[:])
pred_test = model.predict(test_features)
test_mae = mean_absolute_error(test_label[:], pred_test[:])
test_mse = mean_squared_error(test_label[:], pred_test[:])
test_rmse = np.sqrt(mean_squared_error(test_label[:], pred_test[:]))
test_r2 = r2_score(test_label[:], pred_test[:])
mlflow.log_metric('train_mae', train_mae)
mlflow.log_metric('train_mse', train_mse)
mlflow.log_metric('train_rmse', train_rmse)
mlflow.log_metric('train_r2', train_r2)
mlflow.log_metric('test_mae', test_mae)
mlflow.log_metric('test_mse', test_mse)
mlflow.log_metric('test_rmse', test_rmse)
mlflow.log_metric('test_r2', test_r2)
MaximumPrice = np.amax(test_label)
PredictedMaxPrice = np.amax(pred_test)
AveragePrice = np.average(test_label)
PredictedAveragePrice = np.average(pred_test)
MinimumPrice = np.amin(test_label)
PredictedMinimumPrice = np.amin(pred_test)
mlflow.log_metric('Maximum Price', MaximumPrice)
mlflow.log_metric('Predicted Maximum Price', PredictedMaxPrice)
mlflow.log_metric('Average Price', AveragePrice)
mlflow.log_metric('Predicted Average Price', PredictedAveragePrice)
mlflow.log_metric('Minimum Price', MinimumPrice)
mlflow.log_metric('Predicted Minimum Price', PredictedMinimumPrice)
###################
#</evaluate the model>
###################
mlflow.end_run()
if __name__ == "__main__":
main()
To train the model, a command job configured with the input specifying the input data, the number of epochs and the batch size, which then runs the training script using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.
from azure.ai.ml import command
from azure.ai.ml import Input
registered_model_name = 'usedcars_mlp_model'
job = command(
inputs=dict(
train_data=Input(
type='uri_file',
path='azureml://datastores/usedcars_datastore/paths/usedCars_trainSet.csv',
),
test_data=Input(
type='uri_file',
path = 'azureml://datastores/usedcars_datastore/paths/usedCars_testSet.csv',
),
epochs=50,
batch_size=4,
registered_model_name=registered_model_name,
),
code='./src/',
command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --epochs ${{inputs.epochs}} --batch_size ${{inputs.batch_size}}',
environment='aml-usedcars-gpu-mlp@latest',
compute='gpu-cluster-NC4as-T4-v3',
display_name='usedcars_mlp_prediction',
)
Finally, this job can be submitted to run in Azure Machine Learning Studio using the create_or_update command with ml_client.
ml_client.create_or_update(job)
The submitted job can then be viewed by selecting the link in the output of the previous cell. The logged information with MLFlow including the model metrics and saved graphs can then be viewed/downloaded when the job completes.
CatBoost¶
Gradient boosting utilizes the process of iteratively combining weaker models to build strong predictive models using gradient descent. The paper CatBoost: unbiased boosting with categorical features presented a new boosting algorithm, CatBoost, or categorical boosting, where preprocessing the categorical features is not required because they are handled during the training step. Then the paper CatBoost: gradient boosting with categorical features support> explained the methodologies in greater detail.
The notebook containing the baseline model and models using Hyperopt is here. Let's first start by setting up the environment by installing the necessary dependencies, importing the required packages, setting the options and the random and numpy seed. We can also determine if the CUDA compiler is present/which version of CUDA Toolkit is available with nvcc --version and if a GPU is present and the associated characteristics with nvidia-smi.
!pip install catboost
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_catGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
There is a 70W Tesla T4 with 16GB GPU memory with CUDA Toolkit 11.2 available. Now, we can read the train/test sets into pandas dataframes and set up the features with X_train and X_test and the target/label with y_train and y_test. CatBoost does not need the categorical variables to be one hot encoded or dummy variables, so let's define the categorical variables in a list which will be used when modeling.
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
X_train = trainDF.drop(columns = ['price'])
y_train = trainDF[['price']]
X_test = testDF.drop(columns = ['price'])
y_test = testDF[['price']]
categorical_features_indices = ['body_type', 'fuel_type', 'listing_color',
'transmission', 'wheel_system_display','State',
'listed_date_yearMonth']
Baseline Model¶
Let's now set a baseline model using the default parameters for CatBoost, save as a.pkl file, and evaluate the model metrics for both the train and the test sets.
from catboost import CatBoostRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
cat = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
random_state=seed_value,
logging_level='Silent')
cat.fit(X_train, y_train)
Pkl_Filename = 'CatBoost_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(cat, file)
print('\nModel Metrics for Catboost Baseline Model')
y_train_pred = cat.predict(X_train)
y_test_pred = cat.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Hyperopt: 100 Trials 10-Fold Cross Validation¶
To set up Hyperopt, let's first define the number of trials with NUM_EVAL = 100. We need to utilize the same k-folds for reproducibility, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True and the initial defined seed value. Then hyperparameters can then be defined in a dictionary. To define integers, hp.choice with np.arange and dtype=int is used while float types are defined using hp.uniform. The space consists of 6 parameters with 4 integers and 2 float type:
iterations: The number of trees that can be built.depth: Depth of the tree.l2_leaf_reg: Coefficient at the L2 regularization term of the cost function.learning_rate: Used for reducing the gradient step.min_data_in_leaf: The minimum number of training samples in a leaf.one_hot_max_size: Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value.
from sklearn.model_selection import KFold
from hyperopt import hp
NUM_EVAL = 100
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
catboost_tune_kwargs= {
'iterations': hp.choice('iterations', np.arange(100, 500, dtype=int)),
'depth': hp.choice('depth', np.arange(3, 10, dtype=int)),
'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1e-2, 1e0),
'learning_rate': hp.uniform('learning_rate', 1e-4, 0.3),
'min_data_in_leaf': hp.choice('min_data_in_leaf', np.arange(2, 20,
dtype=int)),
'one_hot_max_size': hp.choice('one_hot_max_size', np.arange(2, 20,
dtype=int)),
}
Now, we can define a function cat_hpo for optimizing the hyperparameters. Within this function, we can use joblib to save a .pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float. This is then allocated for n_estimators and max_depth, which starts at max_depth=3. Then the model type, CatBoostRegressor, needs to be defined with the parameters that will be included in all of the trials during the search, which are:
loss_function: The metric to use in training.task_type: The processing unit type to use for training.cat_features: A one-dimensional array of categorical columns indices.early_stopping_rounds: Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.rsm: Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.random_state=seed_value: Random number seed.logging_level: The logging level to output to stdout.
Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error' for the scoring metric and 10-fold cross validation to generate the negative cross_val_score where the np.mean of the scores is saved as the rmse for each trial. The trial loss=rmse, params = parameters tested in the trial from the catboost_tune_kwargs space , the trial number (iteration), the time to complete the trial (train_time) and if the trial completed successfully or not (status) are written to the defined .csv file and appended by row for each trial.
import joblib
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK
def cat_hpo(config):
"""
Objective function to tune a CatoostRegressor model
"""
joblib.dump(bayesOpt_trials, 'Catboost_Hyperopt_100_GPU.pkl')
global ITERATION
ITERATION += 1
config['iterations'] = int(config['iterations'])
config['depth'] = int(config['depth']) + 3
cat = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
random_state=seed_value,
logging_level='Silent',
**config)
start = timer()
scores = -cross_val_score(cat, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
The Tree-structured Parzen Estimator Approach (TPE) algorithm is the default algorithm for Hyperopt. This algorithm proposed in Algorithms for Hyper-Parameter Optimization uses a Bayesian optimization approach to develop a probabilistic model of a defined function (the minimum, in this case) by generating a random starting point, calculating the function, then choosing the next set of parameters that will probabilistically result in a lower minimum utilizing past calculations for the conditional probability model, then computing the real value, and continuing until the defined stopping criteria is met.
Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.
We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function cat_hpo, the parameter space cat_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the hyperparameter optimization (HPO) trials.
from hyperopt import tpe, Trials
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/Catboost/Hyperopt/trialOptions/Catboost_trials_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
from datetime import datetime, timedelta
from hyperopt import fmin
search_time_start = time.time()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Hyperopt_100_GPU.pkl'):
bayesOpt_trials = joblib.load('Catboost_Hyperopt_100_GPU.pkl')
else:
best_param = fmin(cat_hpo, catboost_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
print('Time to run HPO:', time.time() - search_time_start)
This search completed in 2 hours and 57 minutes. Let's now sort the trials with lowest loss (lowest RMSE) first and examine the top two trials:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
import ast
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
results = pd.read_csv('Catboost_trials_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
random_state=seed_value,
logging_level='Silent',
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'Catboost_HPO_trials100_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
# =============================================================================
# # To load saved model
# best_bayes_model = joblib.load('Catboost_HPO_trials100_GPU.pkl')
# print(best_bayes_model)
# =============================================================================
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search¶
Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the search hyperparameters by using a for loop to iterate through the quantitative parameters and visualizing with seaborn.kdeplot.
import matplotlib.pyplot as plt
import seaborn as sns
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['depth'] = bayes_params['depth'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['l2_leaf_reg'] = bayes_params['l2_leaf_reg'].astype('float64')
bayes_params['min_data_in_leaf'] = bayes_params['min_data_in_leaf'].astype('float64')
bayes_params['one_hot_max_size'] = bayes_params['one_hot_max_size'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['iteration', 'iterations']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of depth, l2_leaf_reg, min_data_in_leaf and one_hot_max_size in the given hyperparameter space performed better to generate a lower loss. A learning_rate between 0.1 and 0.2 performed better and the lowest loss was achieved with ~ learning_rate=0.119.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
'one_hot_max_size']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the learning_rate and min_data_in_leaf decreased while one_hot_max_size slightly increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
plt.figure(figsize=(20,8))
plt.rcParams['font.size'] = 18
ax = sns.regplot('iteration', 'l2_leaf_reg', data=bayes_params,
label='Bayes Optimization: Hyperopt')
ax.set(xlabel='Iteration', ylabel='l2_leaf_reg')
plt.tight_layout()
plt.show()
Over the trials, there was a slight increase in l2_leaf_reg.
Model Explanations¶
Next, we can define a function to plot the feature importance, which can also be saved and exported for later use. The feature importance value and the associated feature name can be extracted into two separate arrays then converted to a pandas.DataFrame using the feature names as the key and the feature importance as the value in a dictionary. The dataframe can then be sorted in decreasing order by the feature importance value. Then a searborn.barplot with a defined title, xlabel and ylabel can be utilized to visualize the results.
def plot_feature_importance(importance, names, model_type):
feature_importance = np.array(importance)
feature_names = np.array(names)
data={'feature_names': feature_names,
'feature_importance': feature_importance}
fi_df = pd.DataFrame(data)
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
plt.figure(figsize=(10,8))
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'],
palette='cool')
plt.title(model_type + ' Feature Importance')
plt.xlabel('Feature Importance')
plt.ylabel('Feature Names')
plot_feature_importance(best_bayes_model.get_feature_importance(),
X_train.columns, 'Catboost')
The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm, width, front_legroom, height, body_type.
SHAP (SHapley Additive exPlanations)¶
SHAP (SHapley Additive exPlanations) published as A Unified Approach to Interpreting Model Predictions estimates the impact of the individual components on the final result, separately, while conserving the total impact on the final result. This can be considered in conjunction with the previously used feature importance measures when modeling. Let's use the TreeExplainer since using CatBoost and generate shap_values for the features. The summary_plot shows the global importance of each feature and the distribution of the effect sizes over the set. Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test);
Train set has more positive and more negative compared to test set.
100 Trials Train/Validation/Test¶
The training set can be partitioned into train/validation sets using a test_size=0.2. This will allow for the original testDF to be a true test set that the model has not encountered.
train_Features, val_features, train_Label, val_label = train_test_split(X_train,
y_train,
test_size=0.2,
random_state=seed_value)
We use the same number of trials and parameters space, but defined a new objective function for cat_hpo with a new .pkl file, the same config for the parameters to be evaluated and the same base model. Now, we can specify for the model to fit with the training set with the validation set as the eval_set and to predict using the validation set. We can now define a new out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.
def cat_hpo(config):
"""
Objective function to tune a CatBoostRegressor model.
"""
joblib.dump(bayesOpt_trials, 'Catboost_Hyperopt_100_GPU_TrainValTest.pkl')
global ITERATION
ITERATION += 1
config['iterations'] = int(config['iterations'])
config['depth'] = int(config['depth']) + 3
model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**config)
start = timer()
model.fit(train_Features, train_Label.values.ravel(),
eval_set=[(val_features, val_label.values.ravel())])
run_time = timer() - start
y_pred_val = model.predict(val_features)
rmse = mean_squared_error(val_label, y_pred_val, squared=False)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
tpe_algorithm = tpe.suggest
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/Catboost/Hyperopt/trialOptions/Catboost_trials_100_GPU_TrainValTest.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function cat_hpo, the parameter space cat_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL and the name of the trial set bayesOpt_trials. We can now begin the HPO trials.
if os.path.isfile('Catboost_Hyperopt_100_GPU_TrainValTest.pkl'):
bayesOpt_trials = joblib.load('Catboost_Hyperopt_100_GPU_TrainValTest.pkl')
best_param = fmin(cat_hpo, cat_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
best_param = fmin(cat_hpo, cat_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
Let's now sort the trials with lowest loss (lowest RMSE) first and examine the top two trials:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
We can access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
results = pd.read_csv('Catboost_trials_100_GPU_TrainValTest.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**best_bayes_params)
best_bayes_model.fit(train_features, train_label)
Pkl_Filename = 'Catboost_HPO_trials100_GPU_TrainValTest.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(train_features)
y_test_pred = best_bayes_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
When comparing the model metrics for CatBoost utilizing 10-fold cross validation vs train/validation/test, a lower MAE for the train and the test set, a lower MSE and RMSE for the train set but higher for the test set. There was a higher R² for the train set but a lower R² for the test set, so this model overfit more than the model from the cross validatio search.
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
As stated, there was a higher MSE on the test set using this model compared to the previous search using 10-fold cross validation (5878087.83144 vs. 5773043.66389) but this was achieved at similar number of search iterations.
Results from the Hyperparameter Search¶
To compare the results from this method for partitioning the data for training a model compared to the results from 10-fold cross validation, we can iterate through the quantitative parameters using a similar method and visualizing with seaborn.kdeplot.
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['depth'] = bayes_params['depth'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['l2_leaf_reg'] = bayes_params['l2_leaf_reg'].astype('float64')
bayes_params['min_data_in_leaf'] = bayes_params['min_data_in_leaf'].astype('float64')
bayes_params['one_hot_max_size'] = bayes_params['one_hot_max_size'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['iteration', 'iterations', 'random_state', 'task_type']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Similar results of higher values of depth, l2_leaf_reg, min_data_in_leaf and one_hot_max_size in the given hyperparameter space were observed to generate a lower loss. A learning_rate between 0.1 and 0.2 performed better, the lowest loss was achieved with around a learning_rate=0.136 compared to 0.119 using 10-fold cross validation.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
'one_hot_max_size']):
sns.regplot(x='iteration', y=hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the one_hot_max_size increased while learning_rate and min_data_in_leaf did not reveal any trends.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
plt.figure(figsize=(10,5))
plt.rcParams['font.size'] = 18
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=bayes_params,
label='Bayes Optimization')
ax.set(xlabel='Iteration', ylabel='l2_leaf_reg')
plt.tight_layout()
plt.show()
There was not any observable trend for l2_leaf_reg over the trials.
Model Explanations¶
plot_feature_importance(best_bayes_model.get_feature_importance(),
train_features.columns, 'Catboost')
The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm,
front_legroombefore width, height, wheelbase and State before body_type when comparing to the model from the 10-fold cross validation search.
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.summary_plot(shap_values, test_features);
Optuna: 100 Trials 10-Fold Cross Validation¶
The Optuna notebooks are located here. Let's set up the environment by installing wandb, CatBoost, Optuna and also shap for model explanations. The import the necessary dependencies to set the seed for reproducibility and set the options. We can also examine the GPU characteristics with nvidia-smi.
!pip install --upgrade -q wandb
!pip install catboost
!pip install optuna
!pip install shap
import os
import random
import numpy as np
import warnings
pd.set_option('display.max_columns', None)
seed_value = 42
os.environ['usedCars_catboostGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Given the environment is now configured, the training and the test sets are read into separate pandas dataframes. Then the features and the target are defined for the two sets as train_features, train_label, test_features and test_label. We can also set the categorical variables like was completed for the Hyperopt section. Then we can set the working directory where the .pkl is stored and set up Weights & Biases.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_cat100gpu_T4_cv_kfold10',
'save_code': 'False', 'notes': 'optuna_cat100gpu_T4_cv_kfold10'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Let's utilize the same k-folds for reproducibility that was utilized for the Hyperopt search, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True and the initial defined seed value. Then we can set the up WB callbacks for tracking. Now, let's define a function objective for the optimization of hyperparameters using an Optuna study with a pickle file, parameters to test different combinations using the same ones that were utilized during the Hyperopt trials. Then we can define the model type with the parameters that will be used for each trial and perform timed kfolds cross validation trials with the goal of finding the lowest averaged score.
from sklearn.model_selection import KFold, cross_val_score
import joblib
from catboost import CatBoostRegressor
from timeit import default_timer as timer
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a CatBoostRegressor model.
"""
joblib.dump(study, 'Catboost_Optuna_100_GPU_T4_wb_CV.pkl')
catboost_tune_kwargs = {
'task_type': 'GPU',
'random_state': seed_value,
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3),
'depth': trial.suggest_int('depth', 3, 10),
'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-2, 1e0),
'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 20),
'one_hot_max_size': trial.suggest_int('one_hot_max_size', 2, 20)
}
model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10
rsm=1,
logging_level='Silent',
**catboost_tune_kwargs)
start = timer()
scores = -cross_val_score(model, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
return rmse
Now, we can begin the optimization where the parameters and score for each run is stored usedCars_HPO_Catboost on Weights & Biases.
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Optuna_100_GPU_T4_wb_CV.pkl'):
study = joblib.load('Catboost_Optuna_100_GPU_T4_wb_CV.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
Results from the Hyperparameter Search¶
Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_depth': 'depth'}, inplace=True)
trials_df.rename(columns={'params_l2_leaf_reg': 'l2_leaf_reg'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_min_data_in_leaf': 'min_data_in_leaf'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_one_hot_max_size': 'one_hot_max_size'},
inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Optuna contains many ways to visualize the parameters tested during the search using the visualization module including plot_parallel_coordinate, plot_slice, plot_contour, plot_param_importances, plot_optimization_history, plot_slice and plot_edf. Let's utilize some of these components to examine the hyperparameters tested starting with plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contour plot
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_contour(study, params=['n_estimators',
'min_data_in_leaf',
'depth',
'learning_rate'])
fig.show()
Next, we can plot_slice to compare the objective value and individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
From this plot, lower rmse occurred with a higher depth and sometimes using more estimators in the model. The other parameters exhibited more complex relationships with the loss.
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
import matplotlib.pyplot as plt
import seaborn as sns
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of depth while lower values of l2_leaf_reg, min_data_in_leaf and one_hot_max_size in the given hyperparameter space performed better to generate a lower loss. A learning_rate between 0.17 and 0.22 performed better compared to 0.1 - 0.2 for the Hyperopt trials and the lowest loss was achieved with ~ learning_rate=0.191, which is higher than both Hyperopt trials.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
'depth']):
sns.regplot(x='iteration', y=hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The learning_rate and min_data_in_leaf decreased over the trials while the depth did not reveal any trends besides the trials have a higher depth throughtout the trials.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=trials_df)
plt.tight_layout()
plt.show()
A slight decrease in l2_leaf_reg was observed over the trials.
Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The learning_rate is clearly the most important hyperparameter followed by depth.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
params = study.best_params
params['random_state'] = seed_value
params
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'Catboost_Optuna_trials100_GPU_T4_CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
When comparing the model metrics for CatBoost utilizing 10-fold cross validation from Hyperopt, there was a higher MAE for the train set while a lower MAE for the test set. There was higher MSE for both the train/test sets and a lower RMSE and R² for the train and test sets, so this model overfit less than the model from the Hyperopt cross validation search.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
Using the defined function in the Hyperopt section, we can also plot the feature importance from best model achieved through this hyperparameter search.
plot_feature_importance(best_bayes_model.get_feature_importance(),
X_train.columns, 'Catboost')
The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm, height, width, front_legroom, wheelbase, engine_displacement and savings_amount.
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
100 Trials Train/Validation/Test¶
To evaluate utilizing the train/validation/test set approach, let's use the train set that this is split into the training and validation sets using train_test_split with a test_size=0.2, so the set is split into 60% for training, 20% validation and 20% for testing the trained model. We can then set the working directory where the .pkl is stored and set up Weights & Biases.
train_Features, val_features, train_Label, val_label = train_test_split(X_train,
y_train,
test_size=0.2,
random_state=seed_value)
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_cat100gpu_T4_trainvalTest',
'save_code': 'False',
'notes': 'optuna_cat100gpu_T4_trainvalTest'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Let's set up the WB callbacks and define a function for the optimization of hyperparameters for an Optuna study with a .pkl file and the parameter space to test the different combinations of parameters using the same ones used during the Hyperopt trials.
Then, we can define the model type with parameters that will be used for each trial and fit the model using the train set the validation set for the eval_set where the rmse is predicted using the validation set.
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a CatBoostRegressor model.
"""
joblib.dump(study, 'Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl')
catboost_tune_kwargs = {
'random_state': seed_value,
'task_type': 'GPU',
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3),
'depth': trial.suggest_int('depth', 3, 10),
'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-2, 1e0),
'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 20),
'one_hot_max_size': trial.suggest_int('one_hot_max_size', 2, 20)
}
model = CatBoostRegressor(loss_function='RMSE',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**catboost_tune_kwargs)
start = timer()
model.fit(train_Features, train_Label.values.ravel(),
eval_set=[(val_features, val_label.values.ravel())])
run_time = timer() - start
y_pred_val = model.predict(val_features)
rmse = mean_squared_error(val_label, y_pred_val, squared=False)
return rmse
Now, let's begin the optimization where the parameters and score for each run is stored here.
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl'):
study = joblib.load('Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
Results from the Hyperparameter Search¶
Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_depth': 'depth'}, inplace=True)
trials_df.rename(columns={'params_l2_leaf_reg': 'l2_leaf_reg'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_min_data_in_leaf': 'min_data_in_leaf'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_one_hot_max_size': 'one_hot_max_size'},
inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contours.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_contour(study, params=['n_estimators',
'min_data_in_leaf',
'depth',
'learning_rate'])
fig.show()
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of depth, min_data_in_leaf and one_hot_max_size while lower values of l2_leaf_reg in the given hyperparameter space performed better to generate a lower loss. A learning_rate between 0.17 and 0.22 performed better compared to 0.1 - 0.2 for the Hyperopt trials and the lowest loss was achieved with ~ learning_rate=0.178, which is higher than both Hyperopt trials.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf', 'depth']):
sns.regplot(x='iteration', y=hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the search trials, there were not any trends with the quantitative features over the iterations.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=trials_df)
plt.tight_layout()
plt.show()
Over the trials, there was a slight increase in l2_leaf_reg.
Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The learning_rate is clearly the most important hyperparameter, more when using cross-validation for the Optuna trials followed by depth, which is less important in this search.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
params = study.best_params
params['random_state'] = seed_value
params
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
categorical_features_indices = ['body_type', 'fuel_type', 'listing_color',
'transmission', 'wheel_system_display', 'State',
'listed_date_yearMonth']
best_bayes_model = CatBoostRegressor(loss_function='RMSE',
task_type='GPU',
cat_features=categorical_features_indices,
early_stopping_rounds=10,
rsm=1,
logging_level='Silent',
**params)
best_bayes_model.fit(train_features, train_label)
Pkl_Filename = 'Catboost_Optuna_trials100_GPU_T4_wb_trainValTest.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(train_features)
y_test_pred = best_bayes_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
When comparing the model metrics to the 10-fold cross validation trials using Hyperopt, the MAE, MSE, and RMSE for the both the train and the test is higher and the R² is lower for both of the sets. Therefore, the model with the best performance for CatBoost occurred with the Hyperopt 100 trials using 10-fold cross validation.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
Using the defined function used in the Hyperopt section, plot the feature importance from best model result.
plt.rcParams.update({'font.size': 10})
plot_feature_importance(best_bayes_model.get_feature_importance(),
train_features.columns, 'Catboost')
The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm, height, width, front_legroom, wheelbase, engine_displacement, highway_fuel_economy, savings_amount.
After evaluating the two different packages for performing hyperparanmeter tuning and cross validation vs. train/validation/test sets utilized during the tuning the process for CatBoost, the features horsepower, year and mileage are the most important followed slightly different orders of of feature importance using this approach.
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();
LightGBM¶
The paper LightGBM: A Highly Efficient Gradient Boosting Decision Tree introduced two new techniques called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
GOSS is a new gradient boosting decision tree (GBDT) algorithm under the assumption that instances containing larger gradients (greater than a specified threshold or within the top percentiles) contribute more to information gain. Therefore, the instances with large gradients are retained and only the instances with small gradients are randomly dropped when down sampling the data. This results in more accurate gain estimation compared to uniformly random sampling, with the same target sampling rate, to a higher extent when there is a large range to benefit.
EFB addresses the problem when there is a large number of features present, but in reality, the feature space is quite sparse. This utilizes a graph approach where the features are vertices, edges are added for every two features, if they are not mutually exclusive, and then leveraging a greedy algorithm with a constant approximation ratio to solve.
Since Google Colaboratory was utilized to run LightGBM with a GPU, the environment needs to be set up by mounting the drive and cloning the LightGBM repository from Github.
%cd /content/drive/MyDrive/
!git clone --recursive https://github.com/Microsoft/LightGBM
Then we can navigate to the directory and create a build directory within the cloned repository where we can use cmake to compile and build.
%cd /content/drive/MyDrive/LightGBM
!mkdir build
!cmake -DUSE_GPU=1
!make -j$(nproc)
Next, we can install pip, setuptools and the required dependencies to set up the environment for utilizing LightGBM with GPU capabilities.
!sudo apt-get -y install python-pip
!sudo -H pip install setuptools optuna plotly eli5 shap lime -U
Now, we can move to the python-package subdirectory and use setup.py to compile the dependencies needed so that data will be processed on GPUs when using the LightGBM package.
%cd /content/drive/MyDrive/LightGBM/python-package
!sudo python3 setup.py install --precompile --gpu
The Hyperopt notebook is here. Let's now set up the environment by importing the required packages, setting the the options, setting the random and numpy seed and determining the CUDA compiler and GPU characteristics.
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_lgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
X_train = trainDF.drop(['price'], axis=1)
y_train = trainDF['price']
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
X_test = testDF.drop(['price'], axis=1)
y_test = testDF['price']
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
Baseline Model¶
Let's now fit the baseline model to the data, save as a .pkl file and evaluate the model metrics for the baseline model.
from lightgbm import LGBMRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
lgb = LGBMRegressor(device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
random_state=seed_value,
objective='regression',
metric='rmse',
verbosity=-1)
lgb.fit(X_train, y_train)
Pkl_Filename = 'LightGBM_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(lgb, file)
print('\nModel Metrics for LightGBM Baseline')
y_train_pred = lgb.predict(X_train)
y_test_pred = lgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Hyperopt: 300 Trials 10-Fold Cross Validation¶
We can now define the number of trials, the same k-folds using shuffled n_splits=10 for reproducibility and the parameter space:
random_state:seed_valuedesignated for reproducibility.default=None.device_type: Device for the tree learning.default=cpu.objective: Specify the learning task and the corresponding learning objective or a custom objective function to be used.default=Noneordefault=regression.metric: Metric(s) to be evaluated on the evaluation set(s).default="".verbosity: Controls the level of LightGBM’s verbosity.default=1.n_estimators: Number of boosted trees to fit.default=100.learning_rate: Boosting learning ratedefault=0.1.max_depth: Maximum tree depth for base learners, <=0 means no limit.default=-1.num_leaves: Maximum tree leaves for base learners.default=31.boosting_type:gbdt,gdbt_subsample', 0.5.default='gbdt')) –gbdt, traditional Gradient Boosting Decision Tree.dart, Dropouts meet Multiple Additive Regression Trees.rf, Random Forest.colsample_bytree: Subsample ratio of columns when constructing each tree.default=1.reg_alpha: L1 regularization term on weights.default=0.reg_lambda: L2 regularization term on weights.default=0.
from sklearn.model_selection import KFold
from hyperopt import hp
NUM_EVAL = 300
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
lgb_tune_kwargs = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'regression',
'metric': 'rmse',
'verbosity' : -1,
'n_estimators': hp.choice('n_estimators', np.arange(400, 700, dtype=int)),
'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
'max_depth': hp.choice('max_depth', np.arange(5, 12, dtype=int)),
'num_leaves': hp.choice('num_leaves', np.arange(70, 150, dtype=int)),
'boosting_type': hp.choice('boosting_type',
[{'boosting_type': 'gbdt',
'subsample': hp.uniform('gdbt_subsample', 0.5,
0.95)}]),
'colsample_bytree': hp.uniform('colsample_by_tree', 0.75, 1.0),
'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0)
}
Now, we can define a function lgb_hpo for optimizing the hyperparameters. Within this function, we can use joblib to save a pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float. This is then allocated for max_depth and num_leaves. Then the model type, LGBMRegressor, needs to be defined.
Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error' for the scoring metric and 10-fold cross validation to generate the negative cross_val_score where the np.mean of the scores is saved as the rmse for each trial. The trial loss=rmse, params = parameters tested in the trial from the lgb_tune_kwargs space, the trial number (ITERATION), the time to complete the trial (run_time) and if the trial completed successfully or not (STATUS_OK) are written to the defined .csv file and appended by row for each trial.
from lightgbm import LGBMRegressor
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK
def lgb_hpo(config):
"""
Objective function to tune a LightGBMRegressor model.
"""
global ITERATION
ITERATION += 1
subsample = config['boosting_type'].get('subsample', 1.0)
config['boosting_type'] = config['boosting_type']['boosting_type']
config['subsample'] = subsample
for param_name in ['max_depth', 'num_leaves']:
config[param_name] = int(config[param_name])
lgb = LGBMRegressor(**config)
start = timer()
scores = -cross_val_score(lgb, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.
We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function lgb_hpo, the parameter space lgb_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the hyperparameter optimization HPO trials.
from hyperopt import tpe, Trials
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/lightGBM/Hyperopt/trialOptions/lightGBM_CV_trials_300_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
tpe_algorithm = tpe.suggest
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
from datetime import datetime, timedelta
import joblib
from hyperopt import fmin
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_CV_Hyperopt_300_GPU.pkl'):
bayesOpt_trials = joblib.load('lightGBM_CV_Hyperopt_300_GPU.pkl')
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
else:
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Let's now sort the trials with lowest loss first and examine the lowest two losses:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
import ast
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
results = pd.read_csv('lightGBM_CV_trials_300_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('lightGBM_CV_trials_300_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = LGBMRegressor(device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'lightGBM_CV_HPO_300_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for lightGBM HPO Hyperopt 300 trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0,
'iteration']))
Results from the Hyperparameter Search¶
Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
import matplotlib.pyplot as plt
import seaborn as sns
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['num_leaves'] = bayes_params['num_leaves'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['boosting_type', 'iteration', 'subsample', 'force_col_wise',
'max_depth', 'device_type', 'verbosity', 'random_state',
'objective', 'metric']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of colsample_bytree, n_estimators, num_leaves and reg_lambda in the given hyperparameter space performed better while a lower learning_rate and a lower reg_alpha performed better to generate a lower loss.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
colsample_bytree increases over trials while learning_rate and num_leaves do not tend to follow a trend.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the reg_alpha hyperparameter decreased while reg_lambda increased.
Model Explanations¶
Model Metrics with ELI5¶
Let's now utilize the PermutationImportance from eli5.sklearn which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.
class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
X_test1 = pd.DataFrame(X_test, columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_name=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
from eli5.formatters import format_as_dataframe
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The horsepower feature contains the highest weight (0.533249). The year feature has the next highest, but it is significantly lower (0.132819). This is followed by mileage (0.073229), width (0.070825) and height (0.056370).
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test);
100 Trials Train/Validation/Test¶
Let's now prepare the data for modeling by creating the dummy variables for the categorical features. Then we can define the label as price and the features as everything except price by dropping price to remove the target. Next, we can extract the names of the features by using the columns and converting to a np.array.
train = pd.get_dummies(X_train, drop_first=True)
label = train[['price']]
features = train.drop(columns = ['price'])
features = features.columns
features = np.array(features)
We can now define the number of trials, the same k-folds using shuffled n_splits=10 for reproducibility and the parameter space. Compared to the 10-fold cross validation trials, let's use lower values for the n_estimators (300-500 vs 400-700), max_depth (5-6 vs. 5-12), num_leaves (30-100 vs. 70-150) and colsample_bytree (0.6-1 vs. 0.75-1) and higher values for gbdt_subsample (0.5-1 vs 0.5-0.95). We can also test different kinds of boosting_type, so let's evaluate these trials using gdbt and dart with the same ranges for subsampling and goss with 'subsample'=1.0. We can utilize the same range for the learning_rate, reg_alpha and reg_lambda hyperparameters:
NUM_EVAL = 100
kf = KFold(n_splits=10, shuffle=True, random_state=seed_value)
lgb_tune_kwargs = {
'random_state': seed_value,
'device_type': 'gpu',
'objective': 'regression',
'metric': 'rmse',
'verbosity' : -1,
'n_estimators': hp.choice('n_estimators', np.arange(300, 500, dtype=int)),
'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
'max_depth': hp.choice('max_depth', np.arange(5, 6, dtype=int)),
'num_leaves': hp.choice('num_leaves', np.arange(30, 100, dtype=int)),
'boosting_type': hp.choice('boosting_type', [{'boosting_type': 'gbdt',
'subsample': hp.uniform('gdbt_subsample',
0.5,
1)},
{'boosting_type': 'dart',
'subsample': hp.uniform('dart_subsample',
0.5,
1)},
{'boosting_type': 'goss',
'subsample': 1.0}]),
'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0),
'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0)
}
Now, we can define a function lgb_hpo for optimizing the hyperparameters. Within this function, we can use joblib to save a pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. Then we can retrieve the value used for subsample, if it is present, otherwise it will be set to 1.0. Then we can extract the boosting type. Then the parameters which are integers need to configured to remain as integers and not as float, so let's allocate them for the max_depth and num_leaves hyperparameters.
Next, we can split the training set into the train and validation sets by utilizing the defined variable kf in a for loop to split the training set 10-fold into the train and the validation sets with shuffling and the random_state=seed_value. This will allow for different sets to be fit and evaluated using the LGBMRegressor model. Then the validation set can be evaluated for the trial rmse for each trial and
saved. The trial loss=rmse, params = parameters tested in the trial from the lgb_tune_kwargs space, the trial number (ITERATION), the time to complete the trial (run_time) and if the trial completed successfully or not (STATUS_OK) are written to the defined .csv file and appended by row for each trial.
def lgb_hpo(config):
"""
Objective function to tune a LightGBMRegressor model.
"""
global ITERATION
ITERATION += 1
subsample = config['boosting_type'].get('subsample', 1.0)
config['boosting_type'] = config['boosting_type']['boosting_type']
config['subsample'] = subsample
for param_name in ['max_depth', 'num_leaves']:
config[param_name] = int(config[param_name])
for trn_idx, val_idx in kf.split(train[features], label):
train_features, train_label = train[features].iloc[trn_idx], label.iloc[trn_idx]
val_features, val_label = train[features].iloc[val_idx], label.iloc[val_idx]
lgb = LGBMRegressor(**config)
start = timer()
lgb.fit(train_features, train_label,
eval_set = [(val_features, val_label),
(train_features, train_label)])
run_time = timer() - start
y_pred_val = lgb.predict(val_features)
rmse = mean_squared_error(val_label, y_pred_val, squared=False)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.
We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function lgb_hpo, the parameter space lgb_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the hyperparameter optimization HPO trials.
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/lightGBM/Hyperopt/trialOptions/lightGBM_trials_100_GPU_val.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_Hyperopt_100_GPU_val.pkl'):
bayesOpt_trials = joblib.load('Hyperopt_100_GPU_val.pkl')
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
else:
best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Compared to the completion time for 300 trials using 10-fold cross validation (6:46:24) using the same GPU, a Tesla P100-PCIE, this completed in more than 1/6 the time. We need to evaluate the the model metrics for comparisons, though.
Let's now sort the trials with lowest loss first and examine the lowest two losses:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
results = pd.read_csv('lightGBM_trials_100_GPU_val.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('lightGBM_CV_trials_300_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
train_label = train[['price']]
train_features = train.drop(columns=['price'])
test_label = testDF[['price']]
test_features = testDF.drop(columns=['price'])
test_features = pd.get_dummies(test_features, drop_first=True)
best_bayes_model = LGBMRegressor(device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
**best_bayes_params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'lightGBM_HPO_100_GPU_val.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO Hyperopt 100 trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning. With 300 trials, the model overfit more with the training set compared to the test, while with 100 trials the model performed, more balanced regarding the model metrics.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0,
'iteration']))
Results from the Hyperparameter Search¶
Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['num_leaves'] = bayes_params['num_leaves'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['boosting_type', 'iteration', 'subsample', 'force_col_wise',
'max_depth', 'device_type', 'verbosity', 'random_state',
'objective', 'metric']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Lower values of colsample_bytree, learning_rate and reg_lambda while higher values of n_estimators and num_leaves in the given hyperparameter space performed better to generate a lower loss.
Compared to the 10-fold cross validation model, there were differences in the densities of the colsample_bytree and the reg_lambda hyperparameters.
We can map the boosting_type to integer (essentially label encoding) and plot the boosting type over the search.
bayes_params['boosting_int'] = bayes_params['boosting_type'].replace({'goss': 0,
'dart': 1,
'gbdt': 2})
plt.plot(bayes_params['iteration'], bayes_params['boosting_int'], 'ro')
plt.yticks([0, 1, 2], ['goss', 'dart', 'gbdt']);
plt.xlabel('Iteration'); plt.title('Boosting Type over trials')
plt.show()
More of the trial iterations tested dart for the boosting_type compared to gbdt and goss.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The colsample_bytree parameter decreased, compared to increasing when using cross-validation over trials while the learning_rate might have slightly increased. Both did not show a trend for the num_leaves parameter over the trials.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the hyperparameter reg_lambda decreased when there was not a trend for reg_alpha. For cross-validation, reg_lambda increased over the trials.
Model Explanations¶
Model Metrics with ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
test_label)
html_obj = eli5.show_weights(perm_importance,
feature_name=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
When comparing the weights from the train/validation/test model to the ones generated from using 10-fold cross validation,
horsepower was (0.426414 vs 0.533249), year was lower (0.110301 vs. 0.132819), height was higher (0.098178 vs. 0.056370), mileage was higher (0.080769 vs. 0.073229) and width was higher (0.097303 vs. 0.070825).
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features);
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_lgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Let's read the data and define the label/features with price as the model target. Then create dummy variables for the categorical variables and use the column names to extract feature names.
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
features = train_features.columns
features = np.array(features)
We can leverage the optuna.integration.wandb to set up the callbacks that will be saved to Weights & Biases. First, we login and set up the arguments that include the name of the project, the person saving the results, the group the study belongs to, if code is saved or not, and notes about the study for future reference.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_lgbm100gpu_T4_cv_kfold10',
'save_code': 'False',
'notes': 'optuna_lgbm100gpu_T4_cv_kfold10'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Now, let's set up the WB callbacks using @wandbc.track_in_wandb(). Then we can define a function objective for the optimization of hyperparameters using an Optuna study with a new pickle file and the parameter space to evaluate different combinations. Then we can create a lgb dataset using the training set and the hyperparameter space as a dictionary. Then we can define the model type as LGBMRegressor specifying the parameters that will be used for each trial. We can then leverage cross validation from the lightgbm package by using lgb.cv with the params, train_set, 10-fold cross validation by specifying nfold=10 and stratified=False. This can then perform timed cross validation trials with the goal of finding the lowest averaged rmse in the validation set.
import joblib
from sklearn.model_selection import cross_val_score, KFold
import lightgbm as lgb
from lightgbm import LGBMRegressor
from timeit import default_timer as timer
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a LightGBMRegressor model.
"""
joblib.dump(study, 'lightGBM_Optuna_100_GPU_T4_wb_CV.pkl')
params = {'verbose': -1,
'device': 'gpu',
'gpu_platform_id': 0,
'gpu_device_id': 0,
'boosting_type': 'gbdt',
'objective':'regression',
'metric': 'rmse',
'seed': seed_value,
'early_stopping_rounds': 150,
'n_estimators': trial.suggest_int('n_estimators', 925, 935),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.0075,
0.008),
'num_leaves': trial.suggest_int('num_leaves', 350, 365),
'bagging_freq': trial.suggest_int('bagging_freq', 8, 10),
'subsample': trial.suggest_float('subsample', 0.85, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.77,
0.81),
'max_depth': trial.suggest_int('max_depth', 9, 11),
'lambda_l1': trial.suggest_float('lambda_l1', 0.02, 0.03,
log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 0.001, 0.002,
log=True),
'min_child_samples': trial.suggest_int('min_child_samples',
540, 550)
'verbosity': -1
}
train_set = lgb.Dataset(X_train, label=y_train, params=params)
model = LGBMRegressor(**params)
start = timer()
cv_results = lgb.cv(params, train_set, nfold=10, stratified=False)
run_time = timer() - start
rmse = round(cv_results['valid rmse-mean'][-1], 4)
return rmse
Now let's begin the optimization with Weights & Biases tracking the loss and associated parameters. The optimization is tracked and the trial run components are located here.
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_Optuna_100_GPU_T4_wb_CV.pkl'):
study = joblib.load('lightGBM_Optuna_100_GPU_T4_wb_CV.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Results from the Hyperparameter Search¶
Let's utilize plot_optimization_history, which shows the scores from all trials as well as the best score so far at each point. This search did not contain extreme outliers for the objective value so it can be useful for examining the study output.
fig = optuna.visualization.plot_optimization_history(study)
fig.show()
Next, we can plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
From this plot, lower rmse occurred with lower colsample_bytree, higher lambda_l1, higher lambda_l2, higher learning_rate, higher max_depth, higher n_estimators, higher subsample.
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Lower values of colsample_bytree, min_child_samples and num_leaves while higher values of lambda_l1, lambda_l2, learning_rate, max_depth and subsample in the given hyperparameter space performed better to generate a lower loss.
We can now examine if/how the quantitative parameters changed over the trials run to see if any trends exist.
fig, axs = plt.subplots(2, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree',
'max_depth', 'subsample', 'bagging_freq']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the hyperparameters colsample_bytree and num_leaves decreased while the hyperparameters learning_rate and subsample increased. The results from max_depth and bagging_freq over the trials are not clear.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['lambda_l1', 'lambda_l2']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, lambda_l1 potentially increased where there was not a trend for lambda_l2. Increasing the number of trials might reveal more insight about this parameter.
Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The max_depth was the most important parameter for the loss utilized. CatBoost demonstrated the learning_rate was the most important parameter during the Optuna trials.
We can use plot_edf to plot the objective value empirical distribution function (EDF) for the study.
fig = optuna.visualization.plot_edf(study)
fig.show()
This plot demonstates a narrow range of values for the loss from the models evalauted.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
train_label = trainDF[['price']]
train_features = trainDF.drop(columns=['price'])
test_label = testDF[['price']]
test_features = testDF.drop(columns=['price'])
test_features = pd.get_dummies(test_features, drop_first=True)
best_model = LGBMRegressor(boosting_type='gbdt',
device='gpu',
gpu_platform_id=0,
gpu_device_id=0,
verbosity=-1,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'lightGBM_Optuna_trials100_GPU_T4_wb_CV.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for lightGBM HPO 100 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features);
Model Metrics with ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
test_label)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The horsepower feature contains the highest weight (0.390659). Then the mileage has the next highest, but the second weight is once again way lower than horsepower as in the previous models (0.132178). This is followed by width (0.082871), year (0.079460) and horsepower_rpm (0.034325), which is a different order than the previous model weights using ELI5.
XGBoost¶
A decision tree is a hierarchical model that utilizes conditional statements which represent choices and the potential downstream outcomes. The tree structure is comprised of a root node, branches, internal nodes, and leaf nodes. Gradient Boosting Decision Trees (GBDT) are comprised of multiple decision trees in an ensemble similar to what is used with Random Forest models. However, the methodologies for building and combining the trees differ. Gradient boosting is derived from boosting where a single weak model is combined with other weak models leveraging the objective of improving and generating a more effective model. However, the process of additively generating weak models is characterized with gradient descent for the objective function. Gradient boosting delineates targeted outcomes based on the error of the prediction which is then utilized in subsequent models to minimize the error. The weighted sum of all of the tree predictions is then used for the final prediction.
Extreme Gradient Boosting (XGBoost) presented in XGBoost: A Scalable Tree Boosting System leverages trees built in parallel following a level-wise strategy that scans across the gradient values and uses these partial sums to evaluate the quality of splits at every possible split in the training set. XGBoost is built for model performance and computational speed with CPU and GPU support for the ML library.
The Hyperopt notebook is located here. Let's first set up the environment by installing/importing the dependencies, setting the options and seed followed by examining the CUDA and GPU characteristics.
!pip install xgboost==1.5.2
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings('ignore')
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Then we can read the data and partition the train/test sets with price as the target. Next, we can create dummy variables for the categorical variables.
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
X_train = trainDF.drop(columns = ['price'])
y_train = trainDF[['price']]
X_test = testDF.drop(columns = ['price'])
y_test = testDF[['price']]
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
Baseline Model¶
Let's now fit the baseline model to the data, save as a .pkl file and evaluate the model metrics for the baseline model.
from xgboost import XGBRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Hyperopt: 300 Trials 10-Fold Cross Validation¶
To set up Hyperopt, let's first define the number of trials with NUM_EVAL = 300. We need to utilize the same k-folds for reproducibility, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True and the initial defined seed value. Then hyperparameters can then be defined in a dictionary. To define integers, hp.choice with np.arange and dtype=int is used while float types are defined using hp.uniform. The space consists of 6 parameters with 4 integers and 6 float type:
n_estimators: Number of gradient boosted trees.max_depth: Maximum tree depth for base learners.subsample: Subsample ratio of the training instance.learning_rate: Used for reducing the gradient step.gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree.reg_alpha: L1 regularization term on weights (xgb’s alpha).reg_lambda: L2 regularization term on weights (xgb’s lambda).colsample_bytree: Subsample ratio of columns when constructing each tree.colsample_bylevel: Subsample ratio of columns for each level.min_child_weight: Minimum sum of instance weight(hessian) needed in a child.
from sklearn.model_selection import KFold
from hyperopt import hp
NUM_EVAL = 300
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
xgb_tune_kwargs= {
'n_estimators': hp.choice('n_estimators', np.arange(100, 500, dtype=int)),
'max_depth': hp.choice('max_depth', np.arange(3, 10, dtype=int)),
'subsample': hp.uniform('subsample', 0.25, 0.75),
'gamma': hp.uniform('gamma', 0, 9),
'learning_rate': hp.uniform('learning_rate', 1e-4, 0.3),
'reg_alpha': hp.choice('reg_alpha', np.arange(0, 30, dtype=int)),
'reg_lambda': hp.uniform('reg_lambda', 0, 3),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'colsample_bylevel': hp.uniform('colsample_bylevel', 0.05, 0.5),
'min_child_weight': hp.choice('min_child_weight', np.arange(0, 10,
dtype=int))
}
Now, we can define a function for the optimization of these hyperparameters. Within this function, we can use joblib to save a pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float. This is then allocated for n_estimators and max_depth, which starts at max_depth=3. Then the model type, XGBRegressor, needs to be defined with the parameters that will be included in all of the trials during the search, which are:
objective='reg:squarederror': Specify the learning task and the corresponding learning objective or a custom objective function to be usedbooster='gbtree': Specify which booster to use:gbtree,gblinearordart.tree_method='gpu_hist': Specify which tree method to use.Default=auto.scale_pos_weight=1: Balancing of positive and negative weights.use_label_encoder=False: Encodes labelsrandom_state=seed_value: Random number seed.verbosity=0: The degree of verbosity. Valid values are 0 (silent) - 3 (debug)
Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error' for the scoring metric and 10-fold cross validation to generate the negative cross_val_score where the np.mean of the scores is saved as the rmse for each trial. The trial loss=rmse, params = parameters tested in the trial from the xgb_tune_kwargs space , the trial number (iteration), the time to complete the trial (train_time) and if the trial completed successfully or not (status) are written to the defined .csv file and appended by row for each trial.
import joblib
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
def xgb_hpo(config):
"""
Objective function to tune a XGBoostRegressor model
"""
joblib.dump(bayesOpt_trials, 'xgb_Hyperopt_300_GPU.pkl')
global ITERATION
ITERATION += 1
config['n_estimators'] = int(config['n_estimators'])
config['max_depth'] = int(config['max_depth']) + 3
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
early_stopping_rounds=10,
random_state=seed_value,
verbosity=0,
**config)
start = timer()
scores = -cross_val_score(xgb, X_train, y_train,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([rmse, config, ITERATION, run_time])
return {'loss': rmse, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.
We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function xgb_hpo, the parameter space xgb_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the HPO trials.
from hyperopt import tpe, Trials
tpe_algorithm = tpe.suggest
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/XGBoost/Hyperopt/trialOptions/XGB_trials_300_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
from datetime import datetime, timedelta
from hyperopt import fmin
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('xgb_Hyperopt_300_GPU.pkl'):
bayesOpt_trials = joblib.load('xgb_Hyperopt_300_GPU.pkl')
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
else:
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials,
rstate=np.random.RandomState(42))
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Let's now sort the trials with lowest loss first and examine the lowest two losses:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
import ast
results = pd.read_csv('XGB_300_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'XGB_HPO_trials300_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for XGBoost HPO 300 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train, y_train_pred),
mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train, y_train_pred, squared=False),
mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search¶
Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
import matplotlib.pyplot as plt
import seaborn as sns
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['iteration', 'subsample', 'force_col_wise',
'max_depth', 'min_child_weight', 'n_estimators']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
The colsample_bylevel and reg_lambda parameters with higher values demonstrate lower loss for the objective value. Lower values for the colsample_bytree, gamma and learning_rate parameters resulted in lower loss. The reg_alpha parameter demonstrated bimodal distribution.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The learning_rate, gamma, colsample_bytree decreased over the trials while colsample_bylevel increased. reg_alpha and reg_lambda did not demonstrate any trends over the trial iterations.
Model Explanations¶
Now, we can plot the feature importance from the best model.
plot_importance(best_bayes_model, max_num_features=15)
Using the approach for feature importance, the most important features are daysonmarket, mileage, torque, torque_rpm, savings_amount, back_legroom, city_fuel_economy, height, length, highway_fuel_economy, front_legroom, fuel_tank_volume, wheelbase, width and engine_displacement.
SHAP (SHapley Additive exPlanations)¶
Let's use the training set and summarize the effects of all the features.
import shap
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test);
Model Metrics with ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
X_test1 = pd.DataFrame(X_test, columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(X_test,
y_test)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
The horsepower feature has the highest weight (0.1222) followed by mileage, width and year.
Optuna: 300 Trials 10-Fold Cross Validation¶
The Optuna notebooks can be found here. Let's first set up the environment by installing/importing the dependencies, setting the options and seed followed by examining the CUDA and GPU characteristics.
!pip install --upgrade -q wandb
!pip install xgboost==1.5.2
!pip install optuna
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Then the data can be read into a pandas.Dataframe, the label and features defined followed by creating the dummy variables for the categorical features.
import pandas as pd
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
Let's now set up Weights & Biases specifying the project, entity, group and notes.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgb300gpu_cv_kfold10',
'save_code': 'False', 'notes': 'xgb300gpu_cv_kfold10'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
Using the same k-folds for reproducibility and setting up the W & B callbacks, let's now define a function for the optimization of hyperparameters using an Optuna.study with a pickle file, the parameters to test with different combinations during the search, and the the model type with the parameters that will be used for each trial. Then the timer can be specified to begin when called to perform the timed kFold cross validation trials with the goal of finding the lowest averaged score. The W & B parameters with loss for this study can be examined here.
from sklearn.model_selection import cross_val_score, KFold
import joblib
from xgboost import XGBRegressor, plot_importance
from datetime import datetime, timedelta
from timeit import default_timer as timer
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_300_GPU_CV_wb.pkl')
params_xgb_optuna = {
'n_estimators': trial.suggest_int('n_estimators', 400, 1000),
'max_depth': trial.suggest_int('max_depth', 9, 15),
'subsample': trial.suggest_float('subsample', 0.6, 0.8),
'gamma': trial.suggest_float('gamma', 1e-8, 7e-5),
'learning_rate': trial.suggest_float('learning_rate', 0.04, 0.13),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 1e-5),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.35, 0.6),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 12)
}
model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**params_xgb_optuna)
start = timer()
scores = -cross_val_score(model, train_features, train_label,
scoring='neg_root_mean_squared_error',
cv=kfolds)
run_time = timer() - start
rmse = np.mean(scores)
return rmse
Since the 300 trials did not complete, we can load the saved pickle file for the study using joblib to continue training to complete the 300 trials.
study = joblib.load('XGB_Optuna_300_GPU_CV_wb.pkl')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_300_GPU_CV_wb.pkl'):
study = joblib.load('XGB_Optuna_300_GPU_CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=18, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
Results from the Hyperparameter Search¶
Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Let's utilize plot_optimization_history, which shows the scores from all trials as well as the best score so far at each point. This search did not contain extreme outliers for the objective value so it can be useful for examining the study output.
fig = optuna.visualization.plot_optimization_history(study)
fig.show()
Next, we can utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contour plot.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_contour(study, params=['min_child_weight',
'max_depth',
'learning_rate',
'gamma'])
fig.show()
Next, we can plot_slice to compare the objective value and individal parameters.
fig = optuna.visualization.plot_slice(study)
fig.show()
The colsample_bylevel, max_depth, reg_alpha, and subsample parameters with higher values demonstrate lower loss for the objective value. Lower values for the learning_rate, n_estimators, reg_lambda, and the colsample_bytree parameters resulted in lower loss. It is difficult to to determine if lower/higher values for the gamma and min_child_weight parameters resulted in lower loss for this study.
Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
The learning_rate parameter is clearly the most important with a value of 0.88 when the next important parameter, max_depth, contains a value of 0.03 . CatBoost demonstrated the learning_rate was the most important parameter during the Optuna trials while LightGBM demonstated max_depth. XGBoost is suggesting the learning_rate is the most important parameter in conjunction with CatBoost.
Let's now use plot_edf to visualize the empirical distribution function.
fig = optuna.visualization.plot_edf(study)
fig.show()
Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
The colsample_bylevel, max_depth, reg_alpha and subsample parameters with higher values demonstrate lower loss for the objective value. Lower values for the colsample_bytree, gamma learning_rate, min_child_weight and reg_lambda resulted in lower loss.
Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the study duration, the colsample_bytree and learning_rate parameters decreased while the others did not show a clear trend.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize = (20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the trials, the reg_alpha parameter increased while reg_lambda decreased.
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
param = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials300_GPU_CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO 300 GPU CV trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning. Also, there were lower MAE, MSE, and RMSE and higher R² both for the train/test sets compared to the metrics from using 10-fold cross validation for the Hyperopt trials
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Model Explanations¶
SHAP (SHapley Additive exPlanations)¶
Using the training set, let's summarize the effects of all the features.
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
shap.summary_plot(shap_values, train_features);
Then we can use the test set and summarize the effects of all the features.
shap.initjs()
shap_values = explainer.shap_values(test_features)
shap.summary_plot(shap_values, test_features);
Model Metrics with ELI5¶
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)
perm_importance = PermutationImportance(best_model,
random_state=seed_value).fit(test_features,
test_label)
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
The horsepower feature is again the most important weight (0.189968), followed by year (0.083167), mileage (0.077648), width (0.077522) and height (0.034736).
XGBoost Set Params Same GPU¶
The notebook utilizing the GPU for XGBoost is here and using RAPIDS is here All of the W & B parameters with loss for this experiments can be examined here.
To evaluate the runtime length between cross validation and train/test split, let's utilize the same GPU, a Quadro RTX 4000. Let's first set up the environment, read the data, define the label and features, create dummy variables for categorical variables and create a matrix for GPU.
!pip install --upgrade pip
!pip install --upgrade -q wandb
!pip install xgboost==1.5.2
!pip install optuna
!pip install plotly
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
print('\n')
!nvidia-smi
import pandas as pd
import xgboost as xgb
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)
dtrain = xgb.DMatrix(data=train_features, label=train_label)
dtest = xgb.DMatrix(data=test_features, label=test_label)
Cross Validation¶
Let's start with 3-fold cross validation where the results are logged with W & B.
200 Trials 3-Fold Cross Validation¶
import wandb
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_3CV_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_3CV_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
import joblib
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
from optuna.integration.wandb import WeightsAndBiasesCallback
from timeit import default_timer as timer
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=3,
metrics='rmse', as_pandas=True)
run_time = timer() - start
rmse = cv_results['test-rmse-mean'].values[-1]
return rmse
from datetime import datetime, timedelta
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over a litte more than one hour. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_3CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 5-Fold Cross Validation¶
Let's know try using 5-fold cross validation where the results are logged with W & B.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_5CV_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_5CV_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=5,
metrics='rmse', as_pandas=True)
run_time = timer() - start
rmse = cv_results['test-rmse-mean'].values[-1]
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over nine minutes longer than 3-fold cross validation. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_5CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 10-Fold Cross Validation¶
Let's know try using 10-fold cross validation where the results are logged with W & B.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_10CV_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_10CV_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=10,
metrics='rmse', as_pandas=True)
run_time = timer() - start
rmse = cv_results['test-rmse-mean'].values[-1]
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over 20 minutes longer than 3-fold cross validation and 10 minutes longer than 5-fold cross validation. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_10CV_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Train/Test¶
Let's know try using train/test split to compare to cross validation where the results are logged with W & B.
200 Trials Train/Test DMatrix¶
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_noCUML',
'tags': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_noCUML'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study,
'XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl')
params_xgb_optuna = {
'objective': 'reg:squarederror',
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_ratel': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'verbose': False,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
model = xgb.train(params_xgb_optuna, dtrain)
run_time = timer() - start
y_pred_val = model.predict(dtest)
rmse = mean_squared_error(test_label, y_pred_val,
squared=False)
print('Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took four minutes less than 3-fold cross validation and 10 minutes shorter than 5-fold cross validation and 20 minutes shorter than 10-fold cross validationn. This is potential justification for not using cross validation when using a smaller set size. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(train_label, y_train_pred),
mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred),
mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(train_label, y_train_pred, squared=False),
mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(train_label, y_train_pred),
r2_score(test_label, y_test_pred)))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Cross Validation - cuML¶
Let's go back to Paperspace to utilize the RAPIDS Docker container. We can installing/importing the dependencies, setting the options and seed, set up the local CUDA cluster for Dask, query the client for all connected workers. and the read data.
!pip install --upgrade pip
!pip install xgboost==1.5.2
!pip install --upgrade -q wandb
!pip install optuna
!pip install plotly
!pip install shap
!pip install eli5
import os
import warnings
import random
import numpy as np
import cupy
from cupy import asnumpy
import urllib.request
from contextlib import contextmanager
import time
from timeit import default_timer as timer
warnings.filterwarnings('ignore')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)
@contextmanager
def timed(name):
t0 = time.time()
yield
t1 = time.time()
print('..%-24s: %8.4f' % (name, t1 - t0))
print('\n')
!nvidia-smi
import dask
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar
from dask_cuda import LocalCUDACluster
import dask_cudf
cluster = LocalCUDACluster(threads_per_worker=1, ip='',
dashboard_address='8082')
c = Client(cluster)
workers = c.has_what().keys()
n_workers = len(workers)
c
import cudf
import cuml
import xgboost as xgb
trainDF = cudf.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = cudf.read_csv('usedCars_testSet.csv', low_memory=False)
train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])
train_features = cudf.get_dummies(train_features)
test_features = cudf.get_dummies(test_features)
train_features = train_features.to_cupy()
train_label = train_label.to_cupy()
test_features = test_features.to_cupy()
test_label = test_label.to_cupy()
train_features = train_features.astype('float32')
train_label = train_label.astype('int32')
test_features = test_features.astype('float32')
test_label = test_label.astype('int32')
dtrain = xgb.DMatrix(data=train_features, label=train_label)
dtest = xgb.DMatrix(data=test_features, label=test_label)
200 Trials 3-fold Cross Validation cuML¶
Let's start with 3-fold cross validation where the results are logged with W & B.
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_3CV_cuml',
'tags': 'xgbGPU_setParams_RTX4000_3CV_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
from xgboost import XGBRegressor, plot_importance
import joblib
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=3,
metrics='rmse')
run_time = timer() - start
rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over a litte more than one hour just like what was observed with 3-fold cross validation without cuML. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
import matplotlib.pyplot as plt
import seaborn as sns
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
import pickle
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_3CV_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 5-fold Cross Validation cuML¶
Let's know try using 5-fold cross validation where the results are logged with W & B.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_5CV_cuml',
'tags': 'xgbGPU_setParams_RTX4000_5CV_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=5,
metrics='rmse')
run_time = timer() - start
rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over nine minutes longer than 3-fold cross validation, and a little longer than 5-fold cross validation without cuML. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_5CV_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
200 Trials 10-fold Cross Validation cuML¶
Let's know try using 10-fold cross validation where the results are logged with W & B.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_10CV_cuml',
'tags': 'xgbGPU_setParams_RTX4000_10CV_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl')
params_xgb_optuna = {
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=10,
metrics='rmse')
run_time = timer() - start
rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
print('- Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took over 30 minutes longer than 3-fold cross validation and 20 minutes longer than 5-fold cross validation. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_10CV_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Train/Test - cuML¶
200 Trials Train/Test DMatrix¶
Let's know try using train/test split to compare to cross validation where the results are logged with W & B.
wandb.login()
wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
'group': 'optuna_xgbGPU_setParams',
'save_code': 'False',
'notes': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_cuml',
'tags': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_cuml'}
wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
@wandbc.track_in_wandb()
def objective(trial):
"""
Objective function to tune a XGBoostRegressor model.
"""
joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl')
params_xgb_optuna = {
'objective': 'reg:squarederror',
'booster': 'gbtree',
'gpu_id': 0,
'tree_method': 'gpu_hist',
'gamma': 4.5,
'learning_rate': 0.04,
'reg_alpha': 0.01,
'scale_pos_weight': 1,
'use_label_encoder': False,
'random_state': seed_value,
'verbosity': 0,
'verbose': False,
'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
'max_depth': trial.suggest_int('max_depth', 14, 15),
'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
step=0.1),
'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
step=0.1),
'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
}
start = timer()
model = xgb.train(params_xgb_optuna, dtrain)
run_time = timer() - start
y_pred_val = model.predict(dtest)
rmse = mean_squared_error(asnumpy(test_label), asnumpy(y_pred_val),
squared=False)
print('Trial RMSE:', rmse)
return rmse
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl'):
study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl')
else:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
This took seven minutes less than 3-fold cross validation and 17 minutes shorter than 5-fold cross validation and 35 minutes shorter than 10-fold cross validation. This is potential justification for not using cross validation when using a smaller set size and the strength of GPU matrix. Now we can examine the study components.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
for i, hpo in enumerate(trials_df.columns):
if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
'duration', 'n_estimators', 'state']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
best_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
gamma=4.5,
learning_rate=0.04,
reg_alpha=0.01,
scale_pos_weight=1,
use_label_encoder=False,
verbosity=0,
**params)
best_model.fit(train_features, train_label)
Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
r2_score(asnumpy(test_label), asnumpy(y_test_pred))))
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Set Params GPU Test - cuML: LightGBM 500 Trials¶
To utilize the GPU for LightGBM with RAPIDS and a CUDA compiler on Paperspace, the notebook containing how the environment set up was completed is here.
The steps to complete this are the following:
In a Linux terminal,
- sudo apt update && sudo apt upgradesudo apt-get -y install cmake
- sudo apt-get install -y -qq libboost-all-dev libboost-system-dev libboost-filesystem-dev
- sudo apt-get install --no-install-recommends nvidia-opencl-dev opencl-headers
- sudo apt install glibc-source -y
In a notebook,
- git clone --recursive https://github.com/microsoft/LightGBM
- %cd /notebooks/LightGBM
- !sudo mkdir build
!sudo mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
%cd /notebooks/LightGBM/build
- !sudo cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
- !make -j4
- !sudo apt-get -y install python3-pip
- !sudo -H pip install setuptools joblib wandb optuna datetime plotly eli5 shap -U
- %cd /notebooks/LightGBM/python-package/
- !python setup.py install --gpu --opencl-include-dir=/usr/local/cuda/include/
- !pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
- !pip install cuml-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
Once completed, the dependencies were imported.
The project results can be found here. The runtime lengths using different GPUswere the following:
- RTX A4000 - 2:21:28
- Quadro RTX 5000 - 2:50:45
- Quadro RTX 4000 - 3:00:01
- Quadro P5000 - 4:39:26
Random Forest¶
Random Forests utilize an ensemble of decision trees and with a technique called bagging, or bootstrap aggregation, which consists of selecting random samples with replacement as the training data for a model, fitting decisions trees for each subset and then aggregating the predictions. This approach minimizes overfitting and model variance since more randomness and combinations of features are evaluated. Decision trees use specified set of features and have a high likelihood of overfitting.
Let's go back to Paperspace to utilize the RAPIDS Docker container. We can installing/importing the dependencies, setting the options and seed, set up the local CUDA cluster for Dask, query the client for all connected workers. and the read data.
from cuml.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_streams=1, random_state=seed_value)
rf.fit(X_train, y_train)
Pkl_Filename = 'RF_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(rf, file)
print('\nModel Metrics for RF Baseline')
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Optuna 300 Trials Train/Test¶
Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:
bootstrap: If True, each tree in the forest is built on a bootstrapped sample with replacement.Default=True.n_estimators: Number of trees in the forest.Default=100.max_depth: Maximum tree depth.Default=16.max_leaves: Maximum leaf nodes per tree.Default=-1.min_samples_leaf: The minimum number of samples (rows) in each leaf node.Default=1.min_samples_split: The minimum number of samples required to split an internal node.Default=2.n_bins: Maximum number of bins used by the split algorithm per feature.Default=128.n_streams: Number of parallel streams used for forest building.Default=4.
Then the model will be trained using the training set, predictions made using the test set and then the model evaluated for the rmse with the test set versus the predicted one.
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['max_leaves', 'min_samples_leaf', 'min_samples_split',
'n_bins']):
sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
Over the study, the max_leaves, min_samples_split and n_bins parameters increased while min_samples_leaf decreased.
Let's now use plot_param_importances to visualize the parameter importances.
fig = optuna.visualization.plot_param_importances(study)
fig.show()
min_samples_leaf was the most important hyperparameter followed by max_leaves.
We can also use plot_edf to examine the empirical distribution.
fig = optuna.visualization.plot_edf(study)
fig.show()
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
params = study.best_params
params['random_state'] = seed_value
params
import pickle
from sklearn.metrics import mean_absolute_error, r2_score
best_model = RandomForestRegressor(n_streams=1, **params)
best_model.fit(X_train, y_train)
Pkl_Filename = 'RF_Optuna_trials300_GPU_paramsHi2.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for RF HPO 300 GPU trials')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Tuning did not prove to be worthwhile when higher errors and lower R² compared to baseline.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
K-Nearest Neighbor Regression (KNR)¶
from cuml.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor()
knr.fit(X_train, y_train)
Pkl_Filename = 'KNR_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(knr, file)
print('\nModel Metrics for KNR Baseline')
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Optuna 1000 Train/Test Trials¶
Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:
n_neighbors: Default number of neighbors to query.Default=5.metric: Distance metric to use. Options are'euclidean','manhattan','chebyshev','minkowski'.
Then the model will be trained using the training set, predictions made using the test set and then the model evaluated for the rmse with the test set versus the predicted one.
from timeit import default_timer as timer
from sklearn.metrics import mean_squared_error
def train_and_eval(X_param, y_param, n_neighbors=10,
metric='euclidean', verbose=False):
"""
Partition data into train/test sets, train and evaluate the model
for the given parameters.
Params
______
X_param: DataFrame.
The data to use for training and testing.
y_param: Series.
The label for training
Returns
score: RMSE of the fitted model
"""
X_train, y_train = trainDF.drop('price',
axis=1), trainDF['price'].astype('int32')
X_train = cudf.get_dummies(X_train)
X_train = X_train.astype('float32')
X_test, y_test = testDF.drop('price',
axis=1), testDF['price'].astype('int32')
X_test = cudf.get_dummies(X_test)
X_test = X_test.astype('float32')
model = KNeighborsRegressor(n_neighbors=n_neighbors,
metric=metric,
verbose=verbose)
start = timer()
model.fit(X_train, y_train)
run_time = timer() - start
y_pred = model.predict(X_test)
score = mean_squared_error(y_test.to_numpy(), y_pred.to_numpy(),
squared=False)
return score
print('Score with default parameters : ', train_and_eval(X_train, y_train))
As with other hyperparameter optimizations, an objective function with the study .pkl and the range of the parameters to be tested needs to be defined.
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import joblib
def objective(trial, X_param, y_param):
joblib.dump(study, 'KNR_Optuna_1000_GPU.pkl')
n_neighbors = trial.suggest_int('n_neighbors', 2, 100)
metric = trial.suggest_categorical('metric', ['euclidean', 'manhattan',
'chebyshev', 'minkowski'])
score = train_and_eval(X_param, y_param,
n_neighbors=n_neighbors,
verbose=False)
return score
with timed('dask_optuna'):
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('KNR_Optuna_1000_GPU.pkl'):
study = joblib.load('KNR_Optuna_1000_GPU.pkl')
else:
study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
study_name=study_name,
direction='minimize')
with parallel_backend('dask'):
study.optimize(lambda trial: objective(trial, X_train, y_train),
n_trials=1000,
n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
Let's now extract the trial number, rmse and hyperparameter values into a pandas.Dataframe and sort with the lowest error first.
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_metric': 'metric'}, inplace=True)
trials_df.rename(columns={'params_n_neighbors': 'n_neighbors'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
fig = optuna.visualization.plot_slice(study)
fig.show()
It is difficult to determine which metric performs the best off this slice plot, but a model that uses a lower number of n_neighbors results in lower objective values.
Let's now utilize plot_edf to visualize the empirical distribution of the study.
fig = optuna.visualization.plot_edf(study)
fig.show()
Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).
params = study.best_params
params
best_model = KNeighborsRegressor(n_neighbors=2, metric='manhattan')
best_model.fit(X_train, y_train)
Pkl_Filename = 'KNR_Optuna_trials1000_GPU_man.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_model, file)
print('\nModel Metrics for KNR HPO 1000 GPU trials - Manhattan')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
squared=False),
mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))
Tuning proved to be worthwhile when there are lower errors and higher R² for both the train/test sets compared to the baseline model.
We can also evaluate the MSE on the test set and determine when this was achieved:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test.to_numpy(),
y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
Comments
comments powered by Disqus