The price of a used car has always been an intriguing thought. Once a brand-new car leaves the dealership, the price instantly decreases way below the price an individual paid. During the 2022 car market, the price of a used vehicle increased significantly due to the shortage of semiconductors and microchips, which are needed for the manufacturing of new vehicles, and the rising rate of inflation.

Data

   The data was retrieved from Kaggle. The individual who posted the data used a webcrawler on the Cargurus inventory in September 2020. This data contains a vast amount of information about the physical and mechanical components of cars, and this feature rich set can provide insight into what might contribute to the price differences in the value of a vehicle. Does the location where a vehicle is being sold and the age of the vehicle affect the price? How could these factors contribute to building models to predict the price of vehicles? Which features of used vehicles should be used in predictive modeling?

Preprocessing

   The code that was used for preprocessing and EDA can be found in the Used Cars Github repository. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and empty rows replace with NA.

In [ ]:
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
plt.rcParams['figure.figsize'] = (20, 10)
plt.rcParams['font.size'] = 25
sns.set_context('paper', rc={'font.size': 25, 'axes.titlesize': 35,
                             'axes.labelsize': 30})

seed_value = 42
os.environ['UsedCars_PreprocessEDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

df = pd.read_csv('used_cars_data.csv', low_memory=False)
df = df.drop_duplicates()
df = df.replace(r'^\s*$', np.nan, regex=True)
print('\nSample observations:')
display(df.head())

Sample observations:
vin back_legroom bed bed_height bed_length body_type cabin city city_fuel_economy combine_fuel_economy daysonmarket dealer_zip description engine_cylinders engine_displacement engine_type exterior_color fleet frame_damaged franchise_dealer franchise_make front_legroom fuel_tank_volume fuel_type has_accidents height highway_fuel_economy horsepower interior_color isCab is_certified is_cpo is_new is_oemcpo latitude length listed_date listing_color listing_id longitude main_picture_url major_options make_name maximum_seating mileage model_name owner_count power price salvage savings_amount seller_rating sp_id sp_name theft_title torque transmission transmission_display trimId trim_name vehicle_damage_category wheel_system wheel_system_display wheelbase width year
0 ZACNJABB5KPJ92081 35.1 in NaN NaN NaN SUV / Crossover NaN Bayamon NaN NaN 522 00960 [!@@Additional Info@@!]Engine: 2.4L I4 ZERO EV... I4 1300.0 I4 Solar Yellow NaN NaN True Jeep 41.2 in 12.7 gal Gasoline NaN 66.5 in NaN 177.0 Black NaN NaN NaN True NaN 18.3988 166.6 in 2019-04-06 YELLOW 237132766 -66.1582 https://static.cargurus.com/images/forsale/202... ['Quick Order Package'] Jeep 5 seats 7.0 Renegade NaN 177 hp @ 5,750 RPM 23141.0 NaN 0 2.8 370599.0 Flagship Chrysler NaN 200 lb-ft @ 1,750 RPM A 9-Speed Automatic Overdrive t83804 Latitude FWD NaN FWD Front-Wheel Drive 101.2 in 79.6 in 2019
1 SALCJ2FX1LH858117 38.1 in NaN NaN NaN SUV / Crossover NaN San Juan NaN NaN 207 00922 [!@@Additional Info@@!]Keyless Entry,Ebony Mor... I4 2000.0 I4 Narvik Black NaN NaN True Land Rover 39.1 in 17.7 gal Gasoline NaN 68 in NaN 246.0 Black (Ebony) NaN NaN NaN True NaN 18.4439 181 in 2020-02-15 BLACK 265946296 -66.0785 https://static.cargurus.com/images/forsale/202... ['Adaptive Cruise Control'] Land Rover 7 seats 8.0 Discovery Sport NaN 246 hp @ 5,500 RPM 46500.0 NaN 0 3.0 389227.0 Land Rover San Juan NaN 269 lb-ft @ 1,400 RPM A 9-Speed Automatic Overdrive t86759 S AWD NaN AWD All-Wheel Drive 107.9 in 85.6 in 2020
2 JF1VA2M67G9829723 35.4 in NaN NaN NaN Sedan NaN Guaynabo 17.0 NaN 1233 00969 NaN H4 2500.0 H4 None False False True FIAT 43.3 in 15.9 gal Gasoline False 58.1 in 23.0 305.0 None False NaN NaN False NaN 18.3467 180.9 in 2017-04-25 UNKNOWN 173473508 -66.1098 NaN ['Alloy Wheels', 'Bluetooth', 'Backup Camera',... Subaru 5 seats NaN WRX STI 3.0 305 hp @ 6,000 RPM 46995.0 False 0 NaN 370467.0 FIAT de San Juan False 290 lb-ft @ 4,000 RPM M 6-Speed Manual t58994 Base NaN AWD All-Wheel Drive 104.3 in 78.9 in 2016
3 SALRR2RV0L2433391 37.6 in NaN NaN NaN SUV / Crossover NaN San Juan NaN NaN 196 00922 [!@@Additional Info@@!]Fog Lights,7 Seat Packa... V6 3000.0 V6 Eiger Gray NaN NaN True Land Rover 39 in 23.5 gal Gasoline NaN 73 in NaN 340.0 Gray (Ebony/Ebony/Ebony) NaN NaN NaN True NaN 18.4439 195.1 in 2020-02-26 GRAY 266911050 -66.0785 https://static.cargurus.com/images/forsale/202... NaN Land Rover 7 seats 11.0 Discovery NaN 340 hp @ 6,500 RPM 67430.0 NaN 0 3.0 389227.0 Land Rover San Juan NaN 332 lb-ft @ 3,500 RPM A 8-Speed Automatic Overdrive t86074 V6 HSE AWD NaN AWD All-Wheel Drive 115 in 87.4 in 2020
4 SALCJ2FXXLH862327 38.1 in NaN NaN NaN SUV / Crossover NaN San Juan NaN NaN 137 00922 [!@@Additional Info@@!]Keyless Entry,Ebony Mor... I4 2000.0 I4 Narvik Black NaN NaN True Land Rover 39.1 in 17.7 gal Gasoline NaN 68 in NaN 246.0 Black (Ebony) NaN NaN NaN True NaN 18.4439 181 in 2020-04-25 BLACK 270957414 -66.0785 https://static.cargurus.com/images/forsale/202... ['Adaptive Cruise Control'] Land Rover 7 seats 7.0 Discovery Sport NaN 246 hp @ 5,500 RPM 48880.0 NaN 0 3.0 389227.0 Land Rover San Juan NaN 269 lb-ft @ 1,400 RPM A 9-Speed Automatic Overdrive t86759 S AWD NaN AWD All-Wheel Drive 107.9 in 85.6 in 2020

    The initial data appears to contain various data types, missing data and complex strings with text and numbers. To reduce the high dimensionality, which string variables can contribute to, extracting the numerical information might allow for more granularity and insight.

Data Quality

    Let's define a function to examine the quality of the data by assessing the absolute number/percentage of missing observations in the data, the types of data present as well as the number unique values for the quantitative and qualitative features. This can then be sorted by the missingness to determine which variables should be dropped and which categorical variables contain a large number of groups.

In [ ]:
def data_type_quality_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    var_type = df.dtypes
    unique_count = df.nunique()
    val_table = pd.concat([mis_val, mis_val_percent, var_type, unique_count],
                          axis=1)
    val_table_ren_columns = val_table.rename(
        columns = {0: 'Missing Count', 1: '% Missing', 2: 'Data Type',
                   3: 'Number Unique'})
    val_table_ren_columns = val_table_ren_columns[
        val_table_ren_columns.iloc[:,1] >= 0].sort_values(
            '% Missing', ascending=False).round(1)
    print('The data has ' + str(df.shape[0]) + ' rows and '
          + str(df.shape[1]) + ' columns.\n')
    return val_table_ren_columns

print('\nData Quality Report:')
display(data_type_quality_table(df))

Data Quality Report:
The data has 3000000 rows and 66 columns.

Missing Count % Missing Data Type Number Unique
vehicle_damage_category 3000000 100.0 float64 0
combine_fuel_economy 3000000 100.0 float64 0
is_certified 3000000 100.0 float64 0
bed 2980432 99.3 object 3
cabin 2936468 97.9 object 4
is_oemcpo 2864640 95.5 object 1
is_cpo 2817104 93.9 object 1
bed_height 2570909 85.7 object 1
bed_length 2570909 85.7 object 83
owner_count 1516996 50.6 float64 18
fleet 1426579 47.6 object 2
theft_title 1426579 47.6 object 2
isCab 1426579 47.6 object 2
has_accidents 1426579 47.6 object 2
frame_damaged 1426579 47.6 object 2
salvage 1426579 47.6 object 2
franchise_make 572617 19.1 object 48
torque 517788 17.3 object 2060
highway_fuel_economy 491278 16.4 float64 99
city_fuel_economy 491278 16.4 float64 100
power 481422 16.0 object 2037
main_picture_url 369086 12.3 object 2415855
major_options 200046 6.7 object 279972
engine_displacement 172383 5.7 float64 67
horsepower 172383 5.7 float64 455
back_legroom 159266 5.3 object 219
wheelbase 159266 5.3 object 483
maximum_seating 159266 5.3 object 12
width 159266 5.3 object 285
length 159266 5.3 object 836
height 159266 5.3 object 472
fuel_tank_volume 159266 5.3 object 182
front_legroom 159266 5.3 object 101
wheel_system_display 146730 4.9 object 5
wheel_system 146730 4.9 object 5
mileage 144387 4.8 float64 197577
trim_name 116292 3.9 object 9062
trimId 115825 3.9 object 41329
engine_cylinders 100580 3.4 object 39
engine_type 100580 3.4 object 39
fuel_type 82724 2.8 object 8
description 77900 2.6 object 2519325
transmission 64184 2.1 object 4
transmission_display 64184 2.1 object 44
seller_rating 40872 1.4 float64 1817
body_type 13542 0.5 object 9
interior_color 165 0.0 object 45726
sp_id 96 0.0 float64 27097
exterior_color 26 0.0 object 28665
sp_name 0 0.0 object 26148
vin 0 0.0 object 3000000
savings_amount 0 0.0 int64 10838
price 0 0.0 float64 88861
model_name 0 0.0 object 1429
make_name 0 0.0 object 100
longitude 0 0.0 float64 23241
listing_id 0 0.0 int64 3000000
listing_color 0 0.0 object 15
listed_date 0 0.0 object 1749
latitude 0 0.0 float64 23443
is_new 0 0.0 bool 2
franchise_dealer 0 0.0 bool 2
dealer_zip 0 0.0 object 8237
daysonmarket 0 0.0 int64 1754
city 0 0.0 object 4687
year 0 0.0 int64 98

Data Relevancy and Missingness

   Now we can first drop the variables that will not be usedful. A url will not provide much insight nor will the ones containing *id which act as an identification on a numerical level. The information contained within description might reveal specifics about the vehicles, but since there are 2,519,325 different strings, this might not be worth the amount of preprocessing.

   The initial data consisted of a high degree of missing data. 16 variables have over >= 47% missing values and the next lowest missing value is under 20%. We can filter the features with under 20% missing and then select the complete cases for each of the remaining features to maximize the amount of observations contained in the set.

In [ ]:
drop_columns = ['main_picture_url', 'vin', 'trimId', 'trim_name',
                'sp_id', 'sp_name', 'listing_id', 'description']
df.drop(columns=drop_columns, inplace=True)

df['year'] = df['year'].astype('int')

df = df.loc[:, df.isnull().mean() < 0.20]

df = df[df.major_options.notna() & df.mileage.notna()
        & df.engine_displacement.notna()
        & df.transmission_display.notna() & df.seller_rating.notna()
        & df.engine_cylinders.notna()
        & df.back_legroom.notna() & df.wheel_system.notna()
        & df.interior_color.notna()
        & df.body_type.notna() & df.exterior_color.notna()
        & df.franchise_make.notna() & df.torque.notna()
        & df.highway_fuel_economy.notna() & df.city_fuel_economy
        & df.power.notna()]

Time

   Vehicle manufacturers release new models each year with newer technologies and accessories. The time aspect generally is an important aspect to consider when building predictive models from historical data. Let's examine when the vehicle was made by generating a graph that contains the amount of vehicles in each year. Older vehicles probably cost less than newer vehicles, but this might not always be true considering there are vintage cars.

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

a = sns.countplot(x='year', data=df).set_title('Count of Postings in Each Year')
plt.yticks(fontsize=20)
plt.xticks(fontsize=20)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show();

print('Average year: ' + str(round(df['year'].mean())))
Average year: 2019

   The majority of the vehicles were manufacturered in 2020 with the average year being 2019. The oldest vehicle is from 1984, and there are vehicles from 2021, the year in which the crawler pulled this data. Let's filter the data by year greater than or equal 2016 due to the count and the distribution. Then using a function with autocpt arguments to convert count data to percentages, we can graph the percentage of each model year out of the total listings in this subset.

In [ ]:
df = df.loc[df['year'] >= 2016]

def func(pct):
    return "{:1.1f}%".format(pct)

plt.figure(figsize=(20, 7))
plt.title('Percentage of Listings in Each Year', fontsize=25)
df['year'].value_counts().plot.pie(autopct=lambda pct: func(pct),
                                   textprops={'fontsize': 15})
plt.tight_layout()
plt.show();

   57% of the vehicle listings occur in 2020. Due to the potential seasonality effects, let's examine the month of the year when a vehicle was listed by converting the listed_date to a monthly feature and examining the top 10 year-month by count.

In [ ]:
pd.set_option('display.max_rows', 10)

df['listed_date'] = pd.to_datetime(df['listed_date'])
df['listed_date_yearMonth'] = df['listed_date'].dt.to_period('M')

print('\nNumber of listings \nin each Year-Month:')
print('\n')
print(df.listed_date_yearMonth.value_counts(sort=True).nlargest(10).to_string())

Number of listings 
in each Year-Month:


2020-08    500657
2020-07    257448
2020-09    211995
2020-06    105737
2020-03     60981
2020-02     49363
2020-05     38595
2020-01     36009
2019-12     31130
2020-04     30733
Freq: M

   August contained the most listings, while April contained the least in close proximity to December, January and May. Most of the listings occurred during June - September of 2020, so the set was filtered to this date range. Now we can create a graph that has been condensed significantly. Given this time period is during the first year of the COVID-19 pandemic, the potential for various confounding factors is something that needs to be considered, so let's utilize a narrow range of dates around the summer months.

In [ ]:
df = df.loc[(df['listed_date_yearMonth'] >= '2020-06')]

sns.countplot(x='listed_date_yearMonth',
              data=df,
              order=df['listed_date_yearMonth'].value_counts().index).set_title('Number Postings in Each Year-Month')
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.tight_layout()
plt.show();

Location of Listings

   The initial data contains 4,687 cities and 8,237 dealer zip codes, so ths is potentially from all over the United States where the cost of living varies depending on location. Some states have higher/more taxes and some cost more to live. We want to be sure that price is not guided by location and that it doesn't confound the modeling results. Therefore, let's examine the length of the dealer_zip because there is a shortened zip code and the complete zip code.

In [ ]:
df['dealer_zip_length'] = df['dealer_zip'].str.len()
print('\nNumber of different \nlengths of zipcodes:')
print('\n')
print(df.dealer_zip_length.value_counts(ascending=False).to_string())

Number of different 
lengths of zipcodes:


5     1075706
10        131

   The majority of the zipcodes are the shortened version, so let's subset the shortened zipcode and convert it to numerical values. Now we can create a temporary pandas.Dataframe with only the zipcode, find the unique zipcodes, process to a list and then a dataframe. There are 4,898 unique zipcodes, so it appears the data is not localized to a small section of the country.

In [ ]:
df = df[df.dealer_zip_length == 5]
df = df.drop(['dealer_zip_length'], axis=1)
df['dealer_zip'] = df['dealer_zip'].astype('int64')

df1 = df.dealer_zip
df1 = df1.unique().tolist()
df1 = pd.DataFrame(df1)
df1.columns = ['dealer_zip_unique']
print('- Number of unique zipcodes:', df1.shape[0])
- Number of unique zipcodes: 4898

   The initial set contains the city but not the state where the vehicle was listed. We can use a temporary dataframe with unique zipcodes to find the state and city corresponding to the zipcode by defining two separate functions using a SearchEngine that utilizes the zipcode to find state and city. This temporary dataframe now has three columns where the queried City and dealer_zip_unique can be used to merge the new features using a right join. Duplicates and unnecessary columns can be removed as well as the zipcodes that were not found within the SearchEngine.

In [ ]:
from uszipcode import SearchEngine

search = SearchEngine(simple_zipcode=False)

def zcode_state(x):
    """ Use the zipcode to find the state """
    return search.by_zipcode(x).state

def zcode_city(x):
    """ Use the zipcode to find city """
    return search.by_zipcode(x).city

df1['State'] = df1['dealer_zip_unique'].fillna(0).astype(int).apply(zcode_state)
df1['City'] = df1['dealer_zip_unique'].fillna(0).astype(int).apply(zcode_city)

df = pd.merge(df, df1, how='right', left_on=['city', 'dealer_zip'],
              right_on=['City', 'dealer_zip_unique'])
df.drop_duplicates()
df = df.drop(['dealer_zip_unique', 'city', 'City', 'dealer_zip'], axis=1)
df = df[df['mileage'].notna()]

   With the addition of the State where the vehicles were listed, we can now examine the top ten locations and consider the possibility of there being differences in the standard of living due to location. Out of the top 10 states with the highest amount of listings in this set, Texas has the most while Michigan has the least. Georgia, North Carolina and Michigan don't seem to fit with the northern states or the largest states like Texas, California and Florida, so let's filter the states with the seven highest counts of listings. This results in a set with over 400,000 observations and 42 columns.

In [ ]:
print('\nNumber of listings \n in each US state:')
print('\n')
print(df.State.value_counts(sort=True).nlargest(10).to_string())
print('\n')
df1 = df['State'].value_counts().index[:7]
df = df[df['State'].isin(df1)]

print('- Dimensions after filtering states with the 7 highest counts of listings:',
      df.shape)

del df1

Number of listings 
 in each US state:


TX    113236
CA     96564
FL     79627
OH     46888
IL     44292
NY     38670
PA     38359
GA     35814
NC     34902
MI     32820


- Dimensions after filtering states with the 7 highest counts of listings: (457636, 42)

Dimensionality of Categorical Variables

   The problem of high dimensionality is an important consideration when modeling because the higher number of different groups per categorical variables, the larger the size of the set after converting to dummy or one hot encoded variables. This will result in more memory requirements (RAM and/or GPU memory) needed for the computations, resulting in longer runtimes and more cost utilization. Examining the current object and bool variables, major_options and color related features (interior and exterior_color) contain over thousands of different groups, so let's drop these features. The transmission_diplay variable contains 44 groups while transmission has four levels, so we'll use a broader variable for the characteristics of the transmission. Next, engine_cylinders and engine_type appear to contain the same information and contain over 30 groups, so let's also remove from the set. franchise_make contains the manufacturer name, but we can identify the components of the car rather than bias the analysis by using a provided name. Lastly, franchise_dealer is boolean with True/False, so let's also remove this feature.

In [ ]:
drop_columns = ['major_options', 'interior_color', 'exterior_color',
                'transmission_display', 'engine_cylinders', 'engine_type',
                'franchise_dealer', 'franchise_make']
df.drop(columns=drop_columns, inplace=True)

   Some of the categorical features contain information about the size (in) / volume (gal) / number (seats) of various characteristics about vehicles, so removing the repetitive parts of the variables will allow these features to become continuous variables and not increase the dimensionality of the set. Now we can examine a few observations of the variables and decide how to process them.

In [ ]:
df[['back_legroom', 'wheelbase', 'width', 'length', 'height',
    'fuel_tank_volume', 'front_legroom', 'maximum_seating']].head()
Out[ ]:
back_legroom wheelbase width length height fuel_tank_volume front_legroom maximum_seating
0 38.1 in 111.4 in 73 in 193.8 in 57.6 in 15.8 gal 42 in 5 seats
1 38.4 in 120.9 in 78.6 in 204.3 in 70.7 in 19.4 gal 41 in 8 seats
2 36.8 in 118.9 in 78.5 in 203.7 in 69.9 in 22 gal 41.3 in 8 seats
3 38.6 in 114.8 in 84.8 in 189.8 in 69.3 in 24.6 gal 40.3 in 5 seats
4 38.4 in 120.9 in 78.6 in 204.3 in 70.7 in 19.4 gal 41 in 8 seats

   Let's use a lambda function on a series that will string split the numbers from the text components leaving only the numbers. Then we designate which columns to apply this action. This creates missing data, so we can then remove the rows containing missing values due to transforming the previous string variable to numeric. Now we can see that these columns contain only numerical data.

In [ ]:
extract_num_from_catVar = lambda series: series.str.split().str[0].astype(np.float)

columns = ['back_legroom', 'wheelbase', 'width', 'length', 'height',
           'fuel_tank_volume', 'front_legroom', 'maximum_seating']

df[columns] = df[columns].replace({',': '',
                                   '--': np.nan}).apply(extract_num_from_catVar)

df = df[df.back_legroom.notna() & df.front_legroom.notna()
        & df.fuel_tank_volume.notna() & df.maximum_seating.notna()
        & df.height.notna() & df.length.notna() & df.wheelbase.notna()
        & df.width.notna()]
df.reset_index(drop=True, inplace=True)

df[['back_legroom', 'wheelbase', 'width', 'length', 'height',
    'fuel_tank_volume', 'front_legroom', 'maximum_seating']].head()
Out[ ]:
back_legroom wheelbase width length height fuel_tank_volume front_legroom maximum_seating
0 38.1 111.4 73.0 193.8 57.6 15.8 42.0 5.0
1 38.4 120.9 78.6 204.3 70.7 19.4 41.0 8.0
2 36.8 118.9 78.5 203.7 69.9 22.0 41.3 8.0
3 38.6 114.8 84.8 189.8 69.3 24.6 40.3 5.0
4 38.4 120.9 78.6 204.3 70.7 19.4 41.0 8.0

   The torque and power features contain similar structures to one another. The sequence containing a number followed by a text string, @, number and another text string signifies that similar string splitting at the same index location as well as the string components being converted to np.nan can be used. Let's first examine a sample of both variables:

In [ ]:
df[['torque', 'power']].head()
Out[ ]:
torque power
0 184 lb-ft @ 2,500 RPM 160 hp @ 5,700 RPM
1 266 lb-ft @ 2,800 RPM 310 hp @ 6,800 RPM
2 266 lb-ft @ 3,400 RPM 281 hp @ 6,300 RPM
3 260 lb-ft @ 4,800 RPM 295 hp @ 6,400 RPM
4 266 lb-ft @ 2,800 RPM 310 hp @ 6,800 RPM

   To process torque, first remove the ',', and split the string. Then create a pandas.DataFrame containing only the float information, name the columns and then reset the index. Now, concatenate the new feature columns created with the main table, drop the original torque feature and lastly remove any missing data that was created during the string split. The same methods can be used to process power which results in a numerical horsepower. A few samples of the new features can then be viewed:

In [ ]:
df1 = df.torque.str.replace(',', '').str.split().str[0:4:3]
df1 = pd.DataFrame([[np.nan, np.nan] if type(i).__name__ == 'float'
                    else np.asarray(i).astype('float') for i in df1])
df1.columns = ['torque_new', 'torque_rpm']
df1.reset_index(drop=True, inplace=True)

df = pd.concat([df, df1], axis=1)
df = df.drop(['torque'], axis=1)
df = df[df.torque_rpm.notna()]
df.rename(columns={'torque_new': 'torque'}, inplace=True)

del df1

df1 = df.power.str.replace(',', '').str.split().str[0:4:3]
df1 = pd.DataFrame([[np.nan, np.nan] if type(i).__name__ == 'float'
                    else np.asarray(i).astype('float') for i in df1])
df1.columns = ['horsepower_new', 'horsepower_rpm']
df1.reset_index(drop=True, inplace=True)

df = pd.concat([df, df1], axis=1)
df = df.drop(['horsepower', 'power'], axis=1)
df = df[df.horsepower_rpm.notna()]
df.rename(columns={'horsepower_new': 'horsepower'}, inplace=True)

del df1

df[['torque', 'horsepower']].head()
Out[ ]:
torque horsepower
0 184.0 160.0
1 266.0 310.0
2 266.0 281.0
3 260.0 295.0
4 266.0 310.0

Examine Dependent Variable: Price

   Using describe() can generate more than the typical five summary statistics by adding the count and minimum/maximum values. The average is around 32,000 U.S dollars. The standard deviation is around 15,000, so let's include up to 50,000 dollars in the set. When outliers are present, the mean can be affected, so we can examine the median as well, which is around 29,000. The proximity of the mean and median suggest the data is not highly skewed, but with a maxmimum price of 449,995 dollars, there definitely are outliers. Outlier testing methods like Z-score, which measures how many standard deviations values are from the mean, or the Local Outlier Factor, which uses how close values are to the values of their neighbors (a local density), where greater than one local density signifies it is different than the rest.

In [ ]:
print('\nSummary statistics: price' + ('\n') + str(round(df['price'].describe()).to_string()))
print('\n')
print('- Median price: ' + str(round(df['price'].median())))

Summary statistics: price
count    451217.0
mean      31833.0
std       14546.0
min        4490.0
25%       21569.0
50%       28710.0
75%       39365.0
max      449995.0


- Median price: 28710

   Now, let's filter the observations with the price less than or equal $50,000, and examine the price across the locations where they were posted using the latitude vs. longitude features.

In [ ]:
df = df.loc[df['price'] <= 50000.0]

plt.figure(figsize=(20, 7))
sns.histplot(x=df['price'],
             kde=True).set_title('Distribution of Price <= $50,000');
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.show();

print('- There are now ' + str(df.shape[0]) + ' observations.')

df.plot(kind='scatter', x='latitude', y='longitude', c='price', cmap='RdBu_r',
        colorbar=True, sharex=False);
plt.title('Price: Latitude vs Longitude', fontsize=35, y=1.02);
plt.yticks(fontsize=25)
plt.xticks(fontsize=25)
plt.show()

drop_columns = ['latitude', 'longitude']
df.drop(columns=drop_columns, inplace=True)
- There are now 406840 observations.

   Visually, there doesn't appear to be a large amount of lower prices (blue) localized to any of the US states in this set, but there might a greater proportion of higher prices (dark red) in New York compared to Pennsylvania.

Visualizations: Categorical Variables

   Now we can filter the categorical features that have been reduced to ones with a lower number of groups, and see how price varies within the groups by using side by side boxplots.

In [ ]:
df_cat = df.select_dtypes(include = 'object')
df_cat = df_cat.drop(['State', 'model_name', 'make_name', 'listing_color'],
                     axis=1)

plt.rcParams.update({'font.size': 20})
fig, ax = plt.subplots(3, 2, figsize=(20,17))
fig.suptitle('Categorical Variables vs. Price', fontsize=35, y=1.02)
for var, ax in zip(df_cat, ax.flatten()):
    sns.boxplot(x=var, y='price', data=df, ax=ax)
    ax.tick_params(axis='x', labelrotation=45, labelsize=15)
    ax.tick_params(axis='y', labelsize=15)
    plt.tight_layout();
plt.show();

del df_cat, fig, ax

   wheel_system and wheel_system_display contain the same information, one uses acronomyns and the later with the full text spelled out. We will retain the feature with most detail (wheel_system_display) so the components of this feature are able to be distinguished easier.

   The color of a vehicle tends to be an important factor when individuals purchase a different car. Let's investigate if the listing_colorhas any association with price.

In [ ]:
sorted_nb = df.groupby(['listing_color'])['price'].median().sort_values()
plt.title('Listing Color vs. Price', fontsize=35, y=1.02)
sns.boxplot(x=df['listing_color'], y=df['price'], order=list(sorted_nb.index))
plt.tick_params(axis='x', labelrotation=45, labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.tight_layout()
plt.show();

   Now we can examine the amount of listings with different colors that we observed in the prior graph. Seven colors including UNKNOWNseem to make up the majority of the set. We can then filter the observations with the seven highest number counts for listing_colorand then remove the listings with listing_color='UNKNOWN'.

In [ ]:
print('\nNumber of listings \nwith vehicle color:')
print('\n')
print(df.listing_color.value_counts(ascending=False).to_string())

df1 = df['listing_color'].value_counts().index[:7]
df = df[df['listing_color'].isin(df1)]
df = df.loc[df['listing_color'] != 'UNKNOWN']

del df1

Number of listings 
with vehicle color:


WHITE      81695
BLACK      80019
UNKNOWN    59578
GRAY       58852
SILVER     52343
BLUE       35391
RED        31743
GREEN       2461
ORANGE      1656
BROWN       1456
TEAL         710
YELLOW       462
GOLD         337
PURPLE       128
PINK           9

   Since the name of the car manufacturer exists in the set, let's examine the median price of vehicles distinguished by the manufacturer, and compare that to the overall median price ($28,710). This might provide some insight about the features and components of the vehicles with higher prices.

In [ ]:
df1 = df.groupby('make_name')['price'].median().reset_index()
df1.rename(columns={'price': 'make_medianPrice'}, inplace=True)
print('\nMedian price of the manufacturer:')
print('\n')
df1 = df1.sort_values('make_medianPrice', ascending=False)
print(df1.iloc[0:10])

del df1

Median price of the manufacturer:


        make_name  make_medianPrice
28        Porsche           48995.0
1      Alfa Romeo           41095.0
34          Volvo           40236.0
25  Mercedes-Benz           38888.0
29            RAM           38842.0
23       Maserati           38388.0
2            Audi           38190.0
19     Land Rover           37998.5
16         Jaguar           37888.0
20          Lexus           35790.0

   The vehicles with the highest median price are mostly luxury manufacturers. This makes logical sense. RAM makes trucks, which are larger vehicles containing larger engines. The cost would have to be more to produce and then make a profit.

   After the exploratory data analysis (EDA) of the qualitative features, we can now drop variables due to similarity, high dimensionality and not useful for modeling.

In [ ]:
df = df.drop(['make_name', 'model_name', 'wheel_system',
              'seller_rating', 'listed_date'], axis=1)
df = df.drop_duplicates()

Visualizations: Quantitative Variables

   Histograms can be used to examine the distributions of numerical data. Visually, left or right skewed data can be observed if the majority of data is on the right side of the histogram or left, respectfully.

In [ ]:
df_num = df.select_dtypes(include = ['float64', 'int64'])
cols = df_num.columns.tolist()
cols = cols[-8:] + cols[:-8]
df_num = df_num[cols]

sns.set(font_scale=2)
df_num.hist(figsize=(30,30), bins=50, xlabelsize=16, ylabelsize=17)
plt.tight_layout()
plt.show();

Correlations

   To determine which quantitative variables are correlated with each other, we can create a correlation matrix using the method='spearman', which is a nonparametric approach that ranks the features. From the prior visualization, some of the features contain non-Gaussian distributions. Then we can plot the correlation matrix without any thresholds.

   To increase the granularity of which features are correlated with price, let's extract the features that contain a correlation coefficient greater than 0.5 or less than -0.5 and examine the magnitude.

In [ ]:
corr = df_num.corr(method='spearman')

sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (20, 10)
sns.heatmap(corr, cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=False, square=True)
plt.title('Correlation Matrix with Spearman rho', fontsize=35, y=1.02)
plt.tight_layout()
plt.show();

df_num_corr = df_num.drop('price', axis=1).apply(lambda x: x.corr(df_num.price))
quant_features_list = df_num_corr[abs(df_num_corr) > 0.50].sort_values(ascending=False)
print('- There are {} moderately correlated values with Price:\n{}'.format(
    len(quant_features_list), quant_features_list))
- There are 9 moderately correlated values with Price:
horsepower              0.672491
fuel_tank_volume        0.598516
wheelbase               0.541819
engine_displacement     0.540709
length                  0.528355
width                   0.514155
height                  0.510698
city_fuel_economy      -0.563623
highway_fuel_economy   -0.579206
dtype: float64

   The horsepower feature has the highest positive correlation with price while highway_fuel_economy has the highest negative.

   Now let's filter the correlation matrix using thresholds of rho <= -0.5 or >= 0.5 to reduce the noise in the plot and to better decipher which features have higher correlation in the set.

In [ ]:
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)],
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={'size': 9}, square=True);
plt.title('Correlation Matrix with Spearman rho >= 0.5 or <= -0.5',
          fontsize=30, y=1.02)
plt.tight_layout()
plt.show();

   When utilizing this threshold, the amount of content on the heatmap decreased. The most positively correlated features are highway_fuel_economy and city_fuel_economy with a rho=0.94 followed by length and wheelbase with a rho=0.93. The most negatively correlated features city_fuel_economy and horsepower with a rho=-0.85 as well as city_fuel_economy and fuel_tank_volume with a rho=-0.85. These relationships might need to be considered regarding multicollinearity prior to modeling depending on the method utilized.

Modeling

   Now that the initial data has been processed to a final set, we can proceed with evaluating various modeling approaches. When testing Linear models (lasso, ridge and elastic net regression) using the RAPIDS ecosystem, hyperparameter tuning with 1000 trials each did not generate adequate model metrics even after applying variance inflation factor (VIF) for multicollinearity. Different scaling methods (MinMaxScaler and StandardScaler) were tested as well. Similar results were encountered using LinearSVR as well. This might be attributed to the foundational solvers available in the version of cuml that were available in rapids:22.02. The notebook for the Linear models is here and LinearSVR here.

Multi-Layer Perceptron (MLP) Regression

   The MLP can be utilized for regression problems as demonstrated in Multilayer perceptrons for classification and regression] by Fionn Murtagh in 1991. It is derived from the perceptron invented in the 1940s and implemented by Frank Rosenblatt in the 1950s with The Perceptron — A Perceiving and Recognizing Automaton. The MLP is a artifical neural network connected by neurons containing an input layer, the hidden layer(s) and the output layer with a single output neuron for regression. The data is used as the input to the network which is fed to the hidden layers one layer at a time generating weights (connections to the next neuron) and biases (threshod value needed to obtain the output) till the output layer. The output is then compared to the expected output, and the difference is the error. This error is then propagated back through the network, layer by layer where the weights are updated depending on the effect to the error called the backpropagation algorithm. This is repreated for all of the samples in the training data where one cycle of updating the network is an epoch.

   The notebook can be found here. Let's first set up the environment by installing kerar-tuner for hyperparameter tuning the MLP, then import the dependencies and examine the versions of tensorflow, keras, the amount of GPUs available, the CUDA version and which GPU will be used.

In [ ]:
!pip install keras-tuner
import os
import random
import numpy as np
import warnings
import tensorflow as tf
warnings.filterwarnings('ignore')
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keras-tuner
  Downloading keras_tuner-1.1.3-py3-none-any.whl (135 kB)
     |████████████████████████████████| 135 kB 26.8 MB/s 
Collecting kt-legacy
  Downloading kt_legacy-1.0.4-py3-none-any.whl (9.6 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from keras-tuner) (1.21.6)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from keras-tuner) (2.23.0)
Requirement already satisfied: tensorboard in /usr/local/lib/python3.8/dist-packages (from keras-tuner) (2.9.1)
Requirement already satisfied: ipython in /usr/local/lib/python3.8/dist-packages (from keras-tuner) (7.9.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from keras-tuner) (21.3)
Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (2.0.10)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (0.7.5)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (5.6.0)
Collecting jedi>=0.10
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
     |████████████████████████████████| 1.6 MB 65.5 MB/s 
Requirement already satisfied: pygments in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (2.6.1)
Requirement already satisfied: backcall in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (0.2.0)
Requirement already satisfied: decorator in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (4.4.2)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (57.4.0)
Requirement already satisfied: pexpect in /usr/local/lib/python3.8/dist-packages (from ipython->keras-tuner) (4.8.0)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.8/dist-packages (from jedi>=0.10->ipython->keras-tuner) (0.8.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.8/dist-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->keras-tuner) (0.2.5)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.8/dist-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->keras-tuner) (1.15.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->keras-tuner) (3.0.9)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.8/dist-packages (from pexpect->ipython->keras-tuner) (0.7.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->keras-tuner) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->keras-tuner) (2022.9.24)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->keras-tuner) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->keras-tuner) (3.0.4)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (2.15.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (3.4.1)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (1.0.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (0.4.6)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (0.6.1)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (0.38.4)
Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (3.19.6)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (1.3.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (1.8.1)
Requirement already satisfied: grpcio>=1.24.3 in /usr/local/lib/python3.8/dist-packages (from tensorboard->keras-tuner) (1.51.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.8/dist-packages (from google-auth<3,>=1.6.3->tensorboard->keras-tuner) (0.2.8)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from google-auth<3,>=1.6.3->tensorboard->keras-tuner) (5.2.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.8/dist-packages (from google-auth<3,>=1.6.3->tensorboard->keras-tuner) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.8/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard->keras-tuner) (1.3.1)
Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.8/dist-packages (from markdown>=2.6.8->tensorboard->keras-tuner) (4.13.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard->keras-tuner) (3.11.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.8/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard->keras-tuner) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard->keras-tuner) (3.2.2)
Installing collected packages: jedi, kt-legacy, keras-tuner
Successfully installed jedi-0.18.2 keras-tuner-1.1.3 kt-legacy-1.0.4


TensorFlow version: 2.9.2
Eager execution is: True
Keras version: 2.9.0
Num GPUs Available:  1


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Thu Dec  8 23:32:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    26W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

   Now we can define a function to set the numpy, random and tensorflow seed as well as configure the session environment for reproducibility.

In [ ]:
def init_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    session_conf = tf.compat.v1.ConfigProto()
    session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                            inter_op_parallelism_threads=1)
    os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
    os.environ['TF_DETERMINISTIC_OPS'] = 'True'
    sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
                                config=session_conf)
    tf.compat.v1.keras.backend.set_session(sess)

    return sess

init_seeds(seed=42)
Out[ ]:
<tensorflow.python.client.session.Session at 0x7fc1b2ae9e20>

   The data can then be read from the data directory, partitioned into the features and the target, and the training and test sets created using test_size=0.2. These processed sets can be saved for later use with other ML algorithms so comparisons can be made. Next, we need to create dummy variables for the categorical variables since most algorithms require the model input to be numerical values. Neural networks tend to perform better when the input data is scaled, so let's use the StandardScaler that is fit on the training set and then applied to the test set.

In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', None)

df = pd.read_csv('usedCars_final.csv', low_memory=False)

X = df.drop(['price'], axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=seed_value)

del X, y

train = pd.concat([y_train, X_train], axis=1)
train.to_csv('usedCars_trainSet.csv', index=False)

test = pd.concat([y_test, X_test], axis=1)
test.to_csv('usedCars_testSet.csv', index=False)

del, train, test

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print('Dimensions of X_train for input:', X_train.shape[1])

sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_test = pd.DataFrame(sc.transform(X_test))
Dimensions of X_train for input: 53

Baseline Model

   It is always good to use a baseline model which can function as a reference for downstream comparisons. This can also suggest which parameter(s) need more tuning to generate a model with better model metrics. Since the dependent variable price is quantitative, the metrics for regression-based problems should be followed. Let's first set up where the logs and callbacks from the model are saved. We can specify the parameters for EarlyStopping which monitors the val_loss with a patience=5. This says that if the validation loss from the model does not improve after five epochs, then the model will stop training. We can also specify the ModelCheckpoint to monitor the mse and save only the one with the lowest mse.

In [ ]:
import datetime
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'MLP_weights_only_baseline_b4_HPO.h5'

checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]

   Next, we can define the baseline model architecture. Since the number of features after creating the dummy variables is 53 features, this needs to be specified as the input_dim. Given this is a neural network, the architecture of the baseline model could be as simple as model = Sequential() then model.compile(). Let's try using double the number of features as the first layer that decreases sequentially by 10 neurons.

In [ ]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential()
model.add(Dense(105, input_dim=53, kernel_initializer='normal',
                activation='relu'))
model.add(Dense(95, kernel_initializer='normal', activation='relu'))
model.add(Dense(85, kernel_initializer='normal', activation='relu'))
model.add(Dense(75, kernel_initializer='normal', activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(55, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dense(35, kernel_initializer='normal', activation='relu'))
model.add(Dense(25, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))

opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 105)               5670      
                                                                 
 dense_1 (Dense)             (None, 95)                10070     
                                                                 
 dense_2 (Dense)             (None, 85)                8160      
                                                                 
 dense_3 (Dense)             (None, 75)                6450      
                                                                 
 dense_4 (Dense)             (None, 65)                4940      
                                                                 
 dense_5 (Dense)             (None, 55)                3630      
                                                                 
 dense_6 (Dense)             (None, 45)                2520      
                                                                 
 dense_7 (Dense)             (None, 35)                1610      
                                                                 
 dense_8 (Dense)             (None, 25)                900       
                                                                 
 dense_9 (Dense)             (None, 15)                390       
                                                                 
 dropout (Dropout)           (None, 15)                0         
                                                                 
 dense_10 (Dense)            (None, 1)                 16        
                                                                 
=================================================================
Total params: 44,356
Trainable params: 44,356
Non-trainable params: 0
_________________________________________________________________

   Now the model can be trained by calling fit utilizing the train dataset for 30 epochs, which is the iterations over the entire X_train and y_train, using a batch_size=4, which is the number of samples per gradient update, validation_split=0.2, which is the fraction of the training data not used for training but used for evaluating the loss and the model metrics on this data at the end of each epoch., and the specified callbacks from the callbacks_list.

In [ ]:
history = model.fit(X_train, y_train, epochs=50, batch_size=4,
                    validation_split=0.2, callbacks=callbacks_list)
Epoch 1/50
52466/52466 [==============================] - 186s 4ms/step - loss: 6424.8110 - mse: 71283680.0000 - val_loss: 3043.6062 - val_mse: 16371439.0000
Epoch 2/50
52466/52466 [==============================] - 184s 4ms/step - loss: 6110.4463 - mse: 64388920.0000 - val_loss: 2983.4651 - val_mse: 15934323.0000
Epoch 3/50
52466/52466 [==============================] - 184s 4ms/step - loss: 5973.0586 - mse: 61591424.0000 - val_loss: 2975.0652 - val_mse: 15772930.0000
Epoch 4/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5904.8315 - mse: 60126464.0000 - val_loss: 2742.1104 - val_mse: 13201408.0000
Epoch 5/50
52466/52466 [==============================] - 186s 4ms/step - loss: 5844.7090 - mse: 59102028.0000 - val_loss: 2617.6660 - val_mse: 12030312.0000
Epoch 6/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5815.3027 - mse: 58469804.0000 - val_loss: 3065.7864 - val_mse: 15636744.0000
Epoch 7/50
52466/52466 [==============================] - 186s 4ms/step - loss: 5762.4121 - mse: 57490388.0000 - val_loss: 3213.5110 - val_mse: 17416752.0000
Epoch 8/50
52466/52466 [==============================] - 188s 4ms/step - loss: 5728.6587 - mse: 57012884.0000 - val_loss: 3119.8669 - val_mse: 16833234.0000
Epoch 9/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5710.1909 - mse: 56395396.0000 - val_loss: 2438.7202 - val_mse: 10453662.0000
Epoch 10/50
52466/52466 [==============================] - 187s 4ms/step - loss: 5698.0781 - mse: 56380748.0000 - val_loss: 2442.0486 - val_mse: 10457923.0000
Epoch 11/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5669.3521 - mse: 55788676.0000 - val_loss: 2973.5798 - val_mse: 15162363.0000
Epoch 12/50
52466/52466 [==============================] - 184s 4ms/step - loss: 5659.7017 - mse: 55614156.0000 - val_loss: 2436.3445 - val_mse: 10579475.0000
Epoch 13/50
52466/52466 [==============================] - 186s 4ms/step - loss: 5663.7378 - mse: 55683672.0000 - val_loss: 3060.1448 - val_mse: 15467129.0000
Epoch 14/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5637.0171 - mse: 55219068.0000 - val_loss: 2830.3340 - val_mse: 13513086.0000
Epoch 15/50
52466/52466 [==============================] - 186s 4ms/step - loss: 5620.7773 - mse: 54758668.0000 - val_loss: 2617.1287 - val_mse: 12440698.0000
Epoch 16/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5615.3555 - mse: 54817928.0000 - val_loss: 2338.7542 - val_mse: 9724950.0000
Epoch 17/50
52466/52466 [==============================] - 186s 4ms/step - loss: 5594.9219 - mse: 54382848.0000 - val_loss: 3252.2935 - val_mse: 17827002.0000
Epoch 18/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5614.4482 - mse: 54750008.0000 - val_loss: 2351.7717 - val_mse: 9942457.0000
Epoch 19/50
52466/52466 [==============================] - 206s 4ms/step - loss: 5603.6733 - mse: 54690256.0000 - val_loss: 3087.2949 - val_mse: 15938286.0000
Epoch 20/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5588.5186 - mse: 54347516.0000 - val_loss: 2574.5393 - val_mse: 11933205.0000
Epoch 21/50
52466/52466 [==============================] - 185s 4ms/step - loss: 5563.9771 - mse: 53897560.0000 - val_loss: 2429.4167 - val_mse: 10480536.0000

   Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.

In [ ]:
import matplotlib.pyplot as plt

model.save('./MLP_batch4_50epochs_baseline_tf.h5', save_format='tf')

#filepath = 'MLP_weights_only_baseline_b4.h5'
#model = tf.keras.models.load_model('./MLP_batch4_50epochs_baseline_tf.h5')
#model.load_weights(filepath)

plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.show();

   We can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show();

   We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted price to a pandas.Dataframe and examine the size to determine chunk size for plotting. Let's graph the predicted vs. actual price for the training set.

In [ ]:
pred_train = model.predict(X_train)

y_pred = pd.DataFrame(pred_train)
y_pred.shape
8198/8198 [==============================] - 12s 1ms/step
Out[ ]:
(262329, 1)
In [ ]:
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Price vs Actual Price',
           fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show();

   We can use model.predict using the test set, convert the predicted price to a pandas.Dataframe and graph the predicted vs. actual price for the test set.

In [ ]:
pred_test = model.predict(X_test)

y_pred = pd.DataFrame(pred_test)
2050/2050 [==============================] - 3s 1ms/step
In [ ]:
plt.rcParams['agg.path.chunksize'] = 1000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show();

   To evaluate if this MLP is effective at predicting the price of used car prices, metrics like the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²) can be utilized.

In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
Metrics: Train set
MAE: 2420.233420
MSE: 10379212.365806
RMSE: 3221.678501
R2: 0.886650
In [ ]:
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
Metrics: Test set
MAE: 2425.595299
MSE: 10448375.661413
RMSE: 3232.394725
R2: 0.885705

   The test set has higher values for each of the aforementioned by are in close proximity suggesting the model is worked quite well and was not overfit. The baseline model with this architecture explains the variance quite well, but let's test if this can be improved.

   Let's also examine how the predicted values for the maximum, average and minimum metric price and compare to the actual maximum, average and minimum price.

In [ ]:
print('Maximum Price:', np.amax(y_test))
print('Predicted Max Price:', np.amax(pred_test))
print('\nAverage Price:', np.average(y_test))
print('Predicted Average Price:', np.average(pred_test))
print('\nMinimum Price:', np.amin(y_test))
print('Predicted Minimum Price:', np.amin(pred_test))
Maximum Price: 50000.0
Predicted Max Price: 58631.125

Average Price: 28304.875086379092
Predicted Average Price: 27052.695

Minimum Price: 4490.0
Predicted Minimum Price: 7353.111

   For the baseline model, there is a higher predicted maximum price than the actual maximum price and a higher predicted minimum compared to the actual while there is a lower predicted average vs actual average price.

Hyperparameter Optimization

   Let's first set up the callbacks for the TensorBoard. Then we can define a function build_model that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:

  • num_layers: 13 - 20 layers
  • layer_size: 70 - 130 nodes using step=10
  • learning_rate: 1 x 10-1, 1 x 10-2, 1 x 10-3

   The same 30% Dropout before the output layer as well as the same loss='mae' and metrics=['mse'] can be utilized to compare the aforementioned hyperparameters.

In [ ]:
filepath = 'MLP_weights_only_b4_HPO3.h5'
checkpoint_dir = os.path.dirname(filepath)

callbacks = [TensorBoard(log_dir=log_folder,
                         histogram_freq=1,
                         write_graph=True,
                         write_images=True,
                         update_freq='epoch',
                         profile_batch=1,
                         embeddings_freq=1)]

callbacks_list = [callbacks]

def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 13, 20)):
        model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
                                                     min_value=70,
                                                     max_value=130, step=10),
                                        activation='relu',
                                        kernel_initializer='normal'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(1))
    model.compile(loss='mae', metrics=['mse'],
                  optimizer=tf.keras.optimizers.Adam(
                      hp.Choice('learning_rate', values=[1e-1, 1e-2, 1e-3])))

    return model

   We can define the conditions for keras_tuner which include the objective='val_loss', the number of trials to search through (max_trials=25), the length of each trial (executions_per_trial=1), and the directory and project_name. Then we can now print a summary of the search space.

In [ ]:
import keras_tuner
from keras_tuner import BayesianOptimization

tuner = BayesianOptimization(
    build_model,
    objective='val_loss',
    max_trials=25,
    executions_per_trial=1,
    overwrite=True,
    directory='MLP_HPO3',
    project_name='MLP_HPO3')

tuner.search_space_summary()
Search space summary
Default search space size: 15
num_layers (Int)
{'default': None, 'conditions': [], 'min_value': 13, 'max_value': 20, 'step': 1, 'sampling': None}
layer_size0 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size1 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size2 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size3 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size4 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size5 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size6 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size7 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size8 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size9 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size10 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size11 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
layer_size12 (Int)
{'default': None, 'conditions': [], 'min_value': 70, 'max_value': 130, 'step': 10, 'sampling': None}
learning_rate (Choice)
{'default': 0.1, 'conditions': [], 'values': [0.1, 0.01, 0.001], 'ordered': True}

   Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2 and batch_size=4.

In [ ]:
tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=4,
             callbacks=callbacks_list)
Trial 25 Complete [00h 03m 17s]
val_loss: 3085.916259765625

Best val_loss So Far: 2655.16650390625
Total elapsed time: 01h 38m 11s

   Now that the search has completed, let's print a summary of the results from the trials.

In [ ]:
tuner.results_summary()
Results summary
Results in MLP_HPO3/MLP_HPO3
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x7f3d40671580>
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 130
layer_size1: 130
layer_size2: 130
layer_size3: 70
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 70
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 130
layer_size12: 130
learning_rate: 0.001
layer_size13: 130
layer_size14: 90
layer_size15: 130
layer_size16: 130
layer_size17: 70
layer_size18: 130
layer_size19: 70
Score: 2655.16650390625
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 130
layer_size1: 130
layer_size2: 130
layer_size3: 70
layer_size4: 130
layer_size5: 70
layer_size6: 70
layer_size7: 110
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 130
layer_size12: 130
learning_rate: 0.001
layer_size13: 80
layer_size14: 70
layer_size15: 130
layer_size16: 130
layer_size17: 130
layer_size18: 70
layer_size19: 70
Score: 2671.783447265625
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 130
layer_size1: 70
layer_size2: 130
layer_size3: 70
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 130
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 70
layer_size12: 70
learning_rate: 0.001
layer_size13: 120
layer_size14: 70
layer_size15: 110
Score: 2680.460205078125
Trial summary
Hyperparameters:
num_layers: 14
layer_size0: 80
layer_size1: 90
layer_size2: 120
layer_size3: 100
layer_size4: 120
layer_size5: 90
layer_size6: 130
layer_size7: 90
layer_size8: 110
layer_size9: 70
layer_size10: 110
layer_size11: 90
layer_size12: 90
learning_rate: 0.001
layer_size13: 70
Score: 2700.14501953125
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 90
layer_size1: 70
layer_size2: 130
layer_size3: 130
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 130
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 70
layer_size12: 130
learning_rate: 0.001
layer_size13: 130
layer_size14: 110
layer_size15: 70
Score: 2703.669921875
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 130
layer_size1: 120
layer_size2: 130
layer_size3: 130
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 130
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 130
layer_size12: 130
learning_rate: 0.001
layer_size13: 130
layer_size14: 130
layer_size15: 130
layer_size16: 130
layer_size17: 130
layer_size18: 70
layer_size19: 70
Score: 2710.989013671875
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 80
layer_size1: 120
layer_size2: 130
layer_size3: 70
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 120
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 70
layer_size12: 120
learning_rate: 0.001
layer_size13: 70
layer_size14: 70
layer_size15: 70
Score: 2715.1142578125
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 70
layer_size2: 130
layer_size3: 70
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 70
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 70
layer_size12: 130
learning_rate: 0.001
layer_size13: 130
layer_size14: 70
layer_size15: 130
layer_size16: 130
layer_size17: 110
layer_size18: 70
layer_size19: 70
Score: 2730.990966796875
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 100
layer_size2: 130
layer_size3: 100
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 100
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 70
layer_size12: 90
learning_rate: 0.001
layer_size13: 70
layer_size14: 120
layer_size15: 120
Score: 2735.333984375
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 80
layer_size1: 130
layer_size2: 130
layer_size3: 100
layer_size4: 130
layer_size5: 70
layer_size6: 130
layer_size7: 130
layer_size8: 130
layer_size9: 70
layer_size10: 130
layer_size11: 90
layer_size12: 110
learning_rate: 0.001
layer_size13: 130
layer_size14: 70
layer_size15: 130
layer_size16: 130
layer_size17: 70
layer_size18: 70
layer_size19: 120
Score: 2746.524658203125

   The hyperparameters that resulted in the lowest val_loss (Score: 2655) contained 13 layers, a learning_rate: 0.001 with the initial layers being the largest (130). 9/13 layers contain a layer_size of 130 while 4 contain a layer_size of 70.

Fit The Best Model from HPO

   Let's now set the path where model is saved and set up the callbacks with EarlyStopping that monitors the val_loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well.

In [ ]:
%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]
/content/drive/MyDrive/UsedCarsCarGurus/Models/DL/MLP/Models
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard

   We can now define the model architecture using the results from kerar-tuner search, compile and then examine.

In [ ]:
model = Sequential()
model.add(Dense(130, input_dim=53, kernel_initializer='normal',
                activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dense(130, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))

opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 130)               7020      
                                                                 
 dense_1 (Dense)             (None, 130)               17030     
                                                                 
 dense_2 (Dense)             (None, 130)               17030     
                                                                 
 dense_3 (Dense)             (None, 70)                9170      
                                                                 
 dense_4 (Dense)             (None, 130)               9230      
                                                                 
 dense_5 (Dense)             (None, 70)                9170      
                                                                 
 dense_6 (Dense)             (None, 130)               9230      
                                                                 
 dense_7 (Dense)             (None, 70)                9170      
                                                                 
 dense_8 (Dense)             (None, 130)               9230      
                                                                 
 dense_9 (Dense)             (None, 70)                9170      
                                                                 
 dense_10 (Dense)            (None, 130)               9230      
                                                                 
 dense_11 (Dense)            (None, 130)               17030     
                                                                 
 dense_12 (Dense)            (None, 130)               17030     
                                                                 
 dropout (Dropout)           (None, 130)               0         
                                                                 
 dense_13 (Dense)            (None, 1)                 131       
                                                                 
=================================================================
Total params: 148,871
Trainable params: 148,871
Non-trainable params: 0
_________________________________________________________________

   The total parameters for this model have increased to 148,871 from 44,356 that were utilized for the baseline model. Now the model can be trained by calling fit utilizing the train dataset for 50 epochs using a batch_size=4, validation_split=0.2 and the specified callbacks from the callbacks_list.

In [ ]:
history = model.fit(X_train, y_train, epochs=50, batch_size=4,
                    validation_split=0.2, callbacks=callbacks_list)
Epoch 1/50
52466/52466 [==============================] - 202s 4ms/step - loss: 4140.3911 - mse: 30599140.0000 - val_loss: 2661.3477 - val_mse: 12385773.0000
Epoch 2/50
52466/52466 [==============================] - 193s 4ms/step - loss: 3468.0640 - mse: 20902856.0000 - val_loss: 3197.4976 - val_mse: 17011764.0000
Epoch 3/50
52466/52466 [==============================] - 194s 4ms/step - loss: 3314.2002 - mse: 19116372.0000 - val_loss: 2340.7529 - val_mse: 9500505.0000
Epoch 4/50
52466/52466 [==============================] - 191s 4ms/step - loss: 3228.4912 - mse: 18131914.0000 - val_loss: 2417.0994 - val_mse: 10551332.0000
Epoch 5/50
52466/52466 [==============================] - 193s 4ms/step - loss: 3197.0317 - mse: 17736014.0000 - val_loss: 2302.4119 - val_mse: 9574762.0000
Epoch 6/50
52466/52466 [==============================] - 192s 4ms/step - loss: 3160.1089 - mse: 17387694.0000 - val_loss: 2382.0562 - val_mse: 10013133.0000
Epoch 7/50
52466/52466 [==============================] - 191s 4ms/step - loss: 3135.6331 - mse: 17054686.0000 - val_loss: 2268.5950 - val_mse: 9096997.0000
Epoch 8/50
52466/52466 [==============================] - 191s 4ms/step - loss: 3123.6187 - mse: 16885074.0000 - val_loss: 2150.8435 - val_mse: 8281727.5000
Epoch 9/50
52466/52466 [==============================] - 191s 4ms/step - loss: 3110.8313 - mse: 16784448.0000 - val_loss: 2361.2278 - val_mse: 9920246.0000
Epoch 10/50
52466/52466 [==============================] - 191s 4ms/step - loss: 3090.8972 - mse: 16610272.0000 - val_loss: 2206.7603 - val_mse: 8460629.0000
Epoch 11/50
52466/52466 [==============================] - 192s 4ms/step - loss: 3083.0459 - mse: 16492069.0000 - val_loss: 2113.8247 - val_mse: 8068696.0000
Epoch 12/50
52466/52466 [==============================] - 189s 4ms/step - loss: 3082.2766 - mse: 16499040.0000 - val_loss: 2135.5435 - val_mse: 8162606.0000
Epoch 13/50
52466/52466 [==============================] - 189s 4ms/step - loss: 3060.6533 - mse: 16269251.0000 - val_loss: 2107.6643 - val_mse: 7989292.0000
Epoch 14/50
52466/52466 [==============================] - 190s 4ms/step - loss: 3033.2791 - mse: 16014636.0000 - val_loss: 2114.0759 - val_mse: 7949660.5000
Epoch 15/50
52466/52466 [==============================] - 188s 4ms/step - loss: 3033.4514 - mse: 16015924.0000 - val_loss: 2251.8296 - val_mse: 9173284.0000
Epoch 16/50
52466/52466 [==============================] - 189s 4ms/step - loss: 3029.2383 - mse: 15905849.0000 - val_loss: 2110.7188 - val_mse: 7993517.0000
Epoch 17/50
52466/52466 [==============================] - 190s 4ms/step - loss: 3003.9844 - mse: 15698794.0000 - val_loss: 2105.1919 - val_mse: 7956177.5000
Epoch 18/50
52466/52466 [==============================] - 188s 4ms/step - loss: 3012.2371 - mse: 15774513.0000 - val_loss: 2126.6902 - val_mse: 8016110.5000
Epoch 19/50
52466/52466 [==============================] - 209s 4ms/step - loss: 3006.0662 - mse: 15701212.0000 - val_loss: 2058.7239 - val_mse: 7657570.5000
Epoch 20/50
52466/52466 [==============================] - 189s 4ms/step - loss: 2989.7861 - mse: 15561457.0000 - val_loss: 2096.7629 - val_mse: 7893595.5000
Epoch 21/50
52466/52466 [==============================] - 190s 4ms/step - loss: 2988.3225 - mse: 15555527.0000 - val_loss: 2086.4062 - val_mse: 7908783.0000
Epoch 22/50
52466/52466 [==============================] - 189s 4ms/step - loss: 2990.1040 - mse: 15579956.0000 - val_loss: 2087.6375 - val_mse: 7747406.0000
Epoch 23/50
52466/52466 [==============================] - 189s 4ms/step - loss: 2977.5259 - mse: 15448607.0000 - val_loss: 2084.3987 - val_mse: 7846724.5000
Epoch 24/50
52466/52466 [==============================] - 190s 4ms/step - loss: 2978.4917 - mse: 15420618.0000 - val_loss: 2127.2407 - val_mse: 8254562.5000

   Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.We can also save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:
model.save('./MLP_batch4_50epochs_HPO1_bestModel_tf.h5', save_format='tf')

# Load model for more training or later use
#filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
#model = tf.keras.models.load_model('./MLP_batch4_50epochs_HPO1_bestModel_tf.h5')
#model.load_weights(filepath)

plt.title('Model Error for Price')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
In [ ]:
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Price')
plt.ylabel('Error [Price]')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show()

   We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted price to a pandas.Dataframe and examine the size to determine chunk size for plotting. Let's graph the predicted vs. actual price for the train set.

In [ ]:
pred_train = model.predict(X_train)

y_pred = pd.DataFrame(pred_train)
y_pred.shape
8198/8198 [==============================] - 13s 2ms/step
Out[ ]:
(262329, 1)
In [ ]:
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Price vs Actual Price',
           fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show()

   We can use model.predict using the test set, convert the predicted price to a pandas.Dataframe and graph the predicted vs. actual price for the test set.

In [ ]:
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
2050/2050 [==============================] - 3s 1ms/step
In [ ]:
plt.rcParams['agg.path.chunksize'] = 1000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Price', pad=15, fontsize=20)
ax1.set_ylabel('Price', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Price', pad=15, fontsize=20)
plt.show()

   To evaluate if this MLP is effective at predicting the price of used cars, metrics like the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²) can be utilized for both the training and test sets.

In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
Metrics: Train set
MAE: 2097.008287
MSE: 8024392.726339
RMSE: 2832.735908
R2: 0.912366
In [ ]:
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
Metrics: Test set
MAE: 2123.528194
MSE: 8231294.883357
RMSE: 2869.023333
R2: 0.909957

   When comparing the model metrics from the hyperparameter tuning to the baseline model, the MAE, MSE and RMSE were all lower for the training and test sets. The (R²) was higher for both sets compared to the baseline model.

   Let's also examine how the predicted values for the maximum, average and minimum price compare to the actual maximum, average and minimum price.

In [ ]:
print('Maximum Price:', np.amax(y_test))
print('Predicted Max Price:', np.amax(pred_test))
print('\nAverage Price:', np.average(y_test))
print('Predicted Average Price:', np.average(pred_test))
print('\nMinimum Price:', np.amin(y_test))
print('Predicted Minimum Price:', np.amin(pred_test))
Maximum Price: 50000.0
Predicted Max Price: 49700.676

Average Price: 28304.875086379092
Predicted Average Price: 27487.953

Minimum Price: 4490.0
Predicted Minimum Price: 7409.093

   When evaluating the model after the hyperparameter search with the test set, lower and closer values for the predicted vs. actual maximum price (49700.676 vs. 58631.125), but higher and farther away from actual for the predicted average price (27487.953 vs. 27052.695) and predicted minimum price (7409.093 vs. 7353.111) when compared to the baseline model.

   Another hyperparameter search was tested using more complex network architecture:

  • num_layers: 20 - 30
  • layer_size: 100 - 200 using step=10
  • The same learning rate

   This utilized 50 trials vs. 25 trials for the aforementioned HPO. The time for the completion of the this was 03h 54m 19s compared to the aforementioned of 01h 38m 11s. The best val_loss = 547.799072265625 compared to the best val_loss = 2655.16650390625. The more complex search resulted in a network with dense_20 where 12 of the Dense layers had 200 nodes resulting the total parameters of 568,051 compared to 148,871 parameters for the aforementioned. This resulted in only a slightly better R² but a worse predicted minimum price. Another search tested stratify=X.listed_date_yearMonth with 50 trials and the completion time was 05h 40m 54s with the best val_loss = 2471.0478515625 and 63,236 total parameters.

Create an Azure Files Datastore

   Since we determined the previous model resulted in the lowest error, let's move this to the cloud with Azure Machine Learning Studio. Let's first connect to the workspace by authenticating with DefaultAzureCredential and then input the necessary components with MLClient to get a handle to the workspace.

In [ ]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

ml_client = MLClient(
    credential=credential,
    subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
    resource_group_name='aschultzdata',
    workspace_name='ds-ml-env',
)

   Then we can create the datastore by specifying the name of the datastore, the description, the account, the name of the container, the protocol and the account_key.

In [ ]:
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration

store = AzureBlobDatastore(
    name='usedcars_datastore',
    description='Datastore for Used Cars',
    account_name='dsmlenv8898281366',
    container_name='usedcarscargurus',
    protocol='https',
    credentials=AccountKeyConfiguration(
        account_key='XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'
    ),
)

ml_client.create_or_update(store)
Out[ ]:
AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'usedcars_datastore', 'description': 'Datastore for Used Cars', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/a134465f-eb15-4297-b902-3c97d4c81838/resourceGroups/aschultzdata/providers/Microsoft.MachineLearningServices/workspaces/ds-ml-env/datastores/usedcars_datastore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/cpu-standardds12v2/code/Users/aschultz.data/UsedCarsCarGurus/Models/DL/MLP', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f543aae2740>, 'credentials': {'type': 'account_key'}, 'container_name': 'usedcarscargurus', 'account_name': 'dsmlenv8898281366', 'endpoint': 'core.windows.net', 'protocol': 'https'})

   Finally, the train/test sets can be uploaded to the usedcarscargurus container.

Train the MLP using Microsoft Azure

   Before we train the model, we need to create a compute instance/cluster with a defined name, so let's create one with a GPU, if it has not all ready been created. Then we can specify the cluster components containing the on-demand virtual machine (VM) service, the type of VM, the minimum/maximum nodes in the cluster, the time the node will run after the job finishes/terminates and the type of tier. Finally, we can input the object to create_or_update from MLClient.

In [ ]:
from azure.ai.ml.entities import AmlCompute

gpu_compute_target = 'gpu-cluster-NC4as-T4-v3'

try:
    gpu_cluster = ml_client.compute.get(gpu_compute_target)
    print(
        f"You already have a cluster named {gpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print('Creating a new gpu compute target...')

    gpu_cluster = AmlCompute(
        name=gpu_compute_target,
        type='amlcompute',
        size='Standard_NC4as_T4_v3',
        min_instances=0,
        max_instances=1,
        idle_time_before_scale_down=180,
        tier='Dedicated',
    )
    print(
        f"AMLCompute with name {gpu_cluster.name} will be created, with compute size {gpu_cluster.size}"
    )

    gpu_cluster = ml_client.compute.begin_create_or_update(gpu_cluster)
Creating a new gpu compute target...
AMLCompute with name gpu-cluster-NC4as-T4-v3 will be created, with compute size Standard_NC4as_T4_v3

   Next, let's create the environment by making a directory for the dependencies which lists the components for the runtime and the libraries installed on the compute for training the model.

In [ ]:
import os

dependencies_dir = './dependencies'
os.makedirs(dependencies_dir, exist_ok=True)

   Now we can write the conda.yaml file into the dependencies directory.

In [ ]:
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8.5
  - pip=23.1.2
  - numpy=1.21.6
  - scikit-learn==1.1.2
  - scipy
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - mlflow
    - azureml-mlflow==1.50.0
    - psutil==5.9.0
    - tqdm
    - ipykernel
    - matplotlib
    - tensorflow==2.12
    - keras-tuner==1.1.3
Writing ./dependencies/conda.yaml

   The created conda.yaml file allows for the environment to be created and registered in the workspace.

In [ ]:
from azure.ai.ml.entities import Environment

custom_env_name = 'aml-usedcars-gpu-mlp'

custom_job_env = Environment(
    name=custom_env_name,
    description='Custom environment for Used Cars MLP job',
    tags={'tensorflow': '2.12'},
    conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
    image='mcr.microsoft.com/azureml/curated/tensorflow-2.12-cuda11:6',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f'Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}'
)
Environment with name aml-usedcars-gpu-mlp is registered to workspace, the environment version is 5

   Next, we can create the training script by first creating the source folder where the training script, main.py, will be stored.

In [ ]:
train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)

   The training script includes preparing the environment, reading the data, data preparation, model training, saving the model and evaluating the model. This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of argparse, start logging, reading the train/test sets, preprocessing the data for dummy variables and scaling using the StandardScaler. Then the number of samples and features are logged with MLFlow. It uses this to then train a MLP model using the best parameters from keras-tuner where the loss and val_Loss are logged with MLFlow with defined callbacks. The model is then saved and evaluated for the model loss, the metrics of the train/test sets and the predicted vs. actual minimum/average/maximum price, which are logged as MLFlow artifacts and metrics.

In [ ]:
%%writefile {train_src_dir}/main.py
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
from sklearn.preprocessing import StandardScaler
import mlflow
import tensorflow as tf
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from keras.callbacks import Callback
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

def init_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    session_conf = tf.compat.v1.ConfigProto()
    session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                            inter_op_parallelism_threads=1)
    os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
    os.environ['TF_DETERMINISTIC_OPS'] = 'True'
    sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
                                config=session_conf)
    tf.compat.v1.keras.backend.set_session(sess)

    return sess

init_seeds(seed=42)

def main():
    """Main function of the script."""

    parser = argparse.ArgumentParser()
    parser.add_argument('--train_data', type=str, help='path to input train data')
    parser.add_argument('--test_data', type=str, help='path to input test data')
    parser.add_argument('--epochs', required=False, default=30, type=int)
    parser.add_argument('--batch_size', required=False, default=1, type=int)
    args = parser.parse_args()

    mlflow.start_run()

    ###################
    #<prepare the data>
    ###################
    print(' '.join(f'{k}={v}' for k, v in vars(args).items()))

    print('input train data:', args.train_data)
    print('input test data:', args.test_data)

    trainDF = pd.read_csv(args.train_data, low_memory=False)
    testDF = pd.read_csv(args.test_data, low_memory=False)

    train_label = trainDF[['price']]
    test_label = testDF[['price']]

    train_features = trainDF.drop(columns = ['price'])
    test_features = testDF.drop(columns = ['price'])

    train_features = pd.get_dummies(train_features, drop_first=True)
    test_features = pd.get_dummies(test_features, drop_first=True)

    sc = StandardScaler()
    train_features = pd.DataFrame(sc.fit_transform(train_features))
    test_features = pd.DataFrame(sc.transform(test_features))

    mlflow.log_metric('num_samples', train_features.shape[0])
    mlflow.log_metric('num_features', train_features.shape[1])

    print(f'Training with data of shape {train_features.shape}')

    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################
    model = Sequential()
    model.add(Dense(130, input_dim=53, kernel_initializer='normal',
                    activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dense(70, kernel_initializer='normal', activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dense(70, kernel_initializer='normal', activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dense(70, kernel_initializer='normal', activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dense(70, kernel_initializer='normal', activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dense(130, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(1))

    opt = tf.keras.optimizers.Adam(learning_rate=0.001)
    model.compile(loss='mae', metrics=['mse'], optimizer=opt)
    model.summary()

    class LogRunMetrics(Callback):
        def on_epoch_end(self, epoch, log):
            # log a value repeated which creates a list
            mlflow.log_metric('Loss', log['loss'])
            mlflow.log_metric('Val_Loss', log['val_loss'])

    log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

    filepath = 'MLP_weights_only_HPO1_bestModel.tf'
    checkpoint_dir = os.path.dirname(filepath)

    callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
                      ModelCheckpoint(filepath, monitor='mse',
                                      save_best_only=True, mode='min'),
                      LogRunMetrics()]

    history = model.fit(train_features, train_label, epochs=args.epochs,
                        batch_size=args.batch_size, validation_split=0.2,
                        callbacks=callbacks_list)

    model.save('./MLP_HPO1_bestModel', save_format='tf')

    #filepath = 'MLP_weights_only_b4_HPO1_bestModel.h5'
    #model = tf.keras.models.load_model('./MLP_HPO1_bestModel_tf.h5')
    #model.load_weights(filepath)

    ##################
    #</train the model>
    ##################

    #####################
    #<evaluate the model>
    #####################
    plt.title('Model Error for Price')
    plt.plot(history.history['loss'], label='train')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.ylabel('Error [Price]')
    plt.xlabel('Epoch')
    plt.legend()
    plt.grid(True)
    plt.savefig('Loss vs. Price.png')
    mlflow.log_artifact('Loss vs. Price.png')
    plt.close()

    losses = pd.DataFrame(model.history.history)
    losses.plot()
    plt.title('Model Error for Price')
    plt.ylabel('Error [Price]')
    plt.xlabel('Epoch')
    plt.legend(loc='upper right')
    plt.grid(True)
    plt.savefig('Error vs. Price.png')
    mlflow.log_artifact('Error vs. Price.png')
    plt.close()

    pred_train = model.predict(train_features)

    train_mae = mean_absolute_error(train_label[:], pred_train[:])
    train_mse = mean_squared_error(train_label[:], pred_train[:])
    train_rmse = np.sqrt(mean_squared_error(train_label[:], pred_train[:]))
    train_r2 = r2_score(train_label[:], pred_train[:])

    pred_test = model.predict(test_features)

    test_mae = mean_absolute_error(test_label[:], pred_test[:])
    test_mse = mean_squared_error(test_label[:], pred_test[:])
    test_rmse = np.sqrt(mean_squared_error(test_label[:], pred_test[:]))
    test_r2 = r2_score(test_label[:], pred_test[:])

    mlflow.log_metric('train_mae', train_mae)
    mlflow.log_metric('train_mse', train_mse)
    mlflow.log_metric('train_rmse', train_rmse)
    mlflow.log_metric('train_r2', train_r2)
    mlflow.log_metric('test_mae', test_mae)
    mlflow.log_metric('test_mse', test_mse)
    mlflow.log_metric('test_rmse', test_rmse)
    mlflow.log_metric('test_r2', test_r2)

    MaximumPrice = np.amax(test_label)
    PredictedMaxPrice = np.amax(pred_test)
    AveragePrice = np.average(test_label)
    PredictedAveragePrice = np.average(pred_test)
    MinimumPrice = np.amin(test_label)
    PredictedMinimumPrice = np.amin(pred_test)

    mlflow.log_metric('Maximum Price', MaximumPrice)
    mlflow.log_metric('Predicted Maximum Price', PredictedMaxPrice)
    mlflow.log_metric('Average Price', AveragePrice)
    mlflow.log_metric('Predicted Average Price', PredictedAveragePrice)
    mlflow.log_metric('Minimum Price', MinimumPrice)
    mlflow.log_metric('Predicted Minimum Price', PredictedMinimumPrice)

    ###################
    #</evaluate the model>
    ###################

    mlflow.end_run()

if __name__ == "__main__":
    main()
Writing ./src/main.py

   To train the model, a command job configured with the input specifying the input data, the number of epochs and the batch size, which then runs the training script using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.

In [ ]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = 'usedcars_mlp_model'

job = command(
    inputs=dict(
        train_data=Input(
            type='uri_file',
            path='azureml://datastores/usedcars_datastore/paths/usedCars_trainSet.csv',
        ),
        test_data=Input(
            type='uri_file',
            path = 'azureml://datastores/usedcars_datastore/paths/usedCars_testSet.csv',
        ),
        epochs=50,
        batch_size=4,
        registered_model_name=registered_model_name,
    ),

    code='./src/',
    command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --epochs ${{inputs.epochs}} --batch_size ${{inputs.batch_size}}',
    environment='aml-usedcars-gpu-mlp@latest',
    compute='gpu-cluster-NC4as-T4-v3',
    display_name='usedcars_mlp_prediction',
)

   Finally, this job can be submitted to run in Azure Machine Learning Studio using the create_or_update command with ml_client.

In [ ]:
ml_client.create_or_update(job)
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Uploading src (0.01 MBs): 100%|██████████| 9073/9073 [00:00<00:00, 963415.70it/s]


Out[ ]:
ExperimentNameTypeStatusDetails Page
MLPplucky_frame_0ncbbdml4fcommandStartingLink to Azure Machine Learning studio

   The submitted job can then be viewed by selecting the link in the output of the previous cell. The logged information with MLFlow including the model metrics and saved graphs can then be viewed/downloaded when the job completes.

CatBoost

   Gradient boosting utilizes the process of iteratively combining weaker models to build strong predictive models using gradient descent. The paper CatBoost: unbiased boosting with categorical features presented a new boosting algorithm, CatBoost, or categorical boosting, where preprocessing the categorical features is not required because they are handled during the training step. Then the paper CatBoost: gradient boosting with categorical features support> explained the methodologies in greater detail.

   The notebook containing the baseline model and models using Hyperopt is here. Let's first start by setting up the environment by installing the necessary dependencies, importing the required packages, setting the options and the random and numpy seed. We can also determine if the CUDA compiler is present/which version of CUDA Toolkit is available with nvcc --version and if a GPU is present and the associated characteristics with nvidia-smi.

In [ ]:
!pip install catboost
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['usedCars_catGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
     |████████████████████████████████| 76.6 MB 34 kB/s 
Requirement already satisfied: graphviz in /usr/local/lib/python3.8/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from catboost) (1.7.3)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.8/dist-packages (from catboost) (1.21.6)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.8/dist-packages (from catboost) (1.3.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: plotly in /usr/local/lib/python3.8/dist-packages (from catboost) (5.5.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.24.0->catboost) (2022.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from plotly->catboost) (8.1.0)
Installing collected packages: catboost
Successfully installed catboost-1.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     |████████████████████████████████| 216 kB 4.1 MB/s 
Requirement already satisfied: attrs>17.1.0 in /usr/local/lib/python3.8/dist-packages (from eli5) (22.1.0)
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 71.1 MB/s 
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.8/dist-packages (from eli5) (1.21.6)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from eli5) (1.7.3)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from eli5) (1.15.0)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.8/dist-packages (from eli5) (1.0.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.8/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.8/dist-packages (from eli5) (0.8.10)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn>=0.20->eli5) (1.2.0)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=bae0dff898b1f0866f726dd83fdf8aeb8c7baec28908f3ca33a3db4c550330fd
  Stored in directory: /root/.cache/pip/wheels/85/ac/25/ffcd87ef8f9b1eec324fdf339359be71f22612459d8c75d89c
Successfully built eli5
Installing collected packages: jinja2, eli5
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
notebook 5.7.16 requires jinja2<=3.0.0, but you have jinja2 3.1.2 which is incompatible.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
Successfully installed eli5-0.13.0 jinja2-3.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (575 kB)
     |████████████████████████████████| 575 kB 4.3 MB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from shap) (1.21.6)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.8/dist-packages (from shap) (21.3)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.8/dist-packages (from shap) (4.64.1)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: numba in /usr/local/lib/python3.8/dist-packages (from shap) (0.56.4)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.8/dist-packages (from shap) (1.5.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from shap) (1.3.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from shap) (1.7.3)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-packages (from shap) (1.0.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>20.9->shap) (3.0.9)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from numba->shap) (57.4.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.8/dist-packages (from numba->shap) (5.1.0)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.8/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata->numba->shap) (3.11.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas->shap) (2022.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas->shap) (1.15.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn->shap) (1.2.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Sun Dec 25 15:33:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

   There is a 70W Tesla T4 with 16GB GPU memory with CUDA Toolkit 11.2 available. Now, we can read the train/test sets into pandas dataframes and set up the features with X_train and X_test and the target/label with y_train and y_test. CatBoost does not need the categorical variables to be one hot encoded or dummy variables, so let's define the categorical variables in a list which will be used when modeling.

In [ ]:
import pandas as pd

trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)

X_train = trainDF.drop(columns = ['price'])
y_train = trainDF[['price']]

X_test = testDF.drop(columns = ['price'])
y_test = testDF[['price']]

categorical_features_indices = ['body_type', 'fuel_type', 'listing_color',
                                'transmission', 'wheel_system_display','State',
                                'listed_date_yearMonth']

Baseline Model

   Let's now set a baseline model using the default parameters for CatBoost, save as a.pkl file, and evaluate the model metrics for both the train and the test sets.

In [ ]:
from catboost import CatBoostRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

cat = CatBoostRegressor(loss_function='RMSE',
                        task_type='GPU',
                        cat_features=categorical_features_indices,
                        random_state=seed_value,
                        logging_level='Silent')

cat.fit(X_train, y_train)

Pkl_Filename = 'CatBoost_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(cat, file)

print('\nModel Metrics for Catboost Baseline Model')
y_train_pred = cat.predict(X_train)
y_test_pred = cat.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for Catboost Baseline Model
MAE train: 1909.980, test: 1934.409
MSE train: 6472227.870, test: 6660546.410
RMSE train: 2544.057, test: 2580.803
R^2 train: 0.930, test: 0.927

Hyperopt: 100 Trials 10-Fold Cross Validation

   To set up Hyperopt, let's first define the number of trials with NUM_EVAL = 100. We need to utilize the same k-folds for reproducibility, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True and the initial defined seed value. Then hyperparameters can then be defined in a dictionary. To define integers, hp.choice with np.arange and dtype=int is used while float types are defined using hp.uniform. The space consists of 6 parameters with 4 integers and 2 float type:

  • iterations: The number of trees that can be built.
  • depth: Depth of the tree.
  • l2_leaf_reg: Coefficient at the L2 regularization term of the cost function.
  • learning_rate: Used for reducing the gradient step.
  • min_data_in_leaf: The minimum number of training samples in a leaf.
  • one_hot_max_size: Use one-hot encoding for all categorical features with a number of different values less than or equal to the given parameter value.
In [ ]:
from sklearn.model_selection import KFold
from hyperopt import hp

NUM_EVAL = 100

kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)

catboost_tune_kwargs= {
    'iterations': hp.choice('iterations', np.arange(100, 500, dtype=int)),
    'depth': hp.choice('depth', np.arange(3, 10, dtype=int)),
    'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1e-2, 1e0),
    'learning_rate': hp.uniform('learning_rate', 1e-4, 0.3),
    'min_data_in_leaf': hp.choice('min_data_in_leaf', np.arange(2, 20,
                                                                dtype=int)),
    'one_hot_max_size': hp.choice('one_hot_max_size', np.arange(2, 20,
                                                                dtype=int)),
    }

   Now, we can define a function cat_hpo for optimizing the hyperparameters. Within this function, we can use joblib to save a .pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float. This is then allocated for n_estimators and max_depth, which starts at max_depth=3. Then the model type, CatBoostRegressor, needs to be defined with the parameters that will be included in all of the trials during the search, which are:

  • loss_function: The metric to use in training.
  • task_type: The processing unit type to use for training.
  • cat_features: A one-dimensional array of categorical columns indices.
  • early_stopping_rounds: Sets the overfitting detector type to Iter and stops the training after the specified number of iterations since the iteration with the optimal metric value.
  • rsm: Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random.
  • random_state=seed_value: Random number seed.
  • logging_level: The logging level to output to stdout.

   Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error' for the scoring metric and 10-fold cross validation to generate the negative cross_val_score where the np.mean of the scores is saved as the rmse for each trial. The trial loss=rmse, params = parameters tested in the trial from the catboost_tune_kwargs space , the trial number (iteration), the time to complete the trial (train_time) and if the trial completed successfully or not (status) are written to the defined .csv file and appended by row for each trial.

In [ ]:
import joblib
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK

def cat_hpo(config):
    """
    Objective function to tune a CatoostRegressor model
    """
    joblib.dump(bayesOpt_trials, 'Catboost_Hyperopt_100_GPU.pkl')

    global ITERATION
    ITERATION += 1

    config['iterations'] = int(config['iterations'])
    config['depth'] = int(config['depth']) + 3

    cat = CatBoostRegressor(loss_function='RMSE',
                            task_type='GPU',
                            cat_features=categorical_features_indices,
                            early_stopping_rounds=10,
                            rsm=1,
                            random_state=seed_value,
                            logging_level='Silent',
                            **config)

    start = timer()
    scores = -cross_val_score(cat, X_train, y_train,
                              scoring='neg_root_mean_squared_error',
                              cv=kfolds)
    run_time = timer() - start

    rmse = np.mean(scores)

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([rmse, config, ITERATION, run_time])

    return {'loss': rmse, 'params': config, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

   The Tree-structured Parzen Estimator Approach (TPE) algorithm is the default algorithm for Hyperopt. This algorithm proposed in Algorithms for Hyper-Parameter Optimization uses a Bayesian optimization approach to develop a probabilistic model of a defined function (the minimum, in this case) by generating a random starting point, calculating the function, then choosing the next set of parameters that will probabilistically result in a lower minimum utilizing past calculations for the conditional probability model, then computing the real value, and continuing until the defined stopping criteria is met.

   Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.

   We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function cat_hpo, the parameter space cat_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the hyperparameter optimization (HPO) trials.

In [ ]:
from hyperopt import tpe, Trials

out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/Catboost/Hyperopt/trialOptions/Catboost_trials_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
In [ ]:
from datetime import datetime, timedelta
from hyperopt import fmin

search_time_start = time.time()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Hyperopt_100_GPU.pkl'):
    bayesOpt_trials = joblib.load('Catboost_Hyperopt_100_GPU.pkl')
else:
    best_param = fmin(cat_hpo, catboost_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials,
                      rstate=np.random.RandomState(42))
print('Time to run HPO:', time.time() - search_time_start)
Start Time           2022-02-12 03:17:49.130634
100%|██████████| 100/100 [2:57:51<00:00, 106.71s/it, best loss: 2438.658380210235]
Start Time           2022-02-12 03:17:49.130634
End Time             2022-02-12 06:15:40.468643
2:57:51

   This search completed in 2 hours and 57 minutes. Let's now sort the trials with lowest loss (lowest RMSE) first and examine the top two trials:

In [ ]:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
                                 key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Top two trials with the lowest loss (lowest RMSE)
[{'loss': 2438.658380210235, 'params': {'depth': 12, 'iterations': 485, 'l2_leaf_reg': 0.8954123752717814, 'learning_rate': 0.1185871010597653, 'min_data_in_leaf': 16, 'one_hot_max_size': 7}, 'iteration': 28, 'train_time': 330.391107579002, 'status': 'ok'}, {'loss': 2439.422889044304, 'params': {'depth': 12, 'iterations': 480, 'l2_leaf_reg': 0.8265094329364929, 'learning_rate': 0.10949456447966546, 'min_data_in_leaf': 14, 'one_hot_max_size': 10}, 'iteration': 99, 'train_time': 127.529981271, 'status': 'ok'}]

   Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
import ast
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

results = pd.read_csv('Catboost_trials_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

best_bayes_model = CatBoostRegressor(loss_function='RMSE',
                                     task_type='GPU',
                                     cat_features=categorical_features_indices,
                                     early_stopping_rounds=10,
                                     rsm=1,
                                     random_state=seed_value,
                                     logging_level='Silent',
                                     **best_bayes_params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'Catboost_HPO_trials100_GPU.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

# =============================================================================
# # To load saved model
# best_bayes_model = joblib.load('Catboost_HPO_trials100_GPU.pkl')
# print(best_bayes_model)
# =============================================================================

print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for Catboost HPO UsedCars_CarGurus 100 GPU trials
MAE train: 1435.262, test: 1775.705
MSE train: 3695003.598, test: 5773043.664
RMSE train: 1922.239, test: 2402.716
R^2 train: 0.960, test: 0.937

   Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
                                                                                                            y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
The best model from Bayes optimization scores 5773043.66389 MSE on the test set.
This was achieved after 28 search iterations

   Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the search hyperparameters by using a for loop to iterate through the quantitative parameters and visualizing with seaborn.kdeplot.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['depth'] = bayes_params['depth'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['l2_leaf_reg'] = bayes_params['l2_leaf_reg'].astype('float64')
bayes_params['min_data_in_leaf'] = bayes_params['min_data_in_leaf'].astype('float64')
bayes_params['one_hot_max_size'] = bayes_params['one_hot_max_size'].astype('float64')

for i, hpo in enumerate(bayes_params.columns):
    if hpo not in ['iteration', 'iterations']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
<Figure size 1008x432 with 0 Axes>

   Higher values of depth, l2_leaf_reg, min_data_in_leaf and one_hot_max_size in the given hyperparameter space performed better to generate a lower loss. A learning_rate between 0.1 and 0.2 performed better and the lowest loss was achieved with ~ learning_rate=0.119.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
                         'one_hot_max_size']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, the learning_rate and min_data_in_leaf decreased while one_hot_max_size slightly increased.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
plt.figure(figsize=(20,8))
plt.rcParams['font.size'] = 18
ax = sns.regplot('iteration', 'l2_leaf_reg', data=bayes_params,
                 label='Bayes Optimization: Hyperopt')
ax.set(xlabel='Iteration', ylabel='l2_leaf_reg')
plt.tight_layout()
plt.show()
Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

   Over the trials, there was a slight increase in l2_leaf_reg.

Model Explanations

   Next, we can define a function to plot the feature importance, which can also be saved and exported for later use. The feature importance value and the associated feature name can be extracted into two separate arrays then converted to a pandas.DataFrame using the feature names as the key and the feature importance as the value in a dictionary. The dataframe can then be sorted in decreasing order by the feature importance value. Then a searborn.barplot with a defined title, xlabel and ylabel can be utilized to visualize the results.

In [ ]:
def plot_feature_importance(importance, names, model_type):

    feature_importance = np.array(importance)
    feature_names = np.array(names)
    data={'feature_names': feature_names,
          'feature_importance': feature_importance}
    fi_df = pd.DataFrame(data)
    fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)

    plt.figure(figsize=(10,8))
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'],
                palette='cool')
    plt.title(model_type + ' Feature Importance')
    plt.xlabel('Feature Importance')
    plt.ylabel('Feature Names')
In [ ]:
plot_feature_importance(best_bayes_model.get_feature_importance(),
                        X_train.columns, 'Catboost')

   The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm, width, front_legroom, height, body_type.

SHAP (SHapley Additive exPlanations)

   SHAP (SHapley Additive exPlanations) published as A Unified Approach to Interpreting Model Predictions estimates the impact of the individual components on the final result, separately, while conserving the total impact on the final result. This can be considered in conjunction with the previously used feature importance measures when modeling. Let's use the TreeExplainer since using CatBoost and generate shap_values for the features. The summary_plot shows the global importance of each feature and the distribution of the effect sizes over the set. Let's use the training set and summarize the effects of all the features.

In [ ]:
import shap

shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
In [ ]:
shap.summary_plot(shap_values, X_train);

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_test)
In [ ]:
shap.summary_plot(shap_values, X_test);

   Train set has more positive and more negative compared to test set.

100 Trials Train/Validation/Test

   The training set can be partitioned into train/validation sets using a test_size=0.2. This will allow for the original testDF to be a true test set that the model has not encountered.

In [ ]:
train_Features, val_features, train_Label, val_label = train_test_split(X_train,
                                                                        y_train,
                                                                        test_size=0.2,
                                                                        random_state=seed_value)

   We use the same number of trials and parameters space, but defined a new objective function for cat_hpo with a new .pkl file, the same config for the parameters to be evaluated and the same base model. Now, we can specify for the model to fit with the training set with the validation set as the eval_set and to predict using the validation set. We can now define a new out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.

In [ ]:
def cat_hpo(config):
    """
    Objective function to tune a CatBoostRegressor model.
    """
    joblib.dump(bayesOpt_trials, 'Catboost_Hyperopt_100_GPU_TrainValTest.pkl')

    global ITERATION
    ITERATION += 1

    config['iterations'] = int(config['iterations'])
    config['depth'] = int(config['depth']) + 3

    model = CatBoostRegressor(loss_function='RMSE',
                              cat_features=categorical_features_indices,
                              early_stopping_rounds=10,
                              rsm=1,
                              logging_level='Silent',
                              **config)

    start = timer()
    model.fit(train_Features, train_Label.values.ravel(),
              eval_set=[(val_features, val_label.values.ravel())])
    run_time = timer() - start

    y_pred_val = model.predict(val_features)
    rmse = mean_squared_error(val_label, y_pred_val, squared=False)

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([rmse, config, ITERATION, run_time])

    return {'loss': rmse, 'params': config, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

tpe_algorithm = tpe.suggest

out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/Catboost/Hyperopt/trialOptions/Catboost_trials_100_GPU_TrainValTest.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()

   We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function cat_hpo, the parameter space cat_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL and the name of the trial set bayesOpt_trials. We can now begin the HPO trials.

In [ ]:
if os.path.isfile('Catboost_Hyperopt_100_GPU_TrainValTest.pkl'):
    bayesOpt_trials = joblib.load('Catboost_Hyperopt_100_GPU_TrainValTest.pkl')
    best_param = fmin(cat_hpo, cat_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
    best_param = fmin(cat_hpo, cat_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)
100%|██████████| 100/100 [12:19<00:00,  7.39s/trial, best loss: 2471.5869659402438]

   Let's now sort the trials with lowest loss (lowest RMSE) first and examine the top two trials:

In [ ]:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
                                 key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Top two trials with the lowest loss (lowest RMSE)
[{'loss': 2471.5869659402438, 'params': {'depth': 12, 'iterations': 469, 'l2_leaf_reg': 0.9875868900318402, 'learning_rate': 0.13564041870603102, 'min_data_in_leaf': 19, 'one_hot_max_size': 16, 'random_state': 42, 'task_type': 'GPU'}, 'iteration': 29, 'train_time': 11.28348632000052, 'status': 'ok'}, {'loss': 2476.6703143975124, 'params': {'depth': 12, 'iterations': 424, 'l2_leaf_reg': 0.33040719519574396, 'learning_rate': 0.13592242425890189, 'min_data_in_leaf': 13, 'one_hot_max_size': 19, 'random_state': 42, 'task_type': 'GPU'}, 'iteration': 95, 'train_time': 10.193129091999253, 'status': 'ok'}]

   We can access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
results = pd.read_csv('Catboost_trials_100_GPU_TrainValTest.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)

ast.literal_eval(results.loc[0, 'params'])

best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

train_label = trainDF[['price']]
test_label = testDF[['price']]

train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])

best_bayes_model = CatBoostRegressor(loss_function='RMSE',
                                     cat_features=categorical_features_indices,
                                     early_stopping_rounds=10,
                                     rsm=1,
                                     logging_level='Silent',
                                     **best_bayes_params)

best_bayes_model.fit(train_features, train_label)

Pkl_Filename = 'Catboost_HPO_trials100_GPU_TrainValTest.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(train_features)
y_test_pred = best_bayes_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for Catboost HPO 100 GPU trials
MAE train: 1400.647, test: 1791.601
MSE train: 3514223.052, test: 5878087.831
RMSE train: 1874.626, test: 2424.477
R^2 train: 0.962, test: 0.936

   When comparing the model metrics for CatBoost utilizing 10-fold cross validation vs train/validation/test, a lower MAE for the train and the test set, a lower MSE and RMSE for the train set but higher for the test set. There was a higher R² for the train set but a lower R² for the test set, so this model overfit more than the model from the cross validatio search.

In [ ]:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                            y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
The best model from Bayes optimization scores 5878087.83144 MSE on the test set.
This was achieved after 29 search iterations

   As stated, there was a higher MSE on the test set using this model compared to the previous search using 10-fold cross validation (5878087.83144 vs. 5773043.66389) but this was achieved at similar number of search iterations.

   To compare the results from this method for partitioning the data for training a model compared to the results from 10-fold cross validation, we can iterate through the quantitative parameters using a similar method and visualizing with seaborn.kdeplot.

In [ ]:
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['depth'] = bayes_params['depth'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['l2_leaf_reg'] = bayes_params['l2_leaf_reg'].astype('float64')
bayes_params['min_data_in_leaf'] = bayes_params['min_data_in_leaf'].astype('float64')
bayes_params['one_hot_max_size'] = bayes_params['one_hot_max_size'].astype('float64')

for i, hpo in enumerate(bayes_params.columns):
    if hpo not in ['iteration', 'iterations', 'random_state', 'task_type']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
<Figure size 1400x600 with 0 Axes>

   Similar results of higher values of depth, l2_leaf_reg, min_data_in_leaf and one_hot_max_size in the given hyperparameter space were observed to generate a lower loss. A learning_rate between 0.1 and 0.2 performed better, the lowest loss was achieved with around a learning_rate=0.136 compared to 0.119 using 10-fold cross validation.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
                         'one_hot_max_size']):
    sns.regplot(x='iteration', y=hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, the one_hot_max_size increased while learning_rate and min_data_in_leaf did not reveal any trends.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
plt.figure(figsize=(10,5))
plt.rcParams['font.size'] = 18
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=bayes_params,
                 label='Bayes Optimization')
ax.set(xlabel='Iteration', ylabel='l2_leaf_reg')
plt.tight_layout()
plt.show()

   There was not any observable trend for l2_leaf_reg over the trials.

Model Explanations

In [ ]:
plot_feature_importance(best_bayes_model.get_feature_importance(),
                        train_features.columns, 'Catboost')

   The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm, front_legroombefore width, height, wheelbase and State before body_type when comparing to the model from the 10-fold cross validation search.

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(train_features)
In [ ]:
shap.summary_plot(shap_values, train_features);

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.summary_plot(shap_values, test_features);

Optuna: 100 Trials 10-Fold Cross Validation

   The Optuna notebooks are located here. Let's set up the environment by installing wandb, CatBoost, Optuna and also shap for model explanations. The import the necessary dependencies to set the seed for reproducibility and set the options. We can also examine the GPU characteristics with nvidia-smi.

In [ ]:
!pip install --upgrade -q wandb
!pip install catboost
!pip install optuna
!pip install shap
import os
import random
import numpy as np
import warnings
pd.set_option('display.max_columns', None)

seed_value = 42
os.environ['usedCars_catboostGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 23.5 MB/s eta 0:00:00a 0:00:01
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 KB 25.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 184.3/184.3 KB 23.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 KB 7.9 MB/s eta 0:00:00
  Building wheel for pathtools (setup.py) ... done
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.6/76.6 MB 18.8 MB/s eta 0:00:00
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.9/dist-packages (from catboost) (1.4.4)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.9/dist-packages (from catboost) (3.7.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from catboost) (1.10.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.9/dist-packages (from catboost) (5.13.1)
Requirement already satisfied: six in /usr/local/lib/python3.9/dist-packages (from catboost) (1.16.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.9/dist-packages (from catboost) (0.20.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.9/dist-packages (from catboost) (1.22.4)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas>=0.24.0->catboost) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (1.0.7)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (4.39.3)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (23.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (8.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: importlib-resources>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->catboost) (5.12.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.9/dist-packages (from plotly->catboost) (8.2.2)
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.9/dist-packages (from importlib-resources>=3.2.0->matplotlib->catboost) (3.15.0)
Installing collected packages: catboost
Successfully installed catboost-1.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.1.0-py3-none-any.whl (365 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 365.3/365.3 KB 7.5 MB/s eta 0:00:00
Collecting cmaes>=0.9.1
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.10.2-py3-none-any.whl (212 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.2/212.2 KB 28.1 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from optuna) (1.22.4)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from optuna) (4.65.0)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.9/dist-packages (from optuna) (6.0)
Requirement already satisfied: sqlalchemy>=1.3.0 in /usr/local/lib/python3.9/dist-packages (from optuna) (1.4.47)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from optuna) (23.0)
Requirement already satisfied: typing-extensions>=4 in /usr/local/lib/python3.9/dist-packages (from alembic>=1.5.0->optuna) (4.5.0)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.7/78.7 KB 11.3 MB/s eta 0:00:00
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.9/dist-packages (from sqlalchemy>=1.3.0->optuna) (2.0.2)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.9/dist-packages (from Mako->alembic>=1.5.0->optuna) (2.1.2)
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.10.2 cmaes-0.9.1 colorlog-6.7.0 optuna-3.1.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 572.4/572.4 KB 9.6 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from shap) (1.22.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (from shap) (1.2.2)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.9/dist-packages (from shap) (2.2.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (from shap) (1.4.4)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.9/dist-packages (from shap) (23.0)
Requirement already satisfied: numba in /usr/local/lib/python3.9/dist-packages (from shap) (0.56.4)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from shap) (1.10.1)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.9/dist-packages (from shap) (4.65.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.9/dist-packages (from numba->shap) (67.6.1)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.9/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2022.7.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (1.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Sat Apr  1 04:53:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

   Given the environment is now configured, the training and the test sets are read into separate pandas dataframes. Then the features and the target are defined for the two sets as train_features, train_label, test_features and test_label. We can also set the categorical variables like was completed for the Hyperopt section. Then we can set the working directory where the .pkl is stored and set up Weights & Biases.

In [ ]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_cat100gpu_T4_cv_kfold10',
                'save_code': 'False', 'notes': 'optuna_cat100gpu_T4_cv_kfold10'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
 ··········
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
WeightsAndBiasesCallback is experimental (supported from v2.9.0). The interface can change in the future.

   Let's utilize the same k-folds for reproducibility that was utilized for the Hyperopt search, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True and the initial defined seed value. Then we can set the up WB callbacks for tracking. Now, let's define a function objective for the optimization of hyperparameters using an Optuna study with a pickle file, parameters to test different combinations using the same ones that were utilized during the Hyperopt trials. Then we can define the model type with the parameters that will be used for each trial and perform timed kfolds cross validation trials with the goal of finding the lowest averaged score.

In [ ]:
from sklearn.model_selection import KFold, cross_val_score
import joblib
from catboost import CatBoostRegressor
from timeit import default_timer as timer

kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)

@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a CatBoostRegressor model.
    """
    joblib.dump(study, 'Catboost_Optuna_100_GPU_T4_wb_CV.pkl')

    catboost_tune_kwargs = {
        'task_type': 'GPU',
        'random_state': seed_value,
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3),
        'depth': trial.suggest_int('depth', 3, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-2, 1e0),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 20),
        'one_hot_max_size': trial.suggest_int('one_hot_max_size', 2, 20)
        }

    model = CatBoostRegressor(loss_function='RMSE',
                              cat_features=categorical_features_indices,
                              early_stopping_rounds=10
                              rsm=1,
                              logging_level='Silent',
                              **catboost_tune_kwargs)

    start = timer()
    scores = -cross_val_score(model, X_train, y_train,
                              scoring='neg_root_mean_squared_error',
                              cv=kfolds)
    run_time = timer() - start

    rmse = np.mean(scores)

    return rmse
track_in_wandb is experimental (supported from v3.0.0). The interface can change in the future.

   Now, we can begin the optimization where the parameters and score for each run is stored usedCars_HPO_Catboost on Weights & Biases.

In [ ]:
from datetime import datetime, timedelta

start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Optuna_100_GPU_T4_wb_CV.pkl'):
    study = joblib.load('Catboost_Optuna_100_GPU_T4_wb_CV.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run expert-water-15548 at: https://wandb.ai/aschultz/usedCars_hpo/runs/jdv98gkx
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230401_035021-jdv98gkx/logs
Start Time           2023-04-01 01:32:26.275246
End Time             2023-04-01 03:51:29.288430
2:19:03


Number of finished trials: 100
Best trial: {'n_estimators': 489, 'learning_rate': 0.1909485376914913, 'depth': 10, 'l2_leaf_reg': 0.48024940002581745, 'min_data_in_leaf': 9, 'one_hot_max_size': 9}
Lowest RMSE 2443.423780647336

   Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_depth': 'depth'}, inplace=True)
trials_df.rename(columns={'params_l2_leaf_reg': 'l2_leaf_reg'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_min_data_in_leaf': 'min_data_in_leaf'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_one_hot_max_size': 'one_hot_max_size'},
                 inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
    iteration         rmse             datetime_start  \
95         95  2443.423781 2023-04-01 03:44:40.089283   
94         94  2444.474903 2023-04-01 03:43:27.138133   
73         73  2444.573691 2023-04-01 03:18:58.498137   
88         88  2444.752842 2023-04-01 03:35:02.104110   
81         81  2445.089009 2023-04-01 03:27:13.503830   
..        ...          ...                        ...   
70         70  2909.265366 2023-04-01 03:14:18.656474   
6           6  2951.409847 2023-04-01 01:38:26.692276   
2           2  3040.138873 2023-04-01 01:34:01.525727   
9           9  3651.170082 2023-04-01 01:40:13.674127   
10         10  4385.753379 2023-04-01 01:40:42.278426   

            datetime_complete               duration  depth  l2_leaf_reg  \
95 2023-04-01 03:45:47.023765 0 days 00:01:06.934482     10     0.480249   
94 2023-04-01 03:44:34.207549 0 days 00:01:07.069416     10     0.484870   
73 2023-04-01 03:20:08.272172 0 days 00:01:09.774035     10     0.373652   
88 2023-04-01 03:36:09.133491 0 days 00:01:07.029381     10     0.383041   
81 2023-04-01 03:28:21.858831 0 days 00:01:08.355001     10     0.364498   
..                        ...                    ...    ...          ...   
70 2023-04-01 03:15:08.555937 0 days 00:00:49.899463      3     0.399588   
6  2023-04-01 01:38:54.478978 0 days 00:00:27.786702      3     0.482500   
2  2023-04-01 01:34:27.651943 0 days 00:00:26.126216      4     0.817557   
9  2023-04-01 01:40:36.440688 0 days 00:00:22.766561      3     0.455103   
10 2023-04-01 01:45:06.603452 0 days 00:04:24.325026     10     0.261942   

    learning_rate  min_data_in_leaf  n_estimators  one_hot_max_size     state  
95       0.190949                 9           489                 9  COMPLETE  
94       0.189552                10           481                 9  COMPLETE  
73       0.191139                11           500                10  COMPLETE  
88       0.198501                 9           487                 9  COMPLETE  
81       0.194912                11           492                10  COMPLETE  
..            ...               ...           ...               ...       ...  
70       0.182034                10           488                 6  COMPLETE  
6        0.221716                 9           350                16  COMPLETE  
2        0.255420                12           132                 6  COMPLETE  
9        0.099190                16           167                14  COMPLETE  
10       0.002889                 6           484                 3  COMPLETE  

[100 rows x 12 columns]

   Optuna contains many ways to visualize the parameters tested during the search using the visualization module including plot_parallel_coordinate, plot_slice, plot_contour, plot_param_importances, plot_optimization_history, plot_slice and plot_edf. Let's utilize some of these components to examine the hyperparameters tested starting with plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contour plot

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
In [ ]:
fig = optuna.visualization.plot_contour(study, params=['n_estimators',
                                                       'min_data_in_leaf',
                                                       'depth',
                                                       'learning_rate'])
fig.show()

   Next, we can plot_slice to compare the objective value and individal parameters.

In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   From this plot, lower rmse occurred with a higher depth and sometimes using more estimators in the model. The other parameters exhibited more complex relationships with the loss.

   Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()

   Higher values of depth while lower values of l2_leaf_reg, min_data_in_leaf and one_hot_max_size in the given hyperparameter space performed better to generate a lower loss. A learning_rate between 0.17 and 0.22 performed better compared to 0.1 - 0.2 for the Hyperopt trials and the lowest loss was achieved with ~ learning_rate=0.191, which is higher than both Hyperopt trials.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf',
                         'depth']):
    sns.regplot(x='iteration', y=hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   The learning_rate and min_data_in_leaf decreased over the trials while the depth did not reveal any trends besides the trials have a higher depth throughtout the trials.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=trials_df)
plt.tight_layout()
plt.show()

   A slight decrease in l2_leaf_reg was observed over the trials.

   Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

    The learning_rate is clearly the most important hyperparameter followed by depth.

   Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
params = study.best_params
params['random_state'] = seed_value
params
Out[ ]:
{'n_estimators': 489,
 'learning_rate': 0.1909485376914913,
 'depth': 10,
 'l2_leaf_reg': 0.48024940002581745,
 'min_data_in_leaf': 9,
 'one_hot_max_size': 9,
 'random_state': 42}
In [ ]:
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

best_bayes_model = CatBoostRegressor(loss_function='RMSE',
                                     task_type='GPU',
                                     cat_features=categorical_features_indices,
                                     early_stopping_rounds=10,
                                     rsm=1,
                                     logging_level='Silent',
                                     **params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'Catboost_Optuna_trials100_GPU_T4_CV_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for Catboost HPO 100 GPU trials
MAE train: 1544.340, test: 1796.505
MSE train: 4271275.678, test: 5905262.833
RMSE train: 2066.706, test: 2430.075
R^2 train: 0.953, test: 0.935

   When comparing the model metrics for CatBoost utilizing 10-fold cross validation from Hyperopt, there was a higher MAE for the train set while a lower MAE for the test set. There was higher MSE for both the train/test sets and a lower RMSE and R² for the train and test sets, so this model overfit less than the model from the Hyperopt cross validation search.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5905262.83338 MSE on the test set.
This was achieved using these conditions:
iteration                                    95
rmse                                2443.423781
datetime_start       2023-04-01 03:44:40.089283
datetime_complete    2023-04-01 03:45:47.023765
duration                 0 days 00:01:06.934482
depth                                        10
l2_leaf_reg                            0.480249
learning_rate                          0.190949
min_data_in_leaf                              9
n_estimators                                489
one_hot_max_size                              9
state                                  COMPLETE
Name: 95, dtype: object

Model Explanations

   Using the defined function in the Hyperopt section, we can also plot the feature importance from best model achieved through this hyperparameter search.

In [ ]:
plot_feature_importance(best_bayes_model.get_feature_importance(),
                        X_train.columns, 'Catboost')

   The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm, height, width, front_legroom, wheelbase, engine_displacement and savings_amount.

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
import shap

shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
In [ ]:
shap.summary_plot(shap_values, X_train)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(X_test)
In [ ]:
shap.summary_plot(shap_values, X_test)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();

100 Trials Train/Validation/Test

   To evaluate utilizing the train/validation/test set approach, let's use the train set that this is split into the training and validation sets using train_test_split with a test_size=0.2, so the set is split into 60% for training, 20% validation and 20% for testing the trained model. We can then set the working directory where the .pkl is stored and set up Weights & Biases.

In [ ]:
train_Features, val_features, train_Label, val_label = train_test_split(X_train,
                                                                        y_train,
                                                                        test_size=0.2,
                                                                        random_state=seed_value)
In [ ]:
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_cat100gpu_T4_trainvalTest',
                'save_code': 'False',
                'notes': 'optuna_cat100gpu_T4_trainvalTest'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
 ··········
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
WeightsAndBiasesCallback is experimental (supported from v2.9.0). The interface can change in the future.

   Let's set up the WB callbacks and define a function for the optimization of hyperparameters for an Optuna study with a .pkl file and the parameter space to test the different combinations of parameters using the same ones used during the Hyperopt trials.

   Then, we can define the model type with parameters that will be used for each trial and fit the model using the train set the validation set for the eval_set where the rmse is predicted using the validation set.

In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a CatBoostRegressor model.
    """
    joblib.dump(study, 'Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl')

    catboost_tune_kwargs = {
        'random_state': seed_value,
        'task_type': 'GPU',
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'learning_rate': trial.suggest_float('learning_rate', 1e-4, 0.3),
        'depth': trial.suggest_int('depth', 3, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-2, 1e0),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2, 20),
        'one_hot_max_size': trial.suggest_int('one_hot_max_size', 2, 20)
        }

    model = CatBoostRegressor(loss_function='RMSE',
                              cat_features=categorical_features_indices,
                              early_stopping_rounds=10,
                              rsm=1,
                              logging_level='Silent',
                              **catboost_tune_kwargs)

    start = timer()
    model.fit(train_Features, train_Label.values.ravel(),
              eval_set=[(val_features, val_label.values.ravel())])
    run_time = timer() - start

    y_pred_val = model.predict(val_features)
    rmse = mean_squared_error(val_label, y_pred_val, squared=False)

    return rmse
track_in_wandb is experimental (supported from v3.0.0). The interface can change in the future.

   Now, let's begin the optimization where the parameters and score for each run is stored here.

In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl'):
    study = joblib.load('Catboost_Optuna_100_GPU_T4_wb_trainValTest.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()

   Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_depth': 'depth'}, inplace=True)
trials_df.rename(columns={'params_l2_leaf_reg': 'l2_leaf_reg'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_min_data_in_leaf': 'min_data_in_leaf'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_one_hot_max_size': 'one_hot_max_size'},
                 inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
    iteration         rmse             datetime_start  \
98         98  2469.939903 2023-04-01 05:22:31.040482   
60         60  2473.611914 2023-04-01 05:11:47.267861   
85         85  2474.389422 2023-04-01 05:18:45.191904   
94         94  2475.327747 2023-04-01 05:21:19.339844   
95         95  2475.719446 2023-04-01 05:21:37.512336   
..        ...          ...                        ...   
5           5  2918.970312 2023-04-01 04:56:38.927127   
58         58  2919.173619 2023-04-01 05:11:20.499905   
2           2  3151.714432 2023-04-01 04:55:56.639188   
3           3  3194.153170 2023-04-01 04:56:10.003919   
40         40  6119.775323 2023-04-01 05:06:12.115566   

            datetime_complete               duration  depth  l2_leaf_reg  \
98 2023-04-01 05:22:43.429855 0 days 00:00:12.389373     10     0.417217   
60 2023-04-01 05:11:59.028128 0 days 00:00:11.760267     10     0.563963   
85 2023-04-01 05:18:57.667853 0 days 00:00:12.475949     10     0.535712   
94 2023-04-01 05:21:32.198264 0 days 00:00:12.858420     10     0.524992   
95 2023-04-01 05:21:49.906790 0 days 00:00:12.394454     10     0.521898   
..                        ...                    ...    ...          ...   
5  2023-04-01 04:56:50.323133 0 days 00:00:11.396006      8     0.295064   
58 2023-04-01 05:11:28.850642 0 days 00:00:08.350737      3     0.555729   
2  2023-04-01 04:56:04.597571 0 days 00:00:07.958383      6     0.210505   
3  2023-04-01 04:56:18.169075 0 days 00:00:08.165156      7     0.027474   
40 2023-04-01 05:06:26.222230 0 days 00:00:14.106664      6     0.669139   

    learning_rate  min_data_in_leaf  n_estimators  one_hot_max_size     state  
98       0.177532                11           461                13  COMPLETE  
60       0.224096                 7           418                14  COMPLETE  
85       0.161132                11           461                15  COMPLETE  
94       0.162199                12           488                14  COMPLETE  
95       0.157959                12           478                13  COMPLETE  
..            ...               ...           ...               ...       ...  
5        0.047496                17           237                 8  COMPLETE  
58       0.191195                 7           453                19  COMPLETE  
2        0.076996                 3           158                 7  COMPLETE  
3        0.039194                15           206                17  COMPLETE  
40       0.001582                 2           498                 6  COMPLETE  

[100 rows x 12 columns]

   Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contours.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
In [ ]:
fig = optuna.visualization.plot_contour(study, params=['n_estimators',
                                                       'min_data_in_leaf',
                                                       'depth',
                                                       'learning_rate'])
fig.show()

   Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()

   Higher values of depth, min_data_in_leaf and one_hot_max_size while lower values of l2_leaf_reg in the given hyperparameter space performed better to generate a lower loss. A learning_rate between 0.17 and 0.22 performed better compared to 0.1 - 0.2 for the Hyperopt trials and the lowest loss was achieved with ~ learning_rate=0.178, which is higher than both Hyperopt trials.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'min_data_in_leaf', 'depth']):
    sns.regplot(x='iteration', y=hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the search trials, there were not any trends with the quantitative features over the iterations.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
ax = sns.regplot(x='iteration', y='l2_leaf_reg', data=trials_df)
plt.tight_layout()
plt.show()

   Over the trials, there was a slight increase in l2_leaf_reg.

   Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The learning_rate is clearly the most important hyperparameter, more when using cross-validation for the Optuna trials followed by depth, which is less important in this search.

   Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
params = study.best_params
params['random_state'] = seed_value
params
Out[ ]:
{'n_estimators': 461,
 'learning_rate': 0.17753236478366918,
 'depth': 10,
 'l2_leaf_reg': 0.41721695797669994,
 'min_data_in_leaf': 11,
 'one_hot_max_size': 13,
 'random_state': 42}
In [ ]:
train_label = trainDF[['price']]
test_label = testDF[['price']]

train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])

categorical_features_indices = ['body_type', 'fuel_type', 'listing_color',
                                'transmission', 'wheel_system_display', 'State',
                                'listed_date_yearMonth']

best_bayes_model = CatBoostRegressor(loss_function='RMSE',
                                     task_type='GPU',
                                     cat_features=categorical_features_indices,
                                     early_stopping_rounds=10,
                                     rsm=1,
                                     logging_level='Silent',
                                     **params)

best_bayes_model.fit(train_features, train_label)

Pkl_Filename = 'Catboost_Optuna_trials100_GPU_T4_wb_trainValTest.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for Catboost HPO 100 GPU trials')
y_train_pred = best_bayes_model.predict(train_features)
y_test_pred = best_bayes_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for Catboost HPO 100 GPU trials
MAE train: 1577.754, test: 1800.081
MSE train: 4452674.607, test: 5906404.943
RMSE train: 2110.136, test: 2430.310
R^2 train: 0.952, test: 0.935

   When comparing the model metrics to the 10-fold cross validation trials using Hyperopt, the MAE, MSE, and RMSE for the both the train and the test is higher and the R² is lower for both of the sets. Therefore, the model with the best performance for CatBoost occurred with the Hyperopt 100 trials using 10-fold cross validation.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5906404.94257 MSE on the test set.
This was achieved using these conditions:
iteration                                    98
rmse                                2469.939903
datetime_start       2023-04-01 05:22:31.040482
datetime_complete    2023-04-01 05:22:43.429855
duration                 0 days 00:00:12.389373
depth                                        10
l2_leaf_reg                            0.417217
learning_rate                          0.177532
min_data_in_leaf                             11
n_estimators                                461
one_hot_max_size                             13
state                                  COMPLETE
Name: 98, dtype: object

Model Explanations

   Using the defined function used in the Hyperopt section, plot the feature importance from best model result.

In [ ]:
plt.rcParams.update({'font.size': 10})
plot_feature_importance(best_bayes_model.get_feature_importance(),
                        train_features.columns, 'Catboost')

   The features horsepower, year and mileage are the most important followed by wheel_system_display, back_legroom, horsepower_rpm, height, width, front_legroom, wheelbase, engine_displacement, highway_fuel_economy, savings_amount.

   After evaluating the two different packages for performing hyperparanmeter tuning and cross validation vs. train/validation/test sets utilized during the tuning the process for CatBoost, the features horsepower, year and mileage are the most important followed slightly different orders of of feature importance using this approach.

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(train_features)
In [ ]:
shap.summary_plot(shap_values, train_features)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(test_features)
In [ ]:
shap.summary_plot(shap_values, test_features)
plt.gcf().axes[-1].set_aspect(1000)
plt.gcf().axes[-1].set_box_aspect(1000)
plt.show();

LightGBM

   The paper LightGBM: A Highly Efficient Gradient Boosting Decision Tree introduced two new techniques called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).

   GOSS is a new gradient boosting decision tree (GBDT) algorithm under the assumption that instances containing larger gradients (greater than a specified threshold or within the top percentiles) contribute more to information gain. Therefore, the instances with large gradients are retained and only the instances with small gradients are randomly dropped when down sampling the data. This results in more accurate gain estimation compared to uniformly random sampling, with the same target sampling rate, to a higher extent when there is a large range to benefit.

   EFB addresses the problem when there is a large number of features present, but in reality, the feature space is quite sparse. This utilizes a graph approach where the features are vertices, edges are added for every two features, if they are not mutually exclusive, and then leveraging a greedy algorithm with a constant approximation ratio to solve.

   Since Google Colaboratory was utilized to run LightGBM with a GPU, the environment needs to be set up by mounting the drive and cloning the LightGBM repository from Github.

In [ ]:
%cd /content/drive/MyDrive/
/content/drive/MyDrive
In [ ]:
!git clone --recursive https://github.com/Microsoft/LightGBM
Cloning into 'LightGBM'...
remote: Enumerating objects: 28579, done.
remote: Counting objects: 100% (28578/28578), done.
remote: Compressing objects: 100% (6535/6535), done.
remote: Total 28579 (delta 21247), reused 28354 (delta 21111), pack-reused 1
Receiving objects: 100% (28579/28579), 19.87 MiB | 9.87 MiB/s, done.
Resolving deltas: 100% (21247/21247), done.
Checking out files: 100% (535/535), done.
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'external_libs/compute'
Submodule 'eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'external_libs/eigen'
Submodule 'external_libs/fast_double_parser' (https://github.com/lemire/fast_double_parser.git) registered for path 'external_libs/fast_double_parser'
Submodule 'external_libs/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'external_libs/fmt'
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/compute'...
remote: Enumerating objects: 21733, done.        
remote: Counting objects: 100% (5/5), done.        
remote: Compressing objects: 100% (4/4), done.        
remote: Total 21733 (delta 1), reused 3 (delta 1), pack-reused 21728        
Receiving objects: 100% (21733/21733), 8.51 MiB | 5.87 MiB/s, done.
Resolving deltas: 100% (17567/17567), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/eigen'...
remote: Enumerating objects: 117713, done.        
remote: Counting objects: 100% (800/800), done.        
remote: Compressing objects: 100% (282/282), done.        
remote: Total 117713 (delta 530), reused 771 (delta 515), pack-reused 116913        
Receiving objects: 100% (117713/117713), 103.14 MiB | 10.23 MiB/s, done.
Resolving deltas: 100% (97111/97111), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fast_double_parser'...
remote: Enumerating objects: 781, done.        
remote: Counting objects: 100% (180/180), done.        
remote: Compressing objects: 100% (66/66), done.        
remote: Total 781 (delta 124), reused 131 (delta 103), pack-reused 601        
Receiving objects: 100% (781/781), 833.45 KiB | 1.61 MiB/s, done.
Resolving deltas: 100% (395/395), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fmt'...
remote: Enumerating objects: 31417, done.        
remote: Counting objects: 100% (1040/1040), done.        
remote: Compressing objects: 100% (85/85), done.        
remote: Total 31417 (delta 964), reused 979 (delta 940), pack-reused 30377        
Receiving objects: 100% (31417/31417), 13.60 MiB | 4.95 MiB/s, done.
Resolving deltas: 100% (21284/21284), done.
Submodule path 'external_libs/compute': checked out '36350b7de849300bd3d72a05d8bf890ca405a014'
Submodule path 'external_libs/eigen': checked out '3147391d946bb4b6c68edd901f2add6ac1f31f8c'
Submodule path 'external_libs/fast_double_parser': checked out 'ace60646c02dc54c57f19d644e49a61e7e7758ec'
Submodule 'benchmark/dependencies/abseil-cpp' (https://github.com/abseil/abseil-cpp.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'
Submodule 'benchmark/dependencies/double-conversion' (https://github.com/google/double-conversion.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'...
remote: Enumerating objects: 20079, done.        
remote: Counting objects: 100% (2719/2719), done.        
remote: Compressing objects: 100% (711/711), done.        
remote: Total 20079 (delta 2077), reused 2063 (delta 2008), pack-reused 17360        
Receiving objects: 100% (20079/20079), 12.07 MiB | 6.00 MiB/s, done.
Resolving deltas: 100% (15721/15721), done.
Cloning into '/content/drive/MyDrive/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'...
remote: Enumerating objects: 1352, done.        
remote: Counting objects: 100% (196/196), done.        
remote: Compressing objects: 100% (105/105), done.        
remote: Total 1352 (delta 109), reused 157 (delta 84), pack-reused 1156        
Receiving objects: 100% (1352/1352), 7.15 MiB | 10.65 MiB/s, done.
Resolving deltas: 100% (881/881), done.
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp': checked out 'd936052d32a5b7ca08b0199a6724724aea432309'
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion': checked out 'f4cb2384efa55dee0e6652f8674b05763441ab09'
Submodule path 'external_libs/fmt': checked out 'b6f4ceaed0a0a24ccf575fab6c56dd50ccf6f1a9'

   Then we can navigate to the directory and create a build directory within the cloned repository where we can use cmake to compile and build.

In [ ]:
%cd /content/drive/MyDrive/LightGBM

!mkdir build
/content/drive/MyDrive/LightGBM
In [ ]:
!cmake -DUSE_GPU=1
!make -j$(nproc)
CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.


-- OpenCL include directory: /usr/include
-- Using _mm_prefetch
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /content/drive/MyDrive/LightGBM
Consolidate compiler generated dependencies of target lightgbm_capi_objs
[ -1%] Built target lightgbm_capi_objs
Consolidate compiler generated dependencies of target lightgbm_objs
[ 89%] Built target lightgbm_objs
[ 90%] Built target _lightgbm
Consolidate compiler generated dependencies of target lightgbm
[ 96%] Built target lightgbm

   Next, we can install pip, setuptools and the required dependencies to set up the environment for utilizing LightGBM with GPU capabilities.

In [ ]:
!sudo apt-get -y install python-pip
!sudo -H pip install setuptools optuna plotly eli5 shap lime -U
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libpython-all-dev python-all python-all-dev python-asn1crypto
  python-cffi-backend python-crypto python-cryptography python-dbus
  python-enum34 python-gi python-idna python-ipaddress python-keyring
  python-keyrings.alt python-pip-whl python-pkg-resources python-secretstorage
  python-setuptools python-six python-wheel python-xdg
Suggested packages:
  python-crypto-doc python-cryptography-doc python-cryptography-vectors
  python-dbus-dbg python-dbus-doc python-enum34-doc python-gi-cairo
  gnome-keyring libkf5wallet-bin gir1.2-gnomekeyring-1.0 python-fs
  python-gdata python-keyczar python-secretstorage-doc python-setuptools-doc
The following NEW packages will be installed:
  libpython-all-dev python-all python-all-dev python-asn1crypto
  python-cffi-backend python-crypto python-cryptography python-dbus
  python-enum34 python-gi python-idna python-ipaddress python-keyring
  python-keyrings.alt python-pip python-pip-whl python-pkg-resources
  python-secretstorage python-setuptools python-six python-wheel python-xdg
0 upgraded, 22 newly installed, 0 to remove and 5 not upgraded.
Need to get 3,430 kB of archives.
After this operation, 10.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libpython-all-dev amd64 2.7.15~rc1-1 [1,092 B]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-all amd64 2.7.15~rc1-1 [1,076 B]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-all-dev amd64 2.7.15~rc1-1 [1,100 B]
Get:4 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-asn1crypto all 0.24.0-1 [72.7 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-cffi-backend amd64 1.11.5-1 [63.4 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-crypto amd64 2.6.1-8ubuntu2 [244 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-enum34 all 1.1.6-2 [34.8 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-idna all 2.6-1 [32.4 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-ipaddress all 1.0.17-1 [18.2 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-six all 1.11.0-2 [11.3 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python-cryptography amd64 2.1.4-1ubuntu1.4 [276 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-dbus amd64 1.2.6-1 [90.2 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 python-gi amd64 3.26.1-2ubuntu1 [197 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-secretstorage all 2.3.1-2 [11.8 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-keyring all 10.6.0-1 [30.6 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-keyrings.alt all 3.0-1 [16.7 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-pip-whl all 9.0.1-2.3~ubuntu1.18.04.5 [1,653 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-pip all 9.0.1-2.3~ubuntu1.18.04.5 [151 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-pkg-resources all 39.0.1-2 [128 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-setuptools all 39.0.1-2 [329 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-wheel all 0.30.0-0.2 [36.4 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-xdg all 0.25-4ubuntu1.1 [31.2 kB]
Fetched 3,430 kB in 3s (1,219 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 22.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package libpython-all-dev:amd64.
(Reading database ... 123941 files and directories currently installed.)
Preparing to unpack .../00-libpython-all-dev_2.7.15~rc1-1_amd64.deb ...
Unpacking libpython-all-dev:amd64 (2.7.15~rc1-1) ...
Selecting previously unselected package python-all.
Preparing to unpack .../01-python-all_2.7.15~rc1-1_amd64.deb ...
Unpacking python-all (2.7.15~rc1-1) ...
Selecting previously unselected package python-all-dev.
Preparing to unpack .../02-python-all-dev_2.7.15~rc1-1_amd64.deb ...
Unpacking python-all-dev (2.7.15~rc1-1) ...
Selecting previously unselected package python-asn1crypto.
Preparing to unpack .../03-python-asn1crypto_0.24.0-1_all.deb ...
Unpacking python-asn1crypto (0.24.0-1) ...
Selecting previously unselected package python-cffi-backend.
Preparing to unpack .../04-python-cffi-backend_1.11.5-1_amd64.deb ...
Unpacking python-cffi-backend (1.11.5-1) ...
Selecting previously unselected package python-crypto.
Preparing to unpack .../05-python-crypto_2.6.1-8ubuntu2_amd64.deb ...
Unpacking python-crypto (2.6.1-8ubuntu2) ...
Selecting previously unselected package python-enum34.
Preparing to unpack .../06-python-enum34_1.1.6-2_all.deb ...
Unpacking python-enum34 (1.1.6-2) ...
Selecting previously unselected package python-idna.
Preparing to unpack .../07-python-idna_2.6-1_all.deb ...
Unpacking python-idna (2.6-1) ...
Selecting previously unselected package python-ipaddress.
Preparing to unpack .../08-python-ipaddress_1.0.17-1_all.deb ...
Unpacking python-ipaddress (1.0.17-1) ...
Selecting previously unselected package python-six.
Preparing to unpack .../09-python-six_1.11.0-2_all.deb ...
Unpacking python-six (1.11.0-2) ...
Selecting previously unselected package python-cryptography.
Preparing to unpack .../10-python-cryptography_2.1.4-1ubuntu1.4_amd64.deb ...
Unpacking python-cryptography (2.1.4-1ubuntu1.4) ...
Selecting previously unselected package python-dbus.
Preparing to unpack .../11-python-dbus_1.2.6-1_amd64.deb ...
Unpacking python-dbus (1.2.6-1) ...
Selecting previously unselected package python-gi.
Preparing to unpack .../12-python-gi_3.26.1-2ubuntu1_amd64.deb ...
Unpacking python-gi (3.26.1-2ubuntu1) ...
Selecting previously unselected package python-secretstorage.
Preparing to unpack .../13-python-secretstorage_2.3.1-2_all.deb ...
Unpacking python-secretstorage (2.3.1-2) ...
Selecting previously unselected package python-keyring.
Preparing to unpack .../14-python-keyring_10.6.0-1_all.deb ...
Unpacking python-keyring (10.6.0-1) ...
Selecting previously unselected package python-keyrings.alt.
Preparing to unpack .../15-python-keyrings.alt_3.0-1_all.deb ...
Unpacking python-keyrings.alt (3.0-1) ...
Selecting previously unselected package python-pip-whl.
Preparing to unpack .../16-python-pip-whl_9.0.1-2.3~ubuntu1.18.04.5_all.deb ...
Unpacking python-pip-whl (9.0.1-2.3~ubuntu1.18.04.5) ...
Selecting previously unselected package python-pip.
Preparing to unpack .../17-python-pip_9.0.1-2.3~ubuntu1.18.04.5_all.deb ...
Unpacking python-pip (9.0.1-2.3~ubuntu1.18.04.5) ...
Selecting previously unselected package python-pkg-resources.
Preparing to unpack .../18-python-pkg-resources_39.0.1-2_all.deb ...
Unpacking python-pkg-resources (39.0.1-2) ...
Selecting previously unselected package python-setuptools.
Preparing to unpack .../19-python-setuptools_39.0.1-2_all.deb ...
Unpacking python-setuptools (39.0.1-2) ...
Selecting previously unselected package python-wheel.
Preparing to unpack .../20-python-wheel_0.30.0-0.2_all.deb ...
Unpacking python-wheel (0.30.0-0.2) ...
Selecting previously unselected package python-xdg.
Preparing to unpack .../21-python-xdg_0.25-4ubuntu1.1_all.deb ...
Unpacking python-xdg (0.25-4ubuntu1.1) ...
Setting up python-idna (2.6-1) ...
Setting up python-pip-whl (9.0.1-2.3~ubuntu1.18.04.5) ...
Setting up python-asn1crypto (0.24.0-1) ...
Setting up python-crypto (2.6.1-8ubuntu2) ...
Setting up python-wheel (0.30.0-0.2) ...
Setting up libpython-all-dev:amd64 (2.7.15~rc1-1) ...
Setting up python-pkg-resources (39.0.1-2) ...
Setting up python-cffi-backend (1.11.5-1) ...
Setting up python-gi (3.26.1-2ubuntu1) ...
Setting up python-six (1.11.0-2) ...
Setting up python-enum34 (1.1.6-2) ...
Setting up python-dbus (1.2.6-1) ...
Setting up python-ipaddress (1.0.17-1) ...
Setting up python-pip (9.0.1-2.3~ubuntu1.18.04.5) ...
Setting up python-all (2.7.15~rc1-1) ...
Setting up python-xdg (0.25-4ubuntu1.1) ...
Setting up python-setuptools (39.0.1-2) ...
Setting up python-keyrings.alt (3.0-1) ...
Setting up python-all-dev (2.7.15~rc1-1) ...
Setting up python-cryptography (2.1.4-1ubuntu1.4) ...
Setting up python-secretstorage (2.3.1-2) ...
Setting up python-keyring (10.6.0-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (57.4.0)
Collecting setuptools
  Downloading setuptools-65.5.1-py3-none-any.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 28.3 MB/s 
Collecting optuna
  Downloading optuna-3.0.3-py3-none-any.whl (348 kB)
     |████████████████████████████████| 348 kB 82.7 MB/s 
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (5.5.0)
Collecting plotly
  Downloading plotly-5.11.0-py2.py3-none-any.whl (15.3 MB)
     |████████████████████████████████| 15.3 MB 72.5 MB/s 
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     |████████████████████████████████| 216 kB 73.3 MB/s 
Collecting shap
  Downloading shap-0.41.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (569 kB)
     |████████████████████████████████| 569 kB 77.8 MB/s 
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
     |████████████████████████████████| 275 kB 73.5 MB/s 
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.64.1)
Collecting alembic>=1.5.0
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
     |████████████████████████████████| 209 kB 82.0 MB/s 
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
     |████████████████████████████████| 81 kB 10.5 MB/s 
Requirement already satisfied: scipy<1.9.0,>=1.7.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.7.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.21.6)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.9.0-py3-none-any.whl (23 kB)
Requirement already satisfied: importlib-metadata<5.0.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (4.13.0)
Requirement already satisfied: sqlalchemy>=1.3.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.42)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (6.0)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.3)
Collecting Mako
  Downloading Mako-1.2.3-py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 7.6 MB/s 
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic>=1.5.0->optuna) (5.10.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<5.0.0->optuna) (4.1.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<5.0.0->optuna) (3.10.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (3.0.9)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.3.0->optuna) (2.0.0.post0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from plotly) (8.1.0)
Requirement already satisfied: attrs>17.1.0 in /usr/local/lib/python3.7/dist-packages (from eli5) (22.1.0)
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 79.0 MB/s 
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from eli5) (1.15.0)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.7/dist-packages (from eli5) (1.0.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from eli5) (0.8.10)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.7/dist-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20->eli5) (1.2.0)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: numba in /usr/local/lib/python3.7/dist-packages (from shap) (0.56.4)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from shap) (1.5.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from shap) (1.3.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from lime) (3.2.2)
Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.7/dist-packages (from lime) (0.18.3)
Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (7.1.2)
Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.6.3)
Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2021.11.2)
Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (1.3.0)
Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.9.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (0.11.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (2.8.2)
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.11.0-py2.py3-none-any.whl (112 kB)
     |████████████████████████████████| 112 kB 68.9 MB/s 
Collecting autopage>=0.4.0
  Downloading autopage-0.5.1-py3-none-any.whl (29 kB)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.5.0)
Collecting stevedore>=2.0.1
  Downloading stevedore-3.5.2-py3-none-any.whl (50 kB)
     |████████████████████████████████| 50 kB 6.6 MB/s 
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.2-py3-none-any.whl (147 kB)
     |████████████████████████████████| 147 kB 77.1 MB/s 
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Collecting pyperclip>=1.6
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.7/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->shap) (2022.6)
Building wheels for collected packages: eli5, lime, pyperclip
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=53ec9507155788c5ef2efa597a3fd94d4109d778b5e6fea28eb40b814230d212
  Stored in directory: /root/.cache/pip/wheels/cc/3c/96/3ead31a8e6c20fc0f1a707fde2e05d49a80b1b4b30096573be
  Building wheel for lime (setup.py) ... done
  Created wheel for lime: filename=lime-0.2.0.1-py3-none-any.whl size=283857 sha256=a3a1cbe14bcf2df333b36d2cc42ed97a7ce8029a37d580902011cb04ce660dc0
  Stored in directory: /root/.cache/pip/wheels/ca/cb/e5/ac701e12d365a08917bf4c6171c0961bc880a8181359c66aa7
  Building wheel for pyperclip (setup.py) ... done
  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11136 sha256=05a28b28ead19d2e27c09bb2d74c114cf8fd9267b007da5a789fe9e6985f5d3a
  Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built eli5 lime pyperclip
Installing collected packages: pyperclip, pbr, stevedore, setuptools, Mako, cmd2, autopage, slicer, jinja2, colorlog, cmaes, cliff, alembic, shap, plotly, optuna, lime, eli5
  Attempting uninstall: setuptools
    Found existing installation: setuptools 57.4.0
    Uninstalling setuptools-57.4.0:
      Successfully uninstalled setuptools-57.4.0
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
  Attempting uninstall: plotly
    Found existing installation: plotly 5.5.0
    Uninstalling plotly-5.5.0:
      Successfully uninstalled plotly-5.5.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.9.0 requires jedi>=0.10, which is not installed.
notebook 5.7.16 requires jinja2<=3.0.0, but you have jinja2 3.1.2 which is incompatible.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
Successfully installed Mako-1.2.3 alembic-1.8.1 autopage-0.5.1 cliff-3.10.1 cmaes-0.9.0 cmd2-2.4.2 colorlog-6.7.0 eli5-0.13.0 jinja2-3.1.2 lime-0.2.0.1 optuna-3.0.3 pbr-5.11.0 plotly-5.11.0 pyperclip-1.8.2 setuptools-65.5.1 shap-0.41.0 slicer-0.0.7 stevedore-3.5.2

   Now, we can move to the python-package subdirectory and use setup.py to compile the dependencies needed so that data will be processed on GPUs when using the LightGBM package.

In [ ]:
%cd /content/drive/MyDrive/LightGBM/python-package

!sudo python3 setup.py install --precompile --gpu
/content/drive/MyDrive/LightGBM/python-package
INFO:root:running install
/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
INFO:root:running build
INFO:root:running build_py
INFO:root:running egg_info
INFO:root:writing lightgbm.egg-info/PKG-INFO
INFO:root:writing dependency_links to lightgbm.egg-info/dependency_links.txt
INFO:root:writing requirements to lightgbm.egg-info/requires.txt
INFO:root:writing top-level names to lightgbm.egg-info/top_level.txt
INFO:root:reading manifest template 'MANIFEST.in'
WARNING:root:no previously-included directories found matching 'build'
WARNING:root:warning: no files found matching 'LICENSE'
WARNING:root:warning: no files found matching '*.txt'
WARNING:root:warning: no files found matching '*.so' under directory 'lightgbm'
WARNING:root:warning: no files found matching 'compile/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/cmake/IntegratedOpenCL.cmake'
WARNING:root:warning: no files found matching '*.so' under directory 'compile'
WARNING:root:warning: no files found matching '*.dll' under directory 'compile/Release'
WARNING:root:warning: no files found matching 'compile/external_libs/compute/CMakeLists.txt'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/compute/cmake'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/compute/include'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/compute/meta'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Cholesky'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Core'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Dense'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Eigenvalues'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Geometry'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Householder'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/Jacobi'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/LU'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/QR'
WARNING:root:warning: no files found matching 'compile/external_libs/eigen/Eigen/SVD'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Cholesky'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Core'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Eigenvalues'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Geometry'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Householder'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/Jacobi'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/LU'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/misc'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/plugins'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/QR'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/eigen/Eigen/src/SVD'
WARNING:root:warning: no files found matching 'compile/external_libs/fast_double_parser/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/external_libs/fast_double_parser/LICENSE'
WARNING:root:warning: no files found matching 'compile/external_libs/fast_double_parser/LICENSE.BSL'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/fast_double_parser/include'
WARNING:root:warning: no files found matching 'compile/external_libs/fmt/CMakeLists.txt'
WARNING:root:warning: no files found matching 'compile/external_libs/fmt/LICENSE.rst'
WARNING:root:warning: no files found matching '*' under directory 'compile/external_libs/fmt/include'
WARNING:root:warning: no files found matching '*' under directory 'compile/include'
WARNING:root:warning: no files found matching '*' under directory 'compile/src'
WARNING:root:warning: no files found matching 'LightGBM.sln' under directory 'compile/windows'
WARNING:root:warning: no files found matching 'LightGBM.vcxproj' under directory 'compile/windows'
WARNING:root:warning: no files found matching '*.dll' under directory 'compile/windows/x64/DLL'
WARNING:root:warning: no previously-included files matching '*.py[co]' found anywhere in distribution
WARNING:root:warning: no previously-included files found matching 'compile/external_libs/compute/.git'
INFO:root:writing manifest file 'lightgbm.egg-info/SOURCES.txt'
INFO:root:copying lightgbm/VERSION.txt -> build/lib/lightgbm
INFO:root:running install_lib
INFO:root:creating /usr/lib/python3.8/site-packages
INFO:root:creating /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/dask.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/basic.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/libpath.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/callback.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/sklearn.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/plotting.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/compat.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/engine.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/__init__.py -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:copying build/lib/lightgbm/VERSION.txt -> /usr/lib/python3.8/site-packages/lightgbm
INFO:LightGBM:Installing lib_lightgbm from: ['/content/drive/MyDrive/LightGBM/lib_lightgbm.so']
INFO:root:copying /content/drive/MyDrive/LightGBM/lib_lightgbm.so -> /usr/lib/python3.8/site-packages/lightgbm
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/dask.py to dask.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/basic.py to basic.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/libpath.py to libpath.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/callback.py to callback.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/sklearn.py to sklearn.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/plotting.py to plotting.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/compat.py to compat.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/engine.py to engine.cpython-38.pyc
INFO:root:byte-compiling /usr/lib/python3.8/site-packages/lightgbm/__init__.py to __init__.cpython-38.pyc
INFO:root:running install_egg_info
INFO:root:Copying lightgbm.egg-info to /usr/lib/python3.8/site-packages/lightgbm-3.3.3.99-py3.8.egg-info
INFO:root:running install_scripts

   The Hyperopt notebook is here. Let's now set up the environment by importing the required packages, setting the the options, setting the random and numpy seed and determining the CUDA compiler and GPU characteristics.

In [ ]:
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['usedCars_lgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Wed May 25 00:43:08 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
In [ ]:
import pandas as pd

trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
X_train = trainDF.drop(['price'], axis=1)
y_train = trainDF['price']

testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)
X_test = testDF.drop(['price'], axis=1)
y_test = testDF['price']

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

Baseline Model

   Let's now fit the baseline model to the data, save as a .pkl file and evaluate the model metrics for the baseline model.

In [ ]:
from lightgbm import LGBMRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

lgb = LGBMRegressor(device='gpu',
                    gpu_platform_id=0,
                    gpu_device_id=0,
                    random_state=seed_value,
                    objective='regression',
                    metric='rmse',
                    verbosity=-1)

lgb.fit(X_train, y_train)

Pkl_Filename = 'LightGBM_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(lgb, file)

print('\nModel Metrics for LightGBM Baseline')
y_train_pred = lgb.predict(X_train)
y_test_pred = lgb.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for LightGBM Baseline
MAE train: 2142.980, test: 2160.124
MSE train: 7963695.766, test: 8120763.484
RMSE train: 2822.002, test: 2849.695
R^2 train: 0.913, test: 0.911

Hyperopt: 300 Trials 10-Fold Cross Validation

   We can now define the number of trials, the same k-folds using shuffled n_splits=10 for reproducibility and the parameter space:

  • random_state: seed_value designated for reproducibility. default=None.
  • device_type: Device for the tree learning. default=cpu.
  • objective: Specify the learning task and the corresponding learning objective or a custom objective function to be used. default=None or default=regression.
  • metric: Metric(s) to be evaluated on the evaluation set(s). default="".
  • verbosity: Controls the level of LightGBM’s verbosity. default=1.
  • n_estimators: Number of boosted trees to fit. default=100.
  • learning_rate: Boosting learning rate default=0.1.
  • max_depth: Maximum tree depth for base learners, <=0 means no limit. default=-1.
  • num_leaves: Maximum tree leaves for base learners. default=31.
  • boosting_type: gbdt, gdbt_subsample', 0.5. default='gbdt')) – gbdt, traditional Gradient Boosting Decision Tree. dart, Dropouts meet Multiple Additive Regression Trees. rf, Random Forest.
  • colsample_bytree: Subsample ratio of columns when constructing each tree. default=1.
  • reg_alpha: L1 regularization term on weights. default=0.
  • reg_lambda: L2 regularization term on weights. default=0.
In [ ]:
from sklearn.model_selection import KFold
from hyperopt import hp

NUM_EVAL = 300

kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)

lgb_tune_kwargs = {
    'random_state': seed_value,
    'device_type': 'gpu',
    'objective': 'regression',
    'metric': 'rmse',
    'verbosity' : -1,
    'n_estimators': hp.choice('n_estimators', np.arange(400, 700, dtype=int)),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
    'max_depth': hp.choice('max_depth', np.arange(5, 12, dtype=int)),
    'num_leaves': hp.choice('num_leaves', np.arange(70, 150, dtype=int)),
    'boosting_type': hp.choice('boosting_type',
                               [{'boosting_type': 'gbdt',
                                 'subsample': hp.uniform('gdbt_subsample', 0.5,
                                                                           0.95)}]),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.75, 1.0),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0)
    }

   Now, we can define a function lgb_hpo for optimizing the hyperparameters. Within this function, we can use joblib to save a pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float. This is then allocated for max_depth and num_leaves. Then the model type, LGBMRegressor, needs to be defined.

   Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error' for the scoring metric and 10-fold cross validation to generate the negative cross_val_score where the np.mean of the scores is saved as the rmse for each trial. The trial loss=rmse, params = parameters tested in the trial from the lgb_tune_kwargs space, the trial number (ITERATION), the time to complete the trial (run_time) and if the trial completed successfully or not (STATUS_OK) are written to the defined .csv file and appended by row for each trial.

In [ ]:
from lightgbm import LGBMRegressor
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK

def lgb_hpo(config):
    """
    Objective function to tune a LightGBMRegressor model.
    """
    global ITERATION
    ITERATION += 1

    subsample = config['boosting_type'].get('subsample', 1.0)
    config['boosting_type'] = config['boosting_type']['boosting_type']
    config['subsample'] = subsample

    for param_name in ['max_depth', 'num_leaves']:
        config[param_name] = int(config[param_name])

    lgb = LGBMRegressor(**config)

    start = timer()
    scores = -cross_val_score(lgb, X_train, y_train,
                              scoring='neg_root_mean_squared_error',
                              cv=kfolds)
    run_time = timer() - start

    rmse = np.mean(scores)

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([rmse, config, ITERATION, run_time])

    return {'loss': rmse, 'params': config, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

   Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.

   We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function lgb_hpo, the parameter space lgb_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the hyperparameter optimization HPO trials.

In [ ]:
from hyperopt import tpe, Trials

out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/lightGBM/Hyperopt/trialOptions/lightGBM_CV_trials_300_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

tpe_algorithm = tpe.suggest

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
In [ ]:
from datetime import datetime, timedelta
import joblib
from hyperopt import fmin

start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_CV_Hyperopt_300_GPU.pkl'):
    bayesOpt_trials = joblib.load('lightGBM_CV_Hyperopt_300_GPU.pkl')
    best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials,
                      rstate=np.random.RandomState(42))
else:
    best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials,
                      rstate=np.random.RandomState(42))

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Start Time           2022-05-25 06:15:38.742065
100%|██████████| 300/300 [6:46:24<00:00, 81.28s/it, best loss: 2396.705123053008]
Start Time           2022-05-25 06:15:38.742065
End Time             2022-05-25 13:02:03.360201
6:46:24

   Let's now sort the trials with lowest loss first and examine the lowest two losses:

In [ ]:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
                                 key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Top two trials with the lowest loss (lowest RMSE)
[{'loss': 2396.705123053008, 'params': {'boosting_type': 'gbdt', 'colsample_bytree': 0.990710247020252, 'device_type': 'gpu', 'force_col_wise': '+', 'learning_rate': 0.1532468845606874, 'max_depth': 11, 'metric': 'rmse', 'n_estimators': 664, 'num_leaves': 149, 'objective': 'regression', 'random_state': 42, 'reg_alpha': 0.21368623311905155, 'reg_lambda': 0.6081935133906204, 'verbosity': -1, 'subsample': 0.7521090621184197}, 'iteration': 160, 'train_time': 110.59749732300043, 'status': 'ok'}, {'loss': 2397.3821386755544, 'params': {'boosting_type': 'gbdt', 'colsample_bytree': 0.9990197319105564, 'device_type': 'gpu', 'force_col_wise': '+', 'learning_rate': 0.13443271095173615, 'max_depth': 11, 'metric': 'rmse', 'n_estimators': 650, 'num_leaves': 143, 'objective': 'regression', 'random_state': 42, 'reg_alpha': 0.13676177893122635, 'reg_lambda': 0.6783165993977724, 'verbosity': -1, 'subsample': 0.6583372823931356}, 'iteration': 158, 'train_time': 107.32305251999969, 'status': 'ok'}]

   Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
import ast
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

results = pd.read_csv('lightGBM_CV_trials_300_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('lightGBM_CV_trials_300_GPU.csv')

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

best_bayes_model = LGBMRegressor(device='gpu',
                                 gpu_platform_id=0,
                                 gpu_device_id=0,
                                 **best_bayes_params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'lightGBM_CV_HPO_300_GPU.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for lightGBM HPO Hyperopt 300 trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for lightGBM HPO Hyperopt 300 trials
MAE train: 1341.534, test: 1735.201
MSE train: 3295813.869, test: 5595563.594
RMSE train: 1815.438, test: 2365.494
R^2 train: 0.964, test: 0.939

   Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
                                                                                                            y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0,
                                                                        'iteration']))
The best model from Bayes optimization scores 5595563.59354 MSE on the test set.
This was achieved after 160 search iterations

   Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['num_leaves'] = bayes_params['num_leaves'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')

for i, hpo in enumerate(bayes_params.columns):
    if hpo not in ['boosting_type', 'iteration', 'subsample', 'force_col_wise',
                   'max_depth', 'device_type', 'verbosity', 'random_state',
                   'objective', 'metric']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
<Figure size 1008x432 with 0 Axes>

   Higher values of colsample_bytree, n_estimators, num_leaves and reg_lambda in the given hyperparameter space performed better while a lower learning_rate and a lower reg_alpha performed better to generate a lower loss.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   colsample_bytree increases over trials while learning_rate and num_leaves do not tend to follow a trend.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, the reg_alpha hyperparameter decreased while reg_lambda increased.

Model Explanations

Model Metrics with ELI5

   Let's now utilize the PermutationImportance from eli5.sklearn which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.

class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
import eli5
from eli5.sklearn import PermutationImportance

X_test1 = pd.DataFrame(X_test, columns=X_test.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(X_test,
                                                                     y_test)
html_obj = eli5.show_weights(perm_importance,
                             feature_name=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.5332 ± 0.0048 horsepower
0.1328 ± 0.0030 year
0.0732 ± 0.0012 mileage
0.0708 ± 0.0011 width
0.0564 ± 0.0003 height
0.0403 ± 0.0005 wheel_system_display_Front-Wheel Drive
0.0379 ± 0.0009 horsepower_rpm
0.0369 ± 0.0005 fuel_tank_volume
0.0348 ± 0.0006 length
0.0330 ± 0.0009 highway_fuel_economy
0.0293 ± 0.0005 back_legroom
0.0290 ± 0.0004 maximum_seating
0.0275 ± 0.0004 wheelbase
0.0217 ± 0.0003 engine_displacement
0.0184 ± 0.0002 savings_amount
0.0141 ± 0.0004 front_legroom
0.0137 ± 0.0005 wheel_system_display_Four-Wheel Drive
0.0119 ± 0.0002 city_fuel_economy
0.0108 ± 0.0004 is_new
0.0078 ± 0.0003 daysonmarket
… 33 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
from eli5.formatters import format_as_dataframe

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 horsepower 0.533249 0.002383
1 year 0.132819 0.001479
2 mileage 0.073229 0.000607
3 width 0.070825 0.000543
4 height 0.056370 0.000156
5 wheel_system_display_Front-Wheel Drive 0.040348 0.000272
6 horsepower_rpm 0.037852 0.000457
7 fuel_tank_volume 0.036877 0.000265
8 length 0.034843 0.000321
9 highway_fuel_economy 0.033026 0.000463
10 back_legroom 0.029301 0.000242
11 maximum_seating 0.029037 0.000202
12 wheelbase 0.027520 0.000181
13 engine_displacement 0.021729 0.000151
14 savings_amount 0.018410 0.000109
15 front_legroom 0.014075 0.000206
16 wheel_system_display_Four-Wheel Drive 0.013690 0.000243
17 city_fuel_economy 0.011933 0.000112
18 is_new 0.010768 0.000219
19 daysonmarket 0.007803 0.000148

   The horsepower feature contains the highest weight (0.533249). The year feature has the next highest, but it is significantly lower (0.132819). This is followed by mileage (0.073229), width (0.070825) and height (0.056370).

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
import shap

shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)
In [ ]:
shap.summary_plot(shap_values, X_train);

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test);

100 Trials Train/Validation/Test

   Let's now prepare the data for modeling by creating the dummy variables for the categorical features. Then we can define the label as price and the features as everything except price by dropping price to remove the target. Next, we can extract the names of the features by using the columns and converting to a np.array.

In [ ]:
train = pd.get_dummies(X_train, drop_first=True)

label = train[['price']]

features = train.drop(columns = ['price'])
features = features.columns
features = np.array(features)

   We can now define the number of trials, the same k-folds using shuffled n_splits=10 for reproducibility and the parameter space. Compared to the 10-fold cross validation trials, let's use lower values for the n_estimators (300-500 vs 400-700), max_depth (5-6 vs. 5-12), num_leaves (30-100 vs. 70-150) and colsample_bytree (0.6-1 vs. 0.75-1) and higher values for gbdt_subsample (0.5-1 vs 0.5-0.95). We can also test different kinds of boosting_type, so let's evaluate these trials using gdbt and dart with the same ranges for subsampling and goss with 'subsample'=1.0. We can utilize the same range for the learning_rate, reg_alpha and reg_lambda hyperparameters:

In [ ]:
NUM_EVAL = 100

kf = KFold(n_splits=10, shuffle=True, random_state=seed_value)

lgb_tune_kwargs = {
    'random_state': seed_value,
    'device_type': 'gpu',
    'objective': 'regression',
    'metric': 'rmse',
    'verbosity' : -1,
    'n_estimators': hp.choice('n_estimators', np.arange(300, 500, dtype=int)),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
    'max_depth': hp.choice('max_depth', np.arange(5, 6, dtype=int)),
    'num_leaves': hp.choice('num_leaves', np.arange(30, 100, dtype=int)),
    'boosting_type': hp.choice('boosting_type', [{'boosting_type': 'gbdt',
                                                  'subsample': hp.uniform('gdbt_subsample',
                                                                          0.5,
                                                                          1)},
                                                 {'boosting_type': 'dart',
                                                  'subsample': hp.uniform('dart_subsample',
                                                                          0.5,
                                                                          1)},
                                                 {'boosting_type': 'goss',
                                                  'subsample': 1.0}]),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0)
    }

   Now, we can define a function lgb_hpo for optimizing the hyperparameters. Within this function, we can use joblib to save a pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. Then we can retrieve the value used for subsample, if it is present, otherwise it will be set to 1.0. Then we can extract the boosting type. Then the parameters which are integers need to configured to remain as integers and not as float, so let's allocate them for the max_depth and num_leaves hyperparameters.

   Next, we can split the training set into the train and validation sets by utilizing the defined variable kf in a for loop to split the training set 10-fold into the train and the validation sets with shuffling and the random_state=seed_value. This will allow for different sets to be fit and evaluated using the LGBMRegressor model. Then the validation set can be evaluated for the trial rmse for each trial and saved. The trial loss=rmse, params = parameters tested in the trial from the lgb_tune_kwargs space, the trial number (ITERATION), the time to complete the trial (run_time) and if the trial completed successfully or not (STATUS_OK) are written to the defined .csv file and appended by row for each trial.

In [ ]:
def lgb_hpo(config):
    """
    Objective function to tune a LightGBMRegressor model.
    """
    global ITERATION
    ITERATION += 1

    subsample = config['boosting_type'].get('subsample', 1.0)
    config['boosting_type'] = config['boosting_type']['boosting_type']
    config['subsample'] = subsample

    for param_name in ['max_depth', 'num_leaves']:
        config[param_name] = int(config[param_name])

    for trn_idx, val_idx in kf.split(train[features], label):
        train_features, train_label = train[features].iloc[trn_idx], label.iloc[trn_idx]
        val_features, val_label = train[features].iloc[val_idx], label.iloc[val_idx]

    lgb = LGBMRegressor(**config)

    start = timer()
    lgb.fit(train_features, train_label,
            eval_set = [(val_features, val_label),
                        (train_features, train_label)])
    run_time = timer() - start

    y_pred_val = lgb.predict(val_features)
    rmse = mean_squared_error(val_label, y_pred_val, squared=False)

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([rmse, config, ITERATION, run_time])

    return {'loss': rmse, 'params': config, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

   Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.

   We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function lgb_hpo, the parameter space lgb_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the hyperparameter optimization HPO trials.

In [ ]:
out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/lightGBM/Hyperopt/trialOptions/lightGBM_trials_100_GPU_val.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_Hyperopt_100_GPU_val.pkl'):
    bayesOpt_trials = joblib.load('Hyperopt_100_GPU_val.pkl')
    best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials,
                      rstate=np.random.RandomState(42))
else:
    best_param = fmin(lgb_hpo, lgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials,
                      rstate=np.random.RandomState(42))
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Start Time           2022-09-01 10:47:35.115444
100%|██████████| 100/100 [45:29<00:00, 27.29s/it, best loss: 2491.643880064388]
Start Time           2022-09-01 10:47:35.115444
End Time             2022-09-01 11:33:04.734886
0:45:29

   Compared to the completion time for 300 trials using 10-fold cross validation (6:46:24) using the same GPU, a Tesla P100-PCIE, this completed in more than 1/6 the time. We need to evaluate the the model metrics for comparisons, though.

   Let's now sort the trials with lowest loss first and examine the lowest two losses:

In [ ]:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
                                 key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Top two trials with the lowest loss (lowest RMSE)
[{'loss': 2491.643880064388, 'params': {'boosting_type': 'gbdt', 'colsample_bytree': 0.7019751666596669, 'device_type': 'gpu', 'force_col_wise': '+', 'learning_rate': 0.3070211316315261, 'max_depth': 5, 'metric': 'rmse', 'n_estimators': 498, 'num_leaves': 83, 'objective': 'regression', 'random_state': 42, 'reg_alpha': 0.5025906565345484, 'reg_lambda': 0.1880482199332495, 'verbosity': -1, 'subsample': 0.9533649868642637}, 'iteration': 85, 'train_time': 4.619575178000105, 'status': 'ok'}, {'loss': 2494.775957282981, 'params': {'boosting_type': 'dart', 'colsample_bytree': 0.6754723701477817, 'device_type': 'gpu', 'force_col_wise': '+', 'learning_rate': 0.9984982979328447, 'max_depth': 5, 'metric': 'rmse', 'n_estimators': 483, 'num_leaves': 83, 'objective': 'regression', 'random_state': 42, 'reg_alpha': 0.44767945483995375, 'reg_lambda': 0.08411821913955789, 'verbosity': -1, 'subsample': 0.916422012826982}, 'iteration': 78, 'train_time': 61.83099851800034, 'status': 'ok'}]

   Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
results = pd.read_csv('lightGBM_trials_100_GPU_val.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('lightGBM_CV_trials_300_GPU.csv')

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

train_label = train[['price']]
train_features = train.drop(columns=['price'])

test_label = testDF[['price']]
test_features = testDF.drop(columns=['price'])
test_features = pd.get_dummies(test_features, drop_first=True)

best_bayes_model = LGBMRegressor(device='gpu',
                                 gpu_platform_id=0,
                                 gpu_device_id=0,
                                 **best_bayes_params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'lightGBM_HPO_100_GPU_val.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for lightGBM HPO Hyperopt 100 trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for lightGBM HPO Hyperopt 100 trials
MAE train: 1761.696, test: 1840.712
MSE train: 5605800.296, test: 6145106.449
RMSE train: 2367.657, test: 2478.933
R^2 train: 0.939, test: 0.933

   Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning. With 300 trials, the model overfit more with the training set compared to the test, while with 100 trials the model performed, more balanced regarding the model metrics.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                            y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0,
                                                                        'iteration']))
The best model from Bayes optimization scores 6145106.44850 MSE on the test set.
This was achieved after 85 search iterations

   Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['num_leaves'] = bayes_params['num_leaves'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')

for i, hpo in enumerate(bayes_params.columns):
    if hpo not in ['boosting_type', 'iteration', 'subsample', 'force_col_wise',
                   'max_depth', 'device_type', 'verbosity', 'random_state',
                   'objective', 'metric']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
<Figure size 1008x432 with 0 Axes>

   Lower values of colsample_bytree, learning_rate and reg_lambda while higher values of n_estimators and num_leaves in the given hyperparameter space performed better to generate a lower loss.

   Compared to the 10-fold cross validation model, there were differences in the densities of the colsample_bytree and the reg_lambda hyperparameters.

   We can map the boosting_type to integer (essentially label encoding) and plot the boosting type over the search.

In [ ]:
bayes_params['boosting_int'] = bayes_params['boosting_type'].replace({'goss': 0,
                                                                      'dart': 1,
                                                                      'gbdt': 2})

plt.plot(bayes_params['iteration'], bayes_params['boosting_int'], 'ro')
plt.yticks([0, 1, 2], ['goss', 'dart', 'gbdt']);
plt.xlabel('Iteration'); plt.title('Boosting Type over trials')
plt.show()

   More of the trial iterations tested dart for the boosting_type compared to gbdt and goss.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 3, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   The colsample_bytree parameter decreased, compared to increasing when using cross-validation over trials while the learning_rate might have slightly increased. Both did not show a trend for the num_leaves parameter over the trials.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, the hyperparameter reg_lambda decreased when there was not a trend for reg_alpha. For cross-validation, reg_lambda increased over the trials.

Model Explanations

Model Metrics with ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)

perm_importance = PermutationImportance(best_model,
                                        random_state=seed_value).fit(test_features,
                                                                     test_label)
html_obj = eli5.show_weights(perm_importance,
                             feature_name=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.4264 ± 0.0039 horsepower
0.1103 ± 0.0020 year
0.0982 ± 0.0011 height
0.0973 ± 0.0015 width
0.0808 ± 0.0015 mileage
0.0493 ± 0.0012 length
0.0479 ± 0.0009 fuel_tank_volume
0.0466 ± 0.0006 wheel_system_display_Front-Wheel Drive
0.0461 ± 0.0011 wheelbase
0.0381 ± 0.0004 engine_displacement
0.0351 ± 0.0006 back_legroom
0.0334 ± 0.0005 maximum_seating
0.0312 ± 0.0007 horsepower_rpm
0.0240 ± 0.0007 front_legroom
0.0215 ± 0.0007 highway_fuel_economy
0.0206 ± 0.0003 city_fuel_economy
0.0159 ± 0.0002 savings_amount
0.0116 ± 0.0002 is_new
0.0110 ± 0.0005 wheel_system_display_Four-Wheel Drive
0.0073 ± 0.0001 fuel_type_Flex Fuel Vehicle
… 33 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 horsepower 0.426414 0.001935
1 year 0.110301 0.000988
2 height 0.098178 0.000526
3 width 0.097303 0.000748
4 mileage 0.080769 0.000730
5 length 0.049251 0.000612
6 fuel_tank_volume 0.047857 0.000431
7 wheel_system_display_Front-Wheel Drive 0.046572 0.000311
8 wheelbase 0.046110 0.000530
9 engine_displacement 0.038070 0.000223
10 back_legroom 0.035122 0.000314
11 maximum_seating 0.033421 0.000239
12 horsepower_rpm 0.031211 0.000359
13 front_legroom 0.023997 0.000329
14 highway_fuel_economy 0.021470 0.000343
15 city_fuel_economy 0.020604 0.000126
16 savings_amount 0.015875 0.000088
17 is_new 0.011636 0.000116
18 wheel_system_display_Four-Wheel Drive 0.010982 0.000226
19 fuel_type_Flex Fuel Vehicle 0.007252 0.000053

   When comparing the weights from the train/validation/test model to the ones generated from using 10-fold cross validation, horsepower was (0.426414 vs 0.533249), year was lower (0.110301 vs. 0.132819), height was higher (0.098178 vs. 0.056370), mileage was higher (0.080769 vs. 0.073229) and width was higher (0.097303 vs. 0.070825).

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
In [ ]:
shap.summary_plot(shap_values, train_features);

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(test_features)
In [ ]:
shap.summary_plot(shap_values, test_features);

Optuna: 100 Trials 10-Fold Cross Validation

   The Optuna notebooks are located here. Let's set up the environment by importing the dependencies/setting the options and the seed, and examining the CUDA and GPU characteristics.

In [ ]:
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['usedCars_lgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Wed Apr  6 00:50:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

   Let's read the data and define the label/features with price as the model target. Then create dummy variables for the categorical variables and use the column names to extract feature names.

In [ ]:
trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)

train_label = trainDF[['price']]
test_label = testDF[['price']]
train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])

train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)

features = train_features.columns
features = np.array(features)

   We can leverage the optuna.integration.wandb to set up the callbacks that will be saved to Weights & Biases. First, we login and set up the arguments that include the name of the project, the person saving the results, the group the study belongs to, if code is saved or not, and notes about the study for future reference.

In [ ]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_lgbm100gpu_T4_cv_kfold10',
                'save_code': 'False',
                'notes': 'optuna_lgbm100gpu_T4_cv_kfold10'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)

    Now, let's set up the WB callbacks using @wandbc.track_in_wandb(). Then we can define a function objective for the optimization of hyperparameters using an Optuna study with a new pickle file and the parameter space to evaluate different combinations. Then we can create a lgb dataset using the training set and the hyperparameter space as a dictionary. Then we can define the model type as LGBMRegressor specifying the parameters that will be used for each trial. We can then leverage cross validation from the lightgbm package by using lgb.cv with the params, train_set, 10-fold cross validation by specifying nfold=10 and stratified=False. This can then perform timed cross validation trials with the goal of finding the lowest averaged rmse in the validation set.

In [ ]:
import joblib
from sklearn.model_selection import cross_val_score, KFold
import lightgbm as lgb
from lightgbm import LGBMRegressor
from timeit import default_timer as timer

@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a LightGBMRegressor model.
    """
    joblib.dump(study, 'lightGBM_Optuna_100_GPU_T4_wb_CV.pkl')
    params = {'verbose': -1,
              'device': 'gpu',
              'gpu_platform_id': 0,
              'gpu_device_id': 0,
              'boosting_type': 'gbdt',
              'objective':'regression',
              'metric': 'rmse',
              'seed': seed_value,
              'early_stopping_rounds': 150,
              'n_estimators': trial.suggest_int('n_estimators', 925, 935),
              'learning_rate': trial.suggest_loguniform('learning_rate', 0.0075,
                                                        0.008),
              'num_leaves': trial.suggest_int('num_leaves', 350, 365),
              'bagging_freq': trial.suggest_int('bagging_freq', 8, 10),
              'subsample': trial.suggest_float('subsample', 0.85, 0.95),
              'colsample_bytree': trial.suggest_float('colsample_bytree', 0.77,
                                                      0.81),
              'max_depth': trial.suggest_int('max_depth', 9, 11),
              'lambda_l1': trial.suggest_float('lambda_l1', 0.02, 0.03,
                                               log=True),
              'lambda_l2': trial.suggest_float('lambda_l2', 0.001, 0.002,
                                               log=True),
              'min_child_samples': trial.suggest_int('min_child_samples',
                                                     540, 550)
              'verbosity': -1
              }

    train_set = lgb.Dataset(X_train, label=y_train, params=params)

    model = LGBMRegressor(**params)

    start = timer()
    cv_results = lgb.cv(params, train_set, nfold=10, stratified=False)
    run_time = timer() - start

    rmse = round(cv_results['valid rmse-mean'][-1], 4)

    return rmse

   Now let's begin the optimization with Weights & Biases tracking the loss and associated parameters. The optimization is tracked and the trial run components are located here.

In [ ]:
from datetime import datetime, timedelta

start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('lightGBM_Optuna_100_GPU_T4_wb_CV.pkl'):
    study = joblib.load('lightGBM_Optuna_100_GPU_T4_wb_CV.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, callbacks=[wandbc])
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
Synced daily-energy-2831: https://wandb.ai/aschultz/usedCars_hpo/runs/10d31lpc
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230106_010843-10d31lpc/logs
Start Time           2023-01-05 15:33:40.120878
End Time             2023-01-06 01:14:46.947398
9:41:06


Number of finished trials: 100
Best trial: {'n_estimators': 935, 'learning_rate': 0.007999749230495415, 'num_leaves': 352, 'bagging_freq': 9, 'subsample': 0.9499157527286387, 'colsample_bytree': 0.776813542010267, 'max_depth': 11, 'lambda_l1': 0.026324983395628405, 'lambda_l2': 0.001549095388756512, 'min_child_samples': 541}
Lowest RMSE 2667.5806

   Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_bagging_freq': 'bagging_freq'}, inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_lambda_l1': 'lambda_l1'}, inplace=True)
trials_df.rename(columns={'params_lambda_l2': 'lambda_l2'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_samples': 'min_child_samples'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_num_leaves': 'num_leaves'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
    iteration       rmse             datetime_start  \
75         75  2667.5806 2023-01-05 22:45:42.252337   
83         83  2667.6170 2023-01-05 23:33:57.861181   
89         89  2667.8434 2023-01-06 00:09:40.256251   
82         82  2668.0128 2023-01-05 23:27:42.492958   
84         84  2668.0997 2023-01-05 23:40:04.234647   
..        ...        ...                        ...   
58         58  2727.9808 2023-01-05 21:06:36.116181   
38         38  2728.4181 2023-01-05 19:10:09.163385   
3           3  2737.5249 2023-01-05 15:50:23.089273   
0           0  2742.1722 2023-01-05 15:33:40.123880   
7           7  2743.7182 2023-01-05 16:10:56.150853   

            datetime_complete               duration  bagging_freq  \
75 2023-01-05 22:51:34.854589 0 days 00:05:52.602252             9   
83 2023-01-05 23:39:59.658275 0 days 00:06:01.797094             9   
89 2023-01-06 00:15:32.062667 0 days 00:05:51.806416             9   
82 2023-01-05 23:33:52.961615 0 days 00:06:10.468657             9   
84 2023-01-05 23:45:57.303182 0 days 00:05:53.068535             9   
..                        ...                    ...           ...   
58 2023-01-05 21:11:21.640650 0 days 00:04:45.524469             9   
38 2023-01-05 19:14:58.302748 0 days 00:04:49.139363             9   
3  2023-01-05 15:54:59.216601 0 days 00:04:36.127328             9   
0  2023-01-05 15:38:37.733301 0 days 00:04:57.609421             8   
7  2023-01-05 16:15:42.717010 0 days 00:04:46.566157             8   

    colsample_bytree  lambda_l1  lambda_l2  learning_rate  max_depth  \
75          0.776814   0.026325   0.001549       0.008000         11   
83          0.774550   0.027148   0.001663       0.007987         11   
89          0.779117   0.025841   0.001720       0.007970         11   
82          0.774761   0.027077   0.001654       0.007985         11   
84          0.774380   0.027100   0.001639       0.007966         11   
..               ...        ...        ...            ...        ...   
58          0.776759   0.029926   0.001304       0.007963          9   
38          0.792896   0.027423   0.001516       0.007960          9   
3           0.784250   0.027340   0.001056       0.007979          9   
0           0.802569   0.021986   0.001117       0.007681          9   
7           0.780345   0.023226   0.001220       0.007636          9   

    min_child_samples  n_estimators  num_leaves  subsample     state  
75                541           935         352   0.949916  COMPLETE  
83                541           935         351   0.949831  COMPLETE  
89                540           935         350   0.947598  COMPLETE  
82                542           935         351   0.949835  COMPLETE  
84                542           935         351   0.949612  COMPLETE  
..                ...           ...         ...        ...       ...  
58                544           932         356   0.944064  COMPLETE  
38                546           935         363   0.933440  COMPLETE  
3                 541           925         362   0.860727  COMPLETE  
0                 546           935         351   0.866372  COMPLETE  
7                 549           926         359   0.876418  COMPLETE  

[100 rows x 16 columns]

   Let's utilize plot_optimization_history, which shows the scores from all trials as well as the best score so far at each point. This search did not contain extreme outliers for the objective value so it can be useful for examining the study output.

In [ ]:
fig = optuna.visualization.plot_optimization_history(study)
fig.show()

   Next, we can plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   From this plot, lower rmse occurred with lower colsample_bytree, higher lambda_l1, higher lambda_l2, higher learning_rate, higher max_depth, higher n_estimators, higher subsample.

   Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()

   Lower values of colsample_bytree, min_child_samples and num_leaves while higher values of lambda_l1, lambda_l2, learning_rate, max_depth and subsample in the given hyperparameter space performed better to generate a lower loss.

   We can now examine if/how the quantitative parameters changed over the trials run to see if any trends exist.

In [ ]:
fig, axs = plt.subplots(2, 3, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'num_leaves', 'colsample_bytree',
                         'max_depth', 'subsample', 'bagging_freq']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, the hyperparameters colsample_bytree and num_leaves decreased while the hyperparameters learning_rate and subsample increased. The results from max_depth and bagging_freq over the trials are not clear.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
fig, axs = plt.subplots(1, 2, figsize=(20,5))
i = 0
for i, hpo in enumerate(['lambda_l1', 'lambda_l2']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, lambda_l1 potentially increased where there was not a trend for lambda_l2. Increasing the number of trials might reveal more insight about this parameter.

   Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The max_depth was the most important parameter for the loss utilized. CatBoost demonstrated the learning_rate was the most important parameter during the Optuna trials.

   We can use plot_edf to plot the objective value empirical distribution function (EDF) for the study.

In [ ]:
fig = optuna.visualization.plot_edf(study)
fig.show()

   This plot demonstates a narrow range of values for the loss from the models evalauted.

   Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'n_estimators': 935,
 'learning_rate': 0.007999749230495415,
 'num_leaves': 352,
 'bagging_freq': 9,
 'subsample': 0.9499157527286387,
 'colsample_bytree': 0.776813542010267,
 'max_depth': 11,
 'lambda_l1': 0.026324983395628405,
 'lambda_l2': 0.001549095388756512,
 'min_child_samples': 541,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
train_label = trainDF[['price']]
train_features = trainDF.drop(columns=['price'])

test_label = testDF[['price']]
test_features = testDF.drop(columns=['price'])
test_features = pd.get_dummies(test_features, drop_first=True)

best_model = LGBMRegressor(boosting_type='gbdt',
                           device='gpu',
                           gpu_platform_id=0,
                           gpu_device_id=0,
                           verbosity=-1,
                           **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'lightGBM_Optuna_trials100_GPU_T4_wb_CV.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for lightGBM HPO 100 GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for lightGBM HPO 100 GPU trials
MAE train: 1946.235, test: 1969.517
MSE train: 6820845.134, test: 6980757.209
RMSE train: 2611.675, test: 2642.112
R^2 train: 0.926, test: 0.924

   Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 6980757.20900 MSE on the test set.
This was achieved using these conditions:
iteration                                    75
rmse                                  2667.5806
datetime_start       2023-01-05 22:45:42.252337
datetime_complete    2023-01-05 22:51:34.854589
duration                 0 days 00:05:52.602252
bagging_freq                                  9
colsample_bytree                       0.776814
lambda_l1                              0.026325
lambda_l2                              0.001549
learning_rate                             0.008
max_depth                                    11
min_child_samples                           541
n_estimators                                935
num_leaves                                  352
subsample                              0.949916
state                                  COMPLETE
Name: 75, dtype: object

Model Explanations

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)
In [ ]:
shap.summary_plot(shap_values, train_features);

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(test_features)
In [ ]:
shap.summary_plot(shap_values, test_features);

Model Metrics with ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)

perm_importance = PermutationImportance(best_model,
                                        random_state=seed_value).fit(test_features,
                                                                     test_label)
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.3907 ± 0.0027 horsepower
0.1322 ± 0.0021 mileage
0.0829 ± 0.0016 width
0.0795 ± 0.0020 year
0.0343 ± 0.0007 horsepower_rpm
0.0310 ± 0.0004 wheel_system_display_Front-Wheel Drive
0.0263 ± 0.0003 fuel_tank_volume
0.0215 ± 0.0004 engine_displacement
0.0183 ± 0.0003 back_legroom
0.0181 ± 0.0001 height
0.0172 ± 0.0006 highway_fuel_economy
0.0171 ± 0.0005 wheelbase
0.0135 ± 0.0003 maximum_seating
0.0131 ± 0.0003 savings_amount
0.0112 ± 0.0003 length
0.0094 ± 0.0004 wheel_system_display_Four-Wheel Drive
0.0087 ± 0.0003 front_legroom
0.0074 ± 0.0003 wheel_system_display_All-Wheel Drive
0.0064 ± 0.0002 city_fuel_economy
0.0037 ± 0.0001 daysonmarket
… 33 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 horsepower 0.390659 0.001331
1 mileage 0.132178 0.001073
2 width 0.082871 0.000801
3 year 0.079460 0.001014
4 horsepower_rpm 0.034325 0.000372
5 wheel_system_display_Front-Wheel Drive 0.031041 0.000179
6 fuel_tank_volume 0.026284 0.000150
7 engine_displacement 0.021532 0.000178
8 back_legroom 0.018262 0.000170
9 height 0.018117 0.000046
10 highway_fuel_economy 0.017241 0.000295
11 wheelbase 0.017071 0.000246
12 maximum_seating 0.013532 0.000162
13 savings_amount 0.013138 0.000156
14 length 0.011180 0.000131
15 wheel_system_display_Four-Wheel Drive 0.009423 0.000193
16 front_legroom 0.008720 0.000145
17 wheel_system_display_All-Wheel Drive 0.007443 0.000136
18 city_fuel_economy 0.006367 0.000096
19 daysonmarket 0.003723 0.000044

   The horsepower feature contains the highest weight (0.390659). Then the mileage has the next highest, but the second weight is once again way lower than horsepower as in the previous models (0.132178). This is followed by width (0.082871), year (0.079460) and horsepower_rpm (0.034325), which is a different order than the previous model weights using ELI5.

XGBoost

   A decision tree is a hierarchical model that utilizes conditional statements which represent choices and the potential downstream outcomes. The tree structure is comprised of a root node, branches, internal nodes, and leaf nodes. Gradient Boosting Decision Trees (GBDT) are comprised of multiple decision trees in an ensemble similar to what is used with Random Forest models. However, the methodologies for building and combining the trees differ. Gradient boosting is derived from boosting where a single weak model is combined with other weak models leveraging the objective of improving and generating a more effective model. However, the process of additively generating weak models is characterized with gradient descent for the objective function. Gradient boosting delineates targeted outcomes based on the error of the prediction which is then utilized in subsequent models to minimize the error. The weighted sum of all of the tree predictions is then used for the final prediction.

   Extreme Gradient Boosting (XGBoost) presented in XGBoost: A Scalable Tree Boosting System leverages trees built in parallel following a level-wise strategy that scans across the gradient values and uses these partial sums to evaluate the quality of splits at every possible split in the training set. XGBoost is built for model performance and computational speed with CPU and GPU support for the ML library.

   The Hyperopt notebook is located here. Let's first set up the environment by installing/importing the dependencies, setting the options and seed followed by examining the CUDA and GPU characteristics.

In [ ]:
!pip install xgboost==1.5.2
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings('ignore')

seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xgboost==1.5.2
  Downloading xgboost-1.5.2-py3-none-manylinux2014_x86_64.whl (173.6 MB)
     |████████████████████████████████| 173.6 MB 9.6 kB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from xgboost==1.5.2) (1.21.6)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from xgboost==1.5.2) (1.7.3)
Installing collected packages: xgboost
  Attempting uninstall: xgboost
    Found existing installation: xgboost 0.90
    Uninstalling xgboost-0.90:
      Successfully uninstalled xgboost-0.90
Successfully installed xgboost-1.5.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     |████████████████████████████████| 216 kB 32.2 MB/s 
Requirement already satisfied: attrs>17.1.0 in /usr/local/lib/python3.8/dist-packages (from eli5) (22.1.0)
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 74.8 MB/s 
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.8/dist-packages (from eli5) (1.21.6)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from eli5) (1.7.3)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from eli5) (1.15.0)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.8/dist-packages (from eli5) (1.0.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.8/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.8/dist-packages (from eli5) (0.8.10)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn>=0.20->eli5) (1.2.0)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=e0052c2555531faa8dbdfd63bd3c7755a1fa9dbb1cb15bf2a82aac6086c12561
  Stored in directory: /root/.cache/pip/wheels/85/ac/25/ffcd87ef8f9b1eec324fdf339359be71f22612459d8c75d89c
Successfully built eli5
Installing collected packages: jinja2, eli5
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
notebook 5.7.16 requires jinja2<=3.0.0, but you have jinja2 3.1.2 which is incompatible.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
Successfully installed eli5-0.13.0 jinja2-3.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (575 kB)
     |████████████████████████████████| 575 kB 35.5 MB/s 
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.8/dist-packages (from shap) (4.64.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from shap) (1.21.6)
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from shap) (1.3.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from shap) (1.7.3)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.8/dist-packages (from shap) (1.5.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-packages (from shap) (1.0.2)
Requirement already satisfied: numba in /usr/local/lib/python3.8/dist-packages (from shap) (0.56.4)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.8/dist-packages (from shap) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>20.9->shap) (3.0.9)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.8/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.8/dist-packages (from numba->shap) (5.1.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from numba->shap) (57.4.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata->numba->shap) (3.11.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas->shap) (2022.6)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas->shap) (1.15.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn->shap) (1.2.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Tue Dec 27 01:38:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

   Then we can read the data and partition the train/test sets with price as the target. Next, we can create dummy variables for the categorical variables.

In [ ]:
import pandas as pd

trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)

X_train = trainDF.drop(columns = ['price'])
y_train = trainDF[['price']]

X_test = testDF.drop(columns = ['price'])
y_test = testDF[['price']]

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

Baseline Model

   Let's now fit the baseline model to the data, save as a .pkl file and evaluate the model metrics for the baseline model.

In [ ]:
from xgboost import XGBRegressor
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

xgb = XGBRegressor(objective='reg:squarederror',
                   booster='gbtree',
                   tree_method='gpu_hist',
                   random_state=seed_value,
                   verbosity=0)

xgb.fit(X_train, y_train)

Pkl_Filename = 'XGB_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

print('\nModel Metrics for XGBoost Baseline')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for XGBoost Baseline
MAE train: 1908.565, test: 1957.089
MSE train: 6461770.101, test: 6807736.138
RMSE train: 2542.001, test: 2609.164
R^2 train: 0.930, test: 0.925

Hyperopt: 300 Trials 10-Fold Cross Validation

   To set up Hyperopt, let's first define the number of trials with NUM_EVAL = 300. We need to utilize the same k-folds for reproducibility, so let's use 10-fold as the number of folds which to split the training set into train and validation with shuffle=True and the initial defined seed value. Then hyperparameters can then be defined in a dictionary. To define integers, hp.choice with np.arange and dtype=int is used while float types are defined using hp.uniform. The space consists of 6 parameters with 4 integers and 6 float type:

  • n_estimators: Number of gradient boosted trees.
  • max_depth: Maximum tree depth for base learners.
  • subsample: Subsample ratio of the training instance.
  • learning_rate: Used for reducing the gradient step.
  • gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree.
  • reg_alpha: L1 regularization term on weights (xgb’s alpha).
  • reg_lambda: L2 regularization term on weights (xgb’s lambda).
  • colsample_bytree: Subsample ratio of columns when constructing each tree.
  • colsample_bylevel: Subsample ratio of columns for each level.
  • min_child_weight: Minimum sum of instance weight(hessian) needed in a child.
In [ ]:
from sklearn.model_selection import KFold
from hyperopt import hp

NUM_EVAL = 300

kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)

xgb_tune_kwargs= {
    'n_estimators': hp.choice('n_estimators', np.arange(100, 500, dtype=int)),
    'max_depth': hp.choice('max_depth', np.arange(3, 10, dtype=int)),
    'subsample': hp.uniform('subsample', 0.25, 0.75),
    'gamma': hp.uniform('gamma', 0, 9),
    'learning_rate': hp.uniform('learning_rate', 1e-4, 0.3),
    'reg_alpha': hp.choice('reg_alpha', np.arange(0, 30, dtype=int)),
    'reg_lambda': hp.uniform('reg_lambda', 0, 3),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
    'colsample_bylevel': hp.uniform('colsample_bylevel', 0.05, 0.5),
    'min_child_weight': hp.choice('min_child_weight', np.arange(0, 10,
                                                                dtype=int))
    }

   Now, we can define a function for the optimization of these hyperparameters. Within this function, we can use joblib to save a pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float. This is then allocated for n_estimators and max_depth, which starts at max_depth=3. Then the model type, XGBRegressor, needs to be defined with the parameters that will be included in all of the trials during the search, which are:

  • objective='reg:squarederror': Specify the learning task and the corresponding learning objective or a custom objective function to be used
  • booster='gbtree': Specify which booster to use: gbtree, gblinear or dart.
  • tree_method='gpu_hist': Specify which tree method to use. Default=auto.
  • scale_pos_weight=1: Balancing of positive and negative weights.
  • use_label_encoder=False: Encodes labels
  • random_state=seed_value: Random number seed.
  • verbosity=0: The degree of verbosity. Valid values are 0 (silent) - 3 (debug)

   Then we can fit the model with the training set utilizing the 'neg_root_mean_squared_error' for the scoring metric and 10-fold cross validation to generate the negative cross_val_score where the np.mean of the scores is saved as the rmse for each trial. The trial loss=rmse, params = parameters tested in the trial from the xgb_tune_kwargs space , the trial number (iteration), the time to complete the trial (train_time) and if the trial completed successfully or not (status) are written to the defined .csv file and appended by row for each trial.

In [ ]:
import joblib
from timeit import default_timer as timer
from sklearn.model_selection import cross_val_score
import csv
from hyperopt import STATUS_OK

kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)

def xgb_hpo(config):
    """
    Objective function to tune a XGBoostRegressor model
    """
    joblib.dump(bayesOpt_trials, 'xgb_Hyperopt_300_GPU.pkl')

    global ITERATION
    ITERATION += 1

    config['n_estimators'] = int(config['n_estimators'])

    config['max_depth'] = int(config['max_depth']) + 3

    xgb = XGBRegressor(objective='reg:squarederror',
                       booster='gbtree',
                       tree_method='gpu_hist',
                       scale_pos_weight=1,
                       use_label_encoder=False,
                       early_stopping_rounds=10,
                       random_state=seed_value,
                       verbosity=0,
                       **config)

    start = timer()
    scores = -cross_val_score(xgb, X_train, y_train,
                              scoring='neg_root_mean_squared_error',
                              cv=kfolds)
    run_time = timer() - start

    rmse = np.mean(scores)

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([rmse, config, ITERATION, run_time])

    return {'loss': rmse, 'params': config, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

   Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and defined the Hyperopt Trials as bayesOpt_trials.

   We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function xgb_hpo, the parameter space xgb_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAL, the name of the trial set bayesOpt_trials and the random state np.random.RandomState(42). We can now begin the HPO trials.

In [ ]:
from hyperopt import tpe, Trials

tpe_algorithm = tpe.suggest

out_file = '/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/XGBoost/Hyperopt/trialOptions/XGB_trials_300_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
In [ ]:
from datetime import datetime, timedelta
from hyperopt import fmin

start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('xgb_Hyperopt_300_GPU.pkl'):
    bayesOpt_trials = joblib.load('xgb_Hyperopt_300_GPU.pkl')
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials,
                      rstate=np.random.RandomState(42))
else:
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials,
                      rstate=np.random.RandomState(42))

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
Start Time           2022-02-13 01:26:44.089604
100%|██████████| 300/300 [13:50:32<00:00, 166.11s/it, best loss: 2360.9404778388325]
Start Time           2022-02-13 01:26:44.089604
End Time             2022-02-13 15:17:16.532228
13:50:32

   Let's now sort the trials with lowest loss first and examine the lowest two losses:

In [ ]:
bayesOpt_trials_results = sorted(bayesOpt_trials.results,
                                 key=lambda x: x['loss'])
print('Top two trials with the lowest loss (lowest RMSE)')
print(bayesOpt_trials_results[:2])
Top two trials with the lowest loss (lowest RMSE)
[{'loss': 2360.9404778388325, 'params': {'colsample_bylevel': 0.46810548016285225, 'colsample_bytree': 0.5308143433747501, 'gamma': 1.6729089829733979, 'learning_rate': 0.08523827415243382, 'max_depth': 12, 'min_child_weight': 0, 'n_estimators': 443, 'reg_alpha': 18, 'reg_lambda': 2.530445524613025, 'subsample': 0.7315224452498899}, 'iteration': 263, 'train_time': 427.44341490900115, 'status': 'ok'}, {'loss': 2361.7093181948267, 'params': {'colsample_bylevel': 0.49442563554429647, 'colsample_bytree': 0.5222615437047392, 'gamma': 1.5084072489168792, 'learning_rate': 0.06475205416794028, 'max_depth': 12, 'min_child_weight': 0, 'n_estimators': 414, 'reg_alpha': 18, 'reg_lambda': 2.2944311863133606, 'subsample': 0.7053963773567413}, 'iteration': 261, 'train_time': 383.9923600809998, 'status': 'ok'}]

   Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
import ast

results = pd.read_csv('XGB_300_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

best_bayes_model = XGBRegressor(objective='reg:squarederror',
                                booster='gbtree',
                                tree_method='gpu_hist',
                                scale_pos_weight=1,
                                use_label_encoder=False,
                                random_state=seed_value,
                                verbosity=0,
                                **best_bayes_params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'XGB_HPO_trials300_GPU.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for XGBoost HPO 300 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train, y_train_pred),
        mean_absolute_error(y_test, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred, squared=False),
        mean_squared_error(y_test, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))

Model Metrics for XGBoost HPO 300 GPU trials
MAE train: 1055.897, test: 1698.187
MSE train: 2156346.027, test: 5453324.338
RMSE train: 1468.450, test: 2335.235
R^2 train: 0.976, test: 0.940

   Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test,
                                                                                                            y_test_pred)))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
The best model from Bayes optimization scores 5453324.33829 MSE on the test set.
This was achieved after 263 search iterations

   Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')

for i, hpo in enumerate(bayes_params.columns):
    if hpo not in ['iteration', 'subsample', 'force_col_wise',
                   'max_depth', 'min_child_weight', 'n_estimators']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
<Figure size 1008x432 with 0 Axes>

   The colsample_bylevel and reg_lambda parameters with higher values demonstrate lower loss for the objective value. Lower values for the colsample_bytree, gamma and learning_rate parameters resulted in lower loss. The reg_alpha parameter demonstrated bimodal distribution.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   The learning_rate, gamma, colsample_bytree decreased over the trials while colsample_bylevel increased. reg_alpha and reg_lambda did not demonstrate any trends over the trial iterations.

Model Explanations

   Now, we can plot the feature importance from the best model.

In [ ]:
plot_importance(best_bayes_model, max_num_features=15)

   Using the approach for feature importance, the most important features are daysonmarket, mileage, torque, torque_rpm, savings_amount, back_legroom, city_fuel_economy, height, length, highway_fuel_economy, front_legroom, fuel_tank_volume, wheelbase, width and engine_displacement.

SHAP (SHapley Additive exPlanations)

   Let's use the training set and summarize the effects of all the features.

In [ ]:
import shap

shap.initjs()
explainer = shap.TreeExplainer(best_bayes_model)
shap_values = explainer.shap_values(X_train)


shap.summary_plot(shap_values, X_train);

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test);

Model Metrics with ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
import eli5
from eli5.sklearn import PermutationImportance

X_test1 = pd.DataFrame(X_test, columns=X_test.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(X_test,
                                                                     y_test)
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.1222 ± 0.0015 horsepower
0.0760 ± 0.0012 mileage
0.0744 ± 0.0014 width
0.0700 ± 0.0016 year
0.0523 ± 0.0005 fuel_tank_volume
0.0506 ± 0.0005 height
0.0285 ± 0.0006 engine_displacement
0.0282 ± 0.0005 horsepower_rpm
0.0267 ± 0.0004 wheel_system_display_Front-Wheel Drive
0.0262 ± 0.0002 city_fuel_economy
0.0246 ± 0.0006 maximum_seating
0.0240 ± 0.0004 savings_amount
0.0237 ± 0.0006 highway_fuel_economy
0.0211 ± 0.0003 back_legroom
0.0209 ± 0.0002 length
0.0183 ± 0.0001 wheelbase
0.0089 ± 0.0003 wheel_system_display_Four-Wheel Drive
0.0080 ± 0.0000 front_legroom
0.0072 ± 0.0003 is_new
0.0066 ± 0.0002 daysonmarket
… 33 more …

   The horsepower feature has the highest weight (0.1222) followed by mileage, width and year.

Optuna: 300 Trials 10-Fold Cross Validation

   The Optuna notebooks can be found here. Let's first set up the environment by installing/importing the dependencies, setting the options and seed followed by examining the CUDA and GPU characteristics.

In [ ]:
!pip install --upgrade -q wandb
!pip install xgboost==1.5.2
!pip install optuna
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
     |████████████████████████████████| 1.8 MB 4.2 MB/s 
     |████████████████████████████████| 181 kB 69.9 MB/s 
     |████████████████████████████████| 157 kB 81.8 MB/s 
     |████████████████████████████████| 63 kB 2.0 MB/s 
     |████████████████████████████████| 157 kB 91.3 MB/s 
     |████████████████████████████████| 157 kB 76.4 MB/s 
     |████████████████████████████████| 157 kB 76.9 MB/s 
     |████████████████████████████████| 157 kB 69.7 MB/s 
     |████████████████████████████████| 157 kB 79.5 MB/s 
     |████████████████████████████████| 157 kB 76.0 MB/s 
     |████████████████████████████████| 156 kB 70.9 MB/s 
  Building wheel for pathtools (setup.py) ... done
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xgboost==1.5.2
  Downloading xgboost-1.5.2-py3-none-manylinux2014_x86_64.whl (173.6 MB)
     |████████████████████████████████| 173.6 MB 6.6 kB/s 
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from xgboost==1.5.2) (1.7.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from xgboost==1.5.2) (1.21.6)
Installing collected packages: xgboost
  Attempting uninstall: xgboost
    Found existing installation: xgboost 0.90
    Uninstalling xgboost-0.90:
      Successfully uninstalled xgboost-0.90
Successfully installed xgboost-1.5.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.0.0-py3-none-any.whl (348 kB)
     |████████████████████████████████| 348 kB 4.3 MB/s 
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.3)
Requirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.40)
Collecting alembic
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
     |████████████████████████████████| 209 kB 86.9 MB/s 
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.21.6)
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
     |████████████████████████████████| 81 kB 10.0 MB/s 
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.64.0)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (6.0)
Requirement already satisfied: typing-extensions>=3.10.0.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (4.1.1)
Requirement already satisfied: scipy<1.9.0,>=1.7.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.7.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (3.0.9)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.12.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.3)
Collecting Mako
  Downloading Mako-1.2.2-py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 9.3 MB/s 
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.9.0)
Collecting stevedore>=2.0.1
  Downloading stevedore-3.5.0-py3-none-any.whl (49 kB)
     |████████████████████████████████| 49 kB 7.5 MB/s 
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.2-py3-none-any.whl (147 kB)
     |████████████████████████████████| 147 kB 73.0 MB/s 
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.4.0)
Collecting autopage>=0.4.0
  Downloading autopage-0.5.1-py3-none-any.whl (29 kB)
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.10.0-py2.py3-none-any.whl (112 kB)
     |████████████████████████████████| 112 kB 75.7 MB/s 
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (22.1.0)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Collecting pyperclip>=1.6
  Downloading pyperclip-1.8.2.tar.gz (20 kB)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.8.1)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
Building wheels for collected packages: pyperclip
  Building wheel for pyperclip (setup.py) ... done
  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11137 sha256=f515a8c729edff96ade807cfcf2be48c3cf4de1905df74880879f14a5a9625e4
  Stored in directory: /root/.cache/pip/wheels/9f/18/84/8f69f8b08169c7bae2dde6bd7daf0c19fca8c8e500ee620a28
Successfully built pyperclip
Installing collected packages: pyperclip, pbr, stevedore, Mako, cmd2, autopage, colorlog, cmaes, cliff, alembic, optuna
Successfully installed Mako-1.2.2 alembic-1.8.1 autopage-0.5.1 cliff-3.10.1 cmaes-0.8.2 cmd2-2.4.2 colorlog-6.7.0 optuna-3.0.0 pbr-5.10.0 pyperclip-1.8.2 stevedore-3.5.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     |████████████████████████████████| 216 kB 4.0 MB/s 
Requirement already satisfied: attrs>17.1.0 in /usr/local/lib/python3.7/dist-packages (from eli5) (22.1.0)
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 67.9 MB/s 
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from eli5) (1.21.6)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from eli5) (1.7.3)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from eli5) (1.15.0)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.7/dist-packages (from eli5) (1.0.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from eli5) (0.8.10)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.7/dist-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20->eli5) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=fe87208a420bf4e0dc8d4e8322e824981f1e4ab3ccd8372a79d588f166fbeb87
  Stored in directory: /root/.cache/pip/wheels/cc/3c/96/3ead31a8e6c20fc0f1a707fde2e05d49a80b1b4b30096573be
Successfully built eli5
Installing collected packages: jinja2, eli5
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
Successfully installed eli5-0.13.0 jinja2-3.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
  Downloading shap-0.41.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (569 kB)
     |████████████████████████████████| 569 kB 4.2 MB/s 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from shap) (1.3.5)
Requirement already satisfied: numba in /usr/local/lib/python3.7/dist-packages (from shap) (0.56.0)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.7/dist-packages (from shap) (4.64.0)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.7/dist-packages (from shap) (1.5.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from shap) (1.21.6)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.7/dist-packages (from shap) (21.3)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from shap) (1.0.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from shap) (1.7.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>20.9->shap) (3.0.9)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba->shap) (57.4.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from numba->shap) (4.12.0)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.7/dist-packages (from numba->shap) (0.39.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->numba->shap) (4.1.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->numba->shap) (3.8.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->shap) (2022.2.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->shap) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->shap) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->shap) (3.1.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Fri Sep  2 21:38:08 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

   Then the data can be read into a pandas.Dataframe, the label and features defined followed by creating the dummy variables for the categorical features.

In [ ]:
import pandas as pd

trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)

train_label = trainDF[['price']]
test_label = testDF[['price']]

train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])

train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)

   Let's now set up Weights & Biases specifying the project, entity, group and notes.

In [ ]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback

wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgb300gpu_cv_kfold10',
                'save_code': 'False', 'notes': 'xgb300gpu_cv_kfold10'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
··········
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
WeightsAndBiasesCallback is experimental (supported from v2.9.0). The interface can change in the future.

   Using the same k-folds for reproducibility and setting up the W & B callbacks, let's now define a function for the optimization of hyperparameters using an Optuna.study with a pickle file, the parameters to test with different combinations during the search, and the the model type with the parameters that will be used for each trial. Then the timer can be specified to begin when called to perform the timed kFold cross validation trials with the goal of finding the lowest averaged score. The W & B parameters with loss for this study can be examined here.

In [ ]:
from sklearn.model_selection import cross_val_score, KFold
import joblib
from xgboost import XGBRegressor, plot_importance
from datetime import datetime, timedelta
from timeit import default_timer as timer
import pickle
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

kfolds = KFold(n_splits=10, shuffle=True, random_state=seed_value)

@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_300_GPU_CV_wb.pkl')

    params_xgb_optuna = {
        'n_estimators': trial.suggest_int('n_estimators', 400, 1000),
        'max_depth': trial.suggest_int('max_depth', 9, 15),
        'subsample': trial.suggest_float('subsample', 0.6, 0.8),
        'gamma': trial.suggest_float('gamma', 1e-8, 7e-5),
        'learning_rate': trial.suggest_float('learning_rate', 0.04, 0.13),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 1e-5),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.35, 0.6),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 12)
        }

    model = XGBRegressor(objective='reg:squarederror',
                         booster='gbtree',
                         tree_method='gpu_hist',
                         scale_pos_weight=1,
                         use_label_encoder=False,
                         random_state=seed_value,
                         verbosity=0,
                         **params_xgb_optuna)

    start = timer()
    scores = -cross_val_score(model, train_features, train_label,
                              scoring='neg_root_mean_squared_error',
                              cv=kfolds)
    run_time = timer() - start

    rmse = np.mean(scores)

    return rmse
track_in_wandb is experimental (supported from v3.0.0). The interface can change in the future.

   Since the 300 trials did not complete, we can load the saved pickle file for the study using joblib to continue training to complete the 300 trials.

In [ ]:
study = joblib.load('XGB_Optuna_300_GPU_CV_wb.pkl')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
/content/drive/MyDrive/UsedCarsCarGurus/Models/ML/Xgboost/Optuna/Model_PKL
Number of finished trials: 282
Best trial: {'n_estimators': 549, 'max_depth': 14, 'subsample': 0.7997070496461064, 'gamma': 2.953865805049196e-05, 'learning_rate': 0.04001808814037916, 'reg_alpha': 0.018852758055925938, 'reg_lambda': 1.8216639376033342e-06, 'colsample_bytree': 0.56819205236003, 'colsample_bylevel': 0.5683397007952175, 'min_child_weight': 7}
Lowest RMSE 2333.9101198688268
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_300_GPU_CV_wb.pkl'):
    study = joblib.load('XGB_Optuna_300_GPU_CV_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=18, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
Synced neat-sunset-800: https://wandb.ai/aschultz/usedCars_hpo/runs/2hq42hkl
Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20220910_212824-2hq42hkl/logs
Start Time           2022-09-10 17:02:11.950382
End Time             2022-09-10 21:36:15.811311
4:34:03


Number of finished trials: 300
Best trial: {'n_estimators': 549, 'max_depth': 14, 'subsample': 0.7997070496461064, 'gamma': 2.953865805049196e-05, 'learning_rate': 0.04001808814037916, 'reg_alpha': 0.018852758055925938, 'reg_lambda': 1.8216639376033342e-06, 'colsample_bytree': 0.56819205236003, 'colsample_bylevel': 0.5683397007952175, 'min_child_weight': 7}
Lowest RMSE 2333.9101198688268

   Let's now extract the trial number, rmse and hyperparameter value into a pandas.Dataframe and sort with the lowest error first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
204        204  2333.910120 2022-09-09 14:10:41.129639   
218        218  2333.955932 2022-09-09 23:19:27.518212   
209        209  2334.279615 2022-09-09 20:54:49.480562   
227        227  2334.285232 2022-09-10 01:55:39.223212   
207        207  2334.308772 2022-09-09 14:48:13.542118   
..         ...          ...                        ...   
181        181          NaN 2022-09-07 09:27:23.674764   
192        192          NaN 2022-09-07 19:04:45.356653   
201        201          NaN 2022-09-07 22:50:35.876731   
208        208          NaN 2022-09-09 15:04:58.917727   
281        281          NaN 2022-09-10 16:49:03.943189   

             datetime_complete               duration  colsample_bylevel  \
204 2022-09-09 14:27:27.802886 0 days 00:16:46.673247           0.568340   
218 2022-09-09 23:36:01.846257 0 days 00:16:34.328045           0.552188   
209 2022-09-09 21:10:51.774531 0 days 00:16:02.293969           0.545055   
227 2022-09-10 02:12:50.648771 0 days 00:17:11.425559           0.562369   
207 2022-09-09 15:04:53.179767 0 days 00:16:39.637649           0.559504   
..                         ...                    ...                ...   
181                        NaT                    NaT                NaN   
192                        NaT                    NaT                NaN   
201                        NaT                    NaT                NaN   
208                        NaT                    NaT                NaN   
281                        NaT                    NaT                NaN   

     colsample_bytree     gamma  learning_rate  max_depth  min_child_weight  \
204          0.568192  0.000030       0.040018       14.0               7.0   
218          0.567282  0.000027       0.040005       14.0               7.0   
209          0.566973  0.000028       0.040141       14.0               7.0   
227          0.571154  0.000026       0.040029       14.0               7.0   
207          0.566542  0.000028       0.040150       14.0               7.0   
..                ...       ...            ...        ...               ...   
181               NaN       NaN            NaN        NaN               NaN   
192               NaN       NaN            NaN        NaN               NaN   
201               NaN       NaN            NaN        NaN               NaN   
208               NaN       NaN            NaN        NaN               NaN   
281               NaN       NaN            NaN        NaN               NaN   

     n_estimators  reg_alpha    reg_lambda  subsample     state  
204         549.0   0.018853  1.821664e-06   0.799707  COMPLETE  
218         549.0   0.019058  1.604312e-06   0.799626  COMPLETE  
209         526.0   0.019976  9.446005e-07   0.799289  COMPLETE  
227         571.0   0.018502  8.370246e-07   0.791953  COMPLETE  
207         552.0   0.019947  1.110613e-06   0.799204  COMPLETE  
..            ...        ...           ...        ...       ...  
181           NaN        NaN           NaN        NaN   RUNNING  
192           NaN        NaN           NaN        NaN   RUNNING  
201           NaN        NaN           NaN        NaN   RUNNING  
208           NaN        NaN           NaN        NaN   RUNNING  
281           NaN        NaN           NaN        NaN   RUNNING  

[300 rows x 16 columns]

   Let's utilize plot_optimization_history, which shows the scores from all trials as well as the best score so far at each point. This search did not contain extreme outliers for the objective value so it can be useful for examining the study output.

In [ ]:
fig = optuna.visualization.plot_optimization_history(study)
fig.show()

   Next, we can utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_contour to plot the parameter relationship with a contour plot.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
In [ ]:
fig = optuna.visualization.plot_contour(study, params=['min_child_weight',
                                                       'max_depth',
                                                       'learning_rate',
                                                       'gamma'])
fig.show()

   Next, we can plot_slice to compare the objective value and individal parameters.

In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   The colsample_bylevel, max_depth, reg_alpha, and subsample parameters with higher values demonstrate lower loss for the objective value. Lower values for the learning_rate, n_estimators, reg_lambda, and the colsample_bytree parameters resulted in lower loss. It is difficult to to determine if lower/higher values for the gamma and min_child_weight parameters resulted in lower loss for this study.

   Let's now visualize the parameter importances by utilizing the plot_param_importances feature from the Optuna package.

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   The learning_rate parameter is clearly the most important with a value of 0.88 when the next important parameter, max_depth, contains a value of 0.03 . CatBoost demonstrated the learning_rate was the most important parameter during the Optuna trials while LightGBM demonstated max_depth. XGBoost is suggesting the learning_rate is the most important parameter in conjunction with CatBoost.

   Let's now use plot_edf to visualize the empirical distribution function.

In [ ]:
fig = optuna.visualization.plot_edf(study)
fig.show()

   Let's now examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(bayes_params[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()

   The colsample_bylevel, max_depth, reg_alpha and subsample parameters with higher values demonstrate lower loss for the objective value. Lower values for the colsample_bytree, gamma learning_rate, min_child_weight and reg_lambda resulted in lower loss.

   Let's now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the study duration, the colsample_bytree and learning_rate parameters decreased while the others did not show a clear trend.

   Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:
fig, axs = plt.subplots(1, 2, figsize = (20,5))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the trials, the reg_alpha parameter increased while reg_lambda decreased.

   Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
param = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'n_estimators': 549,
 'max_depth': 14,
 'subsample': 0.7997070496461064,
 'gamma': 2.953865805049196e-05,
 'learning_rate': 0.04001808814037916,
 'reg_alpha': 0.018852758055925938,
 'reg_lambda': 1.8216639376033342e-06,
 'colsample_bytree': 0.56819205236003,
 'colsample_bylevel': 0.5683397007952175,
 'min_child_weight': 7,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials300_GPU_CV_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO 300 GPU CV trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for XGBoost HPO 300 GPU CV trials
MAE train: 969.631, test: 1665.061
MSE train: 1835218.836, test: 5288447.270
RMSE train: 1354.702, test: 2299.662
R^2 train: 0.980, test: 0.942

   Compared to the baseline model, there are lower values for MAE, MSE and RMSE and a higher R² for both train/test sets after hyperparameter tuning. Also, there were lower MAE, MSE, and RMSE and higher R² both for the train/test sets compared to the metrics from using 10-fold cross validation for the Hyperopt trials

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5288447.26991 MSE on the test set.
This was achieved using these conditions:
iteration                                   204
rmse                                 2333.91012
datetime_start       2022-09-09 14:10:41.129639
datetime_complete    2022-09-09 14:27:27.802886
duration                 0 days 00:16:46.673247
colsample_bylevel                       0.56834
colsample_bytree                       0.568192
gamma                                   0.00003
learning_rate                          0.040018
max_depth                                  14.0
min_child_weight                            7.0
n_estimators                              549.0
reg_alpha                              0.018853
reg_lambda                             0.000002
subsample                              0.799707
state                                  COMPLETE
Name: 204, dtype: object

Model Explanations

SHAP (SHapley Additive exPlanations)

   Using the training set, let's summarize the effects of all the features.

In [ ]:
shap.initjs()
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(train_features)

shap.summary_plot(shap_values, train_features);

   Then we can use the test set and summarize the effects of all the features.

In [ ]:
shap.initjs()
shap_values = explainer.shap_values(test_features)

shap.summary_plot(shap_values, test_features);

Model Metrics with ELI5

   Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:
X_test1 = pd.DataFrame(test_features, columns=test_features.columns)

perm_importance = PermutationImportance(best_model,
                                        random_state=seed_value).fit(test_features,
                                                                     test_label)
html_obj = eli5.show_weights(perm_importance,
                            feature_names=X_test1.columns.tolist())
html_obj
Out[ ]:
Weight Feature
0.1900 ± 0.0018 horsepower
0.0832 ± 0.0019 year
0.0776 ± 0.0014 mileage
0.0775 ± 0.0009 width
0.0347 ± 0.0003 height
0.0314 ± 0.0003 engine_displacement
0.0257 ± 0.0005 horsepower_rpm
0.0249 ± 0.0004 wheelbase
0.0236 ± 0.0002 fuel_tank_volume
0.0219 ± 0.0002 city_fuel_economy
0.0201 ± 0.0005 maximum_seating
0.0163 ± 0.0002 back_legroom
0.0161 ± 0.0002 wheel_system_display_Front-Wheel Drive
0.0154 ± 0.0002 savings_amount
0.0152 ± 0.0003 is_new
0.0150 ± 0.0002 length
0.0109 ± 0.0003 highway_fuel_economy
0.0096 ± 0.0002 front_legroom
0.0075 ± 0.0004 wheel_system_display_Four-Wheel Drive
0.0059 ± 0.0003 daysonmarket
… 33 more …

   We can also utilize explain_weights_sklearn which returns an explanation of the estimator parameters (weights).

In [ ]:
explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
Out[ ]:
feature weight std
0 horsepower 0.189968 0.000917
1 year 0.083167 0.000940
2 mileage 0.077648 0.000694
3 width 0.077522 0.000439
4 height 0.034736 0.000152
5 engine_displacement 0.031430 0.000150
6 horsepower_rpm 0.025704 0.000269
7 wheelbase 0.024892 0.000208
8 fuel_tank_volume 0.023631 0.000101
9 city_fuel_economy 0.021869 0.000080
10 maximum_seating 0.020138 0.000248
11 back_legroom 0.016298 0.000111
12 wheel_system_display_Front-Wheel Drive 0.016102 0.000098
13 savings_amount 0.015372 0.000077
14 is_new 0.015173 0.000174
15 length 0.015018 0.000077
16 highway_fuel_economy 0.010851 0.000152
17 front_legroom 0.009646 0.000115
18 wheel_system_display_Four-Wheel Drive 0.007489 0.000178
19 daysonmarket 0.005926 0.000134

   The horsepower feature is again the most important weight (0.189968), followed by year (0.083167), mileage (0.077648), width (0.077522) and height (0.034736).

XGBoost Set Params Same GPU

   The notebook utilizing the GPU for XGBoost is here and using RAPIDS is here All of the W & B parameters with loss for this experiments can be examined here.

   To evaluate the runtime length between cross validation and train/test split, let's utilize the same GPU, a Quadro RTX 4000. Let's first set up the environment, read the data, define the label and features, create dummy variables for categorical variables and create a matrix for GPU.

In [ ]:
!pip install --upgrade pip
!pip install --upgrade -q wandb
!pip install xgboost==1.5.2
!pip install optuna
!pip install plotly
!pip install eli5
!pip install shap
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

print('\n')
!nvidia-smi
Requirement already satisfied: pip in /opt/conda/envs/rapids/lib/python3.9/site-packages (23.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: xgboost==1.5.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (1.5.2)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.21.5)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.6.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: optuna in /opt/conda/envs/rapids/lib/python3.9/site-packages (3.1.0)
Requirement already satisfied: cmaes>=0.9.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (0.9.1)
Requirement already satisfied: packaging>=20.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (21.3)
Requirement already satisfied: tqdm in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (4.62.3)
Requirement already satisfied: PyYAML in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (6.0)
Requirement already satisfied: alembic>=1.5.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (1.9.4)
Requirement already satisfied: sqlalchemy>=1.3.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (2.0.4)
Requirement already satisfied: colorlog in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (6.7.0)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (1.21.5)
Requirement already satisfied: Mako in /opt/conda/envs/rapids/lib/python3.9/site-packages (from alembic>=1.5.0->optuna) (1.2.4)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from packaging>=20.0->optuna) (3.0.7)
Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sqlalchemy>=1.3.0->optuna) (2.0.2)
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sqlalchemy>=1.3.0->optuna) (4.5.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from Mako->alembic>=1.5.0->optuna) (2.0.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: plotly in /opt/conda/envs/rapids/lib/python3.9/site-packages (5.13.1)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from plotly) (8.2.2)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: eli5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (0.13.0)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.6.0)
Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.16.0)
Requirement already satisfied: jinja2>=3.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (3.0.3)
Requirement already satisfied: graphviz in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (0.20.1)
Requirement already satisfied: attrs>17.1.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (21.4.0)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.21.5)
Requirement already satisfied: tabulate>=0.7.7 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (0.9.0)
Requirement already satisfied: scikit-learn>=0.20 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (0.24.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (1.1.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: shap in /opt/conda/envs/rapids/lib/python3.9/site-packages (0.41.0)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.6.0)
Requirement already satisfied: slicer==0.0.7 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (0.0.7)
Requirement already satisfied: packaging>20.9 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (21.3)
Requirement already satisfied: pandas in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.3.5)
Requirement already satisfied: numba in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (0.55.0)
Requirement already satisfied: tqdm>4.25.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (4.62.3)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.21.5)
Requirement already satisfied: cloudpickle in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (2.0.0)
Requirement already satisfied: scikit-learn in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (0.24.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from packaging>20.9->shap) (3.0.7)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from numba->shap) (0.38.0)
Requirement already satisfied: setuptools in /opt/conda/envs/rapids/lib/python3.9/site-packages (from numba->shap) (59.8.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->shap) (2021.3)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn->shap) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: six>=1.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->shap) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv


Thu Mar  2 01:21:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:00:05.0 Off |                  N/A |
| 30%   27C    P8     9W / 125W |   1467MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
In [ ]:
import pandas as pd
import xgboost as xgb

trainDF = pd.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = pd.read_csv('usedCars_testSet.csv', low_memory=False)

train_label = trainDF[['price']]
test_label = testDF[['price']]

train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])

train_features = pd.get_dummies(train_features, drop_first=True)
test_features = pd.get_dummies(test_features, drop_first=True)

dtrain = xgb.DMatrix(data=train_features, label=train_label)
dtest = xgb.DMatrix(data=test_features, label=test_label)

Cross Validation

   Let's start with 3-fold cross validation where the results are logged with W & B.

200 Trials 3-Fold Cross Validation

In [ ]:
import wandb

wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_3CV_noCUML',
                'tags': 'xgbGPU_setParams_RTX4000_3CV_noCUML'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Currently logged in as: aschultz. Use `wandb login --relogin` to force relogin
In [ ]:
import joblib
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
from optuna.integration.wandb import WeightsAndBiasesCallback
from timeit import default_timer as timer


@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl')

    params_xgb_optuna = {
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_rate': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()
    cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=3,
                        metrics='rmse', as_pandas=True)
    run_time = timer() - start

    rmse = cv_results['test-rmse-mean'].values[-1]

    return rmse
In [ ]:
from datetime import datetime, timedelta

start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_3CV_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run unique-darkness-2333 at: https://wandb.ai/aschultz/usedCars_hpo/runs/2c8ixid9
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230302_022407-2c8ixid9/logs
Start Time           2023-03-02 01:22:38.927309
End Time             2023-03-02 02:24:25.610914
1:01:46


Number of finished trials: 200
Best trial: {'n_estimators': 800, 'max_depth': 15, 'subsample': 0.8, 'reg_lambda': 5.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.5, 'min_child_weight': 9}
owest RMSE 2836.6699219999996

   This took over a litte more than one hour. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
199        199  2836.669922 2023-03-02 02:24:07.581266   
127        127  2836.669922 2023-03-02 02:02:13.160332   
128        128  2836.669922 2023-03-02 02:02:31.541297   
129        129  2836.669922 2023-03-02 02:02:50.340703   
131        131  2836.669922 2023-03-02 02:03:26.812853   
..         ...          ...                        ...   
2            2  2908.110515 2023-03-02 01:23:15.295948   
7            7  2909.427002 2023-03-02 01:24:44.710299   
88          88  2911.004069 2023-03-02 01:50:16.697325   
5            5  2921.752116 2023-03-02 01:24:08.984637   
4            4  2937.090576 2023-03-02 01:23:51.420254   

             datetime_complete               duration  colsample_bylevel  \
199 2023-03-02 02:24:21.413371 0 days 00:00:13.832105                0.5   
127 2023-03-02 02:02:27.007972 0 days 00:00:13.847640                0.5   
128 2023-03-02 02:02:45.595758 0 days 00:00:14.054461                0.5   
129 2023-03-02 02:03:04.179688 0 days 00:00:13.838985                0.5   
131 2023-03-02 02:03:40.508086 0 days 00:00:13.695233                0.5   
..                         ...                    ...                ...   
2   2023-03-02 01:23:28.893965 0 days 00:00:13.598017                0.5   
7   2023-03-02 01:24:58.058156 0 days 00:00:13.347857                0.4   
88  2023-03-02 01:50:30.508126 0 days 00:00:13.810801                0.5   
5   2023-03-02 01:24:22.163635 0 days 00:00:13.178998                0.4   
4   2023-03-02 01:24:04.849909 0 days 00:00:13.429655                0.5   

     colsample_bytree  max_depth  min_child_weight  n_estimators  reg_lambda  \
199               0.7         15                 9           800         5.5   
127               0.7         15                 9           800         5.5   
128               0.7         15                 9           800         5.5   
129               0.7         15                 9           800         5.5   
131               0.7         15                 9           800         5.5   
..                ...        ...               ...           ...         ...   
2                 0.6         15                 7           700         7.5   
7                 0.5         15                 7           800         6.5   
88                0.5         15                 7           800         5.5   
5                 0.5         15                 8           600         7.5   
4                 0.5         14                 8           700         7.5   

     subsample     state  
199       0.80  COMPLETE  
127       0.80  COMPLETE  
128       0.80  COMPLETE  
129       0.80  COMPLETE  
131       0.80  COMPLETE  
..         ...       ...  
2         0.78  COMPLETE  
7         0.80  COMPLETE  
88        0.80  COMPLETE  
5         0.80  COMPLETE  
4         0.78  COMPLETE  

[200 rows x 13 columns]
In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'n_estimators': 800,
 'max_depth': 15,
 'subsample': 0.8,
 'reg_lambda': 5.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.5,
 'min_child_weight': 9,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_3CV_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for XGBoost HPO GPU trials
MAE train: 919.201, test: 1684.660
MSE train: 1633389.920, test: 5393519.908
RMSE train: 1278.041, test: 2322.395
R^2 train: 0.982, test: 0.941
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5393519.90815 MSE on the test set.
This was achieved using these conditions:
iteration                                   199
rmse                                2836.669922
datetime_start       2023-03-02 02:24:07.581266
datetime_complete    2023-03-02 02:24:21.413371
duration                 0 days 00:00:13.832105
colsample_bylevel                           0.5
colsample_bytree                            0.7
max_depth                                    15
min_child_weight                              9
n_estimators                                800
reg_lambda                                  5.5
subsample                                   0.8
state                                  COMPLETE
Name: 199, dtype: object

200 Trials 5-Fold Cross Validation

   Let's know try using 5-fold cross validation where the results are logged with W & B.

In [ ]:
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_5CV_noCUML',
                'tags': 'xgbGPU_setParams_RTX4000_5CV_noCUML'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl')

    params_xgb_optuna = {
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_rate': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()
    cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=5,
                        metrics='rmse', as_pandas=True)
    run_time = timer() - start

    rmse = cv_results['test-rmse-mean'].values[-1]
    print('- Trial RMSE:', rmse)

    return rmse
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_5CV_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run polished-glitter-1005 at: https://wandb.ai/aschultz/usedCars_hpo/runs/dyt358s7
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230228_134045-dyt358s7/logs
Start Time           2023-02-28 12:30:13.565950
End Time             2023-02-28 13:41:06.961348
1:10:53
/n
Number of finished trials: 200
Best trial: {'n_estimators': 700, 'max_depth': 14, 'subsample': 0.8, 'reg_lambda': 5.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.6, 'min_child_weight': 7}
Lowest RMSE 2810.4725099999996

   This took over nine minutes longer than 3-fold cross validation. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
99          99  2810.472510 2023-02-28 13:05:06.352142   
121        121  2810.472510 2023-02-28 13:13:01.154200   
122        122  2810.472510 2023-02-28 13:13:22.201484   
123        123  2810.472510 2023-02-28 13:13:43.315001   
124        124  2810.472510 2023-02-28 13:14:04.377982   
..         ...          ...                        ...   
88          88  2880.894190 2023-02-28 13:01:12.405263   
48          48  2887.327295 2023-02-28 12:46:56.194690   
17          17  2890.262939 2023-02-28 12:35:58.164065   
1            1  2896.017432 2023-02-28 12:30:24.373923   
6            6  2909.011865 2023-02-28 12:32:07.024612   

             datetime_complete               duration  colsample_bylevel  \
99  2023-02-28 13:05:23.057664 0 days 00:00:16.705522                0.6   
121 2023-02-28 13:13:18.031413 0 days 00:00:16.877213                0.6   
122 2023-02-28 13:13:39.364476 0 days 00:00:17.162992                0.6   
123 2023-02-28 13:13:59.972708 0 days 00:00:16.657707                0.6   
124 2023-02-28 13:14:21.178490 0 days 00:00:16.800508                0.6   
..                         ...                    ...                ...   
88  2023-02-28 13:01:28.151382 0 days 00:00:15.746119                0.6   
48  2023-02-28 12:47:11.849090 0 days 00:00:15.654400                0.4   
17  2023-02-28 12:36:14.389482 0 days 00:00:16.225417                0.6   
1   2023-02-28 12:30:40.179053 0 days 00:00:15.805130                0.5   
6   2023-02-28 12:32:22.486129 0 days 00:00:15.461517                0.6   

     colsample_bytree  max_depth  min_child_weight  n_estimators  reg_lambda  \
99                0.7         14                 7           700         5.5   
121               0.7         14                 7           700         5.5   
122               0.7         14                 7           700         5.5   
123               0.7         14                 7           700         5.5   
124               0.7         14                 7           700         5.5   
..                ...        ...               ...           ...         ...   
88                0.5         14                 7           600         5.5   
48                0.7         14                 7           700         7.5   
17                0.5         15                 7           700         6.5   
1                 0.5         15                 9           700         7.5   
6                 0.5         14                 9           700         7.5   

     subsample     state  
99        0.80  COMPLETE  
121       0.80  COMPLETE  
122       0.80  COMPLETE  
123       0.80  COMPLETE  
124       0.80  COMPLETE  
..         ...       ...  
88        0.80  COMPLETE  
48        0.80  COMPLETE  
17        0.80  COMPLETE  
1         0.78  COMPLETE  
6         0.78  COMPLETE  

[200 rows x 13 columns]
In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'n_estimators': 700,
 'max_depth': 14,
 'subsample': 0.8,
 'reg_lambda': 5.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.6,
 'min_child_weight': 7,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_5CV_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for XGBoost HPO GPU trials
MAE train: 1032.235, test: 1691.241
MSE train: 2029759.329, test: 5401147.095
RMSE train: 1424.696, test: 2324.037
R^2 train: 0.978, test: 0.941
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5401147.09528 MSE on the test set.
This was achieved using these conditions:
iteration                                    99
rmse                                 2810.47251
datetime_start       2023-02-28 13:05:06.352142
datetime_complete    2023-02-28 13:05:23.057664
duration                 0 days 00:00:16.705522
colsample_bylevel                           0.6
colsample_bytree                            0.7
max_depth                                    14
min_child_weight                              7
n_estimators                                700
reg_lambda                                  5.5
subsample                                   0.8
state                                  COMPLETE
Name: 99, dtype: object

200 Trials 10-Fold Cross Validation

   Let's know try using 10-fold cross validation where the results are logged with W & B.

In [ ]:
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_10CV_noCUML',
                'tags': 'xgbGPU_setParams_RTX4000_10CV_noCUML'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl')

    params_xgb_optuna = {
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_rate': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()
    cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=10,
                        metrics='rmse', as_pandas=True)
    run_time = timer() - start

    rmse = cv_results['test-rmse-mean'].values[-1]
    print('- Trial RMSE:', rmse)

    return rmse
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_10CV_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run glowing-night-2533 at: https://wandb.ai/aschultz/usedCars_hpo/runs/fwozla04
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230302_040413-fwozla04/logs
Start Time           2023-03-02 02:42:46.231671
End Time             2023-03-02 04:04:37.891068
1:21:51
Number of finished trials: 200
Best trial: {'n_estimators': 700, 'max_depth': 14, 'subsample': 0.8, 'reg_lambda': 5.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.5, 'min_child_weight': 8}
Lowest RMSE 2798.5772704

   This took over 20 minutes longer than 3-fold cross validation and 10 minutes longer than 5-fold cross validation. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
199        199  2798.577270 2023-03-02 04:04:13.187285   
177        177  2798.577270 2023-03-02 03:55:10.653231   
178        178  2798.577270 2023-03-02 03:55:35.578400   
179        179  2798.577270 2023-03-02 03:56:00.720457   
181        181  2798.577270 2023-03-02 03:56:48.957300   
..         ...          ...                        ...   
90          90  2909.585156 2023-03-02 03:19:45.406138   
5            5  2911.163281 2023-03-02 02:44:48.484531   
2            2  2913.772925 2023-03-02 02:43:36.911438   
4            4  2914.687500 2023-03-02 02:44:25.457163   
39          39  2915.390747 2023-03-02 02:59:09.091423   

             datetime_complete               duration  colsample_bylevel  \
199 2023-03-02 04:04:32.951922 0 days 00:00:19.764637                0.5   
177 2023-03-02 03:55:30.601609 0 days 00:00:19.948378                0.5   
178 2023-03-02 03:55:55.864629 0 days 00:00:20.286229                0.5   
179 2023-03-02 03:56:20.585950 0 days 00:00:19.865493                0.5   
181 2023-03-02 03:57:09.124214 0 days 00:00:20.166914                0.5   
..                         ...                    ...                ...   
90  2023-03-02 03:20:03.479326 0 days 00:00:18.073188                0.5   
5   2023-03-02 02:45:06.761658 0 days 00:00:18.277127                0.4   
2   2023-03-02 02:43:55.494662 0 days 00:00:18.583224                0.5   
4   2023-03-02 02:44:43.945176 0 days 00:00:18.488013                0.5   
39  2023-03-02 02:59:27.740884 0 days 00:00:18.649461                0.4   

     colsample_bytree  max_depth  min_child_weight  n_estimators  reg_lambda  \
199               0.7         14                 8           700         5.5   
177               0.7         14                 8           700         5.5   
178               0.7         14                 8           700         5.5   
179               0.7         14                 8           700         5.5   
181               0.7         14                 8           700         5.5   
..                ...        ...               ...           ...         ...   
90                0.5         14                 9           700         7.5   
5                 0.5         14                 8           700         6.5   
2                 0.5         14                 9           600         7.5   
4                 0.5         14                 8           800         7.5   
39                0.5         14                 7           700         6.5   

     subsample     state  
199       0.80  COMPLETE  
177       0.80  COMPLETE  
178       0.80  COMPLETE  
179       0.80  COMPLETE  
181       0.80  COMPLETE  
..         ...       ...  
90        0.80  COMPLETE  
5         0.80  COMPLETE  
2         0.78  COMPLETE  
4         0.78  COMPLETE  
39        0.78  COMPLETE  

[200 rows x 13 columns]
In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'n_estimators': 700,
 'max_depth': 14,
 'subsample': 0.8,
 'reg_lambda': 5.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.5,
 'min_child_weight': 8,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_10CV_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for XGBoost HPO GPU trials
MAE train: 1084.934, test: 1693.636
MSE train: 2226121.053, test: 5413187.740
RMSE train: 1492.019, test: 2326.626
R^2 train: 0.976, test: 0.941
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5413187.73955 MSE on the test set.
This was achieved using these conditions:
iteration                                   199
rmse                                 2798.57727
datetime_start       2023-03-02 04:04:13.187285
datetime_complete    2023-03-02 04:04:32.951922
duration                 0 days 00:00:19.764637
colsample_bylevel                           0.5
colsample_bytree                            0.7
max_depth                                    14
min_child_weight                              8
n_estimators                                700
reg_lambda                                  5.5
subsample                                   0.8
state                                  COMPLETE
Name: 199, dtype: object

Train/Test

   Let's know try using train/test split to compare to cross validation where the results are logged with W & B.

200 Trials Train/Test DMatrix

In [ ]:
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_noCUML',
                'tags': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_noCUML'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study,
                'XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl')

    params_xgb_optuna = {
        'objective': 'reg:squarederror',
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_ratel': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'verbose': False,
        'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()

    model = xgb.train(params_xgb_optuna, dtrain)
    run_time = timer() - start

    y_pred_val = model.predict(dtest)
    rmse = mean_squared_error(test_label, y_pred_val,
                              squared=False)
    print('Trial RMSE:', rmse)

    return rmse
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run floral-blaze-200 at: https://wandb.ai/aschultz/usedCars_hpo/runs/mnjhk4cc
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230304_173607-mnjhk4cc/logs
Start Time           2023-03-04 16:38:36.754862
End Time             2023-03-04 17:36:24.634476
0:57:47


Number of finished trials: 200
Best trial: {'n_estimators': 700, 'max_depth': 14, 'subsample': 0.8, 'reg_lambda': 5.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.5, 'min_child_weight': 8}
Lowest RMSE 2784.2744922463503

   This took four minutes less than 3-fold cross validation and 10 minutes shorter than 5-fold cross validation and 20 minutes shorter than 10-fold cross validationn. This is potential justification for not using cross validation when using a smaller set size. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
99          99  2784.274492 2023-03-04 17:07:23.012980   
125        125  2784.274492 2023-03-04 17:14:55.441055   
124        124  2784.274492 2023-03-04 17:14:38.445110   
123        123  2784.274492 2023-03-04 17:14:21.551584   
122        122  2784.274492 2023-03-04 17:14:04.138397   
..         ...          ...                        ...   
147        147  2874.615291 2023-03-04 17:21:09.348967   
3            3  2876.914988 2023-03-04 16:39:31.143347   
17          17  2889.995485 2023-03-04 16:43:38.066785   
1            1  2892.732993 2023-03-04 16:38:58.121425   
35          35  2893.625011 2023-03-04 16:48:50.314883   

             datetime_complete               duration  colsample_bylevel  \
99  2023-03-04 17:07:35.609591 0 days 00:00:12.596611                0.5   
125 2023-03-04 17:15:08.103043 0 days 00:00:12.661988                0.5   
124 2023-03-04 17:14:51.173372 0 days 00:00:12.728262                0.5   
123 2023-03-04 17:14:33.875686 0 days 00:00:12.324102                0.5   
122 2023-03-04 17:14:16.544386 0 days 00:00:12.405989                0.5   
..                         ...                    ...                ...   
147 2023-03-04 17:21:21.733014 0 days 00:00:12.384047                0.5   
3   2023-03-04 16:39:43.602827 0 days 00:00:12.459480                0.5   
17  2023-03-04 16:43:49.798413 0 days 00:00:11.731628                0.5   
1   2023-03-04 16:39:09.900301 0 days 00:00:11.778876                0.4   
35  2023-03-04 16:49:02.460795 0 days 00:00:12.145912                0.4   

     colsample_bytree  max_depth  min_child_weight  n_estimators  reg_lambda  \
99                0.7         14                 8           700         5.5   
125               0.7         14                 8           700         5.5   
124               0.7         14                 8           700         5.5   
123               0.7         14                 8           700         5.5   
122               0.7         14                 8           700         5.5   
..                ...        ...               ...           ...         ...   
147               0.5         14                 8           700         5.5   
3                 0.5         15                 7           700         6.5   
17                0.5         14                 9           700         6.5   
1                 0.5         14                 8           800         5.5   
35                0.5         14                 7           800         5.5   

     subsample     state  
99        0.80  COMPLETE  
125       0.80  COMPLETE  
124       0.80  COMPLETE  
123       0.80  COMPLETE  
122       0.80  COMPLETE  
..         ...       ...  
147       0.78  COMPLETE  
3         0.80  COMPLETE  
17        0.80  COMPLETE  
1         0.78  COMPLETE  
35        0.80  COMPLETE  

[200 rows x 13 columns]
In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'n_estimators': 700,
 'max_depth': 14,
 'subsample': 0.8,
 'reg_lambda': 5.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.5,
 'min_child_weight': 8,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)


Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_trainTest_DMatrix_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(train_label, y_train_pred),
        mean_absolute_error(test_label, y_test_pred)))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred),
        mean_squared_error(test_label, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(train_label, y_train_pred, squared=False),
        mean_squared_error(test_label, y_test_pred, squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(train_label, y_train_pred),
        r2_score(test_label, y_test_pred)))

Model Metrics for XGBoost HPO GPU trials
MAE train: 1084.934, test: 1693.636
MSE train: 2226121.053, test: 5413187.740
RMSE train: 1492.019, test: 2326.626
R^2 train: 0.976, test: 0.941
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(test_label,
                                                                                                      y_test_pred)))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5413187.73955 MSE on the test set.
This was achieved using these conditions:
iteration                                    99
rmse                                2784.274492
datetime_start       2023-03-04 17:07:23.012980
datetime_complete    2023-03-04 17:07:35.609591
duration                 0 days 00:00:12.596611
colsample_bylevel                           0.5
colsample_bytree                            0.7
max_depth                                    14
min_child_weight                              8
n_estimators                                700
reg_lambda                                  5.5
subsample                                   0.8
state                                  COMPLETE
Name: 99, dtype: object

Cross Validation - cuML

   Let's go back to Paperspace to utilize the RAPIDS Docker container. We can installing/importing the dependencies, setting the options and seed, set up the local CUDA cluster for Dask, query the client for all connected workers. and the read data.

In [ ]:
!pip install --upgrade pip
!pip install xgboost==1.5.2
!pip install --upgrade -q wandb
!pip install optuna
!pip install plotly
!pip install shap
!pip install eli5
import os
import warnings
import random
import numpy as np
import cupy
from cupy import asnumpy
import urllib.request
from contextlib import contextmanager
import time
from timeit import default_timer as timer
warnings.filterwarnings('ignore')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)

@contextmanager
def timed(name):
    t0 = time.time()
    yield
    t1 = time.time()
    print('..%-24s:  %8.4f' % (name, t1 - t0))

print('\n')
!nvidia-smi
Requirement already satisfied: pip in /opt/conda/envs/rapids/lib/python3.9/site-packages (22.0.3)
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 57.1 MB/s eta 0:00:00:00:01
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.3
    Uninstalling pip-22.0.3:
      Successfully uninstalled pip-22.0.3
Successfully installed pip-23.0.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: xgboost==1.5.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (1.5.2)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.6.0)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.21.5)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting optuna
  Downloading optuna-3.1.0-py3-none-any.whl (365 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 365.3/365.3 kB 17.7 MB/s eta 0:00:00
Collecting alembic>=1.5.0
  Downloading alembic-1.9.4-py3-none-any.whl (210 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.5/210.5 kB 27.7 MB/s eta 0:00:00
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (1.21.5)
Requirement already satisfied: tqdm in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (4.62.3)
Requirement already satisfied: PyYAML in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (6.0)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting sqlalchemy>=1.3.0
  Downloading SQLAlchemy-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.8/2.8 MB 23.2 MB/s eta 0:00:0000:0100:01
Collecting cmaes>=0.9.1
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Requirement already satisfied: packaging>=20.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from optuna) (21.3)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.7/78.7 kB 23.9 MB/s eta 0:00:00
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from packaging>=20.0->optuna) (3.0.7)
Collecting greenlet!=0.4.17
  Downloading greenlet-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (610 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 610.9/610.9 kB 24.1 MB/s eta 0:00:00
Collecting typing-extensions>=4.2.0
  Downloading typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Requirement already satisfied: MarkupSafe>=0.9.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from Mako->alembic>=1.5.0->optuna) (2.0.1)
Installing collected packages: typing-extensions, Mako, greenlet, colorlog, cmaes, sqlalchemy, alembic, optuna
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.0.1
    Uninstalling typing_extensions-4.0.1:
      Successfully uninstalled typing_extensions-4.0.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 22.2.0 requires cupy-cuda115, which is not installed.
Successfully installed Mako-1.2.4 alembic-1.9.4 cmaes-0.9.1 colorlog-6.7.0 greenlet-2.0.2 optuna-3.1.0 sqlalchemy-2.0.4 typing-extensions-4.5.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting plotly
  Downloading plotly-5.13.1-py2.py3-none-any.whl (15.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 66.7 MB/s eta 0:00:0000:0100:01
Collecting tenacity>=6.2.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.13.1 tenacity-8.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting shap
  Downloading shap-0.41.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 572.4/572.4 kB 46.1 MB/s eta 0:00:00
Requirement already satisfied: scikit-learn in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (0.24.2)
Requirement already satisfied: numba in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (0.55.0)
Requirement already satisfied: packaging>20.9 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (21.3)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.6.0)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.21.5)
Requirement already satisfied: tqdm>4.25.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (4.62.3)
Requirement already satisfied: cloudpickle in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (2.0.0)
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: pandas in /opt/conda/envs/rapids/lib/python3.9/site-packages (from shap) (1.3.5)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from packaging>20.9->shap) (3.0.7)
Requirement already satisfied: setuptools in /opt/conda/envs/rapids/lib/python3.9/site-packages (from numba->shap) (59.8.0)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from numba->shap) (0.38.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas->shap) (2021.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn->shap) (1.1.0)
Requirement already satisfied: six>=1.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->shap) (1.16.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.2/216.2 kB 23.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: attrs>17.1.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (21.4.0)
Requirement already satisfied: jinja2>=3.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (3.0.3)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.21.5)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.6.0)
Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.16.0)
Requirement already satisfied: scikit-learn>=0.20 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (0.24.2)
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.0/47.0 kB 13.0 MB/s eta 0:00:00
Collecting tabulate>=0.7.7
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=64812a2fcae13425386ddb9f3843c5ef66f7aa200d9365d3d0b3b657a132b567
  Stored in directory: /root/.cache/pip/wheels/7b/26/a5/8460416695a992a2966b41caa5338e5e7fcea98c9d032d055c
Successfully built eli5
Installing collected packages: tabulate, graphviz, eli5
Successfully installed eli5-0.13.0 graphviz-0.20.1 tabulate-0.9.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv


Thu Mar  2 00:10:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:00:05.0 Off |                  N/A |
| 30%   32C    P0    36W / 125W |     94MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
In [ ]:
import dask
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar
from dask_cuda import LocalCUDACluster
import dask_cudf

cluster = LocalCUDACluster(threads_per_worker=1, ip='',
                           dashboard_address='8082')
c = Client(cluster)

workers = c.has_what().keys()
n_workers = len(workers)
c
distributed.diskutils - INFO - Found stale lock file and directory '/notebooks/UsedCarsCarGurus/Models/ML/XGBoost/Optuna/Notebooks_Scripts/dask-worker-space/worker-e6t8yiiv', purging
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Out[ ]:

Client

Client-af9be4d6-b88e-11ed-8048-668a37aed768

Connection method: Cluster object Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://10.42.7.83:8082/status

Cluster Info

In [ ]:
import cudf
import cuml
import xgboost as xgb

trainDF = cudf.read_csv('usedCars_trainSet.csv', low_memory=False)
testDF = cudf.read_csv('usedCars_testSet.csv', low_memory=False)

train_label = trainDF[['price']]
test_label = testDF[['price']]

train_features = trainDF.drop(columns = ['price'])
test_features = testDF.drop(columns = ['price'])

train_features = cudf.get_dummies(train_features)
test_features = cudf.get_dummies(test_features)

train_features = train_features.to_cupy()
train_label = train_label.to_cupy()
test_features = test_features.to_cupy()
test_label = test_label.to_cupy()

train_features = train_features.astype('float32')
train_label = train_label.astype('int32')

test_features = test_features.astype('float32')
test_label = test_label.astype('int32')

dtrain = xgb.DMatrix(data=train_features, label=train_label)
dtest = xgb.DMatrix(data=test_features, label=test_label)

200 Trials 3-fold Cross Validation cuML

   Let's start with 3-fold cross validation where the results are logged with W & B.

In [ ]:
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import wandb
from optuna.integration.wandb import WeightsAndBiasesCallback

wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_3CV_cuml',
                'tags': 'xgbGPU_setParams_RTX4000_3CV_cuml'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
In [ ]:
from xgboost import XGBRegressor, plot_importance
import joblib
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl')

    params_xgb_optuna = {
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_rate': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
                                          step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()

    cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=3,
                        metrics='rmse')
    run_time = timer() - start

    rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
    print('- Trial RMSE:', rmse)

    return rmse
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_3CV_cuml_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run sunny-sky-2133 at: https://wandb.ai/aschultz/usedCars_hpo/runs/2pvozyum
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230302_011406-2pvozyum/logs
Start Time           2023-03-02 00:11:35.915923
End Time             2023-03-02 01:14:24.804205
1:02:48


Number of finished trials: 200
Best trial: {'num_boost_round': 800, 'max_depth': 15, 'subsample': 0.8, 'reg_lambda': 6.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.6, 'min_child_weight': 8}
Lowest RMSE 2830.049642

   This took over a litte more than one hour just like what was observed with 3-fold cross validation without cuML. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
99          99  2830.049642 2023-03-02 00:42:33.333198   
123        123  2830.049642 2023-03-02 00:50:00.607334   
125        125  2830.049642 2023-03-02 00:50:37.049158   
126        126  2830.049642 2023-03-02 00:51:00.316314   
127        127  2830.049642 2023-03-02 00:51:19.077201   
..         ...          ...                        ...   
10          10  2890.070556 2023-03-02 00:14:39.832162   
40          40  2892.721924 2023-03-02 00:23:49.430490   
1            1  2898.562419 2023-03-02 00:11:54.354140   
8            8  2936.870117 2023-03-02 00:14:04.593965   
2            2  2939.049316 2023-03-02 00:12:12.751657   

             datetime_complete               duration  colsample_bylevel  \
99  2023-03-02 00:42:47.560224 0 days 00:00:14.227026                0.6   
123 2023-03-02 00:50:14.924396 0 days 00:00:14.317062                0.6   
125 2023-03-02 00:50:51.364466 0 days 00:00:14.315308                0.6   
126 2023-03-02 00:51:14.476553 0 days 00:00:14.160239                0.6   
127 2023-03-02 00:51:33.277087 0 days 00:00:14.199886                0.6   
..                         ...                    ...                ...   
10  2023-03-02 00:14:53.650033 0 days 00:00:13.817871                0.6   
40  2023-03-02 00:24:03.126657 0 days 00:00:13.696167                0.6   
1   2023-03-02 00:12:07.707962 0 days 00:00:13.353822                0.4   
8   2023-03-02 00:14:17.781516 0 days 00:00:13.187551                0.5   
2   2023-03-02 00:12:29.357697 0 days 00:00:16.606040                0.4   

     colsample_bytree  max_depth  min_child_weight  params_num_boost_round  \
99                0.7         15                 8                     800   
123               0.7         15                 8                     800   
125               0.7         15                 8                     800   
126               0.7         15                 8                     800   
127               0.7         15                 8                     800   
..                ...        ...               ...                     ...   
10                0.6         14                 7                     800   
40                0.5         15                 8                     800   
1                 0.5         15                 9                     700   
8                 0.5         14                 9                     600   
2                 0.5         14                 8                     600   

     reg_lambda  subsample     state  
99          6.5       0.80  COMPLETE  
123         6.5       0.80  COMPLETE  
125         6.5       0.80  COMPLETE  
126         6.5       0.80  COMPLETE  
127         6.5       0.80  COMPLETE  
..          ...        ...       ...  
10          6.5       0.80  COMPLETE  
40          7.5       0.80  COMPLETE  
1           5.5       0.78  COMPLETE  
8           7.5       0.80  COMPLETE  
2           5.5       0.80  COMPLETE  

[200 rows x 13 columns]
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'num_boost_round': 800,
 'max_depth': 15,
 'subsample': 0.8,
 'reg_lambda': 6.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.6,
 'min_child_weight': 8,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
import pickle

best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_3CV_cuml_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
        r2_score(asnumpy(test_label), asnumpy(y_test_pred))))

Model Metrics for XGBoost HPO GPU trials
MAE train: 1762.822, test: 1886.988
MSE train: 5735179.124, test: 6538285.171
RMSE train: 2394.823, test: 2557.007
R^2 train: 0.938, test: 0.928
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
                                                                                                      asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 6538285.17141 MSE on the test set.
This was achieved using these conditions:
iteration                                         99
rmse                                     2830.049642
datetime_start            2023-03-02 00:42:33.333198
datetime_complete         2023-03-02 00:42:47.560224
duration                      0 days 00:00:14.227026
colsample_bylevel                                0.6
colsample_bytree                                 0.7
max_depth                                         15
min_child_weight                                   8
params_num_boost_round                           800
reg_lambda                                       6.5
subsample                                        0.8
state                                       COMPLETE
Name: 99, dtype: object

200 Trials 5-fold Cross Validation cuML

   Let's know try using 5-fold cross validation where the results are logged with W & B.

In [ ]:
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_5CV_cuml',
                'tags': 'xgbGPU_setParams_RTX4000_5CV_cuml'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl')

    params_xgb_optuna = {
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_rate': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
                                          step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()
    cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=5,
                        metrics='rmse')
    run_time = timer() - start

    rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
    print('- Trial RMSE:', rmse)

    return rmse
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_5CV_cuml_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run glowing-frost-1205 at: https://wandb.ai/aschultz/usedCars_hpo/runs/pdncbxf5
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230228_155628-pdncbxf5/logs
Start Time           2023-02-28 14:44:48.093429
End Time             2023-02-28 15:56:49.418113
1:12:01


Number of finished trials: 200
Best trial: {'num_boost_round': 700, 'max_depth': 15, 'subsample': 0.8, 'reg_lambda': 5.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.6, 'min_child_weight': 8}
Lowest RMSE 2795.958252

   This took over nine minutes longer than 3-fold cross validation, and a little longer than 5-fold cross validation without cuML. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
99          99  2795.958252 2023-02-28 15:20:21.171642   
124        124  2795.958252 2023-02-28 15:29:31.269482   
125        125  2795.958252 2023-02-28 15:29:52.339900   
126        126  2795.958252 2023-02-28 15:30:18.475476   
127        127  2795.958252 2023-02-28 15:30:39.647347   
..         ...          ...                        ...   
9            9  2862.575244 2023-02-28 14:48:00.203020   
8            8  2872.083496 2023-02-28 14:47:40.127103   
5            5  2872.506933 2023-02-28 14:46:39.633133   
7            7  2891.287256 2023-02-28 14:47:19.982933   
59          59  2909.172119 2023-02-28 15:05:49.470613   

             datetime_complete               duration  colsample_bylevel  \
99  2023-02-28 15:20:38.003057 0 days 00:00:16.831415                0.6   
124 2023-02-28 15:29:48.118041 0 days 00:00:16.848559                0.6   
125 2023-02-28 15:30:09.532971 0 days 00:00:17.193071                0.6   
126 2023-02-28 15:30:35.419476 0 days 00:00:16.944000                0.6   
127 2023-02-28 15:30:56.491398 0 days 00:00:16.844051                0.6   
..                         ...                    ...                ...   
9   2023-02-28 14:48:16.337362 0 days 00:00:16.134342                0.4   
8   2023-02-28 14:47:55.800471 0 days 00:00:15.673368                0.5   
5   2023-02-28 14:46:55.703559 0 days 00:00:16.070426                0.4   
7   2023-02-28 14:47:35.827749 0 days 00:00:15.844816                0.4   
59  2023-02-28 15:06:05.230512 0 days 00:00:15.759899                0.6   

     colsample_bytree  max_depth  min_child_weight  params_num_boost_round  \
99                0.7         15                 8                     700   
124               0.7         15                 8                     700   
125               0.7         15                 8                     700   
126               0.7         15                 8                     700   
127               0.7         15                 8                     700   
..                ...        ...               ...                     ...   
9                 0.7         15                 8                     600   
8                 0.5         14                 7                     700   
5                 0.5         15                 7                     600   
7                 0.5         15                 7                     800   
59                0.5         14                 8                     700   

     reg_lambda  subsample     state  
99          5.5       0.80  COMPLETE  
124         5.5       0.80  COMPLETE  
125         5.5       0.80  COMPLETE  
126         5.5       0.80  COMPLETE  
127         5.5       0.80  COMPLETE  
..          ...        ...       ...  
9           7.5       0.78  COMPLETE  
8           5.5       0.78  COMPLETE  
5           5.5       0.78  COMPLETE  
7           6.5       0.78  COMPLETE  
59          7.5       0.80  COMPLETE  

[200 rows x 13 columns]
In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'num_boost_round': 700,
 'max_depth': 15,
 'subsample': 0.8,
 'reg_lambda': 5.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.6,
 'min_child_weight': 8,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_5CV_cuml_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
        r2_score(asnumpy(test_label), asnumpy(y_test_pred))))

Model Metrics for XGBoost HPO GPU trials
MAE train: 1745.962, test: 1879.259
MSE train: 5633745.232, test: 6482116.496
RMSE train: 2373.551, test: 2546.000
R^2 train: 0.939, test: 0.929
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
                                                                                                      asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 6482116.49609 MSE on the test set.
This was achieved using these conditions:
iteration                                         99
rmse                                     2795.958252
datetime_start            2023-02-28 15:20:21.171642
datetime_complete         2023-02-28 15:20:38.003057
duration                      0 days 00:00:16.831415
colsample_bylevel                                0.6
colsample_bytree                                 0.7
max_depth                                         15
min_child_weight                                   8
params_num_boost_round                           700
reg_lambda                                       5.5
subsample                                        0.8
state                                       COMPLETE
Name: 99, dtype: object

200 Trials 10-fold Cross Validation cuML

   Let's know try using 10-fold cross validation where the results are logged with W & B.

In [ ]:
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_10CV_cuml',
                'tags': 'xgbGPU_setParams_RTX4000_10CV_cuml'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl')

    params_xgb_optuna = {
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_rate': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'n_estimators': trial.suggest_int('num_boost_round', 600, 800,
                                          step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()
    cv_results = xgb.cv(params_xgb_optuna, dtrain, nfold=10,
                        metrics='rmse')
    run_time = timer() - start

    rmse = asnumpy(cv_results['test-rmse-mean'].values[-1])
    print('- Trial RMSE:', rmse)

    return rmse
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_10CV_cuml_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run serene-darkness-1933 at: https://wandb.ai/aschultz/usedCars_hpo/runs/sui2q48e
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230301_050634-sui2q48e/logs
Start Time           2023-03-01 03:36:22.256324
End Time             2023-03-01 05:07:01.810225
1:30:39


Number of finished trials: 200
Best trial: {'num_boost_round': 800, 'max_depth': 15, 'subsample': 0.8, 'reg_lambda': 5.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.6, 'min_child_weight': 8}
Lowest RMSE 2779.5969726

   This took over 30 minutes longer than 3-fold cross validation and 20 minutes longer than 5-fold cross validation. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
99          99  2779.596973 2023-03-01 04:21:05.837371   
123        123  2779.596973 2023-03-01 04:32:04.899252   
124        124  2779.596973 2023-03-01 04:32:31.905454   
125        125  2779.596973 2023-03-01 04:32:58.923757   
126        126  2779.596973 2023-03-01 04:33:25.987929   
..         ...          ...                        ...   
4            4  2845.318994 2023-03-01 03:38:02.492300   
50          50  2856.144531 2023-03-01 03:58:26.023370   
17          17  2870.213916 2023-03-01 03:43:40.809645   
1            1  2875.819263 2023-03-01 03:36:48.090838   
5            5  2918.277612 2023-03-01 03:38:26.544612   

             datetime_complete               duration  colsample_bylevel  \
99  2023-03-01 04:21:29.309840 0 days 00:00:23.472469                0.6   
123 2023-03-01 04:32:27.672937 0 days 00:00:22.773685                0.6   
124 2023-03-01 04:32:54.467968 0 days 00:00:22.562514                0.6   
125 2023-03-01 04:33:21.451939 0 days 00:00:22.528182                0.6   
126 2023-03-01 04:33:48.586988 0 days 00:00:22.599059                0.6   
..                         ...                    ...                ...   
4   2023-03-01 03:38:22.451948 0 days 00:00:19.959648                0.6   
50  2023-03-01 03:58:47.492613 0 days 00:00:21.469243                0.5   
17  2023-03-01 03:43:59.645893 0 days 00:00:18.836248                0.5   
1   2023-03-01 03:37:08.220477 0 days 00:00:20.129639                0.4   
5   2023-03-01 03:38:45.299783 0 days 00:00:18.755171                0.4   

     colsample_bytree  max_depth  min_child_weight  params_num_boost_round  \
99                0.7         15                 8                     800   
123               0.7         15                 8                     800   
124               0.7         15                 8                     800   
125               0.7         15                 8                     800   
126               0.7         15                 8                     800   
..                ...        ...               ...                     ...   
4                 0.6         14                 9                     800   
50                0.5         15                 8                     700   
17                0.5         14                 9                     700   
1                 0.5         15                 8                     600   
5                 0.5         14                 7                     700   

     reg_lambda  subsample     state  
99          5.5       0.80  COMPLETE  
123         5.5       0.80  COMPLETE  
124         5.5       0.80  COMPLETE  
125         5.5       0.80  COMPLETE  
126         5.5       0.80  COMPLETE  
..          ...        ...       ...  
4           7.5       0.80  COMPLETE  
50          6.5       0.78  COMPLETE  
17          6.5       0.80  COMPLETE  
1           6.5       0.78  COMPLETE  
5           7.5       0.80  COMPLETE  

[200 rows x 13 columns]
In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'num_boost_round': 800,
 'max_depth': 15,
 'subsample': 0.8,
 'reg_lambda': 5.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.6,
 'min_child_weight': 8,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_10CV_cuml_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
        r2_score(asnumpy(test_label), asnumpy(y_test_pred))))

Model Metrics for XGBoost HPO GPU trials
MAE train: 1745.962, test: 1879.259
MSE train: 5633745.232, test: 6482116.496
RMSE train: 2373.551, test: 2546.000
R^2 train: 0.939, test: 0.929
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
                                                                                                      asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 6482116.49609 MSE on the test set.
This was achieved using these conditions:
iteration                                         99
rmse                                     2779.596973
datetime_start            2023-03-01 04:21:05.837371
datetime_complete         2023-03-01 04:21:29.309840
duration                      0 days 00:00:23.472469
colsample_bylevel                                0.6
colsample_bytree                                 0.7
max_depth                                         15
min_child_weight                                   8
params_num_boost_round                           800
reg_lambda                                       5.5
subsample                                        0.8
state                                       COMPLETE
Name: 99, dtype: object

Train/Test - cuML

200 Trials Train/Test DMatrix

   Let's know try using train/test split to compare to cross validation where the results are logged with W & B.

In [ ]:
wandb.login()

wandb_kwargs = {'project': 'usedCars_hpo', 'entity': 'aschultz',
                'group': 'optuna_xgbGPU_setParams',
                'save_code': 'False',
                'notes': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_cuml',
                'tags': 'xgbGPU_setParams_RTX4000_trainTest_DMatrix_cuml'}

wandbc = WeightsAndBiasesCallback(wandb_kwargs=wandb_kwargs, as_multirun=True)
In [ ]:
@wandbc.track_in_wandb()

def objective(trial):
    """
    Objective function to tune a XGBoostRegressor model.
    """
    joblib.dump(study, 'XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl')

    params_xgb_optuna = {
        'objective': 'reg:squarederror',
        'booster': 'gbtree',
        'gpu_id': 0,
        'tree_method': 'gpu_hist',
        'gamma': 4.5,
        'learning_rate': 0.04,
        'reg_alpha': 0.01,
        'scale_pos_weight': 1,
        'use_label_encoder': False,
        'random_state': seed_value,
        'verbosity': 0,
        'verbose': False,
        'n_estimators': trial.suggest_int('n_estimators', 600, 800, step=100),
        'max_depth': trial.suggest_int('max_depth', 14, 15),
        'subsample': trial.suggest_float('subsample', 0.78, 0.8, step=0.02),
        'reg_lambda': trial.suggest_float('reg_lambda', 5.5, 7.5, step=1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.7,
                                                step=0.1),
        'colsample_bylevel': trial.suggest_float('colsample_bylevel', 0.4, 0.6,
                                                 step=0.1),
        'min_child_weight': trial.suggest_int('min_child_weight', 7, 9, step=1)
        }

    start = timer()
    model = xgb.train(params_xgb_optuna, dtrain)
    run_time = timer() - start

    y_pred_val = model.predict(dtest)
    rmse = mean_squared_error(asnumpy(test_label), asnumpy(y_pred_val),
                              squared=False)
    print('Trial RMSE:', rmse)

    return rmse
In [ ]:
start_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
if os.path.isfile('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl'):
    study = joblib.load('XGB_Optuna_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl')
else:
    study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200, callbacks=[wandbc])

end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
wandb.finish()
View run fragrant-sound-1733 at: https://wandb.ai/aschultz/usedCars_hpo/runs/tvk5rytm
Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: ./wandb/run-20230301_033137-tvk5rytm/logs
Start Time           2023-03-01 02:36:12.779900
End Time             2023-03-01 03:31:53.950927
0:55:41


Number of finished trials: 200
Best trial: {'n_estimators': 600, 'max_depth': 15, 'subsample': 0.8, 'reg_lambda': 5.5, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.5, 'min_child_weight': 8}
Lowest RMSE 2762.6793216888777

   This took seven minutes less than 3-fold cross validation and 17 minutes shorter than 5-fold cross validation and 35 minutes shorter than 10-fold cross validation. This is potential justification for not using cross validation when using a smaller set size and the strength of GPU matrix. Now we can examine the study components.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_colsample_bylevel': 'colsample_bylevel'},
                 inplace=True)
trials_df.rename(columns={'params_colsample_bytree': 'colsample_bytree'},
                 inplace=True)
trials_df.rename(columns={'params_gamma': 'gamma'}, inplace=True)
trials_df.rename(columns={'params_learning_rate': 'learning_rate'},
                 inplace=True)
trials_df.rename(columns={'params_max_depth': 'max_depth'}, inplace=True)
trials_df.rename(columns={'params_min_child_weight': 'min_child_weight'},
                 inplace=True)
trials_df.rename(columns={'params_n_estimators': 'n_estimators'}, inplace=True)
trials_df.rename(columns={'params_reg_alpha': 'reg_alpha'}, inplace=True)
trials_df.rename(columns={'params_reg_lambda': 'reg_lambda'}, inplace=True)
trials_df.rename(columns={'params_subsample': 'subsample'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
99          99  2762.679322 2023-03-01 03:03:07.239986   
198        198  2762.679322 2023-03-01 03:31:21.313166   
100        100  2762.679322 2023-03-01 03:03:23.299652   
101        101  2762.679322 2023-03-01 03:03:39.315749   
102        102  2762.679322 2023-03-01 03:03:55.433045   
..         ...          ...                        ...   
38          38  2827.040851 2023-03-01 02:46:30.861878   
20          20  2837.609561 2023-03-01 02:41:35.725402   
39          39  2841.753977 2023-03-01 02:46:46.955180   
5            5  2847.695776 2023-03-01 02:37:31.219537   
2            2  2871.813494 2023-03-01 02:36:43.302704   

             datetime_complete               duration  colsample_bylevel  \
99  2023-03-01 03:03:18.993118 0 days 00:00:11.753132                0.5   
198 2023-03-01 03:31:33.189076 0 days 00:00:11.875910                0.5   
100 2023-03-01 03:03:35.054647 0 days 00:00:11.754995                0.5   
101 2023-03-01 03:03:51.139175 0 days 00:00:11.823426                0.5   
102 2023-03-01 03:04:07.115263 0 days 00:00:11.682218                0.5   
..                         ...                    ...                ...   
38  2023-03-01 02:46:42.165483 0 days 00:00:11.303605                0.4   
20  2023-03-01 02:41:47.043198 0 days 00:00:11.317796                0.5   
39  2023-03-01 02:46:58.525654 0 days 00:00:11.570474                0.4   
5   2023-03-01 02:37:42.552390 0 days 00:00:11.332853                0.4   
2   2023-03-01 02:36:54.444195 0 days 00:00:11.141491                0.4   

     colsample_bytree  max_depth  min_child_weight  n_estimators  reg_lambda  \
99                0.7         15                 8           600         5.5   
198               0.7         15                 8           700         5.5   
100               0.7         15                 8           800         5.5   
101               0.7         15                 8           800         5.5   
102               0.7         15                 8           800         5.5   
..                ...        ...               ...           ...         ...   
38                0.6         15                 9           700         6.5   
20                0.6         14                 8           600         6.5   
39                0.5         15                 7           800         5.5   
5                 0.6         15                 8           800         7.5   
2                 0.5         14                 7           700         5.5   

     subsample     state  
99        0.80  COMPLETE  
198       0.80  COMPLETE  
100       0.80  COMPLETE  
101       0.80  COMPLETE  
102       0.80  COMPLETE  
..         ...       ...  
38        0.78  COMPLETE  
20        0.78  COMPLETE  
39        0.80  COMPLETE  
5         0.78  COMPLETE  
2         0.78  COMPLETE  

[200 rows x 13 columns]
In [ ]:
for i, hpo in enumerate(trials_df.columns):
    if hpo not in ['iteration', 'rmse', 'datetime_start', 'datetime_complete',
                   'duration', 'n_estimators', 'state']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(trials_df[hpo], label='Bayes Optimization: Optuna')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()
In [ ]:
params = study.best_params
params['random_state'] = seed_value
params['metric'] = 'rmse'
params
Out[ ]:
{'n_estimators': 600,
 'max_depth': 15,
 'subsample': 0.8,
 'reg_lambda': 5.5,
 'colsample_bytree': 0.7,
 'colsample_bylevel': 0.5,
 'min_child_weight': 8,
 'random_state': 42,
 'metric': 'rmse'}
In [ ]:
best_model = XGBRegressor(objective='reg:squarederror',
                          booster='gbtree',
                          tree_method='gpu_hist',
                          gpu_id=0,
                          gamma=4.5,
                          learning_rate=0.04,
                          reg_alpha=0.01,
                          scale_pos_weight=1,
                          use_label_encoder=False,
                          verbosity=0,
                          **params)

best_model.fit(train_features, train_label)

Pkl_Filename = 'XGB_Optuna_trials_GPU_setParams_RTX4000_trainTest_DMatrix_cuml_wb.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for XGBoost HPO GPU trials')
y_train_pred = best_model.predict(train_features)
y_test_pred = best_model.predict(test_features)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(train_label), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(test_label), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(train_label), asnumpy(y_train_pred)),
        r2_score(asnumpy(test_label), asnumpy(y_test_pred))))

Model Metrics for XGBoost HPO GPU trials
MAE train: 1056.376, test: 1690.577
MSE train: 2117904.858, test: 5389903.354
RMSE train: 1455.302, test: 2321.617
R^2 train: 0.977, test: 0.941
In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(test_label),
                                                                                                      asnumpy(y_test_pred))))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 5389903.35376 MSE on the test set.
This was achieved using these conditions:
iteration                                    99
rmse                                2762.679322
datetime_start       2023-03-01 03:03:07.239986
datetime_complete    2023-03-01 03:03:18.993118
duration                 0 days 00:00:11.753132
colsample_bylevel                           0.5
colsample_bytree                            0.7
max_depth                                    15
min_child_weight                              8
n_estimators                                600
reg_lambda                                  5.5
subsample                                   0.8
state                                  COMPLETE
Name: 99, dtype: object

Set Params GPU Test - cuML: LightGBM 500 Trials

   To utilize the GPU for LightGBM with RAPIDS and a CUDA compiler on Paperspace, the notebook containing how the environment set up was completed is here.

The steps to complete this are the following:

In a Linux terminal,

  1. sudo apt update && sudo apt upgradesudo apt-get -y install cmake
  2. sudo apt-get install -y -qq libboost-all-dev libboost-system-dev libboost-filesystem-dev
  3. sudo apt-get install --no-install-recommends nvidia-opencl-dev opencl-headers
  4. sudo apt install glibc-source -y

In a notebook,

  1. git clone --recursive https://github.com/microsoft/LightGBM
  2. %cd /notebooks/LightGBM
  3. !sudo mkdir build
  4. !sudo mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

  5. %cd /notebooks/LightGBM/build

  6. !sudo cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
  7. !make -j4
  8. !sudo apt-get -y install python3-pip
  9. !sudo -H pip install setuptools joblib wandb optuna datetime plotly eli5 shap -U
  10. %cd /notebooks/LightGBM/python-package/
  11. !python setup.py install --gpu --opencl-include-dir=/usr/local/cuda/include/
  12. !pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
  13. !pip install cuml-cu11 --extra-index-url=https://pypi.ngc.nvidia.com

Once completed, the dependencies were imported.

   The project results can be found here. The runtime lengths using different GPUswere the following:

  • RTX A4000 - 2:21:28
  • Quadro RTX 5000 - 2:50:45
  • Quadro RTX 4000 - 3:00:01
  • Quadro P5000 - 4:39:26

Set Params GPU Test - cuML: CatBoost 500 Trials

   The project results can be found here. The runtime lengths using different GPUswere the following:

  • RTX A4000 - 3:21:19
  • Quadro RTX 5000 - 3:49:51
  • Quadro RTX 4000 - 4:01:33
  • Quadro P5000 - 4:14:14

Random Forest

   Random Forests utilize an ensemble of decision trees and with a technique called bagging, or bootstrap aggregation, which consists of selecting random samples with replacement as the training data for a model, fitting decisions trees for each subset and then aggregating the predictions. This approach minimizes overfitting and model variance since more randomness and combinations of features are evaluated. Decision trees use specified set of features and have a high likelihood of overfitting.

   Let's go back to Paperspace to utilize the RAPIDS Docker container. We can installing/importing the dependencies, setting the options and seed, set up the local CUDA cluster for Dask, query the client for all connected workers. and the read data.

Baseline Model

   The notebooks are here.

   Let's now fit the baseline model to the data, save as a .pkl file and evaluate the model metrics for the baseline model.

In [ ]:
from cuml.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_streams=1, random_state=seed_value)

rf.fit(X_train, y_train)

Pkl_Filename = 'RF_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(rf, file)

print('\nModel Metrics for RF Baseline')
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
                           squared=False),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
        r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))

Model Metrics for RF Baseline
MAE train: 1558.643, test: 1863.607
MSE train: 4666522.856, test: 6391927.032
RMSE train: 2160.214, test: 2528.226
R^2 train: 0.949, test: 0.930

Optuna 300 Trials Train/Test

   Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:

  • bootstrap: If True, each tree in the forest is built on a bootstrapped sample with replacement. Default=True.
  • n_estimators: Number of trees in the forest. Default=100.
  • max_depth: Maximum tree depth. Default=16.
  • max_leaves: Maximum leaf nodes per tree. Default=-1.
  • min_samples_leaf: The minimum number of samples (rows) in each leaf node. Default=1.
  • min_samples_split: The minimum number of samples required to split an internal node. Default=2.
  • n_bins: Maximum number of bins used by the split algorithm per feature. Default=128.
  • n_streams: Number of parallel streams used for forest building. Default=4.

   Then the model will be trained using the training set, predictions made using the test set and then the model evaluated for the rmse with the test set versus the predicted one.

   We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
axs = axs.flatten()
for i, hpo in enumerate(['max_leaves', 'min_samples_leaf', 'min_samples_split',
                         'n_bins']):
    sns.regplot('iteration', hpo, data=trials_df, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

   Over the study, the max_leaves, min_samples_split and n_bins parameters increased while min_samples_leaf decreased.

   Let's now use plot_param_importances to visualize the parameter importances.

In [ ]:
fig = optuna.visualization.plot_param_importances(study)
fig.show()

   min_samples_leaf was the most important hyperparameter followed by max_leaves.

   We can also use plot_edf to examine the empirical distribution.

In [ ]:
fig = optuna.visualization.plot_edf(study)
fig.show()

   Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
params = study.best_params
params['random_state'] = seed_value
params
Out[ ]:
{'n_estimators': 1812,
 'max_depth': 143,
 'max_leaves': 1914,
 'min_samples_leaf': 120,
 'min_samples_split': 122,
 'n_bins': 936,
 'random_state': 42}
In [ ]:
import pickle
from sklearn.metrics import mean_absolute_error, r2_score

best_model = RandomForestRegressor(n_streams=1, **params)

best_model.fit(X_train, y_train)

Pkl_Filename = 'RF_Optuna_trials300_GPU_paramsHi2.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for RF HPO 300 GPU trials')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
                           squared=False),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
        r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))

Model Metrics for RF HPO 300 GPU trials
MAE train: 2046.867, test: 2046.867
MSE train: 7527952.387, test: 7527952.387
RMSE train: 2743.711, test: 2743.711
R^2 train: 0.918, test: 0.918

   Tuning did not prove to be worthwhile when higher errors and lower R² compared to baseline.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test.to_numpy(),
                                                                                                      y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 7527952.38666 MSE on the test set.
This was achieved using these conditions:
iteration                                   209
rmse                                2743.711426
datetime_start       2023-02-25 23:57:01.935184
datetime_complete    2023-02-26 00:05:42.675719
duration                 0 days 00:08:40.740535
max_depth                                 143.0
max_leaves                               1914.0
min_samples_leaf                          120.0
min_samples_split                         122.0
n_bins                                    936.0
n_estimators                             1812.0
state                                  COMPLETE
Name: 209, dtype: object

K-Nearest Neighbor Regression (KNR)

Baseline Model

   The notebook is located here. Let's now fit the baseline model to the data, save as a .pkl file and evaluate the model metrics for the baseline model.

In [ ]:
from cuml.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor()

knr.fit(X_train, y_train)

Pkl_Filename = 'KNR_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(knr, file)

print('\nModel Metrics for KNR Baseline')
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
                           squared=False),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
        r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))

Model Metrics for KNR Baseline
MAE train: 2969.153, test: 3675.465
MSE train: 16193626.343, test: 24731241.269
RMSE train: 4024.131, test: 4973.052
R^2 train: 0.824, test: 0.729

Optuna 1000 Train/Test Trials

   Now, we can define a function train_and_eval to set up the train/test sets and define the model/parameters that will be tested during the search:

  • n_neighbors: Default number of neighbors to query. Default=5.
  • metric: Distance metric to use. Options are 'euclidean', 'manhattan', 'chebyshev', 'minkowski'.

   Then the model will be trained using the training set, predictions made using the test set and then the model evaluated for the rmse with the test set versus the predicted one.

In [ ]:
from timeit import default_timer as timer
from sklearn.metrics import mean_squared_error

def train_and_eval(X_param, y_param, n_neighbors=10,
                   metric='euclidean', verbose=False):
    """
    Partition data into train/test sets, train and evaluate the model
    for the given parameters.

    Params
    ______

    X_param:  DataFrame.
              The data to use for training and testing.
    y_param:  Series.
              The label for training

    Returns
    score: RMSE of the fitted model
    """
    X_train, y_train = trainDF.drop('price',
                                    axis=1), trainDF['price'].astype('int32')
    X_train = cudf.get_dummies(X_train)
    X_train = X_train.astype('float32')

    X_test, y_test = testDF.drop('price',
                                 axis=1), testDF['price'].astype('int32')
    X_test = cudf.get_dummies(X_test)
    X_test = X_test.astype('float32')

    model = KNeighborsRegressor(n_neighbors=n_neighbors,
                                metric=metric,
                                verbose=verbose)

    start = timer()
    model.fit(X_train, y_train)
    run_time = timer() - start

    y_pred = model.predict(X_test)
    score = mean_squared_error(y_test.to_numpy(), y_pred.to_numpy(),
                               squared=False)

    return score
In [ ]:
print('Score with default parameters : ', train_and_eval(X_train, y_train))
Score with default parameters :  4384.845258618256

   As with other hyperparameter optimizations, an objective function with the study .pkl and the range of the parameters to be tested needs to be defined.

In [ ]:
import optuna
from optuna import Trial
optuna.logging.set_verbosity(optuna.logging.WARNING)
import joblib

def objective(trial, X_param, y_param):

    joblib.dump(study, 'KNR_Optuna_1000_GPU.pkl')

    n_neighbors = trial.suggest_int('n_neighbors', 2, 100)
    metric = trial.suggest_categorical('metric', ['euclidean', 'manhattan',
                                                  'chebyshev', 'minkowski'])

    score = train_and_eval(X_param, y_param,
                           n_neighbors=n_neighbors,
                           verbose=False)

    return score
In [ ]:
with timed('dask_optuna'):
    start_time = datetime.now()
    print('%-20s %s' % ('Start Time', start_time))
    if os.path.isfile('KNR_Optuna_1000_GPU.pkl'):
        study = joblib.load('KNR_Optuna_1000_GPU.pkl')
    else:
        study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                    study_name=study_name,
                                    direction='minimize')
    with parallel_backend('dask'):
        study.optimize(lambda trial: objective(trial, X_train, y_train),
                       n_trials=1000,
                       n_jobs=n_workers)
end_time = datetime.now()
print('%-20s %s' % ('Start Time', start_time))
print('%-20s %s' % ('End Time', end_time))
print(str(timedelta(seconds=(end_time-start_time).seconds)))
print('\n')
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)
print('Lowest RMSE', study.best_value)
Start Time           2023-02-25 05:09:57.615216
End Time             2023-02-25 06:30:19.216141
1:20:21


Number of finished trials: 1000
Best trial: {'n_neighbors': 2, 'metric': 'manhattan'}
Lowest RMSE 3070.8987010700253

   Let's now extract the trial number, rmse and hyperparameter values into a pandas.Dataframe and sort with the lowest error first.

In [ ]:
trials_df = study.trials_dataframe()
trials_df.rename(columns={'number': 'iteration'}, inplace=True)
trials_df.rename(columns={'value': 'rmse'}, inplace=True)
trials_df.rename(columns={'params_metric': 'metric'}, inplace=True)
trials_df.rename(columns={'params_n_neighbors': 'n_neighbors'}, inplace=True)
trials_df = trials_df.sort_values('rmse', ascending=True)
print(trials_df)
     iteration         rmse             datetime_start  \
706        706  3070.898701 2023-02-25 06:06:16.039692   
654        654  3070.898701 2023-02-25 06:02:04.757080   
657        657  3070.898701 2023-02-25 06:02:19.218560   
264        264  3070.898701 2023-02-25 05:30:46.428542   
660        660  3070.898701 2023-02-25 06:02:33.634623   
..         ...          ...                        ...   
8            8  5168.263875 2023-02-25 05:10:34.630532   
715        715  5168.263875 2023-02-25 06:06:59.793529   
109        109  5175.166948 2023-02-25 05:18:30.367756   
397        397  5175.166948 2023-02-25 05:41:21.148965   
963        963  5178.840228 2023-02-25 06:27:15.950894   

             datetime_complete               duration     metric  n_neighbors  \
706 2023-02-25 06:06:20.842922 0 days 00:00:04.803230  chebyshev            2   
654 2023-02-25 06:02:09.519697 0 days 00:00:04.762617  manhattan            2   
657 2023-02-25 06:02:24.009119 0 days 00:00:04.790559  chebyshev            2   
264 2023-02-25 05:30:51.115612 0 days 00:00:04.687070  manhattan            2   
660 2023-02-25 06:02:38.407588 0 days 00:00:04.772965  manhattan            2   
..                         ...                    ...        ...          ...   
8   2023-02-25 05:10:39.541621 0 days 00:00:04.911089  euclidean           97   
715 2023-02-25 06:07:05.447919 0 days 00:00:05.654390  manhattan           97   
109 2023-02-25 05:18:35.776791 0 days 00:00:05.409035  chebyshev           99   
397 2023-02-25 05:41:26.569814 0 days 00:00:05.420849  manhattan           99   
963 2023-02-25 06:27:21.492224 0 days 00:00:05.541330  minkowski          100   

        state  
706  COMPLETE  
654  COMPLETE  
657  COMPLETE  
264  COMPLETE  
660  COMPLETE  
..        ...  
8    COMPLETE  
715  COMPLETE  
109  COMPLETE  
397  COMPLETE  
963  COMPLETE  

[1000 rows x 8 columns]

   Let's utilize plot_parallel_coordinate to plot the high-dimensional parameter relationships contained within the search and plot_slice to compare the objective value and individal parameters.

In [ ]:
fig = optuna.visualization.plot_parallel_coordinate(study)
fig.show()
In [ ]:
fig = optuna.visualization.plot_slice(study)
fig.show()

   It is difficult to determine which metric performs the best off this slice plot, but a model that uses a lower number of n_neighbors results in lower objective values.

   Let's now utilize plot_edf to visualize the empirical distribution of the study.

In [ ]:
fig = optuna.visualization.plot_edf(study)
fig.show()

   Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:
params = study.best_params
params
Out[ ]:
{'n_neighbors': 2, 'metric': 'manhattan'}
In [ ]:
best_model = KNeighborsRegressor(n_neighbors=2, metric='manhattan')

best_model.fit(X_train, y_train)

Pkl_Filename = 'KNR_Optuna_trials1000_GPU_man.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_model, file)

print('\nModel Metrics for KNR HPO 1000 GPU trials - Manhattan')
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_absolute_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy()),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy())))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train.to_numpy(), y_train_pred.to_numpy(),
                           squared=False),
        mean_squared_error(y_test.to_numpy(), y_test_pred.to_numpy(),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train.to_numpy(), y_train_pred.to_numpy()),
        r2_score(y_test.to_numpy(), y_test_pred.to_numpy())))

Model Metrics for KNR HPO 1000 GPU trials - Manhattan
MAE train: 1984.275, test: 1984.275
MSE train: 8089934.424, test: 8089934.424
RMSE train: 2844.281, test: 2844.281
R^2 train: 0.912, test: 0.912

   Tuning proved to be worthwhile when there are lower errors and higher R² for both the train/test sets compared to the baseline model.

   We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:
print('The best model from optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(y_test.to_numpy(),
                                                                                                      y_test_pred.to_numpy())))
print('This was achieved using these conditions:')
print(trials_df.iloc[0])
The best model from optimization scores 9849685.25368 MSE on the test set.
This was achieved using these conditions:
iteration                                   706
rmse                                3070.898701
datetime_start       2023-02-25 06:06:16.039692
datetime_complete    2023-02-25 06:06:20.842922
duration                 0 days 00:00:04.803230
metric                                chebyshev
n_neighbors                                   2
state                                  COMPLETE
Name: 706, dtype: object

Comments

comments powered by Disqus