The United States (U.S.) has been one of the greatest advocates for free trade since the end of World War II, and the importance of it to the U.S. economy cannot be underestimated. As many as 40 million jobs in the U.S. are directly supported by international trade as of 2018 (Trade Partnership Worldwide LLC), and as much as half of the entire U.S. manufacturing industry is supported by exports (International Trade Administration). Imports introduce new goods and boost the annual earning power of the average American by over $18,000 annually (Hufbauer and Lu, 2017).
COVID-19 disrupted the global supply chain. The first effect was a dramatic and rapid reduction in the labor force in affected locations as workers fell ill or were forced to quarantine (Gruszczynski, L. (2020). International trade was impacted further as borders were closed to commerce and travel (Kerr, W. (2020)). This further exacerbated supply chain disruptions within and among countries as goods themselves were prioritized differently by those in quarantine. Most international trade reduced dramatically especially for manufactured goods and automobiles. However, food and agriculture supply lines were relatively unaffected in the short term due to narrow time windows between harvest and consumption, and the fact that even during a pandemic, the population still needs to eat for survival (Kerr, W. (2020)).
It is still too early to determine if COVID-19 has permanently disrupted the global supply chain or if it has just hastened trends already underway related to the geography of trading partners and the basket of goods traded among them. These long-term trends will need to be discerned over years, isolating the effect of any prolonged economic recession, especially among countries that are not employing economic stimulus measures. The applied tariff rates of countries is one change that may or may not prove transitory or affected due to the emergence of this virus. These rates may have a confounding effect with the disruptions caused by the direct or indirect measures of COVID-19. It is still too early to tell if the pandemic will lead to a more decentralized global economic system, but this is a possible long term scenario (Gruszczynski, L. (2020).
Proposed Questions
¶
● How has the composition and/or volume of maritime imports and exports from the US changed over time?
● Are there any confounding effects impacting the volume of imports and exports of the targeted commodities?
● How did COVID-19 impact the volume and composition of international maritime trade?
Data Collection and Processing
¶
The questions presented in the analysis are complex, interconnected, and require a variety of data sources to answer sufficiently. Data must establish historical import and export volumes and composition . There also must be data that demonstrate COVID-19’s effects on the supply chains and workforce. Finally, data must be included that establishes potential confounding effects on trade volumes, including historical rates for applied tariffs, currency exchange, and unemployment.
Maritime Imports & Exports
The U.S. Customs and Border Protection collects information about each shipment. Descartes Datamyne, as well as other industry data sources such as IHS Markit’s PIERS and Panjiva, compile this information and enhance it with additional data fields. The data was queried from Descartes Datamyne for U.S. imports and exports separately, from January 1, 2010 through December 31, 2020. The geographic scope includes all maritime trade shipments through the United States’ ports, to and from all trading partners. The data was downloaded in 70 queries of 1 million records or less, separately for imports and exports.
Both the imports and exports sets contain features import arrival, business name and location, product identifying features, port of arrival and departure, source country, and other characteristics relating to the shipment. Datasets will be concatenated into one table. Data consists of all bills of lading (shipment receipts) into or out of a U.S. maritime port from January 1, 2010 through December 31, 2020. Maritime trade is best measured in volumes, such as metric tonnage, TEUs, or total shipments, however the last two are relevant to maritime trade only. Trade in metric tonnage is a major indicator of the economic vitality of the national and global economy.
Preprocessing
¶
The code that was used for preprocessing and EDA can be found here. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory.
import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
seed_value = 42
os.environ['MaritimeTrade_Preprocessing'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
Preprocess Imports
¶
Since the initial trade data had to queried in batches separately for the imports and exports, this resulted in many separate files generated. The structure of the Data
directory resulted in subdirectories for Imports
and Exports
. Let's set the path to where the Imports
data is stored. There are 55 separate .csv
files, so these need to be concatenated together to make a single set. Using glob
, we use the .csv
commonality to read and concatenate all files matching this extension. Then the variables that are extraneous are dropped and the columns are renamed to more relatable terminology without spaces. Since there queries were completed separately, this generated missing values in the Date
features so those are removed. This variable was imported as a float
so a lambda
function was used to remove the extra .0
resulting in a numerical value containing the year month day without hyphens. Since this is the Imports
set, let's add a column denoting a characteristic about this data. Since Container_Type_Refrigerated
and Container_Type_Dry
have many levels, let's create a binary features by using a mask
for all the observations >=1 to be recoded as 1. After this processing, the duplicates are dropped and there over 35 million rows and 17 features. Let's now view the first 5 observations. There appear to be differences in the spacing between words different features so string processing definitely need to be utilized.
import sys
import glob
import pandas as pd
pd.set_option('display.max_columns', None)
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
imports = pd.concat([pd.read_csv(f) for f in all_filenames], ignore_index=True)
imports = imports.drop(['In bond entry type', 'Vessel Country', 'Unnamed: 18'],
axis=1)
imports.rename(columns={'Consignee (Unified)': 'US_Company',
'HS Description': 'HS_Description',
'Country of Origin': 'Foreign_Country',
'Port of Departure': 'Foreign_Port',
'Port of Arrival': 'US_Port',
'Shipper (Unified)': 'Foreign_Company',
'Container LCL/FCL': 'Container_LCL/FCL',
'Metric Tons': 'Metric_Tons',
'Container Type Refrigerated': 'Container_Type_Refrigerated',
'Container Type Dry': 'Container_Type_Dry',
'VIN Quantity': 'VIN_Quantity',
'Total calculated value (US$)': 'TCVUSD'}, inplace=True)
imports = imports[imports['Date'].notna()]
imports['Date'] = imports['Date'].astype(str).apply(lambda x: x.replace('.0',''))
imports['Trade_Direction'] = 'Import'
mask = imports['Container_Type_Refrigerated'] >= 1
imports.loc[mask, 'Container_Type_Refrigerated'] = 1
mask = imports['Container_Type_Dry'] >= 1
imports.loc[mask, 'Container_Type_Dry'] = 1
imports = imports.drop_duplicates()
print('- The Imports set contains ' + str(imports.shape[0]) + ' rows and '
+ str(imports.shape[1]) + ' columns.')
print('\nSample observations of Imports:')
print(imports.head())
Preprocess Exports
¶
There are 36 separate files for the initial Exports
set. So let's move to that directory and perform the same type of concatenation as was used for the Imports
set, drop features that are not needed and rename the columns. Since Short Container Description
will be used to create features for this set, let's select the non-missing for this and the non missing from US_Port
. The Date
variable was imported the same type as in the previous processing, so let's use the same aforementioned approach. Then create the Trade_Direction
feature denoting these are exports. Since these are exports, they do not have a foreign company so let's add Not Applicable to Exports
for the Foreign_Company
variable. Drop any duplicate rows, now totaling over 23 million observations with the same amount of columns as the Imports
set.
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
exports = pd.concat([pd.read_csv(f) for f in all_filenames])
exports = exports.drop(['Vessel Country', 'Unnamed: 16'], axis=1)
exports.rename(columns={'Exporter (Unified)': 'US_Company',
'HS Description': 'HS_Description',
'Country of Final destination': 'Foreign_Country',
'Foreign Port': 'Foreign_Port',
'US Port': 'US_Port',
'Container LCL/FCL': 'Container_LCL/FCL',
'Metric Tons': 'Metric_Tons',
'VIN Quantity': 'VIN_Quantity',
'Total calculated value (US$)': 'TCVUSD'}, inplace=True)
exports = exports[exports['Short Container Description'].notna()
& exports.US_Port.notna()]
exports['Date'] = exports['Date'].astype(str).apply(lambda x: x.replace('.0',''))
exports['Trade_Direction'] = 'Export'
exports['Foreign_Company'] = 'Not Applicable to Exports'
exports = exports.drop_duplicates()
print('- The Exports set contains ' + str(exports.shape[0]) + ' rows and '
+ str(exports.shape[1]) + ' columns.')
print('\nSample observations of Exports:')
print(exports.head())
First, let's filter the Containerized
exports. The text in Short Container Description
can be utilized to create a binary feature by using the condtions that if the word *REEF*
, *Cold*
, or *Deg*
exists in the Short Container Description
feature, then we can create a new variable Container_Type_Refrigerated
and assign the value equal to one. Since these exports are refrigerated, let's assign them as Container_Type_Dry
equal to zero. Then, if the text in the Short Container Description
feature does not include the aforementioned words, we can create a new dataframe where we assign the value equal to one to Container_Type_Dry
and Container_Type_Refrigerated
equal to zero. These two filtered dataframes are then concatenated together by row. For the observations which are not Containerized
, we can create a new dataframe where we assign both Container_Type_Refrigerated
and Container_Type_Dry
equal to zero. Then this can be concatenated by row with the Containerized
filtered data. Now the features Short Container Description
and Containerized
can be removed from the Exports
set and any duplicates dropped.
df = exports.loc[exports['Containerized'] == 1]
search_values = ['REEF', 'Cold', 'Deg']
df2 = df[df['Short Container Description'].str.contains(
'|'.join(search_values))]
df2['Container_Type_Refrigerated'] = 1
df2['Container_Type_Dry'] = 0
df3 = df[~df['Short Container Description'].str.contains('|'.join(search_values))]
df3['Container_Type_Refrigerated'] = 0
df3['Container_Type_Dry'] = 1
df = pd.concat([df2, df3])
del df2, df3
df1 = exports.loc[exports['Containerized'] == 0]
df1['Container_Type_Refrigerated'] = 0
df1['Container_Type_Dry'] = 0
exports = pd.concat([df, df1])
del df, df1
exports = exports.drop(['Short Container Description', 'Containerized'], axis=1)
exports = exports.drop_duplicates()
Now, the Imports
and Exports
sets can be concatenated, and the quality of the data in regards to data types, missingness and uniqueness can be examined.
df = pd.concat([imports, exports])
del imports, exports
def data_quality_table(df):
"""
Returns data types, the count of missing data per feature and the number of unique.
"""
var_type = df.dtypes
mis_val = df.isnull().sum()
unique_count = df.nunique()
mis_val_table = pd.concat([var_type, mis_val, unique_count], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0: 'Data Type', 1: 'Count Missing', 2: 'Number Unique'})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] >= 0].sort_values(
'Count Missing', ascending=False).round(1)
print ('- There are ' + str(df.shape[0]) + ' rows and '
+ str(df.shape[1]) + ' columns.\n')
return mis_val_table_ren_columns
print(data_quality_table(df))
Most of the data is complete but two variables, US_Company
and Foreign_Company
have missing values. TCVUSD
has the most unique, which makes sense because each trade transaction mostl likely contains a different value of money. The features HS_Description
and HS
are strings so any small deviation in the wording, spaces or punctuation could result in differences. It's rationale there would be more foreign ports and US ports as well. Since the Imports
set is larger than the Exports
set, more U.S. companies than foreign makes makes logical sense.
Preprocess Concatenated Imports/Exports
¶
Let's now create a new variable called HS_Class
by pulling the first two digits from the HS Description
feature to make this new feature with cleaner and more homogenized labels for the HS family classes. Then we can drop HS_description
, filter HS_Class
for the 23 selected groups to examine for the proposed questions and convert to a string since it is a categorical variable. We can also create a new variable called HS_Mixed
as a Boolean value denoting TRUE
if there are more than 6 digits in HS
and FALSE
if there are 6 or less digits in HS.
Then HS
can be removed from the set.
df['HS_Class'] = [x[:2] for x in df['HS_Description']]
df = df.drop(['HS_Description'], axis=1)
df['HS_Class'] = pd.to_numeric(df['HS_Class'], errors='coerce')
df = df[df['HS_Class'].notna()]
list = [2,7,8,9,10,11,14,15,16,17,18,19,20,22,24,30,40,47,72,73,87,94,95]
df = df[df['HS_Class'].isin(list)]
print('- Number of unique HS classes:', str(df.HS_Class.nunique()))
df['HS_Class'] = df['HS_Class'].astype('str')
mask = df['HS'].astype(str).str.len() <= 7
df.loc[mask, 'HS_Mixed'] = 0
mask = df['HS'].astype(str).str.len() > 7
df.loc[mask, 'HS_Mixed'] = 1
df = df.drop(['HS'], axis=1)
Now, let's set the path to the Warehouse_Construction
directory where the various sets are stored, read the HS_Groups
file and examine the first five observations. We can remove Description
and convert HS_Class
to type int
. This can be used as a key to join the HS_Groups
set to the main table. This new feature HS_Group_Name
, which groups similar HS_Class
items together, will reduce dimensionality.
HS_Groups = pd.read_csv('HS_Groups.csv', index_col=False)
print('\nSample observations of HS groups data:')
print(HS_Groups.head())
HS_Groups = HS_Groups.drop(['Description'], axis=1)
HS_Groups['HS_Class'] = HS_Groups['HS_Class'].astype(int)
df['HS_Class'] = df['HS_Class'].astype(int)
df = pd.merge(df, HS_Groups, how='left', left_on=['HS_Class'],
right_on=['HS_Class'])
df = df.drop_duplicates()
del HS_Groups
Since some of the observations contain missing information for US_Company
, let's filter the sets into different datafames to process this variable to obtain the U.S. state where company is located. Using rsplit
to split the text where the last comma exists results in 75 unique U.S. company states, so this method is not 100% effective. For the observations that do not contain information for the US_Company
, we can allocate Not Provided
to this feature and the created US_Company_State
feature. The two sets are then concatenated to complete the set. Since the presence of a comma separates the companies which have the same name but exist in different states, we can utilize rsplit
again, but now to remove everything after the last comma. This results in aggregrating the same companies together by removing the state information.
dat = df[df['US_Company'].notna()]
dat['US_Company_State'] = [x.rsplit(',', 1)[-1] for x in dat['US_Company']]
print('- Number of unique US company states:',
str(dat.US_Company_State.nunique()))
dat1 = df[df['US_Company'].isna()]
dat1['US_Company'] = 'Not Provided'
dat1['US_Company_State'] = 'Not Provided'
df = pd.concat([dat, dat1])
del dat, dat1
df.loc[:,'US_Company_Agg'] = [x.rsplit(',', 1)[:-1] for x in df['US_Company']]
df['US_Company_Agg'] = df['US_Company_Agg'].str[0]
print('- Number of unique aggregrated US companies:',
str(df.US_Company_Agg.nunique()))
Next, the Foreign_Company
feature is organized in a similar pattern as US_Company
, but this contains the country which the company is located. Also, the parentheses exist in this string so they need to be removed first. The observations that contain information for this variable can be selected into a set, then all of the parentheses removed, and the last two characters extracted from the string creating Foreign_Company_Country
. Then the observations without foreign companies can be assigned Not Provided
for Foreign_Company
and Foreign_Company_Country
. Lastly, the sets can be concatenated by row returning back to a single table.
dat = df[df['Foreign_Company'].notna()]
dat.loc[:,'Foreign_Company_Country'] = dat['Foreign_Company'].str.replace(r'[\d]+', '')
dat['Foreign_Company_Country'] = [x[-2:] for x in dat['Foreign_Company_Country']]
dat1 = df[df['Foreign_Company'].isna()]
dat1['Foreign_Company'] = 'Not Provided'
dat1['Foreign_Company_Country'] = 'Not Provided'
df = pd.concat([dat, dat1])
del dat, dat1
Extract Names for Clustering using OpenRefine and Binning from Original Data
¶
Data dictionaries stored in .csv
files are generated, which map each of the unique values from the raw fact data to standardized values using the clustering capabilities of Google OpenRefine
. These dictionaries can then be joined to the raw fact data to replace entire fields with the standardized data for that field. This will reduce the occurrences of erroneous data, reduce the number of factor levels, and provide a set of key values for joining the datasets together after some preprocessing.
To create one of these data dictionary .csv
files, the unique values from one of the columns in the fact set will be extracted into a file named [field_name].csv
and stored in the Keys_and_Dictionaries
directory. This is then loaded into Google OpenRefine
, the field containing the row number will be turned into a numeric value and then faceted. There are memory constraints of OpenRefine
, so this controls the number of records fed into the clustering algorithms by selecting a range of the row numeric facets. The column containing the unique field values will be duplicated and renamed [field_name]_Clustered
containing the smaller list of clustered values.
print('- Number of unique US companies:', str(df.US_Company.nunique()))
us_company = pd.DataFrame({'US_Company': df['US_Company'].unique()})
us_company.to_csv('us_company.csv', index=False)
print('- Number of unique foreign companies:',
str(df.Foreign_Company.nunique()))
foreign_company = pd.DataFrame({'Foreign_Company': df['Foreign_Company'].unique()})
foreign_company.to_csv('foreign_company.csv', index=False)
print('- Number of unique foreign company countries:',
str(df.Foreign_Company_Country.nunique()))
foreign_company_country = pd.DataFrame({'Foreign_Company_Country': df['Foreign_Company_Country'].unique()})
foreign_company_country.to_csv('foreign_company_country.csv', index=False)
print('- Number of unique foreign countries:',
str(df.Foreign_Country.nunique()))
foreign_country = pd.DataFrame({'Foreign_Country': df['Foreign_Country'].unique()})
foreign_country.to_csv('foreign_country.csv', index=False)
print('- Number of unique US ports:', str(df.US_Port.nunique()))
US_port = pd.DataFrame({'US_Port': df['US_Port'].unique()})
US_port.to_csv('us_port.csv', index=False)
print('- Number of unique foreign ports:', str(df.Foreign_Port.nunique()))
foreign_port = pd.DataFrame({'Foreign_Port': df['Foreign_Port'].unique()})
foreign_port.to_csv('foreign_port.csv', index=False)
print('- Number of unique carriers:', str(df.Carrier.nunique()))
carrier = pd.DataFrame({'Carrier': df['Carrier'].unique()})
carrier.to_csv('carrier.csv', index=False)
del us_company, foreign_company, foreign_company_country, foreign_country
del US_port, foreign_port, carrier
In the current format, there is a large number of unique strings contained within the companies, countries, ports and carriers features. This will result in high dimensionality if these features are utilized in modeling techniques. There is the potential that there are spelling and formatting errors which might be contributing to this.
In addition, ISO standard country names and abbreviations are essential for joining sets of data because mismatched names can result in the improper alignment and loss of observations. The datapackage
is utilized to generate a file containing the country name and a two digit country code (ISO 3166-1).
import datapackage
data_url = 'https://datahub.io/core/country-list/datapackage.json'
package = datapackage.Package(data_url)
resources = package.resources
for resource in resources:
if resource.tabular:
data = pd.read_csv(resource.descriptor['path'])
data.to_csv('country_map.csv', index=False)
print('- There are ' + str(data.shape[0]) + ' countries.')
print('\nSample observations of ISO standard \n country names and abbreviations:')
data.head()
Perform Clustering using OpenRefine
¶
Note that for this field, there are over 1.2 million unique values for US_Company
, so the numeric facet for the row number column is a necessity to avoid memory issues. A range of these numeric row values will be chosen spanning 400,000 records at a time to be sent through the clustering algorithm.
In this way, the clustering can be done in a few batches on large datasets running through several of the key collision algorithms until each batch does not produce clustering suggestions. US_Country_Clustered
Diagram given below is an example of applying the Beider-Morse
function to the first 400,000 records in the US_Country_Clustered
column.
The interface clearly shows which values it believes should be clustered together and which value should replace them. In this way fields are combined which probably refer to the same entity, as they vary by ordering of words or abbreviations (ex: Smith, Thomas vs Thomas Smith), by the spacing between letters and numbers (ex: Thomas Smith vs Th omasSmith), and by variations in numeric prefixes or suffixes as seen in the example above. After making the reasonable selections, merge selected & close is chosen before repeating the process with a different function like fingerprint or n-gram fingerprint until no clustering suggestions are left for that batch. Then the next batch of records is selected before repeating the process. After all the batches have been processed in this way, all of the records in the entire column can be fed into the clustering functions now that the number of unique values has been reduced. The effectiveness of this method will vary by both the amount of unique values in a field and by the quality of the data in the field itself. For the US_Company
field, the number of unique values went from 1,210,791 down to 880,829, a 27.3% reduction in values.
At this point the column containing the row numbers is dropped and the dataset can be exported to a .csv
file named [Field Name]_Clustered.csv
to become one of the data dictionary files. The dictionary files are stored in a directory called Keys_and_Dictionaries
so let's navigate there to read the data containing the clustered U.S. company names generated by OpenRefine
and perform an inner join using US_Company
as the key. This original feature can then be removed from the set and any duplicate observations removed. Next, read the data containing the clustered foreign companies and perform similar processing.
us_company_clustered = pd.read_csv('us_companies_clustered.csv')
df = pd.merge(df, us_company_clustered, left_on='US_Company',
right_on='US_Company')
df = df.drop(['US_Company'], axis=1)
df = df.drop_duplicates()
foreign_company_clustered = pd.read_csv('foreign_company_clustered.csv')
df = pd.merge(df, foreign_company_clustered, left_on='Foreign_Company',
right_on='Foreign_Company')
df = df.drop(['Foreign_Company'], axis=1)
df = df.drop_duplicates()
Now we can finish merging the rest of the sets with the generated clustered names using the original feature as the key starting with the trade carrier, then U.S. port, foreign company country and foreign country. The sets are merged after the foreign_country_clustered
set because this requires the standard names as a key. Then Country_continent_region_key
is then used to obtain the regions for Foreign_Country_Region
and Foreign_Company_Country_Region
fields one at a time. Then the sets can be removed to clear memory.
carrier = pd.read_csv('carrier_clustered_binned.csv')
df = pd.merge(df, carrier, left_on='Carrier', right_on='Carrier')
df = df.drop(['Carrier_Clustered', 'Carrier'], axis=1)
df = df.drop_duplicates()
us_port_and_state_code_clustered_binned = pd.read_csv('us_port_and_state_code_clustered_binned.csv')
df = pd.merge(df, us_port_and_state_code_clustered_binned,
left_on='US_Port', right_on='US_Port')
df = df.drop_duplicates()
df.rename(columns={'STATE_CODE': 'US_Company_State'}, inplace=True)
foreign_company_country_clustered = pd.read_csv('foreign_company_country_clustered.csv')
df = pd.merge(df, foreign_company_country_clustered,
left_on='Foreign_Company_Country',
right_on='Foreign_Company_Country')
df = df.drop_duplicates()
df.rename(columns={'Name': 'Foreign_Company_Country_Clustered',
'Continent': 'Foreign_Company_Country_Continent',
'Code': 'Foreign_Country_Code'}, inplace=True)
foreign_country_clustered = pd.read_csv('foreign_country_clustered.csv')
df = pd.merge(df, foreign_country_clustered, left_on='Foreign_Country',
right_on='Foreign_Country')
df = df.drop(['Code'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Name': 'Foreign_Country_Name_Clustered',
'Continent': 'Foreign_Country_Continent'}, inplace=True)
country_continent_region_key = pd.read_csv('country_continent_region_key.csv',
index_col=False)
print('\nSample observations of country/continent/region key:')
print(country_continent_region_key.head())
df = pd.merge(df, country_continent_region_key,
left_on=['Foreign_Country_Name_Clustered'],
right_on=['Name'])
df = df.drop(['Name', 'Country Code', 'Continent'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Region': 'Foreign_Country_Region'}, inplace=True)
country_continent_region_key = country_continent_region_key.drop(
['Country Code', 'Continent'], axis=1)
df = pd.merge(df, country_continent_region_key, how='left',
left_on=['Foreign_Company_Country_Clustered'], right_on=['Name'])
df = df.drop(['Name'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Region': 'Foreign_Company_Country_Region'}, inplace=True)
del us_company_clustered, foreign_company_clustered, carrier
del us_port_and_state_code_clustered_binned, foreign_company_country_clustered
del foreign_country_clustered, country_continent_region_key
Postprocessing after merging clustered features by OpenRefine
¶
The results from the clustered data added to the set reveal some inconsistencies within some of the features. Therefore, let's use another str.rsplit
at the last comma to isolate the US_Port_State
from US_Port_Clustered
so the state abbreviation can be useful as key to match with the full state names. There are also inconistencies within US_Port_State
with spacing/formatting, so let's standardize the state and territory codes. Another change that is needed is to recode groups that are not actual states to demarcate these observations from the rest, so let's use NOT DECLARED.
Lastly, some of the information contained within the Foreign_Country_Name_Clustered
and Foreign_Company_Country_Clustered
features contain verbose and inconsistent names, so let's standardize these to match the names in the tariff set, which will be used later.
df['US_Port_State'] = df['US_Port_Clustered'].str.rsplit(',').str[-1]
df['US_Port_State'] = df['US_Port_State'].replace(['NEW YORK',' NY', ' FL',
' TX', 'OH (DHL COURIER)',
' VA', ' NJ', ' UT',' HI',
' MD', ' DC', ' AK',' WA',
' VI',' MI',' MT', ' FL)',
' ND', ' NM', ' PA',
'VIRGIN ISLANDS', ' CO',
' NE', ' CA'],['NY', 'NY',
'FL', 'TX',
'OH', 'VA',
'NJ', 'UT',
'HI', 'MD',
'DC', 'AK',
'WA', 'VI',
'MI', 'MI',
'FL','ND',
'NM', 'PA',
'VI', 'CO',
'NE','CA'])
df['US_Port_State'].mask(df['US_Port_State'] == ' WI', 'WI', inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == ' MS)', 'MS', inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == ' IA', 'IA', inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == ' INC.', 'NOT DECLARED',
inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == 'ND', 'NOT DECLARED',
inplace=True)
df['US_Company_State'].mask(df['US_Company_State'] == 'ND', 'NOT DECLARED',
inplace=True)
df['US_Company_State'].mask(df['US_Company_State'] == 'ED', 'NOT DECLARED',
inplace=True)
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Viet Nam', 'Vietnam')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Taiwan, Province of China', 'Taiwan')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Russian Federation', 'Russia')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Tanzania, United Republic of', 'Tanzania')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Bolivia, Plurinational State of', 'Bolivia')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Viet Nam', 'Vietnam')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Taiwan, Province of China', 'Taiwan')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Russian Federation', 'Russia')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Tanzania, United Republic of', 'Tanzania')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Bolivia, Plurinational State of', 'Bolivia')
Given the large number of unique U.S. companies, let's set a threshold to retain only the U.S. companies with at least 100 total metric tons over the time period (2010-2020). We can groupby
the sum and then filter the observations meeting this threshold. This creates a new feature we can name as Metric_Tons_Totals
. Now, the number of U.S. companies is significantly less compared to the initial amount of companies.
us_company = df.loc[:,['US_company_Clustered', 'Metric_Tons']]
us_company = us_company.groupby('US_company_Clustered')['Metric_Tons'].sum().reset_index()
us_company = us_company.loc[us_company['Metric_Tons'] > 100]
us_company.rename(columns = {'Metric_Tons': 'Metric_Tons_Totals'}, inplace=True)
print('- There are ' + str(us_company.shape[0]) + ' US companies with over 100 metric tons in the set.')
Conditional Binning of U.S. and Foreign Company by Metric Tons for Mapping
¶
Let's now define a function that uses the created Metric_Tons_Totals
feature to bin metric tonnage into specified groups using the following criteria:
- a. micro = 100-1000
- b. small = 1001-10000
- c. medium = 10001-100000
- d. large = 100001-1000000
- e. huge = 1000000+
These groups can be specified in a list to create the feature Company_size
where np.select
assigns the conditions to the respective group. We can also use this function for processing the foreign companies, so let's rename the clustered company feature to Company
. Now we can create the bins with the function, drop Metric_Tons_Totals
and then rename the feature back to the original name. This processed binned data can then be merged with the main set.
def bin_tonnage(df):
"""
Returns a created categorical feature grouping by the total metric tons.
"""
conditions = [
(df['Metric_Tons_Totals'] <= 1000),
(df['Metric_Tons_Totals'] > 1000) & (df['Metric_Tons_Totals'] <= 10000),
(df['Metric_Tons_Totals'] > 10000) & (df['Metric_Tons_Totals'] <= 100000),
(df['Metric_Tons_Totals'] > 100000) & (df['Metric_Tons_Totals'] <= 1000000),
(df['Metric_Tons_Totals'] > 1000000),
(df['Company'].str.strip() == 'NOT AVAILABLE')
]
values = ['micro', 'small', 'medium', 'large', 'huge', 'unknown']
df['company_size'] = np.select(conditions, values)
return(df)
us_company = us_company.rename(columns={'US_company_Clustered': 'Company'})
us_company = bin_tonnage(us_company)
us_company = us_company.drop(['Metric_Tons_Totals'], axis=1)
us_company = us_company.rename(columns={'Company': 'US_company_Clustered'})
df = pd.merge(df, us_company, how='right', left_on=['US_company_Clustered'],
right_on=['US_company_Clustered'])
df = df.drop_duplicates()
df.rename(columns={'company_size': 'US_company_size'}, inplace=True)
del us_company
Similar methods that were used to filter out the U.S. companies who were not involved in trading over 100 metric tons during the 10 year time period were used to process the foreign companies. Conditional binning was performed and then the binned data was joined with the main set.
foreign_company = df.loc[:,['Foreign_Company_Clustered', 'Metric_Tons']]
foreign_company = foreign_company.groupby('Foreign_Company_Clustered')['Metric_Tons'].sum().reset_index()
foreign_company = foreign_company.loc[foreign_company['Metric_Tons'] > 100]
print('- There are ' + str(foreign_company.shape[0]) + ' foreign companies with over 100 metric tons in the set.')
foreign_company.rename(columns = {'Metric_Tons': 'Metric_Tons_Totals',
'Foreign_Company_Clustered': 'Company'},
inplace=True)
foreign_company = bin_tonnage(foreign_company)
foreign_company = foreign_company.drop(['Metric_Tons_Totals'], axis=1)
foreign_company = foreign_company.rename(columns={'Company':
'Foreign_Company_Clustered'})
df = pd.merge(df, foreign_company, how='right',
left_on=['Foreign_Company_Clustered'],
right_on=['Foreign_Company_Clustered'])
df = df.drop_duplicates()
df.rename(columns={'company_size': 'foreign_company_size'}, inplace=True)
del foreign_company
Outlier Testing of Quantitative Variables
¶
Since Metric_Tons
will be utilized as the dependent variable or target when modeling, let's filter the observations within three standard deviations from the mean and use a boxplot to examine the data. This results in Metric_Tons
less than 2,000, but the data is quite skewed. We can then filter the set to less than 250 Metric_Tons
to retain a large amount of data. However, this still does contain data that is in long right tail.
import seaborn as sns
import matplotlib.pyplot as plt
df = df[((df.Metric_Tons - df.Metric_Tons.mean())
/ df.Metric_Tons.std()).abs() < 3.5]
sns.boxplot(x=df['Metric_Tons']).set_title('Distribution of Metric Tons')
plt.show();
df = df.loc[df['Metric_Tons'] < 250]
sns.boxplot(x=df['Metric_Tons']).set_title('Distribution of Metric Tons After Filtering >= 250')
plt.show();
Now, we can filter the data where Teus
and TCVUSD
are within three standard deviations and replot the distributions.
df = df[((df.Teus - df.Teus.mean()) / df.Teus.std()).abs() < 3.5]
df = df[((df['TCVUSD'] - df['TCVUSD'].mean()) / df['TCVUSD'].std()).abs() < 3.5]
df = df.drop_duplicates()
sns.boxplot(x=df['Metric_Tons']).set_title('Distribution of Metric Tons')
plt.show();
sns.boxplot(x=df['Teus']).set_title('Distribution of Teus')
plt.show();
sns.boxplot(x=df['TCVUSD']).set_title('Distribution of Total calculated value (US$)')
plt.show();
The original Date
feature needs be processed into the format comprising of year-month-day. The original string is treated as an int
type with the specified format and pandas.to_datetime
can be utilized to accomplish this task. A function can then be utilized to extract the year, year-week, year-month all in one block of code.
def timeFeatures(df):
"""
Returns datetime (year-month-day), year, year-week, year-month.
"""
df['DateTime'] = pd.to_datetime(df['Date'].astype(int), format='%Y%m%d')
df['Year'] = df.DateTime.dt.year
df['DateTime_YearWeek'] = df['DateTime'].dt.strftime('%Y-w%U')
df['DateTime_YearMonth'] = df['DateTime'].dt.to_period('M')
df = df.drop(['Date'], axis=1)
return df
df = timeFeatures(df)
Merge Trade Data with Other Data Sources
¶
Unemployment
¶
The Bureau of Labor Statistics conducts a Current Population Survey of households every month in the United States. The survey is carried out to create datasets encompassing the labor force, employment, unemployment, people not included in the labor force, and other labor force statistics. The data was collected by workers conducting the survey by telephone while the Bureau of Labor Statistics encouraged businesses to submit their data electronically. The dataset shows the raw count of national civilian unemployment by month (in thousands of people) as well as the national civilian unemployment rate also by month. The collection of the survey data was impacted by the COVID-19 pandemic.
Source: United States, BLS. “Charts Related to the Latest ‘The Employment Situation’ News Release | More Chart Packages.” U.S. Bureau of Labor Statistics on February, 9, 2021.
Macroscopic Details
The data consists of 3 attributes and 133 records including a date for the national civilian unemployment data for the United States only from January 2010 through December 2020. With the onset of the COVID-19 pandemic and stay at home orders across the United States, the pandemic increased unemployment rates far above where they had been throughout much of the previous decade.
Now we can navigate to the Warehouse_Construction
directory and examine the Unemployment
set. The Year-Month
feature contains a -
, so it can be treated as a string and converted to the proper format. An inner merge can then be completed using DateTime_YearMonth
as the key for the main set and Year-Month
for the Unemployment
set. The unemployment rate can act as a better surrogate variable than the numerical amount without adding additional ways to normalize to represent this potential confounding factor in trade imports/exports so this and the monthly time features can be removed. Lastly, to match the format of the variables without spaces present, we can rename Unemployment Rate Total
as Unemployment Rate Total
.
unemployment = pd.read_csv('Unemployment Data 2010-2020.csv', index_col=False)
print('\nSample observations of Unemployment data:')
print(unemployment.head())
unemployment['Year-Month'] = pd.to_datetime(unemployment['Year-Month'].astype(
str), format='%Y-%m')
unemployment['Year-Month'] = unemployment['Year-Month'].dt.to_period('M')
df = pd.merge(df, unemployment, left_on='DateTime_YearMonth',
right_on='Year-Month')
df = df.drop(['Month', 'Year-Month', 'Total in Thousands'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Unemployment Rate Total': 'US_Unemployment_Rate'},
inplace=True)
del unemployment
Tariffs
¶
The World Trade Organization compiles the annual applied tariff rates that member countries set towards the world, on the harmonized service code level. The dataset includes the number of tariff line items, annual averages, and minimum and maximum bounds. Countries were selected based on the United States’ top trading partners, measured in metric tonnage, during the year 2019. Tariff rates were collected for these countries for the years 2010 through 2020. This dataset also notes if the U.S. had a free trade agreement with these top trading partners for each of the 11 years. This qualitative variable will compliment our average applied tariff rate by HS chapter class.
Source: The World Trade Organization via the data query tool downloaded on February 8, 2021.
Source: Office of the United States Trade Representative downloaded on February 11, 2021.
Macroscopic Details
The original datasource included only the primary countries and 18 attributes. Subsequent additions reduced the working attributes to six, but expanded the scope of countries involved, increasing total records from 4,853 to 11,892. The United States and 62 other countries are included. These are the major trading partners of the U.S., as of 2019, measured in metric tonnage. All European countries are included, regardless of their individual trade numbers with the U.S. Data was collected from the WTO and the Office of the United States Trade Representative to create a new table. Each year includes 23 records for average applied tariff rates for the HS chapter classes included in this project. There are additional attributes to indicate whether a free trade agreement exists between the country and the U.S., and whether that country is a member of the European Union. There is missing annual data if a country was not a member of the World Trade Organization (“WTO”) for any of the time period, if it was a member of the European Union (they are reported as simply the European Union), and if they have not yet submitted their 2019 or 2020 data to the WTO. This was later adjusted by assuming that the applied tariff rates were held constant for these countries if no updates were published.
Examining the Tariffs
set, Country
, HS_Family
, Tariff_Year
can be utilized as the keys to add this information to main set using a left
merge to the main table. Then European_Union
, Free_Trade_Agreement_with_US
can be recoded to True/False
and Average_Tariff
can be reformatted.
tariffs = pd.read_csv('Tariffs.csv')
print('\nSample observations of Tariffs data:')
print(tariffs.head())
df = pd.merge(df, tariffs, how='left',
left_on=['Foreign_Country_Name_Clustered', 'HS_Class', 'Year'],
right_on=['Country', 'HS_Family', 'Tariff_Year'])
df = df.drop(['Country', 'Tariff_Year', 'HS_Family', 'HS_Class'], axis=1)
df = df.drop_duplicates()
df['European_Union'] = df['European_Union'].astype('str')
df['European_Union'] = df['European_Union'].replace('No', 'False')
df['European_Union'] = df['European_Union'].replace('Yes', 'True')
df['Free_Trade_Agreement_with_US'] = df['Free_Trade_Agreement_with_US'].astype('str')
df['Free_Trade_Agreement_with_US'] = df['Free_Trade_Agreement_with_US'].replace('No',
'False')
df['Free_Trade_Agreement_with_US'] = df['Free_Trade_Agreement_with_US'].replace('Yes',
'True')
df['Average_Tariff'] = df['Average_Tariff'].str.replace(',','')
df['Average_Tariff'] = df['Average_Tariff'].astype('float64')
del tariffs
Currency
¶
Currency rates are extracted from here, which collects live rates for currencies, currency codes, financial news, commodities and crypto currencies. The data set contains 8 columns and 2,145 rows. The timeframe is from January 2010 to February 2021. All countries that are included in the tariff table are included in the currency dataset.
Source: Stock Market Quotes & Financial News retrieved on February 12, 2021.
Macroscopic Details
The data queried has 8,175 records and 10 attributes using a time frame from January 2010 to Feb 2021 The currency rate is an important variable to address our research questions as it has a great impact on imports and exports. If a currency rate increases, importing and exporting volumes may fall for a country as the cost of goods is too great and may look elsewhere to source. As we build our model, we would like to understand if currencies have any correlation with the other variables.
After examining the information contained in the Currency
set, unnecesary columns can be dropped. The Month_Year
feature, which is actually year-month, can be processed using the same approach as what was used for Year-Month
in the Unemployment
. Month_Year
and Country
can be used as keys for the left
merge to the main set and then features that initially were retained in the initial EDA
can be dropped from the set. This was decided because the weekly information can provide more insight and grain than monthly trade. Also Continent
level trade was not specific enough, Foreign_Country_Name_Clustered
and US_Company_State
contained high dimensionality and the keys to create the warehouse were not needed for modeling.
currencies = pd.read_csv('2010 - 2021 Exchange Rates.csv')
print('\nSample observations of Currencies data:')
print(currencies.head())
currencies = currencies.drop(['Month', 'Exchange_Year', 'Open', 'High', 'Low'],
axis=1)
currencies['Month_Year'] = pd.to_datetime(currencies['Month_Year'].astype(str),
format='%Y-%m')
currencies['Month_Year'] = currencies['Month_Year'].dt.to_period('M')
df = pd.merge(df, currencies, how='left',
left_on=['DateTime_YearMonth', 'Foreign_Country_Name_Clustered'],
right_on=['Month_Year', 'Country'])
df = df.drop(['DateTime_YearMonth', 'Month_Year', 'Country',
'European_Union_Member', 'Foreign_Country_Name_Clustered',
'Foreign_Country_Code', 'Name', 'Country Code',
'Continent', 'US_Company_State'], axis=1)
df = df.drop_duplicates()
del currencies
COVID-19 State Mandated Closures
¶
The U.S. government granted the state government to the abiltiy declare if/when there was an official closure of public areas / services with the goal of preventing/hindering the increased likelihood of new COVID-19 cases reaching an exponential distribution with other distributions considered given past events. This would reduce the amount of person-to-person contact since the transmission route of COVID-19 includes being able to be aerosolized via droplets, which are not visible by the human eyes when an individual is considered infectious. This included working-from-home options to maintain daily operations by working remotely.
Source: Kaiser Family Foundation, initially compiled on April 5, 2020. The data consists of 51 rows and 3 columns.
Now we can navigate to the Warehouse_Construction
directory, and read/examine the KFF_Statewide_Stay_at_Home_Orders
set. Then navigate back to Keys_and_Dictionaries
directory, join the state_abbreviation_key
with the main set so that the KFF_Statewide_Stay_at_Home_Orders
can be joined using State
as the key.
KFF_Statewide_Stay_at_Home_Orders = pd.read_csv('KFF_Statewide-Stay-at-Home-Orders.csv',
index_col=False)
print('\nSample observations of KFF_Statewide_Stay_at_Home_Orders data:')
print(KFF_Statewide_Stay_at_Home_Orders.head())
print('\n')
state_abbreviation_key = pd.read_csv('State_Region_key.csv', index_col=False)
print('\nSample observations of State and Region Key:')
print(state_abbreviation_key.head())
state_abbreviation_key.rename(columns={'Region': 'State_Region'}, inplace=True)
df = pd.merge(df, state_abbreviation_key, how='left',
left_on=['US_Port_State'], right_on=['State Code'])
df = df.drop(['Region', 'State Code'], axis=1)
df = df.drop_duplicates()
df = pd.merge(df, KFF_Statewide_Stay_at_Home_Orders, how='right',
left_on='State', right_on='State')
df = df.drop_duplicates()
df.rename(columns={'Date Announced': 'Date_Announced',
'Effective Date': 'Effective_Date'}, inplace=True)
del KFF_Statewide_Stay_at_Home_Orders
COVID-19 Cases and Deaths in the United States
¶
The New York Times has released, and is continuing to release, a series of data related to COVID-19 from state and local governments as well as health departments. This information is obtained by journalists employed by the newspaper company across the U.S. who are actively tracking the development of the pandemic. This source compiles information related tof COVID-19 across the county and state level in the United States over time. The data starts with the first reported COVID-19 case in the state of Washington on January 21, 2020. The total number of cases and deaths is reported as either “confirmed” or “probable” COVID-19. The number of cases includes all reported cases, including instances of individuals who have recovered or died. There are geographic exceptions in the states of New York, Missouri, Alaska, California, Nebraska, Illinois, Guam and Puerto Rico. Data was pulled from the New York Times Github on February 5, 2021.
Macroscopic Details
The data ranges from January 21, 2020 to February 5, 2021 with the dimensions of 6 columns and 1,001,656 rows including time, place, fips (geographic indicator) and the number of COVID-19 cases and deaths attributed to COVID-19.
Let's now convert the provided date
using pandas.to_datetime
and utilize timedelta(7, unit='d')
to subtract a week. Then the weekly sum of the cases and deaths separately for each U.S. state can be calculated, generating the cases_weekly
and deaths_weekly
features. From the renamed Date_Weekly_COVID
, the Year
can be extracted and filtered to 2020 to match the extent of the trade information. Since the US_Port_State
feature in the main table contains the abbreviated state name, we can merge the COVID-19 set with the state_abbreviation_key
, so this can be used as a key, remove the unnecessary columns and select the non-missing information.
COVID_cases = pd.read_csv('us-counties - NYTimes.csv', index_col=False)
print('\nSample observations of COVID-19 data:')
print(COVID_cases.head())
COVID_cases['Date'] = pd.to_datetime(COVID_cases['date']) - pd.to_timedelta(7,
unit='d')
df1 = (COVID_cases.groupby(['state', pd.Grouper(key='Date', freq='W-MON')])
.agg({'cases': 'sum', 'deaths': 'sum'}).reset_index())
df1.rename(columns={'Date': 'Date_Weekly_COVID', 'cases': 'cases_weekly',
'deaths': 'deaths_weekly', 'state': 'State'}, inplace=True)
df1['DateTime_YearWeek'] = pd.to_datetime(df1['Date_Weekly_COVID'])
df1['DateTime_YearWeek'] = df1['DateTime_YearWeek'].dt.strftime('%Y-w%U')
df1['Year'] = df1['Date_Weekly_COVID'].dt.year
df1 = df1[df1.Year < 2021]
df1 = df1.drop(['Year'], axis=1)
del COVID_cases
df1 = pd.merge(df1, state_abbreviation_key,
how='left', left_on='State', right_on='State')
df1 = df1.drop(['State', 'State_Region'], axis=1)
df1 = df1.drop_duplicates()
df1 = df1[df1['State_Code'].notna()]
Then the aggregrated weekly COVID-19 information can then be outer
merged with the main table using the the keys defined as the abbreviated state name and DateTime_YearWeek
. Then the State_Code
feature can be dropped.
df = pd.merge(df, df1, how='outer', left_on=['US_Port_State',
'DateTime_YearWeek'],
right_on=['State_Code', 'DateTime_YearWeek'])
df = df.drop(['State_Code'], axis=1)
df = df.drop_duplicates()
del state_abbreviation_key, df1
Since COVID-19 became medically known late 2019 and early 2020 for the general public, the prior time periods contain missing data for the weekly cases and deaths given it was flu season and a rapid diagnostic test had not been developed given the introduction of a novel virua.. So let's filter the trade related column, HS_Mixed
, which contains a small percentage missing observations, and fill the missing values with 0s.
However, there are observations that do contain missing values for HS_Mixed
, which are mostly likely related to trade not occurring during these weeks at a location. To retain the grain without losing all off these observations, the quantitative variables can be filled with 0s since trade did not occur in the location at the time. This subset can then be concatenated via rows to the main data.
df1 = df[df['HS_Mixed'].notna()]
df1 = df1.copy()
df1['cases_weekly'] = df1['cases_weekly'].fillna(0)
df1['deaths_weekly'] = df1['deaths_weekly'].fillna(0)
df2 = df[df['HS_Mixed'].isna()]
print('- Number of observations where no trade occurred:', df2.shape)
df2['Year'] = df2['Date_Weekly_COVID'].dt.year
df2['Year'] = df2['Year'].astype('Int64')
df2['US_Port_State'] = df2['State']
quant = ['Teus', 'Metric_Tons', 'VIN_Quantity', 'TCVUSD',
'US_Unemployment_Rate', 'Average_Tariff', 'Price']
df2.loc[:, quant] = df2.loc[:, quant].fillna('')
df2.loc[:, quant] = df2.loc[:, quant].replace('', 0)
df = pd.concat([df1, df2])
del df1, df2
Feature Engineering COVID-19 associated vars
¶
Next, the number of days between when a state mandated closure was announced and was made effective can be calculated by first coverting both to proper datetime format and then substracting Date_Announced
from Effective_Date
, creating State_Closure_EA_Diff
. The original features can then be dropped.
df['Date_Announced'] = pd.to_datetime(df['Date_Announced'].astype(object),
format='%m/%d/%Y')
df['Effective_Date'] = pd.to_datetime(df['Effective_Date'].astype(object),
format='%m/%d/%Y')
df['State_Closure_EA_Diff'] = (df.Effective_Date - df.Date_Announced).dt.days
df = df.drop(['Date_Announced', 'Effective_Date'], axis=1)
Next, we can examine the number of U.S. states in the set for a sanity check. Then filter the set into subsets by the weekly times when there were and were not COVID-19 cases. A pivot table with the weekly COVID date as the index can be generated that utilizes the cases_weekly
feature for the different states. Then the missing case data can be filled with zeros and the first nonzero date for COVID cases can be leveraged using the pandas DataFrame.ne()
method. This can then be sorted with the earliest first occurence to create the feature Time0_StateCase
, or the first weekly date of COVID-19 cases in each state.
print('- Number of unique US states:',
len(pd.unique(df['State'])))
df1 = df[df['Date_Weekly_COVID'].isna()]
df2 = df[df['Date_Weekly_COVID'].notna()]
df3 = df2.pivot_table(index='Date_Weekly_COVID', columns='State',
values='cases_weekly', aggfunc=np.min)
df3.fillna(df3.fillna(0), inplace=True)
df3 = df3.ne(0).idxmax()
df3 = df3.to_frame()
df3 = df3.reset_index()
df3.rename(columns={0: 'Time0_StateCase'}, inplace=True)
df3.sort_values(by='Time0_StateCase', ascending=True, inplace=True)
df3[:10]
This created set can then be left
merged using place and time so that the date can be paired with the number of cases per this date. Selecting these features, which decreases the dimensionality, and the year-week can be used to create the temporary Year_M
that functions as a key with US_Port_State
to merge back with the complete set.
df3 = pd.merge(df3, df2, how='left', left_on=['US_Port_State',
'Time0_StateCase'],
right_on=['US_Port_State', 'Date_Weekly_COVID'])
df3 = df3.drop_duplicates()
df3 = df3.copy()
df3['cases_state_firstweek'] = df3['cases_weekly']
df3 = df3.loc[:, ['US_Port_State', 'Time0_StateCase', 'Date_Weekly_COVID',
'cases_state_firstweek']]
df3 = df3.drop_duplicates()
df3['Year_M'] = df3['Date_Weekly_COVID'].dt.year
df2['Year_M'] = df2['Date_Weekly_COVID'].dt.year
df3 = df3.drop(['Date_Weekly_COVID'], axis=1)
df3 = df3.drop_duplicates()
df2 = pd.merge(df2, df3, how='left', left_on=['US_Port_State', 'Year_M'],
right_on=['US_Port_State', 'Year_M'])
del df3
df2 = df2.drop(['Year_M'], axis=1)
df2 = df2.drop_duplicates()
Then, the cases from each of the weeks where COVID-19 occurred can be subtracted from the the initial amount of cases to generate a percent change feature called cases_pctdelta
. For the set that contains missing values for Date_Weekly_COVID
, the created features generated in the other set can be labeled with Not Applicable
and zeros so the sets can be concatenated by row.
df2['cases_pctdelta'] = df2.apply(lambda x: (x['cases_weekly']
- x['cases_state_firstweek']
/ x['cases_state_firstweek']) * 100,
axis=1)
df1['Time0_StateCase'] = 'Not Applicable'
df1['cases_state_firstweek'] = 0
df1['cases_pctdelta'] = 0
df = pd.concat([df2, df1])
del df1, df2
Similar methods can also be utilized for the deaths_weekly
feature that were used for cases_weekly
as demonstrated above.
df1 = df[df['Date_Weekly_COVID'].isna()]
df2 = df[df['Date_Weekly_COVID'].notna()]
df3 = df2.pivot_table(index='Date_Weekly_COVID', columns='State',
values='deaths_weekly', aggfunc=np.min)
df3.fillna(df3.fillna(0), inplace=True)
df3 = df3.ne(0).idxmax()
df3 = df3.to_frame()
df3 = df3.reset_index()
df3.rename(columns={0: 'Time0_StateDeath'}, inplace=True)
df3.sort_values(by='Time0_StateDeath', ascending=True, inplace=True)
df3[:10]
df3 = pd.merge(df3, df2, how='left', left_on=['US_Port_State',
'Time0_StateDeath'],
right_on=['US_Port_State', 'Date_Weekly_COVID'])
df3 = df3.drop_duplicates()
df3 = df3.copy()
df3['deaths_state_firstweek'] = df3['deaths_weekly']
df3 = df3.loc[:, ['US_Port_State', 'Time0_StateDeath', 'Date_Weekly_COVID',
'deaths_state_firstweek']]
df3['Year_M'] = df3['Date_Weekly_COVID'].dt.year
df2['Year_M'] = df2['Date_Weekly_COVID'].dt.year
df3 = df3.drop(['Date_Weekly_COVID'], axis=1)
df3 = df3.drop_duplicates()
df2 = pd.merge(df2, df3, how='left', left_on=['US_Port_State', 'Year_M'],
right_on=['US_Port_State', 'Year_M'])
del df3
df2 = df2.drop(['Year_M'], axis=1)
df2 = df2.drop_duplicates()
df2['deaths_pctdelta'] = df2.apply(lambda x: (x['deaths_weekly']
- x['deaths_state_firstweek']
/ x['deaths_state_firstweek']) * 100,
axis=1)
df1['Time0_StateDeath'] = 'Not Applicable'
df1['deaths_state_firstweek'] = 0
df1['deaths_pctdelta'] = 0
df = pd.concat([df2, df1])
del df1, df2
Let's now drop the variables not worth considering during the next steps for variable selection.
df = df.drop(['VIN_Quantity', 'US_company_Clustered',
'Foreign_Company_Clustered', 'Foreign_Company_Country_Clustered',
'US_Port_State'], axis=1)
df = df.drop_duplicates()
print('- Dimensions of Data Warehouse for EDA:', df.shape)
Exploratory Data Analysis (EDA)
¶
Feature Selection with Random Forest and XGBoost
¶
Question: How has the composition and/or volume of maritime imports and exports from the U.S. changed from 2010-2015
¶
Random Forest - Feature Importance
¶
To approach this question, let's first subset the observations containing the years 2010 - 2015. Then drop time related variables, including COVID-19 features, and the ones not related to the question. Then remove the rows with the columns having missing values to maximize the completeness of the set.
df1 = df.loc[df['Year'] < 2016]
df1 = df1.drop(['DateTime', 'Year', 'DateTime_YearWeek', 'Date_Weekly_COVID',
'Trade_Direction', 'State_Closure_EA_Diff',
'cases_weekly', 'deaths_weekly', 'Time0_StateCase',
'cases_state_firstweek', 'cases_pctdelta', 'Time0_StateDeath',
'deaths_state_firstweek', 'deaths_pctdelta'
'Free_Trade_Agreement_with_US', 'European_Union', 'Price',
'Currency', 'Foreign_Company_Country_Region',
'US_Port_Clustered', 'Foreign_Country',
'Foreign_Company_Country', 'Foreign_Port', 'US_Port'], axis=1)
df1 = df1[df1.Foreign_Country_Continent.notna()
& df1.Foreign_Country_Region.notna() & df1.Average_Tariff.notna()]
df1 = df1.drop_duplicates()
print('- Dimensions of Question 1 EDA:', df1.shape)
To set up the RandomForestRegressor
, let's create dummy variables for the categorical variables and allocate the features and the target variable, Metric_Tons
. Now train the regressor using the default parameters using the parallel_backend
from joblib
to maximize the amount of CPU cores and then save the model as a pickle
file. From the fitted model, the features can then be sorted by gini importance to identify which are the most important features.
from sklearn.ensemble import RandomForestRegressor
from joblib import parallel_backend
import pickle
df1 = pd.get_dummies(df1, drop_first=True)
X = df1.drop('Metric_Tons', axis=1)
y = df1.Metric_Tons
rf = RandomForestRegressor(n_estimators=100, random_state=seed_value)
with parallel_backend('threading', n_jobs=-1):
rf.fit(X, y)
Pkl_Filename = 'CompositionVolume_2010_2015_RF.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(rf, file)
plt.rcParams['figure.figsize'] = (10,10)
plt.rcParams.update({'font.size': 8.5})
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
sorted_idx = rf.feature_importances_.argsort()
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features for Trade Composition/Volume during 2010-2015',
fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();
The most important features are:
Teus
TCVUSD
HS_Group_Name_Finished_Goods
Average_Tariff
US_Unemployment_Rate
Container_LCL/FCL_LCl
Foreign_Company_Country_Continent_Asia
HS_Group_Name_Edible_Processing
HS_Mixed
US_Port_Coastal_Region_Southeast
US_Port_Coastal_Region_Northeast
The least important features are the smallest carrier sizes and the locationas where trade most likely would not be probable of importing from or exporting to from the U.S.
XGBoost - Feature Importance
¶
XGBoost
can also be utilized to run on the GPU for faster runtimes using similar methods that were used for the RFRegressor.
from xgboost import XGBRegressor, plot_importance
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
gpu_id=0,
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X, y)
Pkl_Filename = 'CompositionVolume_2010_2015_XGB.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features for Trade Composition/Volume during 2010-2015',
fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();
When comparing the feature importances from XGBoost
to Random Forest
, Teus
is the most important while Average_Tariff
is second and TCVUSD
is third. Fourth is US_Unemployment_Rate
, HS_Group_Name_Finished_Goods
is fifth, HS_Group_Name_Raw_Input
is sixth, foreign_company_size_large
is seventh, Container_LCL/FCL_LCl
is eighth, Foreign_Company_Country_Continent_Asia
is ninth and HS_Group_Name_Edible_Processing
is 10th. The ports did not show up till 17th and 18th and they reversed in importance.
Random Forest without Teus & TCVUSD - Feature Importance
Let's now drop the features, Teus
and TCVUSD
, when creating X
. Using the same approaches, let's examine if the feature importance is conserved for both the default models using Random Forest
and XGBoost
.
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
sorted_idx = rf.feature_importances_.argsort()
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Trade Composition/Volume during 2010-2015',
fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();
The most important features without Teus
and TCVUSD
are:
Average_Tariff
US_Unemployment_Rate
HS_Group_Name_Finished_Goods
Container_LCL/FCL_LCl
Container_Type_Dry
HS_Group_Name_Raw_Input
foreign_company_size_large
HS_Mixed
US_company_size_large
US_Port_Coastal_Region_Northeast
US_Port_Coastal_Region_Southeast
Carrier_size_large
The order of important features changed without the two most important features.
XGBoost without Teus & TCVUSD - Feature Importance
ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Trade Composition/Volume during 2010-2015',
fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();
The most important features without Teus
and TCVUSD
are:
Average_Tariff
US_Unemployment_Rate
Container_Type_Dry
US_company_size_large
foreign_company_size_large
HS_Group_Name_Raw_Input
Container_LCL/FCL_LCl
US_company_size_medium
HS_Group_Name_Edible_Processing
HS_Mixed
foreign_company_size_medium
Container_Type_Refrigerated
HS_Group_Name_Finished_Goods
US_Port_Coastal_Region_Northeast
The order of important features changed without the two most important features when using XGBoost
.
Question: How did COVID-19 impact the volume and composition of international maritime trade?
¶
Since this set of trade covers only through 2020, let's subset only the observations from 2020, drop the time related variables, like DateTime
, Year
, Date_Weekly_Covid
and DateTime_YearWeek
, as well as the first time of cases/death in a state and also the ones not related to the question. Then remove the rows with the columns having missing values to maximize the completeness of the set as before.
df2 = df.loc[df['Year'] == 2020]
df2 = df2.drop(['DateTime', 'Year', 'DateTime_YearWeek', 'Date_Weekly_COVID',
'Time0_StateCase', 'Time0_StateDeath'
'Foreign_Company_Country_Region', 'US_Port_Clustered',
'Foreign_Country', 'Foreign_Company_Country',
'Foreign_Port', 'US_Port'], axis=1)
df2 = df2[df2.Foreign_Country_Continent.notna()
& df2.Foreign_Country_Region.notna() & df2.Average_Tariff.notna()
& df2.Price.notna() & df2.Currency.notna()
& df2.deaths_pctdelta.notna() & df2.State_Closure_EA_Diff.notna()]
df2 = df2.drop_duplicates()
print('- Dimensions of Question 2 EDA:', df2.shape)
Random Forest without Teus & TCVUSD
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
sorted_idx = rf.feature_importances_.argsort()
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for COVID-19 and Trade Composition/Volume',
fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();
The most important features without Teus
and TCVUSD
are:
Average_Tariff
HS_Group_Name_Finished_Goods
Price
Container_LCL/FCL_LCl
cases_weekly
cases_pctdelta
death_weekly
deaths_pctdelta
Container_Type_Dry
US_Unemployment_Rate
cases_state_firstweek
US_company_size_large
death_state_firstweek
HS_Mixed
US_company_size_medium
foreign_company_size_medium
Carrier_size_large
foreign_company_size_large
HS_Group_Name_Raw_Input
US_company_size_small
Trade_Direction_Import
XGBoost without Teus & TCVUSD
ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for COVID-19 and Trade Composition/Volume',
fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=10)
plt.tight_layout()
plt.show();
The most important features without Teus
and TCVUSD
are:
Average_Tariff
cases_state_firstweek
Price
deaths_state_firstweek
cases_weekly
US_company_size_large
Container_Type_Dry
foreign_company_size_large
Trade_Direction_Import
US_company_size_medium
HS_Mixed
foreign_company_size_medium
State_Closure_EA_Diff
US_company_size_small
The order of important features changed when compared to the most important features from RF
.
Question: Are there any confounding effects impacting the volume of imports and exports of the targeted commodities?
¶
To address this questions, let's drop the time associated variables and the ones not related to question. Then we can remove the rows with any column having NA/null
for some of the important variables.
df3 = df.drop(['DateTime', 'Year', 'DateTime_YearWeek', 'Date_Weekly_COVID',
'Time0_StateCase', 'Time0_StateDeath',
'Foreign_Company_Country_Region', 'US_Port_Clustered',
'Foreign_Country', 'Foreign_Company_Country',
'Foreign_Port', 'US_Port'], axis=1)
df3 = df3[df3.Foreign_Country_Continent.notna()
& df3.Foreign_Country_Region.notna() & df3.Average_Tariff.notna()
& df3.Price.notna() & df3.Currency.notna()
& df3.Container_Type_Refrigerated.notna()
& df3.deaths_pctdelta.notna() & df3.State_Closure_EA_Diff.notna()]
df3 = df3.drop_duplicates()
print('- Dimensions of Question 3 EDA:', df3.shape)
Random Forest without Teus & TCVUSD
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Potential Confounding Effects',
fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();
The most important features without Teus
and TCVUSD
are:
Price
HS_Group_Name_Finished_Goods
Average_Tariff
US_Unemployent_Rate
Container_LCL/FCL_LCl
State_Closure_EA_Diff
Container_Type_Dry
HS_Group_Name_Raw_Input
foreign_company_size_large
Trade_Direction_Import
HS_Mixed
US_company_size_large
Currency_CNY
foreign_company_size_medium
US_Port_Coastal_Region_Northeast
US_company_size_medium
Carrier_size_large
US_Port_Coastal_Region_Southeast
XGBoost without Teus & TCVUSD
ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Potential Confounding Effects',
fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();
The most important features without Teus
and TCVUSD
are:
Average_Tariff
Price
State_Closure_EA_Diff
Container_Type_Dry
US_company_size_large
foreign_company_size_large
US_Unemployent_Rate
Trade_Direction_Import
foreign_company_size_medium
HS_Group_Name_Raw_Input
Container_LCL/FCL_LCl
HS_Mixed
The order of important features changed when compared to the most important features from RF
.
Create Final Set
¶
Let's now drop the features that will not be used for modeling and examine the quality of the data using the data_quality_table
function.
df = df.drop(['Foreign_Country', 'Foreign_Port', 'US_Port', 'Teus',
'Container_Type_Refrigerated', 'HS_Mixed', 'US_Company',
'US_Company_Agg', 'Foreign_Company_Country', 'carrier_size',
'US_Port_Clustered', 'Foreign_Company_Country_Continent',
'Foreign_Country_Continent', 'Foreign_Company_Country_Region',
'Free_Trade_Agreement_with_US', 'European_Union', 'Currency',
'Price', 'Time0_StateCase', 'cases_state_firstweek',
'Time0_StateDeath', 'deaths_state_firstweek'], axis=1)
df = df.drop_duplicates()
print(data_quality_table(df))
XGBoost RAPIDS: Train 2020, 2019, 2018 - 2019 and 2010 - 2019
The notebooks for XGBoost
are located here. We can utilize Paperspace and use a RAPIDS
Docker
container to utilize GPUs
with higher memory than what is available on the local desktop. First, let's upgrade pip
, install the dependencies and import the required packages. Then we can set the os
environment as well as the random
, cupy
and numpy
seed. We can also define a function timed
to time the code blocks and examine the components of the GPU
which is being utilized. Given the number of observations within the 2010 - 2019 set, let's use a NVIDIA RTXA400
to set the baseline XGBoost
models.
!pip install --upgrade pip
!pip install category_encoders
!pip install xgboost==1.5.2
!pip install hyperopt
!pip install eli5
import os
import warnings
import random
import cupy
import numpy as np
import urllib.request
from contextlib import contextmanager
import time
warnings.filterwarnings('ignore')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
seed_value = 42
os.environ['xgbRAPIDS_GPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)
@contextmanager
def timed(name):
t0 = time.time()
yield
t1 = time.time()
print('..%-24s: %8.4f' % (name, t1 - t0))
print('\n')
!nvidia-smi
Now we can set up a LocalCUDACluster
for Dask
with one threads_per_worker
assigned to Client
for all of the connected workers.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar
cluster = LocalCUDACluster(threads_per_worker=1, ip='',
dashboard_address='8081')
c = Client(cluster)
workers = c.has_what().keys()
n_workers = len(workers)
c
Baseline Models
¶
Train 2020
Let's first read the data using cuDF
and drop any duplicate observations. Then use the DateTime
feature to create the year-week feature as DateTime_YearWeek
that will be used for stratifying the train/test sets. Then we can filter the data to only observations in 2019 and 2020, followed by dropping the Year
variable and converting to a pandas.Dataframe
for preprocessing. For 2020, we can use df2
for setting up the features and target.
import cudf
import pandas as pd
df = cudf.read_csv('combined_trade_final_LSTM.csv', low_memory=False)
df = df.drop_duplicates()
print('Number of rows and columns for 2019-2020:', df.shape)
df['DateTime']= cudf.to_datetime(df['DateTime'])
df['DateTime_YearWeek'] = df['DateTime'].dt.strftime('%Y-w%U')
df = df.drop(['DateTime'], axis=1)
df1 = df[df['Year'] == 2019]
df2 = df[df['Year'] == 2020]
del df
df1 = df1.drop(['Year'], axis=1)
df1 = df1.to_pandas()
df2 = df2.drop(['Year'], axis=1)
df2 = df2.to_pandas()
print('Number of rows and columns for the 2019 set:', df1.shape)
print('Number of rows and columns for the 2020 set:', df2.shape)
X = df2.drop(['Metric_Tons'], axis=1)
y = df2['Metric_Tons']
Now, the train/test sets can be set up by using a test_size=0.2
and stratified by DateTime_YearWeek.
Since the size of the companies, both U.S. and foreign, demonstrated differences in the feature importance, ordinal encoding using ranking can be utilized to contain this level of information. Then dummy variables can be created for the categorical features followed by scaling the features using the MinMaxScaler
. Subsequently, the set can be converted back to cuDF
, the features for both the train and sets can be converted to float32
data types and the targets converted to int32.
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.preprocessing import MinMaxScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=X.DateTime_YearWeek,
random_state=seed_value)
X_train = X_train.drop(['DateTime_YearWeek'], axis=1)
X_test = X_test.drop(['DateTime_YearWeek'], axis=1)
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
mn = MinMaxScaler()
X_train = pd.DataFrame(mn.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(mn.transform(X_test), columns=X_test.columns)
X_train = cudf.DataFrame.from_pandas(X_train)
X_test = cudf.DataFrame.from_pandas(X_test)
X_train = X_train.astype('float32')
y_train = y_train.astype('int32')
X_test = X_test.astype('float32')
y_test = y_test.astype('int32')
The path where the .pkl
file will be stored can then be designated. This is always needed to reload a model for model comparison to the baseline model or after finding the optimal hyperparameters during the tuning process. The baseline model for 2020 can then be fit to the train set, saved, and then the prediction based on the fit model can be tested on both the train and test sets to determine if the error is close or not, suggesting overfitting.
from cupy import asnumpy
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Train20_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
# =============================================================================
# # To load saved model
# model = joblib.load('XGB_Train20_Baseline.pkl')
# print(model)
# =============================================================================
print('\nModel Metrics for XGBoost Baseline Train 2020 Test 2020')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
All of the regression metrics evaluated are similar between the train and test sets, so this does not seem to be overfit. However, the R² demonstrates that this model only explains around 71% of the variance in the data, so hyperparameter tuning will need to be completed to acheive better performance.
Train 2019
For the 2019 baseline model, the same preprocessing, modeling and model metrics that were used for the 2020 set can be applied to df1
.
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Train19_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline Train 2019 Test 2019')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
The model metrics reveal higher MAE
, MSE and RMSE compared to the 2020 train/test sets, but the R² is close.
Train 2018 - 2019
Let's now read the complete trade set into a cuDF
dataframe, remove any duplicates, convert to a pandas
dataframe and select the non-missing rows to maximize the amount of complete observations. Then we can filter the set to only the observations containing only 2018 and 2019 and then drop the Year
feature. To process similarily to the 2020 and 2019 sets, we will create year-week as DateTime_YearWeek
from DateTime
to stratify the train/test sets. Now we prepare this set for partitioning the data by dropping DateTime
and allocating Metric_Tons
as the target.
df = cudf.read_csv('combined_trade_final.csv', low_memory=False)
df = df.drop_duplicates()
df = df.to_pandas()
df = df[df.Foreign_Country_Region.notna() & df.Average_Tariff.notna()
& df.State_Closure_EA_Diff.notna()]
df3 = df.loc[(df['Year'] >= 2018) & (df['Year'] < 2020)]
df3 = df3.drop(['Year'], axis=1)
df3['DateTime']= pd.to_datetime(df3['DateTime'])
df3['DateTime_YearWeek'] = df3['DateTime'].dt.strftime('%Y-w%U')
X = df3.drop(['Metric_Tons', 'DateTime'], axis=1)
y = df3['Metric_Tons']
print('Number of rows and columns for 2018 - 2019:', df3.shape)
Then, we can use the same preprocessing, modeling and model metrics that were used for the 2020 and 2019 sets for the 2018 - 2019 set.
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Train1819_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline Train 2018-19 Test 2018-19')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
The 2018 - 2019 baseline metrics reveal the lowest MAE
, MSE and RMSE but the R² is slightly lower than both the 2020 and 2019 baseline metrics.
Train 2010 - 2019
The same methods used for processing the 2018 - 2019 set were utilized up to where the Year
was filtered. Then DateTime
was dropped and because of memory constraints, a sample of 15,000,000 observations was taken. The features and target were then defined, and the train/test sets were partitioned using test_size=0.3
stratified by Year
. The subsequent preprocessing before modeling then utilized the same steps.
df = df[df['Year'] < 2020]
df = df.drop(['DateTime'], axis=1)
print('Number of rows and columns in 2010 - 2019:', df.shape)
df_sample = df.sample(n=15000000)
del df
X = df_sample.drop(['Metric_Tons'], axis=1)
y = df_sample['Metric_Tons']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
stratify=X.Year,
random_state=seed_value)
del X, y
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'US_company_size'])
X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
mn = MinMaxScaler()
X_train = pd.DataFrame(mn.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(mn.transform(X_test), columns=X_test.columns)
X_train = cudf.DataFrame.from_pandas(X_train)
X_test = cudf.DataFrame.from_pandas(X_test)
X_train = X_train.astype('float32')
y_train = y_train.astype('int32')
X_test = X_test.astype('float32')
y_test = y_test.astype('int32')
Now after preprocessing the data, we can fit the baseline model for the set containing observations from 2010-2019 and examine the metrics.
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0)
xgb.fit(X_train, y_train)
Pkl_Filename = 'XGB_Train1019_Baseline.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(xgb, file)
print('\nModel Metrics for XGBoost Baseline Train 2010-19 Test 2010-19')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
The model metrics for the 2010 - 2019 set are significantly lower compared to the baseline models for 2020, 2019 and 2018 - 2019. The default parameters reveal this model is not fitting well and is overfitting. This is probably due to the wide range of years for trade as well as unemployment, tariffs and currency.
Hyperopt Hyperparameter Optimization
¶
The baseline model metrics reveal that hyperparameter tuning is necessary since the model metrics do not explain the variance as well as they could by testing out different parameters to explain metric tonnage given the considered features. Hyperopt
is a widely utilized package for hyperparameter tuning, so let's now explore different parameters for XGBoost
regression to see if better models can then subsequently result in better predictions with lower errors and explain more of the variance.
Train 2020: 100 Trials Train/Test
¶
The method used to set up the RAPIDS
environment in Colab
can be found here. The code without the extensive output is provided below. The package pynvml
is first installed with pip
and then the rapidsai
repository is cloned to the RAPIDS
directory. The provided code examines if a compatible GPU is present, which a T4 works for this. Both Colab
and Paperspace
were utilized given runtime length and different GPU
and RAM
allocations.
%cd /content/drive/MyDrive/RAPIDS/
!pip install pynvml
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py
The provided .sh
script run in bash
updates the Colab
environment and restarts the kernel.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)
Then we can install Condacolab
and restart the kernel.
import condacolab
condacolab.install()
Now we can see if the environment is ready to install RAPIDS
.
import condacolab
condacolab.check()
We can now install RAPIDS
using the stable
release and then install/import the dependencies to set up the environment. As shown previously, the seed was set for reproducibility, a function was defined to time the code blocks and a CUDA
cluster for Dask
was established.
!python rapidsai-csp-utils/colab/install_rapids.py stable
!pip install category_encoders
!pip install xgboost==1.5.2
!pip install eli5
import os
import warnings
import random
import cupy
import numpy as np
import urllib.request
from contextlib import contextmanager
import time
warnings.filterwarnings('ignore')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
seed_value = 42
os.environ['xgbRAPIDS_GPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)
@contextmanager
def timed(name):
t0 = time.time()
yield
t1 = time.time()
print('..%-24s: %8.4f' % (name, t1 - t0))
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
To set up Hyperopt
, let's first define the number of trials with NUM_EVAL = 100
. Then hyperparameters can then be defined in that a dictionary. To define integers, hp.choice
with np.arange
and dtype=int
is used while float types are defined using hp.uniform
. The space consists of 10 parameters with 4 integers and 6 float type.
from hyperopt import hp
NUM_EVAL = 100
xgb_tune_kwargs= {
'n_estimators': hp.choice('n_estimators', np.arange(50, 700, dtype=int)),
'max_depth': hp.choice('max_depth', np.arange(3, 25, dtype=int)),
'subsample': hp.uniform('subsample', 0.25, 0.95),
'gamma': hp.uniform('gamma', 0, 15),
'learning_rate': hp.uniform('learning_rate', 1e-3, 0.3),
'reg_alpha': hp.choice('reg_alpha', np.arange(0, 30, dtype=int)),
'reg_lambda': hp.uniform('reg_lambda', 0, 100),
'colsample_bytree': hp.uniform('colsample_bytree', 0.3, 1),
'colsample_bylevel': hp.uniform('colsample_bylevel', 0.05, 0.95),
'min_child_weight': hp.choice('min_child_weight', np.arange(1, 30,
dtype=int))
}
Now, we can define a function for the optimization of these hyperparameters. Within this function, we can use joblib
to save a pkl
file that can be reloaded if more training is needed. We can define the trial number with ITERATION
that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float
. This is then allocated for n_estimators
and max_depth
, which starts at max_depth=3
. Then the model type, XGBRegressor
, needs to be defined with the parameters that will be included in all of the trials during the search, which are:
objective='reg:squarederror'
: Specify the learning task and the corresponding learning objective or a custom objective function to be usedbooster='gbtree'
: Specify which booster to use: gbtree, gblinear or dart.tree_method='gpu_hist'
: Specify which tree method to use. Default to auto.scale_pos_weight=1
: Balancing of positive and negative weights.use_label_encoder=False
: Encodes labelsrandom_state=seed_value
: Random number seed.verbosity=0
: The degree of verbosity. Valid values are 0 (silent) - 3 (debug)
Then we can fit the model with the training set and the test set as the eval_set
. Then predict using the test set and evaluate the mean_absolute_error
with the true test set and the predicted from the model. The trial loss=mae
, params
= parameters tested in the trial from the xgb_tune_kwargs
space , the trial number (iteration
), the time to complete the trial (train_time
) and if the trial completed successfully or not (status
) are written to the defined .csv
file and appended by row for each trial.
import joblib
from xgboost import XGBRegressor
from timeit import default_timer as timer
from cupy import asnumpy
from sklearn.metrics import mean_absolute_error
import csv
from hyperopt import STATUS_OK
def xgb_hpo(config):
"""
Objective function to tune a <code>XGBoostRegressor</code> model.
"""
joblib.dump(bayesOpt_trials, 'xgbRapids_Hyperopt_100_GPU_Train20.pkl')
global ITERATION
ITERATION += 1
config['n_estimators'] = int(config['n_estimators'])
config['max_depth'] = int(config['max_depth']) + 3
xgb = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**config)
start = timer()
xgb.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=0)
run_time = timer() - start
y_pred_val = xgb.predict(X_test)
mae = mean_absolute_error(asnumpy(y_test), asnumpy(y_pred_val))
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([mae, config, ITERATION, run_time])
return {'loss': mae, 'params': config, 'iteration': ITERATION,
'train_time': run_time, 'status': STATUS_OK}
The Tree-structured Parzen Estimator Approach (TPE)
algorithm is the default algorithm for Hyperopt
. This algorithm proposed in Algorithms for Hyper-Parameter Optimization uses a Bayesian optimization approach to develop a probabilistic model of a defined function (the minimum, in this case) by generating a random starting point, calculating the function, then choosing the next set of parameters that will probabilistically result in a lower minimum utilizing past calculations for the conditional probability model, then computing the real value, and continuing until the defined stopping criteria is met.
Let's now define an out_file
to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION
and define the Hyperopt Trials
as bayesOpt_trials
. We can utilize if/else
condtional statements to load a .pkl
file if it exists, and then utilizing fmin
, we can specify the training function xgb_hpo
, the parameter space xgb_tune_kwargs
, the algorithm for optimization tpe.suggest
, the number of trials to evaluate NUM_EVAl
and the name of the trial set bayesOpt_trials
. We can now begin the hyperparameter optimization HPO
trials.
from hyperopt import tpe, Trials, fmin
tpe_algorithm = tpe.suggest
out_file = '/content/drive/MyDrive/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train20_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train20.pkl'):
bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train20.pkl')
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the Mean Absolute Error (MAE)
, Mean Squared Error (MSE)
, Root Mean Squared Error (RMSE)
and the Coefficient of Determination
(R²)
import ast
import pickle
from sklearn.metrics import mean_squared_error, r2_score
results = pd.read_csv('xgbRapids_HPO_Train20_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train20_100_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRapids_HPO_Train20_100trials_GPU.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for XGBoost HPO Train 2020 Test 2020 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE
, MSE
and RMSE
for the train and test sets are considerably lower and the R² is considerably higher.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can examine the distributions of the quantitative hyperparameters that were assessed during the search.
import matplotlib.pyplot as plt
import seaborn as sns
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
for i, hpo in enumerate(bayes_params.columns):
if hpo not in ['iteration', 'subsample', 'force_col_wise',
'max_depth', 'min_child_weight', 'n_estimators']:
plt.figure(figsize=(14,6))
if hpo != 'loss':
sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
plt.legend(loc=0)
plt.title('{} Distribution'.format(hpo))
plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
plt.tight_layout()
plt.show()
Higher values of colsample_bylevel
, colsample_bytree
, gamma
and reg_lambda
in the given hyperparameter space performed better to generate a lower loss. A learning_rate
around 0.1 performed better while reg_alpha
was bimodal with values around 11-12 and 25-28.
We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The learning_rate
, and colsample_bylevel
parameters decreased over the trials while the gamma
and colsample_bytree
parameters increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
The reg_lambda
parameter increased while reg_alpha
did not show any trend over the trial iterations.
Model Explanations
Now, we can plot the feature importance for the best model.
from xgboost import plot_importance
plot_importance(best_bayes_model, max_num_features=15)
These features are the most important for the Train 2020 Test 2020 model:
TCVUSD
= 2,349,275Average_Tariff
= 1,596,998cases_weekly
= 1,518,343us_company_size
= 965,347deaths_weekly
= 916,115foreign_company_size
= 882,163US_Unemployment_Rate
= 512,346
Model Metrics with ELI5
Let's now utilize the PermutationImportance
from eli5.sklearn
which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.
Class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
import eli5
from eli5.sklearn import PermutationImportance
X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
from eli5.formatters import format_as_dataframe
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2020 Test 2020 XGBoost
model after tuning, the TCVUSD
feature definitely has the most weight (1.131809). Then Average_Tariff
(0.413096), HS_Group_Name_Finished Goods
(0.276361), Trade_Direction_Import
(0.174613), HS_Group_Name_Raw Input
(0.173788), foreign_company_size
(0.105692), us_company_size
(0.103508) and Delta_Case0_Effective
(0.062671).
Test on 2019
¶
We can now prepare the 2019 set to fit the model trained on 2020 data. This pipeline conists of encoding the foreign_company_size
and US_company_size
features using ordinal (ranking), creating dummy variables for the categorical variables, scaling the data using the MinMaxScaler
, converting back to a cuDF
dataframe and then converting the data types before modeling this test set.
X_test1 = df1.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df1['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')
best_bayes_model.fit(X_test1, y_test1)
print('\nModel Metrics for XGBoost HPO Train 2020 Test 2019')
y_test_pred = best_bayes_model.predict(X_test1)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))
When comparing the metrics from the 2020 training set to the 2019 test set, there are higher MAE
, MSE
, RMSE
but a higher R² with the 2019 set compared to Train 2020 Test 2020
Model Explanations
Now, we can plot the feature importance from the model.
plot_importance(best_bayes_model, max_num_features=15)
These features are the most important for the Train 2020 Test 2019 model:
TCVUSD
= 3,061,152 higher than 2,349,275 for 2020Average_Tariff
= 1,965,434 higher than 1,596,998 for 2020- Then
us_company_size
= 1,286,641 higher 965,347 for 2020 - Then
US_Unemployment_Rate
= 1,241,843 higher than 512,346 for 2020 - Then
foreign_company_size
= 1,105,097 higher than 882,163 for 2020 - Then
Delta_Case0_Effective
= 781,520 - Then
State_Closure_EA_Diff
= 622,017 - Then
Container_Type_Dry
= 371,643 - Not used
cases_weekly
= 1,518,343,us_company_size
is lower with 965,347,deaths_weekly
= 916,115 is not on the chart.
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the 2019 set, and then get the weights with the defined feature names.
X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test1),
asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2020 Test 2019 XGBoost
model after tuning, the TCVUSD
feature definitely has a higher weight (1.239769 vs. 1.131809). Then higher Average_Tariff
(0.481857 vs. 0.413096), similar HS_Group_Name_Finished Goods
(0.274777 vs. 0.276361), lower Trade_Direction_Import
(0.144351 vs. 0.174613), higher HS_Group_Name_Raw Input
(0.211019 vs. 0.173788), higher foreign_company_size
(0.244605 vs. 0.105692), higher us_company_size
(0.156423 vs. 0.103508) and higher Delta_Case0_Effective
(0.101251 vs. 0.062671).
Test on 2010 - 2019
The same methods used for processing the 2019 data up to filtering the Year
feature were used. Then, the DateTime
and Year
features were dropped, the data sampled to 15 million observations, ordinal encoding of the size of the US and foreign companies, dummy variables for the categorical ones, MinMaxScaler
for the test set, which were then converted back to a cuDF.Dataframe
and the data types converted.
X_test1 = df_sample.drop(['Metric_Tons', 'DateTime', 'Year'], axis=1)
y_test1 = df_sample['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
mn = MinMaxScaler()
X_test1 = pd.DataFrame(mn.fit_transform(X_test1), columns=X_test1.columns)
best_bayes_model.fit(X_test1, y_test1)
print('\nModel Metrics for XGBoost HPO Train 2020 Test 2010-19')
y_test_pred = best_bayes_model.predict(X_test1)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))
When comparing the metrics from the 2020 training set to the 2010 - 2019 test set, there are even higher MAE
, MSE
, RMSE
compared to 2020 and 2019, but a higher R² with the 2010-19 set compared to the 2020 test set.
Model Explanations
Now, we can plot the feature importance from the model.
plot_importance(best_bayes_model, max_num_features=15)
These features are the most important for the Train 2020 Test 2010 - 2019 model:
- First
US_Unemployment_Rate
= 8,619,313 significantly higher than than 512,346 for 2020 TCVUSD
= 7,413,080 significantly higher than 2,349,275 for 2020Average_Tariff
= 6,128,827 significantly higher than 1,596,998 for 2020- Then
us_company_size
= 4,194,619 higher than 965,347 for 2020 - Then
foreign_company_size
= 3,415,736 higher than 882,163 for 2020 - Then
State_Closure_EA_Diff
= 2,300,086 higher than 316,770 for 2020 - Then
US_Port_Coastal_Region_Northeast
= 1,115,509, not listed for 2020 - Then
Container_Type_Dry
= 1,096,746 higher than 292,538 for 2020
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the 2010 - 2019 set, and then get the weights with the defined feature names.
X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test1),
asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2020 Test 2010 - 2019 XGBoost
model after tuning, the TCVUSD
feature definitely has a lower weight (1.037852 vs. 1.131809). Then higher Average_Tariff
(0.480369 vs. 0.413096), similar HS_Group_Name_Finished Goods
(0.291674 vs. 0.276361), higher Trade_Direction_Import
(0.193941 vs. 0.174613), lower HS_Group_Name_Raw Input
(0.146352 vs. 0.173788), higher foreign_company_size
(0.123941 vs. 0.105692), higher us_company_size
(0.156423 vs. 0.103508) and higher Delta_Case0_Effective
(0.164763 vs. 0.062671).
Train 2019: 100 Trials Train/Test
Using the same number of trials, hyperparameters and optimization algorithm that was utilized for the 2020 tuning, let's now define a different out_file
and .pkl
file and run the search for the 2019 data.
out_file = '/notebooks/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train19_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train19.pkl'):
bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train19.pkl')
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
results = pd.read_csv('xgbRapids_HPO_Train19_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train19_100_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRapids_Hyperopt_100_GPU_Train19trials.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for XGBoost HPO Train 2019 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE
, MSE
and RMSE
for the train and test sets are considerably lower and the R² is considerably higher.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
learning_rate
, and decreased over the trials while colsample_bylevel
and colsample_bytree
increased while gamma
did not show any trend. For 2020, learning_rate
, and colsample_bylevel
decreased over the trials while gamma
and colsample_bytree
increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
reg_lambda
decreased while reg_alpha
did not show any trend over the trial iterations.For 2020, reg_lambda
increased while reg_alpha
did not show any trend over the trial iterations.
Model Explanations
Now, we can plot the feature importance for the best model.
plot_importance(best_bayes_model, max_num_features=15)
Compared to Train 2020, higher TCVUSD
, Average_Tariff
, us_company_size
. There is different order without cases_weekly
, double US_Unemployment_Rate
, higher foreign_company_size
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2019 Test 2019 XGBoost
model after tuning, the TCVUSD
feature definitely has the most weight (1.147259). Then Average_Tariff
(0.411251), HS_Group_Name_Finished Goods
(0.237294), Trade_Direction_Import
(0.102769), HS_Group_Name_Raw Input
(0.186702), foreign_company_size
(0.233340), us_company_size
(0.127631) and Delta_Case0_Effective
(0.067358).
Test on 2020
Let's now prepare the 2020 set to fit the model train using the 2019 set by ordinal encoding the company size variables, create dummy variables for the categorical variables, MinMax
scaling, converting back to cuDF
, converting the data types for modeling and fitting the model.
X_test1 = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df2['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')
best_bayes_model.fit(X_test1, y_test1)
print('\nModel Metrics for XGBoost HPO Train 2019 Test 2020')
y_test_pred = best_bayes_model.predict(X_test1)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))
When comparing the metrics from the 2019 training set to the 2020 test set, there are higher MAE
, MSE
, RMSE
but a higher R² with the 2019 set compared to Train 2020 Test 2020.
Model Explanations
Now, we can plot the feature importance from the model.
plot_importance(best_bayes_modeL, max_num_features=15)
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the 2020 set, and then get the weights with the defined feature names.
X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test1),
asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2019 Test 2020 XGBoost
model after tuning, the TCVUSD
feature definitely has the most weight (1.212947 vs. 1.147259). Then a higher Average_Tariff
(0.472409 vs. 0.411251), HS_Group_Name_Finished Goods
(0.259453 vs. 0.237294), Trade_Direction_Import
(0.209795 vs. 0.102769), similar HS_Group_Name_Raw Input
(0.198903 vs. 0.186702), lower foreign_company_size
(0.116391 vs. 0.233340), higher us_company_size
(0.158599 vs. 0.127631) and similar Delta_Case0_Effective
(0.075206 vs. 0.067358).
Train 2018-19: 100 Trials Train/Test
Using the same number of trials, hyperparameters and optimization algorithm that was utilized for the 2020 & 2019 tuning, let's now define a different out_file
and .pkl
file and run the search for the 2018 - 2019 data.
out_file = '/content/drive/MyDrive/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train1819_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train1819.pkl'):
bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train1819.pkl')
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
results = pd.read_csv('xgbRapids_HPO_Train1819_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train1819_100_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRapids_Hyperopt_100_GPU_Train1819trials.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for XGBoost HPO Train 2018-19 Test 2018-19 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE
, MSE
and RMSE
for the train and test sets are considerably lower and the R² is considerably higher.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
learning_rate
decreased over the trials while gamma
, colsample_bylevel
and colsample_bytree
increased. For 2019, learning_rate
decreased over the trials while colsample_bylevel
and colsample_bytree
increased while gamma
did not show any trend. For 2020, learning_rate
, and colsample_bylevel
decreased over the trials while gamma
and colsample_bytree
increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
For 2018 - 2019 and only 2019, reg_lambda
decreased while reg_alpha
did not show any trend over the trial iterations.For 2020, reg_lambda
increased while reg_alpha
did not show any trend over the trial iterations.
Model Explanations
Now, we can plot the feature importance fro the best model.
plot_importance(best_bayes_model, max_num_features=15)
Compared to Train 2020 Test 2020, lower TCVUSD
, Average_Tariff
, us_company_size
. There is different order without cases_weekly
and deaths_weekly
, higher US_Unemployment_Rate
and lower foreign_company_size
.
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2018 - 2019 Test 2018 - 2019 XGBoost
model after tuning, the TCVUSD
feature definitely has the most weight (1.116930). Then Average_Tariff
(0.416370), HS_Group_Name_Finished Goods
(0.273364), Trade_Direction_Import
(0.173100), HS_Group_Name_Raw Input
(0.140252), foreign_company_size
(0.112855) and us_company_size
(0.126541).
Test on 2020
We can now prepare the 2020 set to fit the model trained on 2018 - 2019 data. This follows the same process of encoding the foreign_company_size
and US_company_size
features using ordinal (ranking), create dummy variables for the categorical variables, scaling the data using the MinMaxScaler
, converting back to a cuDF
dataframe and converting the data types before modeling this test set.
X_test1 = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df2['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'US_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')
best_bayes_model.fit(X_test1, y_test1)
print('\nModel Metrics for XGBoost HPO Train 2018-19 Test 2020')
y_test_pred = best_bayes_model.predict(X_test1)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test1), asnumpy(y_test_pred)))))
When comparing the metrics from the 2018 - 2019 training set to the 2020 test set, there are even lower MAE
, MSE
, RMSE
compared to 2018 - 2019, but a higher R² with the 2020 set compared to the 2018 - 2019 test set.
Model Explanations
Now, we can plot the feature importance from the model.
plot_importance(best_bayes_model, max_num_features=15)
Compared to Train 2018 - 2019 Test 2018 - 2019, there is a lower TCVUSD
, lower Average_Tariff
, lower us_company_size
. There is different order without cases_weekly
and deaths_weekly
, lower US_Unemployment_Rate
and lower foreign_company_size
.
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the 2020 set, and then get the weights with the defined feature names.
X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test1),
asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2020 Test 2020 XGBoost
model after tuning, the TCVUSD
feature definitely has the most weight (1.210602 vs. 1.131809). Then higher Average_Tariff
(0.465814 vs. 0.413096), similar HS_Group_Name_Finished Goods
(0.273364 vs. 0.276361), higher Trade_Direction_Import
(0.193907 vs. 0.174613), lower HS_Group_Name_Raw Input
(0.149271 vs. 0.173788), higher foreign_company_size
(0.137879 vs. 0.105692) and higher us_company_size
(0.157700 vs. 0.103508).
Train 2010 - 2019: 100 Trials Train/Test
¶
Using the same number of trials, hyperparameters and optimization algorithm that was utilized for the 2020, 2019 & 2018 - 2019 tuning, let's now define a different out_file
and .pkl
file and run the search for the 2010 - 2019 data.
out_file = '/notebooks/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train1019_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)
writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()
global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()
if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train1019.pkl'):
bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train1019.pkl')
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
max_evals=NUM_EVAL, trials=bayesOpt_trials)
Let's now access the results in the trialOptions
directory by reading into a pandas.Dataframe
sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl
file. Then we can evaluate the model metrics for both the train and the test sets using the MAE
, MSE
, RMSE
and the (R²).
results = pd.read_csv('xgbRapids_HPO_Train1019_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train1019_100_GPU.csv')
ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()
best_bayes_model = XGBRegressor(objective='reg:squarederror',
booster='gbtree',
tree_method='gpu_hist',
scale_pos_weight=1,
use_label_encoder=False,
random_state=seed_value,
verbosity=0,
**best_bayes_params)
best_bayes_model.fit(X_train, y_train)
Pkl_Filename = 'xgbRapids_Hyperopt_100_GPU_Train1019trials.pkl'
with open(Pkl_Filename, 'wb') as file:
pickle.dump(best_bayes_model, file)
print('\nModel Metrics for XGBoost HPO Train 2010-19 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test), asnumpy(y_test_pred))))
When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE
, MSE
and RMSE
for the train set are considerably lower, but this is not the same for the test set. The R² did improve for both the train and the test sets compared to the baseline model, but the model from the search is overfit if the R² between the train and the test sets is 21% different.
We can also evaluate the MSE
on the test set and determine when this was achieved:
print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))
Results from the Hyperparameter Search
Let's now create a new pandas.Dataframe
where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.
bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
'params']).keys()),
index=list(range(len(results))))
for i, params in enumerate(results['params']):
bayes_params.loc[i,:] = list(ast.literal_eval(params).values())
bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')
fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
'colsample_bytree']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
For 2010 - 2019 and 2018 - 2019,learning_rate
decreased over the trials while gamma
, colsample_bylevel
and colsample_bytree
increased. For 2019, learning_rate
decreased over the trials while colsample_bylevel
and colsample_bytree
increased while gamma
did not show any trend. For 2020, learning_rate
, and colsample_bylevel
decreased over the trials while gamma
and colsample_bytree
increased.
Next, we can examine if there were any trends regarding the regularization parameters over the trials.
fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()
For 2010 - 2019, 2018 - 2019 and only 2019, reg_lambda
decreased while reg_alpha
did not show any trend over the trial iterations. For 2020, reg_lambda
increased while reg_alpha
did not show any trend over the trial iterations.
Model Explanations
Now, we can plot the feature importance for the best model.
plot_importance(best_bayes_model, max_num_features=15)
Compared to Train 2020 Test 2020, there is a lower value for US_Unemployment_Rate
, but it is the most important feature for this set. There is a lower Average_Tariff
, lower TCVUSD
, lower us_company_size
. There is different order without cases_weekly
an lower foreign_company_size
.
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test),
asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
feature_names=X_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2010 - 2019 Test 2010 - 2019 XGBoost
model after tuning, the TCVUSD
feature definitely has the most weight (0.877037). Then Average_Tariff
(0.361002), HS_Group_Name_Finished Goods
(0.237089), Trade_Direction_Import
(0.126989), HS_Group_Name_Raw Input
(0.108915), foreign_company_size
(0.081719) and us_company_size
(0.095954).
Test on 2020
Let's now prepare the 2020 data to fit the model trained using the 2010 - 2019 data by ordinal encoding the US and foreign company size variables, creatinng dummy variables for the categorical variables, using the MinMaxScaler
for the features, converting the data back to cuDF.Dataframe
, convert the data types for modeling and fitting the model.
X_test1 = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df2['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')
best_bayes_model.fit(X_test1, y_test1)
print('\nModel Metrics for XGBoost HPO Train 2010-19 Test 2020')
y_test_pred = best_bayes_model.predict(X_test1)
print('MAE train: %.3f, test: %.3f' % (
mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
squared=False),
mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))
When comparing the metrics from the 2010 - 2019 training set to the 2020 test set, there are even lower MAE
, MSE
, RMSE
compared to 2018 - 2019, but a higher R² with the 2010 - 2019 set compared to the 2010 - 2019 test set.
Model Explanations
Now, we can plot the feature importance from the model.
plot_importance(best_bayes_model, max_num_features=15)
plot_importance(best_bayes_model, max_num_features=15)
Compared to Train 2010 - 2019 Test 2010 - 2019, there is a significantly lower value for US_Unemployment_Rate
, but TCVUSD
is the most important feature for this set but with a lower value. There is a lower Average_Tariff
, lower us_company_size
. There is different order without cases_weekly
and lower foreign_company_size
.
Model Metrics with ELI5
Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.
X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)
perm_importance = PermutationImportance(best_bayes_model,
random_state=seed_value).fit(asnumpy(X_test1),
asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
feature_names=X1_test1.columns.tolist())
html_obj
explanation = eli5.explain_weights_sklearn(perm_importance,
feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp
For the Train 2010 - 2019 Test 2010 - 2019 XGBoost
model after tuning, the TCVUSD
feature definitely has the most weight (1.116358 vs. 0.877037). Then higher Average_Tariff
(0.415390 vs. 0.361002), similar HS_Group_Name_Finished Goods
(0.227251 vs .0.237089), higher Trade_Direction_Import
(0.168872 vs. 0.126989), higher HS_Group_Name_Raw Input
(0.191584 vs. 0.108915), higher foreign_company_size
(0.103731 vs. 0.081719) and us_company_size
(0.118198 vs. 0.095954).
Multi-Layer Perceptron (MLP) Regression
The MLP
can be utilized for regression problems as demonstrated in Multilayer perceptrons for classification and regression] by Fionn Murtagh in 1991. It is derived from the perceptron invented in the 1940s and implemented by Frank Rosenblatt in the 1950s with The Perceptron — A Perceiving and Recognizing Automaton. The MLP
is a artifical neural network connected by neurons containing an input layer, the hidden layer(s) and the output layer with a single output neuron for regression. The data is used as the input to the network which is fed to the hidden layers one layer at a time generating weights (connections to the next neuron) and biases (threshod value needed to obtain the output) till the output layer. The output is then compared to the expected output, and the difference is the error. This error is then propagated back through the network, layer by layer where the weights are updated depending on the effect to the error called the backpropagation
algorithm. This is repreated for all of the samples in the training data where one cycle of updating the network is an epoch
.
The notebooks can be found here. Let's first install the needed packages and examine the versions of tensorflow
, keras
as well as the CUDA
and GPU
components available.
!pip install category_encoders
!pip install tensorflow
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available:', len(tf.config.list_physical_devices('GPU')))
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
Now we can import the packages and set the seed for the environment.
import os
import random
import warnings
import numpy as np
warnings.filterwarnings('ignore')
def init_seeds(seed=101920):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
session_conf = tf.compat.v1.ConfigProto()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
os.environ['TF_DETERMINISTIC_OPS'] = 'True'
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
return sess
init_seeds(seed=101920)
Baseline Models
¶
Train 2020: Batch Size = 16
¶
Like in preparing the year range data for the XGBoost
models, the data can be read, the train/test sets can be set up by using a test_size=0.2
and stratified by DateTime_YearWeek
. Since the size of the companies, both U.S. and foreign, demonstrated differences in feature importance, ordinal encoding using ranking can be utilized to contain this level of information. Then dummy variables can be created for the categorical features and the number of features examined since this information is needed for the input dimensions to the MLP
. Next, the training features can be scaled using the StandardScaler
, which removes the mean and scales to unit variance, that uses fit_transform
with the training set and applies what is fit with transform
to the test set.
import pandas as pd
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', None)
X = df2.drop(['Metric_Tons'], axis=1)
y = df2['Metric_Tons']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=X.DateTime_YearWeek,
random_state=1920)
X_train = X_train.drop(['DateTime_YearWeek'], axis=1)
X_test = X_test.drop(['DateTime_YearWeek'], axis=1)
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'US_company_size'])
X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print('Dimensions of X_train for input:', X_train.shape[1])
sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_test = pd.DataFrame(sc.transform(X_test))
Let's now set the path where model is saved and set up the callbacks containing the EarlyStopping
class that monitors the loss
and will stop training if it does not improve after 5 epochs, the ModelCheckpoint
class that monitors the mse
saving only the best one with a min
loss as well as the TensorBoard
class that enables visualizations using the TensorBoard
.
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'MLP_weights_only_train20_baseline_sc_b16_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
best_model_only=True, mode='min'),
tensorboard_callback]
We can define a baseline model to compare the results after tuning. Since there are 34 features after preprocessing, let's use three Dense
layers that incrementally decrease in node size with each containing kernel_initializer='normal'
to use a normal distribution to initialize the weights and activation='relu'
to use the positive part of each of the vector components with a 30% Dropout
before the output layer. Then we can configure the model for training using compile
specifying the loss function (mae
), the metric to evaluate (mse
) and which optimizer to utilize (adam
with the default learning_rate=0.001
).
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(30, input_dim=34, kernel_initializer='normal',
activation='relu'))
model.add(Dense(20, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
model.compile(loss='mae', metrics=['mse'], optimizer='adam')
model.summary()
Now the model can be trained by calling fit
utilizing the train dataset for 30 epochs, which is the iterations over the entire X_train
and y_train
, using a batch_size=16
, which is the number of samples per gradient update, validation_split=0.2
, which is the fraction of the training data not used for training but used for evaluating the loss and the model metrics on this data at the end of each epoch, and the specified callbacks from the callbacks_list
.
history = model.fit(X_train, y_train, epochs=30, batch_size=16,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.
import matplotlib.pyplot as plt
model.save('./MLP_Baseline_sc_batch16_30epochs_train20_tf.h5', save_format='tf')
# Load model for more training or later use
#filepath = 'MLP_weights_only_train20_baseline_sc_b16_epochs30.h5'
#model = tf.keras.models.load_model('./MLP_Baseline_sc_batch16_30epochs_train20_tf.h5')
#model.load_weights(filepath)
plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
We can save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe
and examine the size to determine chunksize
for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
y_pred.shape
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2020 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, metrics like the Mean Absolute Error (MAE)
, Mean Squared Error (MSE)
, Root Mean Squared Error (RMSE)
and the Coefficient of Determination
(R²) can be utilized for both the training and test sets.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f'% np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
Then predict on test set and convert the predicted metric tons to pandas.Dataframe
to evaluate the metrics from the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set.
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons', np.average(y_test))
print('Predicted Average Metric Tons', np.average(pred_test))
print('\nMinimum Metric Tons', np.amin(y_test))
print('Predicted Minimum Metric Tons', np.amin(pred_test))
For the baseline model using the 2020 data, there is a higher predicted maximum metric tonnage than the actual maximum metric tons while a lower predicted average and a lower predicted minimum compared to the actual.
Test on 2019
Let's now prepare the 2019 set by partitioning the data and processing the features. We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the 2019 set.
X = df1.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df1['Metric_Tons']
X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))
pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2020 Test 2019: Predicted Metric Tons vs Actual Metric Tons',
fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for the 2020 model using the 2019 set
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))
Compared to the 2020 test set metrics, there are higher errors for MAE
(9.556285 vs 8.552116), MSE
(516.556536 vs 418.956668), RMSE
(22.727880 vs 20.468431) and lower R² (0.486803 vs 0.555586).
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
('\n')
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
('\n')
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
For the baseline model using the 2020 data tested using the 2019 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (417.72134 vs 286.75543) while an even lower predicted average (17.587381 vs. 18.042938) and a lower predicted minimum (-7.155898 vs. -6.1492796) compared to the actual. The average metric tons for 2020 is 21.525 and 2019 is 21.898, which are comparable.
Train 2019: Batch Size = 16
¶
Like in preparing the year range data for the XGBoost
models, the data can be read, the train/test sets can be set up by using a test_size=0.2
and stratified by DateTime_YearWeek
. Since the size of the companies, both U.S. and foreign, demonstrated differences in the feature importance, ordinal encoding using ranking can be utilized to contain this level of information. Then dummy variables can be created for the categorical features and the number of features examined since this information is needed for the input dimensions to the MLP
. Next, the training features can be scaled using the StandardScaler
, which removes the mean and scales to unit variance, that uses fit_transform
with the training set and applies what is fit with transform
to the test set.
X = df1.drop(['Metric_Tons'],axis=1)
y = df1['Metric_Tons']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=X.DateTime_YearWeek,
random_state=1920)
X_train = X_train.drop(['DateTime_YearWeek'], axis=1)
X_test = X_test.drop(['DateTime_YearWeek'], axis=1)
X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print('Dimensions of X_train for input:', X_train.shape[1])
sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_test = pd.DataFrame(sc.transform(X_test))
Let's now set the path where model is saved and set up the callbacks containing the EarlyStopping
class that monitors the loss
and will stop training if it does not improve after 5 epochs, the ModelCheckpoint
class that monitors the mse
saving only the best one with a min
loss as well as the TensorBoard
class that enables visualizations using the TensorBoard
.
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
%load_ext tensorboard
filepath = 'MLP_weights_only_train19_baseline_sc_b16_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
best_model_only=True, mode='min'),
tensorboard_callback]
Utilizing the same model architecture as the one for training 2020, lets train the baseline MLP
using the 2019 training set.
history = model.fit(X_train, y_train, epochs=30, batch_size=16,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.
model.save('./MLP_Baseline_sc_batch16_30epochs_train19_tf.h5', save_format='tf')
plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
As with the 2020 baseline model, the val_loss
is lower than the training loss
over the duration of the training epochs. Then, we can save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe
and examine the size to determine chunksize
for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for both the training and test sets.
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f'% np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set.
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons', np.average(y_test))
print('Predicted Average Metric Tons', np.average(pred_test))
print('\nMinimum Metric Tons', np.amin(y_test))
print('Predicted Minimum Metric Tons', np.amin(pred_test))
For the baseline model using the 2019 data, there is also a higher predicted maximum metric tonnage than the actual maximum metric tons, but it is farther away from the actual maximum than the predicted vs. actual maximum for the 2020 baseline model. On the other hand, there is an even lower predicted average and predicted minimum compared to the actual than what was observed for the baseline model for the 2020 set.
Test on 2020
Let's now prepare the 2020 set by partitioning the data and processing the features. We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the 2020 set.
X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']
X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))
pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2019 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for the 2019 model using the 2020 set
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))
Compared to the 2019 test set metrics, there are lower errors for MAE
(8.862149 vs. 9.251954), MSE
(431.787379 vs. 459.566662), RMSE
(20.779494 vs. 21.437506) and comparable R² (0.542457 vs. 0.543552).
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
For the 2019 baseline model tested using the 2020 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (398.72498 vs. 315.0854) while an even lower predicted average (18.037947 vs. 17.043827) and a lower predicted minimum (-8.755297 vs. -7.53874) compared to the actual. The average metric tons for 2020 is 21.525 and 2019 is 21.898, so they are comparable.
Train 2010 - 2019: Batch Size = 32
¶
Like in preparing the year range data for XGBoost
models, let's read data, remove the missing data for the most complete cases and filter the set to 2010 - 2019 observations. Then set up the data to create the train/test sets stratified by Year
. Next, ordinal encode both the US and foreign company size using ranking, create the dummy variables for the categorical variables and utilize the StandardScaler
rather the MinMaxScaler
.
Let's now set the path where model is saved and set up the callbacks with the EarlyStopping
that monitors the loss
and will stop training if it does not improve after 5 epochs, the ModelCheckpoint
that monitors the mse
saving only the best one with a min
loss and the TensorBoard
as well .
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
%load_ext tensorboard
filepath = 'MLP_weights_only_train1019_baseline_sc_b32_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
best_model_only=True, mode='min'),
tensorboard_callback]
Given the large amount of observations even with sampling, let's use batch_size=32
and the same other parameters to train the MLP
using this train dataset.
history = model.fit(X_train, y_train, epochs=30, batch_size=32,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.
model.save('./MLP_Baseline_sc_batch32_30epochs_train1019_tf.h5',
save_format='tf')
plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
As with the 2020 and 2019 baseline models, the val_loss
is lower than the training loss
over the duration of the training epochs, and this baseline model appears to have lower error values. Then, we can save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe
and examine the size to determine chunksize
for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for both the training and test sets.
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f'% np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set.
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
Out of all of the baseline models, the model using the 2010 - 2019 data was the highest for the predicted maximum metric tonnage and the lowest for the predicted average and a loweest for the predicted minimum compared to the actual.
Test on 2020
Let's now prepare the 2020 set by partitioning the data and processing the features. We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the 2020 set.
X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']
X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))
pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2010-19 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for the 2010 - 2019 model using the 2020 set.
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))
Compared to the 2010 - 2019 test set metrics, there are higher errors for MAE
(23.538832 vs. 8.637050), MSE
(1373.334543 vs. 391.928194),RMSE
(37.058529 vs. 19.797176) and a negative R² (-0.455251 vs. 0.446796).
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
For the 2010 - 2019 baseline model using the using the 2020 data, there is an even lower predicted maximum metric tonnage than the actual maximum metric tons (308.2283 vs. 388.61658), a lower predicted minimum (-8.377576 vs. -7.649522) compared to the actual and an even higher predicted average (29.537125 vs. 16.301502) and . The average metric tons for 2020 is 21.525 and 2010 - 2019 is 20.173, so they are comparable.
Hyperparameter Tuning
¶
Train 2020 HPO5
¶
Let's first set up the callbacks for the TensorBoard
. Then we can define a function build_model
that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:
num_layers
: 7 - 13 layerslayer_size
: 40 - 70 nodes usingstep=5
learning_rate
: 1 x 10-1, 1 x 10-2, 1 x 10-3
The same 30% Dropout
before the output layer as well as the same loss='mae'
and metrics=['mse']
can be utilized to compare the aforementioned hyperparameters.
from tensorflow.keras.callbacks import TensorBoard
from tensorflow import keras
filepath = 'MLP_weights_only_b16_20_HPO5.h5'
checkpoint_dir = os.path.dirname(filepath)
callbacks = [TensorBoard(log_dir=log_folder,
histogram_freq=1,
write_graph=True,
write_images=True,
update_freq='epoch',
profile_batch=1,
embeddings_freq=1)]
callbacks_list = [callbacks]
def build_model(hp):
model = keras.Sequential()
for i in range(hp.Int('num_layers', 7, 13)):
model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
min_value=40, max_value=70,
step=5),
activation='relu',
kernel_initializer='normal'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(1))
model.compile(loss='mae', metrics=['mse'],
optimizer=tf.keras.optimizers.Adam(
hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])))
return model
We can define the conditions for keras_tuner
which include the objective='val_loss'
, the number of trials to search through (max_trials=25
), the length of each trial (executions_per_trial=1
), and the directory
and project_name
. Then we can now print a summary of the search space.
import keras_tuner
from keras_tuner import BayesianOptimization
tuner = BayesianOptimization(
build_model,
objective='val_loss',
max_trials=20,
executions_per_trial=1,
overwrite=True,
directory='MLP_20_HPO5sc',
project_name='MLP_20_HPO5')
tuner.search_space_summary()
Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2
and batch_size=16
.
tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=16,
callbacks=callbacks)
Now that the search has completed, let's print a summary of the results from the trials.
tuner.results_summary()
From the search, the model/hyperparameters that resulted in the lowest val_loss
utilizes a network containing 11 layers with 5 of the layers containing 65 nodes, 2 with 60, 1 with 50, 3 with 45 nodes and a learning_rate: 0.001
. However, when fitting this model and the second best from the search using batch_size=32
, there was a high discrepancy between the train and test metrics. Therefore, the third selected from the search with a more complex model architecture was utilized. This potentially suggests that multiple epochs
should be used during the HPO
trials.
Fit The Best Model from HPO - Batch Size = 32
Let's now set the path where model is saved and set up the callbacks with EarlyStopping
that monitors the loss
and will stop training if it does not improve after 5 epochs
, the ModelCheckpoint
that monitors the mse
saving only the best one with a min
loss and the TensorBoard
as well.
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
%load_ext tensorboard
filepath = 'MLP_weights_only_train20_b32_sc_epochs30_HPO5batch16.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_freq='epoch'),
tensorboard_callback]
We can now define the model architecture using the results from keras-tuner
search, compile
and then examine.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(70, input_dim=34, kernel_initializer='normal',
activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(60, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
Compared to the total model parameters from the baseline model, there are 48,361 vs. 1,891 trainable parameters. Now, the model can be trained by calling fit
utilizing the train dataset for 30 epochs
using a batch_size=32
, validation_split=0.2
and the specified callbacks from the callbacks_list
.
history = model.fit(X_train, y_train, epochs=30, batch_size=32,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.
model.save('./MLP_batch32_sc_30Epochs_HPO5batch32_train20_tf.h5',
save_format='tf')
# Load model for more training or later use
#filepath = 'MLP_weights_only_train20_b32_sc_epochs30_HPO5batch16.h5'
#model = tf.keras.models.load_model('./MLP_batch32_30Epochs_HPO5batch16_train20_tf.h5')
#model.load_weights(filepath)
plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
We can save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe
and examine the size to determine chunksize
for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for both the training and test sets.
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set. When comparing the metrics from the model using HPO
to the baseline model, all of the MAE
, MSE
, and RMSE
metrics are lower and the $^{2}$ is higher for both the train and test sets.
Let's also examine how the predicted values for the maximum, average and minimum metric tonnages compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
Test on 2019
Let's now prepare the 2019 set by partitioning the data and processing the features. We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the 2019 set.
X = df1.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df1['Metric_Tons']
X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.transform(X))
pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Train 2020 Test 2019: Predicted Metric Tons vs Actual Metric Tons',
fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for the 2020 model using the 2019 set. We can also examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))
print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
For the 2020 HPO model using the 2019 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (258.91766 vs. 246.31972), a lower predicted minimum (-0.4527645 vs. -0.21363783) compared to the actual and an even lower predicted average (18.422684 vs. 19.69039).
Train 2019 HPO3
¶
Let's first set up the callbacks for the TensorBoard
. Then we can define a function build_model
that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:
num_layers
: 4 - 10 layerslayer_size
: 20 - 70 nodes usingstep=5
learning_rate
: 1 x 10-1, 1 x 10-2, 1 x 10-3
The same 30% Dropout
before the output layer as well as the same loss='mae'
and metrics=['mse']
can be utilized to compare the aforementioned hyperparameters.
callbacks = [TensorBoard(log_dir=log_folder,
histogram_freq=1,
write_graph=True,
write_images=True,
update_freq='epoch',
profile_batch=1,
embeddings_freq=1)]
def build_model(hp):
model = keras.Sequential()
for i in range(hp.Int('num_layers', 4, 10)):
model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
min_value=20, max_value=70,
step=5),
activation='relu',
kernel_initializer='normal'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(1))
model.compile(loss='mae', metrics=['mse'], optimizer=keras.optimizers.Adam(
hp.Choice('learning_rate', values=[1e-1, 1e-2, 1e-3])))
return model
We can define the conditions for keras_tuner
which include the objective='val_loss'
, the number of trials to search through (max_trials=25
), the length of each trial (executions_per_trial=1
), and the directory
and project_name
. Then we can now print a summary of the search space.
tuner = BayesianOptimization(
build_model,
objective='val_loss',
max_trials=20,
executions_per_trial=1,
overwrite=True,
directory='MLP_19_HPO3sc',
project_name='MLP_19_HPO3')
tuner.search_space_summary()
Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2
and batch_size=16
.
tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=16,
callbacks=callbacks)
Now that the search has completed, let's print a summary of the results from the trials.
tuner.results_summary()
From the search, the model/hyperparameters that resulted in the lowest val_loss
utilizes a network containing 4 layers with all the layers containing layer_size=70
and a learning_rate: 0.001
. The top 4 networks with the lowest val_loss
examined contained 4 layers as well.
Fit The Best Model from HPO - Batch Size = 32
Let's now set the path where model is saved and set up the callbacks with EarlyStopping
that monitors the loss
and will stop training if it does not improve after 5 epochs
, the ModelCheckpoint
that monitors the mse
saving only the best one with a min
loss and the TensorBoard
as well.
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
%load_ext tensorboard
filepath = 'MLP_weights_only_train19_b32_sc_epochs30_HPO3batch16.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_freq='epoch'),
tensorboard_callback]
We can now define the model architecture using the results from keras-tuner
search, compile
and then examine.
model = Sequential()
model.add(Dense(70, input_dim=34, kernel_initializer='normal',
activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
Compared to the total model parameters from the baseline model, there are 17,431 vs. 1,891 trainable parameters. Now, the model can be trained by calling fit
utilizing the train dataset for 30 epochs
using a batch_size=32
, validation_split=0.2
and the specified callbacks from the callbacks_list
.
history = model.fit(X_train, y_train, epochs=30, batch_size=32,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.
model.save('./MLP_batch32_sc_30Epochs_HPO3batch16_train19_tf.h5',
save_format='tf')
plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
We can save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe
and examine the size to determine chunksize
for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for both the training and test sets.
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the metrics from the training and test sets, the errors are higher in the test set, but the distance between the errors for the HPO
models for the train and test sets is greater than the distance for the 2020 HPO
model. The R² for the train set is higher than the R² from the train/test sets using the 2020 HPO
model while the test R² from the 2019 HPO
model is lower than both of the R² from the 2020 HPO
model. When comparing the metrics from the model using 2019 HPO
model to the baseline model, all of the MAE
, MSE
, and RMSE
metrics are lower and the $^{2}$ is higher for both the train and test sets.
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
Test on 2020
Let's now prepare the 2020 set by partitioning the data and processing the features.
X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']
X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))
We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the 2020 set.
pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Train 2019 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for the 2019 model using the 2020 set and also examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))
print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
For the 2019 HPO model using the 2020 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (481.06577 vs. 341.8011), a lower predicted minimum (-14.073943 vs. -13.173259) compared to the actual and an even lower predicted average (17.65845 vs. 18.240992).
Train 2010 - 2019 HPO5
¶
Let's first set up the callbacks for the TensorBoard
. Then we can define a function build_model
that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:
num_layers
: 7 - 13 layerslayer_size
: 40 - 70 nodes usingstep=5
learning_rate
: 1 x 10-1, 1 x 10-2, 1 x 10-3
The same 30% Dropout
before the output layer as well as the same loss='mae'
and metrics=['mse']
can be utilized to compare the aforementioned hyperparameters.
callbacks = [TensorBoard(log_dir=log_folder,
histogram_freq=1,
write_graph=True,
write_images=True,
update_freq='epoch',
profile_batch=1,
embeddings_freq=1)]
def build_model(hp):
model = keras.Sequential()
for i in range(hp.Int('num_layers', 7, 13)):
model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
min_value=40, max_value=70,
step=5),
activation='relu',
kernel_initializer='normal'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(1))
model.compile(loss='mae', metrics=['mse'], optimizer=keras.optimizers.Adam(
hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])))
return model
We can define the conditions for keras_tuner
which include the objective='val_loss'
, the number of trials to search through (max_trials=20
), the length of each trial (executions_per_trial=1
), and the directory
and project_name
. Then we can now print a summary of the search space.
tuner = BayesianOptimization(
build_model,
objective='val_loss',
max_trials=20,
executions_per_trial=1,
overwrite=True,
directory='MLP_1019_HPO5sc',
project_name='MLP_1019_HPO5')
tuner.search_space_summary()
Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2
and batch_size=64
.
tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=64,
callbacks=callbacks)
Now that the search has completed, let's print a summary of the results from the trials.
tuner.results_summary()
From the search, the model/hyperparameters that resulted in the lowest val_loss
utilizes a network containing 13 layers with 3 of the layers containing 70 nodes, 3 with 65, 1 with 60, 1 with 55 nodes, 3 with 45 nodes, 1 with 40 nodes and a learning_rate: 0.001
. The top 6 networks with the lowest val_loss
examined contained 13 layers as well.
Fit The Best Model from HPO - Batch Size = 64
Let's now set the path where model is saved and set up the callbacks with EarlyStopping
that monitors the loss
and will stop training if it does not improve after 5 epochs
, the ModelCheckpoint
that monitors the mse
saving only the best one with a min
loss and the TensorBoard
as well.
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
%load_ext tensorboard
filepath = 'MLP_weights_only_train1019_b64_sc_epochs30_HPO5batch64.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=5),
ModelCheckpoint(filepath, monitor='mse',
save_freq='epoch'),
tensorboard_callback]
We can now define the model architecture using the results from keras-tuner
search, compile
and then examine.
model = Sequential()
model.add(Dense(70, input_dim=34, kernel_initializer='normal',
activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(55, kernel_initializer='normal', activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dense(60, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()
The total parameters for this model have increased to 44,991 from 1,891 that were utilized for the baseline model. This model contains the higher number of parameters out of all searches completed (Baseline: 1,891
, Train 20: 37,886
, Train 19: 17,431
)
Now the model can be trained by calling fit
utilizing the train dataset for 30 epochs
using a batch_size=64
, validation_split=0.2
and the specified callbacks from the callbacks_list
.
history = model.fit(X_train, y_train, epochs=30, batch_size=64,
validation_split=0.2, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
and val_loss
over the training epochs.
model.save('./MLP_batch64_sc_30Epochs_HPO5batch64_train1019_tf.h5',
save_format='tf')
plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()
We can save the model losses in a pandas.Dataframe
, and plot both the mae
and mse
from the train and validation sets over the epochs as well.
losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()
We can use model.predict
to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe
and examine the size to determine chunksize
for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.
pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the test set.
pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for both the training and test sets.
print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))
When comparing the metrics from the training and test sets, the errors are higher in the test set, but the distance between the errors for the HPO
models for the train and test sets is greater than the distance for the 2020 HPO
model. The R² for the train set is higher than the R² from the train/test sets using the 2020 HPO
model while the test R² from the 2019 HPO
model is lower than both of the R² from the 2020 HPO
model. When comparing the metrics from the model using 2019 HPO
model to the baseline model, all of the MAE
, MSE
, and RMSE
metrics are lower and the $^{2}$ is higher for both the train and test sets.
Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
There are now closer values for the predicted vs. actual maximum (280.35706 vs. 249.99) and minimum (0.54962707 vs. 0.0), but around the same distance compared to the baseline model for the predicted vs. actual average metric tonnage (16.623627 vs. 20.17341674121816).
Test on 2020
Let's now prepare the 2020 set by partitioning the data and processing the features. We can use model.predict
using the test set, convert the predicted metric tons to a pandas.Dataframe
and graph the predicted vs. actual metric tonnage for the 2020 set.
X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']
X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))
pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)
y_pred.shape
plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Train 2010-19 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()
To evaluate if this MLP
is effective at predicting metric tonnage, let's evaluate the MAE
, MSE
, RMSE
and the R² for the 2010 - 2019 model using the 2020 set and also examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.
print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))
print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
There are closer values for the predicted vs. actual maximum (266.59167 vs. 249.99) and minimum (0.46689343 vs. 0.0), but farther away from the actual average metric tonnage (12.666425 vs. 19.548126725460893).
Long Short Term Memory Networks (LSTM)
The LSTM
model learns from the most relevant of the multivariate input sequences over time to the output where the sequences generate another dimension to the function being estimated. The information learned from connecting the input to output can dynamically change, which MLPs do not utilize. This method uses a fixed size of time windows that do not need to be specified and can applied to forecasting future events. The notebooks can be found here.
Train 2020
¶
Let's first set up the environment by importing the necessary packages, setting the warnings and seed with a similar seed function that used for the MLP
. Then the 2019 - 2020 data can be read in a pandas.Dataframe
, the DateTime
feature converted using pandas.to_datetime
and the DateTime_YearWeek
feature created using df['DateTime'].dt.strftime('%Y-w%U')
. Then DateTime
can be dropped from the set and the data prepared by sorting the set chronologically with DateTime_YearWeek
set as the index. Then the data set up for encoding the foreign_company_size
and us_company_size
using the ordinal encoder from category_encoders
. After the remaining qualitative variables can be converted to dummy variables using a specified prefix
and then concatenated with the target variable.
import pandas as pd
import category_encoders as ce
pd.set_option('display.max_columns', None)
df = pd.read_csv('combined_trade_final_LSTM.csv', low_memory=False)
print('Number of rows and columns:', df.shape)
df['DateTime']= pd.to_datetime(df['DateTime'])
df['DateTime_YearWeek'] = df['DateTime'].dt.strftime('%Y-w%U')
df = df.drop(['DateTime'], axis=1)
X = df.drop(['Metric_Tons'], axis=1)
y = df['Metric_Tons']
df = pd.concat([X, y], axis=1)
df = df.sort_values('DateTime_YearWeek')
df = df.set_index('DateTime_YearWeek')
X = df.drop(['Metric_Tons'], axis=1)
y = df['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, prefix=['HS_Group_Name', 'Container_LCL/FCL',
'Foreign_Country_Region',
'US_Port_Coastal_Region',
'Trade_Direction', 'Container_Type_Dry'],
columns=['HS_Group_Name', 'Container_LCL/FCL',
'Foreign_Country_Region',
'US_Port_Coastal_Region',
'Trade_Direction', 'Container_Type_Dry'])
df = pd.concat([X, y], axis=1)
print(df.shape)
Let's now examine the Year
feature to determine the number of observations per year to be used to define the train/test sets.
df[['Year']].value_counts()
Now, we can drop the Year
feature and determine the number of columns to prepare the data for utilizing an LSTM
.
df = df.drop(['Year'], axis=1)
df.shape
To prepare the data for evaluating an LSTM
, let's define a function that converts the series to supervised learning with the input sequence (t-n, ... t-1)
and forecasts the sequence (t, t+1, ... t+n)
. Then, this can be concatenated and the rows with NaN
values dropped. We can apply this function by first loading the set, converting all to variables to float
, normalize all of the features with the MinMaxScaler
and then applying the series_to_supervised
function.
from sklearn.preprocessing import MinMaxScaler
def series_to_supervised(dat, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(dat) is list else dat.shape[1]
df = pd.DataFrame(dat)
cols, names = list(), list()
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
agg = pd.concat(cols, axis=1)
agg.columns = names
if dropnan:
agg.dropna(inplace=True)
return agg
dataset = df
values = dataset.values
values = values.astype('float32')
scaler = MinMaxScaler(feature_range=(0,1))
scaled = scaler.fit_transform(values)
reframed = series_to_supervised(scaled, 1, 1)
reframed.drop(reframed.columns[[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,
58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,
74,75,76,77,78,79,80,81]], axis=1, inplace=True)
print(reframed.head())
Now, let's define the model architecture using Sequential
with 50 neurons for the LSTM
and the output Dense
layer followed by compiling the model with the loss='mae'
, metrics=['mse']
and Adam
as the optimizer.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', metrics=['mse'], optimizer='adam')
model.summary()
Batch Size = 2
¶
To define the train
and the test
sets, we can use the yearly count of the data sorted by the DateTime_YearWeek
feature to specify the training set as 2020 and the test set as 2019. Then, the sets can be split into the input and output and the input reshaped to be 3D as [samples, timesteps, features]
.
values = reframed.values
n_train_hours = 3368492
test = values[:n_train_hours,:]
train = values[n_train_hours:,:]
train_X, train_y = train[:,:-1], train[:,-1]
test_X, test_y = test[:,:-1], test[:,-1]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
Let's now set the path where model is saved and set up the callbacks with EarlyStopping
that monitors the loss
and will stop training if it does not improve after 3 epochs
, the ModelCheckpoint
that monitors the mse
saving only the best one with a min
loss and the TensorBoard
as well.
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
%load_ext tensorboard
filepath = 'weights_only_train2020_test2019_n50_b2_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=3),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
tensorboard_callback]
Now the model can be trained by calling fit
utilizing the train dataset for 30 epochs
using a batch_size=2
without shuffling the data and the specified callbacks from the callbacks_list
.
history = model.fit(train_X, train_y, epochs=30, batch_size=2, shuffle=False,
callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
over the training epochs.
import matplotlib.pyplot as plt
model.save('./Model_50neuron_batch2_30epochs_train2020_tf.h5',
save_format='tf')
# Load model for more training or later use
#filepath = 'weights_only_train2020_test2019_n50_b2_epochs30.h5'
#loaded_model = tf.keras.models.load_model('./Model_50neuron_batch2_30epochs_train2020_tf')
#model.load_weights(filepath)
plt.title('Model Loss')
plt.plot(history.history['loss'], label='train')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend()
plt.show()
We can now generate predictions for the training and the test sets by utilizing model.predict
, then create an empty table with 41 fields, fill the table with the predicted values, inverse transform and then select the correct field to determine the MAE
and RMSE
for the train and test sets.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
train_predict = model.predict(train_X)
train_predict_dataset_like = np.zeros(shape=(len(train_predict), 41))
train_predict_dataset_like[:,0] = train_predict[:,0]
train_predict = scaler.inverse_transform(train_predict_dataset_like)[:,0]
print('Train Mean Absolute Error:', mean_absolute_error(train_y[:],
train_predict[:]))
print('Train Root Mean Squared Error:', np.sqrt(mean_squared_error(train_y[:],
train_predict[:])))
test_predict = model.predict(test_X)
test_predict_dataset_like = np.zeros(shape=(len(test_predict), 41))
test_predict_dataset_like[:,0] = test_predict[:,0]
test_predict = scaler.inverse_transform(test_predict_dataset_like)[:,0]
print('Test Mean Absolute Error:', mean_absolute_error(test_y[:],
test_predict[:]))
print('Test Root Mean Squared Error:', np.sqrt(mean_squared_error(test_y[:],
test_predict[:])))
The test set metrics are higher for the test set compared to the training set.
Let's now make a prediction with the test set, invert the scaling for forecast and the actual test data to examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.
from numpy import concatenate
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
inv_yhat = concatenate((yhat, test_X[:,1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]
test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:,1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]
print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
Now, let's calculate the MAE
and MSE
and plot the actual vs predicted metric tonnage.
print('Mean Absolute Error (MAE): %3f' % mean_absolute_error(inv_y, inv_yhat))
print('Mean Square Error (MSE): %3f' % mean_squared_error(inv_y, inv_yhat))
The 2019 test set metrics are higher for the test set compared to the training set, but not significantly higher.
plt.plot(inv_y, label='2019')
plt.plot(inv_yhat, label='2019 - predicted')
plt.legend()
plt.show()
As completed for all models using different batch sizes, we can extract the model metrics into a pandas.Dataframe
that can be utilized to plot the results.
data = [[2020, 2, np.amax(y_test), np.amax(pred_test), np.average(y_test),
np.average(pred_test), np.amin(y_test), np.amin(pred_test)]]
df1 = pd.DataFrame(data, columns=['Year', 'BatchSize', 'Maximum Metric Tons',
'Predicted Maximum Metric Tons',
'Average Metric Tons',
'Predicted Average Metric Tons',
'Minimum Metric Tons',
'Predicted Minimum Metric Tons'])
df = pd.concat([df, df1])
df.to_csv('LSTM_19_20_Results.csv')
Train 2019
¶
We can then swap the train
and test
sets from values
to define 2019 as the training set and 2020 as the test set, followed by splitting the set into the input and output and reshaping the input to be 3D as [samples, timesteps, features]
.
values = reframed.values
n_train_hours = 3368492
train = values[:n_train_hours,:]
test = values[n_train_hours:,:]
train_X, train_y = train[:,:-1], train[:,-1]
test_X, test_y = test[:,:-1], test[:,-1]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
Batch Size = 2
¶
Let's now set the path where model is saved and set up the callbacks with EarlyStopping
that monitors the loss
and will stop training if it does not improve after 3 epochs
, the ModelCheckpoint
that monitors the mse
saving only the best one with a min
loss and the TensorBoard
as well.
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'weights_only_train2019_test2020_n50_b2_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='loss', patience=3),
ModelCheckpoint(filepath, monitor='mse',
save_best_only=True, mode='min'),
tensorboard_callback]
Now the model can be trained by calling fit
utilizing the train dataset for 30 epochs
using a batch_size=2
without shuffling the data and the specified callbacks from the callbacks_list
.
history = model.fit(train_X, train_y, epochs=30, batch_size=2, shuffle=False,
callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and plot the model loss
over the training epochs.
model.save('./Model_50neuron_batch2_30epochs_train2019_tf.h5',
save_format='tf')
# Load model for more training or later use
#filepath = 'weights_only_train2019_test2020_n50_b2_epochs30.h5'
#loaded_model = tf.keras.models.load_model('./Model_50neuron_batch2_30epochs_train2019_tf')
#model.load_weights(filepath)
plt.title('Model Loss')
plt.plot(history.history['loss'], label='train')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend()
plt.show()
We can now generate predictions for the training and the test sets by utilizing model.predict
, then create an empty table with 41 fields, fill the table with the predicted values, inverse transform and then select the correct field to determine the MAE
and RMSE
for the train and test sets.
train_predict = model.predict(train_X)
train_predict_dataset_like = np.zeros(shape=(len(train_predict), 41))
train_predict_dataset_like[:,0] = train_predict[:,0]
train_predict = scaler.inverse_transform(train_predict_dataset_like)[:,0]
print('Train Mean Absolute Error:', mean_absolute_error(train_y[:],
train_predict[:]))
print('Train Root Mean Squared Error:', np.sqrt(mean_squared_error(train_y[:],
train_predict[:])))
Let's now make a prediction with the test set, invert the scaling for forecast and the actual test data to examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
inv_yhat = concatenate((yhat, test_X[:,1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]
test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:,1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]
print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))
Now, let's calculate the MAE
and MSE
and plot the actual vs predicted metric tonnage.
print('Mean Absolute Error (MAE): %3f' % mean_absolute_error(inv_y, inv_yhat))
print('Mean Square Error (MSE): %3f' % mean_squared_error(inv_y, inv_yhat))
The 2020 test set metrics are significantly higher for the test set compared to the training set, and this is significantly higher that was observed for the 2020 model and 2019 as the test set as demonstrated clearly in the following plot.
plt.plot(inv_y, label='2020')
plt.plot(inv_yhat, label='2020 - predicted')
plt.legend()
plt.show()
As completed for all models using different batch sizes, we can extract the model metrics into a pandas.Dataframe
that can be utilized to plot the results.
data = [[2019, 2, np.amax(y_test), np.amax(pred_test), np.average(y_test),
np.average(pred_test), np.amin(y_test), np.amin(pred_test)]]
df = pd.DataFrame(data, columns=['Year', 'BatchSize', 'Maximum Metric Tons',
'Predicted Maximum Metric Tons',
'Average Metric Tons',
'Predicted Average Metric Tons',
'Minimum Metric Tons',
'Predicted Minimum Metric Tons'])
df = pd.concat([df, df1])
df.to_csv('LSTM_19_20_Results.csv')
We can now examine the results from training a model using the 2019 data and the 2020 data utilizing different batch sizes and compare the predicted results.
df = pd.read_csv('LSTM_19_20_Results.csv')
df
df_num = df[['Predicted Maximum Metric Tons', 'Predicted Average Metric Tons',
'Predicted Minimum Metric Tons']]
plt.rcParams.update({'font.size': 14})
fig, ax = plt.subplots(1,3, figsize=(15,5))
fig.suptitle('Train 2019 vs Train 2020: Predicted Maximum, Average and Minimum Metric Tons',
y=1.01, fontsize=20)
for variable, subplot in zip(df_num, ax.flatten()):
a = sns.lineplot(data=df, x=df['Batch Size'], y=df_num[variable],
hue=df.Year, palette=['blue', 'orange'], ax=subplot)
fig.tight_layout()
plt.show();
As the batch size decreased (more observations used to train the model), the predicted results diverged to a larger extent when comparing 2019 to 2020. This suggests that a more complicated model is needed to train an LSTM
utilizing 2019 trade and potentially confounding information compared to the 2020 information.
Comments
comments powered by Disqus