The United States (U.S.) has been one of the greatest advocates for free trade since the end of World War II, and the importance of it to the U.S. economy cannot be underestimated. As many as 40 million jobs in the U.S. are directly supported by international trade as of 2018 (Trade Partnership Worldwide LLC), and as much as half of the entire U.S. manufacturing industry is supported by exports (International Trade Administration). Imports introduce new goods and boost the annual earning power of the average American by over $18,000 annually (Hufbauer and Lu, 2017).

COVID-19 disrupted the global supply chain. The first effect was a dramatic and rapid reduction in the labor force in affected locations as workers fell ill or were forced to quarantine (Gruszczynski, L. (2020). International trade was impacted further as borders were closed to commerce and travel (Kerr, W. (2020)). This further exacerbated supply chain disruptions within and among countries as goods themselves were prioritized differently by those in quarantine. Most international trade reduced dramatically especially for manufactured goods and automobiles. However, food and agriculture supply lines were relatively unaffected in the short term due to narrow time windows between harvest and consumption, and the fact that even during a pandemic, the population still needs to eat for survival (Kerr, W. (2020)).

It is still too early to determine if COVID-19 has permanently disrupted the global supply chain or if it has just hastened trends already underway related to the geography of trading partners and the basket of goods traded among them. These long-term trends will need to be discerned over years, isolating the effect of any prolonged economic recession, especially among countries that are not employing economic stimulus measures. The applied tariff rates of countries is one change that may or may not prove transitory or affected due to the emergence of this virus. These rates may have a confounding effect with the disruptions caused by the direct or indirect measures of COVID-19. It is still too early to tell if the pandemic will lead to a more decentralized global economic system, but this is a possible long term scenario (Gruszczynski, L. (2020).

`Proposed Questions`¶

● How has the composition and/or volume of maritime imports and exports from the US changed over time?

● Are there any confounding effects impacting the volume of imports and exports of the targeted commodities?

● How did COVID-19 impact the volume and composition of international maritime trade?

`Data Collection and Processing`¶

The questions presented in the analysis are complex, interconnected, and require a variety of data sources to answer sufficiently. Data must establish historical import and export volumes and composition . There also must be data that demonstrate COVID-19’s effects on the supply chains and workforce. Finally, data must be included that establishes potential confounding effects on trade volumes, including historical rates for applied tariffs, currency exchange, and unemployment.

`Maritime Imports & Exports`

The U.S. Customs and Border Protection collects information about each shipment. Descartes Datamyne, as well as other industry data sources such as IHS Markit’s PIERS and Panjiva, compile this information and enhance it with additional data fields. The data was queried from Descartes Datamyne for U.S. imports and exports separately, from January 1, 2010 through December 31, 2020. The geographic scope includes all maritime trade shipments through the United States’ ports, to and from all trading partners. The data was downloaded in 70 queries of 1 million records or less, separately for imports and exports.

Both the imports and exports sets contain features import arrival, business name and location, product identifying features, port of arrival and departure, source country, and other characteristics relating to the shipment. Datasets will be concatenated into one table. Data consists of all bills of lading (shipment receipts) into or out of a U.S. maritime port from January 1, 2010 through December 31, 2020. Maritime trade is best measured in volumes, such as metric tonnage, TEUs, or total shipments, however the last two are relevant to maritime trade only. Trade in metric tonnage is a major indicator of the economic vitality of the national and global economy.

`Preprocessing`¶

The code that was used for preprocessing and EDA can be found here. First the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory.

In [ ]:

import os
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['MaritimeTrade_Preprocessing'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

`Preprocess Imports`¶

Since the initial trade data had to queried in batches separately for the imports and exports, this resulted in many separate files generated. The structure of the Data directory resulted in subdirectories for Imports and Exports. Let's set the path to where the Imports data is stored. There are 55 separate .csv files, so these need to be concatenated together to make a single set. Using glob, we use the .csv commonality to read and concatenate all files matching this extension. Then the variables that are extraneous are dropped and the columns are renamed to more relatable terminology without spaces. Since there queries were completed separately, this generated missing values in the Date features so those are removed. This variable was imported as a float so a lambda function was used to remove the extra .0 resulting in a numerical value containing the year month day without hyphens. Since this is the Imports set, let's add a column denoting a characteristic about this data. Since Container_Type_Refrigerated and Container_Type_Dry have many levels, let's create a binary features by using a mask for all the observations >=1 to be recoded as 1. After this processing, the duplicates are dropped and there over 35 million rows and 17 features. Let's now view the first 5 observations. There appear to be differences in the spacing between words different features so string processing definitely need to be utilized.

In [ ]:

import sys
import glob
import pandas as pd
pd.set_option('display.max_columns', None)

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

imports = pd.concat([pd.read_csv(f) for f in all_filenames], ignore_index=True)

imports = imports.drop(['In bond entry type', 'Vessel Country', 'Unnamed: 18'],
                       axis=1)
imports.rename(columns={'Consignee (Unified)': 'US_Company',
                        'HS Description': 'HS_Description',
                        'Country of Origin': 'Foreign_Country',
                        'Port of Departure': 'Foreign_Port',
                        'Port of Arrival': 'US_Port',
                        'Shipper (Unified)': 'Foreign_Company',
                        'Container LCL/FCL': 'Container_LCL/FCL',
                        'Metric Tons': 'Metric_Tons',
                        'Container Type Refrigerated': 'Container_Type_Refrigerated',
                        'Container Type Dry': 'Container_Type_Dry',
                        'VIN Quantity': 'VIN_Quantity',
                        'Total calculated value (US$)': 'TCVUSD'}, inplace=True)

imports = imports[imports['Date'].notna()]

imports['Date'] = imports['Date'].astype(str).apply(lambda x: x.replace('.0',''))

imports['Trade_Direction'] = 'Import'

mask = imports['Container_Type_Refrigerated'] >= 1
imports.loc[mask, 'Container_Type_Refrigerated'] = 1

mask = imports['Container_Type_Dry'] >= 1
imports.loc[mask, 'Container_Type_Dry'] = 1

imports = imports.drop_duplicates()
print('- The Imports set contains ' + str(imports.shape[0]) + ' rows and '
      + str(imports.shape[1]) + ' columns.')
print('\nSample observations of Imports:')
print(imports.head())

- The Imports set contains 35394230 rows and 17 columns.

Sample observations of Imports:
       Date                        US_Company     HS  \
0  20100331  TOKYO MOTORS SR URIEL GARCIA, WO  8708    
1  20100331          THE BRITISH SHOP INC, NC    94    
2  20100331                 PF STORES INC, PR  9401    
5  20100331                 PF STORES INC, PR  9401    
9  20100331     SCOTIABANK DE PUERTO RICO, PR  9503    

                                      HS_Description Foreign_Country  \
0  8708 - PARTS & ACCESS FOR MOTOR VEHICLES (HEAD...           JAPAN   
1  94 - FURNITURE; BEDDING ETC; LAMPS NESOI ETC; ...  UNITED KINGDOM   
2  9401 - SEATS (EXCEPT BARBER  DENTAL  ETC)  AND...           CHINA   
5  9401 - SEATS (EXCEPT BARBER  DENTAL  ETC)  AND...           CHINA   
9  9503 - TOYS NESOI; SCALE MODELS ETC; PUZZLES; ...           CHINA   

                         Foreign_Port      US_Port  \
0    PORT BUSTAMANTE,KINGSTON,JAMAICA  SAN JUAN,PR   
1  HAMBLE,SOUTHHAMPTON,UNITED KINGDOM   NORFOLK,VA   
2    PORT BUSTAMANTE,KINGSTON,JAMAICA  SAN JUAN,PR   
5    PORT BUSTAMANTE,KINGSTON,JAMAICA  SAN JUAN,PR   
9    PORT BUSTAMANTE,KINGSTON,JAMAICA  SAN JUAN,PR   

                                Carrier                      Foreign_Company  \
0  ZIM INTEGRATED SHIPPING SERVICES LTD          MORALES ENTERPRISE LTD (JP)   
1  ZIM INTEGRATED SHIPPING SERVICES LTD                     NOT UNIFIED (XX)   
2  ZIM INTEGRATED SHIPPING SERVICES LTD  ZHEJIANG ANJI ENOUGH FURNITURE (CN)   
5  ZIM INTEGRATED SHIPPING SERVICES LTD  ZHEJIANG ANJI ENOUGH FURNITURE (CN)   
9  ZIM INTEGRATED SHIPPING SERVICES LTD  SHANTOU QITONG TRADING CO  LTD (CN)   

  Container_LCL/FCL  Teus  Metric_Tons  Container_Type_Refrigerated  \
0               FCL   1.0        15.47                            0   
1               FCL   2.0        12.00                            0   
2               FCL   2.0         4.21                            0   
5               FCL   2.0         4.27                            0   
9               FCL   2.0         3.26                            0   

   Container_Type_Dry  VIN_Quantity  TCVUSD Trade_Direction  
0                   1             0     0.0          Import  
1                   1             0     0.0          Import  
2                   1             0     0.0          Import  
5                   1             0     0.0          Import  
9                   1             0     0.0          Import

`Preprocess Exports`¶

There are 36 separate files for the initial Exports set. So let's move to that directory and perform the same type of concatenation as was used for the Imports set, drop features that are not needed and rename the columns. Since Short Container Description will be used to create features for this set, let's select the non-missing for this and the non missing from US_Port. The Date variable was imported the same type as in the previous processing, so let's use the same aforementioned approach. Then create the Trade_Direction feature denoting these are exports. Since these are exports, they do not have a foreign company so let's add Not Applicable to Exports for the Foreign_Company variable. Drop any duplicate rows, now totaling over 23 million observations with the same amount of columns as the Imports set.

In [ ]:

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

exports = pd.concat([pd.read_csv(f) for f in all_filenames])

exports = exports.drop(['Vessel Country', 'Unnamed: 16'], axis=1)
exports.rename(columns={'Exporter (Unified)': 'US_Company',
                        'HS Description': 'HS_Description',
                        'Country of Final destination': 'Foreign_Country',
                        'Foreign Port': 'Foreign_Port',
                        'US Port': 'US_Port',
                        'Container LCL/FCL': 'Container_LCL/FCL',
                        'Metric Tons': 'Metric_Tons',
                        'VIN Quantity': 'VIN_Quantity',
                        'Total calculated value (US$)': 'TCVUSD'}, inplace=True)

exports = exports[exports['Short Container Description'].notna()
                  & exports.US_Port.notna()]

exports['Date'] = exports['Date'].astype(str).apply(lambda x: x.replace('.0',''))

exports['Trade_Direction'] = 'Export'
exports['Foreign_Company'] = 'Not Applicable to Exports'
exports = exports.drop_duplicates()
print('- The Exports set contains ' + str(exports.shape[0]) + ' rows and '
      + str(exports.shape[1]) + ' columns.')
print('\nSample observations of Exports:')
print(exports.head())

- The Exports set contains 23777640 rows and 17 columns.

Sample observations of Exports:
       Date                     US_Company   HS HS_Description  \
0  20101231           CU TRANSPORT INC, CA  00    00 - OTHERS    
2  20101231           CU TRANSPORT INC, CA  00    00 - OTHERS    
5  20101231  MAXLAND INTERNATIONAL INC, CA  00    00 - OTHERS    
6  20101231          LUCKY FREIGHT INC, CA  00    00 - OTHERS    
7  20101231   HAV GLOBAL SOLUTIONS LLC, IL  00    00 - OTHERS    

  Foreign_Country Foreign_Port        US_Port  \
0          TAIWAN    KAOHSIUNG  LONG BEACH,CA   
2          TAIWAN    KAOHSIUNG  LONG BEACH,CA   
5          TAIWAN    KAOHSIUNG  LONG BEACH,CA   
6          TAIWAN    KAOHSIUNG  LONG BEACH,CA   
7       HONG KONG    HONG KONG     NORFOLK,VA   

                              Carrier Container_LCL/FCL  \
0  ORIENT OVERSEAS CONTAINER LINE LTD               LCL   
2  ORIENT OVERSEAS CONTAINER LINE LTD               FCL   
5  ORIENT OVERSEAS CONTAINER LINE LTD               FCL   
6  ORIENT OVERSEAS CONTAINER LINE LTD               FCL   
7  ORIENT OVERSEAS CONTAINER LINE LTD               FCL   

  Short Container Description  Containerized  Teus  Metric_Tons  VIN_Quantity  \
0                  AUTOMOBILE              1   1.0         2.20             0   
2                  AUTOMOBILE              1   2.0         4.40             0   
5                       AUTOS              1   2.0         3.32             0   
6                       AUTOS              1   1.0         2.27             0   
7         RESTAURANT SUPPLIES              1   1.0         8.99             0   

   TCVUSD Trade_Direction            Foreign_Company  
0     0.0          Export  Not Applicable to Exports  
2     0.0          Export  Not Applicable to Exports  
5     0.0          Export  Not Applicable to Exports  
6     0.0          Export  Not Applicable to Exports  
7     0.0          Export  Not Applicable to Exports

First, let's filter the Containerized exports. The text in Short Container Description can be utilized to create a binary feature by using the condtions that if the word *REEF*, *Cold*, or *Deg* exists in the Short Container Description feature, then we can create a new variable Container_Type_Refrigerated and assign the value equal to one. Since these exports are refrigerated, let's assign them as Container_Type_Dry equal to zero. Then, if the text in the Short Container Description feature does not include the aforementioned words, we can create a new dataframe where we assign the value equal to one to Container_Type_Dry and Container_Type_Refrigerated equal to zero. These two filtered dataframes are then concatenated together by row. For the observations which are not Containerized, we can create a new dataframe where we assign both Container_Type_Refrigerated and Container_Type_Dry equal to zero. Then this can be concatenated by row with the Containerized filtered data. Now the features Short Container Description and Containerized can be removed from the Exports set and any duplicates dropped.

In [ ]:

df = exports.loc[exports['Containerized'] == 1]

search_values = ['REEF', 'Cold', 'Deg']
df2 = df[df['Short Container Description'].str.contains(
    '|'.join(search_values))]
df2['Container_Type_Refrigerated'] = 1
df2['Container_Type_Dry'] = 0

df3 = df[~df['Short Container Description'].str.contains('|'.join(search_values))]
df3['Container_Type_Refrigerated'] = 0
df3['Container_Type_Dry'] = 1

df = pd.concat([df2, df3])

del df2, df3

df1 = exports.loc[exports['Containerized'] == 0]
df1['Container_Type_Refrigerated'] = 0
df1['Container_Type_Dry'] = 0

exports = pd.concat([df, df1])

del df, df1

exports = exports.drop(['Short Container Description', 'Containerized'], axis=1)
exports = exports.drop_duplicates()

Now, the Imports and Exports sets can be concatenated, and the quality of the data in regards to data types, missingness and uniqueness can be examined.

In [ ]:

df = pd.concat([imports, exports])

del imports, exports

def data_quality_table(df):
    """
    Returns data types, the count of missing data per feature and the number of unique.
    """
    var_type = df.dtypes
    mis_val = df.isnull().sum()
    unique_count = df.nunique()
    mis_val_table = pd.concat([var_type, mis_val, unique_count], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0: 'Data Type', 1: 'Count Missing', 2: 'Number Unique'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] >= 0].sort_values(
            'Count Missing', ascending=False).round(1)
    print ('- There are ' + str(df.shape[0]) + ' rows and '
           + str(df.shape[1]) + ' columns.\n')

    return mis_val_table_ren_columns

print(data_quality_table(df))

- There are 57408871 rows and 17 columns.

                            Data Type  Count Missing  Number Unique
US_Company                     object        5944727        1356805
Foreign_Company                object          11768         808269
Container_LCL/FCL              object              0              2
TCVUSD                        float64              0       16013472
VIN_Quantity                    int64              0            922
Container_Type_Dry              int64              0              2
Container_Type_Refrigerated     int64              0              2
Metric_Tons                   float64              0         180407
Teus                          float64              0           5554
Date                           object              0           4018
Carrier                        object              0           3421
US_Port                        object              0            288
Foreign_Port                   object              0           4000
Foreign_Country                object              0            236
HS_Description                 object              0        1987063
HS                             object              0        1991294
Trade_Direction                object              0              2

Most of the data is complete but two variables, US_Company and Foreign_Company have missing values. TCVUSD has the most unique, which makes sense because each trade transaction mostl likely contains a different value of money. The features HS_Description and HS are strings so any small deviation in the wording, spaces or punctuation could result in differences. It's rationale there would be more foreign ports and US ports as well. Since the Imports set is larger than the Exports set, more U.S. companies than foreign makes makes logical sense.

`Preprocess Concatenated Imports/Exports`¶

Let's now create a new variable called HS_Class by pulling the first two digits from the HS Description feature to make this new feature with cleaner and more homogenized labels for the HS family classes. Then we can drop HS_description, filter HS_Class for the 23 selected groups to examine for the proposed questions and convert to a string since it is a categorical variable. We can also create a new variable called HS_Mixed as a Boolean value denoting TRUE if there are more than 6 digits in HS and FALSE if there are 6 or less digits in HS. Then HS can be removed from the set.

In [ ]:

df['HS_Class'] = [x[:2] for x in df['HS_Description']]
df = df.drop(['HS_Description'], axis=1)

df['HS_Class'] = pd.to_numeric(df['HS_Class'], errors='coerce')
df = df[df['HS_Class'].notna()]

list = [2,7,8,9,10,11,14,15,16,17,18,19,20,22,24,30,40,47,72,73,87,94,95]
df = df[df['HS_Class'].isin(list)]
print('- Number of unique HS classes:', str(df.HS_Class.nunique()))
df['HS_Class'] = df['HS_Class'].astype('str')

mask = df['HS'].astype(str).str.len() <= 7
df.loc[mask, 'HS_Mixed'] = 0

mask = df['HS'].astype(str).str.len() > 7
df.loc[mask, 'HS_Mixed'] = 1

df = df.drop(['HS'], axis=1)

- Number of unique HS classes: 23

Now, let's set the path to the Warehouse_Construction directory where the various sets are stored, read the HS_Groups file and examine the first five observations. We can remove Description and convert HS_Class to type int. This can be used as a key to join the HS_Groups set to the main table. This new feature HS_Group_Name, which groups similar HS_Class items together, will reduce dimensionality.

In [ ]:

HS_Groups = pd.read_csv('HS_Groups.csv', index_col=False)
print('\nSample observations of HS groups data:')
print(HS_Groups.head())
HS_Groups = HS_Groups.drop(['Description'], axis=1)

HS_Groups['HS_Class'] = HS_Groups['HS_Class'].astype(int)
df['HS_Class'] = df['HS_Class'].astype(int)

df = pd.merge(df, HS_Groups, how='left', left_on=['HS_Class'],
              right_on=['HS_Class'])
df = df.drop_duplicates()

del HS_Groups


Sample observations of HS groups data:
   HS_Class                                        Description  \
0         2                         Meat and edible meat offal   
1         7     Edible vegetables and certain roots and tubers   
2         8  Edible fruit and nuts; peel of citrus fruit or...   
3         9                       Coffee, tea, maté and spices   
4        10                                            Cereals   

            HS_Group_Name  
0                  Edible  
1                  Edible  
2                  Edible  
3  Edible with Processing  
4  Edible with Processing

Since some of the observations contain missing information for US_Company, let's filter the sets into different datafames to process this variable to obtain the U.S. state where company is located. Using rsplit to split the text where the last comma exists results in 75 unique U.S. company states, so this method is not 100% effective. For the observations that do not contain information for the US_Company, we can allocate Not Provided to this feature and the created US_Company_State feature. The two sets are then concatenated to complete the set. Since the presence of a comma separates the companies which have the same name but exist in different states, we can utilize rsplit again, but now to remove everything after the last comma. This results in aggregrating the same companies together by removing the state information.

In [ ]:

dat = df[df['US_Company'].notna()]
dat['US_Company_State'] = [x.rsplit(',', 1)[-1] for x in dat['US_Company']]
print('- Number of unique US company states:',
      str(dat.US_Company_State.nunique()))

dat1 = df[df['US_Company'].isna()]
dat1['US_Company'] = 'Not Provided'
dat1['US_Company_State'] = 'Not Provided'

df = pd.concat([dat, dat1])

del dat, dat1

df.loc[:,'US_Company_Agg'] = [x.rsplit(',', 1)[:-1] for x in df['US_Company']]
df['US_Company_Agg'] = df['US_Company_Agg'].str[0]
print('- Number of unique aggregrated US companies:',
      str(df.US_Company_Agg.nunique()))

- Number of unique US company states: 75
- Number of unique aggregrated US companies: 881137

Next, the Foreign_Company feature is organized in a similar pattern as US_Company, but this contains the country which the company is located. Also, the parentheses exist in this string so they need to be removed first. The observations that contain information for this variable can be selected into a set, then all of the parentheses removed, and the last two characters extracted from the string creating Foreign_Company_Country. Then the observations without foreign companies can be assigned Not Provided for Foreign_Company and Foreign_Company_Country. Lastly, the sets can be concatenated by row returning back to a single table.

In [ ]:

dat = df[df['Foreign_Company'].notna()]
dat.loc[:,'Foreign_Company_Country'] =  dat['Foreign_Company'].str.replace(r'[\d]+', '')
dat['Foreign_Company_Country'] = [x[-2:] for x in dat['Foreign_Company_Country']]

dat1 = df[df['Foreign_Company'].isna()]
dat1['Foreign_Company'] = 'Not Provided'
dat1['Foreign_Company_Country'] = 'Not Provided'

df = pd.concat([dat, dat1])

del dat, dat1

`Extract Names for Clustering using OpenRefine and Binning from Original Data`¶

Data dictionaries stored in .csv files are generated, which map each of the unique values from the raw fact data to standardized values using the clustering capabilities of Google OpenRefine. These dictionaries can then be joined to the raw fact data to replace entire fields with the standardized data for that field. This will reduce the occurrences of erroneous data, reduce the number of factor levels, and provide a set of key values for joining the datasets together after some preprocessing.

To create one of these data dictionary .csv files, the unique values from one of the columns in the fact set will be extracted into a file named [field_name].csv and stored in the Keys_and_Dictionaries directory. This is then loaded into Google OpenRefine, the field containing the row number will be turned into a numeric value and then faceted. There are memory constraints of OpenRefine, so this controls the number of records fed into the clustering algorithms by selecting a range of the row numeric facets. The column containing the unique field values will be duplicated and renamed [field_name]_Clustered containing the smaller list of clustered values.

In [ ]:

print('- Number of unique US companies:', str(df.US_Company.nunique()))
us_company = pd.DataFrame({'US_Company': df['US_Company'].unique()})
us_company.to_csv('us_company.csv', index=False)

print('- Number of unique foreign companies:',
      str(df.Foreign_Company.nunique()))
foreign_company = pd.DataFrame({'Foreign_Company': df['Foreign_Company'].unique()})
foreign_company.to_csv('foreign_company.csv', index=False)

print('- Number of unique foreign company countries:',
      str(df.Foreign_Company_Country.nunique()))
foreign_company_country = pd.DataFrame({'Foreign_Company_Country': df['Foreign_Company_Country'].unique()})
foreign_company_country.to_csv('foreign_company_country.csv', index=False)

print('- Number of unique foreign countries:',
      str(df.Foreign_Country.nunique()))
foreign_country = pd.DataFrame({'Foreign_Country': df['Foreign_Country'].unique()})
foreign_country.to_csv('foreign_country.csv', index=False)

print('- Number of unique US ports:', str(df.US_Port.nunique()))
US_port = pd.DataFrame({'US_Port': df['US_Port'].unique()})
US_port.to_csv('us_port.csv', index=False)

print('- Number of unique foreign ports:', str(df.Foreign_Port.nunique()))
foreign_port = pd.DataFrame({'Foreign_Port': df['Foreign_Port'].unique()})
foreign_port.to_csv('foreign_port.csv', index=False)

print('- Number of unique carriers:', str(df.Carrier.nunique()))
carrier = pd.DataFrame({'Carrier': df['Carrier'].unique()})
carrier.to_csv('carrier.csv', index=False)

del us_company, foreign_company, foreign_company_country, foreign_country
del US_port, foreign_port, carrier

- Number of unique US companies: 1165998
- Number of unique foreign companies: 772977
- Number of unique foreign company countries: 250
- Number of unique foreign countries: 233
- Number of unique US ports: 285
- Number of unique foreign ports: 2976
- Number of unique carriers: 3383

In the current format, there is a large number of unique strings contained within the companies, countries, ports and carriers features. This will result in high dimensionality if these features are utilized in modeling techniques. There is the potential that there are spelling and formatting errors which might be contributing to this.

In addition, ISO standard country names and abbreviations are essential for joining sets of data because mismatched names can result in the improper alignment and loss of observations. The datapackage is utilized to generate a file containing the country name and a two digit country code (ISO 3166-1).

In [ ]:

import datapackage

data_url = 'https://datahub.io/core/country-list/datapackage.json'

package = datapackage.Package(data_url)

resources = package.resources
for resource in resources:
    if resource.tabular:
        data = pd.read_csv(resource.descriptor['path'])
data.to_csv('country_map.csv', index=False)

print('- There are ' + str(data.shape[0]) + ' countries.')
print('\nSample observations of ISO standard \n  country names and abbreviations:')
data.head()

- There are 249 countries.

Sample observations of ISO standard 
  country names and abbreviations:

Out[ ]:

	Name	Code
0	Afghanistan	AF
1	Åland Islands	AX
2	Albania	AL
3	Algeria	DZ
4	American Samoa	AS

`Perform Clustering using OpenRefine`¶

Note that for this field, there are over 1.2 million unique values for US_Company, so the numeric facet for the row number column is a necessity to avoid memory issues. A range of these numeric row values will be chosen spanning 400,000 records at a time to be sent through the clustering algorithm.

In this way, the clustering can be done in a few batches on large datasets running through several of the key collision algorithms until each batch does not produce clustering suggestions. US_Country_Clustered Diagram given below is an example of applying the Beider-Morse function to the first 400,000 records in the US_Country_Clustered column.

The interface clearly shows which values it believes should be clustered together and which value should replace them. In this way fields are combined which probably refer to the same entity, as they vary by ordering of words or abbreviations (ex: Smith, Thomas vs Thomas Smith), by the spacing between letters and numbers (ex: Thomas Smith vs Th omasSmith), and by variations in numeric prefixes or suffixes as seen in the example above. After making the reasonable selections, merge selected & close is chosen before repeating the process with a different function like fingerprint or n-gram fingerprint until no clustering suggestions are left for that batch. Then the next batch of records is selected before repeating the process. After all the batches have been processed in this way, all of the records in the entire column can be fed into the clustering functions now that the number of unique values has been reduced. The effectiveness of this method will vary by both the amount of unique values in a field and by the quality of the data in the field itself. For the US_Company field, the number of unique values went from 1,210,791 down to 880,829, a 27.3% reduction in values.

At this point the column containing the row numbers is dropped and the dataset can be exported to a .csv file named [Field Name]_Clustered.csv to become one of the data dictionary files. The dictionary files are stored in a directory called Keys_and_Dictionaries so let's navigate there to read the data containing the clustered U.S. company names generated by OpenRefine and perform an inner join using US_Company as the key. This original feature can then be removed from the set and any duplicate observations removed. Next, read the data containing the clustered foreign companies and perform similar processing.

In [ ]:

us_company_clustered = pd.read_csv('us_companies_clustered.csv')

df = pd.merge(df, us_company_clustered, left_on='US_Company',
              right_on='US_Company')
df = df.drop(['US_Company'], axis=1)
df = df.drop_duplicates()

foreign_company_clustered = pd.read_csv('foreign_company_clustered.csv')

df = pd.merge(df, foreign_company_clustered, left_on='Foreign_Company',
              right_on='Foreign_Company')
df = df.drop(['Foreign_Company'], axis=1)
df = df.drop_duplicates()

Now we can finish merging the rest of the sets with the generated clustered names using the original feature as the key starting with the trade carrier, then U.S. port, foreign company country and foreign country. The sets are merged after the foreign_country_clustered set because this requires the standard names as a key. Then Country_continent_region_key is then used to obtain the regions for Foreign_Country_Region and Foreign_Company_Country_Region fields one at a time. Then the sets can be removed to clear memory.

In [ ]:

carrier = pd.read_csv('carrier_clustered_binned.csv')

df = pd.merge(df, carrier, left_on='Carrier', right_on='Carrier')
df = df.drop(['Carrier_Clustered', 'Carrier'], axis=1)
df = df.drop_duplicates()

us_port_and_state_code_clustered_binned = pd.read_csv('us_port_and_state_code_clustered_binned.csv')

df = pd.merge(df, us_port_and_state_code_clustered_binned,
              left_on='US_Port', right_on='US_Port')
df = df.drop_duplicates()
df.rename(columns={'STATE_CODE': 'US_Company_State'}, inplace=True)

foreign_company_country_clustered = pd.read_csv('foreign_company_country_clustered.csv')

df = pd.merge(df, foreign_company_country_clustered,
              left_on='Foreign_Company_Country',
              right_on='Foreign_Company_Country')
df = df.drop_duplicates()
df.rename(columns={'Name': 'Foreign_Company_Country_Clustered',
                   'Continent': 'Foreign_Company_Country_Continent',
                   'Code': 'Foreign_Country_Code'}, inplace=True)

foreign_country_clustered = pd.read_csv('foreign_country_clustered.csv')

df = pd.merge(df, foreign_country_clustered, left_on='Foreign_Country',
              right_on='Foreign_Country')
df = df.drop(['Code'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Name': 'Foreign_Country_Name_Clustered',
                   'Continent': 'Foreign_Country_Continent'}, inplace=True)

country_continent_region_key = pd.read_csv('country_continent_region_key.csv',
                                           index_col=False)
print('\nSample observations of country/continent/region key:')
print(country_continent_region_key.head())

df = pd.merge(df, country_continent_region_key,
              left_on=['Foreign_Country_Name_Clustered'],
              right_on=['Name'])
df = df.drop(['Name', 'Country Code', 'Continent'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Region': 'Foreign_Country_Region'}, inplace=True)

country_continent_region_key = country_continent_region_key.drop(
    ['Country Code', 'Continent'], axis=1)

df = pd.merge(df, country_continent_region_key, how='left',
              left_on=['Foreign_Company_Country_Clustered'], right_on=['Name'])
df = df.drop(['Name'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Region': 'Foreign_Company_Country_Region'}, inplace=True)

del us_company_clustered, foreign_company_clustered, carrier
del us_port_and_state_code_clustered_binned, foreign_company_country_clustered
del foreign_country_clustered, country_continent_region_key


Sample observations of country/continent/region key:
             Name Country Code Continent                             Region
0     Afghanistan           AF      Asia                         South Asia
1         Albania           AL    Europe  Other Europe (not European Union)
2         Algeria           DZ    Africa         Middle East & North Africa
3  American Samoa           AS   Oceania                            Oceania
4         Andorra           AD    Europe  Other Europe (not European Union)

`Postprocessing after merging clustered features by OpenRefine`¶

The results from the clustered data added to the set reveal some inconsistencies within some of the features. Therefore, let's use another str.rsplit at the last comma to isolate the US_Port_State from US_Port_Clustered so the state abbreviation can be useful as key to match with the full state names. There are also inconistencies within US_Port_State with spacing/formatting, so let's standardize the state and territory codes. Another change that is needed is to recode groups that are not actual states to demarcate these observations from the rest, so let's use NOT DECLARED. Lastly, some of the information contained within the Foreign_Country_Name_Clustered and Foreign_Company_Country_Clustered features contain verbose and inconsistent names, so let's standardize these to match the names in the tariff set, which will be used later.

In [ ]:

df['US_Port_State'] = df['US_Port_Clustered'].str.rsplit(',').str[-1]

df['US_Port_State'] = df['US_Port_State'].replace(['NEW YORK',' NY', ' FL',
                                                   ' TX', 'OH (DHL COURIER)',
                                                   ' VA', ' NJ', ' UT',' HI',
                                                   ' MD', ' DC', ' AK',' WA',
                                                   ' VI',' MI',' MT', ' FL)',
                                                   ' ND', ' NM', ' PA',
                                                   'VIRGIN ISLANDS', ' CO',
                                                   ' NE', ' CA'],['NY', 'NY',
                                                                  'FL', 'TX',
                                                                  'OH', 'VA',
                                                                  'NJ', 'UT',
                                                                  'HI', 'MD',
                                                                  'DC', 'AK',
                                                                  'WA', 'VI',
                                                                  'MI', 'MI',
                                                                  'FL','ND',
                                                                  'NM', 'PA',
                                                                  'VI', 'CO',
                                                                  'NE','CA'])

df['US_Port_State'].mask(df['US_Port_State'] == ' WI', 'WI', inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == ' MS)', 'MS', inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == ' IA', 'IA', inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == ' INC.', 'NOT DECLARED',
                         inplace=True)
df['US_Port_State'].mask(df['US_Port_State'] == 'ND', 'NOT DECLARED',
                         inplace=True)

df['US_Company_State'].mask(df['US_Company_State'] == 'ND', 'NOT DECLARED',
                            inplace=True)
df['US_Company_State'].mask(df['US_Company_State'] == 'ED', 'NOT DECLARED',
                            inplace=True)

df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Viet Nam', 'Vietnam')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Taiwan, Province of China', 'Taiwan')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Russian Federation', 'Russia')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Tanzania, United Republic of', 'Tanzania')
df['Foreign_Country_Name_Clustered'] = df['Foreign_Country_Name_Clustered'].replace('Bolivia, Plurinational State of', 'Bolivia')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Viet Nam', 'Vietnam')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Taiwan, Province of China', 'Taiwan')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Russian Federation', 'Russia')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Tanzania, United Republic of', 'Tanzania')
df['Foreign_Company_Country_Clustered'] = df['Foreign_Company_Country_Clustered'].replace('Bolivia, Plurinational State of', 'Bolivia')

Given the large number of unique U.S. companies, let's set a threshold to retain only the U.S. companies with at least 100 total metric tons over the time period (2010-2020). We can groupby the sum and then filter the observations meeting this threshold. This creates a new feature we can name as Metric_Tons_Totals. Now, the number of U.S. companies is significantly less compared to the initial amount of companies.

In [ ]:

us_company = df.loc[:,['US_company_Clustered', 'Metric_Tons']]

us_company = us_company.groupby('US_company_Clustered')['Metric_Tons'].sum().reset_index()
us_company = us_company.loc[us_company['Metric_Tons'] > 100]
us_company.rename(columns = {'Metric_Tons': 'Metric_Tons_Totals'}, inplace=True)
print('- There are ' + str(us_company.shape[0]) + ' US companies with over 100 metric tons in the set.')

- There are 119199 US companies with over 100 metric tons in the set.

`Conditional Binning of U.S. and Foreign Company by Metric Tons for Mapping`¶

Let's now define a function that uses the created Metric_Tons_Totals feature to bin metric tonnage into specified groups using the following criteria:

a. micro = 100-1000
b. small = 1001-10000
c. medium = 10001-100000
d. large = 100001-1000000
e. huge = 1000000+

These groups can be specified in a list to create the feature Company_size where np.select assigns the conditions to the respective group. We can also use this function for processing the foreign companies, so let's rename the clustered company feature to Company. Now we can create the bins with the function, drop Metric_Tons_Totals and then rename the feature back to the original name. This processed binned data can then be merged with the main set.

In [ ]:

def bin_tonnage(df):
    """
    Returns a created categorical feature grouping by the total metric tons.
    """
    conditions = [
        (df['Metric_Tons_Totals'] <= 1000),
        (df['Metric_Tons_Totals'] > 1000) & (df['Metric_Tons_Totals'] <= 10000),
        (df['Metric_Tons_Totals'] > 10000) & (df['Metric_Tons_Totals'] <= 100000),
        (df['Metric_Tons_Totals'] > 100000) & (df['Metric_Tons_Totals'] <= 1000000),
        (df['Metric_Tons_Totals'] > 1000000),
        (df['Company'].str.strip() == 'NOT AVAILABLE')
        ]

    values = ['micro', 'small', 'medium', 'large', 'huge', 'unknown']
    df['company_size'] = np.select(conditions, values)

    return(df)

us_company = us_company.rename(columns={'US_company_Clustered': 'Company'})

us_company = bin_tonnage(us_company)
us_company = us_company.drop(['Metric_Tons_Totals'], axis=1)
us_company = us_company.rename(columns={'Company': 'US_company_Clustered'})

df = pd.merge(df, us_company, how='right', left_on=['US_company_Clustered'],
              right_on=['US_company_Clustered'])
df = df.drop_duplicates()
df.rename(columns={'company_size': 'US_company_size'}, inplace=True)

del us_company

Similar methods that were used to filter out the U.S. companies who were not involved in trading over 100 metric tons during the 10 year time period were used to process the foreign companies. Conditional binning was performed and then the binned data was joined with the main set.

In [ ]:

foreign_company = df.loc[:,['Foreign_Company_Clustered', 'Metric_Tons']]

foreign_company = foreign_company.groupby('Foreign_Company_Clustered')['Metric_Tons'].sum().reset_index()
foreign_company = foreign_company.loc[foreign_company['Metric_Tons'] > 100]
print('- There are ' + str(foreign_company.shape[0]) + ' foreign companies with over 100 metric tons in the set.')
foreign_company.rename(columns = {'Metric_Tons': 'Metric_Tons_Totals',
                                  'Foreign_Company_Clustered': 'Company'},
                       inplace=True)

foreign_company = bin_tonnage(foreign_company)
foreign_company = foreign_company.drop(['Metric_Tons_Totals'], axis=1)
foreign_company = foreign_company.rename(columns={'Company':
                                                  'Foreign_Company_Clustered'})

df = pd.merge(df, foreign_company, how='right',
              left_on=['Foreign_Company_Clustered'],
              right_on=['Foreign_Company_Clustered'])
df = df.drop_duplicates()
df.rename(columns={'company_size': 'foreign_company_size'}, inplace=True)

del foreign_company

- There are 123568 foreign companies with over 100 metric tons in the set.

`Outlier Testing of Quantitative Variables`¶

Since Metric_Tons will be utilized as the dependent variable or target when modeling, let's filter the observations within three standard deviations from the mean and use a boxplot to examine the data. This results in Metric_Tons less than 2,000, but the data is quite skewed. We can then filter the set to less than 250 Metric_Tons to retain a large amount of data. However, this still does contain data that is in long right tail.

In [ ]:

import seaborn as sns
import matplotlib.pyplot as plt

df = df[((df.Metric_Tons - df.Metric_Tons.mean())
         / df.Metric_Tons.std()).abs() < 3.5]

sns.boxplot(x=df['Metric_Tons']).set_title('Distribution of Metric Tons')
plt.show();

df = df.loc[df['Metric_Tons'] < 250]
sns.boxplot(x=df['Metric_Tons']).set_title('Distribution of Metric Tons After Filtering >= 250')
plt.show();

Now, we can filter the data where Teus and TCVUSD are within three standard deviations and replot the distributions.

In [ ]:

df = df[((df.Teus - df.Teus.mean()) / df.Teus.std()).abs() < 3.5]
df = df[((df['TCVUSD'] - df['TCVUSD'].mean()) / df['TCVUSD'].std()).abs() < 3.5]
df = df.drop_duplicates()

sns.boxplot(x=df['Metric_Tons']).set_title('Distribution of Metric Tons')
plt.show();

sns.boxplot(x=df['Teus']).set_title('Distribution of Teus')
plt.show();

sns.boxplot(x=df['TCVUSD']).set_title('Distribution of Total calculated value (US$)')
plt.show();

The original Date feature needs be processed into the format comprising of year-month-day. The original string is treated as an int type with the specified format and pandas.to_datetime can be utilized to accomplish this task. A function can then be utilized to extract the year, year-week, year-month all in one block of code.

In [ ]:

def timeFeatures(df):
    """
    Returns datetime (year-month-day), year, year-week, year-month.
    """
    df['DateTime'] = pd.to_datetime(df['Date'].astype(int), format='%Y%m%d')
    df['Year'] = df.DateTime.dt.year
    df['DateTime_YearWeek'] = df['DateTime'].dt.strftime('%Y-w%U')
    df['DateTime_YearMonth'] = df['DateTime'].dt.to_period('M')
    df = df.drop(['Date'], axis=1)

    return df

df = timeFeatures(df)

`Merge Trade Data with Other Data Sources`¶

`Unemployment`¶

The Bureau of Labor Statistics conducts a Current Population Survey of households every month in the United States. The survey is carried out to create datasets encompassing the labor force, employment, unemployment, people not included in the labor force, and other labor force statistics. The data was collected by workers conducting the survey by telephone while the Bureau of Labor Statistics encouraged businesses to submit their data electronically. The dataset shows the raw count of national civilian unemployment by month (in thousands of people) as well as the national civilian unemployment rate also by month. The collection of the survey data was impacted by the COVID-19 pandemic.

Source: United States, BLS. “Charts Related to the Latest ‘The Employment Situation’ News Release | More Chart Packages.” U.S. Bureau of Labor Statistics on February, 9, 2021.

Macroscopic Details

The data consists of 3 attributes and 133 records including a date for the national civilian unemployment data for the United States only from January 2010 through December 2020. With the onset of the COVID-19 pandemic and stay at home orders across the United States, the pandemic increased unemployment rates far above where they had been throughout much of the previous decade.

Now we can navigate to the Warehouse_Construction directory and examine the Unemployment set. The Year-Month feature contains a -, so it can be treated as a string and converted to the proper format. An inner merge can then be completed using DateTime_YearMonth as the key for the main set and Year-Month for the Unemployment set. The unemployment rate can act as a better surrogate variable than the numerical amount without adding additional ways to normalize to represent this potential confounding factor in trade imports/exports so this and the monthly time features can be removed. Lastly, to match the format of the variables without spaces present, we can rename Unemployment Rate Total as Unemployment Rate Total.

In [ ]:

unemployment = pd.read_csv('Unemployment Data 2010-2020.csv', index_col=False)
print('\nSample observations of Unemployment data:')
print(unemployment.head())

unemployment['Year-Month'] = pd.to_datetime(unemployment['Year-Month'].astype(
        str), format='%Y-%m')
unemployment['Year-Month'] = unemployment['Year-Month'].dt.to_period('M')

df = pd.merge(df, unemployment, left_on='DateTime_YearMonth',
              right_on='Year-Month')

df = df.drop(['Month', 'Year-Month', 'Total in Thousands'], axis=1)
df = df.drop_duplicates()
df.rename(columns={'Unemployment Rate Total': 'US_Unemployment_Rate'},
          inplace=True)

del unemployment


Sample observations of Unemployment data:
    Month Year-Month Total in Thousands  Unemployment Rate Total
0  Jan-10    2010-01             15,046                      9.8
1  Feb-10    2010-02             15,113                      9.8
2  Mar-10    2010-03             15,202                      9.9
3  Apr-10    2010-04             15,325                      9.9
4  May-10    2010-05             14,849                      9.6

`Tariffs`¶

The World Trade Organization compiles the annual applied tariff rates that member countries set towards the world, on the harmonized service code level. The dataset includes the number of tariff line items, annual averages, and minimum and maximum bounds. Countries were selected based on the United States’ top trading partners, measured in metric tonnage, during the year 2019. Tariff rates were collected for these countries for the years 2010 through 2020. This dataset also notes if the U.S. had a free trade agreement with these top trading partners for each of the 11 years. This qualitative variable will compliment our average applied tariff rate by HS chapter class.

Source: The World Trade Organization via the data query tool downloaded on February 8, 2021.

Source: Office of the United States Trade Representative downloaded on February 11, 2021.

Macroscopic Details

The original datasource included only the primary countries and 18 attributes. Subsequent additions reduced the working attributes to six, but expanded the scope of countries involved, increasing total records from 4,853 to 11,892. The United States and 62 other countries are included. These are the major trading partners of the U.S., as of 2019, measured in metric tonnage. All European countries are included, regardless of their individual trade numbers with the U.S. Data was collected from the WTO and the Office of the United States Trade Representative to create a new table. Each year includes 23 records for average applied tariff rates for the HS chapter classes included in this project. There are additional attributes to indicate whether a free trade agreement exists between the country and the U.S., and whether that country is a member of the European Union. There is missing annual data if a country was not a member of the World Trade Organization (“WTO”) for any of the time period, if it was a member of the European Union (they are reported as simply the European Union), and if they have not yet submitted their 2019 or 2020 data to the WTO. This was later adjusted by assuming that the applied tariff rates were held constant for these countries if no updates were published.

Examining the Tariffs set, Country, HS_Family, Tariff_Year can be utilized as the keys to add this information to main set using a left merge to the main table. Then European_Union, Free_Trade_Agreement_with_US can be recoded to True/False and Average_Tariff can be reformatted.

In [ ]:

tariffs = pd.read_csv('Tariffs.csv')
print('\nSample observations of Tariffs data:')
print(tariffs.head())

df = pd.merge(df, tariffs, how='left',
              left_on=['Foreign_Country_Name_Clustered', 'HS_Class', 'Year'],
              right_on=['Country', 'HS_Family', 'Tariff_Year'])
df = df.drop(['Country', 'Tariff_Year', 'HS_Family', 'HS_Class'], axis=1)
df = df.drop_duplicates()

df['European_Union'] = df['European_Union'].astype('str')
df['European_Union'] = df['European_Union'].replace('No', 'False')
df['European_Union'] = df['European_Union'].replace('Yes', 'True')
df['Free_Trade_Agreement_with_US'] = df['Free_Trade_Agreement_with_US'].astype('str')
df['Free_Trade_Agreement_with_US'] = df['Free_Trade_Agreement_with_US'].replace('No',
                                                                                'False')
df['Free_Trade_Agreement_with_US'] = df['Free_Trade_Agreement_with_US'].replace('Yes',
                                                                                'True')
df['Average_Tariff'] = df['Average_Tariff'].str.replace(',','')
df['Average_Tariff'] = df['Average_Tariff'].astype('float64')

del tariffs


Sample observations of Tariffs data:
   Country  Tariff_Year  HS_Family Average_Tariff  \
0  Vietnam         2020          2          17.45   
1  Vietnam         2020          7          16.14   
2  Vietnam         2020          8          23.46   
3  Vietnam         2020          9          21.41   
4  Vietnam         2020         10           8.56   

  Free_Trade_Agreement_with_US European_Union  
0                           No             No  
1                           No             No  
2                           No             No  
3                           No             No  
4                           No             No

`Currency`¶

Currency rates are extracted from here, which collects live rates for currencies, currency codes, financial news, commodities and crypto currencies. The data set contains 8 columns and 2,145 rows. The timeframe is from January 2010 to February 2021. All countries that are included in the tariff table are included in the currency dataset.

Source: Stock Market Quotes & Financial News retrieved on February 12, 2021.

Macroscopic Details

The data queried has 8,175 records and 10 attributes using a time frame from January 2010 to Feb 2021 The currency rate is an important variable to address our research questions as it has a great impact on imports and exports. If a currency rate increases, importing and exporting volumes may fall for a country as the cost of goods is too great and may look elsewhere to source. As we build our model, we would like to understand if currencies have any correlation with the other variables.

After examining the information contained in the Currency set, unnecesary columns can be dropped. The Month_Year feature, which is actually year-month, can be processed using the same approach as what was used for Year-Month in the Unemployment. Month_Year and Country can be used as keys for the left merge to the main set and then features that initially were retained in the initial EDA can be dropped from the set. This was decided because the weekly information can provide more insight and grain than monthly trade. Also Continent level trade was not specific enough, Foreign_Country_Name_Clustered and US_Company_State contained high dimensionality and the keys to create the warehouse were not needed for modeling.

In [ ]:

currencies = pd.read_csv('2010 - 2021 Exchange Rates.csv')
print('\nSample observations of Currencies data:')
print(currencies.head())

currencies = currencies.drop(['Month', 'Exchange_Year', 'Open', 'High', 'Low'],
                             axis=1)
currencies['Month_Year'] = pd.to_datetime(currencies['Month_Year'].astype(str),
                                          format='%Y-%m')
currencies['Month_Year'] = currencies['Month_Year'].dt.to_period('M')

df = pd.merge(df, currencies, how='left',
              left_on=['DateTime_YearMonth', 'Foreign_Country_Name_Clustered'],
              right_on=['Month_Year', 'Country'])

df = df.drop(['DateTime_YearMonth', 'Month_Year', 'Country',
              'European_Union_Member', 'Foreign_Country_Name_Clustered',
              'Foreign_Country_Code', 'Name', 'Country Code',
              'Continent', 'US_Company_State'], axis=1)
df = df.drop_duplicates()

del currencies


Sample observations of Currencies data:
      Month  Exchange_Year Month_Year    Price     Open     High      Low  \
0  February           2021    2021-02  13970.0  14035.0  14052.5  13965.0   
1   January           2021    2021-01  14020.0  13910.0  14217.5  13862.5   
2  December           2020    2020-12  14040.0  14117.5  14274.0  14050.0   
3  November           2020    2020-11  14090.0  14650.0  14717.5  13992.5   
4   October           2020    2020-10  14620.0  14810.0  14907.0  14602.5   

  Currency    Country European_Union_Member  
0      IDR  Indonesia                    No  
1      IDR  Indonesia                    No  
2      IDR  Indonesia                    No  
3      IDR  Indonesia                    No  
4      IDR  Indonesia                    No

`COVID-19 State Mandated Closures`¶

The U.S. government granted the state government to the abiltiy declare if/when there was an official closure of public areas / services with the goal of preventing/hindering the increased likelihood of new COVID-19 cases reaching an exponential distribution with other distributions considered given past events. This would reduce the amount of person-to-person contact since the transmission route of COVID-19 includes being able to be aerosolized via droplets, which are not visible by the human eyes when an individual is considered infectious. This included working-from-home options to maintain daily operations by working remotely.

Source: Kaiser Family Foundation, initially compiled on April 5, 2020. The data consists of 51 rows and 3 columns.

Now we can navigate to the Warehouse_Construction directory, and read/examine the KFF_Statewide_Stay_at_Home_Orders set. Then navigate back to Keys_and_Dictionaries directory, join the state_abbreviation_key with the main set so that the KFF_Statewide_Stay_at_Home_Orders can be joined using State as the key.

In [ ]:

KFF_Statewide_Stay_at_Home_Orders = pd.read_csv('KFF_Statewide-Stay-at-Home-Orders.csv',
                                                index_col=False)
print('\nSample observations of KFF_Statewide_Stay_at_Home_Orders data:')
print(KFF_Statewide_Stay_at_Home_Orders.head())
print('\n')

state_abbreviation_key = pd.read_csv('State_Region_key.csv', index_col=False)
print('\nSample observations of State and Region Key:')
print(state_abbreviation_key.head())
state_abbreviation_key.rename(columns={'Region': 'State_Region'}, inplace=True)

df = pd.merge(df, state_abbreviation_key, how='left',
              left_on=['US_Port_State'], right_on=['State Code'])
df = df.drop(['Region', 'State Code'], axis=1)
df = df.drop_duplicates()

df = pd.merge(df, KFF_Statewide_Stay_at_Home_Orders, how='right',
              left_on='State', right_on='State')
df = df.drop_duplicates()
df.rename(columns={'Date Announced': 'Date_Announced',
                   'Effective Date': 'Effective_Date'}, inplace=True)

del KFF_Statewide_Stay_at_Home_Orders


Sample observations of KFF_Statewide_Stay_at_Home_Orders data:
        State Date Announced Effective Date
0     Alabama       4/3/2020       4/4/2020
1      Alaska      3/27/2020      3/28/2020
2     Arizona      3/30/2020      3/31/2020
3    Arkansas            NaN            NaN
4  California      3/19/2020      3/19/2020



Sample observations of State and Region Key:
        State State Code Region
0      Alaska         AK   West
1     Alabama         AL  South
2    Arkansas         AR  South
3     Arizona         AZ   West
4  California         CA   West

`COVID-19 Cases and Deaths in the United States`¶

The New York Times has released, and is continuing to release, a series of data related to COVID-19 from state and local governments as well as health departments. This information is obtained by journalists employed by the newspaper company across the U.S. who are actively tracking the development of the pandemic. This source compiles information related tof COVID-19 across the county and state level in the United States over time. The data starts with the first reported COVID-19 case in the state of Washington on January 21, 2020. The total number of cases and deaths is reported as either “confirmed” or “probable” COVID-19. The number of cases includes all reported cases, including instances of individuals who have recovered or died. There are geographic exceptions in the states of New York, Missouri, Alaska, California, Nebraska, Illinois, Guam and Puerto Rico. Data was pulled from the New York Times Github on February 5, 2021.

Macroscopic Details

The data ranges from January 21, 2020 to February 5, 2021 with the dimensions of 6 columns and 1,001,656 rows including time, place, fips (geographic indicator) and the number of COVID-19 cases and deaths attributed to COVID-19.

Let's now convert the provided date using pandas.to_datetime and utilize timedelta(7, unit='d') to subtract a week. Then the weekly sum of the cases and deaths separately for each U.S. state can be calculated, generating the cases_weekly and deaths_weekly features. From the renamed Date_Weekly_COVID, the Year can be extracted and filtered to 2020 to match the extent of the trade information. Since the US_Port_State feature in the main table contains the abbreviated state name, we can merge the COVID-19 set with the state_abbreviation_key, so this can be used as a key, remove the unnecessary columns and select the non-missing information.

In [ ]:

COVID_cases = pd.read_csv('us-counties - NYTimes.csv', index_col=False)
print('\nSample observations of COVID-19 data:')
print(COVID_cases.head())

COVID_cases['Date'] = pd.to_datetime(COVID_cases['date']) - pd.to_timedelta(7,
                                                                            unit='d')

df1 = (COVID_cases.groupby(['state', pd.Grouper(key='Date', freq='W-MON')])
          .agg({'cases': 'sum', 'deaths': 'sum'}).reset_index())

df1.rename(columns={'Date': 'Date_Weekly_COVID', 'cases': 'cases_weekly',
                    'deaths': 'deaths_weekly', 'state': 'State'}, inplace=True)

df1['DateTime_YearWeek'] = pd.to_datetime(df1['Date_Weekly_COVID'])
df1['DateTime_YearWeek'] = df1['DateTime_YearWeek'].dt.strftime('%Y-w%U')
df1['Year'] = df1['Date_Weekly_COVID'].dt.year

df1 = df1[df1.Year < 2021]
df1 = df1.drop(['Year'], axis=1)

del COVID_cases

df1 = pd.merge(df1, state_abbreviation_key,
               how='left', left_on='State', right_on='State')

df1 = df1.drop(['State', 'State_Region'], axis=1)
df1 = df1.drop_duplicates()
df1 = df1[df1['State_Code'].notna()]


Sample observations of COVID-19 data: 
         date     county       state     fips  cases  deaths
0  2020-01-21  Snohomish  Washington  53061.0      1     0.0
1  2020-01-22  Snohomish  Washington  53061.0      1     0.0
2  2020-01-23  Snohomish  Washington  53061.0      1     0.0
3  2020-01-24       Cook    Illinois  17031.0      1     0.0
4  2020-01-24  Snohomish  Washington  53061.0      1     0.0

Then the aggregrated weekly COVID-19 information can then be outer merged with the main table using the the keys defined as the abbreviated state name and DateTime_YearWeek. Then the State_Code feature can be dropped.

In [ ]:

df = pd.merge(df, df1, how='outer', left_on=['US_Port_State',
                                             'DateTime_YearWeek'],
              right_on=['State_Code', 'DateTime_YearWeek'])
df = df.drop(['State_Code'], axis=1)
df = df.drop_duplicates()

del state_abbreviation_key, df1

Since COVID-19 became medically known late 2019 and early 2020 for the general public, the prior time periods contain missing data for the weekly cases and deaths given it was flu season and a rapid diagnostic test had not been developed given the introduction of a novel virua.. So let's filter the trade related column, HS_Mixed, which contains a small percentage missing observations, and fill the missing values with 0s.

However, there are observations that do contain missing values for HS_Mixed, which are mostly likely related to trade not occurring during these weeks at a location. To retain the grain without losing all off these observations, the quantitative variables can be filled with 0s since trade did not occur in the location at the time. This subset can then be concatenated via rows to the main data.

In [ ]:

df1 = df[df['HS_Mixed'].notna()]

df1 = df1.copy()
df1['cases_weekly'] = df1['cases_weekly'].fillna(0)
df1['deaths_weekly'] = df1['deaths_weekly'].fillna(0)

df2 = df[df['HS_Mixed'].isna()]
print('- Number of observations where no trade occurred:', df2.shape)

df2['Year'] = df2['Date_Weekly_COVID'].dt.year
df2['Year'] = df2['Year'].astype('Int64')
df2['US_Port_State'] = df2['State']

quant = ['Teus', 'Metric_Tons', 'VIN_Quantity', 'TCVUSD',
         'US_Unemployment_Rate', 'Average_Tariff', 'Price']

df2.loc[:, quant] = df2.loc[:, quant].fillna('')
df2.loc[:, quant] = df2.loc[:, quant].replace('', 0)

df = pd.concat([df1, df2])

del df1, df2

- Number of observations where no trade occurred: (1295, 45)

`Feature Engineering COVID-19 associated vars`¶

Next, the number of days between when a state mandated closure was announced and was made effective can be calculated by first coverting both to proper datetime format and then substracting Date_Announced from Effective_Date, creating State_Closure_EA_Diff. The original features can then be dropped.

In [ ]:

df['Date_Announced'] = pd.to_datetime(df['Date_Announced'].astype(object),
                                      format='%m/%d/%Y')

df['Effective_Date'] = pd.to_datetime(df['Effective_Date'].astype(object),
                                      format='%m/%d/%Y')

df['State_Closure_EA_Diff'] = (df.Effective_Date - df.Date_Announced).dt.days

df = df.drop(['Date_Announced', 'Effective_Date'], axis=1)

Next, we can examine the number of U.S. states in the set for a sanity check. Then filter the set into subsets by the weekly times when there were and were not COVID-19 cases. A pivot table with the weekly COVID date as the index can be generated that utilizes the cases_weekly feature for the different states. Then the missing case data can be filled with zeros and the first nonzero date for COVID cases can be leveraged using the pandas DataFrame.ne() method. This can then be sorted with the earliest first occurence to create the feature Time0_StateCase, or the first weekly date of COVID-19 cases in each state.

In [ ]:

print('- Number of unique US states:',
      len(pd.unique(df['State'])))

df1 = df[df['Date_Weekly_COVID'].isna()]

df2 = df[df['Date_Weekly_COVID'].notna()]

- Number of unique US states: 55

In [ ]:

df3 = df2.pivot_table(index='Date_Weekly_COVID', columns='State',
                      values='cases_weekly', aggfunc=np.min)
df3.fillna(df3.fillna(0), inplace=True)
df3 = df3.ne(0).idxmax()
df3 = df3.to_frame()
df3 = df3.reset_index()
df3.rename(columns={0: 'Time0_StateCase'}, inplace=True)
df3.sort_values(by='Time0_StateCase', ascending=True, inplace=True)
df3[:10]

Out[ ]:

	State	Time0_StateCase
30	Washington	2020-01-21
3	California	2020-01-25
10	Illinois	2020-01-27
15	Massachusetts	2020-02-01
2	Arizona	2020-02-05
28	Texas	2020-02-12
23	Oregon	2020-03-01
20	New York	2020-03-01
6	Florida	2020-03-01
7	Georgia	2020-03-02

This created set can then be left merged using place and time so that the date can be paired with the number of cases per this date. Selecting these features, which decreases the dimensionality, and the year-week can be used to create the temporary Year_M that functions as a key with US_Port_State to merge back with the complete set.

In [ ]:

df3 = pd.merge(df3, df2, how='left', left_on=['US_Port_State',
                                              'Time0_StateCase'],
               right_on=['US_Port_State', 'Date_Weekly_COVID'])
df3 = df3.drop_duplicates()

df3 = df3.copy()
df3['cases_state_firstweek'] = df3['cases_weekly']

df3 = df3.loc[:, ['US_Port_State', 'Time0_StateCase', 'Date_Weekly_COVID',
                  'cases_state_firstweek']]
df3 = df3.drop_duplicates()

df3['Year_M'] = df3['Date_Weekly_COVID'].dt.year
df2['Year_M'] = df2['Date_Weekly_COVID'].dt.year

df3 = df3.drop(['Date_Weekly_COVID'], axis=1)
df3 = df3.drop_duplicates()

df2 = pd.merge(df2, df3, how='left', left_on=['US_Port_State', 'Year_M'],
               right_on=['US_Port_State', 'Year_M'])

del df3

df2 = df2.drop(['Year_M'], axis=1)
df2 = df2.drop_duplicates()

Then, the cases from each of the weeks where COVID-19 occurred can be subtracted from the the initial amount of cases to generate a percent change feature called cases_pctdelta. For the set that contains missing values for Date_Weekly_COVID, the created features generated in the other set can be labeled with Not Applicable and zeros so the sets can be concatenated by row.

In [ ]:

df2['cases_pctdelta'] = df2.apply(lambda x: (x['cases_weekly']
                                             - x['cases_state_firstweek']
                                             / x['cases_state_firstweek']) * 100,
                                  axis=1)

df1['Time0_StateCase'] = 'Not Applicable'
df1['cases_state_firstweek'] = 0
df1['cases_pctdelta'] = 0

df = pd.concat([df2, df1])

del df1, df2

Similar methods can also be utilized for the deaths_weekly feature that were used for cases_weekly as demonstrated above.

In [ ]:

df1 = df[df['Date_Weekly_COVID'].isna()]

df2 = df[df['Date_Weekly_COVID'].notna()]

df3 = df2.pivot_table(index='Date_Weekly_COVID', columns='State',
                      values='deaths_weekly', aggfunc=np.min)
df3.fillna(df3.fillna(0), inplace=True)
df3 = df3.ne(0).idxmax()
df3 = df3.to_frame()
df3 = df3.reset_index()
df3.rename(columns={0: 'Time0_StateDeath'}, inplace=True)
df3.sort_values(by='Time0_StateDeath', ascending=True, inplace=True)
df3[:10]

Out[ ]:

	State	Time0_StateDeath
51	Washington	2020-03-01
4	California	2020-03-08
9	Florida	2020-03-08
19	Louisiana	2020-03-15
46	Texas	2020-03-15
10	Georgia	2020-03-15
33	New York	2020-03-15
43	South Carolina	2020-03-15
50	Virginia	2020-03-15
5	Colorado	2020-03-16

In [ ]:

df3 = pd.merge(df3, df2, how='left', left_on=['US_Port_State',
                                              'Time0_StateDeath'],
               right_on=['US_Port_State', 'Date_Weekly_COVID'])
df3 = df3.drop_duplicates()

df3 = df3.copy()
df3['deaths_state_firstweek'] = df3['deaths_weekly']
df3 = df3.loc[:, ['US_Port_State', 'Time0_StateDeath', 'Date_Weekly_COVID',
                  'deaths_state_firstweek']]

df3['Year_M'] = df3['Date_Weekly_COVID'].dt.year
df2['Year_M'] = df2['Date_Weekly_COVID'].dt.year

df3 = df3.drop(['Date_Weekly_COVID'], axis=1)
df3 = df3.drop_duplicates()

df2 = pd.merge(df2, df3, how='left', left_on=['US_Port_State', 'Year_M'],
               right_on=['US_Port_State', 'Year_M'])

del df3

df2 = df2.drop(['Year_M'], axis=1)
df2 = df2.drop_duplicates()

In [ ]:

df2['deaths_pctdelta'] = df2.apply(lambda x: (x['deaths_weekly']
                                              - x['deaths_state_firstweek']
                                              / x['deaths_state_firstweek']) * 100,
                                   axis=1)

df1['Time0_StateDeath'] = 'Not Applicable'
df1['deaths_state_firstweek'] = 0
df1['deaths_pctdelta'] = 0

df = pd.concat([df2, df1])

del df1, df2

Let's now drop the variables not worth considering during the next steps for variable selection.

In [ ]:

df = df.drop(['VIN_Quantity', 'US_company_Clustered',
              'Foreign_Company_Clustered', 'Foreign_Company_Country_Clustered',
              'US_Port_State'], axis=1)
df = df.drop_duplicates()

print('- Dimensions of Data Warehouse for EDA:', df.shape)

- Dimensions of Data Warehouse for EDA: (29152058, 42)

`Exploratory Data Analysis (EDA)`¶

`Feature Selection with Random Forest and XGBoost`¶

`Question: How has the composition and/or volume of maritime imports and exports from the U.S. changed from 2010-2015`¶

`Random Forest - Feature Importance`¶

To approach this question, let's first subset the observations containing the years 2010 - 2015. Then drop time related variables, including COVID-19 features, and the ones not related to the question. Then remove the rows with the columns having missing values to maximize the completeness of the set.

In [ ]:

df1 = df.loc[df['Year'] < 2016]

df1 = df1.drop(['DateTime', 'Year', 'DateTime_YearWeek', 'Date_Weekly_COVID',
                'Trade_Direction', 'State_Closure_EA_Diff',
                'cases_weekly', 'deaths_weekly', 'Time0_StateCase',
                'cases_state_firstweek', 'cases_pctdelta', 'Time0_StateDeath',
                'deaths_state_firstweek', 'deaths_pctdelta'
                'Free_Trade_Agreement_with_US', 'European_Union', 'Price',
                'Currency', 'Foreign_Company_Country_Region',
                'US_Port_Clustered', 'Foreign_Country',
                'Foreign_Company_Country', 'Foreign_Port', 'US_Port'], axis=1)

df1 = df1[df1.Foreign_Country_Continent.notna()
          & df1.Foreign_Country_Region.notna() & df1.Average_Tariff.notna()]
df1 = df1.drop_duplicates()
print('- Dimensions of Question 1 EDA:', df1.shape)

- Dimensions of Question 1 EDA: (10897482, 17)

To set up the RandomForestRegressor, let's create dummy variables for the categorical variables and allocate the features and the target variable, Metric_Tons. Now train the regressor using the default parameters using the parallel_backend from joblib to maximize the amount of CPU cores and then save the model as a pickle file. From the fitted model, the features can then be sorted by gini importance to identify which are the most important features.

In [ ]:

from sklearn.ensemble import RandomForestRegressor
from joblib import parallel_backend
import pickle

df1 = pd.get_dummies(df1, drop_first=True)

X = df1.drop('Metric_Tons', axis=1)
y = df1.Metric_Tons

rf = RandomForestRegressor(n_estimators=100, random_state=seed_value)

with parallel_backend('threading', n_jobs=-1):
    rf.fit(X, y)

Pkl_Filename = 'CompositionVolume_2010_2015_RF.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(rf, file)

plt.rcParams['figure.figsize'] = (10,10)
plt.rcParams.update({'font.size': 8.5})
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
sorted_idx = rf.feature_importances_.argsort()
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features for Trade Composition/Volume during 2010-2015',
             fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();

The most important features are:

Teus
TCVUSD
HS_Group_Name_Finished_Goods
Average_Tariff
US_Unemployment_Rate
Container_LCL/FCL_LCl
Foreign_Company_Country_Continent_Asia
HS_Group_Name_Edible_Processing
HS_Mixed
US_Port_Coastal_Region_Southeast
US_Port_Coastal_Region_Northeast

The least important features are the smallest carrier sizes and the locationas where trade most likely would not be probable of importing from or exporting to from the U.S.

`XGBoost - Feature Importance`¶

XGBoost can also be utilized to run on the GPU for faster runtimes using similar methods that were used for the RFRegressor.

In [ ]:

from xgboost import XGBRegressor, plot_importance

xgb = XGBRegressor(objective='reg:squarederror',
                   booster='gbtree',
                   tree_method='gpu_hist',
                   gpu_id=0,
                   scale_pos_weight=1,
                   use_label_encoder=False,
                   random_state=seed_value,
                   verbosity=0)

xgb.fit(X, y)

Pkl_Filename = 'CompositionVolume_2010_2015_XGB.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features for Trade Composition/Volume during 2010-2015',
             fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();

When comparing the feature importances from XGBoost to Random Forest, Teus is the most important while Average_Tariff is second and TCVUSD is third. Fourth is US_Unemployment_Rate, HS_Group_Name_Finished_Goods is fifth, HS_Group_Name_Raw_Input is sixth, foreign_company_size_large is seventh, Container_LCL/FCL_LCl is eighth, Foreign_Company_Country_Continent_Asia is ninth and HS_Group_Name_Edible_Processing is 10th. The ports did not show up till 17th and 18th and they reversed in importance.

`Random Forest without Teus & TCVUSD - Feature Importance`

Let's now drop the features, Teus and TCVUSD, when creating X. Using the same approaches, let's examine if the feature importance is conserved for both the default models using Random Forest and XGBoost.

In [ ]:

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
sorted_idx = rf.feature_importances_.argsort()
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Trade Composition/Volume during 2010-2015',
             fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();

The most important features without Teus and TCVUSD are:

Average_Tariff
US_Unemployment_Rate
HS_Group_Name_Finished_Goods
Container_LCL/FCL_LCl
Container_Type_Dry
HS_Group_Name_Raw_Input
foreign_company_size_large
HS_Mixed
US_company_size_large
US_Port_Coastal_Region_Northeast
US_Port_Coastal_Region_Southeast
Carrier_size_large

The order of important features changed without the two most important features.

`XGBoost without Teus & TCVUSD - Feature Importance`

In [ ]:

ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Trade Composition/Volume during 2010-2015',
             fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();

The most important features without Teus and TCVUSD are:

Average_Tariff
US_Unemployment_Rate
Container_Type_Dry
US_company_size_large
foreign_company_size_large
HS_Group_Name_Raw_Input
Container_LCL/FCL_LCl
US_company_size_medium
HS_Group_Name_Edible_Processing
HS_Mixed
foreign_company_size_medium
Container_Type_Refrigerated
HS_Group_Name_Finished_Goods
US_Port_Coastal_Region_Northeast

The order of important features changed without the two most important features when using XGBoost.

`Question: How did COVID-19 impact the volume and composition of international maritime trade?`¶

Since this set of trade covers only through 2020, let's subset only the observations from 2020, drop the time related variables, like DateTime, Year, Date_Weekly_Covid and DateTime_YearWeek, as well as the first time of cases/death in a state and also the ones not related to the question. Then remove the rows with the columns having missing values to maximize the completeness of the set as before.

In [ ]:

df2 = df.loc[df['Year'] == 2020]

df2 = df2.drop(['DateTime', 'Year', 'DateTime_YearWeek', 'Date_Weekly_COVID',
                'Time0_StateCase', 'Time0_StateDeath'
                'Foreign_Company_Country_Region', 'US_Port_Clustered',
                'Foreign_Country', 'Foreign_Company_Country',
                'Foreign_Port', 'US_Port'], axis=1)

df2 = df2[df2.Foreign_Country_Continent.notna()
          & df2.Foreign_Country_Region.notna() & df2.Average_Tariff.notna()
          & df2.Price.notna() & df2.Currency.notna()
          & df2.deaths_pctdelta.notna() & df2.State_Closure_EA_Diff.notna()]
df2 = df2.drop_duplicates()
print('- Dimensions of Question 2 EDA:', df2.shape)

- Dimensions of Question 2 EDA: (2871883, 29)

`Random Forest without Teus & TCVUSD`

In [ ]:

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
sorted_idx = rf.feature_importances_.argsort()
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for COVID-19 and Trade Composition/Volume',
             fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();

The most important features without Teus and TCVUSD are:

Average_Tariff
HS_Group_Name_Finished_Goods
Price
Container_LCL/FCL_LCl
cases_weekly
cases_pctdelta
death_weekly
deaths_pctdelta
Container_Type_Dry
US_Unemployment_Rate
cases_state_firstweek
US_company_size_large
death_state_firstweek
HS_Mixed
US_company_size_medium
foreign_company_size_medium
Carrier_size_large
foreign_company_size_large
HS_Group_Name_Raw_Input
US_company_size_small
Trade_Direction_Import

`XGBoost without Teus & TCVUSD`

In [ ]:

ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for COVID-19 and Trade Composition/Volume',
             fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=10)
plt.tight_layout()
plt.show();

The most important features without Teus and TCVUSD are:

Average_Tariff
cases_state_firstweek
Price
deaths_state_firstweek
cases_weekly
US_company_size_large
Container_Type_Dry
foreign_company_size_large
Trade_Direction_Import
US_company_size_medium
HS_Mixed
foreign_company_size_medium
State_Closure_EA_Diff
US_company_size_small

The order of important features changed when compared to the most important features from RF.

`Question: Are there any confounding effects impacting the volume of imports and exports of the targeted commodities?`¶

To address this questions, let's drop the time associated variables and the ones not related to question. Then we can remove the rows with any column having NA/null for some of the important variables.

In [ ]:

df3 = df.drop(['DateTime', 'Year', 'DateTime_YearWeek', 'Date_Weekly_COVID',
               'Time0_StateCase', 'Time0_StateDeath',
               'Foreign_Company_Country_Region', 'US_Port_Clustered',
               'Foreign_Country', 'Foreign_Company_Country',
               'Foreign_Port', 'US_Port'], axis=1)

df3 = df3[df3.Foreign_Country_Continent.notna()
          & df3.Foreign_Country_Region.notna() & df3.Average_Tariff.notna()
          & df3.Price.notna() & df3.Currency.notna()
          & df3.Container_Type_Refrigerated.notna()
          & df3.deaths_pctdelta.notna() & df3.State_Closure_EA_Diff.notna()]
df3 = df3.drop_duplicates()
print('- Dimensions of Question 3 EDA:', df3.shape)

- Dimensions of Question 3 EDA: (25319274, 29)

`Random Forest without Teus & TCVUSD`

In [ ]:

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Potential Confounding Effects',
             fontsize=20)
ax.set_xlabel('Random Forest Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();

The most important features without Teus and TCVUSD are:

Price
HS_Group_Name_Finished_Goods
Average_Tariff
US_Unemployent_Rate
Container_LCL/FCL_LCl
State_Closure_EA_Diff
Container_Type_Dry
HS_Group_Name_Raw_Input
foreign_company_size_large
Trade_Direction_Import
HS_Mixed
US_company_size_large
Currency_CNY
foreign_company_size_medium
US_Port_Coastal_Region_Northeast
US_company_size_medium
Carrier_size_large
US_Port_Coastal_Region_Southeast

`XGBoost without Teus & TCVUSD`

In [ ]:

ax = plot_importance(xgb)
fig = ax.figure
ax.set_title('Feature Importance: Examining Features without Teus & TCVUSD for Potential Confounding Effects',
             fontsize=20)
ax.set_xlabel('XGBoost Feature Importance', fontsize=15)
plt.tight_layout()
plt.show();

The most important features without Teus and TCVUSD are:

Average_Tariff
Price
State_Closure_EA_Diff
Container_Type_Dry
US_company_size_large
foreign_company_size_large
US_Unemployent_Rate
Trade_Direction_Import
foreign_company_size_medium
HS_Group_Name_Raw_Input
Container_LCL/FCL_LCl
HS_Mixed

The order of important features changed when compared to the most important features from RF.

`Create Final Set`¶

Let's now drop the features that will not be used for modeling and examine the quality of the data using the data_quality_table function.

In [ ]:

df = df.drop(['Foreign_Country', 'Foreign_Port', 'US_Port', 'Teus',
              'Container_Type_Refrigerated', 'HS_Mixed', 'US_Company',
              'US_Company_Agg', 'Foreign_Company_Country', 'carrier_size',
              'US_Port_Clustered', 'Foreign_Company_Country_Continent',
              'Foreign_Country_Continent', 'Foreign_Company_Country_Region',
              'Free_Trade_Agreement_with_US', 'European_Union', 'Currency',
              'Price', 'Time0_StateCase', 'cases_state_firstweek',
              'Time0_StateDeath', 'deaths_state_firstweek'], axis=1)
df = df.drop_duplicates()
print(data_quality_table(df))

- There are 29152058 rows and 20 columns.


                       Data Type  Count Missing  Number Unique
Average_Tariff           float64        2022503           1482
Foreign_Country_Region    object         114561             11
State_Closure_EA_Diff    float64           1295              7
Container_LCL/FCL         object           1295              2
DateTime                  object           1295           4018
US_company_size           object           1295              5
foreign_company_size      object           1295              5
US_Port_Coastal_Region    object           1295              6
HS_Group_Name             object           1295              6
Trade_Direction           object           1295              2
Container_Type_Dry        object           1295              2
Year                     float64              0             11
cases_weekly             float64              0           2366
deaths_weekly            float64              0           2025
Metric_Tons              float64              0          25000
US_Unemployment_Rate     float64              0             59
TCVUSD                   float64              0        9907875
cases_pctdelta           float64              0           2365
deaths_pctdelta          float64              0           2025
Delta_Case0_Effective    float64              0             25

`XGBoost RAPIDS: Train 2020, 2019, 2018 - 2019 and 2010 - 2019`

The notebooks for XGBoost are located here. We can utilize Paperspace and use a RAPIDS Docker container to utilize GPUs with higher memory than what is available on the local desktop. First, let's upgrade pip, install the dependencies and import the required packages. Then we can set the os environment as well as the random, cupy and numpy seed. We can also define a function timed to time the code blocks and examine the components of the GPU which is being utilized. Given the number of observations within the 2010 - 2019 set, let's use a NVIDIA RTXA400 to set the baseline XGBoost models.

In [ ]:

!pip install --upgrade pip
!pip install category_encoders
!pip install xgboost==1.5.2
!pip install hyperopt
!pip install eli5
import os
import warnings
import random
import cupy
import numpy as np
import urllib.request
from contextlib import contextmanager
import time
warnings.filterwarnings('ignore')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

seed_value = 42
os.environ['xgbRAPIDS_GPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)

@contextmanager
def timed(name):
    t0 = time.time()
    yield
    t1 = time.time()
    print('..%-24s:  %8.4f' % (name, t1 - t0))

print('\n')
!nvidia-smi

Requirement already satisfied: pip in /opt/conda/envs/rapids/lib/python3.9/site-packages (22.0.3)
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 7.1 MB/s eta 0:00:0000:0100:01
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.3
    Uninstalling pip-22.0.3:
      Successfully uninstalled pip-22.0.3
Successfully installed pip-22.3.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting category_encoders
  Downloading category_encoders-2.5.1.post0-py2.py3-none-any.whl (72 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.4/72.4 kB 20.7 MB/s eta 0:00:00
Requirement already satisfied: scikit-learn>=0.20.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from category_encoders) (0.24.2)
Requirement already satisfied: pandas>=1.0.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from category_encoders) (1.3.5)
Requirement already satisfied: statsmodels>=0.9.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from category_encoders) (0.13.1)
Requirement already satisfied: patsy>=0.5.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from category_encoders) (0.5.2)
Requirement already satisfied: numpy>=1.14.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from category_encoders) (1.21.5)
Requirement already satisfied: scipy>=1.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from category_encoders) (1.6.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from pandas>=1.0.5->category_encoders) (2021.3)
Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.9/site-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20.0->category_encoders) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20.0->category_encoders) (1.1.0)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.5.1.post0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: xgboost==1.5.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (1.5.2)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.6.0)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from xgboost==1.5.2) (1.21.5)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting hyperopt
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 84.7 MB/s eta 0:00:00
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from hyperopt) (1.6.0)
Requirement already satisfied: tqdm in /opt/conda/envs/rapids/lib/python3.9/site-packages (from hyperopt) (4.62.3)
Requirement already satisfied: networkx>=2.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from hyperopt) (2.6.3)
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from hyperopt) (1.21.5)
Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.9/site-packages (from hyperopt) (1.16.0)
Requirement already satisfied: cloudpickle in /opt/conda/envs/rapids/lib/python3.9/site-packages (from hyperopt) (2.0.0)
Collecting py4j
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 kB 56.3 MB/s eta 0:00:00
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 829.2/829.2 kB 97.9 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: future
  Building wheel for future (setup.py) ... done
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491070 sha256=c2c3be805635ae50058bf5f0937515a541edc78adc321367838fd6e166adc6bf
  Stored in directory: /root/.cache/pip/wheels/96/66/19/2de75120f5d0bc185e9d16cf0fd223d8471ed025de08e45867
Successfully built future
Installing collected packages: py4j, future, hyperopt
Successfully installed future-0.18.2 hyperopt-0.2.7 py4j-0.10.9.7
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.2/216.2 kB 4.9 MB/s eta 0:00:00a 0:00:01
  Preparing metadata (setup.py) ... done
Requirement already satisfied: attrs>17.1.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (21.4.0)
Requirement already satisfied: jinja2>=3.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (3.0.3)
Requirement already satisfied: numpy>=1.9.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.21.5)
Requirement already satisfied: scipy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.6.0)
Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (1.16.0)
Requirement already satisfied: scikit-learn>=0.20 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from eli5) (0.24.2)
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.0/47.0 kB 17.4 MB/s eta 0:00:00
Collecting tabulate>=0.7.7
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... done
  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107748 sha256=b02531a0c5cc21be78aa233f80583d13f33f2bf4da6491fdef17feff716226df
  Stored in directory: /root/.cache/pip/wheels/d9/90/a5/4f680c43e8687c0a9aa3e6890ec517deabfda3c07735d997ba
Successfully built eli5
Installing collected packages: tabulate, graphviz, eli5
Successfully installed eli5-0.13.0 graphviz-0.20.1 tabulate-0.9.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv


Mon Dec 19 20:49:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:00:05.0 Off |                  Off |
| 41%   41C    P2    34W / 140W |    157MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Now we can set up a LocalCUDACluster for Dask with one threads_per_worker assigned to Client for all of the connected workers.

In [ ]:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
from dask.diagnostics import ProgressBar

cluster = LocalCUDACluster(threads_per_worker=1, ip='',
                           dashboard_address='8081')
c = Client(cluster)

workers = c.has_what().keys()
n_workers = len(workers)
c

distributed.preloading - INFO - Import preload module: dask_cuda.initialize

Out[ ]:

Client

Client-0bb74626-b246-11ed-8028-2e7762d3237a

Connection method: Cluster object	Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://10.42.72.72:8081/status

Cluster Info

LocalCUDACluster

05256cc3

Dashboard: http://10.42.72.72:8081/status	Workers: 1
Total threads: 1	Total memory: 39.94 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-1fdf72ee-93ff-4fbf-ab01-a2a86d8aa396

Comm: tcp://10.42.72.72:42913	Workers: 1
Dashboard: http://10.42.72.72:8081/status	Total threads: 1
Started: Just now	Total memory: 39.94 GiB

Workers

Worker: 0

Comm: tcp://10.42.72.72:36139	Total threads: 1
Dashboard: http://10.42.72.72:35199/status	Memory: 39.94 GiB
Nanny: tcp://10.42.72.72:35745
Local directory: /notebooks/MaritimeTrade/Models/ML/XGBoost/Hyperopt/Model_PKL/dask-worker-space/worker-djaaonaj
GPU: NVIDIA RTX A4000	GPU memory: 15.99 GiB

`Baseline Models`¶

`Train 2020`

Let's first read the data using cuDF and drop any duplicate observations. Then use the DateTime feature to create the year-week feature as DateTime_YearWeek that will be used for stratifying the train/test sets. Then we can filter the data to only observations in 2019 and 2020, followed by dropping the Year variable and converting to a pandas.Dataframe for preprocessing. For 2020, we can use df2 for setting up the features and target.

In [ ]:

import cudf
import pandas as pd

df = cudf.read_csv('combined_trade_final_LSTM.csv', low_memory=False)
df = df.drop_duplicates()
print('Number of rows and columns for 2019-2020:', df.shape)

df['DateTime']= cudf.to_datetime(df['DateTime'])
df['DateTime_YearWeek'] = df['DateTime'].dt.strftime('%Y-w%U')
df = df.drop(['DateTime'], axis=1)

df1 = df[df['Year'] == 2019]
df2 = df[df['Year'] == 2020]

del df

df1 = df1.drop(['Year'], axis=1)
df1 = df1.to_pandas()
df2 = df2.drop(['Year'], axis=1)
df2 = df2.to_pandas()

print('Number of rows and columns for the 2019 set:', df1.shape)
print('Number of rows and columns for the 2020 set:', df2.shape)

X = df2.drop(['Metric_Tons'], axis=1)
y = df2['Metric_Tons']

Number of rows and columns for 2019-2020: (6633785, 20)
Number of rows and columns for the 2019 set: (3368492, 19)
Number of rows and columns for the 2020 set: (3265293, 19)

Now, the train/test sets can be set up by using a test_size=0.2 and stratified by DateTime_YearWeek. Since the size of the companies, both U.S. and foreign, demonstrated differences in the feature importance, ordinal encoding using ranking can be utilized to contain this level of information. Then dummy variables can be created for the categorical features followed by scaling the features using the MinMaxScaler. Subsequently, the set can be converted back to cuDF, the features for both the train and sets can be converted to float32 data types and the targets converted to int32.

In [ ]:

from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    stratify=X.DateTime_YearWeek,
                                                    random_state=seed_value)

X_train = X_train.drop(['DateTime_YearWeek'], axis=1)
X_test = X_test.drop(['DateTime_YearWeek'], axis=1)

ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

mn = MinMaxScaler()
X_train = pd.DataFrame(mn.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(mn.transform(X_test), columns=X_test.columns)

X_train = cudf.DataFrame.from_pandas(X_train)
X_test = cudf.DataFrame.from_pandas(X_test)

X_train = X_train.astype('float32')
y_train = y_train.astype('int32')
X_test = X_test.astype('float32')
y_test = y_test.astype('int32')

The path where the .pkl file will be stored can then be designated. This is always needed to reload a model for model comparison to the baseline model or after finding the optimal hyperparameters during the tuning process. The baseline model for 2020 can then be fit to the train set, saved, and then the prediction based on the fit model can be tested on both the train and test sets to determine if the error is close or not, suggesting overfitting.

In [ ]:

from cupy import asnumpy

xgb = XGBRegressor(objective='reg:squarederror',
                   booster='gbtree',
                   tree_method='gpu_hist',
                   scale_pos_weight=1,
                   use_label_encoder=False,
                   random_state=seed_value,
                   verbosity=0)

xgb.fit(X_train, y_train)

Pkl_Filename = 'XGB_Train20_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

# =============================================================================
# # To load saved model
# model = joblib.load('XGB_Train20_Baseline.pkl')
# print(model)
# =============================================================================

print('\nModel Metrics for XGBoost Baseline Train 2020 Test 2020')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost Baseline Train 2020 Test 2020
MAE train: 7.823, test: 7.871
MSE train: 270.671, test: 276.149
RMSE train: 16.452, test: 16.618
R^2 train: 0.713, test: 0.710

All of the regression metrics evaluated are similar between the train and test sets, so this does not seem to be overfit. However, the R² demonstrates that this model only explains around 71% of the variance in the data, so hyperparameter tuning will need to be completed to acheive better performance.

`Train 2019`

For the 2019 baseline model, the same preprocessing, modeling and model metrics that were used for the 2020 set can be applied to df1.

In [ ]:

xgb = XGBRegressor(objective='reg:squarederror',
                   booster='gbtree',
                   tree_method='gpu_hist',
                   scale_pos_weight=1,
                   use_label_encoder=False,
                   random_state=seed_value,
                   verbosity=0)

xgb.fit(X_train, y_train)

Pkl_Filename = 'XGB_Train19_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

print('\nModel Metrics for XGBoost Baseline Train 2019 Test 2019')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost Baseline Train 2019 Test 2019
MAE train: 8.066, test: 8.111
MSE train: 288.531, test: 292.210
RMSE train: 16.986, test: 17.094
R^2 train: 0.713, test: 0.709

The model metrics reveal higher MAE, MSE and RMSE compared to the 2020 train/test sets, but the R² is close.

`Train 2018 - 2019`

Let's now read the complete trade set into a cuDF dataframe, remove any duplicates, convert to a pandas dataframe and select the non-missing rows to maximize the amount of complete observations. Then we can filter the set to only the observations containing only 2018 and 2019 and then drop the Year feature. To process similarily to the 2020 and 2019 sets, we will create year-week as DateTime_YearWeek from DateTime to stratify the train/test sets. Now we prepare this set for partitioning the data by dropping DateTime and allocating Metric_Tons as the target.

In [ ]:

df = cudf.read_csv('combined_trade_final.csv', low_memory=False)
df = df.drop_duplicates()
df = df.to_pandas()
df = df[df.Foreign_Country_Region.notna() & df.Average_Tariff.notna()
        & df.State_Closure_EA_Diff.notna()]

df3 = df.loc[(df['Year'] >= 2018) & (df['Year'] < 2020)]
df3 = df3.drop(['Year'], axis=1)

df3['DateTime']= pd.to_datetime(df3['DateTime'])
df3['DateTime_YearWeek'] = df3['DateTime'].dt.strftime('%Y-w%U')

X = df3.drop(['Metric_Tons', 'DateTime'], axis=1)
y = df3['Metric_Tons']
print('Number of rows and columns for 2018 - 2019:', df3.shape)

Number of rows and columns for 2018 - 2019: (6258690, 19)

Then, we can use the same preprocessing, modeling and model metrics that were used for the 2020 and 2019 sets for the 2018 - 2019 set.

In [ ]:

xgb = XGBRegressor(objective='reg:squarederror',
                   booster='gbtree',
                   tree_method='gpu_hist',
                   scale_pos_weight=1,
                   use_label_encoder=False,
                   random_state=seed_value,
                   verbosity=0)

xgb.fit(X_train, y_train)

Pkl_Filename = 'XGB_Train1819_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

print('\nModel Metrics for XGBoost Baseline Train 2018-19 Test 2018-19')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost Baseline Train 2018-19 Test 2018-19
MAE train: 7.403, test: 7.416
MSE train: 224.956, test: 226.949
RMSE train: 14.999, test: 15.065
R^2 train: 0.665, test: 0.663

The 2018 - 2019 baseline metrics reveal the lowest MAE, MSE and RMSE but the R² is slightly lower than both the 2020 and 2019 baseline metrics.

`Train 2010 - 2019`

The same methods used for processing the 2018 - 2019 set were utilized up to where the Year was filtered. Then DateTime was dropped and because of memory constraints, a sample of 15,000,000 observations was taken. The features and target were then defined, and the train/test sets were partitioned using test_size=0.3 stratified by Year. The subsequent preprocessing before modeling then utilized the same steps.

In [ ]:

df = df[df['Year'] < 2020]
df = df.drop(['DateTime'], axis=1)
print('Number of rows and columns in 2010 - 2019:', df.shape)

df_sample = df.sample(n=15000000)

del df

X = df_sample.drop(['Metric_Tons'], axis=1)
y = df_sample['Metric_Tons']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    stratify=X.Year,
                                                    random_state=seed_value)

del X, y

ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'US_company_size'])
X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

mn = MinMaxScaler()
X_train = pd.DataFrame(mn.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(mn.transform(X_test), columns=X_test.columns)

X_train = cudf.DataFrame.from_pandas(X_train)
X_test = cudf.DataFrame.from_pandas(X_test)

X_train = X_train.astype('float32')
y_train = y_train.astype('int32')
X_test = X_test.astype('float32')
y_test = y_test.astype('int32')

Number of rows and columns in 2010 - 2019: (24133210, 18)

Now after preprocessing the data, we can fit the baseline model for the set containing observations from 2010-2019 and examine the metrics.

In [ ]:

xgb = XGBRegressor(objective='reg:squarederror',
                   booster='gbtree',
                   tree_method='gpu_hist',
                   scale_pos_weight=1,
                   use_label_encoder=False,
                   random_state=seed_value,
                   verbosity=0)

xgb.fit(X_train, y_train)

Pkl_Filename = 'XGB_Train1019_Baseline.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(xgb, file)

print('\nModel Metrics for XGBoost Baseline Train 2010-19 Test 2010-19')
y_train_pred = xgb.predict(X_train)
y_test_pred = xgb.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost Baseline Train 2010-19 Test 2010-19
MAE train: 8.396, test: 9.104
MSE train: 281.391, test: 350.187
RMSE train: 16.775, test: 18.713
R^2 train: 0.603, test: 0.506

The model metrics for the 2010 - 2019 set are significantly lower compared to the baseline models for 2020, 2019 and 2018 - 2019. The default parameters reveal this model is not fitting well and is overfitting. This is probably due to the wide range of years for trade as well as unemployment, tariffs and currency.

`Hyperopt Hyperparameter Optimization`¶

The baseline model metrics reveal that hyperparameter tuning is necessary since the model metrics do not explain the variance as well as they could by testing out different parameters to explain metric tonnage given the considered features. Hyperopt is a widely utilized package for hyperparameter tuning, so let's now explore different parameters for XGBoost regression to see if better models can then subsequently result in better predictions with lower errors and explain more of the variance.

`Train 2020: 100 Trials Train/Test`¶

The method used to set up the RAPIDS environment in Colab can be found here. The code without the extensive output is provided below. The package pynvml is first installed with pip and then the rapidsai repository is cloned to the RAPIDS directory. The provided code examines if a compatible GPU is present, which a T4 works for this. Both Colab and Paperspace were utilized given runtime length and different GPU and RAM allocations.

In [ ]:

%cd /content/drive/MyDrive/RAPIDS/

/content/drive/MyDrive/RAPIDS

In [ ]:

!pip install pynvml
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pynvml
  Downloading pynvml-11.4.1-py3-none-any.whl (46 kB)
     |████████████████████████████████| 46 kB 4.9 MB/s 
Installing collected packages: pynvml
Successfully installed pynvml-11.4.1
Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 312, done.
remote: Counting objects: 100% (141/141), done.
remote: Compressing objects: 100% (86/86), done.
remote: Total 312 (delta 82), reused 98 (delta 55), pack-reused 171
Receiving objects: 100% (312/312), 90.55 KiB | 1.68 MiB/s, done.
Resolving deltas: 100% (144/144), done.
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
***********************************************************************

The provided .sh script run in bash updates the Colab environment and restarts the kernel.

In [ ]:

!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Then we can install Condacolab and restart the kernel.

In [ ]:

import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:23
🔁 Restarting kernel...

Now we can see if the environment is ready to install RAPIDS.

In [ ]:

import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!

We can now install RAPIDS using the stable release and then install/import the dependencies to set up the environment. As shown previously, the seed was set for reproducibility, a function was defined to time the code blocks and a CUDA cluster for Dask was established.

In [ ]:

!python rapidsai-csp-utils/colab/install_rapids.py stable
!pip install category_encoders
!pip install xgboost==1.5.2
!pip install eli5
import os
import warnings
import random
import cupy
import numpy as np
import urllib.request
from contextlib import contextmanager
import time
warnings.filterwarnings('ignore')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

seed_value = 42
os.environ['xgbRAPIDS_GPU'] = str(seed_value)
random.seed(seed_value)
cupy.random.seed(seed_value)
np.random.seed(seed_value)

@contextmanager
def timed(name):
    t0 = time.time()
    yield
    t1 = time.time()
    print('..%-24s:  %8.4f' % (name, t1 - t0))

print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: category_encoders in /usr/local/lib/python3.8/site-packages (2.5.1.post0)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.8/site-packages (from category_encoders) (1.2.0)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.8/site-packages (from category_encoders) (1.8.1)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.8/site-packages (from category_encoders) (0.5.3)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.8/site-packages (from category_encoders) (0.13.5)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.8/site-packages (from category_encoders) (1.3.5)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.8/site-packages (from category_encoders) (1.23.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/site-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/site-packages (from pandas>=1.0.5->category_encoders) (2022.6)
Requirement already satisfied: six in /usr/local/lib/python3.8/site-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/site-packages (from scikit-learn>=0.20.0->category_encoders) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.8/site-packages (from scikit-learn>=0.20.0->category_encoders) (1.2.0)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/site-packages (from statsmodels>=0.9.0->category_encoders) (22.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: xgboost==1.5.2 in /usr/local/lib/python3.8/site-packages (1.5.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/site-packages (from xgboost==1.5.2) (1.8.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/site-packages (from xgboost==1.5.2) (1.23.5)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: optuna in /usr/local/lib/python3.8/site-packages (3.0.4)
Requirement already satisfied: importlib-metadata<5.0.0 in /usr/local/lib/python3.8/site-packages (from optuna) (4.13.0)
Requirement already satisfied: alembic>=1.5.0 in /usr/local/lib/python3.8/site-packages (from optuna) (1.8.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/site-packages (from optuna) (1.23.5)
Requirement already satisfied: sqlalchemy>=1.3.0 in /usr/local/lib/python3.8/site-packages (from optuna) (1.4.45)
Requirement already satisfied: cmaes>=0.8.2 in /usr/local/lib/python3.8/site-packages (from optuna) (0.9.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/site-packages (from optuna) (4.64.1)
Requirement already satisfied: scipy<1.9.0,>=1.7.0 in /usr/local/lib/python3.8/site-packages (from optuna) (1.8.1)
Requirement already satisfied: colorlog in /usr/local/lib/python3.8/site-packages (from optuna) (6.7.0)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.8/site-packages (from optuna) (6.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/site-packages (from optuna) (22.0)
Requirement already satisfied: cliff in /usr/local/lib/python3.8/site-packages (from optuna) (4.1.0)
Requirement already satisfied: Mako in /usr/local/lib/python3.8/site-packages (from alembic>=1.5.0->optuna) (1.2.4)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.8/site-packages (from alembic>=1.5.0->optuna) (5.10.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/site-packages (from importlib-metadata<5.0.0->optuna) (3.11.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.8/site-packages (from sqlalchemy>=1.3.0->optuna) (2.0.1)
Requirement already satisfied: stevedore>=2.0.1 in /usr/local/lib/python3.8/site-packages (from cliff->optuna) (4.1.1)
Requirement already satisfied: cmd2>=1.0.0 in /usr/local/lib/python3.8/site-packages (from cliff->optuna) (2.4.2)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.8/site-packages (from cliff->optuna) (3.5.0)
Requirement already satisfied: autopage>=0.4.0 in /usr/local/lib/python3.8/site-packages (from cliff->optuna) (0.5.1)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.8/site-packages (from cmd2>=1.0.0->cliff->optuna) (22.1.0)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.8/site-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: pyperclip>=1.6 in /usr/local/lib/python3.8/site-packages (from cmd2>=1.0.0->cliff->optuna) (1.8.2)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /usr/local/lib/python3.8/site-packages (from stevedore>=2.0.1->cliff->optuna) (5.11.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.8/site-packages (from Mako->alembic>=1.5.0->optuna) (2.1.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: dask_optuna in /usr/local/lib/python3.8/site-packages (0.0.2)
Requirement already satisfied: distributed in /usr/local/lib/python3.8/site-packages (from dask_optuna) (2021.11.2)
Requirement already satisfied: optuna>=2.1.0 in /usr/local/lib/python3.8/site-packages (from dask_optuna) (3.0.4)
Requirement already satisfied: dask in /usr/local/lib/python3.8/site-packages (from dask_optuna) (2021.11.2)
Requirement already satisfied: sqlalchemy>=1.3.0 in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (1.4.45)
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (4.64.1)
Requirement already satisfied: importlib-metadata<5.0.0 in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (4.13.0)
Requirement already satisfied: alembic>=1.5.0 in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (1.8.1)
Requirement already satisfied: cliff in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (4.1.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (22.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (1.23.5)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (6.0)
Requirement already satisfied: scipy<1.9.0,>=1.7.0 in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (1.8.1)
Requirement already satisfied: colorlog in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (6.7.0)
Requirement already satisfied: cmaes>=0.8.2 in /usr/local/lib/python3.8/site-packages (from optuna>=2.1.0->dask_optuna) (0.9.0)
Requirement already satisfied: fsspec>=0.6.0 in /usr/local/lib/python3.8/site-packages (from dask->dask_optuna) (2022.11.0)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/site-packages (from dask->dask_optuna) (0.12.0)
Requirement already satisfied: cloudpickle>=1.1.1 in /usr/local/lib/python3.8/site-packages (from dask->dask_optuna) (2.2.0)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/site-packages (from dask->dask_optuna) (1.3.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (3.1.2)
Requirement already satisfied: tornado>=6.0.3 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (6.1)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (5.9.4)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (1.0.4)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (1.7.0)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (2.2.0)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (8.0.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/site-packages (from distributed->dask_optuna) (59.8.0)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=2.1.0->dask_optuna) (5.10.1)
Requirement already satisfied: Mako in /usr/local/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=2.1.0->dask_optuna) (1.2.4)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/site-packages (from importlib-metadata<5.0.0->optuna>=2.1.0->dask_optuna) (3.11.0)
Requirement already satisfied: locket in /usr/local/lib/python3.8/site-packages (from partd>=0.3.10->dask->dask_optuna) (1.0.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.8/site-packages (from sqlalchemy>=1.3.0->optuna>=2.1.0->dask_optuna) (2.0.1)
Requirement already satisfied: heapdict in /usr/local/lib/python3.8/site-packages (from zict>=0.1.3->distributed->dask_optuna) (1.0.1)
Requirement already satisfied: autopage>=0.4.0 in /usr/local/lib/python3.8/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (0.5.1)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.8/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (3.5.0)
Requirement already satisfied: stevedore>=2.0.1 in /usr/local/lib/python3.8/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (4.1.1)
Requirement already satisfied: cmd2>=1.0.0 in /usr/local/lib/python3.8/site-packages (from cliff->optuna>=2.1.0->dask_optuna) (2.4.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/site-packages (from jinja2->distributed->dask_optuna) (2.1.1)
Requirement already satisfied: pyperclip>=1.6 in /usr/local/lib/python3.8/site-packages (from cmd2>=1.0.0->cliff->optuna>=2.1.0->dask_optuna) (1.8.2)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.8/site-packages (from cmd2>=1.0.0->cliff->optuna>=2.1.0->dask_optuna) (0.2.5)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.8/site-packages (from cmd2>=1.0.0->cliff->optuna>=2.1.0->dask_optuna) (22.1.0)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /usr/local/lib/python3.8/site-packages (from stevedore>=2.0.1->cliff->optuna>=2.1.0->dask_optuna) (5.11.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: eli5 in /usr/local/lib/python3.8/site-packages (0.13.0)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.8/site-packages (from eli5) (1.23.5)
Requirement already satisfied: attrs>17.1.0 in /usr/local/lib/python3.8/site-packages (from eli5) (22.1.0)
Requirement already satisfied: jinja2>=3.0.0 in /usr/local/lib/python3.8/site-packages (from eli5) (3.1.2)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.8/site-packages (from eli5) (0.9.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.8/site-packages (from eli5) (0.20.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/site-packages (from eli5) (1.8.1)
Requirement already satisfied: six in /usr/local/lib/python3.8/site-packages (from eli5) (1.16.0)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.8/site-packages (from eli5) (1.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/site-packages (from jinja2>=3.0.0->eli5) (2.1.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/site-packages (from scikit-learn>=0.20->eli5) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.8/site-packages (from scikit-learn>=0.20->eli5) (1.2.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: shap in /usr/local/lib/python3.8/site-packages (0.41.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.8/site-packages (from shap) (1.3.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/site-packages (from shap) (1.8.1)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.8/site-packages (from shap) (2.2.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/site-packages (from shap) (1.2.0)
Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.8/site-packages (from shap) (0.0.7)
Requirement already satisfied: numba in /usr/local/lib/python3.8/site-packages (from shap) (0.56.4)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.8/site-packages (from shap) (22.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/site-packages (from shap) (1.23.5)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.8/site-packages (from shap) (4.64.1)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.8/site-packages (from numba->shap) (0.39.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.8/site-packages (from numba->shap) (4.13.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/site-packages (from numba->shap) (59.8.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/site-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/site-packages (from pandas->shap) (2022.6)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/site-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.8/site-packages (from scikit-learn->shap) (1.2.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->shap) (1.16.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/site-packages (from importlib-metadata->numba->shap) (3.11.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: plotly in /usr/local/lib/python3.8/site-packages (5.11.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/site-packages (from plotly) (8.1.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Mon Dec 12 01:02:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    25W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

To set up Hyperopt, let's first define the number of trials with NUM_EVAL = 100. Then hyperparameters can then be defined in that a dictionary. To define integers, hp.choice with np.arange and dtype=int is used while float types are defined using hp.uniform. The space consists of 10 parameters with 4 integers and 6 float type.

In [ ]:

from hyperopt import hp

NUM_EVAL = 100

xgb_tune_kwargs= {
    'n_estimators': hp.choice('n_estimators', np.arange(50, 700, dtype=int)),
    'max_depth': hp.choice('max_depth', np.arange(3, 25, dtype=int)),
    'subsample': hp.uniform('subsample', 0.25, 0.95),
    'gamma': hp.uniform('gamma', 0, 15),
    'learning_rate': hp.uniform('learning_rate', 1e-3, 0.3),
    'reg_alpha': hp.choice('reg_alpha', np.arange(0, 30, dtype=int)),
    'reg_lambda': hp.uniform('reg_lambda', 0, 100),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.3, 1),
    'colsample_bylevel': hp.uniform('colsample_bylevel', 0.05, 0.95),
    'min_child_weight': hp.choice('min_child_weight', np.arange(1, 30,
                                                                dtype=int))
    }

Now, we can define a function for the optimization of these hyperparameters. Within this function, we can use joblib to save a pkl file that can be reloaded if more training is needed. We can define the trial number with ITERATION that will increase by 1 each trial to keep track of the parameters for each trial. The parameters which are integers need to configured to remain as integers and not as float. This is then allocated for n_estimators and max_depth, which starts at max_depth=3. Then the model type, XGBRegressor, needs to be defined with the parameters that will be included in all of the trials during the search, which are:

objective='reg:squarederror': Specify the learning task and the corresponding learning objective or a custom objective function to be used
booster='gbtree': Specify which booster to use: gbtree, gblinear or dart.
tree_method='gpu_hist': Specify which tree method to use. Default to auto.
scale_pos_weight=1: Balancing of positive and negative weights.
use_label_encoder=False: Encodes labels
random_state=seed_value: Random number seed.
verbosity=0: The degree of verbosity. Valid values are 0 (silent) - 3 (debug)

Then we can fit the model with the training set and the test set as the eval_set. Then predict using the test set and evaluate the mean_absolute_error with the true test set and the predicted from the model. The trial loss=mae, params = parameters tested in the trial from the xgb_tune_kwargs space , the trial number (iteration), the time to complete the trial (train_time) and if the trial completed successfully or not (status) are written to the defined .csv file and appended by row for each trial.

In [ ]:

import joblib
from xgboost import XGBRegressor
from timeit import default_timer as timer
from cupy import asnumpy
from sklearn.metrics import mean_absolute_error
import csv
from hyperopt import STATUS_OK

def xgb_hpo(config):
    """
    Objective function to tune a <code>XGBoostRegressor</code> model.
    """
    joblib.dump(bayesOpt_trials, 'xgbRapids_Hyperopt_100_GPU_Train20.pkl')

    global ITERATION
    ITERATION += 1

    config['n_estimators'] = int(config['n_estimators'])
    config['max_depth'] = int(config['max_depth']) + 3

    xgb = XGBRegressor(objective='reg:squarederror',
                       booster='gbtree',
                       tree_method='gpu_hist',
                       scale_pos_weight=1,
                       use_label_encoder=False,
                       random_state=seed_value,
                       verbosity=0,
                       **config)

    start = timer()
    xgb.fit(X_train, y_train,
            eval_set=[(X_test, y_test)],
            verbose=0)
    run_time = timer() - start

    y_pred_val = xgb.predict(X_test)
    mae = mean_absolute_error(asnumpy(y_test), asnumpy(y_pred_val))

    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([mae, config, ITERATION, run_time])

    return {'loss': mae, 'params': config, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

The Tree-structured Parzen Estimator Approach (TPE) algorithm is the default algorithm for Hyperopt. This algorithm proposed in Algorithms for Hyper-Parameter Optimization uses a Bayesian optimization approach to develop a probabilistic model of a defined function (the minimum, in this case) by generating a random starting point, calculating the function, then choosing the next set of parameters that will probabilistically result in a lower minimum utilizing past calculations for the conditional probability model, then computing the real value, and continuing until the defined stopping criteria is met.

Let's now define an out_file to save the results where the headers will be written to the file. Then we can set the global variable for the ITERATION and define the Hyperopt Trials as bayesOpt_trials. We can utilize if/else condtional statements to load a .pkl file if it exists, and then utilizing fmin, we can specify the training function xgb_hpo, the parameter space xgb_tune_kwargs, the algorithm for optimization tpe.suggest, the number of trials to evaluate NUM_EVAl and the name of the trial set bayesOpt_trials. We can now begin the hyperparameter optimization HPO trials.

In [ ]:

from hyperopt import tpe, Trials, fmin

tpe_algorithm = tpe.suggest

out_file = '/content/drive/MyDrive/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train20_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()

In [ ]:

if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train20.pkl'):
    bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train20.pkl')
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)

98%|██████████| 98/100 [6:36:44, 300.97s/it, best loss: 5.002324578196533]

Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²)

In [ ]:

import ast
import pickle
from sklearn.metrics import mean_squared_error, r2_score

results = pd.read_csv('xgbRapids_HPO_Train20_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train20_100_GPU.csv')

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

best_bayes_model = XGBRegressor(objective='reg:squarederror',
                                booster='gbtree',
                                tree_method='gpu_hist',
                                scale_pos_weight=1,
                                use_label_encoder=False,
                                random_state=seed_value,
                                verbosity=0,
                                **best_bayes_params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'xgbRapids_HPO_Train20_100trials_GPU.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for XGBoost HPO Train 2020 Test 2020 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost HPO Train 2020 Test 2020 100 GPU trials
MAE train: 2.999, test: 5.065
MSE train: 47.560, test: 155.987
RMSE train: 6.896, test: 12.489
R^2 train: 0.950, test: 0.836

When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE, MSE and RMSE for the train and test sets are considerably lower and the R² is considerably higher.

We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:

print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
                                                                                                            asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))

The best model from Bayes optimization scores 155.98724 MSE on the test set.
This was achieved after 3 search iterations

`Results from the Hyperparameter Search`

Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can examine the distributions of the quantitative hyperparameters that were assessed during the search.

In [ ]:

import matplotlib.pyplot as plt
import seaborn as sns

bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')

for i, hpo in enumerate(bayes_params.columns):
    if hpo not in ['iteration', 'subsample', 'force_col_wise',
                   'max_depth', 'min_child_weight', 'n_estimators']:
        plt.figure(figsize=(14,6))
        if hpo != 'loss':
            sns.kdeplot(bayes_params[hpo], label='Bayes Optimization')
            plt.legend(loc=0)
            plt.title('{} Distribution'.format(hpo))
            plt.xlabel('{}'.format(hpo)); plt.ylabel('Density')
            plt.tight_layout()
            plt.show()

<Figure size 1400x600 with 0 Axes>

Higher values of colsample_bylevel, colsample_bytree, gamma and reg_lambda in the given hyperparameter space performed better to generate a lower loss. A learning_rate around 0.1 performed better while reg_alpha was bimodal with values around 11-12 and 25-28.

We can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:

fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

The learning_rate, and colsample_bylevel parameters decreased over the trials while the gamma and colsample_bytree parameters increased.

Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:

fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

The reg_lambda parameter increased while reg_alpha did not show any trend over the trial iterations.

`Model Explanations`

Now, we can plot the feature importance for the best model.

In [ ]:

from xgboost import plot_importance

plot_importance(best_bayes_model, max_num_features=15)

These features are the most important for the Train 2020 Test 2020 model:

TCVUSD = 2,349,275
Average_Tariff = 1,596,998
cases_weekly = 1,518,343
us_company_size = 965,347
deaths_weekly = 916,115
foreign_company_size = 882,163
US_Unemployment_Rate = 512,346

`Model Metrics with ELI5`

Let's now utilize the PermutationImportance from eli5.sklearn which is a method for the global interpretation of a model that outputs the amplitude of the feature's effect but not the direction. This shuffles the set, generates predictions on the shuffled set, and then calculates the decrease in the specified metric, this case the model's metric, before shuffling the set. The more important a feature is, the greater the model error is when the set is shuffled.

Class PermutationImportance(estimator, scoring=None, n_iter=5, random_state=None, cv='prefit', refit=True)

Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:

import eli5
from eli5.sklearn import PermutationImportance

X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test),
                                                                     asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.1318 ± 0.0053	TCVUSD
0.4131 ± 0.0012	Average_Tariff
0.2764 ± 0.0055	HS_Group_Name_Finished Goods
0.1746 ± 0.0024	Trade_Direction_Import
0.1738 ± 0.0009	HS_Group_Name_Raw Input
0.1057 ± 0.0008	foreign_company_size
0.1035 ± 0.0011	us_company_size
0.0627 ± 0.0011	Delta_Case0_Effective
0.0490 ± 0.0011	Foreign_Country_Region_South Asia
0.0448 ± 0.0014	Foreign_Country_Region_European Union
0.0402 ± 0.0006	Container_Type_Dry
0.0390 ± 0.0007	HS_Group_Name_Edible with Processing
0.0352 ± 0.0004	Foreign_Country_Region_Other East Asia (not China)
0.0274 ± 0.0003	State_Closure_EA_Diff
0.0253 ± 0.0003	HS_Group_Name_Pharma
0.0237 ± 0.0006	HS_Group_Name_Vices
0.0221 ± 0.0003	Foreign_Country_Region_North America
0.0211 ± 0.0005	Foreign_Country_Region_Southeast Asia
0.0202 ± 0.0011	Container_LCL/FCL_LCL
0.0164 ± 0.0003	Foreign_Country_Region_South America
… 14 more …

In [ ]:

from eli5.formatters import format_as_dataframe

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.131809	0.002670
1	Average_Tariff	0.413096	0.000620
2	HS_Group_Name_Finished Goods	0.276361	0.002752
3	Trade_Direction_Import	0.174613	0.001179
4	HS_Group_Name_Raw Input	0.173788	0.000453
5	foreign_company_size	0.105692	0.000394
6	us_company_size	0.103508	0.000532
7	Delta_Case0_Effective	0.062671	0.000569
8	Foreign_Country_Region_South Asia	0.049041	0.000552
9	Foreign_Country_Region_European Union	0.044832	0.000685
10	Container_Type_Dry	0.040172	0.000321
11	HS_Group_Name_Edible with Processing	0.039007	0.000367
12	Foreign_Country_Region_Other East Asia (not Ch...	0.035229	0.000194
13	State_Closure_EA_Diff	0.027365	0.000162
14	HS_Group_Name_Pharma	0.025318	0.000159
15	HS_Group_Name_Vices	0.023705	0.000292
16	Foreign_Country_Region_North America	0.022083	0.000126
17	Foreign_Country_Region_Southeast Asia	0.021134	0.000260
18	Container_LCL/FCL_LCL	0.020249	0.000558
19	Foreign_Country_Region_South America	0.016394	0.000147

For the Train 2020 Test 2020 XGBoost model after tuning, the TCVUSD feature definitely has the most weight (1.131809). Then Average_Tariff (0.413096), HS_Group_Name_Finished Goods (0.276361), Trade_Direction_Import (0.174613), HS_Group_Name_Raw Input (0.173788), foreign_company_size (0.105692), us_company_size (0.103508) and Delta_Case0_Effective (0.062671).

`Test on 2019`¶

We can now prepare the 2019 set to fit the model trained on 2020 data. This pipeline conists of encoding the foreign_company_size and US_company_size features using ordinal (ranking), creating dummy variables for the categorical variables, scaling the data using the MinMaxScaler, converting back to a cuDF dataframe and then converting the data types before modeling this test set.

In [ ]:

X_test1 = df1.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df1['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')

best_bayes_model.fit(X_test1, y_test1)

print('\nModel Metrics for XGBoost HPO Train 2020 Test 2019')
y_test_pred = best_bayes_model.predict(X_test1)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))


Model Metrics for XGBoost HPO Train 2020 Test 2019
MAE train: 2.999, test: 3.519
MSE train: 47.560, test: 78.984
RMSE train: 6.896, test: 8.887
R^2 train: 0.950, test: 0.922

When comparing the metrics from the 2020 training set to the 2019 test set, there are higher MAE, MSE, RMSE but a higher R² with the 2019 set compared to Train 2020 Test 2020

`Model Explanations`

Now, we can plot the feature importance from the model.

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

These features are the most important for the Train 2020 Test 2019 model:

TCVUSD = 3,061,152 higher than 2,349,275 for 2020
Average_Tariff = 1,965,434 higher than 1,596,998 for 2020
Then us_company_size = 1,286,641 higher 965,347 for 2020
Then US_Unemployment_Rate = 1,241,843 higher than 512,346 for 2020
Then foreign_company_size = 1,105,097 higher than 882,163 for 2020
Then Delta_Case0_Effective = 781,520
Then State_Closure_EA_Diff = 622,017
Then Container_Type_Dry = 371,643
Not used cases_weekly = 1,518,343, us_company_size is lower with 965,347, deaths_weekly = 916,115 is not on the chart.

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the 2019 set, and then get the weights with the defined feature names.

In [ ]:

X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test1),
                                                                     asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.2398 ± 0.0020	TCVUSD
0.4819 ± 0.0015	Average_Tariff
0.2748 ± 0.0006	HS_Group_Name_Finished Goods
0.2446 ± 0.0002	foreign_company_size
0.2110 ± 0.0009	HS_Group_Name_Raw Input
0.1564 ± 0.0007	us_company_size
0.1444 ± 0.0002	Trade_Direction_Import
0.1013 ± 0.0007	Delta_Case0_Effective
0.0609 ± 0.0003	Foreign_Country_Region_European Union
0.0547 ± 0.0004	State_Closure_EA_Diff
0.0543 ± 0.0002	Foreign_Country_Region_South Asia
0.0507 ± 0.0005	Container_Type_Dry
0.0455 ± 0.0002	Foreign_Country_Region_Other East Asia (not China)
0.0441 ± 0.0003	US_Unemployment_Rate
0.0437 ± 0.0002	HS_Group_Name_Edible with Processing
0.0314 ± 0.0002	Foreign_Country_Region_North America
0.0273 ± 0.0001	HS_Group_Name_Vices
0.0265 ± 0.0002	Foreign_Country_Region_Southeast Asia
0.0254 ± 0.0001	HS_Group_Name_Pharma
0.0237 ± 0.0002	Container_LCL/FCL_LCL
… 14 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.239769	0.001000
1	Average_Tariff	0.481857	0.000740
2	HS_Group_Name_Finished Goods	0.274777	0.000282
3	foreign_company_size	0.244605	0.000121
4	HS_Group_Name_Raw Input	0.211019	0.000443
5	us_company_size	0.156423	0.000327
6	Trade_Direction_Import	0.144351	0.000122
7	Delta_Case0_Effective	0.101251	0.000325
8	Foreign_Country_Region_European Union	0.060932	0.000170
9	State_Closure_EA_Diff	0.054713	0.000224
10	Foreign_Country_Region_South Asia	0.054319	0.000102
11	Container_Type_Dry	0.050703	0.000226
12	Foreign_Country_Region_Other East Asia (not Ch...	0.045532	0.000118
13	US_Unemployment_Rate	0.044145	0.000140
14	HS_Group_Name_Edible with Processing	0.043697	0.000109
15	Foreign_Country_Region_North America	0.031434	0.000094
16	HS_Group_Name_Vices	0.027297	0.000055
17	Foreign_Country_Region_Southeast Asia	0.026481	0.000094
18	HS_Group_Name_Pharma	0.025397	0.000035
19	Container_LCL/FCL_LCL	0.023662	0.000097

For the Train 2020 Test 2019 XGBoost model after tuning, the TCVUSD feature definitely has a higher weight (1.239769 vs. 1.131809). Then higher Average_Tariff (0.481857 vs. 0.413096), similar HS_Group_Name_Finished Goods (0.274777 vs. 0.276361), lower Trade_Direction_Import (0.144351 vs. 0.174613), higher HS_Group_Name_Raw Input (0.211019 vs. 0.173788), higher foreign_company_size (0.244605 vs. 0.105692), higher us_company_size (0.156423 vs. 0.103508) and higher Delta_Case0_Effective (0.101251 vs. 0.062671).

`Test on 2010 - 2019`

The same methods used for processing the 2019 data up to filtering the Year feature were used. Then, the DateTime and Year features were dropped, the data sampled to 15 million observations, ordinal encoding of the size of the US and foreign companies, dummy variables for the categorical ones, MinMaxScaler for the test set, which were then converted back to a cuDF.Dataframe and the data types converted.

In [ ]:

X_test1 = df_sample.drop(['Metric_Tons', 'DateTime', 'Year'], axis=1)
y_test1 = df_sample['Metric_Tons']

ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)

X_test1 = pd.get_dummies(X_test1, drop_first=True)

mn = MinMaxScaler()
X_test1 = pd.DataFrame(mn.fit_transform(X_test1), columns=X_test1.columns)

best_bayes_model.fit(X_test1, y_test1)

print('\nModel Metrics for XGBoost HPO Train 2020 Test 2010-19')
y_test_pred = best_bayes_model.predict(X_test1)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))

Model Metrics for XGBoost HPO Train 2020 Test 2010-19
MAE train: 2.999, test: 4.594
MSE train: 47.560, test: 113.607
RMSE train: 6.896, test: 10.659
R^2 train: 0.950, test: 0.839

When comparing the metrics from the 2020 training set to the 2010 - 2019 test set, there are even higher MAE, MSE, RMSE compared to 2020 and 2019, but a higher R² with the 2010-19 set compared to the 2020 test set.

`Model Explanations`

Now, we can plot the feature importance from the model.

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

These features are the most important for the Train 2020 Test 2010 - 2019 model:

First US_Unemployment_Rate = 8,619,313 significantly higher than than 512,346 for 2020
TCVUSD = 7,413,080 significantly higher than 2,349,275 for 2020
Average_Tariff = 6,128,827 significantly higher than 1,596,998 for 2020
Then us_company_size = 4,194,619 higher than 965,347 for 2020
Then foreign_company_size = 3,415,736 higher than 882,163 for 2020
Then State_Closure_EA_Diff = 2,300,086 higher than 316,770 for 2020
Then US_Port_Coastal_Region_Northeast = 1,115,509, not listed for 2020
Then Container_Type_Dry = 1,096,746 higher than 292,538 for 2020

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the 2010 - 2019 set, and then get the weights with the defined feature names.

In [ ]:

X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test1),
                                                                     asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.0379 ± 0.0011	TCVUSD
0.4804 ± 0.0004	Average_Tariff
0.2917 ± 0.0006	HS_Group_Name_Finished Goods
0.1946 ± 0.0005	US_Unemployment_Rate
0.1939 ± 0.0003	Trade_Direction_Import
0.1648 ± 0.0004	US_company_size
0.1464 ± 0.0001	HS_Group_Name_Raw Input
0.1239 ± 0.0004	foreign_company_size
0.1012 ± 0.0001	Container_Type_Dry_True
0.0922 ± 0.0004	State_Closure_EA_Diff
0.0681 ± 0.0001	Foreign_Country_Region_European Union
0.0625 ± 0.0002	HS_Group_Name_Edible with Processing
0.0585 ± 0.0001	Container_LCL/FCL_LCL
0.0558 ± 0.0001	Foreign_Country_Region_Other East Asia (not China)
0.0471 ± 0.0002	US_Port_Coastal_Region_Northeast
0.0373 ± 0.0001	Foreign_Country_Region_Southeast Asia
0.0356 ± 0.0001	Foreign_Country_Region_South Asia
0.0324 ± 0.0002	US_Port_Coastal_Region_Southwest
0.0320 ± 0.0001	US_Port_Coastal_Region_Southeast
0.0279 ± 0.0001	Foreign_Country_Region_North America
… 13 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.037852	0.000550
1	Average_Tariff	0.480369	0.000219
2	HS_Group_Name_Finished Goods	0.291674	0.000313
3	US_Unemployment_Rate	0.194572	0.000256
4	Trade_Direction_Import	0.193941	0.000153
5	US_company_size	0.164763	0.000214
6	HS_Group_Name_Raw Input	0.146352	0.000047
7	foreign_company_size	0.123941	0.000219
8	Container_Type_Dry_True	0.101223	0.000052
9	State_Closure_EA_Diff	0.092235	0.000207
10	Foreign_Country_Region_European Union	0.068141	0.000058
11	HS_Group_Name_Edible with Processing	0.062546	0.000084
12	Container_LCL/FCL_LCL	0.058488	0.000064
13	Foreign_Country_Region_Other East Asia (not Ch...	0.055848	0.000041
14	US_Port_Coastal_Region_Northeast	0.047070	0.000108
15	Foreign_Country_Region_Southeast Asia	0.037295	0.000034
16	Foreign_Country_Region_South Asia	0.035593	0.000039
17	US_Port_Coastal_Region_Southwest	0.032407	0.000111
18	US_Port_Coastal_Region_Southeast	0.031958	0.000036
19	Foreign_Country_Region_North America	0.027888	0.000035

For the Train 2020 Test 2010 - 2019 XGBoost model after tuning, the TCVUSD feature definitely has a lower weight (1.037852 vs. 1.131809). Then higher Average_Tariff (0.480369 vs. 0.413096), similar HS_Group_Name_Finished Goods (0.291674 vs. 0.276361), higher Trade_Direction_Import (0.193941 vs. 0.174613), lower HS_Group_Name_Raw Input (0.146352 vs. 0.173788), higher foreign_company_size (0.123941 vs. 0.105692), higher us_company_size (0.156423 vs. 0.103508) and higher Delta_Case0_Effective (0.164763 vs. 0.062671).

`Train 2019: 100 Trials Train/Test`

Using the same number of trials, hyperparameters and optimization algorithm that was utilized for the 2020 tuning, let's now define a different out_file and .pkl file and run the search for the 2019 data.

In [ ]:

out_file = '/notebooks/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train19_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()

In [ ]:

if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train19.pkl'):
    bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train19.pkl')
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)

 90%|█████████ | 90/100 [4:53:40<1:19:24, 476.48s/trial, best loss: 4.866573237785186]

Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:

results = pd.read_csv('xgbRapids_HPO_Train19_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train19_100_GPU.csv')

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

best_bayes_model = XGBRegressor(objective='reg:squarederror',
                                booster='gbtree',
                                tree_method='gpu_hist',
                                scale_pos_weight=1,
                                use_label_encoder=False,
                                random_state=seed_value,
                                verbosity=0,
                                **best_bayes_params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'xgbRapids_Hyperopt_100_GPU_Train19trials.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for XGBoost HPO Train 2019 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost HPO Train 2019 100 GPU trials
MAE train: 3.648, test: 4.867
MSE train: 78.764, test: 159.370
RMSE train: 8.875, test: 12.624
R^2 train: 0.922, test: 0.842

When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE, MSE and RMSE for the train and test sets are considerably lower and the R² is considerably higher.

We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:

print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
                                                                                                            asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))

The best model from Bayes optimization scores 159.36986 MSE on the test set.
This was achieved after 79 search iterations

`Results from the Hyperparameter Search`

Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:

bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')

fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

learning_rate, and decreased over the trials while colsample_bylevel and colsample_bytree increased while gamma did not show any trend. For 2020, learning_rate, and colsample_bylevel decreased over the trials while gamma and colsample_bytree increased.

Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:

fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

reg_lambda decreased while reg_alpha did not show any trend over the trial iterations.For 2020, reg_lambda increased while reg_alpha did not show any trend over the trial iterations.

`Model Explanations`

Now, we can plot the feature importance for the best model.

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

Compared to Train 2020, higher TCVUSD, Average_Tariff, us_company_size. There is different order without cases_weekly, double US_Unemployment_Rate, higher foreign_company_size

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:

X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test),
                                                                     asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.1473 ± 0.0044	TCVUSD
0.4113 ± 0.0021	Average_Tariff
0.2373 ± 0.0030	HS_Group_Name_Finished Goods
0.2333 ± 0.0019	foreign_company_size
0.1867 ± 0.0010	HS_Group_Name_Raw Input
0.1276 ± 0.0027	us_company_size
0.1028 ± 0.0012	Trade_Direction_Import
0.0674 ± 0.0004	Delta_Case0_Effective
0.0455 ± 0.0010	Foreign_Country_Region_European Union
0.0440 ± 0.0010	Container_Type_Dry
0.0423 ± 0.0004	Foreign_Country_Region_South Asia
0.0389 ± 0.0006	Foreign_Country_Region_Other East Asia (not China)
0.0382 ± 0.0010	State_Closure_EA_Diff
0.0358 ± 0.0009	HS_Group_Name_Edible with Processing
0.0276 ± 0.0002	HS_Group_Name_Pharma
0.0261 ± 0.0001	Foreign_Country_Region_North America
0.0209 ± 0.0005	HS_Group_Name_Vices
0.0207 ± 0.0006	Container_LCL/FCL_LCL
0.0206 ± 0.0006	Foreign_Country_Region_Southeast Asia
0.0136 ± 0.0003	Foreign_Country_Region_South America
… 14 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.147259	0.002200
1	Average_Tariff	0.411251	0.001070
2	HS_Group_Name_Finished Goods	0.237294	0.001476
3	foreign_company_size	0.233340	0.000927
4	HS_Group_Name_Raw Input	0.186702	0.000506
5	us_company_size	0.127631	0.001374
6	Trade_Direction_Import	0.102769	0.000616
7	Delta_Case0_Effective	0.067358	0.000208
8	Foreign_Country_Region_European Union	0.045470	0.000501
9	Container_Type_Dry	0.043954	0.000479
10	Foreign_Country_Region_South Asia	0.042261	0.000208
11	Foreign_Country_Region_Other East Asia (not Ch...	0.038893	0.000277
12	State_Closure_EA_Diff	0.038151	0.000488
13	HS_Group_Name_Edible with Processing	0.035820	0.000431
14	HS_Group_Name_Pharma	0.027618	0.000123
15	Foreign_Country_Region_North America	0.026064	0.000071
16	HS_Group_Name_Vices	0.020909	0.000266
17	Container_LCL/FCL_LCL	0.020724	0.000296
18	Foreign_Country_Region_Southeast Asia	0.020648	0.000291
19	Foreign_Country_Region_South America	0.013628	0.000129

For the Train 2019 Test 2019 XGBoost model after tuning, the TCVUSD feature definitely has the most weight (1.147259). Then Average_Tariff (0.411251), HS_Group_Name_Finished Goods (0.237294), Trade_Direction_Import (0.102769), HS_Group_Name_Raw Input (0.186702), foreign_company_size (0.233340), us_company_size (0.127631) and Delta_Case0_Effective (0.067358).

`Test on 2020`

Let's now prepare the 2020 set to fit the model train using the 2019 set by ordinal encoding the company size variables, create dummy variables for the categorical variables, MinMax scaling, converting back to cuDF, converting the data types for modeling and fitting the model.

In [ ]:

X_test1 = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df2['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')

best_bayes_model.fit(X_test1, y_test1)

print('\nModel Metrics for XGBoost HPO Train 2019 Test 2020')
y_test_pred = best_bayes_model.predict(X_test1)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))


Model Metrics for XGBoost HPO Train 2019 Test 2020
MAE train: 3.648, test: 3.078
MSE train: 78.764, test: 47.848
RMSE train: 8.875, test: 6.917
R^2 train: 0.922, test: 0.949

When comparing the metrics from the 2019 training set to the 2020 test set, there are higher MAE, MSE, RMSE but a higher R² with the 2019 set compared to Train 2020 Test 2020.

`Model Explanations`

Now, we can plot the feature importance from the model.

In [ ]:

plot_importance(best_bayes_modeL, max_num_features=15)

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the 2020 set, and then get the weights with the defined feature names.

In [ ]:

X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test1),
                                                                     asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.2129 ± 0.0016	TCVUSD
0.4724 ± 0.0010	Average_Tariff
0.2595 ± 0.0020	HS_Group_Name_Finished Goods
0.2098 ± 0.0006	Trade_Direction_Import
0.1989 ± 0.0005	HS_Group_Name_Raw Input
0.1586 ± 0.0007	us_company_size
0.1164 ± 0.0006	foreign_company_size
0.0752 ± 0.0005	Delta_Case0_Effective
0.0515 ± 0.0004	Foreign_Country_Region_European Union
0.0502 ± 0.0002	HS_Group_Name_Edible with Processing
0.0461 ± 0.0002	Container_Type_Dry
0.0429 ± 0.0001	Foreign_Country_Region_South Asia
0.0405 ± 0.0002	cases_weekly
0.0401 ± 0.0002	Foreign_Country_Region_Other East Asia (not China)
0.0385 ± 0.0002	State_Closure_EA_Diff
0.0350 ± 0.0002	deaths_weekly
0.0294 ± 0.0001	HS_Group_Name_Pharma
0.0281 ± 0.0002	HS_Group_Name_Vices
0.0272 ± 0.0002	Foreign_Country_Region_Southeast Asia
0.0238 ± 0.0001	Container_LCL/FCL_LCL
… 14 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.212947	0.000782
1	Average_Tariff	0.472409	0.000507
2	HS_Group_Name_Finished Goods	0.259453	0.001015
3	Trade_Direction_Import	0.209795	0.000313
4	HS_Group_Name_Raw Input	0.198903	0.000271
5	us_company_size	0.158599	0.000346
6	foreign_company_size	0.116391	0.000299
7	Delta_Case0_Effective	0.075206	0.000244
8	Foreign_Country_Region_European Union	0.051508	0.000195
9	HS_Group_Name_Edible with Processing	0.050237	0.000080
10	Container_Type_Dry	0.046055	0.000112
11	Foreign_Country_Region_South Asia	0.042932	0.000065
12	cases_weekly	0.040503	0.000090
13	Foreign_Country_Region_Other East Asia (not Ch...	0.040080	0.000101
14	State_Closure_EA_Diff	0.038508	0.000082
15	deaths_weekly	0.035038	0.000075
16	HS_Group_Name_Pharma	0.029410	0.000061
17	HS_Group_Name_Vices	0.028064	0.000079
18	Foreign_Country_Region_Southeast Asia	0.027215	0.000096
19	Container_LCL/FCL_LCL	0.023837	0.000061

For the Train 2019 Test 2020 XGBoost model after tuning, the TCVUSD feature definitely has the most weight (1.212947 vs. 1.147259). Then a higher Average_Tariff (0.472409 vs. 0.411251), HS_Group_Name_Finished Goods (0.259453 vs. 0.237294), Trade_Direction_Import (0.209795 vs. 0.102769), similar HS_Group_Name_Raw Input (0.198903 vs. 0.186702), lower foreign_company_size (0.116391 vs. 0.233340), higher us_company_size (0.158599 vs. 0.127631) and similar Delta_Case0_Effective (0.075206 vs. 0.067358).

`Train 2018-19: 100 Trials Train/Test`

Using the same number of trials, hyperparameters and optimization algorithm that was utilized for the 2020 & 2019 tuning, let's now define a different out_file and .pkl file and run the search for the 2018 - 2019 data.

In [ ]:

out_file = '/content/drive/MyDrive/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train1819_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()

In [ ]:

if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train1819.pkl'):
    bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train1819.pkl')
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)

100%|██████████| 100/100 [6:43:36<00:00, 290.77s/it, best loss: 4.711881537882251]

Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:

results = pd.read_csv('xgbRapids_HPO_Train1819_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train1819_100_GPU.csv')

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

best_bayes_model = XGBRegressor(objective='reg:squarederror',
                                booster='gbtree',
                                tree_method='gpu_hist',
                                scale_pos_weight=1,
                                use_label_encoder=False,
                                random_state=seed_value,
                                verbosity=0,
                                **best_bayes_params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'xgbRapids_Hyperopt_100_GPU_Train1819trials.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for XGBoost HPO Train 2018-19 Test 2018-19 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost HPO Train 2018-19 Test 2018-19 100 GPU trials
MAE train: 3.575, test: 4.712
MSE train: 63.675, test: 126.834
RMSE train: 7.980, test: 11.262
R^2 train: 0.905, test: 0.812

When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE, MSE and RMSE for the train and test sets are considerably lower and the R² is considerably higher.

We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:

print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
                                                                                                            asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))

The best model from Bayes optimization scores 126.83448 MSE on the test set.
This was achieved after 21 search iterations

`Results from the Hyperparameter Search`

Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:

bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')

fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

learning_rate decreased over the trials while gamma, colsample_bylevel and colsample_bytree increased. For 2019, learning_rate decreased over the trials while colsample_bylevel and colsample_bytree increased while gamma did not show any trend. For 2020, learning_rate, and colsample_bylevel decreased over the trials while gamma and colsample_bytree increased.

Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:

fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

For 2018 - 2019 and only 2019, reg_lambda decreased while reg_alpha did not show any trend over the trial iterations.For 2020, reg_lambda increased while reg_alpha did not show any trend over the trial iterations.

`Model Explanations`

Now, we can plot the feature importance fro the best model.

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

Compared to Train 2020 Test 2020, lower TCVUSD, Average_Tariff, us_company_size. There is different order without cases_weekly and deaths_weekly, higher US_Unemployment_Rate and lower foreign_company_size.

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:

X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test),
                                                                     asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.1169 ± 0.0058	TCVUSD
0.4164 ± 0.0010	Average_Tariff
0.2734 ± 0.0012	HS_Group_Name_Finished Goods
0.1731 ± 0.0008	Trade_Direction_Import
0.1403 ± 0.0008	HS_Group_Name_Raw Input
0.1265 ± 0.0011	US_company_size
0.1129 ± 0.0013	foreign_company_size
0.0740 ± 0.0004	State_Closure_EA_Diff
0.0737 ± 0.0008	Container_Type_Dry_True
0.0490 ± 0.0003	Foreign_Country_Region_Other East Asia (not China)
0.0480 ± 0.0009	Foreign_Country_Region_European Union
0.0427 ± 0.0010	US_Port_Coastal_Region_Northeast
0.0417 ± 0.0004	HS_Group_Name_Edible with Processing
0.0369 ± 0.0003	Foreign_Country_Region_North America
0.0344 ± 0.0005	Foreign_Country_Region_South Asia
0.0314 ± 0.0007	Container_LCL/FCL_LCL
0.0302 ± 0.0003	Foreign_Country_Region_Southeast Asia
0.0263 ± 0.0002	HS_Group_Name_Pharma
0.0257 ± 0.0004	US_Port_Coastal_Region_Southeast
0.0244 ± 0.0008	US_Unemployment_Rate
… 13 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)

exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.116930	0.002910
1	Average_Tariff	0.416370	0.000507
2	HS_Group_Name_Finished Goods	0.273364	0.000594
3	Trade_Direction_Import	0.173100	0.000393
4	HS_Group_Name_Raw Input	0.140252	0.000388
5	US_company_size	0.126541	0.000538
6	foreign_company_size	0.112855	0.000634
7	State_Closure_EA_Diff	0.074037	0.000201
8	Container_Type_Dry_True	0.073654	0.000416
9	Foreign_Country_Region_Other East Asia (not Ch...	0.049034	0.000137
10	Foreign_Country_Region_European Union	0.047968	0.000440
11	US_Port_Coastal_Region_Northeast	0.042732	0.000504
12	HS_Group_Name_Edible with Processing	0.041667	0.000199
13	Foreign_Country_Region_North America	0.036859	0.000136
14	Foreign_Country_Region_South Asia	0.034373	0.000249
15	Container_LCL/FCL_LCL	0.031374	0.000372
16	Foreign_Country_Region_Southeast Asia	0.030170	0.000149
17	HS_Group_Name_Pharma	0.026260	0.000078
18	US_Port_Coastal_Region_Southeast	0.025736	0.000196
19	US_Unemployment_Rate	0.024450	0.000403

For the Train 2018 - 2019 Test 2018 - 2019 XGBoost model after tuning, the TCVUSD feature definitely has the most weight (1.116930). Then Average_Tariff (0.416370), HS_Group_Name_Finished Goods (0.273364), Trade_Direction_Import (0.173100), HS_Group_Name_Raw Input (0.140252), foreign_company_size (0.112855) and us_company_size (0.126541).

`Test on 2020`

We can now prepare the 2020 set to fit the model trained on 2018 - 2019 data. This follows the same process of encoding the foreign_company_size and US_company_size features using ordinal (ranking), create dummy variables for the categorical variables, scaling the data using the MinMaxScaler, converting back to a cuDF dataframe and converting the data types before modeling this test set.

In [ ]:

X_test1 = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df2['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'US_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')

best_bayes_model.fit(X_test1, y_test1)

print('\nModel Metrics for XGBoost HPO Train 2018-19 Test 2020')
y_test_pred = best_bayes_model.predict(X_test1)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test1), asnumpy(y_test_pred)))))


Model Metrics for XGBoost HPO Train 2018-19 Test 2020
MAE train: 3.575, test: 2.730
MSE train: 63.675, test: 31.477
RMSE train: 7.980, test: 5.610
R^2 train: 0.905, test: 0.949

When comparing the metrics from the 2018 - 2019 training set to the 2020 test set, there are even lower MAE, MSE, RMSE compared to 2018 - 2019, but a higher R² with the 2020 set compared to the 2018 - 2019 test set.

`Model Explanations`

Now, we can plot the feature importance from the model.

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

Compared to Train 2018 - 2019 Test 2018 - 2019, there is a lower TCVUSD, lower Average_Tariff, lower us_company_size. There is different order without cases_weekly and deaths_weekly, lower US_Unemployment_Rate and lower foreign_company_size.

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the 2020 set, and then get the weights with the defined feature names.

In [ ]:

X1_test1 = pd.DataFrame(X_test1.to_pandas(), columns=X_test1.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test1),
                                                                     asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.2106 ± 0.0007	TCVUSD
0.4658 ± 0.0013	Average_Tariff
0.3009 ± 0.0006	HS_Group_Name_Finished Goods
0.1939 ± 0.0005	Trade_Direction_Import
0.1577 ± 0.0004	US_company_size
0.1493 ± 0.0005	HS_Group_Name_Raw Input
0.1379 ± 0.0004	foreign_company_size
0.0737 ± 0.0005	Container_Type_Dry_True
0.0691 ± 0.0002	State_Closure_EA_Diff
0.0633 ± 0.0002	HS_Group_Name_Edible with Processing
0.0578 ± 0.0002	cases_weekly
0.0554 ± 0.0003	Foreign_Country_Region_European Union
0.0511 ± 0.0002	deaths_weekly
0.0453 ± 0.0003	Foreign_Country_Region_Other East Asia (not China)
0.0397 ± 0.0002	Foreign_Country_Region_South Asia
0.0341 ± 0.0002	Container_LCL/FCL_LCL
0.0331 ± 0.0002	Foreign_Country_Region_Southeast Asia
0.0323 ± 0.0001	HS_Group_Name_Pharma
0.0292 ± 0.0001	Foreign_Country_Region_North America
0.0278 ± 0.0003	US_Port_Coastal_Region_Southwest
… 13 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.210602	0.000359
1	Average_Tariff	0.465814	0.000629
2	HS_Group_Name_Finished Goods	0.300933	0.000300
3	Trade_Direction_Import	0.193907	0.000230
4	US_company_size	0.157700	0.000211
5	HS_Group_Name_Raw Input	0.149271	0.000252
6	foreign_company_size	0.137879	0.000209
7	Container_Type_Dry_True	0.073709	0.000245
8	State_Closure_EA_Diff	0.069101	0.000087
9	HS_Group_Name_Edible with Processing	0.063282	0.000121
10	cases_weekly	0.057753	0.000077
11	Foreign_Country_Region_European Union	0.055392	0.000164
12	deaths_weekly	0.051050	0.000079
13	Foreign_Country_Region_Other East Asia (not Ch...	0.045338	0.000127
14	Foreign_Country_Region_South Asia	0.039698	0.000083
15	Container_LCL/FCL_LCL	0.034075	0.000124
16	Foreign_Country_Region_Southeast Asia	0.033094	0.000105
17	HS_Group_Name_Pharma	0.032276	0.000071
18	Foreign_Country_Region_North America	0.029234	0.000067
19	US_Port_Coastal_Region_Southwest	0.027796	0.000143

For the Train 2020 Test 2020 XGBoost model after tuning, the TCVUSD feature definitely has the most weight (1.210602 vs. 1.131809). Then higher Average_Tariff (0.465814 vs. 0.413096), similar HS_Group_Name_Finished Goods (0.273364 vs. 0.276361), higher Trade_Direction_Import (0.193907 vs. 0.174613), lower HS_Group_Name_Raw Input (0.149271 vs. 0.173788), higher foreign_company_size (0.137879 vs. 0.105692) and higher us_company_size (0.157700 vs. 0.103508).

`Train 2010 - 2019: 100 Trials Train/Test`¶

Using the same number of trials, hyperparameters and optimization algorithm that was utilized for the 2020, 2019 & 2018 - 2019 tuning, let's now define a different out_file and .pkl file and run the search for the 2010 - 2019 data.

In [ ]:

out_file = '/notebooks/MaritimeTrade/Models/ML/XGBoost/Hyperopt/trialOptions/xgbRapids_HPO_Train1019_100_GPU.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'iteration', 'train_time'])
of_connection.close()

global ITERATION
ITERATION = 0
bayesOpt_trials = Trials()

In [ ]:

if os.path.isfile('xgbRapids_Hyperopt_100_GPU_Train1019.pkl'):
    bayesOpt_trials = joblib.load('xgbRapids_Hyperopt_100_GPU_Train1019.pkl')
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)
else:
    best_param = fmin(xgb_hpo, xgb_tune_kwargs, algo=tpe.suggest,
                      max_evals=NUM_EVAL, trials=bayesOpt_trials)

96%|█████████▌| 96/100 [7:21:18<13:37, 204.32s/trial, best loss: 7.793500478559527]

Let's now access the results in the trialOptions directory by reading into a pandas.Dataframe sorted with the best scores on top and then resetting the index for slicing. Then we can convert the trial parameters from a string to a dictionary for later use. Now, let's extract the hyperparameters from the model with the lowest loss during the search, recreate the best model, fit using the training data and then saving as a .pkl file. Then we can evaluate the model metrics for both the train and the test sets using the MAE, MSE, RMSE and the (R²).

In [ ]:

results = pd.read_csv('xgbRapids_HPO_Train1019_100_GPU.csv')
results.sort_values('loss', ascending=True, inplace=True)
results.reset_index(inplace=True, drop=True)
results.to_csv('xgbRapids_HPO_Train1019_100_GPU.csv')

ast.literal_eval(results.loc[0, 'params'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

best_bayes_model = XGBRegressor(objective='reg:squarederror',
                                booster='gbtree',
                                tree_method='gpu_hist',
                                scale_pos_weight=1,
                                use_label_encoder=False,
                                random_state=seed_value,
                                verbosity=0,
                                **best_bayes_params)

best_bayes_model.fit(X_train, y_train)

Pkl_Filename = 'xgbRapids_Hyperopt_100_GPU_Train1019trials.pkl'

with open(Pkl_Filename, 'wb') as file:
    pickle.dump(best_bayes_model, file)

print('\nModel Metrics for XGBoost HPO Train 2010-19 100 GPU trials')
y_train_pred = best_bayes_model.predict(X_train)
y_test_pred = best_bayes_model.predict(X_test)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test), asnumpy(y_test_pred))))


Model Metrics for XGBoost HPO Train 2010-19 100 GPU trials
MAE train: 5.647, test: 8.032
MSE train: 155.893, test: 304.396
RMSE train: 12.486, test: 17.447
R^2 train: 0.780, test: 0.571

When comparing the model metrics from the hyperparameter search to the baseline model metrics, all of the MAE, MSE and RMSE for the train set are considerably lower, but this is not the same for the test set. The R² did improve for both the train and the test sets compared to the baseline model, but the model from the search is overfit if the R² between the train and the test sets is 21% different.

We can also evaluate the MSE on the test set and determine when this was achieved:

In [ ]:

print('The best model from Bayes optimization scores {:.5f} MSE on the test set.'.format(mean_squared_error(asnumpy(y_test),
                                                                                                            asnumpy(y_test_pred))))
print('This was achieved after {} search iterations'.format(results.loc[0, 'iteration']))

The best model from Bayes optimization scores 293.21167 MSE on the test set.
This was achieved after 72 search iterations

`Results from the Hyperparameter Search`

Let's now create a new pandas.Dataframe where the parameters examined during the search are stored and the results of each parameter are stored as a different column. Then we can convert the data types for graphing. Now, we can now examine some of these quantitative hyperparameters over the search duration and see if any trends can be observed.

In [ ]:

bayes_params = pd.DataFrame(columns=list(ast.literal_eval(results.loc[0,
                                                                      'params']).keys()),
                            index=list(range(len(results))))

for i, params in enumerate(results['params']):
    bayes_params.loc[i,:] = list(ast.literal_eval(params).values())

bayes_params['loss'] = results['loss']
bayes_params['iteration'] = results['iteration']
bayes_params['colsample_bylevel'] = bayes_params['colsample_bylevel'].astype('float64')
bayes_params['colsample_bytree'] = bayes_params['colsample_bytree'].astype('float64')
bayes_params['gamma'] = bayes_params['gamma'].astype('float64')
bayes_params['learning_rate'] = bayes_params['learning_rate'].astype('float64')
bayes_params['reg_alpha'] = bayes_params['reg_alpha'].astype('float64')
bayes_params['reg_lambda'] = bayes_params['reg_lambda'].astype('float64')
bayes_params['subsample'] = bayes_params['subsample'].astype('float64')

fig, axs = plt.subplots(1, 4, figsize=(20,5))
i = 0
for i, hpo in enumerate(['learning_rate', 'gamma', 'colsample_bylevel',
                         'colsample_bytree']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

For 2010 - 2019 and 2018 - 2019,learning_rate decreased over the trials while gamma, colsample_bylevel and colsample_bytree increased. For 2019, learning_rate decreased over the trials while colsample_bylevel and colsample_bytree increased while gamma did not show any trend. For 2020, learning_rate, and colsample_bylevel decreased over the trials while gamma and colsample_bytree increased.

Next, we can examine if there were any trends regarding the regularization parameters over the trials.

In [ ]:

fig, axs = plt.subplots(1, 2, figsize=(14,6))
i = 0
for i, hpo in enumerate(['reg_alpha', 'reg_lambda']):
    sns.regplot('iteration', hpo, data=bayes_params, ax=axs[i])
    axs[i].set(xlabel='Iteration', ylabel='{}'.format(hpo),
               title='{} over Trials'.format(hpo))
plt.tight_layout()
plt.show()

For 2010 - 2019, 2018 - 2019 and only 2019, reg_lambda decreased while reg_alpha did not show any trend over the trial iterations. For 2020, reg_lambda increased while reg_alpha did not show any trend over the trial iterations.

`Model Explanations`

Now, we can plot the feature importance for the best model.

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

Compared to Train 2020 Test 2020, there is a lower value for US_Unemployment_Rate, but it is the most important feature for this set. There is a lower Average_Tariff, lower TCVUSD, lower us_company_size. There is different order without cases_weekly an lower foreign_company_size.

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:

X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test),
                                                                     asnumpy(y_test))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
0.8770 ± 0.0014	TCVUSD
0.3610 ± 0.0019	Average_Tariff
0.2371 ± 0.0007	HS_Group_Name_Finished Goods
0.1270 ± 0.0005	Trade_Direction_Import
0.1089 ± 0.0004	HS_Group_Name_Raw Input
0.0960 ± 0.0007	US_company_size
0.0817 ± 0.0006	foreign_company_size
0.0753 ± 0.0006	Container_Type_Dry_True
0.0504 ± 0.0006	State_Closure_EA_Diff
0.0456 ± 0.0004	Foreign_Country_Region_Other East Asia (not China)
0.0441 ± 0.0001	Foreign_Country_Region_European Union
0.0441 ± 0.0005	Container_LCL/FCL_LCL
0.0417 ± 0.0003	Year
0.0383 ± 0.0002	HS_Group_Name_Edible with Processing
0.0353 ± 0.0003	US_Unemployment_Rate
0.0276 ± 0.0001	Foreign_Country_Region_Southeast Asia
0.0251 ± 0.0002	Foreign_Country_Region_South Asia
0.0243 ± 0.0002	US_Port_Coastal_Region_Northeast
0.0215 ± 0.0001	HS_Group_Name_Pharma
0.0201 ± 0.0002	Foreign_Country_Region_North America
… 14 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	0.877037	0.000702
1	Average_Tariff	0.361002	0.000955
2	HS_Group_Name_Finished Goods	0.237089	0.000363
3	Trade_Direction_Import	0.126989	0.000258
4	HS_Group_Name_Raw Input	0.108915	0.000177
5	US_company_size	0.095954	0.000370
6	foreign_company_size	0.081719	0.000288
7	Container_Type_Dry_True	0.075284	0.000292
8	State_Closure_EA_Diff	0.050433	0.000294
9	Foreign_Country_Region_Other East Asia (not Ch...	0.045577	0.000175
10	Foreign_Country_Region_European Union	0.044080	0.000074
11	Container_LCL/FCL_LCL	0.044076	0.000245
12	Year	0.041686	0.000125
13	HS_Group_Name_Edible with Processing	0.038306	0.000110
14	US_Unemployment_Rate	0.035306	0.000149
15	Foreign_Country_Region_Southeast Asia	0.027623	0.000050
16	Foreign_Country_Region_South Asia	0.025103	0.000119
17	US_Port_Coastal_Region_Northeast	0.024347	0.000112
18	HS_Group_Name_Pharma	0.021467	0.000054
19	Foreign_Country_Region_North America	0.020125	0.000077

For the Train 2010 - 2019 Test 2010 - 2019 XGBoost model after tuning, the TCVUSD feature definitely has the most weight (0.877037). Then Average_Tariff (0.361002), HS_Group_Name_Finished Goods (0.237089), Trade_Direction_Import (0.126989), HS_Group_Name_Raw Input (0.108915), foreign_company_size (0.081719) and us_company_size (0.095954).

`Test on 2020`

Let's now prepare the 2020 data to fit the model trained using the 2010 - 2019 data by ordinal encoding the US and foreign company size variables, creatinng dummy variables for the categorical variables, using the MinMaxScaler for the features, converting the data back to cuDF.Dataframe, convert the data types for modeling and fitting the model.

In [ ]:

X_test1 = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y_test1 = df2['Metric_Tons']
ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X_test1 = ce_ord.fit_transform(X_test1)
X_test1 = pd.get_dummies(X_test1, drop_first=True)
X_test1 = pd.DataFrame(mn.transform(X_test1), columns=X_test1.columns)
X_test1 = cudf.DataFrame.from_pandas(X_test1)
X_test1 = X_test1.astype('float32')
y_test1 = y_test1.astype('int32')

best_bayes_model.fit(X_test1, y_test1)

print('\nModel Metrics for XGBoost HPO Train 2010-19 Test 2020')
y_test_pred = best_bayes_model.predict(X_test1)

print('MAE train: %.3f, test: %.3f' % (
        mean_absolute_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_absolute_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred)),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred))))
print('RMSE train: %.3f, test: %.3f' % (
        mean_squared_error(asnumpy(y_train), asnumpy(y_train_pred),
                           squared=False),
        mean_squared_error(asnumpy(y_test1), asnumpy(y_test_pred),
                           squared=False)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(asnumpy(y_train), asnumpy(y_train_pred)),
        r2_score(asnumpy(y_test1), asnumpy(y_test_pred))))


Model Metrics for XGBoost HPO Train 2010-19 Test 2020
MAE train: 5.648, test: 4.384
MSE train: 155.898, test: 103.163
RMSE train: 12.486, test: 10.157
R^2 train: 0.780, test: 0.891

When comparing the metrics from the 2010 - 2019 training set to the 2020 test set, there are even lower MAE, MSE, RMSE compared to 2018 - 2019, but a higher R² with the 2010 - 2019 set compared to the 2010 - 2019 test set.

`Model Explanations`

Now, we can plot the feature importance from the model.

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

In [ ]:

plot_importance(best_bayes_model, max_num_features=15)

Compared to Train 2010 - 2019 Test 2010 - 2019, there is a significantly lower value for US_Unemployment_Rate, but TCVUSD is the most important feature for this set but with a lower value. There is a lower Average_Tariff, lower us_company_size. There is different order without cases_weekly and lower foreign_company_size.

`Model Metrics with ELI5`

Now, we can compute the permutation feature importance using the test set, and then get the weights with the defined feature names.

In [ ]:

X_test1 = pd.DataFrame(X_test.to_pandas(), columns=X_test.columns)

perm_importance = PermutationImportance(best_bayes_model,
                                        random_state=seed_value).fit(asnumpy(X_test1),
                                                                     asnumpy(y_test1))
html_obj = eli5.show_weights(perm_importance,
                             feature_names=X1_test1.columns.tolist())
html_obj

Out[ ]:

Weight	Feature
1.1164 ± 0.0015	TCVUSD
0.4154 ± 0.0011	Average_Tariff
0.2273 ± 0.0020	HS_Group_Name_Finished Goods
0.1916 ± 0.0005	HS_Group_Name_Raw Input
0.1689 ± 0.0005	Trade_Direction_Import
0.1182 ± 0.0005	us_company_size
0.1037 ± 0.0005	foreign_company_size
0.0681 ± 0.0004	Delta_Case0_Effective
0.0435 ± 0.0002	Container_Type_Dry
0.0409 ± 0.0002	HS_Group_Name_Edible with Processing
0.0395 ± 0.0001	Foreign_Country_Region_South Asia
0.0388 ± 0.0004	Foreign_Country_Region_European Union
0.0343 ± 0.0002	Foreign_Country_Region_Other East Asia (not China)
0.0317 ± 0.0001	State_Closure_EA_Diff
0.0255 ± 0.0002	Foreign_Country_Region_Southeast Asia
0.0253 ± 0.0001	HS_Group_Name_Vices
0.0241 ± 0.0001	HS_Group_Name_Pharma
0.0236 ± 0.0001	cases_weekly
0.0206 ± 0.0002	Foreign_Country_Region_North America
0.0192 ± 0.0001	Container_LCL/FCL_LCL
… 14 more …

In [ ]:

explanation = eli5.explain_weights_sklearn(perm_importance,
                                           feature_names=X1_test1.columns.tolist())
exp = format_as_dataframe(explanation)
exp

Out[ ]:

	feature	weight	std
0	TCVUSD	1.116358	0.000743
1	Average_Tariff	0.415390	0.000558
2	HS_Group_Name_Finished Goods	0.227251	0.000982
3	HS_Group_Name_Raw Input	0.191584	0.000268
4	Trade_Direction_Import	0.168872	0.000246
5	us_company_size	0.118198	0.000271
6	foreign_company_size	0.103731	0.000242
7	Delta_Case0_Effective	0.068130	0.000186
8	Container_Type_Dry	0.043484	0.000111
9	HS_Group_Name_Edible with Processing	0.040910	0.000113
10	Foreign_Country_Region_South Asia	0.039511	0.000065
11	Foreign_Country_Region_European Union	0.038780	0.000212
12	Foreign_Country_Region_Other East Asia (not Ch...	0.034263	0.000100
13	State_Closure_EA_Diff	0.031747	0.000065
14	Foreign_Country_Region_Southeast Asia	0.025488	0.000113
15	HS_Group_Name_Vices	0.025318	0.000063
16	HS_Group_Name_Pharma	0.024051	0.000044
17	cases_weekly	0.023588	0.000072
18	Foreign_Country_Region_North America	0.020591	0.000076
19	Container_LCL/FCL_LCL	0.019173	0.000069

For the Train 2010 - 2019 Test 2010 - 2019 XGBoost model after tuning, the TCVUSD feature definitely has the most weight (1.116358 vs. 0.877037). Then higher Average_Tariff (0.415390 vs. 0.361002), similar HS_Group_Name_Finished Goods (0.227251 vs .0.237089), higher Trade_Direction_Import (0.168872 vs. 0.126989), higher HS_Group_Name_Raw Input (0.191584 vs. 0.108915), higher foreign_company_size (0.103731 vs. 0.081719) and us_company_size (0.118198 vs. 0.095954).

`Multi-Layer Perceptron (MLP) Regression`

The MLP can be utilized for regression problems as demonstrated in Multilayer perceptrons for classification and regression] by Fionn Murtagh in 1991. It is derived from the perceptron invented in the 1940s and implemented by Frank Rosenblatt in the 1950s with The Perceptron — A Perceiving and Recognizing Automaton. The MLP is a artifical neural network connected by neurons containing an input layer, the hidden layer(s) and the output layer with a single output neuron for regression. The data is used as the input to the network which is fed to the hidden layers one layer at a time generating weights (connections to the next neuron) and biases (threshod value needed to obtain the output) till the output layer. The output is then compared to the expected output, and the difference is the error. This error is then propagated back through the network, layer by layer where the weights are updated depending on the effect to the error called the backpropagation algorithm. This is repreated for all of the samples in the training data where one cycle of updating the network is an epoch.

The notebooks can be found here. Let's first install the needed packages and examine the versions of tensorflow, keras as well as the CUDA and GPU components available.

In [ ]:

!pip install category_encoders
!pip install tensorflow
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available:', len(tf.config.list_physical_devices('GPU')))
print('\n')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.5.1.post0-py2.py3-none-any.whl (72 kB)
     |████████████████████████████████| 72 kB 156 kB/s 
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.8/dist-packages (from category_encoders) (1.21.6)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from category_encoders) (1.7.3)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.8/dist-packages (from category_encoders) (0.12.2)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.8/dist-packages (from category_encoders) (1.3.5)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.8/dist-packages (from category_encoders) (1.0.2)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.8/dist-packages (from category_encoders) (0.5.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.5->category_encoders) (2022.6)
Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from patsy>=0.5.1->category_encoders) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.1.0)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.5.1.post0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: Tensorflow in /usr/local/lib/python3.8/dist-packages (2.9.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (57.4.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (3.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (4.4.0)
Requirement already satisfied: flatbuffers<2,>=1.12 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.12)
Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (14.0.6)
Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.3.0)
Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.1.2)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.51.1)
Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (0.4.0)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (0.28.0)
Requirement already satisfied: keras<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (2.9.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (0.2.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (2.1.1)
Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (3.19.6)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.21.6)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.14.1)
Requirement already satisfied: tensorflow-estimator<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (2.9.0)
Requirement already satisfied: tensorboard<2.10,>=2.9 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (2.9.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (21.3)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.6.3)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (3.1.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.8/dist-packages (from Tensorflow) (1.15.0)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.8/dist-packages (from astunparse>=1.6.0->Tensorflow) (0.38.4)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from tensorboard<2.10,>=2.9->Tensorflow) (0.6.1)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from tensorboard<2.10,>=2.9->Tensorflow) (1.0.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.8/dist-packages (from tensorboard<2.10,>=2.9->Tensorflow) (3.4.1)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.8/dist-packages (from tensorboard<2.10,>=2.9->Tensorflow) (2.15.0)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.8/dist-packages (from tensorboard<2.10,>=2.9->Tensorflow) (1.8.1)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.8/dist-packages (from tensorboard<2.10,>=2.9->Tensorflow) (2.23.0)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.8/dist-packages (from tensorboard<2.10,>=2.9->Tensorflow) (0.4.6)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.8/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->Tensorflow) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.8/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->Tensorflow) (4.9)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->Tensorflow) (5.2.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.8/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->Tensorflow) (1.3.1)
Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.8/dist-packages (from markdown>=2.6.8->tensorboard<2.10,>=2.9->Tensorflow) (4.13.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.10,>=2.9->Tensorflow) (3.11.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.8/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->Tensorflow) (0.4.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->Tensorflow) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->Tensorflow) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->Tensorflow) (2022.9.24)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->Tensorflow) (3.0.4)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->Tensorflow) (3.2.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->Tensorflow) (3.0.9)
TensorFlow version: 2.9.2
Eager execution is: True
Keras version: 2.9.0
Num GPUs Available: 1


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Sat Dec 10 22:45:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    27W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Now we can import the packages and set the seed for the environment.

In [ ]:

import os
import random
import warnings
import numpy as np
warnings.filterwarnings('ignore')

def init_seeds(seed=101920):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    session_conf = tf.compat.v1.ConfigProto()
    session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                            inter_op_parallelism_threads=1)
    os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
    os.environ['TF_DETERMINISTIC_OPS'] = 'True'

    sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
                                config=session_conf)
    tf.compat.v1.keras.backend.set_session(sess)

    return sess

init_seeds(seed=101920)

Out[ ]:

<tensorflow.python.client.session.Session at 0x7fcf86afe1d0>

`Baseline Models`¶

`Train 2020: Batch Size = 16`¶

Like in preparing the year range data for the XGBoost models, the data can be read, the train/test sets can be set up by using a test_size=0.2 and stratified by DateTime_YearWeek. Since the size of the companies, both U.S. and foreign, demonstrated differences in feature importance, ordinal encoding using ranking can be utilized to contain this level of information. Then dummy variables can be created for the categorical features and the number of features examined since this information is needed for the input dimensions to the MLP. Next, the training features can be scaled using the StandardScaler, which removes the mean and scales to unit variance, that uses fit_transform with the training set and applies what is fit with transform to the test set.

In [ ]:

import pandas as pd
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_columns', None)

X = df2.drop(['Metric_Tons'], axis=1)
y = df2['Metric_Tons']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    stratify=X.DateTime_YearWeek,
                                                    random_state=1920)

X_train = X_train.drop(['DateTime_YearWeek'], axis=1)
X_test = X_test.drop(['DateTime_YearWeek'], axis=1)

ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'US_company_size'])
X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print('Dimensions of X_train for input:', X_train.shape[1])

sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_test = pd.DataFrame(sc.transform(X_test))

Dimensions of X_train for input: 34

Let's now set the path where model is saved and set up the callbacks containing the EarlyStopping class that monitors the loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint class that monitors the mse saving only the best one with a min loss as well as the TensorBoard class that enables visualizations using the TensorBoard.

In [ ]:

import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'MLP_weights_only_train20_baseline_sc_b16_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  best_model_only=True, mode='min'),
                  tensorboard_callback]

We can define a baseline model to compare the results after tuning. Since there are 34 features after preprocessing, let's use three Dense layers that incrementally decrease in node size with each containing kernel_initializer='normal' to use a normal distribution to initialize the weights and activation='relu' to use the positive part of each of the vector components with a 30% Dropout before the output layer. Then we can configure the model for training using compile specifying the loss function (mae), the metric to evaluate (mse) and which optimizer to utilize (adam with the default learning_rate=0.001).

In [ ]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential()
model.add(Dense(30, input_dim=34, kernel_initializer='normal',
                activation='relu'))
model.add(Dense(20, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))

model.compile(loss='mae', metrics=['mse'], optimizer='adam')
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 30)                1050      
_________________________________________________________________
dense_1 (Dense)              (None, 20)                620       
_________________________________________________________________
dense_2 (Dense)              (None, 10)                210       
_________________________________________________________________
dropout (Dropout)            (None, 10)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 11        
=================================================================
Total params: 1,891
Trainable params: 1,891
Non-trainable params: 0
_________________________________________________________________

Now the model can be trained by calling fit utilizing the train dataset for 30 epochs, which is the iterations over the entire X_train and y_train, using a batch_size=16, which is the number of samples per gradient update, validation_split=0.2, which is the fraction of the training data not used for training but used for evaluating the loss and the model metrics on this data at the end of each epoch, and the specified callbacks from the callbacks_list.

In [ ]:

history = model.fit(X_train, y_train, epochs=30, batch_size=16,
                    validation_split=0.2, callbacks=callbacks_list)

Epoch 1/30
130612/130612 [==============================] - 287s 2ms/step - loss: 11.3536 - mse: 697.3706 - val_loss: 9.6628 - val_mse: 495.5418
Epoch 2/30
130612/130612 [==============================] - 285s 2ms/step - loss: 10.2092 - mse: 566.1661 - val_loss: 9.3898 - val_mse: 500.8206
Epoch 3/30
130612/130612 [==============================] - 284s 2ms/step - loss: 9.9300 - mse: 520.2528 - val_loss: 9.0747 - val_mse: 461.3257
Epoch 4/30
130612/130612 [==============================] - 284s 2ms/step - loss: 9.7343 - mse: 506.4827 - val_loss: 8.8463 - val_mse: 443.3396
Epoch 5/30
130612/130612 [==============================] - 284s 2ms/step - loss: 9.6057 - mse: 491.5209 - val_loss: 8.7620 - val_mse: 431.2053
Epoch 6/30
130612/130612 [==============================] - 282s 2ms/step - loss: 9.5248 - mse: 483.6649 - val_loss: 8.8105 - val_mse: 439.9854
Epoch 7/30
130612/130612 [==============================] - 281s 2ms/step - loss: 9.4757 - mse: 480.2172 - val_loss: 8.8542 - val_mse: 452.9666
Epoch 8/30
130612/130612 [==============================] - 280s 2ms/step - loss: 9.4242 - mse: 474.2610 - val_loss: 8.6531 - val_mse: 423.2589
Epoch 9/30
130612/130612 [==============================] - 279s 2ms/step - loss: 9.3866 - mse: 472.0242 - val_loss: 8.4450 - val_mse: 411.8867
Epoch 10/30
130612/130612 [==============================] - 281s 2ms/step - loss: 9.3548 - mse: 469.3746 - val_loss: 8.3888 - val_mse: 404.1636
Epoch 11/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.3206 - mse: 466.0229 - val_loss: 8.5108 - val_mse: 406.6072
Epoch 12/30
130612/130612 [==============================] - 278s 2ms/step - loss: 9.2988 - mse: 464.4471 - val_loss: 8.5278 - val_mse: 417.1172
Epoch 13/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.2878 - mse: 463.7484 - val_loss: 8.3800 - val_mse: 394.8915
Epoch 14/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.2684 - mse: 462.7498 - val_loss: 8.6128 - val_mse: 428.9174
Epoch 15/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.2408 - mse: 460.1641 - val_loss: 8.4059 - val_mse: 409.1604
Epoch 16/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.2390 - mse: 459.9371 - val_loss: 8.4664 - val_mse: 407.0621
Epoch 17/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.2152 - mse: 458.4763 - val_loss: 8.3967 - val_mse: 402.6576
Epoch 18/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.2136 - mse: 457.5231 - val_loss: 8.4423 - val_mse: 409.6518
Epoch 19/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.1986 - mse: 455.5930 - val_loss: 8.3374 - val_mse: 401.5810
Epoch 20/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.1924 - mse: 455.7097 - val_loss: 8.2731 - val_mse: 382.9103
Epoch 21/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.1769 - mse: 454.5361 - val_loss: 8.2680 - val_mse: 385.6853
Epoch 22/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.1675 - mse: 453.3613 - val_loss: 8.3437 - val_mse: 403.4636
Epoch 23/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.1664 - mse: 456.3998 - val_loss: 8.3524 - val_mse: 400.8748
Epoch 24/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.1593 - mse: 458.7133 - val_loss: 8.3548 - val_mse: 409.1398
Epoch 25/30
130612/130612 [==============================] - 278s 2ms/step - loss: 9.1542 - mse: 453.1279 - val_loss: 8.3186 - val_mse: 412.0538
Epoch 26/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.1564 - mse: 453.2894 - val_loss: 8.2430 - val_mse: 386.2402
Epoch 27/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.1457 - mse: 452.6359 - val_loss: 8.2581 - val_mse: 389.1211
Epoch 28/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.1336 - mse: 451.8768 - val_loss: 8.1985 - val_mse: 386.0039
Epoch 29/30
130612/130612 [==============================] - 277s 2ms/step - loss: 9.1347 - mse: 450.6846 - val_loss: 8.2873 - val_mse: 392.1515
Epoch 30/30
130612/130612 [==============================] - 276s 2ms/step - loss: 9.1269 - mse: 450.5259 - val_loss: 8.2605 - val_mse: 401.1345

Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.

In [ ]:

import matplotlib.pyplot as plt

model.save('./MLP_Baseline_sc_batch16_30epochs_train20_tf.h5', save_format='tf')

# Load model for more training or later use
#filepath = 'MLP_weights_only_train20_baseline_sc_b16_epochs30.h5'
#model = tf.keras.models.load_model('./MLP_Baseline_sc_batch16_30epochs_train20_tf.h5')
#model.load_weights(filepath)

plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()

We can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:

losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()

We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe and examine the size to determine chunksize for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.

In [ ]:

pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)
y_pred.shape

Out[ ]:

(2612234, 1)

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2020 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, metrics like the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²) can be utilized for both the training and test sets.

In [ ]:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f'% np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))

Metrics: Train set
MAE: 8.292885
MSE: 404.850157
RMSE: 20.120889
R2: 0.571114

Then predict on test set and convert the predicted metric tons to pandas.Dataframe to evaluate the metrics from the test set.

In [ ]:

pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))

Metrics: Test set
MAE: 8.552116
MSE: 418.956668
RMSE: 20.468431
R2: 0.555586

When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set.

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons', np.average(y_test))
print('Predicted Average Metric Tons', np.average(pred_test))
print('\nMinimum Metric Tons', np.amin(y_test))
print('Predicted Minimum Metric Tons', np.amin(pred_test))

Maximum Metric Tons 249.99
Predicted Max Metric Tons: 286.75543

Average Metric Tons 21.525284239249448
Predicted Average Metric Tons 18.042938

Minimum Metric Tons 0.0
Predicted Minimum Metric Tons -6.1492796

For the baseline model using the 2020 data, there is a higher predicted maximum metric tonnage than the actual maximum metric tons while a lower predicted average and a lower predicted minimum compared to the actual.

`Test on 2019`

Let's now prepare the 2019 set by partitioning the data and processing the features. We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the 2019 set.

In [ ]:

X = df1.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df1['Metric_Tons']

X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))

pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2020 Test 2019: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for the 2020 model using the 2019 set

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))

Metrics: Test set
MAE: 9.556285
MSE: 516.556536
RMSE: 22.727880
R2: 0.486803

Compared to the 2020 test set metrics, there are higher errors for MAE (9.556285 vs 8.552116), MSE (516.556536 vs 418.956668), RMSE (22.727880 vs 20.468431) and lower R² (0.486803 vs 0.555586).

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
('\n')
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
('\n')
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 417.72134

Average Metric Tons: 21.897868227087965
Predicted Average Metric Tons: 17.587381

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -7.155898

For the baseline model using the 2020 data tested using the 2019 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (417.72134 vs 286.75543) while an even lower predicted average (17.587381 vs. 18.042938) and a lower predicted minimum (-7.155898 vs. -6.1492796) compared to the actual. The average metric tons for 2020 is 21.525 and 2019 is 21.898, which are comparable.

`Train 2019: Batch Size = 16`¶

Like in preparing the year range data for the XGBoost models, the data can be read, the train/test sets can be set up by using a test_size=0.2 and stratified by DateTime_YearWeek. Since the size of the companies, both U.S. and foreign, demonstrated differences in the feature importance, ordinal encoding using ranking can be utilized to contain this level of information. Then dummy variables can be created for the categorical features and the number of features examined since this information is needed for the input dimensions to the MLP. Next, the training features can be scaled using the StandardScaler, which removes the mean and scales to unit variance, that uses fit_transform with the training set and applies what is fit with transform to the test set.

In [ ]:

X = df1.drop(['Metric_Tons'],axis=1)
y = df1['Metric_Tons']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    stratify=X.DateTime_YearWeek,
                                                    random_state=1920)

X_train = X_train.drop(['DateTime_YearWeek'], axis=1)
X_test = X_test.drop(['DateTime_YearWeek'], axis=1)

X_train = ce_ord.fit_transform(X_train)
X_test = ce_ord.transform(X_test)

X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print('Dimensions of X_train for input:', X_train.shape[1])

sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_test = pd.DataFrame(sc.transform(X_test))

Dimensions of X_train for input: 34

Let's now set the path where model is saved and set up the callbacks containing the EarlyStopping class that monitors the loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint class that monitors the mse saving only the best one with a min loss as well as the TensorBoard class that enables visualizations using the TensorBoard.

In [ ]:

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

%load_ext tensorboard

filepath = 'MLP_weights_only_train19_baseline_sc_b16_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  best_model_only=True, mode='min'),
                  tensorboard_callback]

/content/drive/MyDrive/MaritimeTrade/Models/DL/MLP/Models

Utilizing the same model architecture as the one for training 2020, lets train the baseline MLP using the 2019 training set.

In [ ]:

history = model.fit(X_train, y_train, epochs=30, batch_size=16,
                    validation_split=0.2, callbacks=callbacks_list)

Epoch 1/30
134740/134740 [==============================] - 311s 2ms/step - loss: 11.6150 - mse: 880.4813 - val_loss: 10.0269 - val_mse: 588.8330
Epoch 2/30
134740/134740 [==============================] - 297s 2ms/step - loss: 10.3371 - mse: 564.0920 - val_loss: 9.3850 - val_mse: 499.2766
Epoch 3/30
134740/134740 [==============================] - 297s 2ms/step - loss: 10.0675 - mse: 539.0044 - val_loss: 9.1811 - val_mse: 482.0979
Epoch 4/30
134740/134740 [==============================] - 297s 2ms/step - loss: 9.9154 - mse: 526.2175 - val_loss: 8.9635 - val_mse: 450.3198
Epoch 5/30
134740/134740 [==============================] - 296s 2ms/step - loss: 9.8207 - mse: 521.3840 - val_loss: 8.7572 - val_mse: 441.7188
Epoch 6/30
134740/134740 [==============================] - 295s 2ms/step - loss: 9.7510 - mse: 515.8300 - val_loss: 8.7677 - val_mse: 432.5368
Epoch 7/30
134740/134740 [==============================] - 295s 2ms/step - loss: 9.7058 - mse: 507.7592 - val_loss: 8.9692 - val_mse: 453.5005
Epoch 8/30
134740/134740 [==============================] - 295s 2ms/step - loss: 9.6614 - mse: 501.5469 - val_loss: 8.8610 - val_mse: 492.3078
Epoch 9/30
134740/134740 [==============================] - 294s 2ms/step - loss: 9.6196 - mse: 499.5367 - val_loss: 8.7379 - val_mse: 439.2052
Epoch 10/30
134740/134740 [==============================] - 293s 2ms/step - loss: 9.5973 - mse: 512.9811 - val_loss: 8.6969 - val_mse: 431.0757
Epoch 11/30
134740/134740 [==============================] - 292s 2ms/step - loss: 9.5546 - mse: 491.0605 - val_loss: 8.7743 - val_mse: 441.4550
Epoch 12/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.5218 - mse: 487.4147 - val_loss: 8.9022 - val_mse: 463.1740
Epoch 13/30
134740/134740 [==============================] - 291s 2ms/step - loss: 9.5030 - mse: 485.0403 - val_loss: 8.6689 - val_mse: 433.5587
Epoch 14/30
134740/134740 [==============================] - 289s 2ms/step - loss: 9.4896 - mse: 485.3148 - val_loss: 8.5308 - val_mse: 424.2657
Epoch 15/30
134740/134740 [==============================] - 291s 2ms/step - loss: 9.4675 - mse: 483.2062 - val_loss: 8.5800 - val_mse: 421.7025
Epoch 16/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.4568 - mse: 482.1610 - val_loss: 8.5653 - val_mse: 425.0955
Epoch 17/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.4291 - mse: 478.4980 - val_loss: 8.6223 - val_mse: 424.4791
Epoch 18/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.4287 - mse: 479.2614 - val_loss: 8.5840 - val_mse: 420.1024
Epoch 19/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.4131 - mse: 478.0560 - val_loss: 8.5699 - val_mse: 422.5081
Epoch 20/30
134740/134740 [==============================] - 291s 2ms/step - loss: 9.4055 - mse: 479.2882 - val_loss: 8.4925 - val_mse: 404.9562
Epoch 21/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.3913 - mse: 476.8774 - val_loss: 8.5814 - val_mse: 418.0968
Epoch 22/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.3810 - mse: 474.9148 - val_loss: 8.4966 - val_mse: 407.0798
Epoch 23/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.3816 - mse: 476.5560 - val_loss: 8.5060 - val_mse: 418.5885
Epoch 24/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.3725 - mse: 475.5645 - val_loss: 8.5674 - val_mse: 420.4923
Epoch 25/30
134740/134740 [==============================] - 291s 2ms/step - loss: 9.3546 - mse: 474.3773 - val_loss: 8.5912 - val_mse: 420.8293
Epoch 26/30
134740/134740 [==============================] - 291s 2ms/step - loss: 9.3561 - mse: 473.3643 - val_loss: 8.3844 - val_mse: 409.3878
Epoch 27/30
134740/134740 [==============================] - 290s 2ms/step - loss: 9.3452 - mse: 473.2807 - val_loss: 8.4394 - val_mse: 398.6336
Epoch 28/30
134740/134740 [==============================] - 291s 2ms/step - loss: 9.3353 - mse: 471.7618 - val_loss: 8.4928 - val_mse: 425.4965
Epoch 29/30
134740/134740 [==============================] - 292s 2ms/step - loss: 9.3333 - mse: 472.5301 - val_loss: 8.4256 - val_mse: 409.6219
Epoch 30/30
134740/134740 [==============================] - 292s 2ms/step - loss: 9.3248 - mse: 471.8142 - val_loss: 8.5205 - val_mse: 420.6759

Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.

In [ ]:

model.save('./MLP_Baseline_sc_batch16_30epochs_train19_tf.h5', save_format='tf')

plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()

As with the 2020 baseline model, the val_loss is lower than the training loss over the duration of the training epochs. Then, we can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:

losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()

We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe and examine the size to determine chunksize for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.

In [ ]:

pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the test set.

In [ ]:

pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for both the training and test sets.

In [ ]:

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f'% np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))

Metrics: Train set
MAE: 8.529655
MSE: 422.753801
RMSE: 20.560978
R2: 0.579966

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))

Metrics: Test set
MAE: 9.251954
MSE: 459.566662
RMSE: 21.437506
R2: 0.543552

When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set.

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons', np.average(y_test))
print('Predicted Average Metric Tons', np.average(pred_test))
print('\nMinimum Metric Tons', np.amin(y_test))
print('Predicted Minimum Metric Tons', np.amin(pred_test))

Maximum Metric Tons 249.99
Predicted Max Metric Tons: 315.0854

Average Metric Tons 21.895935662662403
Predicted Average Metric Tons 17.043827

Minimum Metric Tons 0.0
Predicted Minimum Metric Tons -7.53874

For the baseline model using the 2019 data, there is also a higher predicted maximum metric tonnage than the actual maximum metric tons, but it is farther away from the actual maximum than the predicted vs. actual maximum for the 2020 baseline model. On the other hand, there is an even lower predicted average and predicted minimum compared to the actual than what was observed for the baseline model for the 2020 set.

`Test on 2020`

Let's now prepare the 2020 set by partitioning the data and processing the features. We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the 2020 set.

In [ ]:

X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']

X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))

pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2019 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for the 2019 model using the 2020 set

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))

Metrics: Test set
MAE: 8.862149
MSE: 431.787379
RMSE: 20.779494
R2: 0.542457

Compared to the 2019 test set metrics, there are lower errors for MAE (8.862149 vs. 9.251954), MSE (431.787379 vs. 459.566662), RMSE (20.779494 vs. 21.437506) and comparable R² (0.542457 vs. 0.543552).

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 398.72498

Average Metric Tons: 21.53519470993873
Predicted Average Metric Tons: 18.037947

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -8.755297

For the 2019 baseline model tested using the 2020 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (398.72498 vs. 315.0854) while an even lower predicted average (18.037947 vs. 17.043827) and a lower predicted minimum (-8.755297 vs. -7.53874) compared to the actual. The average metric tons for 2020 is 21.525 and 2019 is 21.898, so they are comparable.

`Train 2010 - 2019: Batch Size = 32`¶

Like in preparing the year range data for XGBoost models, let's read data, remove the missing data for the most complete cases and filter the set to 2010 - 2019 observations. Then set up the data to create the train/test sets stratified by Year. Next, ordinal encode both the US and foreign company size using ranking, create the dummy variables for the categorical variables and utilize the StandardScaler rather the MinMaxScaler.

Let's now set the path where model is saved and set up the callbacks with the EarlyStopping that monitors the loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well .

In [ ]:

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

%load_ext tensorboard

filepath = 'MLP_weights_only_train1019_baseline_sc_b32_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  best_model_only=True, mode='min'),
                  tensorboard_callback]

/content/drive/MyDrive/MaritimeTrade/Models/DL/MLP/Models
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard

Given the large amount of observations even with sampling, let's use batch_size=32 and the same other parameters to train the MLP using this train dataset.

In [ ]:

history = model.fit(X_train, y_train, epochs=30, batch_size=32,
                    validation_split=0.2, callbacks=callbacks_list)

Epoch 1/30
422332/422332 [==============================] - 935s 2ms/step - loss: 9.6943 - mse: 446.1781 - val_loss: 8.7496 - val_mse: 390.8900
Epoch 2/30
422332/422332 [==============================] - 959s 2ms/step - loss: 9.2162 - mse: 412.9214 - val_loss: 8.4212 - val_mse: 375.7799
Epoch 3/30
422332/422332 [==============================] - 917s 2ms/step - loss: 9.0297 - mse: 402.0034 - val_loss: 8.3467 - val_mse: 373.9240
Epoch 4/30
422332/422332 [==============================] - 920s 2ms/step - loss: 8.9537 - mse: 396.0067 - val_loss: 8.3418 - val_mse: 369.8779
Epoch 5/30
422332/422332 [==============================] - 929s 2ms/step - loss: 8.9054 - mse: 392.1855 - val_loss: 8.2432 - val_mse: 362.5407
Epoch 6/30
422332/422332 [==============================] - 921s 2ms/step - loss: 8.8783 - mse: 390.2737 - val_loss: 8.1926 - val_mse: 356.0669
Epoch 7/30
422332/422332 [==============================] - 918s 2ms/step - loss: 8.8597 - mse: 388.4507 - val_loss: 8.1190 - val_mse: 357.1298
Epoch 8/30
422332/422332 [==============================] - 931s 2ms/step - loss: 8.8503 - mse: 387.9199 - val_loss: 8.1958 - val_mse: 360.5265
Epoch 9/30
422332/422332 [==============================] - 934s 2ms/step - loss: 8.8371 - mse: 386.9019 - val_loss: 8.2469 - val_mse: 363.9530
Epoch 10/30
422332/422332 [==============================] - 922s 2ms/step - loss: 8.8280 - mse: 386.0043 - val_loss: 8.2077 - val_mse: 359.1540
Epoch 11/30
422332/422332 [==============================] - 922s 2ms/step - loss: 8.8208 - mse: 385.3223 - val_loss: 8.2298 - val_mse: 358.2460
Epoch 12/30
422332/422332 [==============================] - 923s 2ms/step - loss: 8.8145 - mse: 385.1211 - val_loss: 8.1915 - val_mse: 364.5096
Epoch 13/30
422332/422332 [==============================] - 922s 2ms/step - loss: 8.8048 - mse: 384.3127 - val_loss: 8.2360 - val_mse: 369.3531
Epoch 14/30
422332/422332 [==============================] - 920s 2ms/step - loss: 8.7994 - mse: 383.7057 - val_loss: 8.2907 - val_mse: 364.3446
Epoch 15/30
422332/422332 [==============================] - 921s 2ms/step - loss: 8.7958 - mse: 383.4945 - val_loss: 8.2077 - val_mse: 365.8817
Epoch 16/30
422332/422332 [==============================] - 925s 2ms/step - loss: 8.7935 - mse: 383.2272 - val_loss: 8.1145 - val_mse: 353.8852
Epoch 17/30
422332/422332 [==============================] - 923s 2ms/step - loss: 8.7873 - mse: 382.7701 - val_loss: 8.1190 - val_mse: 349.5538
Epoch 18/30
422332/422332 [==============================] - 921s 2ms/step - loss: 8.7851 - mse: 382.6614 - val_loss: 8.1932 - val_mse: 358.3771
Epoch 19/30
422332/422332 [==============================] - 929s 2ms/step - loss: 8.7817 - mse: 382.6654 - val_loss: 8.1101 - val_mse: 354.5242
Epoch 20/30
422332/422332 [==============================] - 920s 2ms/step - loss: 8.7791 - mse: 382.2807 - val_loss: 8.1370 - val_mse: 358.4280
Epoch 21/30
422332/422332 [==============================] - 920s 2ms/step - loss: 8.7754 - mse: 382.1271 - val_loss: 8.1957 - val_mse: 368.1328
Epoch 22/30
422332/422332 [==============================] - 922s 2ms/step - loss: 8.7731 - mse: 381.8557 - val_loss: 8.1256 - val_mse: 356.0920
Epoch 23/30
422332/422332 [==============================] - 922s 2ms/step - loss: 8.7714 - mse: 381.5506 - val_loss: 8.2357 - val_mse: 364.0110
Epoch 24/30
422332/422332 [==============================] - 926s 2ms/step - loss: 8.7714 - mse: 381.3193 - val_loss: 8.2483 - val_mse: 364.8841
Epoch 25/30
422332/422332 [==============================] - 917s 2ms/step - loss: 8.7674 - mse: 380.9654 - val_loss: 8.1100 - val_mse: 354.3062
Epoch 26/30
422332/422332 [==============================] - 917s 2ms/step - loss: 8.7688 - mse: 381.3808 - val_loss: 8.1835 - val_mse: 359.1082
Epoch 27/30
422332/422332 [==============================] - 915s 2ms/step - loss: 8.7655 - mse: 381.1008 - val_loss: 8.2581 - val_mse: 365.0514
Epoch 28/30
422332/422332 [==============================] - 924s 2ms/step - loss: 8.7628 - mse: 380.8708 - val_loss: 8.0745 - val_mse: 347.5423
Epoch 29/30
422332/422332 [==============================] - 920s 2ms/step - loss: 8.7598 - mse: 380.5723 - val_loss: 8.2035 - val_mse: 364.9460
Epoch 30/30
422332/422332 [==============================] - 918s 2ms/step - loss: 8.7606 - mse: 380.6102 - val_loss: 8.1415 - val_mse: 353.8224

Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.

In [ ]:

model.save('./MLP_Baseline_sc_batch32_30epochs_train1019_tf.h5',
           save_format='tf')

plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()

As with the 2020 and 2019 baseline models, the val_loss is lower than the training loss over the duration of the training epochs, and this baseline model appears to have lower error values. Then, we can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:

losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()

We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe and examine the size to determine chunksize for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.

In [ ]:

pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the test set.

In [ ]:

pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for both the training and test sets.

In [ ]:

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f'% np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))

Metrics: Train set
MAE: 8.127828
MSE: 352.190451
RMSE: 18.766738
R2: 0.503125

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))

Metrics: Test set
MAE: 8.637050
MSE: 391.928194
RMSE: 19.797176
R2: 0.446796

When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set.

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 388.61658

Average Metric Tons: 20.17341674121816
Predicted Average Metric Tons: 16.301502

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -7.649522

Out of all of the baseline models, the model using the 2010 - 2019 data was the highest for the predicted maximum metric tonnage and the lowest for the predicted average and a loweest for the predicted minimum compared to the actual.

`Test on 2020`

Let's now prepare the 2020 set by partitioning the data and processing the features. We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the 2020 set.

In [ ]:

X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']

X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))

pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Baseline: Train 2010-19 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for the 2010 - 2019 model using the 2020 set.

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))

Metrics: Test set
MAE: 23.538832
MSE: 1373.334543
RMSE: 37.058529
R2: -0.455251

Compared to the 2010 - 2019 test set metrics, there are higher errors for MAE (23.538832 vs. 8.637050), MSE (1373.334543 vs. 391.928194),RMSE (37.058529 vs. 19.797176) and a negative R² (-0.455251 vs. 0.446796).

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 308.2283

Average Metric Tons: 21.53519470993873
Predicted Average Metric Tons: 29.537125

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -8.377576

For the 2010 - 2019 baseline model using the using the 2020 data, there is an even lower predicted maximum metric tonnage than the actual maximum metric tons (308.2283 vs. 388.61658), a lower predicted minimum (-8.377576 vs. -7.649522) compared to the actual and an even higher predicted average (29.537125 vs. 16.301502) and . The average metric tons for 2020 is 21.525 and 2010 - 2019 is 20.173, so they are comparable.

`Hyperparameter Tuning`¶

`Train 2020 HPO5`¶

Let's first set up the callbacks for the TensorBoard. Then we can define a function build_model that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:

num_layers: 7 - 13 layers
layer_size: 40 - 70 nodes using step=5
learning_rate: 1 x 10^-1, 1 x 10^-2, 1 x 10^-3

The same 30% Dropout before the output layer as well as the same loss='mae' and metrics=['mse'] can be utilized to compare the aforementioned hyperparameters.

In [ ]:

from tensorflow.keras.callbacks import TensorBoard
from tensorflow import keras

filepath = 'MLP_weights_only_b16_20_HPO5.h5'
checkpoint_dir = os.path.dirname(filepath)

callbacks = [TensorBoard(log_dir=log_folder,
                         histogram_freq=1,
                         write_graph=True,
                         write_images=True,
                         update_freq='epoch',
                         profile_batch=1,
                         embeddings_freq=1)]

callbacks_list = [callbacks]

def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 7, 13)):
        model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
                                                     min_value=40, max_value=70,
                                                     step=5),
                                        activation='relu',
                                        kernel_initializer='normal'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(1))
    model.compile(loss='mae', metrics=['mse'],
                  optimizer=tf.keras.optimizers.Adam(
                      hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])))

    return model

We can define the conditions for keras_tuner which include the objective='val_loss', the number of trials to search through (max_trials=25), the length of each trial (executions_per_trial=1), and the directory and project_name. Then we can now print a summary of the search space.

In [ ]:

import keras_tuner
from keras_tuner import BayesianOptimization

tuner = BayesianOptimization(
    build_model,
    objective='val_loss',
    max_trials=20,
    executions_per_trial=1,
    overwrite=True,
    directory='MLP_20_HPO5sc',
    project_name='MLP_20_HPO5')

tuner.search_space_summary()

Search space summary
Default search space size: 9
num_layers (Int)
{'default': None, 'conditions': [], 'min_value': 7, 'max_value': 13, 'step': 1, 'sampling': None}
layer_size0 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size1 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size2 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size3 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size4 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size5 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size6 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}

Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2 and batch_size=16.

In [ ]:

tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=16,
             callbacks=callbacks)

Trial 20 Complete [00h 06m 04s]
val_loss: 9.262261390686035

Best val_loss So Far: 8.955842971801758
Total elapsed time: 02h 19m 46s
INFO:tensorflow:Oracle triggered exit

Now that the search has completed, let's print a summary of the results from the trials.

In [ ]:

tuner.results_summary()

Results summary
Results in MLP_20_HPO5sc/MLP_20_HPO5
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x7f0e49240a00>
Trial summary
Hyperparameters:
num_layers: 11
layer_size0: 65
layer_size1: 65
layer_size2: 60
layer_size3: 45
layer_size4: 60
layer_size5: 65
layer_size6: 45
learning_rate: 0.001
layer_size7: 65
layer_size8: 65
layer_size9: 50
layer_size10: 45
layer_size11: 50
layer_size12: 40
Score: 8.955842971801758
Trial summary
Hyperparameters:
num_layers: 9
layer_size0: 70
layer_size1: 70
layer_size2: 40
layer_size3: 40
layer_size4: 70
layer_size5: 70
layer_size6: 40
learning_rate: 0.001
layer_size7: 70
layer_size8: 55
layer_size9: 60
layer_size10: 40
layer_size11: 55
layer_size12: 60
Score: 9.026692390441895
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 70
layer_size2: 45
layer_size3: 40
layer_size4: 70
layer_size5: 70
layer_size6: 50
learning_rate: 0.0001
layer_size7: 60
layer_size8: 70
layer_size9: 70
layer_size10: 40
layer_size11: 65
layer_size12: 50
Score: 9.05501937866211
Trial summary
Hyperparameters:
num_layers: 9
layer_size0: 70
layer_size1: 70
layer_size2: 40
layer_size3: 40
layer_size4: 60
layer_size5: 70
layer_size6: 70
learning_rate: 0.0001
layer_size7: 70
layer_size8: 70
layer_size9: 55
layer_size10: 40
layer_size11: 40
layer_size12: 70
Score: 9.123533248901367
Trial summary
Hyperparameters:
num_layers: 7
layer_size0: 70
layer_size1: 70
layer_size2: 40
layer_size3: 40
layer_size4: 70
layer_size5: 40
layer_size6: 40
learning_rate: 0.0001
layer_size7: 70
layer_size8: 70
layer_size9: 65
layer_size10: 40
layer_size11: 50
layer_size12: 55
Score: 9.15290355682373
Trial summary
Hyperparameters:
num_layers: 9
layer_size0: 70
layer_size1: 70
layer_size2: 60
layer_size3: 40
layer_size4: 65
layer_size5: 70
layer_size6: 45
learning_rate: 0.0001
layer_size7: 70
layer_size8: 70
layer_size9: 45
layer_size10: 40
layer_size11: 60
layer_size12: 40
Score: 9.155122756958008
Trial summary
Hyperparameters:
num_layers: 7
layer_size0: 70
layer_size1: 70
layer_size2: 40
layer_size3: 40
layer_size4: 70
layer_size5: 70
layer_size6: 40
learning_rate: 0.0001
layer_size7: 70
layer_size8: 70
layer_size9: 70
layer_size10: 70
layer_size11: 45
layer_size12: 70
Score: 9.19211196899414
Trial summary
Hyperparameters:
num_layers: 12
layer_size0: 70
layer_size1: 70
layer_size2: 60
layer_size3: 40
layer_size4: 70
layer_size5: 70
layer_size6: 40
learning_rate: 0.0001
layer_size7: 70
layer_size8: 40
layer_size9: 50
layer_size10: 40
layer_size11: 60
layer_size12: 40
Score: 9.195192337036133
Trial summary
Hyperparameters:
num_layers: 9
layer_size0: 70
layer_size1: 70
layer_size2: 60
layer_size3: 40
layer_size4: 70
layer_size5: 70
layer_size6: 50
learning_rate: 0.0001
layer_size7: 70
layer_size8: 55
layer_size9: 70
layer_size10: 40
layer_size11: 40
layer_size12: 45
Score: 9.203907012939453
Trial summary
Hyperparameters:
num_layers: 8
layer_size0: 70
layer_size1: 70
layer_size2: 40
layer_size3: 70
layer_size4: 70
layer_size5: 70
layer_size6: 40
learning_rate: 0.0001
layer_size7: 70
layer_size8: 70
layer_size9: 70
layer_size10: 40
layer_size11: 50
layer_size12: 70
Score: 9.207023620605469

From the search, the model/hyperparameters that resulted in the lowest val_loss utilizes a network containing 11 layers with 5 of the layers containing 65 nodes, 2 with 60, 1 with 50, 3 with 45 nodes and a learning_rate: 0.001. However, when fitting this model and the second best from the search using batch_size=32, there was a high discrepancy between the train and test metrics. Therefore, the third selected from the search with a more complex model architecture was utilized. This potentially suggests that multiple epochs should be used during the HPO trials.

`Fit The Best Model from HPO - Batch Size = 32`

Let's now set the path where model is saved and set up the callbacks with EarlyStopping that monitors the loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well.

In [ ]:

import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

%load_ext tensorboard

filepath = 'MLP_weights_only_train20_b32_sc_epochs30_HPO5batch16.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  save_freq='epoch'),
                  tensorboard_callback]

We can now define the model architecture using the results from keras-tuner search, compile and then examine.

In [ ]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential()
model.add(Dense(70, input_dim=34, kernel_initializer='normal',
                activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(60, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))

opt = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 70)                2450      
                                                                 
 dense_1 (Dense)             (None, 70)                4970      
                                                                 
 dense_2 (Dense)             (None, 40)                2840      
                                                                 
 dense_3 (Dense)             (None, 70)                2870      
                                                                 
 dense_4 (Dense)             (None, 70)                4970      
                                                                 
 dense_5 (Dense)             (None, 40)                2840      
                                                                 
 dense_6 (Dense)             (None, 70)                2870      
                                                                 
 dense_7 (Dense)             (None, 70)                4970      
                                                                 
 dense_8 (Dense)             (None, 40)                2840      
                                                                 
 dense_9 (Dense)             (None, 60)                2460      
                                                                 
 dense_10 (Dense)            (None, 70)                4270      
                                                                 
 dense_11 (Dense)            (None, 70)                4970      
                                                                 
 dense_12 (Dense)            (None, 70)                4970      
                                                                 
 dropout (Dropout)           (None, 70)                0         
                                                                 
 dense_13 (Dense)            (None, 1)                 71        
                                                                 
=================================================================
Total params: 48,361
Trainable params: 48,361
Non-trainable params: 0
_________________________________________________________________

Compared to the total model parameters from the baseline model, there are 48,361 vs. 1,891 trainable parameters. Now, the model can be trained by calling fit utilizing the train dataset for 30 epochs using a batch_size=32, validation_split=0.2 and the specified callbacks from the callbacks_list.

In [ ]:

history = model.fit(X_train, y_train, epochs=30, batch_size=32,
                    validation_split=0.2, callbacks=callbacks_list)

Epoch 1/30
65306/65306 [==============================] - 202s 3ms/step - loss: 10.5622 - mse: 616.6284 - val_loss: 9.2018 - val_mse: 480.5653
Epoch 2/30
65306/65306 [==============================] - 197s 3ms/step - loss: 9.1723 - mse: 460.2661 - val_loss: 8.7241 - val_mse: 440.4626
Epoch 3/30
65306/65306 [==============================] - 195s 3ms/step - loss: 8.7572 - mse: 423.4736 - val_loss: 8.4677 - val_mse: 403.9790
Epoch 4/30
65306/65306 [==============================] - 196s 3ms/step - loss: 8.4687 - mse: 400.2460 - val_loss: 7.9495 - val_mse: 373.3796
Epoch 5/30
65306/65306 [==============================] - 197s 3ms/step - loss: 8.2474 - mse: 385.6289 - val_loss: 7.8316 - val_mse: 367.0729
Epoch 6/30
65306/65306 [==============================] - 202s 3ms/step - loss: 8.0691 - mse: 373.2658 - val_loss: 7.8150 - val_mse: 364.5799
Epoch 7/30
65306/65306 [==============================] - 199s 3ms/step - loss: 7.9329 - mse: 364.4851 - val_loss: 7.7372 - val_mse: 365.7344
Epoch 8/30
65306/65306 [==============================] - 216s 3ms/step - loss: 7.8281 - mse: 357.2533 - val_loss: 7.5337 - val_mse: 335.8542
Epoch 9/30
65306/65306 [==============================] - 197s 3ms/step - loss: 7.7374 - mse: 350.9850 - val_loss: 7.5800 - val_mse: 340.1299
Epoch 10/30
65306/65306 [==============================] - 198s 3ms/step - loss: 7.6584 - mse: 345.9384 - val_loss: 7.3902 - val_mse: 335.4782
Epoch 11/30
65306/65306 [==============================] - 197s 3ms/step - loss: 7.5804 - mse: 340.2671 - val_loss: 7.2800 - val_mse: 334.8559
Epoch 12/30
65306/65306 [==============================] - 197s 3ms/step - loss: 7.5276 - mse: 335.6842 - val_loss: 7.2888 - val_mse: 328.3452
Epoch 13/30
65306/65306 [==============================] - 200s 3ms/step - loss: 7.4707 - mse: 331.5752 - val_loss: 7.1061 - val_mse: 322.4780
Epoch 14/30
65306/65306 [==============================] - 201s 3ms/step - loss: 7.4224 - mse: 328.5333 - val_loss: 7.2232 - val_mse: 325.9990
Epoch 15/30
65306/65306 [==============================] - 196s 3ms/step - loss: 7.3809 - mse: 325.5475 - val_loss: 7.1130 - val_mse: 311.7838
Epoch 16/30
65306/65306 [==============================] - 197s 3ms/step - loss: 7.3448 - mse: 322.4320 - val_loss: 7.0916 - val_mse: 316.5891
Epoch 17/30
65306/65306 [==============================] - 219s 3ms/step - loss: 7.3019 - mse: 318.9111 - val_loss: 7.0431 - val_mse: 313.3865
Epoch 18/30
65306/65306 [==============================] - 199s 3ms/step - loss: 7.2739 - mse: 317.5409 - val_loss: 6.9815 - val_mse: 310.1332
Epoch 19/30
65306/65306 [==============================] - 218s 3ms/step - loss: 7.2383 - mse: 314.6888 - val_loss: 6.9866 - val_mse: 305.0808
Epoch 20/30
65306/65306 [==============================] - 200s 3ms/step - loss: 7.2087 - mse: 312.7014 - val_loss: 6.9089 - val_mse: 308.6000
Epoch 21/30
65306/65306 [==============================] - 200s 3ms/step - loss: 7.1797 - mse: 309.9180 - val_loss: 6.9731 - val_mse: 308.7126
Epoch 22/30
65306/65306 [==============================] - 199s 3ms/step - loss: 7.1490 - mse: 308.0104 - val_loss: 6.8680 - val_mse: 303.7058
Epoch 23/30
65306/65306 [==============================] - 201s 3ms/step - loss: 7.1248 - mse: 306.8813 - val_loss: 6.9280 - val_mse: 310.3242
Epoch 24/30
65306/65306 [==============================] - 199s 3ms/step - loss: 7.0930 - mse: 304.5747 - val_loss: 6.7913 - val_mse: 296.0530
Epoch 25/30
65306/65306 [==============================] - 204s 3ms/step - loss: 7.0778 - mse: 303.6190 - val_loss: 6.8033 - val_mse: 298.0113
Epoch 26/30
65306/65306 [==============================] - 216s 3ms/step - loss: 7.0583 - mse: 301.3407 - val_loss: 6.7710 - val_mse: 299.3239
Epoch 27/30
65306/65306 [==============================] - 216s 3ms/step - loss: 7.0341 - mse: 300.0639 - val_loss: 6.8074 - val_mse: 295.3144
Epoch 28/30
65306/65306 [==============================] - 216s 3ms/step - loss: 7.0188 - mse: 298.2793 - val_loss: 6.8011 - val_mse: 298.3897
Epoch 29/30
65306/65306 [==============================] - 198s 3ms/step - loss: 6.9904 - mse: 297.7886 - val_loss: 6.8819 - val_mse: 304.5460
Epoch 30/30
65306/65306 [==============================] - 200s 3ms/step - loss: 6.9677 - mse: 296.0454 - val_loss: 6.8377 - val_mse: 299.8877

Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.

In [ ]:

model.save('./MLP_batch32_sc_30Epochs_HPO5batch32_train20_tf.h5',
           save_format='tf')

# Load model for more training or later use
#filepath = 'MLP_weights_only_train20_b32_sc_epochs30_HPO5batch16.h5'
#model = tf.keras.models.load_model('./MLP_batch32_30Epochs_HPO5batch16_train20_tf.h5')
#model.load_weights(filepath)

plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()

We can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:

losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()

We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe and examine the size to determine chunksize for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.

In [ ]:

pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)

81633/81633 [==============================] - 96s 1ms/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the test set.

In [ ]:

pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)

20409/20409 [==============================] - 24s 1ms/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for both the training and test sets.

In [ ]:

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))

Metrics: Train set
MAE: 6.743100
MSE: 290.759245
RMSE: 17.051664
R2: 0.691973

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))

Metrics: Test set
MAE: 7.666669
MSE: 351.065706
RMSE: 18.736747
R2: 0.627629

When comparing the metrics from the training and test sets, the errors are higher in the test set, but they are close in proximity. The R² is slightly lower in the test set compared to the R² from the training set. When comparing the metrics from the model using HPO to the baseline model, all of the MAE, MSE, and RMSE metrics are lower and the $^{2}$ is higher for both the train and test sets.

Let's also examine how the predicted values for the maximum, average and minimum metric tonnages compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 246.31972

Average Metric Tons: 21.5239324438755
Predicted Average Metric Tons: 19.69039

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -0.21363783

`Test on 2019`

Let's now prepare the 2019 set by partitioning the data and processing the features. We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the 2019 set.

In [ ]:

X = df1.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df1['Metric_Tons']

X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.transform(X))

pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)

105266/105266 [==============================] - 125s 1ms/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Train 2020 Test 2019: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for the 2020 model using the 2019 set. We can also examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))

Metrics: Test set
MAE: 9.274905
MSE: 501.066718
RMSE: 22.384520
R2: 0.502192

In [ ]:

print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 258.91766

Average Metric Tons: 21.897900731099753
Predicted Average Metric Tons: 18.422684

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -0.4527645

For the 2020 HPO model using the 2019 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (258.91766 vs. 246.31972), a lower predicted minimum (-0.4527645 vs. -0.21363783) compared to the actual and an even lower predicted average (18.422684 vs. 19.69039).

`Train 2019 HPO3`¶

Let's first set up the callbacks for the TensorBoard. Then we can define a function build_model that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:

num_layers: 4 - 10 layers
layer_size: 20 - 70 nodes using step=5
learning_rate: 1 x 10^-1, 1 x 10^-2, 1 x 10^-3

The same 30% Dropout before the output layer as well as the same loss='mae' and metrics=['mse'] can be utilized to compare the aforementioned hyperparameters.

In [ ]:

callbacks = [TensorBoard(log_dir=log_folder,
                         histogram_freq=1,
                         write_graph=True,
                         write_images=True,
                         update_freq='epoch',
                         profile_batch=1,
                         embeddings_freq=1)]

def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 4, 10)):
        model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
                                                     min_value=20, max_value=70,
                                                     step=5),
                                        activation='relu',
                                        kernel_initializer='normal'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(1))
    model.compile(loss='mae', metrics=['mse'], optimizer=keras.optimizers.Adam(
        hp.Choice('learning_rate', values=[1e-1, 1e-2, 1e-3])))

    return model

We can define the conditions for keras_tuner which include the objective='val_loss', the number of trials to search through (max_trials=25), the length of each trial (executions_per_trial=1), and the directory and project_name. Then we can now print a summary of the search space.

In [ ]:

tuner = BayesianOptimization(
    build_model,
    objective='val_loss',
    max_trials=20,
    executions_per_trial=1,
    overwrite=True,
    directory='MLP_19_HPO3sc',
    project_name='MLP_19_HPO3')

tuner.search_space_summary()

Search space summary
Default search space size: 6
num_layers (Int)
{'default': None, 'conditions': [], 'min_value': 4, 'max_value': 10, 'step': 1, 'sampling': None}
layer_size0 (Int)
{'default': None, 'conditions': [], 'min_value': 20, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size1 (Int)
{'default': None, 'conditions': [], 'min_value': 20, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size2 (Int)
{'default': None, 'conditions': [], 'min_value': 20, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size3 (Int)
{'default': None, 'conditions': [], 'min_value': 20, 'max_value': 70, 'step': 5, 'sampling': None}
learning_rate (Choice)
{'default': 0.1, 'conditions': [], 'values': [0.1, 0.01, 0.001], 'ordered': True}

Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2 and batch_size=16.

In [ ]:

tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=16,
             callbacks=callbacks)

Trial 20 Complete [00h 05m 59s]
val_loss: 9.149958610534668

Best val_loss So Far: 8.78332805633545
Total elapsed time: 02h 01m 36s
INFO:tensorflow:Oracle triggered exit

Now that the search has completed, let's print a summary of the results from the trials.

In [ ]:

tuner.results_summary()

Results summary
Results in MLP_19_HPO3sc/MLP_19_HPO3
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x7fdd1d8178e0>
Trial summary
Hyperparameters:
num_layers: 4
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 70
learning_rate: 0.001
layer_size4: 70
layer_size5: 55
layer_size6: 70
layer_size7: 70
layer_size8: 70
layer_size9: 70
Score: 8.78332805633545
Trial summary
Hyperparameters:
num_layers: 4
layer_size0: 70
layer_size1: 65
layer_size2: 70
layer_size3: 20
learning_rate: 0.001
layer_size4: 70
layer_size5: 70
layer_size6: 70
layer_size7: 70
Score: 8.963759422302246
Trial summary
Hyperparameters:
num_layers: 4
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 20
learning_rate: 0.001
layer_size4: 20
layer_size5: 20
layer_size6: 55
layer_size7: 70
layer_size8: 55
layer_size9: 70
Score: 8.96766471862793
Trial summary
Hyperparameters:
num_layers: 4
layer_size0: 70
layer_size1: 45
layer_size2: 70
layer_size3: 50
learning_rate: 0.001
layer_size4: 20
layer_size5: 70
layer_size6: 70
layer_size7: 70
layer_size8: 70
layer_size9: 20
Score: 8.995427131652832
Trial summary
Hyperparameters:
num_layers: 8
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 20
learning_rate: 0.001
layer_size4: 70
layer_size5: 20
layer_size6: 70
layer_size7: 70
layer_size8: 70
layer_size9: 20
Score: 9.014375686645508
Trial summary
Hyperparameters:
num_layers: 10
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 20
learning_rate: 0.001
layer_size4: 70
layer_size5: 70
layer_size6: 20
layer_size7: 70
layer_size8: 30
layer_size9: 70
Score: 9.048048973083496
Trial summary
Hyperparameters:
num_layers: 10
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 20
learning_rate: 0.001
layer_size4: 20
layer_size5: 70
layer_size6: 70
layer_size7: 70
layer_size8: 70
layer_size9: 70
Score: 9.082844734191895
Trial summary
Hyperparameters:
num_layers: 4
layer_size0: 70
layer_size1: 70
layer_size2: 20
layer_size3: 20
learning_rate: 0.001
layer_size4: 70
layer_size5: 20
layer_size6: 70
layer_size7: 70
layer_size8: 70
layer_size9: 70
Score: 9.098869323730469
Trial summary
Hyperparameters:
num_layers: 4
layer_size0: 70
layer_size1: 25
layer_size2: 70
layer_size3: 20
learning_rate: 0.001
layer_size4: 70
layer_size5: 20
layer_size6: 20
layer_size7: 70
layer_size8: 70
layer_size9: 70
Score: 9.108147621154785
Trial summary
Hyperparameters:
num_layers: 4
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 20
learning_rate: 0.001
layer_size4: 70
layer_size5: 70
layer_size6: 20
layer_size7: 70
layer_size8: 70
layer_size9: 20
Score: 9.124605178833008

From the search, the model/hyperparameters that resulted in the lowest val_loss utilizes a network containing 4 layers with all the layers containing layer_size=70 and a learning_rate: 0.001. The top 4 networks with the lowest val_loss examined contained 4 layers as well.

`Fit The Best Model from HPO - Batch Size = 32`

Let's now set the path where model is saved and set up the callbacks with EarlyStopping that monitors the loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well.

In [ ]:

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

%load_ext tensorboard

filepath = 'MLP_weights_only_train19_b32_sc_epochs30_HPO3batch16.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  save_freq='epoch'),
                  tensorboard_callback]

We can now define the model architecture using the results from keras-tuner search, compile and then examine.

In [ ]:

model = Sequential()
model.add(Dense(70, input_dim=34, kernel_initializer='normal',
                activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))

opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 70)                2450      
                                                                 
 dense_1 (Dense)             (None, 70)                4970      
                                                                 
 dense_2 (Dense)             (None, 70)                4970      
                                                                 
 dense_3 (Dense)             (None, 70)                4970      
                                                                 
 dropout (Dropout)           (None, 70)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 71        
                                                                 
=================================================================
Total params: 17,431
Trainable params: 17,431
Non-trainable params: 0
_________________________________________________________________

Compared to the total model parameters from the baseline model, there are 17,431 vs. 1,891 trainable parameters. Now, the model can be trained by calling fit utilizing the train dataset for 30 epochs using a batch_size=32, validation_split=0.2 and the specified callbacks from the callbacks_list.

In [ ]:

history = model.fit(X_train, y_train, epochs=30, batch_size=32,
                    validation_split=0.2, callbacks=callbacks_list)

Epoch 1/30
67370/67370 [==============================] - 157s 2ms/step - loss: 9.7954 - mse: 562.1500 - val_loss: 8.9230 - val_mse: 472.9009
Epoch 2/30
67370/67370 [==============================] - 155s 2ms/step - loss: 8.7101 - mse: 441.4602 - val_loss: 8.1464 - val_mse: 393.8015
Epoch 3/30
67370/67370 [==============================] - 155s 2ms/step - loss: 8.3865 - mse: 414.2169 - val_loss: 8.0485 - val_mse: 391.1299
Epoch 4/30
67370/67370 [==============================] - 156s 2ms/step - loss: 8.1894 - mse: 397.6087 - val_loss: 7.9009 - val_mse: 378.4498
Epoch 5/30
67370/67370 [==============================] - 157s 2ms/step - loss: 8.0403 - mse: 386.8266 - val_loss: 7.8773 - val_mse: 373.4080
Epoch 6/30
67370/67370 [==============================] - 156s 2ms/step - loss: 7.9302 - mse: 374.9509 - val_loss: 7.6887 - val_mse: 368.6131
Epoch 7/30
67370/67370 [==============================] - 155s 2ms/step - loss: 7.8311 - mse: 365.0150 - val_loss: 7.5164 - val_mse: 353.1789
Epoch 8/30
67370/67370 [==============================] - 157s 2ms/step - loss: 7.7483 - mse: 358.6936 - val_loss: 7.6674 - val_mse: 361.7786
Epoch 9/30
67370/67370 [==============================] - 154s 2ms/step - loss: 7.6859 - mse: 352.9072 - val_loss: 7.7565 - val_mse: 377.6983
Epoch 10/30
67370/67370 [==============================] - 155s 2ms/step - loss: 7.6230 - mse: 347.4040 - val_loss: 7.4072 - val_mse: 341.1722
Epoch 11/30
67370/67370 [==============================] - 155s 2ms/step - loss: 7.5806 - mse: 344.2314 - val_loss: 7.2431 - val_mse: 327.4524
Epoch 12/30
67370/67370 [==============================] - 155s 2ms/step - loss: 7.5302 - mse: 340.2437 - val_loss: 7.2277 - val_mse: 326.0204
Epoch 13/30
67370/67370 [==============================] - 156s 2ms/step - loss: 7.4966 - mse: 338.0010 - val_loss: 7.3049 - val_mse: 331.3467
Epoch 14/30
67370/67370 [==============================] - 157s 2ms/step - loss: 7.4587 - mse: 334.6683 - val_loss: 7.1597 - val_mse: 325.8457
Epoch 15/30
67370/67370 [==============================] - 155s 2ms/step - loss: 7.4318 - mse: 332.6233 - val_loss: 7.1605 - val_mse: 321.1693
Epoch 16/30
67370/67370 [==============================] - 156s 2ms/step - loss: 7.3963 - mse: 330.0112 - val_loss: 7.1089 - val_mse: 319.6627
Epoch 17/30
67370/67370 [==============================] - 156s 2ms/step - loss: 7.3679 - mse: 328.0836 - val_loss: 7.1556 - val_mse: 321.2764
Epoch 18/30
67370/67370 [==============================] - 156s 2ms/step - loss: 7.3435 - mse: 326.5274 - val_loss: 7.3173 - val_mse: 327.5416
Epoch 19/30
67370/67370 [==============================] - 156s 2ms/step - loss: 7.3265 - mse: 325.1121 - val_loss: 7.2172 - val_mse: 323.8383
Epoch 20/30
67370/67370 [==============================] - 157s 2ms/step - loss: 7.3046 - mse: 323.0576 - val_loss: 6.9875 - val_mse: 309.2887
Epoch 21/30
67370/67370 [==============================] - 150s 2ms/step - loss: 7.2891 - mse: 322.2154 - val_loss: 6.9608 - val_mse: 314.0514
Epoch 22/30
67370/67370 [==============================] - 151s 2ms/step - loss: 7.2642 - mse: 319.9480 - val_loss: 7.0681 - val_mse: 317.7403
Epoch 23/30
67370/67370 [==============================] - 155s 2ms/step - loss: 7.2547 - mse: 319.6288 - val_loss: 6.9507 - val_mse: 305.9771
Epoch 24/30
67370/67370 [==============================] - 152s 2ms/step - loss: 7.2358 - mse: 317.6536 - val_loss: 6.9533 - val_mse: 307.8425
Epoch 25/30
67370/67370 [==============================] - 153s 2ms/step - loss: 7.2212 - mse: 316.5201 - val_loss: 6.9612 - val_mse: 305.1148
Epoch 26/30
67370/67370 [==============================] - 153s 2ms/step - loss: 7.2095 - mse: 316.0476 - val_loss: 6.9824 - val_mse: 308.0117
Epoch 27/30
67370/67370 [==============================] - 154s 2ms/step - loss: 7.1944 - mse: 315.4917 - val_loss: 6.9603 - val_mse: 305.8435
Epoch 28/30
67370/67370 [==============================] - 154s 2ms/step - loss: 7.1812 - mse: 315.8129 - val_loss: 7.0363 - val_mse: 313.6278
Epoch 29/30
67370/67370 [==============================] - 158s 2ms/step - loss: 7.1701 - mse: 313.2158 - val_loss: 6.9753 - val_mse: 306.7173
Epoch 30/30
67370/67370 [==============================] - 155s 2ms/step - loss: 7.1660 - mse: 312.9197 - val_loss: 6.8919 - val_mse: 303.6030

Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.

In [ ]:

model.save('./MLP_batch32_sc_30Epochs_HPO3batch16_train19_tf.h5',
           save_format='tf')

plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()

We can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:

losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()

We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe and examine the size to determine chunksize for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.

In [ ]:

pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)

84213/84213 [==============================] - 81s 955us/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the test set.

In [ ]:

pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)

21054/21054 [==============================] - 21s 975us/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for both the training and test sets.

In [ ]:

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))

Metrics: Train set
MAE: 6.821114
MSE: 297.921297
RMSE: 17.260397
R2: 0.703971

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))

Metrics: Test set
MAE: 8.747238
MSE: 435.549051
RMSE: 20.869812
R2: 0.567549

When comparing the metrics from the training and test sets, the errors are higher in the test set, but the distance between the errors for the HPO models for the train and test sets is greater than the distance for the 2020 HPO model. The R² for the train set is higher than the R² from the train/test sets using the 2020 HPO model while the test R² from the 2019 HPO model is lower than both of the R² from the 2020 HPO model. When comparing the metrics from the model using 2019 HPO model to the baseline model, all of the MAE, MSE, and RMSE metrics are lower and the $^{2}$ is higher for both the train and test sets.

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.98
Predicted Max Metric Tons: 341.8011

Average Metric Tons: 21.895931381717027
Predicted Average Metric Tons: 18.240992

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -13.173259

`Test on 2020`

Let's now prepare the 2020 set by partitioning the data and processing the features.

In [ ]:

X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']

X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))

We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the 2020 set.

In [ ]:

pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)

102041/102041 [==============================] - 101s 983us/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Train 2019 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for the 2019 model using the 2020 set and also examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))

Metrics: Test set
MAE: 10.047981
MSE: 524.608402
RMSE: 22.904332
R2: 0.444100

In [ ]:

print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 481.06577

Average Metric Tons: 21.535245101116747
Predicted Average Metric Tons: 17.65845

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -14.073943

For the 2019 HPO model using the 2020 data, there is an even higher predicted maximum metric tonnage than the actual maximum metric tons (481.06577 vs. 341.8011), a lower predicted minimum (-14.073943 vs. -13.173259) compared to the actual and an even lower predicted average (17.65845 vs. 18.240992).

`Train 2010 - 2019 HPO5`¶

Let's first set up the callbacks for the TensorBoard. Then we can define a function build_model that will evaluate different model parameters during the hyperparameter tuning process. The parameters to test are:

num_layers: 7 - 13 layers
layer_size: 40 - 70 nodes using step=5
learning_rate: 1 x 10^-1, 1 x 10^-2, 1 x 10^-3

The same 30% Dropout before the output layer as well as the same loss='mae' and metrics=['mse'] can be utilized to compare the aforementioned hyperparameters.

In [ ]:

callbacks = [TensorBoard(log_dir=log_folder,
                         histogram_freq=1,
                         write_graph=True,
                         write_images=True,
                         update_freq='epoch',
                         profile_batch=1,
                         embeddings_freq=1)]

def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 7, 13)):
        model.add(tf.keras.layers.Dense(units=hp.Int('layer_size' + str(i),
                                                     min_value=40, max_value=70,
                                                     step=5),
                                        activation='relu',
                                        kernel_initializer='normal'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(1))
    model.compile(loss='mae', metrics=['mse'], optimizer=keras.optimizers.Adam(
        hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])))

    return model

We can define the conditions for keras_tuner which include the objective='val_loss', the number of trials to search through (max_trials=20), the length of each trial (executions_per_trial=1), and the directory and project_name. Then we can now print a summary of the search space.

In [ ]:

tuner = BayesianOptimization(
    build_model,
    objective='val_loss',
    max_trials=20,
    executions_per_trial=1,
    overwrite=True,
    directory='MLP_1019_HPO5sc',
    project_name='MLP_1019_HPO5')

tuner.search_space_summary()

Search space summary
Default search space size: 9
num_layers (Int)
{'default': None, 'conditions': [], 'min_value': 7, 'max_value': 13, 'step': 1, 'sampling': None}
layer_size0 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size1 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size2 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size3 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size4 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size5 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
layer_size6 (Int)
{'default': None, 'conditions': [], 'min_value': 40, 'max_value': 70, 'step': 5, 'sampling': None}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}

Let's now begin the search for the best hyperparameters testing different parameters for one epoch using validation_split=0.2 and batch_size=64.

In [ ]:

tuner.search(X_train, y_train, epochs=1, validation_split=0.2, batch_size=64,
             callbacks=callbacks)

Trial 20 Complete [00h 15m 35s]
val_loss: 7.923251152038574

Best val_loss So Far: 7.6570143699646
Total elapsed time: 03h 52m 26s
INFO:tensorflow:Oracle triggered exit

Now that the search has completed, let's print a summary of the results from the trials.

In [ ]:

tuner.results_summary()

Results summary
Results in MLP_1019_HPO5sc/MLP_1019_HPO5
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x7f4068a78310>
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 65
layer_size2: 70
layer_size3: 45
layer_size4: 40
layer_size5: 65
layer_size6: 55
learning_rate: 0.001
layer_size7: 65
layer_size8: 70
layer_size9: 70
layer_size10: 45
layer_size11: 60
layer_size12: 45
Score: 7.6570143699646
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 70
layer_size4: 70
layer_size5: 70
layer_size6: 40
learning_rate: 0.0001
layer_size7: 50
layer_size8: 40
layer_size9: 40
layer_size10: 40
layer_size11: 70
layer_size12: 70
Score: 7.803870677947998
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 70
layer_size2: 65
layer_size3: 40
layer_size4: 55
layer_size5: 55
layer_size6: 45
learning_rate: 0.0001
layer_size7: 50
layer_size8: 70
layer_size9: 70
layer_size10: 40
layer_size11: 40
layer_size12: 40
Score: 7.825712203979492
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 40
layer_size4: 40
layer_size5: 70
layer_size6: 40
learning_rate: 0.0001
layer_size7: 45
layer_size8: 40
layer_size9: 70
layer_size10: 40
layer_size11: 70
layer_size12: 40
Score: 7.856606483459473
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 55
layer_size2: 70
layer_size3: 60
layer_size4: 40
layer_size5: 50
layer_size6: 40
learning_rate: 0.0001
layer_size7: 50
layer_size8: 70
layer_size9: 70
layer_size10: 40
layer_size11: 70
layer_size12: 60
Score: 7.899537563323975
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 70
layer_size2: 40
layer_size3: 70
layer_size4: 70
layer_size5: 40
layer_size6: 70
learning_rate: 0.0001
layer_size7: 40
layer_size8: 40
layer_size9: 40
layer_size10: 40
layer_size11: 70
layer_size12: 70
Score: 7.923251152038574
Trial summary
Hyperparameters:
num_layers: 9
layer_size0: 70
layer_size1: 70
layer_size2: 70
layer_size3: 70
layer_size4: 70
layer_size5: 45
layer_size6: 40
learning_rate: 0.0001
layer_size7: 70
layer_size8: 45
layer_size9: 70
layer_size10: 40
layer_size11: 70
layer_size12: 40
Score: 7.930766582489014
Trial summary
Hyperparameters:
num_layers: 13
layer_size0: 70
layer_size1: 70
layer_size2: 40
layer_size3: 70
layer_size4: 70
layer_size5: 70
layer_size6: 40
learning_rate: 0.0001
layer_size7: 40
layer_size8: 70
layer_size9: 40
layer_size10: 40
layer_size11: 70
layer_size12: 40
Score: 7.970594882965088
Trial summary
Hyperparameters:
num_layers: 12
layer_size0: 55
layer_size1: 55
layer_size2: 60
layer_size3: 45
layer_size4: 60
layer_size5: 50
layer_size6: 45
learning_rate: 0.0001
layer_size7: 45
layer_size8: 55
layer_size9: 55
layer_size10: 40
layer_size11: 40
Score: 7.981153964996338
Trial summary
Hyperparameters:
num_layers: 12
layer_size0: 70
layer_size1: 45
layer_size2: 65
layer_size3: 40
layer_size4: 60
layer_size5: 70
layer_size6: 40
learning_rate: 0.0001
layer_size7: 70
layer_size8: 70
layer_size9: 70
layer_size10: 40
layer_size11: 70
layer_size12: 40
Score: 8.030591011047363

From the search, the model/hyperparameters that resulted in the lowest val_loss utilizes a network containing 13 layers with 3 of the layers containing 70 nodes, 3 with 65, 1 with 60, 1 with 55 nodes, 3 with 45 nodes, 1 with 40 nodes and a learning_rate: 0.001. The top 6 networks with the lowest val_loss examined contained 13 layers as well.

`Fit The Best Model from HPO - Batch Size = 64`

Let's now set the path where model is saved and set up the callbacks with EarlyStopping that monitors the loss and will stop training if it does not improve after 5 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well.

In [ ]:

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

%load_ext tensorboard

filepath = 'MLP_weights_only_train1019_b64_sc_epochs30_HPO5batch64.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=5),
                  ModelCheckpoint(filepath, monitor='mse',
                                  save_freq='epoch'),
                  tensorboard_callback]

We can now define the model architecture using the results from keras-tuner search, compile and then examine.

In [ ]:

model = Sequential()
model.add(Dense(70, input_dim=34, kernel_initializer='normal',
                activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(55, kernel_initializer='normal', activation='relu'))

model.add(Dense(65, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(70, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dense(60, kernel_initializer='normal', activation='relu'))
model.add(Dense(45, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))

opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='mae', metrics=['mse'], optimizer=opt)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 70)                2450      
                                                                 
 dense_1 (Dense)             (None, 65)                4615      
                                                                 
 dense_2 (Dense)             (None, 70)                4620      
                                                                 
 dense_3 (Dense)             (None, 45)                3195      
                                                                 
 dense_4 (Dense)             (None, 40)                1840      
                                                                 
 dense_5 (Dense)             (None, 65)                2665      
                                                                 
 dense_6 (Dense)             (None, 55)                3630      
                                                                 
 dense_7 (Dense)             (None, 65)                3640      
                                                                 
 dense_8 (Dense)             (None, 70)                4620      
                                                                 
 dense_9 (Dense)             (None, 70)                4970      
                                                                 
 dense_10 (Dense)            (None, 45)                3195      
                                                                 
 dense_11 (Dense)            (None, 60)                2760      
                                                                 
 dense_12 (Dense)            (None, 45)                2745      
                                                                 
 dropout (Dropout)           (None, 45)                0         
                                                                 
 dense_13 (Dense)            (None, 1)                 46        
                                                                 
=================================================================
Total params: 44,991
Trainable params: 44,991
Non-trainable params: 0
_________________________________________________________________

The total parameters for this model have increased to 44,991 from 1,891 that were utilized for the baseline model. This model contains the higher number of parameters out of all searches completed (Baseline: 1,891, Train 20: 37,886, Train 19: 17,431)

Now the model can be trained by calling fit utilizing the train dataset for 30 epochs using a batch_size=64, validation_split=0.2 and the specified callbacks from the callbacks_list.

In [ ]:

history = model.fit(X_train, y_train, epochs=30, batch_size=64,
                    validation_split=0.2, callbacks=callbacks_list)

Epoch 1/30
211166/211166 [==============================] - 799s 4ms/step - loss: 8.4668 - mse: 370.8006 - val_loss: 7.6819 - val_mse: 326.4359
Epoch 2/30
211166/211166 [==============================] - 790s 4ms/step - loss: 7.9550 - mse: 335.6031 - val_loss: 7.4567 - val_mse: 313.6351
Epoch 3/30
211166/211166 [==============================] - 786s 4ms/step - loss: 7.8063 - mse: 323.3239 - val_loss: 7.3470 - val_mse: 305.1347
Epoch 4/30
211166/211166 [==============================] - 786s 4ms/step - loss: 7.7147 - mse: 316.8416 - val_loss: 7.2930 - val_mse: 299.3869
Epoch 5/30
211166/211166 [==============================] - 786s 4ms/step - loss: 7.6586 - mse: 312.8250 - val_loss: 7.2685 - val_mse: 300.4239
Epoch 6/30
211166/211166 [==============================] - 788s 4ms/step - loss: 7.5796 - mse: 308.3514 - val_loss: 7.1673 - val_mse: 294.8160
Epoch 7/30
211166/211166 [==============================] - 785s 4ms/step - loss: 7.5308 - mse: 305.4005 - val_loss: 7.1690 - val_mse: 298.6452
Epoch 8/30
211166/211166 [==============================] - 795s 4ms/step - loss: 7.4984 - mse: 303.4142 - val_loss: 7.1205 - val_mse: 292.0550
Epoch 9/30
211166/211166 [==============================] - 793s 4ms/step - loss: 7.4566 - mse: 300.8481 - val_loss: 7.1058 - val_mse: 293.9989
Epoch 10/30
211166/211166 [==============================] - 789s 4ms/step - loss: 7.4321 - mse: 299.6730 - val_loss: 7.0553 - val_mse: 289.8491
Epoch 11/30
211166/211166 [==============================] - 793s 4ms/step - loss: 7.4048 - mse: 298.2980 - val_loss: 7.0454 - val_mse: 286.5066
Epoch 12/30
211166/211166 [==============================] - 792s 4ms/step - loss: 7.3868 - mse: 297.1003 - val_loss: 7.0051 - val_mse: 287.5703
Epoch 13/30
211166/211166 [==============================] - 787s 4ms/step - loss: 7.3830 - mse: 296.4053 - val_loss: 7.0797 - val_mse: 289.8224
Epoch 14/30
211166/211166 [==============================] - 786s 4ms/step - loss: 7.3723 - mse: 295.5655 - val_loss: 7.0885 - val_mse: 288.2786
Epoch 15/30
211166/211166 [==============================] - 787s 4ms/step - loss: 7.3790 - mse: 295.4136 - val_loss: 7.0674 - val_mse: 285.8739
Epoch 16/30
211166/211166 [==============================] - 783s 4ms/step - loss: 7.3719 - mse: 295.2356 - val_loss: 7.0283 - val_mse: 288.5091
Epoch 17/30
211166/211166 [==============================] - 780s 4ms/step - loss: 7.3760 - mse: 294.6553 - val_loss: 7.0269 - val_mse: 285.7284
Epoch 18/30
211166/211166 [==============================] - 781s 4ms/step - loss: 7.3574 - mse: 294.0126 - val_loss: 6.9841 - val_mse: 284.5068
Epoch 19/30
211166/211166 [==============================] - 778s 4ms/step - loss: 7.3893 - mse: 294.0899 - val_loss: 7.0608 - val_mse: 283.3430
Epoch 20/30
211166/211166 [==============================] - 780s 4ms/step - loss: 7.3673 - mse: 293.2057 - val_loss: 7.0221 - val_mse: 281.7839
Epoch 21/30
211166/211166 [==============================] - 786s 4ms/step - loss: 7.4120 - mse: 294.1027 - val_loss: 7.0414 - val_mse: 281.7632
Epoch 22/30
211166/211166 [==============================] - 796s 4ms/step - loss: 7.3990 - mse: 293.6035 - val_loss: 7.0679 - val_mse: 285.6644
Epoch 23/30
211166/211166 [==============================] - 779s 4ms/step - loss: 7.4069 - mse: 293.1695 - val_loss: 7.0633 - val_mse: 283.1910

Now, let's save the model if we need to reload it in the future and plot the model loss and val_loss over the training epochs.

In [ ]:

model.save('./MLP_batch64_sc_30Epochs_HPO5batch64_train1019_tf.h5',
           save_format='tf')

plt.title('Model Error for Metric Tonnage')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.legend()
plt.show()

We can save the model losses in a pandas.Dataframe, and plot both the mae and mse from the train and validation sets over the epochs as well.

In [ ]:

losses = pd.DataFrame(model.history.history)
losses.plot()
plt.title('Model Error for Metric Tonnage')
plt.ylabel('Error [Metric Tonnage]')
plt.xlabel('Epoch')
plt.show()

We can use model.predict to generate output predictions for the specified input samples so let's predict using the training set, convert the predicted metric tons to a pandas.Dataframe and examine the size to determine chunksize for plotting. Let's graph the predicted vs. actual metric tonnage for the train set.

In [ ]:

pred_train = model.predict(X_train)
y_pred = pd.DataFrame(pred_train)

527914/527914 [==============================] - 812s 2ms/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Training Set: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=25)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_train, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the test set.

In [ ]:

pred_test = model.predict(X_test)
y_pred = pd.DataFrame(pred_test)

226249/226249 [==============================] - 350s 2ms/step

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Test Set: Predicted Metric Tons vs Actual Metric Tons', fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y_test, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for both the training and test sets.

In [ ]:

print('Metrics: Train set')
print('MAE: %3f' % mean_absolute_error(y_train[:], pred_train[:]))
print('MSE: %3f' % mean_squared_error(y_train[:], pred_train[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_train[:], pred_train[:])))
print('R2: %3f' % r2_score(y_train[:], pred_train[:]))

Metrics: Train set
MAE: 7.040382
MSE: 280.990830
RMSE: 16.762781
R2: 0.603574

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y_test[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y_test[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y_test[:], pred_test[:])))
print('R2: %3f' % r2_score(y_test[:], pred_test[:]))

Metrics: Test set
MAE: 8.052270
MSE: 359.919602
RMSE: 18.971547
R2: 0.491976

When comparing the metrics from the training and test sets, the errors are higher in the test set, but the distance between the errors for the HPO models for the train and test sets is greater than the distance for the 2020 HPO model. The R² for the train set is higher than the R² from the train/test sets using the 2020 HPO model while the test R² from the 2019 HPO model is lower than both of the R² from the 2020 HPO model. When comparing the metrics from the model using 2019 HPO model to the baseline model, all of the MAE, MSE, and RMSE metrics are lower and the $^{2}$ is higher for both the train and test sets.

Let's also examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 280.35706

Average Metric Tons: 20.17341674121816
Predicted Average Metric Tons: 16.623627

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: 0.54962707

There are now closer values for the predicted vs. actual maximum (280.35706 vs. 249.99) and minimum (0.54962707 vs. 0.0), but around the same distance compared to the baseline model for the predicted vs. actual average metric tonnage (16.623627 vs. 20.17341674121816).

`Test on 2020`

Let's now prepare the 2020 set by partitioning the data and processing the features. We can use model.predict using the test set, convert the predicted metric tons to a pandas.Dataframe and graph the predicted vs. actual metric tonnage for the 2020 set.

In [ ]:

X = df2.drop(['Metric_Tons', 'DateTime_YearWeek'], axis=1)
y = df2['Metric_Tons']

X = ce_ord.fit_transform(X)
X = pd.get_dummies(X, drop_first=True)
X = pd.DataFrame(sc.fit_transform(X))

In [ ]:

pred_test = model.predict(X)
y_pred = pd.DataFrame(pred_test)
y_pred.shape

102041/102041 [==============================] - 159s 2ms/step

Out[ ]:

(3265293, 1)

In [ ]:

plt.rcParams['agg.path.chunksize'] = 10000
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(15,10), sharey=True)
f.suptitle('Train 2010-19 Test 2020: Predicted Metric Tons vs Actual Metric Tons',
           fontsize=30)
ax1.plot(y_pred, color='red')
ax1.set_title('Predicted Metric Tons', pad=15, fontsize=20)
ax1.set_ylabel('Metric Tons', fontsize=20)
ax2.plot(y, color='blue')
ax2.set_title('Actual Metric Tons', pad=15, fontsize=20)
plt.show()

To evaluate if this MLP is effective at predicting metric tonnage, let's evaluate the MAE, MSE, RMSE and the R² for the 2010 - 2019 model using the 2020 set and also examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.

In [ ]:

print('Metrics: Test set')
print('MAE: %3f' % mean_absolute_error(y[:], pred_test[:]))
print('MSE: %3f' % mean_squared_error(y[:], pred_test[:]))
print('RMSE: %3f' % np.sqrt(mean_squared_error(y[:], pred_test[:])))
print('R2: %3f' % r2_score(y[:], pred_test[:]))

Metrics: Test set
MAE: 9.734576
MSE: 448.732405
RMSE: 21.183305
R2: 0.277051

In [ ]:

print('Maximum Metric Tons:', np.amax(y))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 266.59167

Average Metric Tons: 19.548126725460893
Predicted Average Metric Tons: 12.666425

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: 0.46689343

There are closer values for the predicted vs. actual maximum (266.59167 vs. 249.99) and minimum (0.46689343 vs. 0.0), but farther away from the actual average metric tonnage (12.666425 vs. 19.548126725460893).

`Long Short Term Memory Networks (LSTM)`

The LSTM model learns from the most relevant of the multivariate input sequences over time to the output where the sequences generate another dimension to the function being estimated. The information learned from connecting the input to output can dynamically change, which MLPs do not utilize. This method uses a fixed size of time windows that do not need to be specified and can applied to forecasting future events. The notebooks can be found here.

`Train 2020`¶

Let's first set up the environment by importing the necessary packages, setting the warnings and seed with a similar seed function that used for the MLP. Then the 2019 - 2020 data can be read in a pandas.Dataframe, the DateTime feature converted using pandas.to_datetime and the DateTime_YearWeek feature created using df['DateTime'].dt.strftime('%Y-w%U'). Then DateTime can be dropped from the set and the data prepared by sorting the set chronologically with DateTime_YearWeek set as the index. Then the data set up for encoding the foreign_company_size and us_company_size using the ordinal encoder from category_encoders. After the remaining qualitative variables can be converted to dummy variables using a specified prefix and then concatenated with the target variable.

In [ ]:

import pandas as pd
import category_encoders as ce
pd.set_option('display.max_columns', None)

df = pd.read_csv('combined_trade_final_LSTM.csv', low_memory=False)
print('Number of rows and columns:', df.shape)

df['DateTime']= pd.to_datetime(df['DateTime'])
df['DateTime_YearWeek'] = df['DateTime'].dt.strftime('%Y-w%U')
df = df.drop(['DateTime'], axis=1)

X = df.drop(['Metric_Tons'], axis=1)
y = df['Metric_Tons']

df = pd.concat([X, y], axis=1)

df = df.sort_values('DateTime_YearWeek')
df = df.set_index('DateTime_YearWeek')

X = df.drop(['Metric_Tons'], axis=1)
y = df['Metric_Tons']

ce_ord = ce.OrdinalEncoder(cols = ['foreign_company_size', 'us_company_size'])
X = ce_ord.fit_transform(X)

X = pd.get_dummies(X, prefix=['HS_Group_Name', 'Container_LCL/FCL',
                              'Foreign_Country_Region',
                              'US_Port_Coastal_Region',
                              'Trade_Direction', 'Container_Type_Dry'],
                   columns=['HS_Group_Name', 'Container_LCL/FCL',
                            'Foreign_Country_Region',
                            'US_Port_Coastal_Region',
                            'Trade_Direction', 'Container_Type_Dry'])

df = pd.concat([X, y], axis=1)
print(df.shape)

(6633785, 42)

Let's now examine the Year feature to determine the number of observations per year to be used to define the train/test sets.

In [ ]:

df[['Year']].value_counts()

Out[ ]:

Year
2019    3368492
2020    3265293
dtype: int64

Now, we can drop the Year feature and determine the number of columns to prepare the data for utilizing an LSTM.

In [ ]:

df = df.drop(['Year'], axis=1)
df.shape

Out[ ]:

(6633785, 41)

To prepare the data for evaluating an LSTM, let's define a function that converts the series to supervised learning with the input sequence (t-n, ... t-1) and forecasts the sequence (t, t+1, ... t+n). Then, this can be concatenated and the rows with NaN values dropped. We can apply this function by first loading the set, converting all to variables to float, normalize all of the features with the MinMaxScaler and then applying the series_to_supervised function.

In [ ]:

from sklearn.preprocessing import MinMaxScaler

def series_to_supervised(dat, n_in=1, n_out=1, dropnan=True):
		n_vars = 1 if type(dat) is list else dat.shape[1]
		df = pd.DataFrame(dat)
		cols, names = list(), list()

		for i in range(n_in, 0, -1):
				cols.append(df.shift(i))
				names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]

		for i in range(0, n_out):
				cols.append(df.shift(-i))
				if i == 0:
						names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
				else:
						names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]

		agg = pd.concat(cols, axis=1)
		agg.columns = names

		if dropnan:
				agg.dropna(inplace=True)
				return agg

dataset = df
values = dataset.values
values = values.astype('float32')

scaler = MinMaxScaler(feature_range=(0,1))
scaled = scaler.fit_transform(values)

reframed = series_to_supervised(scaled, 1, 1)
reframed.drop(reframed.columns[[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,
                                58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,
																74,75,76,77,78,79,80,81]], axis=1, inplace=True)
print(reframed.head())

   var1(t-1)     var2(t-1)  var3(t-1)  var4(t-1)  var5(t-1)  var6(t-1)  \
1   0.000280  5.305452e-07        0.0        0.0   0.044248   0.003732   
2   0.046162  4.439152e-05        0.2        0.0   0.044248   0.007459   
3   0.025281  2.431148e-05        0.2        0.0   0.044248   0.007459   
4   0.015281  1.469459e-05        0.0        0.0   0.044248   0.007459   
5   0.016601  1.596402e-05        0.0        0.0   0.044248   0.007459   

   var7(t-1)  var8(t-1)  var9(t-1)  var10(t-1)  var11(t-1)  var12(t-1)  \
1        0.0        0.0     0.1250         0.0    0.000004    0.301587   
2        0.0        0.0     0.0625         0.0    0.000004    0.492064   
3        0.0        0.0     0.0625         0.0    0.000004    0.492064   
4        0.0        0.0     0.0625         0.0    0.000004    0.492064   
5        0.0        0.0     0.0625         0.0    0.000004    0.492064   

   var13(t-1)  var14(t-1)  var15(t-1)  var16(t-1)  var17(t-1)  var18(t-1)  \
1         0.0         0.0         1.0         0.0         0.0         0.0   
2         0.0         0.0         1.0         0.0         0.0         0.0   
3         0.0         0.0         1.0         0.0         0.0         0.0   
4         0.0         0.0         1.0         0.0         0.0         0.0   
5         0.0         0.0         1.0         0.0         0.0         0.0   

   var19(t-1)  var20(t-1)  var21(t-1)  var22(t-1)  var23(t-1)  var24(t-1)  \
1         0.0         1.0         0.0         1.0         0.0         0.0   
2         1.0         0.0         1.0         0.0         0.0         0.0   
3         1.0         0.0         1.0         0.0         0.0         0.0   
4         1.0         0.0         1.0         0.0         0.0         0.0   
5         1.0         0.0         1.0         0.0         0.0         0.0   

   var25(t-1)  var26(t-1)  var27(t-1)  var28(t-1)  var29(t-1)  var30(t-1)  \
1         0.0         0.0         0.0         0.0         0.0         0.0   
2         0.0         0.0         0.0         0.0         0.0         0.0   
3         0.0         0.0         0.0         0.0         0.0         0.0   
4         0.0         0.0         0.0         0.0         0.0         0.0   
5         0.0         0.0         0.0         0.0         0.0         0.0   

   var31(t-1)  var32(t-1)  var33(t-1)  var34(t-1)  var35(t-1)  var36(t-1)  \
1         0.0         0.0         1.0         0.0         0.0         0.0   
2         0.0         0.0         0.0         0.0         0.0         1.0   
3         0.0         0.0         0.0         0.0         0.0         1.0   
4         0.0         0.0         0.0         0.0         0.0         1.0   
5         0.0         0.0         0.0         0.0         0.0         1.0   

   var37(t-1)  var38(t-1)  var39(t-1)  var40(t-1)  var41(t-1)   var1(t)  
1         0.0         0.0         1.0         0.0         1.0  0.046162  
2         0.0         0.0         1.0         0.0         1.0  0.025281  
3         0.0         0.0         1.0         0.0         1.0  0.015281  
4         0.0         0.0         1.0         0.0         1.0  0.016601  
5         0.0         0.0         1.0         0.0         1.0  0.044442

Now, let's define the model architecture using Sequential with 50 neurons for the LSTM and the output Dense layer followed by compiling the model with the loss='mae', metrics=['mse'] and Adam as the optimizer.

In [ ]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', metrics=['mse'], optimizer='adam')
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 50)                18400     
_________________________________________________________________
dense (Dense)                (None, 1)                 51        
=================================================================
Total params: 18,451
Trainable params: 18,451
Non-trainable params: 0
_________________________________________________________________

`Batch Size = 2`¶

To define the train and the test sets, we can use the yearly count of the data sorted by the DateTime_YearWeek feature to specify the training set as 2020 and the test set as 2019. Then, the sets can be split into the input and output and the input reshaped to be 3D as [samples, timesteps, features].

In [ ]:

values = reframed.values
n_train_hours = 3368492
test = values[:n_train_hours,:]
train = values[n_train_hours:,:]

train_X, train_y = train[:,:-1], train[:,-1]
test_X, test_y = test[:,:-1], test[:,-1]

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

Let's now set the path where model is saved and set up the callbacks with EarlyStopping that monitors the loss and will stop training if it does not improve after 3 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well.

In [ ]:

import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

%load_ext tensorboard

filepath = 'weights_only_train2020_test2019_n50_b2_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=3),
                  ModelCheckpoint(filepath, monitor='mse',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]

Now the model can be trained by calling fit utilizing the train dataset for 30 epochs using a batch_size=2 without shuffling the data and the specified callbacks from the callbacks_list.

In [ ]:

history = model.fit(train_X, train_y, epochs=30, batch_size=2, shuffle=False,
                    callbacks=callbacks_list)

Epoch 1/30
1632646/1632646 [==============================] - 2264s 1ms/step - loss: 0.0538 - mse: 0.0143
Epoch 2/30
1632646/1632646 [==============================] - 2234s 1ms/step - loss: 0.0535 - mse: 0.0142
Epoch 3/30
1632646/1632646 [==============================] - 2248s 1ms/step - loss: 0.0535 - mse: 0.0142
Epoch 4/30
1632646/1632646 [==============================] - 2317s 1ms/step - loss: 0.0535 - mse: 0.0142
Epoch 5/30
1632646/1632646 [==============================] - 2395s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 6/30
1632646/1632646 [==============================] - 2391s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 7/30
1632646/1632646 [==============================] - 2372s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 8/30
1632646/1632646 [==============================] - 2266s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 9/30
1632646/1632646 [==============================] - 2316s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 10/30
1632646/1632646 [==============================] - 2371s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 11/30
1632646/1632646 [==============================] - 2229s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 12/30
1632646/1632646 [==============================] - 2338s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 13/30
1632646/1632646 [==============================] - 2397s 1ms/step - loss: 0.0534 - mse: 0.0142
Epoch 14/30
1632646/1632646 [==============================] - 2396s 1ms/step - loss: 0.0534 - mse: 0.0142

Now, let's save the model if we need to reload it in the future and plot the model loss over the training epochs.

In [ ]:

import matplotlib.pyplot as plt

model.save('./Model_50neuron_batch2_30epochs_train2020_tf.h5',
           save_format='tf')

# Load model for more training or later use
#filepath = 'weights_only_train2020_test2019_n50_b2_epochs30.h5'
#loaded_model = tf.keras.models.load_model('./Model_50neuron_batch2_30epochs_train2020_tf')
#model.load_weights(filepath)

plt.title('Model Loss')
plt.plot(history.history['loss'], label='train')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

We can now generate predictions for the training and the test sets by utilizing model.predict, then create an empty table with 41 fields, fill the table with the predicted values, inverse transform and then select the correct field to determine the MAE and RMSE for the train and test sets.

In [ ]:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

train_predict = model.predict(train_X)
train_predict_dataset_like = np.zeros(shape=(len(train_predict), 41))

train_predict_dataset_like[:,0] = train_predict[:,0]

train_predict = scaler.inverse_transform(train_predict_dataset_like)[:,0]

print('Train Mean Absolute Error:', mean_absolute_error(train_y[:],
                                                        train_predict[:]))
print('Train Root Mean Squared Error:', np.sqrt(mean_squared_error(train_y[:],
                                                                   train_predict[:])))

Train Mean Absolute Error: 18.923359703648813
Train Root Mean Squared Error: 22.52489685135556

In [ ]:

test_predict = model.predict(test_X)
test_predict_dataset_like = np.zeros(shape=(len(test_predict), 41))

test_predict_dataset_like[:,0] = test_predict[:,0]

test_predict = scaler.inverse_transform(test_predict_dataset_like)[:,0]

print('Test Mean Absolute Error:', mean_absolute_error(test_y[:],
                                                       test_predict[:]))
print('Test Root Mean Squared Error:', np.sqrt(mean_squared_error(test_y[:],
                                                                  test_predict[:])))

Test Mean Absolute Error: 21.861991592185266
Test Root Mean Squared Error: 27.770258932423555

The test set metrics are higher for the test set compared to the training set.

Let's now make a prediction with the test set, invert the scaling for forecast and the actual test data to examine how the predicted values for the maximum, average and minimum metric tonnage compares to the actual maximum, average and minimum metric tonnage.

In [ ]:

from numpy import concatenate

yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

inv_yhat = concatenate((yhat, test_X[:,1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]

test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:,1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]

print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 326.9981

Average Metric Tons: 21.897873
Predicted Average Metric Tons: 21.541716

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -136.16667

Now, let's calculate the MAE and MSE and plot the actual vs predicted metric tonnage.

In [ ]:

print('Mean Absolute Error (MAE): %3f' % mean_absolute_error(inv_y, inv_yhat))
print('Mean Square Error (MSE): %3f' % mean_squared_error(inv_y, inv_yhat))

Mean Absolute Error (MAE): 19.149925
Mean Square Error (MSE): 1158.925537

The 2019 test set metrics are higher for the test set compared to the training set, but not significantly higher.

In [ ]:

plt.plot(inv_y, label='2019')
plt.plot(inv_yhat, label='2019 - predicted')
plt.legend()
plt.show()

/usr/local/lib/python3.7/dist-packages/IPython/core/pylabtools.py:125: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  fig.canvas.print_figure(bytes_io, **kw)

<Figure size 432x288 with 0 Axes>

As completed for all models using different batch sizes, we can extract the model metrics into a pandas.Dataframe that can be utilized to plot the results.

In [ ]:

data = [[2020, 2, np.amax(y_test), np.amax(pred_test), np.average(y_test),
        np.average(pred_test), np.amin(y_test), np.amin(pred_test)]]

df1 = pd.DataFrame(data, columns=['Year', 'BatchSize', 'Maximum Metric Tons',
                                  'Predicted Maximum Metric Tons',
                                  'Average Metric Tons',
                                  'Predicted Average Metric Tons',
                                  'Minimum Metric Tons',
                                  'Predicted Minimum Metric Tons'])

df = pd.concat([df, df1])
df.to_csv('LSTM_19_20_Results.csv')

`Train 2019`¶

We can then swap the train and test sets from values to define 2019 as the training set and 2020 as the test set, followed by splitting the set into the input and output and reshaping the input to be 3D as [samples, timesteps, features].

In [ ]:

values = reframed.values
n_train_hours = 3368492
train = values[:n_train_hours,:]
test = values[n_train_hours:,:]

train_X, train_y = train[:,:-1], train[:,-1]
test_X, test_y = test[:,:-1], test[:,-1]

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

(3368492, 1, 41) (3368492,) (3265292, 1, 41) (3265292,)

`Batch Size = 2`¶

Let's now set the path where model is saved and set up the callbacks with EarlyStopping that monitors the loss and will stop training if it does not improve after 3 epochs, the ModelCheckpoint that monitors the mse saving only the best one with a min loss and the TensorBoard as well.

In [ ]:

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'weights_only_train2019_test2020_n50_b2_epochs30.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='loss', patience=3),
                  ModelCheckpoint(filepath, monitor='mse',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]

Now the model can be trained by calling fit utilizing the train dataset for 30 epochs using a batch_size=2 without shuffling the data and the specified callbacks from the callbacks_list.

In [ ]:

history = model.fit(train_X, train_y, epochs=30, batch_size=2, shuffle=False,
                    callbacks=callbacks_list)

Epoch 1/30
1684246/1684246 [==============================] - 2976s 2ms/step - loss: 0.0558 - mse: 0.0151
Epoch 2/30
1684246/1684246 [==============================] - 2933s 2ms/step - loss: 0.0555 - mse: 0.0151
Epoch 3/30
1684246/1684246 [==============================] - 2934s 2ms/step - loss: 0.0555 - mse: 0.0150
Epoch 4/30
1684246/1684246 [==============================] - 2929s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 5/30
1684246/1684246 [==============================] - 2937s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 6/30
1684246/1684246 [==============================] - 2938s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 7/30
1684246/1684246 [==============================] - 2920s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 8/30
1684246/1684246 [==============================] - 2931s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 9/30
1684246/1684246 [==============================] - 2925s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 10/30
1684246/1684246 [==============================] - 2916s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 11/30
1684246/1684246 [==============================] - 2887s 2ms/step - loss: 0.0553 - mse: 0.0150
Epoch 12/30
1684246/1684246 [==============================] - 2931s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 13/30
1684246/1684246 [==============================] - 2839s 2ms/step - loss: 0.0554 - mse: 0.0150
Epoch 14/30
1684246/1684246 [==============================] - 2798s 2ms/step - loss: 0.0554 - mse: 0.0150

Now, let's save the model if we need to reload it in the future and plot the model loss over the training epochs.

In [ ]:

model.save('./Model_50neuron_batch2_30epochs_train2019_tf.h5',
           save_format='tf')

# Load model for more training or later use
#filepath = 'weights_only_train2019_test2020_n50_b2_epochs30.h5'
#loaded_model = tf.keras.models.load_model('./Model_50neuron_batch2_30epochs_train2019_tf')
#model.load_weights(filepath)

plt.title('Model Loss')
plt.plot(history.history['loss'], label='train')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

We can now generate predictions for the training and the test sets by utilizing model.predict, then create an empty table with 41 fields, fill the table with the predicted values, inverse transform and then select the correct field to determine the MAE and RMSE for the train and test sets.

In [ ]:

train_predict = model.predict(train_X)
train_predict_dataset_like = np.zeros(shape=(len(train_predict), 41))

train_predict_dataset_like[:,0] = train_predict[:,0]

train_predict = scaler.inverse_transform(train_predict_dataset_like)[:,0]

print('Train Mean Absolute Error:', mean_absolute_error(train_y[:],
                                                        train_predict[:]))
print('Train Root Mean Squared Error:', np.sqrt(mean_squared_error(train_y[:],
                                                                   train_predict[:])))

Train Mean Absolute Error: 15.280872679572107
Train Root Mean Squared Error: 17.89666887148485

Let's now make a prediction with the test set, invert the scaling for forecast and the actual test data to examine how the predicted values for the maximum, average and minimum metric tonnage compare to the actual maximum, average and minimum metric tonnage.

In [ ]:

yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

inv_yhat = concatenate((yhat, test_X[:,1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]

test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:,1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]

print('Maximum Metric Tons:', np.amax(y_test))
print('Predicted Max Metric Tons:', np.amax(pred_test))
print('\nAverage Metric Tons:', np.average(y_test))
print('Predicted Average Metric Tons:', np.average(pred_test))
print('\nMinimum Metric Tons:', np.amin(y_test))
print('Predicted Minimum Metric Tons:', np.amin(pred_test))

Maximum Metric Tons: 249.99
Predicted Max Metric Tons: 5749.582

Average Metric Tons: 21.535204
Predicted Average Metric Tons: -102.33338

Minimum Metric Tons: 0.0
Predicted Minimum Metric Tons: -4716.5093

Now, let's calculate the MAE and MSE and plot the actual vs predicted metric tonnage.

In [ ]:

print('Mean Absolute Error (MAE): %3f' % mean_absolute_error(inv_y, inv_yhat))
print('Mean Square Error (MSE): %3f' % mean_squared_error(inv_y, inv_yhat))

Mean Absolute Error (MAE): 271.003937
Mean Square Error (MSE): 529086.687500

The 2020 test set metrics are significantly higher for the test set compared to the training set, and this is significantly higher that was observed for the 2020 model and 2019 as the test set as demonstrated clearly in the following plot.

In [ ]:

plt.plot(inv_y, label='2020')
plt.plot(inv_yhat, label='2020 - predicted')
plt.legend()
plt.show()

<Figure size 432x288 with 0 Axes>

As completed for all models using different batch sizes, we can extract the model metrics into a pandas.Dataframe that can be utilized to plot the results.

In [ ]:

data = [[2019, 2, np.amax(y_test), np.amax(pred_test), np.average(y_test),
        np.average(pred_test), np.amin(y_test), np.amin(pred_test)]]

df = pd.DataFrame(data, columns=['Year', 'BatchSize', 'Maximum Metric Tons',
                                 'Predicted Maximum Metric Tons',
                                 'Average Metric Tons',
                                 'Predicted Average Metric Tons',
                                 'Minimum Metric Tons',
                                 'Predicted Minimum Metric Tons'])

df = pd.concat([df, df1])
df.to_csv('LSTM_19_20_Results.csv')

We can now examine the results from training a model using the 2019 data and the 2020 data utilizing different batch sizes and compare the predicted results.

In [ ]:

df = pd.read_csv('LSTM_19_20_Results.csv')
df

Out[ ]:

	Year	Batch Size	Maximum Metric Tons	Predicted Maximum Metric Tons	Average Metric Tons	Predicted Average Metric Tons	Predicted Minimum Metric Tons
0	2020	16	249.99	247.87039	21.897873	15.626926	-254.637760
1	2020	8	249.99	180.05067	21.897873	16.335623	-116.713486
2	2020	4	249.99	518.87090	21.897873	18.322914	-459.725620
3	2020	2	249.99	326.99810	21.897873	21.541716	-136.166670
4	2019	16	249.99	637.04800	21.535204	10.872106	-153.270100
5	2019	8	249.99	1621.65440	21.535204	-24.676890	-2013.703200
6	2019	4	249.99	3583.09060	21.535204	-40.824720	-2778.392600
7	2019	2	249.99	5749.58200	21.535204	-102.333380	-4716.509300

In [ ]:

df_num = df[['Predicted Maximum Metric Tons', 'Predicted Average Metric Tons',
             'Predicted Minimum Metric Tons']]

plt.rcParams.update({'font.size': 14})
fig, ax = plt.subplots(1,3, figsize=(15,5))
fig.suptitle('Train 2019 vs Train 2020: Predicted Maximum, Average and Minimum Metric Tons',
             y=1.01, fontsize=20)
for variable, subplot in zip(df_num, ax.flatten()):
    a = sns.lineplot(data=df, x=df['Batch Size'], y=df_num[variable],
                     hue=df.Year, palette=['blue', 'orange'], ax=subplot)
fig.tight_layout()
plt.show();

As the batch size decreased (more observations used to train the model), the predicted results diverged to a larger extent when comparing 2019 to 2020. This suggests that a more complicated model is needed to train an LSTM utilizing 2019 trade and potentially confounding information compared to the 2020 information.

Proposed Questions¶

Data Collection and Processing¶

Maritime Imports & Exports

Preprocessing¶

Preprocess Imports¶

Preprocess Exports¶

Preprocess Concatenated Imports/Exports¶

Extract Names for Clustering using OpenRefine and Binning from Original Data¶

Perform Clustering using OpenRefine¶

Postprocessing after merging clustered features by OpenRefine¶

Conditional Binning of U.S. and Foreign Company by Metric Tons for Mapping¶

Outlier Testing of Quantitative Variables¶

Merge Trade Data with Other Data Sources¶

Unemployment¶

Tariffs¶

Currency¶

COVID-19 State Mandated Closures¶

COVID-19 Cases and Deaths in the United States¶

Feature Engineering COVID-19 associated vars¶

Exploratory Data Analysis (EDA)¶

Feature Selection with Random Forest and XGBoost¶

Question: How has the composition and/or volume of maritime imports and exports from the U.S. changed from 2010-2015¶

Random Forest - Feature Importance¶

XGBoost - Feature Importance¶

Random Forest without Teus & TCVUSD - Feature Importance

XGBoost without Teus & TCVUSD - Feature Importance

Question: How did COVID-19 impact the volume and composition of international maritime trade?¶

Random Forest without Teus & TCVUSD

XGBoost without Teus & TCVUSD

Question: Are there any confounding effects impacting the volume of imports and exports of the targeted commodities?¶

Random Forest without Teus & TCVUSD

XGBoost without Teus & TCVUSD

Create Final Set¶

XGBoost RAPIDS: Train 2020, 2019, 2018 - 2019 and 2010 - 2019

Client

Cluster Info

LocalCUDACluster

Scheduler Info

Scheduler

Workers

Worker: 0

Baseline Models¶

Train 2020

Train 2019

Train 2018 - 2019

Train 2010 - 2019

Hyperopt Hyperparameter Optimization¶

Train 2020: 100 Trials Train/Test¶

Results from the Hyperparameter Search

Model Explanations

Model Metrics with ELI5

Test on 2019¶

Model Explanations

Model Metrics with ELI5

Test on 2010 - 2019

Model Explanations

Model Metrics with ELI5

Train 2019: 100 Trials Train/Test

Results from the Hyperparameter Search

Model Explanations

Model Metrics with ELI5

Test on 2020

Model Explanations

Model Metrics with ELI5

Train 2018-19: 100 Trials Train/Test

Results from the Hyperparameter Search

Model Explanations

Model Metrics with ELI5

Test on 2020

Model Explanations

Model Metrics with ELI5

Train 2010 - 2019: 100 Trials Train/Test¶

Results from the Hyperparameter Search

Model Explanations

Model Metrics with ELI5

Test on 2020

Model Explanations

Model Metrics with ELI5

Multi-Layer Perceptron (MLP) Regression

Baseline Models¶

`Proposed Questions`¶

`Data Collection and Processing`¶

`Maritime Imports & Exports`

`Preprocessing`¶

`Preprocess Imports`¶

`Preprocess Exports`¶

`Preprocess Concatenated Imports/Exports`¶

`Extract Names for Clustering using OpenRefine and Binning from Original Data`¶

`Perform Clustering using OpenRefine`¶

`Postprocessing after merging clustered features by OpenRefine`¶

`Conditional Binning of U.S. and Foreign Company by Metric Tons for Mapping`¶

`Outlier Testing of Quantitative Variables`¶

`Merge Trade Data with Other Data Sources`¶

`Unemployment`¶

`Tariffs`¶

`Currency`¶

`COVID-19 State Mandated Closures`¶

`COVID-19 Cases and Deaths in the United States`¶

`Feature Engineering COVID-19 associated vars`¶

`Exploratory Data Analysis (EDA)`¶

`Feature Selection with Random Forest and XGBoost`¶

`Question: How has the composition and/or volume of maritime imports and exports from the U.S. changed from 2010-2015`¶

`Random Forest - Feature Importance`¶

`XGBoost - Feature Importance`¶

`Random Forest without Teus & TCVUSD - Feature Importance`

`XGBoost without Teus & TCVUSD - Feature Importance`

`Question: How did COVID-19 impact the volume and composition of international maritime trade?`¶

`Random Forest without Teus & TCVUSD`

`XGBoost without Teus & TCVUSD`

`Question: Are there any confounding effects impacting the volume of imports and exports of the targeted commodities?`¶

`Random Forest without Teus & TCVUSD`

`XGBoost without Teus & TCVUSD`

`Create Final Set`¶

`XGBoost RAPIDS: Train 2020, 2019, 2018 - 2019 and 2010 - 2019`

`Baseline Models`¶

`Train 2020`

`Train 2019`

`Train 2018 - 2019`

`Train 2010 - 2019`

`Hyperopt Hyperparameter Optimization`¶

`Train 2020: 100 Trials Train/Test`¶

`Results from the Hyperparameter Search`

`Model Explanations`

`Model Metrics with ELI5`

`Test on 2019`¶

`Model Explanations`

`Model Metrics with ELI5`

`Test on 2010 - 2019`

`Model Explanations`

`Model Metrics with ELI5`

`Train 2019: 100 Trials Train/Test`

`Results from the Hyperparameter Search`

`Model Explanations`

`Model Metrics with ELI5`

`Test on 2020`

`Model Explanations`

`Model Metrics with ELI5`

`Train 2018-19: 100 Trials Train/Test`

`Results from the Hyperparameter Search`

`Model Explanations`

`Model Metrics with ELI5`

`Test on 2020`

`Model Explanations`

`Model Metrics with ELI5`

`Train 2010 - 2019: 100 Trials Train/Test`¶

`Results from the Hyperparameter Search`

`Model Explanations`

`Model Metrics with ELI5`

`Test on 2020`

`Model Explanations`

`Model Metrics with ELI5`

`Multi-Layer Perceptron (MLP) Regression`

`Baseline Models`¶

`Train 2020: Batch Size = 16`¶

`Test on 2019`

`Train 2019: Batch Size = 16`¶

`Test on 2020`

`Train 2010 - 2019: Batch Size = 32`¶

`Test on 2020`

`Hyperparameter Tuning`¶