Customer reviews provide a vast amount of information that can be leveraged to increase business profitability via customer retention. Reviews can be utilized to generate business improvements or modifications by considering the overall response in regards to tone, diction and syntax of the words contained within the reviews from all of the individuals who created a review. In addition to the text contained within the reviews, the metadata from the time when the review was created as well the location of the business can provide more comprehensive business insight. This information can be used to generate better predictive models when considering the concept that businesses might have more than one branch at more than one location.

Objective

  • Which characteristics of businesses associate with different operational metrics from customer ratings and reviews?

Questions

  • Do the services provided by the business to the customer, hours of operation and time during the year of the business play any role in the amount of positive ratings or reviews?
  • Does the activity of the individual who reviews businesses demonstrate any patterns that could affect the increased number of positive ratings/reviews?
  • Is the text contained within customer reviews associated with any of the features provided and engineered?

Data and Preprocessing

   The data was retrieved from Yelp Open Dataset. These are separate json files for the reviews, businesses, users, tip and photo inforation. A data warehouse is needed to be constructed containing as much relevant information for downstream analyses that can be queried for more insight. The photo.json was not utilized in constructing a data warehouse.

   The code that was used for preprocessing and EDA can be found Yelp Reviews Github repository. First, the environment is needed to be set up by importing the dependencies, setting the options for viewing/examining the data, the resolution of the graphs/charts, setting the seed for processing/computations with the data and the directory structure where the data is read from/stored.

   For the initial exploratory data analysis (EDA), let's define a function to examine the dimensions of the rows and columns of each set, the unique and missing values as well as the data types of the separate json files containing the reviews, businesses, users, tip, and checkin information. This function can be implemented after reading the data and dropping duplicates. Since some of the information is stored as dictionaries, these need to be converted to strings for dropping duplicates after joining the various sets.

In [ ]:
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

seed_value = 42
os.environ['Yelp_Preprocess_EDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

def data_summary(df):
    """Returns the characteristics of variables in a Pandas dataframe."""
    print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
    a = pd.DataFrame()
    a['Number of Unique Values'] = df.nunique()
    a['Number of Missing Values'] = df.isnull().sum()
    a['Data type of variable'] = df.dtypes
    print(a)

print('\nSummary of Yelp Reviews Data')
print('\n')
print('\nReviews - Data Summary')
print('\n')
reviews_json_path = 'yelp_academic_dataset_review.json'
df_reviews = pd.read_json(reviews_json_path, lines=True)
df_reviews = df_reviews.drop_duplicates()

print(data_summary(df_reviews))
print('======================================================================')

print('\nBusiness - Data Summary')
print('\n')
business_json_path = 'yelp_academic_dataset_business.json'
df_business = pd.read_json(business_json_path, lines=True)

df_business['attributes'] = df_business['attributes'].astype(str)
df_business['hours'] = df_business['hours'].astype(str)
df_business =  df_business.drop_duplicates()

print(data_summary(df_business))
print('======================================================================')

print('\nUser - Data Summary')
print('\n')
user_json_path = 'yelp_academic_dataset_user.json'
df_user = pd.read_json(user_json_path, lines=True)
df_user = df_user.drop_duplicates()

print(data_summary(df_user))
print('======================================================================')

print('\nTip - Data Summary')
print('\n')
tip_json_path = 'yelp_academic_dataset_tip.json'
df_tip = pd.read_json(tip_json_path, lines=True)
df_tip = df_tip.drop_duplicates()

print(data_summary(df_tip))
print('======================================================================')

print('\nCheckin - Data Summary')
print('\n')
checkin_json_path = 'yelp_academic_dataset_Checkin.json'
df_checkin = pd.read_json(checkin_json_path, lines=True)
df_checkin = df_checkin.drop_duplicates()

print(data_summary(df_checkin))
print('======================================================================')

Summary of Yelp Reviews Data



Reviews - Data Summary


Number of Rows: 8635403, Columns: 9
             Number of Unique Values  Number of Missing Values  \
review_id                    8635403                         0   
user_id                      2189457                         0   
business_id                   160585                         0   
stars                              5                         0   
useful                           233                         0   
funny                            190                         0   
cool                             207                         0   
text                         8616410                         0   
date                         8485984                         0   

            Data type of variable  
review_id                  object  
user_id                    object  
business_id                object  
stars                       int64  
useful                      int64  
funny                       int64  
cool                        int64  
text                       object  
date               datetime64[ns]  
None
======================================================================

Business - Data Summary


Number of Rows: 160585, Columns: 14
              Number of Unique Values  Number of Missing Values  \
business_id                    160585                         0   
name                           125850                         0   
address                        123895                         0   
city                              836                         0   
state                              31                         0   
postal_code                      5779                         0   
latitude                       137397                         0   
longitude                      133643                         0   
stars                               9                         0   
review_count                     1281                         0   
is_open                             2                         0   
attributes                      90475                         0   
categories                      88115                       115   
hours                           50858                         0   

             Data type of variable  
business_id                 object  
name                        object  
address                     object  
city                        object  
state                       object  
postal_code                 object  
latitude                   float64  
longitude                  float64  
stars                      float64  
review_count                 int64  
is_open                      int64  
attributes                  object  
categories                  object  
hours                       object  
None
======================================================================

User - Data Summary


Number of Rows: 2189457, Columns: 22
                    Number of Unique Values  Number of Missing Values  \
user_id                             2189457                         0   
name                                 153611                         0   
review_count                           1917                         0   
yelping_since                       2179957                         0   
useful                                 5204                         0   
funny                                  3680                         0   
cool                                   4381                         0   
elite                                  1235                         0   
friends                             1212599                         0   
fans                                    680                         0   
average_stars                           401                         0   
compliment_hot                         1318                         0   
compliment_more                         350                         0   
compliment_profile                      347                         0   
compliment_cute                         317                         0   
compliment_list                         191                         0   
compliment_note                         965                         0   
compliment_plain                       1644                         0   
compliment_cool                        1581                         0   
compliment_funny                       1581                         0   
compliment_writer                       822                         0   
compliment_photos                       952                         0   

                   Data type of variable  
user_id                           object  
name                              object  
review_count                       int64  
yelping_since                     object  
useful                             int64  
funny                              int64  
cool                               int64  
elite                             object  
friends                           object  
fans                               int64  
average_stars                    float64  
compliment_hot                     int64  
compliment_more                    int64  
compliment_profile                 int64  
compliment_cute                    int64  
compliment_list                    int64  
compliment_note                    int64  
compliment_plain                   int64  
compliment_cool                    int64  
compliment_funny                   int64  
compliment_writer                  int64  
compliment_photos                  int64  
None
======================================================================

Tip - Data Summary


Number of Rows: 1161995, Columns: 5
                  Number of Unique Values  Number of Missing Values  \
user_id                            339244                         0   
business_id                        110915                         0   
text                              1090182                         0   
date                              1158473                         0   
compliment_count                       11                         0   

                 Data type of variable  
user_id                         object  
business_id                     object  
text                            object  
date                    datetime64[ns]  
compliment_count                 int64  
None
======================================================================

Checkin - Data Summary


Number of Rows: 138876, Columns: 2
             Number of Unique Values  Number of Missing Values  \
business_id                   138876                         0   
date                          138875                         0   

            Data type of variable  
business_id                object  
date                       object  
None
======================================================================

   The reviews set has the most number of rows (8,635,403) while the user set has the most columns (22). To leverage the highest number of reviews that can be paired with the other available information, the reviews should be used as the backbone of the warehouse and the starting place to determine which keys are needed to join the different sets together.

Reviews and Businesses

   Since multiple sets are being used to make a warehouse, renaming variables unqiue to each set is important for the establishment of potential keys. This is completed for the reviews and business initial sets besides the business_id. Since categories in the business set has missing data, the rows that do not contain missing values are selected to maximize the most complete data. The business_id contains characters and numbers so converting to a string is important for this key to be used to join two tables due to high dimensionality and memory constraints.

In [ ]:
df_reviews.rename(columns = {'text': 'text_reviews'}, inplace=True)
df_reviews.rename(columns = {'stars': 'stars_reviews'}, inplace=True)
df_reviews.rename(columns = {'date': 'date_reviews'}, inplace=True)
df_reviews.rename(columns = {'useful': 'useful_reviews'}, inplace=True)
df_reviews.rename(columns = {'funny': 'funny_reviews'}, inplace=True)
df_reviews.rename(columns = {'cool': 'cool_reviews'}, inplace=True)
print('Sample observations from Reviews:')
print(df_reviews.head())
print('\n')

df_business = df_business[df_business.categories.notna()]

df_business = df_business.copy()
df_business.rename(columns = {'stars': 'stars_business'}, inplace=True)
df_business.rename(columns = {'name': 'name_business'}, inplace=True)
df_business.rename(columns = {'review_count': 'review_countbusiness'},
                   inplace=True)
df_business.rename(columns = {'attributes': 'attributes_business'},
                   inplace=True)
df_business.rename(columns = {'categories': 'categories_business'},
                   inplace=True)
df_business.rename(columns = {'hours': 'hours_business'}, inplace=True)
print('Sample observations from Businesses:')
print(df_business.head())

df_reviews['business_id'] = df_reviews['business_id'].astype(str)
df_business['business_id'] = df_business['business_id'].astype(str)

df = pd.merge(df_reviews, df_business, how='right', left_on=['business_id'],
              right_on=['business_id'])
df = df.drop_duplicates()

del df_reviews
Sample observations from Reviews:
                review_id                 user_id             business_id  \
0  lWC-xP3rd6obsecCYsGZRg  ak0TdVmGKo4pwqdJSTLwWw  buF9druCkbuXLX526sGELQ   
1  8bFej1QE5LXp4O05qjGqXA  YoVfDbnISlW0f7abNQACIg  RA4V8pr014UyUbDvI-LW2A   
2  NDhkzczKjLshODbqDoNLSg  eC5evKn1TWDyHCyQAwguUw  _sS2LBIGNT5NQb6PD1Vtjw   
3  T5fAqjjFooT4V0OeZyuk1w  SFQ1jcnGguO0LYWnbbftAA  0AzLzHfOJgL7ROwhdww2ew   
4  sjm_uUcQVxab_EeLCqsYLg  0kA0PAJ8QFMeveQWHFqz2A  8zehGz9jnxPqXtOc7KaJxA   

   stars_reviews  useful_reviews  funny_reviews  cool_reviews  \
0              4               3              1             1   
1              4               1              0             0   
2              5               0              0             0   
3              2               1              1             1   
4              4               0              0             0   

                                        text_reviews        date_reviews  
0  Apparently Prides Osteria had a rough summer a... 2014-10-11 03:34:02  
1  This store is pretty good. Not as great as Wal... 2015-07-03 20:38:25  
2  I called WVM on the recommendation of a couple... 2013-05-28 20:38:06  
3  I've stayed at many Marriott and Renaissance M... 2010-01-08 02:29:15  
4  The food is always great here. The service fro... 2011-07-28 18:05:01  


Sample observations from Businesses:
              business_id            name_business              address  \
0  6iYb2HFDywm3zjuRg0shjw      Oskar Blues Taproom         921 Pearl St   
1  tCbdrRPZA0oiIYSmHG3J0w  Flying Elephants at PDX  7000 NE Airport Way   
2  bvN78flM8NLprQ1a1y5dRg           The Reclaimory   4720 Hawthorne Ave   
3  oaepsyvc0J17qwi8cfrOWg              Great Clips   2566 Enterprise Rd   
4  PE9uqAjdw0E4-8mjGl3wVA        Crossfit Terminus  1046 Memorial Dr SE   

          city state postal_code   latitude   longitude  stars_business  \
0      Boulder    CO       80302  40.017544 -105.283348             4.0   
1     Portland    OR       97218  45.588906 -122.593331             4.0   
2     Portland    OR       97214  45.511907 -122.613693             4.5   
3  Orange City    FL       32763  28.914482  -81.295979             3.0   
4      Atlanta    GA       30316  33.747027  -84.353424             4.0   

   review_countbusiness  is_open  \
0                    86        1   
1                   126        1   
2                    13        1   
3                     8        1   
4                    14        1   

                                 attributes_business  \
0  {'RestaurantsTableService': 'True', 'WiFi': "u...   
1  {'RestaurantsTakeOut': 'True', 'RestaurantsAtt...   
2  {'BusinessAcceptsCreditCards': 'True', 'Restau...   
3  {'RestaurantsPriceRange2': '1', 'BusinessAccep...   
4  {'GoodForKids': 'False', 'BusinessParking': "{...   

                                 categories_business  \
0  Gastropubs, Food, Beer Gardens, Restaurants, B...   
1  Salad, Soup, Sandwiches, Delis, Restaurants, C...   
2  Antiques, Fashion, Used, Vintage & Consignment...   
3                         Beauty & Spas, Hair Salons   
4  Gyms, Active Life, Interval Training Gyms, Fit...   

                                      hours_business  
0  {'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0'...  
1  {'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', ...  
2  {'Thursday': '11:0-18:0', 'Friday': '11:0-18:0...  
3                                               None  
4  {'Monday': '16:0-19:0', 'Tuesday': '16:0-19:0'...  

User

   The user set contains variables that probably aren't usedful due to ethical constraints since they contain individual level information so these are removed. The columns are then renamed so the variables remain unique for the user set. The user_id is used as the key to join with the main table since this exists in reviews.

In [ ]:
df_user = df_user.drop(['name', 'friends', 'fans', 'compliment_photos'], axis=1)

df_user.rename(columns = {'review_count': 'review_count_user'},
               inplace=True)
df_user.rename(columns = {'yelping_since': 'yelping_since_user'},
               inplace=True)
df_user.rename(columns = {'useful': 'useful_user'}, inplace=True)
df_user.rename(columns = {'funny': 'funny_user'}, inplace=True)
df_user.rename(columns = {'cool': 'cool_user'}, inplace=True)
df_user.rename(columns = {'elite': 'elite_user'}, inplace=True)
df_user.rename(columns = {'average_stars': 'average_stars_user'},
               inplace=True)
df_user.rename(columns = {'compliment_hot': 'compliment_hot_user'},
               inplace=True)
df_user.rename(columns = {'compliment_more': 'compliment_more_user'},
               inplace=True)
df_user.rename(columns = {'compliment_profile': 'compliment_profile_user'},
               inplace=True)
df_user.rename(columns = {'compliment_cute': 'compliment_cute_user'},
               inplace=True)
df_user.rename(columns = {'compliment_list': 'compliment_list_user'},
               inplace=True)
df_user.rename(columns = {'compliment_note': 'compliment_note_user'},
               inplace=True)
df_user.rename(columns = {'compliment_plain': 'compliment_plain_user'},
               inplace=True)
df_user.rename(columns = {'compliment_cool': 'compliment_cool_user'},
               inplace=True)
df_user.rename(columns = {'compliment_funny': 'compliment_funny_user'},
               inplace=True)
df_user.rename(columns = {'compliment_writer': 'compliment_writer_user'},
               inplace=True)
print('Sample observations from Users:')
print(df_user.head())

df = pd.merge(df, df_user, how='left', left_on=['user_id'],
              right_on=['user_id'])
df = df.drop_duplicates()

del df_user
Sample observations from Users:
                  user_id  review_count_user   yelping_since_user  \
0  q_QQ5kBBwlCcbL1s4NVK3g               1220  2005-03-14 20:26:35   
1  dIIKEfOgo0KqUfGQvGikPg               2136  2007-08-10 19:01:51   
2  D6ErcUnFALnCQN4b1W_TlA                119  2007-02-07 15:47:53   
3  JnPIjvC0cmooNDfsa9BmXg                987  2009-02-09 16:14:29   
4  37Hc8hr3cw0iHLoPzLK6Ow                495  2008-03-03 04:57:05   

   useful_user  funny_user  cool_user  \
0        15038       10030      11291   
1        21272       10289      18046   
2          188         128        130   
3         7234        4722       4035   
4         1577         727       1124   

                                          elite_user  average_stars_user  \
0       2006,2007,2008,2009,2010,2011,2012,2013,2014                3.85   
1  2007,2008,2009,2010,2011,2012,2013,2014,2015,2...                4.09   
2                                          2010,2011                3.76   
3                      2009,2010,2011,2012,2013,2014                3.77   
4                                     2009,2010,2011                3.72   

   compliment_hot_user  compliment_more_user  compliment_profile_user  \
0                 1710                   163                      190   
1                 1632                    87                       94   
2                   22                     1                        3   
3                 1180                   129                       93   
4                  248                    19                       32   

   compliment_cute_user  compliment_list_user  compliment_note_user  \
0                   361                   147                  1212   
1                   232                    96                  1187   
2                     0                     0                     5   
3                   219                    90                  1120   
4                    16                    15                    77   

   compliment_plain_user  compliment_cool_user  compliment_funny_user  \
0                   5691                  2541                   2541   
1                   3293                  2205                   2205   
2                     20                    31                     31   
3                   4510                  1566                   1566   
4                    131                   310                    310   

   compliment_writer_user  
0                     815  
1                     472  
2                       3  
3                     391  
4                      98  

Tip

   Like with the user table, variables that are not useful are dropped and the variables are renamed. The business data was used to obtain the name of business in the tip set. This allowed for features to be engineered like the number of compliments. A feature was created to obtain sum of the compliment counts for each business by using groupby and then merging back for sum of the compliments for each id.

In [ ]:
df_tip = df_tip.drop(['user_id', 'text'], axis=1)

df_tip.rename(columns = {'date': 'date_tip'}, inplace=True)
df_tip.rename(columns = {'compliment_count': 'compliment_count_tip'},
              inplace=True)

df_tip['name_business'] = df_tip['business_id'].map(df_business.set_index('business_id')['name_business'])

del df_business

df_tip1 = df_tip.groupby('business_id')['compliment_count_tip'].sum().reset_index()
df_tip1.rename(columns = {'compliment_count_tip': 'compliment_count_tip_idSum'},
               inplace=True)

df_tip = pd.merge(df_tip, df_tip1, how='left', left_on=['business_id'],
                  right_on=['business_id'])
df_tip = df_tip.drop_duplicates()
df_tip = df_tip[df_tip.name_business.notna()]

del df_tip1

print(data_summary(df_tip))
Number of Rows: 1161932, Columns: 5
                            Number of Unique Values  Number of Missing Values  \
business_id                                  110886                         0   
date_tip                                    1158429                         0   
compliment_count_tip                             11                         0   
name_business                                 83810                         0   
compliment_count_tip_idSum                       23                         0   

                           Data type of variable  
business_id                               object  
date_tip                          datetime64[ns]  
compliment_count_tip                       int64  
name_business                             object  
compliment_count_tip_idSum                 int64  
None

   Next, the sum of the compliment count for each business name was calculated using a similar approach and mergd back for the sum of the compliments for each business name. The features used for feature engineering were dropped and duplicates removed using subset = ['business_id']. This was then joined with main table after converting the keys to the proper format for joining the different tables or dataframes.

In [ ]:
df_tip1 = df_tip.groupby('name_business')['compliment_count_tip'].sum().reset_index()
df_tip1.rename(columns = {'compliment_count_tip': 'compliment_count_tip_businessSum'},
               inplace=True)

df_tip = pd.merge(df_tip, df_tip1, how='left', left_on=['name_business'],
                  right_on=['name_business'])
df_tip = df_tip.drop_duplicates()

del df_tip1

df_tip = df_tip.drop(['date_tip', 'compliment_count_tip'], axis=1)
df_tip = df_tip.drop_duplicates(subset = ['business_id'])

print('\nSummary - Tip after Compliment Count Sums:')
print(data_summary(df_tip))
print('\n')
print('Sample observations from Tips:')
print(df_tip.head())

df_tip['business_id'] = df_tip['business_id'].astype(str)
df_tip['name_business'] = df_tip['name_business'].astype(str)
df['name_business'] = df['name_business'].astype(str)

df = pd.merge(df, df_tip, how='right', left_on=['business_id', 'name_business'],
              right_on=['business_id', 'name_business'])
df = df.drop_duplicates()

del df_tip

Summary - Tip after Compliment Count Sums:
Number of Rows: 110886, Columns: 4
                                  Number of Unique Values  \
business_id                                        110886   
name_business                                       83810   
compliment_count_tip_idSum                             23   
compliment_count_tip_businessSum                       27   

                                  Number of Missing Values  \
business_id                                              0   
name_business                                            0   
compliment_count_tip_idSum                               0   
compliment_count_tip_businessSum                         0   

                                 Data type of variable  
business_id                                     object  
name_business                                   object  
compliment_count_tip_idSum                       int64  
compliment_count_tip_businessSum                 int64  
None


Sample observations from Tips:
              business_id                        name_business  \
0  ENwBByjpoa5Gg7tKgxqwLg                   Javier's Taco Shop   
1  jKO4Og6ucdX2-YCTKQVYjg                     Cactus Club Cafe   
2  9Bto7mky640ocgezVKSfVg  MiniLuxe Back Bay at 296 Newbury St   
3  XWFjKtRGZ9khRGtGg2ZvaA                        The Goodnight   
4  mkrx0VhSMU3p3uhyJGCoWA              Gold's Gym Austin North   

   compliment_count_tip_idSum  compliment_count_tip_businessSum  
0                           2                                 2  
1                           0                                 1  
2                           1                                 1  
3                           3                                 3  
4                           0                                 0  

Checkins

   The checkin set allowed for features to be created based of the time information. Let's process the time variables by extracting each date as a row for each business id.

In [ ]:
print('Sample observations from Checkins:')
print(df_checkin.head())
print('\n')

df_checkin['business_id'] = df_checkin['business_id'].astype(str)
df_checkin.rename(columns = {'date': 'businessCheckin_date'},
                  inplace=True)

df_checkin1 = df_checkin.set_index(['business_id']).apply(lambda x: x.str.split(',').explode()).reset_index()
df_checkin1 = df_checkin1.drop_duplicates()
df_checkin1.head()
Sample observations from Checkins:
              business_id                                               date
0  --0r8K_AQ4FZfLsX3ZYRDA                                2017-09-03 17:13:59
1  --0zrn43LEaB4jUWTQH_Bg  2010-10-08 22:21:20, 2010-11-01 21:29:14, 2010...
2  --164t1nclzzmca7eDiJMw  2010-02-26 02:06:53, 2010-02-27 08:00:09, 2010...
3  --2aF9NhXnNVpDV0KS3xBQ  2014-11-03 16:35:35, 2015-01-30 18:16:03, 2015...
4  --2mEJ63SC_8_08_jGgVIg  2010-12-15 17:10:46, 2013-12-28 00:27:54, 2015...


Out[ ]:
business_id businessCheckin_date
0 --0r8K_AQ4FZfLsX3ZYRDA 2017-09-03 17:13:59
1 --0zrn43LEaB4jUWTQH_Bg 2010-10-08 22:21:20
2 --0zrn43LEaB4jUWTQH_Bg 2010-11-01 21:29:14
3 --0zrn43LEaB4jUWTQH_Bg 2010-12-23 22:55:45
4 --0zrn43LEaB4jUWTQH_Bg 2011-04-08 17:14:59

   To create various time features from the checkin time into the different businesses, a function timeFeatures was utitilzed to first convert businessCheckin_date to datetime followed by extracting the year, year-month, year-week, hourly variables as well as the morning and afternoon/night (AM/PM) by using a mask where if the businessCheckin_hourNumber >= 12, then PM, and if not that, then AM.

In [ ]:
def timeFeatures(df):
    """
    Returns the year, year-month, year-week, hourly variables and PM/AM from the date.
    """
    df['businessCheckin_date'] = pd.to_datetime(df['businessCheckin_date'],
                                                format='%Y-%m-%d %H:%M:%S',
                                                errors='ignore')
    df['businessCheckin_Year'] = df.businessCheckin_date.dt.year
    df['businessCheckin_YearMonth'] = df['businessCheckin_date'].dt.to_period('M')
    df['businessCheckin_YearWeek'] = df['businessCheckin_date'].dt.strftime('%Y-w%U')
    df['businessCheckin_hourNumber'] = df.businessCheckin_date.dt.hour

    mask = df['businessCheckin_hourNumber'] >= 12
    df.loc[mask, 'businessCheckin_hourNumber'] = 'PM'

    mask = df['businessCheckin_hourNumber'] != 'PM'
    df.loc[mask, 'businessCheckin_hourNumber'] = 'AM'
    df = df.drop_duplicates()

    return df

df_checkin1 = timeFeatures(df_checkin1)
df_checkin1.head()
Out[ ]:
business_id businessCheckin_date businessCheckin_Year businessCheckin_YearMonth businessCheckin_YearWeek businessCheckin_hourNumber
0 --0r8K_AQ4FZfLsX3ZYRDA 2017-09-03 17:13:59 2017 2017-09 2017-w36 PM
1 --0zrn43LEaB4jUWTQH_Bg 2010-10-08 22:21:20 2010 2010-10 2010-w40 PM
2 --0zrn43LEaB4jUWTQH_Bg 2010-11-01 21:29:14 2010 2010-11 2010-w44 PM
3 --0zrn43LEaB4jUWTQH_Bg 2010-12-23 22:55:45 2010 2010-12 2010-w51 PM
4 --0zrn43LEaB4jUWTQH_Bg 2011-04-08 17:14:59 2011 2011-04 2011-w14 PM

   Now we can get the count of each checkin date for each business id by grouping by the id and the date, followed by merging back with the main table by the business id. A similar process was completed for businessCheckin_Year, businessCheckin_YearMonth, businessCheckin_YearWeek. Lastly, select the rows that contain the most complete data.

In [ ]:
df_checkin2 = df_checkin1.groupby('business_id')['businessCheckin_date'].count().reset_index()
df_checkin2.rename(columns = {'businessCheckin_date': 'dateDay_checkinSum'},
                   inplace=True)
df_checkin2 = df_checkin2.drop_duplicates()

del df_checkin1

df_checkin = pd.merge(df_checkin, df_checkin2, how='left',
                      left_on=['business_id'], right_on=['business_id'])
df_checkin = df_checkin.drop_duplicates()

del df_checkin2

df = pd.merge(df, df_checkin, how='left', left_on=['business_id'],
              right_on=['business_id'])
df = df.drop_duplicates()

del df_checkin

df = df[df.businessCheckin_date.notna()]

Business Hours

   Initially, hours_business is in the format of a dictionary with the day and the hours that the business is operational. There is a lot of information contained within original feature so processing this into a non-dictionary feature will allow for more features to be created and thus potential insight. Let's create a copy of the original hours_business to compare the processing. A lambda function can be utilized that splits the original feature with splitting the current string by , and keeping the business_id with the information. This creates duplicates and missing data so they need to be removed. Some of the information within the original feature contained differences in the spacing so this needs to be normalized to maximize the grain of information. This includes the presence of brackets and spacing. The original hours_business was set as the index to monitor for accurate processing. After, the number of how many unique business day/hours can be calculated and then index reset to compare.

In [ ]:
df1 = df[['business_id', 'hours_business']]

df1['hours_businessOriginal'] = df1['hours_business']
df1 = df1.set_index(['business_id',
                     'hours_businessOriginal']).apply(lambda x: x.str.split(',').explode()).reset_index()
df1 = df1.drop_duplicates()
df1 = df1[df1['business_id'].notna()]
print('- Dimensions of exploded business open/closed days/hours:', df1.shape)

df1['hours_business'] = df1['hours_business'].astype(str)
df1.loc[:,'hours_business'] = df1['hours_business'].str.replace(r'{', '',
                                                                regex=True)
df1.loc[:,'hours_business'] = df1['hours_business'].str.replace(r'}', '',
                                                                regex=True)
df1 = df1.set_index(['hours_businessOriginal'])
df1 = df1.replace("'", "", regex=True)
df1['hours_business'] = df1['hours_business'].str.strip()
df1 = df1.drop_duplicates()

print('- Number of unique business hours after removing Regex characteristics:',
      df1[['hours_business']].nunique())
df1 = df1.reset_index()
print('\n')
print('Sample observations from Business Hours:')
df1.head()
- Dimensions of exploded business open/closed days/hours: (598201, 3)
- Number of unique business hours after removing Regex characteristics: hours_business    9183
dtype: int64


Sample observations from Business Hours:
Out[ ]:
hours_businessOriginal business_id hours_business
0 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Monday: 0:0-0:0
1 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Tuesday: 0:0-0:0
2 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Wednesday: 0:0-0:0
3 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Thursday: 0:0-0:0
4 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Friday: 0:0-0:0

   The businesses with provided hours can then be filtered to a subset with the objective of extracting the day from the business hours. To complete this, we can use split the string at the : using str.rsplit(':') at the first :, creating hoursBusiness_day. All of the string content after the y in the day of the week can be separated from the dictionary to extract the time when a business is open or closed called hours_working. To extract when the business opens, the hours_working string can be split at the -, which separates the numbers before the hyphen resulting in hours_businessOpen. Then, the numbers before the : can be extracted, resulting in hours_businessOpen1, which contains the hour when the business opens. Next, the numbers after the : can be extracted by splitting everything after the :, creating hours_businessOpen1:, which contains the minutes within an hour. Some businesses open on the half hour or various times other than the whole hour.

   Now, the businesses with whole hours for the opening time can be subset and midnight can be defined as 24. Next, the businesses that do not open on the whole hour can be filtered. The minutes can then be divided by the amount of minutes in an hour, 60, and this value can be added to the hours for a numerical value rather than a string, creating hours_businessOpen2. Lastly, the businesses without any open or closed hours can then be filtered and designated with NA for the created features, so they can be retained for the concentation with the businesses that contain open/closed hours. Finally, the sets without any open business hours and the processed ones can be concatenated together by row.

In [ ]:
df3 = df1[(df1['hours_business'] != 'None')]
print('- Dimensions of businesses with provided open/closed hours:', df3.shape)
df3['hoursBusiness_day'] = df3['hours_business'].str.rsplit(':').str[0]
df3['hours_working'] = df3.hours_business.str.extract('y:(.*)')
df3['hours_businessOpen'] = df3['hours_working'].str.rsplit('-').str[0]
df3['hours_businessOpen1'] = df3['hours_business'].str.rsplit(':').str[-3]
df3['hours_businessOpen1'] = df3['hours_businessOpen1'].astype(int)
df3['hours_businessOpen1:'] = df3['hours_businessOpen'].str.rsplit(':').str[-1]

df4 = df3.loc[(df3['hours_businessOpen1'] == 0)]
df4 = df4.drop(['hours_businessOpen1', 'hours_businessOpen1:'], axis=1)
df4['hours_businessOpen2'] = 24.0
print('- Dimensions of businesses with open hours at Midnight:', df4.shape)

df5 = df3.loc[(df3['hours_businessOpen1'] != 0)]
del df3

df5['hours_businessOpen1:'] = df5['hours_businessOpen1:'].astype(float)
df5['hours_businessOpen1:'] = df5['hours_businessOpen1:'] / 60
df5['hours_businessOpen1'] = df5['hours_businessOpen1'].astype(float)
df5['hours_businessOpen2'] = df5['hours_businessOpen1'] + df5['hours_businessOpen1:']
df5 = df5.drop(['hours_businessOpen1', 'hours_businessOpen1:'], axis=1)
print('- Dimensions of businesses with non whole open hours:', df5.shape)

df2 = df1[(df1['hours_business'] == 'None')]
df2['hoursBusiness_day'] = 'NA'
df2['hours_working'] = 'NA'
df2['hours_businessOpen'] = 'NA'
df2['hours_businessOpen2'] = 'NA'
print('- Dimensions of businesses with no open/closed hours:', df2.shape)

data = [df2, df4, df5]
df7 = pd.concat(data)
print('- Dimensions of businesses with no & business hours open modified:',
      df7.shape)

del data, df2, df4, df5

df7[['hours_business', 'hoursBusiness_day', 'hours_working',
     'hours_businessOpen', 'hours_businessOpen2']].tail()
- Dimensions of businesses with provided open/closed hours: (582121, 3)
- Dimensions of businesses with open hours at Midnight: (42394, 7)
- Dimensions of businesses with non whole open hours: (539727, 7)
- Dimensions of businesses with no open/closed hours: (16080, 7)
- Dimensions of businesses with no & business hours open modified: (598201, 7)
Out[ ]:
hours_business hoursBusiness_day hours_working hours_businessOpen hours_businessOpen2
598196 Tuesday: 10:0-20:0 Tuesday 10:0-20:0 10:0 10.0
598197 Wednesday: 10:0-20:0 Wednesday 10:0-20:0 10:0 10.0
598198 Thursday: 10:0-20:0 Thursday 10:0-20:0 10:0 10.0
598199 Friday: 10:0-20:0 Friday 10:0-20:0 10:0 10.0
598200 Saturday: 10:0-18:0 Saturday 10:0-18:0 10:0 10.0

   Let's now process the businesses closing time(s) using similar approaches. First, we can filter the businesses with provided hours into a new subset. The time when the business closes can be determined by using str.rsplit('-') to extract the content after the - from hours_business, creating hours_businessClosed. Then this string can be split at the : to extract the information before and after the : for the hours and minutes. Both then can be converted to float, and a subset of the observations examined:

In [ ]:
df2 = df7[(df7['hours_business'] != 'None')]

df2['hours_businessClosed'] = df2['hours_business'].str.rsplit('-').str[1]
df2['hours_businessClosed1'] = df2['hours_business'].str.rsplit('-').str[1]
df2['hours_businessClosed1'] = df2['hours_businessClosed1'].str.rsplit(':').str[0]
df2['hours_businessClosed1'] = df2['hours_businessClosed1'].astype(float)

df2['hours_businessClosed1:'] = df2['hours_businessClosed'].str.rsplit(':').str[-1]
df2['hours_businessClosed1:'] = df2['hours_businessClosed1:'].astype(float)
df2.tail()
Out[ ]:
hours_businessOriginal business_id hours_business hoursBusiness_day hours_working hours_businessOpen hours_businessOpen2 hours_businessClosed hours_businessClosed1 hours_businessClosed1:
598196 {'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0'... IEHoGw0V5Lbdf4TJU58K0Q Tuesday: 10:0-20:0 Tuesday 10:0-20:0 10:0 10.0 20:0 20.0 0.0
598197 {'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0'... IEHoGw0V5Lbdf4TJU58K0Q Wednesday: 10:0-20:0 Wednesday 10:0-20:0 10:0 10.0 20:0 20.0 0.0
598198 {'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0'... IEHoGw0V5Lbdf4TJU58K0Q Thursday: 10:0-20:0 Thursday 10:0-20:0 10:0 10.0 20:0 20.0 0.0
598199 {'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0'... IEHoGw0V5Lbdf4TJU58K0Q Friday: 10:0-20:0 Friday 10:0-20:0 10:0 10.0 20:0 20.0 0.0
598200 {'Monday': '10:0-20:0', 'Tuesday': '10:0-20:0'... IEHoGw0V5Lbdf4TJU58K0Q Saturday: 10:0-18:0 Saturday 10:0-18:0 10:0 10.0 18:0 18.0 0.0

   As with the various conditions the businesses might open, they also exist and need to be processed for the closing times.Let's first subset the businesses with closing times directly at midnight. The created features for the closing hour and minutes can be dropped, and 24 (midnight) can be assigned to hours_businessClosed2, which the other conditional groups will be processed to complete the set.

   Then the businesses with closing times which are also at midnight (0) but contain non-whole hours can be filtered to another set. Midnight as 24 can be designated for hours_businessClosed1 and hours_businessClosed1:, which contains the minutes of the hour can be divided by 60, and then added to the 24 whole hour, creating hours_businessClosed2 for this set.

   Then all of the businesses who close at non-midnight times on the whole hour can be filtered. Since they do not contain minutes, the hours_businessClosed1 can be treated as hours_businessClosed2, and then hours_businessClosed1 dropped.

   Next, the businesses with closing times not at midnight and not on the whole hour can be filtered to another set. The minutes can be processed to a fraction of the hour and added to the hour for a numerical time.

   Now, the businesses without any open or closed hours can then be filtered and designated with NA for the created features so they can be retained for the concentation with the businesses that contain open/closed hours.

   Finally, the various sets that have been processed for the different variations of business closing hours and the ones without any open/closed hours can concatenated together by row. Then the temporary features renamed and examined.

In [ ]:
df3 = df2.loc[(df2['hours_businessClosed1'] == 0) & (df2['hours_businessClosed1:'] == 0)]
df3 = df3.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
df3['hours_businessClosed2'] = 24.0
print('- Dimensions of businesses with closing times that are midnight and whole hours:',
      df3.shape)

df4 = df2.loc[(df2['hours_businessClosed1'] == 0) & (df2['hours_businessClosed1:'] != 0)]
df4['hours_businessClosed1'] = 24.0
df4['hours_businessClosed1'] = df4['hours_businessClosed1'].astype(float)
df4['hours_businessClosed1:'] = df4['hours_businessClosed1:'].astype(float)
df4['hours_businessClosed1:'] = df4['hours_businessClosed1:'] / 60
df4['hours_businessClosed2'] = df4['hours_businessClosed1'] + df4['hours_businessClosed1:']
df4 = df4.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
print('- Dimensions of businesses with closing times that are midnight and non whole hours:',
      df4.shape)

df5 = df2.loc[(df2['hours_businessClosed1'] != 0) & (df2['hours_businessClosed1:'] == 0)]
df5['hours_businessClosed2'] = df5['hours_businessClosed1']
df5['hours_businessClosed2'] = df5['hours_businessClosed2'].astype(float)
df5 = df5.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
print('- Dimensions of businesses with closing times that are non midnight and whole hours:',
      df5.shape)

df6 = df2.loc[(df2['hours_businessClosed1'] != 0) & (df2['hours_businessClosed1:'] != 0)]
df6['hours_businessClosed1'] = df6['hours_businessClosed1'].astype(float)
df6['hours_businessClosed1:'] = df6['hours_businessClosed1:'].astype(float)
df6['hours_businessClosed1:'] = df6['hours_businessClosed1:'] / 60
df6['hours_businessClosed2'] = df6['hours_businessClosed1'] + df6['hours_businessClosed1:']
df6 = df6.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
print('- Dimensions of businesses with closing times that are non midnight and non whole hours:',
      df6.shape)

df1 = df7[(df7['hours_business'] == 'None')]
df1['hours_businessClosed'] = 'NA'
df1['hours_businessClosed2'] = 'NA'

df1 = [df3, df4, df5, df6, df1]
df1 = pd.concat(df1)

df1 = df1.drop(['hours_businessOpen', 'hours_businessClosed'], axis=1)
df1 = df1.drop_duplicates()
print('- Dimensions of businesses with no & business hours closed modified:',
      df1.shape)

del df2, df3, df4, df5, df6, df7

df1.rename(columns={'hours_businessOpen2': 'hours_businessOpen',
                    'hours_businessClosed2': 'hours_businessClosed'},
           inplace=True)
df1.head()
- Dimensions of businesses with closing times that are midnight and whole hours: (63313, 9)
- Dimensions of businesses with closing times that are midnight and non whole hours: (995, 9)
- Dimensions of businesses with closing times that are non midnight and whole hours: (455785, 9)
- Dimensions of businesses with closing times that are non midnight and non whole hours: (62028, 9)
- Dimensions of businesses with no & business hours closed modified: (598201, 7)
Out[ ]:
hours_businessOriginal business_id hours_business hoursBusiness_day hours_working hours_businessOpen hours_businessClosed
0 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Monday: 0:0-0:0 Monday 0:0-0:0 24.0 24.0
1 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Tuesday: 0:0-0:0 Tuesday 0:0-0:0 24.0 24.0
2 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Wednesday: 0:0-0:0 Wednesday 0:0-0:0 24.0 24.0
3 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Thursday: 0:0-0:0 Thursday 0:0-0:0 24.0 24.0
4 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Friday: 0:0-0:0 Friday 0:0-0:0 24.0 24.0

   Given that now we have processed the initial dictionary down to the hours and minutes during the day and night, we can create various time features from the opening/closing business times like the morning and afternoon/night (AM/PM). Let's first filter the businesses with/without hours_working into different sets again. We can create a binary feature which demarcates when a business opens in the morning or night using AM/PM called hours_businessOpen_amPM by using a mask where if the hours_businessOpen' >= 12, then designate it as PM. If it does not meet this conditional statement, then designate it as AM. The same methods can be utilized to demarcate the closing period called hours_businessClosed_amPM.

   As before, feature engineering new variables typically requires a boolean approach to address the set that does not meet the criteria, so let's fill the new features with NA for the businesses who do not have open/closed hours. Finally, the two sets can be concatenated together by row and the processed business hours renamed as hours_businessRegex. Utilizing regex allowed for new features to be established from the initial dictionary containing the days of the week with the opening and closing times.

In [ ]:
df2 = df1[(df1['hours_working'] != 'NA')]
df2['hours_businessOpen'] = df2['hours_businessOpen'].astype('float64')
df2['hours_businessClosed'] = df2['hours_businessClosed'].astype('float64')

mask = df2['hours_businessOpen'] >= 12
df2.loc[mask, 'hours_businessOpen_amPM'] = 'PM'

mask = df2['hours_businessOpen_amPM'] != 'PM'
df2.loc[mask, 'hours_businessOpen_amPM'] = 'AM'

mask = df2['hours_businessClosed'] >= 12
df2.loc[mask, 'hours_businessClosed_amPM'] = 'PM'

mask = df2['hours_businessClosed_amPM'] != 'PM'
df2.loc[mask, 'hours_businessClosed_amPM'] = 'AM'

df1 = df1[(df1['hours_working'] == 'NA')]
df1['hours_businessOpen_amPM'] = 'NA'
df1['hours_businessClosed_amPM'] = 'NA'

df1 = pd.concat([df2, df1])
df1.rename(columns={'hours_business': 'hours_businessRegex'}, inplace=True)
df1.head()
Out[ ]:
hours_businessOriginal business_id hours_businessRegex hoursBusiness_day hours_working hours_businessOpen hours_businessClosed hours_businessOpen_amPM hours_businessClosed_amPM
0 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Monday: 0:0-0:0 Monday 0:0-0:0 24.0 24.0 AM AM
1 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Tuesday: 0:0-0:0 Tuesday 0:0-0:0 24.0 24.0 AM AM
2 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Wednesday: 0:0-0:0 Wednesday 0:0-0:0 24.0 24.0 AM AM
3 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Thursday: 0:0-0:0 Thursday 0:0-0:0 24.0 24.0 AM AM
4 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Friday: 0:0-0:0 Friday 0:0-0:0 24.0 24.0 AM AM

   Let's create a hours_businessOpenDaily feature that contains the total number of hours each day a business is operational by subtracting hours_businessOpen from hours_businessClosed.

In [ ]:
df2 = df1[(df1['hours_businessRegex'] != 'None')]
df2['hours_businessOpenDaily'] = df2['hours_businessClosed'] - df2['hours_businessOpen']

df3 = df2.loc[(df2['hours_businessOpenDaily'] == 0)]
df3['hours_businessOpenDaily'] = df3['hours_businessOpenDaily'] + 24.0
print('- Dimensions of businesses with open 24 hours:',
      df3.shape)

df2 = df2.loc[(df2['hours_businessOpenDaily'] != 0)]
print('- Dimensions of businesses with open 24 hours:',
      df2.shape)

df2 = pd.concat([df3,df2])

df3 = df1[(df1['hours_businessRegex'] == 'None')]
df3['hours_businessOpenDaily'] = 'NA'

df1 = pd.concat([df2,df3])

del df2, df3

df1.head()
- Dimensions of businesses with open 24 hours: (42609, 10)
- Dimensions of businesses with open 24 hours: (539512, 10)
Out[ ]:
hours_businessOriginal business_id hours_businessRegex hoursBusiness_day hours_working hours_businessOpen hours_businessClosed hours_businessOpen_amPM hours_businessClosed_amPM hours_businessOpenDaily
0 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Monday: 0:0-0:0 Monday 0:0-0:0 24.0 24.0 PM PM 24.0
1 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Tuesday: 0:0-0:0 Tuesday 0:0-0:0 24.0 24.0 PM PM 24.0
2 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Wednesday: 0:0-0:0 Wednesday 0:0-0:0 24.0 24.0 PM PM 24.0
3 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Thursday: 0:0-0:0 Thursday 0:0-0:0 24.0 24.0 PM PM 24.0
4 {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... ENwBByjpoa5Gg7tKgxqwLg Friday: 0:0-0:0 Friday 0:0-0:0 24.0 24.0 PM PM 24.0

   The table containing the processed business hours can now be outer joined with the main table utilizing the strings business_id and hours_business as the keys followed by dropping any duplicates with the subset=['review_id'].

In [ ]:
df['business_id'] = df['business_id'].astype(str)
df['hours_business'] = df['hours_business'].astype(str)

df1['business_id'] = df1['business_id'].astype(str)
df1['hours_businessOriginal'] = df1['hours_businessOriginal'].astype(str)

df = pd.merge(df, df1, how='outer', left_on=['business_id', 'hours_business'],
              right_on=['business_id', 'hours_businessOriginal'])
df = df.drop_duplicates(subset=['review_id'])

del df1

print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
Number of Rows: 7952263, Columns: 53

Exploratory Data Analysis

   Let's now subset the quantitative features from the initial reviews and business tables, and examine the descriptive statistics using pandas.DataFrame.describe rounded to two decimal places.

In [ ]:
df_num = df[['stars_reviews', 'useful_reviews', 'funny_reviews', 'cool_reviews',
             'stars_business', 'review_countbusiness', 'is_open']]

print('Descriptive statistics of quant vars in Reviews + Business:')
print('\n')
print(df_num.describe(include=[np.number]).round(2))
Descriptive statistics of quant vars in Reviews + Business:


       stars_reviews  useful_reviews  funny_reviews  cool_reviews  \
count     7952263.00      7952263.00     7952263.00    7952263.00   
mean            3.73            1.21           0.43          0.51   
std             1.43            3.15           1.90          2.29   
min             1.00            0.00           0.00          0.00   
25%             3.00            0.00           0.00          0.00   
50%             4.00            0.00           0.00          0.00   
75%             5.00            1.00           0.00          0.00   
max             5.00          446.00         610.00        732.00   

       stars_business  review_countbusiness    is_open  
count      7952263.00            7952263.00  7952263.0  
mean             3.73                405.29        0.8  
std              0.69                741.12        0.4  
min              1.00                  5.00        0.0  
25%              3.50                 70.00        1.0  
50%              4.00                183.00        1.0  
75%              4.00                436.00        1.0  
max              5.00               9185.00        1.0  

   When examining the broad summary statistics of the selected features, stars_reviews has mean=3.73 with a standard deviation (std) of 1.43. The 75th percentile are 5 star reviews, so the set contains mostly highly rated reviews. When examining useful, funny and cool_reviews, the 75th percentile is one review or less with a maximum of 446, 610 and 732, respectively.

   For the stars_business, the mean=3.73, which is the same as the mean for stars_reviews, and the std=0.69, which is lower than the std from stars_reviews. The 75th percentile is 4 stars, which is lower when compared to stars_reviews. Regarding the review_countbusiness feature, the std is greater than the mean when the 50th percentile is 183 vs. mean=405.29. Surprisingly, the maximum is only 9,185 given 7,952,263 total reviews. A large number of businesses are open when the is_open feature is considered open at the 25th percentile. We can calculate the percentage though to examine with more granularity.

   Then, we can calculate the percentage of businesses that are open or closed using value_counts(normalize=True) with is_open which then is multiplied by 100.

In [ ]:
print('- Percentage of business are open or closed:')
print(df['is_open'].value_counts(normalize=True) * 100)
- Percentage of business are open or closed:
1    80.449414
0    19.550586
Name: is_open, dtype: float64

   In this set, most of the businesses are open, but the purpose of project is multifaceted consisting of exploring various business features for insight as well as using the text contained within business reviews to address questions from the business insights. So, let's drop the is_open feature since we do not know why the business is not open still. Now, we can examine the quantitative variables further using box-and-whisker and histogram plots using seaborn.boxplot and seaborn.histplot that loops through the numerical features and is presented together.

In [ ]:
df_num = df_num.drop(['is_open'], axis=1)

sns.set_style('whitegrid')
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(2, 3, figsize=(10,7))
fig.suptitle('Boxplots of Quantitative Variables in Reviews & Businesses',
             fontsize=25)
for var, subplot in zip(df_num, ax.flatten()):
    sns.boxplot(x=df_num[var], data=df_num, ax=subplot)
plt.tight_layout()
plt.show();

fig, ax = plt.subplots(2, 3, figsize=(10,7))
fig.suptitle('Histograms of Quantitative Variables in Yelp Reviews & Businesses',
             fontsize=25)
for variable, subplot in zip(df_num, ax.flatten()):
    sns.histplot(x=df_num[variable], kde=True, ax=subplot)
plt.tight_layout()
plt.show();

del df_num

   Utilizing these plots allows for the previously calculated summary statistics to be visualized graphically rather one dimensional numerical columns in table format. Boxplots contain all of the table information which summarize the distribution of the feature visualized easily besides the std. Histograms provide a more granular view of the distribution and utilizing 'kde=True' smooths the distribution using a kernel density estimate.

   Although some of selected variables are coded as numerical values when they could be represented as categorical features given the group structure, we can use seaborn.countplot in a for loop to examine the count of each group within the subset of features.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

df_num = df_num.drop(['useful_reviews', 'funny_reviews', 'cool_reviews'],
                     axis=1)

plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(1, 4, figsize=(20,10))
fig.suptitle('Countplots of Quantitative Variables in Yelp Reviews & Businesses',
             fontsize=35)
for var, subplot in zip(df_num, ax.flatten()):
    sns.countplot(x=df_num[var], ax=subplot)
plt.tight_layout()
plt.show();

del df_num

   The majority of the stars_reviews are rated as 5 star while the majority of the stars_business are 4 star. Most of the businesses do not contain a lot of reviews as this plot demonstates a right tailed distribution.

   We can also examine the reviews based on the variables in the original reviews set. Using various cutoffs like stars_reviews == 5.0 for the positive reviews, stars_reviews <= 2.5 for the negatives and the stars_reviews between 3 - 4.5 for the average reviews, we look at the percentage these reviews are out of the total set.

In [ ]:
total_reviews = len(df)
useful_reviews = len(df[df['useful_reviews'] > 0])
funny_reviews = len(df[df['funny_reviews'] > 0])
cool_reviews = len(df[df['cool_reviews' ] > 0])
positive_reviews  = len(df[df['stars_reviews'] == 5.0])
negative_reviews = len(df[df['stars_reviews'] <= 2.5])
ok_reviews = len(df[df['stars_reviews'].between(3, 4.5)])

print('- Total reviews: {}'.format(total_reviews))
print('- Useful reviews: ' +  str(round((useful_reviews / total_reviews) * 100)) + '%')
print('- Funny reviews: ' +  str(round((funny_reviews / total_reviews) * 100)) + '%')
print('- Cool reviews: ' +  str(round((cool_reviews / total_reviews) * 100)) + '%')
print('- Positive reviews: ' +  str(round((positive_reviews / total_reviews) * 100)) + '%')
print('- Negative reviews: ' +  str(round((negative_reviews / total_reviews) * 100)) + '%')
print('- OK reviews: ' +  str(round((ok_reviews / total_reviews) * 100)) + '%')
- Total reviews: 7952263
- Useful reviews: 44%
- Funny reviews: 19%
- Cool reviews: 23%
- Positive reviews: 43%
- Negative reviews: 22%
- OK reviews: 35%

   Less than half of the reviews are useful and positive and the minority are negative. This might suggest that balancing the sets is needed before any modeling is completed.

   Next, let's examine the features in the original business set. Based on the count of the business name (name_business), we can find the top 30 businesses, which are probably restaurants, and plot the average stars on reviews for the top 30 restaurants.

In [ ]:
top30_restaurants = df.name_business.value_counts().index[:30].tolist()
top30_restaurantsCount = len(top30_restaurants)
total_businesses = df['name_business'].nunique()
print('- Percentage of top 30 restaurants: ' +  str((top30_restaurantsCount / total_businesses) * 100))
top30_restaurants = df.loc[df['name_business'].isin(top30_restaurants)]
top30_restaurants.groupby(top30_restaurants.name_business)['stars_reviews'].mean().sort_values(ascending=True).plot(kind='barh',
                                                                                                                    figsize=(12,10))
plt.title('Average Review Rating of 30 Most Frequent Restaurants',
          fontsize=20)
plt.ylabel('Name of Restaurant', fontsize=18)
plt.xlabel('Average Review Rating', fontsize=18)
plt.yticks(fontsize=18)
plt.tight_layout()
plt.show();
- Percentage of top 30 restaurants: 0.03753753753753754


   Not surprisingly, fast food restaurants have a high number of reviews and a poor average rating.

   We can also plot the average number of useful, funny and cool reviews in the top 30 restaurants and then sort by the number of useful reviews.

In [ ]:
top30_restaurants.groupby(top30_restaurants.name_business)[['useful_reviews',
                                                            'funny_reviews',
                                                            'cool_reviews']].mean().sort_values('useful_reviews',
                                                                                                ascending=True).plot(kind='barh',
                                                                                                                     figsize=(15,14),
                                                                                                                     width=0.7)
plt.title('Average Useful, Funny & Cool Reviews in the 30 Most Frequent Restaurants',
          fontsize=28)
plt.ylabel('Name of Restaurant', fontsize=18)
plt.yticks(fontsize=20)
plt.legend(fontsize=22)
plt.tight_layout()
plt.show()

   Based on the graph, more of these selected reviews are considered cool reviews rather than funny. It is plausible that individuals would use diction expressing more enjoyment than humor if they are considered useful reviews.

   We can also find the top 10 states with the highest number of reviews by converting value_counts() to a dataframe and then use a barplot with the state as the index.

In [ ]:
x = df['state'].value_counts()[:10].to_frame()

plt.figure(figsize=(20,10))
sns.barplot(x=x['state'], y=x.index)
plt.title('States with the 10 highest number of business reviews listed in Yelp',
          fontsize=35)
plt.xlabel('Counts of Reviews', fontsize=25)
plt.ylabel('State', fontsize=25)
plt.tight_layout()
plt.show();

del x

   In this set and subset, Massachusetts has the largest abundance of reviews followed by Texas, Oregon, Georgia, Florida, British Columbia and Ohio. This contains multiple regions in the United States, ranging from the Northeast, Southeast and West.

   Building on this, we can find the top 30 cities with the highest number of reviews.

In [ ]:
x = df['city'].value_counts()
x = x.sort_values(ascending=False)
x = x.iloc[0:30]

plt.rcParams.update({'font.size': 14})
plt.figure(figsize=(20,10))
ax = sns.barplot(x=x.index, y=x.values, alpha=0.9)
plt.title('Cities with the Highest Number of Reviews on Yelp', fontsize=35)
locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
plt.xlabel('Name of City', fontsize=25)
plt.ylabel('Number of Reviews', fontsize=25)

rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height+3, label, ha='center',
            va='bottom', fontsize=13)
plt.tight_layout()
plt.show();

del x

   This plot refines the granularity of state level down to the city level. Although Massachusetts contained the highest number of reviews, Austin in Texas contains the most per city followed by Portland (Oregon), Atlanta (Georgia), Boston (Massachusetts), Orlando (Florida) and Vancouver (British Columbia). Subsequently, the city with the next highest counts, Columbus, had half that of Vancouver.

   Building on this, we can find the top 30 cities with reviews in the set, and then examine which cities contain the highest number of average reviews.

In [ ]:
top30_city = df.city.value_counts().index[:30].tolist()
top30_city = df.loc[df['city'].isin(top30_city)]

top30_city.groupby(top30_city.city)['stars_reviews'].mean().sort_values(ascending=True).plot(kind='barh',
                                                                                             figsize=(12,10))
plt.yticks(fontsize=18)
plt.title('Average Review Rating of Cities with Highest Number of Reviews on Yelp',
          fontsize=20)
plt.ylabel('Name of City', fontsize=18)
plt.xlabel('Average Review Rating', fontsize=18)
plt.tight_layout()
plt.show();

   The cities with the highest number of average reviews are Winter Park, Portland, Austin, Somerville and Boulder. These cities are not all geographically in close proximity, which suggests that the location is not skewed. However, closer examination is warranted to control for a potentially confounding variable like location.

   Then we can convert the date_reviews feature to the proper format using pd.to_datetime with format='%Y-%m-%d %H:%M:%S' to extract the time related components like the year, year-month and year-week. Since we extracted the year of the review, we can utilize this with the business_id and stars_business features to examine the 5 star rated businesses containing the most reviews per year in the set.

In [ ]:
df['date_reviews'] = pd.to_datetime(df['date_reviews'],
                                    format='%Y-%m-%d %H:%M:%S', errors='ignore')
df['date_reviews_Year'] = df.date_reviews.dt.year
df['date_reviews_YearMonth'] = df['date_reviews'].dt.to_period('M')
df['date_reviews_YearWeek'] = df['date_reviews'].dt.strftime('%Y-w%U')

top = 5
temp = df[['business_id', 'date_reviews_Year', 'stars_business']]
five_star_reviews = temp[temp['stars_business'] == 5]
trendy = five_star_reviews.groupby(['business_id',
                                    'date_reviews_Year']).size().reset_index(name='counts')

trending = trendy.sort_values(['date_reviews_Year',
                               'counts'])[::-1][:top].business_id.values

plt.rcParams.update({'font.size': 10})
for business_id in trending:
    record = trendy.loc[trendy['business_id'] == business_id]
    business_name = df.loc[df['business_id'] == business_id].name_business.values[0]
    series = pd.Series(record['counts'].values,
                       index=record.date_reviews_Year.values,
                       name='Trending business')
    axes = series.plot(kind='bar', figsize=(5,5))
    plt.xlabel('Year', axes=axes)
    plt.ylabel('Total reviews', axes=axes)
    plt.title('Review trend of {}'.format(business_name), axes=axes)
    plt.show();

   Since we utilized the business_id and the yearly review count for the 5 star businesses, we are able to determine a specific business location given the id. This resulted in very different types of businesses ranging from yogurt to Asian food selected.

   Next, we can examine the user information using pandas.DataFrame.describe again to generate descriptive statistics rounded to two decimal places.

In [ ]:
df_num = df[['review_count_user', 'useful_user', 'funny_user', 'cool_user',
             'average_stars_user', 'compliment_hot_user',
             'compliment_more_user','compliment_profile_user',
             'compliment_cute_user', 'compliment_list_user',
             'compliment_note_user', 'compliment_plain_user',
             'compliment_cool_user', 'compliment_funny_user',
             'compliment_writer_user']]

print('- Descriptive statistics of quant vars in User:')
print(df_num.describe(include=[np.number]).round(2))
- Descriptive statistics of quant vars in User:
       review_count_user  useful_user  funny_user   cool_user  \
count         7952263.00   7952263.00  7952263.00  7952263.00   
mean              141.96       444.06      218.06      305.08   
std               517.46      3513.48     2340.36     3106.02   
min                 0.00         0.00        0.00        0.00   
25%                 8.00         5.00        1.00        1.00   
50%                28.00        24.00        5.00        5.00   
75%               110.00       131.00       35.00       44.00   
max             15686.00    204380.00   172041.00   198451.00   

       average_stars_user  compliment_hot_user  compliment_more_user  \
count          7952263.00           7952263.00            7952263.00   
mean                 3.74                20.26                  2.94   
std                  0.79               235.80                 52.56   
min                  1.00                 0.00                  0.00   
25%                  3.41                 0.00                  0.00   
50%                  3.83                 0.00                  0.00   
75%                  4.21                 1.00                  1.00   
max                  5.00             25304.00              13501.00   

       compliment_profile_user  compliment_cute_user  compliment_list_user  \
count               7952263.00            7952263.00            7952263.00   
mean                      2.20                  1.58                  1.11   
std                      63.73                 47.93                 42.73   
min                       0.00                  0.00                  0.00   
25%                       0.00                  0.00                  0.00   
50%                       0.00                  0.00                  0.00   
75%                       0.00                  0.00                  0.00   
max                   14180.00              13654.00              12669.00   

       compliment_note_user  compliment_plain_user  compliment_cool_user  \
count            7952263.00             7952263.00            7952263.00   
mean                  15.12                  42.93                 34.22   
std                  138.03                 610.92                391.56   
min                    0.00                   0.00                  0.00   
25%                    0.00                   0.00                  0.00   
50%                    0.00                   0.00                  0.00   
75%                    3.00                   3.00                  3.00   
max                38322.00               90858.00              46858.00   

       compliment_funny_user  compliment_writer_user  
count             7952263.00              7952263.00  
mean                   34.22                   12.94  
std                   391.56                  146.28  
min                     0.00                    0.00  
25%                     0.00                    0.00  
50%                     0.00                    0.00  
75%                     3.00                    2.00  
max                 46858.00                15446.00  

   For the basic summary statistics of these user related features, the review_count_user has a mean=141.96 and a maximum=15686.00. useful_user has the highest std=444.06 and the highest amount (204380.00) compared to the cool and funny users. The average_stars_user is 3.74 with a std=0.79. The compliment_plain_user feature has the highest standard deviation (610.92) and the highest amount (90858.00) for the compliment*user group when the majority of the contain zero.

   We can also examine the number of reviews completed by each user as well as the average rating by each user using histplot from seaborn.

In [ ]:
sns.histplot(x='review_count_user', data=df_num, kde=True)
plt.ylim(0, 100000)
plt.title('Count of reviews by Users in Yelp')
plt.tight_layout()
plt.show();

sns.histplot(x='average_stars_user', data=df_num, kde=True)
plt.ylim(0, 300000)
plt.title('Count of Average Stars by Users in Yelp')
plt.tight_layout()
plt.show();

   The majority of users have only written a small number of reviews demonstrated by the right skewed histogram while the the average stars by the users is more left skewed around 4 stars.

   Box-and-whisker plots are another method to visualize the quantitative variables from the User set. We can iterate through the features using a for loop.

In [ ]:
plt.rcParams.update({'font.size': 25})
fig, ax = plt.subplots(3, 5, figsize=(20,10))
fig.suptitle('Boxplots of Quantitative Variables in Users on Yelp Reviews',
             fontsize=35)
for var, subplot in zip(df_num, ax.flatten()):
    sns.boxplot(x=df_num[var], data=df_num, ax=subplot)
plt.tight_layout()
plt.show();

   The compliment*user group definitely contains outliers as demonstrated from plot above. There are a small number of users who are very active in this set! The majority of the users are not though.

Categories

   Let's examine some observations from the orginal categories variable to determine how to proceed with processing. The current categories_business feature is organized as a comma separated list of the different categorical information about the business.

In [ ]:
df_category_split = df[['business_id', 'categories_business']]
df_category_split = df_category_split.drop_duplicates()
df_category_split[:10]
Out[ ]:
business_id categories_business
0 ENwBByjpoa5Gg7tKgxqwLg Mexican, Restaurants
1904 jKO4Og6ucdX2-YCTKQVYjg Canadian (New), Cafes, Gastropubs, Lounges, Ni...
6748 9Bto7mky640ocgezVKSfVg Nail Salons, Massage, Waxing, Shopping, Hair R...
9555 XWFjKtRGZ9khRGtGg2ZvaA Event Planning & Services, Lounges, Venues & E...
13251 mkrx0VhSMU3p3uhyJGCoWA Active Life, Gyms, Fitness & Instruction, Trai...
13923 VQftVUvHfMQdDTmnO0iQqg Dance Studios, Active Life, Trainers, Fitness ...
14391 2PxZ-fICnd432NJHefXrcA Hotels & Travel, Airports
33354 oQyf1788YWsiDLupGva6sw Pizza, Grocery, Restaurants, Food, Delis, Sand...
33452 OQ2oHkcWA8KNC1Lsvj1SBA Caterers, Restaurants, Breakfast & Brunch, Sou...
85210 Wqetc51pFQzz04SXh_AORA Coffee & Tea, Food

   The various categories can be split into single components by setting the business_id as the index, splitting by the ,, followed by resetting the index. This column can then be renamed for a unique feature name. This resulted in some errors in the spacing that might have existed before processing, so we can remove the first space to align each observation to normalize the subsets for downstream comparisons. Then duplicate observations can be removed. Let's now examine the first ten observations after processing:

In [ ]:
df_category_split = df_category_split.set_index(['business_id'])
df_category_split = df_category_split.stack().str.split(',',
                                                        expand=True).stack().unstack(-2).reset_index(-1,
                                                                                                     drop=True).reset_index()
df_category_split.rename(columns={'categories_business':
                                  'categories_combined'}, inplace=True)
df_category_split['categories_combined'] = df_category_split['categories_combined'].str.strip()
df_category_split = df_category_split.drop_duplicates()
df_category_split[:10]
Out[ ]:
business_id categories_combined
0 ENwBByjpoa5Gg7tKgxqwLg Mexican
1 ENwBByjpoa5Gg7tKgxqwLg Restaurants
2 jKO4Og6ucdX2-YCTKQVYjg Canadian (New)
3 jKO4Og6ucdX2-YCTKQVYjg Cafes
4 jKO4Og6ucdX2-YCTKQVYjg Gastropubs
5 jKO4Og6ucdX2-YCTKQVYjg Lounges
6 jKO4Og6ucdX2-YCTKQVYjg Nightlife
7 jKO4Og6ucdX2-YCTKQVYjg Cocktail Bars
8 jKO4Og6ucdX2-YCTKQVYjg Bars
9 jKO4Og6ucdX2-YCTKQVYjg Food Delivery Services

   The table containing the processed categories_business can now be left joined with the main table utilizing business_id the keys for both sets and the original categories_business feature and duplicates removed from the set. Now, we can use the data_summary function again to examine the number of unique and missing values as well as the data types.

In [ ]:
df = pd.merge(df, df_category_split, how='left', left_on=['business_id'],
              right_on=['business_id'])
df = df.drop(['categories_business'], axis=1)
df = df.drop_duplicates(subset='review_id')

del df_category_split

print('\nSummary - Preprocessing Yelp Reviews for Category:')
print(data_summary(df))

Summary - Preprocessing Yelp Reviews for Category:
Number of Rows: 7952263, Columns: 55
                                  Number of Unique Values  \
review_id                                         7952263   
user_id                                           2032516   
business_id                                        106642   
stars_reviews                                           5   
useful_reviews                                        229   
funny_reviews                                         189   
cool_reviews                                          206   
text_reviews                                      7935287   
date_reviews                                      7825308   
name_business                                       79920   
address                                             86334   
city                                                  623   
state                                                  21   
postal_code                                          4667   
latitude                                            96789   
longitude                                           95020   
stars_business                                          9   
review_countbusiness                                 1281   
is_open                                                 2   
attributes_business                                 75159   
hours_business                                      38206   
review_count_user                                    1917   
yelping_since_user                                2024363   
useful_user                                          5184   
funny_user                                           3666   
cool_user                                            4363   
elite_user                                           1227   
average_stars_user                                    401   
compliment_hot_user                                  1311   
compliment_more_user                                  349   
compliment_profile_user                               345   
compliment_cute_user                                  315   
compliment_list_user                                  190   
compliment_note_user                                  959   
compliment_plain_user                                1638   
compliment_cool_user                                 1575   
compliment_funny_user                                1575   
compliment_writer_user                                818   
compliment_count_tip_idSum                             23   
compliment_count_tip_businessSum                       27   
businessCheckin_date                               106642   
dateDay_checkinSum                                   2683   
hours_businessOriginal                              38206   
hours_businessRegex                                  2883   
hoursBusiness_day                                       8   
hours_working                                        1238   
hours_businessOpen                                     86   
hours_businessClosed                                   94   
hours_businessOpen_amPM                                 3   
hours_businessClosed_amPM                               3   
hours_businessOpenDaily                               154   
date_reviews_Year                                      18   
date_reviews_YearMonth                                196   
date_reviews_YearWeek                                 853   
categories_combined                                  1067   

                                  Number of Missing Values  \
review_id                                                0   
user_id                                                  0   
business_id                                              0   
stars_reviews                                            0   
useful_reviews                                           0   
funny_reviews                                            0   
cool_reviews                                             0   
text_reviews                                             0   
date_reviews                                             0   
name_business                                            0   
address                                                  0   
city                                                     0   
state                                                    0   
postal_code                                              0   
latitude                                                 0   
longitude                                                0   
stars_business                                           0   
review_countbusiness                                     0   
is_open                                                  0   
attributes_business                                      0   
hours_business                                           0   
review_count_user                                        0   
yelping_since_user                                       0   
useful_user                                              0   
funny_user                                               0   
cool_user                                                0   
elite_user                                               0   
average_stars_user                                       0   
compliment_hot_user                                      0   
compliment_more_user                                     0   
compliment_profile_user                                  0   
compliment_cute_user                                     0   
compliment_list_user                                     0   
compliment_note_user                                     0   
compliment_plain_user                                    0   
compliment_cool_user                                     0   
compliment_funny_user                                    0   
compliment_writer_user                                   0   
compliment_count_tip_idSum                               0   
compliment_count_tip_businessSum                         0   
businessCheckin_date                                     0   
dateDay_checkinSum                                       0   
hours_businessOriginal                                   0   
hours_businessRegex                                      0   
hoursBusiness_day                                        0   
hours_working                                            0   
hours_businessOpen                                       0   
hours_businessClosed                                     0   
hours_businessOpen_amPM                                  0   
hours_businessClosed_amPM                                0   
hours_businessOpenDaily                                  0   
date_reviews_Year                                        0   
date_reviews_YearMonth                                   0   
date_reviews_YearWeek                                    0   
categories_combined                                      0   

                                 Data type of variable  
review_id                                       object  
user_id                                         object  
business_id                                     object  
stars_reviews                                    int64  
useful_reviews                                   int64  
funny_reviews                                    int64  
cool_reviews                                     int64  
text_reviews                                    object  
date_reviews                            datetime64[ns]  
name_business                                   object  
address                                         object  
city                                            object  
state                                           object  
postal_code                                     object  
latitude                                       float64  
longitude                                      float64  
stars_business                                 float64  
review_countbusiness                             int64  
is_open                                          int64  
attributes_business                             object  
hours_business                                  object  
review_count_user                                int64  
yelping_since_user                              object  
useful_user                                      int64  
funny_user                                       int64  
cool_user                                        int64  
elite_user                                      object  
average_stars_user                             float64  
compliment_hot_user                              int64  
compliment_more_user                             int64  
compliment_profile_user                          int64  
compliment_cute_user                             int64  
compliment_list_user                             int64  
compliment_note_user                             int64  
compliment_plain_user                            int64  
compliment_cool_user                             int64  
compliment_funny_user                            int64  
compliment_writer_user                           int64  
compliment_count_tip_idSum                       int64  
compliment_count_tip_businessSum                 int64  
businessCheckin_date                            object  
dateDay_checkinSum                             float64  
hours_businessOriginal                          object  
hours_businessRegex                             object  
hoursBusiness_day                               object  
hours_working                                   object  
hours_businessOpen                              object  
hours_businessClosed                            object  
hours_businessOpen_amPM                         object  
hours_businessClosed_amPM                       object  
hours_businessOpenDaily                         object  
date_reviews_Year                                int64  
date_reviews_YearMonth                       period[M]  
date_reviews_YearWeek                           object  
categories_combined                             object  
None

   Let's now examine the processed categories_combined feature to see what are the top 20 categories in the set using value_counts as well as visually with seaborn.barplot.

In [ ]:
plt.rcParams.update({'font.size': 15})

x = df.categories_combined.value_counts()
print('- There are', len(x), 'different categories of Businesses in Yelp')
print('\n')

x = x.sort_values(ascending=False)
x = x.iloc[0:20]
print('Top 20 categories in Yelp:')
print(x)
print('\n')
plt.figure(figsize=(16,10))
ax = sns.barplot(x=x.index, y=x.values, alpha=0.9)
plt.title('Top 20 Categories in Yelp Reviews', fontsize=18)
locs, labels = plt.xticks()
plt.setp(labels, rotation=70)
plt.ylabel('Number of Businesses', fontsize=14)
plt.xlabel('Type of Category', fontsize=14)
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height+5, label, ha='center',
            va='bottom', fontsize=11)
plt.tight_layout()

del x
- There are 1067 different categories of Businesses in Yelp


Top 20 categories in Yelp:
Restaurants                  1354633
Food                          452393
Nightlife                     264904
Bars                          228471
American (New)                183332
American (Traditional)        173218
Pizza                         129175
Beauty & Spas                 129045
Sandwiches                    126985
Breakfast & Brunch            126874
Mexican                       126120
Shopping                      114499
Coffee & Tea                  114124
Italian                       110842
Seafood                       102028
Event Planning & Services      98319
Japanese                       94481
Hotels & Travel                81639
Burgers                        80198
Sushi Bars                     79739
Name: categories_combined, dtype: int64


   The top five categories are Restaurants, Food, Nightlife, Bars and American(New). This contains a high similarity between groups, so focusing on the food related businesses sounds like a worthwhile venture.

   Now, we can write the processed data to a .csv file and proceed to processing the text data from the text_reviews feature.

NLP Preprocessing

   Before preprocessing the reviews, let's download some word sets for cleaning the text in the reviews. We can utilize en_core_web_lg from the spaCy library as well as wordnet, punkt and stopwords from the NLTK library. To download en_core_web_lg from spaCy, run !python -m spacy download en_core_web_lg in the notebook environment or without the ! if utilizing terminal. Then this can be loaded into the notebook after importing the library.

   Now, we can subset the features the previous plots showed that impact the distribution of the reviews.

In [ ]:
import spacy
import nltk

nlp = spacy.load('en_core_web_lg')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

df = df[['review_id', 'text_reviews', 'stars_reviews', 'name_business', 'city',
         'state', 'stars_business', 'review_countbusiness',
         'categories_combined']]

print('Number of rows and columns:', df.shape)
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aschu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aschu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aschu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


- Number of rows and columns: (7952263, 9)

   Yelp offers the ability for individuals to review various types of businesses ranging from food to every day goods and services. For this analysis, let's focus on food restaurants. Using the processed categories feature, the data was filtered to the food categories with over 30k counts and the states with the seven highest counts of reviews.

In [ ]:
v = df.categories_combined.value_counts()
df = df[df.categories_combined.isin(v.index[v.gt(30000)])]

df = df.loc[(df['categories_combined']=='Restaurants')
             | (df['categories_combined']=='Food')
             | (df['categories_combined']=='American (New)')
             | (df['categories_combined']=='American (Traditional)')
             | (df['categories_combined']=='Pizza')
             | (df['categories_combined']=='Sandwiches')
             | (df['categories_combined']=='Breakfast & Brunch')
             | (df['categories_combined']=='Mexican')
             | (df['categories_combined']=='Italian')
             | (df['categories_combined']=='Seafood')
             | (df['categories_combined']=='Japanese')
             | (df['categories_combined']=='Burgers')
             | (df['categories_combined']=='Sushi Bars')
             | (df['categories_combined']=='Chinese')
             | (df['categories_combined']=='Desserts')
             | (df['categories_combined']=='Thai')
             | (df['categories_combined']=='Bakeries')
             | (df['categories_combined']=='Asian Fusian')
             | (df['categories_combined']=='Steakhouse')
             | (df['categories_combined']=='Salad')
             | (df['categories_combined']=='Cafes')
             | (df['categories_combined']=='Barbeque')
             | (df['categories_combined']=='Southern')
             | (df['categories_combined']=='Ice Cream & Frozen Yogurt')
             | (df['categories_combined']=='Vietnamese')
             | (df['categories_combined']=='Vegetarian')
             | (df['categories_combined']=='Specialty Food')
             | (df['categories_combined']=='Mediterranean ')
             | (df['categories_combined']=='Local Flavor')
             | (df['categories_combined']=='Indian')
             | (df['categories_combined']=='Tex-Mex')]

print('- Dimensions after filtering food categories with over 30k counts:',
      df.shape)

df1 = df['state'].value_counts().index[:7]
df = df[df['state'].isin(df1)]

del df1

print('- Dimensions after filtering US states with the 7 highest count of reviews:',
      df.shape)
- Dimensions after filtering food categories with over 30k counts: (3853083, 9)
- Dimensions after filtering US states with the 7 highest count of reviews: (3738688, 9)

   This significantly reduced the number of observations to less than four million compared to the almost eight million in the initial set.

Initial EDA of Reviews - Food with > 30k Counts

   Text data contains common words called stopwords which are frequent in sentences and will need to be processed before text classification. Before processing the reviews, we can utilize lambda functions again to determine the length of the words and characters in each initial review by splitting the strings and calculating the length with len(x). Then the resulting features can be plotted next to each other using different colors. After plotting, we can drop these temporary variables.

In [ ]:
df['review_wordCount'] = df['text_reviews'].apply(lambda x: len(str(x).split()))
df['review_charCount'] = df['text_reviews'].apply(lambda x: len(str(x)))

f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(11,7), sharex=True)
f.suptitle('Restaurants > 30k Counts in Top 7 States: Length of Words and Characters for Each Review',
           fontsize=15)
f.text(0.5, 0.04, 'Review Number', ha='center', fontsize=15)
ax1.plot(df['review_wordCount'], color='red')
ax1.set_ylabel('Word Count', fontsize=15)
ax2.plot(df['review_charCount'], color='blue')
ax2.set_ylabel('Character Count', fontsize=15)

df = df.drop(['review_wordCount', 'review_charCount'], axis=1)

    For the reviews in this food oriented set, the word count is slightly under 1000 words and around 5000 characters.

Text Preprocessing

   Text data often contains multiple languages, which is something to consider before any type of downstream modeling. For the initial text processing step, langid can be used to first detect the language present in the text strings. The pandas dataframe can be converted to a dask dataframe partitioning the data into ten sets for parallel processing and an apply function can be utilized on the text_reviews feature to detect the language. Then the dask dataframe can be converted back using compute.

In [ ]:
import dask.dataframe as dd
import langid
import time

ddata = dd.from_pandas(df, npartitions=10)

del df

print('Time for cleaning with Langid to find the language of the reviews..')
search_time_start = time.time()
ddata['language'] = ddata['text_reviews'].apply(langid.classify,
                                                meta=('text_reviews',
                                                      'object')).compute(scheduler='processes')
print('Finished cleaning with Langid in:', time.time() - search_time_start)

df = ddata.compute()

del ddata
Time for cleaning with Langid to find the language of the reviews..
Finished cleaning with Langid in: 141146.58668327332

   Then a lambda function can be used to extract the detected language in string format for the language feature. The number of estimated languages present, the percent which is English and the number of non-English/English reviews can now be calculated. Then the non-English reviews can be filtered out of the set.

In [ ]:
df['language'] = df['language'].apply(lambda tuple: tuple[0])

print('- Number of tagged languages (estimated):')
print(len(df['language'].unique()))
print('- Percent of data in English (estimated):')
print((sum(df['language'] == 'en') / len(df)) * 100)

df1 = df.loc[df['language'] != 'en']
print('- Number of non-English reviews:', df1.shape[0])

del df1

df = df.loc[df['language'] == 'en']
print('- Number of English reviews:', df.shape[0])

df = df.drop(['language'], axis=1)
- Number of tagged languages (estimated):
55
- Percent of data in English (estimated):
99.79757604806821
- Number of non-English reviews: 7568
- Number of English reviews: 3731120

   There are 55 unique languages present in the set, and 99.8% are estimated to be English. This retains most of the observations. The presence of other languages if processed using English based word sets probably would negatively impact, and hence skew any downstream modeling results since they would be considered as anomalies in the set.

   Now, let's define a class called cleantext to remove the non-words in the text strings before processing the words. The instance of this class (self) can be set to use text so it can be applied to multiple functions within the cleantext class. Using the re module, functions within cleantext can be utilized to remove square brackets ([]), numbers and special characters if they are present using re.sub. replace contractions. get_words tokenize. remove non-ASCII characters from list of tokenized words. remove_punctuation. join words.

In [ ]:
import re
import contractions
import string
import unicodedata

class cleantext():

    def __init__(self, text='text'):
        self.text = text

    def remove_between_square_brackets(self):
        self.text = re.sub('\[[^]]*\]', '', self.text)
        return self

    def remove_numbers(self):
        self.text = re.sub('[-+]?[0-9]+', '', self.text)
        return self

    def remove_special_characters(self, remove_digits=True):
        self.text = re.sub('[^a-zA-z0-9\s]','', self.text)
        return self

    def replace_contractions(self):
        self.text = contractions.fix(self.text)
        return self

    def get_words(self):
        self.words = nltk.word_tokenize(self.text)
        return self

    def remove_non_ascii(self):
        new_words = []
        for word in self.words:
            new_word = unicodedata.normalize('NFKD',
                                             word).encode('ascii',
                                                          'ignore').decode('utf-8',
                                                                           'ignore')
            new_words.append(new_word)
        self.words = new_words
        return self

    def remove_punctuation(self):
        new_words = []
        for word in self.words:
            new_word = re.sub(r'[^\w\s]', '', word)
            if new_word != '':
                new_words.append(new_word)
        self.words = new_words
        return self

    def to_lowercase(self):
        new_words = []
        for word in self.words:
            new_word = word.lower()
            new_words.append(new_word)
        self.words = new_words
        return self

    def join_words(self):
        self.words = ' '.join(self.words)
        return self

    def do_all(self, text):

        self.text = text
        self = self.remove_numbers()
        self = self.remove_special_characters()
        self = self.replace_contractions()
        self = self.get_words()
        self = self.remove_non_ascii()
        self = self.remove_punctuation()
        self = self.to_lowercase()

        return self.words

ct = cleantext()

def dask_this(df):
    res = df.apply(ct.do_all)
    return res

   The pandas.Dataframe can be converted to a Dask dataframe partitioning the data into ten sets for parallel processing and an apply function can be utilized on the text_reviews to remove the non-words within the text string defined in the clean_text class. Then the dask dataframe can be converted back using compute.

In [ ]:
ddata = dd.from_pandas(df, npartitions=10)

del df

print('Time for reviews to be cleaned for non-words...')
search_time_start = time.time()
ddata['cleanReview'] = ddata['text_reviews'].map_partitions(dask_this).compute(scheduler='processes')
print('Finished cleaning reviews for non-words in:',
      time.time() - search_time_start)

df = ddata.compute()

del ddata
Time for reviews to be cleaned for non-words...
Finished cleaning reviews for non-words in: 5481.95995259285

   Now, we can drop the original text_reviews and remove the comma from tokenizing to make the text as a single string.

In [ ]:
df = df.drop(['text_reviews'], axis=1)

df['cleanReview'] = df['cleanReview'].apply(lambda x: ','.join(map(str, x)))
df['cleanReview'] =  df['cleanReview'].str.replace(r',', ' ', regex=True)

   Building on the first class object that was utilized to process the non-word components, we can define a more condensed one to process the words in the text data. Upon first pass, there were non-UTF8 characters which existed in the set, which might have derived from exporting the data and not processing it to the full entirety in a single session after importing, so those few rows had to be located and removed. When using a library that specifies the words in the library are stopwords, it might be context driven given the text data that was used to derive that definition. Therefore, careful inspection of what is used as the reference word bank dictionary as well as fine tuning might be required to optimally solve the tasked problem. Potentially, adding more common words to the stopwords_list will allow for less prevalent words, which might reveal more insightful text information to not be masked.

   Let's start by tokenizing using nltk followed by removing the stop words from the reviews using nltk.corpus. Then the text can be broken down to the root words using the WordNetLemmatizer and the processed string can then be joined as one single text string.

In [ ]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stopwords_list = stopwords.words('english')
stopwords_list.extend(('thing', 'eat'))

class cleantext1():

    def __init__(self, text='test'):
        self.text = text

    def get_words(self):
        self.words = nltk.word_tokenize(self.text)
        return self

    def remove_stopwords(self):
        new_words = []
        for word in self.words:
            if word not in stopwords_list:
                new_words.append(word)
        self.words = new_words
        return self

    def lemmatize_words(self):
        lemmatizer = WordNetLemmatizer()
        lemmas = []
        for word in self.words:
            lemma = lemmatizer.lemmatize(word)
            lemmas.append(lemma)
        self.words = lemmas
        return self

    def join_words(self):
        self.words = ' '.join(self.words)
        return self

    def do_all(self, text):

        self.text = text
        self = self.get_words()
        self = self.remove_stopwords()
        self = self.lemmatize_words()

        return self.words

ct = cleantext1()

def dask_this(df):
    res = df.apply(ct.do_all)
    return res
In [ ]:
ddata = dd.from_pandas(df, npartitions=10)

del df

print('Time for reviews to be cleaned for stopwords and lemma...')
search_time_start = time.time()
ddata['cleanReview1'] = ddata['cleanReview'].map_partitions(dask_this).compute(scheduler='processes')
print('Finished cleaning reviews in:', time.time() - search_time_start)

df = ddata.compute()

del ddata

df['cleanReview1'] = df['cleanReview1'].apply(lambda x: ','.join(map(str, x)))
df['cleanReview1'] =  df['cleanReview1'].str.replace(r',', ' ', regex=True)
Time for reviews to be cleaned for stopwords and lemma...
Finished cleaning reviews in: 4092.073033809662

EDA of Cleaned Reviews

   We can then utilize the same lambda function structure that was used for the initial reviews to determine the length of the words in the processed reviews after removing non-words, stopwords and lemmatization and plot this next to the length from removing non-words. After plotting, we can drop these temporary variables.

In [ ]:
df['cleanReview_wordCount'] = df['cleanReview'].apply(lambda x: len(str(x).split()))
df['cleanReview_wordCount1'] = df['cleanReview1'].apply(lambda x: len(str(x).split()))

f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(11,7), sharex=True, sharey=True)
f.suptitle('Length of Words After Removing Non-Words and Stopwords/Lemmatization',
           fontsize=15)
f.text(0.5, 0.04, 'Review Number', ha='center', fontsize=15)
ax1.plot(df['cleanReview_wordCount'], color='red')
ax1.set_ylabel('Word Count', fontsize=15)
ax2.plot(df['cleanReview_wordCount1'], color='blue');

df = df.drop(['cleanReview_wordCount', 'cleanReview_wordCount1'], axis=1)

   When comparing the length of the reviews in regard to the word count, the processed reviews contain around 500 words compared to ~1000 in the initial set.

   Next, we can examine the length of the characters before/after removing non-words and stopwords/lemmatization.

In [ ]:
df['cleanReview_charCount'] = df['cleanReview'].apply(lambda x: len(str(x)))
df['cleanReview_charCount1'] = df['cleanReview1'].apply(lambda x: len(str(x)))

f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(11,7), sharex=True, sharey=True)
f.suptitle('Length of Characters After Removing Non-Words and Stopwords/Lemmatization',
           fontsize=15)
f.text(0.5, 0.04, 'Review Number', ha='center', fontsize=15)
ax1.plot(df['cleanReview_charCount'], color='red')
ax1.set_ylabel('Character Count', fontsize=15)
ax2.plot(df['cleanReview_charCount1'], color='blue');

df = df.drop(['cleanReview_charCount', 'cleanReview_charCount1'], axis=1)

   The initial character count was around 5000, and now it is around 3500.

   Let's now create a temporary dataframe containing the 1 & 2 star reviews and another with the 5 star reviews. Then subset the cleaned reviews to create a word cloud for the higher and the lower rated reviews.

In [ ]:
from wordcloud import WordCloud, STOPWORDS

df1 = df.loc[(df['stars_reviews'] == 1) | (df['stars_reviews'] == 2)]
df2 = df[df.stars_reviews == 5]

df1_clean = df1['cleanReview1']
df2_clean = df2['cleanReview1']

def wordcloud_draw(data, color='black'):
    words = ' '.join(data)
    cleaned_word = ' '.join([word for word in words.split()
                            if 'http' not in word
                             and not word.startswith('@')
                             and not word.startswith('#')
                             and word != 'RT'])
    wordcloud = WordCloud(stopwords=STOPWORDS,
                          background_color=color,
                          width=1500,
                          height=1000).generate(cleaned_word)
    plt.figure(1, figsize=(10,10))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()

print('- Higher Rated Reviews:')
wordcloud_draw(df2_clean, 'white')

print('- Lower Rated Reviews:')
wordcloud_draw(df1_clean, 'black')

del df1_clean, df2_clean
- Higher Rated Reviews:
- Lower Rated Reviews:

   Word clouds can be utilized to visualize the words present, but let's examine on a more granular level by defining a function get_all_words to find all of the words present in the set and then FreqDist can be used to generate the frequency of the words in the set. Let's first find the 100 most common words in 5 star reviews and write the list to a .csv file. Then perform the same operations with the lowest stars set.

   We can then find the top 10 words that are in both the 5 and 1 & 2 star review sets by converting the list to a dataframe using an inner merge with word as the key.

In [ ]:
from nltk import FreqDist
import csv

df1_clean = df1['cleanReview1']
df2_clean = df2['cleanReview1']

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

ratings_higher_words = get_all_words(df2_clean)

freq_dist_higher = FreqDist(ratings_higher_words)
print(freq_dist_higher.most_common(100))

list0 = freq_dist_higher.most_common(100)
with open('topWords_5starReviews.csv','w') as f:
    writer = csv.writer(f)
    writer.writerow(['word', 'count'])
    writer.writerows(list0)
[('food', 907650), ('place', 857441), ('great', 807273), ('good', 671325), ('time', 461869), ('service', 447048), ('delicious', 415155), ('best', 407790), ('one', 393959), ('get', 376759), ('like', 374557), ('love', 359517), ('go', 357094), ('back', 353739), ('also', 332613), ('amazing', 331685), ('really', 318694), ('restaurant', 315956), ('always', 280241), ('would', 269888), ('friendly', 260852), ('well', 258879), ('definitely', 253100), ('try', 247519), ('chicken', 244488), ('staff', 234439), ('fresh', 233678), ('got', 229970), ('menu', 229908), ('nice', 225750), ('pizza', 224170), ('come', 219183), ('order', 216886), ('make', 213000), ('u', 212652), ('favorite', 210338), ('ordered', 204997), ('even', 193945), ('everything', 187831), ('wait', 177289), ('recommend', 176677), ('sauce', 175563), ('little', 175295), ('made', 174974), ('ever', 174782), ('flavor', 171883), ('first', 170353), ('drink', 164243), ('dish', 164075), ('meal', 160894), ('could', 160238), ('price', 159886), ('cheese', 155463), ('experience', 154804), ('came', 154719), ('day', 151564), ('excellent', 151163), ('every', 151055), ('awesome', 149399), ('friend', 147272), ('perfect', 145173), ('salad', 140429), ('people', 140423), ('never', 139179), ('much', 139124), ('super', 138831), ('went', 137605), ('table', 136285), ('right', 135117), ('coffee', 134028), ('lunch', 134004), ('sandwich', 133270), ('dinner', 132267), ('night', 130968), ('bar', 129355), ('new', 127717), ('lot', 126360), ('burger', 125658), ('spot', 124890), ('know', 124560), ('sweet', 123220), ('area', 122268), ('want', 121457), ('tried', 121142), ('worth', 120168), ('say', 118705), ('way', 116371), ('take', 116069), ('sure', 115413), ('atmosphere', 114831), ('going', 114626), ('taste', 114487), ('side', 114453), ('two', 113803), ('sushi', 111555), ('taco', 110752), ('roll', 110315), ('cream', 110096), ('better', 108976), ('around', 108057)]
In [ ]:
ratings_lower_words = get_all_words(df1_clean)

freq_dist_lower = FreqDist(ratings_lower_words)
print(freq_dist_lower.most_common(100))

list1 = freq_dist_lower.most_common(100)
with open('topWords_12starReviews.csv','w') as f:
    writer = csv.writer(f)
    writer.writerow(['word', 'count'])
    writer.writerows(list1)
[('food', 692310), ('place', 442306), ('time', 390432), ('like', 362354), ('good', 349906), ('order', 346469), ('would', 346416), ('service', 342685), ('one', 329872), ('get', 307063), ('u', 290058), ('back', 287307), ('ordered', 269553), ('go', 249807), ('restaurant', 244083), ('even', 220005), ('got', 216076), ('minute', 212874), ('table', 210185), ('could', 204647), ('came', 197829), ('really', 193732), ('never', 189108), ('said', 172019), ('asked', 162545), ('chicken', 159590), ('drink', 154294), ('come', 150244), ('went', 147995), ('customer', 144845), ('people', 144466), ('pizza', 142332), ('bad', 141194), ('better', 139051), ('first', 135819), ('also', 131484), ('told', 131418), ('great', 131147), ('experience', 130414), ('server', 128796), ('know', 128683), ('two', 128158), ('make', 127658), ('much', 126049), ('menu', 125043), ('going', 125024), ('say', 122213), ('way', 120401), ('wait', 119741), ('took', 119248), ('take', 115695), ('meal', 114689), ('well', 113341), ('another', 111256), ('give', 109397), ('want', 108862), ('hour', 106335), ('sauce', 106109), ('staff', 102939), ('manager', 101830), ('price', 100994), ('ever', 100537), ('star', 100447), ('review', 100262), ('friend', 99672), ('still', 99397), ('waitress', 98706), ('taste', 97187), ('made', 97181), ('location', 94592), ('think', 93381), ('try', 93164), ('cheese', 93058), ('dish', 92261), ('bar', 91850), ('pretty', 90993), ('salad', 89606), ('nice', 89484), ('burger', 88897), ('day', 87823), ('night', 87427), ('last', 87328), ('around', 86724), ('left', 85006), ('wanted', 84307), ('something', 83008), ('nothing', 82210), ('long', 82125), ('little', 81413), ('right', 81196), ('ok', 78301), ('since', 76041), ('flavor', 75957), ('cold', 75899), ('tasted', 75815), ('see', 75052), ('sure', 74714), ('small', 73807), ('dinner', 72699), ('worst', 72149)]
In [ ]:
df1 = pd.DataFrame(list0, columns=['word', 'count'])
df1 = df1[['word']]

df2 = pd.DataFrame(list1, columns=['word', 'count'])
df2 = df2[['word']]

df1 = df1.merge(df2, on='word')
print(df1.iloc[0:10])

del df1, df2, df1_clean, df2_clean
      word
0     food
1    place
2    great
3     good
4     time
5  service
6      one
7      get
8     like
9       go

   The most frequent words in both sets are food, place, great, good, time and service. This makes rational sense since we filtered the original set by food.

   Using the open-source textblob library, we can utilize the sentiment function within a lambda function to generate a polarity score where Negative = -1.0 and Positive = 1.0 with the possibility of intermediate classes since it is a float numerical score. Let's apply this on the set and create the polarity feature. Now we can examine the descriptive statistics using describe again.

In [ ]:
from textblob import TextBlob

df = df[['stars_reviews', 'cleanReview1']]
df.rename(columns={'cleanReview1': 'cleanReview'}, inplace=True)

df['polarity'] = df['cleanReview'].apply(lambda x: TextBlob(x).sentiment[0])

print(df['polarity'].describe().apply("{0:.3f}".format))
count    3722536.000
mean           0.249
std            0.229
min           -1.000
25%            0.121
50%            0.255
75%            0.389
max            1.000
Name: polarity, dtype: object

   Since this generated a numerical output, we can define set thresholds within a function getAnalysis to demarcate the levels of polarity using if and elif conditional statements. This can be applied to the polarity feature, labeled as sentiment, a new qualitative feature.

In [ ]:
def getAnalysis(score):
  if score >= 0.4:
      return 'Positive'
  elif score > 0.2 and score < 0.4:
      return 'Slightly Positive'
  elif score <= 0.2 and score >= 0.0:
      return 'Slightly Negative'
  else:
      return 'Negative'

df['sentiment'] = df['polarity'].apply(getAnalysis)

df[['sentiment']].value_counts()
sentiment     
Slightly Positive    1399223
Slightly Negative    1029281
Positive              872122
Negative              414937
dtype: int64

   Interestingly, discrepancies exist between the stars_reviews and polarity features. There are a higher number of 5 star reviews that are labeled as Slightly Positive compared to Positive, 4 and 5 star reviews labeled as Slightly Negative, and even 1 star reviews considered as Positive as well as 5 star reviews labeled Negative for the sentiment feature. This is dependent on the input data that was used to set the numerical range of values defining polarity for the textblob library. Since an apparent difference exists between the stars_reviews and polarity features, we can retain both of them and evaluate how classification performs using them separately as the target or label for the model.

   Given that textblob allocated different sentiment scores compared to the defined levels of stars_reviews, an equivalent size of each target needs to be sampled so the classes are balanced and comparisons can be made between the different approaches. Since the negative Sentiment contains the least amount of observations (n=414937), let's sample the same amount for the positive Sentiment by filtering and then shuffling before sampling. These can then be concatenated, shuffled and saved as a parquet file for later use.

In [ ]:
from sklearn.utils import shuffle

df1 = df[df.sentiment == 'Positive']
df1 = shuffle(df1)
df1 = df1.sample(n=414937)

df2 = df[df.sentiment == 'Negative']

sent = pd.concat([df1, df2])
del df1, df2

sent = shuffle(sent)

sent = sent[['cleanReview', 'sentiment', 'stars_reviews']]
sent.to_parquet('YelpReviews_NLP_sentimentNegPos.parquet')

   Let's now sample the equivalent numbers for the stars_reviews set by first creating a temporary feature stars where 1 & 2 star reviews are defined as zero and 5 star reviews as one. Then each subset is filtered, shuffled and sampled using the same number of observations as the negative Sentiment (n=414937) to the balance sets. Then the two sets are concatenate by row, shuffled, the temporary stars feature is dropped and then saved as a parquet file.

In [ ]:
df['stars'].mask(df['stars_reviews'] == 1, 0, inplace=True)
df['stars'].mask(df['stars_reviews'] == 2, 0, inplace=True)
df['stars'].mask(df['stars_reviews'] == 5, 1, inplace=True)

df1 = df[df.stars==1]
df1 = shuffle(df1)
df1 = df1.sample(n=414937)

df2 = df[df.stars==0]
df2 = shuffle(df2)
df2 = df2.sample(n=414937)

df = pd.concat([df1, df2])
df = shuffle(df)
df = df.drop(['stars'], axis=1)

del df1, df2

df.to_parquet('YelpReviews_NLP_125stars.parquet')

Classification

Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF)

   A bag-of-words (BoW) model is a method to extract the features from text information to use in modeling. It was first introduced in Zellig Harris's 1954 publication Distributional Structure. The words within the text are compiled into a bag of words where the order and the structure of the words in the text information is not considered. The text is preprocessed using the methods in the NLP Preprocessing section and a vocabulary is created by determining the frequency of each word within the text. The text is then converted into a vector since numerical values are needed for modeling

   One limitation of using the frequency of a word for the score is highly frequent words begin to generate larger scores when they do not contain insightful information compared to less utilized words, resulting in negative effects when modeling. Therefore, we can leverage rescaling the frequency that words appear so that the scores for the frequent words are penalized. This was suggested in A Statistical Interpretation of Term Specificity and Its Application in Retrieval. We can generate a score of the frequency of the word in each of the separate reviews defined as Term Frequency and the generate another score for how important this word is across all of the text data defined as Inverse Document Frequency. Then a weighted score can be generated where all of words are not considered as equally important or interesting. This score then allows for the words which are distinct (potentially containing useful information) to be highlighted. TF-IDF is a statistical measure that considers the importance of the words in the text which can be utilized in modeling after using text vectorization.

   The BoW and TF-IDF notebooks can be found here for the Sentiment set and herefor the Stars on Reviews group.

Sentiment

   Let's first set up the environment by importing the nltk package and downloading stopwords and wordnet to utilize for processing the text. Then we can import the necessary packages and setting the random, numpy and the torch seed followed by assigning the device to CUDA if is available to utilize the GPU. Then we can examine the PyTorch, CUDA and NVIDIA GPU information and empty the cache for the device.

In [ ]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Out[ ]:
True
In [ ]:
import os
import random
import numpy as np
from tqdm import tqdm, tqdm_notebook
import torch
from subprocess import call
tqdm.pandas()

seed_value = 42
random.seed(seed_value)
np.random.seedseed_value)
torch.manual_seed(seed_value)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('Torch version: {}'.format(torch.__version__))
print('pyTorch VERSION:', torch.__version__)
print('\n')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(['nvidia-smi', '--format=csv',
      '--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free'])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())
torch.cuda.empty_cache()
CUDA and NVIDIA GPU Information
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Sun May 22 16:18:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


Torch version: 1.11.0+cu113
pyTorch VERSION: 1.11.0+cu113


__CUDNN VERSION: 8200
__Number CUDA Devices: 1
__Devices
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0

   Now we can read the parquet file into a pandas.Dataframe, examine the dimensions and a few observations.

In [ ]:
import pandas as pd

df = pd.read_parquet('YelpReviews_NLP_sentimentNegPos.parquet')
print('Number of rows and columns:', df.shape)
df.head()
Number of rows and columns: (829874, 3)
Out[ ]:
cleanReview sentiment stars_reviews
index
1 order chicken finger sub honey mustard sauce p... Negative 3.0
3 dedicated loving memory gary feldman greatest ... Negative 5.0
12 absolutely horrible thought would order place ... Negative 1.0
15 found better chicken finger know inside crisp ... Negative 4.0
20 amazing everything tried disappoint chicken ca... Negative 5.0

   Let's examine how the original stars_reviews feature compares with the sentiment(polarity) feature using pandas.DataFrame.value_counts.

In [ ]:
print(df[['stars_reviews', 'sentiment']].value_counts())
stars_reviews  sentiment
5.0            Positive     278011
1.0            Negative     231245
4.0            Positive     104585
2.0            Negative      90320
3.0            Negative      44097
4.0            Negative      26276
5.0            Negative      22999
3.0            Positive      22096
2.0            Positive       6690
1.0            Positive       3555
dtype: int64

   Surprisingly, there are 1 star reviews that are positive, and 5 star reviews which are negative. To prepare the target feature for various methods for classification, we need to recode the string to binary numerical values, so let's define Negative as zero and Positive as one, and the 1 and 2 stars_reviews as zero and the five star reviews as one with the objective of predicting what contributes more based on the text within a review for a higher review rating and Positive review.

   Then we can select the reviews with negative sentiment, shuffle and then sample 20,000 reviews. For a balanced set, let's use the same methods for the positive sentiment reviews. Then convert the review to a string and sentiment to an integer.

In [ ]:
from sklearn.utils import shuffle

df['sentiment'].mask(df['sentiment'] == 'Negative', 0, inplace=True)
df['sentiment'].mask(df['sentiment'] == 'Positive', 1, inplace=True)

df['stars_reviews'].mask(df['stars_reviews'] == 1, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 2, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 5, 1, inplace=True)

df = df[['cleanReview', 'sentiment']]
df1 = df[df.sentiment==0]
df1 = shuffle(df1)
df1 = df1.sample(n=20000)

df2 = df[df.sentiment==1]
df2 = shuffle(df2)
df2 = df2.sample(n=20000)
df = pd.concat([df1, df2])
df = shuffle(df)

del df1, df2

df[['cleanReview']] = df[['cleanReview']].astype('str')
df['sentiment'] = df['sentiment'].astype('int')

print('Number of records:', len(df), '\n')
print('Number of positive reviews:', len(df[df.sentiment==1]))
print('Number of negative reviews:', len(df[df.sentiment==0]))
Number of records: 40000 

Number of positive reviews: 20000
Number of negative reviews: 20000 

Preprocess Text

   The reviews were previously processed by removing non-words, converting to lowercase, removal of stop words and lemmatized. The reviews need to be re-tokenized, so let's use wordpunct_tokenize for this task. We can also remove rare words if they are found only in a small frequency in the tokenized text. Then we can construct the BoW vector and TF-IDF vector which will be used when modeling.

In [ ]:
from nltk.tokenize import wordpunct_tokenize

def tokenize(text):
    tokens = wordpunct_tokenize(text)
    return tokens

def remove_rare_words(tokens, common_tokens, max_len):
    return [token if token in common_tokens else '<UNK>' for token in tokens][-max_len:]

def build_bow_vector(sequence, idx2token):
    vector = [0] * len(idx2token)
    for token_idx in sequence:
        if token_idx not in idx2token:
            raise ValueError('Wrong sequence index found!')
        else:
            vector[token_idx] += 1
        return vector

   To further preprocess the text data for modeling, let's define some parameters for the maximum length of the review, maximum vocab size, and batch size for loading the data.

   Now we can define a class YelpReviewsDataset which will process the data from raw string into two vectors, Bag of Words and Term Frequency Inverse Document Frequency, to be used for modeling. We can set the data to process as df, tokenize the text string and build the most common tokens bound by the defined maximum vocabulary size. Then replace the rare words with UNK and remove the sequences with only UNK. Next, build the vocabulary and convert the tokens to indexes. Now we can build the BoW vector and build the TF-IDF vector using the TfidfVectorizer with the input in list format. This then returns feature vectors and target.

In [ ]:
from functools import partial
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

MAX_LEN = 300
MAX_VOCAB = 10000
BATCH_SIZE = 1

class YelpReviewsDataset(Dataset):
    def __init__(self, df, max_vocab=10000, max_len=300):
        """
        Preprocess the text data for modeling.

        Parameters
        ----------
        df : Data to process.
        max_vocab : Maximum size of vocabulary.
        max_len : Maximum length of text in vocubulary.
        """
        df = df

        df['tokens'] = df.cleanReview.apply(partial(tokenize))
        all_tokens = [token for doc in list(df.tokens) for token in doc]

        common_tokens = set(
            list(zip(*Counter(all_tokens).most_common(max_vocab)))[0])

        df.loc[:, 'tokens'] = df.tokens.progress_apply(
            partial(remove_rare_words,
                    common_tokens=common_tokens,
                    max_len=max_len))

        df = df[df.tokens.progress_apply(
            lambda tokens: any(token != '<UNK>' for token in tokens))]

        vocab = sorted(set(token for doc in list(df.tokens) for token in doc))
        self.token2idx = {token: idx for idx, token in enumerate(vocab)}
        self.idx2token = {idx: token for token, idx in self.token2idx.items()}

        df['indexed_tokens'] = df.tokens.progress_apply(
            lambda doc: [self.token2idx[token] for token in doc])

        df['bow_vector'] = df.indexed_tokens.progress_apply(
            build_bow_vector, args=(self.idx2token,))

        vectorizer = TfidfVectorizer(
            analyzer='word',
            tokenizer=lambda doc: doc,
            preprocessor=lambda doc: doc,
            token_pattern=None)
        vectors = vectorizer.fit_transform(df.tokens).toarray()
        df['tfidf_vector'] = [vector.tolist() for vector in vectors]

        self.text = df.cleanReview.tolist()
        self.sequences = df.indexed_tokens.tolist()
        self.bow_vector = df.bow_vector.tolist()
        self.tfidf_vector = df.tfidf_vector.tolist()
        self.targets = df.sentiment.tolist()

    def __getitem__(self, i):
        return (self.sequences[i],
                self.bow_vector[i],
                self.tfidf_vector[i],
                self.targets[i],
                self.text[i])

    def __len__(self):
        return len(self.targets)

   Now, we can load the data using the described class with the specified MAX_VOCAB = 10000 and MAX_LEN = 300.

In [ ]:
dataset = YelpReviewsDataset(df, max_vocab=MAX_VOCAB, max_len=MAX_LEN)

del df
100%|██████████| 40000/40000 [00:00<00:00, 147768.84it/s]
100%|██████████| 40000/40000 [00:00<00:00, 561011.46it/s]
100%|██████████| 40000/40000 [00:00<00:00, 133120.50it/s]
100%|██████████| 40000/40000 [00:05<00:00, 6758.24it/s]

   Let's then examine a random sample of the processed set in regard to the size and some example text data.

In [ ]:
print('Number of records:', len(dataset), '\n')
random_idx = random.randint(0,len(dataset)-1)
print('index:', random_idx, '\n')
sample_seq, bow_vector, tfidf_vector, sample_target, sample_text=dataset[random_idx]
print(sample_text, '\n')
print(sample_seq, '\n')
print('BoW vector size:', len(bow_vector), '\n')
print('TF-IDF vector size:', len(tfidf_vector), '\n')
print('Sentiment:', sample_target, '\n')
Number of records: 40000 

index: 7296 

say warned tonight waited line qing mu woman would already ordered pulled aside tell would waiting minute soup item made table arrived cold thought simply unlucky wound watching entire party leave without served waiting half hour soup lukewarm tasteless alarmingly teenaged daughter refused one party finished meal saturday night tiny place never half full yet still dysfunctional night would like think qing mu going game without many noodle place town risk visiting qing mu 

[7617, 9680, 9056, 9633, 5066, 0, 5707, 9856, 9896, 246, 6088, 6891, 477, 8838, 9896, 9638, 5568, 8211, 4614, 5226, 8718, 450, 1715, 8943, 7958, 9381, 9897, 9697, 2955, 6295, 4974, 9845, 7774, 9638, 3968, 4257, 8211, 5198, 8792, 0, 0, 2250, 7148, 6051, 6295, 3304, 5402, 7596, 5870, 9007, 6525, 5841, 3968, 3592, 9951, 8442, 0, 5870, 9896, 5049, 8928, 0, 5707, 3767, 3639, 9845, 5298, 5906, 6525, 9112, 7384, 9586, 0, 5707] 

BoW vector size: 10001 

TF-IDF vector size: 10001 

Sentiment: 0 

   Now we can the split the data into training, validation, and test sets and determine the size of each set.

In [ ]:
from torch.utils.data.dataset import random_split

def split_train_valid_test(corpus, valid_ratio=0.1, test_ratio=0.1):
    test_length = int(len(corpus) * test_ratio)
    valid_length = int(len(corpus) * valid_ratio)
    train_length = len(corpus) - valid_length - test_length
    return random_split(
        corpus, lengths=[train_length, valid_length, test_length])

train_dataset, valid_dataset, test_dataset = split_train_valid_test(
    dataset, valid_ratio=0.1, test_ratio=0.1)
len(train_dataset), len(valid_dataset), len(test_dataset)
Out[ ]:
(32000, 4000, 4000)

   Now, let's use the DataLoader to load each of the sets by the specified BATCH_SIZE.

In [ ]:
from torch.utils.data import Dataset, DataLoader

def collate(batch):
    seq = [item[0] for item in batch]
    bow = [item[1] for item in batch]
    tfidf = [item[2] for item in batch]
    target = torch.LongTensor([item[3] for item in batch])
    text = [item[4] for item in batch]
    return seq, bow, tfidf, target, text

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          collate_fn=collate)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE,
                          collate_fn=collate)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                         collate_fn=collate)

print('Number of training batches:', len(train_loader), '\n')
batch_idx = random.randint(0, len(train_loader)-1)
example_idx = random.randint(0, BATCH_SIZE-1)

for i, fields in enumerate(train_loader):
    seq, bow, tfidf, target, text = fields
    if i == batch_idx:
        print('Training input sequence:', seq[example_idx], '\n')
        print('BoW vector size:', len(bow[example_idx]), '\n')
        print('TF-IDF vector size:', len(tfidf[example_idx]), '\n')
        print('Label: ', target[example_idx], '\n')
        print('Review text:', text[example_idx], '\n')
Number of training batches: 32000 

Training input sequence: [22, 5173, 3435, 7777, 9585, 7792, 8996, 257, 3324, 8375, 3097, 8809, 6011, 803, 519, 381, 2713, 6443, 5402, 7355, 7774, 3542, 9056, 3118, 265, 4141, 7092, 480, 7775] 

BoW vector size: 10001 

TF-IDF vector size: 10001 

Label:  tensor(1) 

Review text: absolutely love food service visited several time always five star experience taverna offer best atmosphere appetizer drink phenomenal meal richard served friend tonight express amazing highly recommend asking server 

Build BoW Model

   Now, we can build a class FeedfowardTextClassifier which will initialize the model with the defined architecture using a feed-forward fully connected network with 2 hidden layers with the input as the BoW Vector and the output as a vector size of two containing the probability of the input string being classified as positive or negative.

In [ ]:
import torch.nn as nn
import torch.nn.functional as F

class FeedfowardTextClassifier(nn.Module):
    def __init__(self, device, batch_size, vocab_size, hidden1, hidden2,
                 num_labels):
        """
        Initialize the model by setting up the layers.

        Parameters
        ----------
        device: Cuda device or CPU.
        batch_size: Batch size of dataloader.
        vocab_size: The vocabulary size.
        hidden1: Size of first hidden layer.
        hidden2: Size of second hidden layer.
        num_labels: Number of labels in target.
        """
        super(FeedfowardTextClassifier, self).__init__()
        self.device = device
        self.batch_size = batch_size
        self.fc1 = nn.Linear(vocab_size, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, num_labels)

    def forward(self, x):
        """
        Perform a forward pass of model and returns value between 0 and 1.
        """
        batch_size = len(x)
        if batch_size != self.batch_size:
            self.batch_size = batch_size
        x = torch.FloatTensor(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        return torch.sigmoid(self.fc3(x))

   Then we can define the size of the hidden layers and examine the model architecture.

In [ ]:
HIDDEN1 = 100
HIDDEN2 = 50

bow_model = FeedfowardTextClassifier(
    vocab_size=len(dataset.token2idx),
    hidden1=HIDDEN1,
    hidden2=HIDDEN2,
    num_labels=2,
    device=device,
    batch_size=BATCH_SIZE)

bow_model
Out[ ]:
FeedfowardTextClassifier(
  (fc1): Linear(in_features=10001, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=50, bias=True)
  (fc3): Linear(in_features=50, out_features=2, bias=True)
)

Train BoW Model

   First, let's define the initial learning rate, the loss function as CrossEntropyLoss, the gradient descent optimizer as Adam and CosineAnnealingLR for the scheduler.

   Then, we can define the train_epoch function with the input containing the model as bow_model, optimizer, train_loader, input_type as bow. This will train the model from the data from the train_loader. Then reset the gradient, perform a forward pass through the data and compute the loss. Then perform gradient descent followed by a backwards pass through the data and use step for the optimizer and scheduler to move in the correct direction. After this, metrics can be recorded. Another function validate_epoch can be used for the validation set that perform a forward pass, evaluates the model and records the model metrics.

In [ ]:
from torch import optim
from torch.optim.lr_scheduler import CosineAnnealingLR

LEARNING_RATE = 6e-5

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, bow_model.parameters()),
    lr=LEARNING_RATE)

scheduler = CosineAnnealingLR(optimizer, 1)

def train_epoch(model, optimizer, train_loader, input_type='bow'):
    model.train()
    total_loss, total = 0, 0
    for seq, bow, tfidf, target, text in train_loader:
        if input_type == 'bow':
            inputs = bow
        if input_type == 'tfidf':
            inputs = tfidf

        optimizer.zero_grad()
        output = model(inputs)
        loss = criterion(output, target)
        loss.backward()

        optimizer.step()
        scheduler.step()

        total_loss += loss.item()
        total += len(target)

    return total_loss / total

def validate_epoch(model, valid_loader, input_type='bow'):
    model.eval()
    total_loss, total = 0, 0
    with torch.no_grad():
        for seq, bow, tfidf, target, text in valid_loader:
            if input_type == 'bow':
                inputs = bow
            if input_type == 'tfidf':
                inputs = tfidf

            output = model(inputs)
            loss = criterion(output, target)
            total_loss += loss.item()
            total += len(target)

    return total_loss / total

   We need to specify the directory where the model is saved and when the model checkpoints because more training might be needed, or it is adequate enough, inference can occur, given the context. Let's now start training the model with the early stopping defined as the validation loss is greater than the previous two validation losses. For each epoch, we can append the loss for both the train and validation sets to monitor it.

In [ ]:
EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4

n_epochs = 0
train_losses, valid_losses = [], []
while True:
    train_loss = train_epoch(bow_model, optimizer, train_loader,
                             input_type='bow')
    valid_loss = validate_epoch(bow_model, valid_loader, input_type='bow')

    tqdm.write(
        f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
    )

    if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
        print('Stopping early')
        break

    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    n_epochs += 1
epoch #  1	train_loss: 3.68e-01	valid_loss: 3.21e-01

epoch #  2	train_loss: 3.18e-01	valid_loss: 3.18e-01

epoch #  3	train_loss: 3.16e-01	valid_loss: 3.17e-01

epoch #  4	train_loss: 3.15e-01	valid_loss: 3.17e-01

epoch #  5	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch #  6	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch #  7	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch #  8	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch #  9	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 10	train_loss: 3.14e-01	valid_loss: 3.16e-01

epoch # 11	train_loss: 3.14e-01	valid_loss: 3.16e-01

epoch # 12	train_loss: 3.13e-01	valid_loss: 3.16e-01

epoch # 13	train_loss: 3.13e-01	valid_loss: 3.16e-01

epoch # 14	train_loss: 3.13e-01	valid_loss: 3.16e-01

epoch # 15	train_loss: 3.13e-01	valid_loss: 3.16e-01

Stopping early

BoW Performance - Model Loss and Metrics

   After 15 epochs, the model stopped training due to the specified critieria for early stopping. We should have specified the loss output to print out more decimal places because this granularity, at least in a notebook, does not reveal much.

   Now, we can plot the model loss for both the training and the validation sets.

In [ ]:
import matplotlib.pyplot as plt

epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.show()

   The model seems to be an adequate fit using two hidden layers given the proximity of the train and validation losses. Next, we can evaluate the performance of the BoW model using the classification_report and confusion_matrix from sklearn.metrics.

In [ ]:
from sklearn.metrics import classification_report, confusion_matrix

bow_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []
input_type = 'bow'

with torch.no_grad():
    for seq, bow, tfidf, target, text in test_loader:
        inputs = bow
        probs = bow_model(inputs)
        if input_type == 'tdidf':
            inputs = tfidf
            probs = tfidf_model(inputs)

        probs = probs.detach().cpu().np()
        predictions = np.argmax(probs, axis=1)
        target = target.cpu().np()

        y_true.extend(predictions)
        y_pred.extend(target)

print('Classification Report:')
print(classification_report(y_true, y_pred))

f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1993
           1       1.00      0.99      0.99      2007

    accuracy                           0.99      4000
   macro avg       0.99      0.99      0.99      4000
weighted avg       0.99      0.99      0.99      4000

   Overall, this model performed quite well for accuracy, precision, recall and the f1_score. Now we can examine a few of the observations and compare the predicted vs. actual sentiment.

In [ ]:
from IPython.core.display import display, HTML

flatten = lambda x: [sublst for lst in x for sublst in lst]
seq_lst, bow_lst, tfidf_lst, target_lst, text_lst = zip(*test_loader)
seq_lst, bow_lst, tfidf_lst, target_lst, text_lst = map(flatten, [seq_lst,
                                                                  bow_lst,
                                                                  tfidf_lst,
                                                                  target_lst,
                                                                  text_lst])

test_examples = list(zip(seq_lst, bow_lst, tfidf_lst, target_lst, text_lst))

def print_random_prediction(model, n=5, input_type='bow'):
    to_emoji = lambda x: '😄' if x else '😡'
    model.eval()
    rows = []
    for i in range(n):
        with torch.no_grad():
            seq, bow, tdidf, target, text = random.choice(test_examples)
            target = target.item()

            inputs = bow
            if input_type == 'tdidf':
                inputs = tfidf

            probs = model([inputs])
            probs = probs.detach().cpu().numpy()
            prediction = np.argmax(probs, axis=1)[0]

            predicted = to_emoji(prediction)
            actual = to_emoji(target)

            row = f'''
            <tr>
            <td>{i+1}&nbsp;</td>
            <td>{text}&nbsp;</td>
            <td>{predicted}&nbsp;</td>
            <td>{actual}&nbsp;</td>
            </tr>
            '''
            rows.append(row)

    rows_joined = '\n'.join(rows)
    table = f'''
    <table>
    <tbody>
    <tr>
    <td><b>Number</b>&nbsp;</td>
    <td><b>Review</b>&nbsp;</td>
    <td><b>Predicted</b>&nbsp;</td>
    <td><b>Actual</b>&nbsp;</td>
    </tr>{rows_joined}
    </tbody>
    </table>
    '''
    display(HTML(table))

print_random_prediction(bow_model, n=5, input_type='bow')
Number  Review  Predicted  Actual 
today time melt order basic grilled cheese ordered pork sandwich waited minute get food complaint melt course wait time get food fact pork several small bone fry tasted like sat box armed receive meal free due bone meat based experience probably return disappointed since saw food network  😡  😡 
went ihop pm waited hour order drink customer service bad drink wait minute order food bad customer service night  😡  😡 
food pretty solid far authentic looking standard americanized chinese food throw thai food standouts general tsos chicken sesame chicken mongolian beef last visit tried thai curry pretty mediocre nice kick spice overall place forgettable need chinese fix definitely job  😡  😡 
need change name place lowest quality seafood market talking high school cafeteria quality level price bad totally overpriced hard find seafood restaurant austin better getting frozen meal heb stuff  😡  😡 
impeccable service outstanding seafood enjoyed chef choice sashimi phenomenal cowboy roll interesting tasty dragon role world people great may want call ahead lunch get little busy  😄  😄 

Stars on Reviews

Train BoW Model

   Using the same methods for text preprocessing, train/validation/test set generation, learning rate, loss function, gradient descent optimizer, scheduler and train/validation training set up that was utilized for the Sentiment set, we can train the BoW model for the reviews_stars set.

In [ ]:
EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4

n_epochs = 0
train_losses, valid_losses = [], []
while True:
    train_loss = train_epoch(bow_model, optimizer, train_loader,
                             input_type='bow')
    valid_loss = validate_epoch(bow_model, valid_loader, input_type='bow')

    tqdm.write(
        f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
    )

    if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
        print('Stopping early')
        break

    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    n_epochs += 1
epoch #  1	train_loss: 4.15e-01	valid_loss: 3.64e-01

epoch #  2	train_loss: 3.50e-01	valid_loss: 3.59e-01

epoch #  3	train_loss: 3.42e-01	valid_loss: 3.58e-01

epoch #  4	train_loss: 3.38e-01	valid_loss: 3.58e-01

epoch #  5	train_loss: 3.35e-01	valid_loss: 3.58e-01

epoch #  6	train_loss: 3.32e-01	valid_loss: 3.58e-01

epoch #  7	train_loss: 3.31e-01	valid_loss: 3.58e-01

Stopping early

BoW Performance - Model Loss and Metrics

   After 7 epochs, the model stopped training due to the specified critieria for early stopping. This is less than the 15 epochs that the model was trained using the Sentiment set. This model had a higher train_loss and valid_loss compared to the Sentiment set where the train_loss=3.13e-01 and valid_loss=3.16e-01. Even at the 7th epoch for the Sentiment set, the train and validation losses were less (train_loss: 3.14e-01, valid_loss: 3.17e-01).

   Now, we can plot the model loss for both the training and the validation sets.

In [ ]:
epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.show()

   This model was not as a good fit compared to the Sentiment BoW model given the greater distance between the train and validation losses. Next, we can evaluate the performance of the BoW model using the classification_report and confusion_matrix from sklearn.metrics like what was utilized for the Sentiment model.

In [ ]:
bow_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []
input_type = 'bow'

with torch.no_grad():
    for seq, bow, tfidf, target, text in test_loader:
        inputs = bow
        probs = bow_model(inputs)
        if input_type == 'tdidf':
            inputs = tfidf
            probs = tfidf_model(inputs)

        probs = probs.detach().cpu().numpy()
        predictions = np.argmax(probs, axis=1)
        target = target.cpu().numpy()

        y_true.extend(predictions)
        y_pred.extend(target)

print('Classification Report:')
print(classification_report(y_true, y_pred))

f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96      2053
           1       0.95      0.96      0.96      1947

    accuracy                           0.96      4000
   macro avg       0.96      0.96      0.96      4000
weighted avg       0.96      0.96      0.96      4000

   Although this model performed quite well for accuracy, precision, recall and the f1_score, all metrics are lower compared to the Sentiment model. Next, we can examine a few of the observations and compare the predicted vs. actual stars_reviews.

In [ ]:
print_random_prediction(bow_model, n=5, input_type='bow')
Number  Review  Predicted  Actual 
absolutely horrible place vegetable fruit dirty every time went mistake went four time never variety staff rude list go honestly understand people go save much awful grocery shopping experience  😡  😡 
went hank celebrate special occasion friend food tasty service horrible server kizzy j mia evening never cleared table brought water refill without asking night wore appeared get loaded know slurred speech bumping folk forgetful labored action something expect type restaurant recalled review chronicle sum experience beef hank though come service everyone friendly wellmeaning apart diffident host attention detail overall professionalism every time asked implement necessary dine without set food table otherwise compromise hygiene standard service staff seemed taken aback surprising request especially restaurant look feel higherend establishment enough said good food poor service sad way celebrate friend  😡  😡 
wish could give better review food bland overpriced service friendly weird slow back  😡  😡 
year living nearby promising getting sidetracked fish chip pirate pub door away finally made dinner last night surprise chilli house thai bistro added option surprise hostess assumed pigout big guy need decide whether going la carte ayce upfront seated different section opted good value ayce seated front section waitress server fast efficient blink ordered every dish appetizer soupsalad stir fry section food arrived quickly generally tasty although steam table cooked order tackled curry noodle rice section although best section relying waitress choice lot food dish better earlier course worth trying first make dessert glutinous rice tapioca two singha beer tax builtin tip bill grew rounded service star sometimes quality matter quantity pig nothing remotely nice food bangkok la likely thai restaurant vancouver however weather nice front terrace ace sharing bottle wine two mother day weekend early heatwave watching crowd walk cycle rollerblade drinking great weather would like lot low price feel free add another star two  😄  😡 
location past least time couple night stay working location done nice job renovating lobby common area welcoming feel walk door room nicely appointed rehabbed within last year usually plenty picture attach sadly occasion would able take picture flea bite arm actually caught one flea since easy kill flushed toilet course flea never travel alone second night stay bite wish known pet friendly hotel would asked pet room hoping praying bite carry lyme disease whatever yes kind asking another room conference day hour long changing late evening would feasible given little time sleep give hotel thumb little refrigerator able keep yogurt cold great override flea bite though burger restaurant hotel great different stay walked room find carpet soaking wet mean soaking wet would even assign room guest took little hour get room changed made short night sleep important conference good truly like write le positive review time let needing nearby airport stay know company arranged accommodation never return hotel make reservation elsewhere never  😡  😡 

   When examining the predicted vs. actual of 5 reviews, only 4/5 resulted in the correct match while the 5/5 was achieved for the Sentiment model.

Term Frequency-Inverse Document Frequency (TF-IDF)

Sentiment

   Using the same text pre-processing utilized for the BoW model and the same train/validation/test sets, we can build a simple feed-forward neural net classifier. >The hidden layer sizes have all ready been defined so we can initialize the TF-IDF model and examine it for comparison to the BoW model.

In [ ]:
tfidf_model = FeedfowardTextClassifier(
    vocab_size=len(dataset.token2idx),
    hidden1=HIDDEN1,
    hidden2=HIDDEN2,
    num_labels=2,
    device=device,
    batch_size=BATCH_SIZE,
)
tfidf_model
Out[ ]:
FeedfowardTextClassifier(
  (fc1): Linear(in_features=10001, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=50, bias=True)
  (fc3): Linear(in_features=50, out_features=2, bias=True)
)

Train TF-IDF Model

   We need to first specify the directory where the model is saved and when the model checkpoints. Let's now start training the model with the TF-IDF vectors as input and early stopping defined as the validation loss is greater than the previous two validation losses. For each epoch, we can append the loss for both the train and validation sets to monitor it.

In [ ]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, tfidf_model.parameters()),
    lr=LEARNING_RATE)

scheduler = CosineAnnealingLR(optimizer, 1)

EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4

n_epochs = 0
train_losses, valid_losses = [], []
while True:
    train_loss = train_epoch(tfidf_model, optimizer, train_loader,
                             input_type='tfidf')
    valid_loss = validate_epoch(tfidf_model, valid_loader, input_type='tfidf')

    tqdm.write(
        f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
    )

    if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
        print('Stopping early')
        break

    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    n_epochs += 1
epoch #  1	train_loss: 4.12e-01	valid_loss: 3.26e-01

epoch #  2	train_loss: 3.21e-01	valid_loss: 3.20e-01

epoch #  3	train_loss: 3.17e-01	valid_loss: 3.18e-01

epoch #  4	train_loss: 3.15e-01	valid_loss: 3.18e-01

epoch #  5	train_loss: 3.15e-01	valid_loss: 3.18e-01

epoch #  6	train_loss: 3.14e-01	valid_loss: 3.18e-01

epoch #  7	train_loss: 3.14e-01	valid_loss: 3.18e-01

epoch #  8	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch #  9	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 10	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 11	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 12	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 13	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 14	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 15	train_loss: 3.14e-01	valid_loss: 3.17e-01

epoch # 16	train_loss: 3.13e-01	valid_loss: 3.17e-01

epoch # 17	train_loss: 3.13e-01	valid_loss: 3.17e-01

epoch # 18	train_loss: 3.13e-01	valid_loss: 3.17e-01

epoch # 19	train_loss: 3.13e-01	valid_loss: 3.17e-01

Stopping early

TF-IDF Performance - Model Loss and Metrics

   After 19 epochs, the model stopped training due to the specified critieria for early stopping. Now, we can plot the model loss for both the training and the validation sets.

In [ ]:
epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.xticks(epoch_ticks,color='w')
plt.show()

   The TF-IDF model utilizing the Sentiment set also seems to be an adequate fit using two hidden layers given the proximity of the train and validation losses. Let's now evaluate the performance of the TF-IDF model with the test set using the classification_report and confusion_matrix from sklearn.metrics.

In [ ]:
tfidf_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []

with torch.no_grad():
    for seq, bow, tfidf, target, text in test_loader:
        inputs = tfidf
        probs = tfidf_model(inputs)

        probs = probs.detach().cpu().numpy()
        predictions = np.argmax(probs, axis=1)
        target = target.cpu().numpy()

        y_true.extend(predictions)
        y_pred.extend(target)

print('Classification Report:')
print(classification_report(y_true, y_pred))

f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      2003
           1       0.99      0.99      0.99      1997

    accuracy                           0.99      4000
   macro avg       0.99      0.99      0.99      4000
weighted avg       0.99      0.99      0.99      4000

   Once again, this model performed quite well for accuracy, precision, recall and the f1_score. Now we can examine a few of the observations and compare the predicted vs. actual sentiment.

In [ ]:
print_random_prediction(tfidf_model, n=5, input_type='tfidf')
Number  Review  Predicted  Actual 
horrible way gate stopped togo side ordered burger started wait waiting one customer busy togo side minute yes minutes asked girl see much longer burger would get gate let know could wait couple minute would need get refund go girl counter care help communicate eta nothing staring blankly annoyed interrupting eating cheese fry hidden register like another planet one customer also perplexed give crap behavior get gate turn flight finally burger came couple minute later sloppily thrown box even closed bag grabbed burger ran gate plane open burger find pink middle fine hand made patty like frozen one come pack grocery store kind gross pink middle still cold want risk eating ate around edge threw rest away never go back wahlburgers really need evaluate staff working register girl bad business  😡  😡 
one word delicious try orange chicken brown rice disappointed  😡  😡 
minute walk know hit jackpot especially love desert red velvet perfect frosting ever bite yum  😄  😄 
one best italian restaurant portland menu many option unique authentic flavor combination atmosphere contemporary fun great age service also fantastic highlight pepper light appetizer chicken bolognese entree sauce delicious  😄  😄 
good curry fantastic sauce beef lamp tender juicy dish came small plate way finish service nice warm  😄  😄 

   5/5 match for the predicted vs. actual for this model

Stars on Reviews

   Let's now initialize the TF-IDF model for the reviews_stars set, then define the path where the model will be saved and train the TF-IDF model with the TF-IDF vectors as the input.

In [ ]:
EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4

n_epochs = 0
train_losses, valid_losses = [], []
while True:
    train_loss = train_epoch(tfidf_model, optimizer, train_loader,
                             input_type='tfidf')
    valid_loss = validate_epoch(tfidf_model, valid_loader, input_type='tfidf')

    tqdm.write(
        f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
    )

    if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
        print('Stopping early')
        break

    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    n_epochs += 1
epoch #  1	train_loss: 4.55e-01	valid_loss: 3.67e-01

epoch #  2	train_loss: 3.52e-01	valid_loss: 3.57e-01

epoch #  3	train_loss: 3.43e-01	valid_loss: 3.55e-01

epoch #  4	train_loss: 3.38e-01	valid_loss: 3.55e-01

epoch #  5	train_loss: 3.35e-01	valid_loss: 3.55e-01

epoch #  6	train_loss: 3.33e-01	valid_loss: 3.56e-01

Stopping early

TF-IDF Performance - Model Loss and Metrics

   After 6 epochs, the model stopped training due to the specified critieria for early stopping. This is less than the 19 epochs that the model was trained using the Sentiment set. This model had a higher train_loss and valid_loss compared to the Sentiment set where the train_loss=3.13e-01 and valid_loss=3.17e-01. Even at the 6th epoch for the Sentiment set, the train and validation losses were less (train_loss: 3.14e-01, valid_loss: 3.18e-01).

   Now, we can plot the model loss for both the training and the validation sets.

In [ ]:
epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.show()

   Once again, this model was not as a good fit compared to the Sentiment TF-IDF model given the greater distance between the train and validation losses. Next, we can evaluate the performance of the TF-IDF model using the classification_report and confusion_matrix from sklearn.metrics like what was utilized for the Sentiment model.

In [ ]:
tfidf_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []

with torch.no_grad():
    for seq, bow, tfidf, target, text in test_loader:
        inputs = tfidf
        probs = tfidf_model(inputs)

        probs = probs.detach().cpu().numpy()
        predictions = np.argmax(probs, axis=1)
        target = target.cpu().numpy()

        y_true.extend(predictions)
        y_pred.extend(target)

print('Classification Report:')
print(classification_report(y_true, y_pred))

f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96      2052
           1       0.96      0.96      0.96      1948

    accuracy                           0.96      4000
   macro avg       0.96      0.96      0.96      4000
weighted avg       0.96      0.96      0.96      4000

   The model metrics using both the BoW and TF-IDF showed better performance for the Sentiment model compared to the Stars on Reviews model. Next, we can examine a few of the observations and compare the predicted vs. actual stars_reviews.

In [ ]:
print_random_prediction(tfidf_model, n=5, input_type='tfidf')
Number  Review  Predicted  Actual 
ordered bang bang shrimp chicken today lunch doordash shocked bad delivered burnt shrimp sauce bland nothing taste star food throw star get review show throw away paid food tip money drain  😡  😡 
experience august stoked come crust burnt really could enjoy flavor back  😡  😡 
amazing service amazing atmosphere amazing food diverse menu everything ordered good pizza burger pasta fish nacho amazing  😄  😄 
love love love noodle one place get every time go good tried different size noodle many trial believe fettuccini best noodle type stirfried noodle pappardelle little thick throw noodle meatveggie ratio cold sesame noodle good day hot warm food sushi happy hour pretty awesome also tempura shrimp pretty standard good really nice sit patio back weather nice get kind loud inside normal busy time  😄  😄 
place great big menu lot variety never gotten order wrong order online via website service great cheesy garlic breadstick amazing also love club sandwich  😄  😄 

   5/5 match for the predicted vs. actual for this model, but the model metrics are still lower for this model compared to the Sentiment TF-IDF model.

Bidirectional Long Short Term Memory Networks (LSTM)

   Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems by allowing the input data to leverage information from both the past and the future. The LSTM is trained simultaneously in the positive and negative time direction with the input sequence.

   The notebooks can be found here for Sentiment and here for Stars on Reviews.

Sentiment

   Let's first set up the environment by importing the necessary packages and examining the CUDA and NVIDIA GPU information as well as the Tensorflow and Keras versions for the runtime.

In [ ]:
import os
import random
import numpy as np
import warnings
import tensorflow as tf
warnings.filterwarnings('ignore')
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
CUDA and NVIDIA GPU Information
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Wed May 18 23:17:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


TensorFlow version: 2.8.0
Eager execution is: True
Keras version: 2.8.0
Num GPUs Available:  1

   To set the seed for reproducibility, we can use a function init_seeds that defines the random, numpy and tensorflow seed as well as the environment and session.

In [ ]:
def init_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    session_conf = tf.compat.v1.ConfigProto()
    session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                            inter_op_parallelism_threads=1)
    os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
    os.environ['TF_DETERMINISTIC_OPS'] = 'True'
    sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
                                config=session_conf)
    tf.compat.v1.keras.backend.set_session(sess)
    return sess

init_seeds(seed=42)
Out[ ]:
<tensorflow.python.client.session.Session at 0x7fedbe84fc50>

   As with the previous methods, let's recode the target to numerical (0,1), convert the data types, shuffle the data and then set up the label and the features. Then the data can be partitioned for the train/test sets using test_size=0.1, which stratifies the target (stratify=label).

In [ ]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

df['sentiment'].mask(df['sentiment'] == 'Negative', 0, inplace=True)
df['sentiment'].mask(df['sentiment'] == 'Positive', 1, inplace=True)

df['stars_reviews'].mask(df['stars_reviews'] == 1, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 2, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 5, 1, inplace=True)

df['sentiment'] = df['sentiment'].astype('int')
df['stars_reviews'] = df['stars_reviews'].astype('int')
df1 = df.drop(['stars_reviews'], axis=1)
df1 = shuffle(df1)

label = df1[['sentiment']]
features = df1.cleanReview

X_train, X_test, y_train, y_test = train_test_split(features, label,
                                                    test_size=0.1,
                                                    stratify=label,
                                                    random_state=42)

Length = 500 & Batch Size = 256

   We can now prepare the data by using the KerasTokenizer class with a specified number of words and the maximum length of the text to be tokenized. Then the tokenized text can padded if the length is less than the maxmimum defined length, and then vectorized, which can be applied to both the train and the test sets. We will create a vocabulary containing num_words=100000 with a maxlen=500.

In [ ]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import text, sequence
from sklearn.pipeline import Pipeline

class KerasTokenizer(object):
    """
    Fit and convert text to sequences for use in a Keras model.
    num_words = max number of words
    maxlen = max length of sequences
    """
    def __init__(self, num_words=100000, maxlen=500):
        self.tokenizer = text.Tokenizer(num_words=num_words)
        self.maxlen = maxlen

    def fit(self, X, y):
        self.tokenizer.fit_on_texts(X)
        return self

    def transform(self, X):
        return sequence.pad_sequences(self.tokenizer.texts_to_sequences(X),
                                      maxlen=self.maxlen)

km  = Pipeline([('Keras Tokenizer', KerasTokenizer(num_words=100000,
                                                   maxlen=500))])
X_trainT = km.fit_transform(X_train)
X_testT = km.fit_transform(X_test)

   Now, we can set up where the results will be saved as well as the callbacks with EarlyStopping monitoring the val_loss and stopping training if the val_loss does not improve after 5 epochs. This can also use a ModelCheckpoint with the specified filepath to save only the highest val_accuracy.

In [ ]:
import datetime
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint

%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/LSTM/SentimentPolarity/Models/

!rm -rf /logs/

%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'Polarity_LSTM_weights_only_len500_b256.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
                  ModelCheckpoint(filepath, monitor='val_accuracy',
                                  save_best_only=True, mode='max'),
                  tensorboard_callback]

Model Architecture

   Let's now define the model structure with an embedding_size=300 where the input is the maximum length of the review string (500 words) with 100,000 tokens as the defined maximum words in the vocabulary. Then 10% Dropout which then is fed to the Bidirectional LSTM with the size of 50 neurons and another 10% Dropout with 0% recurrent_dropout. Then a GlobalAveragePooling1D layer followed by another 10% Dropout with a final Dense layer containing a sigmoid activation function. Then the model can be compiled with the loss as binary_crossentropy, adam as the optimizer, the default learning rate and the metric as accuracy.

In [ ]:
from tensorflow.keras.layers import Input, Embedding, Dropout, Bidirectional
from tensorflow.keras.layers import LSTM, GlobalMaxPool1D, Dense, Activation
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

embedding_size = 300
input_ = Input(shape=(500,))
x = Embedding(100000, embedding_size)(input_)
x = Dropout(0.1)(x)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1,
                       recurrent_dropout=0.0))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=input_, outputs=x)

model.compile(loss='binary_crossentropy', optimizer='adam',
              metrics=['accuracy'])

model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 500)]             0         
                                                                 
 embedding (Embedding)       (None, 500, 300)          30000000  
                                                                 
 dropout (Dropout)           (None, 500, 300)          0         
                                                                 
 bidirectional (Bidirectiona  (None, 500, 100)         140400    
 l)                                                              
                                                                 
 global_max_pooling1d (Globa  (None, 100)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 50)                5050      
                                                                 
 dropout_1 (Dropout)         (None, 50)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 51        
                                                                 
=================================================================
Total params: 30,145,501
Trainable params: 30,145,501
Non-trainable params: 0
_________________________________________________________________

   The model can now be trained for 5 epochs using BATCH_SIZE=256 with the defined callbacks_list and then saved.

In [ ]:
history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
                    epochs=5, batch_size=256, callbacks=callbacks_list)

model.save('./Polarity_LSTM_len500_batch256_tf.h5', save_format='tf')

# Load model for more training or later use
#filepath = 'Polarity_LSTM_weights_only_len500_b256.h5'
#model = tf.keras.models.load_model('./Polarity_LSTM_len500_batch256_tf.h5')
#model.load(weights)
Epoch 1/5
2918/2918 [==============================] - 777s 266ms/step - loss: 8.3672e-04 - accuracy: 0.9997 - val_loss: 0.2801 - val_accuracy: 0.9480
Epoch 2/5
2918/2918 [==============================] - 774s 265ms/step - loss: 4.0997e-04 - accuracy: 0.9999 - val_loss: 0.3483 - val_accuracy: 0.9478
Epoch 3/5
2918/2918 [==============================] - 774s 265ms/step - loss: 1.8674e-04 - accuracy: 0.9999 - val_loss: 0.3875 - val_accuracy: 0.9493
Epoch 4/5
2918/2918 [==============================] - 773s 265ms/step - loss: 1.6661e-04 - accuracy: 1.0000 - val_loss: 0.3856 - val_accuracy: 0.9470
Epoch 5/5
2918/2918 [==============================] - 774s 265ms/step - loss: 5.6549e-05 - accuracy: 1.0000 - val_loss: 0.4816 - val_accuracy: 0.9481

LSTM Model Performance

   Now, we can evaluate the trained model for accuracy using the test set. Then plot the model loss and accuracy over the epochs.

In [ ]:
acc = model.evaluate(X_testT, y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(acc[0], acc[1]))
2594/2594 [==============================] - 52s 20ms/step - loss: 0.4816 - accuracy: 0.9481
Test set
  Loss: 0.482
  Accuracy: 0.948

   The accuracy is 94.8%, which is decent, but perhaps, training for more epochs might increase the accuracy.

In [ ]:
import matplotlib.pyplot as plt

plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()

   The model appears to be overfitting because the training loss is quite constant while the validation loss is increasing when it should be decreasing.

In [ ]:
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

   Let's try training for more epochs, so we need to first load the model.

In [ ]:
model = tf.keras.models.load_model('./Polarity_LSTM_len500_batch256_tf.h5')

model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 500)]             0         
                                                                 
 embedding (Embedding)       (None, 500, 300)          30000000  
                                                                 
 dropout (Dropout)           (None, 500, 300)          0         
                                                                 
 bidirectional (Bidirectiona  (None, 500, 100)         140400    
 l)                                                              
                                                                 
 global_max_pooling1d (Globa  (None, 100)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 50)                5050      
                                                                 
 dropout_1 (Dropout)         (None, 50)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 51        
                                                                 
=================================================================
Total params: 30,145,501
Trainable params: 30,145,501
Non-trainable params: 0
_________________________________________________________________

   Let's define a new filepath and reuse the same callbacks_list to continue training the model and then save it.

In [ ]:
filepath = 'LSTM_weights_only_len500_b256_2.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
                  ModelCheckpoint(filepath, monitor='val_accuracy',
                                  save_best_only=True, mode='max'),
                  tensorboard_callback]

history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
                    epochs=7, batch_size=256, callbacks=callbacks_list)

model.save('./Polarity_LSTM_len500_batch256_2_tf.h5', save_format='tf')

# Load model for more training or later use
#model = tf.keras.models.load_model('./Polarity_LSTM_len500_batch256_2_tf.h5')
Epoch 1/7
2918/2918 [==============================] - 780s 267ms/step - loss: 9.5815e-06 - accuracy: 1.0000 - val_loss: 0.6626 - val_accuracy: 0.9415
Epoch 2/7
2918/2918 [==============================] - 774s 265ms/step - loss: 6.4761e-05 - accuracy: 1.0000 - val_loss: 0.5291 - val_accuracy: 0.9478
Epoch 3/7
2918/2918 [==============================] - 774s 265ms/step - loss: 6.9872e-05 - accuracy: 1.0000 - val_loss: 0.5689 - val_accuracy: 0.9431
Epoch 4/7
2918/2918 [==============================] - 771s 264ms/step - loss: 2.7031e-05 - accuracy: 1.0000 - val_loss: 0.6000 - val_accuracy: 0.9467
Epoch 5/7
2918/2918 [==============================] - 770s 264ms/step - loss: 3.8727e-05 - accuracy: 1.0000 - val_loss: 0.6195 - val_accuracy: 0.9472
Epoch 6/7
2918/2918 [==============================] - 761s 261ms/step - loss: 2.2050e-07 - accuracy: 1.0000 - val_loss: 0.6589 - val_accuracy: 0.9464
Epoch 7/7
2918/2918 [==============================] - 762s 261ms/step - loss: 3.8500e-08 - accuracy: 1.0000 - val_loss: 0.6903 - val_accuracy: 0.9464

LSTM Model Performance

   Now, we can reevaluate the trained model for accuracy using the test set.

In [ ]:
acc = model.evaluate(X_testT, y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(acc[0],acc[1]))
2594/2594 [==============================] - 40s 14ms/step - loss: 1.0735 - accuracy: 0.9240
Test set
  Loss: 1.073
  Accuracy: 0.924

   The accuracy actually decreased by training for more epochs for this model. Let's now plot the model loss and accuracy over the epochs as we did before.

In [ ]:
plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
In [ ]:
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

   Let's now use the trained model to predict on the test set and save as scores. This can then be utilized in a function decode_sentiment that outputs a binary sentiment/classification score if the predicted value is greater than 0.5, then it is positive and when it is less than 0.5, then it is negative.

   We can also define a function to construct the confusion matrix called plot_confusion_matrix and generate a classification_report using sklearn.metrics.

In [ ]:
scores = model.predict(X_testT, verbose=1)
2594/2594 [==============================] - 49s 18ms/step
In [ ]:
def decode_sentiment(score):
    return 1 if score > 0.5 else 0

y_pred_1d = [decode_sentiment(score) for score in scores]
In [ ]:
import itertools
from sklearn.metrics import classification_report

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    Print and plot the confusion matrix
    """

    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=20)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, fontsize=13)
    plt.yticks(tick_marks, classes, fontsize=13)

    fmt = '.2f'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment='center',
                 color='white' if cm[i, j] > thresh else 'black')

    plt.ylabel('True label', fontsize=17)
    plt.xlabel('Predicted label', fontsize=17)

print('Classification Report:')
print(classification_report(y_test['sentiment'].tolist(), y_pred_1d))
print('\n')
cnf_matrix = confusion_matrix(y_test['sentiment'].tolist(), y_pred_1d)
plt.figure(figsize=(6,6))
plot_confusion_matrix(cnf_matrix, classes=y_test.sentiment.unique(),
                      title='Confusion matrix')
plt.show();
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.93      0.92     41494
           1       0.93      0.92      0.92     41494

    accuracy                           0.92     82988
   macro avg       0.92      0.92      0.92     82988
weighted avg       0.92      0.92      0.92     82988



   Overall, this model performed quite well for accuracy, precision, recall and the f1_score, but all of the metrics are lower than the ones from the BoW and TF-IDF models. Perhaps, sampling 20,000 observations per each sentiment group might allow for comparable metrics, or hyperparameter tuning in regards to the embedding_size, the amount of neurons within the LSTM and Dense layers as well as the learning_rate.

Stars on Reviews

   As with the previous methods, let's recode the target to numerical (0,1), convert the data types, shuffle the data and then set up the label and the features. Then the data can be partitioned for the train/test sets using test_size=0.1, which stratifies the target (stratify=label).

In [ ]:
df1 = df.drop(['sentiment'], axis=1)
df1 = shuffle(df1)

label = df1[['stars_reviews']]
features = df1.cleanReview

X_train, X_test, y_train, y_test = train_test_split(features, label,
                                                    test_size=0.1,
                                                    stratify=label,
                                                    random_state=42)

Length = 500 & Batch Size = 256

   Let's now prepare the data by using the KerasTokenizer class with a specified number of words and the maximum length of the text to be tokenized. Then the tokenized text can padded if the length is less than the maxmimum defined length, and then vectorized, which can be applied to both the train and the test sets. We will create a vocabulary containing num_words=100000 with a maxlen=500.

In [ ]:
km  = Pipeline([('Keras Tokenizer', KerasTokenizer(num_words=100000,
                                                   maxlen=500))])
X_trainT = km.fit_transform(X_train)
X_testT = km.fit_transform(X_test)

%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/LSTM/ReviewStars/Models/

!rm -rf /logs/

%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'LSTM_weights_only_len500_b256_balancedSP.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
                  ModelCheckpoint(filepath, monitor='val_accuracy',
                                  save_best_only=True, mode='max'),
                  tensorboard_callback]

   The model can now be trained for 10 epochs using BATCH_SIZE=256 with the defined callbacks_list and then saved.

In [ ]:
history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
                    epochs=10, batch_size=256, callbacks=callbacks_list)

model.save('./LSTM_len500_batch256_balancedSP_tf.h5', save_format='tf')

# Load model for more training or later use
#filepath = 'LSTM_weights_only_len500_b256_balancedSP.h5'
#model = tf.keras.models.load_model('./LSTM_len500_batch256_balancedSP_tf.h5')
#model.load(weights)
Epoch 1/10
2918/2918 [==============================] - 696s 235ms/step - loss: 0.0979 - accuracy: 0.9627 - val_loss: 0.4522 - val_accuracy: 0.8236
Epoch 2/10
2918/2918 [==============================] - 683s 234ms/step - loss: 0.0591 - accuracy: 0.9788 - val_loss: 0.4727 - val_accuracy: 0.8246
Epoch 3/10
2918/2918 [==============================] - 682s 234ms/step - loss: 0.0397 - accuracy: 0.9863 - val_loss: 0.6072 - val_accuracy: 0.8131
Epoch 4/10
2918/2918 [==============================] - 685s 235ms/step - loss: 0.0255 - accuracy: 0.9915 - val_loss: 0.6425 - val_accuracy: 0.8138

LSTM Model Performance

   Now, we can evaluate the trained model for accuracy using the test set. Then plot the model loss and accuracy over the epochs.

In [ ]:
acc = model.evaluate(X_testT, y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(acc[0], acc[1]))
2594/2594 [==============================] - 39s 15ms/step - loss: 0.6425 - accuracy: 0.8138
Test set
  Loss: 0.642
  Accuracy: 0.814
In [ ]:
plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
In [ ]:
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend())
plt.show()

   Let's try training for more epochs, so we need to first load the model and examine the model architecture.

In [ ]:
model = tf.keras.models.load_model('./LSTM_batch256_10epochs_balancedSP_tf.h5')

model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 500)]             0         
                                                                 
 embedding (Embedding)       (None, 500, 300)          30000000  
                                                                 
 dropout (Dropout)           (None, 500, 300)          0         
                                                                 
 bidirectional (Bidirectiona  (None, 500, 100)         140400    
 l)                                                              
                                                                 
 global_max_pooling1d (Globa  (None, 100)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 50)                5050      
                                                                 
 dropout_1 (Dropout)         (None, 50)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 51        
                                                                 
=================================================================
Total params: 30,145,501
Trainable params: 30,145,501
Non-trainable params: 0
_________________________________________________________________

   Let's define a new filepath and reuse the same callbacks_list to continue training the model.

In [ ]:
filepath = 'LSTM_weights_only_len500_b256_2.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
                  ModelCheckpoint(filepath, monitor='val_accuracy',
                                  save_best_only=True, mode='max'),
                  tensorboard_callback]

history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
                    epochs=7, batch_size=256, callbacks=callbacks_list)
Epoch 1/7
2918/2918 [==============================] - 655s 222ms/step - loss: 0.0146 - accuracy: 0.9954 - val_loss: 0.8546 - val_accuracy: 0.8052
Epoch 2/7
2918/2918 [==============================] - 650s 223ms/step - loss: 0.0098 - accuracy: 0.9969 - val_loss: 0.8785 - val_accuracy: 0.8061
Epoch 3/7
2918/2918 [==============================] - 649s 223ms/step - loss: 0.0069 - accuracy: 0.9978 - val_loss: 1.0098 - val_accuracy: 0.8082
Epoch 4/7
2918/2918 [==============================] - 646s 221ms/step - loss: 0.0056 - accuracy: 0.9981 - val_loss: 1.1526 - val_accuracy: 0.8015
Epoch 5/7
2918/2918 [==============================] - 646s 221ms/step - loss: 0.0069 - accuracy: 0.9976 - val_loss: 1.1619 - val_accuracy: 0.7971
Epoch 6/7
2918/2918 [==============================] - 646s 221ms/step - loss: 0.0051 - accuracy: 0.9983 - val_loss: 1.1971 - val_accuracy: 0.8025

   We can save the model and evaluate the accuracy using the test set.

In [ ]:
model.save('./LSTM_len500_batch256_2_tf.h5', save_format='tf')
#model = tf.keras.models.load_model('./LSTM_len500_batch256_2_tf.h5')

acc = model.evaluate(X_testT, y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(acc[0], acc[1]))
2594/2594 [==============================] - 35s 14ms/step - loss: 1.1971 - accuracy: 0.8025
Test set
  Loss: 1.197
  Accuracy: 0.803

   Once again, the accuracy actually decreased by training for more epochs for this model.

LSTM Model Performance

   We can then replot the model loss and accuracy over the new training epochs.

In [ ]:
plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
In [ ]:
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

   Let's now predict using the test set, determine the sentiment score with the decode_sentiment that iterates through the test set and examine the classification_report and confusion_matrix.

In [ ]:
scores = model.predict(X_testT, verbose=1)
2594/2594 [==============================] - 32s 12ms/step
In [ ]:
y_pred_1d = [decode_sentiment(score) for score in scores]

print('Classification Report:')
print(classification_report(y_test['stars_reviews'].tolist(), y_pred_1d))
print('\n')

cnf_matrix = confusion_matrix(y_test['stars_reviews'].tolist(), y_pred_1d)
plt.figure(figsize=(6,6))
plot_confusion_matrix(cnf_matrix, classes=y_test.stars_reviews.unique(),
                      title='Confusion matrix')
plt.show()
Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.86      0.81     41494
           1       0.84      0.75      0.79     41494

    accuracy                           0.80     82988
   macro avg       0.81      0.80      0.80     82988
weighted avg       0.81      0.80      0.80     82988



   The model metrics using LSTM demonstrate better performance for the Sentiment model compared to the Stars on Reviews model. However, the BoW and TF-IDF models performed better than the LSTM approach for both sets.

Transfer Learning using Bidirectional Encoder Representations from Transformers (BERT)

   Transfer learning is a ML technique that can be leveraged to save time and cost resources from having to train a model from scratch. The model weights can be loaded and fine tuned using different data than what was used to train it initially. The new weights can be saved and the model evaluated.

   Since we are evaluating different text classification approaches, let's utilize BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. BERT was trained using data from BooksCorpus (800 million words) and English Wikipedia (2,500 million words) using only the text passages. BERT’s model architecture uses a multi-layer bidirectional Transformer encoder based on the original implementation of the transformer in Attention Is All You Need. Let's utilize the BERT base model consisting of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters. Using unlabeled text data, BERT was designed to pre-train deep bidirectional representations by jointly conditioning on both the left and the right context in all of the model's layers. We can leverage this and fine-tune using our set by adding an additional output layer.

   The notebooks can be found here for Sentiment and here for Stars on Reviews.

Sentiment

   Let's first install the transformers package, then set up the environment by importing necessary packages, setting the seed for reproducibility, and examining the CUDA, NVIDIA GPU and PyTorch information.

In [ ]:
!pip install transformers
import os
import random
import numpy as np
import torch
from subprocess import call

seed_value = 42
torch.manual_seed(42)
random.seed(42)
np.random.seed(42)
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
     |████████████████████████████████| 4.2 MB 4.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
     |████████████████████████████████| 6.6 MB 73.6 MB/s 
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.21.6)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.7.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (21.3)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
     |████████████████████████████████| 596 kB 48.9 MB/s 
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
     |████████████████████████████████| 84 kB 3.4 MB/s 
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.11.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.64.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (4.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers) (3.0.9)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.8.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2021.10.8)
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.6.0 pyyaml-6.0 tokenizers-0.12.1 transformers-4.19.2
In [ ]:
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('Torch version: {}'.format(torch.__version__))
print('pyTorch VERSION:', torch.__version__)
print('\n')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(['nvidia-smi', '--format=csv,' '--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free'])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)
CUDA and NVIDIA GPU Information
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Sat May 21 16:39:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


Torch version: 1.11.0+cu113
pyTorch VERSION: 1.11.0+cu113


__CUDNN VERSION: 8200
__Number CUDA Devices: 1
__Devices
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0
cuda:0

Preprocessing

   After recoding both sentiment and stars_reviews to binary as previously completed, let's examine the word count of the reviews since BERT only takes up to 512 words. We can utilize a previously defined lambda function that counts the number of words in the text string to generate review_wordCount. Then pandas.Dataframe.describe can be used to evaluate the summary statistics for this temporary feature.

In [ ]:
df['review_wordCount'] = df['cleanReview'].apply(lambda x: len(str(x).split()))
df['review_wordCount'].describe().apply('{0:f}'.format)
Out[ ]:
count    829874.000000
mean         38.308772
std          33.833394
min           1.000000
25%          17.000000
50%          28.000000
75%          47.000000
max         511.000000
Name: review_wordCount, dtype: object

   We can now subset the reviews which contain less than or equal to 512 words and then drop the review_wordCount temporary variable. Then filter negative and positive sentiment reviews into two sets, shuffle and sample 20,000 observations, which are then concatenated together to generate a single set. Then the reviews can be converted to a string and the target as an integer for the data types. Next the data can be partitioned for the train, validation and test sets using 80% for training, 10% for validation and 10% for the test set.

In [ ]:
from sklearn.utils import shuffle

df = df[df['review_wordCount'] <= 512]
df1 = df.drop(['review_wordCount', 'stars_reviews'], axis=1)

df2 = df1[df1.sentiment==0]
df2 = shuffle(df2)
df2 = df2.sample(n=20000)

df3 = df1[df1.sentiment==1]
df3 = df3.sample(n=20000)
df3 = shuffle(df3)

df1 = pd.concat([df2, df3])
df1 = shuffle(df1)

del df2, df3

df1['sentiment'] = df1['sentiment'].astype('int')
df1['cleanReview'] = df1['cleanReview'].astype(str)

df_train, df_val, df_test = np.split(df1.sample(frac=1, random_state=seed_value),
                                     [int(.8*len(df)), int(.9*len(df))])

print(len(df_train), len(df_val), len(df_test))

del df1
32000 4000 4000

Tokenize

   Let's first define the tokenizer to use, BertTokenizer.from_pretrained('bert-base-uncased') and the components for the label. Then we can build a class Dataset that iterates through the data tokenizeing the text string with a maximum length of 300 tokens and then pads the ones that are shorter in length.

In [ ]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
label = {0, 1}

class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.labels = [label for label in df['sentiment']]
        self.texts = [tokenizer(cleanReview,
                                add_special_tokens=True,
                                padding='max_length',
                                max_length=300,
                                return_tensors='pt',
                                truncation=True) for cleanReview in df['cleanReview']]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

   Now, we can define another class BertClassifier utilizing BertModel.from_pretrained('bert-base-uncased') with a model architecture consisting of an initial 40% Dropout, followed by a Linear layer, a ReLU activation function, an equivalent dropout layer and a final Linear layer. The forward function can be defined containing the input as input_id and mask.

In [ ]:
from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):

    def __init__(self, dropout=0.4):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.drop = nn.Dropout(dropout)
        self.out1 = nn.Linear(self.bert.config.hidden_size, 128)
        self.relu = nn.ReLU()
        self.drop1 = nn.Dropout(p=0.4)
        self.out = nn.Linear(128, 2)

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,
                                     return_dict=False)
        output = self.drop(pooled_output)
        output = self.out1(output)
        output = self.relu(output)
        output = self.drop1(output)
        final_layer = self.out(output)

        return final_layer

   Next, let's define some functions to to save/load the model (save_checkpoint, load_checkpoint) and to evaluate the model (save_metrics, load_metrics).

In [ ]:
def save_checkpoint(save_path, model, valid_loss):

    if save_path == None:
        return
    state_dict = {'model_state_dict': model.state_dict(),
                  'valid_loss': valid_loss}
    torch.save(state_dict, save_path)
    print(f'Model saved to ==> {save_path}')

def load_checkpoint(load_path, model):

    if load_path==None:
        return
    state_dict = torch.load(load_path, map_location=device)
    print(f'Model loaded from <== {load_path}')
    model.load_state_dict(state_dict['model_state_dict'])
    return state_dict['valid_loss']

def save_metrics(save_path, train_loss_list, valid_loss_list,
                 global_steps_list):

    if save_path == None:
        return
    state_dict = {'train_loss_list': train_loss_list,
                  'valid_loss_list': valid_loss_list,
                  'global_steps_list': global_steps_list}
    torch.save(state_dict, save_path)
    print(f'Model saved to ==> {save_path}')

def load_metrics(load_path):

    if load_path==None:
        return
    state_dict = torch.load(load_path, map_location=device)
    print(f'Model loaded from <== {load_path}')

    return state_dict['train_loss_list'], state_dict['valid_loss_list'], state_dict['global_steps_list']

Train and Evaluate Using Batch Size = 8

   We can define a training function train which encompasses a training and validation loop that first initializes the running values for the training and validation loss and global step as 0, loads the training and validation data separately with torch.utils.data.DataLoader using a batch_size=8 that is shuffled and utilizes four workers. Then we can assign the device to CUDA if it is available to utilize the GPU. Next, we can define the model criterion as CrossEntropyLoss and the optimizer as Adam where the learning rate can be specified. Then we can define the training and validation loop to load the data, train for the training set, evaluate the loss and accuracy, reset the running values, print the progress and save the model/checkpoint.

In [ ]:
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
from tqdm import tqdm

destination_folder = '/content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/SentimentPolarity/Models'

%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/SentimentPolarity/40k/

def train(model,
          df_train,
          df_val,
          learning_rate,
          num_epochs=3,
          file_path=destination_folder,
          best_valid_loss=float('Inf')):

    total_loss_train = 0.0
    total_loss_val = 0.0
    global_step = 0
    train_loss_list = []
    valid_loss_list = []
    global_steps_list = []

    train, val = Dataset(df_train), Dataset(df_val)
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=8,
                                                   shuffle=True, pin_memory=True,
                                                   num_workers=4)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=8,
                                                 shuffle=True, pin_memory=True,
                                                 num_workers=4)

    use_cuda = torch.cuda.is_available()
    device = torch.device('cuda' if use_cuda else 'cpu')
    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=learning_rate)

    if use_cuda:
        model = model.cuda()
        criterion = criterion.cuda()
    eval_every = len(train_dataloader) // 2

    model.train()
    for epoch in range(num_epochs):
        for train_input, train_label in tqdm(train_dataloader):
            train_label = train_label.to(device)
            mask = train_input['attention_mask'].to(device)
            input_id = train_input['input_ids'].squeeze(1).to(device)
            output = model(input_id, mask)

            batch_loss = criterion(output, train_label)
            total_loss_train += batch_loss.item()
            optimizer.zero_grad()
            batch_loss.backward()
            optimizer.step()
            global_step += 1

            if global_step % eval_every == 0:
                model.eval()
                with torch.no_grad():
                    for val_input, val_label in val_dataloader:
                        val_label = val_label.to(device)
                        mask = val_input['attention_mask'].to(device)
                        input_id = val_input['input_ids'].squeeze(1).to(device)
                        output = model(input_id, mask)
                        batch_loss = criterion(output, val_label)
                        total_loss_val += batch_loss.item()

                average_train_loss = total_loss_train / len(train_dataloader)
                average_valid_loss = total_loss_val / len(val_dataloader)
                train_loss_list.append(average_train_loss)
                valid_loss_list.append(average_valid_loss)
                global_steps_list.append(global_step)

                total_loss_train = 0.0
                total_loss_val = 0.0
                model.train()

                print('Epoch [{}/{}], Step [{}/{}], Train Loss: {:.4f}, Valid Loss: {:.4f}'
                      .format(epoch+1, num_epochs, global_step,
                              num_epochs * len(train_dataloader),
                              average_train_loss, average_valid_loss))

                if best_valid_loss > average_valid_loss:
                    best_valid_loss = average_valid_loss
                    save_checkpoint(file_path + '/' + 'model_40K_batch8.pt',
                                    model, best_valid_loss)
                    save_metrics(file_path + '/' + 'metrics_40K_batch8.pt',
                                 train_loss_list, valid_loss_list,
                                 global_steps_list)
    save_metrics(file_path + '/' + 'metrics_40K_batch8.pt', train_loss_list,
                 valid_loss_list, global_steps_list)
    print('Training finished!')

   Let's now define the parameters for the training loop where the model is BertClassifier, the optimizer is Adam with a learning rate=5e-6 and three epochs are used for training.

In [ ]:
model = BertClassifier()
optimizer = Adam(model.parameters(), lr=5e-6)
num_epochs = 3
LR = 5e-6
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

   Now, the model can be trained using the specified parameters.

In [ ]:
train(model, df_train, df_val, LR, num_epochs)
 50%|████▉     | 1999/4000 [14:14<14:23,  2.32it/s]
Epoch [1/3], Step [2000/12000], Train Loss: 0.0658, Valid Loss: 0.0394
 50%|█████     | 2000/4000 [15:36<13:51:49, 24.95s/it]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/model_40K_batch8.pt
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/metrics_40K_batch8.pt
100%|█████████▉| 3999/4000 [29:59<00:00,  2.32it/s]
Epoch [1/3], Step [4000/12000], Train Loss: 0.0134, Valid Loss: 0.0154
100%|██████████| 4000/4000 [31:20<00:00,  2.13it/s]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/model_40K_batch8.pt
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/metrics_40K_batch8.pt
 50%|████▉     | 1999/4000 [14:22<14:20,  2.33it/s]
Epoch [2/3], Step [6000/12000], Train Loss: 0.0074, Valid Loss: 0.0135
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/model_40K_batch8.pt
 50%|█████     | 2000/4000 [15:44<13:47:32, 24.83s/it]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/metrics_40K_batch8.pt
100%|█████████▉| 3999/4000 [30:06<00:00,  2.32it/s]
Epoch [2/3], Step [8000/12000], Train Loss: 0.0061, Valid Loss: 0.0110
100%|██████████| 4000/4000 [31:28<00:00,  2.12it/s]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/model_40K_batch8.pt
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/metrics_40K_batch8.pt
 50%|█████     | 2000/4000 [15:42<13:27:34, 24.23s/it]
Epoch [3/3], Step [10000/12000], Train Loss: 0.0027, Valid Loss: 0.0209
100%|██████████| 4000/4000 [31:25<00:00,  2.12it/s]
Epoch [3/3], Step [12000/12000], Train Loss: 0.0026, Valid Loss: 0.0133
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/metrics_40K_batch8.pt
Training finished!

Model Metrics

   Let's now load the saved metrics, and plot the train and validation loss from the global_steps_list.

In [ ]:
import matplotlib.pyplot as plt

train_loss_list, valid_loss_list, global_steps_list = load_metrics(destination_folder
                                                                   + '/metrics_40K_batch8.pt')
plt.plot(global_steps_list, train_loss_list, label='Train')
plt.plot(global_steps_list, valid_loss_list, label='Valid')
plt.xlabel('Global Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()
Model loaded from <== /content/drive/MyDrive/Yelp_Reviews/DL/BERT/SentimentPolarity/Models/metrics_40K_batch8.pt

   We can also define a function called evaluate, which loads the test set, assigns the device to CUDA and evaluates the trained model with the test set for accuracy, the classification_report and confusion_matrix utilizing sklearn.metrics.

In [ ]:
from sklearn.metrics import classification_report, confusion_matrix

def evaluate(model, test_data):
    y_pred = []
    y_true = []

    test = Dataset(test_data)
    test_dataloader = torch.utils.data.DataLoader(test, batch_size=1)

    use_cuda = torch.cuda.is_available()
    device = torch.device('cuda' if use_cuda else 'cpu')

    if use_cuda:
        model = model.cuda()
    total_acc_test = 0
    model.eval()
    with torch.no_grad():
        for test_input, test_label in test_dataloader:
            test_label = test_label.to(device)
            mask = test_input['attention_mask'].to(device)
            input_id = test_input['input_ids'].squeeze(1).to(device)
            output = model(input_id, mask)
            y_pred.extend(torch.argmax(output, 1).tolist())
            y_true.extend(test_label.tolist())
            acc = (output.argmax(dim=1) == test_label).sum().item()
            total_acc_test += acc
    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

    print('Classification Report:')
    print(classification_report(y_true, y_pred, labels=[1,0], digits=4))

    f, (ax) = plt.subplots(1,1)
    f.suptitle('Test Set: Confusion Matrix', fontsize=20)
    cm = confusion_matrix(y_true, y_pred, labels=[1,0])
    ax = plt.subplot()
    sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
    ax.set_xlabel('Predicted Sentiment', fontsize=17)
    ax.set_ylabel('Actual Sentiment', fontsize=17)
    ax.xaxis.set_ticklabels(['Negative', 'Positive'], fontsize=17)
    ax.yaxis.set_ticklabels(['Negative', 'Positive'], fontsize=17)

   Let's now use this defined function to evaluate the model metrics of the trained model with the test set.

In [ ]:
best_model = BertClassifier()
load_checkpoint(destination_folder + '/model_40K_batch8.pt', best_model)
evaluate(best_model, df_test)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Model loaded from <== /content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/SentimentPolarity/Models/model_40K_batch8.pt
Test Accuracy:  0.997
Classification Report:
              precision    recall  f1-score   support

           1     0.9995    0.9951    0.9973      2029
           0     0.9949    0.9995    0.9972      1971

    accuracy                         0.9972      4000
   macro avg     0.9972    0.9973    0.9972      4000
weighted avg     0.9973    0.9972    0.9973      4000

   So far, this model performed the best with a Test Accuracy: 0.997 and the lowest metric of precision=0.9949 for the negative sentiment group.

Stars on Reviews

    Now, let's use the same preprocessing that was utilized for the sentiment set using the review_stars filtered set.

Train and Evaluate Using Batch Size = 8

   Using the same defined model parameters, we can train a model using the stars_reviews set.

In [ ]:
%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/ReviewStars/40k/

destination_folder = '/content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/ReviewStars/Models'

model = BertClassifier()
optimizer = Adam(model.parameters(), lr=5e-6)
num_epochs = 3
LR = 5e-6
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
In [ ]:
train(model, df_train, df_val, LR, num_epochs)
 50%|████▉     | 1999/4000 [09:18<09:18,  3.59it/s]
Epoch [1/3], Step [2000/12000], Train Loss: 0.1219, Valid Loss: 0.1283
 50%|█████     | 2000/4000 [10:05<8:04:24, 14.53s/it]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/model_40k_b8.pt
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/metrics_40k_b8.pt
100%|█████████▉| 3999/4000 [19:23<00:00,  3.59it/s]
Epoch [1/3], Step [4000/12000], Train Loss: 0.0710, Valid Loss: 0.1232
100%|██████████| 4000/4000 [20:11<00:00,  3.30it/s]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/model_40k_b8.pt
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/metrics_40k_b8.pt
 50%|████▉     | 1999/4000 [09:17<09:18,  3.58it/s]
Epoch [2/3], Step [6000/12000], Train Loss: 0.0477, Valid Loss: 0.1192
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/model_40k_b8.pt
 50%|█████     | 2000/4000 [10:06<8:14:40, 14.84s/it]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/metrics_40k_b8.pt
100%|█████████▉| 3999/4000 [19:24<00:00,  3.58it/s]
Epoch [2/3], Step [8000/12000], Train Loss: 0.0477, Valid Loss: 0.0900
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/model_40k_b8.pt
100%|██████████| 4000/4000 [20:13<00:00,  3.30it/s]
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/metrics_40k_b8.pt
 50%|█████     | 2000/4000 [10:04<7:48:57, 14.07s/it]
Epoch [3/3], Step [10000/12000], Train Loss: 0.0299, Valid Loss: 0.0969
100%|██████████| 4000/4000 [20:08<00:00,  3.31it/s]
Epoch [3/3], Step [12000/12000], Train Loss: 0.0292, Valid Loss: 0.1026
Model saved to ==> /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/metrics_40k_b8.pt
Training finished!

Model Metrics

   Let's now load the saved metrics, and plot the train and validation loss from the global_steps_list.

In [ ]:
train_loss_list, valid_loss_list, global_steps_list = load_metrics(destination_folder
                                                                   + '/metrics_40k_b8.pt')
plt.plot(global_steps_list, train_loss_list, label='Train')
plt.plot(global_steps_list, valid_loss_list, label='Valid')
plt.xlabel('Global Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()
Model loaded from <== /content/drive/MyDrive/Yelp_Reviews/DL/BERT/ReviewStars/Models/metrics_40k_b8.pt

   We can now evaluate the peformance of the trained model by utilizing the evaluate function with the test set.

In [ ]:
best_model = BertClassifier()
load_checkpoint(destination_folder + '/model_40k_b8.pt', best_model)
evaluate(best_model, df_test)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Model loaded from <== /content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/ReviewStars/Models/model_40k_b8.pt
Test Accuracy:  0.970
Classification Report:
              precision    recall  f1-score   support

           1     0.9846    0.9550    0.9696      2002
           0     0.9563    0.9850    0.9704      1998

    accuracy                         0.9700      4000
   macro avg     0.9704    0.9700    0.9700      4000
weighted avg     0.9704    0.9700    0.9700      4000

   The model trained with the Sentiment set had better model metrics for everything where the lowest was precision=0.9949 for the negative sentiment group. Using the transfer learning approach with BERT performed better than the BoW, TF-IDF and LSTM methods.


Comments

comments powered by Disqus