Customer reviews provide a vast amount of information that can be leveraged to increase business profitability via customer retention. Reviews can be utilized to generate business improvements or modifications by considering the overall response in regards to tone, diction and syntax of the words contained within the reviews from all of the individuals who created a review. In addition to the text contained within the reviews, the metadata from the time when the review was created as well the location of the business can provide more comprehensive business insight. This information can be used to generate better predictive models when considering the concept that businesses might have more than one branch at more than one location.
Objective
¶
- Which characteristics of businesses associate with different operational metrics from customer ratings and reviews?
Questions
¶
- Do the services provided by the business to the customer, hours of operation and time during the year of the business play any role in the amount of positive ratings or reviews?
- Does the activity of the individual who reviews businesses demonstrate any patterns that could affect the increased number of positive ratings/reviews?
- Is the text contained within customer reviews associated with any of the features provided and engineered?
Data and Preprocessing
¶
The data was retrieved from Yelp Open Dataset. These are separate json
files for the reviews, businesses, users, tip and photo inforation. A data warehouse is needed to be constructed containing as much relevant information for downstream analyses that can be queried for more insight. The photo.json
was not utilized in constructing a data warehouse.
The code that was used for preprocessing and EDA can be found Yelp Reviews Github repository. First, the environment is needed to be set up by importing the dependencies, setting the options for viewing/examining the data, the resolution of the graphs/charts, setting the seed for processing/computations with the data and the directory structure where the data is read from/stored.
For the initial exploratory data analysis (EDA), let's define a function to examine the dimensions of the rows and columns of each set, the unique and missing values as well as the data types of the separate json
files containing the reviews, businesses, users, tip, and checkin information. This function can be implemented after reading the data and dropping duplicates. Since some of the information is stored as dictionaries, these need to be converted to strings for dropping duplicates after joining the various sets.
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
seed_value = 42
os.environ['Yelp_Preprocess_EDA'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
def data_summary(df):
"""Returns the characteristics of variables in a Pandas dataframe."""
print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
a = pd.DataFrame()
a['Number of Unique Values'] = df.nunique()
a['Number of Missing Values'] = df.isnull().sum()
a['Data type of variable'] = df.dtypes
print(a)
print('\nSummary of Yelp Reviews Data')
print('\n')
print('\nReviews - Data Summary')
print('\n')
reviews_json_path = 'yelp_academic_dataset_review.json'
df_reviews = pd.read_json(reviews_json_path, lines=True)
df_reviews = df_reviews.drop_duplicates()
print(data_summary(df_reviews))
print('======================================================================')
print('\nBusiness - Data Summary')
print('\n')
business_json_path = 'yelp_academic_dataset_business.json'
df_business = pd.read_json(business_json_path, lines=True)
df_business['attributes'] = df_business['attributes'].astype(str)
df_business['hours'] = df_business['hours'].astype(str)
df_business = df_business.drop_duplicates()
print(data_summary(df_business))
print('======================================================================')
print('\nUser - Data Summary')
print('\n')
user_json_path = 'yelp_academic_dataset_user.json'
df_user = pd.read_json(user_json_path, lines=True)
df_user = df_user.drop_duplicates()
print(data_summary(df_user))
print('======================================================================')
print('\nTip - Data Summary')
print('\n')
tip_json_path = 'yelp_academic_dataset_tip.json'
df_tip = pd.read_json(tip_json_path, lines=True)
df_tip = df_tip.drop_duplicates()
print(data_summary(df_tip))
print('======================================================================')
print('\nCheckin - Data Summary')
print('\n')
checkin_json_path = 'yelp_academic_dataset_Checkin.json'
df_checkin = pd.read_json(checkin_json_path, lines=True)
df_checkin = df_checkin.drop_duplicates()
print(data_summary(df_checkin))
print('======================================================================')
The reviews
set has the most number of rows (8,635,403) while the user
set has the most columns (22). To leverage the highest number of reviews that can be paired with the other available information, the reviews
should be used as the backbone of the warehouse and the starting place to determine which keys are needed to join the different sets together.
Reviews and Businesses
¶
Since multiple sets are being used to make a warehouse, renaming variables unqiue to each set is important for the establishment of potential keys. This is completed for the reviews and business initial sets besides the business_id
. Since categories
in the business set has missing data, the rows that do not contain missing values are selected to maximize the most complete data. The business_id
contains characters and numbers so converting to a string is important for this key to be used to join two tables due to high dimensionality and memory constraints.
df_reviews.rename(columns = {'text': 'text_reviews'}, inplace=True)
df_reviews.rename(columns = {'stars': 'stars_reviews'}, inplace=True)
df_reviews.rename(columns = {'date': 'date_reviews'}, inplace=True)
df_reviews.rename(columns = {'useful': 'useful_reviews'}, inplace=True)
df_reviews.rename(columns = {'funny': 'funny_reviews'}, inplace=True)
df_reviews.rename(columns = {'cool': 'cool_reviews'}, inplace=True)
print('Sample observations from Reviews:')
print(df_reviews.head())
print('\n')
df_business = df_business[df_business.categories.notna()]
df_business = df_business.copy()
df_business.rename(columns = {'stars': 'stars_business'}, inplace=True)
df_business.rename(columns = {'name': 'name_business'}, inplace=True)
df_business.rename(columns = {'review_count': 'review_countbusiness'},
inplace=True)
df_business.rename(columns = {'attributes': 'attributes_business'},
inplace=True)
df_business.rename(columns = {'categories': 'categories_business'},
inplace=True)
df_business.rename(columns = {'hours': 'hours_business'}, inplace=True)
print('Sample observations from Businesses:')
print(df_business.head())
df_reviews['business_id'] = df_reviews['business_id'].astype(str)
df_business['business_id'] = df_business['business_id'].astype(str)
df = pd.merge(df_reviews, df_business, how='right', left_on=['business_id'],
right_on=['business_id'])
df = df.drop_duplicates()
del df_reviews
User
¶
The user
set contains variables that probably aren't usedful due to ethical constraints since they contain individual level information so these are removed. The columns are then renamed so the variables remain unique for the user
set. The user_id
is used as the key to join with the main table since this exists in reviews
.
df_user = df_user.drop(['name', 'friends', 'fans', 'compliment_photos'], axis=1)
df_user.rename(columns = {'review_count': 'review_count_user'},
inplace=True)
df_user.rename(columns = {'yelping_since': 'yelping_since_user'},
inplace=True)
df_user.rename(columns = {'useful': 'useful_user'}, inplace=True)
df_user.rename(columns = {'funny': 'funny_user'}, inplace=True)
df_user.rename(columns = {'cool': 'cool_user'}, inplace=True)
df_user.rename(columns = {'elite': 'elite_user'}, inplace=True)
df_user.rename(columns = {'average_stars': 'average_stars_user'},
inplace=True)
df_user.rename(columns = {'compliment_hot': 'compliment_hot_user'},
inplace=True)
df_user.rename(columns = {'compliment_more': 'compliment_more_user'},
inplace=True)
df_user.rename(columns = {'compliment_profile': 'compliment_profile_user'},
inplace=True)
df_user.rename(columns = {'compliment_cute': 'compliment_cute_user'},
inplace=True)
df_user.rename(columns = {'compliment_list': 'compliment_list_user'},
inplace=True)
df_user.rename(columns = {'compliment_note': 'compliment_note_user'},
inplace=True)
df_user.rename(columns = {'compliment_plain': 'compliment_plain_user'},
inplace=True)
df_user.rename(columns = {'compliment_cool': 'compliment_cool_user'},
inplace=True)
df_user.rename(columns = {'compliment_funny': 'compliment_funny_user'},
inplace=True)
df_user.rename(columns = {'compliment_writer': 'compliment_writer_user'},
inplace=True)
print('Sample observations from Users:')
print(df_user.head())
df = pd.merge(df, df_user, how='left', left_on=['user_id'],
right_on=['user_id'])
df = df.drop_duplicates()
del df_user
Tip
¶
Like with the user
table, variables that are not useful are dropped and the variables are renamed. The business
data was used to obtain the name of business in the tip
set. This allowed for features to be engineered like the number of compliments. A feature was created to obtain sum of the compliment counts for each business by using groupby
and then merging back for sum of the compliments for each id.
df_tip = df_tip.drop(['user_id', 'text'], axis=1)
df_tip.rename(columns = {'date': 'date_tip'}, inplace=True)
df_tip.rename(columns = {'compliment_count': 'compliment_count_tip'},
inplace=True)
df_tip['name_business'] = df_tip['business_id'].map(df_business.set_index('business_id')['name_business'])
del df_business
df_tip1 = df_tip.groupby('business_id')['compliment_count_tip'].sum().reset_index()
df_tip1.rename(columns = {'compliment_count_tip': 'compliment_count_tip_idSum'},
inplace=True)
df_tip = pd.merge(df_tip, df_tip1, how='left', left_on=['business_id'],
right_on=['business_id'])
df_tip = df_tip.drop_duplicates()
df_tip = df_tip[df_tip.name_business.notna()]
del df_tip1
print(data_summary(df_tip))
Next, the sum of the compliment count for each business name was calculated using a similar approach and mergd back for the sum of the compliments for each business name. The features used for feature engineering were dropped and duplicates removed using subset = ['business_id']
. This was then joined with main table after converting the keys to the proper format for joining the different tables or dataframes.
df_tip1 = df_tip.groupby('name_business')['compliment_count_tip'].sum().reset_index()
df_tip1.rename(columns = {'compliment_count_tip': 'compliment_count_tip_businessSum'},
inplace=True)
df_tip = pd.merge(df_tip, df_tip1, how='left', left_on=['name_business'],
right_on=['name_business'])
df_tip = df_tip.drop_duplicates()
del df_tip1
df_tip = df_tip.drop(['date_tip', 'compliment_count_tip'], axis=1)
df_tip = df_tip.drop_duplicates(subset = ['business_id'])
print('\nSummary - Tip after Compliment Count Sums:')
print(data_summary(df_tip))
print('\n')
print('Sample observations from Tips:')
print(df_tip.head())
df_tip['business_id'] = df_tip['business_id'].astype(str)
df_tip['name_business'] = df_tip['name_business'].astype(str)
df['name_business'] = df['name_business'].astype(str)
df = pd.merge(df, df_tip, how='right', left_on=['business_id', 'name_business'],
right_on=['business_id', 'name_business'])
df = df.drop_duplicates()
del df_tip
Checkins
¶
The checkin
set allowed for features to be created based of the time information. Let's process the time variables by extracting each date as a row for each business id.
print('Sample observations from Checkins:')
print(df_checkin.head())
print('\n')
df_checkin['business_id'] = df_checkin['business_id'].astype(str)
df_checkin.rename(columns = {'date': 'businessCheckin_date'},
inplace=True)
df_checkin1 = df_checkin.set_index(['business_id']).apply(lambda x: x.str.split(',').explode()).reset_index()
df_checkin1 = df_checkin1.drop_duplicates()
df_checkin1.head()
To create various time features from the checkin time into the different businesses, a function timeFeatures
was utitilzed to first convert businessCheckin_date
to datetime followed by extracting the year, year-month, year-week, hourly variables as well as the morning and afternoon/night (AM/PM) by using a mask
where if the businessCheckin_hourNumber >= 12
, then PM
, and if not that, then AM
.
def timeFeatures(df):
"""
Returns the year, year-month, year-week, hourly variables and PM/AM from the date.
"""
df['businessCheckin_date'] = pd.to_datetime(df['businessCheckin_date'],
format='%Y-%m-%d %H:%M:%S',
errors='ignore')
df['businessCheckin_Year'] = df.businessCheckin_date.dt.year
df['businessCheckin_YearMonth'] = df['businessCheckin_date'].dt.to_period('M')
df['businessCheckin_YearWeek'] = df['businessCheckin_date'].dt.strftime('%Y-w%U')
df['businessCheckin_hourNumber'] = df.businessCheckin_date.dt.hour
mask = df['businessCheckin_hourNumber'] >= 12
df.loc[mask, 'businessCheckin_hourNumber'] = 'PM'
mask = df['businessCheckin_hourNumber'] != 'PM'
df.loc[mask, 'businessCheckin_hourNumber'] = 'AM'
df = df.drop_duplicates()
return df
df_checkin1 = timeFeatures(df_checkin1)
df_checkin1.head()
Now we can get the count of each checkin date for each business id by grouping by the id and the date, followed by merging back with the main table by the business id. A similar process was completed for businessCheckin_Year
, businessCheckin_YearMonth
, businessCheckin_YearWeek
. Lastly, select the rows that contain the most complete data.
df_checkin2 = df_checkin1.groupby('business_id')['businessCheckin_date'].count().reset_index()
df_checkin2.rename(columns = {'businessCheckin_date': 'dateDay_checkinSum'},
inplace=True)
df_checkin2 = df_checkin2.drop_duplicates()
del df_checkin1
df_checkin = pd.merge(df_checkin, df_checkin2, how='left',
left_on=['business_id'], right_on=['business_id'])
df_checkin = df_checkin.drop_duplicates()
del df_checkin2
df = pd.merge(df, df_checkin, how='left', left_on=['business_id'],
right_on=['business_id'])
df = df.drop_duplicates()
del df_checkin
df = df[df.businessCheckin_date.notna()]
Business Hours
¶
Initially, hours_business
is in the format of a dictionary with the day and the hours that the business is operational. There is a lot of information contained within original feature so processing this into a non-dictionary feature will allow for more features to be created and thus potential insight. Let's create a copy of the original hours_business
to compare the processing. A lambda
function can be utilized that splits the original feature with splitting the current string by ,
and keeping the business_id
with the information. This creates duplicates and missing data so they need to be removed. Some of the information within the original feature contained differences in the spacing so this needs to be normalized to maximize the grain of information. This includes the presence of brackets and spacing. The original hours_business
was set as the index to monitor for accurate processing. After, the number of how many unique business day/hours can be calculated and then index reset to compare.
df1 = df[['business_id', 'hours_business']]
df1['hours_businessOriginal'] = df1['hours_business']
df1 = df1.set_index(['business_id',
'hours_businessOriginal']).apply(lambda x: x.str.split(',').explode()).reset_index()
df1 = df1.drop_duplicates()
df1 = df1[df1['business_id'].notna()]
print('- Dimensions of exploded business open/closed days/hours:', df1.shape)
df1['hours_business'] = df1['hours_business'].astype(str)
df1.loc[:,'hours_business'] = df1['hours_business'].str.replace(r'{', '',
regex=True)
df1.loc[:,'hours_business'] = df1['hours_business'].str.replace(r'}', '',
regex=True)
df1 = df1.set_index(['hours_businessOriginal'])
df1 = df1.replace("'", "", regex=True)
df1['hours_business'] = df1['hours_business'].str.strip()
df1 = df1.drop_duplicates()
print('- Number of unique business hours after removing Regex characteristics:',
df1[['hours_business']].nunique())
df1 = df1.reset_index()
print('\n')
print('Sample observations from Business Hours:')
df1.head()
The businesses with provided hours can then be filtered to a subset with the objective of extracting the day from the business hours. To complete this, we can use split the string at the :
using str.rsplit(':')
at the first :
, creating hoursBusiness_day
. All of the string content after the y
in the day of the week can be separated from the dictionary to extract the time when a business is open or closed called hours_working
. To extract when the business opens, the hours_working
string can be split at the -
, which separates the numbers before the hyphen resulting in hours_businessOpen
. Then, the numbers before the :
can be extracted, resulting in hours_businessOpen1
, which contains the hour when the business opens. Next, the numbers after the :
can be extracted by splitting everything after the :
, creating hours_businessOpen1:
, which contains the minutes within an hour. Some businesses open on the half hour or various times other than the whole hour.
Now, the businesses with whole hours for the opening time can be subset and midnight can be defined as 24. Next, the businesses that do not open on the whole hour can be filtered. The minutes can then be divided by the amount of minutes in an hour, 60, and this value can be added to the hours for a numerical value rather than a string, creating hours_businessOpen2
. Lastly, the businesses without any open or closed hours can then be filtered and designated with NA
for the created features, so they can be retained for the concentation with the businesses that contain open/closed hours. Finally, the sets without any open business hours and the processed ones can be concatenated together by row.
df3 = df1[(df1['hours_business'] != 'None')]
print('- Dimensions of businesses with provided open/closed hours:', df3.shape)
df3['hoursBusiness_day'] = df3['hours_business'].str.rsplit(':').str[0]
df3['hours_working'] = df3.hours_business.str.extract('y:(.*)')
df3['hours_businessOpen'] = df3['hours_working'].str.rsplit('-').str[0]
df3['hours_businessOpen1'] = df3['hours_business'].str.rsplit(':').str[-3]
df3['hours_businessOpen1'] = df3['hours_businessOpen1'].astype(int)
df3['hours_businessOpen1:'] = df3['hours_businessOpen'].str.rsplit(':').str[-1]
df4 = df3.loc[(df3['hours_businessOpen1'] == 0)]
df4 = df4.drop(['hours_businessOpen1', 'hours_businessOpen1:'], axis=1)
df4['hours_businessOpen2'] = 24.0
print('- Dimensions of businesses with open hours at Midnight:', df4.shape)
df5 = df3.loc[(df3['hours_businessOpen1'] != 0)]
del df3
df5['hours_businessOpen1:'] = df5['hours_businessOpen1:'].astype(float)
df5['hours_businessOpen1:'] = df5['hours_businessOpen1:'] / 60
df5['hours_businessOpen1'] = df5['hours_businessOpen1'].astype(float)
df5['hours_businessOpen2'] = df5['hours_businessOpen1'] + df5['hours_businessOpen1:']
df5 = df5.drop(['hours_businessOpen1', 'hours_businessOpen1:'], axis=1)
print('- Dimensions of businesses with non whole open hours:', df5.shape)
df2 = df1[(df1['hours_business'] == 'None')]
df2['hoursBusiness_day'] = 'NA'
df2['hours_working'] = 'NA'
df2['hours_businessOpen'] = 'NA'
df2['hours_businessOpen2'] = 'NA'
print('- Dimensions of businesses with no open/closed hours:', df2.shape)
data = [df2, df4, df5]
df7 = pd.concat(data)
print('- Dimensions of businesses with no & business hours open modified:',
df7.shape)
del data, df2, df4, df5
df7[['hours_business', 'hoursBusiness_day', 'hours_working',
'hours_businessOpen', 'hours_businessOpen2']].tail()
Let's now process the businesses closing time(s) using similar approaches. First, we can filter the businesses with provided hours into a new subset. The time when the business closes can be determined by using str.rsplit('-')
to extract the content after the -
from hours_business
, creating hours_businessClosed
. Then this string can be split at the :
to extract the information before and after the :
for the hours and minutes. Both then can be converted to float
, and a subset of the observations examined:
df2 = df7[(df7['hours_business'] != 'None')]
df2['hours_businessClosed'] = df2['hours_business'].str.rsplit('-').str[1]
df2['hours_businessClosed1'] = df2['hours_business'].str.rsplit('-').str[1]
df2['hours_businessClosed1'] = df2['hours_businessClosed1'].str.rsplit(':').str[0]
df2['hours_businessClosed1'] = df2['hours_businessClosed1'].astype(float)
df2['hours_businessClosed1:'] = df2['hours_businessClosed'].str.rsplit(':').str[-1]
df2['hours_businessClosed1:'] = df2['hours_businessClosed1:'].astype(float)
df2.tail()
As with the various conditions the businesses might open, they also exist and need to be processed for the closing times.Let's first subset the businesses with closing times directly at midnight. The created features for the closing hour and minutes can be dropped, and 24 (midnight) can be assigned to hours_businessClosed2
, which the other conditional groups will be processed to complete the set.
Then the businesses with closing times which are also at midnight (0) but contain non-whole hours can be filtered to another set. Midnight as 24 can be designated for hours_businessClosed1
and hours_businessClosed1:
, which contains the minutes of the hour can be divided by 60, and then added to the 24 whole hour, creating hours_businessClosed2
for this set.
Then all of the businesses who close at non-midnight times on the whole hour can be filtered. Since they do not contain minutes, the hours_businessClosed1
can be treated as hours_businessClosed2
, and then hours_businessClosed1
dropped.
Next, the businesses with closing times not at midnight and not on the whole hour can be filtered to another set. The minutes can be processed to a fraction of the hour and added to the hour for a numerical time.
Now, the businesses without any open or closed hours can then be filtered and designated with NA
for the created features so they can be retained for the concentation with the businesses that contain open/closed hours.
Finally, the various sets that have been processed for the different variations of business closing hours and the ones without any open/closed hours can concatenated together by row. Then the temporary features renamed and examined.
df3 = df2.loc[(df2['hours_businessClosed1'] == 0) & (df2['hours_businessClosed1:'] == 0)]
df3 = df3.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
df3['hours_businessClosed2'] = 24.0
print('- Dimensions of businesses with closing times that are midnight and whole hours:',
df3.shape)
df4 = df2.loc[(df2['hours_businessClosed1'] == 0) & (df2['hours_businessClosed1:'] != 0)]
df4['hours_businessClosed1'] = 24.0
df4['hours_businessClosed1'] = df4['hours_businessClosed1'].astype(float)
df4['hours_businessClosed1:'] = df4['hours_businessClosed1:'].astype(float)
df4['hours_businessClosed1:'] = df4['hours_businessClosed1:'] / 60
df4['hours_businessClosed2'] = df4['hours_businessClosed1'] + df4['hours_businessClosed1:']
df4 = df4.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
print('- Dimensions of businesses with closing times that are midnight and non whole hours:',
df4.shape)
df5 = df2.loc[(df2['hours_businessClosed1'] != 0) & (df2['hours_businessClosed1:'] == 0)]
df5['hours_businessClosed2'] = df5['hours_businessClosed1']
df5['hours_businessClosed2'] = df5['hours_businessClosed2'].astype(float)
df5 = df5.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
print('- Dimensions of businesses with closing times that are non midnight and whole hours:',
df5.shape)
df6 = df2.loc[(df2['hours_businessClosed1'] != 0) & (df2['hours_businessClosed1:'] != 0)]
df6['hours_businessClosed1'] = df6['hours_businessClosed1'].astype(float)
df6['hours_businessClosed1:'] = df6['hours_businessClosed1:'].astype(float)
df6['hours_businessClosed1:'] = df6['hours_businessClosed1:'] / 60
df6['hours_businessClosed2'] = df6['hours_businessClosed1'] + df6['hours_businessClosed1:']
df6 = df6.drop(['hours_businessClosed1', 'hours_businessClosed1:'], axis=1)
print('- Dimensions of businesses with closing times that are non midnight and non whole hours:',
df6.shape)
df1 = df7[(df7['hours_business'] == 'None')]
df1['hours_businessClosed'] = 'NA'
df1['hours_businessClosed2'] = 'NA'
df1 = [df3, df4, df5, df6, df1]
df1 = pd.concat(df1)
df1 = df1.drop(['hours_businessOpen', 'hours_businessClosed'], axis=1)
df1 = df1.drop_duplicates()
print('- Dimensions of businesses with no & business hours closed modified:',
df1.shape)
del df2, df3, df4, df5, df6, df7
df1.rename(columns={'hours_businessOpen2': 'hours_businessOpen',
'hours_businessClosed2': 'hours_businessClosed'},
inplace=True)
df1.head()
Given that now we have processed the initial dictionary down to the hours and minutes during the day and night, we can create various time features from the opening/closing business times like the morning and afternoon/night (AM/PM). Let's first filter the businesses with/without hours_working
into different sets again. We can create a binary feature which demarcates when a business opens in the morning or night using AM/PM called hours_businessOpen_amPM
by using a mask
where if the hours_businessOpen' >= 12
, then designate it as PM
. If it does not meet this conditional statement, then designate it as AM
. The same methods can be utilized to demarcate the closing period called hours_businessClosed_amPM
.
As before, feature engineering new variables typically requires a boolean approach to address the set that does not meet the criteria, so let's fill the new features with NA
for the businesses who do not have open/closed hours. Finally, the two sets can be concatenated together by row and the processed business hours renamed as hours_businessRegex
. Utilizing regex
allowed for new features to be established from the initial dictionary containing the days of the week with the opening and closing times.
df2 = df1[(df1['hours_working'] != 'NA')]
df2['hours_businessOpen'] = df2['hours_businessOpen'].astype('float64')
df2['hours_businessClosed'] = df2['hours_businessClosed'].astype('float64')
mask = df2['hours_businessOpen'] >= 12
df2.loc[mask, 'hours_businessOpen_amPM'] = 'PM'
mask = df2['hours_businessOpen_amPM'] != 'PM'
df2.loc[mask, 'hours_businessOpen_amPM'] = 'AM'
mask = df2['hours_businessClosed'] >= 12
df2.loc[mask, 'hours_businessClosed_amPM'] = 'PM'
mask = df2['hours_businessClosed_amPM'] != 'PM'
df2.loc[mask, 'hours_businessClosed_amPM'] = 'AM'
df1 = df1[(df1['hours_working'] == 'NA')]
df1['hours_businessOpen_amPM'] = 'NA'
df1['hours_businessClosed_amPM'] = 'NA'
df1 = pd.concat([df2, df1])
df1.rename(columns={'hours_business': 'hours_businessRegex'}, inplace=True)
df1.head()
Let's create a hours_businessOpenDaily
feature that contains the total number of hours each day a business is operational by subtracting hours_businessOpen
from hours_businessClosed
.
df2 = df1[(df1['hours_businessRegex'] != 'None')]
df2['hours_businessOpenDaily'] = df2['hours_businessClosed'] - df2['hours_businessOpen']
df3 = df2.loc[(df2['hours_businessOpenDaily'] == 0)]
df3['hours_businessOpenDaily'] = df3['hours_businessOpenDaily'] + 24.0
print('- Dimensions of businesses with open 24 hours:',
df3.shape)
df2 = df2.loc[(df2['hours_businessOpenDaily'] != 0)]
print('- Dimensions of businesses with open 24 hours:',
df2.shape)
df2 = pd.concat([df3,df2])
df3 = df1[(df1['hours_businessRegex'] == 'None')]
df3['hours_businessOpenDaily'] = 'NA'
df1 = pd.concat([df2,df3])
del df2, df3
df1.head()
The table containing the processed business hours can now be outer
joined with the main table utilizing the strings business_id
and hours_business
as the keys followed by dropping any duplicates with the subset=['review_id']
.
df['business_id'] = df['business_id'].astype(str)
df['hours_business'] = df['hours_business'].astype(str)
df1['business_id'] = df1['business_id'].astype(str)
df1['hours_businessOriginal'] = df1['hours_businessOriginal'].astype(str)
df = pd.merge(df, df1, how='outer', left_on=['business_id', 'hours_business'],
right_on=['business_id', 'hours_businessOriginal'])
df = df.drop_duplicates(subset=['review_id'])
del df1
print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
Exploratory Data Analysis
¶
Let's now subset the quantitative features from the initial reviews
and business
tables, and examine the descriptive statistics using pandas.DataFrame.describe
rounded to two decimal places.
df_num = df[['stars_reviews', 'useful_reviews', 'funny_reviews', 'cool_reviews',
'stars_business', 'review_countbusiness', 'is_open']]
print('Descriptive statistics of quant vars in Reviews + Business:')
print('\n')
print(df_num.describe(include=[np.number]).round(2))
When examining the broad summary statistics of the selected features, stars_reviews
has mean=3.73
with a standard deviation (std
) of 1.43. The 75th percentile are 5 star reviews, so the set contains mostly highly rated reviews. When examining useful
, funny
and cool_reviews
, the 75th percentile is one review or less with a maximum of 446, 610 and 732, respectively.
For the stars_business
, the mean=3.73
, which is the same as the mean for stars_reviews
, and the std=0.69
, which is lower than the std
from stars_reviews
. The 75th percentile is 4 stars, which is lower when compared to stars_reviews
. Regarding the review_countbusiness
feature, the std
is greater than the mean
when the 50th percentile is 183 vs. mean=405.29
. Surprisingly, the maximum is only 9,185 given 7,952,263 total reviews. A large number of businesses are open when the is_open
feature is considered open at the 25th percentile. We can calculate the percentage though to examine with more granularity.
Then, we can calculate the percentage of businesses that are open or closed using value_counts(normalize=True)
with is_open
which then is multiplied by 100.
print('- Percentage of business are open or closed:')
print(df['is_open'].value_counts(normalize=True) * 100)
In this set, most of the businesses are open, but the purpose of project is multifaceted consisting of exploring various business features for insight as well as using the text contained within business reviews to address questions from the business insights. So, let's drop the is_open
feature since we do not know why the business is not open still. Now, we can examine the quantitative variables further using box-and-whisker and histogram plots using seaborn.boxplot
and seaborn.histplot
that loops through the numerical features and is presented together.
df_num = df_num.drop(['is_open'], axis=1)
sns.set_style('whitegrid')
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(2, 3, figsize=(10,7))
fig.suptitle('Boxplots of Quantitative Variables in Reviews & Businesses',
fontsize=25)
for var, subplot in zip(df_num, ax.flatten()):
sns.boxplot(x=df_num[var], data=df_num, ax=subplot)
plt.tight_layout()
plt.show();
fig, ax = plt.subplots(2, 3, figsize=(10,7))
fig.suptitle('Histograms of Quantitative Variables in Yelp Reviews & Businesses',
fontsize=25)
for variable, subplot in zip(df_num, ax.flatten()):
sns.histplot(x=df_num[variable], kde=True, ax=subplot)
plt.tight_layout()
plt.show();
del df_num
Utilizing these plots allows for the previously calculated summary statistics to be visualized graphically rather one dimensional numerical columns in table format. Boxplots contain all of the table information which summarize the distribution of the feature visualized easily besides the std
. Histograms provide a more granular view of the distribution and utilizing 'kde=True'
smooths the distribution using a kernel density estimate.
Although some of selected variables are coded as numerical values when they could be represented as categorical features given the group structure, we can use seaborn.countplot
in a for
loop to examine the count of each group within the subset of features.
import matplotlib.pyplot as plt
import seaborn as sns
df_num = df_num.drop(['useful_reviews', 'funny_reviews', 'cool_reviews'],
axis=1)
plt.rcParams.update({'font.size': 15})
fig, ax = plt.subplots(1, 4, figsize=(20,10))
fig.suptitle('Countplots of Quantitative Variables in Yelp Reviews & Businesses',
fontsize=35)
for var, subplot in zip(df_num, ax.flatten()):
sns.countplot(x=df_num[var], ax=subplot)
plt.tight_layout()
plt.show();
del df_num
The majority of the stars_reviews
are rated as 5 star while the majority of the stars_business
are 4 star. Most of the businesses do not contain a lot of reviews as this plot demonstates a right tailed distribution.
We can also examine the reviews based on the variables in the original reviews
set. Using various cutoffs like stars_reviews == 5.0
for the positive reviews, stars_reviews <= 2.5
for the negatives and the stars_reviews
between 3 - 4.5 for the average reviews, we look at the percentage these reviews are out of the total set.
total_reviews = len(df)
useful_reviews = len(df[df['useful_reviews'] > 0])
funny_reviews = len(df[df['funny_reviews'] > 0])
cool_reviews = len(df[df['cool_reviews' ] > 0])
positive_reviews = len(df[df['stars_reviews'] == 5.0])
negative_reviews = len(df[df['stars_reviews'] <= 2.5])
ok_reviews = len(df[df['stars_reviews'].between(3, 4.5)])
print('- Total reviews: {}'.format(total_reviews))
print('- Useful reviews: ' + str(round((useful_reviews / total_reviews) * 100)) + '%')
print('- Funny reviews: ' + str(round((funny_reviews / total_reviews) * 100)) + '%')
print('- Cool reviews: ' + str(round((cool_reviews / total_reviews) * 100)) + '%')
print('- Positive reviews: ' + str(round((positive_reviews / total_reviews) * 100)) + '%')
print('- Negative reviews: ' + str(round((negative_reviews / total_reviews) * 100)) + '%')
print('- OK reviews: ' + str(round((ok_reviews / total_reviews) * 100)) + '%')
Less than half of the reviews are useful and positive and the minority are negative. This might suggest that balancing the sets is needed before any modeling is completed.
Next, let's examine the features in the original business
set. Based on the count of the business name (name_business
), we can find the top 30 businesses, which are probably restaurants, and plot the average stars on reviews for the top 30 restaurants.
top30_restaurants = df.name_business.value_counts().index[:30].tolist()
top30_restaurantsCount = len(top30_restaurants)
total_businesses = df['name_business'].nunique()
print('- Percentage of top 30 restaurants: ' + str((top30_restaurantsCount / total_businesses) * 100))
top30_restaurants = df.loc[df['name_business'].isin(top30_restaurants)]
top30_restaurants.groupby(top30_restaurants.name_business)['stars_reviews'].mean().sort_values(ascending=True).plot(kind='barh',
figsize=(12,10))
plt.title('Average Review Rating of 30 Most Frequent Restaurants',
fontsize=20)
plt.ylabel('Name of Restaurant', fontsize=18)
plt.xlabel('Average Review Rating', fontsize=18)
plt.yticks(fontsize=18)
plt.tight_layout()
plt.show();
Not surprisingly, fast food restaurants have a high number of reviews and a poor average rating.
We can also plot the average number of useful
, funny
and cool
reviews in the top 30 restaurants and then sort by the number of useful
reviews.
top30_restaurants.groupby(top30_restaurants.name_business)[['useful_reviews',
'funny_reviews',
'cool_reviews']].mean().sort_values('useful_reviews',
ascending=True).plot(kind='barh',
figsize=(15,14),
width=0.7)
plt.title('Average Useful, Funny & Cool Reviews in the 30 Most Frequent Restaurants',
fontsize=28)
plt.ylabel('Name of Restaurant', fontsize=18)
plt.yticks(fontsize=20)
plt.legend(fontsize=22)
plt.tight_layout()
plt.show()
Based on the graph, more of these selected reviews are considered cool
reviews rather than funny
. It is plausible that individuals would use diction expressing more enjoyment than humor if they are considered useful reviews.
We can also find the top 10 states with the highest number of reviews by converting value_counts()
to a dataframe and then use a barplot
with the state
as the index.
x = df['state'].value_counts()[:10].to_frame()
plt.figure(figsize=(20,10))
sns.barplot(x=x['state'], y=x.index)
plt.title('States with the 10 highest number of business reviews listed in Yelp',
fontsize=35)
plt.xlabel('Counts of Reviews', fontsize=25)
plt.ylabel('State', fontsize=25)
plt.tight_layout()
plt.show();
del x
In this set and subset, Massachusetts has the largest abundance of reviews followed by Texas, Oregon, Georgia, Florida, British Columbia and Ohio. This contains multiple regions in the United States, ranging from the Northeast, Southeast and West.
Building on this, we can find the top 30 cities with the highest number of reviews.
x = df['city'].value_counts()
x = x.sort_values(ascending=False)
x = x.iloc[0:30]
plt.rcParams.update({'font.size': 14})
plt.figure(figsize=(20,10))
ax = sns.barplot(x=x.index, y=x.values, alpha=0.9)
plt.title('Cities with the Highest Number of Reviews on Yelp', fontsize=35)
locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
plt.xlabel('Name of City', fontsize=25)
plt.ylabel('Number of Reviews', fontsize=25)
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height+3, label, ha='center',
va='bottom', fontsize=13)
plt.tight_layout()
plt.show();
del x
This plot refines the granularity of state level down to the city level. Although Massachusetts contained the highest number of reviews, Austin in Texas contains the most per city followed by Portland (Oregon), Atlanta (Georgia), Boston (Massachusetts), Orlando (Florida) and Vancouver (British Columbia). Subsequently, the city with the next highest counts, Columbus, had half that of Vancouver.
Building on this, we can find the top 30 cities with reviews in the set, and then examine which cities contain the highest number of average reviews.
top30_city = df.city.value_counts().index[:30].tolist()
top30_city = df.loc[df['city'].isin(top30_city)]
top30_city.groupby(top30_city.city)['stars_reviews'].mean().sort_values(ascending=True).plot(kind='barh',
figsize=(12,10))
plt.yticks(fontsize=18)
plt.title('Average Review Rating of Cities with Highest Number of Reviews on Yelp',
fontsize=20)
plt.ylabel('Name of City', fontsize=18)
plt.xlabel('Average Review Rating', fontsize=18)
plt.tight_layout()
plt.show();
The cities with the highest number of average reviews are Winter Park, Portland, Austin, Somerville and Boulder. These cities are not all geographically in close proximity, which suggests that the location is not skewed. However, closer examination is warranted to control for a potentially confounding variable like location.
Then we can convert the date_reviews
feature to the proper format using pd.to_datetime
with format='%Y-%m-%d %H:%M:%S'
to extract the time related components like the year, year-month and year-week. Since we extracted the year of the review, we can utilize this with the business_id
and stars_business
features to examine the 5 star rated businesses containing the most reviews per year in the set.
df['date_reviews'] = pd.to_datetime(df['date_reviews'],
format='%Y-%m-%d %H:%M:%S', errors='ignore')
df['date_reviews_Year'] = df.date_reviews.dt.year
df['date_reviews_YearMonth'] = df['date_reviews'].dt.to_period('M')
df['date_reviews_YearWeek'] = df['date_reviews'].dt.strftime('%Y-w%U')
top = 5
temp = df[['business_id', 'date_reviews_Year', 'stars_business']]
five_star_reviews = temp[temp['stars_business'] == 5]
trendy = five_star_reviews.groupby(['business_id',
'date_reviews_Year']).size().reset_index(name='counts')
trending = trendy.sort_values(['date_reviews_Year',
'counts'])[::-1][:top].business_id.values
plt.rcParams.update({'font.size': 10})
for business_id in trending:
record = trendy.loc[trendy['business_id'] == business_id]
business_name = df.loc[df['business_id'] == business_id].name_business.values[0]
series = pd.Series(record['counts'].values,
index=record.date_reviews_Year.values,
name='Trending business')
axes = series.plot(kind='bar', figsize=(5,5))
plt.xlabel('Year', axes=axes)
plt.ylabel('Total reviews', axes=axes)
plt.title('Review trend of {}'.format(business_name), axes=axes)
plt.show();
Since we utilized the business_id
and the yearly review count for the 5 star businesses, we are able to determine a specific business location given the id. This resulted in very different types of businesses ranging from yogurt to Asian food selected.
Next, we can examine the user
information using pandas.DataFrame.describe
again to generate descriptive statistics rounded to two decimal places.
df_num = df[['review_count_user', 'useful_user', 'funny_user', 'cool_user',
'average_stars_user', 'compliment_hot_user',
'compliment_more_user','compliment_profile_user',
'compliment_cute_user', 'compliment_list_user',
'compliment_note_user', 'compliment_plain_user',
'compliment_cool_user', 'compliment_funny_user',
'compliment_writer_user']]
print('- Descriptive statistics of quant vars in User:')
print(df_num.describe(include=[np.number]).round(2))
For the basic summary statistics of these user
related features, the review_count_user
has a mean=141.96
and a maximum=15686.00
. useful_user
has the highest std=444.06
and the highest amount (204380.00) compared to the cool
and funny
users. The average_stars_user
is 3.74 with a std=0.79
. The compliment_plain_user
feature has the highest standard deviation (610.92) and the highest amount (90858.00) for the compliment*user
group when the majority of the contain zero.
We can also examine the number of reviews completed by each user as well as the average rating by each user using histplot
from seaborn
.
sns.histplot(x='review_count_user', data=df_num, kde=True)
plt.ylim(0, 100000)
plt.title('Count of reviews by Users in Yelp')
plt.tight_layout()
plt.show();
sns.histplot(x='average_stars_user', data=df_num, kde=True)
plt.ylim(0, 300000)
plt.title('Count of Average Stars by Users in Yelp')
plt.tight_layout()
plt.show();
The majority of users have only written a small number of reviews demonstrated by the right skewed histogram while the the average stars by the users is more left skewed around 4 stars.
Box-and-whisker plots are another method to visualize the quantitative variables from the User
set. We can iterate through the features using a for
loop.
plt.rcParams.update({'font.size': 25})
fig, ax = plt.subplots(3, 5, figsize=(20,10))
fig.suptitle('Boxplots of Quantitative Variables in Users on Yelp Reviews',
fontsize=35)
for var, subplot in zip(df_num, ax.flatten()):
sns.boxplot(x=df_num[var], data=df_num, ax=subplot)
plt.tight_layout()
plt.show();
The compliment*user
group definitely contains outliers as demonstrated from plot above. There are a small number of users who are very active in this set! The majority of the users are not though.
Categories
¶
Let's examine some observations from the orginal categories
variable to determine how to proceed with processing. The current categories_business
feature is organized as a comma separated list of the different categorical information about the business.
df_category_split = df[['business_id', 'categories_business']]
df_category_split = df_category_split.drop_duplicates()
df_category_split[:10]
The various categories can be split into single components by setting the business_id
as the index, splitting by the ,
, followed by resetting the index. This column can then be renamed for a unique feature name. This resulted in some errors in the spacing that might have existed before processing, so we can remove the first space to align each observation to normalize the subsets for downstream comparisons. Then duplicate observations can be removed. Let's now examine the first ten observations after processing:
df_category_split = df_category_split.set_index(['business_id'])
df_category_split = df_category_split.stack().str.split(',',
expand=True).stack().unstack(-2).reset_index(-1,
drop=True).reset_index()
df_category_split.rename(columns={'categories_business':
'categories_combined'}, inplace=True)
df_category_split['categories_combined'] = df_category_split['categories_combined'].str.strip()
df_category_split = df_category_split.drop_duplicates()
df_category_split[:10]
The table containing the processed categories_business
can now be left
joined with the main table utilizing business_id
the keys for both sets and the original categories_business
feature and duplicates removed from the set. Now, we can use the data_summary
function again to examine the number of unique and missing values as well as the data types.
df = pd.merge(df, df_category_split, how='left', left_on=['business_id'],
right_on=['business_id'])
df = df.drop(['categories_business'], axis=1)
df = df.drop_duplicates(subset='review_id')
del df_category_split
print('\nSummary - Preprocessing Yelp Reviews for Category:')
print(data_summary(df))
Let's now examine the processed categories_combined
feature to see what are the top 20 categories in the set using value_counts
as well as visually with seaborn.barplot
.
plt.rcParams.update({'font.size': 15})
x = df.categories_combined.value_counts()
print('- There are', len(x), 'different categories of Businesses in Yelp')
print('\n')
x = x.sort_values(ascending=False)
x = x.iloc[0:20]
print('Top 20 categories in Yelp:')
print(x)
print('\n')
plt.figure(figsize=(16,10))
ax = sns.barplot(x=x.index, y=x.values, alpha=0.9)
plt.title('Top 20 Categories in Yelp Reviews', fontsize=18)
locs, labels = plt.xticks()
plt.setp(labels, rotation=70)
plt.ylabel('Number of Businesses', fontsize=14)
plt.xlabel('Type of Category', fontsize=14)
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height+5, label, ha='center',
va='bottom', fontsize=11)
plt.tight_layout()
del x
The top five categories are Restaurants
, Food
, Nightlife
, Bars
and American(New)
. This contains a high similarity between groups, so focusing on the food related businesses sounds like a worthwhile venture.
Now, we can write the processed data to a .csv
file and proceed to processing the text data from the text_reviews
feature.
NLP Preprocessing
¶
Before preprocessing the reviews, let's download some word sets for cleaning the text in the reviews. We can utilize en_core_web_lg
from the spaCy
library as well as wordnet
, punkt
and stopwords
from the NLTK
library. To download en_core_web_lg
from spaCy
, run !python -m spacy download en_core_web_lg
in the notebook environment or without the !
if utilizing terminal. Then this can be loaded into the notebook after importing the library.
Now, we can subset the features the previous plots showed that impact the distribution of the reviews.
import spacy
import nltk
nlp = spacy.load('en_core_web_lg')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
df = df[['review_id', 'text_reviews', 'stars_reviews', 'name_business', 'city',
'state', 'stars_business', 'review_countbusiness',
'categories_combined']]
print('Number of rows and columns:', df.shape)
Yelp offers the ability for individuals to review various types of businesses ranging from food to every day goods and services. For this analysis, let's focus on food restaurants. Using the processed categories
feature, the data was filtered to the food categories with over 30k counts and the states with the seven highest counts of reviews.
v = df.categories_combined.value_counts()
df = df[df.categories_combined.isin(v.index[v.gt(30000)])]
df = df.loc[(df['categories_combined']=='Restaurants')
| (df['categories_combined']=='Food')
| (df['categories_combined']=='American (New)')
| (df['categories_combined']=='American (Traditional)')
| (df['categories_combined']=='Pizza')
| (df['categories_combined']=='Sandwiches')
| (df['categories_combined']=='Breakfast & Brunch')
| (df['categories_combined']=='Mexican')
| (df['categories_combined']=='Italian')
| (df['categories_combined']=='Seafood')
| (df['categories_combined']=='Japanese')
| (df['categories_combined']=='Burgers')
| (df['categories_combined']=='Sushi Bars')
| (df['categories_combined']=='Chinese')
| (df['categories_combined']=='Desserts')
| (df['categories_combined']=='Thai')
| (df['categories_combined']=='Bakeries')
| (df['categories_combined']=='Asian Fusian')
| (df['categories_combined']=='Steakhouse')
| (df['categories_combined']=='Salad')
| (df['categories_combined']=='Cafes')
| (df['categories_combined']=='Barbeque')
| (df['categories_combined']=='Southern')
| (df['categories_combined']=='Ice Cream & Frozen Yogurt')
| (df['categories_combined']=='Vietnamese')
| (df['categories_combined']=='Vegetarian')
| (df['categories_combined']=='Specialty Food')
| (df['categories_combined']=='Mediterranean ')
| (df['categories_combined']=='Local Flavor')
| (df['categories_combined']=='Indian')
| (df['categories_combined']=='Tex-Mex')]
print('- Dimensions after filtering food categories with over 30k counts:',
df.shape)
df1 = df['state'].value_counts().index[:7]
df = df[df['state'].isin(df1)]
del df1
print('- Dimensions after filtering US states with the 7 highest count of reviews:',
df.shape)
This significantly reduced the number of observations to less than four million compared to the almost eight million in the initial set.
Initial EDA of Reviews - Food with > 30k Counts
¶
Text data contains common words called stopwords which are frequent in sentences and will need to be processed before text classification. Before processing the reviews, we can utilize lambda
functions again to determine the length of the words and characters in each initial review by splitting the strings and calculating the length with len(x)
. Then the resulting features can be plotted next to each other using different colors. After plotting, we can drop these temporary variables.
df['review_wordCount'] = df['text_reviews'].apply(lambda x: len(str(x).split()))
df['review_charCount'] = df['text_reviews'].apply(lambda x: len(str(x)))
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(11,7), sharex=True)
f.suptitle('Restaurants > 30k Counts in Top 7 States: Length of Words and Characters for Each Review',
fontsize=15)
f.text(0.5, 0.04, 'Review Number', ha='center', fontsize=15)
ax1.plot(df['review_wordCount'], color='red')
ax1.set_ylabel('Word Count', fontsize=15)
ax2.plot(df['review_charCount'], color='blue')
ax2.set_ylabel('Character Count', fontsize=15)
df = df.drop(['review_wordCount', 'review_charCount'], axis=1)
For the reviews in this food oriented set, the word count is slightly under 1000 words and around 5000 characters.
Text Preprocessing
¶
Text data often contains multiple languages, which is something to consider before any type of downstream modeling. For the initial text processing step, langid
can be used to first detect the language present in the text strings. The pandas
dataframe can be converted to a dask
dataframe partitioning the data into ten sets for parallel processing and an apply
function can be utilized on the text_reviews
feature to detect the language. Then the dask
dataframe can be converted back using compute
.
import dask.dataframe as dd
import langid
import time
ddata = dd.from_pandas(df, npartitions=10)
del df
print('Time for cleaning with Langid to find the language of the reviews..')
search_time_start = time.time()
ddata['language'] = ddata['text_reviews'].apply(langid.classify,
meta=('text_reviews',
'object')).compute(scheduler='processes')
print('Finished cleaning with Langid in:', time.time() - search_time_start)
df = ddata.compute()
del ddata
Then a lambda
function can be used to extract the detected language in string format for the language
feature. The number of estimated languages present, the percent which is English and the number of non-English/English reviews can now be calculated. Then the non-English reviews can be filtered out of the set.
df['language'] = df['language'].apply(lambda tuple: tuple[0])
print('- Number of tagged languages (estimated):')
print(len(df['language'].unique()))
print('- Percent of data in English (estimated):')
print((sum(df['language'] == 'en') / len(df)) * 100)
df1 = df.loc[df['language'] != 'en']
print('- Number of non-English reviews:', df1.shape[0])
del df1
df = df.loc[df['language'] == 'en']
print('- Number of English reviews:', df.shape[0])
df = df.drop(['language'], axis=1)
There are 55 unique languages present in the set, and 99.8% are estimated to be English. This retains most of the observations. The presence of other languages if processed using English based word sets probably would negatively impact, and hence skew any downstream modeling results since they would be considered as anomalies in the set.
Now, let's define a class
called cleantext
to remove the non-words in the text strings before processing the words. The instance of this class (self
) can be set to use text
so it can be applied to multiple functions within the cleantext
class. Using the re
module, functions within cleantext
can be utilized to remove square brackets ([]
), numbers and special characters if they are present using re.sub
. replace contractions. get_words tokenize. remove non-ASCII characters from list of tokenized words. remove_punctuation. join words.
import re
import contractions
import string
import unicodedata
class cleantext():
def __init__(self, text='text'):
self.text = text
def remove_between_square_brackets(self):
self.text = re.sub('\[[^]]*\]', '', self.text)
return self
def remove_numbers(self):
self.text = re.sub('[-+]?[0-9]+', '', self.text)
return self
def remove_special_characters(self, remove_digits=True):
self.text = re.sub('[^a-zA-z0-9\s]','', self.text)
return self
def replace_contractions(self):
self.text = contractions.fix(self.text)
return self
def get_words(self):
self.words = nltk.word_tokenize(self.text)
return self
def remove_non_ascii(self):
new_words = []
for word in self.words:
new_word = unicodedata.normalize('NFKD',
word).encode('ascii',
'ignore').decode('utf-8',
'ignore')
new_words.append(new_word)
self.words = new_words
return self
def remove_punctuation(self):
new_words = []
for word in self.words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
self.words = new_words
return self
def to_lowercase(self):
new_words = []
for word in self.words:
new_word = word.lower()
new_words.append(new_word)
self.words = new_words
return self
def join_words(self):
self.words = ' '.join(self.words)
return self
def do_all(self, text):
self.text = text
self = self.remove_numbers()
self = self.remove_special_characters()
self = self.replace_contractions()
self = self.get_words()
self = self.remove_non_ascii()
self = self.remove_punctuation()
self = self.to_lowercase()
return self.words
ct = cleantext()
def dask_this(df):
res = df.apply(ct.do_all)
return res
The pandas.Dataframe
can be converted to a Dask
dataframe partitioning the data into ten sets for parallel processing and an apply
function can be utilized on the text_reviews
to remove the non-words within the text string defined in the clean_text
class. Then the dask
dataframe can be converted back using compute
.
ddata = dd.from_pandas(df, npartitions=10)
del df
print('Time for reviews to be cleaned for non-words...')
search_time_start = time.time()
ddata['cleanReview'] = ddata['text_reviews'].map_partitions(dask_this).compute(scheduler='processes')
print('Finished cleaning reviews for non-words in:',
time.time() - search_time_start)
df = ddata.compute()
del ddata
Now, we can drop the original text_reviews
and remove the comma from tokenizing to make the text as a single string.
df = df.drop(['text_reviews'], axis=1)
df['cleanReview'] = df['cleanReview'].apply(lambda x: ','.join(map(str, x)))
df['cleanReview'] = df['cleanReview'].str.replace(r',', ' ', regex=True)
Building on the first class object that was utilized to process the non-word components, we can define a more condensed one to process the words in the text data. Upon first pass, there were non-UTF8 characters which existed in the set, which might have derived from exporting the data and not processing it to the full entirety in a single session after importing, so those few rows had to be located and removed. When using a library that specifies the words in the library are stopwords
, it might be context driven given the text data that was used to derive that definition. Therefore, careful inspection of what is used as the reference word bank dictionary as well as fine tuning might be required to optimally solve the tasked problem. Potentially, adding more common words to the stopwords_list
will allow for less prevalent words, which might reveal more insightful text information to not be masked.
Let's start by tokenizing using nltk
followed by removing the stop words from the reviews using nltk.corpus
. Then the text can be broken down to the root words using the WordNetLemmatizer
and the processed string can then be joined as one single text string.
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
stopwords_list = stopwords.words('english')
stopwords_list.extend(('thing', 'eat'))
class cleantext1():
def __init__(self, text='test'):
self.text = text
def get_words(self):
self.words = nltk.word_tokenize(self.text)
return self
def remove_stopwords(self):
new_words = []
for word in self.words:
if word not in stopwords_list:
new_words.append(word)
self.words = new_words
return self
def lemmatize_words(self):
lemmatizer = WordNetLemmatizer()
lemmas = []
for word in self.words:
lemma = lemmatizer.lemmatize(word)
lemmas.append(lemma)
self.words = lemmas
return self
def join_words(self):
self.words = ' '.join(self.words)
return self
def do_all(self, text):
self.text = text
self = self.get_words()
self = self.remove_stopwords()
self = self.lemmatize_words()
return self.words
ct = cleantext1()
def dask_this(df):
res = df.apply(ct.do_all)
return res
ddata = dd.from_pandas(df, npartitions=10)
del df
print('Time for reviews to be cleaned for stopwords and lemma...')
search_time_start = time.time()
ddata['cleanReview1'] = ddata['cleanReview'].map_partitions(dask_this).compute(scheduler='processes')
print('Finished cleaning reviews in:', time.time() - search_time_start)
df = ddata.compute()
del ddata
df['cleanReview1'] = df['cleanReview1'].apply(lambda x: ','.join(map(str, x)))
df['cleanReview1'] = df['cleanReview1'].str.replace(r',', ' ', regex=True)
EDA of Cleaned Reviews
¶
We can then utilize the same lambda
function structure that was used for the initial reviews to determine the length of the words in the processed reviews after removing non-words, stopwords and lemmatization and plot this next to the length from removing non-words. After plotting, we can drop these temporary variables.
df['cleanReview_wordCount'] = df['cleanReview'].apply(lambda x: len(str(x).split()))
df['cleanReview_wordCount1'] = df['cleanReview1'].apply(lambda x: len(str(x).split()))
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(11,7), sharex=True, sharey=True)
f.suptitle('Length of Words After Removing Non-Words and Stopwords/Lemmatization',
fontsize=15)
f.text(0.5, 0.04, 'Review Number', ha='center', fontsize=15)
ax1.plot(df['cleanReview_wordCount'], color='red')
ax1.set_ylabel('Word Count', fontsize=15)
ax2.plot(df['cleanReview_wordCount1'], color='blue');
df = df.drop(['cleanReview_wordCount', 'cleanReview_wordCount1'], axis=1)
When comparing the length of the reviews in regard to the word count, the processed reviews contain around 500 words compared to ~1000 in the initial set.
Next, we can examine the length of the characters before/after removing non-words and stopwords/lemmatization.
df['cleanReview_charCount'] = df['cleanReview'].apply(lambda x: len(str(x)))
df['cleanReview_charCount1'] = df['cleanReview1'].apply(lambda x: len(str(x)))
f, ((ax1), (ax2)) = plt.subplots(1, 2, figsize=(11,7), sharex=True, sharey=True)
f.suptitle('Length of Characters After Removing Non-Words and Stopwords/Lemmatization',
fontsize=15)
f.text(0.5, 0.04, 'Review Number', ha='center', fontsize=15)
ax1.plot(df['cleanReview_charCount'], color='red')
ax1.set_ylabel('Character Count', fontsize=15)
ax2.plot(df['cleanReview_charCount1'], color='blue');
df = df.drop(['cleanReview_charCount', 'cleanReview_charCount1'], axis=1)
The initial character count was around 5000, and now it is around 3500.
Let's now create a temporary dataframe containing the 1 & 2 star reviews and another with the 5 star reviews. Then subset the cleaned reviews to create a word cloud for the higher and the lower rated reviews.
from wordcloud import WordCloud, STOPWORDS
df1 = df.loc[(df['stars_reviews'] == 1) | (df['stars_reviews'] == 2)]
df2 = df[df.stars_reviews == 5]
df1_clean = df1['cleanReview1']
df2_clean = df2['cleanReview1']
def wordcloud_draw(data, color='black'):
words = ' '.join(data)
cleaned_word = ' '.join([word for word in words.split()
if 'http' not in word
and not word.startswith('@')
and not word.startswith('#')
and word != 'RT'])
wordcloud = WordCloud(stopwords=STOPWORDS,
background_color=color,
width=1500,
height=1000).generate(cleaned_word)
plt.figure(1, figsize=(10,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
print('- Higher Rated Reviews:')
wordcloud_draw(df2_clean, 'white')
print('- Lower Rated Reviews:')
wordcloud_draw(df1_clean, 'black')
del df1_clean, df2_clean
Word clouds can be utilized to visualize the words present, but let's examine on a more granular level by defining a function get_all_words
to find all of the words present in the set and then FreqDist
can be used to generate the frequency of the words in the set. Let's first find the 100 most common words in 5 star reviews and write the list to a .csv
file. Then perform the same operations with the lowest stars set.
We can then find the top 10 words that are in both the 5 and 1 & 2 star review sets by converting the list to a dataframe using an inner merge with word
as the key.
from nltk import FreqDist
import csv
df1_clean = df1['cleanReview1']
df2_clean = df2['cleanReview1']
def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token
ratings_higher_words = get_all_words(df2_clean)
freq_dist_higher = FreqDist(ratings_higher_words)
print(freq_dist_higher.most_common(100))
list0 = freq_dist_higher.most_common(100)
with open('topWords_5starReviews.csv','w') as f:
writer = csv.writer(f)
writer.writerow(['word', 'count'])
writer.writerows(list0)
ratings_lower_words = get_all_words(df1_clean)
freq_dist_lower = FreqDist(ratings_lower_words)
print(freq_dist_lower.most_common(100))
list1 = freq_dist_lower.most_common(100)
with open('topWords_12starReviews.csv','w') as f:
writer = csv.writer(f)
writer.writerow(['word', 'count'])
writer.writerows(list1)
df1 = pd.DataFrame(list0, columns=['word', 'count'])
df1 = df1[['word']]
df2 = pd.DataFrame(list1, columns=['word', 'count'])
df2 = df2[['word']]
df1 = df1.merge(df2, on='word')
print(df1.iloc[0:10])
del df1, df2, df1_clean, df2_clean
The most frequent words in both sets are food
, place
, great
, good
, time
and service
. This makes rational sense since we filtered the original set by food
.
Using the open-source textblob
library, we can utilize the sentiment
function within a lambda
function to generate a polarity score where Negative = -1.0
and Positive = 1.0
with the possibility of intermediate classes since it is a float
numerical score. Let's apply this on the set and create the polarity
feature. Now we can examine the descriptive statistics using describe
again.
from textblob import TextBlob
df = df[['stars_reviews', 'cleanReview1']]
df.rename(columns={'cleanReview1': 'cleanReview'}, inplace=True)
df['polarity'] = df['cleanReview'].apply(lambda x: TextBlob(x).sentiment[0])
print(df['polarity'].describe().apply("{0:.3f}".format))
Since this generated a numerical output, we can define set thresholds within a function getAnalysis
to demarcate the levels of polarity using if
and elif
conditional statements. This can be applied to the polarity
feature, labeled as sentiment
, a new qualitative feature.
def getAnalysis(score):
if score >= 0.4:
return 'Positive'
elif score > 0.2 and score < 0.4:
return 'Slightly Positive'
elif score <= 0.2 and score >= 0.0:
return 'Slightly Negative'
else:
return 'Negative'
df['sentiment'] = df['polarity'].apply(getAnalysis)
df[['sentiment']].value_counts()
Interestingly, discrepancies exist between the stars_reviews
and polarity
features. There are a higher number of 5 star reviews that are labeled as Slightly Positive
compared to Positive
, 4 and 5 star reviews labeled as Slightly Negative
, and even 1 star reviews considered as Positive
as well as 5 star reviews labeled Negative
for the sentiment
feature. This is dependent on the input data that was used to set the numerical range of values defining polarity for the textblob
library. Since an apparent difference exists between the stars_reviews
and polarity
features, we can retain both of them and evaluate how classification performs using them separately as the target or label for the model.
Given that textblob
allocated different sentiment scores compared to the defined levels of stars_reviews
, an equivalent size of each target needs to be sampled so the classes are balanced and comparisons can be made between the different approaches. Since the negative Sentiment
contains the least amount of observations (n=414937
), let's sample the same amount for the positive Sentiment
by filtering and then shuffling before sampling. These can then be concatenated, shuffled and saved as a parquet
file for later use.
from sklearn.utils import shuffle
df1 = df[df.sentiment == 'Positive']
df1 = shuffle(df1)
df1 = df1.sample(n=414937)
df2 = df[df.sentiment == 'Negative']
sent = pd.concat([df1, df2])
del df1, df2
sent = shuffle(sent)
sent = sent[['cleanReview', 'sentiment', 'stars_reviews']]
sent.to_parquet('YelpReviews_NLP_sentimentNegPos.parquet')
Let's now sample the equivalent numbers for the stars_reviews
set by first creating a temporary feature stars
where 1 & 2 star reviews are defined as zero and 5 star reviews as one. Then each subset is filtered, shuffled and sampled using the same number of observations as the negative Sentiment
(n=414937
) to the balance sets. Then the two sets are concatenate by row, shuffled, the temporary stars
feature is dropped and then saved as a parquet
file.
df['stars'].mask(df['stars_reviews'] == 1, 0, inplace=True)
df['stars'].mask(df['stars_reviews'] == 2, 0, inplace=True)
df['stars'].mask(df['stars_reviews'] == 5, 1, inplace=True)
df1 = df[df.stars==1]
df1 = shuffle(df1)
df1 = df1.sample(n=414937)
df2 = df[df.stars==0]
df2 = shuffle(df2)
df2 = df2.sample(n=414937)
df = pd.concat([df1, df2])
df = shuffle(df)
df = df.drop(['stars'], axis=1)
del df1, df2
df.to_parquet('YelpReviews_NLP_125stars.parquet')
Classification
¶
Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF)
¶
A bag-of-words (BoW
) model is a method to extract the features from text information to use in modeling. It was first introduced in Zellig Harris's 1954 publication Distributional Structure. The words within the text are compiled into a bag
of words where the order and the structure of the words in the text information is not considered. The text is preprocessed using the methods in the NLP Preprocessing
section and a vocabulary is created by determining the frequency of each word within the text. The text is then converted into a vector since numerical values are needed for modeling
One limitation of using the frequency of a word for the score is highly frequent words begin to generate larger scores when they do not contain insightful information compared to less utilized words, resulting in negative effects when modeling. Therefore, we can leverage rescaling the frequency that words appear so that the scores for the frequent words are penalized. This was suggested in A Statistical Interpretation of Term Specificity and Its Application in Retrieval. We can generate a score of the frequency of the word in each of the separate reviews defined as Term Frequency
and the generate another score for how important this word is across all of the text data defined as Inverse Document Frequency
. Then a weighted score can be generated where all of words are not considered as equally important or interesting. This score then allows for the words which are distinct (potentially containing useful information) to be highlighted. TF-IDF
is a statistical measure that considers the importance of the words in the text which can be utilized in modeling after using text vectorization.
The BoW
and TF-IDF
notebooks can be found
here
for the Sentiment
set and herefor the Stars on Reviews
group.
Sentiment
¶
Let's first set up the environment by importing the nltk
package and downloading stopwords
and wordnet
to utilize for processing the text. Then we can import the necessary packages and setting the random
, numpy
and the torch
seed followed by assigning the device to CUDA
if is available to utilize the GPU
. Then we can examine the PyTorch
, CUDA
and NVIDIA GPU
information and empty the cache for the device.
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import os
import random
import numpy as np
from tqdm import tqdm, tqdm_notebook
import torch
from subprocess import call
tqdm.pandas()
seed_value = 42
random.seed(seed_value)
np.random.seedseed_value)
torch.manual_seed(seed_value)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('Torch version: {}'.format(torch.__version__))
print('pyTorch VERSION:', torch.__version__)
print('\n')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(['nvidia-smi', '--format=csv',
'--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free'])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())
torch.cuda.empty_cache()
Now we can read the parquet
file into a pandas.Dataframe
, examine the dimensions and a few observations.
import pandas as pd
df = pd.read_parquet('YelpReviews_NLP_sentimentNegPos.parquet')
print('Number of rows and columns:', df.shape)
df.head()
Let's examine how the original stars_reviews
feature compares with the sentiment
(polarity) feature using pandas.DataFrame.value_counts
.
print(df[['stars_reviews', 'sentiment']].value_counts())
Surprisingly, there are 1 star reviews that are positive, and 5 star reviews which are negative. To prepare the target feature for various methods for classification, we need to recode the string to binary numerical values, so let's define Negative
as zero and Positive
as one, and the 1 and 2 stars_reviews
as zero and the five star reviews as one with the objective of predicting what contributes more based on the text within a review for a higher review rating and Positive
review.
Then we can select the reviews with negative sentiment, shuffle and then sample 20,000 reviews. For a balanced set, let's use the same methods for the positive sentiment reviews. Then convert the review to a string
and sentiment
to an integer.
from sklearn.utils import shuffle
df['sentiment'].mask(df['sentiment'] == 'Negative', 0, inplace=True)
df['sentiment'].mask(df['sentiment'] == 'Positive', 1, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 1, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 2, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 5, 1, inplace=True)
df = df[['cleanReview', 'sentiment']]
df1 = df[df.sentiment==0]
df1 = shuffle(df1)
df1 = df1.sample(n=20000)
df2 = df[df.sentiment==1]
df2 = shuffle(df2)
df2 = df2.sample(n=20000)
df = pd.concat([df1, df2])
df = shuffle(df)
del df1, df2
df[['cleanReview']] = df[['cleanReview']].astype('str')
df['sentiment'] = df['sentiment'].astype('int')
print('Number of records:', len(df), '\n')
print('Number of positive reviews:', len(df[df.sentiment==1]))
print('Number of negative reviews:', len(df[df.sentiment==0]))
Preprocess Text
¶
The reviews were previously processed by removing non-words, converting to lowercase, removal of stop words and lemmatized. The reviews need to be re-tokenized, so let's use wordpunct_tokenize
for this task. We can also remove rare words if they are found only in a small frequency in the tokenized text. Then we can construct the BoW vector
and TF-IDF vector
which will be used when modeling.
from nltk.tokenize import wordpunct_tokenize
def tokenize(text):
tokens = wordpunct_tokenize(text)
return tokens
def remove_rare_words(tokens, common_tokens, max_len):
return [token if token in common_tokens else '<UNK>' for token in tokens][-max_len:]
def build_bow_vector(sequence, idx2token):
vector = [0] * len(idx2token)
for token_idx in sequence:
if token_idx not in idx2token:
raise ValueError('Wrong sequence index found!')
else:
vector[token_idx] += 1
return vector
To further preprocess the text data for modeling, let's define some parameters for the maximum length of the review, maximum vocab size, and batch size for loading the data.
Now we can define a class YelpReviewsDataset
which will process the data from raw string into two vectors, Bag of Words
and Term Frequency Inverse Document Frequency
, to be used for modeling. We can set the data to process as df
, tokenize the text string and build the most common tokens bound by the defined maximum vocabulary size. Then replace the rare words with UNK
and remove the sequences with only UNK
. Next, build the vocabulary and convert the tokens to indexes. Now we can build the BoW
vector and build the TF-IDF
vector using the TfidfVectorizer
with the input in list
format. This then returns feature vectors and target.
from functools import partial
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
MAX_LEN = 300
MAX_VOCAB = 10000
BATCH_SIZE = 1
class YelpReviewsDataset(Dataset):
def __init__(self, df, max_vocab=10000, max_len=300):
"""
Preprocess the text data for modeling.
Parameters
----------
df : Data to process.
max_vocab : Maximum size of vocabulary.
max_len : Maximum length of text in vocubulary.
"""
df = df
df['tokens'] = df.cleanReview.apply(partial(tokenize))
all_tokens = [token for doc in list(df.tokens) for token in doc]
common_tokens = set(
list(zip(*Counter(all_tokens).most_common(max_vocab)))[0])
df.loc[:, 'tokens'] = df.tokens.progress_apply(
partial(remove_rare_words,
common_tokens=common_tokens,
max_len=max_len))
df = df[df.tokens.progress_apply(
lambda tokens: any(token != '<UNK>' for token in tokens))]
vocab = sorted(set(token for doc in list(df.tokens) for token in doc))
self.token2idx = {token: idx for idx, token in enumerate(vocab)}
self.idx2token = {idx: token for token, idx in self.token2idx.items()}
df['indexed_tokens'] = df.tokens.progress_apply(
lambda doc: [self.token2idx[token] for token in doc])
df['bow_vector'] = df.indexed_tokens.progress_apply(
build_bow_vector, args=(self.idx2token,))
vectorizer = TfidfVectorizer(
analyzer='word',
tokenizer=lambda doc: doc,
preprocessor=lambda doc: doc,
token_pattern=None)
vectors = vectorizer.fit_transform(df.tokens).toarray()
df['tfidf_vector'] = [vector.tolist() for vector in vectors]
self.text = df.cleanReview.tolist()
self.sequences = df.indexed_tokens.tolist()
self.bow_vector = df.bow_vector.tolist()
self.tfidf_vector = df.tfidf_vector.tolist()
self.targets = df.sentiment.tolist()
def __getitem__(self, i):
return (self.sequences[i],
self.bow_vector[i],
self.tfidf_vector[i],
self.targets[i],
self.text[i])
def __len__(self):
return len(self.targets)
Now, we can load the data using the described class with the specified MAX_VOCAB = 10000
and MAX_LEN = 300
.
dataset = YelpReviewsDataset(df, max_vocab=MAX_VOCAB, max_len=MAX_LEN)
del df
Let's then examine a random sample of the processed set in regard to the size and some example text data.
print('Number of records:', len(dataset), '\n')
random_idx = random.randint(0,len(dataset)-1)
print('index:', random_idx, '\n')
sample_seq, bow_vector, tfidf_vector, sample_target, sample_text=dataset[random_idx]
print(sample_text, '\n')
print(sample_seq, '\n')
print('BoW vector size:', len(bow_vector), '\n')
print('TF-IDF vector size:', len(tfidf_vector), '\n')
print('Sentiment:', sample_target, '\n')
Now we can the split the data into training, validation, and test sets and determine the size of each set.
from torch.utils.data.dataset import random_split
def split_train_valid_test(corpus, valid_ratio=0.1, test_ratio=0.1):
test_length = int(len(corpus) * test_ratio)
valid_length = int(len(corpus) * valid_ratio)
train_length = len(corpus) - valid_length - test_length
return random_split(
corpus, lengths=[train_length, valid_length, test_length])
train_dataset, valid_dataset, test_dataset = split_train_valid_test(
dataset, valid_ratio=0.1, test_ratio=0.1)
len(train_dataset), len(valid_dataset), len(test_dataset)
Now, let's use the DataLoader
to load each of the sets by the specified BATCH_SIZE
.
from torch.utils.data import Dataset, DataLoader
def collate(batch):
seq = [item[0] for item in batch]
bow = [item[1] for item in batch]
tfidf = [item[2] for item in batch]
target = torch.LongTensor([item[3] for item in batch])
text = [item[4] for item in batch]
return seq, bow, tfidf, target, text
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
collate_fn=collate)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE,
collate_fn=collate)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
collate_fn=collate)
print('Number of training batches:', len(train_loader), '\n')
batch_idx = random.randint(0, len(train_loader)-1)
example_idx = random.randint(0, BATCH_SIZE-1)
for i, fields in enumerate(train_loader):
seq, bow, tfidf, target, text = fields
if i == batch_idx:
print('Training input sequence:', seq[example_idx], '\n')
print('BoW vector size:', len(bow[example_idx]), '\n')
print('TF-IDF vector size:', len(tfidf[example_idx]), '\n')
print('Label: ', target[example_idx], '\n')
print('Review text:', text[example_idx], '\n')
Build BoW Model
¶
Now, we can build a class FeedfowardTextClassifier
which will initialize the model with the defined architecture using a feed-forward fully connected network with 2 hidden layers with the input as the BoW Vector
and the output as a vector size of two containing the probability of the input string being classified as positive or negative.
import torch.nn as nn
import torch.nn.functional as F
class FeedfowardTextClassifier(nn.Module):
def __init__(self, device, batch_size, vocab_size, hidden1, hidden2,
num_labels):
"""
Initialize the model by setting up the layers.
Parameters
----------
device: Cuda device or CPU.
batch_size: Batch size of dataloader.
vocab_size: The vocabulary size.
hidden1: Size of first hidden layer.
hidden2: Size of second hidden layer.
num_labels: Number of labels in target.
"""
super(FeedfowardTextClassifier, self).__init__()
self.device = device
self.batch_size = batch_size
self.fc1 = nn.Linear(vocab_size, hidden1)
self.fc2 = nn.Linear(hidden1, hidden2)
self.fc3 = nn.Linear(hidden2, num_labels)
def forward(self, x):
"""
Perform a forward pass of model and returns value between 0 and 1.
"""
batch_size = len(x)
if batch_size != self.batch_size:
self.batch_size = batch_size
x = torch.FloatTensor(x)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return torch.sigmoid(self.fc3(x))
Then we can define the size of the hidden layers and examine the model architecture.
HIDDEN1 = 100
HIDDEN2 = 50
bow_model = FeedfowardTextClassifier(
vocab_size=len(dataset.token2idx),
hidden1=HIDDEN1,
hidden2=HIDDEN2,
num_labels=2,
device=device,
batch_size=BATCH_SIZE)
bow_model
Train BoW Model
¶
First, let's define the initial learning rate, the loss function as CrossEntropyLoss
, the gradient descent optimizer as Adam
and CosineAnnealingLR
for the scheduler.
Then, we can define the train_epoch
function with the input containing the model
as bow_model
, optimizer
, train_loader
, input_type
as bow
. This will train the model from the data from the train_loader
. Then reset the gradient, perform a forward pass through the data and compute the loss. Then perform gradient descent followed by a backwards pass through the data and use step
for the optimizer
and scheduler
to move in the correct direction. After this, metrics can be recorded. Another function validate_epoch
can be used for the validation set that perform a forward pass, evaluates the model and records the model metrics.
from torch import optim
from torch.optim.lr_scheduler import CosineAnnealingLR
LEARNING_RATE = 6e-5
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
filter(lambda p: p.requires_grad, bow_model.parameters()),
lr=LEARNING_RATE)
scheduler = CosineAnnealingLR(optimizer, 1)
def train_epoch(model, optimizer, train_loader, input_type='bow'):
model.train()
total_loss, total = 0, 0
for seq, bow, tfidf, target, text in train_loader:
if input_type == 'bow':
inputs = bow
if input_type == 'tfidf':
inputs = tfidf
optimizer.zero_grad()
output = model(inputs)
loss = criterion(output, target)
loss.backward()
optimizer.step()
scheduler.step()
total_loss += loss.item()
total += len(target)
return total_loss / total
def validate_epoch(model, valid_loader, input_type='bow'):
model.eval()
total_loss, total = 0, 0
with torch.no_grad():
for seq, bow, tfidf, target, text in valid_loader:
if input_type == 'bow':
inputs = bow
if input_type == 'tfidf':
inputs = tfidf
output = model(inputs)
loss = criterion(output, target)
total_loss += loss.item()
total += len(target)
return total_loss / total
We need to specify the directory where the model is saved and when the model checkpoints because more training might be needed, or it is adequate enough, inference can occur, given the context. Let's now start training the model with the early stopping defined as the validation loss is greater than the previous two validation losses. For each epoch, we can append the loss for both the train and validation sets to monitor it.
EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4
n_epochs = 0
train_losses, valid_losses = [], []
while True:
train_loss = train_epoch(bow_model, optimizer, train_loader,
input_type='bow')
valid_loss = validate_epoch(bow_model, valid_loader, input_type='bow')
tqdm.write(
f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
)
if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
print('Stopping early')
break
train_losses.append(train_loss)
valid_losses.append(valid_loss)
n_epochs += 1
BoW Performance - Model Loss and Metrics
¶
After 15 epochs, the model stopped training due to the specified critieria for early stopping. We should have specified the loss output to print out more decimal places because this granularity, at least in a notebook, does not reveal much.
Now, we can plot the model loss for both the training and the validation sets.
import matplotlib.pyplot as plt
epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.show()
The model seems to be an adequate fit using two hidden layers given the proximity of the train and validation losses. Next, we can evaluate the performance of the BoW
model using the classification_report
and confusion_matrix
from sklearn.metrics
.
from sklearn.metrics import classification_report, confusion_matrix
bow_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []
input_type = 'bow'
with torch.no_grad():
for seq, bow, tfidf, target, text in test_loader:
inputs = bow
probs = bow_model(inputs)
if input_type == 'tdidf':
inputs = tfidf
probs = tfidf_model(inputs)
probs = probs.detach().cpu().np()
predictions = np.argmax(probs, axis=1)
target = target.cpu().np()
y_true.extend(predictions)
y_pred.extend(target)
print('Classification Report:')
print(classification_report(y_true, y_pred))
f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
Overall, this model performed quite well for accuracy
, precision
, recall
and the f1_score
. Now we can examine a few of the observations and compare the predicted vs. actual sentiment
.
from IPython.core.display import display, HTML
flatten = lambda x: [sublst for lst in x for sublst in lst]
seq_lst, bow_lst, tfidf_lst, target_lst, text_lst = zip(*test_loader)
seq_lst, bow_lst, tfidf_lst, target_lst, text_lst = map(flatten, [seq_lst,
bow_lst,
tfidf_lst,
target_lst,
text_lst])
test_examples = list(zip(seq_lst, bow_lst, tfidf_lst, target_lst, text_lst))
def print_random_prediction(model, n=5, input_type='bow'):
to_emoji = lambda x: '😄' if x else '😡'
model.eval()
rows = []
for i in range(n):
with torch.no_grad():
seq, bow, tdidf, target, text = random.choice(test_examples)
target = target.item()
inputs = bow
if input_type == 'tdidf':
inputs = tfidf
probs = model([inputs])
probs = probs.detach().cpu().numpy()
prediction = np.argmax(probs, axis=1)[0]
predicted = to_emoji(prediction)
actual = to_emoji(target)
row = f'''
<tr>
<td>{i+1} </td>
<td>{text} </td>
<td>{predicted} </td>
<td>{actual} </td>
</tr>
'''
rows.append(row)
rows_joined = '\n'.join(rows)
table = f'''
<table>
<tbody>
<tr>
<td><b>Number</b> </td>
<td><b>Review</b> </td>
<td><b>Predicted</b> </td>
<td><b>Actual</b> </td>
</tr>{rows_joined}
</tbody>
</table>
'''
display(HTML(table))
print_random_prediction(bow_model, n=5, input_type='bow')
Stars on Reviews
¶
Train BoW Model
¶
Using the same methods for text preprocessing, train/validation/test set generation, learning rate, loss function, gradient descent optimizer, scheduler and train/validation training set up that was utilized for the Sentiment
set, we can train the BoW
model for the reviews_stars
set.
EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4
n_epochs = 0
train_losses, valid_losses = [], []
while True:
train_loss = train_epoch(bow_model, optimizer, train_loader,
input_type='bow')
valid_loss = validate_epoch(bow_model, valid_loader, input_type='bow')
tqdm.write(
f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
)
if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
print('Stopping early')
break
train_losses.append(train_loss)
valid_losses.append(valid_loss)
n_epochs += 1
BoW Performance - Model Loss and Metrics
¶
After 7 epochs, the model stopped training due to the specified critieria for early stopping. This is less than the 15 epochs that the model was trained using the Sentiment
set. This model had a higher train_loss
and valid_loss
compared to the Sentiment
set where the train_loss=3.13e-01
and valid_loss=3.16e-01
. Even at the 7th epoch for the Sentiment
set, the train and validation losses were less (train_loss: 3.14e-01
, valid_loss: 3.17e-01
).
Now, we can plot the model loss for both the training and the validation sets.
epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.show()
This model was not as a good fit compared to the Sentiment
BoW
model given the greater distance between the train and validation losses. Next, we can evaluate the performance of the BoW
model using the classification_report
and confusion_matrix
from sklearn.metrics
like what was utilized for the Sentiment
model.
bow_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []
input_type = 'bow'
with torch.no_grad():
for seq, bow, tfidf, target, text in test_loader:
inputs = bow
probs = bow_model(inputs)
if input_type == 'tdidf':
inputs = tfidf
probs = tfidf_model(inputs)
probs = probs.detach().cpu().numpy()
predictions = np.argmax(probs, axis=1)
target = target.cpu().numpy()
y_true.extend(predictions)
y_pred.extend(target)
print('Classification Report:')
print(classification_report(y_true, y_pred))
f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
Although this model performed quite well for accuracy
, precision
, recall
and the f1_score
, all metrics are lower compared to the Sentiment
model. Next, we can examine a few of the observations and compare the predicted vs. actual stars_reviews
.
print_random_prediction(bow_model, n=5, input_type='bow')
When examining the predicted vs. actual of 5 reviews, only 4/5 resulted in the correct match while the 5/5 was achieved for the Sentiment
model.
Term Frequency-Inverse Document Frequency (TF-IDF)
¶
Sentiment
¶
Using the same text pre-processing utilized for the BoW
model and the same train/validation/test sets, we can build a simple feed-forward neural net classifier. >The hidden layer sizes have all ready been defined so we can initialize the TF-IDF
model and examine it for comparison to the BoW
model.
tfidf_model = FeedfowardTextClassifier(
vocab_size=len(dataset.token2idx),
hidden1=HIDDEN1,
hidden2=HIDDEN2,
num_labels=2,
device=device,
batch_size=BATCH_SIZE,
)
tfidf_model
Train TF-IDF Model
¶
We need to first specify the directory where the model is saved and when the model checkpoints. Let's now start training the model with the TF-IDF
vectors as input and early stopping defined as the validation loss is greater than the previous two validation losses. For each epoch, we can append the loss for both the train and validation sets to monitor it.
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
filter(lambda p: p.requires_grad, tfidf_model.parameters()),
lr=LEARNING_RATE)
scheduler = CosineAnnealingLR(optimizer, 1)
EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4
n_epochs = 0
train_losses, valid_losses = [], []
while True:
train_loss = train_epoch(tfidf_model, optimizer, train_loader,
input_type='tfidf')
valid_loss = validate_epoch(tfidf_model, valid_loader, input_type='tfidf')
tqdm.write(
f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
)
if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
print('Stopping early')
break
train_losses.append(train_loss)
valid_losses.append(valid_loss)
n_epochs += 1
TF-IDF Performance - Model Loss and Metrics
¶
After 19 epochs, the model stopped training due to the specified critieria for early stopping. Now, we can plot the model loss for both the training and the validation sets.
epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.xticks(epoch_ticks,color='w')
plt.show()
The TF-IDF
model utilizing the Sentiment
set also seems to be an adequate fit using two hidden layers given the proximity of the train and validation losses. Let's now evaluate the performance of the TF-IDF
model with the test set using the classification_report
and confusion_matrix
from sklearn.metrics
.
tfidf_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []
with torch.no_grad():
for seq, bow, tfidf, target, text in test_loader:
inputs = tfidf
probs = tfidf_model(inputs)
probs = probs.detach().cpu().numpy()
predictions = np.argmax(probs, axis=1)
target = target.cpu().numpy()
y_true.extend(predictions)
y_pred.extend(target)
print('Classification Report:')
print(classification_report(y_true, y_pred))
f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
Once again, this model performed quite well for accuracy
, precision
, recall
and the f1_score
. Now we can examine a few of the observations and compare the predicted vs. actual sentiment
.
print_random_prediction(tfidf_model, n=5, input_type='tfidf')
5/5 match for the predicted vs. actual for this model
Stars on Reviews
¶
Let's now initialize the TF-IDF
model for the reviews_stars
set, then define the path where the model will be saved and train the TF-IDF
model with the TF-IDF
vectors as the input.
EPOCH = 1
OUT_DIR = './Models/40k_batch1_lr6e_5_10kRW_baseline.pt'
LOSS = 0.4
n_epochs = 0
train_losses, valid_losses = [], []
while True:
train_loss = train_epoch(tfidf_model, optimizer, train_loader,
input_type='tfidf')
valid_loss = validate_epoch(tfidf_model, valid_loader, input_type='tfidf')
tqdm.write(
f'epoch #{n_epochs + 1:3d}\ttrain_loss: {train_loss:.2e}\tvalid_loss: {valid_loss:.2e}\n',
)
if len(valid_losses) > 2 and all(valid_loss >= loss for loss in valid_losses[-3:]):
print('Stopping early')
break
train_losses.append(train_loss)
valid_losses.append(valid_loss)
n_epochs += 1
TF-IDF Performance - Model Loss and Metrics
¶
After 6 epochs, the model stopped training due to the specified critieria for early stopping. This is less than the 19 epochs that the model was trained using the Sentiment
set. This model had a higher train_loss
and valid_loss
compared to the Sentiment
set where the train_loss=3.13e-01
and valid_loss=3.17e-01
. Even at the 6th epoch for the Sentiment
set, the train and validation losses were less (train_loss: 3.14e-01
, valid_loss: 3.18e-01
).
Now, we can plot the model loss for both the training and the validation sets.
epoch_ticks = range(1, n_epochs + 1)
plt.plot(epoch_ticks, train_losses)
plt.plot(epoch_ticks, valid_losses)
plt.legend(['Train Loss', 'Valid Loss'])
plt.title('Model Losses')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.show()
Once again, this model was not as a good fit compared to the Sentiment
TF-IDF
model given the greater distance between the train and validation losses. Next, we can evaluate the performance of the TF-IDF
model using the classification_report
and confusion_matrix
from sklearn.metrics
like what was utilized for the Sentiment
model.
tfidf_model.eval()
test_accuracy, n_examples = 0, 0
y_true, y_pred = [], []
with torch.no_grad():
for seq, bow, tfidf, target, text in test_loader:
inputs = tfidf
probs = tfidf_model(inputs)
probs = probs.detach().cpu().numpy()
predictions = np.argmax(probs, axis=1)
target = target.cpu().numpy()
y_true.extend(predictions)
y_pred.extend(target)
print('Classification Report:')
print(classification_report(y_true, y_pred))
f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Review Stars', fontsize=17)
ax.set_ylabel('Actual Review Stars', fontsize=17)
ax.xaxis.set_ticklabels(['1/2', '5'], fontsize=13)
ax.yaxis.set_ticklabels(['1/2', '5'], fontsize=17)
The model metrics using both the BoW
and TF-IDF
showed better performance for the Sentiment
model compared to the Stars on Reviews
model. Next, we can examine a few of the observations and compare the predicted vs. actual stars_reviews
.
print_random_prediction(tfidf_model, n=5, input_type='tfidf')
5/5 match for the predicted vs. actual for this model, but the model metrics are still lower for this model compared to the Sentiment
TF-IDF
model.
Bidirectional Long Short Term Memory Networks (LSTM)
¶
Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems by allowing the input data to leverage information from both the past and the future. The LSTM
is trained simultaneously
in the positive and negative time direction with the input sequence.
The notebooks can be found
here for Sentiment
and here for Stars on Reviews
.
Sentiment
¶
Let's first set up the environment by importing the necessary packages and examining the CUDA
and NVIDIA GPU
information as well as the Tensorflow
and Keras
versions for the runtime.
import os
import random
import numpy as np
import warnings
import tensorflow as tf
warnings.filterwarnings('ignore')
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
To set the seed for reproducibility, we can use a function init_seeds
that defines the random
, numpy
and tensorflow
seed as well as the environment and session.
def init_seeds(seed=42):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
session_conf = tf.compat.v1.ConfigProto()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
os.environ['TF_DETERMINISTIC_OPS'] = 'True'
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
return sess
init_seeds(seed=42)
As with the previous methods, let's recode the target to numerical (0,1), convert the data types, shuffle the data and then set up the label and the features. Then the data can be partitioned for the train/test sets using test_size=0.1
, which stratifies the target (stratify=label
).
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
df['sentiment'].mask(df['sentiment'] == 'Negative', 0, inplace=True)
df['sentiment'].mask(df['sentiment'] == 'Positive', 1, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 1, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 2, 0, inplace=True)
df['stars_reviews'].mask(df['stars_reviews'] == 5, 1, inplace=True)
df['sentiment'] = df['sentiment'].astype('int')
df['stars_reviews'] = df['stars_reviews'].astype('int')
df1 = df.drop(['stars_reviews'], axis=1)
df1 = shuffle(df1)
label = df1[['sentiment']]
features = df1.cleanReview
X_train, X_test, y_train, y_test = train_test_split(features, label,
test_size=0.1,
stratify=label,
random_state=42)
Length = 500 & Batch Size = 256
We can now prepare the data by using the KerasTokenizer
class with a specified number of words and the maximum length of the text to be tokenized. Then the tokenized text can padded if the length is less than the maxmimum defined length, and then vectorized, which can be applied to both the train and the test sets. We will create a vocabulary containing num_words=100000
with a maxlen=500
.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import text, sequence
from sklearn.pipeline import Pipeline
class KerasTokenizer(object):
"""
Fit and convert text to sequences for use in a Keras model.
num_words = max number of words
maxlen = max length of sequences
"""
def __init__(self, num_words=100000, maxlen=500):
self.tokenizer = text.Tokenizer(num_words=num_words)
self.maxlen = maxlen
def fit(self, X, y):
self.tokenizer.fit_on_texts(X)
return self
def transform(self, X):
return sequence.pad_sequences(self.tokenizer.texts_to_sequences(X),
maxlen=self.maxlen)
km = Pipeline([('Keras Tokenizer', KerasTokenizer(num_words=100000,
maxlen=500))])
X_trainT = km.fit_transform(X_train)
X_testT = km.fit_transform(X_test)
Now, we can set up where the results will be saved as well as the callbacks with EarlyStopping
monitoring the val_loss
and stopping training if the val_loss
does not improve after 5 epochs. This can also use a ModelCheckpoint with the specified filepath
to save only the highest val_accuracy
.
import datetime
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint
%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/LSTM/SentimentPolarity/Models/
!rm -rf /logs/
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'Polarity_LSTM_weights_only_len500_b256.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='val_accuracy',
save_best_only=True, mode='max'),
tensorboard_callback]
Model Architecture
¶
Let's now define the model structure with an embedding_size=300
where the input is the maximum length of the review string (500 words) with 100,000 tokens as the defined maximum words in the vocabulary. Then 10% Dropout
which then is fed to the Bidirectional LSTM
with the size of 50 neurons and another 10% Dropout
with 0% recurrent_dropout
. Then a GlobalAveragePooling1D
layer followed by another 10% Dropout
with a final Dense
layer containing a sigmoid
activation function. Then the model can be compiled with the loss as binary_crossentropy
, adam
as the optimizer, the default learning rate and the metric as accuracy
.
from tensorflow.keras.layers import Input, Embedding, Dropout, Bidirectional
from tensorflow.keras.layers import LSTM, GlobalMaxPool1D, Dense, Activation
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
embedding_size = 300
input_ = Input(shape=(500,))
x = Embedding(100000, embedding_size)(input_)
x = Dropout(0.1)(x)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1,
recurrent_dropout=0.0))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=input_, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.summary()
The model can now be trained for 5 epochs using BATCH_SIZE=256
with the defined callbacks_list
and then saved.
history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
epochs=5, batch_size=256, callbacks=callbacks_list)
model.save('./Polarity_LSTM_len500_batch256_tf.h5', save_format='tf')
# Load model for more training or later use
#filepath = 'Polarity_LSTM_weights_only_len500_b256.h5'
#model = tf.keras.models.load_model('./Polarity_LSTM_len500_batch256_tf.h5')
#model.load(weights)
LSTM Model Performance
¶
Now, we can evaluate the trained model for accuracy using the test set. Then plot the model loss and accuracy over the epochs.
acc = model.evaluate(X_testT, y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(acc[0], acc[1]))
The accuracy
is 94.8%, which is decent, but perhaps, training for more epochs might increase the accuracy
.
import matplotlib.pyplot as plt
plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
The model appears to be overfitting because the training loss is quite constant while the validation loss is increasing when it should be decreasing.
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()
Let's try training for more epochs, so we need to first load the model.
model = tf.keras.models.load_model('./Polarity_LSTM_len500_batch256_tf.h5')
model.summary()
Let's define a new filepath
and reuse the same callbacks_list
to continue training the model and then save it.
filepath = 'LSTM_weights_only_len500_b256_2.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='val_accuracy',
save_best_only=True, mode='max'),
tensorboard_callback]
history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
epochs=7, batch_size=256, callbacks=callbacks_list)
model.save('./Polarity_LSTM_len500_batch256_2_tf.h5', save_format='tf')
# Load model for more training or later use
#model = tf.keras.models.load_model('./Polarity_LSTM_len500_batch256_2_tf.h5')
LSTM Model Performance
¶
Now, we can reevaluate the trained model for accuracy using the test set.
acc = model.evaluate(X_testT, y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(acc[0],acc[1]))
The accuracy
actually decreased by training for more epochs for this model. Let's now plot the model loss
and accuracy
over the epochs as we did before.
plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()
Let's now use the trained model to predict on the test set and save as scores
. This can then be utilized in a function decode_sentiment
that outputs a binary sentiment/classification score if the predicted value is greater than 0.5, then it is positive
and when it is less than 0.5, then it is negative
.
We can also define a function to construct the confusion matrix called plot_confusion_matrix
and generate a classification_report
using sklearn.metrics
.
scores = model.predict(X_testT, verbose=1)
def decode_sentiment(score):
return 1 if score > 0.5 else 0
y_pred_1d = [decode_sentiment(score) for score in scores]
import itertools
from sklearn.metrics import classification_report
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
Print and plot the confusion matrix
"""
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=20)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, fontsize=13)
plt.yticks(tick_marks, classes, fontsize=13)
fmt = '.2f'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment='center',
color='white' if cm[i, j] > thresh else 'black')
plt.ylabel('True label', fontsize=17)
plt.xlabel('Predicted label', fontsize=17)
print('Classification Report:')
print(classification_report(y_test['sentiment'].tolist(), y_pred_1d))
print('\n')
cnf_matrix = confusion_matrix(y_test['sentiment'].tolist(), y_pred_1d)
plt.figure(figsize=(6,6))
plot_confusion_matrix(cnf_matrix, classes=y_test.sentiment.unique(),
title='Confusion matrix')
plt.show();
Overall, this model performed quite well for accuracy
, precision
, recall
and the f1_score
, but all of the metrics are lower than the ones from the BoW
and TF-IDF
models. Perhaps, sampling 20,000 observations per each sentiment
group might allow for comparable metrics, or hyperparameter tuning in regards to the embedding_size
, the amount of neurons within the LSTM
and Dense
layers as well as the learning_rate
.
Stars on Reviews
¶
As with the previous methods, let's recode the target to numerical (0,1), convert the data types, shuffle the data and then set up the label and the features. Then the data can be partitioned for the train/test sets using test_size=0.1
, which stratifies the target (stratify=label
).
df1 = df.drop(['sentiment'], axis=1)
df1 = shuffle(df1)
label = df1[['stars_reviews']]
features = df1.cleanReview
X_train, X_test, y_train, y_test = train_test_split(features, label,
test_size=0.1,
stratify=label,
random_state=42)
Length = 500 & Batch Size = 256
Let's now prepare the data by using the KerasTokenizer
class with a specified number of words and the maximum length of the text to be tokenized. Then the tokenized text can padded if the length is less than the maxmimum defined length, and then vectorized, which can be applied to both the train and the test sets. We will create a vocabulary containing num_words=100000
with a maxlen=500
.
km = Pipeline([('Keras Tokenizer', KerasTokenizer(num_words=100000,
maxlen=500))])
X_trainT = km.fit_transform(X_train)
X_testT = km.fit_transform(X_test)
%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/LSTM/ReviewStars/Models/
!rm -rf /logs/
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'LSTM_weights_only_len500_b256_balancedSP.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
ModelCheckpoint(filepath, monitor='val_accuracy',
save_best_only=True, mode='max'),
tensorboard_callback]
The model can now be trained for 10 epochs using BATCH_SIZE=256
with the defined callbacks_list
and then saved.
history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
epochs=10, batch_size=256, callbacks=callbacks_list)
model.save('./LSTM_len500_batch256_balancedSP_tf.h5', save_format='tf')
# Load model for more training or later use
#filepath = 'LSTM_weights_only_len500_b256_balancedSP.h5'
#model = tf.keras.models.load_model('./LSTM_len500_batch256_balancedSP_tf.h5')
#model.load(weights)
LSTM Model Performance
¶
Now, we can evaluate the trained model for accuracy using the test set. Then plot the model loss and accuracy over the epochs.
acc = model.evaluate(X_testT, y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(acc[0], acc[1]))
plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend())
plt.show()
Let's try training for more epochs, so we need to first load the model and examine the model architecture.
model = tf.keras.models.load_model('./LSTM_batch256_10epochs_balancedSP_tf.h5')
model.summary()
Let's define a new filepath
and reuse the same callbacks_list
to continue training the model.
filepath = 'LSTM_weights_only_len500_b256_2.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=5),
ModelCheckpoint(filepath, monitor='val_accuracy',
save_best_only=True, mode='max'),
tensorboard_callback]
history = model.fit(X_trainT, y_train, validation_data=(X_testT, y_test),
epochs=7, batch_size=256, callbacks=callbacks_list)
We can save the model and evaluate the accuracy
using the test set.
model.save('./LSTM_len500_batch256_2_tf.h5', save_format='tf')
#model = tf.keras.models.load_model('./LSTM_len500_batch256_2_tf.h5')
acc = model.evaluate(X_testT, y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(acc[0], acc[1]))
Once again, the accuracy
actually decreased by training for more epochs for this model.
LSTM Model Performance
¶
We can then replot the model loss
and accuracy
over the new training epochs.
plt.title('Model Loss')
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
plt.title('Model Accuracy')
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()
Let's now predict using the test set, determine the sentiment
score with the decode_sentiment
that iterates through the test set and examine the classification_report
and confusion_matrix
.
scores = model.predict(X_testT, verbose=1)
y_pred_1d = [decode_sentiment(score) for score in scores]
print('Classification Report:')
print(classification_report(y_test['stars_reviews'].tolist(), y_pred_1d))
print('\n')
cnf_matrix = confusion_matrix(y_test['stars_reviews'].tolist(), y_pred_1d)
plt.figure(figsize=(6,6))
plot_confusion_matrix(cnf_matrix, classes=y_test.stars_reviews.unique(),
title='Confusion matrix')
plt.show()
The model metrics using LSTM
demonstrate better performance for the Sentiment
model compared to the Stars on Reviews
model. However, the BoW
and TF-IDF
models performed better than the LSTM
approach for both sets.
Transfer Learning using Bidirectional Encoder Representations from Transformers (BERT)
¶
Transfer learning is a ML technique that can be leveraged to save time and cost resources from having to train a model from scratch. The model weights can be loaded and fine tuned using different data than what was used to train it initially. The new weights can be saved and the model evaluated.
Since we are evaluating different text classification approaches, let's utilize BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. BERT
was trained using data from BooksCorpus (800 million words) and English Wikipedia (2,500 million words) using only the text passages. BERT’s
model architecture uses a multi-layer bidirectional Transformer encoder based on the original implementation of the transformer in Attention Is All You Need. Let's utilize the BERT base
model consisting of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters. Using unlabeled text data, BERT
was designed to pre-train deep bidirectional representations by jointly conditioning on both the left and the right context in all of the model's layers. We can leverage this and fine-tune using our set by adding an additional output layer.
The notebooks can be found here for Sentiment
and here for Stars on Reviews
.
Sentiment
¶
Let's first install the transformers
package, then set up the environment by importing necessary packages, setting the seed for reproducibility, and examining the CUDA
, NVIDIA GPU
and PyTorch
information.
!pip install transformers
import os
import random
import numpy as np
import torch
from subprocess import call
seed_value = 42
torch.manual_seed(42)
random.seed(42)
np.random.seed(42)
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('Torch version: {}'.format(torch.__version__))
print('pyTorch VERSION:', torch.__version__)
print('\n')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(['nvidia-smi', '--format=csv,' '--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free'])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)
Preprocessing
¶
After recoding both sentiment
and stars_reviews
to binary as previously completed, let's examine the word count of the reviews since BERT
only takes up to 512 words. We can utilize a previously defined lambda
function that counts the number of words in the text string to generate review_wordCount
. Then pandas.Dataframe.describe
can be used to evaluate the summary statistics for this temporary feature.
df['review_wordCount'] = df['cleanReview'].apply(lambda x: len(str(x).split()))
df['review_wordCount'].describe().apply('{0:f}'.format)
We can now subset the reviews which contain less than or equal to 512 words and then drop the review_wordCount
temporary variable. Then filter negative and positive sentiment reviews into two sets, shuffle and sample 20,000 observations, which are then concatenated together to generate a single set. Then the reviews can be converted to a string and the target as an integer for the data types. Next the data can be partitioned for the train, validation and test sets using 80% for training, 10% for validation and 10% for the test set.
from sklearn.utils import shuffle
df = df[df['review_wordCount'] <= 512]
df1 = df.drop(['review_wordCount', 'stars_reviews'], axis=1)
df2 = df1[df1.sentiment==0]
df2 = shuffle(df2)
df2 = df2.sample(n=20000)
df3 = df1[df1.sentiment==1]
df3 = df3.sample(n=20000)
df3 = shuffle(df3)
df1 = pd.concat([df2, df3])
df1 = shuffle(df1)
del df2, df3
df1['sentiment'] = df1['sentiment'].astype('int')
df1['cleanReview'] = df1['cleanReview'].astype(str)
df_train, df_val, df_test = np.split(df1.sample(frac=1, random_state=seed_value),
[int(.8*len(df)), int(.9*len(df))])
print(len(df_train), len(df_val), len(df_test))
del df1
Tokenize
¶
Let's first define the tokenizer to use, BertTokenizer.from_pretrained('bert-base-uncased')
and the components for the label. Then we can build a class Dataset
that iterates through the data tokenizeing the text string with a maximum length of 300 tokens and then pads the ones that are shorter in length.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
label = {0, 1}
class Dataset(torch.utils.data.Dataset):
def __init__(self, df):
self.labels = [label for label in df['sentiment']]
self.texts = [tokenizer(cleanReview,
add_special_tokens=True,
padding='max_length',
max_length=300,
return_tensors='pt',
truncation=True) for cleanReview in df['cleanReview']]
def classes(self):
return self.labels
def __len__(self):
return len(self.labels)
def get_batch_labels(self, idx):
return np.array(self.labels[idx])
def get_batch_texts(self, idx):
return self.texts[idx]
def __getitem__(self, idx):
batch_texts = self.get_batch_texts(idx)
batch_y = self.get_batch_labels(idx)
return batch_texts, batch_y
Now, we can define another class BertClassifier
utilizing BertModel.from_pretrained('bert-base-uncased')
with a model architecture consisting of an initial 40% Dropout
, followed by a Linear
layer, a ReLU
activation function, an equivalent dropout layer and a final Linear
layer. The forward
function can be defined containing the input as input_id
and mask
.
from torch import nn
from transformers import BertModel
class BertClassifier(nn.Module):
def __init__(self, dropout=0.4):
super(BertClassifier, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.drop = nn.Dropout(dropout)
self.out1 = nn.Linear(self.bert.config.hidden_size, 128)
self.relu = nn.ReLU()
self.drop1 = nn.Dropout(p=0.4)
self.out = nn.Linear(128, 2)
def forward(self, input_id, mask):
_, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,
return_dict=False)
output = self.drop(pooled_output)
output = self.out1(output)
output = self.relu(output)
output = self.drop1(output)
final_layer = self.out(output)
return final_layer
Next, let's define some functions to to save/load the model (save_checkpoint
, load_checkpoint
) and to evaluate the model (save_metrics
, load_metrics
).
def save_checkpoint(save_path, model, valid_loss):
if save_path == None:
return
state_dict = {'model_state_dict': model.state_dict(),
'valid_loss': valid_loss}
torch.save(state_dict, save_path)
print(f'Model saved to ==> {save_path}')
def load_checkpoint(load_path, model):
if load_path==None:
return
state_dict = torch.load(load_path, map_location=device)
print(f'Model loaded from <== {load_path}')
model.load_state_dict(state_dict['model_state_dict'])
return state_dict['valid_loss']
def save_metrics(save_path, train_loss_list, valid_loss_list,
global_steps_list):
if save_path == None:
return
state_dict = {'train_loss_list': train_loss_list,
'valid_loss_list': valid_loss_list,
'global_steps_list': global_steps_list}
torch.save(state_dict, save_path)
print(f'Model saved to ==> {save_path}')
def load_metrics(load_path):
if load_path==None:
return
state_dict = torch.load(load_path, map_location=device)
print(f'Model loaded from <== {load_path}')
return state_dict['train_loss_list'], state_dict['valid_loss_list'], state_dict['global_steps_list']
Train and Evaluate Using Batch Size = 8
¶
We can define a training function train
which encompasses a training and validation loop that first initializes the running values for the training and validation loss and global step as 0, loads the training and validation data separately with torch.utils.data.DataLoader
using a batch_size=8
that is shuffled and utilizes four workers. Then we can assign the device to CUDA
if it is available to utilize the GPU
. Next, we can define the model criterion
as CrossEntropyLoss
and the optimizer as Adam
where the learning rate can be specified. Then we can define the training and validation loop to load the data, train for the training set, evaluate the loss and accuracy, reset the running values, print the progress and save the model/checkpoint.
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
from tqdm import tqdm
destination_folder = '/content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/SentimentPolarity/Models'
%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/SentimentPolarity/40k/
def train(model,
df_train,
df_val,
learning_rate,
num_epochs=3,
file_path=destination_folder,
best_valid_loss=float('Inf')):
total_loss_train = 0.0
total_loss_val = 0.0
global_step = 0
train_loss_list = []
valid_loss_list = []
global_steps_list = []
train, val = Dataset(df_train), Dataset(df_val)
train_dataloader = torch.utils.data.DataLoader(train, batch_size=8,
shuffle=True, pin_memory=True,
num_workers=4)
val_dataloader = torch.utils.data.DataLoader(val, batch_size=8,
shuffle=True, pin_memory=True,
num_workers=4)
use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')
criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=learning_rate)
if use_cuda:
model = model.cuda()
criterion = criterion.cuda()
eval_every = len(train_dataloader) // 2
model.train()
for epoch in range(num_epochs):
for train_input, train_label in tqdm(train_dataloader):
train_label = train_label.to(device)
mask = train_input['attention_mask'].to(device)
input_id = train_input['input_ids'].squeeze(1).to(device)
output = model(input_id, mask)
batch_loss = criterion(output, train_label)
total_loss_train += batch_loss.item()
optimizer.zero_grad()
batch_loss.backward()
optimizer.step()
global_step += 1
if global_step % eval_every == 0:
model.eval()
with torch.no_grad():
for val_input, val_label in val_dataloader:
val_label = val_label.to(device)
mask = val_input['attention_mask'].to(device)
input_id = val_input['input_ids'].squeeze(1).to(device)
output = model(input_id, mask)
batch_loss = criterion(output, val_label)
total_loss_val += batch_loss.item()
average_train_loss = total_loss_train / len(train_dataloader)
average_valid_loss = total_loss_val / len(val_dataloader)
train_loss_list.append(average_train_loss)
valid_loss_list.append(average_valid_loss)
global_steps_list.append(global_step)
total_loss_train = 0.0
total_loss_val = 0.0
model.train()
print('Epoch [{}/{}], Step [{}/{}], Train Loss: {:.4f}, Valid Loss: {:.4f}'
.format(epoch+1, num_epochs, global_step,
num_epochs * len(train_dataloader),
average_train_loss, average_valid_loss))
if best_valid_loss > average_valid_loss:
best_valid_loss = average_valid_loss
save_checkpoint(file_path + '/' + 'model_40K_batch8.pt',
model, best_valid_loss)
save_metrics(file_path + '/' + 'metrics_40K_batch8.pt',
train_loss_list, valid_loss_list,
global_steps_list)
save_metrics(file_path + '/' + 'metrics_40K_batch8.pt', train_loss_list,
valid_loss_list, global_steps_list)
print('Training finished!')
Let's now define the parameters for the training loop where the model is BertClassifier
, the optimizer is Adam
with a learning rate=5e-6
and three epochs are used for training.
model = BertClassifier()
optimizer = Adam(model.parameters(), lr=5e-6)
num_epochs = 3
LR = 5e-6
Now, the model can be trained using the specified parameters.
train(model, df_train, df_val, LR, num_epochs)
Model Metrics
¶
Let's now load the saved metrics, and plot the train and validation loss from the global_steps_list
.
import matplotlib.pyplot as plt
train_loss_list, valid_loss_list, global_steps_list = load_metrics(destination_folder
+ '/metrics_40K_batch8.pt')
plt.plot(global_steps_list, train_loss_list, label='Train')
plt.plot(global_steps_list, valid_loss_list, label='Valid')
plt.xlabel('Global Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()
We can also define a function called evaluate
, which loads the test set, assigns the device to CUDA
and evaluates the trained model with the test set for accuracy, the classification_report
and confusion_matrix
utilizing sklearn.metrics
.
from sklearn.metrics import classification_report, confusion_matrix
def evaluate(model, test_data):
y_pred = []
y_true = []
test = Dataset(test_data)
test_dataloader = torch.utils.data.DataLoader(test, batch_size=1)
use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')
if use_cuda:
model = model.cuda()
total_acc_test = 0
model.eval()
with torch.no_grad():
for test_input, test_label in test_dataloader:
test_label = test_label.to(device)
mask = test_input['attention_mask'].to(device)
input_id = test_input['input_ids'].squeeze(1).to(device)
output = model(input_id, mask)
y_pred.extend(torch.argmax(output, 1).tolist())
y_true.extend(test_label.tolist())
acc = (output.argmax(dim=1) == test_label).sum().item()
total_acc_test += acc
print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')
print('Classification Report:')
print(classification_report(y_true, y_pred, labels=[1,0], digits=4))
f, (ax) = plt.subplots(1,1)
f.suptitle('Test Set: Confusion Matrix', fontsize=20)
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues', fmt='d')
ax.set_xlabel('Predicted Sentiment', fontsize=17)
ax.set_ylabel('Actual Sentiment', fontsize=17)
ax.xaxis.set_ticklabels(['Negative', 'Positive'], fontsize=17)
ax.yaxis.set_ticklabels(['Negative', 'Positive'], fontsize=17)
Let's now use this defined function to evaluate the model metrics of the trained model with the test set.
best_model = BertClassifier()
load_checkpoint(destination_folder + '/model_40K_batch8.pt', best_model)
evaluate(best_model, df_test)
So far, this model performed the best with a Test Accuracy: 0.997
and the lowest metric of precision=0.9949
for the negative sentiment
group.
Stars on Reviews
¶
Now, let's use the same preprocessing that was utilized for the sentiment
set using the review_stars
filtered set.
Train and Evaluate Using Batch Size = 8
¶
Using the same defined model parameters, we can train a model using the stars_reviews
set.
%cd /content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/ReviewStars/40k/
destination_folder = '/content/drive/MyDrive/Yelp_Reviews/Models/DL/BERT/ReviewStars/Models'
model = BertClassifier()
optimizer = Adam(model.parameters(), lr=5e-6)
num_epochs = 3
LR = 5e-6
train(model, df_train, df_val, LR, num_epochs)
Model Metrics
¶
Let's now load the saved metrics, and plot the train and validation loss from the global_steps_list
.
train_loss_list, valid_loss_list, global_steps_list = load_metrics(destination_folder
+ '/metrics_40k_b8.pt')
plt.plot(global_steps_list, train_loss_list, label='Train')
plt.plot(global_steps_list, valid_loss_list, label='Valid')
plt.xlabel('Global Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()
We can now evaluate the peformance of the trained model by utilizing the evaluate
function with the test set.
best_model = BertClassifier()
load_checkpoint(destination_folder + '/model_40k_b8.pt', best_model)
evaluate(best_model, df_test)
The model trained with the Sentiment
set had better model metrics for everything where the lowest was precision=0.9949
for the negative sentiment group. Using the transfer learning approach with BERT
performed better than the BoW
, TF-IDF
and LSTM
methods.
Comments
comments powered by Disqus