Recommendation systems are widely utilized in the digital age, whether it is recommending what to purchase on a website or what to watch/listen to online. They can be leveraged to increase profitability by using the information obtained from a large number of individuals to personalize the experience for an individual. They can be used to generate analytical reports containing information about products' costs and profits for the business and the projected demand so the needed supply is purchased.
In this project, the user ratings from a subset of Amazon Reviews are used to train different recommendation systems which evaluate various algorithms. The models return a list of recommended items based on the reviewer's previous actions.
Data
¶
The Movies_and_TV
ratings data was retrieved from Recommender Systems and Personalization Datasets. This includes (item, user, rating, timestamp) tuples. A subset of the data was used in the analysis due to constraints of sparcity for computations.
Preprocessing
¶
The code that was used for preprocessing and EDA can be found Recommender System Github repository. First, the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and columns named.
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
seed_value = 42
os.environ['Recommender'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
df = pd.read_csv('Movies_and_TV.csv', header=None, skiprows=[0],
low_memory=False)
df = df.drop_duplicates()
df.columns = ['item', 'reviewerID', 'rating', 'timestamp']
print('Sample observations:')
df.head()
Then a function data_summary
is defined to examine the data for the number of missing observations, data types and the amount of unique values in the initial set. The timestamp
variable is dropped since it will not be used.
def data_summary(df):
print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
a = pd.DataFrame()
a['Number of Missing Values'] = df.isnull().sum()
a['Data type of variable'] = df.dtypes
a['Number of Unique Values'] = df.nunique()
print(a)
print('Initial Data Summary:')
print(data_summary(df))
df = df.drop(['timestamp'], axis=1)
The top 10 items in the initial set shows the highest item has 24,554 ratings while the 10th highest items has 14,174 ratings.
items_top10 = df.groupby('item').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings in initial set:')
print(items_top10)
The top 10 reviewers with the most number of ratings in the initial set shows that they have over 1,600 reviews.
reviewers_top10 = df.groupby('reviewerID').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in initial set:')
print(reviewers_top10)
Since the data is sparse, a new integer id is created for item
rather the initial string variable.
value_counts = df['item'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['item_unique', 'counts']
df1 = df1.reset_index()
df1.rename(columns={'index': 'item_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['item'],
right_on=['item_unique'])
df = df.drop_duplicates()
df = df.drop(['item_unique'], axis=1)
del value_counts, df1
The same process is used for reviewerID
. A key is created for merging the new integer variables that later be used to join the original data. For this set, the unnecessary keys are then dropped.
value_counts = df['reviewerID'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['id_unique', 'counts']
df1 = df1.reset_index()
df1.rename(columns={'index': 'reviewer_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['reviewerID'],
right_on=['id_unique'])
df = df.drop_duplicates()
df = df.drop(['id_unique'], axis=1)
del value_counts, df1
df1 = df[['item', 'item_id', 'reviewerID', 'reviewer_id']]
df1.to_csv('Movies_and_TV_idMatch.csv', index=False)
del df1
The data is then filtered to reviewers who have greater than or equal to 25 ratings/reviews due to sparsity. This results in a set containing 1,114,877 ratings with 19,702 unique reviewers and 103,687 unique items. The majority of items are rated 5 star.
reviewer_count = df.reviewerID.value_counts()
df = df[df.reviewerID.isin(reviewer_count[reviewer_count >= 25].index)]
df = df.drop_duplicates()
del reviewer_count
print('- Number of ratings after filtering: ', len(df))
print('- Number of unique reviewers: ', df['reviewerID'].nunique())
print('- Number of unique items: ', df['itemID'].nunique())
for i in range(1,6):
print('- Number of items with {0} rating = {1}'.format(i,
df[df['rating'] == i].shape[0]))
The top 10 items in the filtered set shows a large reduction with the highest item reducing from 24,554 to 1,136 ratings for the highest item while the 10th highest item reducing from 14,174 to 853 ratings.
items_top10 = df.groupby('item').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings in the filtered set:')
print(items_top10)
The top 10 reviewers in the filtered set reduced from 4,101 to 3,981 ratings, while the 10th highest item reducing from 1,699 to 1,634 ratings.
reviewers_top10 = df.groupby('reviewerID').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in the filtered set:')
print(reviewers_top10)
Item-item collaborative filtering
¶
Item-item collaborative filtering is one kind of recommendation method which looks for similar items based on the items users have already liked or positively interacted with. How IBCF works is that it suggests an item based on items the user has previously consumed. It looks for the items the user has consumed then it finds other items similar to consumed items and recommends accordingly.
Collaborative Filtering in Tensorflow
¶
Let's first set up the environment by importing the necessary packages and examining the CUDA
and NVIDIA GPU
information as well as the Tensorflow
and Keras
versions for the runtime. To set the seed for reproducibility, we can use a function init_seeds
that defines the random
, numpy
and tensorflow
seed as well as the environment and session.
import tensorflow as tf
from tensorflow import keras
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('\n')
def init_seeds(seed=42):
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
session_conf = tf.compat.v1.ConfigProto()
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
os.environ['AmazonReviews_RecSysDL'] = str(seed)
os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
os.environ['TF_DETERMINISTIC_OPS'] = 'True'
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
return sess
init_seeds(seed=42)
Now, we can use the LabelEncoder
to encode reviewerID
and itemID
and create sets containing the unique users and items. Then, we can create minimum and maximum rating variables and set the number of factors (50) to be considered in the model. Let's examine the number of unique items and users as well as the minimum and maximum rating:
from sklearn.preprocessing import LabelEncoder
user_encode = LabelEncoder()
df['user'] = user_encode.fit_transform(df['reviewerID'].values)
n_users = df['user'].nunique()
item_encode = LabelEncoder()
df['Item'] = item_encode.fit_transform(df['itemID'].values)
n_items = df['Item'].nunique()
df['stars'] = df['rating'].values
min_rating = min(df['stars'])
max_rating = max(df['stars'])
n_factors = 50
print('Number of unique items:', n_items)
print('Number of unique users:', n_users)
print('Minimum rating:', min_rating)
print('Maximum rating:', max_rating)
There are more unique items than the number of users and the minimum rating is 1 while the maximum rating is 5. Let's now set up the train/test sets using a test_size=0.2
and examine the number of observations in each set.
from sklearn.model_selection import train_test_split
X = df[['user', 'Item']].values
y = df['stars'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=seed_value)
del X, y
print('Train data split:', len(X_train))
print('Eval data split:', len(X_test))
X_train_array = [X_train[:,0], X_train[:,1]]
X_test_array = [X_test[: 0], X_test[:,1]]
Let's now set up the model architecture by creating the EmbeddingLayer
class with the items and number of factors as the input containing embeddings_initializer='he_normal'
and embeddings_regularizer=l2(1e-6)
, which is then reshaped.
from keras.layers import Embedding, Reshape
from keras.regularizers import l2
class EmbeddingLayer:
def __init__(self, n_items, n_factors):
self.n_items = n_items
self.n_factors = n_factors
def __call__(self, x):
x = Embedding(self.n_items, self.n_factors,
embeddings_initializer='he_normal',
embeddings_regularizer=l2(1e-6))(x)
x = Reshape((self.n_factors,))(x)
return x
We can define the Recommender
function to set up the model architecture with the input containing the created unique n_users
, unique n_items
, the defined n_factors
, min_rating
and max_rating
. The user and item embeddings can be concatenated, then 5% Dropout
, Dense
layer with 10 nodes using the kernel_initializer='he_normal'
, which draws samples from a truncated normal distribution, followed by a relu
activation function and 20% Dropout
fed into an Dense
layer with a sigmoid
activation function and a final Lambda
layer that finds the difference between the min_rating
and the max_rating
and then adds the min_rating
. The model can then be compiled using loss='mean_squared_error
and optimizer=Adam(lr=0.001)
.
from keras.layers import Input, Concatenate, Dropout, Dense, Activation, Lambda
from keras.models import Model
from keras.optimizers import Adam
def Recommender(n_users, n_items, n_factors, min_rating, max_rating):
user = Input(shape=(1,))
u = EmbeddingLayer(n_users, n_factors)(user)
Item = Input(shape=(1,))
m = EmbeddingLayer(n_items, n_factors)(Item)
x = Concatenate()([u, m])
x = Dropout(0.05)(x)
x = Dense(10, kernel_initializer='he_normal')(x)
x = Activation('relu')(x)
x = Dropout(0.2)(x)
x = Dense(1, kernel_initializer='he_normal')(x)
x = Activation('sigmoid')(x)
x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
model = Model(inputs=[user, Item], outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='mean_squared_error', optimizer=opt)
return model
Now, we can examine the model architecture by saving the Recomender
class with the input containing n_users
, n_items
, n_factors
, min_rating
, max_rating
as keras_model
and calling summary
.
keras_model = Recommender(n_users, n_items, n_factors, min_rating, max_rating)
keras_model.summary()
Batch Size = 16
¶
Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping
class that monitors the val_loss
and will stop training if it does not improve after 3 epochs
, the ModelCheckpoint
class that monitors the val_loss
saving only the best one with a min
loss as well as the TensorBoard
class that enables visualizations using the TensorBoard
.
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
!rm -rf /logs/
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'CF_Baseline_b16_dropout0.2_weights_only.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
ModelCheckpoint(filepath, monitor='val_loss',
save_best_only=True, mode='min'),
tensorboard_callback]
Now the model can be trained by calling fit
utilizing the train dataset for 10 epochs
, which is the iterations over the entire X_train
and y_train
, using a batch_size=16
, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list
.
history = keras_model.fit(x=X_train_array, y=y_train,
validation_data=(X_test_array, y_test),
batch_size=16, epochs=10, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and predict on the test set.
keras_model.save('./baselineCF_Model2_batch16_dropout0.2_tf.h5',
save_format='tf')
#filepath = 'CF_Baseline_b16_dropout0.2_weights_only.h5'
#model = tf.keras.models.load_model('./baselineCF_Model2_batch16_dropout0.2_tf.h5')
#model.load_weights(filepath)
predictions = keras_model.predict(X_test_array)
Then, we create a pandas.Dataframe
containing the test set's review_idd
, item_id
, rating
and prediction
results.
df_test = pd.DataFrame(X_test[:,0])
df_test.rename(columns={0: 'user'}, inplace=True)
df_test['items'] = X_test[:,1]
df_test['stars'] = y_test
df_test['predictions'] = predictions
df_test.head()
We can utilize the actual ratings
from the original test set and the predicted ones to plot the distribution of the actual ratings using matplotlib.pyplot.bar
and the predicted ratings using matplotlib.pyplot.hist
.
import matplotlib.pyplot as plt
values, counts = np.unique(df_test['stars'], return_counts=True)
plt.figure(figsize=(8,6))
plt.bar(values, counts, tick_label=['1','2','3','4','5'], label='true value')
plt.hist(predictions, color='orange', label='predicted value')
plt.xlabel('Ratings')
plt.ylabel('Frequency')
plt.title('Ratings Histogram')
plt.legend()
plt.show()
Now, let's examine metrics from the model and plot the model loss
and val_loss
over the training epochs.
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()
We can extract the weights of the embeddings and examine the shape and length.
emb = keras_model.get_layer('embedding_1')
emb_weights = emb.get_weights()[0]
print('The shape of embedded weights: ', emb_weights.shape)
print('The length of embedded weights: ', len(emb_weights))
Cosine Similarity
¶
This distance-evaluation metric, also known as item-to-item similarity, calculates the similarity scores between two items in a vector residing in a multidimensional inner product space. This is made possible by vectorization, which converts words into vectors (numbers), allowing their meaning to be encoded and processed mathematically. Then the cosine of the angle between the two vector items as projected in the multidimensional space can be determined.
Each item is now represented as a 50-dimensional vector. The embeddings need to normalized so the dot product between two embeddings is the Cosine Similarity
, which calculates a numeric value representing the similarity between two items. The embedding layers from the model are needed to compute the Cosine Similarity
via a dot product.
emb_weights = emb_weights / np.linalg.norm(emb_weights, axis=1).reshape((-1,1))
rest_id_emb = df['itemID'].unique()
print('Number of unique embedded weights:', len(rest_id_emb))
Next, let's create a pandas.Dataframe
containing all the unique items in the 50 dimensions with their embedded weights.
rest_pd = pd.DataFrame(emb_weights)
rest_pd['itemID'] = rest_id_emb
rest_pd = rest_pd.set_index('itemID')
rest_pd
We can create a temp
dataframe that contains the unique IDs of the items and use an inner pandas.Merge
to get the items, then create a copy of the itemID
column as Item
and remove inf
and -inf
from the data.
temp = df[['itemID']].drop_duplicates()
df_recommend = pd.merge(rest_pd, temp, on='itemID')
df_recommend['Item'] = df_recommend.loc[:, 'itemID']
df_recommend = df_recommend[~df_recommend.isin([np.nan, np.inf,
-np.inf]).any(1)]
df_recommend.shape
Recommendation
¶
Let's now use the trained model to recommend items that should be suggested to be purchased with the items that are the most rated in the set. We can define a function find_similarity_total
that calculates the cosine similarity
between the target and the rest of the other items storing the result in a table.
def find_similarity_total(item_name):
""" Recommends item based on the cosine similarity between items """
cosine_list_total = []
result = []
for i in range(0, df_recommend.shape[0]):
sample_name = df_recommend[df_recommend['Item'] == item_name].iloc[:,1:51]
row = df_recommend.iloc[i,1:51]
cosine_total = np.dot(sample_name, row)
recommended_name = df_recommend.iloc[i,51]
cosine_list_total.append(cosine_total)
result.append(recommended_name)
cosine_df_total = pd.DataFrame({'similar_item': result,
'cosine': cosine_list_total})
return cosine_df_total
Let's now call the function with the most frequent item, B009934S5M
, as the input. We can then apply a lambda
function with the convert
function to create a new column called cos
in the result dataframe, drop the original cosine
column (which had values with np.array), and sort with the highest values first. These items with the highest cos
are the most similar to the input item.
def convert(input):
""" Replace '[]' to empty strings & convert string to float """
return float(str(input).replace('[','').replace(']',''))
result = find_similarity_total('B009934S5M')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
The model recommends items that should be suggested to be purchased with the most rated item, which was Star Trek Into Darkness (Blu-ray)
(B009934S5M
).
B00KHW4XHM
- Criterion Collection: Golden Age of TelevisionB0000JLLAO
- Not foundB01F6EHOIK
- Red Sonja: Queen Of PlaguesB00199PP6U
- Adventures of the Galaxy Rangers Collection Vol. 2B00005LZOF
- Not foundB00009IAXL
- Malibooty: Beach with 5:1 ratio females:maleB0006FFRKM
- Stellvia: Foundation III - Anime sold of Amazon Italy6304492952
- Disney Sing Along Songs - The Early Years, Collection of All-Time Favorites VHSB009934S5M
- Star Trek Into Darkness (Blu-ray)B000HT3QBY
- The Most Beautiful Wife - Italy, mafia, girl punished
Let's call the function again using another frequent item as input.
result = find_similarity_total('B00PY4Q9OS')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
The model recommends items that should be suggested to be purchased with the most second most rated item, which was Guardians of the Galaxy 3D - Limited Edition Steelbook
(B00PY4Q9OS
).
B00062IZ2C
- Star Hunter 01B0157757NK
- Fire City: End of Days - Battle over earth w demonsB004L1DB9G
- Born to Fight [Blu-ray] - Train for the biggest boxing match with a bullyB00008973P
- 101 Reykjavík [DVD] - Contemporary Icelandic version of American movies of the 1970s like Five Easy Pieces, in which antiheroic characters struggleB00004I9YP
- Not foundB000H4JH4O
- Keeping Up With the Steins [DVD] - Heartwarming coming-of-age comedy when three generations collide in a crazy family reunionB00ABCJQZW
- Celebrate With Clifford (Clifford The Big Red Dog) [DVD]B000FJMZ0O
- Not foundB00005JM7T
- Dr. Seuss' The Cat In The Hat (Widescreen Edition) [DVD]
Let's call the function again using another frequent item as input.
result = find_similarity_total('B00R8GUXPG')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
Item ID (ASIN) was not able to be found.
Batch Size = 8
¶
Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping
class that monitors the val_loss
and will stop training if it does not improve after 3 epochs
, the ModelCheckpoint
class that monitors the val_loss
saving only the best one with a min
loss as well as the TensorBoard
class that enables visualizations using the TensorBoard
.
Now the model can be trained by calling fit
utilizing the train dataset for 10 epochs
, which is the iterations over the entire X_train
and y_train
, using a batch_size=8
, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list
.
!rm -rf /logs/
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'CF_Baseline_b8_dropout0.2_weights_only.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
ModelCheckpoint(filepath, monitor='val_loss',
save_best_only=True, mode='min'),
tensorboard_callback]
history = keras_model.fit(x=X_train_array, y=y_train,
validation_data=(X_test_array, y_test),
batch_size=8, epochs=10, callbacks=callbacks_list)
The model stopped at the 5th epoch
because patience=3
was set. Let's now save the model and predict using the test set.
keras_model.save('./baselineCF_Model2_batch8_dropout0.2_tf.h5',
save_format='tf')
predictions = keras_model.predict(X_test_array)
Then, we create a pandas.Dataframe
containing the test set's review_id
, item_id
, rating
andprediction
results.
df_test = pd.DataFrame(X_test[:,0])
df_test.rename(columns={0: 'user'}, inplace=True)
df_test['items'] = X_test[:,1]
df_test['stars'] = y_test
df_test['predictions'] = predictions
df_test.head()
We can utilize the actual ratings
from the original test set and the predicted ones to plot the distribution of the actual ratings using matplotlib.pyplot.bar
and the predicted ratings using matplotlib.pyplot.hist
.
values, counts = np.unique(df_test['stars'], return_counts=True)
plt.figure(figsize=(8,6))
plt.bar(values, counts, tick_label=['1','2','3','4','5'], label='true value')
plt.hist(predictions, color='orange', label='predicted value')
plt.xlabel('Ratings')
plt.ylabel('Frequency')
plt.title('Ratings Histogram')
plt.legend()
plt.show()
Now, let's examine metrics from the model and plot the model loss
and val_loss
over the training epochs.
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()
We can extract the weights of the embeddings and examine the shape and length.
emb = keras_model.get_layer('embedding_11')
emb_weights = emb.get_weights()[0]
print('The shape of embedded weights: ', emb_weights.shape)
print('The length of embedded weights: ', len(emb_weights))
Cosine Similarity
¶
Each item is now represented as a 50-dimensional vector. The embeddings need to normalized so the dot product between two embeddings is the Cosine Similarity
, which calculates a numeric value representing the similarity between two items. The embedding layers from the model are needed to compute the Cosine Similarity
via a dot product.
emb_weights = emb_weights / np.linalg.norm(emb_weights, axis=1).reshape((-1,1))
rest_id_emb = df['itemID'].unique()
print('Number of unique embedded weights:', len(rest_id_emb))
Next, let's create a pandas.Dataframe
containing all the unique items in the 50 dimensions with their embedded weights.
rest_pd = pd.DataFrame(emb_weights)
rest_pd['itemID'] = rest_id_emb
rest_pd = rest_pd.set_index('itemID')
rest_pd
We can create a temp
dataframe that contains the unique IDs of the items and use an inner pandas.Merge
to get the items, then create a copy of the itemID
column as Item
and remove inf
and -inf
from the data.
temp = df[['itemID']].drop_duplicates()
df_recommend = pd.merge(rest_pd, temp, on='itemID')
df_recommend['Item'] = df_recommend.loc[:, 'itemID']
df_recommend = df_recommend[~df_recommend.isin([np.nan, np.inf,
-np.inf]).any(1)]
df_recommend.shape
There is a reduced number compared to batch_size=16
.
Recommendation
¶
Let's now call the function with the most frequent item, B009934S5M
, as the input. We can then apply a lambda
function with the convert
function to create a new column called cos
in the result dataframe, drop the original cosine
column (which had values with np.array), and sort with the highest values first. These items with the highest cos
are the most similar to the input item.
result = find_similarity_total('B009934S5M')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
The model recommends items that should be suggested to be purchased with the most rated item, which was Star Trek Into Darkness (Blu-ray)
(B009934S5M
).
B009934S5M
- Star Trek Into Darkness (Blu-ray)B00005U8QD
- Witchouse 3: Demon Fire (Widescreen Special Edition)B000P5FH5I
- Dora The Explorer - Summer ExplorerB001JAHSIM
- Not foundB000WYVUZS
- Not foundB001TXVSPS
- Not foundB007OCD1CG
- Simply Red: Live at Montreux 2003 [Blu-ray] musicB000OZ2CTS
- Vandread and Vandread the Second Stage: Complete Collection animeB00JWS9IS6
- President Wolfman [Edizione: Stati Uniti] [Edizione: USA]B000W8KY0G
- Not found
Collaborative Filtering Model using Tensorflow
¶
We can utilize the previous approach to set up the environment including setting the seed with a init_seeds
function for reproducibility.
Let's now prepare the data by using a lambda
function, and examine the number of reviewers, items and ratings.
import pandas as pd
df['item_id'] = df['item_id'].apply(lambda x: f'item_{x}')
df['reviewer_id'] = df['reviewer_id'].apply(lambda x: f'reviewer_{x}')
df['rating'] = df['rating'].apply(lambda x: float(x))
print(f'Number of reviewers: {len(df.reviewer_id.unique())}')
print(f'Number of items: {len(df.item_id.unique())}')
print(f'Number of ratings: {len(df.index)}')
To prepare the data for modeling, let's create the training and evaluation sets with 80% allocated to the training set and 20% for evaluation that are saved as .csv
files as well.
random_selection = np.random.rand(len(df.index)) <= 0.80
train_data = df[random_selection]
eval_data = df[~random_selection]
train_data.to_csv('train_filtered.csv', index=False, sep='|', header=False)
eval_data.to_csv('eval_filtered.csv', index=False, sep='|', header=False)
print(f'Train data split: {len(train_data.index)}')
print(f'Eval data split: {len(eval_data.index)}')
print('Train and eval data files are saved.')
Then, we can define the dataset metadata including where the data is stored, the columns as csv_header
, the name of the target variable and the vocabularies for the unique reviewers and items.
The model hyperparameters can be specified, so let's utilize learning_rate = 0.0001
, batch_size = 1024
, num_epochs = 30
and a base_embedding_dim = 64
.
DATADIR = '/content/drive/MyDrive/AmazonReviews/Data/'
csv_header = list(df.columns)
target_feature_name = 'rating'
reviewer_vocabulary = list(df.reviewer_id.unique())
item_vocabulary = list(df.item_id.unique())
learning_rate = 0.0001
batch_size = 1024
num_epochs = 30
base_embedding_dim = 64
Let's use a function get_dataset_from_csv
that load the data into a tf.data.experimental.make_csv_dataset
with each element of the dataset as a tuple that is a batch of rows. The features dictionary maps the feature column names to Tensors
containing the corresponding feature data, and labels is a Tensor
containing the batch's label data.
from tensorflow.python.lib.io import file_io
def get_dataset_from_csv(path, batch_size=1024, shuffle=True):
csv_file_path = tf.io.gfile.glob(path)
return tf.data.experimental.make_csv_dataset(
csv_file_path,
batch_size=batch_size,
column_names=csv_header,
label_name=target_feature_name,
num_epochs=1,
header=False,
field_delim='|',
shuffle=shuffle)
Let's use another function run_experiment
now that call the get_dataset_from_csv
function to read the training & test data from the DATADIR
and then to compile
and fit the model with the specified callbacks
.
def run_experiment(model):
train_dataset = get_dataset_from_csv(os.path.join(DATADIR,
'train_filtered.csv'),
batch_size)
eval_dataset = get_dataset_from_csv(os.path.join(DATADIR,
'eval_filtered.csv'),
batch_size, shuffle=False)
model.compile(optimizer=keras.optimizers.Adam(learning_rate),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[keras.metrics.MeanAbsoluteError(name='mae')])
history = model.fit(train_dataset, epochs=num_epochs,
validation_data=eval_dataset,
callbacks=callbacks_list)
return history
from tensorflow.keras import layers
def embedding_encoder(vocabulary, embedding_dim, num_oov_indices=0, name=None):
return keras.Sequential(
[layers.StringLookup(vocabulary=vocabulary, mask_token=None,
num_oov_indices=num_oov_indices),
layers.Embedding(input_dim=len(vocabulary) + num_oov_indices,
output_dim=embedding_dim),],
name=f'{name}_embedding' if name else None)
Implement the Baseline Model
¶
We can define another function create_baseline_model
that uses the reviewers as an input to the embedding_encoder
function for the reviewer embeddings, the item as an input to the embedding_encoder
for the item embeddings. Then the dot product similarity between reviewer and item embeddings can be generated and converted to a rating scale. Lastly, the model can be created.
def create_baseline_model():
reviewer_input = layers.Input(name='reviewer_id', shape=(), dtype=tf.string)
reviewer_embedding = embedding_encoder(vocabulary=reviewer_vocabulary,
embedding_dim=base_embedding_dim,
name='reviewer')(reviewer_input)
item_input = layers.Input(name='item_id', shape=(), dtype=tf.string)
item_embedding = embedding_encoder(vocabulary=item_vocabulary,
embedding_dim=base_embedding_dim,
name='item')(item_input)
logits = layers.Dot(axes=1, name='dot_similarity')([reviewer_embedding,
item_embedding])
prediction = keras.activations.sigmoid(logits) * 5
model = keras.Model(inputs=[reviewer_input, item_input], outputs=prediction,
name='baseline_model')
return model
Let's create the baseline model and examine the model summary.
baseline_model = create_baseline_model()
baseline_model.summary()
Notice that the number of trainable parameters is7,892,864
. Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping
class that monitors the val_loss
and will stop training if it does not improve after 3 epochs
, the ModelCheckpoint
class that monitors the val_loss
saving only the best one with a min
loss as well as the TensorBoard
class that enables visualizations using the TensorBoard
.
Now the baseline_model
can be trained by calling fit
utilizing the train dataset for 30 epochs
, which is the iterations over the entire X_train
and y_train
, using a batch_size=1024
, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list
.
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
!rm -rf /logs/
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'baselineEmbed64_MoviesTV_weights_only_b1024.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
ModelCheckpoint(filepath, monitor='val_loss',
save_best_only=True, mode='min'),
tensorboard_callback]
history = run_experiment(baseline_model)
We can save the model and plot the model loss
and val_loss
over the epochs
.
import matplotlib.pyplot as plt
baseline_model.save('./baselineEmbed64_MoviesTV_batchb1024_tf.h5',
save_format='tf')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()
Let's now load the test set and the model so we can evaluate
the trained model for mae
using the test set.
eval_dataset = get_dataset_from_csv('eval_filtered.csv', batch_size,
shuffle=False)
model = tf.keras.models.load_model('./baselineEmbed64_MoviesTV_batchb1024_tf.h5')
result = model.evaluate(eval_dataset, return_dict=True, verbose=0)
print('\nEvaluation on the test set:')
display(result)
Memory-Efficient Model
¶
Implement Quotient-Remainder Embedding as a Layer
¶
The Quotient-Remainder technique works by creating two num_buckets X embedding_dim
embedding tables for a set of vocabulary and embedding size embedding_dim
where num_buckets
is much smaller than vocabulary_size
, rather than creating a vocabulary_size X embedding_dim
embedding table.
An embedding for a given item index
is generated by:
- Computing the
quotient_index
asindex // num_buckets
. - Computing the
remainder_index
asindex % num_buckets
. - Lookup
quotient_embedding
from the first embedding table usingquotient_index
. - Lookup
remainder_embedding
from the second embedding table usingremainder_index
. - Returning
quotient_embedding
*remainder_embedding
.
This technique reduces the number of embedding vectors needed to be stored and trained, as well as generating a unique embedding vector for each item of size embedding_dim
. The embeddings, q_embedding
and r_embedding
, can be combined using other operations like Add
and Concatenate
.
Let's now set up the config and get the item, quotient and remainder indeces. Then lookup the quotient_embedding using the quotient_index and lookup the remainder_embedding using the remainder_index. Use multiplication as a combiner operation.
class QREmbedding(keras.layers.Layer):
def __init__(self, vocabulary, embedding_dim, num_buckets, name=None):
super(QREmbedding, self).__init__(name=name)
self.num_buckets = num_buckets
self.index_lookup = layers.StringLookup(
vocabulary=vocabulary, mask_token=None, num_oov_indices=0)
self.q_embeddings = layers.Embedding(num_buckets, embedding_dim,)
self.r_embeddings = layers.Embedding(num_buckets, embedding_dim,)
def get_config(self):
config = super().get_config()
return config
def call(self, inputs):
embedding_index = self.index_lookup(inputs)
quotient_index = tf.math.floordiv(embedding_index, self.num_buckets)
remainder_index = tf.math.floormod(embedding_index, self.num_buckets)
quotient_embedding = self.q_embeddings(quotient_index)
remainder_embedding = self.r_embeddings(remainder_index)
return quotient_embedding * remainder_embedding
Implement Mixed Dimension Embedding as a Layer
¶
Embedding vectors are trained with full dimensions for the frequently queried items, while embedding vectors with reduced dimensions are trained for less frequent items. A projection weights matrix allows low dimension embeddings to become the full dimensions.
More precisely, we define blocks of items of similar frequencies. For each block, a block_vocab_size X block_embedding_dim
embedding table and block_embedding_dim X full_embedding_dim
projection weights matrix are created. Note that, if block_embedding_dim
equals full_embedding_dim
, the projection weights matrix becomes an identity matrix. Embeddings for a given batch of item indices
are generated via the following steps:
- For each block, lookup the
block_embedding_dim
embedding vectors usingindices
, and project them to thefull_embedding_dim
. - If an item index does not belong to a given block, an out-of-vocabulary embedding is returned. Each block will return a
batch_size X full_embedding_dim
tensor. - A mask is applied to the embeddings returned from each block in order to convert the out-of-vocabulary embeddings to vector of zeros. That is, for each item in the batch, a single non-zero embedding vector is returned from the all block embeddings.
- Embeddings retrieved from the blocks are combined using sum to produce the final
batch_size X full_embedding_dim
tensor.
For the MDEmbedding
class, let's create a vocabulary to block lookup and create block embedding encoders and projectors. Then configure the layers where the block index for each input item is determined, followed by initializing the output embeddings to zeros and the embeddings are generated from the blocks. Then lookup and project embeddings from the current block to base_embedding_dim
and create a mask to filter out the embeddings of items not in the current block. Then set the embeddings for the items not in the current block to zero. and add the block embeddings to the final embeddings.
class MDEmbedding(keras.layers.Layer):
def __init__(self, blocks_vocabulary, blocks_embedding_dims,
base_embedding_dim, name=None):
super(MDEmbedding, self).__init__(name=name)
self.num_blocks = len(blocks_vocabulary)
keys = []
values = []
for block_idx, block_vocab in enumerate(blocks_vocabulary):
keys.extend(block_vocab)
values.extend([block_idx] * len(block_vocab))
self.vocab_to_block = tf.lookup.StaticHashTable(
tf.lookup.KeyValueTensorInitializer(keys, values), default_value=-1)
self.block_embedding_encoders = []
self.block_embedding_projectors = []
for idx in range(self.num_blocks):
vocabulary = blocks_vocabulary[idx]
embedding_dim = blocks_embedding_dims[idx]
block_embedding_encoder = embedding_encoder(
vocabulary, embedding_dim, num_oov_indices=1)
self.block_embedding_encoders.append(block_embedding_encoder)
if embedding_dim == base_embedding_dim:
self.block_embedding_projectors.append(layers.Lambda(lambda x: x))
else:
self.block_embedding_projectors.append(
layers.Dense(units=base_embedding_dim))
def get_config(self):
config = super().get_config()
return config
def call(self, inputs):
block_indicies = self.vocab_to_block.lookup(inputs)
embeddings = tf.zeros(shape=(tf.shape(inputs)[0], base_embedding_dim))
for idx in range(self.num_blocks):
block_embeddings = self.block_embedding_encoders[idx](inputs)
block_embeddings = self.block_embedding_projectors[idx](block_embeddings)
mask = tf.expand_dims(tf.cast(block_indicies == idx,
tf.dtypes.float32), 1)
block_embeddings = block_embeddings * mask
embeddings += block_embeddings
return embeddings
Implement the Memory-Efficient Model
¶
The Quotient-Remainder
method reduces the size of the user embeddings, and the Mixed Dimension
method reduces the size of the item embeddings. The number of blocks and the dimensions of embeddings of each block are determined based on the histogram visualization of frequency of the items, potentially popularity. So let's use pandas.Dataframe.value_counts()
to generate item_frequencies
and plot with bins=10
.
item_frequencies = df['item_id'].value_counts()
item_frequencies.hist(bins=100)
We can also examine the 5 number summary statistics using pandas.Dataframe.describe
.
item_frequencies.describe()
Let's examine the items with a count less than the mean and graph.
dat = df[df.groupby('item_id')['item_id'].transform('count') <= 10]
dat.shape
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
Then, we can examine the the items with a count more than the mean and graph.
dat = df[df.groupby('item_id')['item_id'].transform('count') >= 10]
dat.shape
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
dat = df[df.groupby('item_id')['item_id'].transform('count') <= 100]
dat.shape
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
dat = df[df.groupby('item_id')['item_id'].transform('count') >= 100]
dat.shape
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
Now, we can create item block vocabularies with different dimensions. The items can be grouped into three blocks (high, normal and low popularity), and assigned as the embedding dimensions [64, 32, and 16]
with the block size split like the frequency of the items. Then, we can define number of embedding buckets for reviewers by dividing the length of the reviewer_vocabulary
by 100.
sorted_item_vocabulary = list(item_frequencies.keys())
item_blocks_vocabulary = [
sorted_item_vocabulary[10:100],
sorted_item_vocabulary[100:],
sorted_item_vocabulary[:10]]
item_blocks_embedding_dims = [64, 32, 16]
reviewer_embedding_num_buckets = len(reviewer_vocabulary) // 100
print('Number of reviewer embedding buckets:', reviewer_embedding_num_buckets)
We can define another function create_memory_efficient_model
that uses the reviewers as an input to the MDEmbedding
class for the reviewer embeddings, the item as an input to the MDEmbedding
for the item embeddings. Then the dot product similarity between reviewer and item embeddings can be generated and converted to a rating scale. Lastly, the model can be created.
def create_memory_efficient_model():
reviewer_input = layers.Input(name='reviewer_id', shape=(), dtype=tf.string)
reviewer_embedding = QREmbedding(
vocabulary=reviewer_vocabulary,
embedding_dim=base_embedding_dim,
num_buckets=reviewer_embedding_num_buckets,
name='reviewer_embedding')(reviewer_input)
item_input = layers.Input(name='item_id', shape=(), dtype=tf.string)
item_embedding = MDEmbedding(
blocks_vocabulary=item_blocks_vocabulary,
blocks_embedding_dims=item_blocks_embedding_dims,
base_embedding_dim=base_embedding_dim,
name='item_embedding')(item_input)
logits = layers.Dot(axes=1, name='dot_similarity')(
[reviewer_embedding, item_embedding])
prediction = keras.activations.sigmoid(logits) * 5
model = keras.Model(inputs=[reviewer_input, item_input], outputs=prediction,
name='memory_model')
return model
Let's now create the memory efficient model and examine the model summary.
memory_efficient_model = create_memory_efficient_model()
memory_efficient_model.summary()
The number of trainable parameters is now 3,349,104
down from 7,892,864
, which is more than 2x less than the number of parameters in the baseline model.
Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping
class that monitors the val_loss
and will stop training if it does not improve after 3 epochs
, the ModelCheckpoint
class that monitors the val_loss
saving only the best one with a min
loss as well as the TensorBoard
class that enables visualizations using the TensorBoard
.
Now, the memory_efficient_model
can be trained by calling fit
utilizing the train dataset for 30 epochs
using a batch_size=102
and the specified callbacks from the callbacks_list
.
filepath = 'baselineFiltered_weights_only_b1024_memOpt.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
ModelCheckpoint(filepath, monitor='val_loss',
save_best_only=True, mode='min'),
tensorboard_callback]
history = run_experiment(memory_efficient_model)
We can save the model and plot the loss
, val_loss
, mae
and val_mae
over the training epochs.
memory_efficient_model.save('./memoryEfficient_weights_only_b1024_tf.h5',
save_format='tf')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()
plt.plot(history.history['mae'])
plt.plot(history.history['val_mae'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()
Let's now load the test set and the model so we can evaluate
the trained model for mae
using the test set.
eval_dataset = get_dataset_from_csv('eval_filtered.csv', batch_size,
shuffle=False)
model = tf.keras.models.load_model('./memoryEfficient_weights_only_b1024_tf.h5')
result = model.evaluate(eval_dataset, return_dict=True, verbose=0)
print('\nEvaluation on the test set:')
display(result)
Ranking System in Tensorflow
¶
We can utilize the previous approach to set up the environment including setting the seed with a init_seeds
function for reproducibility.
!pip install tensorflow-recommenders==0.7.2
!pip install tensorflow==2.9.2
import tensorflow as tf
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('\n')
Let's now load the pandas.Dataframe
as a dictionary to a tensorflow.data.Dataset
and map the variables with a lambda function
. Then we can prepare the train/test sets by shuffling and sampling. Then we can generate a vocabulary that maps the features to the embedding vector in the model, determine the unique items
and reviewer_ids
for the vocabulary and set the dimensions for the embeddings.
from sklearn.utils import shuffle
ratings = tf.data.Dataset.from_tensor_slices(dict(df))
ratings = ratings.map(lambda x: {'item': x['item'],
'reviewer_id': x['reviewerID'],
'rating': x['rating']})
shuffled = ratings.shuffle(1_000_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(500_000)
test = shuffled.skip(500_000).take(100_000)
items = ratings.batch(3_000_000).map(lambda x: x['item'])
reviewer_ids = ratings.batch(3_000_000).map(lambda x: x['reviewer_id'])
unique_items = np.unique(np.concatenate(list(items)))
unique_reviewer_ids = np.unique(np.concatenate(list(reviewer_ids)))
embedding_dimension = 64
Model Architecture and Metrics
¶
We can use a class RankingModel
to set up the ranking model containing embeddings for the reviewers and items using the created vocabulary for each. Then the model can be defined with multiple stacked Dense
layers containing activation='relu'
with decreasing number of nodes.
class RankingModel(tf.keras.Model):
def __init__(self):
"""
Initialize the model by setting up the layers.
"""
super().__init__()
self.reviewer_embeddings = tf.keras.Sequential([
tf.keras.layers.StringLookup(vocabulary=unique_reviewer_ids,
mask_token=None),
tf.keras.layers.Embedding(len(unique_reviewer_ids) + 1,
embedding_dimension)])
self.item_embeddings = tf.keras.Sequential([
tf.keras.layers.StringLookup(vocabulary=unique_items, mask_token=None),
tf.keras.layers.Embedding(len(unique_items) + 1, embedding_dimension)])
self.ratings = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)])
def call(self, inputs):
reviewer_id, item = inputs
reviewer_embedding = self.reviewer_embeddings(reviewer_id)
item_embedding = self.item_embeddings(item)
return self.ratings(tf.concat([reviewer_embedding, item_embedding], axis=1))
Let's now examine a sample:
RankingModel()((['31'], ['2726956']))
Next, we can define the loss
andmetrics
usingtensorflow_recommenders
.
import tensorflow_recommenders as tfrs
task = tfrs.tasks.Ranking(loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.RootMeanSquaredError()])
We can now define the complete model by generating another class AmazonReviewsModel
where the rankingModel()
is added in the __init__
method, and the compute_loss
method calculates a loss given the feautres as input. The base model creates the training loop. The function compute_loss
computes the loss and the metrics for the model.
import pprint
from typing import Dict, Text
class AmazonReviewsModel(tfrs.models.Model):
def __init__(self):
super().__init__()
self.ranking_model: tf.keras.Model = RankingModel()
self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
loss = tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.RootMeanSquaredError()])
def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
return self.ranking_model((features['reviewer_id'], features['item']))
def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
labels = features.pop('rating')
rating_predictions = self(features)
return self.task(labels=labels, predictions=rating_predictions)
Now, we can initiate the model and prepare the train/test sets.
model = AmazonReviewsModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))
cached_train = train.shuffle(1_000_000).batch(1).cache()
cached_test = test.batch(1).cache()
Fit, Evaluate and Save Model
¶
Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping
class that monitors the root_mean_squared_error
and will stop training if it does not improve after 3 epochs
, the ModelCheckpoint
class that monitors the root_mean_squared_error
saving only the best one with a min
loss as well as the TensorBoard
class that enables visualizations using the TensorBoard
.
Now the model can be trained by calling fit
utilizing the cached_train
dataset for 20 epochs
, which is the iterations over the entire cached_train
, using a batch_size=1
, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list
.
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
!rm -rf /logs/
%load_ext tensorboard
log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = 'rankingBaseline_MoviesTV_weights_only.h5'
checkpoint_dir = os.path.dirname(filepath)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
histogram_freq=1)
callbacks_list = [EarlyStopping(monitor='root_mean_squared_error', patience=3),
ModelCheckpoint(filepath, monitor='root_mean_squared_error',
save_weights_only=True, mode='min'),
tensorboard_callback]
history = model.fit(cached_train, epochs=20, callbacks=callbacks_list)
Now, let's save the model if we need to reload it in the future and evaluate
on the test set.
tf.saved_model.save(model, 'rankingBaseline_MoviesTV_20epochs')
model.evaluate(cached_test, return_dict=True)
We can examine metrics from the model and plot the model loss
over the training epochs.
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()
Let's now load the trained model and evaluate
using the cached test set.
loaded = tf.saved_model.load('rankingBaseline_MoviesTV_20epochs')
loaded({'reviewer_id': np.array(['AV6QDP8Q0ONK4']),
'item': ['B009934S5M']}).numpy()
model.evaluate(cached_test, return_dict=True)
Predictions
¶
We can test out the Ranking Model
by computing predictions for a subset of the items and rank them based on the predictions.
AV6QDP8Q0ONK4
¶
This reviewer had 3981 ratings in the filtered set.
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['AV6QDP8Q0ONK4']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
ABO2ZI2Y5DQ9T
¶
This reviewer had 2068 ratings in the filtered set.
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['ABO2ZI2Y5DQ9T']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
test_ratings = {}
test_item = ['B00KHW4XHM', 'B01F6EHOIK', 'B00199PP6U', 'B00009IAXL']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['ABO2ZI2Y5DQ9T']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
test_ratings = {}
test_item = ['B0006FFRKM', '6304492952', 'B009934S5M', 'B000HT3QBY']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['ABO2ZI2Y5DQ9T']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
A328S9RN3U5M68
¶
This reviewer had 1997 ratings in the filtered set.
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['A328S9RN3U5M68']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
test_ratings = {}
test_item = ['B00KHW4XHM', 'B01F6EHOIK', 'B00199PP6U', 'B00009IAXL']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['A328S9RN3U5M68']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
test_ratings = {}
test_item = ['B0006FFRKM', '6304492952', 'B009934S5M', 'B000HT3QBY']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['A328S9RN3U5M68']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
A3MV1KKHX51FYT
¶
This reviewer had 1986 ratings in the filtered set.
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['A3MV1KKHX51FYT']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
test_ratings = {}
test_item = ['B00KHW4XHM', 'B01F6EHOIK', 'B00199PP6U', 'B00009IAXL']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['A3MV1KKHX51FYT']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
test_ratings = {}
test_item = ['B0006FFRKM', '6304492952', 'B009934S5M', 'B000HT3QBY']
for item in test_item:
test_ratings[item] = model({
'reviewer_id': np.array(['A3MV1KKHX51FYT']),
'item': np.array([item])})
print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
reverse=True):
print(f'{title}: {score}')
We can also convert the TensorfloW
model to Tensorflow Lite
model to run on-device.
converter = tf.lite.TFLiteConverter.from_saved_model('rankingBaseline_MoviesTV_20epochs')
tflite_model = converter.convert()
open('converted_model.tflite', 'wb').write(tflite_model)
Let's now get the input and output tensors and test the model for a few of the observations in the set.
interpreter = tf.lite.Interpreter(model_path='converted_model.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
if input_details[0]['name'] == 'serving_default_item:0':
interpreter.set_tensor(input_details[0]['index'], np.array(['B009934S5M']))
interpreter.set_tensor(input_details[1]['index'],
np.array(['AV6QDP8Q0ONK4']))
else:
interpreter.set_tensor(input_details[0]['index'],
np.array(['AV6QDP8Q0ONK4']))
interpreter.set_tensor(input_details[1]['index'], np.array(['B009934S5M']))
interpreter.invoke()
rating = interpreter.get_tensor(output_details[0]['index'])
print(rating)
if input_details[0]['name'] == 'serving_default_item:0':
interpreter.set_tensor(input_details[0]['index'], np.array(['B00PY4Q9OS']))
interpreter.set_tensor(input_details[1]['index'],
np.array(['ABO2ZI2Y5DQ9T']))
else:
interpreter.set_tensor(input_details[0]['index'],
np.array(['ABO2ZI2Y5DQ9T']))
interpreter.set_tensor(input_details[1]['index'], np.array(['B00PY4Q9OS']))
interpreter.invoke()
rating = interpreter.get_tensor(output_details[0]['index'])
print(rating)
if input_details[0]['name'] == 'serving_default_item:0':
interpreter.set_tensor(input_details[0]['index'], np.array(['B00R8GUXPG']))
interpreter.set_tensor(input_details[1]['index'],
np.array(['A328S9RN3U5M68']))
else:
interpreter.set_tensor(input_details[0]['index'],
np.array(['A328S9RN3U5M68']))
interpreter.set_tensor(input_details[1]['index'], np.array(['B00R8GUXPG']))
interpreter.invoke()
rating = interpreter.get_tensor(output_details[0]['index'])
print(rating)
Recommendation Systems using Surprise
¶
The data is loaded using the Reader
class from a pandas.Dataframes
prior to modeling. For the initial training of the models using surprise
, the default parameters of BaselineOnly
, KNNBaseline
, KNNBasic
, KNNWithMeans
, KNNWithZScore
, CoClustering
,SVD
,SVDp
(an extension of SVD which taking into account implicit ratings), NMF
, were evaluated using the cross_validate
method using 3-fold cross validation to determine which algorithm yielded the lowest RMSE
errors.
Basic algorithms
¶
-NormalPredictor: NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work
BaselineOnly: BaselineOnly algorithm predicts the baseline estimate for given user and item
k-NN algorithms
¶
KNNBasic: KNNBasic is a basic collaborative filtering algorithm
KNNWithMeans: KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user
KNNWithZScore: KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user
KNNBaseline: KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating
Matrix Factorization-based algorithms
¶
-SVD: SVD algorithm is equivalent to Probabilistic Matrix Factorization
-SVDpp: The SVDpp algorithm is an extension of SVD that takes into account implicit ratings
-NMF: NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD
Co-clustering: Coclustering is a collaborative filtering algorithm based on co-clustering
This revealed that SVDpp
generated the lowest RMSE
, but it took the longest to fit the model and test. The default parameters for SVDpp
uses 20 epochs for fitting the model, so experimenting with less epochs and other model parameters will reduce the runtime and potentially maintain a low RMSE
. The results from using KNNBaseline
demonstrate a close loss with a significantly lower runtime, so hyperparameter tuning might allow this to be a better choice, especially given larger sample sizes.
from surprise import Dataset, Reader
import time
from surprise import BaselineOnly, KNNBaseline, KNNBasic, KNNWithMeans
from surprise import KNNWithZScore, CoClustering, SVD, SVDpp, NMF, NormalPredictor
from surprise.model_selection import cross_validate
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['reviewer_id', 'item_id', 'rating']], reader)
print('Time for iterating through different algorithms..')
search_time_start = time.time()
benchmark = []
for algorithm in [BaselineOnly(), KNNBaseline(), KNNBasic(), KNNWithMeans(),
KNNWithZScore(), CoClustering(), SVD(), SVDpp(), NMF(),
NormalPredictor()]:
results = cross_validate(algorithm, data, measures=['RMSE'], cv=3,
verbose=False, n_jobs=-1)
tmp = pd.DataFrame.from_dict(results).mean(axis=0)
tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]],
index=['Algorithm']))
benchmark.append(tmp)
print('Finished iterating through different algorithms:',
time.time() - search_time_start)
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
print('Results from testing different algorithms:')
print(surprise_results)
del surprise_results
Let's read the train/test sets and load them using the reader
from surprise
.
train = pd.read_csv('train_filtered.csv', sep='|')
train.columns = ['rating', 'item_id', 'reviewer_id']
train['reviewer_id'] = train['reviewer_id'].str.extract(pat='(\d+)',
expand=False)
train['item_id'] = train['item_id'].str.extract(pat='(\d+)', expand=False)
train['reviewer_id'] = train['reviewer_id'].astype(int)
train['item_id'] = train['item_id'].astype(int)
test = pd.read_csv('eval_filtered.csv', sep='|')
test.columns = ['rating', 'item_id', 'reviewer_id']
test['reviewer_id'] = test['reviewer_id'].str.extract(pat='(\d+)', expand=False)
test['item_id'] = test['item_id'].str.extract(pat='(\d+)', expand=False)
test['reviewer_id'] = test['reviewer_id'].astype(int)
test['item_id'] = test['item_id'].astype(int)
train = Dataset.load_from_df(train[['reviewer_id', 'item_id', 'rating']], reader)
test = Dataset.load_from_df(test[['reviewer_id', 'item_id', 'rating']], reader)
SVDpp with Lowest RMSE
¶
Then, we can set the path to save the results, fit the model using 3-fold cross validation with the default parameters for 3 epochs and examine the RMSE
for the train/test sets.
from surprise import accuracy, dump
print('Train/predict using SVDpp default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVDpp default parameters..')
search_time_start = time.time()
algo = SVDpp(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
n_jobs=-1)
print('Finished iterating through SVDpp default parameters:',
time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
print(key, ' : ', value)
print('\n')
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_3epochs_DefaultParamModel_file')
Since we removed the orginal item
and reviewerID
and created a numerical to handle the sparsity issue, let's read the Movies_and_TV_idMatch.csv
set so the original features are present.
df = pd.read_csv('Movies_and_TV_idMatch.csv')
df = df.drop_duplicates()
print('Sample observations:')
df.head()
To examine the results from fitting models, let's now define two functions, get_Ir
, which determines the number of items rated by a given reviewer, and get_Ri
, which determines the number of reviewers that rated a given item, from the model prediction results.
def get_Ir(reviewer_id):
"""
Determine the number of items rated by given reviewer
Args:
reviewerID: the id of the reviewer
Returns:
Number of items rated by the reviewer
"""
try:
return len(train.ur[train.to_inner_uid(reviewer_id)])
except ValueError:
return 0
def get_Ri(item_id):
"""
Determine number of reviewers that rated given item
Args:
itemID: the id of the item
Returns:
Number of reviewers that have rated the item
"""
try:
return len(train.ir[train.to_inner_iid(item_id)])
except ValueError:
return 0
We can convert the predictions from the model to a pandas.Dataframe
, apply the functions and examine the top 10 best predictions:
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
df2 = df2.drop_duplicates()
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
Hyperparameter optimization (HPO) using Grid Search
¶
Hyperparameter optimization using GridSearchCV
was performed to find the best parameters. Since this algorithm is computationally expensive with gradient descent, 10 epochs was used. A larger number of factors compared to the default n_factors=20
, The default parameters for lr_all=0.007
and reg_all=0.02
were included in the search. Let's now define the parameters for the grid search.
param_grid = {'n_epochs': [20],
'n_factors': [5, 10, 15],
'lr_all': [7e-4, 7e-3, 7e-2],
'reg_all': [7e-2, 5e-2, 2e-2, 7e-1],
'random_state': [seed_value]}
print('Grid search parameters:')
param_grid
Now we can run the grid search with rmse
and mae
as the metrics. Then examine the parameters that resulted in the lowest RMSE
.
from surprise.model_selection import GridSearchCV
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)
print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())
del results_df
Then use the parameters that resulted in the lowest rmse
on the train/test sets, predict, apply functions and save the prediction results.
algo = gs.best_estimator['rmse']
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
del df1
Let's now examine the top 10 best predictions:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
SVD
¶
Let's now fit the model using 3-fold cross validation the default parameters for 3 epochs and examine the RMSE
for the train/test sets.
print('Train/predict using SVD default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVD default parameters..')
search_time_start = time.time()
algo = SVD(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
n_jobs=-1)
print('Finished iterating through SVD default parameters:',
time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
print(key, ' : ', value)
print('\n')
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVD_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVD_3epochs_DefaultParamModel_file')
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVD_DefaultParamModel.csv', index=False)
del df1
Let's examine the best 10 predictions:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
HPO using Grid Search
¶
Let's now define the parameters for the grid search.
param_grid = {'n_epochs': [30, 35, 40, 45, 50, 55, 60, 65, 70],
'n_factors': [20, 25, 30, 35, 40 ,45 , 50],
'lr_all': [0.002, 0.003, 0.004, 0.005, 0.006, 0.007],
'reg_all': [0.0001, 0.001, 0.01, 0.02, 0.03, 0.04],
'random_state': [seed_value]}
print('SVD HPO Grid search parameters:')
param_grid
Now we can run the grid search with rmse
and mae
as the metrics. Then examine the parameters that resulted in the lowest RMSE
.
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3,
joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)
print('SVD GridSearch HPO Cross Validation Results:')
print(results_df.head())
del results_df
Then use the parameters that resulted in the lowest RMSE
on the train/test sets, predict, apply functions and save the prediction results.
algo = gs.best_estimator['rmse']
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./SVD_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVD_bestGrid_Model_file')
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
del df1
We can examine the top 10 best predictions:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
BaselineOnly
¶
Now we can run the 3-fold cross validation using the default parameters for 3 epochs using method='als'
with RMSE
and mae
as the metrics. Then predict with the model, apply the functions and save the prediction results.
print('Train/predict using Baseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print(BaselineOnly(bsl_options=bsl_options))
print('Time for iterating through Baseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = BaselineOnly(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
n_jobs=-1)
print('Finished iterating through Baseline default parameters epochs=3 using ALS:',
time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
print(key, ' : ', value)
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./Baseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./Baseline_3epochs_DefaultParamModel_file')
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
del df1
We can examine the top 10 best predictions:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
HPO using Grid Search
¶
Let's now define the parameters for the grid search.
print('BaselineOnly HPO using Grid Search Minimized:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
'n_epochs': [10, 15, 20, 25, 30, 35, 40, 45, 50],
'reg_u': [0, 1, 2, 3, 4, 5, 10, 15, 20],
'reg_i': [0, 1, 2, 3, 4,5, 10, 15, 20]}}
print('Grid search parameters:')
param_grid
Now we can run the grid search with rmse
and mae
as the metrics. Then examine the parameters that resulted in the lowest rmse
.
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3,
joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])
Then use the parameters that resulted in the lowest rmse
on the train/test sets, predict, apply functions and save the prediction results.
algo = gs.best_estimator['rmse']
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./BaselineOnly_moreParams_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./BaselineOnly_moreParams_bestGrid_Model_file')
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
del df1
We can examine the top 10 best predictions:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
KNNBaseline
¶
Now we can run the 3-fold cross validation using the default parameters for 3 epochs using method='als'
with RMSE
and mae
as the metrics. Then predict with the model, apply the functions and save the prediction results.
print('Train/predict using KNNBaseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print('KNNBaseline(k=40, min_k=1, bsl_options=bsl_options, verbose=False)')
print('Time for iterating through KNNBaseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = KNNBaseline(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
n_jobs=-1)
print('Finished iterating through KNNBaseline default parameters epochs=3 using ALS:',
time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
print(key, ' : ', value)
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./KNNBaseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_3epochs_DefaultParamModel_file')
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
del df1
We can examine the top 10 best predictions:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
HPO using Grid Search
¶
Let's now define the parameters for the grid search.
print('KNNBaseline HPO using Grid Search:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd']},
'k': [30, 35, 40, 45, 50],
'min_k': [5, 10],
'random_state': [seed_value],
'sim_options': {'name': ['pearson_baseline'],
'min_support': [5, 10],
'shrinkage': [0, 100]}}
print('Grid search parameters:')
param_grid
Now we can run the grid search with rmse
and mae
as the metrics. Then examine the parameters that resulted in the lowest rmse
.
gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse', 'mae'], cv=3,
joblib_verbose=-1, n_jobs=5)
print('Start time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
time.time() - search_time_start)
print('\n')
print('Model with lowest RMSE:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters with the lowest RMSE:')
print(gs.best_params['rmse'])
Then use the parameters that resulted in the lowest rmse
on the train/test sets, predict, apply functions and save the prediction results.
algo = gs.best_estimator['rmse']
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./KNNBaseline_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_bestGrid_Model_file')
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)
df2 = pd.merge(df, df1, how='right',
left_on=['reviewer_id', 'item_id'],
right_on=['reviewer_id', 'item_id'])
del df1
We can examine the top 10 best predictions:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Let's examine the worst 10 predictions as well:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)
del predictions, df2, best_predictions, worst_predictions
Comments