Recommendation systems are widely utilized in the digital age, whether it is recommending what to purchase on a website or what to watch/listen to online. They can be leveraged to increase profitability by using the information obtained from a large number of individuals to personalize the experience for an individual. They can be used to generate analytical reports containing information about products' costs and profits for the business and the projected demand so the needed supply is purchased.

   In this project, the user ratings from a subset of Amazon Reviews are used to train different recommendation systems which evaluate various algorithms. The models return a list of recommended items based on the reviewer's previous actions.

Data

   The Movies_and_TV ratings data was retrieved from Recommender Systems and Personalization Datasets. This includes (item, user, rating, timestamp) tuples. A subset of the data was used in the analysis due to constraints of sparcity for computations.

Preprocessing

   The code that was used for preprocessing and EDA can be found Recommender System Github repository. First, the environment is set up with the dependencies, library options, the seed for reproducibility, and setting the location of the project directory. Then the data is read, duplicate observations dropped and columns named.

In [ ]:
import os
import random
import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

seed_value = 42
os.environ['Recommender'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

df = pd.read_csv('Movies_and_TV.csv', header=None, skiprows=[0],
                 low_memory=False)
df = df.drop_duplicates()

df.columns = ['item', 'reviewerID', 'rating', 'timestamp']

print('Sample observations:')
df.head()
Sample observations:
Out[ ]:
item reviewerID rating timestamp
0 0001527665 A2VHSG6TZHU1OB 5.0 1361145600
1 0001527665 A23EJWOW1TLENE 5.0 1358380800
2 0001527665 A1KM9FNEJ8Q171 5.0 1357776000
3 0001527665 A38LY2SSHVHRYB 4.0 1356480000
4 0001527665 AHTYUW2H1276L 5.0 1353024000

   Then a function data_summary is defined to examine the data for the number of missing observations, data types and the amount of unique values in the initial set. The timestamp variable is dropped since it will not be used.

In [ ]:
def data_summary(df):
    print('Number of Rows: {}, Columns: {}'.format(df.shape[0], df.shape[1]))
    a = pd.DataFrame()
    a['Number of Missing Values'] = df.isnull().sum()
    a['Data type of variable'] = df.dtypes
    a['Number of Unique Values'] = df.nunique()
    print(a)

print('Initial Data Summary:')
print(data_summary(df))

df = df.drop(['timestamp'], axis=1)
Initial Data Summary:
Number of Rows: 8522125, Columns: 4
            Number of Missing Values Data type of variable  \
item                               0                object   
reviewerID                         0                object   
rating                             0               float64   
timestamp                          0                 int64   

            Number of Unique Values  
item                         182032  
reviewerID                  3826085  
rating                            5  
timestamp                      7476  
None

The top 10 items in the initial set shows the highest item has 24,554 ratings while the 10th highest items has 14,174 ratings.

In [ ]:
items_top10 = df.groupby('item').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings in initial set:')
print(items_top10)
Items with highest number of ratings in initial set:
item
B00YSG2ZPA    24554
B00006CXSS    24485
B00AQVMZKQ    21015
B01BHTSIOC    20889
B00NAQ3EOK    16857
6305837325    16671
B00WNBABVC    15205
B017S3OP7A    14795
B009934S5M    14481
B00FL31UF0    14174
dtype: int64

The top 10 reviewers with the most number of ratings in the initial set shows that they have over 1,600 reviews.

In [ ]:
reviewers_top10 = df.groupby('reviewerID').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in initial set:')
print(reviewers_top10)
Reviewers with highest number of ratings in initial set:
reviewerID
AV6QDP8Q0ONK4     4101
A1GGOC9PVDXW7Z    2114
ABO2ZI2Y5DQ9T     2073
A328S9RN3U5M68    2059
A3MV1KKHX51FYT    1989
A2EDZH51XHFA9B    1842
A3LZGLA88K0LA0    1814
A16CZRQL23NOIW    1808
AIMR915K4YCN      1719
A2NJO6YE954DBH    1699
dtype: int64

Since the data is sparse, a new integer id is created for item rather the initial string variable.

In [ ]:
value_counts = df['item'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['item_unique', 'counts']
df1 = df1.reset_index()
df1.rename(columns={'index': 'item_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['item'],
              right_on=['item_unique'])
df = df.drop_duplicates()
df = df.drop(['item_unique'], axis=1)

del value_counts, df1

   The same process is used for reviewerID. A key is created for merging the new integer variables that later be used to join the original data. For this set, the unnecessary keys are then dropped.

In [ ]:
value_counts = df['reviewerID'].value_counts(dropna=True, sort=True)
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['id_unique', 'counts']
df1 = df1.reset_index()
df1.rename(columns={'index': 'reviewer_id'}, inplace=True)
df1 = df1.drop(['counts'], axis=1)
df = pd.merge(df, df1, how='left', left_on=['reviewerID'],
              right_on=['id_unique'])
df = df.drop_duplicates()
df = df.drop(['id_unique'], axis=1)

del value_counts, df1

df1 = df[['item', 'item_id', 'reviewerID', 'reviewer_id']]
df1.to_csv('Movies_and_TV_idMatch.csv', index=False)

del df1

   The data is then filtered to reviewers who have greater than or equal to 25 ratings/reviews due to sparsity. This results in a set containing 1,114,877 ratings with 19,702 unique reviewers and 103,687 unique items. The majority of items are rated 5 star.

In [ ]:
reviewer_count = df.reviewerID.value_counts()
df = df[df.reviewerID.isin(reviewer_count[reviewer_count >= 25].index)]
df = df.drop_duplicates()

del reviewer_count

print('- Number of ratings after filtering: ', len(df))
print('- Number of unique reviewers: ', df['reviewerID'].nunique())
print('- Number of unique items: ', df['itemID'].nunique())
for i in range(1,6):
  print('- Number of items with {0} rating = {1}'.format(i,
                                                         df[df['rating'] == i].shape[0]))
- Number of ratings after filtering:  1114877
- Number of unique reviewers:  19702
- Number of unique items:  103731
- Number of items with 1 rating = 59576
- Number of items with 2 rating = 65630
- Number of items with 3 rating = 141579
- Number of items with 4 rating = 252845
- Number of items with 5 rating = 595247

   The top 10 items in the filtered set shows a large reduction with the highest item reducing from 24,554 to 1,136 ratings for the highest item while the 10th highest item reducing from 14,174 to 853 ratings.

In [ ]:
items_top10 = df.groupby('item').size().sort_values(ascending=False)[:10]
print('Items with highest number of ratings in the filtered set:')
print(items_top10)
Items with highest number of ratings in the filtered set:
item
B009934S5M    1136
B00PY4Q9OS    1042
B00R8GUXPG    1040
B00Q0G2VXM    1040
B00NYC65M8     965
B00OV3VGP0     903
B00DY64A3U     895
B00D91GRA4     870
B0059XTU3G     860
B005LAIHY0     853
dtype: int64

   The top 10 reviewers in the filtered set reduced from 4,101 to 3,981 ratings, while the 10th highest item reducing from 1,699 to 1,634 ratings.

In [ ]:
reviewers_top10 = df.groupby('reviewerID').size().sort_values(ascending=False)[:10]
print('Reviewers with highest number of ratings in the filtered set:')
print(reviewers_top10)
Reviewers with highest number of ratings in the filtered set:
reviewerID
AV6QDP8Q0ONK4     3981
ABO2ZI2Y5DQ9T     2068
A328S9RN3U5M68    1997
A3MV1KKHX51FYT    1986
A2EDZH51XHFA9B    1838
A3LZGLA88K0LA0    1811
A16CZRQL23NOIW    1797
A1GGOC9PVDXW7Z    1733
AIMR915K4YCN      1706
A20EEWWSFMZ1PN    1634
dtype: int64

Item-item collaborative filtering

   Item-item collaborative filtering is one kind of recommendation method which looks for similar items based on the items users have already liked or positively interacted with. How IBCF works is that it suggests an item based on items the user has previously consumed. It looks for the items the user has consumed then it finds other items similar to consumed items and recommends accordingly.

Collaborative Filtering in Tensorflow

   Let's first set up the environment by importing the necessary packages and examining the CUDA and NVIDIA GPU information as well as the Tensorflow and Keras versions for the runtime. To set the seed for reproducibility, we can use a function init_seeds that defines the random, numpy and tensorflow seed as well as the environment and session.

In [ ]:
import tensorflow as tf
from tensorflow import keras
print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('\n')
CUDA and NVIDIA GPU Information
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Mon Jan  9 22:34:20 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


TensorFlow version: 2.9.2
Eager execution is: True
Keras version: 2.9.0
Num GPUs Available:  1


In [ ]:
def init_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    session_conf = tf.compat.v1.ConfigProto()
    session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                            inter_op_parallelism_threads=1)
    os.environ['AmazonReviews_RecSysDL'] = str(seed)
    os.environ['TF_CUDNN_DETERMINISTIC'] = 'True'
    os.environ['TF_DETERMINISTIC_OPS'] = 'True'

    sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(),
                                config=session_conf)
    tf.compat.v1.keras.backend.set_session(sess)
    return sess

init_seeds(seed=42)
Out[ ]:
<tensorflow.python.client.session.Session at 0x7fab7f7b9e80>

   Now, we can use the LabelEncoder to encode reviewerID and itemID and create sets containing the unique users and items. Then, we can create minimum and maximum rating variables and set the number of factors (50) to be considered in the model. Let's examine the number of unique items and users as well as the minimum and maximum rating:

In [ ]:
from sklearn.preprocessing import LabelEncoder

user_encode = LabelEncoder()
df['user'] = user_encode.fit_transform(df['reviewerID'].values)
n_users = df['user'].nunique()

item_encode = LabelEncoder()
df['Item'] = item_encode.fit_transform(df['itemID'].values)
n_items = df['Item'].nunique()

df['stars'] = df['rating'].values

min_rating = min(df['stars'])
max_rating = max(df['stars'])

n_factors = 50

print('Number of unique items:', n_items)
print('Number of unique users:', n_users)
print('Minimum rating:', min_rating)
print('Maximum rating:', max_rating)
Number of unique items: 103731
Number of unique users: 19702
Minimum rating: 1.0
Maximum rating: 5.0

   There are more unique items than the number of users and the minimum rating is 1 while the maximum rating is 5. Let's now set up the train/test sets using a test_size=0.2 and examine the number of observations in each set.

In [ ]:
from sklearn.model_selection import train_test_split

X = df[['user', 'Item']].values
y = df['stars'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=seed_value)

del X, y

print('Train data split:', len(X_train))
print('Eval data split:', len(X_test))

X_train_array = [X_train[:,0], X_train[:,1]]
X_test_array = [X_test[: 0], X_test[:,1]]
Train data split: 891901
Eval data split: 222976

   Let's now set up the model architecture by creating the EmbeddingLayer class with the items and number of factors as the input containing embeddings_initializer='he_normal' and embeddings_regularizer=l2(1e-6), which is then reshaped.

In [ ]:
from keras.layers import Embedding, Reshape
from keras.regularizers import l2

class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors

    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors,
                      embeddings_initializer='he_normal',
                      embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)

        return x

   We can define the Recommender function to set up the model architecture with the input containing the created unique n_users, unique n_items, the defined n_factors, min_rating and max_rating. The user and item embeddings can be concatenated, then 5% Dropout, Dense layer with 10 nodes using the kernel_initializer='he_normal', which draws samples from a truncated normal distribution, followed by a relu activation function and 20% Dropout fed into an Dense layer with a sigmoid activation function and a final Lambda layer that finds the difference between the min_rating and the max_rating and then adds the min_rating. The model can then be compiled using loss='mean_squared_error and optimizer=Adam(lr=0.001).

In [ ]:
from keras.layers import Input, Concatenate, Dropout, Dense, Activation, Lambda
from keras.models import Model
from keras.optimizers import Adam

def Recommender(n_users, n_items, n_factors, min_rating, max_rating):

    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)

    Item = Input(shape=(1,))
    m = EmbeddingLayer(n_items, n_factors)(Item)

    x = Concatenate()([u, m])
    x = Dropout(0.05)(x)
    x = Dense(10, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.2)(x)
    x = Dense(1, kernel_initializer='he_normal')(x)
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)

    model = Model(inputs=[user, Item], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model

   Now, we can examine the model architecture by saving the Recomender class with the input containing n_users, n_items, n_factors, min_rating, max_rating as keras_model and calling summary.

In [ ]:
keras_model = Recommender(n_users, n_items, n_factors, min_rating, max_rating)
keras_model.summary()
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_2 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 input_3 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 embedding (Embedding)          (None, 1, 50)        985100      ['input_2[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, 1, 50)        5186550     ['input_3[0][0]']                
                                                                                                  
 reshape (Reshape)              (None, 50)           0           ['embedding[0][0]']              
                                                                                                  
 reshape_1 (Reshape)            (None, 50)           0           ['embedding_1[0][0]']            
                                                                                                  
 concatenate (Concatenate)      (None, 100)          0           ['reshape[0][0]',                
                                                                  'reshape_1[0][0]']              
                                                                                                  
 dropout (Dropout)              (None, 100)          0           ['concatenate[0][0]']            
                                                                                                  
 dense (Dense)                  (None, 10)           1010        ['dropout[0][0]']                
                                                                                                  
 activation (Activation)        (None, 10)           0           ['dense[0][0]']                  
                                                                                                  
 dropout_1 (Dropout)            (None, 10)           0           ['activation[0][0]']             
                                                                                                  
 dense_1 (Dense)                (None, 1)            11          ['dropout_1[0][0]']              
                                                                                                  
 activation_1 (Activation)      (None, 1)            0           ['dense_1[0][0]']                
                                                                                                  
 lambda (Lambda)                (None, 1)            0           ['activation_1[0][0]']           
                                                                                                  
==================================================================================================
Total params: 6,172,671
Trainable params: 6,172,671
Non-trainable params: 0
__________________________________________________________________________________________________

Batch Size = 16

   Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping class that monitors the val_loss and will stop training if it does not improve after 3 epochs, the ModelCheckpoint class that monitors the val_loss saving only the best one with a min loss as well as the TensorBoard class that enables visualizations using the TensorBoard.

In [ ]:
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

!rm -rf /logs/

%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'CF_Baseline_b16_dropout0.2_weights_only.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
                  ModelCheckpoint(filepath, monitor='val_loss',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]

   Now the model can be trained by calling fit utilizing the train dataset for 10 epochs, which is the iterations over the entire X_train and y_train, using a batch_size=16, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list.

In [ ]:
history = keras_model.fit(x=X_train_array, y=y_train,
                          validation_data=(X_test_array, y_test),
                          batch_size=16, epochs=10, callbacks=callbacks_list)
Epoch 1/10
55744/55744 [==============================] - 289s 5ms/step - loss: 1.0271 - val_loss: 0.9597
Epoch 2/10
55744/55744 [==============================] - 291s 5ms/step - loss: 0.9599 - val_loss: 0.9527
Epoch 3/10
55744/55744 [==============================] - 281s 5ms/step - loss: 0.9551 - val_loss: 0.9528
Epoch 4/10
55744/55744 [==============================] - 281s 5ms/step - loss: 0.9517 - val_loss: 0.9474
Epoch 5/10
55744/55744 [==============================] - 279s 5ms/step - loss: 0.9516 - val_loss: 0.9447
Epoch 6/10
55744/55744 [==============================] - 277s 5ms/step - loss: 0.9535 - val_loss: 0.9471
Epoch 7/10
55744/55744 [==============================] - 277s 5ms/step - loss: 0.9508 - val_loss: 0.9534
Epoch 8/10
55744/55744 [==============================] - 277s 5ms/step - loss: 0.9506 - val_loss: 0.9445
Epoch 9/10
55744/55744 [==============================] - 275s 5ms/step - loss: 0.9485 - val_loss: 0.9482
Epoch 10/10
55744/55744 [==============================] - 277s 5ms/step - loss: 0.9490 - val_loss: 0.9456

   Now, let's save the model if we need to reload it in the future and predict on the test set.

In [ ]:
keras_model.save('./baselineCF_Model2_batch16_dropout0.2_tf.h5',
                 save_format='tf')

#filepath = 'CF_Baseline_b16_dropout0.2_weights_only.h5'
#model = tf.keras.models.load_model('./baselineCF_Model2_batch16_dropout0.2_tf.h5')
#model.load_weights(filepath)

predictions = keras_model.predict(X_test_array)
6968/6968 [==============================] - 9s 1ms/step

   Then, we create a pandas.Dataframe containing the test set's review_idd, item_id, rating and prediction results.

In [ ]:
df_test = pd.DataFrame(X_test[:,0])
df_test.rename(columns={0: 'user'}, inplace=True)
df_test['items'] = X_test[:,1]
df_test['stars'] = y_test
df_test['predictions'] = predictions
df_test.head()
Out[ ]:
user items stars predictions
0 14701 17937 3.0 1.903274
1 17387 97250 5.0 3.822067
2 17411 66343 4.0 4.709871
3 6792 10159 5.0 3.698924
4 2965 77733 5.0 4.655035

   We can utilize the actual ratings from the original test set and the predicted ones to plot the distribution of the actual ratings using matplotlib.pyplot.bar and the predicted ratings using matplotlib.pyplot.hist.

In [ ]:
import matplotlib.pyplot as plt

values, counts = np.unique(df_test['stars'], return_counts=True)

plt.figure(figsize=(8,6))
plt.bar(values, counts, tick_label=['1','2','3','4','5'], label='true value')
plt.hist(predictions, color='orange', label='predicted value')
plt.xlabel('Ratings')
plt.ylabel('Frequency')
plt.title('Ratings Histogram')
plt.legend()
plt.show()

   Now, let's examine metrics from the model and plot the model loss and val_loss over the training epochs.

In [ ]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()

   We can extract the weights of the embeddings and examine the shape and length.

In [ ]:
emb = keras_model.get_layer('embedding_1')
emb_weights = emb.get_weights()[0]

print('The shape of embedded weights: ', emb_weights.shape)
print('The length of embedded weights: ', len(emb_weights))
The shape of embedded weights:  (103731, 50)
The length of embedded weights:  103731

Cosine Similarity

   This distance-evaluation metric, also known as item-to-item similarity, calculates the similarity scores between two items in a vector residing in a multidimensional inner product space. This is made possible by vectorization, which converts words into vectors (numbers), allowing their meaning to be encoded and processed mathematically. Then the cosine of the angle between the two vector items as projected in the multidimensional space can be determined.

   Each item is now represented as a 50-dimensional vector. The embeddings need to normalized so the dot product between two embeddings is the Cosine Similarity, which calculates a numeric value representing the similarity between two items. The embedding layers from the model are needed to compute the Cosine Similarity via a dot product.

In [ ]:
emb_weights = emb_weights / np.linalg.norm(emb_weights, axis=1).reshape((-1,1))

rest_id_emb = df['itemID'].unique()
print('Number of unique embedded weights:', len(rest_id_emb))
Number of unique embedded weights: 103731

   Next, let's create a pandas.Dataframe containing all the unique items in the 50 dimensions with their embedded weights.

In [ ]:
rest_pd = pd.DataFrame(emb_weights)
rest_pd['itemID'] = rest_id_emb
rest_pd = rest_pd.set_index('itemID')
rest_pd
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
itemID
000503860X inf -inf -inf inf -inf -inf -inf inf -inf -inf ... inf inf -inf inf inf inf inf -inf inf -inf
0005092663 inf -inf inf inf inf inf inf -inf inf inf ... -inf inf -inf -inf -inf -inf -inf inf inf inf
0005019281 -inf -inf inf inf -inf inf -inf inf inf inf ... -inf -inf -inf inf -inf inf -inf -inf -inf -inf
0005119367 -0.141163 1.477967e-01 2.316365e-29 0.147855 -0.147876 1.178349e-01 0.082837 1.471831e-01 -1.494529e-01 1.230121e-01 ... -1.486145e-01 1.413215e-01 1.493893e-01 -0.149376 2.305986e-29 -0.147192 1.488855e-01 -0.149351 -1.472584e-01 0.150208
0005123968 inf -inf -inf -inf inf -inf -inf inf inf inf ... inf inf -inf -inf -inf -inf inf -inf -inf -inf
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
B01HHZKOG0 -0.008267 -2.236083e-01 -2.599621e-02 -0.223686 -0.144924 2.091652e-01 -0.120053 -1.055703e-01 2.093079e-01 -1.650761e-01 ... 5.156517e-02 -2.127886e-01 8.212486e-03 -0.005607 -2.685340e-02 -0.015225 -1.140203e-02 0.216453 5.390420e-02 -0.241080
B01HI8KC5E 0.000015 -1.177827e-28 1.772185e-08 -0.006270 0.011387 -1.183634e-28 -0.001693 1.198261e-28 1.190786e-28 -1.198461e-28 ... 1.178250e-28 -5.411250e-14 6.627290e-07 0.000355 1.078850e-03 -0.000003 -5.405612e-09 0.000071 1.029599e-18 -0.001684
B01HI87P4K -0.192161 9.981440e-04 -1.826643e-01 -0.178949 0.162945 1.688821e-01 -0.000266 -1.660917e-01 1.671402e-01 1.536722e-01 ... 1.100356e-01 -1.763748e-01 3.756321e-02 -0.114001 -1.473959e-02 -0.120297 1.098178e-01 0.197131 1.383827e-01 -0.150529
B01HJ6R77G 0.011110 -1.364973e-01 1.508010e-02 -0.047702 0.006104 1.077538e-01 -0.136790 4.593020e-04 2.395007e-01 -9.541614e-02 ... 1.375569e-01 5.134614e-06 -2.457947e-01 0.157121 -1.206576e-02 -0.000002 -1.642001e-01 0.144495 7.963887e-02 -0.340266
B01HJF79XO 0.149675 -1.460049e-01 1.408108e-01 0.130977 -0.136690 -1.266955e-01 -0.148206 1.487567e-01 1.501051e-01 -1.427349e-01 ... -1.487200e-01 1.455216e-01 -2.347758e-29 0.138634 -1.514007e-01 0.145058 1.487101e-01 -0.147483 1.478791e-01 0.153532

103731 rows × 50 columns

   We can create a temp dataframe that contains the unique IDs of the items and use an inner pandas.Merge to get the items, then create a copy of the itemID column as Item and remove inf and -inf from the data.

In [ ]:
temp = df[['itemID']].drop_duplicates()
df_recommend = pd.merge(rest_pd, temp, on='itemID')
df_recommend['Item'] = df_recommend.loc[:, 'itemID']
df_recommend = df_recommend[~df_recommend.isin([np.nan, np.inf,
                                                -np.inf]).any(1)]
df_recommend.shape
Out[ ]:
(63771, 52)

Recommendation

   Let's now use the trained model to recommend items that should be suggested to be purchased with the items that are the most rated in the set. We can define a function find_similarity_total that calculates the cosine similarity between the target and the rest of the other items storing the result in a table.

In [ ]:
def find_similarity_total(item_name):
    """ Recommends item based on the cosine similarity between items """
    cosine_list_total = []
    result = []
    for i in range(0, df_recommend.shape[0]):
        sample_name = df_recommend[df_recommend['Item'] == item_name].iloc[:,1:51]
        row = df_recommend.iloc[i,1:51]
        cosine_total = np.dot(sample_name, row)
        recommended_name = df_recommend.iloc[i,51]
        cosine_list_total.append(cosine_total)
        result.append(recommended_name)
    cosine_df_total = pd.DataFrame({'similar_item': result,
                                    'cosine': cosine_list_total})

    return cosine_df_total

   Let's now call the function with the most frequent item, B009934S5M, as the input. We can then apply a lambda function with the convert function to create a new column called cos in the result dataframe, drop the original cosine column (which had values with np.array), and sort with the highest values first. These items with the highest cos are the most similar to the input item.

In [ ]:
def convert(input):
  """ Replace '[]' to empty strings & convert string to float """
    return float(str(input).replace('[','').replace(']',''))
In [ ]:
result = find_similarity_total('B009934S5M')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
Out[ ]:
similar_item cos
60553 B00KHW4XHM 1.148089
37066 B0000JLLAO 1.115949
30263 B01F6EHOIK 1.112725
16538 B00199PP6U 1.043721
6666 B00005LZOF 1.038528
8372 B00009IAXL 1.018756
38499 B0006FFRKM 1.013303
32816 6304492952 1.005688
23900 B009934S5M 1.000000
42424 B000HT3QBY 0.999978

   The model recommends items that should be suggested to be purchased with the most rated item, which was Star Trek Into Darkness (Blu-ray) (B009934S5M).

  • B00KHW4XHM - Criterion Collection: Golden Age of Television
  • B0000JLLAO - Not found
  • B01F6EHOIK - Red Sonja: Queen Of Plagues
  • B00199PP6U - Adventures of the Galaxy Rangers Collection Vol. 2
  • B00005LZOF - Not found
  • B00009IAXL - Malibooty: Beach with 5:1 ratio females:male
  • B0006FFRKM - Stellvia: Foundation III - Anime sold of Amazon Italy
  • 6304492952 - Disney Sing Along Songs - The Early Years, Collection of All-Time Favorites VHS
  • B009934S5M - Star Trek Into Darkness (Blu-ray)
  • B000HT3QBY - The Most Beautiful Wife - Italy, mafia, girl punished

   Let's call the function again using another frequent item as input.

In [ ]:
result = find_similarity_total('B00PY4Q9OS')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
Out[ ]:
similar_item cos
28219 B00PY4Q9OS 1.000000
9836 B00062IZ2C 0.999615
29467 B0157757NK 0.999062
20906 B004L1DB9G 0.998818
7998 B00008973P 0.998594
34385 B00004I9YP 0.998380
12898 B000H4JH4O 0.997436
57305 B00ABCJQZW 0.997081
41731 B000FJMZ0O 0.997028
6437 B00005JM7T 0.996673

   The model recommends items that should be suggested to be purchased with the most second most rated item, which was Guardians of the Galaxy 3D - Limited Edition Steelbook (B00PY4Q9OS).

  • B00062IZ2C - Star Hunter 01
  • B0157757NK - Fire City: End of Days - Battle over earth w demons
  • B004L1DB9G - Born to Fight [Blu-ray] - Train for the biggest boxing match with a bully
  • B00008973P - 101 Reykjavík [DVD] - Contemporary Icelandic version of American movies of the 1970s like Five Easy Pieces, in which antiheroic characters struggle
  • B00004I9YP - Not found
  • B000H4JH4O - Keeping Up With the Steins [DVD] - Heartwarming coming-of-age comedy when three generations collide in a crazy family reunion
  • B00ABCJQZW - Celebrate With Clifford (Clifford The Big Red Dog) [DVD]
  • B000FJMZ0O - Not found
  • B00005JM7T - Dr. Seuss' The Cat In The Hat (Widescreen Edition) [DVD]

   Let's call the function again using another frequent item as input.

In [ ]:
result = find_similarity_total('B00R8GUXPG')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
Out[ ]:
similar_item cos
42262 B000H6SUYK 1.193777
9370 B00023GFUY 1.161545
2121 6303023355 1.151336
20807 B004I1CVV8 1.151042
49743 B001O17SW2 1.141012
47718 B0019BLYUO 1.076888
15472 B000ZBEOGK 1.072506
39396 B0009OUBUG 1.057287
52853 B00443FMK2 1.040808
36531 B00008PW2B 1.035146

   Item ID (ASIN) was not able to be found.

Batch Size = 8

   Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping class that monitors the val_loss and will stop training if it does not improve after 3 epochs, the ModelCheckpoint class that monitors the val_loss saving only the best one with a min loss as well as the TensorBoard class that enables visualizations using the TensorBoard.

   Now the model can be trained by calling fit utilizing the train dataset for 10 epochs, which is the iterations over the entire X_train and y_train, using a batch_size=8, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list.

In [ ]:
!rm -rf /logs/

%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'CF_Baseline_b8_dropout0.2_weights_only.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
                  ModelCheckpoint(filepath, monitor='val_loss',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]

history = keras_model.fit(x=X_train_array, y=y_train,
                          validation_data=(X_test_array, y_test),
                          batch_size=8, epochs=10, callbacks=callbacks_list)
Epoch 1/10
111488/111488 [==============================] - 574s 5ms/step - loss: 1.0626 - val_loss: 1.0098
Epoch 2/10
111488/111488 [==============================] - 571s 5ms/step - loss: 1.0186 - val_loss: 1.0042
Epoch 3/10
111488/111488 [==============================] - 568s 5ms/step - loss: 1.0181 - val_loss: 1.0056
Epoch 4/10
111488/111488 [==============================] - 566s 5ms/step - loss: 1.0196 - val_loss: 1.0078
Epoch 5/10
111488/111488 [==============================] - 565s 5ms/step - loss: 1.0211 - val_loss: 1.0091

   The model stopped at the 5th epoch because patience=3 was set. Let's now save the model and predict using the test set.

In [ ]:
keras_model.save('./baselineCF_Model2_batch8_dropout0.2_tf.h5',
                 save_format='tf')

predictions = keras_model.predict(X_test_array)
6968/6968 [==============================] - 9s 1ms/step

   Then, we create a pandas.Dataframe containing the test set's review_id, item_id, rating andprediction results.

In [ ]:
df_test = pd.DataFrame(X_test[:,0])
df_test.rename(columns={0: 'user'}, inplace=True)
df_test['items'] = X_test[:,1]
df_test['stars'] = y_test
df_test['predictions'] = predictions
df_test.head()
Out[ ]:
user items stars predictions
0 14701 17937 3.0 2.267012
1 17387 97250 5.0 3.749072
2 17411 66343 4.0 4.731325
3 6792 10159 5.0 3.781336
4 2965 77733 5.0 4.521596

   We can utilize the actual ratings from the original test set and the predicted ones to plot the distribution of the actual ratings using matplotlib.pyplot.bar and the predicted ratings using matplotlib.pyplot.hist.

In [ ]:
values, counts = np.unique(df_test['stars'], return_counts=True)

plt.figure(figsize=(8,6))
plt.bar(values, counts, tick_label=['1','2','3','4','5'], label='true value')
plt.hist(predictions, color='orange', label='predicted value')
plt.xlabel('Ratings')
plt.ylabel('Frequency')
plt.title('Ratings Histogram')
plt.legend()
plt.show()

   Now, let's examine metrics from the model and plot the model loss and val_loss over the training epochs.

In [ ]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()

   We can extract the weights of the embeddings and examine the shape and length.

In [ ]:
emb = keras_model.get_layer('embedding_11')
emb_weights = emb.get_weights()[0]

print('The shape of embedded weights: ', emb_weights.shape)
print('The length of embedded weights: ', len(emb_weights))
The shape of embedded weights:  (103731, 50)
The length of embedded weights:  103731

Cosine Similarity

   Each item is now represented as a 50-dimensional vector. The embeddings need to normalized so the dot product between two embeddings is the Cosine Similarity, which calculates a numeric value representing the similarity between two items. The embedding layers from the model are needed to compute the Cosine Similarity via a dot product.

In [ ]:
emb_weights = emb_weights / np.linalg.norm(emb_weights, axis=1).reshape((-1,1))

rest_id_emb = df['itemID'].unique()
print('Number of unique embedded weights:', len(rest_id_emb))
Number of unique embedded weights: 103731

   Next, let's create a pandas.Dataframe containing all the unique items in the 50 dimensions with their embedded weights.

In [ ]:
rest_pd = pd.DataFrame(emb_weights)
rest_pd['itemID'] = rest_id_emb
rest_pd = rest_pd.set_index('itemID')
rest_pd
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
itemID
000503860X -inf -inf inf inf inf inf inf -inf inf inf ... -inf inf inf inf -inf inf inf -inf -inf inf
0005092663 -inf -inf inf -inf -inf inf inf -inf inf inf ... -inf inf inf -inf -inf inf inf -inf -inf inf
0005019281 -inf -inf -inf -inf -inf -inf -inf -inf inf -inf ... inf -inf -inf -inf inf -inf -inf -inf inf -inf
0005119367 -1.090690e-28 1.731628e-10 -1.086338e-28 1.232361e-10 -0.116705 -0.153975 -2.389029e-04 -5.150603e-02 0.013055 3.204659e-04 ... 0.000140 -1.095652e-28 -7.075178e-03 5.015315e-03 2.480933e-04 -4.935525e-12 0.013148 0.010717 -0.000146 0.107412
0005123968 -inf -inf -inf inf inf -inf inf -inf -inf inf ... inf -inf inf -inf -inf inf inf -inf -inf -inf
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
B01HHZKOG0 -3.712402e-08 -4.029023e-03 4.909287e-09 -1.923616e-07 -0.003230 -0.002738 2.696407e-07 3.736180e-07 -0.000176 -2.457352e-09 ... 0.000092 4.232260e-04 -2.973162e-07 3.344762e-19 -6.953103e-08 3.354215e-11 -0.000318 0.000018 0.002259 0.000222
B01HI8KC5E -inf inf inf inf inf -inf -inf -inf inf inf ... inf -inf -inf inf inf inf inf inf -inf inf
B01HI87P4K 1.339735e-01 -1.418108e-01 8.517871e-03 -1.393756e-01 -0.141205 -0.142142 1.445967e-01 -1.346857e-01 -0.141986 -1.300971e-01 ... 0.141561 -1.383641e-01 -1.366770e-01 1.268713e-01 -1.408127e-01 1.414165e-01 -0.137534 0.140210 0.141345 0.103656
B01HJ6R77G inf -inf inf -inf inf -inf -inf -inf -inf -inf ... inf inf -inf inf inf inf inf inf -inf inf
B01HJF79XO -inf inf -inf -inf -inf inf inf -inf inf inf ... inf inf inf -inf inf -inf -inf -inf -inf -inf

103731 rows × 50 columns

   We can create a temp dataframe that contains the unique IDs of the items and use an inner pandas.Merge to get the items, then create a copy of the itemID column as Item and remove inf and -inf from the data.

In [ ]:
temp = df[['itemID']].drop_duplicates()
df_recommend = pd.merge(rest_pd, temp, on='itemID')
df_recommend['Item'] = df_recommend.loc[:, 'itemID']
df_recommend = df_recommend[~df_recommend.isin([np.nan, np.inf,
                                                -np.inf]).any(1)]
df_recommend.shape
Out[ ]:
(47155, 52)

   There is a reduced number compared to batch_size=16.

Recommendation

   Let's now call the function with the most frequent item, B009934S5M, as the input. We can then apply a lambda function with the convert function to create a new column called cos in the result dataframe, drop the original cosine column (which had values with np.array), and sort with the highest values first. These items with the highest cos are the most similar to the input item.

In [ ]:
result = find_similarity_total('B009934S5M')
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)
result.drop('cosine', axis=1, inplace=True)
result = result.sort_values('cos', ascending=False)
result.head(10)
Out[ ]:
similar_item cos
18010 B009934S5M 1.000000
5555 B00005U8QD 0.872185
11124 B000P5FH5I 0.828417
13565 B001JAHSIM 0.816748
33908 B000WYVUZS 0.805130
37096 B001TXVSPS 0.789085
17658 B007OCD1CG 0.788496
32716 B000OZ2CTS 0.782737
44674 B00JWS9IS6 0.780625
11750 B000W8KY0G 0.767757

   The model recommends items that should be suggested to be purchased with the most rated item, which was Star Trek Into Darkness (Blu-ray) (B009934S5M).

  • B009934S5M - Star Trek Into Darkness (Blu-ray)
  • B00005U8QD - Witchouse 3: Demon Fire (Widescreen Special Edition)
  • B000P5FH5I - Dora The Explorer - Summer Explorer
  • B001JAHSIM - Not found
  • B000WYVUZS - Not found
  • B001TXVSPS - Not found
  • B007OCD1CG - Simply Red: Live at Montreux 2003 [Blu-ray] music
  • B000OZ2CTS - Vandread and Vandread the Second Stage: Complete Collection anime
  • B00JWS9IS6 - President Wolfman [Edizione: Stati Uniti] [Edizione: USA]
  • B000W8KY0G - Not found

Collaborative Filtering Model using Tensorflow

   We can utilize the previous approach to set up the environment including setting the seed with a init_seeds function for reproducibility.

   Let's now prepare the data by using a lambda function, and examine the number of reviewers, items and ratings.

In [ ]:
import pandas as pd

df['item_id'] = df['item_id'].apply(lambda x: f'item_{x}')
df['reviewer_id'] = df['reviewer_id'].apply(lambda x: f'reviewer_{x}')
df['rating'] = df['rating'].apply(lambda x: float(x))

print(f'Number of reviewers: {len(df.reviewer_id.unique())}')
print(f'Number of items: {len(df.item_id.unique())}')
print(f'Number of ratings: {len(df.index)}')
Number of reviewers: 19639
Number of items: 103687
Number of ratings: 1113396

   To prepare the data for modeling, let's create the training and evaluation sets with 80% allocated to the training set and 20% for evaluation that are saved as .csv files as well.

In [ ]:
random_selection = np.random.rand(len(df.index)) <= 0.80
train_data = df[random_selection]
eval_data = df[~random_selection]

train_data.to_csv('train_filtered.csv', index=False, sep='|', header=False)
eval_data.to_csv('eval_filtered.csv', index=False, sep='|', header=False)
print(f'Train data split: {len(train_data.index)}')
print(f'Eval data split: {len(eval_data.index)}')
print('Train and eval data files are saved.')
Train data split: 890562
Eval data split: 222834
Train and eval data files are saved.

   Then, we can define the dataset metadata including where the data is stored, the columns as csv_header, the name of the target variable and the vocabularies for the unique reviewers and items.

   The model hyperparameters can be specified, so let's utilize learning_rate = 0.0001, batch_size = 1024, num_epochs = 30 and a base_embedding_dim = 64.

In [ ]:
DATADIR = '/content/drive/MyDrive/AmazonReviews/Data/'
csv_header = list(df.columns)
target_feature_name = 'rating'
reviewer_vocabulary = list(df.reviewer_id.unique())
item_vocabulary = list(df.item_id.unique())

learning_rate = 0.0001
batch_size = 1024
num_epochs = 30
base_embedding_dim = 64

   Let's use a function get_dataset_from_csv that load the data into a tf.data.experimental.make_csv_dataset with each element of the dataset as a tuple that is a batch of rows. The features dictionary maps the feature column names to Tensors containing the corresponding feature data, and labels is a Tensor containing the batch's label data.

In [ ]:
from tensorflow.python.lib.io import file_io

def get_dataset_from_csv(path, batch_size=1024, shuffle=True):

    csv_file_path = tf.io.gfile.glob(path)
    return tf.data.experimental.make_csv_dataset(
        csv_file_path,
        batch_size=batch_size,
        column_names=csv_header,
        label_name=target_feature_name,
        num_epochs=1,
        header=False,
        field_delim='|',
        shuffle=shuffle)

   Let's use another function run_experiment now that call the get_dataset_from_csv function to read the training & test data from the DATADIR and then to compile and fit the model with the specified callbacks.

In [ ]:
def run_experiment(model):

    train_dataset = get_dataset_from_csv(os.path.join(DATADIR,
                                                      'train_filtered.csv'),
                                         batch_size)
    eval_dataset = get_dataset_from_csv(os.path.join(DATADIR,
                                                     'eval_filtered.csv'),
                                        batch_size, shuffle=False)

    model.compile(optimizer=keras.optimizers.Adam(learning_rate),
                  loss=tf.keras.losses.MeanSquaredError(),
                  metrics=[keras.metrics.MeanAbsoluteError(name='mae')])

    history = model.fit(train_dataset, epochs=num_epochs,
                        validation_data=eval_dataset,
                        callbacks=callbacks_list)
    return history

Baseline Collaborative Filtering Model

Implement Embedding Encoder

   The embedding_encoder function with input containing the vocabulary, the dimensions of the embeddings, and the indice number can be used to return the embeddings for the reviewers and items.

In [ ]:
from tensorflow.keras import layers

def embedding_encoder(vocabulary, embedding_dim, num_oov_indices=0, name=None):
    return keras.Sequential(
        [layers.StringLookup(vocabulary=vocabulary, mask_token=None,
                             num_oov_indices=num_oov_indices),
         layers.Embedding(input_dim=len(vocabulary) + num_oov_indices,
                          output_dim=embedding_dim),],
                          name=f'{name}_embedding' if name else None)

Implement the Baseline Model

   We can define another function create_baseline_model that uses the reviewers as an input to the embedding_encoder function for the reviewer embeddings, the item as an input to the embedding_encoder for the item embeddings. Then the dot product similarity between reviewer and item embeddings can be generated and converted to a rating scale. Lastly, the model can be created.

In [ ]:
def create_baseline_model():

    reviewer_input = layers.Input(name='reviewer_id', shape=(), dtype=tf.string)
    reviewer_embedding = embedding_encoder(vocabulary=reviewer_vocabulary,
                                           embedding_dim=base_embedding_dim,
                                           name='reviewer')(reviewer_input)

    item_input = layers.Input(name='item_id', shape=(), dtype=tf.string)
    item_embedding = embedding_encoder(vocabulary=item_vocabulary,
                                       embedding_dim=base_embedding_dim,
                                       name='item')(item_input)

    logits = layers.Dot(axes=1, name='dot_similarity')([reviewer_embedding,
                                                        item_embedding])

    prediction = keras.activations.sigmoid(logits) * 5

    model = keras.Model(inputs=[reviewer_input, item_input], outputs=prediction,
                        name='baseline_model')

    return model

   Let's create the baseline model and examine the model summary.

In [ ]:
baseline_model = create_baseline_model()

baseline_model.summary()
Model: "baseline_model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 reviewer_id (InputLayer)       [(None,)]            0           []                               
                                                                                                  
 item_id (InputLayer)           [(None,)]            0           []                               
                                                                                                  
 reviewer_embedding (Sequential  (None, 64)          1256896     ['reviewer_id[0][0]']            
 )                                                                                                
                                                                                                  
 item_embedding (Sequential)    (None, 64)           6635968     ['item_id[0][0]']                
                                                                                                  
 dot_similarity (Dot)           (None, 1)            0           ['reviewer_embedding[0][0]',     
                                                                  'item_embedding[0][0]']         
                                                                                                  
 tf.math.sigmoid_2 (TFOpLambda)  (None, 1)           0           ['dot_similarity[0][0]']         
                                                                                                  
 tf.math.multiply_2 (TFOpLambda  (None, 1)           0           ['tf.math.sigmoid_2[0][0]']      
 )                                                                                                
                                                                                                  
==================================================================================================
Total params: 7,892,864
Trainable params: 7,892,864
Non-trainable params: 0
__________________________________________________________________________________________________

   Notice that the number of trainable parameters is7,892,864. Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping class that monitors the val_loss and will stop training if it does not improve after 3 epochs, the ModelCheckpoint class that monitors the val_loss saving only the best one with a min loss as well as the TensorBoard class that enables visualizations using the TensorBoard.

   Now the baseline_model can be trained by calling fit utilizing the train dataset for 30 epochs, which is the iterations over the entire X_train and y_train, using a batch_size=1024, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list.

In [ ]:
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

!rm -rf /logs/

%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'baselineEmbed64_MoviesTV_weights_only_b1024.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
                  ModelCheckpoint(filepath, monitor='val_loss',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]

history = run_experiment(baseline_model)
Epoch 1/30
870/870 [==============================] - 141s 157ms/step - loss: 4.0122 - mae: 1.8476 - val_loss: 4.0184 - val_mae: 1.8491
Epoch 2/30
870/870 [==============================] - 131s 151ms/step - loss: 4.0040 - mae: 1.8456 - val_loss: 4.0182 - val_mae: 1.8491
Epoch 3/30
870/870 [==============================] - 132s 152ms/step - loss: 3.9942 - mae: 1.8432 - val_loss: 4.0175 - val_mae: 1.8489
Epoch 4/30
870/870 [==============================] - 136s 156ms/step - loss: 3.9822 - mae: 1.8403 - val_loss: 4.0155 - val_mae: 1.8484
Epoch 5/30
870/870 [==============================] - 136s 156ms/step - loss: 3.9661 - mae: 1.8363 - val_loss: 4.0103 - val_mae: 1.8471
Epoch 6/30
870/870 [==============================] - 140s 161ms/step - loss: 3.9430 - mae: 1.8306 - val_loss: 3.9983 - val_mae: 1.8441
Epoch 7/30
870/870 [==============================] - 138s 158ms/step - loss: 3.9082 - mae: 1.8219 - val_loss: 3.9740 - val_mae: 1.8380
Epoch 8/30
870/870 [==============================] - 139s 159ms/step - loss: 3.8559 - mae: 1.8086 - val_loss: 3.9313 - val_mae: 1.8272
Epoch 9/30
870/870 [==============================] - 132s 152ms/step - loss: 3.7799 - mae: 1.7890 - val_loss: 3.8641 - val_mae: 1.8098
Epoch 10/30
870/870 [==============================] - 129s 149ms/step - loss: 3.6758 - mae: 1.7616 - val_loss: 3.7689 - val_mae: 1.7846
Epoch 11/30
870/870 [==============================] - 132s 152ms/step - loss: 3.5419 - mae: 1.7253 - val_loss: 3.6454 - val_mae: 1.7510
Epoch 12/30
870/870 [==============================] - 133s 153ms/step - loss: 3.3799 - mae: 1.6802 - val_loss: 3.4966 - val_mae: 1.7097
Epoch 13/30
870/870 [==============================] - 130s 150ms/step - loss: 3.1947 - mae: 1.6274 - val_loss: 3.3280 - val_mae: 1.6619
Epoch 14/30
870/870 [==============================] - 131s 150ms/step - loss: 2.9932 - mae: 1.5685 - val_loss: 3.1466 - val_mae: 1.6090
Epoch 15/30
870/870 [==============================] - 131s 151ms/step - loss: 2.7831 - mae: 1.5052 - val_loss: 2.9596 - val_mae: 1.5530
Epoch 16/30
870/870 [==============================] - 135s 155ms/step - loss: 2.5716 - mae: 1.4395 - val_loss: 2.7733 - val_mae: 1.4955
Epoch 17/30
870/870 [==============================] - 134s 154ms/step - loss: 2.3649 - mae: 1.3728 - val_loss: 2.5929 - val_mae: 1.4379
Epoch 18/30
870/870 [==============================] - 129s 148ms/step - loss: 2.1678 - mae: 1.3067 - val_loss: 2.4224 - val_mae: 1.3814
Epoch 19/30
870/870 [==============================] - 128s 148ms/step - loss: 1.9836 - mae: 1.2423 - val_loss: 2.2644 - val_mae: 1.3268
Epoch 20/30
870/870 [==============================] - 130s 150ms/step - loss: 1.8145 - mae: 1.1803 - val_loss: 2.1205 - val_mae: 1.2749
Epoch 21/30
870/870 [==============================] - 130s 149ms/step - loss: 1.6615 - mae: 1.1216 - val_loss: 1.9912 - val_mae: 1.2262
Epoch 22/30
870/870 [==============================] - 123s 141ms/step - loss: 1.5244 - mae: 1.0665 - val_loss: 1.8765 - val_mae: 1.1811
Epoch 23/30
870/870 [==============================] - 129s 148ms/step - loss: 1.4030 - mae: 1.0155 - val_loss: 1.7755 - val_mae: 1.1396
Epoch 24/30
870/870 [==============================] - 130s 149ms/step - loss: 1.2959 - mae: 0.9686 - val_loss: 1.6874 - val_mae: 1.1020
Epoch 25/30
870/870 [==============================] - 128s 147ms/step - loss: 1.2019 - mae: 0.9255 - val_loss: 1.6109 - val_mae: 1.0680
Epoch 26/30
870/870 [==============================] - 129s 149ms/step - loss: 1.1196 - mae: 0.8863 - val_loss: 1.5447 - val_mae: 1.0374
Epoch 27/30
870/870 [==============================] - 128s 147ms/step - loss: 1.0475 - mae: 0.8506 - val_loss: 1.4875 - val_mae: 1.0101
Epoch 28/30
870/870 [==============================] - 125s 144ms/step - loss: 0.9843 - mae: 0.8181 - val_loss: 1.4383 - val_mae: 0.9857
Epoch 29/30
870/870 [==============================] - 129s 149ms/step - loss: 0.9287 - mae: 0.7886 - val_loss: 1.3958 - val_mae: 0.9639
Epoch 30/30
870/870 [==============================] - 129s 148ms/step - loss: 0.8797 - mae: 0.7617 - val_loss: 1.3592 - val_mae: 0.9446

   We can save the model and plot the model loss and val_loss over the epochs.

In [ ]:
import matplotlib.pyplot as plt

baseline_model.save('./baselineEmbed64_MoviesTV_batchb1024_tf.h5',
                    save_format='tf')

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()

   Let's now load the test set and the model so we can evaluate the trained model for mae using the test set.

In [ ]:
eval_dataset = get_dataset_from_csv('eval_filtered.csv', batch_size,
                                    shuffle=False)

model = tf.keras.models.load_model('./baselineEmbed64_MoviesTV_batchb1024_tf.h5')

result = model.evaluate(eval_dataset, return_dict=True, verbose=0)
print('\nEvaluation on the test set:')
display(result)

Evaluation on the test set:
{'loss': 1.359154462814331, 'mae': 0.9445576071739197}

Memory-Efficient Model

Implement Quotient-Remainder Embedding as a Layer

   The Quotient-Remainder technique works by creating two num_buckets X embedding_dim embedding tables for a set of vocabulary and embedding size embedding_dim where num_buckets is much smaller than vocabulary_size, rather than creating a vocabulary_size X embedding_dim embedding table.

   An embedding for a given item index is generated by:

  1. Computing the quotient_index as index // num_buckets.
  2. Computing the remainder_index as index % num_buckets.
  3. Lookup quotient_embedding from the first embedding table using quotient_index.
  4. Lookup remainder_embedding from the second embedding table using remainder_index.
  5. Returning quotient_embedding * remainder_embedding.

   This technique reduces the number of embedding vectors needed to be stored and trained, as well as generating a unique embedding vector for each item of size embedding_dim. The embeddings, q_embedding and r_embedding, can be combined using other operations like Add and Concatenate.

   Let's now set up the config and get the item, quotient and remainder indeces. Then lookup the quotient_embedding using the quotient_index and lookup the remainder_embedding using the remainder_index. Use multiplication as a combiner operation.

In [ ]:
class QREmbedding(keras.layers.Layer):
    def __init__(self, vocabulary, embedding_dim, num_buckets, name=None):

        super(QREmbedding, self).__init__(name=name)
        self.num_buckets = num_buckets
        self.index_lookup = layers.StringLookup(
            vocabulary=vocabulary, mask_token=None, num_oov_indices=0)
        self.q_embeddings = layers.Embedding(num_buckets, embedding_dim,)
        self.r_embeddings = layers.Embedding(num_buckets, embedding_dim,)

    def get_config(self):
        config = super().get_config()
        return config

    def call(self, inputs):
        embedding_index = self.index_lookup(inputs)
        quotient_index = tf.math.floordiv(embedding_index, self.num_buckets)
        remainder_index = tf.math.floormod(embedding_index, self.num_buckets)

        quotient_embedding = self.q_embeddings(quotient_index)
        remainder_embedding = self.r_embeddings(remainder_index)

        return quotient_embedding * remainder_embedding

Implement Mixed Dimension Embedding as a Layer

   Embedding vectors are trained with full dimensions for the frequently queried items, while embedding vectors with reduced dimensions are trained for less frequent items. A projection weights matrix allows low dimension embeddings to become the full dimensions.

   More precisely, we define blocks of items of similar frequencies. For each block, a block_vocab_size X block_embedding_dim embedding table and block_embedding_dim X full_embedding_dim projection weights matrix are created. Note that, if block_embedding_dim equals full_embedding_dim, the projection weights matrix becomes an identity matrix. Embeddings for a given batch of item indices are generated via the following steps:

  1. For each block, lookup the block_embedding_dim embedding vectors using indices, and project them to the full_embedding_dim.
  2. If an item index does not belong to a given block, an out-of-vocabulary embedding is returned. Each block will return a batch_size X full_embedding_dim tensor.
  3. A mask is applied to the embeddings returned from each block in order to convert the out-of-vocabulary embeddings to vector of zeros. That is, for each item in the batch, a single non-zero embedding vector is returned from the all block embeddings.
  4. Embeddings retrieved from the blocks are combined using sum to produce the final batch_size X full_embedding_dim tensor.

   For the MDEmbedding class, let's create a vocabulary to block lookup and create block embedding encoders and projectors. Then configure the layers where the block index for each input item is determined, followed by initializing the output embeddings to zeros and the embeddings are generated from the blocks. Then lookup and project embeddings from the current block to base_embedding_dim and create a mask to filter out the embeddings of items not in the current block. Then set the embeddings for the items not in the current block to zero. and add the block embeddings to the final embeddings.

In [ ]:
class MDEmbedding(keras.layers.Layer):
    def __init__(self, blocks_vocabulary, blocks_embedding_dims,
                 base_embedding_dim, name=None):

        super(MDEmbedding, self).__init__(name=name)
        self.num_blocks = len(blocks_vocabulary)

        keys = []
        values = []
        for block_idx, block_vocab in enumerate(blocks_vocabulary):
            keys.extend(block_vocab)
            values.extend([block_idx] * len(block_vocab))
        self.vocab_to_block = tf.lookup.StaticHashTable(
            tf.lookup.KeyValueTensorInitializer(keys, values), default_value=-1)
        self.block_embedding_encoders = []
        self.block_embedding_projectors = []

        for idx in range(self.num_blocks):
            vocabulary = blocks_vocabulary[idx]
            embedding_dim = blocks_embedding_dims[idx]
            block_embedding_encoder = embedding_encoder(
                vocabulary, embedding_dim, num_oov_indices=1)
            self.block_embedding_encoders.append(block_embedding_encoder)
            if embedding_dim == base_embedding_dim:
                self.block_embedding_projectors.append(layers.Lambda(lambda x: x))
            else:
                self.block_embedding_projectors.append(
                    layers.Dense(units=base_embedding_dim))

    def get_config(self):
        config = super().get_config()
        return config

    def call(self, inputs):
        block_indicies = self.vocab_to_block.lookup(inputs)
        embeddings = tf.zeros(shape=(tf.shape(inputs)[0], base_embedding_dim))

        for idx in range(self.num_blocks):
            block_embeddings = self.block_embedding_encoders[idx](inputs)
            block_embeddings = self.block_embedding_projectors[idx](block_embeddings)

            mask = tf.expand_dims(tf.cast(block_indicies == idx,
                                          tf.dtypes.float32), 1)
            block_embeddings = block_embeddings * mask
            embeddings += block_embeddings

        return embeddings

Implement the Memory-Efficient Model

   The Quotient-Remainder method reduces the size of the user embeddings, and the Mixed Dimension method reduces the size of the item embeddings. The number of blocks and the dimensions of embeddings of each block are determined based on the histogram visualization of frequency of the items, potentially popularity. So let's use pandas.Dataframe.value_counts() to generate item_frequencies and plot with bins=10.

In [ ]:
item_frequencies = df['item_id'].value_counts()
item_frequencies.hist(bins=100)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f635fdf0f40>

We can also examine the 5 number summary statistics using pandas.Dataframe.describe.

In [ ]:
item_frequencies.describe()
Out[ ]:
count    103687.000000
mean         10.738048
std          32.901640
min           1.000000
25%           1.000000
50%           3.000000
75%           8.000000
max        1136.000000
Name: item_id, dtype: float64

Let's examine the items with a count less than the mean and graph.

In [ ]:
dat = df[df.groupby('item_id')['item_id'].transform('count') <= 10]
dat.shape
Out[ ]:
(234484, 3)
In [ ]:
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f678072bc10>

Then, we can examine the the items with a count more than the mean and graph.

In [ ]:
dat = df[df.groupby('item_id')['item_id'].transform('count') >= 10]
dat.shape
Out[ ]:
(895032, 3)
In [ ]:
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6780a87610>
In [ ]:
dat = df[df.groupby('item_id')['item_id'].transform('count') <= 100]
dat.shape
Out[ ]:
(752937, 3)
In [ ]:
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f678057b040>
In [ ]:
dat = df[df.groupby('item_id')['item_id'].transform('count') >= 100]
dat.shape
Out[ ]:
(363159, 3)
In [ ]:
item_frequencies1 = dat['item_id'].value_counts()
item_frequencies1.hist(bins=100)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6780959a60>

   Now, we can create item block vocabularies with different dimensions. The items can be grouped into three blocks (high, normal and low popularity), and assigned as the embedding dimensions [64, 32, and 16] with the block size split like the frequency of the items. Then, we can define number of embedding buckets for reviewers by dividing the length of the reviewer_vocabulary by 100.

In [ ]:
sorted_item_vocabulary = list(item_frequencies.keys())

item_blocks_vocabulary = [
    sorted_item_vocabulary[10:100],
    sorted_item_vocabulary[100:],
    sorted_item_vocabulary[:10]]

item_blocks_embedding_dims = [64, 32, 16]

reviewer_embedding_num_buckets = len(reviewer_vocabulary) // 100
print('Number of reviewer embedding buckets:', reviewer_embedding_num_buckets)
Number of reviewer embedding buckets: 196

   We can define another function create_memory_efficient_model that uses the reviewers as an input to the MDEmbedding class for the reviewer embeddings, the item as an input to the MDEmbedding for the item embeddings. Then the dot product similarity between reviewer and item embeddings can be generated and converted to a rating scale. Lastly, the model can be created.

In [ ]:
def create_memory_efficient_model():
    reviewer_input = layers.Input(name='reviewer_id', shape=(), dtype=tf.string)
    reviewer_embedding = QREmbedding(
        vocabulary=reviewer_vocabulary,
        embedding_dim=base_embedding_dim,
        num_buckets=reviewer_embedding_num_buckets,
        name='reviewer_embedding')(reviewer_input)

    item_input = layers.Input(name='item_id', shape=(), dtype=tf.string)
    item_embedding = MDEmbedding(
        blocks_vocabulary=item_blocks_vocabulary,
        blocks_embedding_dims=item_blocks_embedding_dims,
        base_embedding_dim=base_embedding_dim,
        name='item_embedding')(item_input)

    logits = layers.Dot(axes=1, name='dot_similarity')(
        [reviewer_embedding, item_embedding])

    prediction = keras.activations.sigmoid(logits) * 5

    model = keras.Model(inputs=[reviewer_input, item_input], outputs=prediction,
                        name='memory_model')
    return model

   Let's now create the memory efficient model and examine the model summary.

In [ ]:
memory_efficient_model = create_memory_efficient_model()

memory_efficient_model.summary()
Model: "memory_model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 reviewer_id (InputLayer)       [(None,)]            0           []                               
                                                                                                  
 item_id (InputLayer)           [(None,)]            0           []                               
                                                                                                  
 reviewer_embedding (QREmbeddin  (None, 64)          25088       ['reviewer_id[0][0]']            
 g)                                                                                               
                                                                                                  
 item_embedding (MDEmbedding)   (None, 64)           3324016     ['item_id[0][0]']                
                                                                                                  
 dot_similarity (Dot)           (None, 1)            0           ['reviewer_embedding[0][0]',     
                                                                  'item_embedding[0][0]']         
                                                                                                  
 tf.math.sigmoid_3 (TFOpLambda)  (None, 1)           0           ['dot_similarity[0][0]']         
                                                                                                  
 tf.math.multiply_3 (TFOpLambda  (None, 1)           0           ['tf.math.sigmoid_3[0][0]']      
 )                                                                                                
                                                                                                  
==================================================================================================
Total params: 3,349,104
Trainable params: 3,349,104
Non-trainable params: 0
__________________________________________________________________________________________________

   The number of trainable parameters is now 3,349,104 down from 7,892,864, which is more than 2x less than the number of parameters in the baseline model.

   Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping class that monitors the val_loss and will stop training if it does not improve after 3 epochs, the ModelCheckpoint class that monitors the val_loss saving only the best one with a min loss as well as the TensorBoard class that enables visualizations using the TensorBoard.

   Now, the memory_efficient_model can be trained by calling fit utilizing the train dataset for 30 epochsusing a batch_size=102 and the specified callbacks from the callbacks_list.

In [ ]:
filepath = 'baselineFiltered_weights_only_b1024_memOpt.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='val_loss', patience=3),
                  ModelCheckpoint(filepath, monitor='val_loss',
                                  save_best_only=True, mode='min'),
                  tensorboard_callback]

history = run_experiment(memory_efficient_model)
Epoch 1/30
870/870 [==============================] - 139s 155ms/step - loss: 3.9922 - mae: 1.8427 - val_loss: 3.8602 - val_mae: 1.8107
Epoch 2/30
870/870 [==============================] - 129s 148ms/step - loss: 2.9061 - mae: 1.5446 - val_loss: 1.7800 - val_mae: 1.1903
Epoch 3/30
870/870 [==============================] - 127s 147ms/step - loss: 1.4770 - mae: 1.0167 - val_loss: 1.3738 - val_mae: 0.9444
Epoch 4/30
870/870 [==============================] - 125s 143ms/step - loss: 1.2883 - mae: 0.9073 - val_loss: 1.3152 - val_mae: 0.9091
Epoch 5/30
870/870 [==============================] - 133s 153ms/step - loss: 1.2071 - mae: 0.8670 - val_loss: 1.2915 - val_mae: 0.8895
Epoch 6/30
870/870 [==============================] - 132s 152ms/step - loss: 1.1569 - mae: 0.8380 - val_loss: 1.2851 - val_mae: 0.8772
Epoch 7/30
870/870 [==============================] - 133s 152ms/step - loss: 1.1255 - mae: 0.8177 - val_loss: 1.2823 - val_mae: 0.8693
Epoch 8/30
870/870 [==============================] - 133s 152ms/step - loss: 1.1033 - mae: 0.8035 - val_loss: 1.2786 - val_mae: 0.8639
Epoch 9/30
870/870 [==============================] - 127s 145ms/step - loss: 1.0853 - mae: 0.7932 - val_loss: 1.2739 - val_mae: 0.8594
Epoch 10/30
870/870 [==============================] - 132s 152ms/step - loss: 1.0701 - mae: 0.7851 - val_loss: 1.2676 - val_mae: 0.8560
Epoch 11/30
870/870 [==============================] - 134s 154ms/step - loss: 1.0565 - mae: 0.7787 - val_loss: 1.2610 - val_mae: 0.8531
Epoch 12/30
870/870 [==============================] - 133s 153ms/step - loss: 1.0443 - mae: 0.7733 - val_loss: 1.2533 - val_mae: 0.8503
Epoch 13/30
870/870 [==============================] - 130s 150ms/step - loss: 1.0330 - mae: 0.7683 - val_loss: 1.2458 - val_mae: 0.8477
Epoch 14/30
870/870 [==============================] - 130s 149ms/step - loss: 1.0226 - mae: 0.7638 - val_loss: 1.2382 - val_mae: 0.8450
Epoch 15/30
870/870 [==============================] - 130s 150ms/step - loss: 1.0129 - mae: 0.7596 - val_loss: 1.2308 - val_mae: 0.8424
Epoch 16/30
870/870 [==============================] - 132s 152ms/step - loss: 1.0041 - mae: 0.7556 - val_loss: 1.2241 - val_mae: 0.8398
Epoch 17/30
870/870 [==============================] - 139s 160ms/step - loss: 0.9960 - mae: 0.7518 - val_loss: 1.2172 - val_mae: 0.8369
Epoch 18/30
870/870 [==============================] - 134s 154ms/step - loss: 0.9886 - mae: 0.7482 - val_loss: 1.2112 - val_mae: 0.8343
Epoch 19/30
870/870 [==============================] - 128s 147ms/step - loss: 0.9818 - mae: 0.7446 - val_loss: 1.2054 - val_mae: 0.8318
Epoch 20/30
870/870 [==============================] - 132s 152ms/step - loss: 0.9757 - mae: 0.7414 - val_loss: 1.2001 - val_mae: 0.8291
Epoch 21/30
870/870 [==============================] - 132s 152ms/step - loss: 0.9700 - mae: 0.7384 - val_loss: 1.1952 - val_mae: 0.8268
Epoch 22/30
870/870 [==============================] - 123s 141ms/step - loss: 0.9648 - mae: 0.7356 - val_loss: 1.1906 - val_mae: 0.8248
Epoch 23/30
870/870 [==============================] - 123s 142ms/step - loss: 0.9601 - mae: 0.7331 - val_loss: 1.1866 - val_mae: 0.8230
Epoch 24/30
870/870 [==============================] - 123s 141ms/step - loss: 0.9557 - mae: 0.7307 - val_loss: 1.1826 - val_mae: 0.8213
Epoch 25/30
870/870 [==============================] - 122s 140ms/step - loss: 0.9516 - mae: 0.7286 - val_loss: 1.1793 - val_mae: 0.8197
Epoch 26/30
870/870 [==============================] - 121s 139ms/step - loss: 0.9478 - mae: 0.7265 - val_loss: 1.1759 - val_mae: 0.8180
Epoch 27/30
870/870 [==============================] - 121s 139ms/step - loss: 0.9443 - mae: 0.7247 - val_loss: 1.1729 - val_mae: 0.8165
Epoch 28/30
870/870 [==============================] - 120s 138ms/step - loss: 0.9409 - mae: 0.7228 - val_loss: 1.1702 - val_mae: 0.8154
Epoch 29/30
870/870 [==============================] - 119s 137ms/step - loss: 0.9376 - mae: 0.7210 - val_loss: 1.1675 - val_mae: 0.8141
Epoch 30/30
870/870 [==============================] - 121s 139ms/step - loss: 0.9345 - mae: 0.7193 - val_loss: 1.1650 - val_mae: 0.8126

   We can save the model and plot the loss, val_loss, mae and val_mae over the training epochs.

In [ ]:
memory_efficient_model.save('./memoryEfficient_weights_only_b1024_tf.h5',
                            save_format='tf')

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()
In [ ]:
plt.plot(history.history['mae'])
plt.plot(history.history['val_mae'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()

   Let's now load the test set and the model so we can evaluate the trained model for mae using the test set.

In [ ]:
eval_dataset = get_dataset_from_csv('eval_filtered.csv', batch_size,
                                    shuffle=False)

model = tf.keras.models.load_model('./memoryEfficient_weights_only_b1024_tf.h5')

result = model.evaluate(eval_dataset, return_dict=True, verbose=0)
print('\nEvaluation on the test set:')
display(result)

Evaluation on the test set:
{'loss': 1.359154462814331, 'mae': 0.9445576071739197}

Ranking System in Tensorflow

   We can utilize the previous approach to set up the environment including setting the seed with a init_seeds function for reproducibility.

In [ ]:
!pip install tensorflow-recommenders==0.7.2
!pip install tensorflow==2.9.2
import tensorflow as tf

print('CUDA and NVIDIA GPU Information')
!/usr/local/cuda/bin/nvcc --version
!nvidia-smi
print('\n')
print('TensorFlow version: {}'.format(tf.__version__))
print('Eager execution is: {}'.format(tf.executing_eagerly()))
print('Keras version: {}'.format(tf.keras.__version__))
print('Num GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('\n')
Requirement already satisfied: tensorflow-recommenders==0.7.2 in /usr/local/lib/python3.10/dist-packages (0.7.2)
Requirement already satisfied: tensorflow>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow-recommenders==0.7.2) (2.9.2)
Requirement already satisfied: absl-py>=0.1.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow-recommenders==0.7.2) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.6.3)
Requirement already satisfied: flatbuffers<2,>=1.12 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.12)
Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.4.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.2.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.59.0)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.9.0)
Requirement already satisfied: keras<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.9.0)
Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.1.2)
Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (16.0.6)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.23.5)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.3.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (23.2)
Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.19.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (67.7.2)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.16.0)
Requirement already satisfied: tensorboard<2.10,>=2.9 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.9.1)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.34.0)
Requirement already satisfied: tensorflow-estimator<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.9.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (4.5.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.15.0)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.41.2)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.17.3)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.4.6)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.5)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.31.0)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.6.1)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.8.1)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.0.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (5.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (1.3.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.3.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2023.7.22)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (2.1.3)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (0.5.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow>=2.9.0->tensorflow-recommenders==0.7.2) (3.2.2)
Requirement already satisfied: tensorflow==2.9.2 in /usr/local/lib/python3.10/dist-packages (2.9.2)
Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.6.3)
Requirement already satisfied: flatbuffers<2,>=1.12 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.12)
Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (0.4.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (0.2.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.59.0)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (3.9.0)
Requirement already satisfied: keras<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (2.9.0)
Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.1.2)
Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (16.0.6)
Requirement already satisfied: numpy>=1.20 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.23.5)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (3.3.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (23.2)
Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (3.19.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (67.7.2)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.16.0)
Requirement already satisfied: tensorboard<2.10,>=2.9 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (2.9.1)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (0.34.0)
Requirement already satisfied: tensorflow-estimator<2.10.0,>=2.9.0rc0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (2.9.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (2.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (4.5.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow==2.9.2) (1.15.0)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow==2.9.2) (0.41.2)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow==2.9.2) (2.17.3)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow==2.9.2) (0.4.6)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow==2.9.2) (3.5)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow==2.9.2) (2.31.0)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow==2.9.2) (0.6.1)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow==2.9.2) (1.8.1)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.10,>=2.9->tensorflow==2.9.2) (3.0.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (5.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (1.3.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (3.3.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (2023.7.22)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (2.1.3)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (0.5.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.10,>=2.9->tensorflow==2.9.2) (3.2.2)
CUDA and NVIDIA GPU Information
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Sun Oct 15 22:05:18 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


TensorFlow version: 2.9.2
Eager execution is: True
Keras version: 2.9.0
Num GPUs Available:  1


   Let's now load the pandas.Dataframe as a dictionary to a tensorflow.data.Dataset and map the variables with a lambda function. Then we can prepare the train/test sets by shuffling and sampling. Then we can generate a vocabulary that maps the features to the embedding vector in the model, determine the unique items and reviewer_ids for the vocabulary and set the dimensions for the embeddings.

In [ ]:
from sklearn.utils import shuffle

ratings = tf.data.Dataset.from_tensor_slices(dict(df))

ratings = ratings.map(lambda x: {'item': x['item'],
                                 'reviewer_id': x['reviewerID'],
                                 'rating': x['rating']})

shuffled = ratings.shuffle(1_000_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(500_000)
test = shuffled.skip(500_000).take(100_000)

items = ratings.batch(3_000_000).map(lambda x: x['item'])
reviewer_ids = ratings.batch(3_000_000).map(lambda x: x['reviewer_id'])

unique_items = np.unique(np.concatenate(list(items)))
unique_reviewer_ids = np.unique(np.concatenate(list(reviewer_ids)))

embedding_dimension = 64

Model Architecture and Metrics

   We can use a class RankingModel to set up the ranking model containing embeddings for the reviewers and items using the created vocabulary for each. Then the model can be defined with multiple stacked Dense layers containing activation='relu' with decreasing number of nodes.

In [ ]:
class RankingModel(tf.keras.Model):

  def __init__(self):

    """
    Initialize the model by setting up the layers.
    """
    super().__init__()

    self.reviewer_embeddings = tf.keras.Sequential([
        tf.keras.layers.StringLookup(vocabulary=unique_reviewer_ids,
                                     mask_token=None),
        tf.keras.layers.Embedding(len(unique_reviewer_ids) + 1,
                                  embedding_dimension)])

    self.item_embeddings = tf.keras.Sequential([
        tf.keras.layers.StringLookup(vocabulary=unique_items, mask_token=None),
        tf.keras.layers.Embedding(len(unique_items) + 1, embedding_dimension)])

    self.ratings = tf.keras.Sequential([
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1)])

  def call(self, inputs):
      reviewer_id, item = inputs
      reviewer_embedding = self.reviewer_embeddings(reviewer_id)
      item_embedding = self.item_embeddings(item)

    return self.ratings(tf.concat([reviewer_embedding, item_embedding], axis=1))

Let's now examine a sample:

In [ ]:
RankingModel()((['31'], ['2726956']))
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=['31']. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=['2726956']. Consider rewriting this model with the Functional API.
Out[ ]:
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.00874423]], dtype=float32)>

   Next, we can define the loss andmetrics usingtensorflow_recommenders.

In [ ]:
import tensorflow_recommenders as tfrs

task = tfrs.tasks.Ranking(loss=tf.keras.losses.MeanSquaredError(),
                          metrics=[tf.keras.metrics.RootMeanSquaredError()])

   We can now define the complete model by generating another class AmazonReviewsModel where the rankingModel() is added in the __init__ method, and the compute_loss method calculates a loss given the feautres as input. The base model creates the training loop. The function compute_loss computes the loss and the metrics for the model.

In [ ]:
import pprint
from typing import Dict, Text

class AmazonReviewsModel(tfrs.models.Model):

  def __init__(self):
    super().__init__()
    self.ranking_model: tf.keras.Model = RankingModel()
    self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
        loss = tf.keras.losses.MeanSquaredError(),
        metrics=[tf.keras.metrics.RootMeanSquaredError()])

  def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
      return self.ranking_model((features['reviewer_id'], features['item']))

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
      labels = features.pop('rating')
      rating_predictions = self(features)

      return self.task(labels=labels, predictions=rating_predictions)

   Now, we can initiate the model and prepare the train/test sets.

In [ ]:
model = AmazonReviewsModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01))

cached_train = train.shuffle(1_000_000).batch(1).cache()
cached_test = test.batch(1).cache()

Fit, Evaluate and Save Model

   Let's now set the path where model is saved and set up the logs/callbacks containing the EarlyStopping class that monitors the root_mean_squared_error and will stop training if it does not improve after 3 epochs, the ModelCheckpoint class that monitors the root_mean_squared_error saving only the best one with a min loss as well as the TensorBoard class that enables visualizations using the TensorBoard.

   Now the model can be trained by calling fit utilizing the cached_train dataset for 20 epochs, which is the iterations over the entire cached_train, using a batch_size=1, which is the number of samples per gradient update, and the specified callbacks from the callbacks_list.

In [ ]:
import datetime
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

!rm -rf /logs/

%load_ext tensorboard

log_folder = 'logs/fit/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

filepath = 'rankingBaseline_MoviesTV_weights_only.h5'
checkpoint_dir = os.path.dirname(filepath)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_folder,
                                                      histogram_freq=1)

callbacks_list = [EarlyStopping(monitor='root_mean_squared_error', patience=3),
                  ModelCheckpoint(filepath, monitor='root_mean_squared_error',
                                  save_weights_only=True, mode='min'),
                  tensorboard_callback]

history = model.fit(cached_train, epochs=20, callbacks=callbacks_list)
Epoch 1/20
500000/500000 [==============================] - 1359s 3ms/step - root_mean_squared_error: 0.8855 - loss: 0.7844 - regularization_loss: 0.0000e+00 - total_loss: 0.7844
Epoch 2/20
500000/500000 [==============================] - 1350s 3ms/step - root_mean_squared_error: 0.8323 - loss: 0.6929 - regularization_loss: 0.0000e+00 - total_loss: 0.6929
Epoch 3/20
500000/500000 [==============================] - 1351s 3ms/step - root_mean_squared_error: 0.7777 - loss: 0.6051 - regularization_loss: 0.0000e+00 - total_loss: 0.6051
Epoch 4/20
500000/500000 [==============================] - 1346s 3ms/step - root_mean_squared_error: 0.7191 - loss: 0.5173 - regularization_loss: 0.0000e+00 - total_loss: 0.5173
Epoch 5/20
500000/500000 [==============================] - 1344s 3ms/step - root_mean_squared_error: 0.6644 - loss: 0.4416 - regularization_loss: 0.0000e+00 - total_loss: 0.4416
Epoch 6/20
500000/500000 [==============================] - 1345s 3ms/step - root_mean_squared_error: 0.6109 - loss: 0.3734 - regularization_loss: 0.0000e+00 - total_loss: 0.3734
Epoch 7/20
500000/500000 [==============================] - 1345s 3ms/step - root_mean_squared_error: 0.5578 - loss: 0.3113 - regularization_loss: 0.0000e+00 - total_loss: 0.3113
Epoch 8/20
500000/500000 [==============================] - 1342s 3ms/step - root_mean_squared_error: 0.5080 - loss: 0.2582 - regularization_loss: 0.0000e+00 - total_loss: 0.2582
Epoch 9/20
500000/500000 [==============================] - 1352s 3ms/step - root_mean_squared_error: 0.4633 - loss: 0.2147 - regularization_loss: 0.0000e+00 - total_loss: 0.2147
Epoch 10/20
500000/500000 [==============================] - 1363s 3ms/step - root_mean_squared_error: 0.4246 - loss: 0.1804 - regularization_loss: 0.0000e+00 - total_loss: 0.1804
Epoch 11/20
500000/500000 [==============================] - 1347s 3ms/step - root_mean_squared_error: 0.3915 - loss: 0.1533 - regularization_loss: 0.0000e+00 - total_loss: 0.1533
Epoch 12/20
500000/500000 [==============================] - 1352s 3ms/step - root_mean_squared_error: 0.3629 - loss: 0.1318 - regularization_loss: 0.0000e+00 - total_loss: 0.1318
Epoch 13/20
500000/500000 [==============================] - 1347s 3ms/step - root_mean_squared_error: 0.3384 - loss: 0.1145 - regularization_loss: 0.0000e+00 - total_loss: 0.1145
Epoch 14/20
500000/500000 [==============================] - 1352s 3ms/step - root_mean_squared_error: 0.3171 - loss: 0.1006 - regularization_loss: 0.0000e+00 - total_loss: 0.1006
Epoch 15/20
500000/500000 [==============================] - 1367s 3ms/step - root_mean_squared_error: 0.2982 - loss: 0.0889 - regularization_loss: 0.0000e+00 - total_loss: 0.0889
Epoch 16/20
500000/500000 [==============================] - 1364s 3ms/step - root_mean_squared_error: 0.2819 - loss: 0.0795 - regularization_loss: 0.0000e+00 - total_loss: 0.0795
Epoch 17/20
500000/500000 [==============================] - 1357s 3ms/step - root_mean_squared_error: 0.2675 - loss: 0.0716 - regularization_loss: 0.0000e+00 - total_loss: 0.0716
Epoch 18/20
500000/500000 [==============================] - 1359s 3ms/step - root_mean_squared_error: 0.2548 - loss: 0.0649 - regularization_loss: 0.0000e+00 - total_loss: 0.0649
Epoch 19/20
500000/500000 [==============================] - 1356s 3ms/step - root_mean_squared_error: 0.2433 - loss: 0.0592 - regularization_loss: 0.0000e+00 - total_loss: 0.0592
Epoch 20/20
500000/500000 [==============================] - 1358s 3ms/step - root_mean_squared_error: 0.2330 - loss: 0.0543 - regularization_loss: 0.0000e+00 - total_loss: 0.0543

   Now, let's save the model if we need to reload it in the future and evaluate on the test set.

In [ ]:
tf.saved_model.save(model, 'rankingBaseline_MoviesTV_20epochs')
WARNING:absl:Found untraced functions such as ranking_10_layer_call_fn, ranking_10_layer_call_and_return_conditional_losses while saving (showing 2 of 2). These functions will not be directly callable after loading.
In [ ]:
model.evaluate(cached_test, return_dict=True)
100000/100000 [==============================] - 197s 2ms/step - root_mean_squared_error: 1.2426 - loss: 1.5440 - regularization_loss: 0.0000e+00 - total_loss: 1.5440
Out[ ]:
{'root_mean_squared_error': 1.2425607442855835,
 'loss': 0.1961183398962021,
 'regularization_loss': 0,
 'total_loss': 0.1961183398962021}

   We can examine metrics from the model and plot the model loss over the training epochs.

In [ ]:
import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Eval'], loc='upper right')
plt.show()

   Let's now load the trained model and evaluate using the cached test set.

In [ ]:
loaded = tf.saved_model.load('rankingBaseline_MoviesTV_20epochs')
loaded({'reviewer_id': np.array(['AV6QDP8Q0ONK4']),
        'item': ['B009934S5M']}).numpy()
Out[ ]:
array([[4.7614393]], dtype=float32)
In [ ]:
model.evaluate(cached_test, return_dict=True)
100000/100000 [==============================] - 204s 2ms/step - root_mean_squared_error: 1.2426 - loss: 1.5440 - regularization_loss: 0.0000e+00 - total_loss: 1.5440
Out[ ]:
{'root_mean_squared_error': 1.2425607442855835,
 'loss': 0.1961183398962021,
 'regularization_loss': 0,
 'total_loss': 0.1961183398962021}

Predictions

   We can test out the Ranking Model by computing predictions for a subset of the items and rank them based on the predictions.

AV6QDP8Q0ONK4

   This reviewer had 3981 ratings in the filtered set.

In [ ]:
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['AV6QDP8Q0ONK4']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B00NYC65M8: [[5.2519903]]
B00R8GUXPG: [[4.836647]]
B009934S5M: [[4.7614393]]
B00PY4Q9OS: [[4.7345595]]
B00Q0G2VXM: [[4.623266]]

ABO2ZI2Y5DQ9T

   This reviewer had 2068 ratings in the filtered set.

In [ ]:
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
    test_ratings[item] = model({
       'reviewer_id': np.array(['ABO2ZI2Y5DQ9T']),
       'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B00R8GUXPG: [[4.1731825]]
B00PY4Q9OS: [[3.9811945]]
B009934S5M: [[3.6320379]]
B00NYC65M8: [[3.6279793]]
B00Q0G2VXM: [[3.5430803]]
In [ ]:
test_ratings = {}
test_item = ['B00KHW4XHM', 'B01F6EHOIK', 'B00199PP6U', 'B00009IAXL']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['ABO2ZI2Y5DQ9T']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B00KHW4XHM: [[4.799788]]
B00199PP6U: [[4.7554584]]
B00009IAXL: [[3.4191265]]
B01F6EHOIK: [[1.9375358]]
In [ ]:
test_ratings = {}
test_item = ['B0006FFRKM', '6304492952', 'B009934S5M', 'B000HT3QBY']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['ABO2ZI2Y5DQ9T']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B0006FFRKM: [[5.2371516]]
B009934S5M: [[3.6320379]]
B000HT3QBY: [[3.4291909]]
6304492952: [[3.126309]]

A328S9RN3U5M68

   This reviewer had 1997 ratings in the filtered set.

In [ ]:
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['A328S9RN3U5M68']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B00R8GUXPG: [[4.9083266]]
B00PY4Q9OS: [[4.6880455]]
B009934S5M: [[4.497545]]
B00Q0G2VXM: [[4.416779]]
B00NYC65M8: [[3.2599416]]
In [ ]:
test_ratings = {}
test_item = ['B00KHW4XHM', 'B01F6EHOIK', 'B00199PP6U', 'B00009IAXL']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['A328S9RN3U5M68']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B00KHW4XHM: [[5.38043]]
B00199PP6U: [[5.1638284]]
B00009IAXL: [[3.2683938]]
B01F6EHOIK: [[3.0241704]]
In [ ]:
test_ratings = {}
test_item = ['B0006FFRKM', '6304492952', 'B009934S5M', 'B000HT3QBY']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['A328S9RN3U5M68']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B0006FFRKM: [[5.147983]]
B009934S5M: [[4.497545]]
B000HT3QBY: [[4.3494844]]
6304492952: [[3.8749783]]

A3MV1KKHX51FYT

   This reviewer had 1986 ratings in the filtered set.

In [ ]:
test_ratings = {}
test_item = ['B009934S5M','B00PY4Q9OS', 'B00R8GUXPG','B00Q0G2VXM','B00NYC65M8']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['A3MV1KKHX51FYT']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B009934S5M: [[3.0159693]]
B00Q0G2VXM: [[2.8539734]]
B00R8GUXPG: [[2.4100082]]
B00PY4Q9OS: [[2.2210793]]
B00NYC65M8: [[2.1591673]]
In [ ]:
test_ratings = {}
test_item = ['B00KHW4XHM', 'B01F6EHOIK', 'B00199PP6U', 'B00009IAXL']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['A3MV1KKHX51FYT']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B00009IAXL: [[4.1028647]]
B00KHW4XHM: [[3.1912699]]
B00199PP6U: [[2.5072653]]
B01F6EHOIK: [[1.9650021]]
In [ ]:
test_ratings = {}
test_item = ['B0006FFRKM', '6304492952', 'B009934S5M', 'B000HT3QBY']
for item in test_item:
    test_ratings[item] = model({
        'reviewer_id': np.array(['A3MV1KKHX51FYT']),
        'item': np.array([item])})

print('Ratings:')
for title, score in sorted(test_ratings.items(), key=lambda x: x[1],
                           reverse=True):
    print(f'{title}: {score}')
Ratings:
B000HT3QBY: [[5.27298]]
B0006FFRKM: [[3.945326]]
B009934S5M: [[3.0159693]]
6304492952: [[2.4662666]]

   We can also convert the TensorfloW model to Tensorflow Lite model to run on-device.

In [ ]:
converter = tf.lite.TFLiteConverter.from_saved_model('rankingBaseline_MoviesTV_20epochs')
tflite_model = converter.convert()
open('converted_model.tflite', 'wb').write(tflite_model)
Out[ ]:
34591500

   Let's now get the input and output tensors and test the model for a few of the observations in the set.

In [ ]:
interpreter = tf.lite.Interpreter(model_path='converted_model.tflite')
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

if input_details[0]['name'] == 'serving_default_item:0':
    interpreter.set_tensor(input_details[0]['index'], np.array(['B009934S5M']))
    interpreter.set_tensor(input_details[1]['index'],
                           np.array(['AV6QDP8Q0ONK4']))
else:
    interpreter.set_tensor(input_details[0]['index'],
                           np.array(['AV6QDP8Q0ONK4']))
    interpreter.set_tensor(input_details[1]['index'], np.array(['B009934S5M']))

interpreter.invoke()

rating = interpreter.get_tensor(output_details[0]['index'])
print(rating)
[[4.7614403]]
In [ ]:
if input_details[0]['name'] == 'serving_default_item:0':
    interpreter.set_tensor(input_details[0]['index'], np.array(['B00PY4Q9OS']))
    interpreter.set_tensor(input_details[1]['index'],
                           np.array(['ABO2ZI2Y5DQ9T']))
else:
    interpreter.set_tensor(input_details[0]['index'],
                           np.array(['ABO2ZI2Y5DQ9T']))
    interpreter.set_tensor(input_details[1]['index'], np.array(['B00PY4Q9OS']))

interpreter.invoke()

rating = interpreter.get_tensor(output_details[0]['index'])
print(rating)
[[3.9811943]]
In [ ]:
if input_details[0]['name'] == 'serving_default_item:0':
    interpreter.set_tensor(input_details[0]['index'], np.array(['B00R8GUXPG']))
    interpreter.set_tensor(input_details[1]['index'],
                           np.array(['A328S9RN3U5M68']))
else:
    interpreter.set_tensor(input_details[0]['index'],
                           np.array(['A328S9RN3U5M68']))
    interpreter.set_tensor(input_details[1]['index'], np.array(['B00R8GUXPG']))

interpreter.invoke()

rating = interpreter.get_tensor(output_details[0]['index'])
print(rating)
[[4.9083266]]

Recommendation Systems using Surprise

   The data is loaded using the Reader class from a pandas.Dataframes prior to modeling. For the initial training of the models using surprise, the default parameters of BaselineOnly, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, CoClustering,SVD,SVDp (an extension of SVD which taking into account implicit ratings), NMF, were evaluated using the cross_validate method using 3-fold cross validation to determine which algorithm yielded the lowest RMSE errors.

Basic algorithms

-NormalPredictor: NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work

  • BaselineOnly: BaselineOnly algorithm predicts the baseline estimate for given user and item

k-NN algorithms

  • KNNBasic: KNNBasic is a basic collaborative filtering algorithm

  • KNNWithMeans: KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user

  • KNNWithZScore: KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user

  • KNNBaseline: KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating

Matrix Factorization-based algorithms

-SVD: SVD algorithm is equivalent to Probabilistic Matrix Factorization

-SVDpp: The SVDpp algorithm is an extension of SVD that takes into account implicit ratings

-NMF: NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD

  • Co-clustering: Coclustering is a collaborative filtering algorithm based on co-clustering

   This revealed that SVDpp generated the lowest RMSE, but it took the longest to fit the model and test. The default parameters for SVDpp uses 20 epochs for fitting the model, so experimenting with less epochs and other model parameters will reduce the runtime and potentially maintain a low RMSE. The results from using KNNBaseline demonstrate a close loss with a significantly lower runtime, so hyperparameter tuning might allow this to be a better choice, especially given larger sample sizes.

In [ ]:
from surprise import Dataset, Reader
import time
from surprise import BaselineOnly, KNNBaseline, KNNBasic, KNNWithMeans
from surprise import KNNWithZScore, CoClustering, SVD, SVDpp, NMF, NormalPredictor
from surprise.model_selection import cross_validate

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['reviewer_id', 'item_id', 'rating']], reader)

print('Time for iterating through different algorithms..')
search_time_start = time.time()
benchmark = []
for algorithm in [BaselineOnly(), KNNBaseline(), KNNBasic(), KNNWithMeans(),
                  KNNWithZScore(), CoClustering(), SVD(), SVDpp(), NMF(),
                  NormalPredictor()]:

    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3,
                             verbose=False, n_jobs=-1)

    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]],
                               index=['Algorithm']))
    benchmark.append(tmp)
print('Finished iterating through different algorithms:',
time.time() - search_time_start)

surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
print('Results from testing different algorithms:')
print(surprise_results)

del surprise_results
Time for iterating through different algorithms..
Finished iterating through different algorithms: 1877.5992422103882
Results from testing different algorithms:
                 test_rmse     fit_time  test_time
Algorithm                                         
SVDpp             0.951981  1449.569379  34.628475
SVD               0.956978    56.881580   3.054269
BaselineOnly      0.958829     0.541074   2.015631
KNNBaseline       0.986208    12.493270  27.932690
KNNWithMeans      0.996443    12.406364  24.509705
KNNWithZScore     1.002136    13.109460  26.066691
CoClustering      1.012724    18.731312   2.226075
NMF               1.051432    59.292935   2.782387
KNNBasic          1.106805    12.031695  23.257696
NormalPredictor   1.505521     0.629097   2.384375

   Let's read the train/test sets and load them using the reader from surprise.

In [ ]:
train = pd.read_csv('train_filtered.csv', sep='|')
train.columns = ['rating', 'item_id', 'reviewer_id']
train['reviewer_id'] = train['reviewer_id'].str.extract(pat='(\d+)',
                                                        expand=False)
train['item_id'] = train['item_id'].str.extract(pat='(\d+)', expand=False)
train['reviewer_id'] = train['reviewer_id'].astype(int)
train['item_id'] = train['item_id'].astype(int)

test = pd.read_csv('eval_filtered.csv', sep='|')
test.columns = ['rating', 'item_id', 'reviewer_id']
test['reviewer_id'] = test['reviewer_id'].str.extract(pat='(\d+)', expand=False)
test['item_id'] = test['item_id'].str.extract(pat='(\d+)', expand=False)
test['reviewer_id'] = test['reviewer_id'].astype(int)
test['item_id'] = test['item_id'].astype(int)

train = Dataset.load_from_df(train[['reviewer_id', 'item_id', 'rating']], reader)
test = Dataset.load_from_df(test[['reviewer_id', 'item_id', 'rating']], reader)

SVDpp with Lowest RMSE

   Then, we can set the path to save the results, fit the model using 3-fold cross validation with the default parameters for 3 epochs and examine the RMSE for the train/test sets.

In [ ]:
from surprise import accuracy, dump

print('Train/predict using SVDpp default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVDpp default parameters..')
search_time_start = time.time()
algo = SVDpp(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through SVDpp default parameters:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
    print(key, ' : ', value)
print('\n')

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_3epochs_DefaultParamModel_file')
Train/predict using SVDpp default parameters for 3 epochs:


Time for iterating through SVDpp default parameters..
Finished iterating through SVDpp default parameters: 259.01811718940735


Cross validation results:
test_rmse  :  [0.99186233 0.99067176 0.98925811]
test_mae  :  [0.75639429 0.75586697 0.75391987]
fit_time  :  (212.8136830329895, 213.71945691108704, 214.16111540794373)
test_time  :  (34.81285309791565, 35.151495695114136, 34.626320362091064)


RMSE from fit best parameters on train predict on test:
RMSE: 0.9811
0.9811266996736911

   Since we removed the orginal item and reviewerID and created a numerical to handle the sparsity issue, let's read the Movies_and_TV_idMatch.csv set so the original features are present.

In [ ]:
df = pd.read_csv('Movies_and_TV_idMatch.csv')
df = df.drop_duplicates()

print('Sample observations:')
df.head()
Sample observations:
Out[ ]:
item item_id reviewerID reviewer_id
0 0001527665 45044 A2VHSG6TZHU1OB 1406814
1 0001527665 45044 A23EJWOW1TLENE 3722126
2 0001527665 45044 A1KM9FNEJ8Q171 842275
3 0001527665 45044 A38LY2SSHVHRYB 2915259
4 0001527665 45044 AHTYUW2H1276L 2915260

   To examine the results from fitting models, let's now define two functions, get_Ir, which determines the number of items rated by a given reviewer, and get_Ri, which determines the number of reviewers that rated a given item, from the model prediction results.

In [ ]:
def get_Ir(reviewer_id):
    """
    Determine the number of items rated by given reviewer
    Args:
      reviewerID: the id of the reviewer
    Returns:
      Number of items rated by the reviewer
    """
    try:
        return len(train.ur[train.to_inner_uid(reviewer_id)])
    except ValueError:
        return 0

def get_Ri(item_id):
    """
    Determine number of reviewers that rated given item
    Args:
      itemID: the id of the item
    Returns:
     Number of reviewers that have rated the item
    """
    try:
        return len(train.ir[train.to_inner_iid(item_id)])
    except ValueError:
        return 0

   We can convert the predictions from the model to a pandas.Dataframe, apply the functions and examine the top 10 best predictions:

In [ ]:
df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewerID.apply(get_Ir)
df1['Ui'] = df1.itemID.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2 = df2.drop_duplicates()

best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
2465    B00005JMFQ      286   ATOHBACMBJQQP         1901  5.0  5.0   
103694  B005LAJ23A      300   ALJ0ANSP3D93H         2922  5.0  5.0   
140212  B001UHOWX8      444  A1QE1C5NZL4AVN        11094  5.0  5.0   
196650  B0027VST2Q     1488   A246D56KFO5CG          198  5.0  5.0   
34119   6302428122     2501   AKD3NT9SO1VT6         2454  5.0  5.0   
14052   B00003CWT6      151  A3C96IWQ26Q46I         3719  5.0  5.0   
34137   6305800073     2600  A3RTSDHZFK344J         3916  5.0  5.0   
159982  B00PY4Q9OS       14  A2YW1GDT8MQB88         1066  5.0  5.0   
64899   B00DY64C8S      147  A12DWEETVVOX4G         5448  5.0  5.0   
103569  B00003CY6C     5672  A1E0ILXNP3LKAS          882  5.0  5.0   

                          details   Iu   Ui  err  
2465    {'was_impossible': False}   81  247  0.0  
103694  {'was_impossible': False}   59  190  0.0  
140212  {'was_impossible': False}   28  193  0.0  
196650  {'was_impossible': False}  284   47  0.0  
34119   {'was_impossible': False}   62   43  0.0  
14052   {'was_impossible': False}   51  440  0.0  
34137   {'was_impossible': False}   50   92  0.0  
159982  {'was_impossible': False}   97  837  0.0  
64899   {'was_impossible': False}   39  223  0.0  
103569  {'was_impossible': False}  121   79  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui       est  \
26621   6303921248      269  A1C1T97KIJPGIX         4844  1.0  4.995164   
148629  B0002WYTWG      172   A7Y6AVS576M03           42  1.0  5.000000   
44833   B0000VCZK2      409  A3K89LA3W7YS1Q         5789  1.0  5.000000   
67997   6302168465      534  A2X5WF3FPJ8C9J         8107  1.0  5.000000   
29490   B01D9EUNBY     2691  A3FWD37OUCOQDL          778  1.0  5.000000   
110642  B00005JNJV      451  A1JU644REKXCNO         1311  1.0  5.000000   
119911  B00DMK6WN4     1915   ANHTCXAG77MBM         1331  1.0  5.000000   
109341  B01G06T00S     2752  A3NONVSAUJAFXC          835  1.0  5.000000   
33420   0800141709      752  A28YE81E63ZOZT         4933  1.0  5.000000   
118018  B00AMSLDW4     1834  A157SI9J9ECKYH         1666  1.0  5.000000   

                          details   Iu   Ui       err  
26621   {'was_impossible': False}   38  158  3.995164  
148629  {'was_impossible': False}  660  588  4.000000  
44833   {'was_impossible': False}   41  243  4.000000  
67997   {'was_impossible': False}   29  262  4.000000  
29490   {'was_impossible': False}  122   79  4.000000  
110642  {'was_impossible': False}   93  461  4.000000  
119911  {'was_impossible': False}   95   63  4.000000  
109341  {'was_impossible': False}  116    4  4.000000  
33420   {'was_impossible': False}   43  209  4.000000  
118018  {'was_impossible': False}   77   97  4.000000  

   Hyperparameter optimization using GridSearchCV was performed to find the best parameters. Since this algorithm is computationally expensive with gradient descent, 10 epochs was used. A larger number of factors compared to the default n_factors=20, The default parameters for lr_all=0.007 and reg_all=0.02 were included in the search. Let's now define the parameters for the grid search.

In [ ]:
param_grid = {'n_epochs': [20],
              'n_factors': [5, 10, 15],
              'lr_all': [7e-4, 7e-3, 7e-2],
              'reg_all': [7e-2, 5e-2, 2e-2, 7e-1],
              'random_state': [seed_value]}
print('Grid search parameters:')
param_grid
Grid search parameters:
Out[ ]:
{'n_epochs': [20],
 'n_factors': [5, 10, 15],
 'lr_all': [0.0007, 0.007, 0.07],
 'reg_all': [0.07, 0.05, 0.02, 0.7],
 'random_state': [42]}

   Now we can run the grid search with rmse and mae as the metrics. Then examine the parameters that resulted in the lowest RMSE.

In [ ]:
from surprise.model_selection import GridSearchCV

gs = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])
Time for iterating grid search parameters..
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed: 87.7min
Finished iterating grid search parameters: 13742.905663013458


Lowest RMSE from Grid Search:
0.9454947143378393


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 20, 'n_factors': 5, 'lr_all': 0.007, 'reg_all': 0.05, 'random_state': 42}
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 229.0min finished
In [ ]:
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)

print('SVDpp GridSearch HPO Cross Validation Results:')
print(results_df.head())

del results_df
SVDpp GridSearch HPO Cross Validation Results:
    split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
5           0.948897          0.943507          0.944079        0.945495   
4           0.948880          0.943676          0.944297        0.945618   
16          0.948938          0.943844          0.944304        0.945695   
28          0.948962          0.943712          0.944634        0.945769   
17          0.949299          0.943932          0.944247        0.945826   

    std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
5        0.002417               1         0.684832         0.682043   
4        0.002321               2         0.686716         0.684125   
16       0.002301               3         0.686884         0.684269   
28       0.002289               4         0.686872         0.684129   
17       0.002459               5         0.685130         0.682168   

    split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
5          0.681793       0.682889      0.001377              4   
4          0.683896       0.684913      0.001279              7   
16         0.683806       0.684986      0.001355              8   
28         0.684206       0.685069      0.001276              9   
17         0.681556       0.682951      0.001561              5   

    mean_fit_time  std_fit_time  mean_test_time  std_test_time  \
5     1538.214133      9.366337       74.756616       0.963164   
4     1544.355476      3.865669       74.432318       1.427925   
16    1910.758532      9.697008       73.113926       0.483212   
28    2274.356767     17.361908       70.215237       1.224627   
17    1896.146014      8.418769       71.352642       0.601564   

                                               params  param_n_epochs  \
5   {'n_epochs': 20, 'n_factors': 5, 'lr_all': 0.0...              20   
4   {'n_epochs': 20, 'n_factors': 5, 'lr_all': 0.0...              20   
16  {'n_epochs': 20, 'n_factors': 10, 'lr_all': 0....              20   
28  {'n_epochs': 20, 'n_factors': 15, 'lr_all': 0....              20   
17  {'n_epochs': 20, 'n_factors': 10, 'lr_all': 0....              20   

    param_n_factors  param_lr_all  param_reg_all  param_random_state  
5                 5         0.007           0.05                  42  
4                 5         0.007           0.07                  42  
16               10         0.007           0.07                  42  
28               15         0.007           0.07                  42  
17               10         0.007           0.05                  42  


   Then use the parameters that resulted in the lowest rmse on the train/test sets, predict, apply functions and save the prediction results.

In [ ]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVDpp_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVDpp_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])

del df1
RMSE from fit best parameters on train predict on test:
RMSE: 0.9466
0.9465595720332523

   Let's now examine the top 10 best predictions:

In [ ]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
66148   0790732300     3391  A26JGAM6GZMM4V         1115  5.0  5.0   
68178   B004AUP3F8     5195   A3B3182H8G4VF        12878  5.0  5.0   
68175   B000LXGXY8     9565  A3KOWOXEUEHJX8         3186  5.0  5.0   
175864  6304618352     2453  A1ADL615ZYH2ZZ         3773  5.0  5.0   
213457  B00364K7AU      420  A1E7VTRDMI4XMV         6532  5.0  5.0   
68171   0800141660      512  A37FC6SI13C7ZG         9096  5.0  5.0   
68168   6300213730     2956   APNWI1W2D3HGH        13356  5.0  5.0   
68158   6301933532     8075  A1TN8INXTQN040         3830  5.0  5.0   
68123   B0002F6BTM     2041   AI71P00BG70FQ         7200  5.0  5.0   
68113   B000BITV1A     7983   A4WNCSJJY011A         7108  5.0  5.0   

                          details   Iu   Ui  err  
66148   {'was_impossible': False}  101  112  0.0  
68178   {'was_impossible': False}   25   29  0.0  
68175   {'was_impossible': False}   51   28  0.0  
175864  {'was_impossible': False}   47  137  0.0  
213457  {'was_impossible': False}   40  185  0.0  
68171   {'was_impossible': False}   27  168  0.0  
68168   {'was_impossible': False}   25   75  0.0  
68158   {'was_impossible': False}   47   46  0.0  
68123   {'was_impossible': False}   31   52  0.0  
68113   {'was_impossible': False}   38   24  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
110642  B00005JNJV      451  A1JU644REKXCNO         1311  1.0  5.0   
39171   B000E5LEXS      906  A3TY31K1VXAGSX         3522  1.0  5.0   
67997   6302168465      534  A2X5WF3FPJ8C9J         8107  1.0  5.0   
12513   B00007K027    23913  A192UPT6KST3E2          804  1.0  5.0   
11064   6304961693     2718   AK2FSJJ7GAN8B        10089  1.0  5.0   
132808  6300158764    11537  A3LRLY9YUNZAGS         4212  1.0  5.0   
53552   6302969794    10674  A29ND8RD38SEUZ         2548  1.0  5.0   
157725  B00005U0JX     5196  A19RHMNYUFAN4L        14167  1.0  5.0   
80945   6300216977     2118  A1RVXYN8QAXXNB        18177  1.0  5.0   
204198  6305492042     3565  A2N03TUL3EZZDD         6386  1.0  5.0   

                          details   Iu   Ui  err  
110642  {'was_impossible': False}   93  461  4.0  
39171   {'was_impossible': False}   51   46  4.0  
67997   {'was_impossible': False}   29  262  4.0  
12513   {'was_impossible': False}  117    4  4.0  
11064   {'was_impossible': False}   28   90  4.0  
132808  {'was_impossible': False}   43   35  4.0  
53552   {'was_impossible': False}   68   41  4.0  
157725  {'was_impossible': False}   25   29  4.0  
80945   {'was_impossible': False}   20   61  4.0  
204198  {'was_impossible': False}   32   34  4.0  

SVD

   Let's now fit the model using 3-fold cross validation the default parameters for 3 epochs and examine the RMSE for the train/test sets.

In [ ]:
print('Train/predict using SVD default parameters for 3 epochs:')
print('\n')
print('Time for iterating through SVD default parameters..')
search_time_start = time.time()
algo = SVD(n_epochs=3, random_state=seed_value)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through SVD default parameters:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
    print(key, ' : ', value)
print('\n')

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./SVD_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./SVD_3epochs_DefaultParamModel_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])
df2.to_csv('predictions_SVD_DefaultParamModel.csv', index=False)

del df1
Train/predict using SVD default parameters for 3 epochs:


Time for iterating through SVD default parameters..
Finished iterating through SVD default parameters: 21.718221426010132


Cross validation results:
test_rmse  :  [1.01170944 1.01306478 1.00997506]
test_mae  :  [0.77944209 0.78081871 0.77882791]
fit_time  :  (9.072836875915527, 8.729201793670654, 8.824720859527588)
test_time  :  (2.9705753326416016, 3.0572056770324707, 2.9552011489868164)


RMSE from fit best parameters on train predict on test:
RMSE: 0.9995
0.9995495492030135

   Let's examine the best 10 predictions:

In [ ]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
0       B00YSG2ZPA        0  A37PPLULHT6CFQ         4911  5.0  5.0   
163930  B017RR6YJE     1286   AUDXDMFM49NGY         1279  5.0  5.0   
210635  6303365752      959  A35ZK3M8L9JUPX           55  5.0  5.0   
21967   0790729342     1683   AWG2O9C42XW5G           10  5.0  5.0   
195113  B000QGDJGK    11155  A3FL6CIO8QIJ2F          120  5.0  5.0   
2540    B00BNAE6M4     4244   AS2SJP1G389FE         1761  5.0  5.0   
163781  6300215628     1893  A3DTNUZALZDZPR          313  5.0  5.0   
116666  B00006JE59      407  A3LZBOBV9H1HDV          192  5.0  5.0   
54574   B00004Z4U9     1238   AXOZ5BWOEDL76          142  5.0  5.0   
163763  6304500831      526   AIHBP1TQ3JA86         8633  5.0  5.0   

                          details    Iu   Ui  err  
0       {'was_impossible': False}    41  334  0.0  
163930  {'was_impossible': False}    92  102  0.0  
210635  {'was_impossible': False}   605  186  0.0  
21967   {'was_impossible': False}  1262  122  0.0  
195113  {'was_impossible': False}   392   15  0.0  
2540    {'was_impossible': False}    76   61  0.0  
163781  {'was_impossible': False}   216   71  0.0  
116666  {'was_impossible': False}   286  258  0.0  
54574   {'was_impossible': False}   336  212  0.0  
163763  {'was_impossible': False}    31  186  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui       est  \
217222  B0009AK57Y       92  A1IWR4YH4ZA9BM          246  1.0  4.974257   
21188   6305622825     3674    ASRB35ZZQZGB          291  1.0  4.989281   
83087   B0002Z16HY     1672   AHW9HY6U7MV1Y        17415  1.0  5.000000   
185192  B005X5XIF6      833   AWG2O9C42XW5G           10  1.0  5.000000   
98948   B000E5LEXS      906  A3TY31K1VXAGSX         3522  1.0  5.000000   
184013  6304618352     2453  A3D2VIUT2HWP0Z          439  1.0  5.000000   
201875  1562550888      608  A3B1IPR94VFG1K         3590  1.0  5.000000   
71015   0788832492     1558  A335JCRHG40DUK         4132  1.0  5.000000   
135268  B000M341QE      536   AWG2O9C42XW5G           10  1.0  5.000000   
17450   B00003CXIU     3035  A2JQJ7S0LOXH1G          243  5.0  1.000000   

                          details    Iu   Ui       err  
217222  {'was_impossible': False}   248  151  3.974257  
21188   {'was_impossible': False}   224   92  3.989281  
83087   {'was_impossible': False}    21  169  4.000000  
185192  {'was_impossible': False}  1262  187  4.000000  
98948   {'was_impossible': False}    55   52  4.000000  
184013  {'was_impossible': False}   175  131  4.000000  
201875  {'was_impossible': False}    53  248  4.000000  
71015   {'was_impossible': False}    48  152  4.000000  
135268  {'was_impossible': False}  1262  335  4.000000  
17450   {'was_impossible': False}   261  104  4.000000  

   Let's now define the parameters for the grid search.

In [ ]:
param_grid = {'n_epochs': [30, 35, 40, 45, 50, 55, 60, 65, 70],
              'n_factors': [20, 25, 30, 35, 40 ,45 , 50],
              'lr_all': [0.002, 0.003, 0.004, 0.005, 0.006, 0.007],
              'reg_all': [0.0001, 0.001, 0.01, 0.02, 0.03, 0.04],
              'random_state': [seed_value]}
print('SVD HPO Grid search parameters:')
param_grid
SVD HPO Grid search parameters:
Out[ ]:
{'n_epochs': [30, 35, 40, 45, 50, 55, 60, 65, 70],
 'n_factors': [20, 25, 30, 35, 40, 45, 50],
 'lr_all': [0.002, 0.003, 0.004, 0.005, 0.006, 0.007],
 'reg_all': [0.0001, 0.001, 0.01, 0.02, 0.03, 0.04],
 'random_state': [42]}

   Now we can run the grid search with rmse and mae as the metrics. Then examine the parameters that resulted in the lowest RMSE.

In [ ]:
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])
Time for iterating grid search parameters..
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed: 18.0min
[Parallel(n_jobs=-1)]: Done 616 tasks      | elapsed: 48.1min
[Parallel(n_jobs=-1)]: Done 1120 tasks      | elapsed: 90.8min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 153.8min
[Parallel(n_jobs=-1)]: Done 2560 tasks      | elapsed: 238.1min
[Parallel(n_jobs=-1)]: Done 3496 tasks      | elapsed: 350.8min
[Parallel(n_jobs=-1)]: Done 4576 tasks      | elapsed: 501.9min
[Parallel(n_jobs=-1)]: Done 5800 tasks      | elapsed: 687.2min
Finished iterating grid search parameters: 51872.37151479721


Lowest RMSE from Grid Search:
0.9470577657248332


Parameters of Model with lowest RMSE from Grid Search:
{'n_epochs': 65, 'n_factors': 20, 'lr_all': 0.002, 'reg_all': 0.04, 'random_state': 42}
[Parallel(n_jobs=-1)]: Done 6804 out of 6804 | elapsed: 864.5min finished
In [ ]:
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df = results_df.sort_values('mean_test_rmse', ascending=True)
print('SVD GridSearch HPO Cross Validation Results:')
print(results_df.head())

del results_df
SVD GridSearch HPO Cross Validation Results:
      split0_test_rmse  split1_test_rmse  split2_test_rmse  mean_test_rmse  \
1769          0.947837          0.947292          0.946044        0.947058   
515           0.947919          0.947282          0.946094        0.947098   
1517          0.947917          0.947291          0.946087        0.947098   
767           0.947869          0.947354          0.946098        0.947107   
17            0.947931          0.947284          0.946113        0.947109   

      std_test_rmse  rank_test_rmse  split0_test_mae  split1_test_mae  \
1769       0.000751               1         0.684882         0.684427   
515        0.000756               2         0.685845         0.685306   
1517       0.000760               3         0.685833         0.685298   
767        0.000744               4         0.684506         0.684084   
17         0.000752               5         0.685866         0.685321   

      split2_test_mae  mean_test_mae  std_test_mae  rank_test_mae  \
1769         0.683501       0.684270      0.000575             83   
515          0.684445       0.685199      0.000576            168   
1517         0.684431       0.685187      0.000578            164   
767          0.683143       0.683911      0.000570             49   
17           0.684465       0.685217      0.000577            172   

      mean_fit_time  std_fit_time  mean_test_time  std_test_time  \
1769     110.694426      0.793158        5.775399       0.235709   
515       73.020183      0.826249        6.204501       0.493265   
1517     105.295258      0.178649        5.506686       0.242048   
767       78.236890      0.808342        5.835091       0.203345   
17        55.468125      0.798048        5.672714       0.187081   

                                                 params  param_n_epochs  \
1769  {'n_epochs': 65, 'n_factors': 20, 'lr_all': 0....              65   
515   {'n_epochs': 40, 'n_factors': 20, 'lr_all': 0....              40   
1517  {'n_epochs': 60, 'n_factors': 20, 'lr_all': 0....              60   
767   {'n_epochs': 45, 'n_factors': 20, 'lr_all': 0....              45   
17    {'n_epochs': 30, 'n_factors': 20, 'lr_all': 0....              30   

      param_n_factors  param_lr_all  param_reg_all  param_random_state  
1769               20         0.002           0.04                  42  
515                20         0.003           0.04                  42  
1517               20         0.002           0.04                  42  
767                20         0.003           0.04                  42  
17                 20         0.004           0.04                  42  


   Then use the parameters that resulted in the lowest RMSE on the train/test sets, predict, apply functions and save the prediction results.

In [ ]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./SVD_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./SVD_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])

del df1
RMSE from fit best parameters on train predict on test:
RMSE: 0.9426
0.9426279412758342


   We can examine the top 10 best predictions:

In [ ]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
176329  B000068TSI      806  A12BXTSI4P73PA        16449  5.0  5.0   
13930   0792846133     1456   AT07UZQQR7ZEH         1410  5.0  5.0   
38632   6305807655    21418  A3LKO630Z5MC5P        11133  5.0  5.0   
197661  B00005JO2V     6502  A2P5WV8GFN1HQO         4184  5.0  5.0   
38643   6303042503      362  A1VIZFPC5FDDI5         4880  5.0  5.0   
97976   6304401132      577   AL7HW45VDAWUZ        10433  5.0  5.0   
97971   6300215695      435  A32R3MTF6UVLGE        14696  5.0  5.0   
38631   B001KX50AQ    33032  A3QMO4Z0U2R8P1         3128  5.0  5.0   
97969   B00BEIYP1W      918  A3LZBOBV9H1HDV          192  5.0  5.0   
197652  B000095J2Q    22792  A3NONVSAUJAFXC          835  5.0  5.0   

                          details   Iu   Ui  err  
176329  {'was_impossible': False}   18  206  0.0  
13930   {'was_impossible': False}   84  122  0.0  
38632   {'was_impossible': False}   27   15  0.0  
197661  {'was_impossible': False}   47   23  0.0  
38643   {'was_impossible': False}   44  336  0.0  
97976   {'was_impossible': False}   25  245  0.0  
97971   {'was_impossible': False}   23  188  0.0  
38631   {'was_impossible': False}   56    6  0.0  
97969   {'was_impossible': False}  293  154  0.0  
197652  {'was_impossible': False}  131   14  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
12813   B00OV3VGP0       22  A107BIW9F8FHZK         6703  1.0  5.0   
114206  6303832563     4884  A2UKC1X32PBQZ9         1053  1.0  5.0   
113120  B00ID3TPA2    22390  A2KASTU01ARCF9         6928  1.0  5.0   
77728   B00753M8U0    23406  A31W21E1FQ8Q76         1563  1.0  5.0   
80302   B002HEXVUI       50  A1JTTJ7M7EC7Q7          705  1.0  5.0   
57297   B004X0MHEU     1821  A3G16L6SE9WQJW        18154  1.0  5.0   
121154  B000069HYZ    45548   ASFCEIBZOZFHF         2098  1.0  5.0   
99084   B00E0KWBE4     4451   AH4USMNP4MV4I        12373  1.0  5.0   
907     B003Y7TJXA    24721   AH06UFDUCQLUI         5443  1.0  5.0   
221274  630365147X     1178  A35E0VGWQKCADS         5108  1.0  5.0   

                          details   Iu   Ui  err  
12813   {'was_impossible': False}   36  739  4.0  
114206  {'was_impossible': False}  109   25  4.0  
113120  {'was_impossible': False}   34   11  4.0  
77728   {'was_impossible': False}   85   11  4.0  
80302   {'was_impossible': False}  133  646  4.0  
57297   {'was_impossible': False}   21   47  4.0  
121154  {'was_impossible': False}   70    5  4.0  
99084   {'was_impossible': False}   24   39  4.0  
907     {'was_impossible': False}   40    4  4.0  
221274  {'was_impossible': False}   42  152  4.0  

BaselineOnly

   Now we can run the 3-fold cross validation using the default parameters for 3 epochs using method='als' with RMSE and mae as the metrics. Then predict with the model, apply the functions and save the prediction results.

In [ ]:
print('Train/predict using Baseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
               'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print(BaselineOnly(bsl_options=bsl_options))
Train/predict using Baseline default parameters with Alternating Least Squares for 3 epochs:


Baselines estimates configuration:
{'method': 'als', 'n_epochs': 3}


Model parameters:
<surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x000002F2AB988C40>
In [ ]:
print('Time for iterating through Baseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = BaselineOnly(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through Baseline default parameters epochs=3 using ALS:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
    print(key, ' : ', value)

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./Baseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./Baseline_3epochs_DefaultParamModel_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])

del df1
Time for iterating through Baseline default parameters epochs=3 using ALS..
Finished iterating through Baseline default parameters epochs=3 using ALS: 18.00507688522339


Cross validation results:
test_rmse  :  [0.96236564 0.9575281  0.95800474]
test_mae  :  [0.7192956  0.71624087 0.71744233]
fit_time  :  (0.45334315299987793, 0.5258574485778809, 0.4615175724029541)
test_time  :  (3.44765043258667, 3.460092782974243, 3.342712163925171)
Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9532
0.9532457019820773

   We can examine the top 10 best predictions:

In [ ]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
22719   6303824358      652  A3UGURGJMNRSH5         6615  5.0  5.0   
154698  6303521231     3897   ARMESRCOD7ZRV         5721  5.0  5.0   
190758  B00IV3FLO8      504  A3CXEYHXHR5RSE         1293  5.0  5.0   
86407   7799128836      123  A3TGQ0Y8ILUGBP         9774  5.0  5.0   
154699  0792834917     3899   A8KI69T8XEYH8        11191  5.0  5.0   
215986  B001KVZ6HK      489  A2YUA3H1LLU53Z           13  5.0  5.0   
51302   6302799112     7144  A18EXIAW8NU3DP          764  5.0  5.0   
66610   B0013LRKRQ     8910  A1QNLOV3985VUP         2518  5.0  5.0   
51298   6301966554      107  A2BVCF2AKHHCDF         3942  5.0  5.0   
51296   B001ATWK2Q    14068  A12CN8FQSR18H9          452  5.0  5.0   

                          details    Iu   Ui  err  
22719   {'was_impossible': False}    32  124  0.0  
154698  {'was_impossible': False}    36   28  0.0  
190758  {'was_impossible': False}    92   72  0.0  
86407   {'was_impossible': False}    31  254  0.0  
154699  {'was_impossible': False}    28   59  0.0  
215986  {'was_impossible': False}  1171  222  0.0  
51302   {'was_impossible': False}   118   22  0.0  
66610   {'was_impossible': False}    63   32  0.0  
51298   {'was_impossible': False}    43  264  0.0  
51296   {'was_impossible': False}   170    8  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
214948  B005BYBZCM    14976   APSOSLICIDIFX         4500  1.0  5.0   
118018  B00AMSLDW4     1834  A157SI9J9ECKYH         1666  1.0  5.0   
110642  B00005JNJV      451  A1JU644REKXCNO         1311  1.0  5.0   
108646  B0046ZT40W       20  A15XO0U10UDT8O         8302  1.0  5.0   
125152  B000BGQSEA     1999   AWG2O9C42XW5G           10  1.0  5.0   
33420   0800141709      752  A28YE81E63ZOZT         4933  1.0  5.0   
44833   B0000VCZK2      409  A3K89LA3W7YS1Q         5789  1.0  5.0   
29490   B01D9EUNBY     2691  A3FWD37OUCOQDL          778  1.0  5.0   
119911  B00DMK6WN4     1915   ANHTCXAG77MBM         1331  1.0  5.0   
39171   B000E5LEXS      906  A3TY31K1VXAGSX         3522  1.0  5.0   

                          details    Iu   Ui  err  
214948  {'was_impossible': False}    48   10  4.0  
118018  {'was_impossible': False}    77   97  4.0  
110642  {'was_impossible': False}    93  461  4.0  
108646  {'was_impossible': False}    32   53  4.0  
125152  {'was_impossible': False}  1270   25  4.0  
33420   {'was_impossible': False}    43  209  4.0  
44833   {'was_impossible': False}    41  243  4.0  
29490   {'was_impossible': False}   122   79  4.0  
119911  {'was_impossible': False}    95   63  4.0  
39171   {'was_impossible': False}    51   46  4.0  

   Let's now define the parameters for the grid search.

In [ ]:
print('BaselineOnly HPO using Grid Search Minimized:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                             'n_epochs': [10, 15, 20, 25, 30, 35, 40, 45, 50],
                             'reg_u': [0, 1, 2, 3, 4, 5, 10, 15, 20],
                             'reg_i': [0, 1, 2, 3, 4,5, 10, 15, 20]}}
print('Grid search parameters:')
param_grid
BaselineOnly HPO using Grid Search Minimized:


Grid search parameters:
Out[ ]:
{'bsl_options': {'method': ['als', 'sgd'],
  'n_epochs': [10, 15, 20, 25, 30, 35, 40, 45, 50],
  'reg_u': [0, 1, 2, 3, 4, 5, 10, 15, 20],
  'reg_i': [0, 1, 2, 3, 4, 5, 10, 15, 20]}}

   Now we can run the grid search with rmse and mae as the metrics. Then examine the parameters that resulted in the lowest rmse.

In [ ]:
gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=-1)
print('Time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Lowest RMSE from Grid Search:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters of Model with lowest RMSE from Grid Search:')
print(gs.best_params['rmse'])
Time for iterating grid search parameters..
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   45.6s
[Parallel(n_jobs=-1)]: Done 256 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 616 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 1120 tasks      | elapsed: 19.6min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 32.1min
[Parallel(n_jobs=-1)]: Done 2560 tasks      | elapsed: 46.8min
[Parallel(n_jobs=-1)]: Done 3496 tasks      | elapsed: 63.2min
Finished iterating grid search parameters: 4817.081090688705


Lowest RMSE from Grid Search:
0.9429468447894139


Parameters of Model with lowest RMSE from Grid Search:
{'bsl_options': {'method': 'als', 'n_epochs': 50, 'reg_u': 1, 'reg_i': 3}}
[Parallel(n_jobs=-1)]: Done 4374 out of 4374 | elapsed: 80.2min finished

   Then use the parameters that resulted in the lowest rmse on the train/test sets, predict, apply functions and save the prediction results.

In [ ]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./BaselineOnly_moreParams_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./BaselineOnly_moreParams_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])

del df1
Estimating biases using als...
RMSE from fit best parameters on train predict on test:
RMSE: 0.9400
0.9400458413742169

   We can examine the top 10 best predictions:

In [ ]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
111339  B0053O89WY      757  A10EU5IAKELBDI        14872  5.0  5.0   
204037  B000YDMPCO     7653   A551XY0L7WQXU         1218  5.0  5.0   
74750   B00U38W0X4    18648  A227P36VP2YMFJ        12389  5.0  5.0   
158992  B00FDJGL6A      306  A16J08RGNPYXW6        11073  5.0  5.0   
204044  B00005JKZU      366  A36GQV2C4O0DXE         5800  5.0  5.0   
204054  B000059H6T     2016   AP40TY8769CBM        16798  5.0  5.0   
158983  B00005BJWN    13388  A3ESQSCCM2IR24         7591  5.0  5.0   
158982  B00S89IAWU    20893  A2545BZKJ6FLF6        16037  5.0  5.0   
204055  B0012Z36FI     4689   AWD6SR6I52C5C          476  5.0  5.0   
158978  B00D91GRA4       87  A2RC58KZZC41QW         2180  5.0  5.0   

                          details   Iu   Ui  err  
111339  {'was_impossible': False}   19  107  0.0  
204037  {'was_impossible': False}   98   32  0.0  
74750   {'was_impossible': False}   29   17  0.0  
158992  {'was_impossible': False}   22   52  0.0  
204044  {'was_impossible': False}   38  365  0.0  
204054  {'was_impossible': False}   22   50  0.0  
158983  {'was_impossible': False}   33   12  0.0  
158982  {'was_impossible': False}   18    6  0.0  
204055  {'was_impossible': False}  174   25  0.0  
158978  {'was_impossible': False}   66  694  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
162516  B000VS20M2      148  A1GFX8FS58ZDHR         6479  1.0  5.0   
220011  B01DPPMPSG     6988  A30BQKQU5XGGS1        15523  1.0  5.0   
126619  6305627401     1667   AJ32DH3NMIES9         5904  1.0  5.0   
99713   6302707722    65249  A1JFJFHKG8Y7LB         3612  1.0  5.0   
99311   B000NOIX6G    12926  A3VHB8OAQQF9W6         1094  1.0  5.0   
13116   6303686850     3483  A3N3Y4UJ07LOVG         3284  1.0  5.0   
54348   B01D5MQQ2A    61254  A3AEZXERB7JP4R        11448  1.0  5.0   
108646  B0046ZT40W       20  A15XO0U10UDT8O         8302  1.0  5.0   
26866   B00005T33K     2680  A28UFH7QJBZZMU         4268  1.0  5.0   
96274   B000BKVKSA     7802  A10175AMUHOQC4          458  1.0  5.0   

                          details   Iu   Ui  err  
162516  {'was_impossible': False}   40  450  4.0  
220011  {'was_impossible': False}   19   21  4.0  
126619  {'was_impossible': False}   36  105  4.0  
99713   {'was_impossible': False}   44    1  4.0  
99311   {'was_impossible': False}  102   18  4.0  
13116   {'was_impossible': False}   54  101  4.0  
54348   {'was_impossible': False}   29    4  4.0  
108646  {'was_impossible': False}   32   53  4.0  
26866   {'was_impossible': False}   45   14  4.0  
96274   {'was_impossible': False}  163   22  4.0  

KNNBaseline

   Now we can run the 3-fold cross validation using the default parameters for 3 epochs using method='als' with RMSE and mae as the metrics. Then predict with the model, apply the functions and save the prediction results.

In [ ]:
print('Train/predict using KNNBaseline default parameters with Alternating Least Squares for 3 epochs:')
print('\n')
bsl_options = {'method': 'als',
               'n_epochs': 3}
print('Baselines estimates configuration:')
print(bsl_options)
print('\n')
print('Model parameters:')
print('KNNBaseline(k=40, min_k=1, bsl_options=bsl_options, verbose=False)')
Train/predict using KNNBaseline default parameters with Alternating Least Squares for 3 epochs:


Baselines estimates configuration:
{'method': 'als', 'n_epochs': 3}


Model parameters:
KNNBaseline(k=40, min_k=1, bsl_options=bsl_options, verbose=False)
In [ ]:
print('Time for iterating through KNNBaseline default parameters epochs=3 using ALS..')
search_time_start = time.time()
algo = KNNBaseline(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=False,
                    n_jobs=-1)
print('Finished iterating through KNNBaseline default parameters epochs=3 using ALS:',
      time.time() - search_time_start)
print('\n')
print('Cross validation results:')
for key, value in cv.items():
    print(key, ' : ', value)
Time for iterating through KNNBaseline default parameters epochs=3 using ALS..
Finished iterating through KNNBaseline default parameters epochs=3 using ALS: 52.3457145690918


Cross validation results:
test_rmse  :  [0.98871454 0.98392946 0.98468003]
test_mae  :  [0.70193266 0.6991709  0.69981287]
fit_time  :  (12.814245462417603, 12.74767255783081, 12.50976276397705)
test_time  :  (27.62490439414978, 28.00885558128357, 28.005619525909424)
In [ ]:
predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
dump.dump('./KNNBaseline_3epochs_DefaultParamModel_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_3epochs_DefaultParamModel_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])

del df1
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9639
0.9638924023238667

   We can examine the top 10 best predictions:

In [ ]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
82292   B002ZG97EC      599  A29TWAFI927A05         7359  5.0  5.0   
66702   B016LGTB1A     1054  A3UCR3ANY81Y3X        10988  5.0  5.0   
199042  B00000JQU8     4924   AHD101501WCN1         3797  5.0  5.0   
66720   B000ANDBTY     5079   ALWB64XOXNMDP          305  5.0  5.0   
66728   B00OJ0X41E     1670  A3HYD0KTB8XKPQ        13455  5.0  5.0   
66730   B001EHDSRK    10371   AR0BHE8D3J5LD          567  5.0  5.0   
199038  B007K7IC6K    20933  A35X4WCJ3K67BY         3490  5.0  5.0   
66697   B0066E6RWE    33621   A1UJPR7G1TX5D        16570  5.0  5.0   
66734   6304493703    10521  A386NVAVQV5WUO         4546  5.0  5.0   
66745   B0000C508X    32981  A1XEVKJR5FBBSK        12943  5.0  5.0   

                                          details   Iu   Ui  err  
82292   {'actual_k': 40, 'was_impossible': False}   36  104  0.0  
66702   {'actual_k': 40, 'was_impossible': False}   26  257  0.0  
199042  {'actual_k': 40, 'was_impossible': False}   47   73  0.0  
66720   {'actual_k': 40, 'was_impossible': False}  226   57  0.0  
66728    {'actual_k': 3, 'was_impossible': False}   24   13  0.0  
66730    {'actual_k': 5, 'was_impossible': False}  153    5  0.0  
199038   {'actual_k': 9, 'was_impossible': False}   49   15  0.0  
66697    {'actual_k': 1, 'was_impossible': False}   23    9  0.0  
66734    {'actual_k': 1, 'was_impossible': False}   47    5  0.0  
66745    {'actual_k': 2, 'was_impossible': False}   26    8  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
50924   B000UYN9PY    41193  A2POT13FER5L82         1518  5.0  1.0   
162946  B00AZMFG48     3768  A1NFCJXNYVJ0O1         5015  1.0  5.0   
64718   B00LT1JHLW     2240  A1QWGK4SXTRXMD         3387  1.0  5.0   
121979  630249933X    46077  A2WHWME9PTOS0E        14782  1.0  5.0   
87343   6304329008     6536  A30NXBU3CDNDFR         5468  1.0  5.0   
1375    B0009UVCO4    55298  A141HP4LYPWMSR           76  5.0  1.0   
37276   B00435KP44    22619  A347I00XTPNIUP         3867  5.0  1.0   
128317  B00000IYR0    24076  A1POCQWA3VAY8T        17696  1.0  5.0   
21647   B000JMKJPK    44655   AETEH7SOTOFRT        11177  5.0  1.0   
7373    B000QCUZ7U    28629   AJTQS6KM9ZV5L         1420  1.0  5.0   

                                          details   Iu  Ui  err  
50924    {'actual_k': 1, 'was_impossible': False}   85   5  4.0  
162946   {'actual_k': 6, 'was_impossible': False}   43  21  4.0  
64718   {'actual_k': 22, 'was_impossible': False}   49  66  4.0  
121979   {'actual_k': 1, 'was_impossible': False}   23   1  4.0  
87343    {'actual_k': 7, 'was_impossible': False}   42  48  4.0  
1375     {'actual_k': 2, 'was_impossible': False}  503   2  4.0  
37276    {'actual_k': 1, 'was_impossible': False}   46   3  4.0  
128317   {'actual_k': 1, 'was_impossible': False}   20  11  4.0  
21647    {'actual_k': 1, 'was_impossible': False}   26   2  4.0  
7373     {'actual_k': 2, 'was_impossible': False}   89   4  4.0  

   Let's now define the parameters for the grid search.

In [ ]:
print('KNNBaseline HPO using Grid Search:')
print('\n')
param_grid = {'bsl_options': {'method': ['als', 'sgd']},
              'k': [30, 35, 40, 45, 50],
              'min_k': [5, 10],
              'random_state': [seed_value],
              'sim_options': {'name': ['pearson_baseline'],
                              'min_support': [5, 10],
                              'shrinkage': [0, 100]}}
print('Grid search parameters:')
param_grid
KNNBaseline HPO using Grid Search:


Grid search parameters:
Out[ ]:
{'bsl_options': {'method': ['als', 'sgd']},
 'k': [30, 35, 40, 45, 50],
 'min_k': [20, 25],
 'random_state': [42],
 'sim_options': {'name': ['pearson_baseline'],
  'min_support': [5, 10],
  'shrinkage': [0, 100]}}

   Now we can run the grid search with rmse and mae as the metrics. Then examine the parameters that resulted in the lowest rmse.

In [ ]:
gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse', 'mae'], cv=3,
                  joblib_verbose=-1, n_jobs=5)
print('Start time for iterating grid search parameters..')
search_time_start = time.time()
gs.fit(data)
print('Finished iterating grid search parameters:',
      time.time() - search_time_start)
print('\n')
print('Model with lowest RMSE:')
print(gs.best_score['rmse'])
print('\n')
print('Parameters with the lowest RMSE:')
print(gs.best_params['rmse'])
Start time for iterating grid search parameters..
[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  62 tasks      | elapsed: 11.1min
Finished iterating grid search parameters: 2447.833321094513


Model with lowest RMSE:
0.9453110845949882


Parameters with the lowest RMSE:
{'bsl_options': {'method': 'sgd'}, 'k': 40, 'min_k': 20, 'random_state': 42, 'sim_options': {'name': 'pearson_baseline', 'min_support': 5, 'shrinkage': 100, 'user_based': True}}
[Parallel(n_jobs=5)]: Done 240 out of 240 | elapsed: 40.7min finished

   Then use the parameters that resulted in the lowest rmse on the train/test sets, predict, apply functions and save the prediction results.

In [ ]:
algo = gs.best_estimator['rmse']

predictions = algo.fit(train).test(test)
print('RMSE from fit best parameters on train predict on test:')
print(accuracy.rmse(predictions))
print('\n')
dump.dump('./KNNBaseline_bestGrid_Model_file', predictions, algo)
#predictions, algo = dump.load('./KNNBaseline_bestGrid_Model_file')

df1 = pd.DataFrame(predictions, columns=['reviewer_id', 'item_id', 'rui', 'est',
                                         'details'])
df1['Iu'] = df1.reviewer_id.apply(get_Ir)
df1['Ui'] = df1.item_id.apply(get_Ri)
df1['err'] = abs(df1.est - df1.rui)

df2 = pd.merge(df, df1, how='right',
               left_on=['reviewer_id', 'item_id'],
               right_on=['reviewer_id', 'item_id'])

del df1
Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE from fit best parameters on train predict on test:
RMSE: 0.9379
0.9378743356689234


   We can examine the top 10 best predictions:

In [ ]:
best_predictions = df2.sort_values(by='err')[:10]
print('Best 10 predictions:')
print(best_predictions)
Best 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
90940   B00O2IZPD8      211  A2Q9961GP0RMBT         4608  5.0  5.0   
159489  6303686761    14836   AZIV57ZFSUYI6        18439  5.0  5.0   
134670  B000063EME     3672   AN49MA84I1W13        12176  5.0  5.0   
41605   B00C888NFQ       90  A2EQRK5VKWY4UL         5787  5.0  5.0   
214781  B00FEP9PQG      188   AKWS79G5WUYUE        16444  5.0  5.0   
14334   079073463X     1319  A3MQAQT8C6D1I7         7397  5.0  5.0   
41607   B000UFIYQ2     8628  A1QWQF0CXTE41E          602  5.0  5.0   
200020  B000CDSS18    17834  A2XOH1PC423E2S         8201  5.0  5.0   
41613   B00FR23GPW     7599   AJC05Q1I8IG0U         8629  5.0  5.0   
91520   B00HEPCRLE      874  A12XYDOWQA4H5B         6810  5.0  5.0   

                                          details   Iu   Ui  err  
90940    {'actual_k': 6, 'was_impossible': False}   48  231  0.0  
159489   {'actual_k': 2, 'was_impossible': False}   17   26  0.0  
134670   {'actual_k': 0, 'was_impossible': False}   24   54  0.0  
41605    {'actual_k': 1, 'was_impossible': False}   40  152  0.0  
214781   {'actual_k': 2, 'was_impossible': False}   20   96  0.0  
14334    {'actual_k': 2, 'was_impossible': False}   35   96  0.0  
41607    {'actual_k': 0, 'was_impossible': False}  133    7  0.0  
200020   {'actual_k': 0, 'was_impossible': False}   33    4  0.0  
41613    {'actual_k': 0, 'was_impossible': False}   33   31  0.0  
91520   {'actual_k': 25, 'was_impossible': False}   35  261  0.0  

   Let's examine the worst 10 predictions as well:

In [ ]:
worst_predictions = df2.sort_values(by='err')[-10:]
print('Worst 10 predictions:')
print(worst_predictions)

del predictions, df2, best_predictions, worst_predictions
Worst 10 predictions:
              item  item_id      reviewerID  reviewer_id  rui  est  \
199613  B00005JO20      733  A167KI3P7XN1AM         2527  1.0  5.0   
111347  B000J10F8C    20835  A33RN6T49VEFUO          302  1.0  5.0   
128997  B019T8QBR4      700  A2BXZJPLOTMI5A         7822  1.0  5.0   
13429   6304961693     2718   AK2FSJJ7GAN8B        10089  1.0  5.0   
135680  B00007K027    23913  A192UPT6KST3E2          804  1.0  5.0   
105987  B00FF9SKSK      388   AE35V0M480BBV         9570  1.0  5.0   
145506  6304178360     8074   AWG2O9C42XW5G           10  1.0  5.0   
44675   B00008MTW7     4328   AXOIYNI70LPN1         5849  1.0  5.0   
64715   B00G3D732Q      100  A3JDBLJIN704EW         3907  1.0  5.0   
201633  B0009PW4D2     1842  A2JHYW5V7UFIQ2         3958  5.0  1.0   

                                          details    Iu   Ui  err  
199613   {'actual_k': 5, 'was_impossible': False}    65  417  4.0  
111347   {'actual_k': 0, 'was_impossible': False}   218    5  4.0  
128997   {'actual_k': 4, 'was_impossible': False}    34  114  4.0  
13429    {'actual_k': 0, 'was_impossible': False}    30   87  4.0  
135680   {'actual_k': 0, 'was_impossible': False}   122    5  4.0  
105987   {'actual_k': 0, 'was_impossible': False}    25  299  4.0  
145506   {'actual_k': 8, 'was_impossible': False}  1307   38  4.0  
44675    {'actual_k': 1, 'was_impossible': False}    40   71  4.0  
64715   {'actual_k': 26, 'was_impossible': False}    49  452  4.0  
201633   {'actual_k': 0, 'was_impossible': False}    54   29  4.0  

Comments