Tensorflow Recommenders - How to retrieve candidate items?



Tensorflow Recommeners (TFRS), which is released in 2020 as part of Tensorflow ecosystem, provides entire stack for recommender systems including
  • Retrieval
  • Ranking
  • Post-ranking
where the Retrieval component shrinks item candidates from O(thousands of millions) to O(thousands), the Ranking component trims the candidates from O(thousands) to O(hundreds), and finally, the Post-ranking component narrows down the candidates from O(hundreds) to O(dozens).

In this post, we focus on the first part of TFRS tutorial, in which we would like to use the well-established MovieLens 100k dataset containing 100k ratings of movies from users. This tutorial requires TFRS installed (TFRS requires Tensorflow 2.x). You can install TFRS easily with:

pip install tensorflow-recommenders

Content

  • Prepare dataset
  • Build model
  • Train and evaluate the model
  • Export the model, load again and get retrieved items using the model

Prepare dataset

First, let's import packages that we need, and print out Tensorflow and TFRS versions for reference.

from typing import Dict, Text # for typing hint
 
import os
import pprint
import numpy as np
import tempfile
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

print(tf.__version__)
print(tfrs.__version__)
Output:

2.9.1
v0.7.0
Let's load the MovieLens dataset from tensorflow_dataset.

# ----------------------------------------
# Prepare Movielens100k data
# ----------------------------------------
# Ratings data
ratings = tfds.load('movielens/100k-ratings', split='train')
# Features of all the available movies
movies = tfds.load('movielens/100k-movies', split='train')
We can print out some of the samples from ratings and movies.

# ----------------------------------------
# Explore dataset
# ----------------------------------------
n = 1
print('\nRatings:')
for r in ratings.take(n).as_numpy_iterator(): # as_numpy_iterator() Returns an iterator which converts all elements of the dataset to numpy.
    pprint.pprint(r)
    
print('\nMovies:')
for m in movies.take(n).as_numpy_iterator():
    pprint.pprint(m)
Output:

Ratings:
{'bucketized_user_age': 45.0,
 'movie_genres': array([7], dtype=int64),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}

Movies:
{'movie_genres': array([4], dtype=int64),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}
Now, we move on and prepare the dataset, which include the following steps:
  • Shuffle the dataset, and split train/test with 80000 and 20000 examples each.
  • Build up the user id and movie vocabularies for learning user/item embeddings later on
 

# ----------------------------------------
# Shuffle and prepare ML dataset 
# ----------------------------------------
tf.random.set_seed(42)

# Get user, item information only
ratings = ratings.map(lambda x: {
    'movie_title': x['movie_title'],
    'user_id': x['user_id']
})
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

movie_titles = movies.map(lambda x: x['movie_title'])
user_ids = ratings.map(lambda x: x['user_id'])

# Convert to int ids
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(ratings.map(lambda x: x["user_id"]))

movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocabulary.adapt(movie_titles)


Build model

To build a TFRS model, we need to prepare three things:
  • user model
  • item model
  • task (which is retrieval in this tutorial as one might expect)
user and item models are Tensorflow Keras models. The tfrs.metrics.FactorizedTopK computes metrics for across top K candidates surfaced by a retrieval model the ks parameter refers ot a sequence of values of k at which to perform retrieval evaluation..

# Define user and movie models.
user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(user_ids_vocabulary.vocabulary_size(), 64)
])

item_model = tf.keras.Sequential([
    movie_titles_vocabulary,
    tf.keras.layers.Embedding(movie_titles_vocabulary.vocabulary_size(), 64)
])

# Define metrics
metrics = tfrs.metrics.FactorizedTopK(
    candidates=movie_titles.batch(128).map(item_model),
    ks=[5, 10, 20])

# Define task
# Will compile and calculate loss
task = tfrs.tasks.Retrieval(metrics=metrics)

# Building model all together
class MovielensModel(tfrs.Model):
    def __init__(self, user_model:tf.keras.Model, 
                 item_model: tf.keras.Model,
                 task = tf.keras.layers.Layer):
        super().__init__()
        self.item_model = item_model
        self.user_model = user_model
        self.task = task
        
    def compute_loss(self, features: Dict[Text, tf.Tensor], 
                     training=False) -> tf.Tensor:
        # Define how the loss is computed.

        user_embeddings = self.user_model(features["user_id"])
        item_embeddings = self.item_model(features["movie_title"])

        return self.task(user_embeddings, item_embeddings)





Train and evaluate the model

Given the defined model, we can compile and fit the model with our training dataset.

# Compile and fit
model = MovielensModel(user_model, item_model, task)
model.compile(optimizer=tf.keras.optimizers
              .Adagrad(learning_rate=0.1))

cached_train = train.shuffle(100_000).batch(8092).cache()
cached_test = test.batch(4096).cache()

model.fit(cached_train, epochs=3)
Output:

Epoch 1/3
10/10 [==============================] - 10s 695ms/step - factorized_top_k/top_5_categorical_accuracy: 0.0081 - factorized_top_k/top_10_categorical_accuracy: 0.0190 - factorized_top_k/top_20_categorical_accuracy: 0.0401 - loss: 70403.1754 - regularization_loss: 0.0000e+00 - total_loss: 70403.1754
Epoch 2/3
10/10 [==============================] - 6s 613ms/step - factorized_top_k/top_5_categorical_accuracy: 0.0183 - factorized_top_k/top_10_categorical_accuracy: 0.0370 - factorized_top_k/top_20_categorical_accuracy: 0.0740 - loss: 67763.8679 - regularization_loss: 0.0000e+00 - total_loss: 67763.8679
Epoch 3/3
10/10 [==============================] - 7s 664ms/step - factorized_top_k/top_5_categorical_accuracy: 0.0241 - factorized_top_k/top_10_categorical_accuracy: 0.0492 - factorized_top_k/top_20_categorical_accuracy: 0.0956 - loss: 66349.8991 - regularization_loss: 0.0000e+00 - total_loss: 66349.8991

Next, we can evaluate how well the trained model performs on our test set. return_dict parameter returns the results in a dictionary.

# Evaluation
model.evaluate(cached_test, return_dict=True)
Ouput:

5/5 [==============================] - 2s 185ms/step - factorized_top_k/top_5_categorical_accuracy: 0.0052 - factorized_top_k/top_10_categorical_accuracy: 0.0135 - factorized_top_k/top_20_categorical_accuracy: 0.0342 - loss: 31226.3392 - regularization_loss: 0.0000e+00 - total_loss: 31226.3392
{'factorized_top_k/top_5_categorical_accuracy': 0.005249999929219484,
 'factorized_top_k/top_10_categorical_accuracy': 0.013500000350177288,
 'factorized_top_k/top_20_categorical_accuracy': 0.03424999862909317,
 'loss': 28381.1328125,
 'regularization_loss': 0,
 'total_loss': 28381.1328125}
With our fitted model, we can use it for predictions or retrieving top-k items (movies) for a target user.

# ----------------------------------------
# Use trained model for predictions
# ----------------------------------------
# Use brute-force search to set up retrieval using the trained representations.
# BruteForce here means we do an exhaustive search on the neighbors of
# an embedding vector
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index_from_dataset(
    movie_titles.batch(100).map(lambda title: (title, model.item_model(title))))

# Get some recommendations for user id 42
_, titles = index(np.array(["42"]))
print(f"Top 3 recommendations for user 42: {titles[0, :3]}")
Output:

Top 3 recommendations for user 42: [b'Rudy (1993)' b'Father of the Bride Part II (1995)'
 b'Bridges of Madison County, The (1995)']


Export the model, load again and get retrieved items using the model

In practice, we might need to save a well-trained model and serve to the system. We can save the trained model above to the disk and reload it for inference - retrieving top-k items for a user. This should give us exactly the same results as above as one might expect.
# --------------------------------------
# Export the query model
# --------------------------------------
with tempfile.TemporaryDirectory() as tmp:
    path = os.path.join(tmp, "model")
    
    # Save the index
    index.save(path)
    
    # Load it back
    # Can also be done in TF Serving
    loaded = tf.keras.models.load_model(path)
    
    # Pass a user id, get recommendations
    scores, titles = loaded(["42"])
    
    print(f"Recommendations: {titles[0][:3]}")


In the next post, we look at how to utilize contextual features of users/items in the retrieval model.

More TFRS tutorials can be found at https://parklize.blogspot.com/p/tensorflow.html

References
[1]. https://github.com/tensorflow/recommenders
[2]. https://grouplens.org/datasets/movielens/

No comments:

Post a Comment