RecSys MLOps. BigQuery ML and Vertex AI. Part 2

Movielens
BigQuery
BigQuery ML
RecSys
Recommender
Pipeline
Vertex AI
MLOps
Author

Francisco Mussari

Published

July 24, 2023

Overview

Part 1 was about creating a Vertex AI pipeline for training and deploying Matrix Factorization model using BigQuery ML.

We ended with a model we can use from BigQuery, but that is also available as a Cloud Run endpoint, deployed as a Docker container.

Note

This notebook is better suited for running in Colab.

Authenticate

from google.colab import auth

auth.authenticate_user()

Get recommendations from deployed endpoint

As defined in part 1, the service was deployed as recommender-model.

SERVICE = "recommender-model"
REGION = 'us-central1'
PROJECT_ID = "[your-project-id]"

To get the endpoint url, you can navigate to Cloud Run console (insert your Project ID):

https://console.cloud.google.com/run/detail//recommender-model/metrics?project=[PROJECT_ID]

Or you can run the following command:

service_url = ! gcloud run services describe {SERVICE} --platform managed --region {REGION} --format 'value(status.url)'
service_url = service_url[0] + "/v1/models/recommender_model:predict"
service_url

And the endpoint should look something like this:
https://recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict

Call the deployed model endpoint

To call the API simply use curl. In this case we get recommendations for the user_id equals 1:

! curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"instances": [{"user_id": 1}]}' \
  https://recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict
{
    "predictions": [
        {
            "predicted_rating": [7.1595457186221525, 6.931200454403859, 6.7487105522542379, 6.6099371431826093, 6.3368375925279663, 6.3162291916598949, 6.2034087300038294, 6.1908018877869777, 6.095065519595515, 6.0748385283824344, 6.0707896719918022, 6.0115152056191317, 5.9950883198035312, 5.9266960142259579, 5.8977393292303706, 5.8963543166167049, 5.8879213166109317, 5.8651216371567951, 5.8157813717342961, 5.7918400837633026, 5.7728366859651743, 5.7459233895973343, 5.7286289144902627, 5.7236862593334399, 5.6735891385114323, 5.6712837959564055, 5.6344651298840862, 5.6144777548802312, 5.6136065773986719, 5.5966244955484585, 5.5826125214031448, 5.5438993473737597, 5.5405328509132108, 5.5156535458862912, 5.5060371378140998, 5.4983916934699879, 5.4779312652757479, 5.4381644519356804, 5.436539915444734, 5.4274984607954586, 5.4162018042063238, 5.4131954911985822, 5.4112950178511134, 5.3977833272890878, 5.3924309392602101, 5.3921355023131454, 5.3795566101479775, 5.3784815662419794, 5.3724805166998593, 5.3716275159116167],
            "predicted_item_id": ["3382", "3920", "2178", "662", "469", "199", "215", "2073", "881", "3132", "1900", "1615", "3141", "2928", "445", "973", "3605", "766", "935", "1841", "2330", "944", "2314", "3447", "2171", "3327", "2419", "2351", "557", "3720", "156", "2066", "2475", "2970", "26", "2360", "3018", "2272", "2506", "3670", "1631", "414", "2165", "1177", "2264", "743", "2725", "1934", "2710", "3275"]
        }
    ]
}

Which return the top 50 items (movies) by predicted rating for the specified user.

After processing the output, we can get a dataframe like this (remember this are predictions for user_id=1):

endpoint_predictions_df.head(10)
predicted_rating predicted_item_id
0 7.159546 3382
1 6.931200 3920
2 6.748711 2178
3 6.609937 662
4 6.336838 469
5 6.316229 199
6 6.203409 215
7 6.190802 2073
8 6.095066 881
9 6.074839 3132

BigQuery

Authenticate to BigQuery

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)

Set the same parameters as in Part 1

MODEL_DATASET = "[bq-model-dataset]"  # @param {type:"string"}
MOVIELENS_DATASET = "[bq-data-dataset]"  # @param {type:"string"}

Query the data tables

movies_df = client.query(f'''
  SELECT *
  FROM `{MOVIELENS_DATASET}.movie_titles`
''').to_dataframe()
movies_df.head()
movie_id movie_title genre
0 632 Land and Freedom (Tierra y libertad) (1995) War
1 665 Underground (1995) War
2 760 Stalingrad (1993) War
3 777 Pharaoh's Army (1995) War
4 1450 Prisoner of the Mountains (Kavkazsky Plennik) ... War
user_ratings_df = client.query(f'''
  SELECT t1.user_id, t1.rating, t1.item_id, t2.movie_title, t2.genre
  FROM `{MOVIELENS_DATASET}.movielens_1m` t1
  LEFT JOIN {MOVIELENS_DATASET}.movie_titles t2
  ON t1.item_id=t2.movie_id
  --WHERE t1.user_id=1
  ORDER BY rating DESC
''').to_dataframe()
user_ratings_df.head()
user_id rating item_id movie_title genre
0 1 5.0 1193 One Flew Over the Cuckoo's Nest (1975) Drama
1 1 5.0 2355 Bug's Life, A (1998) Animation|Children's|Comedy
2 1 5.0 1287 Ben-Hur (1959) Action|Adventure|Drama
3 1 5.0 2804 Christmas Story, A (1983) Comedy|Drama
4 1 5.0 595 Beauty and the Beast (1991) Animation|Children's|Musical

Information about model(s) created

models = client.list_models(MODEL_DATASET)
print("Models contained in '{}':".format(MODEL_DATASET))

models_fmid = []
for model in models:
    full_model_id = f"{model.project}.{model.dataset_id}.{model.model_id}"
    models_fmid.append(full_model_id)

    print(f"{full_model_id}")
Models contained in 'bqml_modelos':
pythonapi-205723.bqml_modelos.mf_model_pipe_01
pythonapi-205723.bqml_modelos.mf_model_pipe_02
pythonapi-205723.bqml_modelos.mf_model_pipe_03

As you can see, I created three models in bqml_modelos dataset.

client.get_model(models_fmid[0])
Model(reference=ModelReference(project_id='pythonapi-205723', dataset_id='bqml_modelos', model_id='mf_model_pipe_01'))
model.model_type
'MATRIX_FACTORIZATION'

Training statistics

client.query(f'''
  SELECT *
  FROM ML.TRAINING_INFO(MODEL `{models_fmid[2]}`)
''').to_dataframe()
training_run iteration loss eval_loss duration_ms
0 0 15 0.259132 NaN 64410
1 0 14 0.261374 NaN 107980
2 0 13 0.265850 NaN 60169
3 0 12 0.268827 NaN 29616
4 0 11 0.274483 NaN 57306
5 0 10 0.278608 NaN 117195
6 0 9 0.286101 NaN 59622
7 0 8 0.292234 NaN 115613
8 0 7 0.302932 NaN 62337
9 0 6 0.313072 NaN 101301
10 0 5 0.330243 NaN 131648
11 0 4 0.350469 NaN 118377
12 0 3 0.385341 NaN 55254
13 0 2 0.449968 NaN 103391
14 0 1 0.581976 NaN 64919
15 0 0 4.047567 NaN 168618

Evaluation

client.query(f'''
SELECT *
FROM ML.EVALUATE(MODEL `{models_fmid[2]}`,
    (
    SELECT
      user_id,
      item_id,
      rating
     FROM
      {MOVIELENS_DATASET}.movielens_1m))
''').to_dataframe()
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
0 0.382252 0.259132 0.016406 0.293064 0.792348 0.792348

Rating predictions

Lets get predicted ratings for all user-item pairs.

Note

This query generates a predicted rating for every user-item pair. As a result, this can be a rather large output. It is recommended that the output is saved in a table which is then used in other queries for analysis.

Lets create one table to store all predictions called recommend_m2_1m and another one with top predictions by user called recommend_m2_1m_top.

recommend_m2_1m_df = client.query(f'''
SELECT *
FROM ML.RECOMMEND(MODEL bqml_modelos.mf_model_pipe_03)
''').to_dataframe()
recommend_m2_1m_df
predicted_rating user_id item_id
0 3.489891 3575 1376
1 6.627501 2213 1362
2 2.694415 5876 1330
3 -1.662953 4817 3042
4 2.208275 5448 563
... ... ... ...
22384235 3.183006 933 2282
22384236 1.201612 5843 391
22384237 2.886981 4321 3525
22384238 2.488141 1000 2881
22384239 2.805384 5148 1049

22384240 rows × 3 columns

print(f"Number of users: {len(user_ratings_df.user_id.unique()):,}")
print(f"Number of movies: {len(user_ratings_df.item_id.unique()):,}")
assert len(user_ratings_df.user_id.unique()) * len(user_ratings_df.item_id.unique()) == len(recommend_m2_1m_df)
Number of users: 6,040
Number of movies: 3,706

As we can see, the full table has 22,384,240 rows, equals to 6,040 users x 3,706 movies.

Compare to endpoint results

recommend_m2_1m_df[recommend_m2_1m_df.user_id==1].sort_values(by='predicted_rating', ascending=False).head(10)
predicted_rating user_id item_id
50602 7.159546 1 3382
22190133 6.931200 1 3920
5356399 6.748711 1 2178
15157835 6.609937 1 662
4687764 6.336838 1 469
17892355 6.316229 1 199
1365223 6.203409 1 215
16742143 6.190802 1 2073
8094661 6.095066 1 881
21623544 6.074839 1 3132
endpoint_predictions_df.head(10)
predicted_rating predicted_item_id
0 7.159546 3382
1 6.931200 3920
2 6.748711 2178
3 6.609937 662
4 6.336838 469
5 6.316229 199
6 6.203409 215
7 6.190802 2073
8 6.095066 881
9 6.074839 3132

For user_id=1 and item_id=3382, the predicted rating is 7.159546.

Latent Factors (Embeddings) and Bias

The matrix factorization model trained in BigQuery ML returns a predicted rating as the following equation:

predicted = Global Bias + DotProduct(user_factors, item_factors) + user_bias + item_bias

Where Global Bias is a constant for all the pairs, user_factors and item_factors are vectors with 60 dimensions (as definied in the training process) and user_bias and item_bias are a distinct number for each user and for each item.

Global Bias

global_bias = client.query(f"""
    SELECT intercept
    FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03)
    WHERE feature = 'global__INTERCEPT__'
""").to_dataframe()
global_bias_np = global_bias.loc[0].values
global_bias_np
array([3.58156445])

Movie Embeddings

item_factors_df = client.query(f"""
    SELECT feature AS item_id, intercept AS bias, fw.factor AS factor, fw.weight
    FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03),
    UNNEST(factor_weights) AS fw
    WHERE processed_input = 'item_id'
""").to_dataframe()
item_factors_df
item_id bias factor weight
0 2 -0.393036 60 -0.113832
1 2 -0.393036 59 0.048548
2 2 -0.393036 58 -0.112024
3 2 -0.393036 57 0.118971
4 2 -0.393036 56 -0.050817
... ... ... ... ...
222355 3833 -1.592288 5 0.123739
222356 3833 -1.592288 4 0.234778
222357 3833 -1.592288 3 -0.296197
222358 3833 -1.592288 2 0.502392
222359 3833 -1.592288 1 -0.232240

222360 rows × 4 columns

User Embeddings

user_factors_df = client.query("""
    SELECT feature AS user_id, intercept AS bias, fw.factor AS factor, fw.weight
    FROM ML.WEIGHTS(model bqml_modelos.mf_model_pipe_03),
    UNNEST(factor_weights) AS fw
    WHERE processed_input = 'user_id'
""").to_dataframe()
user_factors_df
user_id bias factor weight
0 256 0.255283 60 0.713645
1 256 0.255283 59 1.387058
2 256 0.255283 58 -0.567324
3 256 0.255283 57 -0.709545
4 256 0.255283 56 -1.813568
... ... ... ... ...
362395 5887 -0.408196 5 1.493551
362396 5887 -0.408196 4 0.960923
362397 5887 -0.408196 3 -2.272016
362398 5887 -0.408196 2 0.393577
362399 5887 -0.408196 1 -2.099104

362400 rows × 4 columns

Do the math for a prediction

We already saw that for user_id=1 and item_id=3382, the predicted rating is 7.159546. Let’s do the math:

user_1_factors = user_factors_df[user_factors_df.user_id=='1'].weight.to_numpy()
user_1_bias = user_factors_df[user_factors_df.user_id=='1'].bias.to_numpy()[:1]
user_1_bias, user_1_factors[:5], user_1_factors.shape
(array([0.04259326]),
 array([ 0.72214304, -1.06488835, -0.35848901, -0.08184539, -0.34282946]),
 (60,))
item_3382_factors = item_factors_df[item_factors_df.item_id=="3382"].weight.to_numpy()
item_3382_bias = item_factors_df[item_factors_df.item_id=="3382"].bias.to_numpy()[:1]
item_3382_bias, item_3382_factors[:5], item_3382_factors.shape
(array([3.535388]),
 array([ 1.38777878e-17,  3.03576608e-18, -2.81892565e-18,  1.73472348e-18,
        -5.02894452e-17]),
 (60,))
import numpy as np
global_bias_np + np.dot(user_1_factors, item_3382_factors) + user_1_bias + item_3382_bias
array([7.15954572])

There it is, same value for the predicted rating.

Full Pair Predictions

# Change type
user_factors_df[['user_id', 'factor']] = user_factors_df[['user_id', 'factor']].astype('int')

# Bias for each user
user_bias_df = user_factors_df.groupby('user_id')['bias'].max().reset_index()
# Change type
item_factors_df[['item_id', 'factor']] = item_factors_df[['item_id', 'factor']].astype('int')

# Bias for each movie
item_bias_df = item_factors_df.groupby('item_id')['bias'].max().reset_index()

Convert to numpy

# bias
user_bias_np = user_bias_df.sort_values("user_id", ascending=True).bias.to_numpy()
# embeddings
user_factors_np = user_factors_df.sort_values(
    ["user_id", "factor"], ascending=[True, False]
).weight.to_numpy().reshape((6_040, 60))

user_bias_np.shape, user_factors_np.shape
((6040,), (6040, 60))
# bias
item_bias_np = item_bias_df.sort_values("item_id", ascending=True).bias.to_numpy()
# embeddings
item_factors_np = item_factors_df.sort_values(
    ["item_id", "factor"], ascending=[True, False]
).weight.to_numpy().reshape((3_706, 60))

item_bias_np.shape, item_factors_np.shape
((3706,), (3706, 60))

Map user_id and item_id to numpy indices

idx2item_id = {idx: item_id for idx, item_id in zip(item_bias_df.index, item_bias_df.item_id)}

User Item interaction

user_items_interactions = global_bias_np + np.dot(
    user_factors_np, item_factors_np.T
) + user_bias_np[:,None] + item_bias_np[None,:] 

user_items_interactions.shape
(6040, 3706)

So, user_id=1 is the index 0 for the numpy array.

top_10_1 = user_items_interactions[0].argsort()[-10:][::-1]
[idx2item_id[idx] for idx in top_10_1]
[3382, 3920, 2178, 662, 469, 199, 215, 2073, 881, 3132]
endpoint_predictions_df.head(10)
predicted_rating predicted_item_id
0 7.159546 3382
1 6.931200 3920
2 6.748711 2178
3 6.609937 662
4 6.336838 469
5 6.316229 199
6 6.203409 215
7 6.190802 2073
8 6.095066 881
9 6.074839 3132

Same Top item_id from the endpoint. We are doing fine.

Top 10 predictions for each pair

top_10_idxs = np.argsort(user_items_interactions, axis=1)[:,::-1][:,:10]
top_10_idxs
array([[3152, 3673, 1997, ..., 1892,  823, 2916],
       [3258, 3191, 2758, ...,  447, 2719, 1917],
       [1763,  668,  984, ..., 3152, 1917, 1860],
       ...,
       [3152, 3504, 2122, ..., 2232, 1470,  543],
       [ 986, 2964,  856, ...,  616, 2118, 2684],
       [3237, 1298, 3579, ..., 2925, 2732, 1451]])

Top 10 ratings for user_id=1

user_items_interactions[0][top_10_idxs[0]]
array([7.15954572, 6.93120045, 6.74871055, 6.60993714, 6.33683759,
       6.31622919, 6.20340873, 6.19080189, 6.09506552, 6.07483853])
df = pd.DataFrame(top_10_idxs[0])
df.columns = ['movie_id']
df.movie_id = df.movie_id.map(idx2item_id)

df = df.merge(movies_df)
df['predicted_rating'] = user_items_interactions[0][top_10_idxs[0]]
df
movie_id movie_title genre predicted_rating
0 3382 Song of Freedom (1936) Drama 7.159546
1 3920 Faraway, So Close (In Weiter Ferne, So Nah!) (... Drama|Fantasy 6.931200
2 2178 Frenzy (1972) Thriller 6.748711
3 662 Fear (1996) Thriller 6.609937
4 469 House of the Spirits, The (1993) Drama|Romance 6.336838
5 199 Umbrellas of Cherbourg, The (Parapluies de Che... Drama|Musical 6.316229
6 215 Before Sunrise (1995) Drama|Romance 6.203409
7 2073 Fandango (1985) Comedy 6.190802
8 881 First Kid (1996) Children's|Comedy 6.095066
9 3132 Daddy Long Legs (1919) Comedy 6.074839

Getting Top Movies for each User from BigQuery

Let’s create a table containing the Top 5 movies by user.

query = f'''SELECT
  user_id,
  ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating)
ORDER BY predicted_rating DESC LIMIT 5)
FROM (
SELECT
  user_id,
  item_id,
  predicted_rating,
  movie_title,
  genre
FROM
  {MODEL_DATASET}.recommend_m2_1m
JOIN
  movielens.movie_titles
ON
  item_id = movie_id)
GROUP BY
  user_id
'''
job_config = bigquery.QueryJobConfig()
job_config.destination = "{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top"

query_job = client.query(query, job_config)
results = query_job.result()
items_for_user_1_df = client.query(f"""
    SELECT * 
    FROM `{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top` 
    WHERE user_id=1
""").to_dataframe()

items_for_user_1_df.f0_.loc[0]
array([{'movie_title': 'Song of Freedom (1936)', 'genre': 'Drama', 'predicted_rating': 7.1595457186221525},
       {'movie_title': 'Faraway, So Close (In Weiter Ferne, So Nah!) (1993)', 'genre': 'Drama|Fantasy', 'predicted_rating': 6.931200454403858},
       {'movie_title': 'Frenzy (1972)', 'genre': 'Thriller', 'predicted_rating': 6.748710552254239},
       {'movie_title': 'Fear (1996)', 'genre': 'Thriller', 'predicted_rating': 6.609937143182611},
       {'movie_title': 'House of the Spirits, The (1993)', 'genre': 'Drama|Romance', 'predicted_rating': 6.336837592527967}],
      dtype=object)