datamatica - RecSys MLOps. BigQuery ML and Vertex AI. Part 2

Overview

Part 1 was about creating a Vertex AI pipeline for training and deploying Matrix Factorization model using BigQuery ML.

We ended with a model we can use from BigQuery, but that is also available as a Cloud Run endpoint, deployed as a Docker container.

Note

This notebook is better suited for running in Colab.

Authenticate

from google.colab import auth

auth.authenticate_user()

Get recommendations from deployed endpoint

As defined in part 1, the service was deployed as recommender-model.

SERVICE = "recommender-model"
REGION = 'us-central1'
PROJECT_ID = "[your-project-id]"

To get the endpoint url, you can navigate to Cloud Run console (insert your Project ID):

https://console.cloud.google.com/run/detail//recommender-model/metrics?project=[PROJECT_ID]

Or you can run the following command:

service_url = ! gcloud run services describe {SERVICE} --platform managed --region {REGION} --format 'value(status.url)'
service_url = service_url[0] + "/v1/models/recommender_model:predict"
service_url

And the endpoint should look something like this:
https://recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict

Call the deployed model endpoint

To call the API simply use curl. In this case we get recommendations for the user_id equals 1:

! curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"instances": [{"user_id": 1}]}' \
  https://recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict

{
    "predictions": [
        {
            "predicted_rating": [7.1595457186221525, 6.931200454403859, 6.7487105522542379, 6.6099371431826093, 6.3368375925279663, 6.3162291916598949, 6.2034087300038294, 6.1908018877869777, 6.095065519595515, 6.0748385283824344, 6.0707896719918022, 6.0115152056191317, 5.9950883198035312, 5.9266960142259579, 5.8977393292303706, 5.8963543166167049, 5.8879213166109317, 5.8651216371567951, 5.8157813717342961, 5.7918400837633026, 5.7728366859651743, 5.7459233895973343, 5.7286289144902627, 5.7236862593334399, 5.6735891385114323, 5.6712837959564055, 5.6344651298840862, 5.6144777548802312, 5.6136065773986719, 5.5966244955484585, 5.5826125214031448, 5.5438993473737597, 5.5405328509132108, 5.5156535458862912, 5.5060371378140998, 5.4983916934699879, 5.4779312652757479, 5.4381644519356804, 5.436539915444734, 5.4274984607954586, 5.4162018042063238, 5.4131954911985822, 5.4112950178511134, 5.3977833272890878, 5.3924309392602101, 5.3921355023131454, 5.3795566101479775, 5.3784815662419794, 5.3724805166998593, 5.3716275159116167],
            "predicted_item_id": ["3382", "3920", "2178", "662", "469", "199", "215", "2073", "881", "3132", "1900", "1615", "3141", "2928", "445", "973", "3605", "766", "935", "1841", "2330", "944", "2314", "3447", "2171", "3327", "2419", "2351", "557", "3720", "156", "2066", "2475", "2970", "26", "2360", "3018", "2272", "2506", "3670", "1631", "414", "2165", "1177", "2264", "743", "2725", "1934", "2710", "3275"]
        }
    ]
}

Which return the top 50 items (movies) by predicted rating for the specified user.

After processing the output, we can get a dataframe like this (remember this are predictions for user_id=1):

endpoint_predictions_df.head(10)

	predicted_rating	predicted_item_id
0	7.159546	3382
1	6.931200	3920
2	6.748711	2178
3	6.609937	662
4	6.336838	469
5	6.316229	199
6	6.203409	215
7	6.190802	2073
8	6.095066	881
9	6.074839	3132

BigQuery

Authenticate to BigQuery

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)

Set the same parameters as in Part 1

MODEL_DATASET = "[bq-model-dataset]"  # @param {type:"string"}
MOVIELENS_DATASET = "[bq-data-dataset]"  # @param {type:"string"}

Query the data tables

movies_df = client.query(f'''
  SELECT *
  FROM `{MOVIELENS_DATASET}.movie_titles`
''').to_dataframe()
movies_df.head()

	movie_id	movie_title	genre
0	632	Land and Freedom (Tierra y libertad) (1995)	War
1	665	Underground (1995)	War
2	760	Stalingrad (1993)	War
3	777	Pharaoh's Army (1995)	War
4	1450	Prisoner of the Mountains (Kavkazsky Plennik) ...	War

user_ratings_df = client.query(f'''
  SELECT t1.user_id, t1.rating, t1.item_id, t2.movie_title, t2.genre
  FROM `{MOVIELENS_DATASET}.movielens_1m` t1
  LEFT JOIN {MOVIELENS_DATASET}.movie_titles t2
  ON t1.item_id=t2.movie_id
  --WHERE t1.user_id=1
  ORDER BY rating DESC
''').to_dataframe()
user_ratings_df.head()

	user_id	rating	item_id	movie_title	genre
0	1	5.0	1193	One Flew Over the Cuckoo's Nest (1975)	Drama
1	1	5.0	2355	Bug's Life, A (1998)	Animation\|Children's\|Comedy
2	1	5.0	1287	Ben-Hur (1959)	Action\|Adventure\|Drama
3	1	5.0	2804	Christmas Story, A (1983)	Comedy\|Drama
4	1	5.0	595	Beauty and the Beast (1991)	Animation\|Children's\|Musical

Information about model(s) created

models = client.list_models(MODEL_DATASET)
print("Models contained in '{}':".format(MODEL_DATASET))

models_fmid = []
for model in models:
    full_model_id = f"{model.project}.{model.dataset_id}.{model.model_id}"
    models_fmid.append(full_model_id)

    print(f"{full_model_id}")

Models contained in 'bqml_modelos':
pythonapi-205723.bqml_modelos.mf_model_pipe_01
pythonapi-205723.bqml_modelos.mf_model_pipe_02
pythonapi-205723.bqml_modelos.mf_model_pipe_03

As you can see, I created three models in bqml_modelos dataset.

client.get_model(models_fmid[0])

Model(reference=ModelReference(project_id='pythonapi-205723', dataset_id='bqml_modelos', model_id='mf_model_pipe_01'))

model.model_type

'MATRIX_FACTORIZATION'

Training statistics

client.query(f'''
  SELECT *
  FROM ML.TRAINING_INFO(MODEL `{models_fmid[2]}`)
''').to_dataframe()

	iteration	loss	eval_loss	duration_ms
0	15	0.259132	NaN	64410
1	14	0.261374	NaN	107980
2	13	0.265850	NaN	60169
3	12	0.268827	NaN	29616
4	11	0.274483	NaN	57306
5	10	0.278608	NaN	117195
6	9	0.286101	NaN	59622
7	8	0.292234	NaN	115613
8	7	0.302932	NaN	62337
9	6	0.313072	NaN	101301
10	5	0.330243	NaN	131648
11	4	0.350469	NaN	118377
12	3	0.385341	NaN	55254
13	2	0.449968	NaN	103391
14	1	0.581976	NaN	64919
15	0	4.047567	NaN	168618

Evaluation

client.query(f'''
SELECT *
FROM ML.EVALUATE(MODEL `{models_fmid[2]}`,
    (
    SELECT
      user_id,
      item_id,
      rating
     FROM
      {MOVIELENS_DATASET}.movielens_1m))
''').to_dataframe()

	mean_absolute_error	mean_squared_error	mean_squared_log_error	median_absolute_error	r2_score	explained_variance
0	0.382252	0.259132	0.016406	0.293064	0.792348	0.792348

Rating predictions

Lets get predicted ratings for all user-item pairs.

Note

This query generates a predicted rating for every user-item pair. As a result, this can be a rather large output. It is recommended that the output is saved in a table which is then used in other queries for analysis.

Lets create one table to store all predictions called recommend_m2_1m and another one with top predictions by user called recommend_m2_1m_top.

recommend_m2_1m_df = client.query(f'''
SELECT *
FROM ML.RECOMMEND(MODEL bqml_modelos.mf_model_pipe_03)
''').to_dataframe()

recommend_m2_1m_df

	predicted_rating	user_id	item_id
0	3.489891	3575	1376
1	6.627501	2213	1362
2	2.694415	5876	1330
3	-1.662953	4817	3042
4	2.208275	5448	563
...	...	...	...
22384235	3.183006	933	2282
22384236	1.201612	5843	391
22384237	2.886981	4321	3525
22384238	2.488141	1000	2881
22384239	2.805384	5148	1049

22384240 rows × 3 columns

print(f"Number of users: {len(user_ratings_df.user_id.unique()):,}")
print(f"Number of movies: {len(user_ratings_df.item_id.unique()):,}")
assert len(user_ratings_df.user_id.unique()) * len(user_ratings_df.item_id.unique()) == len(recommend_m2_1m_df)

Number of users: 6,040
Number of movies: 3,706

As we can see, the full table has 22,384,240 rows, equals to 6,040 users x 3,706 movies.

Compare to endpoint results

recommend_m2_1m_df[recommend_m2_1m_df.user_id==1].sort_values(by='predicted_rating', ascending=False).head(10)

	predicted_rating	user_id	item_id
50602	7.159546	1	3382
22190133	6.931200	1	3920
5356399	6.748711	1	2178
15157835	6.609937	1	662
4687764	6.336838	1	469
17892355	6.316229	1	199
1365223	6.203409	1	215
16742143	6.190802	1	2073
8094661	6.095066	1	881
21623544	6.074839	1	3132

endpoint_predictions_df.head(10)

	predicted_rating	predicted_item_id
0	7.159546	3382
1	6.931200	3920
2	6.748711	2178
3	6.609937	662
4	6.336838	469
5	6.316229	199
6	6.203409	215
7	6.190802	2073
8	6.095066	881
9	6.074839	3132

For user_id=1 and item_id=3382, the predicted rating is 7.159546.

Latent Factors (Embeddings) and Bias

The matrix factorization model trained in BigQuery ML returns a predicted rating as the following equation:

predicted = Global Bias + DotProduct(user_factors, item_factors) + user_bias + item_bias

Where Global Bias is a constant for all the pairs, user_factors and item_factors are vectors with 60 dimensions (as definied in the training process) and user_bias and item_bias are a distinct number for each user and for each item.

Global Bias

global_bias = client.query(f"""
    SELECT intercept
    FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03)
    WHERE feature = 'global__INTERCEPT__'
""").to_dataframe()

global_bias_np = global_bias.loc[0].values
global_bias_np

array([3.58156445])

Movie Embeddings

item_factors_df = client.query(f"""
    SELECT feature AS item_id, intercept AS bias, fw.factor AS factor, fw.weight
    FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03),
    UNNEST(factor_weights) AS fw
    WHERE processed_input = 'item_id'
""").to_dataframe()

item_factors_df

	item_id	bias	factor	weight
0	2	-0.393036	60	-0.113832
1	2	-0.393036	59	0.048548
2	2	-0.393036	58	-0.112024
3	2	-0.393036	57	0.118971
4	2	-0.393036	56	-0.050817
...	...	...	...	...
222355	3833	-1.592288	5	0.123739
222356	3833	-1.592288	4	0.234778
222357	3833	-1.592288	3	-0.296197
222358	3833	-1.592288	2	0.502392
222359	3833	-1.592288	1	-0.232240

222360 rows × 4 columns

User Embeddings

user_factors_df = client.query("""
    SELECT feature AS user_id, intercept AS bias, fw.factor AS factor, fw.weight
    FROM ML.WEIGHTS(model bqml_modelos.mf_model_pipe_03),
    UNNEST(factor_weights) AS fw
    WHERE processed_input = 'user_id'
""").to_dataframe()

user_factors_df

	user_id	bias	factor	weight
0	256	0.255283	60	0.713645
1	256	0.255283	59	1.387058
2	256	0.255283	58	-0.567324
3	256	0.255283	57	-0.709545
4	256	0.255283	56	-1.813568
...	...	...	...	...
362395	5887	-0.408196	5	1.493551
362396	5887	-0.408196	4	0.960923
362397	5887	-0.408196	3	-2.272016
362398	5887	-0.408196	2	0.393577
362399	5887	-0.408196	1	-2.099104

362400 rows × 4 columns

Do the math for a prediction

We already saw that for user_id=1 and item_id=3382, the predicted rating is 7.159546. Let’s do the math:

user_1_factors = user_factors_df[user_factors_df.user_id=='1'].weight.to_numpy()
user_1_bias = user_factors_df[user_factors_df.user_id=='1'].bias.to_numpy()[:1]
user_1_bias, user_1_factors[:5], user_1_factors.shape

(array([0.04259326]),
 array([ 0.72214304, -1.06488835, -0.35848901, -0.08184539, -0.34282946]),
 (60,))

item_3382_factors = item_factors_df[item_factors_df.item_id=="3382"].weight.to_numpy()
item_3382_bias = item_factors_df[item_factors_df.item_id=="3382"].bias.to_numpy()[:1]
item_3382_bias, item_3382_factors[:5], item_3382_factors.shape

(array([3.535388]),
 array([ 1.38777878e-17,  3.03576608e-18, -2.81892565e-18,  1.73472348e-18,
        -5.02894452e-17]),
 (60,))

import numpy as np
global_bias_np + np.dot(user_1_factors, item_3382_factors) + user_1_bias + item_3382_bias

array([7.15954572])

There it is, same value for the predicted rating.

Full Pair Predictions

# Change type
user_factors_df[['user_id', 'factor']] = user_factors_df[['user_id', 'factor']].astype('int')

# Bias for each user
user_bias_df = user_factors_df.groupby('user_id')['bias'].max().reset_index()

# Change type
item_factors_df[['item_id', 'factor']] = item_factors_df[['item_id', 'factor']].astype('int')

# Bias for each movie
item_bias_df = item_factors_df.groupby('item_id')['bias'].max().reset_index()

Convert to numpy

# bias
user_bias_np = user_bias_df.sort_values("user_id", ascending=True).bias.to_numpy()
# embeddings
user_factors_np = user_factors_df.sort_values(
    ["user_id", "factor"], ascending=[True, False]
).weight.to_numpy().reshape((6_040, 60))

user_bias_np.shape, user_factors_np.shape

((6040,), (6040, 60))

# bias
item_bias_np = item_bias_df.sort_values("item_id", ascending=True).bias.to_numpy()
# embeddings
item_factors_np = item_factors_df.sort_values(
    ["item_id", "factor"], ascending=[True, False]
).weight.to_numpy().reshape((3_706, 60))

item_bias_np.shape, item_factors_np.shape

((3706,), (3706, 60))

Map `user_id` and `item_id` to numpy indices

idx2item_id = {idx: item_id for idx, item_id in zip(item_bias_df.index, item_bias_df.item_id)}

User Item interaction

user_items_interactions = global_bias_np + np.dot(
    user_factors_np, item_factors_np.T
) + user_bias_np[:,None] + item_bias_np[None,:] 

user_items_interactions.shape

(6040, 3706)

So, user_id=1 is the index 0 for the numpy array.

top_10_1 = user_items_interactions[0].argsort()[-10:][::-1]
[idx2item_id[idx] for idx in top_10_1]

[3382, 3920, 2178, 662, 469, 199, 215, 2073, 881, 3132]

endpoint_predictions_df.head(10)

	predicted_rating	predicted_item_id
0	7.159546	3382
1	6.931200	3920
2	6.748711	2178
3	6.609937	662
4	6.336838	469
5	6.316229	199
6	6.203409	215
7	6.190802	2073
8	6.095066	881
9	6.074839	3132

Same Top item_id from the endpoint. We are doing fine.

Top 10 predictions for each pair

top_10_idxs = np.argsort(user_items_interactions, axis=1)[:,::-1][:,:10]
top_10_idxs

array([[3152, 3673, 1997, ..., 1892,  823, 2916],
       [3258, 3191, 2758, ...,  447, 2719, 1917],
       [1763,  668,  984, ..., 3152, 1917, 1860],
       ...,
       [3152, 3504, 2122, ..., 2232, 1470,  543],
       [ 986, 2964,  856, ...,  616, 2118, 2684],
       [3237, 1298, 3579, ..., 2925, 2732, 1451]])

Top 10 ratings for user_id=1

user_items_interactions[0][top_10_idxs[0]]

array([7.15954572, 6.93120045, 6.74871055, 6.60993714, 6.33683759,
       6.31622919, 6.20340873, 6.19080189, 6.09506552, 6.07483853])

df = pd.DataFrame(top_10_idxs[0])
df.columns = ['movie_id']
df.movie_id = df.movie_id.map(idx2item_id)

df = df.merge(movies_df)
df['predicted_rating'] = user_items_interactions[0][top_10_idxs[0]]
df

	movie_id	movie_title	genre	predicted_rating
0	3382	Song of Freedom (1936)	Drama	7.159546
1	3920	Faraway, So Close (In Weiter Ferne, So Nah!) (...	Drama\|Fantasy	6.931200
2	2178	Frenzy (1972)	Thriller	6.748711
3	662	Fear (1996)	Thriller	6.609937
4	469	House of the Spirits, The (1993)	Drama\|Romance	6.336838
5	199	Umbrellas of Cherbourg, The (Parapluies de Che...	Drama\|Musical	6.316229
6	215	Before Sunrise (1995)	Drama\|Romance	6.203409
7	2073	Fandango (1985)	Comedy	6.190802
8	881	First Kid (1996)	Children's\|Comedy	6.095066
9	3132	Daddy Long Legs (1919)	Comedy	6.074839

Getting Top Movies for each User from BigQuery

Let’s create a table containing the Top 5 movies by user.

query = f'''SELECT
  user_id,
  ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating)
ORDER BY predicted_rating DESC LIMIT 5)
FROM (
SELECT
  user_id,
  item_id,
  predicted_rating,
  movie_title,
  genre
FROM
  {MODEL_DATASET}.recommend_m2_1m
JOIN
  movielens.movie_titles
ON
  item_id = movie_id)
GROUP BY
  user_id
'''
job_config = bigquery.QueryJobConfig()
job_config.destination = "{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top"

query_job = client.query(query, job_config)
results = query_job.result()

items_for_user_1_df = client.query(f"""
    SELECT * 
    FROM `{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top` 
    WHERE user_id=1
""").to_dataframe()

items_for_user_1_df.f0_.loc[0]

array([{'movie_title': 'Song of Freedom (1936)', 'genre': 'Drama', 'predicted_rating': 7.1595457186221525},
       {'movie_title': 'Faraway, So Close (In Weiter Ferne, So Nah!) (1993)', 'genre': 'Drama|Fantasy', 'predicted_rating': 6.931200454403858},
       {'movie_title': 'Frenzy (1972)', 'genre': 'Thriller', 'predicted_rating': 6.748710552254239},
       {'movie_title': 'Fear (1996)', 'genre': 'Thriller', 'predicted_rating': 6.609937143182611},
       {'movie_title': 'House of the Spirits, The (1993)', 'genre': 'Drama|Romance', 'predicted_rating': 6.336837592527967}],
      dtype=object)

Overview

Authenticate

Get recommendations from deployed endpoint

Call the deployed model endpoint

BigQuery

Authenticate to BigQuery

Set the same parameters as in Part 1

Query the data tables

Information about model(s) created

Training statistics

Evaluation

Rating predictions

Compare to endpoint results

Latent Factors (Embeddings) and Bias

Global Bias

Movie Embeddings

User Embeddings

Do the math for a prediction

Full Pair Predictions

Convert to numpy

Map user_id and item_id to numpy indices

User Item interaction

Top 10 predictions for each pair

Getting Top Movies for each User from BigQuery

Map `user_id` and `item_id` to numpy indices