from google.colab import auth
auth.authenticate_user()
Overview
Part 1 was about creating a Vertex AI pipeline for training and deploying Matrix Factorization model using BigQuery ML.
We ended with a model we can use from BigQuery, but that is also available as a Cloud Run endpoint, deployed as a Docker container.
This notebook is better suited for running in Colab.
Authenticate
Get recommendations from deployed endpoint
As defined in part 1, the service was deployed as recommender-model
.
= "recommender-model"
SERVICE = 'us-central1'
REGION = "[your-project-id]" PROJECT_ID
To get the endpoint url, you can navigate to Cloud Run console (insert your Project ID):
https://console.cloud.google.com/run/detail/
Or you can run the following command:
= ! gcloud run services describe {SERVICE} --platform managed --region {REGION} --format 'value(status.url)'
service_url = service_url[0] + "/v1/models/recommender_model:predict"
service_url service_url
And the endpoint should look something like this:
https://recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict
Call the deployed model endpoint
To call the API simply use curl
. In this case we get recommendations for the user_id
equals 1
:
! curl --header "Content-Type: application/json" \
--request POST \
--data '{"instances": [{"user_id": 1}]}' \
//recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict https:
{
"predictions": [
{
"predicted_rating": [7.1595457186221525, 6.931200454403859, 6.7487105522542379, 6.6099371431826093, 6.3368375925279663, 6.3162291916598949, 6.2034087300038294, 6.1908018877869777, 6.095065519595515, 6.0748385283824344, 6.0707896719918022, 6.0115152056191317, 5.9950883198035312, 5.9266960142259579, 5.8977393292303706, 5.8963543166167049, 5.8879213166109317, 5.8651216371567951, 5.8157813717342961, 5.7918400837633026, 5.7728366859651743, 5.7459233895973343, 5.7286289144902627, 5.7236862593334399, 5.6735891385114323, 5.6712837959564055, 5.6344651298840862, 5.6144777548802312, 5.6136065773986719, 5.5966244955484585, 5.5826125214031448, 5.5438993473737597, 5.5405328509132108, 5.5156535458862912, 5.5060371378140998, 5.4983916934699879, 5.4779312652757479, 5.4381644519356804, 5.436539915444734, 5.4274984607954586, 5.4162018042063238, 5.4131954911985822, 5.4112950178511134, 5.3977833272890878, 5.3924309392602101, 5.3921355023131454, 5.3795566101479775, 5.3784815662419794, 5.3724805166998593, 5.3716275159116167],
"predicted_item_id": ["3382", "3920", "2178", "662", "469", "199", "215", "2073", "881", "3132", "1900", "1615", "3141", "2928", "445", "973", "3605", "766", "935", "1841", "2330", "944", "2314", "3447", "2171", "3327", "2419", "2351", "557", "3720", "156", "2066", "2475", "2970", "26", "2360", "3018", "2272", "2506", "3670", "1631", "414", "2165", "1177", "2264", "743", "2725", "1934", "2710", "3275"]
}
]
}
Which return the top 50 items (movies) by predicted rating for the specified user.
After processing the output, we can get a dataframe like this (remember this are predictions for user_id=1):
10) endpoint_predictions_df.head(
predicted_rating | predicted_item_id | |
---|---|---|
0 | 7.159546 | 3382 |
1 | 6.931200 | 3920 |
2 | 6.748711 | 2178 |
3 | 6.609937 | 662 |
4 | 6.336838 | 469 |
5 | 6.316229 | 199 |
6 | 6.203409 | 215 |
7 | 6.190802 | 2073 |
8 | 6.095066 | 881 |
9 | 6.074839 | 3132 |
BigQuery
Authenticate to BigQuery
from google.cloud import bigquery
= bigquery.Client(project=PROJECT_ID) client
Set the same parameters as in Part 1
= "[bq-model-dataset]" # @param {type:"string"}
MODEL_DATASET = "[bq-data-dataset]" # @param {type:"string"} MOVIELENS_DATASET
Query the data tables
= client.query(f'''
movies_df SELECT *
FROM `{MOVIELENS_DATASET}.movie_titles`
''').to_dataframe()
movies_df.head()
movie_id | movie_title | genre | |
---|---|---|---|
0 | 632 | Land and Freedom (Tierra y libertad) (1995) | War |
1 | 665 | Underground (1995) | War |
2 | 760 | Stalingrad (1993) | War |
3 | 777 | Pharaoh's Army (1995) | War |
4 | 1450 | Prisoner of the Mountains (Kavkazsky Plennik) ... | War |
= client.query(f'''
user_ratings_df SELECT t1.user_id, t1.rating, t1.item_id, t2.movie_title, t2.genre
FROM `{MOVIELENS_DATASET}.movielens_1m` t1
LEFT JOIN {MOVIELENS_DATASET}.movie_titles t2
ON t1.item_id=t2.movie_id
--WHERE t1.user_id=1
ORDER BY rating DESC
''').to_dataframe()
user_ratings_df.head()
user_id | rating | item_id | movie_title | genre | |
---|---|---|---|---|---|
0 | 1 | 5.0 | 1193 | One Flew Over the Cuckoo's Nest (1975) | Drama |
1 | 1 | 5.0 | 2355 | Bug's Life, A (1998) | Animation|Children's|Comedy |
2 | 1 | 5.0 | 1287 | Ben-Hur (1959) | Action|Adventure|Drama |
3 | 1 | 5.0 | 2804 | Christmas Story, A (1983) | Comedy|Drama |
4 | 1 | 5.0 | 595 | Beauty and the Beast (1991) | Animation|Children's|Musical |
Information about model(s) created
= client.list_models(MODEL_DATASET)
models print("Models contained in '{}':".format(MODEL_DATASET))
= []
models_fmid for model in models:
= f"{model.project}.{model.dataset_id}.{model.model_id}"
full_model_id
models_fmid.append(full_model_id)
print(f"{full_model_id}")
Models contained in 'bqml_modelos':
pythonapi-205723.bqml_modelos.mf_model_pipe_01
pythonapi-205723.bqml_modelos.mf_model_pipe_02
pythonapi-205723.bqml_modelos.mf_model_pipe_03
As you can see, I created three models in bqml_modelos
dataset.
0]) client.get_model(models_fmid[
Model(reference=ModelReference(project_id='pythonapi-205723', dataset_id='bqml_modelos', model_id='mf_model_pipe_01'))
model.model_type
'MATRIX_FACTORIZATION'
Training statistics
f'''
client.query( SELECT *
FROM ML.TRAINING_INFO(MODEL `{models_fmid[2]}`)
''').to_dataframe()
training_run | iteration | loss | eval_loss | duration_ms | |
---|---|---|---|---|---|
0 | 0 | 15 | 0.259132 | NaN | 64410 |
1 | 0 | 14 | 0.261374 | NaN | 107980 |
2 | 0 | 13 | 0.265850 | NaN | 60169 |
3 | 0 | 12 | 0.268827 | NaN | 29616 |
4 | 0 | 11 | 0.274483 | NaN | 57306 |
5 | 0 | 10 | 0.278608 | NaN | 117195 |
6 | 0 | 9 | 0.286101 | NaN | 59622 |
7 | 0 | 8 | 0.292234 | NaN | 115613 |
8 | 0 | 7 | 0.302932 | NaN | 62337 |
9 | 0 | 6 | 0.313072 | NaN | 101301 |
10 | 0 | 5 | 0.330243 | NaN | 131648 |
11 | 0 | 4 | 0.350469 | NaN | 118377 |
12 | 0 | 3 | 0.385341 | NaN | 55254 |
13 | 0 | 2 | 0.449968 | NaN | 103391 |
14 | 0 | 1 | 0.581976 | NaN | 64919 |
15 | 0 | 0 | 4.047567 | NaN | 168618 |
Evaluation
f'''
client.query(SELECT *
FROM ML.EVALUATE(MODEL `{models_fmid[2]}`,
(
SELECT
user_id,
item_id,
rating
FROM
{MOVIELENS_DATASET}.movielens_1m))
''').to_dataframe()
mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | |
---|---|---|---|---|---|---|
0 | 0.382252 | 0.259132 | 0.016406 | 0.293064 | 0.792348 | 0.792348 |
Rating predictions
Lets get predicted ratings for all user-item pairs.
This query generates a predicted rating for every user-item pair. As a result, this can be a rather large output. It is recommended that the output is saved in a table which is then used in other queries for analysis.
Lets create one table to store all predictions called recommend_m2_1m
and another one with top predictions by user called recommend_m2_1m_top
.
= client.query(f'''
recommend_m2_1m_df SELECT *
FROM ML.RECOMMEND(MODEL bqml_modelos.mf_model_pipe_03)
''').to_dataframe()
recommend_m2_1m_df
predicted_rating | user_id | item_id | |
---|---|---|---|
0 | 3.489891 | 3575 | 1376 |
1 | 6.627501 | 2213 | 1362 |
2 | 2.694415 | 5876 | 1330 |
3 | -1.662953 | 4817 | 3042 |
4 | 2.208275 | 5448 | 563 |
... | ... | ... | ... |
22384235 | 3.183006 | 933 | 2282 |
22384236 | 1.201612 | 5843 | 391 |
22384237 | 2.886981 | 4321 | 3525 |
22384238 | 2.488141 | 1000 | 2881 |
22384239 | 2.805384 | 5148 | 1049 |
22384240 rows × 3 columns
print(f"Number of users: {len(user_ratings_df.user_id.unique()):,}")
print(f"Number of movies: {len(user_ratings_df.item_id.unique()):,}")
assert len(user_ratings_df.user_id.unique()) * len(user_ratings_df.item_id.unique()) == len(recommend_m2_1m_df)
Number of users: 6,040
Number of movies: 3,706
As we can see, the full table has 22,384,240 rows, equals to 6,040 users x 3,706 movies.
Compare to endpoint results
==1].sort_values(by='predicted_rating', ascending=False).head(10) recommend_m2_1m_df[recommend_m2_1m_df.user_id
predicted_rating | user_id | item_id | |
---|---|---|---|
50602 | 7.159546 | 1 | 3382 |
22190133 | 6.931200 | 1 | 3920 |
5356399 | 6.748711 | 1 | 2178 |
15157835 | 6.609937 | 1 | 662 |
4687764 | 6.336838 | 1 | 469 |
17892355 | 6.316229 | 1 | 199 |
1365223 | 6.203409 | 1 | 215 |
16742143 | 6.190802 | 1 | 2073 |
8094661 | 6.095066 | 1 | 881 |
21623544 | 6.074839 | 1 | 3132 |
10) endpoint_predictions_df.head(
predicted_rating | predicted_item_id | |
---|---|---|
0 | 7.159546 | 3382 |
1 | 6.931200 | 3920 |
2 | 6.748711 | 2178 |
3 | 6.609937 | 662 |
4 | 6.336838 | 469 |
5 | 6.316229 | 199 |
6 | 6.203409 | 215 |
7 | 6.190802 | 2073 |
8 | 6.095066 | 881 |
9 | 6.074839 | 3132 |
For user_id=1
and item_id=3382
, the predicted rating is 7.159546
.
Latent Factors (Embeddings) and Bias
The matrix factorization model trained in BigQuery ML returns a predicted rating as the following equation:
predicted = Global Bias + DotProduct(user_factors, item_factors) + user_bias + item_bias
Where Global Bias
is a constant for all the pairs, user_factors
and item_factors
are vectors with 60 dimensions (as definied in the training process) and user_bias
and item_bias
are a distinct number for each user and for each item.
Global Bias
= client.query(f"""
global_bias SELECT intercept
FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03)
WHERE feature = 'global__INTERCEPT__'
""").to_dataframe()
= global_bias.loc[0].values
global_bias_np global_bias_np
array([3.58156445])
Movie Embeddings
= client.query(f"""
item_factors_df SELECT feature AS item_id, intercept AS bias, fw.factor AS factor, fw.weight
FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03),
UNNEST(factor_weights) AS fw
WHERE processed_input = 'item_id'
""").to_dataframe()
item_factors_df
item_id | bias | factor | weight | |
---|---|---|---|---|
0 | 2 | -0.393036 | 60 | -0.113832 |
1 | 2 | -0.393036 | 59 | 0.048548 |
2 | 2 | -0.393036 | 58 | -0.112024 |
3 | 2 | -0.393036 | 57 | 0.118971 |
4 | 2 | -0.393036 | 56 | -0.050817 |
... | ... | ... | ... | ... |
222355 | 3833 | -1.592288 | 5 | 0.123739 |
222356 | 3833 | -1.592288 | 4 | 0.234778 |
222357 | 3833 | -1.592288 | 3 | -0.296197 |
222358 | 3833 | -1.592288 | 2 | 0.502392 |
222359 | 3833 | -1.592288 | 1 | -0.232240 |
222360 rows × 4 columns
User Embeddings
= client.query("""
user_factors_df SELECT feature AS user_id, intercept AS bias, fw.factor AS factor, fw.weight
FROM ML.WEIGHTS(model bqml_modelos.mf_model_pipe_03),
UNNEST(factor_weights) AS fw
WHERE processed_input = 'user_id'
""").to_dataframe()
user_factors_df
user_id | bias | factor | weight | |
---|---|---|---|---|
0 | 256 | 0.255283 | 60 | 0.713645 |
1 | 256 | 0.255283 | 59 | 1.387058 |
2 | 256 | 0.255283 | 58 | -0.567324 |
3 | 256 | 0.255283 | 57 | -0.709545 |
4 | 256 | 0.255283 | 56 | -1.813568 |
... | ... | ... | ... | ... |
362395 | 5887 | -0.408196 | 5 | 1.493551 |
362396 | 5887 | -0.408196 | 4 | 0.960923 |
362397 | 5887 | -0.408196 | 3 | -2.272016 |
362398 | 5887 | -0.408196 | 2 | 0.393577 |
362399 | 5887 | -0.408196 | 1 | -2.099104 |
362400 rows × 4 columns
Do the math for a prediction
We already saw that for user_id=1
and item_id=3382
, the predicted rating is 7.159546
. Let’s do the math:
= user_factors_df[user_factors_df.user_id=='1'].weight.to_numpy()
user_1_factors = user_factors_df[user_factors_df.user_id=='1'].bias.to_numpy()[:1]
user_1_bias 5], user_1_factors.shape user_1_bias, user_1_factors[:
(array([0.04259326]),
array([ 0.72214304, -1.06488835, -0.35848901, -0.08184539, -0.34282946]),
(60,))
= item_factors_df[item_factors_df.item_id=="3382"].weight.to_numpy()
item_3382_factors = item_factors_df[item_factors_df.item_id=="3382"].bias.to_numpy()[:1]
item_3382_bias 5], item_3382_factors.shape item_3382_bias, item_3382_factors[:
(array([3.535388]),
array([ 1.38777878e-17, 3.03576608e-18, -2.81892565e-18, 1.73472348e-18,
-5.02894452e-17]),
(60,))
import numpy as np
+ np.dot(user_1_factors, item_3382_factors) + user_1_bias + item_3382_bias global_bias_np
array([7.15954572])
There it is, same value for the predicted rating.
Full Pair Predictions
# Change type
'user_id', 'factor']] = user_factors_df[['user_id', 'factor']].astype('int')
user_factors_df[[
# Bias for each user
= user_factors_df.groupby('user_id')['bias'].max().reset_index() user_bias_df
# Change type
'item_id', 'factor']] = item_factors_df[['item_id', 'factor']].astype('int')
item_factors_df[[
# Bias for each movie
= item_factors_df.groupby('item_id')['bias'].max().reset_index() item_bias_df
Convert to numpy
# bias
= user_bias_df.sort_values("user_id", ascending=True).bias.to_numpy()
user_bias_np # embeddings
= user_factors_df.sort_values(
user_factors_np "user_id", "factor"], ascending=[True, False]
[6_040, 60))
).weight.to_numpy().reshape((
user_bias_np.shape, user_factors_np.shape
((6040,), (6040, 60))
# bias
= item_bias_df.sort_values("item_id", ascending=True).bias.to_numpy()
item_bias_np # embeddings
= item_factors_df.sort_values(
item_factors_np "item_id", "factor"], ascending=[True, False]
[3_706, 60))
).weight.to_numpy().reshape((
item_bias_np.shape, item_factors_np.shape
((3706,), (3706, 60))
Map user_id
and item_id
to numpy indices
= {idx: item_id for idx, item_id in zip(item_bias_df.index, item_bias_df.item_id)} idx2item_id
User Item interaction
= global_bias_np + np.dot(
user_items_interactions
user_factors_np, item_factors_np.T+ user_bias_np[:,None] + item_bias_np[None,:]
)
user_items_interactions.shape
(6040, 3706)
So, user_id=1
is the index 0 for the numpy array.
= user_items_interactions[0].argsort()[-10:][::-1]
top_10_1 for idx in top_10_1] [idx2item_id[idx]
[3382, 3920, 2178, 662, 469, 199, 215, 2073, 881, 3132]
10) endpoint_predictions_df.head(
predicted_rating | predicted_item_id | |
---|---|---|
0 | 7.159546 | 3382 |
1 | 6.931200 | 3920 |
2 | 6.748711 | 2178 |
3 | 6.609937 | 662 |
4 | 6.336838 | 469 |
5 | 6.316229 | 199 |
6 | 6.203409 | 215 |
7 | 6.190802 | 2073 |
8 | 6.095066 | 881 |
9 | 6.074839 | 3132 |
Same Top item_id from the endpoint. We are doing fine.
Top 10 predictions for each pair
= np.argsort(user_items_interactions, axis=1)[:,::-1][:,:10]
top_10_idxs top_10_idxs
array([[3152, 3673, 1997, ..., 1892, 823, 2916],
[3258, 3191, 2758, ..., 447, 2719, 1917],
[1763, 668, 984, ..., 3152, 1917, 1860],
...,
[3152, 3504, 2122, ..., 2232, 1470, 543],
[ 986, 2964, 856, ..., 616, 2118, 2684],
[3237, 1298, 3579, ..., 2925, 2732, 1451]])
Top 10 ratings for user_id=1
0][top_10_idxs[0]] user_items_interactions[
array([7.15954572, 6.93120045, 6.74871055, 6.60993714, 6.33683759,
6.31622919, 6.20340873, 6.19080189, 6.09506552, 6.07483853])
= pd.DataFrame(top_10_idxs[0])
df = ['movie_id']
df.columns = df.movie_id.map(idx2item_id)
df.movie_id
= df.merge(movies_df)
df 'predicted_rating'] = user_items_interactions[0][top_10_idxs[0]]
df[ df
movie_id | movie_title | genre | predicted_rating | |
---|---|---|---|---|
0 | 3382 | Song of Freedom (1936) | Drama | 7.159546 |
1 | 3920 | Faraway, So Close (In Weiter Ferne, So Nah!) (... | Drama|Fantasy | 6.931200 |
2 | 2178 | Frenzy (1972) | Thriller | 6.748711 |
3 | 662 | Fear (1996) | Thriller | 6.609937 |
4 | 469 | House of the Spirits, The (1993) | Drama|Romance | 6.336838 |
5 | 199 | Umbrellas of Cherbourg, The (Parapluies de Che... | Drama|Musical | 6.316229 |
6 | 215 | Before Sunrise (1995) | Drama|Romance | 6.203409 |
7 | 2073 | Fandango (1985) | Comedy | 6.190802 |
8 | 881 | First Kid (1996) | Children's|Comedy | 6.095066 |
9 | 3132 | Daddy Long Legs (1919) | Comedy | 6.074839 |
Getting Top Movies for each User from BigQuery
Let’s create a table containing the Top 5 movies by user.
= f'''SELECT
query user_id,
ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating)
ORDER BY predicted_rating DESC LIMIT 5)
FROM (
SELECT
user_id,
item_id,
predicted_rating,
movie_title,
genre
FROM
{MODEL_DATASET}.recommend_m2_1m
JOIN
movielens.movie_titles
ON
item_id = movie_id)
GROUP BY
user_id
'''
= bigquery.QueryJobConfig()
job_config = "{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top"
job_config.destination
= client.query(query, job_config)
query_job = query_job.result() results
= client.query(f"""
items_for_user_1_df SELECT *
FROM `{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top`
WHERE user_id=1
""").to_dataframe()
0] items_for_user_1_df.f0_.loc[
array([{'movie_title': 'Song of Freedom (1936)', 'genre': 'Drama', 'predicted_rating': 7.1595457186221525},
{'movie_title': 'Faraway, So Close (In Weiter Ferne, So Nah!) (1993)', 'genre': 'Drama|Fantasy', 'predicted_rating': 6.931200454403858},
{'movie_title': 'Frenzy (1972)', 'genre': 'Thriller', 'predicted_rating': 6.748710552254239},
{'movie_title': 'Fear (1996)', 'genre': 'Thriller', 'predicted_rating': 6.609937143182611},
{'movie_title': 'House of the Spirits, The (1993)', 'genre': 'Drama|Romance', 'predicted_rating': 6.336837592527967}],
dtype=object)