from google.colab import auth
auth.authenticate_user()Overview
Part 1 was about creating a Vertex AI pipeline for training and deploying Matrix Factorization model using BigQuery ML.
We ended with a model we can use from BigQuery, but that is also available as a Cloud Run endpoint, deployed as a Docker container.
This notebook is better suited for running in Colab.
Authenticate
Get recommendations from deployed endpoint
As defined in part 1, the service was deployed as recommender-model.
SERVICE = "recommender-model"
REGION = 'us-central1'
PROJECT_ID = "[your-project-id]"To get the endpoint url, you can navigate to Cloud Run console (insert your Project ID):
https://console.cloud.google.com/run/detail/
Or you can run the following command:
service_url = ! gcloud run services describe {SERVICE} --platform managed --region {REGION} --format 'value(status.url)'
service_url = service_url[0] + "/v1/models/recommender_model:predict"
service_urlAnd the endpoint should look something like this:
https://recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict
Call the deployed model endpoint
To call the API simply use curl. In this case we get recommendations for the user_id equals 1:
! curl --header "Content-Type: application/json" \
--request POST \
--data '{"instances": [{"user_id": 1}]}' \
https://recommender-model-<ENDPOINT_ID>-uc.a.run.app/v1/models/recommender_model:predict{
"predictions": [
{
"predicted_rating": [7.1595457186221525, 6.931200454403859, 6.7487105522542379, 6.6099371431826093, 6.3368375925279663, 6.3162291916598949, 6.2034087300038294, 6.1908018877869777, 6.095065519595515, 6.0748385283824344, 6.0707896719918022, 6.0115152056191317, 5.9950883198035312, 5.9266960142259579, 5.8977393292303706, 5.8963543166167049, 5.8879213166109317, 5.8651216371567951, 5.8157813717342961, 5.7918400837633026, 5.7728366859651743, 5.7459233895973343, 5.7286289144902627, 5.7236862593334399, 5.6735891385114323, 5.6712837959564055, 5.6344651298840862, 5.6144777548802312, 5.6136065773986719, 5.5966244955484585, 5.5826125214031448, 5.5438993473737597, 5.5405328509132108, 5.5156535458862912, 5.5060371378140998, 5.4983916934699879, 5.4779312652757479, 5.4381644519356804, 5.436539915444734, 5.4274984607954586, 5.4162018042063238, 5.4131954911985822, 5.4112950178511134, 5.3977833272890878, 5.3924309392602101, 5.3921355023131454, 5.3795566101479775, 5.3784815662419794, 5.3724805166998593, 5.3716275159116167],
"predicted_item_id": ["3382", "3920", "2178", "662", "469", "199", "215", "2073", "881", "3132", "1900", "1615", "3141", "2928", "445", "973", "3605", "766", "935", "1841", "2330", "944", "2314", "3447", "2171", "3327", "2419", "2351", "557", "3720", "156", "2066", "2475", "2970", "26", "2360", "3018", "2272", "2506", "3670", "1631", "414", "2165", "1177", "2264", "743", "2725", "1934", "2710", "3275"]
}
]
}
Which return the top 50 items (movies) by predicted rating for the specified user.
After processing the output, we can get a dataframe like this (remember this are predictions for user_id=1):
endpoint_predictions_df.head(10)| predicted_rating | predicted_item_id | |
|---|---|---|
| 0 | 7.159546 | 3382 |
| 1 | 6.931200 | 3920 |
| 2 | 6.748711 | 2178 |
| 3 | 6.609937 | 662 |
| 4 | 6.336838 | 469 |
| 5 | 6.316229 | 199 |
| 6 | 6.203409 | 215 |
| 7 | 6.190802 | 2073 |
| 8 | 6.095066 | 881 |
| 9 | 6.074839 | 3132 |
BigQuery
Authenticate to BigQuery
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)Set the same parameters as in Part 1
MODEL_DATASET = "[bq-model-dataset]" # @param {type:"string"}
MOVIELENS_DATASET = "[bq-data-dataset]" # @param {type:"string"}Query the data tables
movies_df = client.query(f'''
SELECT *
FROM `{MOVIELENS_DATASET}.movie_titles`
''').to_dataframe()
movies_df.head()| movie_id | movie_title | genre | |
|---|---|---|---|
| 0 | 632 | Land and Freedom (Tierra y libertad) (1995) | War |
| 1 | 665 | Underground (1995) | War |
| 2 | 760 | Stalingrad (1993) | War |
| 3 | 777 | Pharaoh's Army (1995) | War |
| 4 | 1450 | Prisoner of the Mountains (Kavkazsky Plennik) ... | War |
user_ratings_df = client.query(f'''
SELECT t1.user_id, t1.rating, t1.item_id, t2.movie_title, t2.genre
FROM `{MOVIELENS_DATASET}.movielens_1m` t1
LEFT JOIN {MOVIELENS_DATASET}.movie_titles t2
ON t1.item_id=t2.movie_id
--WHERE t1.user_id=1
ORDER BY rating DESC
''').to_dataframe()
user_ratings_df.head()| user_id | rating | item_id | movie_title | genre | |
|---|---|---|---|---|---|
| 0 | 1 | 5.0 | 1193 | One Flew Over the Cuckoo's Nest (1975) | Drama |
| 1 | 1 | 5.0 | 2355 | Bug's Life, A (1998) | Animation|Children's|Comedy |
| 2 | 1 | 5.0 | 1287 | Ben-Hur (1959) | Action|Adventure|Drama |
| 3 | 1 | 5.0 | 2804 | Christmas Story, A (1983) | Comedy|Drama |
| 4 | 1 | 5.0 | 595 | Beauty and the Beast (1991) | Animation|Children's|Musical |
Information about model(s) created
models = client.list_models(MODEL_DATASET)
print("Models contained in '{}':".format(MODEL_DATASET))
models_fmid = []
for model in models:
full_model_id = f"{model.project}.{model.dataset_id}.{model.model_id}"
models_fmid.append(full_model_id)
print(f"{full_model_id}")Models contained in 'bqml_modelos':
pythonapi-205723.bqml_modelos.mf_model_pipe_01
pythonapi-205723.bqml_modelos.mf_model_pipe_02
pythonapi-205723.bqml_modelos.mf_model_pipe_03
As you can see, I created three models in bqml_modelos dataset.
client.get_model(models_fmid[0])Model(reference=ModelReference(project_id='pythonapi-205723', dataset_id='bqml_modelos', model_id='mf_model_pipe_01'))
model.model_type'MATRIX_FACTORIZATION'
Training statistics
client.query(f'''
SELECT *
FROM ML.TRAINING_INFO(MODEL `{models_fmid[2]}`)
''').to_dataframe()| training_run | iteration | loss | eval_loss | duration_ms | |
|---|---|---|---|---|---|
| 0 | 0 | 15 | 0.259132 | NaN | 64410 |
| 1 | 0 | 14 | 0.261374 | NaN | 107980 |
| 2 | 0 | 13 | 0.265850 | NaN | 60169 |
| 3 | 0 | 12 | 0.268827 | NaN | 29616 |
| 4 | 0 | 11 | 0.274483 | NaN | 57306 |
| 5 | 0 | 10 | 0.278608 | NaN | 117195 |
| 6 | 0 | 9 | 0.286101 | NaN | 59622 |
| 7 | 0 | 8 | 0.292234 | NaN | 115613 |
| 8 | 0 | 7 | 0.302932 | NaN | 62337 |
| 9 | 0 | 6 | 0.313072 | NaN | 101301 |
| 10 | 0 | 5 | 0.330243 | NaN | 131648 |
| 11 | 0 | 4 | 0.350469 | NaN | 118377 |
| 12 | 0 | 3 | 0.385341 | NaN | 55254 |
| 13 | 0 | 2 | 0.449968 | NaN | 103391 |
| 14 | 0 | 1 | 0.581976 | NaN | 64919 |
| 15 | 0 | 0 | 4.047567 | NaN | 168618 |
Evaluation
client.query(f'''
SELECT *
FROM ML.EVALUATE(MODEL `{models_fmid[2]}`,
(
SELECT
user_id,
item_id,
rating
FROM
{MOVIELENS_DATASET}.movielens_1m))
''').to_dataframe()| mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | |
|---|---|---|---|---|---|---|
| 0 | 0.382252 | 0.259132 | 0.016406 | 0.293064 | 0.792348 | 0.792348 |
Rating predictions
Lets get predicted ratings for all user-item pairs.
This query generates a predicted rating for every user-item pair. As a result, this can be a rather large output. It is recommended that the output is saved in a table which is then used in other queries for analysis.
Lets create one table to store all predictions called recommend_m2_1m and another one with top predictions by user called recommend_m2_1m_top.
recommend_m2_1m_df = client.query(f'''
SELECT *
FROM ML.RECOMMEND(MODEL bqml_modelos.mf_model_pipe_03)
''').to_dataframe()recommend_m2_1m_df| predicted_rating | user_id | item_id | |
|---|---|---|---|
| 0 | 3.489891 | 3575 | 1376 |
| 1 | 6.627501 | 2213 | 1362 |
| 2 | 2.694415 | 5876 | 1330 |
| 3 | -1.662953 | 4817 | 3042 |
| 4 | 2.208275 | 5448 | 563 |
| ... | ... | ... | ... |
| 22384235 | 3.183006 | 933 | 2282 |
| 22384236 | 1.201612 | 5843 | 391 |
| 22384237 | 2.886981 | 4321 | 3525 |
| 22384238 | 2.488141 | 1000 | 2881 |
| 22384239 | 2.805384 | 5148 | 1049 |
22384240 rows × 3 columns
print(f"Number of users: {len(user_ratings_df.user_id.unique()):,}")
print(f"Number of movies: {len(user_ratings_df.item_id.unique()):,}")
assert len(user_ratings_df.user_id.unique()) * len(user_ratings_df.item_id.unique()) == len(recommend_m2_1m_df)Number of users: 6,040
Number of movies: 3,706
As we can see, the full table has 22,384,240 rows, equals to 6,040 users x 3,706 movies.
Compare to endpoint results
recommend_m2_1m_df[recommend_m2_1m_df.user_id==1].sort_values(by='predicted_rating', ascending=False).head(10)| predicted_rating | user_id | item_id | |
|---|---|---|---|
| 50602 | 7.159546 | 1 | 3382 |
| 22190133 | 6.931200 | 1 | 3920 |
| 5356399 | 6.748711 | 1 | 2178 |
| 15157835 | 6.609937 | 1 | 662 |
| 4687764 | 6.336838 | 1 | 469 |
| 17892355 | 6.316229 | 1 | 199 |
| 1365223 | 6.203409 | 1 | 215 |
| 16742143 | 6.190802 | 1 | 2073 |
| 8094661 | 6.095066 | 1 | 881 |
| 21623544 | 6.074839 | 1 | 3132 |
endpoint_predictions_df.head(10)| predicted_rating | predicted_item_id | |
|---|---|---|
| 0 | 7.159546 | 3382 |
| 1 | 6.931200 | 3920 |
| 2 | 6.748711 | 2178 |
| 3 | 6.609937 | 662 |
| 4 | 6.336838 | 469 |
| 5 | 6.316229 | 199 |
| 6 | 6.203409 | 215 |
| 7 | 6.190802 | 2073 |
| 8 | 6.095066 | 881 |
| 9 | 6.074839 | 3132 |
For user_id=1 and item_id=3382, the predicted rating is 7.159546.
Latent Factors (Embeddings) and Bias
The matrix factorization model trained in BigQuery ML returns a predicted rating as the following equation:
predicted = Global Bias + DotProduct(user_factors, item_factors) + user_bias + item_bias
Where Global Bias is a constant for all the pairs, user_factors and item_factors are vectors with 60 dimensions (as definied in the training process) and user_bias and item_bias are a distinct number for each user and for each item.
Global Bias
global_bias = client.query(f"""
SELECT intercept
FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03)
WHERE feature = 'global__INTERCEPT__'
""").to_dataframe()global_bias_np = global_bias.loc[0].values
global_bias_nparray([3.58156445])
Movie Embeddings
item_factors_df = client.query(f"""
SELECT feature AS item_id, intercept AS bias, fw.factor AS factor, fw.weight
FROM ML.WEIGHTS(model {MODEL_DATASET}.mf_model_pipe_03),
UNNEST(factor_weights) AS fw
WHERE processed_input = 'item_id'
""").to_dataframe()item_factors_df| item_id | bias | factor | weight | |
|---|---|---|---|---|
| 0 | 2 | -0.393036 | 60 | -0.113832 |
| 1 | 2 | -0.393036 | 59 | 0.048548 |
| 2 | 2 | -0.393036 | 58 | -0.112024 |
| 3 | 2 | -0.393036 | 57 | 0.118971 |
| 4 | 2 | -0.393036 | 56 | -0.050817 |
| ... | ... | ... | ... | ... |
| 222355 | 3833 | -1.592288 | 5 | 0.123739 |
| 222356 | 3833 | -1.592288 | 4 | 0.234778 |
| 222357 | 3833 | -1.592288 | 3 | -0.296197 |
| 222358 | 3833 | -1.592288 | 2 | 0.502392 |
| 222359 | 3833 | -1.592288 | 1 | -0.232240 |
222360 rows × 4 columns
User Embeddings
user_factors_df = client.query("""
SELECT feature AS user_id, intercept AS bias, fw.factor AS factor, fw.weight
FROM ML.WEIGHTS(model bqml_modelos.mf_model_pipe_03),
UNNEST(factor_weights) AS fw
WHERE processed_input = 'user_id'
""").to_dataframe()user_factors_df| user_id | bias | factor | weight | |
|---|---|---|---|---|
| 0 | 256 | 0.255283 | 60 | 0.713645 |
| 1 | 256 | 0.255283 | 59 | 1.387058 |
| 2 | 256 | 0.255283 | 58 | -0.567324 |
| 3 | 256 | 0.255283 | 57 | -0.709545 |
| 4 | 256 | 0.255283 | 56 | -1.813568 |
| ... | ... | ... | ... | ... |
| 362395 | 5887 | -0.408196 | 5 | 1.493551 |
| 362396 | 5887 | -0.408196 | 4 | 0.960923 |
| 362397 | 5887 | -0.408196 | 3 | -2.272016 |
| 362398 | 5887 | -0.408196 | 2 | 0.393577 |
| 362399 | 5887 | -0.408196 | 1 | -2.099104 |
362400 rows × 4 columns
Do the math for a prediction
We already saw that for user_id=1 and item_id=3382, the predicted rating is 7.159546. Let’s do the math:
user_1_factors = user_factors_df[user_factors_df.user_id=='1'].weight.to_numpy()
user_1_bias = user_factors_df[user_factors_df.user_id=='1'].bias.to_numpy()[:1]
user_1_bias, user_1_factors[:5], user_1_factors.shape(array([0.04259326]),
array([ 0.72214304, -1.06488835, -0.35848901, -0.08184539, -0.34282946]),
(60,))
item_3382_factors = item_factors_df[item_factors_df.item_id=="3382"].weight.to_numpy()
item_3382_bias = item_factors_df[item_factors_df.item_id=="3382"].bias.to_numpy()[:1]
item_3382_bias, item_3382_factors[:5], item_3382_factors.shape(array([3.535388]),
array([ 1.38777878e-17, 3.03576608e-18, -2.81892565e-18, 1.73472348e-18,
-5.02894452e-17]),
(60,))
import numpy as np
global_bias_np + np.dot(user_1_factors, item_3382_factors) + user_1_bias + item_3382_biasarray([7.15954572])
There it is, same value for the predicted rating.
Full Pair Predictions
# Change type
user_factors_df[['user_id', 'factor']] = user_factors_df[['user_id', 'factor']].astype('int')
# Bias for each user
user_bias_df = user_factors_df.groupby('user_id')['bias'].max().reset_index()# Change type
item_factors_df[['item_id', 'factor']] = item_factors_df[['item_id', 'factor']].astype('int')
# Bias for each movie
item_bias_df = item_factors_df.groupby('item_id')['bias'].max().reset_index()Convert to numpy
# bias
user_bias_np = user_bias_df.sort_values("user_id", ascending=True).bias.to_numpy()
# embeddings
user_factors_np = user_factors_df.sort_values(
["user_id", "factor"], ascending=[True, False]
).weight.to_numpy().reshape((6_040, 60))
user_bias_np.shape, user_factors_np.shape((6040,), (6040, 60))
# bias
item_bias_np = item_bias_df.sort_values("item_id", ascending=True).bias.to_numpy()
# embeddings
item_factors_np = item_factors_df.sort_values(
["item_id", "factor"], ascending=[True, False]
).weight.to_numpy().reshape((3_706, 60))
item_bias_np.shape, item_factors_np.shape((3706,), (3706, 60))
Map user_id and item_id to numpy indices
idx2item_id = {idx: item_id for idx, item_id in zip(item_bias_df.index, item_bias_df.item_id)}User Item interaction
user_items_interactions = global_bias_np + np.dot(
user_factors_np, item_factors_np.T
) + user_bias_np[:,None] + item_bias_np[None,:]
user_items_interactions.shape(6040, 3706)
So, user_id=1 is the index 0 for the numpy array.
top_10_1 = user_items_interactions[0].argsort()[-10:][::-1]
[idx2item_id[idx] for idx in top_10_1][3382, 3920, 2178, 662, 469, 199, 215, 2073, 881, 3132]
endpoint_predictions_df.head(10)| predicted_rating | predicted_item_id | |
|---|---|---|
| 0 | 7.159546 | 3382 |
| 1 | 6.931200 | 3920 |
| 2 | 6.748711 | 2178 |
| 3 | 6.609937 | 662 |
| 4 | 6.336838 | 469 |
| 5 | 6.316229 | 199 |
| 6 | 6.203409 | 215 |
| 7 | 6.190802 | 2073 |
| 8 | 6.095066 | 881 |
| 9 | 6.074839 | 3132 |
Same Top item_id from the endpoint. We are doing fine.
Top 10 predictions for each pair
top_10_idxs = np.argsort(user_items_interactions, axis=1)[:,::-1][:,:10]
top_10_idxsarray([[3152, 3673, 1997, ..., 1892, 823, 2916],
[3258, 3191, 2758, ..., 447, 2719, 1917],
[1763, 668, 984, ..., 3152, 1917, 1860],
...,
[3152, 3504, 2122, ..., 2232, 1470, 543],
[ 986, 2964, 856, ..., 616, 2118, 2684],
[3237, 1298, 3579, ..., 2925, 2732, 1451]])
Top 10 ratings for user_id=1
user_items_interactions[0][top_10_idxs[0]]array([7.15954572, 6.93120045, 6.74871055, 6.60993714, 6.33683759,
6.31622919, 6.20340873, 6.19080189, 6.09506552, 6.07483853])
df = pd.DataFrame(top_10_idxs[0])
df.columns = ['movie_id']
df.movie_id = df.movie_id.map(idx2item_id)
df = df.merge(movies_df)
df['predicted_rating'] = user_items_interactions[0][top_10_idxs[0]]
df| movie_id | movie_title | genre | predicted_rating | |
|---|---|---|---|---|
| 0 | 3382 | Song of Freedom (1936) | Drama | 7.159546 |
| 1 | 3920 | Faraway, So Close (In Weiter Ferne, So Nah!) (... | Drama|Fantasy | 6.931200 |
| 2 | 2178 | Frenzy (1972) | Thriller | 6.748711 |
| 3 | 662 | Fear (1996) | Thriller | 6.609937 |
| 4 | 469 | House of the Spirits, The (1993) | Drama|Romance | 6.336838 |
| 5 | 199 | Umbrellas of Cherbourg, The (Parapluies de Che... | Drama|Musical | 6.316229 |
| 6 | 215 | Before Sunrise (1995) | Drama|Romance | 6.203409 |
| 7 | 2073 | Fandango (1985) | Comedy | 6.190802 |
| 8 | 881 | First Kid (1996) | Children's|Comedy | 6.095066 |
| 9 | 3132 | Daddy Long Legs (1919) | Comedy | 6.074839 |
Getting Top Movies for each User from BigQuery
Let’s create a table containing the Top 5 movies by user.
query = f'''SELECT
user_id,
ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating)
ORDER BY predicted_rating DESC LIMIT 5)
FROM (
SELECT
user_id,
item_id,
predicted_rating,
movie_title,
genre
FROM
{MODEL_DATASET}.recommend_m2_1m
JOIN
movielens.movie_titles
ON
item_id = movie_id)
GROUP BY
user_id
'''
job_config = bigquery.QueryJobConfig()
job_config.destination = "{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top"
query_job = client.query(query, job_config)
results = query_job.result()items_for_user_1_df = client.query(f"""
SELECT *
FROM `{PROJECT_ID}.{MODEL_DATASET}.recommend_m2_1m_top`
WHERE user_id=1
""").to_dataframe()
items_for_user_1_df.f0_.loc[0]array([{'movie_title': 'Song of Freedom (1936)', 'genre': 'Drama', 'predicted_rating': 7.1595457186221525},
{'movie_title': 'Faraway, So Close (In Weiter Ferne, So Nah!) (1993)', 'genre': 'Drama|Fantasy', 'predicted_rating': 6.931200454403858},
{'movie_title': 'Frenzy (1972)', 'genre': 'Thriller', 'predicted_rating': 6.748710552254239},
{'movie_title': 'Fear (1996)', 'genre': 'Thriller', 'predicted_rating': 6.609937143182611},
{'movie_title': 'House of the Spirits, The (1993)', 'genre': 'Drama|Romance', 'predicted_rating': 6.336837592527967}],
dtype=object)