datamatica - Building a Full-Text Search Engine for fastai YouTube channel

Part 1. Extracting transcriptions and creating SQLite’s searchable index

Introduction

In this post we are going to get the transcriptions of YouTube videos from one or more given Playlists. Here we are going to do it for fastai channel, but it can be done for any given list of playlists (if the videos have transcriptions).

After we get the transcriptions, we are going to build a search engine with SQLite’s full-text search functionality provided by its FTS5 extension.

In Part 2 we are going to build and share the search engine as a Streamlit web app, just like this one: Full-Text Search Engine for fastai Youtube Chanel

References

If you want to get deeper, I encourage you to read these articles:

Get YouTube Transcriptions

Install and Import Libraries

We need to first install the libraries we need (pytube and youtube-transcript-api).
We can use pip:

$ pip install pytube
$ pip install youtube_transcript_api

Or conda:

$ conda install -c conda-forge pytube
$ conda install -c conda-forge youtube-transcript-api

from youtube_transcript_api import YouTubeTranscriptApi
from pytube import YouTube, Playlist

import sqlite3

YouTube Playlists

Let’s create a list of YouTube playlist ids. We can get them browsing YouTube playlist. The id is in the url which has the following format:
https://www.youtube.com/playlist?list={PLAYLIST_ID}

base_pl = 'https://www.youtube.com/playlist?list='
base_yt = 'https://youtu.be/'

yt_pl_ids = [
    'PLfYUBJiXbdtSgU6S_3l6pX-4hQYKNJZFU', # fast.ai APL Study Group #2022
    'PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU', # Practical Deep Learning for Coders 2022
    'PLfYUBJiXbdtSLBPJ1GMx-sQWf6iNhb8mM', # fast.ai live coding & tutorials #2022
    'PLfYUBJiXbdtRL3FMB3GoWHRI8ieU6FhfM', # Practical Deep Learning for Coders (2020)
    'PLfYUBJiXbdtTIdtE1U8qgyxo4Jy2Y91uj', # Deep Learning from the Foundations #2019
    'PLfYUBJiXbdtSWRCYUHh-ThVCC39bp5yiq', # fastai v2 code walk-thrus #2019
    'PLfYUBJiXbdtSIJb-Qd3pw0cqCbkGeS0xn', # Practical Deep Learning for Coders 2019
    'PLfYUBJiXbdtSyktd8A_x0JNd6lxDcZE96', # Introduction to Machine Learning for Coders
    'PLfYUBJiXbdtTttBGq-u2zeY1OTjs5e-Ia', # Cutting Edge Deep Learning for Coders 2 #2018
    'PLfYUBJiXbdtS2UQRzyrxmyVHoGW0gmLSM', # Practical Deep Learning For Coders 2018
]

Get Transcriptions

Let’s explore the methods:

playlist = Playlist('https://www.youtube.com/playlist?list=PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU')
print(playlist.title)
video = YouTube(playlist[0])
print(video.title)
print(playlist[0])
video_id = playlist[0].split('=')[1]
script = YouTubeTranscriptApi.get_transcript(video_id, languages=('en',))
print(script[0])

Practical Deep Learning for Coders 2022
Lesson 1: Practical Deep Learning for Coders 2022
https://www.youtube.com/watch?v=8SF_h3xF3cE
{'text': 'Welcome to Practical Deep Learning for coders,\xa0\nlesson one. This is version five of this course,\xa0\xa0', 'start': 2.0, 'duration': 8.0}

Download all transcriptions

Now we are going to download all the transcriptions. Let’s create three dictionaries to store the data: - playlists to store each playlist as {playlist_id: playlist_name} - videos to store videos as {video_id: video_name} - database to store all captions as {playlist_id: {video_id: {'start': caption}}.

playlists = dict()
videos = dict()
database = dict()

for pl_id in yt_pl_ids:
    playlist = Playlist(base_pl + pl_id)
    print(playlist.title)
    playlists[pl_id] = playlist.title
    database[pl_id] = dict()

    for video in playlist:
        video_id = video.split("=")[1]
        videos[video_id] = YouTube(video).title
        database[pl_id][video_id] = dict()
        # Manually created transcripts are returned first
        script = YouTubeTranscriptApi.get_transcript(video_id, languages=('en',))

        for txt in script:
            database[pl_id][video_id][txt['start']] = txt['text']

fast.ai APL Study Group
Practical Deep Learning for Coders 2022
fast.ai live coding & tutorials
Practical Deep Learning for Coders (2020)
Deep Learning from the Foundations
fastai v2 code walk-thrus
Practical Deep Learning for Coders 2019
Introduction to Machine Learning for Coders
Cutting Edge Deep Learning for Coders 2
Practical Deep Learning For Coders 2018

Building the Search Engine

Formatting the data to facilitate insertion into SQLite

# https://stackoverflow.com/a/60932565/10013187
records = [
    (level1, level2, level3, leaf)
    for level1, level2_dict in database.items()
    for level2, level3_dict in level2_dict.items()
    for level3, leaf in level3_dict.items()
]
print("(playlist_id, video_id, start, text)")
print(records[100])

(playlist_id, video_id, start, text)
('PLfYUBJiXbdtSgU6S_3l6pX-4hQYKNJZFU', 'CGpR2ILao5M', 294.18, 'gonna go watch them or anything all')

Creating the database

db = sqlite3.connect('fastai_yt.db')
cur = db.cursor()

# virtual table configured to allow full-text search
cur.execute('DROP TABLE IF EXISTS transcriptions_fts;') 
cur.execute('CREATE VIRTUAL TABLE transcriptions_fts USING fts5(playlist_id, video_id, start, text, tokenize="porter unicode61");')

# dimension like tables
cur.execute('DROP TABLE IF EXISTS playlist;')
cur.execute('CREATE TABLE playlist (playlist_id, playlist_name);')
cur.execute('DROP TABLE IF EXISTS video;')
cur.execute('CREATE TABLE video (video_id, video_name);')

<sqlite3.Cursor>

# bulk index records
cur.executemany('INSERT INTO transcriptions_fts (playlist_id, video_id, start, text) values (?,?,?,?);', records)
cur.executemany('INSERT INTO playlist (playlist_id, playlist_name) values (?,?);', playlists.items())
cur.executemany('INSERT INTO video (video_id, video_name) values (?,?);', videos.items())
db.commit()

Example of a simple query:

cur.execute('SELECT start, text FROM transcriptions_fts WHERE video_id="8SF_h3xF3cE" LIMIT 5').fetchall()

[(2.0,
  'Welcome to Practical Deep Learning for coders,\xa0\nlesson one. This is version five of this course,\xa0\xa0'),
 (11.44,
  "and it's the first new one we've done\xa0\nin two years. So, we've got a lot of\xa0\xa0"),
 (15.12,
  "cool things to cover! It's amazing how much has\xa0\nchanged. Here is an xkcd from the end of 2015.\xa0\xa0"),
 (28.0,
  'Who here has seen xkcd comics before?\xa0\n…Pretty much everybody. Not surprising.\xa0\xa0'),
 (35.36,
  "So the basic joke here is… I'll let you\xa0\nread it, and then I'll come back to it.")]

fastai_yt.db. Once we have the database populated, we can use it in any application we want without the need to get the transcriptions from YouTube.

Search queries

def print_search_results(res):
    for each in res:
        print()
        print(playlists[each[0]], "->", videos[each[1]])
        print(f'"... {each[4]}..."')
        print('https://youtu.be/' + each[1] + "?t=" + str(int(each[2])))

def get_query(q, limit):
    search_in = 'text'
    if 'text:' in q: search_in = 'transcriptions_fts'
    query = f"""
    SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
    FROM transcriptions_fts WHERE {search_in} MATCH '{q}' ORDER BY rank
    LIMIT "{limit}" 
    """
    print(query)
    return query

Search for a word

q = 'fastc*'
res = cur.execute(get_query(q, limit=5)).fetchall()
print_search_results(res)


    SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
    FROM transcriptions_fts WHERE text MATCH 'fastc*' ORDER BY rank
    LIMIT "5" 
    

fast.ai live coding & tutorials -> Live coding 3
"... going to install python and [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=820

fast.ai live coding & tutorials -> Live coding 3
"... but for a library like [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2818

fast.ai live coding & tutorials -> Live coding 3
"... use the latest version of [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2975

fast.ai live coding & tutorials -> Live coding 3
"... no module named [fastcore] is actually..."
https://youtu.be/B6BQiIgiEks?t=3617

fast.ai live coding & tutorials -> Live coding 2
"... fastgen so [fastchan] is a channel that..."
https://youtu.be/0pWjZByJ3Lk?t=3720

q = 'deleg*'
res = cur.execute(get_query(q, limit=5)).fetchall()
print_search_results(res)


    SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
    FROM transcriptions_fts WHERE text MATCH 'deleg*' ORDER BY rank
    LIMIT "5" 
    

fastai v2 code walk-thrus -> fastai v2 walk-thru #9
"... [delegated] down to that so [delegates] down..."
https://youtu.be/bBqFVBpOZoY?t=2462

Deep Learning from the Foundations -> Lesson 9 (2019) - How to train your model
"... [delegate] get attribute to the other..."
https://youtu.be/AcA8HAYh7IE?t=6435

fastai v2 code walk-thrus -> fastai v2 walk-thru #9
"... [delegate] everything Sodor in Python..."
https://youtu.be/bBqFVBpOZoY?t=2304

Deep Learning from the Foundations -> Lesson 13 (2019) - Basics of Swift for Deep Learning
"... default [delegates] is probably going to..."
https://youtu.be/3TqN_M1L4ts?t=6750

fastai v2 code walk-thrus -> fastai v2 walk-thru #2
"... this [delegates] decorator and what the..."
https://youtu.be/yEe5ZUMLEys?t=4756

Faceted Search

We can limit the search to specific playlists in a faceted like search.

playlists

{'PLfYUBJiXbdtSgU6S_3l6pX-4hQYKNJZFU': 'fast.ai APL Study Group',
 'PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU': 'Practical Deep Learning for Coders 2022',
 'PLfYUBJiXbdtSLBPJ1GMx-sQWf6iNhb8mM': 'fast.ai live coding & tutorials',
 'PLfYUBJiXbdtRL3FMB3GoWHRI8ieU6FhfM': 'Practical Deep Learning for Coders (2020)',
 'PLfYUBJiXbdtTIdtE1U8qgyxo4Jy2Y91uj': 'Deep Learning from the Foundations',
 'PLfYUBJiXbdtSWRCYUHh-ThVCC39bp5yiq': 'fastai v2 code walk-thrus',
 'PLfYUBJiXbdtSIJb-Qd3pw0cqCbkGeS0xn': 'Practical Deep Learning for Coders 2019',
 'PLfYUBJiXbdtSyktd8A_x0JNd6lxDcZE96': 'Introduction to Machine Learning for Coders',
 'PLfYUBJiXbdtTttBGq-u2zeY1OTjs5e-Ia': 'Cutting Edge Deep Learning for Coders 2',
 'PLfYUBJiXbdtS2UQRzyrxmyVHoGW0gmLSM': 'Practical Deep Learning For Coders 2018'}

pl_lst = list(playlists.keys())

# Search in playlist 'Practical Deep Learning for Coders 2022' or
# 'fast.ai live coding & tutorials'
q = f"""
(text: fastcore OR paral*) AND 
(playlist_id: "{pl_lst[1]}" OR "{pl_lst[2]}")
"""
res = cur.execute(get_query(q, limit=10)).fetchall()

print_search_results(res)


    SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
    FROM transcriptions_fts WHERE transcriptions_fts MATCH '
(text: fastcore OR paral*) AND 
(playlist_id: "PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU" OR "PLfYUBJiXbdtSLBPJ1GMx-sQWf6iNhb8mM")
' ORDER BY rank
    LIMIT "10" 
    

Practical Deep Learning for Coders 2022 -> Lesson 6: Practical Deep Learning for Coders 2022
"... but my [fastcore] library has a [parallel] sub module 
which can basically do anything that you can do  ..."
https://youtu.be/AdhG64NF76E?t=3799

fast.ai live coding & tutorials -> Live coding 3
"... going to install python and [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=820

fast.ai live coding & tutorials -> Live coding 3
"... but for a library like [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2818

fast.ai live coding & tutorials -> Live coding 3
"... use the latest version of [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2975

fast.ai live coding & tutorials -> Live coding 3
"... no module named [fastcore] is actually..."
https://youtu.be/B6BQiIgiEks?t=3617

fast.ai live coding & tutorials -> Live coding 8
"... [parallel]..."
https://youtu.be/-Scs4gbwWXg?t=1155

fast.ai live coding & tutorials -> Live coding 8
"... [parallel]..."
https://youtu.be/-Scs4gbwWXg?t=1160

fast.ai live coding & tutorials -> Live coding 3
"... and import [fastcore] it can't find it..."
https://youtu.be/B6BQiIgiEks?t=3401

fast.ai live coding & tutorials -> Live coding 8
"... for [parallel]..."
https://youtu.be/-Scs4gbwWXg?t=1049

fast.ai live coding & tutorials -> Live coding 15
"... somewhat in [parallel]..."
https://youtu.be/6JGoes9_bPs?t=5589

Conclusions

We used youtube-transcript-api and pytube Python libraries to extract YouTube captions based on the given playlists.
We indexed the captions using the capabilities of the ubiquitous SQLite and FTS5.
We did some powerful full-text search queries and simulated a faceted search.
We can go exactly to the video part the search is returning.
In Part 2 we are going to deploy an web app to Streamlit.