from youtube_transcript_api import YouTubeTranscriptApi
from pytube import YouTube, Playlist
import sqlite3
Part 1. Extracting transcriptions and creating SQLite’s searchable index
Introduction
In this post we are going to get the transcriptions of YouTube videos from one or more given Playlists. Here we are going to do it for fastai channel, but it can be done for any given list of playlists (if the videos have transcriptions).
After we get the transcriptions, we are going to build a search engine with SQLite’s full-text search functionality provided by its FTS5 extension.
In Part 2 we are going to build and share the search engine as a Streamlit web app, just like this one: Full-Text Search Engine for fastai Youtube Chanel
References
If you want to get deeper, I encourage you to read these articles:
Get YouTube Transcriptions
Install and Import Libraries
We need to first install the libraries we need (pytube and youtube-transcript-api).
We can use pip
:
$ pip install pytube
$ pip install youtube_transcript_api
Or conda
:
$ conda install -c conda-forge pytube
$ conda install -c conda-forge youtube-transcript-api
YouTube Playlists
Let’s create a list of YouTube playlist ids. We can get them browsing YouTube playlist. The id is in the url which has the following format:
https://www.youtube.com/playlist?list={PLAYLIST_ID}
= 'https://www.youtube.com/playlist?list='
base_pl = 'https://youtu.be/'
base_yt
= [
yt_pl_ids 'PLfYUBJiXbdtSgU6S_3l6pX-4hQYKNJZFU', # fast.ai APL Study Group #2022
'PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU', # Practical Deep Learning for Coders 2022
'PLfYUBJiXbdtSLBPJ1GMx-sQWf6iNhb8mM', # fast.ai live coding & tutorials #2022
'PLfYUBJiXbdtRL3FMB3GoWHRI8ieU6FhfM', # Practical Deep Learning for Coders (2020)
'PLfYUBJiXbdtTIdtE1U8qgyxo4Jy2Y91uj', # Deep Learning from the Foundations #2019
'PLfYUBJiXbdtSWRCYUHh-ThVCC39bp5yiq', # fastai v2 code walk-thrus #2019
'PLfYUBJiXbdtSIJb-Qd3pw0cqCbkGeS0xn', # Practical Deep Learning for Coders 2019
'PLfYUBJiXbdtSyktd8A_x0JNd6lxDcZE96', # Introduction to Machine Learning for Coders
'PLfYUBJiXbdtTttBGq-u2zeY1OTjs5e-Ia', # Cutting Edge Deep Learning for Coders 2 #2018
'PLfYUBJiXbdtS2UQRzyrxmyVHoGW0gmLSM', # Practical Deep Learning For Coders 2018
]
Get Transcriptions
Let’s explore the methods:
= Playlist('https://www.youtube.com/playlist?list=PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU')
playlist print(playlist.title)
= YouTube(playlist[0])
video print(video.title)
print(playlist[0])
= playlist[0].split('=')[1]
video_id = YouTubeTranscriptApi.get_transcript(video_id, languages=('en',))
script print(script[0])
Practical Deep Learning for Coders 2022
Lesson 1: Practical Deep Learning for Coders 2022
https://www.youtube.com/watch?v=8SF_h3xF3cE
{'text': 'Welcome to Practical Deep Learning for coders,\xa0\nlesson one. This is version five of this course,\xa0\xa0', 'start': 2.0, 'duration': 8.0}
Download all transcriptions
Now we are going to download all the transcriptions. Let’s create three dictionaries to store the data: - playlists
to store each playlist as {playlist_id: playlist_name}
- videos
to store videos as {video_id: video_name}
- database
to store all captions as {playlist_id: {video_id: {'start': caption}}
.
= dict()
playlists = dict()
videos = dict()
database
for pl_id in yt_pl_ids:
= Playlist(base_pl + pl_id)
playlist print(playlist.title)
= playlist.title
playlists[pl_id] = dict()
database[pl_id]
for video in playlist:
= video.split("=")[1]
video_id = YouTube(video).title
videos[video_id] = dict()
database[pl_id][video_id] # Manually created transcripts are returned first
= YouTubeTranscriptApi.get_transcript(video_id, languages=('en',))
script
for txt in script:
'start']] = txt['text'] database[pl_id][video_id][txt[
fast.ai APL Study Group
Practical Deep Learning for Coders 2022
fast.ai live coding & tutorials
Practical Deep Learning for Coders (2020)
Deep Learning from the Foundations
fastai v2 code walk-thrus
Practical Deep Learning for Coders 2019
Introduction to Machine Learning for Coders
Cutting Edge Deep Learning for Coders 2
Practical Deep Learning For Coders 2018
Building the Search Engine
Formatting the data to facilitate insertion into SQLite
# https://stackoverflow.com/a/60932565/10013187
= [
records
(level1, level2, level3, leaf)for level1, level2_dict in database.items()
for level2, level3_dict in level2_dict.items()
for level3, leaf in level3_dict.items()
]print("(playlist_id, video_id, start, text)")
print(records[100])
(playlist_id, video_id, start, text)
('PLfYUBJiXbdtSgU6S_3l6pX-4hQYKNJZFU', 'CGpR2ILao5M', 294.18, 'gonna go watch them or anything all')
Creating the database
= sqlite3.connect('fastai_yt.db')
db = db.cursor() cur
# virtual table configured to allow full-text search
'DROP TABLE IF EXISTS transcriptions_fts;')
cur.execute('CREATE VIRTUAL TABLE transcriptions_fts USING fts5(playlist_id, video_id, start, text, tokenize="porter unicode61");')
cur.execute(
# dimension like tables
'DROP TABLE IF EXISTS playlist;')
cur.execute('CREATE TABLE playlist (playlist_id, playlist_name);')
cur.execute('DROP TABLE IF EXISTS video;')
cur.execute('CREATE TABLE video (video_id, video_name);') cur.execute(
<sqlite3.Cursor>
# bulk index records
'INSERT INTO transcriptions_fts (playlist_id, video_id, start, text) values (?,?,?,?);', records)
cur.executemany('INSERT INTO playlist (playlist_id, playlist_name) values (?,?);', playlists.items())
cur.executemany('INSERT INTO video (video_id, video_name) values (?,?);', videos.items())
cur.executemany( db.commit()
Example of a simple query:
'SELECT start, text FROM transcriptions_fts WHERE video_id="8SF_h3xF3cE" LIMIT 5').fetchall() cur.execute(
[(2.0,
'Welcome to Practical Deep Learning for coders,\xa0\nlesson one. This is version five of this course,\xa0\xa0'),
(11.44,
"and it's the first new one we've done\xa0\nin two years. So, we've got a lot of\xa0\xa0"),
(15.12,
"cool things to cover! It's amazing how much has\xa0\nchanged. Here is an xkcd from the end of 2015.\xa0\xa0"),
(28.0,
'Who here has seen xkcd comics before?\xa0\n…Pretty much everybody. Not surprising.\xa0\xa0'),
(35.36,
"So the basic joke here is… I'll let you\xa0\nread it, and then I'll come back to it.")]
fastai_yt.db. Once we have the database populated, we can use it in any application we want without the need to get the transcriptions from YouTube.
Search queries
def print_search_results(res):
for each in res:
print()
print(playlists[each[0]], "->", videos[each[1]])
print(f'"... {each[4]}..."')
print('https://youtu.be/' + each[1] + "?t=" + str(int(each[2])))
def get_query(q, limit):
= 'text'
search_in if 'text:' in q: search_in = 'transcriptions_fts'
= f"""
query SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
FROM transcriptions_fts WHERE {search_in} MATCH '{q}' ORDER BY rank
LIMIT "{limit}"
"""
print(query)
return query
Search for a word
= 'fastc*'
q = cur.execute(get_query(q, limit=5)).fetchall()
res print_search_results(res)
SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
FROM transcriptions_fts WHERE text MATCH 'fastc*' ORDER BY rank
LIMIT "5"
fast.ai live coding & tutorials -> Live coding 3
"... going to install python and [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=820
fast.ai live coding & tutorials -> Live coding 3
"... but for a library like [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2818
fast.ai live coding & tutorials -> Live coding 3
"... use the latest version of [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2975
fast.ai live coding & tutorials -> Live coding 3
"... no module named [fastcore] is actually..."
https://youtu.be/B6BQiIgiEks?t=3617
fast.ai live coding & tutorials -> Live coding 2
"... fastgen so [fastchan] is a channel that..."
https://youtu.be/0pWjZByJ3Lk?t=3720
= 'deleg*'
q = cur.execute(get_query(q, limit=5)).fetchall()
res print_search_results(res)
SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
FROM transcriptions_fts WHERE text MATCH 'deleg*' ORDER BY rank
LIMIT "5"
fastai v2 code walk-thrus -> fastai v2 walk-thru #9
"... [delegated] down to that so [delegates] down..."
https://youtu.be/bBqFVBpOZoY?t=2462
Deep Learning from the Foundations -> Lesson 9 (2019) - How to train your model
"... [delegate] get attribute to the other..."
https://youtu.be/AcA8HAYh7IE?t=6435
fastai v2 code walk-thrus -> fastai v2 walk-thru #9
"... [delegate] everything Sodor in Python..."
https://youtu.be/bBqFVBpOZoY?t=2304
Deep Learning from the Foundations -> Lesson 13 (2019) - Basics of Swift for Deep Learning
"... default [delegates] is probably going to..."
https://youtu.be/3TqN_M1L4ts?t=6750
fastai v2 code walk-thrus -> fastai v2 walk-thru #2
"... this [delegates] decorator and what the..."
https://youtu.be/yEe5ZUMLEys?t=4756
Faceted Search
We can limit the search to specific playlists in a faceted like search.
playlists
{'PLfYUBJiXbdtSgU6S_3l6pX-4hQYKNJZFU': 'fast.ai APL Study Group',
'PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU': 'Practical Deep Learning for Coders 2022',
'PLfYUBJiXbdtSLBPJ1GMx-sQWf6iNhb8mM': 'fast.ai live coding & tutorials',
'PLfYUBJiXbdtRL3FMB3GoWHRI8ieU6FhfM': 'Practical Deep Learning for Coders (2020)',
'PLfYUBJiXbdtTIdtE1U8qgyxo4Jy2Y91uj': 'Deep Learning from the Foundations',
'PLfYUBJiXbdtSWRCYUHh-ThVCC39bp5yiq': 'fastai v2 code walk-thrus',
'PLfYUBJiXbdtSIJb-Qd3pw0cqCbkGeS0xn': 'Practical Deep Learning for Coders 2019',
'PLfYUBJiXbdtSyktd8A_x0JNd6lxDcZE96': 'Introduction to Machine Learning for Coders',
'PLfYUBJiXbdtTttBGq-u2zeY1OTjs5e-Ia': 'Cutting Edge Deep Learning for Coders 2',
'PLfYUBJiXbdtS2UQRzyrxmyVHoGW0gmLSM': 'Practical Deep Learning For Coders 2018'}
= list(playlists.keys()) pl_lst
# Search in playlist 'Practical Deep Learning for Coders 2022' or
# 'fast.ai live coding & tutorials'
= f"""
q (text: fastcore OR paral*) AND
(playlist_id: "{pl_lst[1]}" OR "{pl_lst[2]}")
"""
= cur.execute(get_query(q, limit=10)).fetchall()
res
print_search_results(res)
SELECT *, HIGHLIGHT(transcriptions_fts, 3, '[', ']')
FROM transcriptions_fts WHERE transcriptions_fts MATCH '
(text: fastcore OR paral*) AND
(playlist_id: "PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU" OR "PLfYUBJiXbdtSLBPJ1GMx-sQWf6iNhb8mM")
' ORDER BY rank
LIMIT "10"
Practical Deep Learning for Coders 2022 -> Lesson 6: Practical Deep Learning for Coders 2022
"... but my [fastcore] library has a [parallel] sub module
which can basically do anything that you can do ..."
https://youtu.be/AdhG64NF76E?t=3799
fast.ai live coding & tutorials -> Live coding 3
"... going to install python and [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=820
fast.ai live coding & tutorials -> Live coding 3
"... but for a library like [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2818
fast.ai live coding & tutorials -> Live coding 3
"... use the latest version of [fastcore]..."
https://youtu.be/B6BQiIgiEks?t=2975
fast.ai live coding & tutorials -> Live coding 3
"... no module named [fastcore] is actually..."
https://youtu.be/B6BQiIgiEks?t=3617
fast.ai live coding & tutorials -> Live coding 8
"... [parallel]..."
https://youtu.be/-Scs4gbwWXg?t=1155
fast.ai live coding & tutorials -> Live coding 8
"... [parallel]..."
https://youtu.be/-Scs4gbwWXg?t=1160
fast.ai live coding & tutorials -> Live coding 3
"... and import [fastcore] it can't find it..."
https://youtu.be/B6BQiIgiEks?t=3401
fast.ai live coding & tutorials -> Live coding 8
"... for [parallel]..."
https://youtu.be/-Scs4gbwWXg?t=1049
fast.ai live coding & tutorials -> Live coding 15
"... somewhat in [parallel]..."
https://youtu.be/6JGoes9_bPs?t=5589
Conclusions
- We used
youtube-transcript-api
andpytube
Python libraries to extract YouTube captions based on the given playlists. - We indexed the captions using the capabilities of the ubiquitous SQLite and FTS5.
- We did some powerful full-text search queries and simulated a faceted search.
- We can go exactly to the video part the search is returning.
- In Part 2 we are going to deploy an web app to Streamlit.