Overview
The exploration I’m documenting here about dbt (Data Build Tool) is almost a recreation of the awsome tutorial Building a Kimball dimensional model with dbt but with some twists.
I encourage you to read that tutorial for a discussion of what a dimensional model is, its benefits, and the differences with other modeling techniques such as Third Normal Form (3NF) and One Big Table (OBT).
What I did different from that tutorial was the following:
Worked with the AdventureWorks data already loaded into PostgreSQL, and not seeded from .csv files. For that I managed to create a backup that you can restore to your Postgres database. You can download it here.
Experimented with three flavors of Postgres:
Created custom schema names Deploy to custom schemas & override dbt defaults.
Created two date dimension tables, one with calogical
dbt-dateextension, which is a wraper arounddbt_utils.date_spine, and the other withdate_spineitself. I managed to create the calendar dynamically ranging from the first day in the fact table to the last one.
Setting up the project
It is not in the scope of this document to explain how to get started with dbt. I had never used dbt and it was straight forward with these two tutorials:
- Building a Kimball dimensional model with dbt
- Intro to Data Build Tool (dbt) // Create your first project!
You can explore the whole experiment I did in this repository: fmussari/dbt-dimensional-modeling-experiment
Defining the databases in profiles.yml
dbt is a transformation tool, so it only works with one database at a time, meaning that the same database is the source and the target. In this case we have the normalized AdventureWorks database and we are creating the transformations to create a model with dimensions and fact tables. I found interesting that multiple databases can be defined in a project, so you can test, for example, in your local database, and then create the models into the cloud database.
For storing the credentials I used conda, based on setting environment variables.
The profiles.yml ended looking like this:
profiles.yml
dbt_dimensional_demo:
outputs:
postgresAdventure:
type: postgres
threads: 4
host: localhost
port: 5432
user: postgres
pass: '10042076'
database: Adventureworks
schema: target # this is override in each model's sql
supabaseAdventure:
type: postgres
threads: 4
host: "{{ env_var('SUPABASE_HOST') }}"
port: 5432
user: "{{ env_var('SUPABASE_USER') }}"
pass: "{{ env_var('SUPABASE_PASS') }}"
database: Adventureworks
schema: t
auroraAdventure:
type: postgres
threads: 4
host: "{{ env_var('AURORA_HOST') }}"
port: 5432
user: "{{ env_var('AURORA_USER') }}"
pass: "{{ env_var('AURORA_PASS') }}"
database: Adventureworks
schema: t
target: supabaseAdventureAs you can see your experiment can run in any database specified in the file, by only changing the target.
Creating the Dimensional Model
This part is almost a copy from the aforementioned tutorial. The only difference being that based on the video Deploy to custom schemas & override dbt defaults I added some lines to configure the Schema in the transformations file.
Another difference was that I specified my sources in in the file schema.yml that looks like this:
schema.yml
# Copied from: https://docs.getdbt.com/docs/build/sources
version: 2
sources:
- name: pg_production
database: Adventureworks
description: 'Adventureworks, production schema'
schema: production
tables:
- name: product
- name: productcategory
- name: productsubcategory
- name: pg_person
database: Adventureworks
description: 'Adventureworks, person schema'
schema: person
tables:
- name: address
- name: stateprovince
- name: countryregion
- name: person
- name: pg_sales
database: Adventureworks
description: 'Adventureworks, sales schema, customer table'
schema: sales
tables:
- name: customer
- name: store
- name: creditcard
- name: salesorderheader
- name: salesorderdetailSo dim_address.sql had the lines on materialization and schema configuration, pointed to the sources defined in the previous .yml, and ended looking like this:
dim_address.sql
{{ config(materialized='table') }}
{{ config(schema='Dimensions') }}
with stg_address as (
select *
from {{ source('pg_person', 'address') }}
),
stg_stateprovince as (
select *
from {{ source('pg_person', 'stateprovince') }}
),
stg_countryregion as (
select *
from {{ source('pg_person', 'countryregion') }}
)
select
{{ dbt_utils.generate_surrogate_key(['stg_address.addressid']) }} as address_key,
stg_address.addressid,
stg_address.city as city_name,
stg_stateprovince.name as state_name,
stg_countryregion.name as country_name
from stg_address
left join stg_stateprovince on stg_address.stateprovinceid = stg_stateprovince.stateprovinceid
left join stg_countryregion on stg_stateprovince.countryregioncode = stg_countryregion.countryregioncodeIn the same way as the tutorial, there were created the other dimensions:
Fact table:
Date Dimension
This is where I did most of my own experiments. I wanted to create a Date Domension or Calendar table ranging from the earliest to the latest date in the fact table, that comes from salesorderheader. I had little to no experience at all with Jinja templates, which is what dbt relies on for transformations, so maybe there are better ways to achieve what I intented to do, but I must say it did worked fine.
dbt_utils.date_spine
The first version of the date dimension was created using dbt_utils, based on the tutorial As of Date Tables. I simply changed start_date and end_date statements to get the aforementioned earliest and latest dates from salesorderheader. The transformation ended looking like this:
dim_dates.sql
{{ config(materialized='table') }}
{{ config(schema='Dimensions') }}
{%- set datepart = "day" -%}
{%- set start_date = "(select MIN(cast(orderdate as date)) from sales.salesorderheader)" -%}
{%- set end_date = "(select MAX(cast(orderdate as date)) + (interval '1 day') from sales.salesorderheader)" -%}
with as_of_date AS (
{{ dbt_utils.date_spine(
datepart=datepart,
start_date=start_date,
end_date=end_date
)
}}
)
SELECT * FROM as_of_dateThis isn’t a full calendar but a sequence of dates from earliest to latest. For a full calendar we then would need to create all the columns as year, month, day, quarter…
Statement Blocks
I then made some experiments with statement blocks and created one called get_dates to generate a view with the earliest and latest dates from salesorderheader:
dim_view_date_range.sql
{%- call statement('get_dates', fetch_result=True) -%}
SELECT MIN(cast(orderdate as date)), MAX(cast(orderdate as date))
FROM {{ source('pg_sales', 'salesorderheader') }}
{%- endcall -%}
{%- set earliest = load_result('get_dates')['data'][0][0] -%}
{%- set latest = load_result('get_dates')['data'][0][1] -%}
SELECT cast('{{ earliest }}' as date) AS earliest, cast('{{ latest }}' as date) AS latestIt worked, now we know how to create a statement that gets dates from salesorderheader. Lets use it for our final step in this experiment which is creating the Calendar with calogical.
dbt_date.get_date_dimension
This wraper creates a full calendar dimensio, the transformation
dim_dates_calogical.sql
{{ config(materialized='table') }}
{{ config(schema='Dimensions') }}
{%- call statement('get_dates', fetch_result=True) -%}
SELECT MIN(cast(orderdate as date)), MAX(cast(orderdate as date)) + (interval '1 day')
FROM {{ source('pg_sales', 'salesorderheader') }}
{%- endcall -%}
{%- set earliest = load_result('get_dates')['data'][0][0] -%}
{%- set latest = load_result('get_dates')['data'][0][1] -%}
{{ dbt_date.get_date_dimension(
start_date=earliest,
end_date=latest
)
}}This script creates the following columns: date_day, prior_date_day, next_date_day, prior_year_date_day, prior_year_over_year_date_day, day_of_week, day_of_week_name, day_of_week_name_short, day_of_month, day_of_year, week_start_date, week_end_date, prior_year_week_start_date, prior_year_week_end_date, week_of_year, iso_week_start_date, iso_week_end_date, prior_year_iso_week_start_date, prior_year_iso_week_end_date, iso_week_of_year, prior_year_week_of_year, prior_year_iso_week_of_year, month_of_year, month_name, month_name_short, month_start_date, month_end_date, prior_year_month_start_date, prior_year_month_end_date, quarter_of_year, quarter_start_date, quarter_end_date, year_number, year_start_date and year_end_date.
Wrap Up
dbt is one of most representative tool of the called Modern Data Stack and the Analytics Engineering, claiming the incorporation of Software Engineering practices to the data stack. What I found interesting is the fact that (at least in theory) we can create the tranformations collaboratively, version control them and then deploy those models to any supported datawarehouse.
To get deepen in the data landscape, Modern Data Stack and dbt I highly recommend: