Overview
The exploration I’m documenting here about dbt (Data Build Tool) is almost a recreation of the awsome tutorial Building a Kimball dimensional model with dbt but with some twists.
I encourage you to read that tutorial for a discussion of what a dimensional model is, its benefits, and the differences with other modeling techniques such as Third Normal Form (3NF) and One Big Table (OBT).
What I did different from that tutorial was the following:
Worked with the AdventureWorks data already loaded into PostgreSQL, and not seeded from .csv files. For that I managed to create a backup that you can restore to your Postgres database. You can download it here.
Experimented with three flavors of Postgres:
Created custom schema names Deploy to custom schemas & override dbt defaults.
Created two date dimension tables, one with calogical
dbt-date
extension, which is a wraper arounddbt_utils.date_spine
, and the other withdate_spine
itself. I managed to create the calendar dynamically ranging from the first day in the fact table to the last one.
Setting up the project
It is not in the scope of this document to explain how to get started with dbt. I had never used dbt and it was straight forward with these two tutorials:
- Building a Kimball dimensional model with dbt
- Intro to Data Build Tool (dbt) // Create your first project!
You can explore the whole experiment I did in this repository: fmussari/dbt-dimensional-modeling-experiment
Defining the databases in profiles.yml
dbt is a transformation tool, so it only works with one database at a time, meaning that the same database is the source and the target. In this case we have the normalized AdventureWorks database and we are creating the transformations to create a model with dimensions and fact tables. I found interesting that multiple databases can be defined in a project, so you can test, for example, in your local database, and then create the models into the cloud database.
For storing the credentials I used conda, based on setting environment variables.
The profiles.yml
ended looking like this:
profiles.yml
dbt_dimensional_demo:
outputs:
postgresAdventure:
type: postgres
threads: 4
host: localhost
port: 5432
user: postgres
pass: '10042076'
database: Adventureworks
schema: target # this is override in each model's sql
supabaseAdventure:
type: postgres
threads: 4
host: "{{ env_var('SUPABASE_HOST') }}"
port: 5432
user: "{{ env_var('SUPABASE_USER') }}"
pass: "{{ env_var('SUPABASE_PASS') }}"
database: Adventureworks
schema: t
auroraAdventure:
type: postgres
threads: 4
host: "{{ env_var('AURORA_HOST') }}"
port: 5432
user: "{{ env_var('AURORA_USER') }}"
pass: "{{ env_var('AURORA_PASS') }}"
database: Adventureworks
schema: t
target: supabaseAdventure
As you can see your experiment can run in any database specified in the file, by only changing the target
.
Creating the Dimensional Model
This part is almost a copy from the aforementioned tutorial. The only difference being that based on the video Deploy to custom schemas & override dbt defaults I added some lines to configure the Schema in the transformations file.
Another difference was that I specified my sources in in the file schema.yml
that looks like this:
schema.yml
# Copied from: https://docs.getdbt.com/docs/build/sources
version: 2
sources:
- name: pg_production
database: Adventureworks
description: 'Adventureworks, production schema'
schema: production
tables:
- name: product
- name: productcategory
- name: productsubcategory
- name: pg_person
database: Adventureworks
description: 'Adventureworks, person schema'
schema: person
tables:
- name: address
- name: stateprovince
- name: countryregion
- name: person
- name: pg_sales
database: Adventureworks
description: 'Adventureworks, sales schema, customer table'
schema: sales
tables:
- name: customer
- name: store
- name: creditcard
- name: salesorderheader
- name: salesorderdetail
So dim_address.sql
had the lines on materialization and schema configuration, pointed to the sources defined in the previous .yml
, and ended looking like this:
dim_address.sql
materialized='table') }}
{{ config(schema='Dimensions') }}
{{ config(
with stg_address as (
select *
from {{ source('pg_person', 'address') }}
),
as (
stg_stateprovince select *
from {{ source('pg_person', 'stateprovince') }}
),
as (
stg_countryregion select *
from {{ source('pg_person', 'countryregion') }}
)
select
'stg_address.addressid']) }} as address_key,
{{ dbt_utils.generate_surrogate_key([
stg_address.addressid,as city_name,
stg_address.city as state_name,
stg_stateprovince.name as country_name
stg_countryregion.name from stg_address
left join stg_stateprovince on stg_address.stateprovinceid = stg_stateprovince.stateprovinceid
left join stg_countryregion on stg_stateprovince.countryregioncode = stg_countryregion.countryregioncode
In the same way as the tutorial, there were created the other dimensions:
Fact table:
Date Dimension
This is where I did most of my own experiments. I wanted to create a Date Domension or Calendar table ranging from the earliest to the latest date in the fact table, that comes from salesorderheader
. I had little to no experience at all with Jinja templates, which is what dbt relies on for transformations, so maybe there are better ways to achieve what I intented to do, but I must say it did worked fine.
dbt_utils.date_spine
The first version of the date dimension was created using dbt_utils
, based on the tutorial As of Date Tables. I simply changed start_date
and end_date
statements to get the aforementioned earliest and latest dates from salesorderheader
. The transformation ended looking like this:
dim_dates.sql
materialized='table') }}
{{ config(schema='Dimensions') }}
{{ config(
- set datepart = "day" -%}
{%- set start_date = "(select MIN(cast(orderdate as date)) from sales.salesorderheader)" -%}
{%- set end_date = "(select MAX(cast(orderdate as date)) + (interval '1 day') from sales.salesorderheader)" -%}
{%
with as_of_date AS (
{{ dbt_utils.date_spine(=datepart,
datepart=start_date,
start_date=end_date
end_date
)
}}
)
SELECT * FROM as_of_date
This isn’t a full calendar but a sequence of dates from earliest to latest. For a full calendar we then would need to create all the columns as year, month, day, quarter…
Statement Blocks
I then made some experiments with statement blocks and created one called get_dates
to generate a view with the earliest and latest dates from salesorderheader
:
dim_view_date_range.sql
- call statement('get_dates', fetch_result=True) -%}
{%
SELECT MIN(cast(orderdate as date)), MAX(cast(orderdate as date))
FROM {{ source('pg_sales', 'salesorderheader') }}
- endcall -%}
{%
- set earliest = load_result('get_dates')['data'][0][0] -%}
{%- set latest = load_result('get_dates')['data'][0][1] -%}
{%
SELECT cast('{{ earliest }}' as date) AS earliest, cast('{{ latest }}' as date) AS latest
It worked, now we know how to create a statement that gets dates from salesorderheader
. Lets use it for our final step in this experiment which is creating the Calendar with calogical.
dbt_date.get_date_dimension
This wraper creates a full calendar dimensio, the transformation
dim_dates_calogical.sql
materialized='table') }}
{{ config(schema='Dimensions') }}
{{ config(
- call statement('get_dates', fetch_result=True) -%}
{%
SELECT MIN(cast(orderdate as date)), MAX(cast(orderdate as date)) + (interval '1 day')
FROM {{ source('pg_sales', 'salesorderheader') }}
- endcall -%}
{%
- set earliest = load_result('get_dates')['data'][0][0] -%}
{%- set latest = load_result('get_dates')['data'][0][1] -%}
{%
{{ dbt_date.get_date_dimension(=earliest,
start_date=latest
end_date
) }}
This script creates the following columns: date_day
, prior_date_day
, next_date_day
, prior_year_date_day
, prior_year_over_year_date_day
, day_of_week
, day_of_week_name
, day_of_week_name_short
, day_of_month
, day_of_year
, week_start_date
, week_end_date
, prior_year_week_start_date
, prior_year_week_end_date
, week_of_year
, iso_week_start_date
, iso_week_end_date
, prior_year_iso_week_start_date
, prior_year_iso_week_end_date
, iso_week_of_year
, prior_year_week_of_year
, prior_year_iso_week_of_year
, month_of_year
, month_name
, month_name_short
, month_start_date
, month_end_date
, prior_year_month_start_date
, prior_year_month_end_date
, quarter_of_year
, quarter_start_date
, quarter_end_date
, year_number
, year_start_date
and year_end_date
.
Wrap Up
dbt is one of most representative tool of the called Modern Data Stack and the Analytics Engineering, claiming the incorporation of Software Engineering practices to the data stack. What I found interesting is the fact that (at least in theory) we can create the tranformations collaboratively, version control them and then deploy those models to any supported datawarehouse.
To get deepen in the data landscape, Modern Data Stack and dbt I highly recommend: