datamatica - Dimensional Modelling with dbt and different flavors of PostgreSQL

Overview

The exploration I’m documenting here about dbt (Data Build Tool) is almost a recreation of the awsome tutorial Building a Kimball dimensional model with dbt but with some twists.

I encourage you to read that tutorial for a discussion of what a dimensional model is, its benefits, and the differences with other modeling techniques such as Third Normal Form (3NF) and One Big Table (OBT).

What I did different from that tutorial was the following:

Worked with the AdventureWorks data already loaded into PostgreSQL, and not seeded from .csv files. For that I managed to create a backup that you can restore to your Postgres database. You can download it here.
Experimented with three flavors of Postgres:
Created custom schema names Deploy to custom schemas & override dbt defaults.
Created two date dimension tables, one with calogical dbt-date extension, which is a wraper around dbt_utils.date_spine, and the other with date_spine itself. I managed to create the calendar dynamically ranging from the first day in the fact table to the last one.

Setting up the project

It is not in the scope of this document to explain how to get started with dbt. I had never used dbt and it was straight forward with these two tutorials:

You can explore the whole experiment I did in this repository: fmussari/dbt-dimensional-modeling-experiment

Defining the databases in `profiles.yml`

dbt is a transformation tool, so it only works with one database at a time, meaning that the same database is the source and the target. In this case we have the normalized AdventureWorks database and we are creating the transformations to create a model with dimensions and fact tables. I found interesting that multiple databases can be defined in a project, so you can test, for example, in your local database, and then create the models into the cloud database.

For storing the credentials I used conda, based on setting environment variables.

The profiles.yml ended looking like this:

profiles.yml

dbt_dimensional_demo:

  outputs:

    postgresAdventure:
      type: postgres
      threads: 4
      host: localhost
      port: 5432
      user: postgres
      pass: '10042076'
      database: Adventureworks
      schema: target  # this is override in each model's sql

    supabaseAdventure:
      type: postgres
      threads: 4
      host: "{{ env_var('SUPABASE_HOST') }}"
      port: 5432
      user: "{{ env_var('SUPABASE_USER') }}"
      pass: "{{ env_var('SUPABASE_PASS') }}"
      database: Adventureworks
      schema: t
    
    auroraAdventure:
      type: postgres
      threads: 4
      host: "{{ env_var('AURORA_HOST') }}"
      port: 5432
      user: "{{ env_var('AURORA_USER') }}"
      pass: "{{ env_var('AURORA_PASS') }}"
      database: Adventureworks
      schema: t

  target: supabaseAdventure

As you can see your experiment can run in any database specified in the file, by only changing the target.

Creating the Dimensional Model

This part is almost a copy from the aforementioned tutorial. The only difference being that based on the video Deploy to custom schemas & override dbt defaults I added some lines to configure the Schema in the transformations file.

Another difference was that I specified my sources in in the file schema.yml that looks like this:

schema.yml

# Copied from: https://docs.getdbt.com/docs/build/sources

version: 2

sources:
  - name: pg_production
    database: Adventureworks
    description: 'Adventureworks, production schema'
    schema: production
    tables:
      - name: product
      - name: productcategory
      - name: productsubcategory

  - name: pg_person
    database: Adventureworks
    description: 'Adventureworks, person schema'
    schema: person
    tables:
      - name: address
      - name: stateprovince
      - name: countryregion
      - name: person

  - name: pg_sales
    database: Adventureworks
    description: 'Adventureworks, sales schema, customer table'
    schema: sales
    tables:
      - name: customer
      - name: store
      - name: creditcard
      - name: salesorderheader
      - name: salesorderdetail

So dim_address.sql had the lines on materialization and schema configuration, pointed to the sources defined in the previous .yml, and ended looking like this:

dim_address.sql

{{ config(materialized='table') }}
{{ config(schema='Dimensions') }}

with stg_address as (
    select *
    from {{ source('pg_person', 'address') }}
),

stg_stateprovince as (
    select *
    from {{ source('pg_person', 'stateprovince') }}
),

stg_countryregion as (
    select *
    from {{ source('pg_person', 'countryregion') }}
)

select
    {{ dbt_utils.generate_surrogate_key(['stg_address.addressid']) }} as address_key,
    stg_address.addressid,
    stg_address.city as city_name,
    stg_stateprovince.name as state_name,
    stg_countryregion.name as country_name
from stg_address
left join stg_stateprovince on stg_address.stateprovinceid = stg_stateprovince.stateprovinceid
left join stg_countryregion on stg_stateprovince.countryregioncode = stg_countryregion.countryregioncode

In the same way as the tutorial, there were created the other dimensions:

Fact table:

fact_sales.sql

Date Dimension

This is where I did most of my own experiments. I wanted to create a Date Domension or Calendar table ranging from the earliest to the latest date in the fact table, that comes from salesorderheader. I had little to no experience at all with Jinja templates, which is what dbt relies on for transformations, so maybe there are better ways to achieve what I intented to do, but I must say it did worked fine.

`dbt_utils.date_spine`

The first version of the date dimension was created using dbt_utils, based on the tutorial As of Date Tables. I simply changed start_date and end_date statements to get the aforementioned earliest and latest dates from salesorderheader. The transformation ended looking like this:

dim_dates.sql

{{ config(materialized='table') }}
{{ config(schema='Dimensions') }}

{%- set datepart = "day" -%}
{%- set start_date = "(select MIN(cast(orderdate as date)) from sales.salesorderheader)" -%}
{%- set end_date = "(select MAX(cast(orderdate as date)) + (interval '1 day') from sales.salesorderheader)" -%}

with as_of_date AS (
    {{ dbt_utils.date_spine(
        datepart=datepart, 
        start_date=start_date,
        end_date=end_date
       ) 
    }}
)

SELECT * FROM as_of_date

This isn’t a full calendar but a sequence of dates from earliest to latest. For a full calendar we then would need to create all the columns as year, month, day, quarter…

Statement Blocks

I then made some experiments with statement blocks and created one called get_dates to generate a view with the earliest and latest dates from salesorderheader:

dim_view_date_range.sql

{%- call statement('get_dates', fetch_result=True) -%}

      SELECT  MIN(cast(orderdate as date)), MAX(cast(orderdate as date)) 
      FROM {{ source('pg_sales', 'salesorderheader') }}

{%- endcall -%}

{%- set earliest = load_result('get_dates')['data'][0][0] -%}
{%- set latest = load_result('get_dates')['data'][0][1] -%}

SELECT cast('{{ earliest }}' as date) AS earliest, cast('{{ latest }}' as date) AS latest

It worked, now we know how to create a statement that gets dates from salesorderheader. Lets use it for our final step in this experiment which is creating the Calendar with calogical.

`dbt_date.get_date_dimension`

This wraper creates a full calendar dimensio, the transformation

dim_dates_calogical.sql

{{ config(materialized='table') }}
{{ config(schema='Dimensions') }}

{%- call statement('get_dates', fetch_result=True) -%}

      SELECT  MIN(cast(orderdate as date)), MAX(cast(orderdate as date)) + (interval '1 day')
      FROM {{ source('pg_sales', 'salesorderheader') }}

{%- endcall -%}

{%- set earliest = load_result('get_dates')['data'][0][0] -%}
{%- set latest = load_result('get_dates')['data'][0][1] -%}

{{ dbt_date.get_date_dimension(
        start_date=earliest, 
        end_date=latest
    ) 
}}

This script creates the following columns: date_day, prior_date_day, next_date_day, prior_year_date_day, prior_year_over_year_date_day, day_of_week, day_of_week_name, day_of_week_name_short, day_of_month, day_of_year, week_start_date, week_end_date, prior_year_week_start_date, prior_year_week_end_date, week_of_year, iso_week_start_date, iso_week_end_date, prior_year_iso_week_start_date, prior_year_iso_week_end_date, iso_week_of_year, prior_year_week_of_year, prior_year_iso_week_of_year, month_of_year, month_name, month_name_short, month_start_date, month_end_date, prior_year_month_start_date, prior_year_month_end_date, quarter_of_year, quarter_start_date, quarter_end_date, year_number, year_start_date and year_end_date.

Wrap Up

dbt is one of most representative tool of the called Modern Data Stack and the Analytics Engineering, claiming the incorporation of Software Engineering practices to the data stack. What I found interesting is the fact that (at least in theory) we can create the tranformations collaboratively, version control them and then deploy those models to any supported datawarehouse.

To get deepen in the data landscape, Modern Data Stack and dbt I highly recommend: