Skip to main content

Command Palette

Search for a command to run...

Mastering dbt: From Core Concepts to CI/CD, Lineage, and Medallion Architecture

Updated
4 min read

Modern data teams don’t struggle with collecting data anymore — they struggle with trusting it. That’s exactly the problem dbt was built to solve.

This post is a complete, end-to-end guide to dbt, covering:

  • What problems it solves

  • Core concepts (models, materialization, lineage)

  • Medallion architecture

  • CI/CD with GitHub Actions

  • Interview-ready mental models and scenarios


1. What is dbt?

dbt (data build tool) is a SQL-first transformation framework that runs transformations inside your data warehouse.

What dbt is for

  • Transforming raw data into analytics-ready tables

  • Applying software engineering practices to analytics

  • Enabling testing, documentation, and lineage

What dbt is NOT for

  • Data ingestion (Fivetran, NiFi, Glue)

  • Orchestration (Airflow)

  • Streaming or real-time processing

One-liner: dbt brings version control, testing, and CI/CD to SQL-based analytics.


2. The Problem dbt Solves

Before dbt:

  • Business logic lives in BI tools

  • No version control for SQL

  • Broken dashboards after schema changes

  • No visibility into downstream impact

With dbt:

  • All transformations live in Git

  • SQL is modular and testable

  • Automatic dependency graph (lineage)

  • Safe deployments via CI/CD


3. Core dbt Concepts (You Must Know)

3.1 Models

  • SQL SELECT statements

  • Each model becomes a table, view, or incremental table

  • Dependencies defined using ref()

select *
from {{ ref('stg_orders') }}

3.2 Materialization (Very Important)

Materialization defines how a model is stored in the warehouse.

TypeUse Case
viewLightweight, always fresh
tableBI-facing, fast queries
incrementalLarge fact tables
ephemeralReusable logic, no storage

Example (incremental):

{{ config(materialized='incremental', unique_key='order_id') }}

select *
from raw.orders
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Materialization is dbt’s cost and performance control knob.


3.3 Sources

  • Explicit declaration of raw tables

  • Enables freshness and source tests

sources:
  - name: raw
    tables:
      - name: orders

3.4 Tests

  • Built-in tests:

    • not_null

    • unique

    • relationships

  • Custom SQL tests supported

Tests fail the pipeline, not silently.


3.5 Macros

  • Reusable SQL using Jinja

  • Helps avoid duplication

  • Used for:

    • Date logic

    • Environment-based behavior

    • Custom tests


3.6 Snapshots

  • Track historical changes (SCD Type 2)

  • Useful when source systems don’t provide history


4. Data Lineage Explained Simply

Lineage = visual map of data flow from source to final model.

dbt builds lineage automatically using:

  • source()

  • ref()

Example:

raw.orders
   ↓
stg_orders
   ↓
int_orders
   ↓
fct_orders

Why lineage matters

  • Impact analysis

  • Faster debugging

  • Safer CI/CD

  • Easier onboarding

dbt lineage shows data dependencies, not execution order.


5. Medallion Architecture (Bronze → Silver → Gold)

Bronze (Raw)

  • Ingested data

  • Minimal transformation

  • Append-only

Silver (Cleaned)

  • Deduplication

  • Type casting

  • Business logic begins

Gold (Analytics)

  • Aggregated

  • Business-ready

  • Used by BI tools

How dbt fits

  • Bronze → dbt sources

  • Silver → staging models

  • Gold → marts


6. dbt in the Modern Data Stack

ToolResponsibility
NiFi / FivetranIngestion
AWS GlueHeavy Spark ETL
AirflowOrchestration
dbtAnalytics transformations

dbt focuses on the last mile of analytics.


7. CI/CD for dbt (Production-Grade)

Why CI/CD is critical

  • Prevent broken SQL in prod

  • Catch schema changes early

  • Enforce data quality


CI (Pull Requests)

  • dbt compile

  • Slim CI (state:modified+)

  • Run tests on changed models only

CD (Main branch)

  • Full dbt run

  • Full dbt test

  • Generate documentation


8. GitHub Actions – Complete CI/CD Workflow

Environments

  • dev → CI

  • prod → production deploy

CI + CD Workflow

name: dbt-ci-cd

on:
  pull_request:
  push:
    branches: [main]

jobs:
  dbt-ci:
    runs-on: ubuntu-latest
    environment: dev
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - run: pip install dbt-snowflake
      - run: dbt deps
      - run: dbt compile
      - run: dbt test

  dbt-prod:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: prod
    needs: dbt-ci
    steps:
      - uses: actions/checkout@v4
      - run: pip install dbt-snowflake
      - run: dbt run
      - run: dbt test

9. Real Interview Scenarios

Why dbt over stored procedures?

  • Version control

  • Testing

  • Lineage

  • CI/CD

  • Modular SQL

How do you handle large tables?

  • Incremental models

  • Partitioning

  • Late-arriving data logic

How do you stop bad data in prod?

  • CI tests

  • Protected branches

  • Mandatory PR checks


10. Final Mental Model

  • dbt → Transform data

  • Materialization → Control cost & performance

  • Lineage → Trust & impact analysis

  • Medallion architecture → Structure data layers

  • CI/CD → Safe, scalable deployments


Final One-Liner

dbt is a SQL-based analytics engineering framework that enables trusted, testable, and production-ready data transformations using modern software engineering practices.