TD 6 - Analytics Engineering with dbt (GEMINI generated)
Prerequisites
Make sure you have the following installed:
- Python 3.8+ (managed via
uvorconda) - Visual Studio Code (highly recommended with the "dbt Power User" extension)
- A terminal
1. Setup & Installation
Note: This tutorial was (GEMINI generated) to demonstrate dbt capabilities within this repository.
In this tutorial, we will build a "Modern Data Stack" locally. instead of a cloud warehouse like Snowflake or BigQuery, we will use DuckDB, an in-process SQL OLAP database that is fast and easy to set up.
a. Project Initialization
We will create a specific directory for our analytics project to keep it separate from the documentation site.
Exercise - Environment Setup
- Create a new folder named
dbt_projectat the root of your repository (sibling todocs,gen_logs, etc.). - Open your terminal in this new folder.
- Initialize a new python environment:
uv venv(orpython -m venv .venv). - Activate the environment:
- Windows:
.venv\Scripts\activate - Mac/Linux:
source .venv/bin/activate
- Windows:
- Install dbt for DuckDB:
(or
pip install dbt-duckdb)
b. dbt Initialization
dbt (data build tool) comes with a CLI to scaffold projects.
Exercise - Create the dbt project
- Run the initialization command:
- The wizard will ask you for configuration:
- Database: Select
duckdb. - plugins: Press Enter.
- threads: 1 or 4.
- path: Type
dbt.duckdb(This is crucial to save your data to a file!).
- Database: Select
- Once finished, you will have a
ecommerce_analyticsfolder. Open it in VS Code.
c. The profiles.yml
You might wonder: Where is the database connection defined?
By default, dbt init creates a global configuration file in your home directory (~/.dbt/profiles.yml).
For this project to be self-contained, we will create a local configuration.
Exercise - Local Configuration
- Inside your
ecommerce_analyticsfolder, create a file namedprofiles.yml. - Paste the following content:
ecommerce_analytics:
target: dev
outputs:
dev:
type: duckdb
path: 'dbt.duckdb'
extensions:
- httpfs
- parquet
1. Add the flag: `dbt debug --profiles-dir .`
2. (Recommended) Set an environment variable in your terminal:
- **Windows (PowerShell)**: `$env:DBT_PROFILES_DIR="."`
- **Mac/Linux**: `export DBT_PROFILES_DIR=.`
- Run
dbt debugto confirm the connection works.
Note: The extensions list ensures DuckDB can read remote files and parquet, which is useful for advanced features.
2. The Scenario: E-Commerce Traffic
You are the Analytics Engineer for a growing e-commerce site. The backend team has provided you with:
- Raw Server Logs: A messy stream of HTTP requests containing valuable user behavior data.
- Product Catalog: Static JSON files describing the products and categories.
Your goal is to turn this mess into a clean "Daily Revenue" report.
a. Generate Data
We will use the repository's existing tools to generate our raw data.
Exercise - Data Generation
- Go back to the root of the repository (where
gen_logsis located). - Generate 50,000 log lines:
Note: You might need to adjust the path depending on where you are running the command. The goal is to put
access.loginside your dbt project root. - Copy the dimension data into your dbt project's
seedsfolder.- Source:
gen_logs/data/products.json,departments.json,categories.json - Destination:
dbt_project/ecommerce_analytics/seeds/
- Source:
Your folder structure should look like this:
3. Ingestion (Bronze Layer)
In the "Modern Data Stack", we often follow the ELT (Extract, Load, Transform) pattern. We load raw data first, then transform it in place.
a. Seeds (Dimensions)
dbt "Seeds" are perfect for static data like our product catalogs.
Exercise - Loading Seeds
- In your terminal (inside
ecommerce_analytics), run: - dbt will automatically infer schemas from the JSON files and create tables in your DuckDB database (by default, a file named
dbt.duckdbwill appear). - You can verify the data using the DuckDB CLI or by installing the
duckdbpython package and querying it, but trusting dbt is usually enough!
b. External Sources (Facts)
The access.log is a raw text file. We don't want to manually insert it. DuckDB is amazing because it can query files directly.
Exercise - Configure the Source
- Create a new file
models/sources.yml. - We will define a source that points to our local file.
version: 2
sources:
- name: raw_data
schema: main
tables:
- name: access_logs
description: "Raw server logs from the webserver"
meta:
external_location: "read_text('access.log')"
Note: read_text is a DuckDB function that reads a file line by line into a single column named content.
4. Transformation (Silver Layer)
Now that our data is "defined", we need to parse it. The logs are currently just unstructured text strings.
a. Staging Models
We will create a "staging" model to clean up the raw data. This is where we apply Regex to extract columns.
Exercise - Parsing Logs
- Create a new file
models/staging/stg_access_logs.sql. - We need to extract the IP, Timestamp, URL, and Status Code.
- Copy the following SQL (DuckDB dialect):
with source as (
select * from {{ source('raw_data', 'access_logs') }}
),
parsed as (
select
-- DuckDB regex extraction
regexp_extract(content, '^(\S+)', 1) as ip_address,
strptime(regexp_extract(content, '\[(.*?)\]', 1), '%d/%b/%Y:%H:%M:%S %z') as log_timestamp,
regexp_extract(content, '"GET (.*?) HTTP', 1) as url,
cast(regexp_extract(content, 'HTTP/1.1" (\d{3})', 1) as int) as status_code
from source
)
select * from parsed
dbt run to build this model.
- Inspect the view/table created in DuckDB (or use dbt show --select stg_access_logs).
b. Feature Extraction
We have a url column, but we need business context. The URLs look like /product/105 or /add_to_cart/105.
Challenge - Extracting Context
- Modify
models/staging/stg_access_logs.sql. - Add logic to extract:
product_id: The number at the end of/product/...or/add_to_cart/...URLs.action: The type of action (e.g., 'view_product', 'add_to_cart', 'checkout').
Hint: You can use case when statements with LIKE operators or more regex.
Click for Solution
select
ip_address,
log_timestamp,
status_code,
url,
-- Extract Product ID if present
case
when url like '/product/%' then split_part(url, '/', 3)
when url like '/add_to_cart/%' then split_part(url, '/', 3)
else null
end as product_id,
-- Label the action
case
when url like '/product/%' then 'view_product'
when url like '/add_to_cart/%' then 'add_to_cart'
when url = '/checkout' then 'checkout'
else 'browse'
end as action
from parsed
where status_code = 200 -- We only care about successful requests
5. Business Logic (Gold Layer)
Now that we have clean events, we can join them with our dimension tables to calculate business metrics.
a. Potential Revenue Mart
We want to know how much money is being "added to cart" each day.
Exercise - Daily Revenue Mart
- Create a new file
models/marts/mart_daily_revenue.sql. - We need to:
- Select from
stg_access_logs. - Filter for
add_to_cartactions. - Join with the
productsseed to get the price. - Group by day.
- Select from
with logs as (
select * from {{ ref('stg_access_logs') }}
),
products as (
select * from {{ ref('products') }}
)
select
-- Truncate timestamp to day
date_trunc('day', logs.log_timestamp) as report_date,
sum(products.product_price) as potential_revenue,
count(*) as items_added
from logs
join products on cast(logs.product_id as int) = products.product_id
where logs.action = 'add_to_cart'
group by 1
order by 1 desc
dbt run.
6. Quality & Documentation
A huge advantage of dbt is that documentation and testing are code, living right next to your SQL.
a. Testing
Exercise - Add Tests
- Open
models/schema.yml(create it if it doesn't exist). - Add tests to ensure our data assumptions hold true.
version: 2
models:
- name: stg_access_logs
columns:
- name: log_timestamp
tests:
- not_null
- name: action
tests:
- accepted_values:
values: ['view_product', 'add_to_cart', 'checkout', 'browse']
- name: mart_daily_revenue
columns:
- name: report_date
tests:
- unique
- not_null
dbt test. If everything is correct, all tests should pass (green).
b. Documentation
Exercise - Generate Documentation
- Run the following command:
- Then launch the local documentation server:
- A browser tab will open. Explore the Lineage Graph (bottom right icon). You will see the full flow of your data from
access.logtomart_daily_revenue.
Summary
You have built a complete ELT pipeline locally!
- Extract: Read raw text logs.
- Load: DuckDB accessed them directly.
- Transform: Parsed Regex, joined with JSON seeds, and aggregated revenue.