Skip to content
1 /

Navigate with โ†‘ โ†“ or Space

Jump to start/end with Home / End

Got it
Press E to edit ยท Click text to modify
01

Bertelsmann Marketing Data Product

Case-study of moving analytics ETL from Databricks notebooks to Airflow, dbt, and Snowflake marts for analytics and ML

Edi ยท Data Engineer ยท Airflow, SQL, Jinja, Python, Snowflake, dbt

02

Agenda

  • 1. Business Context and My Role (4 min)
  • 2. Source Domains and Consumers (3 min)
  • 3. Why Databricks Was Not the Right ETL Fit (5 min)
  • 4. Target Architecture: Airflow + Snowflake + dbt (8 min)
  • 5. Raw โ†’ Staging โ†’ Intermediate โ†’ Marts (8 min)
  • 6. Tests, Validation, and Star Schema (5 min)
  • 7. Snowflake Advantages in Context (3 min)
  • 8. Why This Fits Your Company + Q&A
03

Who Am I?

Edi

  • Data Engineer with hands-on experience in Airflow, SQL, Jinja, Python, Snowflake, dbt, and Databricks
  • Worked with Airflow at Bertelsmann and helped prepare data for the Analytics and ML teams
  • Built the pipeline from raw ingestion to clean marts with a star models and governed business logic
  • Can explain architecture and trade-offs clearly.

Why This Matters

Enterprise case study about building a end to end marketing data product: Airflow for orchestration, dbt for transformations, Snowflake for data modelling, and delivered as clean marts for analytics and ML.

04

What The Data Product Had To Solve

Bertelsmann Marketing Analytics Case

  • Business goal: Create one trusted marketing data product for reporting, segmentation, and ML feature consumption
  • Scope: Roughly 18 source tables across CRM, orders, products, campaign, and customer-interaction silos
  • Consumers: Analytics for KPI reporting and the ML team for churn, propensity, and next-best-offer features
  • My role: Help prepare the data and shape the pipeline from raw ingestion to clean marts
  • Constraint: Synthetic marketing data in this demo, because the real data is confidential
05

Source Domains and Consumers

Source Domains

  • Customer consent and master data
  • Orders, order items, and products catalog
  • Customer interactions from marketing and support channels
  • Historical data arriving with different quality levels and owners

Consumers

  • Marketing analytics and marketing KPI reporting
  • Analytics engineers building reusable marts
  • ML teams consuming stable customer and marketing features
  • One shared and standardized business definition instead of silo business logic

The main challenge was not only compute. It was turning the fragmented source data into one governed, analytics-ready product that different teams could rely on and use.

06

Synthetic Data, Real Architecture

Bertelsmann Data Product

  • RAW.CUSTOMERS
  • RAW.PRODUCTS
  • RAW.ORDERS
  • RAW.ORDER_ITEMS
  • RAW.CUSTOMER_INTERACTIONS
  • Customer segment and region
  • Order status and order dates
  • Product category and unit price
  • Quantity, discount, and revenue
  • Email, phone, and chat interactions
  • Recency and lifetime value features
  • Star-schema marts for BI and ML

Key: The data is synthetic, but the pipeline design, orchestration pattern, transformation layers, and mart outputs reflect the real Bertelsmann project work.

07

Starting Point: Databricks At Bertelsmann

Why Databricks Was There First

  • 1. Notebook and Spark-based jobs already existed in the team
  • 2. Databricks was still useful for heavier ML-intensive workloads and exploratory data analyses
  • 3. The daily marketing ETL also started there, with one shared production job cluster per domain
  • 4. That setup worked, but it was not the cleanest operating model for governed analytics delivery (no source of truth, lineage, or testing)
08

Why Databricks Was Not the Right ETL Fit

  • Notebook logic was harder to standardize and hand over
  • Business transformations were spread across Python, Spark SQL, and multiple notebook steps
  • Version control, python libraries, and ownership were weaker than the dbt structure
  • The team needed a cleaner and clearer path from raw data to governed marts
  • It was strong for ML-heavy processing, but heavy for this daily analytics ETL
  • Analysts and downstream teams needed clear and readable SQL models
  • Governance, testing, and lineage needed to be first-class for the Single-Source-of-Truth
  • More clusters would not fix orchestration, traceability, or business-definition alignment
09

Migration Decision

Keep Databricks where ML intensity justifies it

Move analytics ETL and marts to Snowflake

The target was a SQL, orchestrated, tested, and easier-to-govern data product for analytics and ML consumers.

10

Target Architecture

Airflow Orchestration + Snowflake Storage/Compute + dbt Transformations

11

Step 1: Orchestrate With Airflow

The pipeline is scheduled and controlled in Airflow, not run as ad-hoc manual SQL stored procedures

marketing_data_product_dag generate_synthetic_data > dbt_run_staging > dbt_run_intermediate > dbt_run_marts > dbt_test > validate_end_to_end

This is where I used Airflow at Bertelsmann to make the flow repeatable, visible, and easy to operate

โ†’ Orchestration becomes part of the product.

12

Step 2: Build the Raw Layer in Snowflake

CREATE OR REPLACE TABLE MARKETING_DB.RAW.CUSTOMERS AS SELECT ROW_NUMBER() OVER (ORDER BY SEQ4()) AS customer_id, 'Customer_' || ROW_NUMBER() OVER (ORDER BY SEQ4()) AS customer_name, CASE UNIFORM(1, 3, RANDOM()) WHEN 1 THEN 'Enterprise' WHEN 2 THEN 'Mid-Market' ELSE 'SMB' END AS segment, CURRENT_TIMESTAMP() AS created_at;

Raw layer: CUSTOMERS, PRODUCTS, ORDERS, ORDER_ITEMS, and CUSTOMER_INTERACTIONS land first, before any business logic is applied.

13

Step 3: dbt Staging Models

Standardize Before Joining

-- models/staging/stg_orders.sql select order_id, customer_id, cast(order_date as date) as order_date, upper(status) as order_status, total_amount::number(12,2) as total_amount from {{ source('raw', 'orders') }}

Why this matters: dbt makes the business logic readable, version-controlled, and much easier to review than notebook-based ETL.

14

Step 4: Intermediate + Star Schema

dimensions/ dim_customers dim_products dim_dates facts/ fct_marketing intermediate/ int_orders_enriched int_customer_interactions

Result: A clean star model that analysts can query directly and ML teams can consume without redoing joins or transformations.

15

Step 5: Analytics-Ready `fct_marketing` Mart

select customer_id, segment, region, count(distinct order_id) as order_frequency, sum(net_revenue) as lifetime_value, avg(net_revenue) as avg_order_value, max(order_date) as last_order_date, count(distinct interaction_id) as interaction_count from {{ ref('int_orders_enriched') }} group by 1,2,3

For analytics: revenue, order frequency, and segment reporting

For ML: validity, value, and interaction features are already packaged in one mart

16

Step 6: dbt Tests and Validation

dbt test --select staging dbt test --select intermediate dbt test --select marts tests: - not_null - unique - relationships - accepted_values

Validation: I ran the pipeline end to end, checked row counts, key relationships, and mart outputs before handing data to downstream teams.

This is where the project became a data product, not just a collection of SQL stored procedures.

17

Why Snowflake Fit This ETL Better

Snowflake advantages in this use case - SQL transformation flow - Clear warehouse and schema boundaries - Elastic compute for scheduled ETL - Zero-copy clone for safe dev/test - Time Travel for recovery and comparison

Important nuance: Snowflake was the better operating model for this pipeline, not a statement that Databricks is bad in every case.

18

What Improved After the Migration

19

Right Tool, Right Workload

Snowflake For This Use-Case

  • Daily analytics ETL and marts
  • Readable SQL and dbt model structure
  • Governance, cloning, and Time Travel

Databricks Still Matters

  • ML-intensive workloads and heavier engineering tasks
  • Exploration where notebook workflows still make sense
  • Not the best fit for this governed ETL product

My message: I choose platforms by workload, not by publicity.

20

Why This Fits Your Company

  • Real enterprise case-study : a credible Bertelsmann data team product, not a baby project
  • Strong analytics-engineering fit : Airflow, SQL, Jinja, dbt, tests, and dimensional modeling
  • Balanced platform judgment : Snowflake where it fits, Databricks where it still adds value
  • Reusable delivery pattern : raw, staging, intermediate, marts, and validation
  • Business communication : I can explain architecture in stakeholder language
  • Team enablement : clean marts that analysts and ML teams can share
21

Ready to Build Governed Data Products

This is the kind of end-to-end analytics workflow I can deliver: orchestrated, tested, business-facing, and right about where each platform fits.

Questions?

Edi ยท info@mbitai.com