Bertelsmann Marketing Data Product

Case-study of moving analytics ETL from Databricks notebooks to Airflow, dbt, and Snowflake marts for analytics and ML

Edi · Data Engineer · Airflow, SQL, Jinja, Python, Snowflake, dbt

02

Agenda

1. Business Context and My Role (4 min)
2. Source Domains and Consumers (3 min)
3. Why Databricks Was Not the Right ETL Fit (5 min)
4. Target Architecture: Airflow + Snowflake + dbt (8 min)

5. Raw → Staging → Intermediate → Marts (8 min)
6. Tests, Validation, and Star Schema (5 min)
7. Snowflake Advantages in Context (3 min)
8. Why This Fits Your Company + Q&A

03

Who Am I?

Edi

Data Engineer with hands-on experience in Airflow, SQL, Jinja, Python, Snowflake, dbt, and Databricks
Worked with Airflow at Bertelsmann and helped prepare data for the Analytics and ML teams
Built the pipeline from raw ingestion to clean marts with a star models and governed business logic
Can explain architecture and trade-offs clearly.

Why This Matters

Enterprise case study about building a end to end marketing data product: Airflow for orchestration, dbt for transformations, Snowflake for data modelling, and delivered as clean marts for analytics and ML.

04

What The Data Product Had To Solve

Bertelsmann Marketing Analytics Case

Business goal: Create one trusted marketing data product for reporting, segmentation, and ML feature consumption
Scope: Roughly 18 source tables across CRM, orders, products, campaign, and customer-interaction silos
Consumers: Analytics for KPI reporting and the ML team for churn, propensity, and next-best-offer features
My role: Help prepare the data and shape the pipeline from raw ingestion to clean marts
Constraint: Synthetic marketing data in this demo, because the real data is confidential

05

Source Domains and Consumers

Source Domains

Customer consent and master data
Orders, order items, and products catalog
Customer interactions from marketing and support channels
Historical data arriving with different quality levels and owners

Consumers

Marketing analytics and marketing KPI reporting
Analytics engineers building reusable marts
ML teams consuming stable customer and marketing features
One shared and standardized business definition instead of silo business logic

The main challenge was not only compute. It was turning the fragmented source data into one governed, analytics-ready product that different teams could rely on and use.

06

Synthetic Data, Real Architecture

Bertelsmann Data Product

RAW.CUSTOMERS
RAW.PRODUCTS
RAW.ORDERS
RAW.ORDER_ITEMS
RAW.CUSTOMER_INTERACTIONS
Customer segment and region
Order status and order dates
Product category and unit price
Quantity, discount, and revenue
Email, phone, and chat interactions
Recency and lifetime value features
Star-schema marts for BI and ML

Key: The data is synthetic, but the pipeline design, orchestration pattern, transformation layers, and mart outputs reflect the real Bertelsmann project work.

07

Starting Point: Databricks At Bertelsmann

Why Databricks Was There First

1. Notebook and Spark-based jobs already existed in the team
2. Databricks was still useful for heavier ML-intensive workloads and exploratory data analyses
3. The daily marketing ETL also started there, with one shared production job cluster per domain
4. That setup worked, but it was not the cleanest operating model for governed analytics delivery (no source of truth, lineage, or testing)

08

Why Databricks Was Not the Right ETL Fit

Notebook logic was harder to standardize and hand over
Business transformations were spread across Python, Spark SQL, and multiple notebook steps
Version control, python libraries, and ownership were weaker than the dbt structure
The team needed a cleaner and clearer path from raw data to governed marts

It was strong for ML-heavy processing, but heavy for this daily analytics ETL
Analysts and downstream teams needed clear and readable SQL models
Governance, testing, and lineage needed to be first-class for the Single-Source-of-Truth
More clusters would not fix orchestration, traceability, or business-definition alignment

09

Migration Decision

Keep Databricks where ML intensity justifies it

Move analytics ETL and marts to Snowflake

The target was a SQL, orchestrated, tested, and easier-to-govern data product for analytics and ML consumers.

10

Target Architecture

Airflow Orchestration + Snowflake Storage/Compute + dbt Transformations

Python and SQL generate/land RAW marketing data in Snowflake
Airflow orchestrates ingestion, dbt runs, and validation steps
dbt structures transformations into staging, intermediate, and marts
Star-schema outputs support analytics reporting and ML feature consumers
Snowflake keeps compute elastic and the ETL logic in SQL
Testing, cloning, and Time Travel improve reliability and handover

11

Step 1: Orchestrate With Airflow

The pipeline is scheduled and controlled in Airflow, not run as ad-hoc manual SQL stored procedures

            marketing_data_product_dag generate_synthetic_data >
            dbt_run_staging > dbt_run_intermediate > dbt_run_marts >
            dbt_test > validate_end_to_end
          

This is where I used Airflow at Bertelsmann to make the flow repeatable, visible, and easy to operate

→ Orchestration becomes part of the product.

12

Step 2: Build the Raw Layer in Snowflake

            CREATE OR REPLACE TABLE MARKETING_DB.RAW.CUSTOMERS AS SELECT
            ROW_NUMBER() OVER (ORDER BY SEQ4()) AS customer_id, 'Customer_' ||
            ROW_NUMBER() OVER (ORDER BY SEQ4()) AS customer_name, CASE
            UNIFORM(1, 3, RANDOM()) WHEN 1 THEN 'Enterprise' WHEN 2 THEN
            'Mid-Market' ELSE 'SMB' END AS segment, CURRENT_TIMESTAMP() AS
            created_at;
          

Raw layer: CUSTOMERS, PRODUCTS, ORDERS, ORDER_ITEMS, and CUSTOMER_INTERACTIONS land first, before any business logic is applied.

13

Step 3: dbt Staging Models

Standardize Before Joining

            -- models/staging/stg_orders.sql select order_id, customer_id,
            cast(order_date as date) as order_date, upper(status) as
            order_status, total_amount::number(12,2) as total_amount from {{
            source('raw', 'orders') }}
          

Why this matters: dbt makes the business logic readable, version-controlled, and much easier to review than notebook-based ETL.

14

Step 4: Intermediate + Star Schema

            dimensions/ dim_customers dim_products dim_dates facts/
            fct_marketing intermediate/ int_orders_enriched
            int_customer_interactions
          

Result: A clean star model that analysts can query directly and ML teams can consume without redoing joins or transformations.

15

Step 5: Analytics-Ready `fct_marketing` Mart

            select customer_id, segment, region, count(distinct order_id) as
            order_frequency, sum(net_revenue) as lifetime_value,
            avg(net_revenue) as avg_order_value, max(order_date) as
            last_order_date, count(distinct interaction_id) as interaction_count
            from {{ ref('int_orders_enriched') }} group by 1,2,3
          

For analytics: revenue, order frequency, and segment reporting

For ML: validity, value, and interaction features are already packaged in one mart

16

Step 6: dbt Tests and Validation

            dbt test --select staging dbt test --select intermediate dbt test
            --select marts tests: - not_null - unique - relationships -
            accepted_values
          

Validation: I ran the pipeline end to end, checked row counts, key relationships, and mart outputs before handing data to downstream teams.

This is where the project became a data product, not just a collection of SQL stored procedures.

17

Why Snowflake Fit This ETL Better

            Snowflake advantages in this use case - SQL transformation flow -
            Clear warehouse and schema boundaries - Elastic compute for
            scheduled ETL - Zero-copy clone for safe dev/test - Time Travel for
            recovery and comparison
          

Important nuance: Snowflake was the better operating model for this pipeline, not a statement that Databricks is bad in every case.

18

What Improved After the Migration

1. Raw to mart flow became easier to explain and review
2. Business logic moved into dbt instead of hidden notebook steps
3. Analytics and ML teams consumed the same governed model
4. Tests and lineage improved confidence in daily refreshes
5. Snowflake features reduced operational friction
6. Handover to other teams became more realistic
7. The pipeline looked like a reusable pattern, not a one-time prototype project

19

Right Tool, Right Workload

Snowflake For This Use-Case

Daily analytics ETL and marts
Readable SQL and dbt model structure
Governance, cloning, and Time Travel

Databricks Still Matters

ML-intensive workloads and heavier engineering tasks
Exploration where notebook workflows still make sense
Not the best fit for this governed ETL product

My message: I choose platforms by workload, not by publicity.

20

Why This Fits Your Company

Real enterprise case-study : a credible Bertelsmann data team product, not a baby project
Strong analytics-engineering fit : Airflow, SQL, Jinja, dbt, tests, and dimensional modeling
Balanced platform judgment : Snowflake where it fits, Databricks where it still adds value
Reusable delivery pattern : raw, staging, intermediate, marts, and validation
Business communication : I can explain architecture in stakeholder language
Team enablement : clean marts that analysts and ML teams can share

21

Ready to Build Governed Data Products

This is the kind of end-to-end analytics workflow I can deliver: orchestrated, tested, business-facing, and right about where each platform fits.

Questions?

Edi · info@mbitai.com