Navigate with โ โ or Space
Jump to start/end with Home / End
Got itCase-study of moving analytics ETL from Databricks notebooks to Airflow, dbt, and Snowflake marts for analytics and ML
Edi ยท Data Engineer ยท Airflow, SQL, Jinja, Python, Snowflake, dbt
Enterprise case study about building a end to end marketing data product: Airflow for orchestration, dbt for transformations, Snowflake for data modelling, and delivered as clean marts for analytics and ML.
The main challenge was not only compute. It was turning the fragmented source data into one governed, analytics-ready product that different teams could rely on and use.
Key: The data is synthetic, but the pipeline design, orchestration pattern, transformation layers, and mart outputs reflect the real Bertelsmann project work.
Keep Databricks where ML intensity justifies it
Move analytics ETL and marts to Snowflake
The target was a SQL, orchestrated, tested, and easier-to-govern data product for analytics and ML consumers.
The pipeline is scheduled and controlled in Airflow, not run as ad-hoc manual SQL stored procedures
This is where I used Airflow at Bertelsmann to make the flow repeatable, visible, and easy to operate
โ Orchestration becomes part of the product.
Raw layer: CUSTOMERS, PRODUCTS, ORDERS, ORDER_ITEMS, and CUSTOMER_INTERACTIONS land first, before any business logic is applied.
Why this matters: dbt makes the business logic readable, version-controlled, and much easier to review than notebook-based ETL.
Result: A clean star model that analysts can query directly and ML teams can consume without redoing joins or transformations.
For analytics: revenue, order frequency, and segment reporting
For ML: validity, value, and interaction features are already packaged in one mart
Validation: I ran the pipeline end to end, checked row counts, key relationships, and mart outputs before handing data to downstream teams.
This is where the project became a data product, not just a collection of SQL stored procedures.
Important nuance: Snowflake was the better operating model for this pipeline, not a statement that Databricks is bad in every case.
My message: I choose platforms by workload, not by publicity.
This is the kind of end-to-end analytics workflow I can deliver: orchestrated, tested, business-facing, and right about where each platform fits.
Questions?
Edi ยท info@mbitai.com