Lakehouse migration: from Databricks to Snowflake for a European media company
How I migrated 20M+ records from Databricks to Snowflake, cutting monthly infrastructure costs by 40% and fixing years of accumulated governance debt.
The situation
A European media company had built their analytics platform on Databricks. It had worked during a period of rapid growth, but by the time I came in the monthly cost was €50K+ and the platform was ungoverned. The cost was high because clusters ran around the clock instead of spinning up on demand. There were 50+ data scientists working off 500+ tables spread across Delta Lake, Parquet, and JSON formats, with no central catalog and no clear lineage.
The harder problem was trust. When quarterly reports took two hours and the numbers sometimes differed depending on who ran them, teams started routing around the platform instead of using it. That’s usually where things stand when I get called in.
What I did
Assessment (weeks 1–4)
Profiled all 200+ notebooks to understand compute vs. storage usage patterns. Built a cost model comparing the two platforms under realistic usage assumptions. Designed a hybrid architecture: Snowflake as the compute layer over the existing S3 storage, which avoided duplicating data and reduced migration risk.
Foundation and tooling (weeks 5–10)
Set up multi-cluster Snowflake warehouses with separate resource pools for ETL, analytics, and reporting workloads. Built migration pipelines with dbt and Airflow. Added data validation and reconciliation at each step. Most migrations fail here because teams assume data arrived correctly without checking.
Incremental migration (weeks 11–18)
Started with high-traffic datasets: user behavior and content metadata. Both platforms ran in parallel through the transition. I decommissioned Databricks notebooks gradually as Snowflake equivalents proved out, rather than doing a hard cutover.
Optimisation (weeks 19–22)
Right-sized compute based on actual usage patterns. Added materialized views for common reporting aggregations. Set up cost monitoring and alerts.
Results
Monthly infrastructure dropped from €50K to around €30K (a 40% reduction). Most of the savings came from moving away from always-on clusters to Snowflake’s per-second billing.
Query performance roughly tripled on typical analytical workloads. Quarterly reports went from two hours to fifteen minutes. Concurrent analyst capacity doubled.
Data lineage is now fully tracked. There’s a central catalog. The self-service analytics that had been the original promise of the Databricks setup started getting used once people could find data and trust it.
What I’d do differently
I spent more time on the cost model at the start than I needed to. The architecture decision was right, but I could have reached it faster and spent that time on better tooling for schema evolution edge cases. There were Delta Lake schema changes mid-migration that required manual intervention I hadn’t fully planned for.
Client details anonymized. Metrics are from the actual project.
Eddie Beloiu
Freelance Data Platform Engineer · Munich