3 things that break when you move Databricks notebooks to dbt
The three SQL syntax, schema reference, and state management problems I keep seeing when teams migrate from Databricks to dbt; and how to fix each one.
I’ve helped multiple teams move their data transformations from Databricks notebooks to dbt. The case for doing it is usually clear: version control, proper testing, reproducible builds. But the migration itself keeps hitting the same three snags.
1. Spark SQL syntax differences
Databricks notebooks accumulate Spark-specific SQL over time. Some of it looks like standard SQL and isn’t. When you paste it into a dbt model, you get errors that aren’t always obvious to diagnose.
The most common examples:
current_date()instead ofCURRENT_DATE- Date arithmetic that uses
INTERVAL 30 DAYS(Spark) vsINTERVAL '30' DAY(ANSI SQL) - Column names with special characters handled differently
The fix is a translation pass before migration. Make a reference sheet of your Databricks functions and their standard SQL equivalents, and go through the notebooks systematically:
-- Databricks
SELECT * FROM my_table WHERE date_col > current_date() - INTERVAL 30 DAYS
-- Standard SQL (dbt)
SELECT * FROM my_table WHERE date_col > CURRENT_DATE - INTERVAL '30' DAY
2. Missing schema references
Databricks resolves table references implicitly against the cluster’s default database. dbt doesn’t. When you move the same query, it fails because there’s no schema prefix.
Related problems:
- Hardcoded environment-specific database names
- Missing grants for the dbt service account
Use dbt’s {{ ref() }} function and configure schema properly:
-- In a dbt model
SELECT *
FROM {{ ref('staging_orders') }}
WHERE order_date >= '{{ var("start_date") }}'
# dbt_project.yml
models:
your_project:
staging:
schema: staging
marts:
schema: analytics
3. State management and incremental logic
Databricks notebooks sometimes rely on Spark’s caching between cells, or on checkpoint files on DBFS. Neither of these exists in dbt. Incremental patterns that look fine in a notebook need to be rethought.
Specific failure modes:
- Assuming intermediate data is still cached when the next cell runs
- Complex incremental logic built around Spark DataFrames rather than SQL
- Late-arriving data that the original notebook handled manually
dbt’s incremental materialization handles most of these cases cleanly:
{{ config(materialized='incremental') }}
SELECT
order_id,
customer_id,
order_total,
order_date
FROM {{ ref('raw_orders') }}
{% if is_incremental() %}
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
Migration approach
Don’t try to convert everything at once. I usually start with the models that run most often; they have the most test coverage and the clearest business logic. Convert simple SELECT statements first, then work up to the incremental models once you have confidence in the simpler ones.
Keep a migration log. Which notebooks map to which dbt models, what was changed during translation, what tests were added. That document pays for itself the first time someone asks why a model looks different from what’s in the old notebook.
What I’d do differently next time: set up the dbt schema configuration before touching any SQL. Most of the schema-related failures I’ve seen happened because people started converting models before the environments were properly configured.
Eddie Beloiu
Freelance Data Platform Engineer · Munich