3 things that break when you move Databricks notebooks to dbt

I’ve helped multiple teams move their data transformations from Databricks notebooks to dbt. The case for doing it is usually clear: version control, proper testing, reproducible builds. But the migration itself keeps hitting the same three snags.

1. Spark SQL syntax differences

Databricks notebooks accumulate Spark-specific SQL over time. Some of it looks like standard SQL and isn’t. When you paste it into a dbt model, you get errors that aren’t always obvious to diagnose.

The most common examples:

current_date() instead of CURRENT_DATE
Date arithmetic that uses INTERVAL 30 DAYS (Spark) vs INTERVAL '30' DAY (ANSI SQL)
Column names with special characters handled differently

The fix is a translation pass before migration. Make a reference sheet of your Databricks functions and their standard SQL equivalents, and go through the notebooks systematically:

-- Databricks
SELECT * FROM my_table WHERE date_col > current_date() - INTERVAL 30 DAYS

-- Standard SQL (dbt)
SELECT * FROM my_table WHERE date_col > CURRENT_DATE - INTERVAL '30' DAY

2. Missing schema references

Databricks resolves table references implicitly against the cluster’s default database. dbt doesn’t. When you move the same query, it fails because there’s no schema prefix.

3. State management and incremental logic

Databricks notebooks sometimes rely on Spark’s caching between cells, or on checkpoint files on DBFS. Neither of these exists in dbt! Incremental patterns that look fine in a notebook need to be re-written.

Specific failure modes:

Assuming intermediate data is still cached when the next cell runs
Complex incremental logic built around Spark DataFrames rather than SQL
Late-arriving data that the original notebook handled manually

dbt’s incremental materialization handles most of these cases cleanly:

{{ config(materialized='incremental') }}

SELECT
  order_id,
  customer_id,
  order_total,
  order_date
FROM {{ ref('raw_orders') }}

{% if is_incremental() %}
  WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Migration approach

Don’t try to convert everything at once. I usually start with the models that run most often, as they have the most test coverage and the clearest business logic. Convert simple SELECT statements first, then work up to the incremental models once you have confidence in the simpler ones.

Keep a migration log. Which notebooks map to which dbt models, what was changed during translation, what tests were added. That document pays for itself the first time someone asks why a model looks different from what’s in the old notebook!

What I’d do differently next time: set up the dbt schema configuration before touching any SQL. Most of the schema-related failures I’ve seen happened because people started converting models before the environments were properly configured.

Moving Databricks notebooks to dbt and want these snags mapped before they cost you a sprint? That is the kind of work I take on. Get in touch.