Databricks Lakeflow Declarative Pipelines (Delta Live Tables) in Practice
How Lakeflow Declarative Pipelines (formerly Delta Live Tables) work in production: streaming tables vs materialized views, AUTO CDC, data quality expectations, and cost tuning.
Databricks
This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.
If you ran data pipelines on Databricks any time between 2022 and mid-2025, you almost certainly built them with Delta Live Tables (DLT): a declarative framework where you describe each table as a query and Databricks figures out the dependency graph, the streaming vs batch incrementalization, the data-quality checks, and the orchestration. At the 2025 Data + AI Summit Databricks rebranded the product as Lakeflow Declarative Pipelines under a broader Lakeflow umbrella, open-sourced the core engine as Apache Spark Declarative Pipelines, and made some of the most useful syntax (notably CREATE STREAMING TABLE outside the pipeline DDL) generally available on Databricks SQL.
The model has not changed: you still author tables and views declaratively, the runtime still owns ordering, retries, checkpoint state, and quality enforcement, and the materialization unit is still a Delta table. What changed is the surface — new keyword set on SQL warehouses, new product names in the UI, and a path to run the same pipelines on plain open-source Spark. The migration is essentially a find-and-replace plus a UI re-mapping. The interesting work is everything underneath: where streaming tables fit versus materialized views, how the AUTO CDC flow (legacy DLT’s APPLY CHANGES INTO) actually maintains type-2 history, why the data_quality block is the cheapest production guardrail in the entire Databricks platform, and which patterns make a pipeline cheap rather than just correct.
This post walks through Lakeflow Declarative Pipelines the way a team that already runs Databricks should think about them: pick the right primitive, write the smallest amount of declarative code, let the runtime handle the hard parts, and pay attention to the few places where the runtime’s defaults don’t match your workload.
In this post, we’ll cover:
- The DLT → Lakeflow rebrand — what changed, what didn’t, and what to call things now
- Streaming tables vs materialized views — the two primitives every pipeline composes
- Writing pipelines — SQL and Python syntax, dependencies, and the dependency graph
- Expectations — declarative data quality without writing tests
- Change data capture —
AUTO CDCflows for SCD type 1 and type 2 - Triggered vs continuous, serverless compute, and target lag — choosing the right execution mode
- Production patterns — Auto Loader ingestion, multi-hop medallion architecture, schema evolution
- Pitfalls — the handful of footguns that are not obvious from the docs
What changed in the rebrand
The user-facing rename is the most visible part. Delta Live Tables is now Lakeflow Declarative Pipelines; the broader Lakeflow product also includes Lakeflow Connect for managed ingestion connectors and Lakeflow Jobs (the new name for Databricks Workflows) for orchestrating arbitrary tasks. Inside the workspace UI the navigation moved from “Delta Live Tables” to “Pipelines” under the Data Engineering section. The databricks.com documentation has redirected most /delta-live-tables/ URLs to /declarative-pipelines/ paths, but the historical content is still indexed.
What didn’t change is the programming model. The same @dlt.table Python decorator works, the same CREATE OR REFRESH ... SQL syntax works, and existing pipelines continue to run against the same dlt Python module name. New code on the open-source Spark Declarative Pipelines API uses the updated pipelines module — from pyspark import pipelines as dp and @dp.table in place of @dlt.table — but Databricks has been explicit that the legacy import dlt path will keep working indefinitely for backward compatibility. If you have hundreds of pipelines in production, you do not need to migrate them.
The genuinely new technical capability is that streaming tables and materialized views are now first-class objects on Databricks SQL warehouses, not just inside a pipeline. You can run CREATE STREAMING TABLE or CREATE MATERIALIZED VIEW directly in a SQL editor or databricks_sql_query Terraform resource and the runtime handles the same incremental refresh that DLT always did. For teams that wanted DLT semantics without standing up a separate pipeline cluster, this is the bigger win than the rename.
The third change is the open-source path via Spark Declarative Pipelines. The same declarative API now exists in upstream Apache Spark, which means a pipeline you write on Databricks can — with some effort — run on a vanilla Spark cluster. In practice almost no one will move away from the managed runtime (you lose Auto Loader, Unity Catalog integration, and serverless compute), but the open source story matters for organizations whose procurement requires “no vendor lock-in on the compute layer.”
Streaming tables vs materialized views
Every Lakeflow pipeline is composed of two primitives. Picking the right one for each table is the most consequential design decision in the entire framework.
A streaming table is incrementally updated from a streaming source. The source must be append-only — a Delta table with Change Data Feed, a Kafka topic, an Auto Loader directory, or another streaming table. The runtime maintains checkpoint state, so each run only processes records that arrived since the last run. The query body must be a streaming-compatible SQL or DataFrame expression (no ORDER BY over the whole input, no global aggregations without watermarking). The result is a Delta table that grows as new records arrive.
A materialized view is a recomputable view of one or more sources. Each refresh re-evaluates the query against the current state of the sources and writes the result to a Delta table. The query body is arbitrary — joins, full aggregations, window functions over the whole dataset — and the runtime tries to use incremental refresh when the query shape allows it (Databricks publishes a list of incremental-eligible patterns). When the query is not incremental-eligible, a refresh becomes a full recompute.
The decision rule we use in practice: if the source is a stream and the transformation is row-by-row (or per-key with a watermark), use a streaming table. If the transformation requires looking at multiple rows together — aggregations, joins, window functions — use a materialized view. A typical bronze table reading raw events from S3 is a streaming table. A typical silver table that joins events to dimension data is a streaming table (with a static or slowly-changing join). A typical gold table that aggregates by day and customer is a materialized view.
The cost model differs significantly. Streaming tables are cheap because each refresh only processes the delta, but you pay for checkpoint state and you pay continuously if the pipeline runs in continuous mode. Materialized views are expensive on full refresh — a daily aggregation over a 5 TB events table is a 5 TB scan — but cheap on incremental refresh when the query pattern allows it. The single largest cost optimization in most Lakeflow accounts is converting a materialized view whose query is incrementally maintainable to actually be incrementally maintained, which usually means rewriting the query to avoid the operations that block incremental refresh: DISTINCT, OFFSET, certain window functions, and non-deterministic functions like CURRENT_TIMESTAMP().
Writing your first pipeline
The SQL form of a pipeline is the easiest to read.
-- Bronze: raw events from cloud storage
CREATE OR REFRESH STREAMING TABLE events_bronze
COMMENT "Raw events ingested from S3 with Auto Loader"
AS SELECT *
FROM STREAM read_files(
's3://my-bucket/events/',
format => 'json',
schemaLocation => 's3://my-bucket/_schemas/events/'
);
-- Silver: enriched with dimension data
CREATE OR REFRESH STREAMING TABLE events_silver
COMMENT "Events joined to customer dimension"
AS SELECT
e.event_id,
e.user_id,
e.event_type,
e.event_time,
c.customer_tier,
c.region
FROM STREAM(events_bronze) e
LEFT JOIN customers c ON e.user_id = c.id;
-- Gold: daily aggregation
CREATE OR REFRESH MATERIALIZED VIEW events_daily
COMMENT "Daily event counts by type and region"
AS SELECT
DATE(event_time) AS day,
region,
event_type,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM events_silver
GROUP BY DATE(event_time), region, event_type;
Three tables, three queries, no orchestration code. The runtime parses the dependency graph from the table references (STREAM(events_bronze) declares the silver table depends on the bronze table; the gold view’s FROM events_silver declares its dependency), schedules the refreshes in topological order, and maintains state for the streaming tables across runs.
The STREAM(...) syntax matters. Writing FROM events_bronze without STREAM would read the table as a static snapshot, which would defeat the streaming semantics. The runtime requires STREAM(...) on every streaming-source reference to make the streaming/batch distinction unambiguous in the query body.
The Python form gives you more control and is the right choice when you need to compute table names dynamically, conditionally include columns, or share helper logic across tables.
import dlt
from pyspark.sql.functions import col, current_timestamp
@dlt.table(
name="events_bronze",
comment="Raw events ingested from S3 with Auto Loader",
table_properties={"quality": "bronze"},
)
def events_bronze():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "s3://my-bucket/_schemas/events/")
.load("s3://my-bucket/events/")
.withColumn("_ingest_ts", current_timestamp())
)
@dlt.table(name="events_silver", comment="Events joined to customer dimension")
def events_silver():
events = dlt.readStream("events_bronze")
customers = spark.read.table("LIVE.customers")
return events.join(customers, events.user_id == customers.id, "left")
dlt.readStream("events_bronze") is the Python equivalent of STREAM(events_bronze) and reads the table incrementally. spark.read.table("LIVE.customers") reads the customers table as a static snapshot — the LIVE. schema prefix tells the runtime this reference lives inside the same pipeline (and so should respect its dependency graph), but the underlying read is batch.
Expectations
Expectations are the single most under-used feature in Lakeflow. They are declarative data-quality constraints: a boolean SQL expression that should hold for every row, with an action to take when a row fails. In SQL:
CREATE OR REFRESH STREAMING TABLE events_silver (
CONSTRAINT valid_event_id EXPECT (event_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT recent_event EXPECT (event_time > current_timestamp() - INTERVAL 7 DAYS),
CONSTRAINT valid_user_id EXPECT (user_id RLIKE '^[0-9a-f]{36}$') ON VIOLATION FAIL UPDATE
)
AS SELECT ...;
Three constraints with three different actions. DROP ROW silently filters offending rows out of the result and increments a violation counter in the pipeline event log. The default action (no ON VIOLATION clause) is to let the row through and record the violation — useful for monitoring data drift without breaking the pipeline. FAIL UPDATE aborts the entire pipeline run when any row violates the constraint, which is the right setting for invariants the rest of the system depends on.
The Python form uses decorators:
@dlt.table(name="events_silver")
@dlt.expect_or_drop("valid_event_id", "event_id IS NOT NULL")
@dlt.expect("recent_event", "event_time > current_timestamp() - INTERVAL 7 DAYS")
@dlt.expect_or_fail("valid_user_id", "user_id RLIKE '^[0-9a-f]{36}$'")
def events_silver():
...
The violation counters are queryable from the pipeline event log:
SELECT
details:flow_progress.data_quality.expectations[*].name AS expectation,
details:flow_progress.data_quality.expectations[*].failed_records AS failures,
details:flow_progress.data_quality.expectations[*].passed_records AS passes
FROM event_log(TABLE(my_pipeline))
WHERE event_type = 'flow_progress'
ORDER BY timestamp DESC;
In practice we recommend treating expectations as the primary data quality layer and skipping a separate testing framework for the constraints they cover. The cost of adding an expectation is one SQL line, the runtime evaluates it on every row without a separate scan, and the violation counts surface in the pipeline UI without any dashboard work. Reserve Great Expectations or Soda for tests that cross multiple tables (referential integrity, sum-of-children-equals-parent invariants) or that need to run on a schedule independent of the pipeline.
See what QueryPlane can build for you
Connect to your database, write SQL with AI, and build shareable apps — all from your browser.
Change data capture with AUTO CDC
The hardest part of any pipeline that ingests from a transactional source is correctly applying inserts, updates, and deletes. The naive approach — read the full table on a schedule and overwrite — works for small dimensions but melts down for anything large or update-heavy. Lakeflow’s answer is the AUTO CDC flow, a declarative SCD applier that knows how to maintain SCD type 1 (overwrite) and type 2 (history) targets from a CDC stream. The legacy DLT name for the same operation was APPLY CHANGES INTO; the Lakeflow-era surface wraps it as a named CREATE FLOW ... AS AUTO CDC INTO statement, and existing pipelines using the legacy form continue to run.
A typical setup ingests a stream of CDC events (from Lakeflow Connect, Fivetran, Debezium, or a Kafka topic of outbox-pattern events) into a bronze streaming table, then applies those changes into a silver SCD table.
-- Bronze: raw CDC events
CREATE OR REFRESH STREAMING TABLE customers_cdc
AS SELECT * FROM STREAM read_files(
's3://my-bucket/cdc/customers/',
format => 'json'
);
-- Silver SCD type 1: current state
CREATE OR REFRESH STREAMING TABLE customers_current;
CREATE FLOW customers_current_flow AS AUTO CDC INTO customers_current
FROM STREAM(customers_cdc)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = 'DELETE'
SEQUENCE BY change_lsn
COLUMNS * EXCEPT (operation, change_lsn);
-- Silver SCD type 2: full history
CREATE OR REFRESH STREAMING TABLE customers_history;
CREATE FLOW customers_history_flow AS AUTO CDC INTO customers_history
FROM STREAM(customers_cdc)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = 'DELETE'
SEQUENCE BY change_lsn
COLUMNS * EXCEPT (operation, change_lsn)
STORED AS SCD TYPE 2;
KEYS declares the primary key. SEQUENCE BY declares the ordering column the runtime uses to decide which change to apply when multiple changes arrive for the same key — almost always a log sequence number (change_lsn), a LSN, or a high-resolution timestamp. APPLY AS DELETE WHEN declares the condition under which a record is a delete rather than an upsert (the source’s CDC format usually carries this as a column). The flow name (customers_current_flow, customers_history_flow) shows up in the pipeline UI and event log alongside the table refresh, which makes it easier to attribute row counts and failures when one target has multiple flows.
The type 2 form materializes __START_AT and __END_AT columns and writes one row per version of each key. The current row for a key has __END_AT IS NULL; historical rows have a non-null end timestamp from the SEQUENCE BY column. Queries that want the current state filter for __END_AT IS NULL; queries that want the state at a point in time filter for __START_AT <= t AND (__END_AT IS NULL OR __END_AT > t).
The most common subtle bug here is choosing the wrong SEQUENCE BY column. A wall-clock timestamp from the CDC source works until two events arrive in the same second, after which the runtime applies them in an unpredictable order and the resulting history is wrong. Always prefer a monotonically-increasing log sequence number (the CDC source’s LSN for Postgres, the SCN for Oracle, the binlog position for MySQL) or, when only a timestamp is available, append a tie-breaker column.
Triggered vs continuous, serverless, and target lag
Lakeflow pipelines run in one of two modes. Triggered mode starts the pipeline, refreshes every table to consistency with the source, and stops. Continuous mode keeps the pipeline running and processes new records as they arrive. Most teams should default to triggered and switch to continuous only when end-to-end freshness below ~5 minutes actually matters to a downstream consumer.
The cost difference is large. A continuous pipeline pays for a running cluster 24/7. A triggered pipeline on serverless compute pays per execution and scales the cluster down to zero between runs. For batch-y workloads — daily reports, hourly aggregations, even 15-minute refreshes — triggered serverless is dramatically cheaper. The Databricks pricing page for Lakeflow lists DBU consumption per hour for both modes; in practice the cross-over for a small pipeline is around an hour between triggered runs.
Materialized views also expose a target_lag setting that controls how often the runtime tries to refresh:
CREATE OR REFRESH MATERIALIZED VIEW events_hourly
SCHEDULE EVERY 1 HOUR
AS SELECT ...;
A SCHEDULE EVERY 1 HOUR means the runtime will refresh the view at least once an hour. If the upstream tables have not changed, the refresh is a no-op; if they have, the refresh kicks off. Combined with serverless compute, this gives you near-zero idle cost for views that only need hourly freshness.
Serverless compute is the default for new pipelines as of late 2025. The classic-compute path (where you specify a pipeline cluster configuration in the pipeline settings) still exists and is the right choice for workloads that need specific Photon versions, custom init scripts, or instance types not available on serverless. For everything else serverless is faster to start, cheaper to idle, and removes the cluster-configuration tuning surface that older DLT users spent significant time on.
Production patterns
The medallion architecture — bronze for raw, silver for cleaned, gold for analytics-ready — is the canonical Lakeflow pattern because the runtime’s primitives map onto it naturally. A typical production pipeline looks like:
-
Bronze streaming tables read raw data from S3 / Kafka / Lakeflow Connect with Auto Loader. Schema is inferred or evolved automatically. Expectations check for parse-level errors but otherwise let everything through.
-
Silver streaming tables clean, deduplicate, join to dimensions, and apply CDC. Expectations enforce row-level business rules (
event_timeis in the recent past,user_idis well-formed).AUTO CDCflows materialize SCD targets from CDC streams. -
Gold materialized views aggregate to the granularity each downstream consumer needs. One view per dashboard family is the right level of granularity: don’t try to build a single “wide” gold table that serves everyone.
Auto Loader is the right ingest pattern for almost every bronze table that reads from cloud storage. The runtime tracks which files have been processed using a transactional log under schemaLocation, so a re-run never double-processes a file and a backfill is a single FULL REFRESH away. The schema evolution mode is set per-table — addNewColumns is the safe default and lets new top-level columns appear without breaking downstream consumers.
CREATE OR REFRESH STREAMING TABLE events_bronze
AS SELECT * FROM STREAM read_files(
's3://my-bucket/events/',
format => 'json',
schemaLocation => 's3://my-bucket/_schemas/events/',
schemaEvolutionMode => 'addNewColumns'
);
A second pattern worth knowing is forward-fill backfill from history. When you add a new bronze table that reads from a Kafka topic with limited retention, the first run only sees records from the retention horizon onward. To backfill from S3 archives without writing a separate one-time job, define a second streaming source over the archive directory in the same table and UNION them:
CREATE OR REFRESH STREAMING TABLE events_bronze
AS
SELECT * FROM STREAM read_files('s3://my-bucket/archive/events/', format => 'json')
UNION ALL
SELECT * FROM STREAM read_kafka(
bootstrapServers => 'kafka.internal:9092',
subscribe => 'events'
);
The runtime maintains separate checkpoints for each source. The archive directory replays once at the first run and then stops producing rows; the Kafka source streams continuously from then on. The result is a single bronze table with full history and no orchestrating code.
A third pattern is using materialized views as cached joins for fast-changing dimensions. A common failure mode is a streaming silver table that joins to a customers table — if the join is a stream-static join, the runtime can’t push down predicates and the join becomes a full table scan per micro-batch. Replacing the static customer reference with a materialized view that is itself maintained from a CDC stream lets the runtime maintain a much smaller, indexed dimension snapshot that the streaming join can use efficiently.
Observability and the event log
Every pipeline writes a structured event log to a table you can query. The event log is the canonical source for “what happened” — refresh start/stop times, row counts, expectation pass/fail counts, schema changes, errors. The query pattern is the same regardless of pipeline name:
-- Last 24 hours of flow progress events
SELECT
timestamp,
details:flow_progress.metrics.num_output_rows AS rows_written,
details:flow_progress.data_quality.dropped_records AS dropped_records,
details:flow_progress.status AS status
FROM event_log(TABLE(my_pipeline))
WHERE event_type = 'flow_progress'
AND timestamp > current_timestamp() - INTERVAL 24 HOURS
ORDER BY timestamp DESC;
The other event types worth knowing: user_action records who triggered a run; update_progress records run-level state; flow_definition records the dependency graph; data_quality records expectation evaluations (also rolled up under flow_progress).
For production alerting, the right pattern is to materialize a subset of the event log into a Databricks SQL alert or pipe it to your existing observability stack (Datadog, Grafana, Honeycomb). The two metrics worth always-on alerting on are: any flow status other than COMPLETED for more than one consecutive run, and any expectation with ON VIOLATION FAIL UPDATE that fired.
Common pitfalls
A few traps catch teams the first time they put a Lakeflow pipeline into production.
FULL REFRESH on a streaming table re-processes everything from the source. Streaming tables maintain checkpoint state so re-runs are incremental. A FULL REFRESH resets that checkpoint and re-reads the entire source — for a Kafka topic with full retention, that’s everything since the topic’s beginning. The right pattern is to use targeted refresh: REFRESH TABLE my_table rather than REFRESH TABLE my_table FULL unless you specifically need to re-process from scratch (a corrupted state, a schema change you want applied retroactively).
Materialized views fall off the incremental path silently when you add a non-deterministic function. A view query that includes CURRENT_TIMESTAMP(), RAND(), or certain non-deterministic UDFs cannot be incrementally maintained — the runtime falls back to a full refresh every run. The drop in cost-efficiency is dramatic and the cause is invisible in the SQL. Always check the pipeline UI’s “Incremental refresh” badge after deploying a new materialized view. If it says “Full refresh,” look for non-deterministic expressions in the query.
Expectations on a CDC source with APPLY AS DELETE WHEN evaluate before the delete is interpreted. If you put an expectation like EXPECT (email IS NOT NULL) on a CDC bronze table, a delete event (which carries the old row’s columns or all-null columns depending on the source) can fire the violation. Either move the expectation to the SCD target (after the AUTO CDC flow has resolved deletes) or write the expectation to be aware of the operation column: EXPECT (operation = 'DELETE' OR email IS NOT NULL).
LIVE. schema references are pipeline-local. A LIVE.customers reference in a Python pipeline only resolves to a table defined in the same pipeline. Cross-pipeline references go through the catalog and schema directly (my_catalog.silver.customers), and the runtime treats them as static (no incremental hookup, no automatic refresh ordering). Plan pipeline boundaries with this in mind: putting two tables in different pipelines means losing the runtime’s ability to coordinate their refresh.
AUTO CDC does not support arbitrary MERGE logic. The runtime’s SCD applier handles insert, update, delete, and SCD type 2 history. It does not handle arbitrary merge clauses like “update only if the new value is greater than the old value.” For non-standard merge logic, fall back to a streaming table that materializes the CDC log directly and an explicit MERGE INTO in a separate Lakeflow Jobs task, or use a forEachBatch Python pipeline.
Schema evolution during a continuous run can pause the pipeline. When Auto Loader detects a new column under schemaEvolutionMode => 'addNewColumns', the pipeline stops, picks up the new schema, and restarts. This is correct behavior but it surprises teams who expect zero-downtime evolution. The duration depends on cluster startup time; on serverless this is usually under a minute. For accounts that cannot tolerate the pause, set the schema mode to rescue (new columns end up in a _rescued_data JSON column without a schema change) and promote them to top-level on a planned schema update.
The Hive metastore catalog does not support every Lakeflow feature. Streaming tables and materialized views as first-class objects on Databricks SQL require Unity Catalog. If your workspace is still on the Hive metastore, you can run the legacy DLT pipelines but you cannot use CREATE STREAMING TABLE outside a pipeline. The Unity Catalog migration is a separate (and usually larger) project, but it unblocks every modern feature the platform ships.
Frequently asked questions
Are Delta Live Tables and Lakeflow Declarative Pipelines the same thing?
Yes. Databricks renamed DLT to Lakeflow Declarative Pipelines at the 2025 Data + AI Summit as part of unifying ingestion (Lakeflow Connect), transformation (Lakeflow Declarative Pipelines), and orchestration (Lakeflow Jobs, the new name for Workflows) under one product family. Existing DLT pipelines continue to run, the dlt Python module name still works, and the SQL syntax is unchanged.
When should I use a streaming table vs a materialized view? Use a streaming table when the source is append-only (Kafka topic, Auto Loader directory, CDC stream) and the transformation is row-by-row or per-key with a watermark. Use a materialized view when the transformation aggregates, joins arbitrarily, or windows over the whole input. Streaming tables are cheaper on incremental refresh; materialized views can be incrementally refreshed only when the query shape allows it.
Can I run Lakeflow pipelines on open-source Spark? The core declarative engine is being open-sourced as Apache Spark Declarative Pipelines. The basic primitives — declarative tables, dependency graphs, the SDK — run on open-source Spark. Managed features like Auto Loader, Unity Catalog integration, expectations dashboards, and serverless compute are Databricks-only. In practice most teams that adopt declarative pipelines stay on the managed runtime to keep those features.
Do expectations slow down the pipeline? Almost never measurably. An expectation is a boolean SQL predicate evaluated per row during the same scan that produces the output. Compared to the work of reading and writing the row, the predicate evaluation is free. The exception is an expectation that calls a heavyweight UDF or a regex on a large column; in that case profile the pipeline run with the event log to see whether the expectation is on the hot path.
How do I migrate an existing DLT pipeline to Lakeflow? You don’t have to. Existing DLT pipelines run unchanged in the new runtime. The only forced changes are: the navigation in the workspace UI (Pipelines, not DLT), URLs in any documentation links you maintain, and any internal references to the product name. New pipelines should use the Lakeflow naming in commit messages and docs but the code is identical.
Can a single pipeline span multiple catalogs or schemas?
A pipeline targets one catalog and schema at a time, but LIVE. references resolve within the pipeline regardless of the target. You can read from other catalogs as static sources (SELECT * FROM other_catalog.silver.customers), but the runtime treats those as snapshots — no incremental hookup, no refresh ordering. If you need a pipeline whose output spans multiple schemas, run multiple pipelines and orchestrate them with Lakeflow Jobs.
How do I handle late-arriving data in streaming tables?
The Spark Structured Streaming watermark API is the right answer. Set a watermark on the event-time column (withWatermark("event_time", "1 hour")) and the runtime will hold state for the watermark window, accepting late records inside the window and dropping records past it. For per-row transformations without aggregations, late records are accepted and processed as they arrive — no watermark needed.
Do streaming tables guarantee exactly-once processing? Yes, within a single sink. The runtime uses Delta’s transactional write semantics and Structured Streaming’s checkpoint state to guarantee that each input record produces one output write, even on retries. The guarantee does not extend to side effects in non-deterministic UDFs (an external API call inside a UDF can be repeated on retry), so keep side-effecting code out of pipeline expressions.
Is AUTO CDC always faster than writing a manual MERGE?
For the SCD type 1 and type 2 patterns it supports, yes — the runtime can use Delta’s DELETE and MERGE internals with better statistics and incremental hookup than a hand-written job. For non-standard merge logic (windowed-merge, conditional updates, multi-source merge) you’ll need a manual MERGE INTO in a Lakeflow Jobs task, but if your CDC-applier needs are standard, AUTO CDC (still known as APPLY CHANGES INTO in legacy DLT pipelines) is faster, cheaper, and shorter to write.
Where does Lakeflow Connect fit?
Lakeflow Connect is the managed-connector layer that lands raw CDC data from operational sources (Salesforce, Workday, SQL Server, ServiceNow) directly into Delta tables. The output of a Connect ingestion is a streaming table that you then read in your pipeline with STREAM(...). If your sources are all on cloud storage or Kafka, you don’t need Connect; if they’re SaaS or operational databases, Connect is usually faster to set up than rolling your own Debezium pipeline.
Can I query Lakeflow pipeline state in a database GUI? Yes. Streaming tables and materialized views are Delta tables under the hood — any tool that can connect to a Databricks SQL warehouse can read them. We use QueryPlane’s Databricks integration to inspect pipeline output and run ad-hoc analytical queries against the gold layer without spinning up a separate Spark session.
Wrapping up
Lakeflow Declarative Pipelines (the artist formerly known as Delta Live Tables) is the right default for new data pipelines on Databricks. The mental model is small: pick streaming tables for append-only or per-key transformations, materialized views for aggregations and joins; layer expectations on every silver table for row-level data quality; use AUTO CDC flows for SCD applies; run triggered serverless unless freshness below five minutes is a hard requirement. Everything else — orchestration, checkpoint state, dependency graphs, retry logic, CDC apply mechanics — is what the runtime handles for you.
The rebrand from DLT to Lakeflow doesn’t change any of that. Existing pipelines keep running, the SQL and Python syntax is unchanged, and the migration is essentially zero-effort. What it does change is the surface area: streaming tables and materialized views are now first-class on Databricks SQL warehouses, the engine is being open-sourced under Apache Spark, and the navigation in the workspace reflects the broader Lakeflow umbrella alongside Lakeflow Connect and Lakeflow Jobs.
If you’re choosing a SQL editor to inspect pipeline output, debug expectation violations, or build interactive tools on top of your gold layer, the QueryPlane Databricks integration connects to any SQL warehouse in a couple of clicks — bring your workspace URL and a personal access token. The rest of our Databricks coverage starts with the best Databricks GUI tools for a survey of the editor landscape, and our Databricks Liquid Clustering in practice post covers the table-layout strategy that pipeline-produced streaming tables and materialized views now use by default.