Databricks Liquid Clustering in Practice

Q: Should I still use PARTITIONED BY with Liquid Clustering?

No. The recommended pattern is no PARTITIONED BY clause plus a CLUSTER BY declaration. Liquid Clustering's per-file min/max metadata handles the data skipping that partitioning used to provide, without the small-files or metadata-bloat problems that come from too many partitions. If you already have a partitioned table, you can add CLUSTER BY immediately for new writes, but the existing partition layout stays until you do a one-time rewrite into a new unpartitioned table.

Q: What's the difference between OPTIMIZE and OPTIMIZE FULL on a Liquid table?

Plain OPTIMIZE re-clusters only files that have been written or modified since the last run, plus any below-target-size files that can be combined. OPTIMIZE FULL ignores the per-file "already clustered" marker and re-clusters every file — useful right after a CLUSTER BY key change if you want the new layout applied immediately to all historical data. On a steady-state table you almost never want OPTIMIZE FULL; it negates Liquid's incrementality benefit.

Q: How do I change the clustering keys on an existing Liquid table?

ALTER TABLE my_table CLUSTER BY (new_col1, new_col2);. This is a metadata-only change — new writes immediately use the new keys, but existing files stay put until the next OPTIMIZE rewrites them. To force immediate re-clustering of all historical data, run OPTIMIZE my_table FULL after the ALTER.

Q: Can I disable clustering on a Liquid table?

Yes, with ALTER TABLE my_table CLUSTER BY NONE. This drops the clustering metadata and the table becomes a plain unpartitioned, unclustered Delta table. Future OPTIMIZE runs do only bin-packing without any clustering reorganization.

How Liquid Clustering replaces partitioning and Z-order on Delta tables: CLUSTER BY, AUTO mode, OPTIMIZE behavior, predictive optimization, and the migration from Z-order.

This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.

For most of the history of Delta Lake, the answer to “how do I lay out this table for queries” was a combination of PARTITIONED BY at table creation time and OPTIMIZE … ZORDER BY afterward. That advice held up well for a long time — it’s how the canonical Bronze/Silver/Gold patterns in every Databricks reference architecture are laid out — but it forces two decisions early that are expensive to change later: which columns to partition on, and which columns to Z-order on. Get them wrong and you either rewrite the whole table or live with skew and small-file fragmentation that no amount of OPTIMIZE can fully repair.

Liquid Clustering is the layout strategy Databricks released to replace both. It went GA on Delta tables in mid-2024, expanded to Unity Catalog managed tables, and through 2025-2026 became the default recommendation in Databricks’ own docs and reference templates. With Liquid Clustering you skip PARTITIONED BY entirely, declare CLUSTER BY on one to four columns (or CLUSTER BY AUTO and let Databricks pick), and rely on OPTIMIZE to reorganize files using a Hilbert-curve-based algorithm that — unlike Z-order — is incremental: it touches only newly-written data on each run, so a daily OPTIMIZE no longer has to rewrite the entire table.

This post walks through what Liquid Clustering actually does under the hood, how CLUSTER BY (manual and AUTO) interacts with OPTIMIZE and Predictive Optimization, how to migrate an existing partitioned + Z-ordered table without a full rewrite, and the production patterns and pitfalls that surface once a Liquid-clustered table is the warehouse default.

In this post, we’ll cover:

Where Liquid Clustering fits — vs partitioning, vs Z-order, and when not to use any of them
What Liquid Clustering actually does — Hilbert curve clustering, file metadata, and skipping
CLUSTER BY syntax and AUTO mode — declaring keys, changing them, and letting Databricks pick
OPTIMIZE behavior — incremental clustering, full-clustering rewrites, and what OPTIMIZE FULL is for
Predictive Optimization — the managed OPTIMIZE + VACUUM layer and what it costs
Migration from PARTITIONED BY + Z-order — ALTER TABLE CLUSTER BY, the trade-off with already-partitioned data
Production patterns and monitoring — DESCRIBE DETAIL, clusteringColumns, file size targets, alerting
Pitfalls — too many keys, low-cardinality keys, AUTO surprises, partition-collision migrations, VACUUM interactions

Where Liquid Clustering fits

Liquid Clustering is the answer for almost any Delta table that has either a partition column that’s becoming a maintenance burden or a Z-order column that’s becoming expensive to maintain.

The decision rule against partitioning is simple. Partitioning physically separates files into directories, which is helpful only if (1) the partition column has moderately high cardinality, (2) most queries filter on it, and (3) each partition is well above the target file size (Databricks’ guidance is roughly 1 GB to 128 MB). If any one of those fails, partitioning costs you more than it saves — you end up with tens of thousands of tiny partitions, the metadata becomes the bottleneck, and queries that don’t filter on the partition column scan the whole table anyway. Liquid Clustering keeps the data in a single physical bucket, lets Databricks decide which files contain which key ranges, and skips files at query time using the per-file min/max metadata that Delta already records.

The decision against Z-order is even simpler. Z-order works — it produces a multi-dimensional ordering that lets Delta skip files on multiple predicates at once — but every OPTIMIZE … ZORDER BY is a full table rewrite. On a 5 TB Bronze table that costs hours of warehouse time daily and produces a churn pattern that prevents VACUUM from reclaiming anything inside the retention window. Liquid Clustering uses a Hilbert curve (better data-locality than the Z-curve for similar cardinality) and only re-clusters the files written since the last run. The first OPTIMIZE on a freshly-converted table still does meaningful work, but the steady-state cost drops by an order of magnitude.

Liquid Clustering is not the right answer for tables that are small enough not to need any layout strategy (under a few GB total), append-only tables where every query is a full scan, or tables with a single, very high-cardinality column where a sort-merge layout is what you actually want. For everything in between — which is the overwhelming majority of production Delta tables — it’s the new default.

What Liquid Clustering actually does

Under the hood, a Liquid-clustered table looks identical to an unpartitioned Delta table: a single directory of Parquet files plus a _delta_log of transaction commits. The clustering itself shows up in two places. First, the table properties carry a clusteringColumns list — visible in DESCRIBE DETAIL and SHOW TBLPROPERTIES — that records the current clustering keys. Second, the Delta log records, per file, both the min/max statistics for the clustering columns (the standard Delta data-skipping metadata) and a Hilbert-space coordinate that OPTIMIZE uses to decide which files belong together.

When OPTIMIZE runs, it groups files by their Hilbert coordinate and rewrites them into target-sized output files. The key property is that two files with nearby Hilbert coordinates will share similar values across all clustering columns at once. This is the multi-dimensional analogue of sorting on a single column: a query that filters on any subset of the clustering keys can skip files whose min/max ranges don’t intersect the predicate.

The Hilbert curve maps an N-dimensional point to a single 1-D coordinate while preserving locality — points that are near each other in N-D space stay near each other on the curve. That property is what makes per-file min/max skipping efficient on multi-key filters. The Z-curve has similar locality on average but worse worst-case behavior, especially when the clustering columns have very different cardinalities. Liquid Clustering’s choice of Hilbert is one of the reasons it outperforms Z-order on the same column set.

The other reason is incrementality. With Z-order, the algorithm needs the entire table to compute the global Z-coordinate ordering, so any new data invalidates the previous layout. With Liquid Clustering, new files get a Hilbert coordinate at write time, and OPTIMIZE only re-clusters the files that have been touched (or are too small) since the previous run. That changes the steady-state cost from O(table size) to O(new data) — usually a 5-20× reduction on a busy Bronze table.

`CLUSTER BY` syntax and `AUTO` mode

Declaring clustering at table creation time uses the familiar Delta DDL with CLUSTER BY in place of PARTITIONED BY:

CREATE OR REPLACE TABLE events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_ts TIMESTAMP,
  payload STRING
)
USING DELTA
CLUSTER BY (event_ts, user_id);

Up to four columns are supported. Order matters less than with Z-order — the Hilbert encoding treats all keys symmetrically — but it doesn’t disappear, because the first column is the one most-frequently-used for data skipping when only a single predicate is supplied. Pick the column that the largest number of read patterns filter on first; pick the second column for the next-most-common filter; and so on.

The AUTO form delegates the choice to Databricks:

CREATE OR REPLACE TABLE events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_ts TIMESTAMP,
  payload STRING
)
USING DELTA
CLUSTER BY AUTO;

CLUSTER BY AUTO requires Unity Catalog and Predictive Optimization to be enabled. Behind the scenes, Databricks profiles query history against the table — query_history system tables surface which columns get filtered on and at what selectivity — and updates the clustering columns when the workload diverges from the current key set. A re-clustering only happens if the change is large enough to be worth the cost, so the table doesn’t thrash. Verify what AUTO chose with DESCRIBE DETAIL after the table has seen a few thousand queries; the clusteringColumns column updates as the auto-selection converges.

Changing clustering keys later is a one-line operation:

ALTER TABLE events CLUSTER BY (event_ts, event_type);

Or, to switch a manually-clustered table to AUTO:

ALTER TABLE events CLUSTER BY AUTO;

This metadata-only change updates the clustering key list but does not immediately re-cluster the data. New writes land with the new Hilbert coordinate; old files stay put until the next OPTIMIZE rewrites them. That’s intentional — it means changing keys is a cheap operation, and the layout converges over the next few OPTIMIZE cycles instead of all at once.

To disable clustering entirely on a table, use ALTER TABLE … CLUSTER BY NONE. The clustering metadata is dropped and the table becomes a plain unpartitioned, unclustered Delta table.

`OPTIMIZE` behavior

The OPTIMIZE you’ve already been running for Z-order and bin-packing still applies, but it behaves differently on a Liquid-clustered table:

OPTIMIZE events;

This re-clusters only the files written or modified since the last OPTIMIZE, plus any files that are below the target size and can be combined. The output is a smaller number of larger files with Hilbert coordinates that respect the clustering keys. Crucially, you never pass ZORDER BY on a Liquid-clustered table — the clustering keys are already declared at the table level, so OPTIMIZE knows what to do.

Two flags matter. OPTIMIZE … FULL (DBR 16.4 LTS and above) forces a full re-clustering of all files in the table, ignoring the per-file “already clustered” marker:

OPTIMIZE events FULL;

You almost never want OPTIMIZE FULL on a steady-state table — the whole point of Liquid is to avoid full rewrites — but it’s the right tool right after a CLUSTER BY key change if you want the new layout to apply to all historical data immediately, rather than converging gradually.

For back-filling a subset — re-clustering a single date range after a corrected upstream load, for example — combine FULL with a WHERE predicate. On a Liquid-clustered table, OPTIMIZE … FULL WHERE (DBR 18.1+) is the supported syntax for partial reclustering (plain OPTIMIZE … WHERE is the older partition-filter form and is not supported on Liquid tables):

OPTIMIZE events FULL WHERE event_ts >= '2026-06-01' AND event_ts < '2026-06-08';

A useful operational property: OPTIMIZE on a Liquid-clustered table is idempotent in the absence of new writes. Running it twice in a row produces no work on the second run, because every file is already at its target Hilbert coordinate and target size.

See what QueryPlane can build for you

Connect to your database, write SQL with AI, and build shareable apps — all from your browser.

Get Started Book a Demo

Predictive Optimization

The natural conclusion to “you should be running OPTIMIZE regularly” is “have the platform do it for you,” and that’s what Predictive Optimization is. Enabled at the catalog or schema level, Predictive Optimization runs OPTIMIZE, VACUUM, and (for Liquid-clustered tables) re-clustering automatically, billed against a system-managed serverless compute pool.

The scheduling logic is workload-aware. Tables that see frequent writes get optimized more often; cold tables get optimized rarely or not at all. The DBU spend shows up in system.billing.usage under the predictive_optimization SKU, and Databricks publishes guidance that the average ROI is 1.5-2× the spend — i.e., the query-side savings exceed the optimize-side cost. Validate that on your own tables before turning it on for everything: it tends to over-deliver on warm tables with multi-column filters and under-deliver on small or cold tables.

To enable it for an entire catalog:

ALTER CATALOG main ENABLE PREDICTIVE OPTIMIZATION;

Or for a single schema:

ALTER SCHEMA main.events ENABLE PREDICTIVE OPTIMIZATION;

The same DDL supports DISABLE PREDICTIVE OPTIMIZATION to turn it off and INHERIT PREDICTIVE OPTIMIZATION to reset to the parent catalog’s setting.

For Liquid Clustering specifically, Predictive Optimization changes one more thing: with CLUSTER BY AUTO, the system both picks the clustering keys and runs the re-clustering, so a freshly-created table requires no ongoing tuning. That’s the configuration Databricks now recommends as the default for new managed tables on Unity Catalog. Manual CLUSTER BY (col1, col2) with Predictive Optimization enabled is the right configuration when you have strong opinions about which columns matter and want the OPTIMIZE schedule managed.

Migration from `PARTITIONED BY` + Z-order

Most production Delta tables you’re going to convert were created with PARTITIONED BY (date_col) and have a periodic OPTIMIZE … ZORDER BY (user_id, event_type) job. Liquid Clustering is not compatible with partitioning or Z-order — a table is one or the other — so the migration has to replace both, not layer clustering on top.

On a recent runtime (DBR 18.1+) the cleanest path is the in-place conversion, which Databricks recommends because it minimizes reader and writer downtime and works for both managed and external tables:

ALTER TABLE events REPLACE PARTITIONED BY WITH CLUSTER BY (event_ts, user_id);
OPTIMIZE events;

This removes the partitioning and declares the Liquid keys in a single statement (you can convert to CLUSTER BY AUTO the same way); the following OPTIMIZE re-clusters the existing data.

On older runtimes without REPLACE PARTITIONED BY, do a one-time rewrite into a new clustered table with CTAS. Note that CLUSTER BY goes right after the table name, not inside the SELECT — and CREATE TABLE … LIKE can’t take a CLUSTER BY clause, so it has to be CTAS. For very large tables, create the table from the first partition range and backfill the rest with INSERT INTO … SELECT:

CREATE OR REPLACE TABLE events_v2
CLUSTER BY (event_ts, user_id)
AS SELECT * FROM events
WHERE event_ts >= '2024-01-01' AND event_ts < '2025-01-01';

INSERT INTO events_v2
SELECT * FROM events
WHERE event_ts >= '2023-01-01' AND event_ts < '2024-01-01';

-- repeat for each range, then swap the names

Then drop the old table or rename it for safekeeping. The cost is one full rewrite — but it’s a one-time cost, after which the steady-state OPTIMIZE runs incrementally.

For the Z-order side, the cutover is simpler: stop running OPTIMIZE … ZORDER BY and start running plain OPTIMIZE (or enable Predictive Optimization to handle it). The first plain OPTIMIZE will re-cluster all the files written since the last Z-order pass; subsequent ones will be incremental.

One specific gotcha worth calling out: tables created with PARTITIONED BY (col1) ZORDER BY (col1, col2) — using the same column for both — are particularly good candidates for migration. The partitioning was buying you the file-skipping that Liquid Clustering provides natively, and the Z-order was buying you the within-file ordering that Liquid Clustering’s incremental algorithm provides at a fraction of the cost.

Production patterns and monitoring

The first thing to wire up on a Liquid-clustered table is visibility into the clustering state. DESCRIBE DETAIL is the operational primitive:

DESCRIBE DETAIL events;

The output includes clusteringColumns (the current key set), numFiles, sizeInBytes, partitionColumns, and properties — enough to confirm the table is Liquid-clustered and on which columns. DESCRIBE DETAIL is the right primitive for “is this table set up correctly,” not for tracking clustering progress; that signal lives in DESCRIBE HISTORY and the system tables.

For monitoring clustering progress and file size distribution, run DESCRIBE HISTORY and look at the operationMetrics on each OPTIMIZE commit. DESCRIBE HISTORY is a utility statement, not a table you can wrap in a SQL subquery — query the DataFrame it returns instead (its operationMetrics is a map<string,string>, so cast the values you do math on):

from pyspark.sql import functions as F

(spark.sql("DESCRIBE HISTORY events")
    .where("operation = 'OPTIMIZE'")
    .select(
        "version",
        "timestamp",
        F.col("operationMetrics.numFilesAdded").cast("long").alias("files_added"),
        F.col("operationMetrics.numFilesRemoved").cast("long").alias("files_removed"),
        (F.col("operationMetrics.numOutputBytes").cast("double")
            / F.col("operationMetrics.numFilesAdded").cast("double")).alias("avg_output_file_size"),
    )
    .orderBy(F.col("version").desc())
    .limit(20)
    .display())

As a rule of thumb, the target file size for Liquid-clustered tables sits in the low hundreds of MB. A sustained drop below ~64 MB means OPTIMIZE isn’t keeping up; consistently very large files mean the rewrites are getting expensive. A steadily-rising numFilesAdded per run with a falling average output file size is the most common “OPTIMIZE schedule is too infrequent” pattern.

For VACUUM interactions, the same retention rules apply as on any Delta table: the default delta.deletedFileRetentionDuration is 7 days, and VACUUM cleans up files older than that and no longer in the current version. Liquid Clustering’s incremental nature is actively friendly to VACUUM here — the churn-per-OPTIMIZE is small, so VACUUM reliably finds files to reclaim. Z-ordered tables, by contrast, would often rewrite the entire table on each OPTIMIZE and end up with the entire previous version of the table sitting in the retention window. Switching to Liquid often produces a meaningful reduction in storage costs purely on the VACUUM side.

For alerting, the two signals worth wiring into your monitoring are: (1) OPTIMIZE commits stopping for longer than the expected cadence (no row in DESCRIBE HISTORY with operation = 'OPTIMIZE' over the last N hours), and (2) the average output file size after OPTIMIZE drifting outside the 64 MB - 512 MB band. Both indicate the OPTIMIZE schedule needs adjustment, or that Predictive Optimization isn’t running on the table for some reason (catalog setting changed, ownership change, etc.). For Unity Catalog managed tables, system.storage.predictive_optimization_operations_history provides the same signal at the account level without per-table querying.

Pitfalls

A few specific traps come up repeatedly when teams roll out Liquid Clustering.

Too many keys. The maximum is four, but the useful maximum is usually two or three. The Hilbert encoding loses locality as the dimensionality grows — on smaller tables (Databricks calls out under ~10 TB) four columns can skip fewer files than two, though the gap narrows as the table grows. Pick the two columns most queries filter on and stop there. Add a third only if you have a workload where the third column shows up in over 30% of queries as a co-predicate with the first two.

Low-cardinality keys. Liquid Clustering on a column with only a handful of distinct values is wasted work. The Hilbert coordinate collapses to a few buckets and OPTIMIZE produces no useful file skipping. Stick to columns with at least a few thousand distinct values per partition you’d otherwise have used — and remember that timestamp columns are extremely high-cardinality, which is why they’re almost always one of the keys.

AUTO mode surprises. CLUSTER BY AUTO is workload-driven, which means it can change its mind. A new dashboard that filters on a previously-ignored column can cause the auto-selected keys to flip, triggering a re-clustering that you didn’t schedule. The behavior is generally desirable — it’s keeping the layout aligned with how the table is actually queried — but it can show up as an unexplained DBU spike. Pin a manual CLUSTER BY (...) if you need a deterministic layout for a critical pipeline.

Migration partial state. ALTER TABLE … CLUSTER BY (...) on a partitioned table does not remove the partitioning, only adds clustering metadata. New writes go to the partition layout and land with a Hilbert coordinate; existing files stay where they are. Don’t assume queries will magically start skipping across partitions — they won’t, until you’ve done the full rewrite to remove PARTITIONED BY. Check with DESCRIBE DETAIL after migration; the partitionColumns field should be empty for a true Liquid table.

OPTIMIZE FULL on a steady-state table. Tempting after a key change, but on a busy table it can cost as much as a full table rewrite and saturate the warehouse for hours. Prefer gradual convergence via the regular OPTIMIZE schedule unless you specifically need the new layout applied to historical data within the next hour. Predictive Optimization explicitly avoids OPTIMIZE FULL for the same reason.

Reading clusteringColumns from the table properties. The clustering keys are not stored in SHOW TBLPROPERTIES like a regular table property — they’re a first-class column in DESCRIBE DETAIL. Some monitoring scripts adapted from Z-order tables look for delta.optimizeWrite or delta.zorder.columns and report “no clustering” on Liquid tables. Update the script to read DESCRIBE DETAIL’s clusteringColumns field.

VACUUM during clustering. Avoid running VACUUM immediately after a large OPTIMIZE that touched the majority of files. The just-rewritten old files are inside the retention window and VACUUM won’t clean them up; running it does nothing useful and adds Spark cluster cost. Schedule VACUUM on a separate, less-frequent cadence — Predictive Optimization will pick the right schedule if you let it own the table.

Mixing manual OPTIMIZE with Predictive Optimization. Running both creates redundant work — Predictive Optimization checks “is this table currently being optimized” and may skip its scheduled run, but it doesn’t always, so you end up paying for two re-clusterings on roughly the same data. Pick one: manual cron-driven OPTIMIZE or Predictive Optimization, not both on the same table.

Wrapping up

Liquid Clustering is the right default for new Delta tables on Databricks in 2026 and a worthwhile migration target for almost any existing partitioned + Z-ordered table. The combination of incremental OPTIMIZE, Hilbert-curve multi-dimensional skipping, and CLUSTER BY AUTO under Predictive Optimization means most teams can stop thinking about table layout entirely and still get better query performance than they had with manually-tuned partitioning. The migration cost is a one-time rewrite, and the steady-state cost drops substantially compared to the Z-order pattern it replaces.

The patterns that show up across production rollouts — two-key clustering on a timestamp plus a user-ID-like column, Predictive Optimization owning the schedule, monitoring OPTIMIZE cadence and average output file size via DESCRIBE HISTORY, removing the legacy PARTITIONED BY only when the workload justifies the rewrite — are durable enough to bake into a team standard. The biggest risk is over-using CLUSTER BY AUTO on tables that don’t justify it (small or cold tables get auto-clustered and the spend doesn’t pay back) and overstating the migration benefit on tables that already had a healthy partition layout.

If you’re inspecting Liquid-clustered tables, debugging clustering convergence, or building internal tools and dashboards on top of Databricks SQL, QueryPlane is a SQL editor and app builder that connects to your Databricks SQL warehouse and lets you build interactive apps over those tables — including dashboards that surface DESCRIBE DETAIL and DESCRIBE HISTORY metrics across your warehouse without writing your own observability layer. For a broader Databricks GUI comparison, our roundup of the best Databricks GUI tools covers the alternatives. If Liquid Clustering is the storage layer underneath a declarative pipeline, the Lakeflow Declarative Pipelines in Practice post covers how CLUSTER BY interacts with streaming tables and materialized views. And if the data is landing through Auto Loader, the Auto Loader in Practice post is the upstream piece.

Frequently asked questions

What is Liquid Clustering in Databricks? Liquid Clustering is a Delta table layout strategy that replaces both partitioning and ZORDER BY. You declare CLUSTER BY (col1, col2, ...) (up to four columns) or CLUSTER BY AUTO, and OPTIMIZE reorganizes files using a Hilbert-curve-based algorithm that runs incrementally — it only re-clusters files written since the last run, rather than rewriting the whole table.

How is Liquid Clustering different from Z-order? Z-order uses a Z-curve and requires rewriting every file in the table on each OPTIMIZE … ZORDER BY run. Liquid Clustering uses a Hilbert curve (better locality at higher cardinality) and runs incrementally — only newly-written or small files are re-clustered. The steady-state cost on a busy table is typically an order of magnitude lower, and VACUUM can reclaim space because Liquid doesn’t rewrite the whole table on every pass.

Should I still use PARTITIONED BY with Liquid Clustering? No. The recommended pattern is no PARTITIONED BY clause plus a CLUSTER BY declaration. Liquid Clustering’s per-file min/max metadata handles the data skipping that partitioning used to provide, without the small-files or metadata-bloat problems that come from too many partitions. If you already have a partitioned table, you can add CLUSTER BY immediately for new writes, but the existing partition layout stays until you do a one-time rewrite into a new unpartitioned table.

What is CLUSTER BY AUTO? A mode where Databricks chooses the clustering columns for you based on the table’s query history. It requires Unity Catalog and Predictive Optimization to be enabled. It updates the clustering keys when the workload changes substantially. Use DESCRIBE DETAIL to see what AUTO chose at any point in time; the clusteringColumns field shows the current selection.

How many columns can I cluster by? Up to four, but two or three is usually the right answer. The Hilbert encoding loses locality as the dimensionality grows, so four-column clustering often skips fewer files than two-column clustering on the same table. Pick the columns the largest fraction of queries filter on.

Do I still need to run OPTIMIZE manually? You can, but the easier answer is to enable Predictive Optimization at the catalog or schema level and let Databricks handle OPTIMIZE and VACUUM scheduling. Predictive Optimization is workload-aware (it optimizes warm tables more often and cold tables less) and is billed against a managed serverless pool, with typical ROI of 1.5-2× the spend.

What’s the difference between OPTIMIZE and OPTIMIZE FULL on a Liquid table? Plain OPTIMIZE re-clusters only files that have been written or modified since the last run, plus any below-target-size files that can be combined. OPTIMIZE FULL ignores the per-file “already clustered” marker and re-clusters every file — useful right after a CLUSTER BY key change if you want the new layout applied immediately to all historical data. On a steady-state table you almost never want OPTIMIZE FULL; it negates Liquid’s incrementality benefit.

How do I change the clustering keys on an existing Liquid table? ALTER TABLE my_table CLUSTER BY (new_col1, new_col2);. This is a metadata-only change — new writes immediately use the new keys, but existing files stay put until the next OPTIMIZE rewrites them. To force immediate re-clustering of all historical data, run OPTIMIZE my_table FULL after the ALTER.

Can I disable clustering on a Liquid table? Yes, with ALTER TABLE my_table CLUSTER BY NONE. This drops the clustering metadata and the table becomes a plain unpartitioned, unclustered Delta table. Future OPTIMIZE runs do only bin-packing without any clustering reorganization.

Does Liquid Clustering work on streaming tables and materialized views? Yes. Streaming tables and materialized views in Lakeflow Declarative Pipelines accept CLUSTER BY in the same way as regular Delta tables. The pipeline runtime handles OPTIMIZE scheduling for pipeline-managed tables, so you typically don’t need to wire up Predictive Optimization separately. The clustering applies to the underlying Delta table — query-path skipping works the same way.

Is Liquid Clustering supported on external tables? Liquid Clustering is fully supported on Unity Catalog managed tables and on external Delta tables managed through Unity Catalog. The CLUSTER BY AUTO mode specifically requires Unity Catalog because it depends on query history collected by the catalog. Manual CLUSTER BY (col1, col2) works on any Delta table, regardless of catalog.

When should I not use Liquid Clustering? Tables small enough not to need any layout strategy (under a few GB total, or under ~100 files); append-only tables where every query is a full scan; and tables where a sort-merge layout on a single column is what you actually want (rare — usually you’d use a clustered key instead). For every other Delta table on Databricks in 2026, Liquid Clustering with Predictive Optimization is the right default.