Snowflake Clustering Keys in Practice
How Snowflake clustering keys and micro-partitions work, when automatic clustering pays off, and how to read CLUSTERING_DEPTH and pruning stats.
Snowflake
This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.
Snowflake does not have B-tree indexes. Query performance comes from a different mechanism — micro-partitions and partition pruning — and clustering keys are the dial you turn when natural insertion order has stopped doing that work for you. Most teams ignore clustering for years and are right to. The teams that do reach for it usually do so because a single fact table has grown into the tens or hundreds of terabytes, scan ratios on common queries have crept up, and the query history shows partition pruning falling apart.
This post is for that moment. It covers what a Snowflake micro-partition is, what CLUSTER BY actually changes, when automatic clustering pays for itself, and how to read the metadata Snowflake gives you to decide whether the keys you picked are doing their job.
In this post, we’ll cover:
- Micro-partitions - what they are, why they exist, and why pruning is the whole game
- What
CLUSTER BYdoes - and what it does not do - Choosing the right clustering key - cardinality, query shape, and column ordering
- Automatic clustering - when the credits are worth it
- Reading
CLUSTERING_DEPTHandCLUSTERING_INFORMATION- the only two tools you really need - Search optimization service - the alternative when point lookups dominate
- Common pitfalls - over-clustering, hashing keys, and reclustering on the wrong column
What is a Snowflake micro-partition?
A Snowflake micro-partition is a contiguous chunk of compressed columnar storage — typically 50 MB to 500 MB uncompressed, around 16 MB compressed — that holds 1,000 to a few million rows depending on row width. Every table is automatically split into these partitions on load; you do not create them, declare them, or size them. They are not like Postgres partitions or BigQuery shards. There is no per-partition DDL.
What Snowflake does track for every micro-partition is metadata: the minimum and maximum value of every column, a distinct-value count, and a few bookkeeping fields. That metadata is small enough to live in the services layer and is the entire reason analytical queries on a 50-terabyte table can return in seconds. When you write WHERE created_at >= '2026-01-01', the query planner consults the metadata, eliminates any partition whose MAX(created_at) < '2026-01-01', and only reads the rest. That is partition pruning, and it is what makes the warehouse feel cheap.
The catch: pruning only works if the partitions you do not need have predictable boundaries on the columns you filter by. A table loaded in created_at order prunes beautifully on created_at. The same table prunes badly on customer_id because customer rows are scattered across every partition.
What CLUSTER BY actually does
CLUSTER BY (col1, col2, ...) is a hint, not an index. It tells Snowflake “I would like the data in this table to be physically organized so that partitions are tight on these columns.” Snowflake then runs a background process — automatic clustering — that rewrites partitions in batches to converge on that layout. The clustering key is metadata; the actual benefit shows up in better min/max ranges per partition, which means better pruning, which means fewer partitions scanned, which means lower compute cost.
A few things CLUSTER BY is not:
- It is not a B-tree. There is no random-access lookup structure.
- It is not a partition definition. You cannot drop “the 2024 partition.” Snowflake decides how to lay rows out across micro-partitions; you only get to declare what should be co-located.
- It is not free. Reclustering consumes credits, separate from the query workload, billed against an automatic-clustering service account.
- It is not instant. Newly inserted rows land in fresh partitions and need to be reclustered before they participate in the tighter layout.
ALTER TABLE events CLUSTER BY (event_date, customer_id);
That one statement schedules a reclustering pass on events. Snowflake will keep working in the background until the natural clustering depth on (event_date, customer_id) stabilizes, and will continue running periodic passes as new data lands. You do not have to schedule anything.
When to add a clustering key
The honest answer for most tables is never. Snowflake’s documentation explicitly recommends against defining a clustering key on tables smaller than ~1 TB, and even at multi-terabyte scale natural ingestion order frequently makes the column already well-clustered. The default behavior — micro-partition by insertion — handles the long tail of warehouse tables just fine.
You start considering a clustering key when:
- The table is materially large — call it 1 TB and above.
- Common filter columns are not the same as the natural insertion column. A bot-events table loaded by
created_atbut always queried bytenant_idis the canonical case. - Query history shows pruning getting worse over time. The
PARTITIONS_SCANNED / PARTITIONS_TOTALratio fromSNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORYis the right proxy; if it used to sit near 0.05 and is now closer to 0.40, the layout has drifted. - The credit cost of automatic clustering is less than the savings you expect on query compute. This is the easy-to-skip step.
If three of those are true, run the math. If only one or two are, leave it alone — clustering tables for the wrong reason wastes more credits than it saves on almost every workload smaller than ten terabytes.
Choosing the right clustering key
Picking the wrong clustering key is worse than not having one, because Snowflake will spend credits maintaining a layout that gives you no benefit. The rules are simpler than they look:
Pick columns that are common in WHERE and JOIN. A clustering key on event_date only pays off if event_date shows up in the predicates that drive your expensive queries. Audit the query history before you commit.
Pick moderate-cardinality columns. A clustering key on a high-cardinality column like user_id (millions of distinct values) gives every micro-partition a tiny range that prunes well for point lookups but is expensive to maintain — Snowflake has to rewrite partitions constantly because every new row is potentially out of order. A clustering key on a low-cardinality column like is_deleted (two values) gives you essentially no pruning. The sweet spot is hundreds to tens of thousands of distinct values per partition.
Order columns from low to high cardinality. CLUSTER BY (event_date, tenant_id) works better than CLUSTER BY (tenant_id, event_date) for most workloads. Snowflake co-locates rows by the leading column first, then by the next, so the leading column should be the one with the broadest, coarsest range. Date or month leading, then tenant, then user.
Cap the key at three or four columns. Each additional column in the clustering key makes reclustering more expensive and the layout less likely to be stable. If you find yourself reaching for five, your real problem is probably a missing aggregate table or materialized view.
Use expressions when you need coarser buckets. If your created_at is microsecond-precision and your queries filter by day, CLUSTER BY (DATE_TRUNC('day', created_at)) is usually better than CLUSTER BY (created_at). The truncated value has lower cardinality, the partitions get tighter on a per-day basis, and queries that filter by day will prune as well as they ever will.
-- Typical analytical fact table
ALTER TABLE events
CLUSTER BY (DATE_TRUNC('day', event_ts), tenant_id);
That gives you day-level pruning for time-range queries and tenant-level pruning inside each day. It is one of the safest defaults for an events-style table that is large enough to need clustering at all.
See what QueryPlane can build for you
Connect to your database, write SQL with AI, and build shareable apps — all from your browser.
Automatic clustering: when the credits are worth it
Automatic clustering is a Snowflake-managed service that runs in the background once you define a clustering key. It bills against a system account and reports cost in SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY. The bill is separate from your query warehouses, which means it is easy to miss when you are looking at the line items.
The economic question is straightforward: does the per-month reclustering cost beat the per-month query-compute savings the better layout enables? You can estimate both sides.
For the savings side, pick three to five common queries that filter on the prospective clustering key. Run them on the current layout and capture BYTES_SCANNED plus warehouse seconds. Then estimate the post-clustering scan ratio — if the column is well chosen, you can usually expect to read between one-tenth and one-third of what you do today. Multiply the seconds saved by the warehouse cost.
For the cost side, query AUTOMATIC_CLUSTERING_HISTORY after the table has been clustering for a week or two. The early days will be expensive because Snowflake is doing the bulk rewrite to get the table into shape, but steady-state cost depends almost entirely on how much new data lands per day and how out-of-order it is. Append-heavy tables loaded in clustering-key order cost almost nothing to maintain. Tables loaded out of order pay continuously.
If the savings beat the cost by 3x or more, the clustering key was a good choice. If they only beat it by 1.5x, you are paying real credits for a small improvement — usually a sign the wrong column was chosen or the table is not actually large enough. If the cost exceeds the savings, suspend automatic clustering (ALTER TABLE t SUSPEND RECLUSTER) and revert.
Reading CLUSTERING_DEPTH and CLUSTERING_INFORMATION
The two metadata functions every team should know are SYSTEM$CLUSTERING_DEPTH and SYSTEM$CLUSTERING_INFORMATION.
SYSTEM$CLUSTERING_DEPTH returns a single number that summarizes how well-clustered a table is on a given column or set of columns. The depth is the average number of micro-partitions that overlap on the clustering key. A depth of 1 is perfect — every partition has a non-overlapping range. A depth of 10 means a typical predicate ends up touching ~10 partitions where it ideally would have touched 1. The number is unitless and meant to be interpreted as “lower is better, and you want to see it dropping over time after you add a clustering key.”
SELECT SYSTEM$CLUSTERING_DEPTH('events', '(event_date, tenant_id)') AS depth;
SYSTEM$CLUSTERING_INFORMATION returns a JSON blob with the full distribution: number of micro-partitions, total partition count, average depth, depth histogram, and the column or expression list. It is the right function for an actual diagnostic rather than a check.
SELECT SYSTEM$CLUSTERING_INFORMATION('events', '(event_date, tenant_id)');
The histogram is the most useful field. A healthy distribution skews toward depth 1-4 with a long thin tail. An unhealthy distribution has a fat lump around depth 32, 64, or higher — that is a sign reclustering has not caught up, the key is too high cardinality, or the load pattern is too out-of-order for the clustering service to converge.
You can run both functions on tables without a defined clustering key, by passing the column list inline. That is how you preview whether clustering would help before committing to the operational cost.
-- Would clustering on (event_date, tenant_id) help?
SELECT SYSTEM$CLUSTERING_INFORMATION('events', '(event_date, tenant_id)');
If the simulated depth is already 1-2, leave the table alone — natural order is doing the job.
Search optimization service: the alternative for point lookups
Clustering keys are the right answer when your query pattern is range filters or equality filters on moderate-cardinality columns. They are the wrong answer when your pattern is point lookups on high-cardinality columns — finding a single order id, looking up one user by primary key, fetching a specific event by uuid.
For that pattern, Snowflake offers the search optimization service. Search optimization builds a secondary access path — closer to a B-tree in spirit — that lets point lookups skip the partition-pruning machinery entirely. It is separately billed, comparable in cost shape to automatic clustering, and far more appropriate when “find the one row with id = X” is your common shape. We cover the tradeoffs in our Snowflake search optimization service guide.
In rough terms:
| Workload shape | Best answer |
|---|---|
| Range scans on a date column | Natural insertion order, no clustering key |
| Equality on tenant/customer id (medium cardinality) | Clustering key |
| Point lookup on a uuid / primary key | Search optimization service |
| Sub-second dashboard with a fixed shape | Materialized view or dynamic table |
Mixing the wrong tool with the wrong workload is the most common reason teams burn credits without seeing performance improve.
Pitfalls
A few that show up in nearly every consulting engagement around Snowflake clustering:
Clustering on a hashed column. Engineers sometimes try CLUSTER BY (HASH(user_id)) to spread load evenly. The hash destroys the very range information clustering relies on, and pruning collapses to scanning the entire table. Hashing is occasionally useful for skew problems on a different axis, but it is almost never the right clustering input.
Clustering on a column that is already monotonic. If your data lands in event_ts order naturally, defining a clustering key on event_ts costs you reclustering credits for no improvement. Run SYSTEM$CLUSTERING_DEPTH before defining the key and confirm the depth is high enough to be worth fixing.
Reclustering on a column the query history never filters by. Painful and common. The clustering key was set during table creation based on a hypothesis that did not survive contact with the workload. Audit the query history with QUERY_HISTORY and ACCESS_HISTORY before defining or changing a key. If the filter column the team thinks they use is not actually in the top predicates, the clustering key should change.
Forgetting to monitor cost. Automatic clustering is billed against a system account, not the warehouse the queries run on, so the line item is easy to miss. Set up a resource monitor on the automatic clustering service account and review AUTOMATIC_CLUSTERING_HISTORY weekly for the first month after defining any new key.
Mixing clustering keys with frequent UPDATE traffic. Snowflake handles updates by writing new micro-partitions and tombstoning the old rows — heavy update workloads churn micro-partitions constantly, and a clustering key on the table will run up serious reclustering bills trying to catch up. Snowflake is not an OLTP store, and clustering keys assume an append-heavy workload. If you have a write pattern that looks more like OLTP, look at a streams-and-tasks pipeline or a different model entirely before reaching for clustering.
Defining a clustering key on the wrong table. The biggest table in the schema is not always the one that benefits most. Smaller fact tables that are queried in a tighter loop — a 200 GB events table hit by every dashboard — sometimes return more savings per credit than a 5 TB cold-archive table that is queried twice a month. Optimize for what is hot, not for what is largest.
A practical workflow
If you suspect a table needs a clustering key, the steps are:
- Audit the queries. Find the top 10 queries by warehouse seconds against the table. Identify the columns in their
WHEREandJOINpredicates. If one or two columns dominate, you have a candidate key. - Simulate. Run
SYSTEM$CLUSTERING_INFORMATION(table, '(col1, col2)')to see the depth distribution under the proposed key. If the simulated depth is already low, stop. If it is high, continue. - Define the key on a clone.
CREATE TABLE events_clustered CLONE events; ALTER TABLE events_clustered CLUSTER BY (...);lets you measure the impact without touching production. - Wait for reclustering to converge. Watch
AUTOMATIC_CLUSTERING_HISTORYuntil the cost-per-day flattens out. That is the steady-state maintenance cost. - Re-run the audit queries. Compare
BYTES_SCANNEDand warehouse seconds against the original table. Calculate savings. - Decide. If steady-state cost is < 1/3 of expected savings, swap the cluster onto the production table. If not, drop the clone and pick a different tactic — materialized view, dynamic table, or search optimization service.
You will know within two to three weeks whether the key was right. There is no benefit to waiting longer than that — automatic clustering converges quickly on the underlying table shape, and what you see in week three is approximately what you get.
Frequently asked questions
Do I need a clustering key on every Snowflake table? No. Most tables prune well enough on natural insertion order. The Snowflake documentation explicitly recommends against defining a clustering key on tables under 1 TB, and even above that threshold many tables do not benefit. Define a clustering key only when there is a specific, measured pruning problem the key would solve.
What is a Snowflake micro-partition? A micro-partition is a 50-500 MB chunk of columnar storage that Snowflake creates automatically on every load. It holds the row data, column statistics (min/max/distinct count), and metadata used for pruning. You do not create or size micro-partitions; Snowflake does it for you.
What is the difference between a clustering key and a primary key in Snowflake? A primary key in Snowflake is a metadata constraint that is not enforced — it does not create an index, prevent duplicates, or affect performance. A clustering key is a physical layout directive that causes Snowflake to organize data on disk to enable better pruning. They serve completely different purposes. Snowflake has no uniqueness-enforcing primary key in the OLTP sense.
How many columns should be in a Snowflake clustering key?
Two to three is the sweet spot for most workloads. More than four hurts maintenance cost and rarely improves pruning further. Order from low cardinality to high cardinality (e.g., (month, tenant, user_id)).
How do I know if a clustering key is helping?
Compare BYTES_SCANNED on representative queries before and after, and watch SYSTEM$CLUSTERING_DEPTH drop over the first one to two weeks after the key is defined. A drop from depth 30+ to depth 1-4 with a 3-10x reduction in bytes scanned is the typical successful outcome.
Does Snowflake automatically cluster tables without me defining a key?
Tables are organized by insertion order by default, which is “natural clustering” for time-loaded data. Snowflake does not run automatic reclustering on tables without an explicit CLUSTER BY; you have to opt in by defining one.
Can I have a clustering key on a column expression?
Yes. CLUSTER BY (DATE_TRUNC('day', created_at), tenant_id) is valid and often better than clustering on the raw column. Expression-based clustering is the right answer when the raw column is too high cardinality (e.g., microsecond timestamps) but a truncated version matches the query pattern.
What is the difference between clustering keys and the search optimization service? Clustering keys improve pruning for range and equality scans on moderate-cardinality columns. The search optimization service builds a secondary access path that accelerates point lookups on high-cardinality columns. They are not substitutes for each other — clustering helps “give me all events for tenant 42 last week,” search optimization helps “give me event with id 9f3b…”.
How much does automatic clustering cost?
It depends on table size, how out-of-order new data lands, and the cardinality of the clustering key. Steady-state cost is reported per-table in SNOWFLAKE.ACCOUNT_USAGE.AUTOMATIC_CLUSTERING_HISTORY. Initial reclustering (the first pass after defining a key) is often much more expensive than steady-state. Budget for a higher first month and a much smaller steady-state bill afterwards.
Wrapping up
Clustering keys in Snowflake are a precision tool, not a default. The default — micro-partitioning by insertion order — solves the pruning problem for the great majority of analytical tables, and adding a clustering key to a table that does not need one is a real cost with no return.
The teams that benefit are working with large, hot tables where the natural insertion column has stopped lining up with the query predicates. For them, CLUSTER BY is one of the most leveraged levers Snowflake offers: a well-chosen key on a hot fact table can drop scanned bytes by an order of magnitude and pay for itself many times over.
The mechanics are unforgiving in only a few places — wrong column, too-high cardinality, hashed inputs, OLTP-shaped update traffic — and easy to navigate everywhere else. Audit the query history before you commit, simulate the depth before you define the key, watch CLUSTERING_DEPTH and AUTOMATIC_CLUSTERING_HISTORY for the first two weeks, and you will know within the month whether the key earned its keep.
If you are deciding which Snowflake interface to use while running these diagnostics, our guide to the best Snowflake GUI tools covers the day-to-day options. And if your real problem is closer to “the table is too expensive for any clustering strategy to fix,” pair this with our warehouse sizing guide — the cheapest credit is the one you do not spend.