ClickHouse TTL and Data Skipping Indexes in Practice

Q: How is ClickHouse TTL different from a delete query?

A DELETE (or ALTER TABLE ... DELETE) is a one-shot mutation that rewrites every part containing the rows to delete. TTL is a declarative rule that the merge engine applies as it merges parts, so the cost is amortized across the merges that would happen anyway and the rule keeps firing forever without re-running.

Q: Can I have multiple TTL rules with different actions on the same table?

Yes — separate the expressions by commas in TTL .... The engine evaluates the rules in the order they appear and applies the first whose expression has fired. The standard pattern is one or more TO DISK / TO VOLUME rules followed by a final DELETE rule.

Q: How do I tell whether a skip index is actually pruning?

Compare read_rows from system.query_log for the same query before and after adding the index, or use the EXPLAIN indexes = 1 syntax which prints which indexes were considered and how many granules each one pruned. The "Granules" line in the output is the ground truth.

How to use ClickHouse TTL, data skipping indexes, and tiered storage together to keep large MergeTree tables fast and cheap to query in production.

This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.

ClickHouse is fast on freshly designed tables. It gets slow and expensive on tables that have been collecting rows for two years, because the two things that protect performance — partition pruning and primary-key index pruning — only work for queries whose predicates line up with the sort key. Once you have a table with billions of rows and an analyst writing WHERE user_id = 42 AND event_type = 'signup' against a table sorted by (event_time, project_id), you are scanning a lot of granules.

This post is about the two features in ClickHouse that exist to fix that exact problem without adding a second physical copy of the data: TTL, which removes or downsamples rows the workload no longer needs, and data skipping indexes, which let the planner prove a granule cannot contain matching rows so it never reads the granule at all. The two features compose — TTL trims the table so skip indexes have less to cover, and skip indexes accelerate the queries that survive the trim.

The basic mechanics are well-documented. What is less well-documented is the set of choices and pitfalls that decide whether these features actually pay back the engineering cost: which TYPE of skip index to pick for a given predicate, what GRANULARITY value actually does, why a TTL ... DELETE rule can sit unmerged for weeks, and the conditions under which TTL ... TO DISK 's3' plus skip indexes give you a fast hot tier and a cheap cold tier without rewriting any queries.

In this post, we’ll cover:

TTL semantics — DELETE, TO DISK, TO VOLUME, GROUP BY, and the merge-driven model that runs them
Data skipping indexes — minmax, set, bloom_filter, ngrambf_v1, tokenbf_v1, and how GRANULARITY actually works
Composing TTL with tiered storage — hot/cold S3 patterns and move_factor
TTL GROUP BY — automatic rollups for retention-friendly aggregates
Skip index pitfalls — false positives, low-cardinality traps, and indexes that get worse as you ingest
Operational tips — MATERIALIZE INDEX, system.parts, monitoring the merges that drive TTL
Six production pitfalls — the failure modes worth knowing about before you ship a rule
FAQ — short answers to the questions teams ask after they enable their first TTL rule

TTL semantics

ClickHouse TTL is not a background sweeper that scans tables looking for expired rows. It is a rule attached to a column or a table that the merge engine consults when it decides what to do with a part. Because merges are the engine of TTL, a TTL rule does not fire on a fixed schedule. It fires when the part containing the row gets merged, or when a dedicated TTL merge picks the part up because merge_with_ttl_timeout has elapsed since the last TTL merge for that table.

The simplest form is row-level deletion:

CREATE TABLE events
(
  event_time DateTime,
  user_id    UInt64,
  event_type LowCardinality(String),
  payload    String
)
ENGINE = MergeTree
ORDER BY (event_time, user_id)
TTL event_time + INTERVAL 90 DAY DELETE;

The TTL <expression> DELETE clause attaches a rule to the table: any row whose event_time + INTERVAL 90 DAY lies in the past is eligible for deletion the next time the part is merged. The expression has to be deterministic and computable from columns of the row — usually a DateTime or Date column with an INTERVAL offset. ClickHouse stores the minimum TTL value across all rows in a part as part metadata, so a part whose minimum TTL is still in the future is skipped entirely by the TTL merge scheduler.

DELETE is the default action and can be omitted. The interesting actions are the others:

ALTER TABLE events
MODIFY TTL
  event_time + INTERVAL 30 DAY TO DISK   's3_cold',
  event_time + INTERVAL 180 DAY TO VOLUME 'archive',
  event_time + INTERVAL 365 DAY DELETE;

Each TTL expression is evaluated independently against every row: a TO DISK rule that fired at day 30 moves the part to the cold disk, the TO VOLUME rule at day 180 moves it again to the archive volume, and the DELETE rule at day 365 removes it entirely. There can be at most one DELETE action per table, but multiple move actions can fire in sequence as time advances. TO DISK moves the part to a specific disk by name. TO VOLUME moves it to a named multi-disk volume — useful when your cold tier is itself sharded across several S3 buckets or local HDDs. DELETE removes the part the same way the simple form does.

There is also TTL ... RECOMPRESS CODEC(ZSTD(12)) for re-encoding old parts with a heavier compression codec — useful when you keep cold data on the same disk but want to trade CPU for storage. And there is TTL ... GROUP BY for aggregation, which we cover below.

Two operational details that catch teams off guard the first time. First, merge_with_ttl_timeout defaults to 4 hours, which means a part that has fully expired rows can sit around for up to four hours before the engine bothers to merge it for TTL purposes. If your retention rule is “delete within 24 hours of expiry” you can leave the default; if it is “delete within 5 minutes” you need to lower this value to the order of minutes and provision enough merge-pool capacity (background_pool_size, background_merges_mutations_concurrency_ratio) for the resulting load — the extra CPU and IO are real and need to be sized against the table’s ingest rate.

Second, when you ALTER TABLE ... MODIFY TTL, existing parts do not get re-evaluated against the new rule until the next merge touches them. If you tighten a retention rule from 90 days to 30 days, the old 60-day-old parts will still be there until the next merge for that part. To force evaluation, run ALTER TABLE events MATERIALIZE TTL — this is a heavy operation on a big table because it rewrites every part, but it is the only way to make a TTL change land synchronously.

Data skipping indexes

ClickHouse’s primary-key index is sparse — it indexes every index_granularity-th row of the sorted file (8192 rows by default), so the granule is the unit of pruning. A query like WHERE event_time BETWEEN ... AND ... against a table sorted by event_time reads exactly the granules whose min/max overlaps the range, which on a year-scale table can prune 99% of the granules without touching the data.

Skip indexes generalize this idea to columns that are not in the ORDER BY. A skip index stores, for every GRANULARITY consecutive granules of the base table, a small summary of the values appearing in some expression. When a query has a predicate on the expression, the planner consults the summary; if the summary proves the predicate cannot match any row in those granules, ClickHouse skips reading them.

There are five built-in types worth knowing.

minmax stores the minimum and maximum value of the expression over each block of granules. It is the right answer when the column is numeric or date-like and is naturally clustered in storage even though it is not in the sort key — for example, an event_time column that is correlated with the primary key but not part of it. A WHERE event_time > now() - INTERVAL 1 HOUR against a minmax index on event_time will prune any block whose max event_time is below the threshold.

ALTER TABLE events
  ADD INDEX events_event_time_minmax (event_time) TYPE minmax GRANULARITY 4;

set(max_rows) stores the set of distinct values over each block. It is the right answer when the column has low-to-moderate cardinality and queries filter by equality. The max_rows parameter caps how many distinct values are stored per block; if a block has more than max_rows distinct values, the index gives up for that block and the planner has to read it. Sizing the parameter is the entire art: too small and the index is mostly useless because blocks blow past the cap, too large and the index itself becomes a sizeable lookup the planner has to scan.

ALTER TABLE events
  ADD INDEX events_event_type_set (event_type) TYPE set(100) GRANULARITY 4;

bloom_filter([false_positive_rate]) stores a Bloom filter per block of the values appearing in the expression. It is the right answer for equality predicates on high-cardinality columns where set() blows past max_rows immediately — UUIDs, user IDs, anything with millions of distinct values. The optional false-positive rate defaults to 0.025. Lowering it makes the filter bigger and more selective; raising it makes the filter smaller and accept more false-positive granule reads. For most production tables, the default rate is the right starting point.

ALTER TABLE events
  ADD INDEX events_user_id_bf (user_id) TYPE bloom_filter(0.01) GRANULARITY 4;

ngrambf_v1(n, size_of_bloom_filter, number_of_hash_functions, random_seed) and tokenbf_v1(...) are Bloom filters built on substrings (or whitespace-separated tokens) of a string column. They accelerate LIKE, hasToken(), and equality search on text columns. ngrambf_v1 indexes character n-grams (useful when you want LIKE '%substr%' queries to prune); tokenbf_v1 indexes whitespace-separated tokens (useful when you want hasToken('error') to prune log messages). These are the right answer for log search and free-text search against high-cardinality string columns. Sizing them is harder than the numeric Bloom filter; the ClickHouse data skipping index docs include a calculator and rules of thumb.

ALTER TABLE logs
  ADD INDEX logs_message_token (message) TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 4;

The GRANULARITY N clause that appears on every skip index defines how many base-table granules are summarized in a single skip-index granule. With index_granularity = 8192 and skip-index GRANULARITY 4, each skip-index entry covers 4 × 8192 = 32,768 rows. A smaller GRANULARITY makes the index finer-grained and more selective per block, at the cost of a larger index. A larger GRANULARITY makes the index coarser but cheaper. The default of GRANULARITY 1 is rarely optimal for big tables; values between 4 and 64 are common in production depending on how dense the relevant values are.

See what QueryPlane can build for you

Connect to your database, write SQL with AI, and build shareable apps — all from your browser.

Get Started Book a Demo

Composing TTL with tiered storage

The single highest-leverage TTL pattern in production is hot/cold tiering: keep the last 30 days on local SSD where queries are fast, move 30-to-365 day data to S3 where storage is an order of magnitude cheaper, and delete the rest. ClickHouse’s storage policy abstraction makes this a configuration change rather than a code change.

Define a storage policy in config.xml (or a file under config.d/) that has two volumes — one local and one S3:

<storage_configuration>
  <disks>
    <s3_cold>
      <type>s3</type>
      <endpoint>https://s3.us-east-1.amazonaws.com/queryplane-events-cold/</endpoint>
      <use_environment_credentials>true</use_environment_credentials>
    </s3_cold>
  </disks>
  <policies>
    <hot_cold>
      <volumes>
        <hot>
          <disk>default</disk>
        </hot>
        <cold>
          <disk>s3_cold</disk>
        </cold>
      </volumes>
      <move_factor>0.2</move_factor>
    </hot_cold>
  </policies>
</storage_configuration>

Attach the policy to the table and add TTL rules that target the volume by name:

CREATE TABLE events
(
  event_time DateTime,
  user_id    UInt64,
  event_type LowCardinality(String),
  payload    String
)
ENGINE = MergeTree
ORDER BY (event_time, user_id)
TTL event_time + INTERVAL 30  DAY TO VOLUME 'cold',
    event_time + INTERVAL 365 DAY DELETE
SETTINGS storage_policy = 'hot_cold';

From that point, the merge scheduler moves parts older than 30 days to the cold volume — which is backed by S3 — and deletes parts older than 365 days. Queries against the table do not change: the planner reads parts from whichever disk they sit on, and the only thing the analyst notices is that queries against cold data are a few hundred milliseconds slower because of S3 latency.

The move_factor setting in the policy controls when ClickHouse spills hot parts to cold without a TTL match. When the free space on the hot volume drops below move_factor (here, 20%), the engine starts moving the oldest parts to the next volume even if their TTL has not fired. This is the safety valve that keeps a fast-growing table from filling up local SSD; it is also a subtle reason a cold-tier byte count can grow faster than retention rules alone would predict.

Skip indexes work transparently across the tiers: the index is part of the part, so when the part moves to S3 the index moves with it. A query against cold data still benefits from skip-index pruning, which matters more on S3 than on SSD because every avoided granule read avoids a network round-trip.

TTL with GROUP BY for retention-friendly rollups

A less well-known TTL action is GROUP BY, which lets ClickHouse downsample older data instead of deleting it. The rule says “for rows older than X, aggregate them along these columns using these aggregate functions, then delete the underlying rows.” The result is a table where the recent data is full-fidelity and the historical data is pre-aggregated, all without a separate materialized view or scheduled task.

CREATE TABLE events
(
  event_date DateTime,
  user_id    UInt64,
  event_type LowCardinality(String),
  cnt        UInt64 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, event_type, user_id)
TTL event_date + INTERVAL 90 DAY
    GROUP BY event_date, event_type
    SET user_id = min(user_id),
        cnt     = sum(cnt);

Rows older than 90 days are aggregated by (event_date, event_type); user_id is collapsed to the minimum value in the group (the per-user dimension is gone), and cnt becomes the sum across all users. The grouped row replaces the underlying rows in storage. Old data still answers SELECT event_date, event_type, sum(cnt) FROM events GROUP BY 1, 2, but it cannot answer WHERE user_id = 42 — the per-user dimension is gone, by design. Note the ORDER BY puts event_type before user_id: the GROUP BY list in a TTL rollup has to be a prefix of the sort key, so the schema is designed around that constraint up front.

Two constraints on GROUP BY TTL that bite people. First, the GROUP BY columns must be a prefix of the table’s ORDER BY — the engine cannot re-sort the part to perform a non-prefix rollup. Second, the SET clause must enumerate every column that is not in the GROUP BY; columns you do not list are reset to their default value, which may or may not be what you want. Both restrictions are documented in the ALTER TTL reference.

Skip index pitfalls

Skip indexes are the area where teams most often discover that a documented feature does not behave the way the docs suggest.

The first pitfall is the index that never prunes anything. A set or bloom_filter index is only useful if values are locally clustered in the part — that is, if a small fraction of blocks contain any given value. If the data is random with respect to ingestion order, every block has every value and the index returns “might match” for every query. The fix is to ingest in an order that clusters the indexed column — for example, sort by (event_time, user_id) so that an event_type skip index has hope of clustering inside an hour. If the column is high-cardinality and uncorrelated with the sort key, no skip index will help; that is a sign the column should be in the ORDER BY or in a projection.

The second pitfall is picking the wrong type. set(N) only prunes when the per-block cardinality is below N. Teams pick set(1000) for a column that has 100,000 distinct values per block, and the index silently does nothing — every block exceeds the cap and the planner falls back to reading the data. system.data_skipping_indices exposes per-index size and the marks scanned, but it does not directly expose “this index produced zero pruning”; the reliable way to verify a skip index is doing work is to compare system.query_log rows for the same query with and without the index, looking at read_rows and read_bytes.

The third pitfall is the LIKE query that doesn’t use the ngram index. ngrambf_v1 only helps when the substring in the query has at least n characters; a WHERE message LIKE '%x%' against an ngrambf_v1(3, ...) index cannot use the index because the substring is shorter than the n-gram length. Document the minimum supported substring length for your text-search use case and size the n-gram parameter to match.

The fourth pitfall is indexes that hurt insert performance. Every skip index is recomputed on insert and every merge. A table with five skip indexes inserts roughly 30-40% slower than the same table without them, and the merge throughput is reduced proportionally. The math is straightforward — adding an index is a write-path tradeoff — but teams discover this only after they ship the indexes and watch ingest lag climb.

The fifth pitfall is building an index on a column you have just added. ADD INDEX only computes the index for parts that ClickHouse subsequently writes or merges; existing parts have no index data, and the planner cannot prune them. The fix is ALTER TABLE events MATERIALIZE INDEX events_user_id_bf, which schedules a materialization mutation that rewrites every existing part to include the index. This is identical in cost to materializing a TTL change — it rewrites every part — so plan it for off-peak and watch system.mutations until it completes.

Operational tips

A handful of system tables are essential for running TTL and skip indexes in production.

system.parts has a delete_ttl_info_min / delete_ttl_info_max and move_ttl_info block on every part. Joining system.parts against the storage policy makes it easy to chart “how much data is on hot vs cold” and “what is the oldest part that hasn’t been moved yet”:

SELECT
    table,
    disk_name,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    min(min_time) AS oldest_event,
    max(max_time) AS newest_event
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table, disk_name
ORDER BY table, disk_name;

system.data_skipping_indices has per-table, per-index size and the marks column showing how many index marks exist. If marks is zero for a freshly added index, you forgot to run MATERIALIZE INDEX. If data_compressed_bytes is a meaningful fraction of the table’s compressed size, the index is too large for what it returns — consider a coarser GRANULARITY or a smaller set() cap.

system.mutations shows in-flight MATERIALIZE INDEX and MATERIALIZE TTL operations. A mutation row stays around with is_done = 0 until every part affected has been rewritten; the parts_to_do column counts remaining parts. Watch this when you ship a new index or TTL rule to a large table — the rewrite can run for hours.

system.merges shows current merges, and the merge_type column (docs) identifies what kind of merge each row represents — Regular for size-driven merges, TTL_DELETE for merges driven by a TTL delete or move rule, and TTL_RECOMPRESS for codec-recompression rules. Counting WHERE merge_type LIKE 'TTL%' over time tells you whether retention rules are doing real work; if the count is stuck at zero, either nothing is expiring or the merge scheduler is starved.

Six production pitfalls

A few failure modes worth knowing about before you ship a rule:

TTL rules that never fire on stopped tables. TTL is driven by merges. A table with no inserts and no merges does not have its TTL rules evaluated. If you take a partition offline for migration work and forget about it, expired data sits in it indefinitely. Cron job: ALTER TABLE ... MATERIALIZE TTL on tables with declining ingestion to ensure expiry happens.

TO DISK with the wrong disk name. ClickHouse does not validate the disk name in MODIFY TTL against the table’s storage policy at definition time. A typo in the disk name results in a rule that fires at the right time but with the wrong action — usually a no-op move, sometimes a part landing on an unintended volume. Validate disk names with SELECT name FROM system.disks before shipping the rule.

S3 part counts blowing up. A part that has been compressed to a few hundred kilobytes on S3 is still a part, which means every query against cold data fans out to a small object on S3 per part. If you don’t tune merge_max_block_size_to_lower_memory_usage and the related merge settings for cold parts, you end up with thousands of tiny S3 objects, each adding latency. ClickHouse’s cold storage docs cover the settings; the short version is “keep merging in the cold tier, just at a longer cadence.”

Bloom filter false-positive accumulation across granules. A skip index built with a high false-positive rate (say, 0.05) is fine on a small table, but on a billion-row table with one million granules, the expected number of false-positive granule reads is fifty thousand — enough to substantially reduce the pruning win. Decrease the rate by an order of magnitude on big tables.

TTL ... DELETE WHERE predicate getting cheaper to evaluate than intended. ClickHouse 22.5 added DELETE WHERE to TTL (release notes), which lets you express “delete rows older than X and matching a predicate.” If the predicate is on a column that is materialized only at merge time (e.g., a MATERIALIZED default), the predicate’s value at TTL evaluation may differ from the value at insert. The safer pattern is to put both expressions in the same TTL rule on a column whose value is fixed at insert.

MATERIALIZE TTL colliding with regular merges on a busy table. Running MATERIALIZE TTL on a table with steady ingest can saturate the merge pool because every part has to be rewritten and the scheduler still needs to keep up with new parts. Throttle by setting max_replicated_merges_in_queue lower for the duration of the materialization, or schedule the operation on a replica that you temporarily drain from the load balancer.

FAQ

How is ClickHouse TTL different from a delete query?

A DELETE (or ALTER TABLE ... DELETE) is a one-shot mutation that rewrites every part containing the rows to delete. TTL is a declarative rule that the merge engine applies as it merges parts, so the cost is amortized across the merges that would happen anyway and the rule keeps firing forever without re-running.

When does a TTL rule actually run?

When a part containing eligible rows is merged. Dedicated TTL merges fire at most every merge_with_ttl_timeout (default 4 hours) per table, in addition to regular size-based merges that may also apply the rule.

Can I have multiple TTL rules with different actions on the same table?

Yes — separate the expressions by commas in TTL .... The engine evaluates the rules in the order they appear and applies the first whose expression has fired. The standard pattern is one or more TO DISK / TO VOLUME rules followed by a final DELETE rule.

Do skip indexes work on columns inside `MATERIALIZED` defaults?

Yes, but only after a merge has materialized the column for the part. Recently inserted parts may not yet have the column populated, in which case the skip index will not prune those parts. MATERIALIZE INDEX forces this.

Which is faster, a skip index or a projection?

A projection is a separate physical layout — it has its own sort order and primary key and can answer queries by reading only its data. It is much faster for the queries it covers, at the cost of doubling (or more) the on-disk size of the underlying data. A skip index is cheap, but it only helps queries that filter on the indexed expression. The rule of thumb: if a query pattern is hot enough to justify a 2x storage cost, use a projection; otherwise, use a skip index.

Should I drop the primary-key index when I add skip indexes?

No. The primary-key index is the cheapest and most effective pruning mechanism in ClickHouse. Skip indexes are additive — they prune granules the primary-key index could not. Keep the primary key tuned for the dominant query pattern and add skip indexes for the secondary ones.

How do I tell whether a skip index is actually pruning?

Compare read_rows from system.query_log for the same query before and after adding the index, or use the EXPLAIN indexes = 1 syntax which prints which indexes were considered and how many granules each one pruned. The “Granules” line in the output is the ground truth.

Can I use TTL `TO DISK` to keep a table partially on local SSD and partially on a slower local HDD?

Yes — define the SSD and the HDD as separate disks in the storage policy and use TO VOLUME to move parts between them. The same pattern that moves hot parts to S3 works for local-HDD tiering.

Does `TTL ... GROUP BY` break with non-deterministic aggregates?

Yes, in subtle ways. ClickHouse evaluates the rollup at merge time, so any aggregate whose result depends on the order of input rows (groupArray, argMax with ties, etc.) can produce different output across two equivalent rollups. Stick to deterministic aggregates (sum, count, min, max, avg) in TTL GROUP BY clauses.

What happens to skip indexes when a part moves to S3?

The index file moves with the part. The planner consults the same index file regardless of which disk the part lives on; only the latency of reading the underlying granules differs.

Wrapping up

TTL and data skipping indexes are the two MergeTree features that keep a growing ClickHouse table fast and affordable without bolting on infrastructure that has to be kept in sync. TTL trims and tiers the data so the engine spends its compute budget on the rows that still matter; skip indexes prune what’s left so queries against the trimmed data still finish quickly. Composed correctly, the two reduce a multi-terabyte hot table to a tens-of-gigabytes hot tier and a cheap-but-still-pruneable cold tier, with no application-layer change.

The decision points worth slowing down on: which TYPE of skip index matches the predicate (numeric ranges → minmax, low-cardinality equality → set, high-cardinality equality → bloom_filter, text → tokenbf_v1 / ngrambf_v1), what GRANULARITY keeps the index small enough to be worth its weight, whether the data is naturally clustered enough on the indexed expression for any index to help at all, and whether the cold tier is allowed to be on S3 from the start (almost always yes — the latency cost is trivially worth the storage savings).

If you are exploring a ClickHouse table that already has TTL and skip indexes configured, system.parts, system.data_skipping_indices, system.mutations, and EXPLAIN indexes = 1 are the four things you want at hand — and they are exactly the kind of system-table introspection QueryPlane is built for. Connect a ClickHouse instance, point it at one of those tables, and you can paste an EXPLAIN directly next to a chart of system.parts size by disk to see, in one screen, whether the indexes are pruning and whether the tiered storage policy is doing what you think it is.

For related ClickHouse internals, see our guides on partition and primary key design, ReplacingMergeTree, projections, LowCardinality, and JOIN strategies. If you’re choosing a GUI to drive these workflows, our top ClickHouse GUI tools post compares the most common options.