An Intro to ClickHouse Joins
Learn how to use joins in ClickHouse with real examples, when to aggregate before joining, when ANY JOIN helps, and where dictionaries fit.
ClickHouse
This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.
ClickHouse joins have a reputation problem. A lot of people still carry around an old mental model that says “ClickHouse is great for scans, but you should avoid joins.” That advice is too blunt now. ClickHouse can handle joins well, but the best patterns still look a little different from what you might reach for in PostgreSQL or MySQL.
Instead of keeping this guide theoretical, I verified the examples below locally with clickhouse local on ClickHouse 26.4.1.1039 using the official binary. That makes the advice much more concrete.
In this post, we’ll cover:
- A straightforward fact-to-dimension join
- Why aggregating before joining often helps
- What
LEFT ANY JOINactually does - When dictionaries are better than a regular join
- A simple decision ladder for production workloads
Example setup
We’ll start with a small events table and a user dimension:
CREATE TABLE users
(
user_id UInt32,
region LowCardinality(String),
plan LowCardinality(String)
)
ENGINE = Memory;
INSERT INTO users VALUES
(1, 'us', 'free'),
(2, 'eu', 'pro'),
(3, 'us', 'team'),
(4, 'apac', 'free');
CREATE TABLE events
(
event_date Date,
user_id UInt32,
revenue UInt32,
event_type LowCardinality(String)
)
ENGINE = Memory;
INSERT INTO events VALUES
('2026-04-01', 1, 0, 'page_view'),
('2026-04-01', 1, 120, 'purchase'),
('2026-04-01', 2, 80, 'purchase'),
('2026-04-02', 2, 0, 'page_view'),
('2026-04-02', 3, 250, 'purchase'),
('2026-04-02', 3, 0, 'page_view'),
('2026-04-03', 4, 40, 'purchase'),
('2026-04-03', 4, 0, 'page_view');
This is intentionally small, but it is enough to show the join patterns that matter.
Example 1: A normal fact-to-dimension INNER JOIN
This is the kind of join people often overthink in ClickHouse. For a clean dimension lookup, the straightforward query works fine:
SELECT
u.region,
count() AS event_count,
sum(e.revenue) AS revenue
FROM events AS e
INNER JOIN users AS u ON e.user_id = u.user_id
WHERE e.event_date >= '2026-04-01'
GROUP BY u.region
ORDER BY revenue DESC, u.region ASC;
Local result:
us 4 370
eu 2 80
apac 2 40
This is the first mindset shift worth making: ClickHouse can absolutely do a normal fact-to-dimension join. If the logic is naturally relational and the query is clear, do not contort the model just because old advice says joins are bad.
Example 2: Aggregate first, then join
One of the best ClickHouse habits is to reduce the number of rows that reach the join when you can. Instead of joining raw events to users and then rolling up by plan, aggregate the fact table first:
SELECT
u.plan,
sum(user_events) AS event_count,
sum(user_revenue) AS revenue
FROM
(
SELECT
user_id,
count() AS user_events,
sum(revenue) AS user_revenue
FROM events
WHERE event_date >= '2026-04-01'
GROUP BY user_id
) AS e
INNER JOIN users AS u ON e.user_id = u.user_id
GROUP BY u.plan
ORDER BY revenue DESC, u.plan ASC;
Local result:
team 2 250
free 4 160
pro 2 80
On this tiny dataset the performance difference does not matter, but the pattern does. On large event tables, pre-aggregating before the join often reduces memory pressure and makes the query easier for ClickHouse to execute efficiently.
If you only need user-level totals, do not make the join process every raw event row.
See what QueryPlane can build for you
Connect to your database, write SQL with AI, and build shareable apps — all from your browser.
Example 3: LEFT JOIN vs LEFT ANY JOIN
ANY JOIN is one of the most ClickHouse-specific tools in this area, and it is worth understanding because it solves a real problem: row multiplication when the right-hand side has duplicates.
Here is a simple campaign example:
CREATE TABLE campaign_history
(
campaign_id UInt32,
owner LowCardinality(String),
version UInt8
)
ENGINE = Memory;
INSERT INTO campaign_history VALUES
(10, 'alice', 1),
(10, 'bob', 2),
(20, 'carol', 1);
CREATE TABLE campaign_clicks
(
campaign_id UInt32,
clicks UInt32
)
ENGINE = Memory;
INSERT INTO campaign_clicks VALUES
(10, 100),
(20, 50),
(30, 25);
With a plain LEFT JOIN, duplicate matches on the right create duplicate rows:
SELECT
c.campaign_id,
h.owner,
c.clicks
FROM campaign_clicks AS c
LEFT JOIN campaign_history AS h ON c.campaign_id = h.campaign_id
ORDER BY c.campaign_id, h.version;
Local result:
10 alice 100
10 bob 100
20 carol 50
30 25
Now switch to LEFT ANY JOIN:
SELECT
c.campaign_id,
h.owner,
c.clicks
FROM campaign_clicks AS c
LEFT ANY JOIN campaign_history AS h ON c.campaign_id = h.campaign_id
ORDER BY c.campaign_id;
Local result:
10 alice 100
20 carol 50
30 25
This is the important nuance: LEFT ANY JOIN removed the duplicate row for campaign 10, but it did not give me the latest owner. In this run it picked alice, not bob.
So ANY JOIN is useful when “one match is enough,” but it is the wrong tool if you need a specific match such as “latest version.”
If you need the latest row, deduplicate first
For the campaign example, the safer pattern is to reduce the right-hand side to one row per key before joining:
SELECT
c.campaign_id,
h.owner,
c.clicks
FROM campaign_clicks AS c
LEFT JOIN
(
SELECT campaign_id, argMax(owner, version) AS owner
FROM campaign_history
GROUP BY campaign_id
) AS h ON c.campaign_id = h.campaign_id
ORDER BY c.campaign_id;
Local result:
10 bob 100
20 carol 50
30 25
This is usually the better production pattern when the right-hand side is versioned history rather than a truly unique dimension.
When dictionaries are better than a join
ClickHouse dictionaries are still worth considering when the right-hand side is:
- small
- lookup-oriented
- relatively static
Typical examples:
- status code mappings
- geo enrichment
- user tier labels
- reference metadata that fits comfortably in memory
The decision rule is simple:
- if you need normal relational logic, use a join
- if you mostly need fast key-value enrichment from a stable lookup table, evaluate a dictionary
That is one of the most ClickHouse-specific optimization moves available.
Join performance still depends on table design
Even with better join performance, poor table design still hurts. The main problems are familiar:
- large unfiltered fact-to-fact joins
- too many rows reaching the join
- layouts that prevent pruning before the join
- unnecessary columns carried through intermediate steps
This is why ClickHouse performance work usually starts with table design, not clever join syntax. If you need a refresher there, read our guide to ClickHouse PARTITION BY, ORDER BY, and PRIMARY KEY.
A realistic decision ladder
When you need enrichment or relational logic in ClickHouse, use this order:
- Can I keep the model simple with denormalization?
- If not, is a normal join the clearest answer?
- Can I reduce rows before the join by aggregating first?
- Is the right-hand side really a lookup table that should be a dictionary?
- If the right-hand side has duplicates, do I need a specific row or is
ANY JOINtruly acceptable?
That is a much better workflow than either “never join” or “joins are always fine.”
Frequently asked questions
Does ClickHouse support joins?
Yes. Every ANSI join type is supported — INNER, LEFT, RIGHT, FULL, plus ClickHouse-specific variants like ANY and ASOF. The execution model is different from a row store, but the syntax and semantics work as you’d expect.
Why are ClickHouse joins sometimes slow? ClickHouse defaults to building an in-memory hash table from the right-hand side of the join, so a large right-hand table — or a large left-hand stream that overflows the hash join’s memory budget — hurts. Aggregating before the join, picking the smaller side as the right-hand input, or switching to a dictionary for stable lookups usually fixes it.
What is the difference between JOIN and ANY JOIN in ClickHouse?
A regular join emits a row for every matching pair on the right-hand side, the same as SQL standard. ANY JOIN emits at most one match per left-hand row and is implemented as a faster code path. Use ANY JOIN only when an arbitrary matching row is acceptable; otherwise it silently drops duplicates.
When should I use a dictionary instead of a join?
Use a dictionary when the right-hand side is a small, slow-changing lookup table (countries, plans, feature flags, status enums) that you reuse across many queries. dictGet('name', 'col', key) is essentially a hash lookup, runs without the join machinery, and lets you keep the lookup in memory across queries.
What is ASOF JOIN used for?
ASOF JOIN matches each left-hand row to the right-hand row with the nearest preceding (or equal) value on a sorted column — usually a timestamp. It is the right answer for time-aligned lookups like “the price at the moment of this trade” or “the user’s plan at the moment of this event” without a self-join.
Should I denormalize instead of joining in ClickHouse? Often yes. ClickHouse is a column store, and storing the joined fields directly on the fact table is the cheapest read pattern. Denormalize when the right-hand side is small, the join is on every query, and the columns rarely change. Keep the join when one of those breaks down.
Why does my ClickHouse join fail with MEMORY_LIMIT_EXCEEDED?
The hash table built from the right-hand side did not fit in the configured memory budget. Reduce the right-hand side by aggregating or filtering before the join, swap the sides if the smaller table ended up on the left, or set join_algorithm = 'partial_merge' or 'grace_hash' so ClickHouse can spill to disk.
How do I tune join performance in ClickHouse?
Three settings move the needle most: join_algorithm (default direct, hash; consider parallel_hash, grace_hash, or partial_merge for large joins), max_bytes_in_join (the hash table memory cap), and join_use_nulls (controls null semantics on unmatched rows). Combine those with table design — sort key, dictionaries, denormalization — for compound effect.
Are ClickHouse joins distributed?
Yes, but you have to be deliberate. Joins inside a single shard run locally and fast. Cross-shard joins use GLOBAL JOIN, which broadcasts the right-hand side to every node — fine for small lookups, expensive for fact-to-fact joins. Co-locate joined data on the same shard via a shared sharding key whenever you can.
Wrapping up
ClickHouse joins are very usable, but the best patterns are selective:
- use a regular join when the logic is naturally relational
- aggregate before joining when that reduces the working set
- use
ANY JOINonly when any match is genuinely acceptable - pre-deduplicate the right-hand side when you need a specific version
- evaluate dictionaries for stable lookup enrichment
The key improvement in this mental model is not “joins good” or “joins bad.” It is “joins are a real tool, and ClickHouse gives you a few extra ways to shape them well.”
If you want a more ergonomic daily workflow for testing queries and schemas, see our guide to the best ClickHouse GUI tools.