Minimal Implementation of ClickHouse MergeTree from Scratch

Yesterday I was a bit tired of studying, so I opened my editor and hacked together a small C++ project to understand the internals of ClickHouse a little better.

I’ve used ClickHouse a lot at Tinybird (a real-time analytics database built on top of ClickHouse, with our own internal fork). In practice that means I’ve spent a lot of time not just running ClickHouse-based systems at production scale, but also tuning and extending ClickHouse for real workloads (ingestion patterns, merge/compaction behavior, and query performance). These days I’m working on billing, but ClickHouse is still very much “home turf”. Earlier, I also used ClickHouse at Keytrends.

I’ve always had that engineer curiosity: what if I implement a tiny version of MergeTree from scratch? Just to learn the fundamentals.

So I built a minimal (but functional) MergeTree-like engine in C++ (the code is on GitHub: manumartinm/clickhouse-mergetree). In this post, I’ll explain the ideas behind it, the trade-offs, and how the end-to-end flow works.

Academic Fundamentals: Some math & papers

LSM-Trees

The LSM-Tree (Log-Structured Merge-Tree) concept was introduced in the classic paper (O’Neil et al., 1996). The core idea is:

Instead of updating data in-place like traditional B-trees, keep data in immutable structures and merge periodically

This approach solves the write amplification problem that traditional systems suffer from. In a B-tree, a single write can generate multiple random I/Os. In an LSM-tree, all writes are sequential.

ClickHouse’s MergeTree engine family isn’t a “textbook LSM-tree”, but it shares a lot of the same high-level shape: data lands in immutable on-disk parts, and the system continuously merges (compacts) those parts in the background to keep reads fast (ClickHouse, n.d.-a; O’Neil et al., 1996).

Column Stores

The C-Store paper (Stonebraker et al., 2005) established the foundations of columnar storage:

In practice, the benefits are pretty straightforward: better compression (similar data compresses better together), selective I/O (you only read the columns you need), and vectorization (SIMD operations over homogeneous arrays).

Sparse Indexing

Unlike traditional dense indexes, ClickHouse uses a sparse primary index. The high-level idea (indexing blocks, not rows) is common in LSM/SSTable systems (e.g., Chang et al., 2006) and is documented clearly in ClickHouse’s own docs (ClickHouse, n.d.-b).

A dense index stores one entry per row ( $O(N)$ memory). A sparse index stores one entry per granule, e.g. every 8192 rows ( $O(N/8192)$ memory).

For a billion-row table, this means 122K index entries vs 1B. The difference is abysmal.

Code Deep Dive: From data structures to the end-to-end flow

After studying these papers and working with ClickHouse in production at Tinybird (where we build a ClickHouse-based real-time analytics database on an internal fork, and routinely need to reason about — and tune — MergeTree behavior: parts, merges/compaction, and query performance), I built my own minimal implementation with the help of Claude Code. I’ll start with the key design decisions, and then connect them to the full write/read/merge flow.

One very important thing about LSM designs is this: writes are cheap because we append immutable sorted data, and then we pay the cost later by compacting/merging parts in the background (O’Neil et al., 1996).

Here’s the mental model of the implementation:

INSERT → MemTable (sorted in RAM) → FLUSH → Part on disk (immutable)
QUERY  → MemTable + relevant Parts → merge results → deduplicate
MERGE  → pick Parts → k-way merge → new bigger Part → remove old Parts

Key design decisions

Basic data structures: rows and granules

struct Row {
    std::string key;
    std::string value;
    uint64_t timestamp;

    bool operator<(const Row& other) const {
        if (key != other.key) return key < other.key;
        return timestamp < other.timestamp;
    }
};

Granules are blocks of 8192 rows—the size ClickHouse found optimal after years of production tuning:

If the block is too small you pay index overhead (too many entries); if it’s too large you lose selectivity (you read more irrelevant data). The size also plays nicely with efficient memory mapping (think OS page size alignment and sequential access).

MemTable: skip lists

class MemTable {
private:
    static constexpr int MAX_LEVEL = 16;
    static constexpr double PROBABILITY = 0.5;
    std::shared_ptr<SkipListNode> header_;

public:
    void insert(const Row& row);  // O(log n) expected
    RowVector query(const std::string& start, const std::string& end) const;
};

I chose skip lists over B-trees because they’re inherently more thread-friendly, they have fewer corner cases (which matters a lot in a small “learn by building” project), and they tend to have decent memory locality in write-heavy workloads.

Pugh’s original paper is still a great read (Pugh, 1990).

On-disk layout: immutable parts + columnar files

Each Part is immutable and stores data in columnar format:

part_001/
├── metadata.bin       # Part statistics
├── primary.idx        # Sparse primary key index
├── granule_0_keys.bin       # Key column
├── granule_0_values.bin     # Value column
├── granule_0_timestamps.bin # Timestamp column
└── ...

This organization enables what C-Store called late materialization: you only reconstruct full rows when you actually need them (Stonebraker et al., 2005).

Merge algorithm: k-way merge with a priority queue

class MergeIterator {
private:
    std::priority_queue<RowWithSource> heap_;
    std::vector<std::unique_ptr<Part>> parts_;

public:
    Row next();  // O(log k) where k = number of parts
    bool has_next() const;
};

The merge algorithm is a classic k-way merge using a heap. While conceptually simple, the decisions of when to merge are complex: you’re constantly balancing write amplification (don’t merge too frequently), read amplification (don’t end up with too many parts), and space amplification (merges need temporary storage).

End-to-end flow

The invariants (what must always be true)

There are three invariants I try to keep in mind. First, everything we query efficiently is sorted by the same order (in my case, primary key + timestamp), which is what makes range queries and merges simple. Second, on-disk data is immutable: once a Part is written, we never update it in place; new versions come from new parts (flush or merge). Third, each part has a sparse index (one entry per granule, not per row), which is the trick that keeps index size and memory usage under control (see also: ClickHouse, n.d.-b).

Write path (INSERT → MemTable → FLUSH → Part)

Insert hits the MemTable. The MemTable is an in-memory ordered structure (skip list). That means inserts are cheap-ish ( $O(\log n)$ expected) and, crucially, the MemTable can give you rows already sorted when it’s time to flush.

When MemTable is “big enough”, we flush. Flush is the moment where we turn RAM state into an immutable on-disk Part: we snapshot the sorted rows from MemTable (to freeze a consistent view), split them into fixed-size granules (8192 rows in my implementation), write each column to its own .bin file, build a sparse index (primary.idx) with one entry per granule (typically min/max key), and finally register the part in the MergeTree’s list of parts.

The important insight: flushing creates more parts. That’s fine for writes, but it increases the number of files/segments a query might have to look at (read amplification). That’s why merges exist.

What exactly is inside a Part?

A Part is the immutable on-disk unit: a set of granules (fixed-size blocks of rows), per-granule column files (so you can read only what you need), and a sparse primary index that tells you which granules might overlap a query range. This is a tiny but very practical approximation of ClickHouse’s “parts + marks” idea: don’t index every row; index blocks.

Read path (QUERY → prune parts → prune granules → scan)

When you do a range query [start_key, end_key], the engine does three levels of “filtering” before it reads data:

First, you always check MemTable because it contains the newest rows that might not have been flushed yet. Then you do part-level pruning: skip parts whose global min/max key doesn’t overlap your requested range. Finally, inside each remaining part you do granule-level pruning using the sparse index: you only scan granules that might overlap:

bool IndexEntry::overlaps_range(const std::string& start_key,
                               const std::string& end_key) const {
    return !(max_key < start_key || min_key > end_key);
}

Then we only load/scan those granules. In a “real” column store, this is also where the engine decides which columns to read. My minimal version reads the columns needed to reconstruct Row objects for results.

Merging query results (why we “merge and dedup”)

Because we never update parts in place, you can have multiple versions of the “same logical key” across different parts (and also in the MemTable).

So the final query result comes from two operations: you merge the sorted streams (MemTable + each part’s results) into one sorted stream, and then you deduplicate by your versioning rule (in this project: keep the latest by timestamp).

My code currently does this in a very simple way by concatenating results, sorting, and unique-ing. It’s not the most efficient approach, but it’s conceptually clean for an educational implementation. A production engine would do this as a streaming merge of already-sorted iterators to avoid extra memory/sorts.

Background merges (why we compact parts at all)

Every flush adds a new part. If we never merge, reads get slower because:

you have to check more parts, you might have to scan more granules, and you do more merge/dedup work at query time.

So merges (compactions) trade write amplification for lower read amplification:

pick a few candidate parts (heuristics: size tiers, number of parts, etc.), do a k-way merge of their rows (because each part is sorted), apply the same dedup rule (keep the newest version), write out a new bigger part, and drop the old ones.

The “k-way merge” is the classic algorithmic core: you keep a min-heap with the current head row of each input part, pop the smallest, advance that input, repeat. That’s exactly the same shape you’d use in merge sort—just generalized to $k$ inputs.

A tiny worked example (with numbers)

To make the flow tangible, let’s pretend the granule size is 4 rows (instead of 8192), the sort order is (key, timestamp), and the “versioning rule” is “latest timestamp wins”.

Now we insert these rows (same key can appear multiple times):

insert order	key	value	ts
1	`a`	`v1`	10
2	`b`	`v1`	11
3	`a`	`v2`	12
4	`c`	`v1`	13
5	`b`	`v2`	20

After inserts 1..4, MemTable hits the flush threshold and we write Part_1. Because MemTable is ordered, the flushed rows are already sorted:

Part_1 rows (sorted):
(a,10) (a,12) (b,11) (c,13)

Insert 5 stays in MemTable for now:

MemTable:
(b,20)

Now query the range [a, b]:

MemTable returns (b,20), Part_1 returns (a,10) (a,12) (b,11), and merge + dedup keeps the newest version per key:

Result:
a → (a,12)
b → (b,20)

This example is basically the whole system in miniature: immutability creates duplicates across structures, and the engine resolves them by merging and applying a “latest wins” rule.

A “drawing” of the system (what lives where)

Here’s the physical picture of one Part on disk in my project:

MergeTree/
  parts/
    part_1/
      metadata.bin
      primary.idx            # sparse index (1 entry per granule)
      granule_0_keys.bin
      granule_0_values.bin
      granule_0_timestamps.bin
      granule_1_keys.bin
      ...

And here’s how a query uses that layout:

query(start_key, end_key)
  ├─ read MemTable (sorted in RAM)
  ├─ for each Part:
  │    ├─ check part min/max (cheap prune)
  │    ├─ consult primary.idx to pick candidate granules
  │    └─ read only those granule_*_{keys,values,timestamps}.bin
  └─ merge + dedup across all sources

Notes & further reading

Math notes (simple but useful)

Skip list insert (MemTable).
Expected insertion/search cost is $O(\log n)$ . With promotion probability $p = 0.5$ , the expected number of levels per node is constant, so you get a fast ordered structure without explicit rebalancing (Pugh, 1990).

Sparse index size.
If a part has $N$ rows and granule size $S$ , the number of index entries is:

E = \left\lceil \frac{N}{S} \right\rceil

So compared to a dense index ( $N$ entries), you reduce it by roughly a factor of $S$ (8192 in my “real” implementation).

K-way merge cost.
If you merge $k$ sorted inputs with total $R$ rows, heap-based k-way merge is:

O(R \log k)

That’s why having too many parts (large $k$ ) hurts reads, and why compaction/merges matter.

A tiny cost model (intuition, not a proof).
For a range query, you can think:

T \approx c_0 \cdot P + c_1 \cdot \sum_{p=1}^{P} E_p + c_2 \cdot R

Where:

$P$ = number of parts you “touch”,
$E_p$ = index entries scanned/checked in each part (worst-case; in practice you can binary search),
$R$ = total rows actually scanned/returned.
$c_0, c_1, c_2$ = constant factors for part pruning, index work, and row scanning.

Merges aim to reduce $P$ (and usually $R$ for the same logical query) at the cost of doing extra work in the background.

Performance Analysis: Real Numbers

My implementation achieves:

Insert performance: 13,970 rows/sec (single-threaded)
Query performance: 316 rows/μs scanning
Memory efficiency: 1.3 MB for 50K rows with 50 parts
Storage overhead: ~42 bytes per row (before compression)

For context, ClickHouse in production can do:

Inserts: 100M+ rows/sec (multi-threaded, clustered)
Queries: GB/sec scanning rates
Compression: 10-20x reduction vs raw data

Running the Implementation

The complete implementation is built with CMake and can be easily compiled and executed:

# Install CMake (if not already installed)
brew install cmake

# Navigate to project directory
cd clickhouse_mergetree

# Configure and build
mkdir -p build && cd build
cmake ..
make

# Run the demo
./demo

The demo showcases all the core functionality:

ClickHouse MergeTree Implementation Demo
=========================================

=== Testing Basic Operations ===
Inserting test data...
Querying single key...
Found 2 entries for key1
  key1 -> value1 (ts: 1000)
  key1 -> updated_value1 (ts: 4000)
Querying range...
Found 4 entries in range [key1, key3]
Basic operations test completed successfully!

=== Performance Test ===
Inserting 50000 rows...
Insert performance: 50000 rows in 3873 ms (12909.9 rows/sec)
Final stats:
  Parts: 50
  Total rows: 50000
  Memory usage: 1327 KB
  Disk usage: 2091 KB
Query performance: 5549 results in 17807 μs
Performance test completed successfully!

The Code and Beyond

My complete implementation (~2000 lines) is available on GitHub with full documentation: manumartinm/clickhouse-mergetree.

References

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., & Gruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06).

ClickHouse. (n.d.-a). MergeTree (table engine). Retrieved January 11, 2026.

ClickHouse. (n.d.-b). Sparse primary indexes. Retrieved January 11, 2026.

ClickHouse. (n.d.-c). Column compression codecs. Retrieved January 11, 2026.

Facebook. (n.d.). Compaction. RocksDB Wiki. Retrieved January 11, 2026.

Kleppmann, M. (2017). Designing data-intensive applications. O’Reilly Media.

O’Neil, P., Cheng, E., Gawlick, D., & O’Neil, E. (1996). The log-structured merge-tree (LSM-tree).

Pugh, W. (1990). Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668–676. PDF

Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., & Zdonik, S. (2005). C-Store: A column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB ’05).