Why Databases Reach for B+ Trees
A practical look at why B+ trees became the default on-disk index structure for relational databases — and where they start to creak.
Almost every relational database we touch — PostgreSQL, MySQL, SQL Server, Oracle, SQLite — stores its indexes (and often its tables) in a B+ tree. This post walks through why that choice keeps winning, what the structure actually looks like on disk, and where it starts to creak.
The constraint that shapes everything
Indexes don't live in RAM. They live on disk, and disk is roughly five orders of magnitude slower than memory for random access. Even on NVMe, a single page read is on the order of 100 µs; an L1 cache hit is under a nanosecond.
That gap is the entire reason B+ trees exist. Any on-disk index structure is judged by one metric first: how many page reads does it take to find a row? Everything else — CPU work, in-memory layout, theoretical complexity — is rounding error next to that.
Why not the obvious candidates
Before we reach for B+ trees, it's worth seeing what fails.
Sorted array. Binary search is O(log n) and the array is dense, but inserts are O(n) — every insert in the middle shifts everything after it. Hopeless for a table that takes writes.
Linked list. Inserts are cheap, lookups are O(n). Equally hopeless.
Hash table. O(1) point lookups, but no range queries, no ordered iteration, and rehashing a multi-gigabyte on-disk hash table is a nightmare. Databases do use hash indexes for specific cases, but not as the general default.
Binary search tree (BST). O(log n) in the average case, but each node holds one key. For a billion rows, that's ~30 levels deep — meaning up to 30 random disk reads per lookup. A self-balancing BST (AVL, red-black) has the same problem: the fanout is two.
The fanout is the punchline. To minimise disk reads, we want a tree that is as shallow as possible, which means each node should hold as many keys as we can fit in one page read.
What a B+ tree actually is
A B+ tree is a balanced, n-ary search tree designed so that each node fills exactly one disk page (typically 4 KB or 8 KB, 16 KB in InnoDB). Two properties matter:
- All values live in the leaf nodes. Internal nodes hold only keys and pointers used for routing. This is the "+" in B+ tree — vanilla B-trees store values in internal nodes too.
- Leaf nodes are linked in a doubly linked list. Once we find the start of a range, walking forward is sequential page reads, not tree traversals.
A small example with order 4 (each node holds up to 3 keys):
[ 20 | 40 ]
/ | \
[10|15] [20|25|30] [40|50|60]
| | |
leaves linked: [10|15] <-> [20|25|30] <-> [40|50|60]A lookup for 25 reads the root, picks the middle pointer (20 <= 25 < 40), reads that leaf page, and finds the row. Two page reads.
The fanout argument, with real numbers
Suppose a page is 8 KB and a key-plus-pointer pair is roughly 16 bytes. That's about 500 entries per internal node. With a fanout of 500:
- 1 level → 500 keys
- 2 levels → 250,000 keys
- 3 levels → 125,000,000 keys
- 4 levels → 62,500,000,000 keys
Three to four page reads gets us to any row in a table with tens of billions of entries. The top levels are almost always cached in RAM, so in practice a point lookup costs one disk read on a warm cache. This is the entire ballgame.
Compare with an AVL tree on the same billion rows: ~30 levels, ~30 reads in the worst case. The B+ tree is ten times shallower because each node is a thousand times wider.
Range queries fall out for free
Because leaves are sorted and linked:
SELECT * FROM orders WHERE id BETWEEN 1000 AND 2000;…costs one descent to find id = 1000, then a sequential walk along the leaf list until we pass 2000. No re-traversal, no random I/O — just contiguous page reads, which the OS prefetcher loves. Hash indexes can't do this at all.
The same property powers ORDER BY indexed_column LIMIT n (walk the leaves in order, stop early) and the merge join from the previous post — both sides arrive pre-sorted from their indexes.
Inserts and the balancing act
Inserts go to the right leaf. If the leaf is full, it splits in half and the median key is promoted to the parent. If the parent overflows, it splits too, and so on up to the root. The tree stays balanced because splits always preserve the invariant that every leaf is at the same depth.
In practice, B+ tree implementations leave some slack space per page (InnoDB's "fill factor", Postgres's fillfactor) so that not every insert triggers a split. For an append-only table with monotonically increasing keys — which is most tables with a serial primary key — splits happen only at the rightmost leaf, and the structure stays compact.
ALTER TABLE big_table SET (fillfactor = 90);
REINDEX TABLE big_table;I'd reach for fillfactor only when we're seeing visible page-split overhead on a write-heavy table. The default of 90 is well-chosen.
Clustered vs secondary indexes
In InnoDB (MySQL's default engine), the primary key index is the table — leaf pages store the full row. This is called a clustered index. Secondary indexes store the primary key as the row pointer, not a physical row location.
PostgreSQL works differently. Every index, primary or not, is a separate B+ tree whose leaves hold a tuple identifier (ctid) pointing into the heap. The heap is unordered. PostgreSQL has CLUSTER table USING index to physically reorder the heap, but it's a one-shot operation — new inserts go wherever there's space.
Both designs have trade-offs. Clustered indexes make primary-key range scans extremely fast (the leaf is the row), but secondary index lookups cost an extra hop. PostgreSQL's uniform model is simpler but every index lookup pays one heap fetch unless it's an Index Only Scan.
Where B+ trees creak
B+ trees are not the right answer for every workload. The cases where they hurt:
- Write-heavy, random-key workloads. Every insert into a random spot in the keyspace is a random page write. LSM trees (used by RocksDB, Cassandra, ScyllaDB) buffer writes in memory and flush them as sorted runs, trading read amplification for much higher write throughput.
- Very high write rates with bounded read latency requirements. Same answer — LSM.
- Pure key-value point lookups with no range queries. A hash index is faster and simpler. PostgreSQL has them; few people use them.
- Full-text search. Inverted indexes (GIN in PostgreSQL, Lucene under the hood of Elasticsearch) are the right tool.
- Spatial queries. R-trees and their variants.
For typical OLTP — mixed reads and writes, range queries, secondary indexes, transactional consistency — the B+ tree's combination of shallow depth, ordered leaves, and predictable update cost is genuinely hard to beat. That's why it has been the default for forty years.
Things to keep in mind
- B+ tree depth grows with
log(n)to the base of the fanout. Doubling the table size barely changes the number of disk reads per lookup, which is why "the index gets slow as the table grows" is usually not actually true. - The top two or three levels of any active index are almost always in PostgreSQL's
shared_buffersor InnoDB's buffer pool. Cold-cache numbers are misleading for production reasoning. - Index bloat is real. Long-running transactions and heavy updates leave dead tuples behind.
pg_stat_user_indexesandpgstattupletell us when aREINDEX CONCURRENTLYis worth scheduling. - A B+ tree on a
UUID v4column is a worst case — random keys cause splits everywhere and trash the cache. Reach forUUID v7orULIDif we want UUIDs with insert locality.
What I'd Recommend
Treat the B+ tree as the default and reach for alternatives only when the workload genuinely demands it. In my experience, most "we need a different index structure" conversations end with "actually the B-tree was fine, we just needed a composite index in a different column order." Understand the structure well enough to know why it's fast, and the cases where it isn't will be obvious when we hit them.