Why Databases Reach for B+ Trees

PostgreSQL, MySQL, SQL Server, Oracle, SQLite - they all store indexes in a B+ tree, and often the tables too. In order to understand why, we need to see what it looks like on disk, and how alternative data structures would work.

Indexes are stored on disk, and disk is slow.

Indexes don't live in RAM. They live on disk. And disk is something like five orders of magnitude slower than memory for random access. A single page read on NVMe runs about 100 µs. An L1 cache hit is under a nanosecond. This is a huge difference.

This difference is the reason why B+ trees exist. There's really one question worth asking about an on-disk index: how many page reads to find a row? CPU work, in-memory layout are too small factors compared to a disk read.

What are the alternative data structures, and why they are worse

Sorted array: binary search complexity for it is O(log n). But inserts are O(n) because every insert in the middle shoves everything after it. No good for a table that takes writes. And linked list flips that - cheap inserts, but O(n) lookups. So linked list is also not good.

Hash table gives us O(1) point lookups. But no range queries and no ordered iteration. And rehashing a multi-gigabyte on-disk hash table is slow. Databases keep hash indexes around for specific cases, but never as the default.

Binary search tree gives us O(log n) on average. But each node holds one key, and billion rows is around 30 levels deep, so it results in up to 30 random disk reads per lookup. Self-balancing it doesn't help.

So fanout is the lever. We need fewer disk reads, and fewer disk reads means a shallower tree, which means that each node should pack as many keys as fit in one page read.

So what a B+ tree actually is

A balanced n-ary search tree built in such way that each node fills exactly one disk page. Usually it is 4 KB or 8 KB. InnoDB uses 16 KB. Two things make it work.

All the values live in the leaves. Internal nodes hold keys and pointers, nothing else - they just route. That is why we have the "+" in "B+" tree title. The leaves of B+ tree are represented as a doubly linked list. So once you find the start of a range, walking forward is just sequential page reads, with no need to go back up the tree over and over.

Small example, order 4 (each node holds up to 3 keys):

                  [ 20 | 40 ]
                 /     |     \
        [10|15]  [20|25|30]  [40|50|60]
           |         |            |
        leaves linked: [10|15] <-> [20|25|30] <-> [40|50|60]

Look up 25 would consist of operations: read the root, pick the middle pointer (20 <= 25 < 40), read that leaf, find the row. Only two page reads in total.

Fanout, in numbers

Page is 8 KB, a key-plus-pointer pair is approximately 16 bytes. So internal node will contain about 500 entries. So now we can estimate number of levels for different amounts of keys:

1 level → 500 keys
2 levels → 250,000 keys
3 levels → 125,000,000 keys
4 levels → 62,500,000,000 keys

Three or four page reads gets you to any row in a table which contains tens of billions of entries. The top levels are nearly always cached, so lookup on a warm cache will usually cost a single disk read.

That same billion rows in an AVL tree will require around 30 levels, so it is 30 reads in worst case. The B+ tree is ten times shallower because each node is a thousand times wider.

Range queries are very lightweight on B+ trees

Leaves are sorted and linked, so:

SELECT * FROM orders WHERE id BETWEEN 1000 AND 2000;

just requires to find id = 1000, then we need to do a sequential walk along the leaf list until you pass 2000. So no re-traversal, no random I/O required - contiguous page reads which are easy due to the OS prefetcher. A hash index just can't do this.

Same thing for ORDER BY indexed_column LIMIT n. We walk the leaves in order, and then stop early. Same with the merge join from the previous post, where both sides show up already sorted from their indexes.

Inserts

An insert goes to the right leaf and it splits in half and the median key moves up to the parent if the leaf is full. And if its parent overflows, it splits too. This can be performed all the way up to the root. The tree stays balanced because a split always keeps every leaf at the same depth.

B+ tree implementations leave free space per page (InnoDB's "fill factor", Postgres's fillfactor) so not every insert triggers a split. Append-only table with increasing keys only ever splits the rightmost leaf, and stays compact.

ALTER TABLE big_table SET (fillfactor = 90);
REINDEX TABLE big_table;

You can play with fillfactor, but touch it when you actually see page-split overhead on a write-heavy table.

Clustered vs secondary indexes

In InnoDB, MySQL's default engine, the primary key index is the table. Leaf pages hold the full row. That's a clustered index. Secondary indexes store the primary key as the row pointer instead of a physical location.

Postgres does it differently. Both primary and non-primary indexes are B+ tree whose leaves hold a tuple identifier (ctid) pointing into the heap. The heap itself is unordered. There is a CLUSTER table USING index operation to physically reorder it, but that runs once, and new inserts land wherever there's space.

Both approaches have pros and cons. Clustered indexes make primary-key range scans very fast because the leaf is the row - but a secondary index lookup needs an extra step. Postgres's uniform model is easier to understand. We pay for it with a heap fetch per lookup, unless the query qualifies for an Index Only Scan.

Where B+ tree can work not very good

A B+ tree isn't the answer for everything.

Example of poor performance case is a write-heavy table with random keys. Every insert lands somewhere random in the keyspace, so a random page write is performed. LSM trees (RocksDB, Cassandra, ScyllaDB) buffer writes in memory and flush them as sorted runs. They eat higher read amplification to get way more write throughput.

For typical usage (mixed reads and writes, range queries, secondary indexes, transactional consistency) the B+ tree is hard to beat. It has a shallow depth, ordered leaves, and a predictable update cost. That's why it has been staying as the default index option for a long time.

Couple of things to keep in your head. Depth grows with log(n) to the base of the fanout, so doubling the table size barely moves the reads per lookup. And the top two or three levels of any active index sit in Postgres's shared_buffers or InnoDB's buffer pool all the time, so cold-cache numbers will lie to you about production.

Then let's talk a bit about maintaining indexes in order. Long-running transactions and heavy updates leave dead tuples behind. That's why we can use pg_stat_user_indexes and pgstattuple which tell us when a REINDEX CONCURRENTLY is worth scheduling. Try to avoid using the B+ tree on a UUID v4 column - random keys split pages all over the tree and push useful pages out of cache. Use UUID v7 or ULID if you want UUIDs with insert locality, never use UUID v4.

As a conclusion, the B+ tree is the most universal and optimal data structure for indexes, and it hardly can be a bottleneck.