db-10 — B-Tree Fundamentals
The first lab of the B-tree track. Up to db-09 every persistent structure we built was an LSM (log + sorted runs + merge). Postgres, MySQL/InnoDB, SQLite, Oracle, SQL Server, and most embedded key-value engines you have never heard of are B-trees instead. This lab builds the in-memory kernel; db-11 wraps it in a pager so it can live on disk.
What is it?
A self-balancing search tree where every node holds up to 2T - 1
keys (and, if internal, up to 2T children). We pick the smallest
non-trivial degree T = 2, giving:
- 1 ≤ keys ≤ 3 per non-root node
- 2 ≤ children ≤ 4 per non-root internal node
- root may hold 1..3 keys (or 0 if the tree is empty)
The algorithms are the textbook CLRS B-tree: insert splits a child
proactively while descending if it is full; delete rebalances a
child proactively while descending if it would underflow. With this
discipline every operation is exactly one root-to-leaf traversal —
no second pass, no recursion back up to fix invariants.
Keys and values are arbitrary byte slices; comparison is lexicographic. Each node carries the value of every key it holds (this is a B-tree, not a B+-tree — values do not live exclusively in the leaves). db-11 will make the leaf-only choice when we introduce the pager and need to keep internal nodes small.
Why does it matter?
- Predictable depth.
log_T(n)height withT=2gives a small, perfectly bounded number of comparisons per lookup, no matter the insertion order. LSMs trade log writes for O(log levels) read amplification; B-trees trade page rewrites for a tight bound. - In-place update. A B-tree key update mutates exactly one node. LSMs append a new record and reclaim the old one during compaction. Which is better depends on workload — db-22 will measure it.
- The canonical study substrate. Every working storage engineer has implemented a B-tree at least once. Splits and merges are the microcosm of every concurrent, copy-on-write, or page-versioned variant that exists in production code.
How does it work?
Node layout
┌─────────────────────────── Node ────────────────────────────┐
│ is_leaf : bool │
│ keys : Vec<(key, value)> // 1..3 entries │
│ children: Vec<Box<Node>> // 0 if leaf, else nkeys+1│
└─────────────────────────────────────────────────────────────┘
Internal node with two keys (k0 < k1):
┌──────────┬──────────┐
│ k0,v0 │ k1,v1 │
└─┬──────┬─┴────────┬─┘
│ │ │
▼ ▼ ▼
c0 keys c0<…<k0 k0<…<k1 c2 keys k1<…
Insert (proactive split)
Descend from the root. Before stepping into any full child
(nkeys == 3), split that child in place: promote its middle key to
the parent, drop the right sibling into the parent's child list at
position i+1, and let the new (now non-full) child take the
descent. If the root itself is full, grow upwards: create a new
parent with the old root as its only child, then split. This is the
only place tree height increases.
Before split (child too full): After split (middle promoted):
[ K ] [ K , k1 ]
│ │ │
▼ ▼ ▼
[k0, k1, k2] [k0] [k2]
Delete (proactive rebalance)
Descend from the root looking for the key. Before stepping into any
child that has only T - 1 = 1 key, ensure it has at least T = 2
keys by one of:
- Borrow from left sibling — rotate left sibling's last key up into the parent, parent's separator down into the child's front.
- Borrow from right sibling — symmetric.
- Merge with a sibling — pull the parent's separator down,
concatenate child + separator + sibling into a single node with
2T - 1 = 3keys.
If the root becomes an empty internal node (only one child, no keys) after the operation, collapse it: the root's only child becomes the new root. This is the only place tree height decreases.
Deletes that hit an internal key are handled by replacing the key with its in-order successor (or predecessor) and recursing the delete into that subtree, where the recursion terminates at a leaf.
Canonical serialization
A preorder traversal of the tree emitting, per node:
u8 is_leaf (1 if leaf, 0 if internal)
u32 LE nkeys
nkeys * { u32 LE klen | klen bytes key | u32 LE vlen | vlen bytes val }
if !is_leaf:
(nkeys + 1) * recurse(child)
The empty tree therefore serializes as five bytes:
01 00 00 00 00 (one leaf node with zero keys).
This format captures structure, not just contents. Two trees with
the same {(key, value)} set but different splits / shapes produce
different byte sequences — so scripts/cross_test.sh would catch a
language whose insertion order or split rule diverged, even if the
externally-visible scan output still agreed.
Deterministic workload
run_workload(scenario, seed, ops) drives a fresh tree using
SplitMix64(seed) to generate keys (8-byte big-endian indices
modulo a 200-entry key space) and values (4-byte big-endian). The
three scenarios:
| scenario | per-iteration behavior |
|---|---|
inserts | always insert(key, val) |
deletes | insert during the first half, delete(key) during the second |
mixed | bits 62..63 of r1 decide: insert (0,1), delete (2), no-op (3) |
Two PRNG outputs are consumed per iteration regardless of which branch is taken, so the key sequence is invariant under the scenario choice and only the operation kind differs. This makes the three scenarios easy to reason about: they all visit the same keys in the same order.
The btreectl CLI
btreectl --seed N --ops M --scenario {inserts|deletes|mixed}
Runs the chosen workload and writes the serialized tree to stdout (raw bytes, no trailing newline).
Cross-language invariant
scripts/cross_test.sh runs the same (seed, ops, scenario) triple
through Rust, Go, and C++ btreectl binaries and asserts that all
three produce byte-identical output via sha256 for two scenarios:
| scenario | seed | ops | sha256 | size |
|---|---|---|---|---|
A inserts | 42 | 500 | 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 | varies |
B mixed | 7 | 500 | 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 | 2515 B |
A matching hash proves that all three implementations agree on: the PRNG, the lexicographic key compare, the proactive-split insertion, the proactive-rebalance deletion, and the precise tree shape after the workload. Any drift in any of these surfaces as a sha256 mismatch.
What's intentionally out of scope
- Persistence. db-11 introduces the pager and turns nodes into fixed-size disk pages.
- B+-tree leaves-only-values layout. Also db-11; it's the natural change once internal nodes need to fit one to a page.
- Concurrent / lock-coupling B-trees. db-13 (MVCC) and db-21 (storage-engine advanced) explore copy-on-write and latch protocols.
- Variable-length keys with prefix compression. SQLite and RocksDB both do this; we will revisit in db-15.