db-10 — B-Tree Fundamentals

The first lab of the B-tree track. Up to db-09 every persistent structure we built was an LSM (log + sorted runs + merge). Postgres, MySQL/InnoDB, SQLite, Oracle, SQL Server, and most embedded key-value engines you have never heard of are B-trees instead. This lab builds the in-memory kernel; db-11 wraps it in a pager so it can live on disk.

What is it?

A self-balancing search tree where every node holds up to 2T - 1 keys (and, if internal, up to 2T children). We pick the smallest non-trivial degree T = 2, giving:

  • 1 ≤ keys ≤ 3 per non-root node
  • 2 ≤ children ≤ 4 per non-root internal node
  • root may hold 1..3 keys (or 0 if the tree is empty)

The algorithms are the textbook CLRS B-tree: insert splits a child proactively while descending if it is full; delete rebalances a child proactively while descending if it would underflow. With this discipline every operation is exactly one root-to-leaf traversal — no second pass, no recursion back up to fix invariants.

Keys and values are arbitrary byte slices; comparison is lexicographic. Each node carries the value of every key it holds (this is a B-tree, not a B+-tree — values do not live exclusively in the leaves). db-11 will make the leaf-only choice when we introduce the pager and need to keep internal nodes small.

Why does it matter?

  • Predictable depth. log_T(n) height with T=2 gives a small, perfectly bounded number of comparisons per lookup, no matter the insertion order. LSMs trade log writes for O(log levels) read amplification; B-trees trade page rewrites for a tight bound.
  • In-place update. A B-tree key update mutates exactly one node. LSMs append a new record and reclaim the old one during compaction. Which is better depends on workload — db-22 will measure it.
  • The canonical study substrate. Every working storage engineer has implemented a B-tree at least once. Splits and merges are the microcosm of every concurrent, copy-on-write, or page-versioned variant that exists in production code.

How does it work?

Node layout

        ┌─────────────────────────── Node ────────────────────────────┐
        │  is_leaf : bool                                             │
        │  keys    : Vec<(key, value)>      // 1..3 entries           │
        │  children: Vec<Box<Node>>         // 0 if leaf, else nkeys+1│
        └─────────────────────────────────────────────────────────────┘

Internal node with two keys (k0 < k1):

        ┌──────────┬──────────┐
        │   k0,v0  │   k1,v1  │
        └─┬──────┬─┴────────┬─┘
          │      │          │
          ▼      ▼          ▼
        c0 keys  c0<…<k0    k0<…<k1    c2 keys k1<…

Insert (proactive split)

Descend from the root. Before stepping into any full child (nkeys == 3), split that child in place: promote its middle key to the parent, drop the right sibling into the parent's child list at position i+1, and let the new (now non-full) child take the descent. If the root itself is full, grow upwards: create a new parent with the old root as its only child, then split. This is the only place tree height increases.

Before split (child too full):     After split (middle promoted):

   [   K   ]                          [ K , k1 ]
        │                              │      │
        ▼                              ▼      ▼
  [k0, k1, k2]                       [k0]    [k2]

Delete (proactive rebalance)

Descend from the root looking for the key. Before stepping into any child that has only T - 1 = 1 key, ensure it has at least T = 2 keys by one of:

  1. Borrow from left sibling — rotate left sibling's last key up into the parent, parent's separator down into the child's front.
  2. Borrow from right sibling — symmetric.
  3. Merge with a sibling — pull the parent's separator down, concatenate child + separator + sibling into a single node with 2T - 1 = 3 keys.

If the root becomes an empty internal node (only one child, no keys) after the operation, collapse it: the root's only child becomes the new root. This is the only place tree height decreases.

Deletes that hit an internal key are handled by replacing the key with its in-order successor (or predecessor) and recursing the delete into that subtree, where the recursion terminates at a leaf.

Canonical serialization

A preorder traversal of the tree emitting, per node:

u8     is_leaf            (1 if leaf, 0 if internal)
u32 LE nkeys
nkeys * { u32 LE klen | klen bytes key | u32 LE vlen | vlen bytes val }
if !is_leaf:
   (nkeys + 1) * recurse(child)

The empty tree therefore serializes as five bytes: 01 00 00 00 00 (one leaf node with zero keys).

This format captures structure, not just contents. Two trees with the same {(key, value)} set but different splits / shapes produce different byte sequences — so scripts/cross_test.sh would catch a language whose insertion order or split rule diverged, even if the externally-visible scan output still agreed.

Deterministic workload

run_workload(scenario, seed, ops) drives a fresh tree using SplitMix64(seed) to generate keys (8-byte big-endian indices modulo a 200-entry key space) and values (4-byte big-endian). The three scenarios:

scenarioper-iteration behavior
insertsalways insert(key, val)
deletesinsert during the first half, delete(key) during the second
mixedbits 62..63 of r1 decide: insert (0,1), delete (2), no-op (3)

Two PRNG outputs are consumed per iteration regardless of which branch is taken, so the key sequence is invariant under the scenario choice and only the operation kind differs. This makes the three scenarios easy to reason about: they all visit the same keys in the same order.

The btreectl CLI

btreectl --seed N --ops M --scenario {inserts|deletes|mixed}

Runs the chosen workload and writes the serialized tree to stdout (raw bytes, no trailing newline).

Cross-language invariant

scripts/cross_test.sh runs the same (seed, ops, scenario) triple through Rust, Go, and C++ btreectl binaries and asserts that all three produce byte-identical output via sha256 for two scenarios:

scenarioseedopssha256size
A inserts425004b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088varies
B mixed75009edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e377182515 B

A matching hash proves that all three implementations agree on: the PRNG, the lexicographic key compare, the proactive-split insertion, the proactive-rebalance deletion, and the precise tree shape after the workload. Any drift in any of these surfaces as a sha256 mismatch.

What's intentionally out of scope

  • Persistence. db-11 introduces the pager and turns nodes into fixed-size disk pages.
  • B+-tree leaves-only-values layout. Also db-11; it's the natural change once internal nodes need to fit one to a page.
  • Concurrent / lock-coupling B-trees. db-13 (MVCC) and db-21 (storage-engine advanced) explore copy-on-write and latch protocols.
  • Variable-length keys with prefix compression. SQLite and RocksDB both do this; we will revisit in db-15.