Data Structures for Storage
Status: Complete. Companion to db-01 and prerequisite for db-05 (MemTable) and db-10 (B-Tree).
1. What Is It
This lab is about the in-memory data structures that databases use, and why those choices change completely when the data is on disk. We build two structures from scratch:
- A skip list — an ordered, probabilistic, pointer-based map. This is what LevelDB and RocksDB use for their MemTable, and what Redis uses for sorted sets.
- A hash table with open-addressing + Robin Hood probing — an unordered, array-backed map. This is what you use when you need O(1) point lookups and don't need ordering.
We then benchmark them against each other on three workloads (insert, point lookup, range scan) and explain why the numbers come out the way they do.
This lab does not implement a B-Tree or a B+-Tree — those are on-disk structures and arrive in db-10. The lesson here is why a B-Tree dominates on disk even though a skip list or hash table is faster in RAM.
2. Why It Matters
Every database has a critical-path data structure for "find this key":
| System | Structure | Why |
|---|---|---|
| LevelDB / RocksDB MemTable | Skip list | Lock-free reads, ordered iteration for flush to SSTable |
| Redis sorted-set (ZSET) | Skip list + hash table | Skip list for ranked access, hash for O(1) by-key |
Memcached, Java HashMap | Open-addressing hash table | Unordered, point lookup only |
| SQLite, PostgreSQL, InnoDB | B+-Tree | On-disk: minimize page reads |
| Cassandra, ScyllaDB MemTable | Skip list | Same reasoning as LevelDB |
| Lucene postings | Skip list + delta-encoded arrays | Range scans over sorted doc IDs |
If you pick the wrong structure you don't get "a little slower" — you get "100× slower" or "we run out of memory at 100M keys." The cost model for an in-memory structure (cycles, cache misses) is not the cost model for an on-disk one (page reads, sync latency), and a structure tuned for one will lose badly in the other domain.
3. How It Works
Skip list
A skip list is a stack of linked lists. The bottom list contains every key in sorted order. Each higher list contains a random subset of the keys below it, sampled with probability p (we use p = 0.5). To find a key, you walk right on the highest level until you'd overshoot, then drop down a level, repeat.
level 3: HEAD ────────────────────────────────────────────────► NIL
│ │
level 2: HEAD ────────► [13] ───────────────► [42] ────────────► NIL
│ │ │ │
level 1: HEAD ──► [7] ─► [13] ─► [21] ───────► [42] ─► [55] ──► NIL
│ │ │ │ │ │ │
level 0: HEAD ► [3]►[7]►[13]►[21]►[34]►[39]►[42]►[51]►[55] ──► NIL
Searching for 39 from the top:
- L3: HEAD → NIL (overshoot from HEAD), drop to L2
- L2: HEAD → 13 → 42? 42 > 39, drop to L1
- L1: 13 → 21 → 42? 42 > 39, drop to L0
- L0: 21 → 34 → 39. Found.
Expected time: O(log n) with constant factor 1/p · log_{1/p}(n). With p = 0.5 that's 2 · log₂ n comparisons.
Hash table (open addressing, Robin Hood)
An array of 2^k slots. Hash the key, mod by table size, that's the home slot. If occupied by a different key, probe linearly (slot+1, slot+2, ...). Robin Hood twist: when you probe past slot i for the d-th time, and the resident at slot i has been probed only d' < d times, swap them — the "rich" entry gets displaced by the "poor" one. This bounds the worst-case probe distance to roughly the mean.
home(K1)=2 home(K2)=2 home(K3)=4
hash before insert:
┌──┬──┬────┬────┬────┬──┬──┬──┐
│ │ │ K1 │ K2 │ K3 │ │ │ │
└──┴──┴────┴────┴────┴──┴──┴──┘
0 1 2 3 4 5 6 7
(K1 home=2 dist=0, K2 home=2 dist=1, K3 home=4 dist=0)
insert K4 with home=2:
probe slot 2 (K1, dist 0); K4 dist=0; equal — keep going
probe slot 3 (K2, dist 1); K4 dist=1; equal — keep going
probe slot 4 (K3, dist 0); K4 dist=2 > 0 — STEAL, K3 displaced
probe slot 5 (empty); place K3 with dist=1
result:
┌──┬──┬────┬────┬────┬────┬──┬──┐
│ │ │ K1 │ K2 │ K4 │ K3 │ │ │
└──┴──┴────┴────┴────┴────┴──┴──┘
Loads up to ~0.9 work well with Robin Hood; we resize at 0.85 (× 2 capacity, rehash all).
When in-memory and on-disk diverge
A skip-list node holding a 16-byte key + 16-byte value + 4 forward pointers is ~64 bytes in memory. The same record packed into a B+-Tree leaf page (4 KiB page, no per-record pointers) is ~36 bytes — no level header, no forward-pointer array. And the B+-Tree co-locates ~100 records in one page, so a range scan of 100 keys is 1 page read instead of 100 random reads.
On disk:
- A pointer is an 8-byte offset that triggers a page read (~100 µs cold).
- A cache miss is ~100× a cache hit.
- A page is 4 KiB whether you read 1 byte or 4096.
Therefore on-disk structures want high fan-out, low height, contiguous siblings, no random pointer chasing. Skip lists violate all four; B+-Trees satisfy all four.
4. Core Terminology
| Term | Definition |
|---|---|
| Skip list | Probabilistic ordered map: stack of linked lists with geometric level distribution |
| Level | Index of a forward-pointer array in a skip-list node (0 = bottom, dense; higher = sparser) |
p | Per-level promotion probability (we use 0.5) |
| Sentinel head | Dummy node with the maximum possible level; all searches start here |
| Open addressing | Collision resolution: probe other slots in the same array (vs chaining into a list) |
| Linear probing | Probe sequence is h, h+1, h+2, … (best cache behavior) |
| Robin Hood | On insert, displace any resident whose probe distance is smaller than the newcomer's |
| Probe distance / PSL | Slots between a key's home slot and its actual slot (probe sequence length) |
| Load factor | len / capacity |
| Tombstone | Sentinel marking a deleted slot so probes don't short-circuit (we use backward-shift deletion instead) |
| Backward-shift deletion | After deleting a slot, shift the following non-home entries left by one; avoids tombstones |
| Cache line | 64 bytes on x86_64 / Apple Silicon; the unit the CPU fetches from RAM |
| Pointer chasing | A traversal whose next address depends on the byte just loaded; CPU cannot prefetch |
| Fan-out | Number of children per internal node in a tree; B+-Trees aim for hundreds |
5. Mental Models
"A skip list is a binary search you can mutate cheaply"
A balanced BST and a skip list have the same asymptotic complexity. The skip list wins because it has no rebalancing: no rotations, no recoloring. Each insert is one geometric coin flip + N forward-pointer writes (N = node height, ≈ log₂ n expected). This makes it much easier to make concurrent — LevelDB's MemTable allows lock-free reads while a writer inserts, because a partially-published node is invisible until the bottom-level pointer is CAS'd in.
"A hash table is a sparse array you pretend is dense"
If keys were integers in [0, N) you'd use an array. A hash function fakes that: it maps any key into [0, capacity). Collisions are the cost of the lie. Robin Hood specifically equalizes the cost of the lie across keys, so the worst-case lookup is close to the average.
"Cost models differ by 5 orders of magnitude"
L1 cache hit ~1 ns
Main memory ~100 ns
NVMe SSD random read (cold) ~100 µs ← 1000× RAM
HDD seek ~10 ms ← 100× SSD
Cross-DC round trip ~50 ms
A "fast" in-memory structure becomes irrelevant if it issues 10 page reads where a B+-Tree issues 1. The B+-Tree's "slow" O(log_B n) with B = 256 beats the skip list's "fast" O(log₂ n) on disk by a factor of log₂(256) = 8 — every level you avoid saves 100 µs.
6. Common Misconceptions
- "Skip lists are slow because they're probabilistic." No — the expected and with-high-probability bounds are tight. The variance is small for any list above ~1000 keys. Failure modes are bad RNG seed (we use a deterministic xorshift here) and adversarial key insertion patterns (irrelevant for hashed keys; mitigated by per-instance seed).
- "Hash tables have O(1) worst case." Average, not worst. A pathological hash function or adversarial keys produce O(n) chains. Robin Hood mitigates variance but does not change the worst case.
- "You should always use a hash table when you don't need ordering." Two cases where skip lists or trees win even unordered: (a) you need iteration in any deterministic order across runs; (b) memory is tight and you can't afford the 15–40% slack of a hash table at safe load factors.
- "Open addressing wastes memory because of empty slots." Chaining wastes more in practice: every chain node is a heap allocation with a header (
mallocarenas + pointer + next ptr ≈ 32 bytes overhead per entry). Linear-probing hash tables with 70% load factor still use less memory thanstd::unordered_map. - "A B-Tree is just a balanced BST with more children." No: B+-Trees keep all data in leaves and chain leaves with sibling pointers, making range scans
O(1)per page after the initial descent. - "
std::unordered_map/ Gomap/ Pythondictare the gold standard." They're general-purpose. Specialized hash tables (Abseil'sflat_hash_map, Rust'shashbrown, F14) beat them by 2–5× on most workloads. Database authors often roll their own.
7. Interview Talking Points
- "Why does LevelDB use a skip list for the MemTable instead of a red-black tree?" → Lockless reads via single-pointer CAS publish; no rotations; easier to implement correctly.
- "Why isn't a hash table good enough for a MemTable?" → MemTable flushes to an SSTable, which is a sorted file. A hash table would require sorting at flush time (O(n log n)); a skip list is already sorted, so flush is O(n) sequential.
- "When would you use chaining vs open addressing?" → Open addressing for small fixed-size values (better cache); chaining when values are large and you want pointer stability across resizes.
- "What's the cost model on disk that breaks skip lists?" → Each level traversal is a potential page read. With
log₂ nlevels you paylog₂ n × 100 µs. A B+-Tree with fan-out 256 haslog_256 nlevels, so 3 page reads for 16 M keys vs 24. - "Why is Robin Hood probing useful?" → Bounds variance: the maximum probe distance grows as
O(log n)w.h.p., and lookups for missing keys become almost as fast as hits because you can stop as soon as you see a slot with smaller PSL than yours. - "What's the alternative to tombstones?" → Backward-shift deletion: walk forward from the deleted slot, shift each non-home entry left until you hit an empty slot or a home-slotted entry. O(probe-length) per delete, no tombstone bookkeeping.
8. Connections to Other Labs
- db-01 Storage Primitives — the page abstraction the disk structures use.
- db-03 WAL — the WAL is appended before the MemTable insert; failure recovery rebuilds the MemTable by replaying the WAL.
- db-04 Bloom Filters — sits in front of the SSTable; same family of probabilistic in-memory structures.
- db-05 LSM MemTable — uses the skip list from this lab, adds a snapshot / immutable flip.
- db-10 B-Tree Fundamentals — contrast with this lab; same problem, different cost model.
- db-21 Storage Engine Advanced — concurrent skip list (CAS publish), concurrent hash table (extendible hashing, lock striping).