Analysis — db-05 LSM MemTable
Problem
Implement the in-memory write buffer of an LSM-tree such that
- it supports
put,delete(tombstone insertion),get, and ordered iteration; - it can be serialized to disk in a deterministic byte layout shared by Rust, Go, and C++;
- the same dump can be reloaded in any of the three languages;
- iteration order is byte-lexicographic on keys.
Constraints
- Keys are arbitrary byte sequences up to
2^32 − 1bytes (u32length prefix). - Values are arbitrary byte sequences up to
2^32 − 1bytes; for tombstones the value length is 0. - Cross-language interop: the dump format must be identical byte-for-byte and the cross-test script asserts SHA-256 equality of the three dumps.
- No allocator tricks: simplicity over LevelDB-style arena/skiplist; we use the standard sorted map in each language.
- No concurrency in this lab: single-threaded API. Concurrency arrives in db-09.
Design decisions
Why a sorted associative container, not a skip list?
Production LSMs (LevelDB, RocksDB) use skip lists because they support concurrent lock-free reads and arena allocation. For this teaching lab those benefits are irrelevant — what matters is determinism, byte-identical serialization, and the fact that iteration is in key order so the flush is a sequential write. Any sorted container satisfies that:
| Language | Container | Why |
|---|---|---|
| Rust | BTreeMap<Vec<u8>, Entry> | Vec<u8>: Ord is byte-lex; balanced. |
| Go | map[string]Entry + sort.Strings(keys) | Avoid third-party sorted maps. |
| C++ | std::map<std::vector<uint8_t>, Entry> | RB-tree; vector<uint8_t>::operator< is lex. |
All three give the same iteration order for identical input, which is what cross-test checks.
Tombstones as entries
A delete(k) replaces whatever entry k had with Entry::Tombstone. Crucially the
key is not erased — the tombstone must propagate to the SSTable to shadow older
on-disk versions of the key.
On-disk dump layout
offset size field
------ ---- --------------------------------
0 4 magic ASCII "MMT1"
4 4 count: u32 LE (entry count)
8 ? entries, sorted by key ascending:
[ klen: u32 LE ]
[ vlen: u32 LE ] (0 if tombstone)
[ type: u8 ] (0 = Value, 1 = Tombstone)
[ key bytes ]
[ value bytes ]
All multi-byte integers little-endian; the file is self-delimiting via count and
each entry's two length prefixes. No checksum at this layer — the WAL (db-03) and the
SSTable (db-06) carry their own.
Size accounting
size_bytes() returns the on-disk dump size assuming the current contents flush
immediately: 8 bytes header + per entry (9 + klen + vlen). This is what an LSM
would compare against write_buffer_size.
Error model
The decoder validates:
- magic ==
MMT1, - enough bytes remain for each header field and the declared key/value spans,
typeis 0 or 1,- tombstones have
vlen == 0, - no trailing garbage,
- keys appear in strictly ascending order.
A failure returns an explicit error (Error::* in Rust, error in Go,
std::invalid_argument / std::runtime_error in C++) rather than panicking. The
encoder cannot fail (no I/O at that layer).
Trade-offs
- No sequence numbers. Real LSMs prepend a 64-bit
(seqno << 8) | typeto every internal key so MVCC snapshots can pick the right version. We collapse to "latest write wins" because db-13 reintroduces MVCC. - No range tombstones. Each delete shadows exactly one key. RocksDB-style range deletes are db-09 work.
- No prefix bloom or compressed entries. The MemTable is in RAM; flushing to a proper SSTable (db-06) is where compression and block boundaries appear.
- Allocation policy:
Vec/vector/[]byte-per-entry, not an arena. Allocator pressure becomes interesting only at multi-million-key scales, which we exercise in db-22 benchmarking.
Alternatives considered
- Skip list with arena (LevelDB style). Better concurrency, cache locality, and drop-the-whole-arena freeing — but the data structure complexity (random levels, acquire/release pointer ops) would dwarf the lab's pedagogical point.
- Sorted vector with binary search. Lowest memory overhead, but every
putisO(n)due to mid-vector insertion. Fine for tiny tables (<1 K entries), terrible beyond that. - HashMap with periodic sort. Fast inserts, but iteration is no longer cheap; every flush triggers a sort. Acceptable if flush is rare, painful otherwise.
- B-epsilon tree. Batches writes inside internal nodes, blurring the line with LSM. Out of scope.