Analysis — db-05 LSM MemTable

Problem

Implement the in-memory write buffer of an LSM-tree such that

  1. it supports put, delete (tombstone insertion), get, and ordered iteration;
  2. it can be serialized to disk in a deterministic byte layout shared by Rust, Go, and C++;
  3. the same dump can be reloaded in any of the three languages;
  4. iteration order is byte-lexicographic on keys.

Constraints

  • Keys are arbitrary byte sequences up to 2^32 − 1 bytes (u32 length prefix).
  • Values are arbitrary byte sequences up to 2^32 − 1 bytes; for tombstones the value length is 0.
  • Cross-language interop: the dump format must be identical byte-for-byte and the cross-test script asserts SHA-256 equality of the three dumps.
  • No allocator tricks: simplicity over LevelDB-style arena/skiplist; we use the standard sorted map in each language.
  • No concurrency in this lab: single-threaded API. Concurrency arrives in db-09.

Design decisions

Why a sorted associative container, not a skip list?

Production LSMs (LevelDB, RocksDB) use skip lists because they support concurrent lock-free reads and arena allocation. For this teaching lab those benefits are irrelevant — what matters is determinism, byte-identical serialization, and the fact that iteration is in key order so the flush is a sequential write. Any sorted container satisfies that:

LanguageContainerWhy
RustBTreeMap<Vec<u8>, Entry>Vec<u8>: Ord is byte-lex; balanced.
Gomap[string]Entry + sort.Strings(keys)Avoid third-party sorted maps.
C++std::map<std::vector<uint8_t>, Entry>RB-tree; vector<uint8_t>::operator< is lex.

All three give the same iteration order for identical input, which is what cross-test checks.

Tombstones as entries

A delete(k) replaces whatever entry k had with Entry::Tombstone. Crucially the key is not erased — the tombstone must propagate to the SSTable to shadow older on-disk versions of the key.

On-disk dump layout

   offset  size  field
   ------  ----  --------------------------------
        0     4  magic ASCII "MMT1"
        4     4  count: u32 LE (entry count)
        8     ?  entries, sorted by key ascending:
                   [ klen: u32 LE ]
                   [ vlen: u32 LE ]   (0 if tombstone)
                   [ type: u8     ]   (0 = Value, 1 = Tombstone)
                   [ key bytes    ]
                   [ value bytes  ]

All multi-byte integers little-endian; the file is self-delimiting via count and each entry's two length prefixes. No checksum at this layer — the WAL (db-03) and the SSTable (db-06) carry their own.

Size accounting

size_bytes() returns the on-disk dump size assuming the current contents flush immediately: 8 bytes header + per entry (9 + klen + vlen). This is what an LSM would compare against write_buffer_size.

Error model

The decoder validates:

  • magic == MMT1,
  • enough bytes remain for each header field and the declared key/value spans,
  • type is 0 or 1,
  • tombstones have vlen == 0,
  • no trailing garbage,
  • keys appear in strictly ascending order.

A failure returns an explicit error (Error::* in Rust, error in Go, std::invalid_argument / std::runtime_error in C++) rather than panicking. The encoder cannot fail (no I/O at that layer).

Trade-offs

  • No sequence numbers. Real LSMs prepend a 64-bit (seqno << 8) | type to every internal key so MVCC snapshots can pick the right version. We collapse to "latest write wins" because db-13 reintroduces MVCC.
  • No range tombstones. Each delete shadows exactly one key. RocksDB-style range deletes are db-09 work.
  • No prefix bloom or compressed entries. The MemTable is in RAM; flushing to a proper SSTable (db-06) is where compression and block boundaries appear.
  • Allocation policy: Vec/vector/[]byte-per-entry, not an arena. Allocator pressure becomes interesting only at multi-million-key scales, which we exercise in db-22 benchmarking.

Alternatives considered

  • Skip list with arena (LevelDB style). Better concurrency, cache locality, and drop-the-whole-arena freeing — but the data structure complexity (random levels, acquire/release pointer ops) would dwarf the lab's pedagogical point.
  • Sorted vector with binary search. Lowest memory overhead, but every put is O(n) due to mid-vector insertion. Fine for tiny tables (<1 K entries), terrible beyond that.
  • HashMap with periodic sort. Fast inserts, but iteration is no longer cheap; every flush triggers a sort. Acceptable if flush is rare, painful otherwise.
  • B-epsilon tree. Batches writes inside internal nodes, blurring the line with LSM. Out of scope.