Broader Ideas — db-05 LSM MemTable

The MemTable in this lab is intentionally minimal. Real systems extend it in many directions; this doc maps the design space.

Concurrency-friendly structures

  • Skip list (LevelDB, RocksDB). Single writer + many readers, lock-free reads via memory-order acquire/release. The dominant choice for LSM MemTables.
  • Bw-tree (Hekaton). Lock-free B+ tree using delta records and a mapping table; shines on multi-writer workloads.
  • ART (Adaptive Radix Tree). Cache-friendly trie; very fast point lookups, used by HyPer, DuckDB, and recent CockroachDB internals for some indexes.
  • Masstree. Trie-of-B+trees; outperforms skip list on long variable keys.

Arena allocation

LevelDB's MemTable allocates all skip-list nodes from a bump-arena. Freeing is O(1) (drop the arena). RocksDB has a configurable Arena and a ConcurrentArena for parallel writes. Real benefit: less fragmentation and one cache-line probe per allocation. Our lab uses standard allocators because the lesson is the data layout, not the allocator.

Sequence numbers & MVCC

Production LSMs prepend a 64-bit sequence number (and an 8-bit type byte) to every internal key. Snapshot reads pick the latest sequence ≤ the snapshot's tag. db-13 revisits this when we add MVCC; here we collapse to last-write-wins.

Range tombstones

A single tombstone shadows one key. RocksDB's DeleteRange tombstones cover a key range [start, end) and live in a separate auxiliary structure inside the MemTable (RangeDelAggregator). This avoids exploding the MemTable size when bulk-deleting. Adding it would require:

  1. A RangeTombstone struct: (start, end, seqno).
  2. A second sorted container inside MemTable.
  3. get consults both: a key shadowed by an overlapping range tombstone returns Tombstone even if it has a Value entry.

Multiple MemTables (active + immutable list)

Production engines keep one active MemTable plus a list of immutable MemTables awaiting flush. Reads consult [active, ...immutables, L0, L1, ...] in order. Writers swap atomically (active → immutable + new empty active) when the size cap is hit. This decouples flush latency from write latency.

Write amplification interplay

The MemTable size cap (write_buffer_size) is the first knob in the LSM write amplification dial:

  • Larger MemTable → fewer, bigger L0 SSTables → less L0 compaction → lower write amp but slower recovery and more RAM.
  • Smaller MemTable → more L0 SSTables → more compaction work → higher write amp but fast recovery.

RocksDB and Cassandra default in the range 64–256 MiB; LevelDB defaults to 4 MiB.

Persistent MemTables (PMEM)

Intel Optane / CXL persistent memory blurs the WAL+MemTable boundary: the MemTable itself lives in persistent memory, so the WAL is unnecessary. Papers from VLDB 2018–2020 (NoveLSM, SLM-DB, FloDB) explore this.

Encryption

Cassandra and RocksDB optionally encrypt at-rest data including the MemTable's flushed SSTables. The MemTable itself is in RAM and inherits process-memory protection. Encrypting in-memory pages requires hardware support (SGX, AMD SEV).

Compression of in-memory entries

For very long values, RocksDB can compress values inside the MemTable using LZ4 or ZSTD via the MemTableRep's EncodeKey hook. Trades CPU for memory; useful when RAM is the limit.

Skip-list level distribution

Pugh's original skip list uses geometric level distribution with p=0.5 (max levels = log₂ n). LevelDB sets max levels = 12 and branching = 4; RocksDB defaults max = 16, branching = 4. Lower branching → taller list → more memory but better adaptivity.

Adversarial concerns

  • Memory amplification via tombstones. A flood of deletes can make the MemTable hold many entries with no live data; eventually all tombstones must propagate to SSTables and may take generations of compaction to GC.
  • Skew-induced flush storms. A hot key prefix can keep one MemTable bucket pinned while others empty; with hash-partitioned MemTables (HashSkipList) this is pronounced.

Beyond LSM

  • B-epsilon trees (TokuDB / Percona) batch writes inside internal B+ tree nodes; no separate MemTable.
  • Anti-caching (HyPer, VoltDB) keeps the working set in memory and evicts cold rows to disk; inverts the LSM model.
  • WiscKey decouples keys (LSM) from values (separate log) to slash write amplification for large values.