Analysis — SSTable Format

Problem

Take a sorted stream of K/V (and tombstone) entries — exactly what db-05 produces — and persist it as an immutable, randomly-readable file:

  • writing is one sequential pass (no re-reads, no buffering all keys);
  • a point lookup costs O(log N) on the index plus one block read;
  • the file is self-describing: a reader can validate and navigate it using only the file itself.

Constraints

  • 4 KB target data-block size (close to a page; tunable).
  • Little-endian integers throughout (matches db-03 / db-05).
  • No per-block CRCs in this lab — added in db-21 ("Storage Engine Advanced"). The footer magic is the only integrity gate.
  • No compression, no delta-encoded keys: the goal is a format simple enough to compare byte-for-byte across three languages.
  • Cross-language interop: Rust, Go, and C++ MUST emit byte-identical SSTables for the same input MemTable.

Design

Stream-once writer

sst_writer.add(key, type, value) -> writes into current data block buffer
sst_writer.finish() -> flushes the current block, writes index, writes footer

The writer accumulates one block in memory at a time. When adding an entry would push the encoded block size past 4096 bytes, the current block is flushed and a new one started. The first key of every block is captured into an IndexEntry { key, offset, size }.

Index sizing

Index entries are ~ (4 + 8 + 8 + k̄) = 20 + k̄ bytes; for k̄ = 16 and ~250 entries per 4 KB block, a 1 GB SSTable carries ~262 144 blocks × 36 B ≈ 9 MB of index — small enough to keep in RAM per open SSTable.

Lookup

fn get(key) -> Option<Entry>:
    footer = read_tail(32)
    assert footer.magic == "SST1\0\0\0\0"
    index = read(footer.index_offset, footer.index_size)
    blk = bsearch_floor(index, key)?                # None => below smallest
    block = read(blk.offset, blk.size)
    return linear_scan(block, key)

bsearch_floor is the rightmost index entry whose first key ≤ target. Returning None (target precedes the smallest first-key) is a fast miss without reading any data block.

Per-language container choice

LanguageWriter bufferIndex repr
RustVec<u8> for the current blockVec<IndexEntry>
Go[]byte[]IndexEntry
C++std::vector<uint8_t>std::vector<IndexEntry>

IndexEntry is (Vec<u8> key, u64 offset, u64 size) in all three.

Build-from-memtable bridge

For cross-test friendliness, the writer's input source is a decoded MemTable dump (the output of db-05 encode). The CLI command build reads IN.mt, iterates in sorted order, and emits OUT.sst.

What could break

  • Block boundary drift. If two implementations disagree on when to flush a block (e.g. > 4096 vs >= 4096), the data blocks land at different offsets, the index differs, and the footer hashes differ. We pin the rule: *flush when `current_block_size + next_entry_size

    4096ANDcurrent_block_size > 0`*.

  • Index encoding for the very first block. Its first key may be the empty string ""; the index entry then has klen=0. The reader must still treat it as the floor for any non-empty target.
  • Footer alignment. Anything other than exactly 32 bytes after the index block invalidates the magic offset.