Analysis — SSTable Format
Problem
Take a sorted stream of K/V (and tombstone) entries — exactly what db-05 produces — and persist it as an immutable, randomly-readable file:
- writing is one sequential pass (no re-reads, no buffering all keys);
- a point lookup costs O(log N) on the index plus one block read;
- the file is self-describing: a reader can validate and navigate it using only the file itself.
Constraints
- 4 KB target data-block size (close to a page; tunable).
- Little-endian integers throughout (matches db-03 / db-05).
- No per-block CRCs in this lab — added in db-21 ("Storage Engine Advanced"). The footer magic is the only integrity gate.
- No compression, no delta-encoded keys: the goal is a format simple enough to compare byte-for-byte across three languages.
- Cross-language interop: Rust, Go, and C++ MUST emit byte-identical SSTables for the same input MemTable.
Design
Stream-once writer
sst_writer.add(key, type, value) -> writes into current data block buffer
sst_writer.finish() -> flushes the current block, writes index, writes footer
The writer accumulates one block in memory at a time. When adding an
entry would push the encoded block size past 4096 bytes, the current
block is flushed and a new one started. The first key of every block
is captured into an IndexEntry { key, offset, size }.
Index sizing
Index entries are ~ (4 + 8 + 8 + k̄) = 20 + k̄ bytes; for k̄ = 16 and ~250 entries per 4 KB block, a 1 GB SSTable carries ~262 144 blocks × 36 B ≈ 9 MB of index — small enough to keep in RAM per open SSTable.
Lookup
fn get(key) -> Option<Entry>:
footer = read_tail(32)
assert footer.magic == "SST1\0\0\0\0"
index = read(footer.index_offset, footer.index_size)
blk = bsearch_floor(index, key)? # None => below smallest
block = read(blk.offset, blk.size)
return linear_scan(block, key)
bsearch_floor is the rightmost index entry whose first key ≤ target.
Returning None (target precedes the smallest first-key) is a fast
miss without reading any data block.
Per-language container choice
| Language | Writer buffer | Index repr |
|---|---|---|
| Rust | Vec<u8> for the current block | Vec<IndexEntry> |
| Go | []byte | []IndexEntry |
| C++ | std::vector<uint8_t> | std::vector<IndexEntry> |
IndexEntry is (Vec<u8> key, u64 offset, u64 size) in all three.
Build-from-memtable bridge
For cross-test friendliness, the writer's input source is a decoded
MemTable dump (the output of db-05 encode). The CLI command
build reads IN.mt, iterates in sorted order, and emits OUT.sst.
What could break
- Block boundary drift. If two implementations disagree on when to
flush a block (e.g.
> 4096vs>= 4096), the data blocks land at different offsets, the index differs, and the footer hashes differ. We pin the rule: *flush when `current_block_size + next_entry_size4096
ANDcurrent_block_size > 0`*. - Index encoding for the very first block. Its first key may be
the empty string
""; the index entry then hasklen=0. The reader must still treat it as the floor for any non-empty target. - Footer alignment. Anything other than exactly 32 bytes after the index block invalidates the magic offset.