Restart points and prefix compression. LevelDB stores keys as
(shared_prefix_len, unshared_suffix, value) and resets the prefix
every N entries (a "restart point"). The block trailer lists the
restart offsets so binary search inside a block is still O(log
N_restarts). Halves on-disk size for sorted key sets but couples
decode to encode order.
Two-level / partitioned index. A 1 TB SSTable would have ~10 GB
of index entries. Partitioning the index into "index blocks of
index blocks" keeps the resident index small at the cost of one
extra pread per miss. RocksDB uses this above ~2 GB SSTables.
Per-block bloom filters. Attaching a small Bloom (db-04) to each
data block lets the reader skip the entire block on a miss without
decoding it. Trades index/Bloom RAM for fewer block reads.
Block CRCs / per-block compression. Real engines write
[data][type byte: compression][crc32c] per block; the writer
computes the CRC over compressed bytes. Detects bit-rot at read
time but adds CPU cost per block.
Streaming writer to disk. This lab returns Vec<u8>; production
writers stream blocks into an os.File and only buffer the index in
RAM. With a 4 KB block target and 250 entries/block, peak RAM is
~4 KB + the growing index.
Min/max keys per block in the index. Index entries can carry the
last key too, so a query strictly between two blocks short-circuits
without reading either. Costs ~2× index size.
Splitting hot blocks. Some engines (e.g. CockroachDB) measure
read frequency per block and adaptively shrink hot blocks to reduce
read amplification on small lookups.
Versioned magic. A future format change (e.g. adding bloom
blocks) bumps the magic to SST2; readers can keep both code paths
and choose at open time. Cheap, common practice.