Broader Ideas — SSTable Format

  • Restart points and prefix compression. LevelDB stores keys as (shared_prefix_len, unshared_suffix, value) and resets the prefix every N entries (a "restart point"). The block trailer lists the restart offsets so binary search inside a block is still O(log N_restarts). Halves on-disk size for sorted key sets but couples decode to encode order.
  • Two-level / partitioned index. A 1 TB SSTable would have ~10 GB of index entries. Partitioning the index into "index blocks of index blocks" keeps the resident index small at the cost of one extra pread per miss. RocksDB uses this above ~2 GB SSTables.
  • Per-block bloom filters. Attaching a small Bloom (db-04) to each data block lets the reader skip the entire block on a miss without decoding it. Trades index/Bloom RAM for fewer block reads.
  • Block CRCs / per-block compression. Real engines write [data][type byte: compression][crc32c] per block; the writer computes the CRC over compressed bytes. Detects bit-rot at read time but adds CPU cost per block.
  • Streaming writer to disk. This lab returns Vec<u8>; production writers stream blocks into an os.File and only buffer the index in RAM. With a 4 KB block target and 250 entries/block, peak RAM is ~4 KB + the growing index.
  • Min/max keys per block in the index. Index entries can carry the last key too, so a query strictly between two blocks short-circuits without reading either. Costs ~2× index size.
  • Splitting hot blocks. Some engines (e.g. CockroachDB) measure read frequency per block and adaptively shrink hot blocks to reduce read amplification on small lookups.
  • Versioned magic. A future format change (e.g. adding bloom blocks) bumps the magic to SST2; readers can keep both code paths and choose at open time. Cheap, common practice.