SSTable Format

1. What Is It

A Sorted String Table (SSTable) is an immutable on-disk file holding key/value entries in byte-lex key order, organised into fixed-size blocks with an index block that maps each block's first key to its byte range inside the file, and a fixed-size footer that locates the index block.

The format in this lab:

+--------------------+   0
| data block 0       |
+--------------------+
| data block 1       |
+--------------------+
| ...                |
+--------------------+
| data block N-1     |
+--------------------+   index_offset
| index block        |
+--------------------+   file_size - 32
| footer (32 bytes)  |
+--------------------+   file_size

The footer always lives in the last 32 bytes and ends with the magic SST1\0\0\0\0, so any reader can validate the file with one pread of the tail and then pread the index block, and only then the relevant data block.

2. Why It Matters

  • Read-once, write-never. Each SSTable is written sequentially and then treated as read-only. That eliminates most concurrency hazards: lookups, range scans, and compactions can all share a single immutable file.
  • Bounded read amplification. A point lookup is footer → index → one data block. With a 4 KB block and a 16-byte average entry, ≤256 keys are scanned per lookup, regardless of file size.
  • Predictable I/O. Blocks are aligned write units; the OS page cache can pin hot blocks. Tail latency is dominated by exactly two I/Os per miss (index + data).
  • Foundation for LSM. Compaction merges multiple immutable SSTables into new immutable SSTables. The format is what makes "immutable + sorted + indexed" a usable storage primitive.

3. How It Works

3.1 Data block

A data block is a self-describing run of entries.

[count: u32 LE]
repeat count times (keys ascending byte-lex within the block):
  [klen: u32 LE][vlen: u32 LE][type: u8][key bytes][value bytes]

The writer flushes a block when its accumulated size would exceed a target (default 4096 bytes). The very first key of each block is the index key for that block.

3.2 Index block

[count: u32 LE]
repeat count times:
  [klen: u32 LE][offset: u64 LE][size: u64 LE][first-key bytes]

offset and size locate the data block inside the file. Index entries are listed in ascending block order, which is the same as ascending first-key order.

[index_offset: u64 LE]
[index_size:   u64 LE]
[num_blocks:   u64 LE]
[magic:        "SST1\0\0\0\0" (8 bytes)]

The fixed size makes the tail trivially locatable: pread(fd, buf, 32, file_size - 32).

3.4 Point lookup

  1. Read footer; verify magic.
  2. Read index block (index_offset..index_offset+index_size).
  3. Binary-search the index for the rightmost index entry whose first key ≤ target. That entry's block is the only one that can contain the target.
  4. Read that block; linear-scan within it.

A miss in step 3 (target < first entry's key) means the key is absent without reading any data block.

4. Core Terminology

TermDefinition
SSTableImmutable sorted on-disk K/V file.
BlockContiguous run of entries inside the SSTable; the I/O and indexing unit.
Data blockBlock containing user K/V entries.
Index blockBlock mapping each data block's first key to (offset, size).
FooterFixed-size tail (32 B) locating the index block; ends with a magic.
MagicSentinel byte pattern (SST1 here) used to validate file identity.
Index keyThe first key of a data block, copied into the index entry.
Block boundaryThe byte position where one data block ends and the next begins.
Restart point(Not used here; LevelDB-style intra-block delta-encoding marker.)
TombstoneEntry whose type=1 records that a key has been logically deleted.

5. Mental Models

  • Phone book. Data blocks are pages; the index is the alphabetical tabs on the side. The footer is the spine label saying "Volume 3 of 3".
  • Skiplist with one level. The index is a single coarse "level" above the sorted data; binary search on the index replaces a multi- level skiplist traversal.
  • Two-tier B+tree. Conceptually an SSTable is a 2-level B+tree whose leaves are data blocks and whose root is the index block — but built sequentially and frozen.

6. Common Misconceptions

  • "You need to scan the whole file to find a key." No — one index lookup pins a single block.
  • "The index must be at the start so you can read it first." No — the footer pointer makes the index location flexible, and writing the index last avoids buffering all keys before any data is flushed.
  • "Block size = block contents size." The on-disk block includes its own count header; the writer tracks an estimate of accumulated bytes so blocks land roughly on the target.
  • "Tombstones can be dropped at write time." Not safely — a tombstone must survive until no older SSTable can shadow it (handled by compaction in db-07).
  • "Binary search needs fixed-size index entries." The index entries here are variable-length, but the index block itself is small and fully loaded into RAM, where any search structure is cheap.

7. Interview Talking Points

  • Why is the footer at the end? ("So the writer can stream data blocks then index without two passes; one pread of the tail finds everything.")
  • What changes if you target 64 KB blocks instead of 4 KB? (Fewer index entries → smaller index → faster directory lookups; larger read amplification per miss; better compression ratios.)
  • How does this format become durable? (fsync after the footer is written, and a parent directory fsync so the dirent is recoverable. Without it a crash can leave the magic visible but data missing.)
  • What is bsearch looking for inside the index? (The floor — the largest first-key ≤ target — not equality.)
  • What stops a corrupt footer from poisoning the reader? (Magic check
    • length plausibility checks. Real systems add CRCs per block.)

8. Connections to Other Labs

  • db-05 MemTable supplies the sorted, in-memory K/V stream that this writer drains into blocks.
  • db-04 Bloom Filters can be attached per SSTable to skip the index lookup on negative queries (added in db-08 / db-09).
  • db-07 LSM Compaction consumes many SSTables and produces new ones using exactly this format.
  • db-08 Block Cache and Iterators caches the parsed data block rather than re-decoding on each lookup, and turns intra-block scans into iterators.
  • db-09 LevelDB Complete stitches MemTable + WAL + SSTable + compaction + bloom into a working engine.