Analysis — Bloom Filters and Hashing

Problem statement

Build a fixed-memory probabilistic set that:

  1. Never reports false negatives (lookups for present keys always return true).
  2. Reports false positives at a tunable, well-understood rate.
  3. Is fast enough to consult in the hot path of a key-value store lookup (≈ nanoseconds).
  4. Has an on-disk representation identical across Rust, Go, and C++ so the same filter built in any language can be read by any other.

Constraints

ConstraintWhy it matters
Compact (≤ 2 bytes/key for 5% FPR)The filter is loaded into RAM beside the table it indexes.
Constant-time add and containsHot path of Get(k).
Deterministic across languagesCross-language tests must pass.
Single 64-bit hashWe synthesize k indices via Kirsch–Mitzenmacher — keeps CPU low.
No removePure Bloom. Counting variants left to db-21.

Design decisions

  1. Base hash = FNV-1a64 then SplitMix64 mixing. FNV-1a64 is trivial to implement identically across languages; SplitMix64 finalizing fixes its weak avalanche so the upper and lower 32 bits are both well-distributed. The two 32-bit halves become h1 and h2 for double hashing.
  2. Kirsch–Mitzenmacher: g_i = h1 + i*h2, all u64 arithmetic, with the final mod m using a single u128-multiplication trick (Daniel Lemire, Fast Random Integer Generation in an Interval, 2018) so we don't pay for a div.
  3. Bit array is little-endian byte-packed, bit i lives in bytes[i/8] >> (i%8) & 1. Same convention LevelDB and RocksDB use.
  4. Header = k(u32 LE) || m(u64 LE). 12 bytes total. We deliberately put k first so a partial-read of just the header reveals the hash count without needing m.
  5. No checksum on the filter itself. Bloom filters can tolerate a flipped bit (it adds at most a few keys' worth of false positives); pages-level checksumming belongs to db-06 (SSTable).
  6. new_with_fpr(n, fpr) constructor. Picks m = ceil(-n * ln(fpr) / (ln 2)^2) and k = round((m/n) * ln 2). Caps k at 30 to avoid degenerate sizing for absurdly small FPRs.

Why this design over alternatives

  • vs MurmurHash3 / xxhash: faster and arguably higher quality, but each is hundreds of lines to re-implement identically in three languages. FNV+SplitMix is 12 lines per language and indistinguishable in our use case.
  • vs k independent hashes: 2× CPU for no measurable FPR change (Kirsch & Mitzenmacher 2006).
  • vs Cuckoo / xor filters: more space-efficient at low FPR but much more code. Worth a separate lab — db-21.
  • vs in-language hashers (std::hash, hash/fnv, std::hash<string>): per-language differences — Go's maphash is randomized per process; C++ std::hash<string> is implementation-defined. None of them survive cross-language interop.

Failure modes addressed

FailureHow
FPR much higher than claimedTest V1: empirically measure FPR with 100k random queries and assert it's within 2× of the theoretical bound.
Bit packing mismatched across languagesTest V2 (cross-lang): each writer dumps its filter bytes; each reader queries it for known-present & known-absent keys.
Endian mismatch in headerAll header fields encoded little-endian explicitly.
Hash function mismatchTest V3: known FNV-1a64 vectors (""→0xcbf29ce484222325, "foobar"→0x85944171f73967e8) checked at startup.
Saturation at n ≫ plannedcontains still works; FPR climbs gracefully. Filter constructors document the assumed n.

Failure modes NOT addressed

  • Concurrent inserts. Single writer model. Concurrent add corrupts overlapping byte writes. Lock externally or use atomic OR per byte — covered in db-08 / db-21.
  • Adversarial keys. FNV-1a64 is not cryptographic — an attacker can craft collisions to inflate FPR. Use SipHash / xxh3 (with secret seed) if filter inputs are attacker-controlled.
  • Deletion. Use a counting Bloom or a cuckoo filter. See db-21.