Analysis — Bloom Filters and Hashing

Problem statement

Build a fixed-memory probabilistic set that:

Never reports false negatives (lookups for present keys always return true).
Reports false positives at a tunable, well-understood rate.
Is fast enough to consult in the hot path of a key-value store lookup (≈ nanoseconds).
Has an on-disk representation identical across Rust, Go, and C++ so the same filter built in any language can be read by any other.

Constraint	Why it matters
Compact (≤ 2 bytes/key for 5% FPR)	The filter is loaded into RAM beside the table it indexes.
Constant-time `add` and `contains`	Hot path of `Get(k)`.
Deterministic across languages	Cross-language tests must pass.
Single 64-bit hash	We synthesize `k` indices via Kirsch–Mitzenmacher — keeps CPU low.
No remove	Pure Bloom. Counting variants left to db-21.

Base hash = FNV-1a64 then SplitMix64 mixing. FNV-1a64 is trivial to implement identically across languages; SplitMix64 finalizing fixes its weak avalanche so the upper and lower 32 bits are both well-distributed. The two 32-bit halves become h1 and h2 for double hashing.
Kirsch–Mitzenmacher: g_i = h1 + i*h2, all u64 arithmetic, with the final mod m using a single u128-multiplication trick (Daniel Lemire, Fast Random Integer Generation in an Interval, 2018) so we don't pay for a div.
Bit array is little-endian byte-packed, bit i lives in bytes[i/8] >> (i%8) & 1. Same convention LevelDB and RocksDB use.
Header = k(u32 LE) || m(u64 LE). 12 bytes total. We deliberately put k first so a partial-read of just the header reveals the hash count without needing m.
No checksum on the filter itself. Bloom filters can tolerate a flipped bit (it adds at most a few keys' worth of false positives); pages-level checksumming belongs to db-06 (SSTable).
new_with_fpr(n, fpr) constructor. Picks m = ceil(-n * ln(fpr) / (ln 2)^2) and k = round((m/n) * ln 2). Caps k at 30 to avoid degenerate sizing for absurdly small FPRs.

vs MurmurHash3 / xxhash: faster and arguably higher quality, but each is hundreds of lines to re-implement identically in three languages. FNV+SplitMix is 12 lines per language and indistinguishable in our use case.
vs k independent hashes: 2× CPU for no measurable FPR change (Kirsch & Mitzenmacher 2006).
vs Cuckoo / xor filters: more space-efficient at low FPR but much more code. Worth a separate lab — db-21.
vs in-language hashers (std::hash, hash/fnv, std::hash<string>): per-language differences — Go's maphash is randomized per process; C++ std::hash<string> is implementation-defined. None of them survive cross-language interop.

Failure	How
FPR much higher than claimed	Test V1: empirically measure FPR with 100k random queries and assert it's within 2× of the theoretical bound.
Bit packing mismatched across languages	Test V2 (cross-lang): each writer dumps its filter bytes; each reader queries it for known-present & known-absent keys.
Endian mismatch in header	All header fields encoded little-endian explicitly.
Hash function mismatch	Test V3: known FNV-1a64 vectors (`""→0xcbf29ce484222325`, `"foobar"→0x85944171f73967e8`) checked at startup.
Saturation at `n ≫ planned`	`contains` still works; FPR climbs gracefully. Filter constructors document the assumed `n`.

Concurrent inserts. Single writer model. Concurrent add corrupts overlapping byte writes. Lock externally or use atomic OR per byte — covered in db-08 / db-21.
Adversarial keys. FNV-1a64 is not cryptographic — an attacker can craft collisions to inflate FPR. Use SipHash / xxh3 (with secret seed) if filter inputs are attacker-controlled.
Deletion. Use a counting Bloom or a cuckoo filter. See db-21.