Bloom Filters and Hashing

Status: complete — runnable in Rust, Go, C++.

1. What Is It

A Bloom filter is a probabilistic set: add(x) always succeeds; contains(x) returns either definitely not present or probably present. It uses a fixed-size bit array m and k independent hash functions; add(x) sets bits at positions h_1(x) mod m, …, h_k(x) mod m; contains(x) returns true iff all those bits are set.

add("foo"):
   h1=37 h2=812 h3=4    →  bits[37]=bits[812]=bits[4]=1

contains("bar"):
   h1=99 h2=812 h3=120  →  bits[99]=0  ⇒  definitely absent
contains("foo"):
   h1=37 h2=812 h3=4    →  all 1  ⇒  probably present

False positives are inherent (any other key that hits the same k bits looks present); false negatives are impossible (a stored key set its bits, and we never unset).

2. Why It Matters

Without a bloom filter	With one
LevelDB / RocksDB `Get(k)` on a miss probes every SSTable's index → many disk reads	One in-memory bit-test per SSTable rejects 99% of misses
Distributed cache: "do I have this key?" requires a network RTT	Local bit-test on a 1 MB filter answers in nanoseconds
Spell-checker holds full dictionary	Few bits per word
Webcrawler revisits the same URL	A few bits per URL prevent recrawl

Filter sizes are tiny: at the textbook optimum (~9.6 bits/key for 1% FPR) a million keys fit in 1.2 MB. Cache-resident.

3. How It Works

For n inserts into m bits with k hashes (assuming independent uniform hashes), the probability a given bit is still zero is (1 - 1/m)^(kn) ≈ e^(-kn/m), so the false-positive rate is

$$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k$$

Differentiating with respect to k yields the optimal hash count

$$k^* = \frac{m}{n}\ln 2 \approx 0.693 \cdot \frac{m}{n}$$

and the achievable FPR at $k^*$:

$$\text{FPR}^* \approx 0.6185^{,m/n}$$

So 10 bits/key ⇒ ~1% FPR with 7 hashes; 20 bits/key ⇒ ~0.01% with 14 hashes.

Kirsch–Mitzenmacher double hashing

We do not compute k independent hashes. Per Kirsch & Mitzenmacher (2006), it is sufficient — with no measurable FPR penalty — to compute one 64-bit hash, split it into halves h1 and h2, and synthesize:

g_i(x) = h1(x) + i * h2(x)   for i = 0..k-1

This is what LevelDB, RocksDB, and most production filters do.

In this lab the underlying 64-bit hash is FNV-1a64 of the key, then mixed once through SplitMix64 to spread the bits. (FNV-1a64 alone is biased in its high bits, and the Kirsch–Mitzenmacher splitting cares about both halves being well-distributed.)

On-disk / on-wire layout

 ┌─────────┬─────────┬───────────────────────────┐
 │ k (u32) │ m (u64) │  bits  (⌈m/8⌉ bytes, LE)  │
 └─────────┴─────────┴───────────────────────────┘

Identical layout across Rust/Go/C++ so all three can read each other's filters byte-for-byte.

4. Core Terminology

Term	Definition
`m`	Bit-array size in bits.
`n`	Number of distinct keys inserted.
`k`	Number of hash functions per key.
FPR	False-positive rate at query time. False negatives are impossible.
Bits/key	`m / n`. The single knob that determines achievable FPR.
Saturation	Once a large fraction of bits are 1, FPR climbs sharply; filters should be sized for the maximum expected `n`.
Counting Bloom	Variant that supports `remove` by storing 4-bit counters per slot. Costs 4× memory.
Cuckoo filter	Modern alternative: supports delete, often lower space at FPR ≤ 1%, harder to size.
Xor filter	Static (build once, query many) — best space efficiency, but no incremental inserts.

5. Mental Models

Bloom is a hash-collision amplifier. One hash collision is rare; needing k of them simultaneously is rarer. The filter trades memory for that compounding.
A Bloom filter is a negative index. Use it to avoid work; never use it to prove presence.
Hash quality matters less than independence. Once individual bits are well-distributed, the Kirsch–Mitzenmacher trick gives you arbitrarily many "independent" hashes for free.
You can compose them. Union ⇒ bitwise OR (with same m, k). Approximate intersection ⇒ bitwise AND (overestimates).

6. Common Misconceptions

"FPR depends on the number of bits set." It depends on m, n, and k. Two filters with the same fill factor but different k have different FPR.
"Bigger k is always better." Past $k^*$, FPR climbs again because each insert sets more bits, accelerating saturation.
"I can resize a Bloom filter." No — bit positions depend on m. Resize by building a fresh filter from the underlying data (or by maintaining a scalable filter, which is a series of growing Bloom filters).
"Cryptographic hashes are required." Wasted CPU. Anything well-distributed and fast (FNV, xxhash, MurmurHash3, CityHash) works.
"remove would be cheap if I just cleared the bits." It would also clear bits set by every other key that shares positions. Counting Bloom exists for this reason.

7. Interview Talking Points

Derive $k^* = (m/n) \ln 2$ and the resulting FPR formula from first principles.
Explain Kirsch–Mitzenmacher and why it doesn't increase FPR (citation: Less Hashing, Same Performance, ESA 2006).
Walk through how RocksDB pairs a Bloom filter with each SSTable — and how the new ribbon filter improves on that.
Quantify: "for 1% FPR you need ~10 bits/key; for one-in-a-million, ~30."
Contrast Bloom vs. Cuckoo vs. xor filters and their tradeoffs.

8. Connections to Other Labs

db-06 — every SSTable carries an embedded Bloom filter.
db-07 — compaction rebuilds filters because input filters can't be merged exactly.
db-08 — filter block is cached separately from data blocks.
db-09 — LookupKey flow consults the per-table filter before reading the index block.
db-21 — prefix Bloom filters, partitioned filters, ribbon filters.

Distributed Systems Engineer — Build Databases & Consensus From Scratch