Bloom Filters and Hashing
Status: complete — runnable in Rust, Go, C++.
1. What Is It
A Bloom filter is a probabilistic set: add(x) always succeeds; contains(x) returns either definitely not present or probably present. It uses a fixed-size bit array m and k independent hash functions; add(x) sets bits at positions h_1(x) mod m, …, h_k(x) mod m; contains(x) returns true iff all those bits are set.
add("foo"):
h1=37 h2=812 h3=4 → bits[37]=bits[812]=bits[4]=1
contains("bar"):
h1=99 h2=812 h3=120 → bits[99]=0 ⇒ definitely absent
contains("foo"):
h1=37 h2=812 h3=4 → all 1 ⇒ probably present
False positives are inherent (any other key that hits the same k bits looks present); false negatives are impossible (a stored key set its bits, and we never unset).
2. Why It Matters
| Without a bloom filter | With one |
|---|---|
LevelDB / RocksDB Get(k) on a miss probes every SSTable's index → many disk reads | One in-memory bit-test per SSTable rejects 99% of misses |
| Distributed cache: "do I have this key?" requires a network RTT | Local bit-test on a 1 MB filter answers in nanoseconds |
| Spell-checker holds full dictionary | Few bits per word |
| Webcrawler revisits the same URL | A few bits per URL prevent recrawl |
Filter sizes are tiny: at the textbook optimum (~9.6 bits/key for 1% FPR) a million keys fit in 1.2 MB. Cache-resident.
3. How It Works
For n inserts into m bits with k hashes (assuming independent uniform hashes), the probability a given bit is still zero is (1 - 1/m)^(kn) ≈ e^(-kn/m), so the false-positive rate is
$$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k$$
Differentiating with respect to k yields the optimal hash count
$$k^* = \frac{m}{n}\ln 2 \approx 0.693 \cdot \frac{m}{n}$$
and the achievable FPR at $k^*$:
$$\text{FPR}^* \approx 0.6185^{,m/n}$$
So 10 bits/key ⇒ ~1% FPR with 7 hashes; 20 bits/key ⇒ ~0.01% with 14 hashes.
Kirsch–Mitzenmacher double hashing
We do not compute k independent hashes. Per Kirsch & Mitzenmacher (2006), it is sufficient — with no measurable FPR penalty — to compute one 64-bit hash, split it into halves h1 and h2, and synthesize:
g_i(x) = h1(x) + i * h2(x) for i = 0..k-1
This is what LevelDB, RocksDB, and most production filters do.
In this lab the underlying 64-bit hash is FNV-1a64 of the key, then mixed once through SplitMix64 to spread the bits. (FNV-1a64 alone is biased in its high bits, and the Kirsch–Mitzenmacher splitting cares about both halves being well-distributed.)
On-disk / on-wire layout
┌─────────┬─────────┬───────────────────────────┐
│ k (u32) │ m (u64) │ bits (⌈m/8⌉ bytes, LE) │
└─────────┴─────────┴───────────────────────────┘
Identical layout across Rust/Go/C++ so all three can read each other's filters byte-for-byte.
4. Core Terminology
| Term | Definition |
|---|---|
m | Bit-array size in bits. |
n | Number of distinct keys inserted. |
k | Number of hash functions per key. |
| FPR | False-positive rate at query time. False negatives are impossible. |
| Bits/key | m / n. The single knob that determines achievable FPR. |
| Saturation | Once a large fraction of bits are 1, FPR climbs sharply; filters should be sized for the maximum expected n. |
| Counting Bloom | Variant that supports remove by storing 4-bit counters per slot. Costs 4× memory. |
| Cuckoo filter | Modern alternative: supports delete, often lower space at FPR ≤ 1%, harder to size. |
| Xor filter | Static (build once, query many) — best space efficiency, but no incremental inserts. |
5. Mental Models
- Bloom is a hash-collision amplifier. One hash collision is rare; needing
kof them simultaneously is rarer. The filter trades memory for that compounding. - A Bloom filter is a negative index. Use it to avoid work; never use it to prove presence.
- Hash quality matters less than independence. Once individual bits are well-distributed, the Kirsch–Mitzenmacher trick gives you arbitrarily many "independent" hashes for free.
- You can compose them. Union ⇒ bitwise OR (with same
m,k). Approximate intersection ⇒ bitwise AND (overestimates).
6. Common Misconceptions
- "FPR depends on the number of bits set." It depends on
m,n, andk. Two filters with the same fill factor but differentkhave different FPR. - "Bigger
kis always better." Past $k^*$, FPR climbs again because each insert sets more bits, accelerating saturation. - "I can resize a Bloom filter." No — bit positions depend on
m. Resize by building a fresh filter from the underlying data (or by maintaining a scalable filter, which is a series of growing Bloom filters). - "Cryptographic hashes are required." Wasted CPU. Anything well-distributed and fast (FNV, xxhash, MurmurHash3, CityHash) works.
- "
removewould be cheap if I just cleared the bits." It would also clear bits set by every other key that shares positions. Counting Bloom exists for this reason.
7. Interview Talking Points
- Derive $k^* = (m/n) \ln 2$ and the resulting FPR formula from first principles.
- Explain Kirsch–Mitzenmacher and why it doesn't increase FPR (citation: Less Hashing, Same Performance, ESA 2006).
- Walk through how RocksDB pairs a Bloom filter with each SSTable — and how the new ribbon filter improves on that.
- Quantify: "for 1% FPR you need ~10 bits/key; for one-in-a-million, ~30."
- Contrast Bloom vs. Cuckoo vs. xor filters and their tradeoffs.
8. Connections to Other Labs
- db-06 — every SSTable carries an embedded Bloom filter.
- db-07 — compaction rebuilds filters because input filters can't be merged exactly.
- db-08 — filter block is cached separately from data blocks.
- db-09 —
LookupKeyflow consults the per-table filter before reading the index block. - db-21 — prefix Bloom filters, partitioned filters, ribbon filters.