Broader Ideas — Beyond the Minimum
Block / partitioned filters
RocksDB partitions the filter so that one filter probe touches only a single cache line. Trade: marginally higher FPR for ~3× faster contains on cache-cold filters. See Optimizing Bloom Filter: Challenges, Solutions, and Comparisons (Luo et al., IEEE 2019).
Cuckoo filters
Replace bit array with a hash table of fingerprints. Same FPR as Bloom at lower bits/key (~6 bits/key for 1% FPR), and supports remove. Slower to build, occasionally fails to insert when over-full. Excellent for membership tests with a known max size and a need for deletion.
Xor filters
Build-once, query-many. ~9% smaller than Cuckoo at the same FPR, faster lookup (always exactly 3 memory accesses). Bad fit if you insert incrementally; great for static datasets like compiled SSTable filters.
Ribbon filters (RocksDB 6.15+)
A linear-algebra reformulation of xor filters. ~30% smaller than Bloom at the same FPR, slightly slower lookup, ~10× slower to build. RocksDB now uses these as the default for SSTable filters.
Prefix Bloom filters
When most queries are by prefix (e.g. userid:12345:*), build the filter from prefixes instead of full keys. Saves space and lets prefix-range queries use the filter.
Scaling without resizing
A scalable Bloom filter (Almeida et al., 2007) chains a sequence of progressively larger filters with progressively tighter FPRs. add writes to the youngest; contains ORs across all. Memory grows logarithmically with n.
Compressed Bloom filters
If you transmit a Bloom filter over a network, sparsity makes it gzip well. Mitzenmacher (2002) showed that optimizing for compressed size leads to a different k than optimizing for in-memory FPR.
Cardinality estimation: HyperLogLog vs Bloom
Bloom tells you "in set" with FPR; HLL gives you |set| ± ~2% with constant memory. Often used in the same systems for different questions.
Filters in distributed systems
- Bigtable / HBase: block-level filters per SSTable.
- Cassandra: row-level filter per SSTable, plus a key cache.
- Akamai / CDNs: "did this URL get cached?" Bloom-based pre-checks.
- Gmail: per-user spam fingerprint filters.
- Bitcoin SPV clients (BIP 37): filters published to full nodes to indicate which addresses the SPV wallet cares about. Famously broken from a privacy standpoint — the filter leaks the address set.
Adversarial considerations
Bloom-filter parameters and hashes are usually public. If users can choose keys, they can craft collisions that fill the filter and push FPR to 100%. Defenses:
- Use a keyed hash (SipHash) seeded at filter creation.
- Cap inserts or fall back to an exact structure beyond a threshold.
- Periodically rebuild.
Information-theoretic lower bound
Carter et al. (1978) prove that any approximate set with FPR ε requires at least n * log2(1/ε) bits — that's ~6.64 bits/key for 1% FPR. Bloom uses ~9.6 (44% overhead). Xor filters approach the bound at ~9% overhead. Ribbon filters get within ~3%.