Broader Ideas — Beyond the Minimum

Block / partitioned filters

RocksDB partitions the filter so that one filter probe touches only a single cache line. Trade: marginally higher FPR for ~3× faster contains on cache-cold filters. See Optimizing Bloom Filter: Challenges, Solutions, and Comparisons (Luo et al., IEEE 2019).

Cuckoo filters

Replace bit array with a hash table of fingerprints. Same FPR as Bloom at lower bits/key (~6 bits/key for 1% FPR), and supports remove. Slower to build, occasionally fails to insert when over-full. Excellent for membership tests with a known max size and a need for deletion.

Xor filters

Build-once, query-many. ~9% smaller than Cuckoo at the same FPR, faster lookup (always exactly 3 memory accesses). Bad fit if you insert incrementally; great for static datasets like compiled SSTable filters.

Ribbon filters (RocksDB 6.15+)

A linear-algebra reformulation of xor filters. ~30% smaller than Bloom at the same FPR, slightly slower lookup, ~10× slower to build. RocksDB now uses these as the default for SSTable filters.

Prefix Bloom filters

When most queries are by prefix (e.g. userid:12345:*), build the filter from prefixes instead of full keys. Saves space and lets prefix-range queries use the filter.

Scaling without resizing

A scalable Bloom filter (Almeida et al., 2007) chains a sequence of progressively larger filters with progressively tighter FPRs. add writes to the youngest; contains ORs across all. Memory grows logarithmically with n.

Compressed Bloom filters

If you transmit a Bloom filter over a network, sparsity makes it gzip well. Mitzenmacher (2002) showed that optimizing for compressed size leads to a different k than optimizing for in-memory FPR.

Cardinality estimation: HyperLogLog vs Bloom

Bloom tells you "in set" with FPR; HLL gives you |set| ± ~2% with constant memory. Often used in the same systems for different questions.

Filters in distributed systems

  • Bigtable / HBase: block-level filters per SSTable.
  • Cassandra: row-level filter per SSTable, plus a key cache.
  • Akamai / CDNs: "did this URL get cached?" Bloom-based pre-checks.
  • Gmail: per-user spam fingerprint filters.
  • Bitcoin SPV clients (BIP 37): filters published to full nodes to indicate which addresses the SPV wallet cares about. Famously broken from a privacy standpoint — the filter leaks the address set.

Adversarial considerations

Bloom-filter parameters and hashes are usually public. If users can choose keys, they can craft collisions that fill the filter and push FPR to 100%. Defenses:

  1. Use a keyed hash (SipHash) seeded at filter creation.
  2. Cap inserts or fall back to an exact structure beyond a threshold.
  3. Periodically rebuild.

Information-theoretic lower bound

Carter et al. (1978) prove that any approximate set with FPR ε requires at least n * log2(1/ε) bits — that's ~6.64 bits/key for 1% FPR. Bloom uses ~9.6 (44% overhead). Xor filters approach the bound at ~9% overhead. Ribbon filters get within ~3%.