Observation — Looking at the Bits

1. Hexdump a freshly built filter

./build/bloombench build /tmp/bf 4 0.05
xxd /tmp/bf
00000000: 0500 0000 1f00 0000 0000 0000 1206 92    ...............

Reading the header: k=5, m=31 bits ⇒ ⌈31/8⌉ = 4 bytes of bits. The trailing 12 06 92 … is the bit vector with 4 keys mixed in. The actual high byte may differ depending on how m is rounded.

For 1000 keys at 1% FPR you should see roughly 9.6 bits/key ⇒ 1200 bytes of bits, and k ≈ 7.

2. Sanity-check the hash chain

./build/bloombench hash foobar
# fnv1a64=85944171f73967e8  splitmix=...  h1=...  h2=...

Known FNV-1a64 vectors (used in tests):

Inputfnv1a64
""0xcbf29ce484222325
"a"0xaf63dc4c8601ec8c
"foobar"0x85944171f73967e8

All three languages must print the same fnv1a64 and same splitmix64 for any given input. If they don't, cross-language interop is dead on arrival.

3. Empirical FPR matches the formula

./build/bloombench build /tmp/bf 100000 0.01
./build/bloombench fpr-test /tmp/bf 100000 1000000
# expected: observed=0.0098  theoretical≈0.0100   (within ±20% with 1M samples)

If observed FPR is much higher than theoretical:

  • k is wrong (probably 0 or 1 due to integer truncation; check with_fpr math).
  • Hash is biased (FNV without SplitMix mixing — the high bits are clumped).
  • mod m step has a bias (using h % m with non-prime m is OK; using h & (m-1) only works when m is a power of two).

If observed FPR is much lower: probably double-counting or your "random absent" key generator overlaps with the present set — verify input.

4. Bit density vs FPR

for fpr in 0.10 0.05 0.01 0.001 0.0001; do
  ./build/bloombench build /tmp/bf 10000 $fpr
  ls -l /tmp/bf
done

Sample row sizes (header + body):

FPRBytesBits/key
0.10~6 040~4.8
0.05~7 820~6.2
0.01~12 010~9.6
0.001~17 990~14.4
0.0001~23 970~19.2

The 9.6 bits/key heuristic for 1% FPR is the one most often quoted in interviews.

5. Cross-language byte-identical filters

for lang in go rust cpp; do
  ./src/$lang/.../bloombench build /tmp/bf.$lang 1000 0.01
done
md5sum /tmp/bf.*    # all three identical

If any digest differs, suspect (in order): endian, bit ordering inside the byte, integer types of the header, or hash mismatch.

What "working" looks like

  • Bytes 0..3 = k, bytes 4..11 = m, bytes 12..end = bits. No padding.
  • Empirical FPR is within ±2× of theoretical for any (n, fpr) you try.
  • All three languages produce identical filters and read each other's filters.

What "broken" looks like

  • contains(k) returns false for a key you just inserted ⇒ false negative ⇒ critical bug. Likely indexing math: set and get disagree about bit-within-byte.
  • FPR is 100% ⇒ all bits are 1 ⇒ either m was rounded down to 0 or you're indexing past the bit array.
  • FPR is 0% with realistic load ⇒ add is a no-op or contains always returns true on the "absent" path.
  • Cross-language readers disagree ⇒ print the first 16 bytes of each filter and the first three h1, h2 values for a known key; one of them is wrong.