Observation — Looking at the Bits
1. Hexdump a freshly built filter
./build/bloombench build /tmp/bf 4 0.05
xxd /tmp/bf
00000000: 0500 0000 1f00 0000 0000 0000 1206 92 ...............
Reading the header: k=5, m=31 bits ⇒ ⌈31/8⌉ = 4 bytes of bits. The trailing 12 06 92 … is the bit vector with 4 keys mixed in. The actual high byte may differ depending on how m is rounded.
For 1000 keys at 1% FPR you should see roughly 9.6 bits/key ⇒ 1200 bytes of bits, and k ≈ 7.
2. Sanity-check the hash chain
./build/bloombench hash foobar
# fnv1a64=85944171f73967e8 splitmix=... h1=... h2=...
Known FNV-1a64 vectors (used in tests):
| Input | fnv1a64 |
|---|---|
"" | 0xcbf29ce484222325 |
"a" | 0xaf63dc4c8601ec8c |
"foobar" | 0x85944171f73967e8 |
All three languages must print the same fnv1a64 and same splitmix64 for any given input. If they don't, cross-language interop is dead on arrival.
3. Empirical FPR matches the formula
./build/bloombench build /tmp/bf 100000 0.01
./build/bloombench fpr-test /tmp/bf 100000 1000000
# expected: observed=0.0098 theoretical≈0.0100 (within ±20% with 1M samples)
If observed FPR is much higher than theoretical:
kis wrong (probably0or1due to integer truncation; checkwith_fprmath).- Hash is biased (FNV without SplitMix mixing — the high bits are clumped).
mod mstep has a bias (usingh % mwith non-primemis OK; usingh & (m-1)only works whenmis a power of two).
If observed FPR is much lower: probably double-counting or your "random absent" key generator overlaps with the present set — verify input.
4. Bit density vs FPR
for fpr in 0.10 0.05 0.01 0.001 0.0001; do
./build/bloombench build /tmp/bf 10000 $fpr
ls -l /tmp/bf
done
Sample row sizes (header + body):
| FPR | Bytes | Bits/key |
|---|---|---|
| 0.10 | ~6 040 | ~4.8 |
| 0.05 | ~7 820 | ~6.2 |
| 0.01 | ~12 010 | ~9.6 |
| 0.001 | ~17 990 | ~14.4 |
| 0.0001 | ~23 970 | ~19.2 |
The 9.6 bits/key heuristic for 1% FPR is the one most often quoted in interviews.
5. Cross-language byte-identical filters
for lang in go rust cpp; do
./src/$lang/.../bloombench build /tmp/bf.$lang 1000 0.01
done
md5sum /tmp/bf.* # all three identical
If any digest differs, suspect (in order): endian, bit ordering inside the byte, integer types of the header, or hash mismatch.
What "working" looks like
- Bytes 0..3 =
k, bytes 4..11 =m, bytes 12..end = bits. No padding. - Empirical FPR is within ±2× of theoretical for any (n, fpr) you try.
- All three languages produce identical filters and read each other's filters.
What "broken" looks like
contains(k)returnsfalsefor a key you just inserted ⇒ false negative ⇒ critical bug. Likely indexing math:setandgetdisagree about bit-within-byte.- FPR is 100% ⇒ all bits are 1 ⇒ either
mwas rounded down to 0 or you're indexing past the bit array. - FPR is 0% with realistic load ⇒
addis a no-op orcontainsalways returns true on the "absent" path. - Cross-language readers disagree ⇒ print the first 16 bytes of each filter and the first three
h1, h2values for a known key; one of them is wrong.