Step 03 — Sizing, encode/decode, FPR measurement
Picking m and k from (n, fpr)
Given target false-positive rate p and expected key count n:
$$m = \left\lceil \frac{-n \cdot \ln p}{(\ln 2)^2} \right\rceil$$
$$k = \max\left(1,; \text{round}\left(\frac{m}{n} \cdot \ln 2\right)\right)$$
Reference numbers (compute once, hard-code in tests):
| n | p | m (bits) | bits/key | k |
|---|---|---|---|---|
| 1 000 | 0.10 | ~4 793 | 4.79 | 3 |
| 1 000 | 0.01 | ~9 586 | 9.59 | 7 |
| 1 000 | 0.001 | ~14 378 | 14.38 | 10 |
| 10 000 | 0.01 | ~95 851 | 9.59 | 7 |
Implementation:
with_fpr(n, p):
ln2 = ln(2)
m_real = -(n as f64) * ln(p) / (ln2 * ln2)
m = ceil(m_real)
k_real = (m as f64 / n as f64) * ln2
k = round(k_real) clamped to [1, 30]
return BloomFilter::new(m, k)
The clamp on k prevents pathological cases. with_fpr(1, 1e-100) would request k ≈ 332 and almost certainly saturate the filter.
Encode
[ k: u32 LE ][ m: u64 LE ][ bits: ⌈m/8⌉ bytes ]
#![allow(unused)] fn main() { pub fn encode(&self) -> Vec<u8> { let mut out = Vec::with_capacity(12 + self.bits.len()); out.extend_from_slice(&self.k.to_le_bytes()); out.extend_from_slice(&self.m.to_le_bytes()); out.extend_from_slice(&self.bits); out } }
func (b *Bloom) Encode() []byte {
out := make([]byte, 12+len(b.bits))
binary.LittleEndian.PutUint32(out[0:4], b.k)
binary.LittleEndian.PutUint64(out[4:12], b.m)
copy(out[12:], b.bits)
return out
}
std::vector<std::uint8_t> Bloom::Encode() const {
std::vector<std::uint8_t> out(12 + bits_.size());
EncodeU32LE(out.data() + 0, k_);
EncodeU64LE(out.data() + 4, m_);
std::memcpy(out.data() + 12, bits_.data(), bits_.size());
return out;
}
Decode
decode(bytes):
if len(bytes) < 12: error
k = read u32 LE @ 0
m = read u64 LE @ 4
body = bytes[12:]
if len(body) != ⌈m/8⌉: error
return BloomFilter { k, m, bits: body }
Validate sizes. If k == 0 or m == 0, reject — those are nonsense.
Measuring FPR
fpr-test(filter, n, m_queries):
seed reader rng to disjoint stream
hits = 0
for q in 0..m_queries:
key = generate distinctly-not-inserted key
if filter.contains(key):
hits += 1
observed = hits / m_queries
theoretical = (1 - exp(-k * n / m))^k
print observed, theoretical
Generating known-absent keys: use the same family as the inserted ones, but with indices ≥ n. If insert used "key0", "key1", ..., "key{n-1}", query with "q0", "q1", ... — different prefix guarantees no accidental overlap.
A million absent queries gives ±10% noise on a 1% FPR estimate; that's the sample size used in the test fpr_within_2x.
Test gate
with_fpr(1000, 0.01)returnsk=7andm∈[9 581, 9 591].encodethendecodegives an identical filter.fpr-testwithn=10 000,m_queries=100 000reports observed FPR within 2× of theoretical (well within ±50%).- The encoded filter is byte-identical across Rust / Go / C++.
What broken looks like
k=0fromwith_fpr→ integer truncation; you usedint(k_real)instead ofround.- Decode fails on a perfectly valid file → endian mismatch or header offset wrong.
- Observed FPR is exactly 1.0 → bit array got written but indices land outside its range (modulo bug).
- Observed FPR is exactly 0.0 →
containsalways returns false; bit array isn't being touched onadd(you forgot to mutateself).