Step 03 — Sizing, encode/decode, FPR measurement

Picking `m` and `k` from `(n, fpr)`

Given target false-positive rate p and expected key count n:

$$m = \left\lceil \frac{-n \cdot \ln p}{(\ln 2)^2} \right\rceil$$

$$k = \max\left(1,; \text{round}\left(\frac{m}{n} \cdot \ln 2\right)\right)$$

Reference numbers (compute once, hard-code in tests):

n	p	m (bits)	bits/key	k
1 000	0.10	~4 793	4.79	3
1 000	0.01	~9 586	9.59	7
1 000	0.001	~14 378	14.38	10
10 000	0.01	~95 851	9.59	7

Implementation:

with_fpr(n, p):
    ln2     = ln(2)
    m_real  = -(n as f64) * ln(p) / (ln2 * ln2)
    m       = ceil(m_real)
    k_real  = (m as f64 / n as f64) * ln2
    k       = round(k_real) clamped to [1, 30]
    return BloomFilter::new(m, k)

The clamp on k prevents pathological cases. with_fpr(1, 1e-100) would request k ≈ 332 and almost certainly saturate the filter.

Encode

[ k: u32 LE ][ m: u64 LE ][ bits: ⌈m/8⌉ bytes ]

#![allow(unused)]
fn main() {
pub fn encode(&self) -> Vec<u8> {
    let mut out = Vec::with_capacity(12 + self.bits.len());
    out.extend_from_slice(&self.k.to_le_bytes());
    out.extend_from_slice(&self.m.to_le_bytes());
    out.extend_from_slice(&self.bits);
    out
}
}

func (b *Bloom) Encode() []byte {
    out := make([]byte, 12+len(b.bits))
    binary.LittleEndian.PutUint32(out[0:4], b.k)
    binary.LittleEndian.PutUint64(out[4:12], b.m)
    copy(out[12:], b.bits)
    return out
}

std::vector<std::uint8_t> Bloom::Encode() const {
    std::vector<std::uint8_t> out(12 + bits_.size());
    EncodeU32LE(out.data() + 0, k_);
    EncodeU64LE(out.data() + 4, m_);
    std::memcpy(out.data() + 12, bits_.data(), bits_.size());
    return out;
}

Decode

decode(bytes):
    if len(bytes) < 12: error
    k = read u32 LE @ 0
    m = read u64 LE @ 4
    body = bytes[12:]
    if len(body) != ⌈m/8⌉: error
    return BloomFilter { k, m, bits: body }

Validate sizes. If k == 0 or m == 0, reject — those are nonsense.

Measuring FPR

fpr-test(filter, n, m_queries):
    seed reader rng to disjoint stream
    hits = 0
    for q in 0..m_queries:
        key = generate distinctly-not-inserted key
        if filter.contains(key):
            hits += 1
    observed = hits / m_queries
    theoretical = (1 - exp(-k * n / m))^k
    print observed, theoretical

Generating known-absent keys: use the same family as the inserted ones, but with indices ≥ n. If insert used "key0", "key1", ..., "key{n-1}", query with "q0", "q1", ... — different prefix guarantees no accidental overlap.

A million absent queries gives ±10% noise on a 1% FPR estimate; that's the sample size used in the test fpr_within_2x.

Test gate

with_fpr(1000, 0.01) returns k=7 and m ∈ [9 581, 9 591].
encode then decode gives an identical filter.
fpr-test with n=10 000, m_queries=100 000 reports observed FPR within 2× of theoretical (well within ±50%).
The encoded filter is byte-identical across Rust / Go / C++.

What broken looks like

k=0 from with_fpr → integer truncation; you used int(k_real) instead of round.
Decode fails on a perfectly valid file → endian mismatch or header offset wrong.
Observed FPR is exactly 1.0 → bit array got written but indices land outside its range (modulo bug).
Observed FPR is exactly 0.0 → contains always returns false; bit array isn't being touched on add (you forgot to mutate self).

Distributed Systems Engineer — Build Databases & Consensus From Scratch