Step 02 — Bit array, add, contains

The bit array

Backed by a Vec<u8> / []byte / std::vector<uint8_t> of length ⌈m / 8⌉. Indexing is little-endian within each byte:

bit i  →  byte index  = i / 8
         bit within  = i % 8         // bit 0 is the LSB
         test:  (bytes[i/8] >> (i%8)) & 1
         set:    bytes[i/8] |= 1 << (i%8)

Side-by-side:

#![allow(unused)]
fn main() {
fn set_bit(bits: &mut [u8], i: u64) {
    let idx = (i / 8) as usize;
    let off = (i % 8) as u8;
    bits[idx] |= 1u8 << off;
}
fn get_bit(bits: &[u8], i: u64) -> bool {
    let idx = (i / 8) as usize;
    let off = (i % 8) as u8;
    (bits[idx] >> off) & 1 == 1
}
}
func setBit(bits []byte, i uint64) {
    bits[i/8] |= 1 << (i % 8)
}
func getBit(bits []byte, i uint64) bool {
    return bits[i/8]&(1<<(i%8)) != 0
}
inline void SetBit(std::uint8_t* b, std::uint64_t i) {
    b[i / 8] |= std::uint8_t{1} << (i % 8);
}
inline bool GetBit(const std::uint8_t* b, std::uint64_t i) {
    return (b[i / 8] >> (i % 8)) & 1u;
}

⚠️ Pick one bit-order now. LSB-first (above) matches LevelDB and is the natural choice in C. MSB-first matches some networking specs (TCP option encoding). Whichever you pick, all three implementations must agree.

add(key)

add(key):
    h = mix64(key)
    h1 = h & 0xffffffff
    h2 = h >> 32
    for i in 0..k:
        idx = (h1 + i * h2) mod m
        set_bit(idx)

Notes:

  • All arithmetic is wrapping 64-bit (u64/uint64/std::uint64_t).
  • i * h2 overflows for large k. That's fine — mod m will still produce a valid index. Languages with overflow checks (Rust debug mode) need wrapping_mul/wrapping_add.
  • We compute h once per key, then derive k indices. That's the entire Kirsch–Mitzenmacher win.

contains(key)

contains(key):
    h = mix64(key)
    h1 = h & 0xffffffff
    h2 = h >> 32
    for i in 0..k:
        idx = (h1 + i * h2) mod m
        if not get_bit(idx):
            return false
    return true

Early-exit on the first zero bit. With FPR 1% and a random absent key, you typically exit after 1 or 2 probes.

Three-language add skeleton

#![allow(unused)]
fn main() {
pub fn add(&mut self, key: &[u8]) {
    let h = mix64(key);
    let h1 = h as u32 as u64;
    let h2 = h >> 32;
    for i in 0..self.k as u64 {
        let idx = h1.wrapping_add(i.wrapping_mul(h2)) % self.m;
        set_bit(&mut self.bits, idx);
    }
}
}
func (b *Bloom) Add(key []byte) {
    h := Mix64(key)
    h1 := h & 0xffffffff
    h2 := h >> 32
    for i := uint64(0); i < uint64(b.k); i++ {
        idx := (h1 + i*h2) % b.m
        setBit(b.bits, idx)
    }
}
void Bloom::Add(const std::uint8_t* k, std::size_t n) {
    std::uint64_t h = Mix64(k, n);
    std::uint64_t h1 = h & 0xffffffffULL;
    std::uint64_t h2 = h >> 32;
    for (std::uint64_t i = 0; i < k_; ++i) {
        std::uint64_t idx = (h1 + i * h2) % m_;
        SetBit(bits_.data(), idx);
    }
}

Test gate

  • Insert keys "k0".."k999" (UTF-8). contains("kN") must be true for every N. Any false negative is a critical bug.
  • Filter must be byte-identical across the three languages (md5sum).

What broken looks like

  • contains is sometimes false for present keys → set and get disagree on bit-within-byte. Suspect LSB vs MSB.
  • Cross-language filters differ → mod m reduction differs (one uses & instead of % and m isn't a power of two), or h1/h2 halves are swapped.
  • contains is always true → m was constructed as 0; bit array is empty so every (h % 0) panics or all indices land in a never-cleared byte.