Step 02 — Bit array, add, contains
The bit array
Backed by a Vec<u8> / []byte / std::vector<uint8_t> of length ⌈m / 8⌉. Indexing is little-endian within each byte:
bit i → byte index = i / 8
bit within = i % 8 // bit 0 is the LSB
test: (bytes[i/8] >> (i%8)) & 1
set: bytes[i/8] |= 1 << (i%8)
Side-by-side:
#![allow(unused)] fn main() { fn set_bit(bits: &mut [u8], i: u64) { let idx = (i / 8) as usize; let off = (i % 8) as u8; bits[idx] |= 1u8 << off; } fn get_bit(bits: &[u8], i: u64) -> bool { let idx = (i / 8) as usize; let off = (i % 8) as u8; (bits[idx] >> off) & 1 == 1 } }
func setBit(bits []byte, i uint64) {
bits[i/8] |= 1 << (i % 8)
}
func getBit(bits []byte, i uint64) bool {
return bits[i/8]&(1<<(i%8)) != 0
}
inline void SetBit(std::uint8_t* b, std::uint64_t i) {
b[i / 8] |= std::uint8_t{1} << (i % 8);
}
inline bool GetBit(const std::uint8_t* b, std::uint64_t i) {
return (b[i / 8] >> (i % 8)) & 1u;
}
⚠️ Pick one bit-order now. LSB-first (above) matches LevelDB and is the natural choice in C. MSB-first matches some networking specs (TCP option encoding). Whichever you pick, all three implementations must agree.
add(key)
add(key):
h = mix64(key)
h1 = h & 0xffffffff
h2 = h >> 32
for i in 0..k:
idx = (h1 + i * h2) mod m
set_bit(idx)
Notes:
- All arithmetic is wrapping 64-bit (
u64/uint64/std::uint64_t). i * h2overflows for largek. That's fine —mod mwill still produce a valid index. Languages with overflow checks (Rust debug mode) needwrapping_mul/wrapping_add.- We compute
honce per key, then derivekindices. That's the entire Kirsch–Mitzenmacher win.
contains(key)
contains(key):
h = mix64(key)
h1 = h & 0xffffffff
h2 = h >> 32
for i in 0..k:
idx = (h1 + i * h2) mod m
if not get_bit(idx):
return false
return true
Early-exit on the first zero bit. With FPR 1% and a random absent key, you typically exit after 1 or 2 probes.
Three-language add skeleton
#![allow(unused)] fn main() { pub fn add(&mut self, key: &[u8]) { let h = mix64(key); let h1 = h as u32 as u64; let h2 = h >> 32; for i in 0..self.k as u64 { let idx = h1.wrapping_add(i.wrapping_mul(h2)) % self.m; set_bit(&mut self.bits, idx); } } }
func (b *Bloom) Add(key []byte) {
h := Mix64(key)
h1 := h & 0xffffffff
h2 := h >> 32
for i := uint64(0); i < uint64(b.k); i++ {
idx := (h1 + i*h2) % b.m
setBit(b.bits, idx)
}
}
void Bloom::Add(const std::uint8_t* k, std::size_t n) {
std::uint64_t h = Mix64(k, n);
std::uint64_t h1 = h & 0xffffffffULL;
std::uint64_t h2 = h >> 32;
for (std::uint64_t i = 0; i < k_; ++i) {
std::uint64_t idx = (h1 + i * h2) % m_;
SetBit(bits_.data(), idx);
}
}
Test gate
- Insert keys
"k0".."k999"(UTF-8).contains("kN")must betruefor every N. Any false negative is a critical bug. - Filter must be byte-identical across the three languages (
md5sum).
What broken looks like
containsis sometimes false for present keys →setandgetdisagree on bit-within-byte. Suspect LSB vs MSB.- Cross-language filters differ →
mod mreduction differs (one uses&instead of%andmisn't a power of two), orh1/h2halves are swapped. containsis always true →mwas constructed as 0; bit array is empty so every(h % 0)panics or all indices land in a never-cleared byte.