db-22 — Performance and Benchmarking

Why this lab exists

Benchmarks lie. They mostly lie because a benchmark answers a different question than the one you thought you were asking. db-22 is a small, self-contained system whose only purpose is to be measured: a keyed in-memory counter store driven by a deterministic synthetic workload. We freeze a wire format and a workload, hash the resulting state across three implementations (Rust, Go, C++), and use the resulting binary identity as the load-bearing definition of "the same program."

Once correctness is cross-language identical, we can compare throughput of the three implementations on the same hardware honestly — and we can talk about what does and does not constitute a fair comparison.

The data structure: CounterStore

A CounterStore is a mapping i64 -> u64 (counters) plus a single total_ops: u64 running counter. Three operations:

  • incr(k, by): total_ops += 1, counters[k] += by (entry created with value by if missing).
  • decr(k, by): total_ops += 1. If k is missing the call has no further effect (total_ops was already incremented). Otherwise:
    • if current <= by, the entry is removed (saturating decrement);
    • else counters[k] = current - by.
  • get(k) -> Option<u64>: live lookup, returns None if absent.

There are no tombstones. Removed counters leave no trace in the snapshot. This is intentional and matters: it makes the snapshot a pure function of the final live state, not of the history of operations.

The semantic that total_ops is bumped on every call (including no-op decr on missing) is the simplest invariant and is the one against which all golden hashes were computed. Changing it would change every hash.

Wire format: dump_snapshot

The snapshot is a function CounterStore -> bytes whose output must be byte-identical across all three implementations.

offset  size  field
------  ----  ---------------------------------------------
0       8     magic "DSEBENCH" (ASCII)
8       8     total_ops (u64 little-endian)
16      4     distinct_keys (u32 little-endian)
20+     16    one row per key, ascending by key:
              - 8 bytes: key (i64 little-endian)
              - 8 bytes: count (u64 little-endian)

Ordering is the only subtle bit. Rust uses BTreeMap, whose iteration is naturally ascending. Go uses a plain map and explicitly sorts the keys before emitting. C++ uses std::map, also ascending. All three converge on the same byte sequence.

The workload: deterministic by construction

We need a workload that:

  1. Is identical across languages.
  2. Exercises a mix of insert / mutate / delete to produce a non-trivial end state.
  3. Can be scaled in both ops and keys.

We use SplitMix64 for randomness. It is small, fast, has trivially portable arithmetic (just u64 adds, shifts, multiplies, and xors), and needs no library. The constants and step function are well-known:

state += 0x9E3779B97F4A7C15
z = state
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
z = (z ^ (z >> 27)) * 0x94D049BB133111EB
return z ^ (z >> 31)

Each workload iteration draws exactly three u64 words. Drawing the same number every iteration is what keeps the RNG stream identical across languages even when one branch is short and another is long.

r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 62) & 0x3        # 0,1,2 → Incr ; 3 → Decr  (3:1 ratio)
k    = (r2 % keys) as i64
by   = (r3 % 100) + 1          # 1..=100

Three-to-one incr:decr means the counter map grows in expectation, but with keys small relative to ops the map fills up and the decrement path actually deletes entries — both code paths get exercised in any non-trivial run.

Two frozen scenarios

ScenarioseedopskeysSHA-256 of snapshot
A42500324b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
B750002565c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1

scripts/cross_test.sh builds all three binaries and asserts that the hashes match each other and these golden values.

Common cross-language divergence sources (and how we avoid them)

  • Map iteration order. We never iterate HashMap. We sort keys explicitly (Go) or use BTreeMap/std::map (Rust, C++).
  • Integer promotion in shifts. u64-only arithmetic. No mixed signed/unsigned shifts.
  • % semantics for negative operands. r2 is u64; modulus and cast to i64 happen exactly once and in the same order.
  • size_t width. We only put u32/u64 on the wire, never size_t directly.
  • Trailing whitespace / newlines in CLI output. hash prints the hex with no trailing newline. bench writes its line to stderr so it can never pollute stdout that a script might be hashing.

Bench methodology

benchctl bench runs a short warm-up (ops/10 + 1) to pull pages and populate caches, then a single timed pass over ops calls. It prints ops, keys, elapsed_us, ops_per_sec, and distinct (the number of live counters at the end) to stderr.

This is intentionally crude — the workload is a single thread doing in-memory map operations. It is good enough for "is the Rust build twice as fast as the Go build?" type questions; it is not a microbenchmark replacement for criterion / go test -bench / Google Benchmark. The references in references.md cover the deeper rabbit hole.

What you actually learn from this lab

  • Why a benchmark needs to fix a deterministic workload before it fixes a metric.
  • Why "the same program in two languages" needs a binary equality test, not a "looks the same" code review.
  • Why bench harnesses must warm up, isolate stdout/stderr, and avoid hidden allocations inside the timed region.
  • Why HashMap iteration order is a footgun for portable wire formats.