Analysis — db-22

What we are actually trying to do

The brief was "performance and benchmarking." That is a topic, not a problem statement. The first design pass turned it into a problem statement:

Build a tiny system that has one knob (a deterministic workload) and one measurable property (throughput on that workload), then implement it in three languages and use byte-identical correctness as the precondition for any speed claim.

Everything else in the lab follows from that constraint.

Constraints I started with

Three languages must produce the same bytes for the same inputs. Without this, "Rust is faster than Go on this workload" is unfalsifiable — they might just be doing different work.
No external dependencies for the core data structure. SHA-256 has to be reimplementable from scratch in C++ (no OpenSSL), SplitMix64 has to be reimplemented in all three. This is the same discipline used in db-15 and is the only way to guarantee bit-identity.
The workload must be small enough to brute-force-test for determinism, but large enough that a 1% difference in implementation efficiency shows up in the bench numbers. The two scenarios (500 ops / 32 keys and 5000 ops / 256 keys) bracket this.
The bench harness must not contaminate the correctness harness. Throughput numbers go to stderr; the hex hash goes to stdout with no trailing newline. A shell script can $(benchctl hash ...) cleanly.

Data structure choice: counter store, not KV store

I considered an mvcc KV store (like db-15), a small B-tree, or even reusing db-20's distributed KV. All three are overkill for what we want to measure here. A i64 -> u64 counter store is:

Small enough to fit in ~400 LOC per language.
Big enough to exercise the map implementations of each language (BTreeMap, map, std::map).
Has interesting cross-language failure modes (HashMap iteration order, signed/unsigned subtraction, integer-width truncation in serialization).
Has a workload that genuinely creates and destroys entries, so the map's resize / rebalance / erase code paths all execute.

The saturating-decrement decision

The choice about what decr(k, by) does when by > current or when k is missing is the most consequential semantic decision in the lab. I considered three options:

No-op on missing, total_ops unchanged. Cleaner in some ways but makes total_ops a partial counter: you cannot replay the operation stream and recover the same total_ops. Rejected.
Underflow / panic on negative. Would force the workload generator to remember which keys are live, defeating the determinism. Rejected.
Saturating: bump total_ops, then either remove the entry or subtract. Total_ops always tracks the operation stream. Snapshots are pure functions of the final state, not the operation history. This is what we picked.

The cost is that "decrement past zero" is silently lossy. For a benchmark workload that is the right trade.

Why three RNG draws per iteration

A subtle correctness footgun: if some branches consume fewer RNG words than others, the RNG stream diverges from a different implementation that happens to evaluate the branches in a different order. Drawing all three words before branching means every iteration consumes the same number of RNG bytes regardless of kind. This makes the workload trivially portable.

SplitMix64 over xoshiro / pcg / etc.

SplitMix64 has the smallest state (one u64) and the simplest step (one add, three multiplies, four xors, three shifts). Its only operations are 64-bit integer ops that all three languages handle identically with no surprises. Anything fancier is a footgun for cross-language byte-equality with no upside for a synthetic workload.

Wire format design notes

Little-endian everywhere. ASCII magic so a hexdump -C is human-readable. Length prefix (distinct_keys) so a reader could in principle parse the snapshot incrementally — we never actually do this in the lab, but the format is forward-compatible.

We do not embed keys or ops or seed in the snapshot. The snapshot is purely about the resulting state; the workload that produced it is metadata.

Bench harness design

Four decisions:

One pass, one timing region. No statistical machinery. The exercises that need percentiles or distributions go to criterion / go test -bench / Google Benchmark — not this harness.
Warm-up sized as ops/10 + 1. A small constant warm-up (+ 1 handles ops < 10) that pulls cache lines and triggers allocator first-touch. Empirically this stabilizes the second-pass timing to within a few percent run-to-run.
Bench output to stderr. Lets benchctl bench and benchctl hash use the same flag layout and lets shell scripts redirect them independently.
distinct in the output. It's a sanity check: if the bench reports distinct=0, your workload is collapsing entries faster than it creates them and your throughput number is measuring deletes, not inserts. (See observation.md for the actual numbers.)

What I'd do differently with more time

Add a third scenario that intentionally has heavy contention on a single key (small keys, large ops) to make the bench numbers more sensitive to allocator behavior.
Wire the bench harness to also produce a flamegraph hint (elapsed_us bucketed per operation kind).
Add a --profile flag that runs the workload twice and reports the ratio, as a cheap "is this stable?" check.

Distributed Systems Engineer — Build Databases & Consensus From Scratch