Analysis — db-22
What we are actually trying to do
The brief was "performance and benchmarking." That is a topic, not a problem statement. The first design pass turned it into a problem statement:
Build a tiny system that has one knob (a deterministic workload) and one measurable property (throughput on that workload), then implement it in three languages and use byte-identical correctness as the precondition for any speed claim.
Everything else in the lab follows from that constraint.
Constraints I started with
- Three languages must produce the same bytes for the same inputs. Without this, "Rust is faster than Go on this workload" is unfalsifiable — they might just be doing different work.
- No external dependencies for the core data structure. SHA-256 has to be reimplementable from scratch in C++ (no OpenSSL), SplitMix64 has to be reimplemented in all three. This is the same discipline used in db-15 and is the only way to guarantee bit-identity.
- The workload must be small enough to brute-force-test for determinism, but large enough that a 1% difference in implementation efficiency shows up in the bench numbers. The two scenarios (500 ops / 32 keys and 5000 ops / 256 keys) bracket this.
- The bench harness must not contaminate the correctness harness.
Throughput numbers go to stderr; the hex hash goes to stdout with no
trailing newline. A shell script can
$(benchctl hash ...)cleanly.
Data structure choice: counter store, not KV store
I considered an mvcc KV store (like db-15), a small B-tree, or even
reusing db-20's distributed KV. All three are overkill for what we want
to measure here. A i64 -> u64 counter store is:
- Small enough to fit in ~400 LOC per language.
- Big enough to exercise the map implementations of each language
(
BTreeMap,map,std::map). - Has interesting cross-language failure modes (HashMap iteration order, signed/unsigned subtraction, integer-width truncation in serialization).
- Has a workload that genuinely creates and destroys entries, so the map's resize / rebalance / erase code paths all execute.
The saturating-decrement decision
The choice about what decr(k, by) does when by > current or when k
is missing is the most consequential semantic decision in the lab. I
considered three options:
- No-op on missing, total_ops unchanged. Cleaner in some ways but
makes
total_opsa partial counter: you cannot replay the operation stream and recover the sametotal_ops. Rejected. - Underflow / panic on negative. Would force the workload generator to remember which keys are live, defeating the determinism. Rejected.
- Saturating: bump total_ops, then either remove the entry or subtract. Total_ops always tracks the operation stream. Snapshots are pure functions of the final state, not the operation history. This is what we picked.
The cost is that "decrement past zero" is silently lossy. For a benchmark workload that is the right trade.
Why three RNG draws per iteration
A subtle correctness footgun: if some branches consume fewer RNG words
than others, the RNG stream diverges from a different implementation
that happens to evaluate the branches in a different order. Drawing all
three words before branching means every iteration consumes the same
number of RNG bytes regardless of kind. This makes the workload trivially
portable.
SplitMix64 over xoshiro / pcg / etc.
SplitMix64 has the smallest state (one u64) and the simplest step
(one add, three multiplies, four xors, three shifts). Its only
operations are 64-bit integer ops that all three languages handle
identically with no surprises. Anything fancier is a footgun for
cross-language byte-equality with no upside for a synthetic workload.
Wire format design notes
Little-endian everywhere. ASCII magic so a hexdump -C is human-readable.
Length prefix (distinct_keys) so a reader could in principle parse the
snapshot incrementally — we never actually do this in the lab, but the
format is forward-compatible.
We do not embed keys or ops or seed in the snapshot. The
snapshot is purely about the resulting state; the workload that produced
it is metadata.
Bench harness design
Four decisions:
- One pass, one timing region. No statistical machinery. The
exercises that need percentiles or distributions go to
criterion/go test -bench/ Google Benchmark — not this harness. - Warm-up sized as
ops/10 + 1. A small constant warm-up (+ 1handlesops < 10) that pulls cache lines and triggers allocator first-touch. Empirically this stabilizes the second-pass timing to within a few percent run-to-run. - Bench output to stderr. Lets
benchctl benchandbenchctl hashuse the same flag layout and lets shell scripts redirect them independently. distinctin the output. It's a sanity check: if the bench reportsdistinct=0, your workload is collapsing entries faster than it creates them and your throughput number is measuring deletes, not inserts. (Seeobservation.mdfor the actual numbers.)
What I'd do differently with more time
- Add a third scenario that intentionally has heavy contention on a
single key (small
keys, largeops) to make the bench numbers more sensitive to allocator behavior. - Wire the bench harness to also produce a flamegraph hint (
elapsed_usbucketed per operation kind). - Add a
--profileflag that runs the workload twice and reports the ratio, as a cheap "is this stable?" check.