Broader Ideas — db-22

The lab as it stands is a deliberately minimal harness. These are extensions that would build naturally on top of it.

A. Percentile-aware bench harness

Replace the single-pass timer with a per-operation timing loop that collects a histogram (HDR-style) of per-op latencies. Then bench reports p50 / p90 / p99 / p99.9 in addition to throughput. This is where the Gil Tene "How NOT to Measure Latency" talk earns its keep — even on a synchronous single-thread loop, a long-tail GC pause in Go or a page fault in C++ will move the tail dramatically.

Trap to avoid: the cost of taking a timestamp per op (time.Now() / std::chrono::steady_clock::now()) is itself ~30 ns on most boxes, which is comparable to one workload op. You'd need to time batches of ops and divide.

B. Allocator pressure scenario

Add a third scenario whose workload is deliberately allocator-heavy: short-lived strings as values (move from u64 to String), or a churn pattern that constantly creates and removes keys so the map is forced to resize. The cross-language throughput delta for this scenario would be much larger than for the existing one, and the results would speak to the maturity of each language's allocator.

C. Multi-threaded variant

Wrap CounterStore in a sync primitive and run N workers. The point is not to demonstrate scaling — Mutex<BTreeMap<…>> won't scale — but to demonstrate the difference between coarse locking, sharded locking, and lockfree updates. Each language has different idioms here (parking_lot vs std::sync, sync.Map vs atomic, std::shared_mutex vs std::atomic), and the cross-language comparison becomes a language-features comparison.

D. Snapshot replay / log shipping

Right now dump_snapshot produces bytes that are only used for hashing. Add a restore_snapshot and a small "log" of operations (just the sequence of (op, k, by) triples), and you have a tiny replicated store. Connect three nodes via a deterministic schedule and you have a toy version of db-23.

E. Energy and not-time metrics

On Apple Silicon, powermetrics --samplers cpu_power can give you energy per op. The relative energy of the Rust / Go / C++ implementations on the same workload is a more honest "which is more efficient" claim than throughput, because it folds in stalls, branch mispredictions, and memory bandwidth.

F. Comparison with off-the-shelf benchmark frameworks

Run the same workload under criterion (Rust), go test -bench, and Google Benchmark (C++). Compare:

  • Their reported throughput vs ours.
  • Their reported variance.
  • The shape of their output.

The lab's homegrown harness will look crude in comparison, and that's the point — the exercise of measuring the difference is more educational than the difference itself.

G. Worst-case scenario discovery

Use coverage-guided fuzzing on the workload generator (with the saturating-decrement invariant as the asserted property) to find a seed/ops/keys combination that maximizes either throughput or memory pressure. This connects perf work to the fuzz/property-test discipline used in db-13 and db-15.

H. Cross-architecture verification

Run the existing scripts/cross_test.sh under qemu-user-static for aarch64 / x86_64 / riscv64 and confirm the hashes still match. They should — the wire format is little-endian and the arithmetic is all 64-bit — but the only way to be sure is to actually do it.

I. Cache-aware redesign of CounterStore

std::map / BTreeMap / sorted-Go-slice all use pointer-rich tree structures. A flat sorted array with binary search would be slower for insert but dramatically faster for the iteration step (which is the critical path in dump_snapshot). For a workload that touches each key only a handful of times before snapshotting, the array would be worth measuring.

J. The "ten percent rule"

A small operational rule we picked up doing this lab: any perf change worth claiming must move the bench number by more than ten percent. Below that, run-to-run noise on a laptop dominates. Above that, you can usually attribute the change to a specific code path. The harness is deliberately not precise enough to defend a 2% claim, and that's a feature.