observation
What we measured
For each implementation, on every commit:
- All unit tests pass under release optimisation. (Debug-only bugs
are real —
assert(side_effect)under-DNDEBUGis the classic.) - The CLI binary produces both golden hashes.
- The cross-language script produces the same hash from all three binaries.
What the bytes look like
The snapshot from scenario A is 7088 bytes. Roughly:
- 8 bytes magic
- 8 bytes
next_txid(~501 for scenario A; some ops are no-ops) - 4 bytes primary row count (≈ 28 of 32 possible keys are touched)
- Per row: 8 + 8 + 4 + len(tag) + 8 + 8 = 36 + len(tag) bytes
- 4 bytes secondary distinct tag count
- Per (tag, keys): 4 + len(tag) + 4 + 8*key_count bytes
The largest single section is the primary; the secondary is small because the tag alphabet is fixed at 16.
Visibility of tombstones
Because tombstoned rows stay in the primary, you can read the
snapshot and recover the current visible state by filtering on
deleted_at == 0. That property let us write a test that asserts
the primary row count equals live_count + tombstone_count, which
caught a regression where exec_delete was removing the row from
the primary instead of marking it.
The shape of kind distribution
Across 2000 ops in scenario B, the empirical distribution of kind
matches the design 3:2:1:1:1 ratio within ~3% — confirming that
splitmix64's top 3 bits are sufficiently uniform that we do not need
a rejection sampler.
Non-determinism we did not observe
- No flakes across 50 runs of scenario B.
- No drift between debug and release builds for any implementation.
- No drift between macOS arm64 and Linux x86_64 (sanity-checked once in a throwaway container).
All three of those properties are load-bearing for the cross-language test to be useful: if any of them fail, the script becomes a flaky test and people learn to ignore it.