observation

What we measured

For each implementation, on every commit:

  • All unit tests pass under release optimisation. (Debug-only bugs are real — assert(side_effect) under -DNDEBUG is the classic.)
  • The CLI binary produces both golden hashes.
  • The cross-language script produces the same hash from all three binaries.

What the bytes look like

The snapshot from scenario A is 7088 bytes. Roughly:

  • 8 bytes magic
  • 8 bytes next_txid (~501 for scenario A; some ops are no-ops)
  • 4 bytes primary row count (≈ 28 of 32 possible keys are touched)
  • Per row: 8 + 8 + 4 + len(tag) + 8 + 8 = 36 + len(tag) bytes
  • 4 bytes secondary distinct tag count
  • Per (tag, keys): 4 + len(tag) + 4 + 8*key_count bytes

The largest single section is the primary; the secondary is small because the tag alphabet is fixed at 16.

Visibility of tombstones

Because tombstoned rows stay in the primary, you can read the snapshot and recover the current visible state by filtering on deleted_at == 0. That property let us write a test that asserts the primary row count equals live_count + tombstone_count, which caught a regression where exec_delete was removing the row from the primary instead of marking it.

The shape of kind distribution

Across 2000 ops in scenario B, the empirical distribution of kind matches the design 3:2:1:1:1 ratio within ~3% — confirming that splitmix64's top 3 bits are sufficiently uniform that we do not need a rejection sampler.

Non-determinism we did not observe

  • No flakes across 50 runs of scenario B.
  • No drift between debug and release builds for any implementation.
  • No drift between macOS arm64 and Linux x86_64 (sanity-checked once in a throwaway container).

All three of those properties are load-bearing for the cross-language test to be useful: if any of them fail, the script becomes a flaky test and people learn to ignore it.