db-12 step 03 — Serializer and cross-language byte identity

Goal

Define a deterministic binary format for the AST, implement serialize(stmts) -> Vec<u8> in all three languages, ship a sqlctl CLI that prints the bytes, and prove via sha256 that all three implementations agree on every legal input.

CLI contract

sqlctl parse --file <path>
sqlctl parse --inline "<sql>"
  • Stdout receives the raw bytes from serialize(parse(...)) — no framing, no trailing newline.
  • Stderr receives the lowercase hex sha256 of stdout — no trailing newline.
  • On parse error, write parse error at line L col C: <msg>\n to stderr and exit 1. Stdout must be empty.

Tasks

  1. Implement serialize per the wire format in CONCEPTS.md. Magic header b"DSESQL01" then u32 LE count then per-statement records with u8 kind tags. Numbers are unsigned little-endian unless noted; INT literals are i64 LE; strings are u32 LE length + raw UTF-8 bytes.
  2. Inline a SHA-256 implementation (Rust sha256 + sha256_hex; C++ sha256_hex). In Go, use crypto/sha256 for brevity (stdlib is allowed; the implementation is determined by the standard, so cross-language identity is preserved).
  3. Build sqlctl in Rust (src/rust/src/bin/sqlctl.rs), Go (src/go/cmd/sqlctl/main.go), and C++ (src/cpp/src/sqlctl.cc).
  4. Freeze the two fixtures scripts/fixtures/a_basic.sql and scripts/fixtures/b_full.sql — exercise every statement kind, both literal types, the '' escape, every comparison operator. Compute their sha256 once from the Rust reference; freeze the values in:
    • scripts/cross_test.sh (as want_hash cases)
    • src/go/sql_test.go (TestFixtureAHash, TestFixtureBHash)
    • src/cpp/tests/test_sqlfront12.cc (test_fixture_a_hash, test_fixture_b_hash)
    • CONCEPTS.md (frozen-hash table)
  5. Write scripts/verify.sh — builds + unit-tests the three languages; prints === OK === on success.
  6. Write scripts/cross_test.sh:
    • Build the three sqlctl binaries.
    • For each fixture, run sqlctl parse --file FIX for all three; assert all three stderr hashes match each other and match the frozen value; assert the CLI hash equals shasum -a 256 of stdout; assert the bytes are bit-identical (cmp -s).
    • Inline-arg smoke test: sqlctl parse --inline 'SELECT * FROM t;' must match across languages.
    • Error-path smoke test: feed SELECT FROM t; to all three; each must exit non-zero with a stderr line that mentions the column.
    • Print === ALL OK === on success.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
=== build ===
=== fixture: a_basic.sql ===
  rust=071b40fd... (     181 B)
  go  =071b40fd... (     181 B)
  cpp =071b40fd... (     181 B)
  match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
  rust=e219f1ee... (     486 B)
  ...
  match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
  inline hash: 941f2125...
=== error-path smoke test ===
  [rust] parse error at line 1 col 8: expected identifier
  [go] parse error at line 1 col 8: expected identifier
  [cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===

Inline unit tests (mirror across three languages):

  • serialize_header_and_count — output starts with "DSESQL01" + the correct u32 LE statement count.
  • serialize_is_deterministicserialize(ast) == serialize(ast) byte-for-byte on a non-trivial AST.
  • sha256_known_vectorssha256("") and sha256("abc") match the FIPS 180-4 reference vectors.

Discussion prompts

  • Why is the cross-language sha256 match a near-proof of correctness rather than an actual proof? What kind of bug could match anyway?
  • The b_full.sql test is 486 bytes. Why is that more interesting than a 100k-byte randomly generated SQL file with the same hash check?
  • If we wanted to add LIMIT N to the SELECT grammar tomorrow, what would the smallest backwards-compatible change to the wire format look like? Why does that question matter the first time we want to evolve the AST?