db-12 step 03 — Serializer and cross-language byte identity
Goal
Define a deterministic binary format for the AST, implement
serialize(stmts) -> Vec<u8> in all three languages, ship a sqlctl
CLI that prints the bytes, and prove via sha256 that all three
implementations agree on every legal input.
CLI contract
sqlctl parse --file <path>
sqlctl parse --inline "<sql>"
- Stdout receives the raw bytes from
serialize(parse(...))— no framing, no trailing newline. - Stderr receives the lowercase hex sha256 of stdout — no trailing newline.
- On parse error, write
parse error at line L col C: <msg>\nto stderr and exit 1. Stdout must be empty.
Tasks
- Implement
serializeper the wire format inCONCEPTS.md. Magic headerb"DSESQL01"thenu32 LEcount then per-statement records withu8 kindtags. Numbers are unsigned little-endian unless noted;INTliterals arei64 LE; strings areu32 LE length+ raw UTF-8 bytes. - Inline a SHA-256 implementation (Rust
sha256+sha256_hex; C++sha256_hex). In Go, usecrypto/sha256for brevity (stdlib is allowed; the implementation is determined by the standard, so cross-language identity is preserved). - Build
sqlctlin Rust (src/rust/src/bin/sqlctl.rs), Go (src/go/cmd/sqlctl/main.go), and C++ (src/cpp/src/sqlctl.cc). - Freeze the two fixtures
scripts/fixtures/a_basic.sqlandscripts/fixtures/b_full.sql— exercise every statement kind, both literal types, the''escape, every comparison operator. Compute their sha256 once from the Rust reference; freeze the values in:scripts/cross_test.sh(aswant_hashcases)src/go/sql_test.go(TestFixtureAHash,TestFixtureBHash)src/cpp/tests/test_sqlfront12.cc(test_fixture_a_hash,test_fixture_b_hash)CONCEPTS.md(frozen-hash table)
- Write
scripts/verify.sh— builds + unit-tests the three languages; prints=== OK ===on success. - Write
scripts/cross_test.sh:- Build the three
sqlctlbinaries. - For each fixture, run
sqlctl parse --file FIXfor all three; assert all three stderr hashes match each other and match the frozen value; assert the CLI hash equalsshasum -a 256of stdout; assert the bytes are bit-identical (cmp -s). - Inline-arg smoke test:
sqlctl parse --inline 'SELECT * FROM t;'must match across languages. - Error-path smoke test: feed
SELECT FROM t;to all three; each must exit non-zero with a stderr line that mentions the column. - Print
=== ALL OK ===on success.
- Build the three
Acceptance
$ scripts/verify.sh
=== rust === ... ok
=== go === ... ok
=== cpp === ... ok
=== OK ===
$ scripts/cross_test.sh
=== build ===
=== fixture: a_basic.sql ===
rust=071b40fd... ( 181 B)
go =071b40fd... ( 181 B)
cpp =071b40fd... ( 181 B)
match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
rust=e219f1ee... ( 486 B)
...
match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
inline hash: 941f2125...
=== error-path smoke test ===
[rust] parse error at line 1 col 8: expected identifier
[go] parse error at line 1 col 8: expected identifier
[cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===
Inline unit tests (mirror across three languages):
serialize_header_and_count— output starts with"DSESQL01"+ the correctu32 LEstatement count.serialize_is_deterministic—serialize(ast) == serialize(ast)byte-for-byte on a non-trivial AST.sha256_known_vectors—sha256("")andsha256("abc")match the FIPS 180-4 reference vectors.
Discussion prompts
- Why is the cross-language sha256 match a near-proof of correctness rather than an actual proof? What kind of bug could match anyway?
- The
b_full.sqltest is 486 bytes. Why is that more interesting than a 100k-byte randomly generated SQL file with the same hash check? - If we wanted to add
LIMIT Nto the SELECT grammar tomorrow, what would the smallest backwards-compatible change to the wire format look like? Why does that question matter the first time we want to evolve the AST?