db-12 — Analysis
We are building a hand-written SQL frontend in three languages and proving agreement byte-for-byte. The hard part is not any single piece — tokenizing, parsing, and emitting bytes are all small, well-understood components — but holding all three implementations to a single set of design decisions tight enough that the output hashes match on every input.
Required invariants
- Deterministic encoding. Given the same input text, the serializer
must produce exactly the same byte sequence on every run, in every
language, on every machine. No iteration over hash-maps, no
environment-dependent integer widths, no locale-sensitive case
conversion. Iteration order of
set/colsis insertion order (which is parse order, which is source order). - Error reporting carries 1-based
(line, col). Tokenizer errors point at the offending character; parser errors point at the offending token. The error string format is identical across languages (across_test.shsmoke test asserts this onSELECT FROM t;). - Identifiers are preserved verbatim. Only keywords are
case-insensitive.
select FROM uSeRsis legal; the table identifieruSeRsis emitted as the bytesu,S,e,R,s— notusers, notUSERS. - String literals use SQL escape: doubled single-quote = one
single-quote.
'pad''let'is the 7-byte stringpad'let. No backslash escapes; noE'...'C-style escapes. The serializer emits the unescaped string contents. - All five statement kinds round-trip identically. No statement is "almost canonical" — every parsable input produces a byte-identical serialization to the same input parsed by the other two languages.
- Cross-language byte identity is the only acceptable proof. Equal AST shapes "by inspection" don't count; equal sha256 over the serialized bytes does.
Design decisions
Why a u8 tag in front of every variant
The wire format is a tagged union. Every statement, every literal, every WHERE-or-no-WHERE choice starts with a single byte that tells you what follows. The alternatives all fail:
- Implicit type from position: requires a schema, which the frontend has no access to.
- Self-describing JSON-like format: kills byte identity (key ordering, whitespace, escape choices).
- Protobuf-style varints: introduces "unknown field" / "default value" ambiguities. Two encoders that agree on the schema can still disagree on the bytes.
A fixed u8 tag with a tight numeric assignment (Create=1, Insert=2, Select=3, Delete=4, Update=5) plus length-prefixed strings gives us
trivially-determinizable bytes.
Why INT is i64 LE, not varint
i64 LE is the simplest thing that works in all three languages without
a helper library. Varint would save a few bytes on small literals but
costs a non-trivial encoder/decoder that we'd have to keep in lockstep
across Rust/Go/C++.
Why operators get a single byte, not a string
Same reason: a fixed numeric assignment (Eq=1..Ge=6) makes the byte
layout exact and language-agnostic. If we wrote "=", then someone in
some language would eventually decide to emit "==" and the hashes
would drift on the day the lab grew expressions.
Why we keep the MAGIC header
"DSESQL01" is 8 bytes of self-description. It costs nothing, lets a
hypothetical decoder detect "this isn't a db-12 AST blob" before
mis-parsing, and pins the wire format version inside the bytes themselves
(01). If the format ever changes incompatibly, we bump to DSESQL02
and update the frozen hashes.
Why we don't compute the AST length up front
A length prefix on each statement would force a two-pass serialize (size then write), or backpatching. We get away without it because the wire format is fully self-describing left-to-right; a decoder needs no random access. Keeping the encoder one-pass keeps all three implementations short and obviously equivalent.
Why the C++ build is self-contained
db-12 has no upstream lab dependencies. The C++ CMakeLists.txt does
not add_subdirectory(../db-NN). That keeps the lab's ctest output
clean (only one test target: test_sqlfront12) and avoids the trap from
db-09 where leaking upstream add_test calls polluted local runs. Each
lab's CMake should ask itself: do I genuinely need upstream code in this
binary? For db-12, the answer is no.
Why deferring execution is the right call
A VM that walks the AST is the natural next step, but it needs a storage backend (db-10/11 pager or db-05/06 LSM), a notion of column types and rows, and ideally a transaction layer. Bolting any of that into db-12 would either bind it to a specific storage shape too early or build a toy in-memory engine we'd throw away in db-13. Stopping at AST bytes keeps the lab small, scope-clean, and shippable.
Why three languages
The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations all matching the same frozen reference hash, the probability that a bug in one of them produces a matching sha256 is vanishingly small. A matching hash on a non-trivial fixture is therefore a near-proof of correctness for the entire tokenize → parse → encode pipeline.