db-12 — Analysis

We are building a hand-written SQL frontend in three languages and proving agreement byte-for-byte. The hard part is not any single piece — tokenizing, parsing, and emitting bytes are all small, well-understood components — but holding all three implementations to a single set of design decisions tight enough that the output hashes match on every input.

Required invariants

  1. Deterministic encoding. Given the same input text, the serializer must produce exactly the same byte sequence on every run, in every language, on every machine. No iteration over hash-maps, no environment-dependent integer widths, no locale-sensitive case conversion. Iteration order of set / cols is insertion order (which is parse order, which is source order).
  2. Error reporting carries 1-based (line, col). Tokenizer errors point at the offending character; parser errors point at the offending token. The error string format is identical across languages (a cross_test.sh smoke test asserts this on SELECT FROM t;).
  3. Identifiers are preserved verbatim. Only keywords are case-insensitive. select FROM uSeRs is legal; the table identifier uSeRs is emitted as the bytes u, S, e, R, s — not users, not USERS.
  4. String literals use SQL escape: doubled single-quote = one single-quote. 'pad''let' is the 7-byte string pad'let. No backslash escapes; no E'...' C-style escapes. The serializer emits the unescaped string contents.
  5. All five statement kinds round-trip identically. No statement is "almost canonical" — every parsable input produces a byte-identical serialization to the same input parsed by the other two languages.
  6. Cross-language byte identity is the only acceptable proof. Equal AST shapes "by inspection" don't count; equal sha256 over the serialized bytes does.

Design decisions

Why a u8 tag in front of every variant

The wire format is a tagged union. Every statement, every literal, every WHERE-or-no-WHERE choice starts with a single byte that tells you what follows. The alternatives all fail:

  • Implicit type from position: requires a schema, which the frontend has no access to.
  • Self-describing JSON-like format: kills byte identity (key ordering, whitespace, escape choices).
  • Protobuf-style varints: introduces "unknown field" / "default value" ambiguities. Two encoders that agree on the schema can still disagree on the bytes.

A fixed u8 tag with a tight numeric assignment (Create=1, Insert=2, Select=3, Delete=4, Update=5) plus length-prefixed strings gives us trivially-determinizable bytes.

Why INT is i64 LE, not varint

i64 LE is the simplest thing that works in all three languages without a helper library. Varint would save a few bytes on small literals but costs a non-trivial encoder/decoder that we'd have to keep in lockstep across Rust/Go/C++.

Why operators get a single byte, not a string

Same reason: a fixed numeric assignment (Eq=1..Ge=6) makes the byte layout exact and language-agnostic. If we wrote "=", then someone in some language would eventually decide to emit "==" and the hashes would drift on the day the lab grew expressions.

Why we keep the MAGIC header

"DSESQL01" is 8 bytes of self-description. It costs nothing, lets a hypothetical decoder detect "this isn't a db-12 AST blob" before mis-parsing, and pins the wire format version inside the bytes themselves (01). If the format ever changes incompatibly, we bump to DSESQL02 and update the frozen hashes.

Why we don't compute the AST length up front

A length prefix on each statement would force a two-pass serialize (size then write), or backpatching. We get away without it because the wire format is fully self-describing left-to-right; a decoder needs no random access. Keeping the encoder one-pass keeps all three implementations short and obviously equivalent.

Why the C++ build is self-contained

db-12 has no upstream lab dependencies. The C++ CMakeLists.txt does not add_subdirectory(../db-NN). That keeps the lab's ctest output clean (only one test target: test_sqlfront12) and avoids the trap from db-09 where leaking upstream add_test calls polluted local runs. Each lab's CMake should ask itself: do I genuinely need upstream code in this binary? For db-12, the answer is no.

Why deferring execution is the right call

A VM that walks the AST is the natural next step, but it needs a storage backend (db-10/11 pager or db-05/06 LSM), a notion of column types and rows, and ideally a transaction layer. Bolting any of that into db-12 would either bind it to a specific storage shape too early or build a toy in-memory engine we'd throw away in db-13. Stopping at AST bytes keeps the lab small, scope-clean, and shippable.

Why three languages

The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations all matching the same frozen reference hash, the probability that a bug in one of them produces a matching sha256 is vanishingly small. A matching hash on a non-trivial fixture is therefore a near-proof of correctness for the entire tokenize → parse → encode pipeline.