db-12 — SQL Frontend
What is it?
A self-contained SQL frontend: a tokenizer + recursive-descent parser that turns a small but realistic SQL dialect into an Abstract Syntax Tree, plus a canonical byte serializer that hashes deterministically. There is no execution engine in this lab — the AST stops at bytes-on-disk and bytes-on- the-wire.
The supported dialect is the smallest one that's still interesting:
CREATE TABLE name (col TYPE, …);withINTandTEXTcolumns.INSERT INTO name VALUES (…), (…), …;with single-row and multi-row form.SELECT * | col, col, … FROM name [WHERE col OP literal];DELETE FROM name [WHERE col OP literal];UPDATE name SET col = lit, col = lit, … [WHERE col OP literal];
with six comparison operators (=, !=, <, <=, >, >=), integer
and text literals ('pad''let' style escape), -- line comments, and
case-insensitive keywords. Identifiers are preserved verbatim.
Execution — the bytecode VM that walks the AST to actually run the statement — is deliberately deferred to db-13 (where it can share the transaction machinery it really needs). This lab stops at "the program parsed to this exact AST, and we can prove it byte-for-byte across three languages".
Why does it matter?
This is the lab where the project pivots from storage to language. Every SQL database in the world starts with the same three-stage front:
source text ──► tokens ──► AST ──► (planner / VM / executor)
What we are doing here is exactly steps one and two, plus a fourth step — serialize the AST to a canonical byte stream — that no production engine needs but the project needs as the only honest cross-language proof that three independent parsers agree on the meaning of the same SQL text.
If you've ever read SQLite's tokenize.c or Postgres's scan.l /
gram.y, this lab is the same shape, written by hand:
- Tokenizing is a single character-by-character pass with a handful of state branches (whitespace, comment, identifier, number, string, operator, single-char punct).
- Parsing is recursive descent: one function per non-terminal, peek at the next token, dispatch, recurse. No parser generators, no table-driven state machines, no lookahead arithmetic — the grammar is tiny enough that the code is almost a 1:1 transliteration of the BNF.
- The AST is a discriminated union (Rust
enum, Go field-bag struct, C++structwith a kind tag). Statements know their kind; literals know their type.
Once you've built one frontend by hand, every other one becomes a reading exercise.
How does it work?
┌──────── source text (UTF-8 bytes) ────────┐
│ │
tokenize │ char loop: ws | -- comment | ident | │
─────────►│ number | string | op | punct │
│ → Vec<Token { kind, payload, line, col }>│
│ │
parse │ recursive descent: │
─────────►│ parse_program = stmt* EOF │
│ parse_stmt dispatches on kw │
│ parse_create/insert/select/delete/update│
│ → Vec<Statement> │
│ │
serialize│ walk AST, emit canonical bytes (see │
─────────►│ "wire format" below). Magic header lets │
│ decoders sanity-check. │
│ → Vec<u8> │
│ │
sha256 │ inline FIPS 180-4 (Rust + C++); │
─────────►│ crypto/sha256 (Go). Output hex matches │
│ in all three languages on any input. │
└────────────────────────────────────────────┘
Wire format
Magic header "DSESQL01" (8 ASCII bytes), then u32 LE statement count,
then that many statement records:
Statement record:
u8 kind 1=Create, 2=Insert, 3=Select, 4=Delete, 5=Update
u32 LE name_len
name bytes
if Create:
u32 LE col_count
repeat col_count: { u32 LE name_len, name bytes, u8 type (1=Int|2=Text) }
if Insert:
u32 LE row_count
repeat row_count:
u32 LE col_count
repeat col_count: literal
if Select:
u8 cols_kind 1 = *, 0 = named
if named: u32 LE n, repeat n: { u32 LE name_len, name bytes }
where
if Delete:
where
if Update:
u32 LE set_count
repeat set_count: { u32 LE name_len, name bytes, literal }
where
literal:
u8 tag 1 = Int, 2 = Text
if Int: i64 LE (two's-complement, little-endian)
if Text: u32 LE n, n bytes
where:
u8 has_where 0 = no WHERE, 1 = WHERE
if 1:
u32 LE col_name_len, col_name bytes
u8 op 1=Eq, 2=Ne, 3=Lt, 4=Le, 5=Gt, 6=Ge
literal
Every integer is unsigned-little-endian unless noted. Strings carry their own length prefix (no null-terminators, no escapes — the bytes are exactly what the parser saw between the unescaped quotes).
Error format
Every error message is one line of the form:
parse error at line L col C: <message>
Lines and columns are 1-based. tokenize errors report the position of
the bad character; parse errors report the position of the offending
token (or one past the last token's column if the input ended early).
What's intentionally out of scope
- Execution. No VM, no query plan, no I/O. db-13.
JOIN,GROUP BY,ORDER BY,LIMIT, expressions. Single-table predicates only. Future labs.- Schema validation. A
SELECT name FROM treferencing an undefined column parses cheerfully; that's the planner's job, not the parser's. - Identifier case folding. SQLite folds
Usersanduserstogether; Postgres folds them to lowercase. We do neither — identifiers are preserved verbatim, only keywords are case-insensitive. This makes the byte-identity test sharper. - Quoted identifiers (
"foo"), backticks, square brackets. One identifier syntax keeps the tokenizer trivial. - Negative literals as expressions. A leading
-before an integer literal in a value position is parsed as a sign on that literal; it is not a unary-minus operator. There is no general expression grammar.
Cross-language invariant
All three implementations expose sqlctl parse --file FOO.sql (or
--inline "..."). Stdout receives the canonical bytes; stderr receives
the sha256 hex (no trailing newline).
scripts/cross_test.sh runs both fixtures through all three binaries and
asserts:
- All three stderr-emitted sha256s match.
- The matching hash equals the frozen-in-tests value (so the wire format cannot silently drift even if all three implementations drift together).
- The bytes themselves are bit-identical (
cmp -s) — guarding against a hypothetical sha256 collision. - The error path also matches: feeding
"SELECT FROM t;"to all three binaries must produce a non-zero exit and an error line that mentions the column.
The frozen reference hashes are:
| Fixture | Bytes | sha256 |
|---|---|---|
a_basic.sql | 181 | 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 |
b_full.sql | 486 | e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 |
Any change to the AST shape, tokenizer behavior, or wire format must
update those numbers in scripts/cross_test.sh, the Go test
(sql_test.go), the C++ test (tests/test_sqlfront12.cc), and this
table — all in the same commit.