db-12 — SQL Frontend

What is it?

A self-contained SQL frontend: a tokenizer + recursive-descent parser that turns a small but realistic SQL dialect into an Abstract Syntax Tree, plus a canonical byte serializer that hashes deterministically. There is no execution engine in this lab — the AST stops at bytes-on-disk and bytes-on- the-wire.

The supported dialect is the smallest one that's still interesting:

  • CREATE TABLE name (col TYPE, …); with INT and TEXT columns.
  • INSERT INTO name VALUES (…), (…), …; with single-row and multi-row form.
  • SELECT * | col, col, … FROM name [WHERE col OP literal];
  • DELETE FROM name [WHERE col OP literal];
  • UPDATE name SET col = lit, col = lit, … [WHERE col OP literal];

with six comparison operators (=, !=, <, <=, >, >=), integer and text literals ('pad''let' style escape), -- line comments, and case-insensitive keywords. Identifiers are preserved verbatim.

Execution — the bytecode VM that walks the AST to actually run the statement — is deliberately deferred to db-13 (where it can share the transaction machinery it really needs). This lab stops at "the program parsed to this exact AST, and we can prove it byte-for-byte across three languages".

Why does it matter?

This is the lab where the project pivots from storage to language. Every SQL database in the world starts with the same three-stage front:

source text ──► tokens ──► AST ──► (planner / VM / executor)

What we are doing here is exactly steps one and two, plus a fourth step — serialize the AST to a canonical byte stream — that no production engine needs but the project needs as the only honest cross-language proof that three independent parsers agree on the meaning of the same SQL text.

If you've ever read SQLite's tokenize.c or Postgres's scan.l / gram.y, this lab is the same shape, written by hand:

  • Tokenizing is a single character-by-character pass with a handful of state branches (whitespace, comment, identifier, number, string, operator, single-char punct).
  • Parsing is recursive descent: one function per non-terminal, peek at the next token, dispatch, recurse. No parser generators, no table-driven state machines, no lookahead arithmetic — the grammar is tiny enough that the code is almost a 1:1 transliteration of the BNF.
  • The AST is a discriminated union (Rust enum, Go field-bag struct, C++ struct with a kind tag). Statements know their kind; literals know their type.

Once you've built one frontend by hand, every other one becomes a reading exercise.

How does it work?

            ┌──────── source text (UTF-8 bytes) ────────┐
            │                                            │
   tokenize │  char loop: ws | -- comment | ident |     │
   ─────────►│  number | string | op | punct             │
            │  → Vec<Token { kind, payload, line, col }>│
            │                                            │
   parse    │  recursive descent:                        │
   ─────────►│    parse_program        = stmt* EOF       │
            │    parse_stmt            dispatches on kw  │
            │    parse_create/insert/select/delete/update│
            │  → Vec<Statement>                          │
            │                                            │
   serialize│  walk AST, emit canonical bytes (see       │
   ─────────►│  "wire format" below). Magic header lets  │
            │  decoders sanity-check.                    │
            │  → Vec<u8>                                 │
            │                                            │
   sha256   │  inline FIPS 180-4 (Rust + C++);           │
   ─────────►│  crypto/sha256 (Go). Output hex matches   │
            │  in all three languages on any input.      │
            └────────────────────────────────────────────┘

Wire format

Magic header "DSESQL01" (8 ASCII bytes), then u32 LE statement count, then that many statement records:

Statement record:
  u8 kind           1=Create, 2=Insert, 3=Select, 4=Delete, 5=Update
  u32 LE name_len
  name bytes

  if Create:
    u32 LE col_count
    repeat col_count: { u32 LE name_len, name bytes, u8 type (1=Int|2=Text) }

  if Insert:
    u32 LE row_count
    repeat row_count:
      u32 LE col_count
      repeat col_count: literal

  if Select:
    u8 cols_kind     1 = *, 0 = named
    if named: u32 LE n, repeat n: { u32 LE name_len, name bytes }
    where

  if Delete:
    where

  if Update:
    u32 LE set_count
    repeat set_count: { u32 LE name_len, name bytes, literal }
    where

literal:
  u8 tag            1 = Int, 2 = Text
  if Int:  i64 LE (two's-complement, little-endian)
  if Text: u32 LE n, n bytes

where:
  u8 has_where      0 = no WHERE, 1 = WHERE
  if 1:
    u32 LE col_name_len, col_name bytes
    u8 op           1=Eq, 2=Ne, 3=Lt, 4=Le, 5=Gt, 6=Ge
    literal

Every integer is unsigned-little-endian unless noted. Strings carry their own length prefix (no null-terminators, no escapes — the bytes are exactly what the parser saw between the unescaped quotes).

Error format

Every error message is one line of the form:

parse error at line L col C: <message>

Lines and columns are 1-based. tokenize errors report the position of the bad character; parse errors report the position of the offending token (or one past the last token's column if the input ended early).

What's intentionally out of scope

  • Execution. No VM, no query plan, no I/O. db-13.
  • JOIN, GROUP BY, ORDER BY, LIMIT, expressions. Single-table predicates only. Future labs.
  • Schema validation. A SELECT name FROM t referencing an undefined column parses cheerfully; that's the planner's job, not the parser's.
  • Identifier case folding. SQLite folds Users and users together; Postgres folds them to lowercase. We do neither — identifiers are preserved verbatim, only keywords are case-insensitive. This makes the byte-identity test sharper.
  • Quoted identifiers ("foo"), backticks, square brackets. One identifier syntax keeps the tokenizer trivial.
  • Negative literals as expressions. A leading - before an integer literal in a value position is parsed as a sign on that literal; it is not a unary-minus operator. There is no general expression grammar.

Cross-language invariant

All three implementations expose sqlctl parse --file FOO.sql (or --inline "..."). Stdout receives the canonical bytes; stderr receives the sha256 hex (no trailing newline). scripts/cross_test.sh runs both fixtures through all three binaries and asserts:

  • All three stderr-emitted sha256s match.
  • The matching hash equals the frozen-in-tests value (so the wire format cannot silently drift even if all three implementations drift together).
  • The bytes themselves are bit-identical (cmp -s) — guarding against a hypothetical sha256 collision.
  • The error path also matches: feeding "SELECT FROM t;" to all three binaries must produce a non-zero exit and an error line that mentions the column.

The frozen reference hashes are:

FixtureBytessha256
a_basic.sql181071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
b_full.sql486e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962

Any change to the AST shape, tokenizer behavior, or wire format must update those numbers in scripts/cross_test.sh, the Go test (sql_test.go), the C++ test (tests/test_sqlfront12.cc), and this table — all in the same commit.