db-12 — Broader Ideas

Where to take this frontend next, and how the patterns generalize.

Immediate next labs

  • db-13 — Execution + transactions + MVCC. The bytecode VM that walks the AST we built here is the natural next step. db-13 needs:

    • A storage backend (we'll plug in the db-11 pager for B-tree-backed tables, or the db-09 LSM for log-structured tables).
    • A row representation (compact bytes per row).
    • A type checker that turns "the AST referenced column name" into "column index 1 of type TEXT".
    • A planner — even a trivial one — to convert SELECT … WHERE … into a scan-with-filter or an index lookup.
    • A transaction layer. Each of those is a small project. The reason db-12 stops where it does is so db-13 can spend its budget on those, not on re-parsing.
  • db-14 — Indexes + query optimization. Once we have a planner, an obvious next move is to add secondary indexes and a cost-based picker between scan and index-lookup. The AST shape from db-12 is rich enough to drive that without modification.

How this lab's patterns show up in real systems

  • Recursive descent is what you actually read in most production database front-ends, even when the surface uses a parser generator. SQLite's parse.y (Lemon) generates a parser whose state machine looks nothing like recursive descent, but the hand-rolled tokenize.c and the hand-written expr.c (operator-precedence parser for SQL expressions) are exactly the style we used. Postgres is similar: gram.y is bison, but analyze.c and the planner do recursive walks over the AST that look just like our serializer.

  • AST → bytes is a primitive that quietly underlies a lot of database engineering:

    • Query caching: Postgres's prepared-statement caching keys on a canonical AST representation.
    • Plan-hash matching: Oracle uses an AST/plan hash to decide "this query is the same as one I've seen before, reuse the plan."
    • Audit logs: serialize the AST instead of the raw SQL text so you can normalize whitespace, comments, and identifier case for diff-friendly storage.
    • Cross-version compatibility tests: serialize an AST in version N, deserialize in version N+1, and assert nothing changed — exactly the byte-identity discipline we use here, except across time instead of across languages.
  • Cross-language byte identity is rare in industry (most teams ship in one language) but the same discipline appears in:

    • Compiler bootstrapping: GCC and rustc both rebuild themselves and require bit-identical second-stage output.
    • Deterministic builds: Bazel/Nix/Reproducible Builds project all rely on the same "bytes out are a pure function of bytes in" property we exercise here.
    • Cryptographic protocol implementations: TLS test vectors, canonical CBOR (RFC 8949 §4.2), Ed25519 deterministic signatures.

Performance experiments worth running later

These don't affect lab status (which is green), but they're good Saturday-afternoon explorations:

  • Replace the per-token String allocation in Rust with a slice into the source buffer (zero-copy tokens). Measure how much that buys on a 1 MB SQL script.
  • Profile the C++ serializer on a 100k-statement input. The hand-written push_back loop is probably memory-bandwidth-bound; a single reserve(estimate) up front should help.
  • Generate a 1k-fixture random-SQL fuzz corpus, parse it in all three languages, and assert sha256 match across languages on every input. This catches drift the two hand-written fixtures don't cover.
  • Pratt-parse expressions: add col + col, col * literal, etc., to the WHERE grammar using Pratt's top-down operator precedence. The AST gets a recursive Expr node; the serializer gets one more branch.

What "production-ready" would require beyond this lab

  • Lookahead beyond LL(1) in a handful of places (e.g., INSERT INTO t (col, col) VALUES ... vs INSERT INTO t VALUES ...).
  • A real expression grammar (Pratt or precedence-climbing).
  • JOIN, subqueries, CTEs, window functions, ORDER BY, GROUP BY, HAVING, LIMIT/OFFSET.
  • Quoted identifiers ("foo bar") and the associated escape semantics.
  • A separate semantic-analysis pass between parse and execute (name resolution, type checking, ambiguous-column detection).
  • Error recovery in the parser: real SQL frontends report multiple errors per parse rather than bailing on the first one.
  • Internationalized identifiers (Unicode identifier class, NFC normalization).
  • Concurrent parsing for prepared-statement caches (lock-free hash lookup, AST interning).

None of these change the shape of the front-end — they make the same shape bigger.