db-12 — Broader Ideas
Where to take this frontend next, and how the patterns generalize.
Immediate next labs
-
db-13 — Execution + transactions + MVCC. The bytecode VM that walks the AST we built here is the natural next step. db-13 needs:
- A storage backend (we'll plug in the db-11 pager for B-tree-backed tables, or the db-09 LSM for log-structured tables).
- A row representation (compact bytes per row).
- A type checker that turns "the AST referenced column
name" into "column index 1 of type TEXT". - A planner — even a trivial one — to convert
SELECT … WHERE …into a scan-with-filter or an index lookup. - A transaction layer. Each of those is a small project. The reason db-12 stops where it does is so db-13 can spend its budget on those, not on re-parsing.
-
db-14 — Indexes + query optimization. Once we have a planner, an obvious next move is to add secondary indexes and a cost-based picker between scan and index-lookup. The AST shape from db-12 is rich enough to drive that without modification.
How this lab's patterns show up in real systems
-
Recursive descent is what you actually read in most production database front-ends, even when the surface uses a parser generator. SQLite's
parse.y(Lemon) generates a parser whose state machine looks nothing like recursive descent, but the hand-rolledtokenize.cand the hand-writtenexpr.c(operator-precedence parser for SQL expressions) are exactly the style we used. Postgres is similar:gram.yis bison, butanalyze.cand the planner do recursive walks over the AST that look just like our serializer. -
AST → bytes is a primitive that quietly underlies a lot of database engineering:
- Query caching: Postgres's prepared-statement caching keys on a canonical AST representation.
- Plan-hash matching: Oracle uses an AST/plan hash to decide "this query is the same as one I've seen before, reuse the plan."
- Audit logs: serialize the AST instead of the raw SQL text so you can normalize whitespace, comments, and identifier case for diff-friendly storage.
- Cross-version compatibility tests: serialize an AST in version N, deserialize in version N+1, and assert nothing changed — exactly the byte-identity discipline we use here, except across time instead of across languages.
-
Cross-language byte identity is rare in industry (most teams ship in one language) but the same discipline appears in:
- Compiler bootstrapping: GCC and rustc both rebuild themselves and require bit-identical second-stage output.
- Deterministic builds: Bazel/Nix/Reproducible Builds project all rely on the same "bytes out are a pure function of bytes in" property we exercise here.
- Cryptographic protocol implementations: TLS test vectors, canonical CBOR (RFC 8949 §4.2), Ed25519 deterministic signatures.
Performance experiments worth running later
These don't affect lab status (which is green), but they're good Saturday-afternoon explorations:
- Replace the per-token
Stringallocation in Rust with a slice into the source buffer (zero-copy tokens). Measure how much that buys on a 1 MB SQL script. - Profile the C++ serializer on a 100k-statement input. The
hand-written
push_backloop is probably memory-bandwidth-bound; a singlereserve(estimate)up front should help. - Generate a 1k-fixture random-SQL fuzz corpus, parse it in all three languages, and assert sha256 match across languages on every input. This catches drift the two hand-written fixtures don't cover.
- Pratt-parse expressions: add
col + col,col * literal, etc., to the WHERE grammar using Pratt's top-down operator precedence. The AST gets a recursiveExprnode; the serializer gets one more branch.
What "production-ready" would require beyond this lab
- Lookahead beyond
LL(1)in a handful of places (e.g.,INSERT INTO t (col, col) VALUES ...vsINSERT INTO t VALUES ...). - A real expression grammar (Pratt or precedence-climbing).
JOIN, subqueries, CTEs, window functions,ORDER BY,GROUP BY,HAVING,LIMIT/OFFSET.- Quoted identifiers (
"foo bar") and the associated escape semantics. - A separate semantic-analysis pass between parse and execute (name resolution, type checking, ambiguous-column detection).
- Error recovery in the parser: real SQL frontends report multiple errors per parse rather than bailing on the first one.
- Internationalized identifiers (Unicode identifier class, NFC normalization).
- Concurrent parsing for prepared-statement caches (lock-free hash lookup, AST interning).
None of these change the shape of the front-end — they make the same shape bigger.