db-12 — References
Primary sources
- Crafting Interpreters, Robert Nystrom — chapters 4 ("Scanning") and 6 ("Parsing Expressions") map almost 1:1 onto what we built. The hand- rolled recursive-descent style and the "one function per non-terminal" discipline are taken straight from this book. https://craftinginterpreters.com/
- Modern Compiler Implementation in ML (or in C / Java), Andrew Appel —
chapter 3 ("Parsing"). The clearest exposition of why recursive descent
works for
LL(1)grammars and what changes when you need lookahead or precedence climbing. - The Dragon Book (Aho, Lam, Sethi, Ullman) — chapters 3 and 4. The textbook source for lexical analysis (DFA construction, regular expressions to scanners) and predictive parsing. Read for theory; the practice is in Crafting Interpreters.
How real databases parse SQL
- SQLite —
src/tokenize.c(hand-rolled DFA, conceptually the same as ours but written in C with a generated keyword-lookup function) andsrc/parse.y(a Lemon-parser grammar, which generates a table-driven bottom-up parser rather than our recursive-descent). https://github.com/sqlite/sqlite/blob/master/src/tokenize.c https://www.sqlite.org/lemon.html - PostgreSQL —
src/backend/parser/scan.l(flex) andsrc/backend/parser/gram.y(bison). A generator-based front; the grammar file alone is ~17k lines. Worth opening just to see the scale of the dialect they support. https://github.com/postgres/postgres/tree/master/src/backend/parser - DuckDB — uses
libpg_query, which is a stripped-down Postgres parser exposed as a library. A useful pattern when you want Postgres-compatible SQL without the rest of Postgres. https://github.com/duckdb/duckdb/tree/main/third_party/libpg_query - CockroachDB —
pkg/sql/parser/sql.y(goyacc). Like DuckDB above, another example of "real database, generated parser".
Recursive descent specifically
- Rob Pike, Lexical Scanning in Go, GopherCon 2011 — the talk that
popularized the "scanner emits tokens to a channel; parser reads from
the channel" style. We don't use channels (we just return a
Vec), but the state-machine framing of the scanner loop is the same. https://www.youtube.com/watch?v=HxaD_trXwRE - Doug Crockford, Top Down Operator Precedence, 2007 — the cleanest explanation of Pratt parsing, which is what you reach for next once recursive descent runs into expression-precedence pain. We don't need it here (no expression grammar) but it's the natural follow-up. https://crockford.com/javascript/tdop/tdop.html
Determinism / wire formats
- Federal Information Processing Standards Publication 180-4, NIST, 2015 — the SHA-256 specification we implemented inline in Rust and C++. https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
- Protobuf encoding docs — useful counterexample for what not to do if you want byte-identity. Protobuf's "unknown field" handling and optional canonicalization are exactly the corners that prevent stable hashes across implementations. Our format avoids those corners on purpose. https://protobuf.dev/programming-guides/encoding/
Cross-lab dependencies
None. db-12 is intentionally self-contained: there is nothing in earlier
labs (storage, WAL, LSM, B-tree) that the parser needs, and the AST
serializer uses no upstream wire format. The C++ build does not
add_subdirectory(../db-NN). That isolation is a feature, not an
oversight — it keeps the lab small enough to be reasoned about as a
self-contained compiler-front exercise.
db-12 feeds db-13 (execution + transactions), where the AST will finally be walked by a VM.