db-12 step 02 — Parser and AST
Goal
Implement parse(src) -> Result<Vec<Statement>, ParseError> that
consumes the token stream from step 01 and produces a typed AST.
Parser errors carry 1-based (line, col) from the offending token.
Tasks
- Define the AST:
ColType { Int, Text }.Literal { Int(i64), Text(String) }with explicit tag bytesInt=1,Text=2(matters for serialization).Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 }—#[repr(u8)]in Rust,OpEq=1..OpGe=6constants in Go,enum class Op : uint8_tin C++.Where { col: String, op: Op, lit: Literal }— present-or-absent viaOption/pointer/has_whereflag.SelectCols { Star, Named(Vec<String>) }.Statementenum with five variants:Create { name, cols: Vec<(name, ColType)> },Insert { name, rows: Vec<Vec<Literal>> },Select { name, cols: SelectCols, where_: Option<Where> },Delete { name, where_: Option<Where> },Update { name, sets: Vec<(name, Literal)>, where_: Option<Where> }.
- Implement
Parseras{ tokens: &[Token], pos: usize }with one method per non-terminal:parse_program,parse_stmt,parse_create,parse_insert,parse_select,parse_delete,parse_update,parse_where,parse_literal. Each method consumes tokens left-to-right with single-token lookahead viapeek. - On any unexpected token, produce
parse error at line L col C: <message>. Make sure the<message>and(L, C)are stable across the three languages —cross_test.shasserts this. - Preserve insertion order everywhere. SELECT column lists, UPDATE
SET assignments, INSERT row lists, CREATE column lists are all
Vec/slice/std::vector(neverHashMap/map). - A leading
-before an integer literal in value position (RHS ofWHERE col OP -1, INSERT/UPDATE literals) parses as a negative integer literal. It is not a unary-minus operator; there is no expression grammar.
Acceptance
Inline unit tests (Rust names; mirror in Go and C++):
parse_create_table—CREATE TABLEwith one INT column and one TEXT column.parse_insert_multirow— multi-rowINSERT VALUES (..), (..), exercising both literal kinds.parse_select_variants_and_all_ops—SELECT *,SELECT col, col, and each of the 6 comparison operators in WHERE.parse_update_and_delete—UPDATEwith multi-column SET and WHERE;DELETEwith WHERE.parse_multi_with_comments_and_case—--line comments, keywords in mixed case, identifiers preserved verbatim.parse_errors_report_column—SELECT FROM t;reportsparse error at line 1 col 8: expected identifier.
All six green in Rust, Go, and C++.
Discussion prompts
- Recursive descent works because our grammar is
LL(1). What's the single most popular SQL construct that isn'tLL(1)and how would we extend the parser to handle it? - We parse
INSERT INTO t VALUES (1, 'a')but notINSERT INTO t (a, b) VALUES (1, 'x'). Which token's lookahead would tell us we're in the second form, and how would that changeparse_insert? - Why does the negative-literal-in-value-position decision live in the
parser rather than the tokenizer? Hint: what would
WHERE a - bmean if it were a tokenizer rule?