db-12 step 02 — Parser and AST

Goal

Implement parse(src) -> Result<Vec<Statement>, ParseError> that consumes the token stream from step 01 and produces a typed AST. Parser errors carry 1-based (line, col) from the offending token.

Tasks

  1. Define the AST:
    • ColType { Int, Text }.
    • Literal { Int(i64), Text(String) } with explicit tag bytes Int=1, Text=2 (matters for serialization).
    • Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 }#[repr(u8)] in Rust, OpEq=1..OpGe=6 constants in Go, enum class Op : uint8_t in C++.
    • Where { col: String, op: Op, lit: Literal } — present-or-absent via Option/pointer/has_where flag.
    • SelectCols { Star, Named(Vec<String>) }.
    • Statement enum with five variants: Create { name, cols: Vec<(name, ColType)> }, Insert { name, rows: Vec<Vec<Literal>> }, Select { name, cols: SelectCols, where_: Option<Where> }, Delete { name, where_: Option<Where> }, Update { name, sets: Vec<(name, Literal)>, where_: Option<Where> }.
  2. Implement Parser as { tokens: &[Token], pos: usize } with one method per non-terminal: parse_program, parse_stmt, parse_create, parse_insert, parse_select, parse_delete, parse_update, parse_where, parse_literal. Each method consumes tokens left-to-right with single-token lookahead via peek.
  3. On any unexpected token, produce parse error at line L col C: <message>. Make sure the <message> and (L, C) are stable across the three languages — cross_test.sh asserts this.
  4. Preserve insertion order everywhere. SELECT column lists, UPDATE SET assignments, INSERT row lists, CREATE column lists are all Vec/slice/std::vector (never HashMap / map).
  5. A leading - before an integer literal in value position (RHS of WHERE col OP -1, INSERT/UPDATE literals) parses as a negative integer literal. It is not a unary-minus operator; there is no expression grammar.

Acceptance

Inline unit tests (Rust names; mirror in Go and C++):

  • parse_create_tableCREATE TABLE with one INT column and one TEXT column.
  • parse_insert_multirow — multi-row INSERT VALUES (..), (..), exercising both literal kinds.
  • parse_select_variants_and_all_opsSELECT *, SELECT col, col, and each of the 6 comparison operators in WHERE.
  • parse_update_and_deleteUPDATE with multi-column SET and WHERE; DELETE with WHERE.
  • parse_multi_with_comments_and_case-- line comments, keywords in mixed case, identifiers preserved verbatim.
  • parse_errors_report_columnSELECT FROM t; reports parse error at line 1 col 8: expected identifier.

All six green in Rust, Go, and C++.

Discussion prompts

  • Recursive descent works because our grammar is LL(1). What's the single most popular SQL construct that isn't LL(1) and how would we extend the parser to handle it?
  • We parse INSERT INTO t VALUES (1, 'a') but not INSERT INTO t (a, b) VALUES (1, 'x'). Which token's lookahead would tell us we're in the second form, and how would that change parse_insert?
  • Why does the negative-literal-in-value-position decision live in the parser rather than the tokenizer? Hint: what would WHERE a - b mean if it were a tokenizer rule?