db-12 step 01 — Tokenizer

Goal

Implement tokenize(src) -> Result<Vec<Token>, ParseError> such that any character that cannot start a valid token produces an error pointing at its 1-based (line, col), and the legal token kinds form a stream the parser can consume left-to-right with no lookahead.

Tasks

  1. Define TokKind covering:
    • Keywords: SELECT, FROM, WHERE, INSERT, INTO, VALUES, CREATE, TABLE, DELETE, UPDATE, SET, AND, INT, TEXT.
    • Identifier (case-preserving).
    • Integer literal, text literal.
    • Punctuation: ,, ;, (, ), *.
    • Operators: =, !=, <, <=, >, >=.
  2. Implement tokenize as a single character-by-character pass over the source bytes:
    • Skip whitespace; tracking line via \n, column via byte index since last \n.
    • On --: skip to end of line.
    • On [A-Za-z_]: read an identifier; uppercase-fold it and look it up in the keyword table. If found, emit the keyword token; otherwise emit an identifier token with the verbatim bytes (no case folding).
    • On [0-9]: read an integer literal (optional - already consumed in value position by the parser — not here in the tokenizer).
    • On ': read a string literal; '' is a single embedded quote; missing close quote is an error reporting the opening (line, col).
    • On <, >, !: peek for = to form <=, >=, !=.
    • On =, ,, ;, (, ), *: emit a single-char token.
    • Anything else: error reporting (line, col) of the bad byte.
  3. Every emitted Token carries its (line, col) (the start of the token), so parser errors can blame the right column even when the token is several characters long.

Acceptance

Inline unit tests (Rust names; mirror them in Go and C++):

  • tokenize_happy — a single mixed input exercising every token kind. Assert the resulting Vec<TokKind> matches the expected sequence.
  • tokenize_strings_and_errors — a '' escape lexes to the unescaped contents; an unterminated string returns parse error at line N col M: ... with the correct (N, M).

Both green in Rust, Go, and C++.

Discussion prompts

  • Why fold keywords but not identifiers? What would change in our fixture hashes if we case-folded identifiers like SQLite does?
  • The tokenizer recognizes 14 keywords. Which keyword would we add first if we wanted to parse LIMIT 10? Why does adding it require a parser change too?
  • We chose to track (line, col) per token rather than per character offset. What's the trade-off?