db-12 step 01 — Tokenizer
Goal
Implement tokenize(src) -> Result<Vec<Token>, ParseError> such that any
character that cannot start a valid token produces an error pointing at
its 1-based (line, col), and the legal token kinds form a stream the
parser can consume left-to-right with no lookahead.
Tasks
- Define
TokKindcovering:- Keywords:
SELECT,FROM,WHERE,INSERT,INTO,VALUES,CREATE,TABLE,DELETE,UPDATE,SET,AND,INT,TEXT. - Identifier (case-preserving).
- Integer literal, text literal.
- Punctuation:
,,;,(,),*. - Operators:
=,!=,<,<=,>,>=.
- Keywords:
- Implement
tokenizeas a single character-by-character pass over the source bytes:- Skip whitespace; tracking line via
\n, column via byte index since last\n. - On
--: skip to end of line. - On
[A-Za-z_]: read an identifier; uppercase-fold it and look it up in the keyword table. If found, emit the keyword token; otherwise emit an identifier token with the verbatim bytes (no case folding). - On
[0-9]: read an integer literal (optional-already consumed in value position by the parser — not here in the tokenizer). - On
': read a string literal;''is a single embedded quote; missing close quote is an error reporting the opening(line, col). - On
<,>,!: peek for=to form<=,>=,!=. - On
=,,,;,(,),*: emit a single-char token. - Anything else: error reporting
(line, col)of the bad byte.
- Skip whitespace; tracking line via
- Every emitted
Tokencarries its(line, col)(the start of the token), so parser errors can blame the right column even when the token is several characters long.
Acceptance
Inline unit tests (Rust names; mirror them in Go and C++):
tokenize_happy— a single mixed input exercising every token kind. Assert the resultingVec<TokKind>matches the expected sequence.tokenize_strings_and_errors— a''escape lexes to the unescaped contents; an unterminated string returnsparse error at line N col M: ...with the correct(N, M).
Both green in Rust, Go, and C++.
Discussion prompts
- Why fold keywords but not identifiers? What would change in our fixture hashes if we case-folded identifiers like SQLite does?
- The tokenizer recognizes 14 keywords. Which keyword would we add
first if we wanted to parse
LIMIT 10? Why does adding it require a parser change too? - We chose to track
(line, col)per token rather than per character offset. What's the trade-off?