broader ideas
What a "real" SQLite slice would add
If the goal is fidelity rather than pedagogy, the natural next steps, roughly in order of payoff:
- A pager backed by
pwrite. This is db-11. Once you have a pager the snapshot becomes the file, not an ad-hoc byte stream. - Page-level checksums. Even just XXH3 per page; turns silent corruption into noisy corruption.
- A WAL. Append-only journal of operations, replay on open. db-03 does the WAL; the fusion is in db-23.
- Schema and DDL.
CREATE TABLE, multiple tables, column types. The single-table assumption hides almost all the catalog complexity. - A query planner. Even just a cost-based decision between
SELECT_BY_KandSELECT_BY_TAGwould be educational. With one table and two indexes the planner is trivial; with joins it explodes.
What this engine could become with concurrency
The MVCC bookkeeping is already there — created_at and deleted_at.
What is missing for real read-mostly concurrency:
- A reader-visible snapshot timestamp, so
SELECTreads consistently as of "the latest committed txid I saw". - Write-set tracking and a commit barrier, so two writers cannot both bump txid without serialising.
- Garbage collection of tombstoned rows once no live reader could observe them. The current code holds tombstones forever, which is fine for a benchmark and disastrous for a real system.
The interesting thing is that the snapshot wire format would still work — you would just be dumping a consistent point rather than the literal in-memory state.
What this engine could never be
Without on-disk persistence, this is not a database; it is a test fixture. Adding the pager moves it to "embedded KV with SQL", which is roughly what SQLite is.
It will never be a server. Network protocols, connection management, client-side query plans, authentication — none of that is in scope for any lab in this series, by design.
Useful tangent: cross-language byte identity as a discipline
This lab is a microcosm of a discipline that pays off elsewhere:
- gRPC and protobuf rely on a canonical encoding for hash-based signing.
- Git's object hashing depends on a canonical layout per object type.
- Bitcoin transactions are SHA-256-d in a canonical byte form.
Whenever you find yourself asking "is this implementation correct", producing a canonical byte stream from each implementation and hashing it is one of the cheapest mechanical proofs you can buy.