broader ideas

What a "real" SQLite slice would add

If the goal is fidelity rather than pedagogy, the natural next steps, roughly in order of payoff:

  • A pager backed by pwrite. This is db-11. Once you have a pager the snapshot becomes the file, not an ad-hoc byte stream.
  • Page-level checksums. Even just XXH3 per page; turns silent corruption into noisy corruption.
  • A WAL. Append-only journal of operations, replay on open. db-03 does the WAL; the fusion is in db-23.
  • Schema and DDL. CREATE TABLE, multiple tables, column types. The single-table assumption hides almost all the catalog complexity.
  • A query planner. Even just a cost-based decision between SELECT_BY_K and SELECT_BY_TAG would be educational. With one table and two indexes the planner is trivial; with joins it explodes.

What this engine could become with concurrency

The MVCC bookkeeping is already there — created_at and deleted_at. What is missing for real read-mostly concurrency:

  • A reader-visible snapshot timestamp, so SELECT reads consistently as of "the latest committed txid I saw".
  • Write-set tracking and a commit barrier, so two writers cannot both bump txid without serialising.
  • Garbage collection of tombstoned rows once no live reader could observe them. The current code holds tombstones forever, which is fine for a benchmark and disastrous for a real system.

The interesting thing is that the snapshot wire format would still work — you would just be dumping a consistent point rather than the literal in-memory state.

What this engine could never be

Without on-disk persistence, this is not a database; it is a test fixture. Adding the pager moves it to "embedded KV with SQL", which is roughly what SQLite is.

It will never be a server. Network protocols, connection management, client-side query plans, authentication — none of that is in scope for any lab in this series, by design.

Useful tangent: cross-language byte identity as a discipline

This lab is a microcosm of a discipline that pays off elsewhere:

  • gRPC and protobuf rely on a canonical encoding for hash-based signing.
  • Git's object hashing depends on a canonical layout per object type.
  • Bitcoin transactions are SHA-256-d in a canonical byte form.

Whenever you find yourself asking "is this implementation correct", producing a canonical byte stream from each implementation and hashing it is one of the cheapest mechanical proofs you can buy.