Indexes and Query Optimization

1. What Is It

A secondary index is an auxiliary data structure that maps each value of a non-primary-key column to the set of row-ids that contain it. A query planner turns a logical query (predicates + projections) into a physical plan tree (scan → filters → project) and picks an access path per predicate. A rule-based planner uses fixed heuristics; a cost-based planner consults statistics. db-14 implements the rule-based half end-to-end and keeps the cost model deliberately tiny (rows / distinct_keys for =, (rows+2)/3 for ranges) so the byte-for-byte cross-language invariant is tractable.

2. Why It Matters

A SeqScan on N rows costs Θ(N) regardless of selectivity. A point lookup through a sorted (or hashed) index is O(log N) (or O(1) amortised). When predicates have selectivity ≪ 1 — the normal case in OLTP — choosing the right access path is the single largest performance lever a database has. And once two physical operators are available, you need a planner. Even a naive planner with the wrong cost model can be catastrophically slow on real workloads (see Leis et al., "How Good Are Query Optimizers, Really?", PVLDB 2015) — but it is also the concept through which every later optimisation (joins, partitioning, parallelism) gets expressed.

3. How It Works

            Query{projections, predicates}
                          │
                          ▼
                  ┌───────────────┐
                  │   Planner     │   rule-based:
                  │ estimate per  │   • Eq  → rows/distinct
                  │ indexable pred│   • Lt/Le/Gt/Ge → (rows+2)/3
                  │ pick min      │   • Ne / no-index → SeqScan
                  └──────┬────────┘
                         ▼
            Plan{ Pipeline[ scan, *Filter, Project? ] }
                         │
                         ▼
                  ┌───────────────┐
                  │   Executor    │   Volcano-style: scan rows,
                  │  scan→filter  │   retain on predicate,
                  │   →project    │   rewrite columns at end.
                  └──────┬────────┘
                         ▼
                     []Row

Indexes are BTreeMap<Value, Vec<row_id>> in Rust, std::map in C++, and a sorted slice of (Value, []row_id) in Go. Insertion order is preserved inside each bucket, which (combined with ascending key traversal) gives a total, deterministic output order shared across all three implementations.

4. Core Terminology

Term	Definition
Secondary index	Sorted map from column value → list of row-ids.
Access path	Concrete way to read rows for a predicate (SeqScan vs IndexScan).
Selectivity	Fraction of rows a predicate keeps; 0 ≤ s ≤ 1.
Cardinality estimate	Predicted row count out of an operator.
Pipeline	Linear chain of operators evaluated row-at-a-time (Volcano model).
Rule-based optimizer	Picks plans from fixed heuristics; no statistics.
Cost-based optimizer	Searches plan space; uses statistics + a cost function.
Covering index	Index that includes every column the query needs (skip the row lookup).
Index-only scan	Read the index without touching the heap.
Tuple	A single row of a relation.

5. Mental Models

Index = sorted dictionary. Equality is dictionary lookup; range is dictionary range(). Everything else is a special case.
Planner = predicate auctioneer. Each indexable predicate "bids" its estimated row count; the lowest bid wins the scan, the rest become Filters.
Executor = pipeline. Each operator pulls rows from its child. Operators don't materialise unless they must (project, sort, hash).
Wire format = correctness oracle. If three languages serialise the same plan and result bytes for the same query, they agree on planning + execution semantics. The SHA-256 collapses N MB of bytes into a 64-char string we can put in a case statement.

6. Common Misconceptions

"Indexes always speed up reads." False for low-selectivity predicates: a SeqScan reads the heap once; an IndexScan dereferences each row-id, which may be worse if most rows match.
"More indexes is free." Every write must update every index — and indexes cost RAM/disk.
"Rule-based is obsolete." Modern systems (SQLite, MySQL) ship rule-based fallbacks for simple queries because cost-based planning has its own pathologies (bad stats → catastrophic plans).
"Selectivity = 1 / distinct keys." Only under uniform-distribution assumption. Skewed data needs histograms or sampling.
"Project is free." Wide projection through long pipelines materialises copies; columnar engines avoid this; row engines pay for it.

7. Interview Talking Points

Explain how a B-tree index supports both point and range queries, and what changes for a hash index (point only, no ordering).
Walk through plan selection: predicates → estimates → cheapest scan → remaining as filters → optional project.
Why is index iteration order important? Determinism, merge-join inputs, ORDER BY elimination.
Explain Volcano vs vectorised execution. Tradeoffs?
What is a covering index, and when does it dominate?
Discuss EXPLAIN output: how do you read a query plan?
What can go wrong when the planner picks a SeqScan instead of an IndexScan (or vice versa)? Stale statistics, correlated predicates, type coercions that disable the index.

8. Connections to Other Labs

db-02 (data structures) introduced sorted maps; this lab puts them to work.
db-10 (B-tree) is the persistent counterpart of the in-memory index here.
db-12 (SQL frontend) produces the Query struct planners consume.
db-13 (transactions/MVCC) governs which rows the index sees per snapshot.
db-15 (SQLite-complete) stitches all of the above into a real engine.
db-22 (perf/benchmarking) measures the planner choices we make here.

9. Frozen Wire Format

Plan stream      = 0x05 (PIPELINE) | u32 LE child_count | child*
Child node tags:
   0x01 SeqScan    | u32 LE table_id(=0)
   0x02 IndexScan  | u32 LE col_idx | u8 op | u8 val_tag | <val>
   0x03 Filter     | u32 LE col_idx | u8 op | u8 val_tag | <val>
   0x04 Project    | u32 LE col_count | (u32 LE col_idx)*

Op codes : Eq=1 Ne=2 Lt=3 Le=4 Gt=5 Ge=6
Val tags : Int=1 (i64 LE) ; Text=2 (u32 LE len | bytes)

Result stream    = "DSEQR01" (7 bytes) | u32 LE row_count |
                   per row: u32 LE col_count | (u8 tag | <val>)*

Both streams are concatenated per op; the SHA-256 of that concatenation, across N ops, is the byte-identity oracle for the cross-language test.

Distributed Systems Engineer — Build Databases & Consensus From Scratch