Step 3 — Group commit benchmark
Goal
Quantify the cost of fsync and the throughput win from group commit.
Workload
bench-group PATH N BATCH:
for i in 0..N:
append(payload)
if (i+1) % BATCH == 0: sync()
sync() // final
PATH is a brand-new file each run. N = 50_000 is a good starting point on a modern SSD.
Numbers to look for (M2 Pro, APFS, 64-byte payload)
| Batch | Throughput | Avg latency / sync | Bytes flushed / sync |
|---|---|---|---|
| 1 | ~1,800 rec/s | ~560 µs | ~72 B |
| 8 | ~12,000 rec/s | ~670 µs | ~576 B |
| 64 | ~110,000 rec/s | ~580 µs | ~4.6 KB |
| 512 | ~260,000 rec/s | ~1.0 ms | ~37 KB |
| 4096 | ~310,000 rec/s | ~13 ms | ~295 KB |
Two effects worth noting:
- Sync time is roughly constant up to ~4KB: the bottleneck is the per-fsync overhead (syscall + journal commit), not the byte count.
- Returns diminish past batch ~256: bandwidth becomes the next limit. Past ~4096 you start hitting tail-latency cliffs.
What "broken" looks like
- Per-record throughput is the same as group=64: your
sync()isn't doing anything (no-op, wrong fd, orbufio.Writerswallowing the write). - Throughput keeps climbing past group=4096: you may not be calling
sync()at all between batches. - macOS numbers look impossibly fast: plain
fsyncdoes not flush the device cache. Re-run withF_FULLFSYNCto compare.
Comparing to a Linux box
On NVMe + ext4:
| Batch | Throughput |
|---|---|
| 1 | ~3,000 rec/s |
| 64 | ~180,000 rec/s |
| 4096 | ~600,000 rec/s |
The shape is identical; absolute numbers depend on the device's flush latency.