go-toml PR #1067 · optimization breakdown
metric
view
scale
platforms

go-toml rewritten by Fable 5

TOML is a configuration-file format (the .toml files you'll recognize from Rust, Python and Go tooling), and go-toml is a widely used library for reading and writing it in Go. PR #1067 rewrote its three core pieces from scratch: the parser (reads the text), the decoder (fills your Go values), and the encoder (writes them back). It kept three hard rules: an identical public API, every existing test passing unchanged, and no unsafe (Go's escape hatch that trades safety for raw speed). This page walks the 20 optimizations that took it from the shipping version to the final design, and shows what each did to performance on both macOS (arm64) and Linux (amd64).

Every milestone is the same frozen benchmark suite, rebuilt at that commit with one fixed Go toolchain and run on the same machine, so only the library changes. Read each platform against itself, never across the two. The charts are interactive: switch metric, view and scale, step through the guided walkthrough, or open the per-benchmark detail below.

Speedup from v2 to the final design, geometric mean of all benchmarks. Big number is Linux; macOS is shown below. Lower is faster.

Guided walkthrough

step the optimizations along the curve
0 / 20

Overall arc whole-suite geomean across all benchmarks · show / hide

Whole-suite speedup vs v2

bold = geomean · faint = individual benchmarks · ★ = phase start
Phase A · rewrite from scratch (steps 1-12) Phase B · faster decoding (steps 13-20) steps 3-4: correct but un-tuned (intentional spike)
Per-benchmark detail every benchmark, grouped · show / hide

Optimization-by-optimization

Δ = whole-suite geomean change at this step
stepoptimizationwhat it does Δ time
L · M
Δ allocs
L · M

Δ columns are the change in the geometric mean of all benchmarks at that step versus the previous step (negative = faster / fewer). A narrow optimization that only touches one benchmark shows a small whole-suite Δ even when its own benchmark moves a lot; the prose names the benchmark it targets. Steps under ~1-2% are within measurement noise, especially on the shared macOS machine.

Methodology & caveats

The benchmark harness is held constant: the frozen #1067 suite (6 real-world datasets, the ReferenceFile/SimpleDocument/Hugo Marshal+Unmarshal cases, and 7 "real-world consumer" benchmarks) is overlaid on each milestone commit, which all share an identical public API, so identical benchmark code runs against every implementation. Each commit is checked out, rebuilt and run with go1.26.4. macOS (M-series, arm64) and Linux (Linux, amd64) are different machines: compare each series only against itself. The x-axis is the optimization milestone order on the reimpl-perf branch, which is a single linear history containing every optimization commit. The two un-tuned rewrite steps (3-4) are kept in deliberately: a correct functional decoder, before pooling/caching, is several times slower than v2, and the following commits reclaim that headroom. The merged PR adds one further commit (internal/parserbridge) on top of step 20 to keep the exported symbol set identical to v2; it costs ~1.5% on the smallest decode micro-cases and is neutral on real documents, so it is omitted from this pure-performance chain.