bench: harden benchmark validity and capture topology and attribution findings by belveryin · Pull Request #604 · tonbo-io/tonbo

belveryin · 2026-03-29T20:13:18Z

Summary

This PR turns the benchmark branch into a validity-gated investigation across both engine and surface layers.

It now does seven things:

hardens the swmr_gb_scale_mixed benchmark so the reported 1 GB numbers are correctness-checked
hardens the standard-S3 benchmark path so invalid zero-row artifacts are rejected instead of silently accepted
adds topology metadata and a matrix helper for local vs object-store comparison work
captures the first working Tonbo S3 Express runs, including same-host same-region EC2 results
adds local write-path attribution for swmr_gb_scale_mixed so foreground writer cost is broken down instead of inferred from end-to-end latency alone
adds a first-pass surface_open_and_fresh_read benchmark to measure snapshot/open, selective HEAD reads, foreground write, and write-to-visible follow-up behavior
stacks the Tonbo-side Express benchmark wiring on top of Fusio PR tonbo-io/fusio#269

What Changed

SWMR validity hardening

swmr_gb_scale_mixed now records per-reader expectations:
- expected rows
- expected first/last key
- expected key fingerprint
- validation model
reader validity is now explicit:
- head_light: count_and_key_band
- head_heavy, pinned_light, pinned_heavy: exact_shape_stable
pinned readers now use a held snapshot object instead of reconstructing state only through snapshot_at(latest_manifest_version.timestamp)
the artifact now records both held-snapshot expectations and manifest-reconstruction observations so the old pinned mismatch is visible and machine-readable
added deterministic regression coverage for SWMR snapshot stability and reader-shape expectations

Standard-S3 benchmark hardening

benchmark artifacts now carry topology metadata:
- runner env / region / AZ / instance type
- bucket region / AZ
- object-store flavor
- endpoint kind
- network path
added benches/compaction/run_matrix.sh to run local / S3 matrix cells and emit a report stub
switched the benchmark back to the native object-store DB path for standard S3 correctness after the probed S3 wrapper produced invalid artifacts
hardened read_compaction_quiesced setup so the benchmark reopens a fresh measurement DB and waits until it sees the expected visible row count before accepting the run

S3 Express enablement and benchmark wiring

On top of Fusio PR tonbo-io/fusio#269, this branch now:

exposes S3Spec.s3_express
wires Express mode through benchmark and smoke-test env handling
propagates spec.s3_express into the separate object-store FS used for snapshot / cleanup metadata walking
pins Tonbo to the exact Fusio commit under test via git dependencies rather than a local path override

That last harness fix matters because the initial failing benchmark path was not the core DB path. It was the metadata walk rebuilding an S3 client without Express mode and then hitting the real Express endpoint with the wrong signing flow.

Local write-path attribution

The branch now also records a local attribution pass for the foreground writer path used by swmr_gb_scale_mixed.

That adds:

explicit timing capture for partitioning, WAL append, WAL commit, mutable insert, seal, and minor compaction in the ingest path
benchmark artifact emission for the aggregated writer-path breakdown
regression coverage that checks the profiled timings are coherent and that ingest visibility semantics still hold
a checked-in result note summarizing the local ~1 GB writer-path breakdown

First surface benchmark

The branch now also adds a first narrow surface benchmark:

scenario: surface_open_and_fresh_read
measures:
- begin_snapshot as a first-pass open/snapshot cost
- one selective HEAD read with a small projection
- one heavier HEAD read
- one foreground write with write-path profiling
- one follow-up selective HEAD read after the write to capture a write-to-visible surface cost
reuses the current artifact pipeline, topology metadata, and object-store/local backend selection
is intentionally closer to a user-facing interactive path than read_compaction_quiesced, but is still not a filesystem benchmark

Key Findings

Same-host EC2 local vs standard S3

On the same EC2 instance in eu-central-1:

Cell	Mean	p95	Rows Processed
`local, scale=1`	`28.88 ms`	`29.30 ms`	`24,576`
`standard S3, scale=1`	`747.19 ms`	`793.70 ms`	`24,576`
`local, scale=4`	`81.18 ms`	`82.58 ms`	`24,576`
`standard S3, scale=4`	`1263.28 ms`	`1284.84 ms`	`24,576`

Observed ratio:

scale=1 mean: 25.9x
scale=1 p95: 27.1x
scale=4 mean: 15.6x
scale=4 p95: 15.6x

Interpretation:

same-region EC2 materially improves standard-S3 numbers relative to the non-EC2 remote host, but standard S3 is still much slower than local on the same machine
the earlier EC2 scale=4 zero-row artifact was a benchmark-harness acceptance bug, not a demonstrated generic reopen/read bug in Tonbo
standard-S3 cost remains heavily prepare/setup dominated on this read-only directional scenario

Same-host EC2 SWMR `1 GB`

On the same EC2 instance at ~1 GB logical state:

Cell	Mean Step (s)	p95 Step (s)	Throughput	Writer Mean (s)	Head Light (s)	Head Heavy (s)	Pinned Light (s)	Pinned Heavy (s)
`local, ~1 GB logical`	`0.211`	`0.283`	`37.57 Krows/s`	`0.148`	`0.0076`	`0.0356`	`0.0028`	`0.0176`
`standard S3, ~1 GB logical`	`9.642`	`11.950`	`823 rows/s`	`5.840`	`0.887`	`1.617`	`0.275`	`1.022`

Observed ratio:

whole mixed step mean: 45.7x
writer mean: 39.5x
head_light: 117.3x
head_heavy: 45.4x
pinned_light: 99.6x
pinned_heavy: 58.1x

Interpretation:

the branch clearly deteriorates as state size and workload realism increase
the main cost wall on standard S3 is still the writer path, but reader costs are also materially worse at ~1 GB
both same-host 1 GB artifacts are correctness-valid, so these are usable comparison numbers

Local writer-path attribution at `~1 GB`

For swmr_gb_scale_mixed at ~1 GB logical on the local backend, the profiled foreground writer path breaks down as:

minor compaction: 229.18 ms
WAL durability (append + commit): 218.31 ms
mutable insert: 109.17 ms
seal: 60.04 ms
partition: 4.89 ms

Interpretation:

the current user-visible write latency is not a WAL-only ack path
inline minor compaction and WAL durability together dominate the foreground writer cost
the next design question is whether Tonbo should acknowledge after durable WAL completion and move more maintenance behind that boundary

First surface benchmark follow-up

The first local vs standard-S3 comparison for surface_open_and_fresh_read shows that the severe object-store penalty is not limited to the earlier compaction-focused engine cell.

Metric	Local	Standard S3	Ratio
whole surface op mean	`14.08 ms`	`11.321 s`	`804x`
`begin_snapshot`	`0.87 ms`	`534.87 ms`	`613x`
latest light read	`2.91 ms`	`2.921 s`	`1004x`
latest heavy read	`5.68 ms`	`3.746 s`	`659x`
foreground write	`1.86 ms`	`1.207 s`	`649x`
write-to-visible follow-up	`4.60 ms`	`4.118 s`	`894x`

Read-path split in the object-store surface run:

latest light read:
- prepare: 2919.80 ms
- consume: 0.78 ms
latest heavy read:
- prepare: 3742.38 ms
- consume: 3.48 ms
fresh light read after write:
- prepare: 2910.10 ms
- consume: 0.49 ms

Interpretation:

a narrower user-facing open/fresh-read proxy also lands in a hundreds-to-about-a-thousand-times slowdown regime relative to local
as with the earlier engine-layer results, the observed penalty is still overwhelmingly prepare/setup dominated rather than row-consume dominated
this is a first-pass API-surface benchmark, not a filesystem benchmark

S3 Express same-host same-region follow-up

The branch now also contains the first directly comparable same-host runs between standard S3 and S3 Express in us-east-1, with the EC2 runner in AZ ID use1-az6 matching the directory bucket.

Cell	Standard S3	S3 Express	Express vs Standard
`read_compaction_quiesced scale=1` mean	`873.05 ms`	`2278.97 ms`	`2.61x` slower
`read_compaction_quiesced scale=1` p95	`1263.49 ms`	`2621.39 ms`	`2.07x` slower
`read_compaction_quiesced scale=4` mean	`2478.59 ms`	`4071.72 ms`	`1.64x` slower
`read_compaction_quiesced scale=4` p95	`2768.13 ms`	`4264.24 ms`	`1.54x` slower
`swmr_gb_scale_mixed ~1 GB` mean step	`11.644 s`	`40.162 s`	`3.45x` slower
`swmr_gb_scale_mixed ~1 GB` p95 step	`13.747 s`	`42.338 s`	`3.08x` slower
`swmr_gb_scale_mixed ~1 GB` throughput	`681.57 rows/s`	`197.60 rows/s`	`3.45x` lower
`swmr_gb_scale_mixed ~1 GB` writer mean	`7.571 s`	`21.575 s`	`2.85x` slower

Important debug result:

a same-host Express debug rerun with FUSIO_S3_EXPRESS_DEBUG=1 logged exactly 1 CreateSession
Express list failures: 0

Interpretation:

the old cross-region explanation is no longer enough
the old repeated-session churn bug is no longer the dominant explanation either
in the current Tonbo/Fusio path, same-AZ S3 Express is still slower than the same-host standard-S3 control
the largest same-host gap appears in the write/setup path, especially 1 GB preload time:
- standard S3 preload: 699.734 s
- Express preload: 2331.777 s

Conclusion

This branch now supports three design conclusions.

The current write contract is too broad for low-latency ingest. Foreground writes still wait for work beyond WAL durability, including seal and opportunistic minor compaction. The next design step should be to narrow the user-visible ack boundary closer to durable WAL completion and move more maintenance behind it.
Direct WAL-on-object-store is likely the wrong default fast path. For deployments with local disk, the next meaningful comparison is hybrid WAL topologies such as WAL(disk) -> later S3 publish or WAL(disk) -> stream pipeline -> eventual S3 persistence. Direct object-store WAL may still be necessary for environments without a filesystem, but the current path looks too expensive as the general case.
Read performance remains insufficiently explained, but the evidence now shows more than a small object-store floor. Both the larger SWMR workload and the first surface benchmark show severe user-visible degradation as workload realism grows. The next work should focus on attributing whether that deterioration is driven mainly by bytes read, object-count/metadata growth, snapshot/manifest setup, or a combination.

What this branch still does not prove:

the final hybrid WAL architecture,
the exact root-cause breakdown of the read path,
or full product-surface behavior such as filesystem-style traversal workloads.

Result Notes

Checked-in notes that support the branch story:

swmr_gb_scale_2026-03-27.md: pinned-snapshot root cause
swmr_gb_scale_2026-03-28.md: move from non-empty checks to shape validation
swmr_gb_scale_2026-03-29.md: first valid object-store 1 GB SWMR evidence
compaction_topology_2026-03-30.md: first local vs S3 topology pass and original broken scale=4 state
ec2_same_host_topology_2026-03-31.md: final same-host EC2 local vs standard-S3 comparison
ec2_s3_express_cross_region_2026-04-01.md: first successful cross-region Express enablement runs
ec2_use1_same_region_2026-04-01.md: same-host same-region standard-S3 vs Express comparison
swmr_write_path_attribution_2026-04-14.md: local writer-path attribution for the ~1 GB SWMR benchmark

Validation

Ran across this branch work:

cargo test
cargo clippy --all-targets -- -D warnings
cargo +nightly fmt --all
cargo bench -p tonbo --bench compaction_local --no-run
local and standard-S3 directional reruns
same-host EC2 local and standard-S3 directional reruns
same-host EC2 local and standard-S3 SWMR 1 GB reruns
same-host EC2 cross-region S3 Express directional and SWMR reruns
same-host EC2 same-region same-AZ S3 Express directional and SWMR reruns
same-host same-AZ Express debug rerun with FUSIO_S3_EXPRESS_DEBUG=1
local SWMR attribution rerun for swmr_gb_scale_mixed
local and standard-S3 first-pass surface reruns for surface_open_and_fresh_read

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: harden benchmark validity and capture topology and attribution findings#604

bench: harden benchmark validity and capture topology and attribution findings#604
belveryin wants to merge 14 commits intomainfrom
feat/swmr-benchmark-first-pass

belveryin commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

belveryin commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

SWMR validity hardening

Standard-S3 benchmark hardening

S3 Express enablement and benchmark wiring

Local write-path attribution

First surface benchmark

Key Findings

Same-host EC2 local vs standard S3

Same-host EC2 SWMR 1 GB

Local writer-path attribution at ~1 GB

First surface benchmark follow-up

S3 Express same-host same-region follow-up

Conclusion

Result Notes

Validation

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

belveryin commented Mar 29, 2026 •

edited

Loading

Same-host EC2 SWMR `1 GB`

Local writer-path attribution at `~1 GB`