Skip to content

bench: harden benchmark validity and capture topology and attribution findings#604

Open
belveryin wants to merge 14 commits intomainfrom
feat/swmr-benchmark-first-pass
Open

bench: harden benchmark validity and capture topology and attribution findings#604
belveryin wants to merge 14 commits intomainfrom
feat/swmr-benchmark-first-pass

Conversation

@belveryin
Copy link
Copy Markdown
Collaborator

@belveryin belveryin commented Mar 29, 2026

Summary

This PR turns the benchmark branch into a validity-gated investigation across both engine and surface layers.

It now does seven things:

  • hardens the swmr_gb_scale_mixed benchmark so the reported 1 GB numbers are correctness-checked
  • hardens the standard-S3 benchmark path so invalid zero-row artifacts are rejected instead of silently accepted
  • adds topology metadata and a matrix helper for local vs object-store comparison work
  • captures the first working Tonbo S3 Express runs, including same-host same-region EC2 results
  • adds local write-path attribution for swmr_gb_scale_mixed so foreground writer cost is broken down instead of inferred from end-to-end latency alone
  • adds a first-pass surface_open_and_fresh_read benchmark to measure snapshot/open, selective HEAD reads, foreground write, and write-to-visible follow-up behavior
  • stacks the Tonbo-side Express benchmark wiring on top of Fusio PR tonbo-io/fusio#269

What Changed

SWMR validity hardening

  • swmr_gb_scale_mixed now records per-reader expectations:
    • expected rows
    • expected first/last key
    • expected key fingerprint
    • validation model
  • reader validity is now explicit:
    • head_light: count_and_key_band
    • head_heavy, pinned_light, pinned_heavy: exact_shape_stable
  • pinned readers now use a held snapshot object instead of reconstructing state only through snapshot_at(latest_manifest_version.timestamp)
  • the artifact now records both held-snapshot expectations and manifest-reconstruction observations so the old pinned mismatch is visible and machine-readable
  • added deterministic regression coverage for SWMR snapshot stability and reader-shape expectations

Standard-S3 benchmark hardening

  • benchmark artifacts now carry topology metadata:
    • runner env / region / AZ / instance type
    • bucket region / AZ
    • object-store flavor
    • endpoint kind
    • network path
  • added benches/compaction/run_matrix.sh to run local / S3 matrix cells and emit a report stub
  • switched the benchmark back to the native object-store DB path for standard S3 correctness after the probed S3 wrapper produced invalid artifacts
  • hardened read_compaction_quiesced setup so the benchmark reopens a fresh measurement DB and waits until it sees the expected visible row count before accepting the run

S3 Express enablement and benchmark wiring

On top of Fusio PR tonbo-io/fusio#269, this branch now:

  • exposes S3Spec.s3_express
  • wires Express mode through benchmark and smoke-test env handling
  • propagates spec.s3_express into the separate object-store FS used for snapshot / cleanup metadata walking
  • pins Tonbo to the exact Fusio commit under test via git dependencies rather than a local path override

That last harness fix matters because the initial failing benchmark path was not the core DB path. It was the metadata walk rebuilding an S3 client without Express mode and then hitting the real Express endpoint with the wrong signing flow.

Local write-path attribution

The branch now also records a local attribution pass for the foreground writer path used by swmr_gb_scale_mixed.

That adds:

  • explicit timing capture for partitioning, WAL append, WAL commit, mutable insert, seal, and minor compaction in the ingest path
  • benchmark artifact emission for the aggregated writer-path breakdown
  • regression coverage that checks the profiled timings are coherent and that ingest visibility semantics still hold
  • a checked-in result note summarizing the local ~1 GB writer-path breakdown

First surface benchmark

The branch now also adds a first narrow surface benchmark:

  • scenario: surface_open_and_fresh_read
  • measures:
    • begin_snapshot as a first-pass open/snapshot cost
    • one selective HEAD read with a small projection
    • one heavier HEAD read
    • one foreground write with write-path profiling
    • one follow-up selective HEAD read after the write to capture a write-to-visible surface cost
  • reuses the current artifact pipeline, topology metadata, and object-store/local backend selection
  • is intentionally closer to a user-facing interactive path than read_compaction_quiesced, but is still not a filesystem benchmark

Key Findings

Same-host EC2 local vs standard S3

On the same EC2 instance in eu-central-1:

Cell Mean p95 Rows Processed
local, scale=1 28.88 ms 29.30 ms 24,576
standard S3, scale=1 747.19 ms 793.70 ms 24,576
local, scale=4 81.18 ms 82.58 ms 24,576
standard S3, scale=4 1263.28 ms 1284.84 ms 24,576

Observed ratio:

  • scale=1 mean: 25.9x
  • scale=1 p95: 27.1x
  • scale=4 mean: 15.6x
  • scale=4 p95: 15.6x

Interpretation:

  • same-region EC2 materially improves standard-S3 numbers relative to the non-EC2 remote host, but standard S3 is still much slower than local on the same machine
  • the earlier EC2 scale=4 zero-row artifact was a benchmark-harness acceptance bug, not a demonstrated generic reopen/read bug in Tonbo
  • standard-S3 cost remains heavily prepare/setup dominated on this read-only directional scenario

Same-host EC2 SWMR 1 GB

On the same EC2 instance at ~1 GB logical state:

Cell Mean Step (s) p95 Step (s) Throughput Writer Mean (s) Head Light (s) Head Heavy (s) Pinned Light (s) Pinned Heavy (s)
local, ~1 GB logical 0.211 0.283 37.57 Krows/s 0.148 0.0076 0.0356 0.0028 0.0176
standard S3, ~1 GB logical 9.642 11.950 823 rows/s 5.840 0.887 1.617 0.275 1.022

Observed ratio:

  • whole mixed step mean: 45.7x
  • writer mean: 39.5x
  • head_light: 117.3x
  • head_heavy: 45.4x
  • pinned_light: 99.6x
  • pinned_heavy: 58.1x

Interpretation:

  • the branch clearly deteriorates as state size and workload realism increase
  • the main cost wall on standard S3 is still the writer path, but reader costs are also materially worse at ~1 GB
  • both same-host 1 GB artifacts are correctness-valid, so these are usable comparison numbers

Local writer-path attribution at ~1 GB

For swmr_gb_scale_mixed at ~1 GB logical on the local backend, the profiled foreground writer path breaks down as:

  • minor compaction: 229.18 ms
  • WAL durability (append + commit): 218.31 ms
  • mutable insert: 109.17 ms
  • seal: 60.04 ms
  • partition: 4.89 ms

Interpretation:

  • the current user-visible write latency is not a WAL-only ack path
  • inline minor compaction and WAL durability together dominate the foreground writer cost
  • the next design question is whether Tonbo should acknowledge after durable WAL completion and move more maintenance behind that boundary

First surface benchmark follow-up

The first local vs standard-S3 comparison for surface_open_and_fresh_read shows that the severe object-store penalty is not limited to the earlier compaction-focused engine cell.

Metric Local Standard S3 Ratio
whole surface op mean 14.08 ms 11.321 s 804x
begin_snapshot 0.87 ms 534.87 ms 613x
latest light read 2.91 ms 2.921 s 1004x
latest heavy read 5.68 ms 3.746 s 659x
foreground write 1.86 ms 1.207 s 649x
write-to-visible follow-up 4.60 ms 4.118 s 894x

Read-path split in the object-store surface run:

  • latest light read:
    • prepare: 2919.80 ms
    • consume: 0.78 ms
  • latest heavy read:
    • prepare: 3742.38 ms
    • consume: 3.48 ms
  • fresh light read after write:
    • prepare: 2910.10 ms
    • consume: 0.49 ms

Interpretation:

  • a narrower user-facing open/fresh-read proxy also lands in a hundreds-to-about-a-thousand-times slowdown regime relative to local
  • as with the earlier engine-layer results, the observed penalty is still overwhelmingly prepare/setup dominated rather than row-consume dominated
  • this is a first-pass API-surface benchmark, not a filesystem benchmark

S3 Express same-host same-region follow-up

The branch now also contains the first directly comparable same-host runs between standard S3 and S3 Express in us-east-1, with the EC2 runner in AZ ID use1-az6 matching the directory bucket.

Cell Standard S3 S3 Express Express vs Standard
read_compaction_quiesced scale=1 mean 873.05 ms 2278.97 ms 2.61x slower
read_compaction_quiesced scale=1 p95 1263.49 ms 2621.39 ms 2.07x slower
read_compaction_quiesced scale=4 mean 2478.59 ms 4071.72 ms 1.64x slower
read_compaction_quiesced scale=4 p95 2768.13 ms 4264.24 ms 1.54x slower
swmr_gb_scale_mixed ~1 GB mean step 11.644 s 40.162 s 3.45x slower
swmr_gb_scale_mixed ~1 GB p95 step 13.747 s 42.338 s 3.08x slower
swmr_gb_scale_mixed ~1 GB throughput 681.57 rows/s 197.60 rows/s 3.45x lower
swmr_gb_scale_mixed ~1 GB writer mean 7.571 s 21.575 s 2.85x slower

Important debug result:

  • a same-host Express debug rerun with FUSIO_S3_EXPRESS_DEBUG=1 logged exactly 1 CreateSession
  • Express list failures: 0

Interpretation:

  • the old cross-region explanation is no longer enough
  • the old repeated-session churn bug is no longer the dominant explanation either
  • in the current Tonbo/Fusio path, same-AZ S3 Express is still slower than the same-host standard-S3 control
  • the largest same-host gap appears in the write/setup path, especially 1 GB preload time:
    • standard S3 preload: 699.734 s
    • Express preload: 2331.777 s

Conclusion

This branch now supports three design conclusions.

  • The current write contract is too broad for low-latency ingest. Foreground writes still wait for work beyond WAL durability, including seal and opportunistic minor compaction. The next design step should be to narrow the user-visible ack boundary closer to durable WAL completion and move more maintenance behind it.
  • Direct WAL-on-object-store is likely the wrong default fast path. For deployments with local disk, the next meaningful comparison is hybrid WAL topologies such as WAL(disk) -> later S3 publish or WAL(disk) -> stream pipeline -> eventual S3 persistence. Direct object-store WAL may still be necessary for environments without a filesystem, but the current path looks too expensive as the general case.
  • Read performance remains insufficiently explained, but the evidence now shows more than a small object-store floor. Both the larger SWMR workload and the first surface benchmark show severe user-visible degradation as workload realism grows. The next work should focus on attributing whether that deterioration is driven mainly by bytes read, object-count/metadata growth, snapshot/manifest setup, or a combination.

What this branch still does not prove:

  • the final hybrid WAL architecture,
  • the exact root-cause breakdown of the read path,
  • or full product-surface behavior such as filesystem-style traversal workloads.

Result Notes

Checked-in notes that support the branch story:

  • swmr_gb_scale_2026-03-27.md: pinned-snapshot root cause
  • swmr_gb_scale_2026-03-28.md: move from non-empty checks to shape validation
  • swmr_gb_scale_2026-03-29.md: first valid object-store 1 GB SWMR evidence
  • compaction_topology_2026-03-30.md: first local vs S3 topology pass and original broken scale=4 state
  • ec2_same_host_topology_2026-03-31.md: final same-host EC2 local vs standard-S3 comparison
  • ec2_s3_express_cross_region_2026-04-01.md: first successful cross-region Express enablement runs
  • ec2_use1_same_region_2026-04-01.md: same-host same-region standard-S3 vs Express comparison
  • swmr_write_path_attribution_2026-04-14.md: local writer-path attribution for the ~1 GB SWMR benchmark

Validation

Ran across this branch work:

  • cargo test
  • cargo clippy --all-targets -- -D warnings
  • cargo +nightly fmt --all
  • cargo bench -p tonbo --bench compaction_local --no-run
  • local and standard-S3 directional reruns
  • same-host EC2 local and standard-S3 directional reruns
  • same-host EC2 local and standard-S3 SWMR 1 GB reruns
  • same-host EC2 cross-region S3 Express directional and SWMR reruns
  • same-host EC2 same-region same-AZ S3 Express directional and SWMR reruns
  • same-host same-AZ Express debug rerun with FUSIO_S3_EXPRESS_DEBUG=1
  • local SWMR attribution rerun for swmr_gb_scale_mixed
  • local and standard-S3 first-pass surface reruns for surface_open_and_fresh_read

Related

@belveryin belveryin requested a review from ethe March 29, 2026 20:24
@belveryin belveryin marked this pull request as ready for review March 29, 2026 20:32
@belveryin belveryin force-pushed the feat/swmr-benchmark-first-pass branch from 310ef66 to e8fd769 Compare March 30, 2026 09:26
@belveryin belveryin changed the title bench: harden swmr benchmark validity bench: harden benchmark validity and capture topology findings Mar 31, 2026
@belveryin belveryin changed the title bench: harden benchmark validity and capture topology findings bench: harden benchmark validity and capture topology and attribution findings Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bench: add high-volume one-writer / multiple-readers benchmark

1 participant