Skip to content

fix(download): residential saturation + transient failure hardening#95

Open
jacderida wants to merge 2 commits into
WithAutonomi:mainfrom
jacderida:home-download-fixes
Open

fix(download): residential saturation + transient failure hardening#95
jacderida wants to merge 2 commits into
WithAutonomi:mainfrom
jacderida:home-download-fixes

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

@jacderida jacderida commented May 22, 2026

Summary

Hardens the download path against two distinct failure modes observed running ant file download against the production network: residential link saturation and per-peer / DHT transient errors that previously fatally aborted multi-hundred-chunk downloads. Together the changes take an ant file download run on a residential connection from "aborts on the first 256-wide concurrent batch's saturation event" to "completes 11/11 files including 2+ GB downloads," and on a fat-pipe droplet from "matches baseline" to "matches baseline" — no regression on the warm-start path that production downloaders actually exercise.

What's in here

Six related pieces:

  1. retry-on-Ok(None) with unanimous-NotFound threshold in chunk_get. When the close group returns Ok(None) (no peer has the chunk), retry once with a fresh find_closest_peers lookup, unless every queried peer responded with an authoritative NotFound (which is the only safe stop for genuine data absence).
  2. rebucketed_unordered in file.rs instead of buffer_unordered for the in-flight chunk fetches, so the adaptive limiter's cap can shrink the in-flight count mid-batch under sustained pressure.
  3. observe-outer with Ok(None) → Outcome::Timeout instead of observe-per-peer. The controller sees one observation per chunk_get (not one per peer attempt), classified via a new chunk_get_outcome helper that treats Ok(None) as a load-shedding signal. Avoids the per-peer noise floor on the production network where some peers in any K=7 close group are unreachable from any given client even on a healthy link.
  4. ChannelStart::fetch: 64 → 4 cold-start. The original 64-wide initial burst would saturate residential connections before the controller had any observation to act on. 4 is the value confirmed safe on the operator's home link. On droplets the cost is a one-off cold-start warm-up of ~16 min on the first 2.5 GB file; subsequent files warm-start from the persisted snapshot (which reaches cap=256 cleanly).
  5. Deferred-retry pass in streaming_decrypt's consumer. When chunk_get returns Ok(None) for a chunk during a batch, the chunk is deferred rather than aborting the batch. After the main batch settles, the deferred chunks are retried serially with sleeps of 10/30/60 s, giving the link time to clear any transient saturation. A chunk only becomes fatal after all 3 deferred attempts fail.
  6. Per-peer protocol-error tolerance and deferred-retry transient-error tolerance. A single peer returning Error::Protocol (e.g., "Chunk verification failed" from a corrupted local copy) no longer aborts the close-group sweep — the loop counts it and continues to the next peer. Similarly, an Err(_) from chunk_get_observed during a deferred-retry attempt logs and falls through to the next attempt's longer backoff, rather than escalating.

Also: latency_inflation_factor default 2.0 → 4.0 (cherry-picked from the previously-validated tune-latency-inflation-factor branch — natural close-group fallback latency on the production network routinely doubles vs the EWMA baseline, and was firing spurious Decrease decisions on the droplet).

What's not in here

  • No new user-facing CLI flags. All controls are internal adaptive knobs.
  • No retry behavior at the per-peer level inside chunk_get_from_peer. The retry happens at the close-group sweep level (once inside chunk_get, once at the deferred pass).

Evidence

The most recent end-to-end runs both completed cleanly:

Local download (residential connection, PROD-LOCAL-DL-04)

11/11 files completed including 2.51 GB and 2.76 GB downloads:

# Size Duration Started (UTC)
1 18 B 24.2 s 2026-05-22 14:43:33
2 150.4 KB 28.8 s 2026-05-22 14:43:57
3 15.0 MB 42.7 s 2026-05-22 14:44:26
4 2.51 GB 42m 43.3s 2026-05-22 14:45:09
5 2.76 GB 46m 26.5s 2026-05-22 15:27:52
6 65.9 MB 1m 21.0s 2026-05-22 16:14:19
7 6.0 MB 32.5 s 2026-05-22 16:15:40
8 802.6 KB 37.4 s 2026-05-22 16:16:12
9 2.2 MB 36.6 s 2026-05-22 16:16:50
10 12.8 MB 50.6 s 2026-05-22 16:17:26
11 961.6 KB 25.3 s 2026-05-22 16:18:17

Files 4 and 5 are the multi-GB workloads that previously aborted on the first close-group exhaustion within the first few minutes. They now complete via the deferred-retry mechanism — during the earlier successful home test, 14 chunks were deferred and every single one recovered on attempt 1/3 after the 10 s sleep.

Droplet download (production, PROD-DL-05)

20/20 files completed:

# Size Duration
1 18 B 25.1 s
2 150.4 KB 25.0 s
3 15.0 MB 38.8 s
4 2.51 GB 16m 9.4s ← cold-start, only file paying warm-up cost
5 2.76 GB 6m 22.5s
6 65.9 MB 48.0 s
7 6.0 MB 26.8 s
8 802.6 KB 30.1 s
9 2.2 MB 25.2 s
10 12.8 MB 47.7 s
11 961.6 KB 19.1 s
12 1.77 GB 3m 53.9s
13 2.25 GB 5m 4.3s
14 2.34 GB 3m 55.2s
15 2.28 GB 4m 17.7s
16 2.47 GB 5m 7.0s
17 2.55 GB 5m 12.8s
18 2.50 GB 4m 32.6s
19 2.75 GB 5m 5.6s
20 2.84 GB 4m 53.0s

File #4 is the cold-start cost: the adaptive limiter ramps from ChannelStart::fetch=4 through doublings to the channel ceiling of 256, and the snapshot persists at 256 for subsequent runs. Files #5 onwards run at near-baseline speeds: 3-6 min per 2+ GB file, vs the pre-change baseline of ~5 min on PROD-DL-02.

Grepping the per-file logs on PROD-DL-05 shows none of the new retry mechanisms fired across the 20-file run — every chunk_get succeeded on its first close-group sweep. So this is a healthy-network success, not a "saved by deferred retry" success. The deferred-retry mechanism is proven on the home runs (14 successful recoveries in PROD-LOCAL-DL-03); the droplet just didn't need it.

Test plan

  • All 296 ant-core unit tests pass (cargo test -p ant-core --lib).
  • End-to-end residential download (PROD-LOCAL-DL-04): 11/11 files completed.
  • End-to-end droplet download (PROD-DL-05): 20/20 files completed.

🤖 Generated with Claude Code

Hardens the download path against two distinct failure modes observed
running `ant file download` against the production network: residential
link saturation and per-peer / DHT transient errors that previously
fatally aborted multi-hundred-chunk downloads.

End to end, this takes a residential `ant file download` from "aborts
on the first 256-wide concurrent batch's saturation event" to
"completes 11/11 files including 2+ GB downloads," and on a fat-pipe
droplet from "matches baseline" to "matches baseline" — no regression
on the warm-start path that production downloaders actually exercise.

Six related changes:

1. retry-on-Ok(None) with unanimous-NotFound threshold in chunk_get.
   When the close group returns Ok(None) (no peer has the chunk),
   retry once with a fresh find_closest_peers lookup, unless every
   queried peer responded with an authoritative NotFound (the only
   safe stop for genuine data absence). The previous behaviour treated
   Ok(None) as fatal on first occurrence, which on a saturated link
   meant any single chunk's transient close-group exhaustion aborted
   the whole download.

2. rebucketed_unordered in file.rs instead of buffer_unordered for the
   in-flight chunk fetches. The adaptive limiter's cap can now shrink
   the in-flight count mid-batch under sustained pressure;
   buffer_unordered snapshotted the cap once at pipeline build and
   ignored later Decrease decisions.

3. observe-outer with Ok(None) -> Outcome::Timeout instead of
   observe-per-peer. The controller sees one observation per chunk_get
   (not one per peer attempt), classified via a new chunk_get_outcome
   helper that treats Ok(None) as a load-shedding signal. Avoids the
   per-peer noise floor on the production network where some peers in
   any K=7 close group are unreachable from any given client even on a
   healthy link — that noise was driving spurious Decrease decisions
   on the droplet and pinning steady-state cap low.

4. ChannelStart::fetch: 64 -> 4 cold-start. The 64-wide initial burst
   saturated residential connections before the controller had any
   observation to act on. 4 is the value confirmed safe on a real
   residential link. On droplets the cost is a one-off cold-start
   warm-up of ~16 min on the first 2.5 GB file; subsequent files
   warm-start from the persisted client_adaptive.json snapshot (which
   the controller cleanly grows to cap=256, the channel ceiling).

5. Deferred-retry pass in streaming_decrypt's consumer. When chunk_get
   returns Ok(None) for a chunk during a batch, the chunk is deferred
   rather than aborting the batch. After the main batch settles, the
   deferred chunks are retried serially with sleeps of 10/30/60 s.
   This rides out transient saturation events that hit multiple
   in-flight chunks at once — by the time the batch has drained and
   the first sleep elapses, the link has usually settled. A chunk
   only becomes fatal after all 3 deferred attempts fail.

6. Per-peer protocol-error tolerance and deferred-retry transient-error
   tolerance. A single peer returning Error::Protocol (e.g. "Chunk
   verification failed" from a corrupted local copy) no longer aborts
   the close-group sweep — the loop counts it and continues to the
   next peer. Similarly, an Err(_) from chunk_get_observed during a
   deferred-retry attempt logs and falls through to the next attempt's
   longer backoff rather than escalating.

Also: latency_inflation_factor default 2.0 -> 4.0. Natural close-group
fallback latency on the production network routinely doubles vs the
EWMA baseline (a single peer hitting fallback adds ~10 s on top of a
sub-second median), and was firing spurious Decrease decisions even
on the droplet. 4.0 is the value validated on the previously-merged
tune-latency-inflation-factor branch.

Test plan:
  - 296 ant-core unit tests pass.
  - End-to-end residential download (PROD-LOCAL-DL-04): 11/11 files
    completed including 2.51 GB (42m 43s) and 2.76 GB (46m 26s).
    During an earlier residential run 14 chunks went through the
    deferred-retry path and every one recovered on attempt 1/3 after
    the 10 s sleep.
  - End-to-end droplet download (PROD-DL-05): 20/20 files completed.
    The first 2.5 GB file paid 16m 9s of cold-start cost; subsequent
    multi-GB files ran in 3-6 min each, near the pre-change ~5 min
    baseline on PROD-DL-02. No retry mechanism fired across the
    20-file run — healthy-network success.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the home-download-fixes branch from e8b1de5 to 13070d9 Compare May 22, 2026 17:17
The previous commit fixed residential saturation and abort-on-first-
failure, but field reports from fast connections (e.g. an Oracle VPS
that gets full speed on the released client) showed the opposite
problem: the fetch cap stayed pinned at ~13-24 across an entire 36-file
run and never climbed toward the 256 ceiling, so multi-GB files took
~22 min each instead of ~5.

Three compounding causes, all from having tuned exclusively for the
saturated-home case:

1. Cap can't grow. AIMD exits slow-start permanently on the first
   Decrease, then grows +1 per 32-observation window. On a link with a
   steady ~4% close-group-exhaustion trickle, intermittent Decreases
   fire often enough that additive +1 never gets ahead — equilibrium
   ~20. Additive growth simply cannot reach a useful cap from a low
   base before a file finishes.

   Fix: add `LimiterConfig::slow_start_ramp_threshold`. Below it, a
   Decrease still halves the cap but keeps slow-start armed, so the
   next healthy window doubles back up instead of crawling. The fetch
   channel sets it to the channel ceiling, so download concurrency
   tracks the connection's real capacity. Default 0 preserves the
   original behaviour for quote/store.

2. The p95-latency Decrease misfires on fetch. `chunk_get_observed`'s
   latency includes the internal 1 s retry sleep and the slow retry
   sweep for chunks that needed one, so a window with a couple of
   retry-path chunks has a wildly inflated p95 that reads as
   congestion. Fix: add `LimiterConfig::latency_decrease_enabled`,
   false for fetch. Genuine fetch congestion still surfaces via the
   Ok(None) -> Timeout rate, which the timeout_ceiling check catches.

3. The deferred-retry pass was a throughput sink. It retried deferred
   chunks SERIALLY with a mandatory 10 s pre-sleep each; a batch that
   deferred ~20 chunks burned minutes of near-zero throughput even
   though every chunk succeeded on its first retry (the 10 s sleep was
   pure waste — the deferrals were peer-side noise that clears in <1 s).
   Fix: retry deferred chunks in CONCURRENT rounds reusing the fetch
   limiter, with the first round firing immediately and later rounds
   backing off (0/15/45 s) only for chunks that survive a round. Both
   Ok(None) and transient errors re-defer to the next round; only the
   final round's leftovers are fatal.

Quote and store channel behaviour is unchanged (threshold 0,
latency-decrease enabled). New unit tests cover protected-vs-additive
recovery, the disabled latency check, and that the controller applies
the download tuning to fetch only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant