Skip to content

fix(search): survive the transient I/O abort (os 995) during a parked-drive re-warm#469

Merged
githubrobbi merged 1 commit into
mainfrom
fix/search-resilient-to-rewarm-abort
Jun 20, 2026
Merged

fix(search): survive the transient I/O abort (os 995) during a parked-drive re-warm#469
githubrobbi merged 1 commit into
mainfrom
fix/search-resilient-to-rewarm-abort

Conversation

@githubrobbi

Copy link
Copy Markdown
Collaborator

Symptom

The first search after the daemon idle-parks drives can fail:

Error: Daemon search_cli failed
  Caused by: I/O error: The I/O operation has been aborted because of either a
  thread exit or an application request. (os error 995)

…and a manual retry then works (the re-warm finished in the background). Observed on Windows after ~3 h idle (7 drives, 26 M records): the search waited out the re-warm, then died at the end.

Root cause

os error 995 = ERROR_OPERATION_ABORTED — the OS cancels a pending overlapped read when the thread that issued it is reaped. During a parked-drive re-warm, DiskBodyLoader::loadload_drive_with_usn_refresh reads MFT extents over the broker's FILE_FLAG_OVERLAPPED handle from spawn_blocking/rayon workers that recycle; a pending read on a retiring thread gets aborted. The read is valid — it just needs reissuing — but it propagated out as a hard search failure.

Fix — root + safety net

  1. Root (uffs-mft read_handle_at): bounded retry on ERROR_OPERATION_ABORTED. Split into a read_handle_at retry wrapper + read_handle_at_once single attempt, with an is_operation_aborted classifier. Every broker-handle read flows through this primitive, so the whole re-warm read path becomes abort-resilient.
  2. Safety net (uffs-cli search_retry): new module — search_cli_with_warm_retry retries the search with back-off on the os-995 transient and prints "UFFS index is warming up — retrying…" to stderr (never stdout/CSV) instead of a raw error. Catches the transient regardless of origin; real errors still fail fast.

Tests / verification

  • warming_abort_matches_only_995 unit test: retry only on 995, not on os error 5 / ConnectionClosed.
  • macOS + x86_64-pc-windows-msvc builds, macOS + Windows clippy clean, full pre-push gate green.
  • New module keeps main.rs within the 800-LOC policy.

🤖 Generated with Claude Code

…-drive re-warm

A search landing while the daemon re-warms parked drives could fail with
`Daemon search_cli failed → I/O error … (os error 995)`. 995 is
`ERROR_OPERATION_ABORTED`: the OS cancels a *pending* overlapped read when the
thread that issued it is reaped (tokio `spawn_blocking` / rayon worker recycling
during the re-warm, whose USN-refresh reads MFT extents over the broker's
overlapped handle). The read is valid — it just needs reissuing — but it
propagated straight out as a hard search failure (retrying by hand worked).

Two layers, root + safety net:

1. **Root (uffs-mft `read_handle_at`):** wrap the overlapped read in a bounded
   retry on `ERROR_OPERATION_ABORTED`. Every broker-handle read goes through
   this primitive, so the whole re-warm read path becomes abort-resilient. Split
   into a `read_handle_at` retry wrapper + `read_handle_at_once` attempt;
   `is_operation_aborted` classifies the 995.

2. **Safety net (uffs-cli `search_retry`):** a new module wraps `search_cli_raw`
   in `search_cli_with_warm_retry` — bounded retries with back-off when it sees
   the os-995 transient, printing "UFFS index is warming up — retrying…" on
   stderr (never stdout) instead of a raw I/O error. Catches the transient
   regardless of which read produced it; real errors still fail fast.

Verified: macOS + windows-msvc builds, macOS + Windows clippy clean, and a
`warming_abort_matches_only_995` unit test (retry only on 995, not real errors).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@githubrobbi githubrobbi enabled auto-merge (squash) June 20, 2026 18:22
@githubrobbi githubrobbi merged commit d3872ca into main Jun 20, 2026
21 checks passed
@githubrobbi githubrobbi deleted the fix/search-resilient-to-rewarm-abort branch June 20, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant