fix(cli): keep doctor resilient to a corrupt browser cache#1822
Conversation
A partial or corrupt browser cache (a stub file where a version directory is expected, a missing executable, or malformed metadata) makes getInstalledBrowsers throw ENOTDIR. That throw propagated up through findBrowser -> checkChrome -> runEnvironmentChecks, and since doctor.run calls runEnvironmentChecks before any try/catch or the --json output, the command crashed with exit 1. doctor --json is documented to exit 0 even when checks fail, so it must report a corrupt cache as "Chrome not found", not crash on it. - checkChrome now catches any error from findBrowser and converts it to the existing ok:false "Chrome not found" outcome with the browser ensure hint, so runEnvironmentChecks never throws for a missing or corrupt browser. - findFromCache treats a throwing getInstalledBrowsers as "no cached browser", letting resolution fall through to system/download instead of crashing every caller (render included), not just doctor. A healthy browser still reports ok:true. Adds a preflight test asserting an ok:false Chrome outcome when discovery throws, instead of propagating.
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Reviewed at 3647ef78b1a48174e2b9bda975bd989583c172e8. Diff is 44 additions / 3 deletions across manager.ts, preflight.ts, and the preflight test. The stated symptom is real (unguarded getInstalledBrowsers throwing ENOTDIR on a stub file bubbles to a crash-with-no-JSON on doctor --json, violating its documented exit-0 contract), and the fix mechanically matches the diagnosis. Two concerns before it lands, one nit.
🟡 Silent-swallow in findFromCache — no observability signal on the recovery path
packages/cli/src/browser/manager.ts:108-116 — the new try/catch converts the throw into installed = [], then the function falls through to findFromSystem / download. Zero log line, zero telemetry event, zero exposure of "the cache was corrupt on this run". Consequences:
- A user whose HF cache goes corrupt (partial download, disk-full during install,
rm -rfhalf-completed) sees no signal that anything is unusual — they'll just experience "download takes a while again" and shrug. Repeated corruption is invisible to the team. - Debugging a real "why is
doctorslow now" report has to be done by inspection: someone has to notice the cache dir is malformed, since neitherdoctornor the CLI ever emits the "your cache is corrupt, falling back to X" state. - This is precisely the class of bug the
doctorcommand exists to surface. Silencing it inside the cache-read helper defeats the purpose of runningdoctoron it.
Suggested minimum: console.warn with the error's .code / .message from the catch, mirrored to the existing stale hyperframes cache warn at manager.ts:126 for consistency. Optional next-level: emit a trackBrowserInstall-style telemetry event so the team can see the counter.
The checkChrome catch in preflight.ts:145-149 is a different animal — it's already surfacing the outcome via the "Chrome not found" report line, which IS the doctor signal. That one's fine as-is (the report line is the observability). The manager.ts one is not.
🟡 Root-cause vs. defense-in-depth — no accompanying investigation of what produces the corruption
The PR body describes the corruption as a static state ("stub file where the directory belongs") and doesn't identify a writer-side bug that PRODUCES that state. Two possibilities:
- This is genuinely external corruption (user manually pokes the cache, partial download from a killed
browser ensure, filesystem-level issues). Then the defense-in-depth here is correct AND we should still ask: doesbrowser ensureclean up half-downloaded stub files onSIGINT/ on download-failure? If not, the CLI itself is a corruption source, and this PR is treating the symptom. - There is a writer-side bug (e.g. the download-in-progress leaves stub files that don't get cleaned up on interrupt). Then this PR is fine as a hardening pass, but a sibling PR should exist for the writer.
Worth a one-line acknowledgement in the PR body: "the corruption source is [X] and [is / is not] separately addressed." If (2) applies, a sibling PR reference would land this a lot cleaner.
The hyperframes browser clear command exists (per browser.ts examples) — is it aggressive enough to wipe a corrupt cache state? A user hitting the crash today has no in-CLI escape hatch other than rm -rf ~/.cache/hyperframes/chrome, which is what happens if browser clear implements what its docstring says. Worth confirming.
🟢 Fix mechanically matches the diagnosis
findFromCachecatches the throw fromgetInstalledBrowsersand returns[], so callers fall through to system / download as intended.checkChromecatches fromfindBrowser()and produces the exact "Chrome not found" outcome thedoctorcontract promises.ok:false, correcthint,runEnvironmentChecksno longer throws for this case.- The
browser ensurecommand atbrowser.ts:33,66(which callsfindBrowser()directly, outsidecheckChrome) also benefits —findBrowserno longer throws becausefindFromCacheno longer throws — so the render/preview/inspect paths inherit the resilience. Good coverage for the whole CLI surface, not justdoctor. - Preflight test at
preflight.test.ts:68-90correctly asserts the outcome shape (ok:false, "Chrome not found", ensure hint). Thetry/finallyaroundspy.mockRestore()is defensive-correct.
🟢 Doctor's documented --json exit-0 contract is preserved
The PR body correctly identifies the contract (stale-version check returning ok:false must not poison CI), and the fix maintains it. Chrome-not-found is now reported the same way as ffmpeg-not-found — via the outcomes list, not via a crash.
Nits
findFromCacheat manager.ts:108 — the catch block eats the specific error. If you keep the silent-swallow, consider at least destructuring for.codeand re-checking againstENOTDIR/ENOENT/EACCES(the plausible crash surfaces), so a genuine bug (like a permission error the user should know about) isn't hidden as "no cache found". Something like:This preserves the resilience for the documented failure modes, keeps a signal on the surprising ones.} catch (err) { const code = (err as NodeJS.ErrnoException)?.code; if (code && !["ENOTDIR", "ENOENT", "EACCES"].includes(code)) { console.warn(`[browser] Unexpected cache read error (${code}); treating as no cached browser.`); } installed = []; }
- The
checkChromecomment (preflight.ts:141-146) is great; thefindFromCachecomment (manager.ts:108-111) is fine but could reference the siblingcheckChromecomment so the two hardenings are cross-linked in future greps.
What I didn't verify
- Whether
browser ensure --force(orbrowser clear) cleanly wipes a stub-file cache before re-downloading. If not, a user hitting the corruption today may still be stuck. - Whether the test injection via
vi.spyOn(manager, "findBrowser")is actually intercepting the call —preflight.tsimportsfindBrowserat module scope, so ESM binding semantics can defeat namespace-spying. Green CI (1083 tests) suggests it's fine, but worth an eye if you extend the pattern. - Whether there are other callers of
getInstalledBrowsers(orreaddirSyncon the cache dir) beyondfindFromCachethat could still crash on the same corruption.
Non-blocking overall. The doctor-contract fix is correct and the callers-of-findBrowser inherit the resilience for free. The observability gap in the silent-swallow is the item worth addressing before merge — every other line of this PR is defensive-programming to make the failure LEGIBLE, so making the recovery itself invisible cuts against the PR's own thesis.
— Review by Rames D Jusso
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
R2 @ cf2d7d85e36a23490ec44dc3c3ca16e44c8023c9 — verifying against my R1 at 3647ef78.
R1 findings
🟡 (a) Silent-swallow in findFromCache — no observability signal → ✅ RESOLVED
manager.ts:113-119 now emits a structured warn on the recovery path:
} catch (err) {
const code = (err as NodeJS.ErrnoException | undefined)?.code;
const suffix = code ? ` (${code})` : "";
console.warn(
`[hyperframes] Browser cache read failed${suffix}: ${normalizeErrorMessage(err)}. Falling back to system Chrome or a fresh download.`,
);
installed = [];
}
- Structured — carries the error code (
ENOTDIR,ENOENT,EACCESall captured) in parens +normalizeErrorMessage(err)for the message body. - Prefixed
[hyperframes]for greppability in a mixed-log environment. - Wording explicitly names the recovery path ("Falling back to system Chrome or a fresh download") so a user grepping their log after a slow
doctorrun sees the diagnosis without needing to inspect the cache dir.
NEW regression test at manager.test.ts:150-166 — asserts BOTH the outcome ({ executablePath: SYSTEM_CHROME, source: "system" }) AND the warn (toContain("Browser cache read failed (ENOTDIR)") + toContain("Falling back to system Chrome")). That's the observability bug-pin I was asking for.
The R1 nit-level suggestion (destructure code and filter surprise-vs-expected errors) was deliberately not taken — the current code warns on ALL codes. That's the more conservative choice (never silent, even on surprising failures) and aligns with the PR's "failures should be legible" thesis; happy with it as-is.
🟡 (b) Writer-side root-cause investigation → ✅ RESOLVED (acknowledged + deferred)
PR body now has an explicit "Corruption source" section:
This PR treats the observed state as external/partial-cache corruption and keeps the CLI resilient when it happens. The writer-side path still has the existing recovery command:
hyperframes browser clearremoves the managed cache, andhyperframes browser ensurerecreates it. If we later findbrowser ensureitself leaves this stub state on interrupt/download failure, that should be handled as a sibling writer-side cleanup PR; this PR keepsdoctor, render, preview, and inspect from crashing on the read side.
That's the acknowledgment I was after — read-side scope is bounded, writer-side is (a) accessible via existing recovery command AND (b) explicitly punted to a follow-up if browser ensure turns out to be a corruption source. Closes the finding per feedback_open_item_alternative_resolution (operational goal — reviewer + future maintainer can see the scope decision — is met via the doc, no code required).
Adjacent-file check
preflight.ts:141-154 — the try { info = await findBrowser(); } catch { info = undefined; } still uses a bare catch (no observability). That's intentional and fine here: the outcome path already emits ok:false + the "Chrome not found" title with the npx hyperframes browser ensure hint, which IS the doctor signal. Warning at THIS level would be noise (user already sees the outcome in the report). The manager.ts warn covers the render/preview/inspect paths where there's no outcomes-list to surface the signal. Correct split.
The new manager.test.ts regression test at test.ts:150-166 uses installedInHfCacheError (new option) to inject a rejected getInstalledBrowsers, then asserts fallback to system Chrome. It also calls _resetSystemFallbackWarnForTests() to defeat the once-flag on the system-Chrome warn — good hygiene, otherwise test isolation depends on invocation order.
Nits from R1
- Error-code discrimination (nit-level) — Not addressed; current code warns on all
.codevalues. Non-blocking as noted above; conservative choice. - Cross-link between the two catch comments — Not addressed;
findFromCachecatch andcheckChromecatch each have their own comment. Fine — the two are logically distinct (one is defensive-cache-read, one is doctor-outcome-adapter) and don't need cross-referencing at this scale.
CI status
BLOCKED (needs 1 approval), MERGEABLE. Required checks green on R2 push; a few CI jobs (CodeQL javascript-typescript, some Windows render + Typecheck/Build/Test) still IN_PROGRESS on the current push.
Peer reviews since R1
None. Only my R1 review on the record.
R2 verdict
LGTM — both 🟡 findings resolved (structured observability warn with test coverage + PR body scope acknowledgment with deferred writer-side path). Doctor's --json exit-0 contract preserved; render/preview/inspect paths inherit the resilience. Ships cleanly.
— Review by Rames D Jusso
Problem
hyperframes doctor --jsonis documented to exit 0 even when individual checks fail (a stale-version check returningok:falsemust not poison a CI pipeline runningdoctor --json). But when the Chrome headless shell cache is corrupt or partial (folder present, stub files, no real executable),doctorcrashes with exit 1 and prints no JSON at all, instead of reporting "Chrome not found" and exiting 0.Root cause
doctor.run()callsrunEnvironmentChecks({ includeBrowser: true })as its first line, before any try/catch or the--jsonoutput branch. That reachescheckChrome(), which callsawait findBrowser()with no error handling.findBrowser->findFromCachecallsgetInstalledBrowsers({ cacheDir }), which doesfs.readdirSyncon each browser sub-directory. A stub file where thechrome-headless-shelldirectory is expected makes that throwENOTDIR, and the unguarded throw propagates up to the CLI runner, producing an exit-1 crash on the exact condition the check exists to report.Corruption source
This PR treats the observed state as external/partial-cache corruption and keeps the CLI resilient when it happens. The writer-side path still has the existing recovery command:
hyperframes browser clearremoves the managed cache, andhyperframes browser ensurerecreates it. If we later findbrowser ensureitself leaves this stub state on interrupt/download failure, that should be handled as a sibling writer-side cleanup PR; this PR keepsdoctor, render, preview, and inspect from crashing on the read side.Reproduce
Plant a stub file where the cached browser directory belongs, then run the built CLI:
Before: exits 1 with
ENOTDIR: not a directory, scandir '.../chrome-headless-shell', no JSON printed.Fix
checkChrome(preflight.ts) wraps thefindBrowser()call: any error is caught and converted into the existing cleanok:false"Chrome not found" outcome with thenpx hyperframes browser ensurehint.runEnvironmentChecksno longer throws for a missing or corrupt browser.findFromCache(manager.ts) treats a throwinggetInstalledBrowsersas "no cached browser", so resolution falls through to system/download instead of crashing every caller (the render path benefits too, not justdoctor).findFromCachefallback now emits a warning with the cache-read error code/message, so corrupt-cache recovery is visible instead of silently looking like a normal cache miss.The documented
--jsonexit-0 behavior is unchanged, anddoctordoes not swallow any unrelated error. A healthy browser still reportsok:true.Verification
doctor --jsonexits 0, prints its full report, and the Chrome check isok:falsewith the ensure hint. A healthy cache still reportsok:true.ok:false"Chrome not found" outcome when browser discovery throws, instead of propagating.bun run --filter @hyperframes/cli test -- src/browser/manager.test.ts src/browser/preflight.test.tsgreen locally (18 tests).oxfmt+oxlintclean on changed files. Pre-commit fallow / typecheck / commitlint gates pass.