Skip to content

fix(cli): keep doctor resilient to a corrupt browser cache#1822

Merged
miguel-heygen merged 2 commits into
mainfrom
fix/doctor-resilient-to-broken-browser-cache
Jul 1, 2026
Merged

fix(cli): keep doctor resilient to a corrupt browser cache#1822
miguel-heygen merged 2 commits into
mainfrom
fix/doctor-resilient-to-broken-browser-cache

Conversation

@miguel-heygen

@miguel-heygen miguel-heygen commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Problem

hyperframes doctor --json is documented to exit 0 even when individual checks fail (a stale-version check returning ok:false must not poison a CI pipeline running doctor --json). But when the Chrome headless shell cache is corrupt or partial (folder present, stub files, no real executable), doctor crashes with exit 1 and prints no JSON at all, instead of reporting "Chrome not found" and exiting 0.

Root cause

doctor.run() calls runEnvironmentChecks({ includeBrowser: true }) as its first line, before any try/catch or the --json output branch. That reaches checkChrome(), which calls await findBrowser() with no error handling. findBrowser -> findFromCache calls getInstalledBrowsers({ cacheDir }), which does fs.readdirSync on each browser sub-directory. A stub file where the chrome-headless-shell directory is expected makes that throw ENOTDIR, and the unguarded throw propagates up to the CLI runner, producing an exit-1 crash on the exact condition the check exists to report.

Corruption source

This PR treats the observed state as external/partial-cache corruption and keeps the CLI resilient when it happens. The writer-side path still has the existing recovery command: hyperframes browser clear removes the managed cache, and hyperframes browser ensure recreates it. If we later find browser ensure itself leaves this stub state on interrupt/download failure, that should be handled as a sibling writer-side cleanup PR; this PR keeps doctor, render, preview, and inspect from crashing on the read side.

Reproduce

Plant a stub file where the cached browser directory belongs, then run the built CLI:

mkdir -p ~/.cache/hyperframes/chrome
echo stub > ~/.cache/hyperframes/chrome/chrome-headless-shell   # file, not a dir
node packages/cli/dist/cli.js doctor --json

Before: exits 1 with ENOTDIR: not a directory, scandir '.../chrome-headless-shell', no JSON printed.

Fix

  • checkChrome (preflight.ts) wraps the findBrowser() call: any error is caught and converted into the existing clean ok:false "Chrome not found" outcome with the npx hyperframes browser ensure hint. runEnvironmentChecks no longer throws for a missing or corrupt browser.
  • findFromCache (manager.ts) treats a throwing getInstalledBrowsers as "no cached browser", so resolution falls through to system/download instead of crashing every caller (the render path benefits too, not just doctor).
  • The findFromCache fallback now emits a warning with the cache-read error code/message, so corrupt-cache recovery is visible instead of silently looking like a normal cache miss.

The documented --json exit-0 behavior is unchanged, and doctor does not swallow any unrelated error. A healthy browser still reports ok:true.

Verification

  • Repro re-run after the fix: doctor --json exits 0, prints its full report, and the Chrome check is ok:false with the ensure hint. A healthy cache still reports ok:true.
  • New preflight test asserts an ok:false "Chrome not found" outcome when browser discovery throws, instead of propagating.
  • New manager test asserts the corrupt-cache warning is emitted and browser resolution falls through to system Chrome.
  • bun run --filter @hyperframes/cli test -- src/browser/manager.test.ts src/browser/preflight.test.ts green locally (18 tests).
  • oxfmt + oxlint clean on changed files. Pre-commit fallow / typecheck / commitlint gates pass.

A partial or corrupt browser cache (a stub file where a version directory
is expected, a missing executable, or malformed metadata) makes
getInstalledBrowsers throw ENOTDIR. That throw propagated up through
findBrowser -> checkChrome -> runEnvironmentChecks, and since doctor.run
calls runEnvironmentChecks before any try/catch or the --json output, the
command crashed with exit 1.

doctor --json is documented to exit 0 even when checks fail, so it must
report a corrupt cache as "Chrome not found", not crash on it.

- checkChrome now catches any error from findBrowser and converts it to the
  existing ok:false "Chrome not found" outcome with the browser ensure hint,
  so runEnvironmentChecks never throws for a missing or corrupt browser.
- findFromCache treats a throwing getInstalledBrowsers as "no cached
  browser", letting resolution fall through to system/download instead of
  crashing every caller (render included), not just doctor.

A healthy browser still reports ok:true. Adds a preflight test asserting an
ok:false Chrome outcome when discovery throws, instead of propagating.

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed at 3647ef78b1a48174e2b9bda975bd989583c172e8. Diff is 44 additions / 3 deletions across manager.ts, preflight.ts, and the preflight test. The stated symptom is real (unguarded getInstalledBrowsers throwing ENOTDIR on a stub file bubbles to a crash-with-no-JSON on doctor --json, violating its documented exit-0 contract), and the fix mechanically matches the diagnosis. Two concerns before it lands, one nit.

🟡 Silent-swallow in findFromCache — no observability signal on the recovery path

packages/cli/src/browser/manager.ts:108-116 — the new try/catch converts the throw into installed = [], then the function falls through to findFromSystem / download. Zero log line, zero telemetry event, zero exposure of "the cache was corrupt on this run". Consequences:

  1. A user whose HF cache goes corrupt (partial download, disk-full during install, rm -rf half-completed) sees no signal that anything is unusual — they'll just experience "download takes a while again" and shrug. Repeated corruption is invisible to the team.
  2. Debugging a real "why is doctor slow now" report has to be done by inspection: someone has to notice the cache dir is malformed, since neither doctor nor the CLI ever emits the "your cache is corrupt, falling back to X" state.
  3. This is precisely the class of bug the doctor command exists to surface. Silencing it inside the cache-read helper defeats the purpose of running doctor on it.

Suggested minimum: console.warn with the error's .code / .message from the catch, mirrored to the existing stale hyperframes cache warn at manager.ts:126 for consistency. Optional next-level: emit a trackBrowserInstall-style telemetry event so the team can see the counter.

The checkChrome catch in preflight.ts:145-149 is a different animal — it's already surfacing the outcome via the "Chrome not found" report line, which IS the doctor signal. That one's fine as-is (the report line is the observability). The manager.ts one is not.

🟡 Root-cause vs. defense-in-depth — no accompanying investigation of what produces the corruption

The PR body describes the corruption as a static state ("stub file where the directory belongs") and doesn't identify a writer-side bug that PRODUCES that state. Two possibilities:

  1. This is genuinely external corruption (user manually pokes the cache, partial download from a killed browser ensure, filesystem-level issues). Then the defense-in-depth here is correct AND we should still ask: does browser ensure clean up half-downloaded stub files on SIGINT / on download-failure? If not, the CLI itself is a corruption source, and this PR is treating the symptom.
  2. There is a writer-side bug (e.g. the download-in-progress leaves stub files that don't get cleaned up on interrupt). Then this PR is fine as a hardening pass, but a sibling PR should exist for the writer.

Worth a one-line acknowledgement in the PR body: "the corruption source is [X] and [is / is not] separately addressed." If (2) applies, a sibling PR reference would land this a lot cleaner.

The hyperframes browser clear command exists (per browser.ts examples) — is it aggressive enough to wipe a corrupt cache state? A user hitting the crash today has no in-CLI escape hatch other than rm -rf ~/.cache/hyperframes/chrome, which is what happens if browser clear implements what its docstring says. Worth confirming.

🟢 Fix mechanically matches the diagnosis

  • findFromCache catches the throw from getInstalledBrowsers and returns [], so callers fall through to system / download as intended.
  • checkChrome catches from findBrowser() and produces the exact "Chrome not found" outcome the doctor contract promises. ok:false, correct hint, runEnvironmentChecks no longer throws for this case.
  • The browser ensure command at browser.ts:33,66 (which calls findBrowser() directly, outside checkChrome) also benefits — findBrowser no longer throws because findFromCache no longer throws — so the render/preview/inspect paths inherit the resilience. Good coverage for the whole CLI surface, not just doctor.
  • Preflight test at preflight.test.ts:68-90 correctly asserts the outcome shape (ok:false, "Chrome not found", ensure hint). The try/finally around spy.mockRestore() is defensive-correct.

🟢 Doctor's documented --json exit-0 contract is preserved

The PR body correctly identifies the contract (stale-version check returning ok:false must not poison CI), and the fix maintains it. Chrome-not-found is now reported the same way as ffmpeg-not-found — via the outcomes list, not via a crash.

Nits

  • findFromCache at manager.ts:108 — the catch block eats the specific error. If you keep the silent-swallow, consider at least destructuring for .code and re-checking against ENOTDIR / ENOENT / EACCES (the plausible crash surfaces), so a genuine bug (like a permission error the user should know about) isn't hidden as "no cache found". Something like:
    } catch (err) {
      const code = (err as NodeJS.ErrnoException)?.code;
      if (code && !["ENOTDIR", "ENOENT", "EACCES"].includes(code)) {
        console.warn(`[browser] Unexpected cache read error (${code}); treating as no cached browser.`);
      }
      installed = [];
    }
    This preserves the resilience for the documented failure modes, keeps a signal on the surprising ones.
  • The checkChrome comment (preflight.ts:141-146) is great; the findFromCache comment (manager.ts:108-111) is fine but could reference the sibling checkChrome comment so the two hardenings are cross-linked in future greps.

What I didn't verify

  • Whether browser ensure --force (or browser clear) cleanly wipes a stub-file cache before re-downloading. If not, a user hitting the corruption today may still be stuck.
  • Whether the test injection via vi.spyOn(manager, "findBrowser") is actually intercepting the call — preflight.ts imports findBrowser at module scope, so ESM binding semantics can defeat namespace-spying. Green CI (1083 tests) suggests it's fine, but worth an eye if you extend the pattern.
  • Whether there are other callers of getInstalledBrowsers (or readdirSync on the cache dir) beyond findFromCache that could still crash on the same corruption.

Non-blocking overall. The doctor-contract fix is correct and the callers-of-findBrowser inherit the resilience for free. The observability gap in the silent-swallow is the item worth addressing before merge — every other line of this PR is defensive-programming to make the failure LEGIBLE, so making the recovery itself invisible cuts against the PR's own thesis.

Review by Rames D Jusso

@miguel-heygen miguel-heygen marked this pull request as ready for review July 1, 2026 02:38

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R2 @ cf2d7d85e36a23490ec44dc3c3ca16e44c8023c9 — verifying against my R1 at 3647ef78.

R1 findings

🟡 (a) Silent-swallow in findFromCache — no observability signal → ✅ RESOLVED

manager.ts:113-119 now emits a structured warn on the recovery path:

} catch (err) {
  const code = (err as NodeJS.ErrnoException | undefined)?.code;
  const suffix = code ? ` (${code})` : "";
  console.warn(
    `[hyperframes] Browser cache read failed${suffix}: ${normalizeErrorMessage(err)}. Falling back to system Chrome or a fresh download.`,
  );
  installed = [];
}
  • Structured — carries the error code (ENOTDIR, ENOENT, EACCES all captured) in parens + normalizeErrorMessage(err) for the message body.
  • Prefixed [hyperframes] for greppability in a mixed-log environment.
  • Wording explicitly names the recovery path ("Falling back to system Chrome or a fresh download") so a user grepping their log after a slow doctor run sees the diagnosis without needing to inspect the cache dir.

NEW regression test at manager.test.ts:150-166 — asserts BOTH the outcome ({ executablePath: SYSTEM_CHROME, source: "system" }) AND the warn (toContain("Browser cache read failed (ENOTDIR)") + toContain("Falling back to system Chrome")). That's the observability bug-pin I was asking for.

The R1 nit-level suggestion (destructure code and filter surprise-vs-expected errors) was deliberately not taken — the current code warns on ALL codes. That's the more conservative choice (never silent, even on surprising failures) and aligns with the PR's "failures should be legible" thesis; happy with it as-is.

🟡 (b) Writer-side root-cause investigation → ✅ RESOLVED (acknowledged + deferred)

PR body now has an explicit "Corruption source" section:

This PR treats the observed state as external/partial-cache corruption and keeps the CLI resilient when it happens. The writer-side path still has the existing recovery command: hyperframes browser clear removes the managed cache, and hyperframes browser ensure recreates it. If we later find browser ensure itself leaves this stub state on interrupt/download failure, that should be handled as a sibling writer-side cleanup PR; this PR keeps doctor, render, preview, and inspect from crashing on the read side.

That's the acknowledgment I was after — read-side scope is bounded, writer-side is (a) accessible via existing recovery command AND (b) explicitly punted to a follow-up if browser ensure turns out to be a corruption source. Closes the finding per feedback_open_item_alternative_resolution (operational goal — reviewer + future maintainer can see the scope decision — is met via the doc, no code required).

Adjacent-file check

preflight.ts:141-154 — the try { info = await findBrowser(); } catch { info = undefined; } still uses a bare catch (no observability). That's intentional and fine here: the outcome path already emits ok:false + the "Chrome not found" title with the npx hyperframes browser ensure hint, which IS the doctor signal. Warning at THIS level would be noise (user already sees the outcome in the report). The manager.ts warn covers the render/preview/inspect paths where there's no outcomes-list to surface the signal. Correct split.

The new manager.test.ts regression test at test.ts:150-166 uses installedInHfCacheError (new option) to inject a rejected getInstalledBrowsers, then asserts fallback to system Chrome. It also calls _resetSystemFallbackWarnForTests() to defeat the once-flag on the system-Chrome warn — good hygiene, otherwise test isolation depends on invocation order.

Nits from R1

  • Error-code discrimination (nit-level) — Not addressed; current code warns on all .code values. Non-blocking as noted above; conservative choice.
  • Cross-link between the two catch comments — Not addressed; findFromCache catch and checkChrome catch each have their own comment. Fine — the two are logically distinct (one is defensive-cache-read, one is doctor-outcome-adapter) and don't need cross-referencing at this scale.

CI status

BLOCKED (needs 1 approval), MERGEABLE. Required checks green on R2 push; a few CI jobs (CodeQL javascript-typescript, some Windows render + Typecheck/Build/Test) still IN_PROGRESS on the current push.

Peer reviews since R1

None. Only my R1 review on the record.

R2 verdict

LGTM — both 🟡 findings resolved (structured observability warn with test coverage + PR body scope acknowledgment with deferred writer-side path). Doctor's --json exit-0 contract preserved; render/preview/inspect paths inherit the resilience. Ships cleanly.

Review by Rames D Jusso

@miguel-heygen miguel-heygen merged commit 0d202ea into main Jul 1, 2026
52 of 64 checks passed
@miguel-heygen miguel-heygen deleted the fix/doctor-resilient-to-broken-browser-cache branch July 1, 2026 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants