Skip to content

Host test memory leak fixes + leak-hunting skill#4397

Open
lukemelia wants to merge 13 commits intomainfrom
cs-host-test-memory-leak-fixes
Open

Host test memory leak fixes + leak-hunting skill#4397
lukemelia wants to merge 13 commits intomainfrom
cs-host-test-memory-leak-fixes

Conversation

@lukemelia
Copy link
Copy Markdown
Contributor

@lukemelia lukemelia commented Apr 13, 2026

Summary

Fixes a stack of memory leaks in the host test suite that were driving CI shards toward V8 OOM in long runs, plus the per-test instrumentation that surfaces future regressions and a Claude Code Skill that documents the hunting workflow.

The dominant leak (~7.5 MB / test) was App._applicationInstances retaining destroyed-but-not-unwatched ApplicationInstances. The fix sweeps them out in QUnit.testDone. Smaller fixes address listener-removal mismatches, globalThis closures that pinned the test owner, and per-test state held by setupMockMatrix.

Validation

Local card-basics probe (90 tests):

Metric Before fix stack After fix stack
Heap growth t=10→t=90 158 → 775 MB 116 → 177 MB
Per-test slope (steady state) ~7.5 MB/test ~0.125 MB/test
App._applicationInstances size climbs 11 → 51 (+1/test) stays at 0
Base-realm module-source duplication +31 copies / 40 tests +1 copy / 40 tests

With the leak gone, this PR also reverts the temporary 30-shard CI bump back to 20.

What's in the PR

The fixes (9 commits):

  • Sweep destroyed ApplicationInstances from App._applicationInstances
  • MEMPROBE instrumentation in setup-qunit.js (per-test gc + every-10 log line)
  • setupMockMatrix testState cleanup
  • globalThis closure + listener capture flag fixes
  • Resize handle modifier listener cleanup

The tooling (1 commit):

  • .claude/skills/host-test-memory-leak-hunting/ — Skill (workflow) + known-leaks catalog
  • packages/host/scripts/{heap-snapshot-runner,snapshot-diff,snapshot-by-class,snapshot-retainers}.js — CDP-driven snapshot capture and analysis
  • packages/host/scripts/HEAP_PROBE.md — invocation reference

The scripts are dev-time only — never invoked from CI or production builds.

Open follow-up

super.willDestroy() in ApplicationInstance throws somewhere in the chain — that's why _unwatchInstance is skipped and the workaround sweep is needed. The sweep is provably safe (it only removes instances @glimmer/destroyable has already marked DESTROYED), but the underlying error is worth tracking down so we can file an upstream Ember issue.

Test plan

  • All 20 host-test shards pass on this PR
  • Shard 3 (the historical OOM canary) passes well within memory limits
  • MEMPROBE log lines in CI output show app_instances=0 and used= flat across the run
  • Spot-check: load tests/index.html?filter=card-basics locally, confirm no test failures

lukemelia and others added 10 commits April 13, 2026 19:47
The `manageHandleRegistration` modifier on `ResizablePanelGroup`'s Handle
component called `registerResizeHandle()` (which returns an unregister
function) but never returned a cleanup from the modifier. When a
Handle was destroyed, its `setResizeHandlerState` closure stayed in the
module-scope `registeredResizeHandlers` Set, and the global
pointerdown/pointermove/pointerup/dblclick listeners on document.body
stayed live.

Because `setResizeHandlerState` captures `this` (the Handle component)
and transitively the Ember owner, every mount/unmount of a
ResizablePanelGroup accumulated a pinned owner graph. In host tests
this showed up as every service (LoaderService, StoreService,
MatrixService, etc.) growing +1 per test; in production it would
leak listeners and component references any time a user mounted and
unmounted resizable panels.

Fix: `registerHandle()` now returns the unregister function from
`registerResizeHandle()`, and the modifier returns a cleanup that
calls it on destroy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…owners

Three related fixes for owner-leak retainers found via heap snapshot
analysis. Each closes a path where ApplicationInstance (and its services)
remained reachable across test boundaries.

1. boxelTransitionTo helper closure (registerBoxelTransitionTo,
   render.ts #setupTransitionHelper):
   The helper closures captured `router` (and in render.ts `this`),
   were stashed on globalThis, and were never cleared on owner destroy.
   Each callsite now passes a destroyable owner; a registerDestructor
   removes the global iff it still points at the function we installed.

2. render-route globals (__boxelRenderContext, __renderModel,
   __docsInFlight, __waitForRenderLoadStability):
   Cleared in deactivate() but not on owner destroy. In tests the owner
   is destroyed without a normal route teardown, leaving these closures
   pinning `this` on globalThis. Added registerDestructor in beforeModel
   for RenderRoute, ModuleRoute, CommandRunnerRoute, and
   RenderFileExtractRoute to clear them on owner destroy.

3. panel-resize-handle-registry capture-flag mismatch:
   pointerdown was added with `{ capture: true }` but removed without
   the capture flag, so removeEventListener silently no-op'd. The
   listener stayed on document.body, retaining its module-level closure
   scope (which contained the registered handles' modifier functions
   and through them the entire ApplicationInstance).
Collapses two long conditional/function-signature lines to single-line
form per the project's prettier config. No behavioral change.
The module-level `testState` object inside `setupMockMatrix` held a
reference to the LAST test's `owner` and `sdk` until the next test's
`beforeEach` overwrote it. Between tests (and after the final test in
each module), that stale reference kept the entire owner subgraph
alive: services, Realm, TestRealmAdapter, queue handlers, bound Worker
methods, MockSDK event listeners, etc.

Heap-snapshot retainer chain confirmed the leak:
  ApplicationInstance
    ← testState object (property "owner")
    ← MockUtils (property "testState")
    ← TestRealmAdapter (private #mockMatrixUtils)
    ← Realm (private #adapter)

Clearing `testState.owner`/`sdk`/`opts` in an afterEach (after
`settled()` so any in-flight async completes first) lets the previous
test's owner be garbage-collected before the next test starts.
Temporary instrumentation: after each test, force GC (if --expose-gc
exposes it) and emit "PROBE t=N used=XMB total=YMB name=..." so CI
logs show per-test heap deltas. Chrome args now include --expose-gc
and --enable-precise-memory-info so the numbers are usable.

To be reverted once the residual per-test leak is eliminated.
Per-test gc() calls (via --expose-gc) are what keep shard 3 under the
4GB old-space ceiling — V8's automatic GC couldn't reclaim fast enough
and the heap drifted into OOM around test 139. The heap still grows
~20MB/test (155MB → 3.5GB across shard 3), so there is an underlying
~1-owner-subgraph-per-test leak that remains to be eliminated, but
the forced GC keeps us below the ceiling.

Cutting the log output to every 10 tests (MEMPROBE lines) — the
per-test noise was pushing log size past useful.
- mock-matrix.ts: blank line between import groups (import/order)
- setup-qunit.js: multi-line console.log (prettier) and drop an
  unused eslint-disable-next-line directive that was no longer
  needed after the log was already wrapped
Add a WeakRef to each test's this.owner in testDone (after the test
body has run), then after gc() count how many prior owners are still
dereferenceable. If live_owners grows linearly with test count, the
ApplicationInstance is still leaking; if it stays near 0 (or bounded
by gc cadence), the leak lives below the owner — e.g., in module-level
caches that survive owner destruction.
Heap snapshot analysis showed `App._applicationInstances` accumulating ~1
prior-test ApplicationInstance per host test (11→51 over 40 tests). Each
pinned instance retained its Registry → template factories →
FieldComponent closures → Box trees → ~7.5MB of base-realm module
sources, driving the host test suite into V8 OOM in long shards.

Root cause: `ApplicationInstance.willDestroy()` calls `super.willDestroy()`
THEN `_unwatchInstance(this)`. When something in the super chain throws,
`_unwatchInstance` is skipped — but `@glimmer/destroyable` still marks
the instance as `isDestroyed`. The Set keeps a strong reference to a
destroyed corpse forever.

Fix: in `QUnit.testDone`, after gc(), iterate `_applicationInstances`
and delete any entry where `isDestroyed(inst)` is true. Provably safe —
only sweeps instances already in the DESTROYED state per
@glimmer/destroyable.

Validated locally on `card-basics` (90 tests):
- Heap growth t=10→t=90: 158→775 MB → 116→177 MB
- Per-test slope (steady state): ~7.5 MB → ~0.125 MB
- `card-api` source string copies: +31/40 tests → +1/40 tests
- `app_instances` Set size: climbs to 50+ → stays at 0

Replaces the WeakRef-on-context.owner instrumentation (which always
read 0 because `getContext()` returns undefined at testDone after
`unsetContext()` runs in teardownContext) with a direct readout of
`getApplication()._applicationInstances.size` plus per-state counts.

With the leak gone, revert the temporary 30-shard CI bump back to 20.
Documents the local probe loop used to find the leaks in this branch and
includes the snapshot-analysis scripts so the next person can repeat the
investigation without re-deriving the workflow from scratch.

- `.claude/skills/host-test-memory-leak-hunting/SKILL.md` — when to use,
  the probe loop step-by-step, pitfalls, and how to validate a fix.
- `.claude/skills/host-test-memory-leak-hunting/known-leaks.md` — catalog
  of leaks found so far (App._applicationInstances Set, mock-matrix
  testState, globalThis closures + listener capture-flag mismatch, resize
  handle modifier) plus patterns to look for when adding new code.
- `packages/host/scripts/heap-snapshot-runner.js` — CDP client that
  watches Chrome console for `MEMPROBE t=N` and writes
  `/tmp/snap-tN.heapsnapshot` (streams chunks to disk; joining blows V8's
  max string length past ~300MB).
- `packages/host/scripts/snapshot-diff.js` — constructor-name count
  deltas between two snapshots, sorted by retained-size delta.
- `packages/host/scripts/snapshot-by-class.js` — same idea, name-truncated
  for at-a-glance scanning, sorted by count delta.
- `packages/host/scripts/snapshot-retainers.js` — backward BFS from
  matching nodes to GC roots; `--strong` flag (almost always required)
  skips weak/shortcut edges and WeakMap "part of key" internal edges
  that would otherwise mask the true retainer chain.
- `packages/host/scripts/HEAP_PROBE.md` — invocation reference for the
  scripts, linked from the skill.

Dev-time tooling only — not invoked from CI or production builds.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 13, 2026

The leak sweep in setup-qunit.js brings growth down from ~7.5MB/test
to ~0.125MB/test, so 20 shards is sufficient again.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 14, 2026

Realm Server Test Results

  1 files  ±0    1 suites  ±0   16m 0s ⏱️ + 2m 33s
844 tests ±0  844 ✅ ±0  0 💤 ±0  0 ❌ ±0 
915 runs  ±0  915 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 2f7f67f. ± Comparison against base commit 9131738.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 14, 2026

Host Test Results

2 194 tests  ±0   2 179 ✅ ±0   2h 31m 38s ⏱️ + 2m 24s
    1 suites ±0      15 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 2f7f67f. ± Comparison against base commit 9131738.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces memory growth in the host test suite by fixing several leak sources and adding per-test memory instrumentation, plus adding dev-time heap snapshot tooling and documentation to make future leak regressions easier to diagnose.

Changes:

  • Add QUnit.testDone cleanup/instrumentation: sweep destroyed ApplicationInstances, force GC, and periodically log MEMPROBE heap + instance stats.
  • Fix additional per-test retention sources (mock-matrix testState cleanup, globalThis closures cleaned up with destructors, and event listener removal mismatch).
  • Add dev-time heap snapshot capture/analysis scripts and documentation (plus a Claude skill describing the workflow).

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/host/tests/helpers/setup-qunit.js Adds per-test GC + MEMPROBE logging and sweeps destroyed ApplicationInstances from App._applicationInstances.
packages/host/tests/helpers/mock-matrix.ts Clears module-level testState after each test to allow GC of owner/sdk graphs.
packages/host/testem.js Enables --expose-gc and precise memory info for Chrome test runs.
packages/host/scripts/heap-snapshot-runner.js New CDP-based snapshot runner that captures .heapsnapshot files at chosen probe indices.
packages/host/scripts/snapshot-diff.js New tool to diff two heap snapshots by constructor counts and retained-size deltas.
packages/host/scripts/snapshot-by-class.js New tool to report constructor count deltas with truncated names for readability.
packages/host/scripts/snapshot-retainers.js New tool to find retainer paths from targets back to GC roots.
packages/host/scripts/HEAP_PROBE.md Usage/reference docs for snapshot tooling.
packages/host/app/utils/register-boxel-transition.ts Registers global transition helper with destructor-based cleanup to avoid global retention.
packages/host/app/routes/standby.ts Updates callsite to pass an owner for destructor registration.
packages/host/app/routes/render/file-extract.ts Clears __boxelRenderContext on owner destroy (tests) to prevent per-test retention.
packages/host/app/routes/render.ts Adds destructor-based cleanup for globals + transition helper to avoid globalThis pinning in tests.
packages/host/app/routes/module.ts Adds owner into model context and clears __boxelRenderContext on destroy (tests).
packages/host/app/routes/command-runner.ts Clears __boxelRenderContext on destroy and updates transition helper registration signature.
packages/host/app/components/card-prerender.gts Passes owner through module model context so global cleanup can be tied to a destroyable.
packages/boxel-ui/addon/src/components/resizable-panel-group/utils/panel-resize-handle-registry.ts Fixes removeEventListener capture mismatch so listeners are actually removed.
packages/boxel-ui/addon/src/components/resizable-panel-group/handle.gts Ensures resize-handle registration is unregistered when the modifier is destroyed.
.claude/skills/host-test-memory-leak-hunting/known-leaks.md Adds catalog of known leak patterns and fixes.
.claude/skills/host-test-memory-leak-hunting/SKILL.md Adds documented workflow for leak hunting and validating fixes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Pre-filter raw message strings before JSON.parse in cdpCall and the
  console listener so HeapProfiler chunk notifications aren't parsed
  redundantly
- Replace queue.shift() with head-index pointer in BFS retainer walk
  to avoid O(n²) on large snapshots
@lukemelia lukemelia requested review from a team, backspace and habdelra April 14, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants