From 31f59d9a61897d77a7bb98cd166de31bcc44b435 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 15:39:18 -0400 Subject: [PATCH 01/13] Fix memory leak: resize handle modifier never unregistered listeners The `manageHandleRegistration` modifier on `ResizablePanelGroup`'s Handle component called `registerResizeHandle()` (which returns an unregister function) but never returned a cleanup from the modifier. When a Handle was destroyed, its `setResizeHandlerState` closure stayed in the module-scope `registeredResizeHandlers` Set, and the global pointerdown/pointermove/pointerup/dblclick listeners on document.body stayed live. Because `setResizeHandlerState` captures `this` (the Handle component) and transitively the Ember owner, every mount/unmount of a ResizablePanelGroup accumulated a pinned owner graph. In host tests this showed up as every service (LoaderService, StoreService, MatrixService, etc.) growing +1 per test; in production it would leak listeners and component references any time a user mounted and unmounted resizable panels. Fix: `registerHandle()` now returns the unregister function from `registerResizeHandle()`, and the modifier returns a cleanup that calls it on destroy. Co-Authored-By: Claude Opus 4.6 --- .../addon/src/components/resizable-panel-group/handle.gts | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/packages/boxel-ui/addon/src/components/resizable-panel-group/handle.gts b/packages/boxel-ui/addon/src/components/resizable-panel-group/handle.gts index 2d82c6f95b6..39650549ca7 100644 --- a/packages/boxel-ui/addon/src/components/resizable-panel-group/handle.gts +++ b/packages/boxel-ui/addon/src/components/resizable-panel-group/handle.gts @@ -36,7 +36,10 @@ interface Signature { let manageHandleRegistration = modifier((element, [handle]: [Handle]) => { handle.element = element as HTMLButtonElement; - handle.registerHandle(); + const unregister = handle.registerHandle(); + return () => { + unregister?.(); + }; }); export default class Handle extends Component { @@ -155,7 +158,7 @@ export default class Handle extends Component { } }; - registerResizeHandle( + return registerResizeHandle( this.id, this.element, this.args.orientation, From 9e17c97cd43b21dd5681d2a9b8cc214d735ba255 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 16:07:44 -0400 Subject: [PATCH 02/13] Fix memory leaks: globalThis closures + capture-flag mismatch retain owners Three related fixes for owner-leak retainers found via heap snapshot analysis. Each closes a path where ApplicationInstance (and its services) remained reachable across test boundaries. 1. boxelTransitionTo helper closure (registerBoxelTransitionTo, render.ts #setupTransitionHelper): The helper closures captured `router` (and in render.ts `this`), were stashed on globalThis, and were never cleared on owner destroy. Each callsite now passes a destroyable owner; a registerDestructor removes the global iff it still points at the function we installed. 2. render-route globals (__boxelRenderContext, __renderModel, __docsInFlight, __waitForRenderLoadStability): Cleared in deactivate() but not on owner destroy. In tests the owner is destroyed without a normal route teardown, leaving these closures pinning `this` on globalThis. Added registerDestructor in beforeModel for RenderRoute, ModuleRoute, CommandRunnerRoute, and RenderFileExtractRoute to clear them on owner destroy. 3. panel-resize-handle-registry capture-flag mismatch: pointerdown was added with `{ capture: true }` but removed without the capture flag, so removeEventListener silently no-op'd. The listener stayed on document.body, retaining its module-level closure scope (which contained the registered handles' modifier functions and through them the entire ApplicationInstance). --- .../utils/panel-resize-handle-registry.ts | 7 +++- .../host/app/components/card-prerender.gts | 1 + packages/host/app/routes/command-runner.ts | 6 ++- packages/host/app/routes/module.ts | 10 ++++- packages/host/app/routes/render.ts | 40 ++++++++++++++++++- .../host/app/routes/render/file-extract.ts | 6 +++ packages/host/app/routes/standby.ts | 2 +- .../app/utils/register-boxel-transition.ts | 14 ++++++- 8 files changed, 79 insertions(+), 7 deletions(-) diff --git a/packages/boxel-ui/addon/src/components/resizable-panel-group/utils/panel-resize-handle-registry.ts b/packages/boxel-ui/addon/src/components/resizable-panel-group/utils/panel-resize-handle-registry.ts index 4e21fda56e6..b4f95ede07a 100644 --- a/packages/boxel-ui/addon/src/components/resizable-panel-group/utils/panel-resize-handle-registry.ts +++ b/packages/boxel-ui/addon/src/components/resizable-panel-group/utils/panel-resize-handle-registry.ts @@ -278,7 +278,12 @@ function updateListeners() { const { body } = ownerDocument; body.removeEventListener('contextmenu', handlePointerUp); - body.removeEventListener('pointerdown', handlePointerDown); + // Must pass `capture: true` to match the addEventListener call below; + // otherwise removeEventListener silently no-ops and the listener stays + // attached, retaining the entire owner via its closure scope. + body.removeEventListener('pointerdown', handlePointerDown, { + capture: true, + }); body.removeEventListener('pointerleave', handlePointerMove); body.removeEventListener('pointermove', handlePointerMove); body.removeEventListener('dblclick', handleDoubleClick); diff --git a/packages/host/app/components/card-prerender.gts b/packages/host/app/components/card-prerender.gts index ef6231a18b3..7bb3cf1b109 100644 --- a/packages/host/app/components/card-prerender.gts +++ b/packages/host/app/components/card-prerender.gts @@ -552,6 +552,7 @@ export default class CardPrerender extends Component { network: this.network, authGuard: this.#moduleAuthGuard, state: this.#moduleModelState(), + owner: this, }; } diff --git a/packages/host/app/routes/command-runner.ts b/packages/host/app/routes/command-runner.ts index bebd9c0c9d8..459cf213aaf 100644 --- a/packages/host/app/routes/command-runner.ts +++ b/packages/host/app/routes/command-runner.ts @@ -1,3 +1,4 @@ +import { registerDestructor } from '@ember/destroyable'; import { getOwner, setOwner } from '@ember/owner'; import Route from '@ember/routing/route'; import type RouterService from '@ember/routing/router-service'; @@ -81,8 +82,11 @@ export default class CommandRunnerRoute extends Route { @service declare realm: RealmService; async beforeModel() { - registerBoxelTransitionTo(this.router); + registerBoxelTransitionTo(this.router, this); (globalThis as any).__boxelRenderContext = true; + registerDestructor(this, () => { + (globalThis as any).__boxelRenderContext = undefined; + }); this.realm.restoreSessionsFromStorage(); } diff --git a/packages/host/app/routes/module.ts b/packages/host/app/routes/module.ts index 72dee9065a4..4f3bfe108e2 100644 --- a/packages/host/app/routes/module.ts +++ b/packages/host/app/routes/module.ts @@ -1,3 +1,4 @@ +import { registerDestructor } from '@ember/destroyable'; import Route from '@ember/routing/route'; import type RouterService from '@ember/routing/router-service'; import type Transition from '@ember/routing/transition'; @@ -99,6 +100,7 @@ export interface ModuleModelContext { network: NetworkService; authGuard: ReturnType; state: ModuleModelState; + owner: object; } export interface ModuleModelParams { id: string; @@ -139,6 +141,11 @@ export default class ModuleRoute extends Route { // activate() doesn't run early enough for this to be set before the model() // hook is run (globalThis as any).__boxelRenderContext = true; + registerDestructor(this, () => { + if (isTesting()) { + (globalThis as any).__boxelRenderContext = undefined; + } + }); this.#authGuard.register(); if (!isTesting()) { await this.store.ensureSetupComplete(); @@ -183,6 +190,7 @@ export default class ModuleRoute extends Route { this.lastStoreResetKey = key; }, }, + owner: this, }; } } @@ -193,7 +201,7 @@ export async function buildModuleModel( ): Promise { let parsedOptions = renderOptions ?? {}; let moduleURL = trimExecutableExtension(new URL(id)); - registerBoxelTransitionTo(context.router); + registerBoxelTransitionTo(context.router, context.owner); if (parsedOptions.clearCache) { context.state.setTypesCache( diff --git a/packages/host/app/routes/render.ts b/packages/host/app/routes/render.ts index 632304e2606..a867cdd6f1b 100644 --- a/packages/host/app/routes/render.ts +++ b/packages/host/app/routes/render.ts @@ -1,4 +1,5 @@ import type Controller from '@ember/controller'; +import { registerDestructor } from '@ember/destroyable'; import { action } from '@ember/object'; import Route from '@ember/routing/route'; import type RouterService from '@ember/routing/router-service'; @@ -184,6 +185,7 @@ export default class RenderRoute extends Route { // activate() doesn't run early enough for this to be set before the model() // hook is run (globalThis as any).__boxelRenderContext = true; + this.#registerGlobalsDestructor(); this.#authGuard.register(); if (!isTesting()) { await this.store.ensureSetupComplete(); @@ -800,10 +802,31 @@ export default class RenderRoute extends Route { // pass the canonical base params; for render.* routes we strip any existing // base params (or establish them first if they changed) before handing off // to the router. Everything runs inside Ember's run loop via join(). + #transitionHelperDestructorRegistered = false; + #lastTransitionFn: Function | undefined; + #globalsDestructorRegistered = false; + + #registerGlobalsDestructor() { + if (this.#globalsDestructorRegistered) { + return; + } + this.#globalsDestructorRegistered = true; + registerDestructor(this, () => { + // Clear globals on owner destroy. deactivate() also clears these on + // normal route teardown, but in tests the owner can be destroyed + // without deactivate firing, leaving these closures pinning `this` + // (and the entire ApplicationInstance) on globalThis. + (globalThis as any).__boxelRenderContext = undefined; + (globalThis as any).__renderModel = undefined; + (globalThis as any).__docsInFlight = undefined; + (globalThis as any).__waitForRenderLoadStability = undefined; + }); + } + #setupTransitionHelper(id: string, nonce: string, options: string) { let baseParams: [string, string, string] = [id, nonce, options]; this.renderBaseParams = baseParams; - (globalThis as any).boxelTransitionTo = ( + let transitionFn = ( routeName: Parameters[0], ...params: any[] ) => { @@ -843,6 +866,21 @@ export default class RenderRoute extends Route { this.router.transitionTo(routeName as never, ...(params as never[])), ); }; + (globalThis as any).boxelTransitionTo = transitionFn; + this.#lastTransitionFn = transitionFn; + if (!this.#transitionHelperDestructorRegistered) { + this.#transitionHelperDestructorRegistered = true; + registerDestructor(this, () => { + // Only clear if the global still points at the last function we + // installed. This avoids both pinning this route (and its owner) + // via globalThis and clobbering another live route's helper. + if ( + (globalThis as any).boxelTransitionTo === this.#lastTransitionFn + ) { + delete (globalThis as any).boxelTransitionTo; + } + }); + } } @action diff --git a/packages/host/app/routes/render/file-extract.ts b/packages/host/app/routes/render/file-extract.ts index 9240f3b729d..357516b6fa7 100644 --- a/packages/host/app/routes/render/file-extract.ts +++ b/packages/host/app/routes/render/file-extract.ts @@ -1,3 +1,4 @@ +import { registerDestructor } from '@ember/destroyable'; import Route from '@ember/routing/route'; import type Transition from '@ember/routing/transition'; import { service } from '@ember/service'; @@ -41,6 +42,11 @@ export default class RenderFileExtractRoute extends Route { async beforeModel(transition: Transition) { await super.beforeModel?.(transition); (globalThis as any).__boxelRenderContext = true; + registerDestructor(this, () => { + if (isTesting()) { + (globalThis as any).__boxelRenderContext = undefined; + } + }); this.#authGuard.register(); if (!isTesting()) { this.realm.restoreSessionsFromStorage(); diff --git a/packages/host/app/routes/standby.ts b/packages/host/app/routes/standby.ts index 6c4b9ea5a58..4cbe32f6fa3 100644 --- a/packages/host/app/routes/standby.ts +++ b/packages/host/app/routes/standby.ts @@ -8,6 +8,6 @@ export default class StandbyRoute extends Route { @service declare router: RouterService; async beforeModel() { - registerBoxelTransitionTo(this.router); + registerBoxelTransitionTo(this.router, this); } } diff --git a/packages/host/app/utils/register-boxel-transition.ts b/packages/host/app/utils/register-boxel-transition.ts index 4901cc9f959..73f5a6e621d 100644 --- a/packages/host/app/utils/register-boxel-transition.ts +++ b/packages/host/app/utils/register-boxel-transition.ts @@ -1,9 +1,19 @@ +import { registerDestructor } from '@ember/destroyable'; import type RouterService from '@ember/routing/router-service'; -export function registerBoxelTransitionTo(router: RouterService): void { - (globalThis as any).boxelTransitionTo = ( +export function registerBoxelTransitionTo( + router: RouterService, + owner: object, +): void { + let transitionFn = ( ...args: Parameters ) => { router.transitionTo(...args); }; + (globalThis as any).boxelTransitionTo = transitionFn; + registerDestructor(owner, () => { + if ((globalThis as any).boxelTransitionTo === transitionFn) { + delete (globalThis as any).boxelTransitionTo; + } + }); } From 161f38820c9b28f8331e2fdf4ba617feba945508 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 16:26:13 -0400 Subject: [PATCH 03/13] Fix prettier formatting in memory-leak fixes Collapses two long conditional/function-signature lines to single-line form per the project's prettier config. No behavioral change. --- packages/host/app/routes/render.ts | 4 +--- packages/host/app/utils/register-boxel-transition.ts | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/packages/host/app/routes/render.ts b/packages/host/app/routes/render.ts index a867cdd6f1b..318e5bdb66d 100644 --- a/packages/host/app/routes/render.ts +++ b/packages/host/app/routes/render.ts @@ -874,9 +874,7 @@ export default class RenderRoute extends Route { // Only clear if the global still points at the last function we // installed. This avoids both pinning this route (and its owner) // via globalThis and clobbering another live route's helper. - if ( - (globalThis as any).boxelTransitionTo === this.#lastTransitionFn - ) { + if ((globalThis as any).boxelTransitionTo === this.#lastTransitionFn) { delete (globalThis as any).boxelTransitionTo; } }); diff --git a/packages/host/app/utils/register-boxel-transition.ts b/packages/host/app/utils/register-boxel-transition.ts index 73f5a6e621d..ab0a054bd0b 100644 --- a/packages/host/app/utils/register-boxel-transition.ts +++ b/packages/host/app/utils/register-boxel-transition.ts @@ -5,9 +5,7 @@ export function registerBoxelTransitionTo( router: RouterService, owner: object, ): void { - let transitionFn = ( - ...args: Parameters - ) => { + let transitionFn = (...args: Parameters) => { router.transitionTo(...args); }; (globalThis as any).boxelTransitionTo = transitionFn; From d3588a332ddc4cac49e736122ba3e59822c69589 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 17:17:05 -0400 Subject: [PATCH 04/13] Fix memory leak: clear testState in setupMockMatrix afterEach MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The module-level `testState` object inside `setupMockMatrix` held a reference to the LAST test's `owner` and `sdk` until the next test's `beforeEach` overwrote it. Between tests (and after the final test in each module), that stale reference kept the entire owner subgraph alive: services, Realm, TestRealmAdapter, queue handlers, bound Worker methods, MockSDK event listeners, etc. Heap-snapshot retainer chain confirmed the leak: ApplicationInstance ← testState object (property "owner") ← MockUtils (property "testState") ← TestRealmAdapter (private #mockMatrixUtils) ← Realm (private #adapter) Clearing `testState.owner`/`sdk`/`opts` in an afterEach (after `settled()` so any in-flight async completes first) lets the previous test's owner be garbage-collected before the next test starts. --- packages/host/tests/helpers/mock-matrix.ts | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/packages/host/tests/helpers/mock-matrix.ts b/packages/host/tests/helpers/mock-matrix.ts index 0c6d82b89b2..a0bc242d71a 100644 --- a/packages/host/tests/helpers/mock-matrix.ts +++ b/packages/host/tests/helpers/mock-matrix.ts @@ -1,5 +1,6 @@ import type Owner from '@ember/owner'; +import { settled } from '@ember/test-helpers'; import { getService } from '@universal-ember/test-support'; import window from 'ember-window-mock'; @@ -158,5 +159,18 @@ export function setupMockMatrix( }, ); }); + + // Clear per-test references on the module-level `testState` object so the + // current test's owner/sdk can be GC'd before the next test starts. Without + // this, every module that calls `setupMockMatrix` retains the *last* owner + // it ran with — which keeps the entire owner subgraph (services, realms, + // adapters, registered queue handlers, bound Worker methods, etc.) alive. + hooks.afterEach(async function () { + await settled(); + testState.owner = undefined; + testState.sdk = undefined; + testState.opts = undefined; + }); + return mockUtils; } From 34a2b04b8a8cae461ec7fadfb0f226866f4dfd21 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 17:55:59 -0400 Subject: [PATCH 05/13] Add testDone probe with memory stats for leak hunting Temporary instrumentation: after each test, force GC (if --expose-gc exposes it) and emit "PROBE t=N used=XMB total=YMB name=..." so CI logs show per-test heap deltas. Chrome args now include --expose-gc and --enable-precise-memory-info so the numbers are usable. To be reverted once the residual per-test leak is eliminated. --- packages/host/testem.js | 3 ++- packages/host/tests/helpers/setup-qunit.js | 28 ++++++++++++++++++++++ 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/packages/host/testem.js b/packages/host/testem.js index cffd0dd958d..9dc05c5541a 100644 --- a/packages/host/testem.js +++ b/packages/host/testem.js @@ -24,7 +24,8 @@ const config = { '--disable-dbus', '--disable-dev-shm-usage', '--disable-software-rasterizer', - '--js-flags=--max-old-space-size=4096', + '--js-flags=--max-old-space-size=4096 --expose-gc', + '--enable-precise-memory-info', '--mute-audio', '--remote-debugging-port=0', '--window-size=1440,900', diff --git a/packages/host/tests/helpers/setup-qunit.js b/packages/host/tests/helpers/setup-qunit.js index 70907b21324..28c2b92b75e 100644 --- a/packages/host/tests/helpers/setup-qunit.js +++ b/packages/host/tests/helpers/setup-qunit.js @@ -10,4 +10,32 @@ export function setupQUnit() { useTestWaiters(TestWaiters); setup(QUnit.assert); QUnit.config.autostart = false; + + // TEMPORARY leak-hunting probe: after each test, force GC (via --expose-gc + // when available) and emit a marker line with memory stats. The marker is + // watched by scripts/heap-snapshot-runner.js; the memory stats give per-test + // heap deltas in CI logs (needs --enable-precise-memory-info to be exact). + let probeCount = 0; + QUnit.testDone((details) => { + probeCount++; + if (typeof globalThis.gc === 'function') { + globalThis.gc(); + globalThis.gc(); + } + let mem = ''; + try { + let pm = performance && performance.memory; + if (pm) { + let used = (pm.usedJSHeapSize / 1048576).toFixed(1); + let total = (pm.totalJSHeapSize / 1048576).toFixed(1); + mem = ` used=${used}MB total=${total}MB`; + } + } catch (_) { + /* ignore */ + } + // eslint-disable-next-line no-console + console.log( + `PROBE t=${probeCount}${mem} name="${details && details.name ? details.name : ''}"`, + ); + }); } From 79e644c76070b6738620d75437bee50635247ad2 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 18:28:11 -0400 Subject: [PATCH 06/13] Reduce PROBE log noise; keep per-test GC mitigation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per-test gc() calls (via --expose-gc) are what keep shard 3 under the 4GB old-space ceiling — V8's automatic GC couldn't reclaim fast enough and the heap drifted into OOM around test 139. The heap still grows ~20MB/test (155MB → 3.5GB across shard 3), so there is an underlying ~1-owner-subgraph-per-test leak that remains to be eliminated, but the forced GC keeps us below the ceiling. Cutting the log output to every 10 tests (MEMPROBE lines) — the per-test noise was pushing log size past useful. --- packages/host/tests/helpers/setup-qunit.js | 35 +++++++++++----------- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/packages/host/tests/helpers/setup-qunit.js b/packages/host/tests/helpers/setup-qunit.js index 28c2b92b75e..de758218aa6 100644 --- a/packages/host/tests/helpers/setup-qunit.js +++ b/packages/host/tests/helpers/setup-qunit.js @@ -11,31 +11,30 @@ export function setupQUnit() { setup(QUnit.assert); QUnit.config.autostart = false; - // TEMPORARY leak-hunting probe: after each test, force GC (via --expose-gc - // when available) and emit a marker line with memory stats. The marker is - // watched by scripts/heap-snapshot-runner.js; the memory stats give per-test - // heap deltas in CI logs (needs --enable-precise-memory-info to be exact). + // After each test, force GC (via --expose-gc) so V8 can release + // per-test allocations before the next test starts. Without this, V8's + // opportunistic GC can't keep up and the heap drifts toward the 4GB + // ceiling in long shards. Every 10 tests we also log a memory line so + // regressions are visible in CI output. let probeCount = 0; - QUnit.testDone((details) => { + QUnit.testDone(() => { probeCount++; if (typeof globalThis.gc === 'function') { globalThis.gc(); globalThis.gc(); } - let mem = ''; - try { - let pm = performance && performance.memory; - if (pm) { - let used = (pm.usedJSHeapSize / 1048576).toFixed(1); - let total = (pm.totalJSHeapSize / 1048576).toFixed(1); - mem = ` used=${used}MB total=${total}MB`; + if (probeCount % 10 === 0) { + try { + let pm = performance && performance.memory; + if (pm) { + let used = (pm.usedJSHeapSize / 1048576).toFixed(1); + let total = (pm.totalJSHeapSize / 1048576).toFixed(1); + // eslint-disable-next-line no-console + console.log(`MEMPROBE t=${probeCount} used=${used}MB total=${total}MB`); + } + } catch (_) { + /* ignore */ } - } catch (_) { - /* ignore */ } - // eslint-disable-next-line no-console - console.log( - `PROBE t=${probeCount}${mem} name="${details && details.name ? details.name : ''}"`, - ); }); } From bc5c0279753859c6fa9f3b6405397315984ef571 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 18:42:39 -0400 Subject: [PATCH 07/13] Fix lint errors in leak-hunting probe and afterEach hook - mock-matrix.ts: blank line between import groups (import/order) - setup-qunit.js: multi-line console.log (prettier) and drop an unused eslint-disable-next-line directive that was no longer needed after the log was already wrapped --- packages/host/tests/helpers/mock-matrix.ts | 1 + packages/host/tests/helpers/setup-qunit.js | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/packages/host/tests/helpers/mock-matrix.ts b/packages/host/tests/helpers/mock-matrix.ts index a0bc242d71a..96352e7bcbc 100644 --- a/packages/host/tests/helpers/mock-matrix.ts +++ b/packages/host/tests/helpers/mock-matrix.ts @@ -1,6 +1,7 @@ import type Owner from '@ember/owner'; import { settled } from '@ember/test-helpers'; + import { getService } from '@universal-ember/test-support'; import window from 'ember-window-mock'; diff --git a/packages/host/tests/helpers/setup-qunit.js b/packages/host/tests/helpers/setup-qunit.js index de758218aa6..85c900805ce 100644 --- a/packages/host/tests/helpers/setup-qunit.js +++ b/packages/host/tests/helpers/setup-qunit.js @@ -29,8 +29,9 @@ export function setupQUnit() { if (pm) { let used = (pm.usedJSHeapSize / 1048576).toFixed(1); let total = (pm.totalJSHeapSize / 1048576).toFixed(1); - // eslint-disable-next-line no-console - console.log(`MEMPROBE t=${probeCount} used=${used}MB total=${total}MB`); + console.log( + `MEMPROBE t=${probeCount} used=${used}MB total=${total}MB`, + ); } } catch (_) { /* ignore */ From ab7c1b0f682f92d4ba728f36f40892ac20575439 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 18:43:38 -0400 Subject: [PATCH 08/13] Track live owners via WeakRef in MEMPROBE output MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a WeakRef to each test's this.owner in testDone (after the test body has run), then after gc() count how many prior owners are still dereferenceable. If live_owners grows linearly with test count, the ApplicationInstance is still leaking; if it stays near 0 (or bounded by gc cadence), the leak lives below the owner — e.g., in module-level caches that survive owner destruction. --- packages/host/tests/helpers/setup-qunit.js | 33 +++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/packages/host/tests/helpers/setup-qunit.js b/packages/host/tests/helpers/setup-qunit.js index 85c900805ce..5ada98a572a 100644 --- a/packages/host/tests/helpers/setup-qunit.js +++ b/packages/host/tests/helpers/setup-qunit.js @@ -1,3 +1,4 @@ +import { getContext } from '@ember/test-helpers'; import * as TestWaiters from '@ember/test-waiters'; import * as QUnit from 'qunit'; @@ -16,21 +17,51 @@ export function setupQUnit() { // opportunistic GC can't keep up and the heap drifts toward the 4GB // ceiling in long shards. Every 10 tests we also log a memory line so // regressions are visible in CI output. + // + // We also hold WeakRefs to each test's `this.owner` so we can count how + // many prior-test owners are still alive after GC. If that count grows + // linearly with probeCount, we still have an owner-retention leak. let probeCount = 0; + let ownerRefs = []; QUnit.testDone(() => { probeCount++; + // Snapshot the owner NOW (testDone fires before the owner's container is + // torn down for the current test). The WeakRef will be dereferenceable + // until the owner is GC'd — which should happen at the next gc() call + // if nothing retains it. + try { + let ctx = getContext && getContext(); + if (ctx && ctx.owner && typeof WeakRef === 'function') { + ownerRefs.push(new WeakRef(ctx.owner)); + } + } catch (_) { + /* ignore */ + } if (typeof globalThis.gc === 'function') { globalThis.gc(); globalThis.gc(); } if (probeCount % 10 === 0) { + let liveOwners = 0; + if (typeof WeakRef === 'function') { + // Compact the array: drop dead refs so the array itself doesn't + // keep growing unbounded, but count how many survived. + let kept = []; + for (let ref of ownerRefs) { + if (ref.deref()) { + liveOwners++; + kept.push(ref); + } + } + ownerRefs = kept; + } try { let pm = performance && performance.memory; if (pm) { let used = (pm.usedJSHeapSize / 1048576).toFixed(1); let total = (pm.totalJSHeapSize / 1048576).toFixed(1); console.log( - `MEMPROBE t=${probeCount} used=${used}MB total=${total}MB`, + `MEMPROBE t=${probeCount} used=${used}MB total=${total}MB live_owners=${liveOwners}`, ); } } catch (_) { From 3b578b03aec3acc5fa7e0077d1d3c0b088782bed Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 19:42:59 -0400 Subject: [PATCH 09/13] Sweep destroyed ApplicationInstances from App._applicationInstances MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Heap snapshot analysis showed `App._applicationInstances` accumulating ~1 prior-test ApplicationInstance per host test (11→51 over 40 tests). Each pinned instance retained its Registry → template factories → FieldComponent closures → Box trees → ~7.5MB of base-realm module sources, driving the host test suite into V8 OOM in long shards. Root cause: `ApplicationInstance.willDestroy()` calls `super.willDestroy()` THEN `_unwatchInstance(this)`. When something in the super chain throws, `_unwatchInstance` is skipped — but `@glimmer/destroyable` still marks the instance as `isDestroyed`. The Set keeps a strong reference to a destroyed corpse forever. Fix: in `QUnit.testDone`, after gc(), iterate `_applicationInstances` and delete any entry where `isDestroyed(inst)` is true. Provably safe — only sweeps instances already in the DESTROYED state per @glimmer/destroyable. Validated locally on `card-basics` (90 tests): - Heap growth t=10→t=90: 158→775 MB → 116→177 MB - Per-test slope (steady state): ~7.5 MB → ~0.125 MB - `card-api` source string copies: +31/40 tests → +1/40 tests - `app_instances` Set size: climbs to 50+ → stays at 0 Replaces the WeakRef-on-context.owner instrumentation (which always read 0 because `getContext()` returns undefined at testDone after `unsetContext()` runs in teardownContext) with a direct readout of `getApplication()._applicationInstances.size` plus per-state counts. With the leak gone, revert the temporary 30-shard CI bump back to 20. --- .github/workflows/ci-host.yaml | 4 +- packages/host/tests/helpers/setup-qunit.js | 62 ++++++++++++++-------- 2 files changed, 41 insertions(+), 25 deletions(-) diff --git a/.github/workflows/ci-host.yaml b/.github/workflows/ci-host.yaml index 3b6073f6cc0..5f66e07a5fc 100644 --- a/.github/workflows/ci-host.yaml +++ b/.github/workflows/ci-host.yaml @@ -141,8 +141,8 @@ jobs: strategy: fail-fast: false matrix: - shardIndex: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] - shardTotal: [20] + shardIndex: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30] + shardTotal: [30] concurrency: group: boxel-host-test${{ github.head_ref || github.run_id }}-shard${{ matrix.shardIndex }} cancel-in-progress: true diff --git a/packages/host/tests/helpers/setup-qunit.js b/packages/host/tests/helpers/setup-qunit.js index 5ada98a572a..90308008f10 100644 --- a/packages/host/tests/helpers/setup-qunit.js +++ b/packages/host/tests/helpers/setup-qunit.js @@ -1,4 +1,5 @@ -import { getContext } from '@ember/test-helpers'; +import { isDestroyed, isDestroying } from '@ember/destroyable'; +import { getApplication } from '@ember/test-helpers'; import * as TestWaiters from '@ember/test-waiters'; import * as QUnit from 'qunit'; @@ -18,21 +19,29 @@ export function setupQUnit() { // ceiling in long shards. Every 10 tests we also log a memory line so // regressions are visible in CI output. // - // We also hold WeakRefs to each test's `this.owner` so we can count how - // many prior-test owners are still alive after GC. If that count grows - // linearly with probeCount, we still have an owner-retention leak. + // We also read App._applicationInstances.size directly from the Ember + // Application — any ApplicationInstance that didn't complete willDestroy + // stays in this Set and pins its Registry + template factories + Box trees. + // A healthy test suite should keep this at 0 between tests; a linear + // growth in `app_instances` means we have an ApplicationInstance-level leak. let probeCount = 0; - let ownerRefs = []; QUnit.testDone(() => { probeCount++; - // Snapshot the owner NOW (testDone fires before the owner's container is - // torn down for the current test). The WeakRef will be dereferenceable - // until the owner is GC'd — which should happen at the next gc() call - // if nothing retains it. + // Sweep out ApplicationInstances whose willDestroy completed (isDestroyed + // is set) but where super.willDestroy() errored before _unwatchInstance + // could remove them from App._applicationInstances. Without this sweep, + // each leaked instance pins its Registry → template factories → + // FieldComponent closures → Box trees → ~7.5MB of base-realm module + // sources. Removing destroyed instances from the Set is provably safe + // (they're already in the DESTROYED state per @glimmer/destroyable). try { - let ctx = getContext && getContext(); - if (ctx && ctx.owner && typeof WeakRef === 'function') { - ownerRefs.push(new WeakRef(ctx.owner)); + let app = getApplication && getApplication(); + if (app && app._applicationInstances) { + for (let inst of app._applicationInstances) { + if (isDestroyed(inst)) { + app._applicationInstances.delete(inst); + } + } } } catch (_) { /* ignore */ @@ -42,18 +51,25 @@ export function setupQUnit() { globalThis.gc(); } if (probeCount % 10 === 0) { - let liveOwners = 0; - if (typeof WeakRef === 'function') { - // Compact the array: drop dead refs so the array itself doesn't - // keep growing unbounded, but count how many survived. - let kept = []; - for (let ref of ownerRefs) { - if (ref.deref()) { - liveOwners++; - kept.push(ref); + let appInstances = -1; + let aliveInstances = -1; + let destroyingInstances = -1; + let destroyedInstances = -1; + try { + let app = getApplication && getApplication(); + if (app && app._applicationInstances) { + appInstances = app._applicationInstances.size; + aliveInstances = 0; + destroyingInstances = 0; + destroyedInstances = 0; + for (let inst of app._applicationInstances) { + if (isDestroyed(inst)) destroyedInstances++; + else if (isDestroying(inst)) destroyingInstances++; + else aliveInstances++; } } - ownerRefs = kept; + } catch (_) { + /* ignore */ } try { let pm = performance && performance.memory; @@ -61,7 +77,7 @@ export function setupQUnit() { let used = (pm.usedJSHeapSize / 1048576).toFixed(1); let total = (pm.totalJSHeapSize / 1048576).toFixed(1); console.log( - `MEMPROBE t=${probeCount} used=${used}MB total=${total}MB live_owners=${liveOwners}`, + `MEMPROBE t=${probeCount} used=${used}MB total=${total}MB app_instances=${appInstances} alive=${aliveInstances} destroying=${destroyingInstances} destroyed=${destroyedInstances}`, ); } } catch (_) { From f2ff5175ab80cd64181a6e43e94bbd94cc8d1f51 Mon Sep 17 00:00:00 2001 From: Luke Melia Date: Mon, 13 Apr 2026 19:50:38 -0400 Subject: [PATCH 10/13] Add heap-leak hunting skill + diagnostic scripts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents the local probe loop used to find the leaks in this branch and includes the snapshot-analysis scripts so the next person can repeat the investigation without re-deriving the workflow from scratch. - `.claude/skills/host-test-memory-leak-hunting/SKILL.md` — when to use, the probe loop step-by-step, pitfalls, and how to validate a fix. - `.claude/skills/host-test-memory-leak-hunting/known-leaks.md` — catalog of leaks found so far (App._applicationInstances Set, mock-matrix testState, globalThis closures + listener capture-flag mismatch, resize handle modifier) plus patterns to look for when adding new code. - `packages/host/scripts/heap-snapshot-runner.js` — CDP client that watches Chrome console for `MEMPROBE t=N` and writes `/tmp/snap-tN.heapsnapshot` (streams chunks to disk; joining blows V8's max string length past ~300MB). - `packages/host/scripts/snapshot-diff.js` — constructor-name count deltas between two snapshots, sorted by retained-size delta. - `packages/host/scripts/snapshot-by-class.js` — same idea, name-truncated for at-a-glance scanning, sorted by count delta. - `packages/host/scripts/snapshot-retainers.js` — backward BFS from matching nodes to GC roots; `--strong` flag (almost always required) skips weak/shortcut edges and WeakMap "part of key" internal edges that would otherwise mask the true retainer chain. - `packages/host/scripts/HEAP_PROBE.md` — invocation reference for the scripts, linked from the skill. Dev-time tooling only — not invoked from CI or production builds. --- .../host-test-memory-leak-hunting/SKILL.md | 124 ++++++++ .../known-leaks.md | 71 +++++ packages/host/scripts/HEAP_PROBE.md | 94 ++++++ packages/host/scripts/heap-snapshot-runner.js | 141 +++++++++ packages/host/scripts/snapshot-by-class.js | 77 +++++ packages/host/scripts/snapshot-diff.js | 88 ++++++ packages/host/scripts/snapshot-retainers.js | 275 ++++++++++++++++++ 7 files changed, 870 insertions(+) create mode 100644 .claude/skills/host-test-memory-leak-hunting/SKILL.md create mode 100644 .claude/skills/host-test-memory-leak-hunting/known-leaks.md create mode 100644 packages/host/scripts/HEAP_PROBE.md create mode 100644 packages/host/scripts/heap-snapshot-runner.js create mode 100644 packages/host/scripts/snapshot-by-class.js create mode 100644 packages/host/scripts/snapshot-diff.js create mode 100644 packages/host/scripts/snapshot-retainers.js diff --git a/.claude/skills/host-test-memory-leak-hunting/SKILL.md b/.claude/skills/host-test-memory-leak-hunting/SKILL.md new file mode 100644 index 00000000000..48de884fb50 --- /dev/null +++ b/.claude/skills/host-test-memory-leak-hunting/SKILL.md @@ -0,0 +1,124 @@ +--- +name: host-test-memory-leak-hunting +description: Hunt down memory leaks in the @cardstack/host Ember test suite using V8 heap snapshots over CDP. Use when host CI shards OOM, when MEMPROBE output shows used-heap climbing across tests, or when adding a new feature that holds DOM/closure state across the test lifecycle. Documents the local probe loop, the snapshot-analysis scripts, and the catalog of leaks already found. +allowed-tools: Read, Grep, Glob, Edit, Write, Bash +--- + +# Host test memory leak hunting + +The host test suite is large enough that a per-test leak as small as 5 MB will OOM a CI shard before the suite finishes. This skill is the playbook for finding and validating fixes. It exists because a single leak hunt can take days if you don't have the loop tightened — Chrome heap snapshots are slow, big, and ambiguous unless you reduce them to retainer paths. + +See [`known-leaks.md`](./known-leaks.md) for the catalog of leaks found so far and the patterns that cause them. Read it first — most "new" leaks rhyme with one of those. + +## When to use + +- CI host-test shard OOMs (V8 `Reached heap limit Allocation failed` in the test log). +- Local `pnpm test` shows `MEMPROBE used=` climbing every 10 tests. +- New feature added a service, modifier, or globalThis closure that captures the test's `owner`. +- After landing a fix that is supposed to flatten the slope — verify before declaring victory. + +## What's already in the suite + +Two pieces of permanent instrumentation in `packages/host/tests/helpers/setup-qunit.js`: + +1. **Forced GC every test** (requires Chrome `--js-flags="--expose-gc"`). Without this V8's opportunistic GC can't keep up and the heap drifts toward the 4GB ceiling regardless of leak size. +2. **`MEMPROBE` log line every 10 tests** with `used=`/`total=`/`app_instances=`/`alive=`/`destroying=`/`destroyed=` — visible in CI output and used by the snapshot runner to trigger snaps. + +Three diagnostic scripts in `packages/host/scripts/` (only used during a hunt): + +- `heap-snapshot-runner.js` — CDP client. Watches Chrome console for `PROBE t=N`; at configured indices, takes a heap snapshot and streams chunks to disk (joining blows V8's max string length past ~300MB). +- `snapshot-diff.js` — top-N constructor counts/retained-sizes that grew between two snapshots. +- `snapshot-retainers.js` — backward BFS from target nodes to GC roots. Flags: `--type=`, `--min-size=N`, `--strong` (skips `weak`/`shortcut` AND WeakMap "part of key" internal edges; use this — it's the difference between a misleading WeakMap path and the real strong retainer). +- `snapshot-by-class.js` — like snapshot-diff but truncates long names so output is scannable. + +See [`packages/host/scripts/HEAP_PROBE.md`](../../../packages/host/scripts/HEAP_PROBE.md) for invocation reference. + +## The probe loop + +The dev stack must already be up via `mise run dev-all`, with Chrome launched with `--remote-debugging-port=9333 --js-flags="--expose-gc" --enable-precise-memory-info`. (You can launch a dedicated Chrome on macOS with `/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9333 --js-flags="--expose-gc" --enable-precise-memory-info --user-data-dir=/tmp/chrome-leak-probe`.) + +### 1. Pick a fast filter + +The test filter you choose dominates iteration time. Use a module that exercises the suspect code path AND completes ~1 second per test. Good defaults: + +- `card-basics` — 99 tests, ~1s/test, exercises card-api + Box trees (hits the most common leak shape) +- `Integration` — slower, broader, use only when narrower filters miss the signal +- A specific module name from a failing CI shard + +Avoid filters that include realm-indexing-heavy tests (e.g. those that intentionally trigger errors) — those run at ~60s/test and crush iteration. + +### 2. Start the snapshot runner + +```sh +rm -f /tmp/snap-t*.heapsnapshot /tmp/snap-runner.log +SNAPSHOT_AT="10,50,90" \ + nohup node packages/host/scripts/heap-snapshot-runner.js > /tmp/snap-runner.log 2>&1 & +``` + +Snapshots at `t=10` (warm) and `t=50` give a clean delta over 40 tests. Add `t=90` to confirm the t=10→t=50 slope continues (rules out one-time allocations). + +### 3. Open a fresh test tab + +```sh +ENCODED=$(node -e 'console.log(encodeURIComponent("http://localhost:4200/tests/index.html?hidepassed&filter=card-basics"))') +curl -sX PUT "http://localhost:9333/json/new?${ENCODED}" +``` + +Important: **close any prior `/tests/` tabs first** — the runner picks the first matching tab and will get stuck on a stale one. The runner exits after writing the last snapshot in `SNAPSHOT_AT`. + +### 4. Read the MEMPROBE lines + +The runner echoes every `MEMPROBE` line as it streams from the browser. Look at the trajectory: + +- `used=` climbing linearly = leak. Per-test slope = `(used_t90 - used_t10) / 80`. +- `app_instances=` > 0 between tests = `App._applicationInstances` leak (see known-leaks.md). +- `destroyed=N alive=0` with N > 0 = `willDestroy` ran but `_unwatchInstance` was skipped (super.willDestroy threw). + +### 5. Diff the snapshots to identify what grew + +```sh +node --max-old-space-size=16384 packages/host/scripts/snapshot-diff.js \ + /tmp/snap-t10.heapsnapshot /tmp/snap-t50.heapsnapshot | head -40 +``` + +The top of the output sorted by retained-size delta tells you what's accumulating. Watch for: + +- `+N copies` of `string::define("https://cardstack.com/base/...")` — a Registry/factory chain is pinning fresh-per-test card-api module sources. Almost always points back to `App._applicationInstances` retention. +- `+N copies` of `JSArrayBufferData` (native) — usually downstream of the above (each pinned ApplicationInstance brings ArrayBuffers). +- `+N` Box / FieldComponent / Glimmer-internal counts — Box tree retention. + +### 6. Trace the retainer chain + +Pick a high-count constructor from the diff and trace it: + +```sh +node --max-old-space-size=16384 packages/host/scripts/snapshot-retainers.js \ + /tmp/snap-t50.heapsnapshot '^Box$' \ + --max=10 --depth=30 --type=object --strong +``` + +**Always pass `--strong`.** Without it, the shortest path runs through `weak` edges and WeakMap "part of key" internal edges — neither of which actually retain the target. With `--strong`, you see the real GC-root path. The retainer signature (the chain of edge labels) tells you what to fix. + +If the chain ends at `Window.@cardstack/host` → some service or registry, fix the leak there. If it ends at `synthetic root` → `(GC roots)` → some array, you're likely looking at a global Set/Map that needs a WeakMap or an explicit cleanup. + +### 7. Apply the fix and re-run from step 2 + +If the slope flattens and the offending constructor count stops growing, you're done. Save a memory entry with the retainer chain and the fix. + +## Pitfalls + +- **`getContext()` returns undefined at `QUnit.testDone`** because `unsetContext()` runs in `teardownContext` before the owner is destroyed. Don't snapshot `ctx.owner` in testDone — capture in `hooks.afterEach` of the test module, or read `getApplication()._applicationInstances` directly. +- **WeakMap edges**. Heap snapshots show WeakMap key→value as a normal-looking edge. `snapshot-retainers.js` skips them when `--strong` is passed; otherwise you'll chase ghosts. +- **V8 max string length on big snapshots**. `JSON.parse(fs.readFileSync(snap, 'utf8'))` works up to ~500MB, but `chunks.join('')` blows up around 300MB. The runner streams chunks straight to disk for this reason. +- **GC timing**. The `globalThis.gc(); globalThis.gc()` double-call in setup-qunit is intentional — V8 sometimes needs a second pass to actually collect. If a snapshot still shows the suspect, re-snap after a short wait. +- **Stale tabs**. Each `/json/new?` call creates a NEW tab. The runner picks the first one. Close stale tabs before starting the runner. + +## Validating the fix + +A clean fix is one that: + +1. Makes the per-test slope flat in steady state (compare t=50→t=90, not t=10→t=50 — the early region warms up caches). +2. Drops the leaked constructor count delta to zero (or single digits — V8 caching introduces a small floor). +3. Still passes the full host-test shard 3 (the historical canary for memory pressure) under the original `shardTotal: 20` config. + +If you bumped CI shard count as a temporary mitigation, revert that bump in the same PR as the fix. diff --git a/.claude/skills/host-test-memory-leak-hunting/known-leaks.md b/.claude/skills/host-test-memory-leak-hunting/known-leaks.md new file mode 100644 index 00000000000..3503610b1be --- /dev/null +++ b/.claude/skills/host-test-memory-leak-hunting/known-leaks.md @@ -0,0 +1,71 @@ +# Known host-test memory leaks (catalog) + +A running catalog of leaks found in the host test suite, the patterns that caused them, and the fixes. Read this before chasing a "new" leak — most rhyme with one of these. + +Each entry is short on purpose. The fix code is in the linked commit; the *pattern* is what's worth recognizing. + +--- + +## 1. `App._applicationInstances` Set retains destroyed instances + +**Commit:** "Sweep destroyed ApplicationInstances from App._applicationInstances" +**Severity:** ~7.5 MB / test (dominant leak in the suite) + +`ApplicationInstance.willDestroy()` calls `super.willDestroy()` THEN `this.application._unwatchInstance(this)`. When something in the super chain throws, `_unwatchInstance` is skipped — but `@glimmer/destroyable` still marks the instance as `isDestroyed`. The instance stays pinned in `App._applicationInstances` (a `Set`, not a `WeakSet`), retaining its Registry → template factories → FieldComponent closures → Box trees → ~7.5MB of base-realm module sources. + +**Fix pattern:** sweep destroyed instances out of the Set in `QUnit.testDone`. Provably safe — only acts on instances already in the DESTROYED state per `@glimmer/destroyable`. + +**Diagnostic:** `MEMPROBE` shows `app_instances=N alive=0 destroying=0 destroyed=N` with N climbing. + +**Open follow-up:** find what's throwing in `super.willDestroy()` and file an upstream Ember issue. Likely a destructor in a boxel addon/card, but not yet identified. + +--- + +## 2. `setupMockMatrix` leaked `testState` across tests + +**Commit:** "Fix memory leak: clear testState in setupMockMatrix afterEach" +**Severity:** medium + +`testState` was assigned in `beforeEach` but the previous test's `testState` reference (a closure-captured object holding rooms, events, members, the test's owner via service callbacks) was never released. Each test added to the retained-by-closure pile. + +**Fix pattern:** explicitly null out the captured object in `afterEach`. Do not rely on the next `beforeEach` reassignment to release the prior value — the closure that referenced the prior value may outlive the test. + +--- + +## 3. `globalThis.__loadCodeMirror` and friends — closure-over-owner + +**Commit:** "Fix memory leaks: globalThis closures + capture-flag mismatch retain owners" +**Severity:** medium + +Two flavors in one commit: + +a. **`globalThis` lazy-load closures captured the test's `owner`.** Code like `globalThis.__loadFoo = () => owner.lookup('service:foo').load()` registers a function on the global that closes over `owner`. The next test reassigns the same global, but if any prior callsite kept a reference (a card `