Reworked threading model.#1652
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reworks the graphics threading/synchronization model by replacing per-thread encoders and the SafeTimespan/UpdateToken mechanism with a single per-frame bgfx encoder owned by DeviceImpl and a new FrameCompletionScope gate/counter for cross-thread frame completion synchronization.
Changes:
- Removed
SafeTimespanGuarantor,UpdateToken/Update, and per-thread encoder management; introduced a single frame encoder lifecycle inDeviceImpl. - Added
FrameCompletionScope+ frame-start/before-render/after-render scheduling to coordinate JS-thread work withbgfx::frame(). - Updated NativeEngine/NativeXR/Canvas/FrameBuffer APIs to use
GetActiveEncoder()and internal encoder acquisition patterns instead of passing encoders through call chains.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Core/Graphics/Source/DeviceImpl.h | Adds single-frame encoder pointer and frame gate (mutex/CV + pending scope count). |
| Core/Graphics/Source/DeviceImpl.cpp | Implements frame open/close sequencing, scope blocking, and frame-start dispatcher tick. |
| Core/Graphics/InternalInclude/Babylon/Graphics/DeviceContext.h | Replaces Update/UpdateToken with FrameCompletionScope, adds frame-start scheduler + encoder accessors. |
| Core/Graphics/Source/DeviceContext.cpp | Implements FrameCompletionScope RAII and forwards new scheduling/encoder methods. |
| Core/Graphics/Include/Shared/Babylon/Graphics/Device.h | Converts DeviceUpdate into a no-op compatibility shim and inlines GetUpdate. |
| Core/Graphics/Source/Device.cpp | Removes old GetUpdate implementation. |
| Core/Graphics/InternalInclude/Babylon/Graphics/FrameBuffer.h | Removes encoder parameters from bind/unbind/viewport/scissor APIs. |
| Core/Graphics/Source/FrameBuffer.cpp | Updates FrameBuffer internals to acquire view IDs without needing an encoder parameter. |
| Plugins/NativeEngine/Source/NativeEngine.h | Switches to active-encoder access + outside-frame scope holding; adds begin/end frame no-ops. |
| Plugins/NativeEngine/Source/NativeEngine.cpp | Updates command submission and RAF scheduling to use frame-start scheduling + scopes; updates ReadTexture blit scheduling. |
| Plugins/NativeEngine/Source/PerFrameValue.h | Adjusts API to remove encoder parameters (but currently has a type-signature issue). |
| Plugins/NativeXr/Source/NativeXrImpl.h | Removes stored Update member from XR session state. |
| Plugins/NativeXr/Source/NativeXrImpl.cpp | Migrates XR scheduling to frame-start + scopes; updates encoder usage. |
| Polyfills/Canvas/Source/Context.h | Removes stored Update member. |
| Polyfills/Canvas/Source/Context.cpp | Uses GetActiveEncoder() and acquires a FrameCompletionScope when flushing outside the frame cycle. |
| Polyfills/Canvas/Source/nanovg/nanovg_babylon.cpp | Updates FrameBuffer binding calls to new signature (no encoder param). |
| Apps/HeadlessScreenshotApp/Win32/App.cpp | Wraps startup wait with Start/Finish frame calls to keep gate open during startup. |
| Apps/StyleTransferApp/Win32/App.cpp | Same startup framing change as HeadlessScreenshotApp. |
| Core/Graphics/CMakeLists.txt | Removes SafeTimespanGuarantor sources from build. |
Comments suppressed due to low confidence (1)
Plugins/NativeEngine/Source/PerFrameValue.h:33
PerFrameValueis templated onT, butSetcurrently takes aboolparameter. This makes the template misleading and will break or silently convert ifPerFrameValueis ever instantiated with a non-bool type. ChangeSetto acceptT(e.g.,const T&orT) to match the class template parameter.
T Get() const
{
return m_value;
}
void Set(bool value)
{
m_value = value;
if (!m_isResetScheduled)
{
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9f25ed5 to
0b8bfad
Compare
…er FrameCompletionScope acquisition.
…ompletionScope instead of deferred member release.
This reverts commit 0d315bb.
…cquired FrameCompletionScope in SubmitCommands.
…ck on FrameCompletionScope.
…enderScheduler fires one frame late when called outside RAF.
…te leaking into nanovg rendering.
… getFrameBufferData outside RAF.
0b8bfad to
e97bfdc
Compare
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
Apr 23, 2026
Matches the established pattern from ExternalTexture.AddToContextAsyncAndUpdate on master: wait for JS work to complete while the frame is still open, then Finish. The new CreateForJavaScript and RenderTextureArray tests in this PR had accidentally dropped this pattern, which works on master but deadlocks under the reworked threading model in BabylonJS#1652 (Finish closes the frame gate while JS is still acquiring FrameCompletionScopes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
Apr 29, 2026
Hypothesis test for the 4-30x Linux JSC CI runtime variance on rework-thread-model: The new FrameCompletionScope gate model on this branch makes JS-thread SubmitCommands legal only between StartRenderingCurrentFrame and FinishRenderingCurrentFrame. The previous test pump called Start and Finish back-to-back with the wait_for AFTER Finish — collapsing the open window to ~zero. JS-thread work then had to win a scheduler race to land its scope before the pump thread re-acquired the mutex and closed the gate. On contended runners, JS thread loses that race many times per second, producing 4-30x runtime variance. Restructure: open the gate, then wait_for, then briefly close+reopen each iteration. Gate is open ~99% of the time so JS thread never has to race. If this commit reduces variance to master-baseline (5-10 min) consistently, the production fix belongs in BabylonJS#1652 itself (either invert the consumer pattern or eliminate the gate). [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
Apr 29, 2026
Root cause of the 4-30x Linux JSC CI runtime variance proven on rework-thread-model-perf branch (4 baseline runs hit 30-60min, 4 fix runs ran Unit Tests in 3 sec). On rework-thread-model the new FrameCompletionScope gate model makes JS-thread SubmitCommands legal only between StartRenderingCurrentFrame and FinishRenderingCurrentFrame. The previous test pump called Start and Finish back-to-back with the wait_for AFTER Finish, collapsing the open window to ~zero. JS-thread work then had to win a scheduler race to land its scope before the pump thread re-acquired the mutex and closed the gate. On contended runners JS lost that race repeatedly, accumulating per-poll 16ms+frame() stalls into 30+ minute runtimes. Restructure: open the gate before the loop, then for each iteration wait_for(16ms) -> Finish -> Start. Gate is open ~99% of the time so JS thread never has to race. Validation: rework-thread-model-perf branch (BabylonJS#1652 base + this fix only) ran Unit Tests in 3 seconds across all reachable jobs in 4 parallel runs. The previous baseline showed Unit Tests timing out at 30-60 min on multiple runs. Note: Validation Tests (Playground pixel diff) is a separate flaky issue unrelated to this fix. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fresh docs The std::move(scope) capture in 7f98c7c produced a move-only lambda. Dispatchable at the AppRuntime layer is move-only and accepts it, but JsRuntime downstream type-erases the callable through std::function which requires copy-constructibility: std_function.h:439:18: error: static assertion failed due to requirement 'is_copy_constructible<...>::value': std::function target must be copy-constructible MSVC accepted this; gcc 14 (Ubuntu CI) rejects it. Wrap the scope in shared_ptr so the lambda is copyable, matching the existing RAF dispatcher pattern in ScheduleRequestAnimationFrameCallbacks. Also refresh the comments on FrameCompletionScope and AcquireFrameCompletionScope. The previous class-level comment listed three usage patterns, but one of them ("NativeEngine::GetEncoder() acquires lazily") didn't match the code — GetEncoder doesn't acquire a scope. Replace with the two patterns that are actually in use: 1. JS-frame scoped via m_runtime.Dispatch capture (RAF, SubmitCommands) 2. Block scoped on the stack (Canvas::Flush fallback, ReadTextureAsync) [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Schedule the FrameCompletionScope-releasing lambda before the command stream is processed, not after. Equivalent in the success path, but makes the JS-frame coverage exception-safe — if the command processing throws, the lambda is already queued and the scope survives to the next dispatch cycle, instead of being destroyed at the throw and dropping the count to 0 on the render thread's view. Also drops the explicit local in favour of an inline make_shared in the lambda capture — the local was only there to be captured. [Created by Copilot on behalf of @bghgary] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MSAA tests deadlocked after merging master into rework-thread-model. The render thread waits on startupDone while the JS thread runs the AddToContextAsync .then() callback. That callback calls startup() -> new NativeEngine() / new Scene() / new RenderTargetTexture(), which eventually hits SubmitCommands. SubmitCommands now synchronously acquires a FrameCompletionScope, which blocks while m_frameBlocked is true. The gate only opens on the next StartRenderingCurrentFrame -- but the render thread is stuck on startupDone. Deadlock. Fix: open frame 2 before waiting on startupDone, and reuse the same frame for renderFrame(). Any JS work that touches the engine has to run while a frame is in progress under the new model. This workaround is temporary. Once #1646 lands, the test should migrate from AddToContextAsync to the new synchronous CreateForJavaScript, at which point startup() runs in the same JS task as the texture creation and the cross-frame dance disappears entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
May 29, 2026
With BabylonJS#1646's synchronous ExternalTexture::CreateForJavaScript and BabylonJS#1652's new threading model both in this branch, the original test pattern is both unnecessary and broken: - Unnecessary: AddToContextAsync's two-frame dance (queue in frame 1, wait for bgfx::frame() to resolve the promise, run startup in frame 2) exists only because the texture wrap was async. CreateForJavaScript is synchronous so the wrap, startup(), and renderFrame() all run in the same JS task inside a single frame. - Broken: under the new model, SubmitCommands synchronously acquires a FrameCompletionScope which blocks when no frame is in progress. The old between-frames startup() pattern deadlocks. Collapse to a single frame + single dispatch that wraps the texture, runs startup(), and chains renderFrame()'s promise resolution. The host app pattern host apps should follow with the new API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
May 29, 2026
BabylonJS#1689 added the rendering-enabled check to DeviceImpl::UpdateDevice on master. Both BabylonJS#1646 and BabylonJS#1652 also have it. The check was dropped during this branch's merge with the older partner-test base, causing Device.UpdateDeviceThrowsWhenRenderingEnabled to fail in CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
May 29, 2026
Same fix as the MSAA migration: AddToContextAsync's between-frames .then() pattern deadlocks under BabylonJS#1652's threading model because SubmitCommands synchronously acquires a FrameCompletionScope which blocks when no frame is in progress. CreateForJavaScript runs synchronously inside the same dispatch lambda, so startup() / restoreTexture() execute in the same JS task as the wrap. No cross-frame .then() dance, no deadlock. Without this, all four Win32 D3D11 jobs hit the 60-minute timeout on the 'ExternalTexture.RestoreAfterDeviceLoss' test (added in BabylonJS#1716, merged into master after this branch's last sync). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # Plugins/NativeEngine/Source/NativeEngine.cpp # Polyfills/Canvas/Source/Context.cpp
The test drives device loss/restore via the async AddToContextAsync + manual frame-pump pattern, which deadlocks under the single-frame-encoder model in this PR. It is re-enabled in #1646 after migration to the synchronous CreateForJavaScript API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the earlier skip. Under the single-frame-encoder model the async AddToContextAsync .then() runs startup()/restoreTexture() on the JS thread, whose SubmitCommands blocks until a frame is in progress. Open the next frame before each blocking wait on the async result (and reuse it for the following renderFrame), mirroring the workaround already applied to Tests.ExternalTexture.Msaa.cpp. Obsoleted once #1646 migrates these tests to the synchronous CreateForJavaScript API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
🎉 |
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
Jun 9, 2026
Conflict resolutions (final state matches the BabylonJS#1674 integration reference plus the layerIndex work from the Teams branch): - Apps (HeadlessScreenshot/PrecompiledShaderTest/StyleTransfer): combine BabylonJS#1652's frame-pump (open a startup frame so CreateForJavaScript's blit has an active frame) with BabylonJS#1646's synchronous CreateForJavaScript migration. - Tests.ExternalTexture.Msaa.cpp: BabylonJS#1646's synchronous CreateForJavaScript migration supersedes the frame-pump workaround BabylonJS#1652 added. - NativeEngine::SetTexture: combine BabylonJS#1652's GetEncoder() with BabylonJS#1646's layerIndex single-slice view overload. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
Jun 9, 2026
The master merge brought in BabylonJS#1716's RestoreAfterDeviceLoss test using the async AddToContextAsync + frame-pump pattern (added by BabylonJS#1652). With BabylonJS#1646's synchronous CreateForJavaScript, that test runs startup() in the same JS task as the texture wrap, so the frame-pump is unnecessary; migrate it to match the other ExternalTexture tests. Also correct a now-stale comment in AddToContextAsyncAndUpdate: the deprecated AddToContextAsync shim resolves synchronously via CreateForJavaScript, so the frame finish no longer drives its completion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
Jun 9, 2026
Conflict resolutions (final state matches the BabylonJS#1674 integration reference): - Apps (HeadlessScreenshot/PrecompiledShaderTest/StyleTransfer): combine BabylonJS#1652's frame-pump (open a startup frame so CreateForJavaScript's blit has an active frame) with BabylonJS#1646's synchronous CreateForJavaScript migration. - Tests.ExternalTexture.Msaa.cpp: BabylonJS#1646's synchronous CreateForJavaScript migration supersedes the frame-pump workaround BabylonJS#1652 added. - NativeEngine::SetTexture: combine BabylonJS#1652's GetEncoder() with BabylonJS#1646's layerIndex single-slice view overload. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/BabylonNative
that referenced
this pull request
Jun 9, 2026
The master merge brought in BabylonJS#1716's RestoreAfterDeviceLoss test using the async AddToContextAsync + frame-pump pattern (added by BabylonJS#1652). With BabylonJS#1646's synchronous CreateForJavaScript, that test runs startup() in the same JS task as the texture wrap, so the frame-pump is unnecessary; migrate it to match the other ExternalTexture tests. Also correct a now-stale comment in AddToContextAsyncAndUpdate: the deprecated AddToContextAsync shim resolves synchronously via CreateForJavaScript, so the frame finish no longer drives its completion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
added a commit
that referenced
this pull request
Jun 10, 2026
…1744) ## Problem After syncing to the latest master (which includes #1739 divisor-driven per-instance vertex attributes and the #1652 thread-model rework), a number of previously-excluded Playground validation tests now render correctly. They were left disabled because they used to crash/hang or pixel-diff against their references on Babylon Native. ## Change Re-enable **16** `validation_native.js` tests in `Apps/Playground/Scripts/config.json`. No code changes. ## Verification Each candidate was first confirmed passing **in isolation** on both backends, then the **full sweep** was re-run on both to ensure no order-dependent state-leak cascade (the runner aborts the whole sweep on the first hard failure): - **Win32 D3D11:** `ran=289 passed=289 failed=0` (was 273) - **OpenGL (ANGLE):** `ran=280 passed=280 failed=0` Stencil/scissor tests that pass in isolation but leak GPU state into a later test (`cube-with-holes-using-stencil-buffer`, `Scissor test with 0.9 hardware scaling`) were intentionally **kept excluded** to avoid reintroducing a cascade. ## Re-enabled tests - Area Lights - Standard Material - Area Lights - PBR - GLTF Node visibility test - Export GLTF Extension KHR_texture_transform - simple-render-target-with-blue-spheres - pillars-sphere-and-torus-with-PCSS-shadows - torus-knot-mirror - simple-sphere-in-4-mirrors - procedural-texture-with-node-material - simple-custom-shader - change-texture-of-material - OpenPBR Thin Film Weight VS IOR - Analytic Lights - OpenPBR Thin Film Thickness VS IOR - Analytic Lights - OpenPBR Thin Film and Specular Weight - Analytic Lights - OpenPBR Thin Film and Specular Weight Metals - Analytic Lights - WebGPU async pipeline pre-warming ## Notes Validated locally on **Win32 D3D11** and **OpenGL** only; Metal/iOS not validated locally — worth watching Mac CI. --------- Co-authored-by: Branimir Karadzic <branimirkaradzic@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
pushed a commit
that referenced
this pull request
Jun 10, 2026
DeviceUpdate was reduced to a no-op compatibility shim when the threading model was reworked (#1652) -- frame synchronization is now handled by FrameCompletionScope inside StartRenderingCurrentFrame/ FinishRenderingCurrentFrame. This removes the deprecated shim (Graphics::DeviceUpdate and Device::GetUpdate) and all of its now-redundant callers across the apps, platform shells, and unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
pushed a commit
that referenced
this pull request
Jun 10, 2026
DeviceUpdate was reduced to a no-op compatibility shim when the threading model was reworked (#1652) -- frame synchronization is now handled by FrameCompletionScope inside StartRenderingCurrentFrame/ FinishRenderingCurrentFrame. This removes the deprecated shim (Graphics::DeviceUpdate and Device::GetUpdate) and all of its now-redundant callers across the apps, platform shells, and unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
pushed a commit
that referenced
this pull request
Jun 10, 2026
DeviceUpdate became a no-op compatibility shim when the threading model was reworked (#1652) -- frame synchronization is now handled by FrameCompletionScope inside StartRenderingCurrentFrame/ FinishRenderingCurrentFrame. This removes all of its now-redundant callers across the apps, platform shells (Win32/UWP/X11/Android/iOS/macOS/visionOS), and unit tests, and marks Graphics::DeviceUpdate and Device::GetUpdate [[deprecated]] so any remaining external callers get a warning. The shim itself is kept for source compatibility and will be removed in a future PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
pushed a commit
that referenced
this pull request
Jun 10, 2026
DeviceUpdate became a no-op compatibility shim when the threading model was reworked (#1652) -- frame synchronization is now handled by FrameCompletionScope inside StartRenderingCurrentFrame/ FinishRenderingCurrentFrame. This removes all of its now-redundant callers across the apps, platform shells (Win32/UWP/X11/Android/iOS/macOS/visionOS), and unit tests, and marks DeviceUpdate::Start/Finish/RequestFinish and Device::GetUpdate [[deprecated]] so any remaining external callers get a warning. The shim itself is kept for source compatibility and will be removed in a future PR. The type is left un-deprecated so GetUpdate can keep returning it without a self-warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
added a commit
that referenced
this pull request
Jun 11, 2026
`DeviceUpdate` became a no-op compatibility shim when the threading model was reworked (#1652) — frame synchronization is now handled by `FrameCompletionScope` inside `StartRenderingCurrentFrame`/`FinishRenderingCurrentFrame`. This PR: - **Removes all of its now-redundant callers** across the apps, platform shells (Win32/UWP/X11/Android/iOS/macOS/visionOS), and unit tests. - **Marks `DeviceUpdate::Start`/`Finish`/`RequestFinish` and `Device::GetUpdate` `[[deprecated]]`** so any remaining external callers get a warning. The type itself is left un-deprecated so `GetUpdate` can keep returning it without a self-warning (no compiler-specific pragmas needed). The shim implementation is **kept** for source compatibility and will be removed in a future PR. Rebuilt fresh on top of current `master` (the original branch predated the squash-merge of #1652). Verified locally on Win32: Playground and UnitTests build clean with no deprecation-warning leakage, and all UnitTests pass. Co-authored-by: Branimir Karadžić <branimirk@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
pushed a commit
to bkaradzic-microsoft/BabylonNative
that referenced
this pull request
Jun 12, 2026
LoadTextureFromImage (and the single-mip-per-face branch of LoadCubeTextureFromImages) drove its mip loop with the condition `mip < image->m_numMips`, re-reading image->m_numMips every iteration. On the last iteration the mip's bgfx::makeRef hands ownership of `image` to bgfx's release queue (releaseFn calls bimg::imageFree(image)). bgfx may run that release on the render thread during bgfx::frame() as soon as the Update2D/UpdateCube command is submitted. The loop then evaluates its exit condition by dereferencing image->m_numMips on memory that has just been freed; if the stale read happens to exceed mip, the loop re-enters imageGetRawData() on the freed container, whose garbage m_format indexes s_imageBlockInfo out of bounds -> intermittent access violation (bimg/image.cpp). This is a use-after-free (CWE-416), reproduced as a rare (~2-4%) crash on texture-loading validation tests. Note: bgfx resource creation/update are themselves mutex-guarded (m_resourceApiLock, also held by bgfx::frame()) and are safe to call from a worker thread; the bug is BabylonNative dereferencing `image` after transferring its ownership to bgfx's async release queue, not a bgfx race. Fix: cache image->m_numMips in the loop initializer (`for (uint8_t mip = 0, numMips = image->m_numMips; mip < numMips; ++mip)`) so the loop bound is not re-read from `image` after the freeing submit. This restores verbatim the change made by commit e8ee5f7 (BabylonJS#1628) that was reverted by BabylonJS#1652. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bkaradzic-microsoft
added a commit
that referenced
this pull request
Jun 12, 2026
) ## Problem `NativeEngine::LoadTextureFromImage` (and the single-mip-per-face branch of `LoadCubeTextureFromImages`) drove the mip-upload loop with `for (uint8_t mip = 0; mip < image->m_numMips; ++mip)`, re-reading `image->m_numMips` for the exit condition on every iteration. On the **last** iteration the mip is submitted to bgfx via `bgfx::makeRef(imageMip.m_data, imageMip.m_size, releaseFn, image)`, where `releaseFn` calls `bimg::imageFree(image)`. bgfx invokes that release once it has consumed the memory — which happens on the render thread inside `bgfx::frame()`, potentially **immediately** after `Update2D`/`UpdateCube` submits the command. The loop then evaluates `mip < image->m_numMips` by dereferencing `image` *after it may already have been freed*. If the stale `m_numMips` read happens to be greater than `mip`, the loop re-enters `bimg::imageGetRawData(*image, ...)` on the freed container; its garbage `m_format` indexes `s_imageBlockInfo` out of bounds → an intermittent access violation. This is a **use-after-free (CWE-416)**, reproduced as a rare (~2–4%) crash on texture-loading validation tests (e.g. an NME particle test that loads a ramp texture while the render loop is active). ### Not a bgfx threading bug bgfx resource creation/update **are** mutex-guarded (`m_resourceApiLock`, which `bgfx::frame()` also holds), so calling them from the decode worker thread is safe and supported. The defect is purely that BabylonNative keeps dereferencing `image` after handing its ownership to bgfx's asynchronous release queue. ## Fix Cache `image->m_numMips` in the loop initializer (`for (uint8_t mip = 0, numMips = image->m_numMips; mip < numMips; ++mip)`) so the loop bound is no longer re-read from `image` after the freeing submit. This restores **verbatim** the change made in [`e8ee5f7`](e8ee5f7) (#1628) — see Regression history below. ## Regression history This is a **re-fix of a previously-fixed bug**. The exact same race was fixed in April by commit [`e8ee5f7`](e8ee5f7) ("Fixed race condition when passing data to bgfx. Issue #1398", #1628), which changed both loops to `for (uint8_t mip = 0, numMips = image->m_numMips; ...)`. That fix was inadvertently reverted by commit [`6bb8014`](6bb8014) ("Reworked threading model.", #1652), which restored the original `for (uint8_t mip = 0; mip < image->m_numMips; ++mip)` form in both `LoadTextureFromImage` and `LoadCubeTextureFromImages`, reintroducing the use-after-free. This PR re-applies the original fix unchanged. ## Validation - Stress-tested an affected texture-loading validation test **3000× headless** (30 × 100, 5-way concurrent for extra scheduler pressure): **0 crashes** (baseline crashes ~8% / ~1-in-12). At that baseline rate P(0 in 3000) is effectively zero, so the use-after-free is eliminated. - Particle validation subset: no regressions. Co-authored-by: Branimir Karadzic <branimirkaradzic@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Updated by Copilot on behalf of @bghgary]
Context
Replaces the old per-thread encoder model (
GetEncoderForThread/EndEncoders+UpdateTokenwrappingSafeTimespanGuarantor) with a single frame encoder owned byDeviceImpl, synchronized by aFrameCompletionScopeRAII gate.SafeTimespanGuarantor,UpdateToken, and theUpdateclasses are removed.DeviceUpdateremains as a no-op shim — removing it is an API break and a separate PR (#1653).Model
StartRenderingCurrentFrameacquires the encoder (bgfx::begin) and opens the gate;FinishRenderingCurrentFramewaits for allFrameCompletionScopes to release, closes the gate, ends the encoder, and submits the frame. All consumers (NativeEngine, Canvas, NativeXr) read the shared encoder viaGetActiveEncoder().SubmitCommandsandReadTextureeach hold a stack-scopedFrameCompletionScope, so the encoder can't be ended mid-work; when invoked outside a frame (e.g. an XHR callback) the scope blocks until the next frame opens.Synchronization rules
SubmitCommands/ReadTexturealways hold a scope — the encoder can't be ended mid-command, regardless of microtask draining or reentrancy.Start/Finish, orSubmitCommandsblocks forever waiting for the gate (HeadlessScreenshotApp, StyleTransferApp, UnitTests do this).Canvas::Flushacquires a scope when called outside a frame, then discards residual encoder state before nanovg rendering (the shared encoder carries NativeEngine state).Notes
Tests.ExternalTexture.Msaa.cppopens the next frame before waiting onstartup()— a workaround for the legacy asyncAddToContextAsyncpattern, obsoleted once Add synchronous CreateForJavaScript and deprecate AddToContextAsync in ExternalTexture #1646's syncCreateForJavaScriptlands.Tests.ExternalTexture.RestoreAfterDeviceLoss(added on master after this branch) uses the same async pattern; the same frame-pump workaround is applied here. Obsoleted once Add synchronous CreateForJavaScript and deprecate AddToContextAsync in ExternalTexture #1646's syncCreateForJavaScriptlands.m_encoderBeginLockinbgfx::frame()to prevent a deadlock withbgfx::destroy().DeviceUpdateno-op shim is an API-breaking change handled in Remove DeviceUpdate usage and mark the shim deprecated #1653.