Fix: raise CPU-sim scheduler idle cap via platform-defined constant by ChaoZheng109 · Pull Request #845 · hw-native-sys/simpler

ChaoZheng109 · 2026-05-22T07:47:00Z

Summary

MAX_IDLE_ITERATIONS was a single runtime constant (800000) shared by sim and
onboard. On CPU sim, idle scheduler iterations accumulate far faster than real
kernel progress, so matmul-heavy kernels (e.g. compressor_ratio4) hit the cap
and fail with PTO2_ERROR_SCHEDULER_TIMEOUT (run_prepared failed with code -100) while the kernel is still making forward progress.

This moves the cap to a platform-defined PLATFORM_MAX_IDLE_ITERATIONS in
each backend's spin_hint.h (the existing per-backend AICPU header), so the
runtime consumes a single symbol while sim and onboard supply their own values:

onboard: 800000 (unchanged hardware behavior)
sim: 8000000 (10× the onboard cap — a conservative bump, see caveat below)

Changes

PLATFORM_MAX_IDLE_ITERATIONS added to all four spin_hint.h (a5/a2a3 ×
sim/onboard). The sim variant's doc comment already described this exact
MAX_IDLE_ITERATIONS interaction, so the cap is co-located there.
scheduler_types.h aliases MAX_IDLE_ITERATIONS = PLATFORM_MAX_IDLE_ITERATIONS,
mirroring the existing MAX_AICPU_THREADS = PLATFORM_MAX_AICPU_THREADS pattern.
The runtime stays platform-agnostic — no sim/onboard #ifdef in runtime code.
STALL_LOG_INTERVAL is now derived as MAX_IDLE_ITERATIONS / 2 instead of a
fixed 400000, so the stall-diagnostic log fires ~once halfway to timeout on
every platform (the larger sim cap would otherwise spam stall warnings).
Preserves prior onboard behavior exactly (800000 / 2 == 400000).
The onboard spin_hint.h headers gain the standard license block required by
the pre-commit header check.
Applied to both a5 and a2a3 trees.

Design note

The runtime consumes one platform-provided symbol; the sim/onboard split lives
entirely in the platform layer (same precedent as SPIN_WAIT_HINT() and
PLATFORM_MAX_AICPU_THREADS). A wall-clock timeout via get_sys_cnt_aicpu()
was considered but rejected to keep the idle hot path a plain integer compare
(lower per-iteration overhead, no clock noise).

Scope / caveats

The sim value (8000000) is not empirically calibrated. The repro kernel
compressor_ratio4.py lives in the PTO/pypto repo, not here, so it could not
be reproduced or measured in this repo. 8000000 is a 10× bump over the
onboard cap; for reference, the sibling host_build_graph runtime already
runs CPU sim with a 50000000 idle cap. The value should be validated
against compressor_ratio4 in the PTO env and raised if a slow kernel still
false-times-out (a one-line change).
The PTO-ISA TMatmulNzZn optimizations suggested in the issue are in the
PTO-ISA repo and out of scope for this change.

Testing

a2a3sim + a5sim runtimes build clean (all targets)
tensormap_and_ringbuffer sim st test passes on both archs (dummy_task)
Hardware tests (unchanged onboard value; not run here)
compressor_ratio4 on a2a3sim in PTO env (requires PTO_ISA_ROOT)

Fixes #840

gemini-code-assist

Code Review

This pull request introduces platform-specific constants for MAX_IDLE_ITERATIONS, allowing for a significantly higher threshold in simulation environments compared to onboard hardware to prevent false deadlock timeouts during slow tasks. Review feedback recommends scaling the STALL_LOG_INTERVAL relative to MAX_IDLE_ITERATIONS instead of using a hardcoded value, which would prevent excessive log volume in simulation while maintaining consistent diagnostic frequency across platforms.

Fixes hw-native-sys#840 The scheduler's MAX_IDLE_ITERATIONS was a single runtime constant (800000) shared by sim and onboard. On CPU sim, idle iterations accumulate far faster than real kernel progress, so matmul-heavy kernels (e.g. compressor_ratio4) hit the cap and fail with PTO2_ERROR_SCHEDULER_TIMEOUT (-100) while the kernel is still making forward progress. Move the cap to a platform-defined PLATFORM_MAX_IDLE_ITERATIONS in each backend's spin_hint.h (the existing per-backend AICPU header), so the runtime consumes one symbol while sim and onboard supply their own values: - onboard: 800000 (unchanged hardware behavior) - sim: 8000000 (10x the onboard cap; conservative bump, tune if needed) scheduler_types.h aliases MAX_IDLE_ITERATIONS = PLATFORM_MAX_IDLE_ITERATIONS, mirroring the existing MAX_AICPU_THREADS = PLATFORM_MAX_AICPU_THREADS pattern. STALL_LOG_INTERVAL is now derived as MAX_IDLE_ITERATIONS / 2 instead of a fixed 400000, so the stall-diagnostic log fires ~once halfway to timeout on every platform. This preserves the prior onboard behavior exactly (800000 / 2 == 400000). Applied to both a5 and a2a3 trees. The onboard spin_hint.h headers also gain the standard license block required by the pre-commit header check. Note: the repro kernel lives in the PTO/pypto repo, so the sim value should be validated against compressor_ratio4 there. The PTO-ISA TMatmulNzZn optimizations suggested in the issue are out of scope for this repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h Outdated

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h Outdated

ChaoZheng109 force-pushed the fix/issue-840-sim-scheduler-idle-cap branch 2 times, most recently from 30b2340 to 20cd44d Compare May 22, 2026 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: raise CPU-sim scheduler idle cap via platform-defined constant#845

Fix: raise CPU-sim scheduler idle cap via platform-defined constant#845
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/issue-840-sim-scheduler-idle-cap

ChaoZheng109 commented May 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoZheng109 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Design note

Scope / caveats

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoZheng109 commented May 22, 2026 •

edited

Loading