Skip to content

Fix: raise CPU-sim scheduler idle cap via platform-defined constant#845

Open
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/issue-840-sim-scheduler-idle-cap
Open

Fix: raise CPU-sim scheduler idle cap via platform-defined constant#845
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/issue-840-sim-scheduler-idle-cap

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 commented May 22, 2026

Summary

MAX_IDLE_ITERATIONS was a single runtime constant (800000) shared by sim and
onboard. On CPU sim, idle scheduler iterations accumulate far faster than real
kernel progress, so matmul-heavy kernels (e.g. compressor_ratio4) hit the cap
and fail with PTO2_ERROR_SCHEDULER_TIMEOUT (run_prepared failed with code -100) while the kernel is still making forward progress.

This moves the cap to a platform-defined PLATFORM_MAX_IDLE_ITERATIONS in
each backend's spin_hint.h (the existing per-backend AICPU header), so the
runtime consumes a single symbol while sim and onboard supply their own values:

  • onboard: 800000 (unchanged hardware behavior)
  • sim: 8000000 (10× the onboard cap — a conservative bump, see caveat below)

Changes

  • PLATFORM_MAX_IDLE_ITERATIONS added to all four spin_hint.h (a5/a2a3 ×
    sim/onboard). The sim variant's doc comment already described this exact
    MAX_IDLE_ITERATIONS interaction, so the cap is co-located there.
  • scheduler_types.h aliases MAX_IDLE_ITERATIONS = PLATFORM_MAX_IDLE_ITERATIONS,
    mirroring the existing MAX_AICPU_THREADS = PLATFORM_MAX_AICPU_THREADS pattern.
    The runtime stays platform-agnostic — no sim/onboard #ifdef in runtime code.
  • STALL_LOG_INTERVAL is now derived as MAX_IDLE_ITERATIONS / 2 instead of a
    fixed 400000, so the stall-diagnostic log fires ~once halfway to timeout on
    every platform (the larger sim cap would otherwise spam stall warnings).
    Preserves prior onboard behavior exactly (800000 / 2 == 400000).
  • The onboard spin_hint.h headers gain the standard license block required by
    the pre-commit header check.
  • Applied to both a5 and a2a3 trees.

Design note

The runtime consumes one platform-provided symbol; the sim/onboard split lives
entirely in the platform layer (same precedent as SPIN_WAIT_HINT() and
PLATFORM_MAX_AICPU_THREADS). A wall-clock timeout via get_sys_cnt_aicpu()
was considered but rejected to keep the idle hot path a plain integer compare
(lower per-iteration overhead, no clock noise).

Scope / caveats

  • The sim value (8000000) is not empirically calibrated. The repro kernel
    compressor_ratio4.py lives in the PTO/pypto repo, not here, so it could not
    be reproduced or measured in this repo. 8000000 is a 10× bump over the
    onboard cap; for reference, the sibling host_build_graph runtime already
    runs CPU sim with a 50000000 idle cap. The value should be validated
    against compressor_ratio4 in the PTO env
    and raised if a slow kernel still
    false-times-out (a one-line change).
  • The PTO-ISA TMatmulNzZn optimizations suggested in the issue are in the
    PTO-ISA repo and out of scope for this change.

Testing

  • a2a3sim + a5sim runtimes build clean (all targets)
  • tensormap_and_ringbuffer sim st test passes on both archs (dummy_task)
  • Hardware tests (unchanged onboard value; not run here)
  • compressor_ratio4 on a2a3sim in PTO env (requires PTO_ISA_ROOT)

Fixes #840

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces platform-specific constants for MAX_IDLE_ITERATIONS, allowing for a significantly higher threshold in simulation environments compared to onboard hardware to prevent false deadlock timeouts during slow tasks. Review feedback recommends scaling the STALL_LOG_INTERVAL relative to MAX_IDLE_ITERATIONS instead of using a hardcoded value, which would prevent excessive log volume in simulation while maintaining consistent diagnostic frequency across platforms.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h Outdated
@ChaoZheng109 ChaoZheng109 force-pushed the fix/issue-840-sim-scheduler-idle-cap branch 2 times, most recently from 30b2340 to 20cd44d Compare May 22, 2026 09:23
Fixes hw-native-sys#840

The scheduler's MAX_IDLE_ITERATIONS was a single runtime constant (800000)
shared by sim and onboard. On CPU sim, idle iterations accumulate far faster
than real kernel progress, so matmul-heavy kernels (e.g. compressor_ratio4)
hit the cap and fail with PTO2_ERROR_SCHEDULER_TIMEOUT (-100) while the kernel
is still making forward progress.

Move the cap to a platform-defined PLATFORM_MAX_IDLE_ITERATIONS in each
backend's spin_hint.h (the existing per-backend AICPU header), so the runtime
consumes one symbol while sim and onboard supply their own values:
  - onboard: 800000 (unchanged hardware behavior)
  - sim:     8000000 (10x the onboard cap; conservative bump, tune if needed)

scheduler_types.h aliases MAX_IDLE_ITERATIONS = PLATFORM_MAX_IDLE_ITERATIONS,
mirroring the existing MAX_AICPU_THREADS = PLATFORM_MAX_AICPU_THREADS pattern.
STALL_LOG_INTERVAL is now derived as MAX_IDLE_ITERATIONS / 2 instead of a fixed
400000, so the stall-diagnostic log fires ~once halfway to timeout on every
platform. This preserves the prior onboard behavior exactly (800000 / 2 ==
400000).

Applied to both a5 and a2a3 trees. The onboard spin_hint.h headers also gain
the standard license block required by the pre-commit header check.

Note: the repro kernel lives in the PTO/pypto repo, so the sim value should be
validated against compressor_ratio4 there. The PTO-ISA TMatmulNzZn
optimizations suggested in the issue are out of scope for this repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] CPU sim scheduler times out on slow TMATMUL-heavy kernels before real deadlock timeout

1 participant