[cute] Cleaner LICM: alias DCE + FMA-friendly scale hoist#2575
Open
oulgen wants to merge 1 commit into
Open
Conversation
This was referenced May 25, 2026
This was referenced May 25, 2026
jansel
approved these changes
May 25, 2026
Contributor
jansel
left a comment
There was a problem hiding this comment.
Fix test failures before landing
5b8b743 to
301c843
Compare
oulgen
added a commit
that referenced
this pull request
May 25, 2026
Extends hoist_loop_invariant_recip.py with 4 sub-passes orchestrated in dependency order: 1. Alias DCE: inline SSA-style ``NAME = ANOTHER_NAME`` chains so ``mi_copy_1 = mi; mi_copy_1_0 = mi_copy_1`` collapse to the root. This removes the per-iter "copy" instructions Helion's SSA maintenance leaves behind. 2. Outer-in reciprocal hoist (was inner-first in P16): places the ``inv = 1.0 / di`` at the OUTERMOST legal scope so we don't emit cascade aliases like ``_helion_inv_div_0 = 1.0 * _helion_inv_div_1`` at each nested loop level. 3. FMA-friendly scale hoist: detects ``(A - INV) * CONST`` where INV is loop-invariant. Emits ``scaled_K = INV * CONST`` above the loop and rewrites inside to ``A * CONST - scaled_K`` — same value, but fewer per-iter instructions and FMA-friendly. 4. DCE for dead pure assigns: removes ``v_N = pure-expr`` whose target is never read afterwards (e.g. the ``v_10 = v_9 - mi`` that the FMA hoist supersedes). Iterates to fixed point. A correctness piece (rename-aware invariance) plumbs the rename group map from DeviceFunction so ``v_1_0`` (which post-pass renames to ``mi``) is recognized as the SAME variable. Without this the FMA hoist would lift ``mi * 1.4427`` ABOVE the reduce loop and capture the stale initial -inf, producing wildly wrong softmax outputs. Bench (B200, fp16, HELION_AUTOTUNE_EFFORT=quick): Shape Pre-P17 Post-P17 vs ATen (4096, 256) 84 GB/s 90 GB/s 0.18x (launcher bound) (4096, 6400) 1384 GB/s 1422 GB/s 1.47x (beats ATen by 47%) (4096, 12672) 1719 GB/s 1747 GB/s 0.79x average 1062 GB/s 1087 GB/s 0.81x Cumulative: 0.45x -> 0.81x ATen (+80% from baseline). Tests added (TestCuteHoistLoopInvariantP17, 6 tests): - useless_cascade_alias_removed - ssa_alias_chain_inlined - fma_scale_hoist_above_consume - fma_scale_hoist_in_reduce_v_loop - dce_removes_dead_sub_after_fma_hoist - invariance_canonicalization_does_not_break_consume stack-info: PR: #2575, branch: oulgen/stack/317
Extends hoist_loop_invariant_recip.py with 4 sub-passes orchestrated in dependency order: 1. Alias DCE: inline SSA-style ``NAME = ANOTHER_NAME`` chains so ``mi_copy_1 = mi; mi_copy_1_0 = mi_copy_1`` collapse to the root. This removes the per-iter "copy" instructions Helion's SSA maintenance leaves behind. 2. Outer-in reciprocal hoist (was inner-first in P16): places the ``inv = 1.0 / di`` at the OUTERMOST legal scope so we don't emit cascade aliases like ``_helion_inv_div_0 = 1.0 * _helion_inv_div_1`` at each nested loop level. 3. FMA-friendly scale hoist: detects ``(A - INV) * CONST`` where INV is loop-invariant. Emits ``scaled_K = INV * CONST`` above the loop and rewrites inside to ``A * CONST - scaled_K`` — same value, but fewer per-iter instructions and FMA-friendly. 4. DCE for dead pure assigns: removes ``v_N = pure-expr`` whose target is never read afterwards (e.g. the ``v_10 = v_9 - mi`` that the FMA hoist supersedes). Iterates to fixed point. A correctness piece (rename-aware invariance) plumbs the rename group map from DeviceFunction so ``v_1_0`` (which post-pass renames to ``mi``) is recognized as the SAME variable. Without this the FMA hoist would lift ``mi * 1.4427`` ABOVE the reduce loop and capture the stale initial -inf, producing wildly wrong softmax outputs. Bench (B200, fp16, HELION_AUTOTUNE_EFFORT=quick): Shape Pre-P17 Post-P17 vs ATen (4096, 256) 84 GB/s 90 GB/s 0.18x (launcher bound) (4096, 6400) 1384 GB/s 1422 GB/s 1.47x (beats ATen by 47%) (4096, 12672) 1719 GB/s 1747 GB/s 0.79x average 1062 GB/s 1087 GB/s 0.81x Cumulative: 0.45x -> 0.81x ATen (+80% from baseline). Tests added (TestCuteHoistLoopInvariantP17, 6 tests): - useless_cascade_alias_removed - ssa_alias_chain_inlined - fma_scale_hoist_above_consume - fma_scale_hoist_in_reduce_v_loop - dce_removes_dead_sub_after_fma_hoist - invariance_canonicalization_does_not_break_consume stack-info: PR: #2575, branch: oulgen/stack/317
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
[cute] Cleaner LICM: alias DCE + FMA-friendly scale hoist
Extends hoist_loop_invariant_recip.py with 4 sub-passes orchestrated
in dependency order:
Alias DCE: inline SSA-style
NAME = ANOTHER_NAMEchains somi_copy_1 = mi; mi_copy_1_0 = mi_copy_1collapse to the root.This removes the per-iter "copy" instructions Helion's SSA
maintenance leaves behind.
Outer-in reciprocal hoist (was inner-first in P16): places the
inv = 1.0 / diat the OUTERMOST legal scope so we don't emitcascade aliases like
_helion_inv_div_0 = 1.0 * _helion_inv_div_1at each nested loop level.
FMA-friendly scale hoist: detects
(A - INV) * CONSTwhere INVis loop-invariant. Emits
scaled_K = INV * CONSTabove the loopand rewrites inside to
A * CONST - scaled_K— same value, butfewer per-iter instructions and FMA-friendly.
DCE for dead pure assigns: removes
v_N = pure-exprwhose targetis never read afterwards (e.g. the
v_10 = v_9 - mithat theFMA hoist supersedes). Iterates to fixed point.
A correctness piece (rename-aware invariance) plumbs the rename group
map from DeviceFunction so
v_1_0(which post-pass renames tomi) is recognized as the SAME variable. Without this the FMAhoist would lift
mi * 1.4427ABOVE the reduce loop and capturethe stale initial -inf, producing wildly wrong softmax outputs.
Bench (B200, fp16, HELION_AUTOTUNE_EFFORT=quick):
Shape Pre-P17 Post-P17 vs ATen
(4096, 256) 84 GB/s 90 GB/s 0.18x (launcher bound)
(4096, 6400) 1384 GB/s 1422 GB/s 1.47x (beats ATen by 47%)
(4096, 12672) 1719 GB/s 1747 GB/s 0.79x
average 1062 GB/s 1087 GB/s 0.81x
Cumulative: 0.45x -> 0.81x ATen (+80% from baseline).
Tests added (TestCuteHoistLoopInvariantP17, 6 tests):