Skip to content

[cute] Cleaner LICM: alias DCE + FMA-friendly scale hoist#2575

Open
oulgen wants to merge 1 commit into
oulgen/stack/316from
oulgen/stack/317
Open

[cute] Cleaner LICM: alias DCE + FMA-friendly scale hoist#2575
oulgen wants to merge 1 commit into
oulgen/stack/316from
oulgen/stack/317

Conversation

@oulgen
Copy link
Copy Markdown
Contributor

@oulgen oulgen commented May 25, 2026

Stacked PRs:


[cute] Cleaner LICM: alias DCE + FMA-friendly scale hoist

Extends hoist_loop_invariant_recip.py with 4 sub-passes orchestrated
in dependency order:

  1. Alias DCE: inline SSA-style NAME = ANOTHER_NAME chains so
    mi_copy_1 = mi; mi_copy_1_0 = mi_copy_1 collapse to the root.
    This removes the per-iter "copy" instructions Helion's SSA
    maintenance leaves behind.

  2. Outer-in reciprocal hoist (was inner-first in P16): places the
    inv = 1.0 / di at the OUTERMOST legal scope so we don't emit
    cascade aliases like _helion_inv_div_0 = 1.0 * _helion_inv_div_1
    at each nested loop level.

  3. FMA-friendly scale hoist: detects (A - INV) * CONST where INV
    is loop-invariant. Emits scaled_K = INV * CONST above the loop
    and rewrites inside to A * CONST - scaled_K — same value, but
    fewer per-iter instructions and FMA-friendly.

  4. DCE for dead pure assigns: removes v_N = pure-expr whose target
    is never read afterwards (e.g. the v_10 = v_9 - mi that the
    FMA hoist supersedes). Iterates to fixed point.

A correctness piece (rename-aware invariance) plumbs the rename group
map from DeviceFunction so v_1_0 (which post-pass renames to
mi) is recognized as the SAME variable. Without this the FMA
hoist would lift mi * 1.4427 ABOVE the reduce loop and capture
the stale initial -inf, producing wildly wrong softmax outputs.

Bench (B200, fp16, HELION_AUTOTUNE_EFFORT=quick):
Shape Pre-P17 Post-P17 vs ATen
(4096, 256) 84 GB/s 90 GB/s 0.18x (launcher bound)
(4096, 6400) 1384 GB/s 1422 GB/s 1.47x (beats ATen by 47%)
(4096, 12672) 1719 GB/s 1747 GB/s 0.79x
average 1062 GB/s 1087 GB/s 0.81x

Cumulative: 0.45x -> 0.81x ATen (+80% from baseline).

Tests added (TestCuteHoistLoopInvariantP17, 6 tests):

  • useless_cascade_alias_removed
  • ssa_alias_chain_inlined
  • fma_scale_hoist_above_consume
  • fma_scale_hoist_in_reduce_v_loop
  • dce_removes_dead_sub_after_fma_hoist
  • invariance_canonicalization_does_not_break_consume

Copy link
Copy Markdown
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix test failures before landing

@oulgen oulgen marked this pull request as draft May 25, 2026 16:41
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 25, 2026 16:41
@oulgen oulgen force-pushed the oulgen/stack/317 branch 2 times, most recently from 5b8b743 to 301c843 Compare May 25, 2026 16:42
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 25, 2026 16:42
@oulgen oulgen marked this pull request as ready for review May 25, 2026 16:42
@oulgen oulgen marked this pull request as draft May 25, 2026 17:11
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 25, 2026 17:11
@oulgen oulgen force-pushed the oulgen/stack/317 branch from 301c843 to ffc20a2 Compare May 25, 2026 17:11
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 25, 2026 17:11
@oulgen oulgen marked this pull request as ready for review May 25, 2026 17:12
@oulgen oulgen force-pushed the oulgen/stack/316 branch from ae85a58 to 8a8c885 Compare May 25, 2026 17:44
oulgen added a commit that referenced this pull request May 25, 2026
Extends hoist_loop_invariant_recip.py with 4 sub-passes orchestrated
in dependency order:

1. Alias DCE: inline SSA-style ``NAME = ANOTHER_NAME`` chains so
   ``mi_copy_1 = mi; mi_copy_1_0 = mi_copy_1`` collapse to the root.
   This removes the per-iter "copy" instructions Helion's SSA
   maintenance leaves behind.

2. Outer-in reciprocal hoist (was inner-first in P16): places the
   ``inv = 1.0 / di`` at the OUTERMOST legal scope so we don't emit
   cascade aliases like ``_helion_inv_div_0 = 1.0 * _helion_inv_div_1``
   at each nested loop level.

3. FMA-friendly scale hoist: detects ``(A - INV) * CONST`` where INV
   is loop-invariant. Emits ``scaled_K = INV * CONST`` above the loop
   and rewrites inside to ``A * CONST - scaled_K`` — same value, but
   fewer per-iter instructions and FMA-friendly.

4. DCE for dead pure assigns: removes ``v_N = pure-expr`` whose target
   is never read afterwards (e.g. the ``v_10 = v_9 - mi`` that the
   FMA hoist supersedes). Iterates to fixed point.

A correctness piece (rename-aware invariance) plumbs the rename group
map from DeviceFunction so ``v_1_0`` (which post-pass renames to
``mi``) is recognized as the SAME variable. Without this the FMA
hoist would lift ``mi * 1.4427`` ABOVE the reduce loop and capture
the stale initial -inf, producing wildly wrong softmax outputs.

Bench (B200, fp16, HELION_AUTOTUNE_EFFORT=quick):
  Shape          Pre-P17   Post-P17   vs ATen
  (4096, 256)     84 GB/s   90 GB/s   0.18x  (launcher bound)
  (4096, 6400)  1384 GB/s 1422 GB/s   1.47x  (beats ATen by 47%)
  (4096, 12672) 1719 GB/s 1747 GB/s   0.79x
  average       1062 GB/s 1087 GB/s   0.81x

Cumulative: 0.45x -> 0.81x ATen (+80% from baseline).

Tests added (TestCuteHoistLoopInvariantP17, 6 tests):
- useless_cascade_alias_removed
- ssa_alias_chain_inlined
- fma_scale_hoist_above_consume
- fma_scale_hoist_in_reduce_v_loop
- dce_removes_dead_sub_after_fma_hoist
- invariance_canonicalization_does_not_break_consume

stack-info: PR: #2575, branch: oulgen/stack/317
@oulgen oulgen force-pushed the oulgen/stack/317 branch from ffc20a2 to 637dade Compare May 25, 2026 17:45
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 26, 2026 19:05
@oulgen oulgen force-pushed the oulgen/stack/317 branch from 637dade to 25bd005 Compare May 26, 2026 19:06
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 26, 2026 19:06
@oulgen oulgen marked this pull request as ready for review May 26, 2026 19:06
@oulgen oulgen marked this pull request as draft May 26, 2026 19:36
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 26, 2026 19:36
@oulgen oulgen force-pushed the oulgen/stack/317 branch from 25bd005 to 161a763 Compare May 26, 2026 19:36
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 26, 2026 19:37
@oulgen oulgen marked this pull request as ready for review May 26, 2026 19:37
@oulgen oulgen marked this pull request as draft May 26, 2026 20:24
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 26, 2026 20:24
@oulgen oulgen force-pushed the oulgen/stack/317 branch from 161a763 to c954159 Compare May 26, 2026 20:24
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 26, 2026 20:24
@oulgen oulgen marked this pull request as ready for review May 26, 2026 20:25
@oulgen oulgen marked this pull request as draft May 26, 2026 20:26
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 26, 2026 20:26
@oulgen oulgen force-pushed the oulgen/stack/317 branch from c954159 to 07e5bcf Compare May 26, 2026 20:26
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 26, 2026 20:27
@oulgen oulgen marked this pull request as ready for review May 26, 2026 20:27
@oulgen oulgen marked this pull request as draft May 26, 2026 20:31
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 26, 2026 20:31
@oulgen oulgen force-pushed the oulgen/stack/317 branch from 07e5bcf to 0b4d2de Compare May 26, 2026 20:31
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 26, 2026 20:31
@oulgen oulgen marked this pull request as ready for review May 26, 2026 20:32
Extends hoist_loop_invariant_recip.py with 4 sub-passes orchestrated
in dependency order:

1. Alias DCE: inline SSA-style ``NAME = ANOTHER_NAME`` chains so
   ``mi_copy_1 = mi; mi_copy_1_0 = mi_copy_1`` collapse to the root.
   This removes the per-iter "copy" instructions Helion's SSA
   maintenance leaves behind.

2. Outer-in reciprocal hoist (was inner-first in P16): places the
   ``inv = 1.0 / di`` at the OUTERMOST legal scope so we don't emit
   cascade aliases like ``_helion_inv_div_0 = 1.0 * _helion_inv_div_1``
   at each nested loop level.

3. FMA-friendly scale hoist: detects ``(A - INV) * CONST`` where INV
   is loop-invariant. Emits ``scaled_K = INV * CONST`` above the loop
   and rewrites inside to ``A * CONST - scaled_K`` — same value, but
   fewer per-iter instructions and FMA-friendly.

4. DCE for dead pure assigns: removes ``v_N = pure-expr`` whose target
   is never read afterwards (e.g. the ``v_10 = v_9 - mi`` that the
   FMA hoist supersedes). Iterates to fixed point.

A correctness piece (rename-aware invariance) plumbs the rename group
map from DeviceFunction so ``v_1_0`` (which post-pass renames to
``mi``) is recognized as the SAME variable. Without this the FMA
hoist would lift ``mi * 1.4427`` ABOVE the reduce loop and capture
the stale initial -inf, producing wildly wrong softmax outputs.

Bench (B200, fp16, HELION_AUTOTUNE_EFFORT=quick):
  Shape          Pre-P17   Post-P17   vs ATen
  (4096, 256)     84 GB/s   90 GB/s   0.18x  (launcher bound)
  (4096, 6400)  1384 GB/s 1422 GB/s   1.47x  (beats ATen by 47%)
  (4096, 12672) 1719 GB/s 1747 GB/s   0.79x
  average       1062 GB/s 1087 GB/s   0.81x

Cumulative: 0.45x -> 0.81x ATen (+80% from baseline).

Tests added (TestCuteHoistLoopInvariantP17, 6 tests):
- useless_cascade_alias_removed
- ssa_alias_chain_inlined
- fma_scale_hoist_above_consume
- fma_scale_hoist_in_reduce_v_loop
- dce_removes_dead_sub_after_fma_hoist
- invariance_canonicalization_does_not_break_consume

stack-info: PR: #2575, branch: oulgen/stack/317
@oulgen oulgen marked this pull request as draft May 26, 2026 20:58
@oulgen oulgen changed the base branch from oulgen/stack/316 to main May 26, 2026 20:58
@oulgen oulgen force-pushed the oulgen/stack/317 branch from 0b4d2de to 79f419b Compare May 26, 2026 20:58
@oulgen oulgen changed the base branch from main to oulgen/stack/316 May 26, 2026 20:58
@oulgen oulgen marked this pull request as ready for review May 26, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants