Skip to content

[Feature] Support unaligned barrier sync#2295

Open
Rachmanino wants to merge 1 commit into
tile-ai:mainfrom
Rachmanino:feat/unaligned-barrier-sync
Open

[Feature] Support unaligned barrier sync#2295
Rachmanino wants to merge 1 commit into
tile-ai:mainfrom
Rachmanino:feat/unaligned-barrier-sync

Conversation

@Rachmanino

@Rachmanino Rachmanino commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added an aligned parameter to sync primitives (sync_threads() and named_barrier_arrive()), letting callers choose aligned vs unaligned barrier behavior (default: True). Generated kernels will reflect the chosen form.
  • Tests

    • Added CUDA-focused tests covering unaligned barrier synchronization, including positional and keyword argument usages.

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds an optional aligned flag end-to-end so TileLang can emit aligned (bar.*) or unaligned (barrier.*) PTX barrier intrinsics via templated device helpers, CuteDSL, and CUDA codegen.

Changes

Unaligned Barrier Synchronization

Layer / File(s) Summary
Device template barrier selection
src/tl_templates/cuda/common.h
__sync_thread_partial and __named_barrier_arrive are templated on aligned and conditionally emit bar.sync/bar.arrive (true) or barrier.sync/barrier.arrive (false).
CUDA codegen parsing & emission
src/backend/cuda/codegen/codegen_cuda.cc, src/backend/cuda/codegen/codegen_cutedsl.cc
Add GetBoolImm helpers. PrintStorageSync/PrintStorageSync_ accept a 4th boolean immediate to forward aligned; VisitExpr_ for tl::named_barrier_arrive accepts an optional 3rd aligned arg and emits template specialization when present.
CuteDSL barrier helpers
tilelang/contrib/cutedsl/reduce.py, tilelang/contrib/cutedsl/utils.py
bar_sync_ptx gains an aligned parameter to choose the PTX mnemonic; sync_thread_partial forwards the aligned flag.
TileLang builtin barrier API
tilelang/language/builtin.py
sync_threads and named_barrier_arrive accept aligned: bool = True (keyword-only for named_barrier_arrive), validate runtime types, and append a boolean argument when lowering to select unaligned emission.
Barrier alignment tests
testing/python/language/test_tilelang_language_sync_threads.py
Adds CUDA-only tests: test_sync_threads_unaligned, test_sync_threads_unaligned_positional, and test_named_barrier_arrive_unaligned asserting the emitted source contains the unaligned specializations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tile-ai/tilelang#2194: Related work that added the 2-arg tl.named_barrier_arrive intrinsic which this PR extends with an optional aligned argument and template emission.
  • tile-ai/tilelang#2197: Related changes to storage_sync("shared") codegen paths; both PRs touch validation and emission of barrier/thread arguments.

Suggested reviewers

  • LeiWang1999

Poem

🐰 I hopped through kernels, thread fences in sight,
Templated barriers now flip day to night.
Aligned or not, the assembly will flow,
bar.* or barrier.* — the choice we now know.
A rabbit approves: syncs land just right.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Feature] Support unaligned barrier sync' directly and clearly summarizes the main change: adding support for an unaligned variant of barrier synchronization operations across multiple files (codegen, templates, tests, and language bindings).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

<< PrintExpr(args[2]);
if (args.size() == 4) {
this->stream << ", "
<< (GetBoolImm(args[3], "storage_sync aligned") ? "true"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's "storage_sync aligned"?

@Rachmanino Rachmanino force-pushed the feat/unaligned-barrier-sync branch from 53eacd3 to 9de3806 Compare May 28, 2026 09:35

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
testing/python/transform/test_tilelang_transform_layout_inference.py (1)

185-187: ⚡ Quick win

Prefer structural layout markers over exact local array size literals.

Asserting values[16] vs values[32] is fragile. Keep the fallback/partition intent check structural (e.g., scalar fragment path present and vector-pack fallback absent) to reduce false failures.

Based on learnings: For Python tests in testing/python/transform, assertions should target structural behavior rather than specific numeric literals.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/transform/test_tilelang_transform_layout_inference.py` around
lines 185 - 187, Replace fragile literal-size assertions on kernel_source with
structural checks: instead of asserting "signed char values[16]" and not "signed
char values[32]", verify the scalar-fragment code path exists (e.g., assert
presence of the scalar fragment identifier or code block that handles
single-element processing) and verify the vector-packed fallback is absent by
keeping the existing check against "make_longlong4(" (or asserting absence of
any vector-pack helper like "make_longlong" patterns). Update the test to assert
these structural markers on kernel_source rather than specific numeric array
sizes.
testing/python/transform/test_tilelang_transform_thread_sync.py (1)

112-113: ⚡ Quick win

Avoid hard-coding barrier IDs/count literals in transform-pass assertions.

These checks are brittle to unrelated pass/internal allocation changes. Assert structural behavior (aligned vs unaligned form and occurrence patterns) instead of exact (3, 64/128) literals.

Suggested assertion style
-    assert "tl::__sync_thread_partial<false>(3, 128);" in src, src
+    assert "tl::__sync_thread_partial<false>(" in src, src

-    assert "tl::__sync_thread_partial(3, 64);" in src, src
+    assert "tl::__sync_thread_partial(" in src, src
     assert "tl::__sync_thread_partial<false>(" not in src, src

-    assert src.count("tl::__sync_thread_partial(3, 64);") == 2, src
+    assert src.count("tl::__sync_thread_partial(") == 2, src
     assert "tl::__sync_thread_partial<false>(" not in src, src

Based on learnings: For Python tests of the tilelang transform passes, focus assertions on structural patterns in generated source and avoid relying on specific numeric literals.

Also applies to: 128-130, 146-147

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/transform/test_tilelang_transform_thread_sync.py` around lines
112 - 113, The assertion currently checks for a hard-coded barrier literal
"tl::__sync_thread_partial<false>(3, 128);" which is brittle; update the test in
test_tilelang_transform_thread_sync.py to assert structural patterns instead:
check that the generated source (src) contains the function call name
"tl::__sync_thread_partial<false>" and then assert whether the call appears in
the aligned form (e.g., with a power-of-two second argument pattern) or
unaligned form by using a regex or substring checks for the surrounding token
patterns rather than exact numeric literals; replace the exact-match assertions
(the one shown and the ones at lines ~128-130 and ~146-147) with these
structural/occurrence pattern checks on src.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/cuda/codegen/codegen_cuda.cc`:
- Around line 1425-1439: The optional aligned selector (args[3]) is downcast
with Downcast<Bool> without checking it's a boolean immediate; add an ICHECK
before the downcast to ensure args.size()==4 implies args[3] is a Bool immediate
(e.g., check args[3].dtype().is_bool() and/or that it is a BoolImm) so the code
in the tl::__sync_thread_partial emission path (the block using args,
Downcast<Bool>, PrintExpr and this->stream) fails with a clear diagnostic
instead of a generic downcast failure; apply the same guard in the similar
section around lines 2404-2415 where Downcast<Bool> is used.

---

Nitpick comments:
In `@testing/python/transform/test_tilelang_transform_layout_inference.py`:
- Around line 185-187: Replace fragile literal-size assertions on kernel_source
with structural checks: instead of asserting "signed char values[16]" and not
"signed char values[32]", verify the scalar-fragment code path exists (e.g.,
assert presence of the scalar fragment identifier or code block that handles
single-element processing) and verify the vector-packed fallback is absent by
keeping the existing check against "make_longlong4(" (or asserting absence of
any vector-pack helper like "make_longlong" patterns). Update the test to assert
these structural markers on kernel_source rather than specific numeric array
sizes.

In `@testing/python/transform/test_tilelang_transform_thread_sync.py`:
- Around line 112-113: The assertion currently checks for a hard-coded barrier
literal "tl::__sync_thread_partial<false>(3, 128);" which is brittle; update the
test in test_tilelang_transform_thread_sync.py to assert structural patterns
instead: check that the generated source (src) contains the function call name
"tl::__sync_thread_partial<false>" and then assert whether the call appears in
the aligned form (e.g., with a power-of-two second argument pattern) or
unaligned form by using a regex or substring checks for the surrounding token
patterns rather than exact numeric literals; replace the exact-match assertions
(the one shown and the ones at lines ~128-130 and ~146-147) with these
structural/occurrence pattern checks on src.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 340d38ad-aae7-4cd9-ba89-1fa9ba7299aa

📥 Commits

Reviewing files that changed from the base of the PR and between 53eacd3 and 9de3806.

📒 Files selected for processing (11)
  • src/backend/cuda/codegen/codegen_cuda.cc
  • src/backend/cuda/codegen/codegen_cutedsl.cc
  • src/op/parallel.cc
  • src/tl_templates/cuda/common.h
  • src/transform/thread_storage_sync.cc
  • testing/python/language/test_tilelang_language_sync_threads.py
  • testing/python/transform/test_tilelang_transform_layout_inference.py
  • testing/python/transform/test_tilelang_transform_thread_sync.py
  • tilelang/contrib/cutedsl/reduce.py
  • tilelang/contrib/cutedsl/utils.py
  • tilelang/language/builtin.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tilelang/contrib/cutedsl/reduce.py

Comment thread src/backend/cuda/codegen/codegen_cuda.cc Outdated
@Rachmanino Rachmanino force-pushed the feat/unaligned-barrier-sync branch from 9de3806 to ffcd43d Compare June 1, 2026 07:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tilelang/language/builtin.py (1)

361-377: ⚡ Quick win

Validate read type in tma_store_wait before lowering

read is documented as bool, but currently any value is forwarded into the intrinsic. This can surface later as codegen downcast failures instead of a clear frontend error. Add a runtime bool check (same pattern used by aligned in this file).

Suggested patch
 def tma_store_wait(count: int = 0, read: bool = True):
@@
-    return tirx.call_intrin("handle", tirx.op.Op.get("tl.tma_store_wait"), count, read)
+    if not isinstance(read, bool):
+        raise TypeError(f"Expect read to be bool, but got {type(read)}.")
+    return tirx.call_intrin("handle", tirx.op.Op.get("tl.tma_store_wait"), count, read)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/language/builtin.py` around lines 361 - 377, The tma_store_wait
function forwards the documented bool parameter read directly into the intrinsic
which can cause downstream codegen type errors; add a runtime type check similar
to the aligned validation in this file: verify isinstance(read, bool) and raise
a TypeError with a clear message if not, before the existing return that calls
tirx.call_intrin("handle", tirx.op.Op.get("tl.tma_store_wait"), count, read) so
only booleans are lowered.
testing/python/language/test_tilelang_language_sync_threads.py (1)

52-52: ⚡ Quick win

Make source assertions less brittle to formatting-only codegen changes

These checks currently depend on exact spacing/text rendering. Consider asserting stable tokens (callee + unaligned marker) instead of full-line exact string equality.

Suggested patch
-    assert "tl::__sync_thread_partial(1, 128, false);" in src, src
+    assert "tl::__sync_thread_partial(" in src and ", false);" in src, src
@@
-    assert "tl::__sync_thread_partial(1, 128, false);" in src, src
+    assert "tl::__sync_thread_partial(" in src and ", false);" in src, src
@@
-    assert "tl::__named_barrier_arrive<false>(2, 128);" in src, src
+    assert "tl::__named_barrier_arrive<false>(" in src, src

Also applies to: 64-64, 76-76

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/language/test_tilelang_language_sync_threads.py` at line 52,
The assertion is brittle because it matches exact spacing of the generated line
("tl::__sync_thread_partial(1, 128, false);"); update the three assertions in
test_tilelang_language_sync_threads.py (the ones checking for
tl::__sync_thread_partial at lines ~52, ~64, ~76) to instead assert on stable
tokens such as the callee name and key arguments/markers (for example assert
"tl::__sync_thread_partial" in src and assert "false" in the same context, or
use a simple regex matching
tl::__sync_thread_partial\\s*\\(.*1.*128.*false.*\\)); locate the assertions by
the string "tl::__sync_thread_partial" and replace exact-string checks with
token-based or regex checks so formatting-only codegen changes won’t break the
test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@testing/python/language/test_tilelang_language_sync_threads.py`:
- Line 52: The assertion is brittle because it matches exact spacing of the
generated line ("tl::__sync_thread_partial(1, 128, false);"); update the three
assertions in test_tilelang_language_sync_threads.py (the ones checking for
tl::__sync_thread_partial at lines ~52, ~64, ~76) to instead assert on stable
tokens such as the callee name and key arguments/markers (for example assert
"tl::__sync_thread_partial" in src and assert "false" in the same context, or
use a simple regex matching
tl::__sync_thread_partial\\s*\\(.*1.*128.*false.*\\)); locate the assertions by
the string "tl::__sync_thread_partial" and replace exact-string checks with
token-based or regex checks so formatting-only codegen changes won’t break the
test.

In `@tilelang/language/builtin.py`:
- Around line 361-377: The tma_store_wait function forwards the documented bool
parameter read directly into the intrinsic which can cause downstream codegen
type errors; add a runtime type check similar to the aligned validation in this
file: verify isinstance(read, bool) and raise a TypeError with a clear message
if not, before the existing return that calls tirx.call_intrin("handle",
tirx.op.Op.get("tl.tma_store_wait"), count, read) so only booleans are lowered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e680c950-46ba-4827-b4b4-29d2662652a4

📥 Commits

Reviewing files that changed from the base of the PR and between 9de3806 and ffcd43d.

📒 Files selected for processing (7)
  • src/backend/cuda/codegen/codegen_cuda.cc
  • src/backend/cuda/codegen/codegen_cutedsl.cc
  • src/tl_templates/cuda/common.h
  • testing/python/language/test_tilelang_language_sync_threads.py
  • tilelang/contrib/cutedsl/reduce.py
  • tilelang/contrib/cutedsl/utils.py
  • tilelang/language/builtin.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/backend/cuda/codegen/codegen_cutedsl.cc
  • tilelang/contrib/cutedsl/reduce.py
  • tilelang/contrib/cutedsl/utils.py
  • src/tl_templates/cuda/common.h

@Rachmanino Rachmanino force-pushed the feat/unaligned-barrier-sync branch from ffcd43d to d65e9f8 Compare June 1, 2026 07:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_sync_threads.py (1)

52-52: ⚡ Quick win

Make codegen assertions less formatting-fragile.

These checks are good, but exact full-statement matches are brittle to whitespace/template formatting changes. Prefer asserting stable semantic substrings (e.g., intrinsic name + false specialization).

Proposed adjustment
-    assert "tl::__sync_thread_partial(1, 128, false);" in src, src
+    assert "tl::__sync_thread_partial(" in src and "false" in src, src
...
-    assert "tl::__sync_thread_partial(1, 128, false);" in src, src
+    assert "tl::__sync_thread_partial(" in src and "false" in src, src
...
-    assert "tl::__named_barrier_arrive<false>(2, 128);" in src, src
+    assert "tl::__named_barrier_arrive<false>" in src, src

Also applies to: 64-64, 76-76

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@testing/python/language/test_tilelang_language_sync_threads.py` at line 52,
The assertions that currently check the exact generated statement string (e.g.,
"tl::__sync_thread_partial(1, 128, false);") are too formatting-fragile; change
them to assert on stable semantic substrings instead — for example verify that
"tl::__sync_thread_partial(" is in src and that the specialization flag
substring ", false" (or "false" near the intrinsic) is present (or assert both
"tl::__sync_thread_partial(" and "false" are in src) for the three occurrences
that currently match the full-statement (the checks referencing
tl::__sync_thread_partial at the three assertion sites); update those assertions
to use substring membership checks rather than exact full-statement equality.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tilelang/language/builtin.py`:
- Around line 959-966: In the sync lowering logic (the block that handles
barrier_id, arrive_count and aligned — look for variables barrier_id,
arrive_count, aligned and the T.sync_threads path), add a validation that if
aligned is True and arrive_count is provided while barrier_id is None, you raise
a clear error (TypeError/ValueError) and do not lower; this prevents
interpreting a single positional arg as barrier_id. Update the conditional so
you only append arrive_count when barrier_id is present (or not aligned) and
ensure the error message references both arrive_count and barrier_id to guide
callers.

---

Nitpick comments:
In `@testing/python/language/test_tilelang_language_sync_threads.py`:
- Line 52: The assertions that currently check the exact generated statement
string (e.g., "tl::__sync_thread_partial(1, 128, false);") are too
formatting-fragile; change them to assert on stable semantic substrings instead
— for example verify that "tl::__sync_thread_partial(" is in src and that the
specialization flag substring ", false" (or "false" near the intrinsic) is
present (or assert both "tl::__sync_thread_partial(" and "false" are in src) for
the three occurrences that currently match the full-statement (the checks
referencing tl::__sync_thread_partial at the three assertion sites); update
those assertions to use substring membership checks rather than exact
full-statement equality.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9ba6796-6ee4-4894-80a3-161fbac6983b

📥 Commits

Reviewing files that changed from the base of the PR and between ffcd43d and d65e9f8.

📒 Files selected for processing (7)
  • src/backend/cuda/codegen/codegen_cuda.cc
  • src/backend/cuda/codegen/codegen_cutedsl.cc
  • src/tl_templates/cuda/common.h
  • testing/python/language/test_tilelang_language_sync_threads.py
  • tilelang/contrib/cutedsl/reduce.py
  • tilelang/contrib/cutedsl/utils.py
  • tilelang/language/builtin.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • tilelang/contrib/cutedsl/reduce.py
  • src/backend/cuda/codegen/codegen_cutedsl.cc
  • tilelang/contrib/cutedsl/utils.py
  • src/tl_templates/cuda/common.h

Comment on lines +959 to 966
if barrier_id is not None or not aligned:
if barrier_id is None:
barrier_id = 0
args.append(barrier_id)
if arrive_count is not None:
if arrive_count is not None or not aligned:
if arrive_count is None:
arrive_count = 0
args.append(arrive_count)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disallow arrive_count without barrier_id in aligned mode.

T.sync_threads(arrive_count=...) is currently accepted and lowers with a single positional arg, which is ambiguous and can map to the wrong barrier id at codegen time.

Proposed fix
 def sync_threads(barrier_id: int = None, arrive_count: int = None, aligned: bool = True):
     """Synchronize all threads in a block."""
     args = []
     if not isinstance(aligned, bool):
         raise TypeError(f"Expect aligned to be bool, but got {type(aligned)}.")
+    if aligned and barrier_id is None and arrive_count is not None:
+        raise ValueError("T.sync_threads(arrive_count=...) requires barrier_id.")
     if barrier_id is not None or not aligned:
         if barrier_id is None:
             barrier_id = 0
         args.append(barrier_id)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/language/builtin.py` around lines 959 - 966, In the sync lowering
logic (the block that handles barrier_id, arrive_count and aligned — look for
variables barrier_id, arrive_count, aligned and the T.sync_threads path), add a
validation that if aligned is True and arrive_count is provided while barrier_id
is None, you raise a clear error (TypeError/ValueError) and do not lower; this
prevents interpreting a single positional arg as barrier_id. Update the
conditional so you only append arrive_count when barrier_id is present (or not
aligned) and ensure the error message references both arrive_count and
barrier_id to guide callers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants