Skip to content

feat(cambricon): add CausalSoftmaxOp in Cambricon#35

Draft
bitzyz wants to merge 1 commit into
masterfrom
feat/dev-causal-softmax-cambricon
Draft

feat(cambricon): add CausalSoftmaxOp in Cambricon#35
bitzyz wants to merge 1 commit into
masterfrom
feat/dev-causal-softmax-cambricon

Conversation

@bitzyz
Copy link
Copy Markdown
Contributor

@bitzyz bitzyz commented Mar 27, 2026

Summary

  • Add Cambricon MLU backend implementation for the CausalSoftmax operator (src/native/cambricon/ops/causal_softmax/causal_softmax.h, src/native/cambricon/ops/causal_softmax/kernel.mlu)
  • Support data types: float16, bfloat16, float32
  • Add reusable reduce helper functions (SumInternal, SumTyped, Sum, SumBatched, MaxInternal, MaxTyped, Max, MaxBatched) in src/native/cambricon/common.h for NRAM-based reduction operations
  • Update src/native/cambricon/ops/rms_norm/kernel.mlu to use the new shared SumInternal helper
  • Adjust test tolerance in tests/test_causal_softmax.py to align with InfiniCore accuracy expectations

Motivation

Extends operator coverage to the Cambricon platform by implementing the CausalSoftmax operator using BANG MLU kernels. The reduce utilities in common.h are also extracted as shared infrastructure for reuse by other operators (e.g. rms_norm).

Type of Change

  • feat — new feature / new operator / new platform

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA N/A N/A Not affected
Iluvatar N/A N/A Not affected
MetaX N/A N/A Not affected
Cambricon ✅ Passed Tested locally
Moore N/A N/A Not affected
Ascend N/A N/A Not affected

Benchmark / Performance Impact

N/A

Notes for Reviewers

  • The kernel uses Union1 task type for parallelizing across batches and sequence positions.
  • The reduce helpers (SumBatched/MaxBatched) in common.h use optimized BANG intrinsics (__bang_sumpool, __bang_maxpool, etc.) for large batches and fall back to scalar loops for small batches.
  • Test tolerances were relaxed slightly to match InfiniCore reference results on MLU hardware.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean.
  • clang-tidy concerns (per .clang-tidy) have been reviewed — no new warnings beyond the existing baseline.
  • Operator parameter order is inputs first, outputs last; attributes are between inputs and outputs; naming follows PyTorch → ONNX → CUDA API precedence (CONTRIBUTING.md §C++).
  • No exceptions are thrown. Error paths use assert with messages that include at least __FILE__, __LINE__, and __func__ (CONTRIBUTING.md §C++).
  • Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • Kernel files are named correctly: custom = kernel / kernel_v2 / …; well-known algorithms use the algorithm name; library-based implementations use the library name (CONTRIBUTING.md §C++).
  • Kernel and kernel launcher are in separate files: launcher .h, kernel follows platform conventions (e.g. .cuh + .cu) even when non-templated (CONTRIBUTING.md §C++).
  • Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
  • Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
  • Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).
  • New operators added via src/base/<op>.h (inheriting Operator<Op>) with platform implementations under src/<category>/<platform>/ inheriting the base (CONTRIBUTING.md §Adding an Operator).
  • No raw new/delete; RAII / smart pointers / existing allocators are used.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
  • ruff format --check passes cleanly — if not, run ruff format and commit the result.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
  • Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • Type hints are added / kept consistent with the surrounding code.

Testing

  • pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
  • For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
  • New functionality has matching tests under tests/ following tests/test_add.py / tests/test_gemm.py patterns (CONTRIBUTING.md §Adding an Operator).
  • Tests use pytest.mark.parametrize correctly: dependent parameters share one decorator (e.g. @pytest.mark.parametrize("dtype, rtol, atol", …)), independent parameters use separate decorators ordered by parameter declaration.
  • Where appropriate, pytest.mark.auto_act_and_assert is used and the test returns a Payload whose func and ref share the same calling convention.
  • Default dtype / device parameterization is relied on, or overridden with an explicit pytest.mark.parametrize when necessary.
  • Any new test that is flaky under parallelism is marked so, or documented to require pytest -n 1.
  • For bug fixes: a regression test has been added that fails on master and passes with this PR.

Build, CI, and Tooling

  • The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform.
  • compile_commands.json still regenerates (CMake option CMAKE_EXPORT_COMPILE_COMMANDS=ON in pyproject.toml — required by the code-lint skill and clang-tidy -p).
  • New backends / devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES) and to if(AUTO_DETECT_BACKENDS) if applicable.
  • Only one CUDA-like GPU backend is selectable at a time — the existing mutual-exclusion check in CMakeLists.txt is not broken.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).
  • No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).

Documentation

  • README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • New operators, new dispatch helpers, or new public utilities are documented (docstring, header comment, or an addition to CONTRIBUTING.md §Some Code Explanations).
  • Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

@bitzyz bitzyz self-assigned this Mar 27, 2026
@bitzyz bitzyz marked this pull request as draft March 27, 2026 03:23
@bitzyz bitzyz force-pushed the feat/dev-causal-softmax-cambricon branch 3 times, most recently from 7de4dcf to d6bf09c Compare March 27, 2026 06:23
@bitzyz bitzyz marked this pull request as ready for review March 30, 2026 02:19
@bitzyz bitzyz requested review from Ziminli and voltjia March 30, 2026 06:40
@bitzyz bitzyz force-pushed the feat/dev-causal-softmax-cambricon branch from d6bf09c to c25096a Compare April 8, 2026 03:26
@bitzyz bitzyz changed the base branch from feat/dev-infra to master April 8, 2026 03:26
@voltjia
Copy link
Copy Markdown
Collaborator

voltjia commented Apr 23, 2026

本 PR 对 tolerances 的修改是根据 https://github.com/InfiniTensor/InfiniCore/blob/main/test/infiniop/causal_softmax.py 进行的。

Comment thread src/native/cambricon/common.h
Comment thread src/native/cambricon/ops/causal_softmax/kernel.mlu
Comment thread src/cambricon/causal_softmax/causal_softmax.h Outdated
Comment thread src/native/cambricon/ops/causal_softmax/causal_softmax.h
@voltjia voltjia self-requested a review April 23, 2026 07:58
@bitzyz bitzyz marked this pull request as draft April 29, 2026 08:46
@bitzyz bitzyz force-pushed the feat/dev-causal-softmax-cambricon branch from c25096a to 720c2ca Compare May 28, 2026 05:58
@bitzyz bitzyz changed the title feat: add cambricon causal softmax op feat(cambricon): add CausalSoftmaxOp in Cambricon May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants