Skip to content

feat(cambricon): add SwigluOp in Cambricon#37

Draft
bitzyz wants to merge 1 commit into
masterfrom
feat/dev-swiglu-cambricon
Draft

feat(cambricon): add SwigluOp in Cambricon#37
bitzyz wants to merge 1 commit into
masterfrom
feat/dev-swiglu-cambricon

Conversation

@bitzyz
Copy link
Copy Markdown
Contributor

@bitzyz bitzyz commented Mar 30, 2026

Summary

  • Add Cambricon MLU backend implementation for the Swiglu operator (src/native/cambricon/ops/swiglu/swiglu.h, src/native/cambricon/ops/swiglu/kernel.mlu)
  • Support data types: float16, bfloat16, float32
  • Includes a fast path for contiguous tensors with matching shapes (no broadcast) and a general path handling non-contiguous tensors and broadcasting
  • For float32, computes SwiGLU element-wise via input * gate * sigmoid(gate) using scalar loop; for float16/bfloat16, uses BANG intrinsics (__bang_active_sigmoid, __bang_mul)

Motivation

Extends operator coverage to the Cambricon platform by implementing the element-wise Swiglu operator using BANG MLU kernels.

Type of Change

  • feat — new feature / new operator / new platform

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA N/A N/A Not affected
Iluvatar N/A N/A Not affected
MetaX N/A N/A Not affected
Cambricon ✅ Passed Tested locally
Moore N/A N/A Not affected
Ascend N/A N/A Not affected

Benchmark / Performance Impact

N/A

Notes for Reviewers

  • The kernel uses Union1 task type (single cluster) for simplicity. Multi-cluster support can be added as a follow-up if needed.
  • The workspace stores shape/stride metadata for device-side access during kernel execution.
  • No test file changes were needed — the existing test_swiglu.py parameterization covers the MLU device through the default device/dtype fixtures.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean.
  • clang-tidy concerns (per .clang-tidy) have been reviewed — no new warnings beyond the existing baseline.
  • Operator parameter order is inputs first, outputs last; attributes are between inputs and outputs; naming follows PyTorch → ONNX → CUDA API precedence (CONTRIBUTING.md §C++).
  • No exceptions are thrown. Error paths use assert with messages that include at least __FILE__, __LINE__, and __func__ (CONTRIBUTING.md §C++).
  • Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • Kernel files are named correctly: custom = kernel / kernel_v2 / …; well-known algorithms use the algorithm name; library-based implementations use the library name (CONTRIBUTING.md §C++).
  • Kernel and kernel launcher are in separate files: launcher .h, kernel follows platform conventions (e.g. .cuh + .cu) even when non-templated (CONTRIBUTING.md §C++).
  • Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
  • Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
  • Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).
  • New operators added via src/base/<op>.h (inheriting Operator<Op>) with platform implementations under src/<category>/<platform>/ inheriting the base (CONTRIBUTING.md §Adding an Operator).
  • No raw new/delete; RAII / smart pointers / existing allocators are used.

Python Specific (if Python files changed)

N/A — no Python files were changed in this PR.

Testing

  • pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
  • For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
  • New functionality has matching tests under tests/ following tests/test_add.py / tests/test_gemm.py patterns (CONTRIBUTING.md §Adding an Operator).
  • Tests use pytest.mark.parametrize correctly: dependent parameters share one decorator (e.g. @pytest.mark.parametrize("dtype, rtol, atol", …)), independent parameters use separate decorators ordered by parameter declaration.
  • Where appropriate, pytest.mark.auto_act_and_assert is used and the test returns a Payload whose func and ref share the same calling convention.
  • Default dtype / device parameterization is relied on, or overridden with an explicit pytest.mark.parametrize when necessary.
  • Any new test that is flaky under parallelism is marked so, or documented to require pytest -n 1.
  • For bug fixes: a regression test has been added that fails on master and passes with this PR.

Build, CI, and Tooling

  • The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform.
  • compile_commands.json still regenerates (CMake option CMAKE_EXPORT_COMPILE_COMMANDS=ON in pyproject.toml — required by the code-lint skill and clang-tidy -p).
  • New backends / devices have been added to auto-detection in CMakeLists.txt under if(AUTO_DETECT_DEVICES) and to if(AUTO_DETECT_BACKENDS) if applicable.
  • Only one CUDA-like GPU backend is selectable at a time — the existing mutual-exclusion check in CMakeLists.txt is not broken.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).
  • No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).

Documentation

  • README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • New operators, new dispatch helpers, or new public utilities are documented (docstring, header comment, or an addition to CONTRIBUTING.md §Some Code Explanations).
  • Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

@bitzyz bitzyz self-assigned this Mar 30, 2026
@bitzyz bitzyz requested review from Ziminli and voltjia March 30, 2026 06:42
@bitzyz bitzyz force-pushed the feat/dev-swiglu-cambricon branch from bed0481 to 15cd664 Compare March 30, 2026 06:47
@bitzyz bitzyz force-pushed the feat/dev-swiglu-cambricon branch from 15cd664 to 607361a Compare April 8, 2026 03:18
@bitzyz bitzyz changed the base branch from feat/dev-infra to master April 8, 2026 03:19
@bitzyz bitzyz marked this pull request as draft April 29, 2026 08:45
@bitzyz bitzyz force-pushed the feat/dev-swiglu-cambricon branch from 607361a to c16b0a6 Compare May 28, 2026 03:14
@bitzyz bitzyz force-pushed the feat/dev-swiglu-cambricon branch from c16b0a6 to d8ea2a7 Compare May 28, 2026 03:19
@bitzyz bitzyz changed the title feat: add cambricon swiglu op feat(cambricon): add SwigluOp in Cambricon May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant