Skip to content

feat: add quarot transform support and related changes, and asym for …#1682

Closed
wenhuach21 wants to merge 2 commits intomainfrom
feat/autoround-quarot
Closed

feat: add quarot transform support and related changes, and asym for …#1682
wenhuach21 wants to merge 2 commits intomainfrom
feat/autoround-quarot

Conversation

@wenhuach21
Copy link
Copy Markdown
Contributor

…activations. only for test.

Description

Please briefly describe your main changes, the motivation.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Copilot AI review requested due to automatic review settings April 14, 2026 09:39
@wenhuach21 wenhuach21 marked this pull request as draft April 14, 2026 09:39
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 6 pipeline(s).
1 pipeline(s) require an authorized user to comment /azp run to run.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
6 pipeline(s) were filtered out due to trigger conditions.
1 pipeline(s) require an authorized user to comment /azp run to run.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new W4A4 preset scheme and adds experimental QuaRot-style (Hadamard/rotation) support for Llama models, wiring the configuration through CLI → compression → export/load → inference/eval paths.

Changes:

  • Add W4A4 preset scheme and enable it across formats and CPU tests.
  • Add llama_quarot placement strategy with offline weight rotation + online activation transforms (hooks + wrapper monkey-patches).
  • Extend CLI/config plumbing for --hadamard_config and activation symmetry overrides; adjust evaluation to keep in-memory models for fake so runtime hooks persist.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/test_cpu/schemes/test_scheme.py Adds a CPU test covering W4A4 quantize-and-save + reload.
auto_round/schemes.py Introduces the W4A4 preset and registers it in PRESET_SCHEMES.
auto_round/inference/convert_model.py Simplifies HadamardConfig construction when registering input hooks during model conversion.
auto_round/inference/backend.py Makes Hadamard config handling robust to dict vs object; adds placement strategy check.
auto_round/formats.py Advertises W4A4 support for AutoGPTQ + AutoRound formats.
auto_round/experimental/utils.py Adds llama_quarot shorthand and relaxes scheme gating when using that placement strategy.
auto_round/experimental/transform/patch_modules.py Adds wrapper forward monkey-patches to apply selective online activation transforms.
auto_round/experimental/transform/llama_quarot.py New implementation for offline Llama QuaRot weight rotation + online transforms.
auto_round/experimental/transform/hadamard_config.py Extends HadamardConfig with placement_strategy and QuaRot-specific options.
auto_round/experimental/transform/apply.py Routes placement_strategy == llama_quarot to the new offline+online transform flow.
auto_round/eval/evaluation.py Forces in-memory evaluation for fake format so runtime hooks remain attached.
auto_round/compressors/base.py Normalizes/validates hadamard config once, passes target device, and applies QuaRot layer-config overrides.
auto_round/main.py Adds --hadamard_config, --act_sym/--act_asym, and argument resolution helpers.
Comments suppressed due to low confidence (1)

auto_round/formats.py:1071

  • AutoRoundFormat.support_schemes now includes W4A4, but the default format="auto_round" export path selects the GPTQ backend for symmetric int schemes and GPTQ inference ignores activation-quantization fields (act_bits, etc.). As a result, users can export W4A4 successfully but won’t actually get 4-bit activation quantization at load/inference time. Consider either (1) disallowing act_bits < 16 for auto_round exports (consistent with the existing error message later in __init__), or (2) adding an activation-quant-capable export/inference backend before advertising W4A4 support here.
class AutoRoundFormat(OutputFormat):
    support_schemes = [
        "W4A16",
        "W4A4",
        "W4A16_MIXED",
        "W2A16",
        "W3A16",
        "W8A16",

input_tensor = args[0]
transformed_input = llama_quarot_online_transform(module, input_tensor)
if len(args) == 1:
return transformed_input
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _build_online_hook, the forward pre-hook returns a bare Tensor when the module has a single positional input (len(args) == 1). For register_forward_pre_hook, the hook must return None or a tuple of positional args; returning a Tensor will cause PyTorch to treat it as an iterable of inputs when splatting (*input), breaking the forward call. Return (transformed_input,) instead (and keep the tuple path for multi-arg cases).

Suggested change
return transformed_input
return (transformed_input,)

Copilot uses AI. Check for mistakes.
Comment on lines 516 to +517
class AutoGPTQFormat(OutputFormat):
support_schemes = ["W4A16", "W2A16", "W3A16", "W8A16", "BF16", "W2A16G64", "W2A16G32", "W4A16_MIXED"]
support_schemes = ["W4A16", "W4A4", "W2A16", "W3A16", "W8A16", "BF16", "W2A16G64", "W2A16G32", "W4A16_MIXED"]
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding W4A4 to AutoGPTQFormat.support_schemes is misleading because the GPTQ export/inference path only uses bits/group_size/sym when creating inference layers (activation quantization fields like act_bits/act_group_size/act_sym are ignored in convert_model._create_quant_layer for GPTQ backends). This means a model exported as auto_gptq from W4A4 will effectively behave like weight-only W4A16 at inference. Either remove W4A4 from this list or add a hard check that scheme.act_bits is 16 for this format (and route users to fake if they want activation quantization).

Copilot uses AI. Check for mistakes.
@wenhuach21 wenhuach21 closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants