feat: add quarot transform support and related changes, and asym for …#1682
feat: add quarot transform support and related changes, and asym for …#1682wenhuach21 wants to merge 2 commits intomainfrom
Conversation
…activations. only for test.
|
Azure Pipelines: Successfully started running 6 pipeline(s). 1 pipeline(s) require an authorized user to comment /azp run to run. |
for more information, see https://pre-commit.ci
|
Azure Pipelines: 6 pipeline(s) were filtered out due to trigger conditions. 1 pipeline(s) require an authorized user to comment /azp run to run. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new W4A4 preset scheme and adds experimental QuaRot-style (Hadamard/rotation) support for Llama models, wiring the configuration through CLI → compression → export/load → inference/eval paths.
Changes:
- Add
W4A4preset scheme and enable it across formats and CPU tests. - Add
llama_quarotplacement strategy with offline weight rotation + online activation transforms (hooks + wrapper monkey-patches). - Extend CLI/config plumbing for
--hadamard_configand activation symmetry overrides; adjust evaluation to keep in-memory models forfakeso runtime hooks persist.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_cpu/schemes/test_scheme.py | Adds a CPU test covering W4A4 quantize-and-save + reload. |
| auto_round/schemes.py | Introduces the W4A4 preset and registers it in PRESET_SCHEMES. |
| auto_round/inference/convert_model.py | Simplifies HadamardConfig construction when registering input hooks during model conversion. |
| auto_round/inference/backend.py | Makes Hadamard config handling robust to dict vs object; adds placement strategy check. |
| auto_round/formats.py | Advertises W4A4 support for AutoGPTQ + AutoRound formats. |
| auto_round/experimental/utils.py | Adds llama_quarot shorthand and relaxes scheme gating when using that placement strategy. |
| auto_round/experimental/transform/patch_modules.py | Adds wrapper forward monkey-patches to apply selective online activation transforms. |
| auto_round/experimental/transform/llama_quarot.py | New implementation for offline Llama QuaRot weight rotation + online transforms. |
| auto_round/experimental/transform/hadamard_config.py | Extends HadamardConfig with placement_strategy and QuaRot-specific options. |
| auto_round/experimental/transform/apply.py | Routes placement_strategy == llama_quarot to the new offline+online transform flow. |
| auto_round/eval/evaluation.py | Forces in-memory evaluation for fake format so runtime hooks remain attached. |
| auto_round/compressors/base.py | Normalizes/validates hadamard config once, passes target device, and applies QuaRot layer-config overrides. |
| auto_round/main.py | Adds --hadamard_config, --act_sym/--act_asym, and argument resolution helpers. |
Comments suppressed due to low confidence (1)
auto_round/formats.py:1071
AutoRoundFormat.support_schemesnow includesW4A4, but the defaultformat="auto_round"export path selects the GPTQ backend for symmetric int schemes and GPTQ inference ignores activation-quantization fields (act_bits, etc.). As a result, users can exportW4A4successfully but won’t actually get 4-bit activation quantization at load/inference time. Consider either (1) disallowingact_bits < 16forauto_roundexports (consistent with the existing error message later in__init__), or (2) adding an activation-quant-capable export/inference backend before advertisingW4A4support here.
class AutoRoundFormat(OutputFormat):
support_schemes = [
"W4A16",
"W4A4",
"W4A16_MIXED",
"W2A16",
"W3A16",
"W8A16",
| input_tensor = args[0] | ||
| transformed_input = llama_quarot_online_transform(module, input_tensor) | ||
| if len(args) == 1: | ||
| return transformed_input |
There was a problem hiding this comment.
In _build_online_hook, the forward pre-hook returns a bare Tensor when the module has a single positional input (len(args) == 1). For register_forward_pre_hook, the hook must return None or a tuple of positional args; returning a Tensor will cause PyTorch to treat it as an iterable of inputs when splatting (*input), breaking the forward call. Return (transformed_input,) instead (and keep the tuple path for multi-arg cases).
| return transformed_input | |
| return (transformed_input,) |
| class AutoGPTQFormat(OutputFormat): | ||
| support_schemes = ["W4A16", "W2A16", "W3A16", "W8A16", "BF16", "W2A16G64", "W2A16G32", "W4A16_MIXED"] | ||
| support_schemes = ["W4A16", "W4A4", "W2A16", "W3A16", "W8A16", "BF16", "W2A16G64", "W2A16G32", "W4A16_MIXED"] |
There was a problem hiding this comment.
Adding W4A4 to AutoGPTQFormat.support_schemes is misleading because the GPTQ export/inference path only uses bits/group_size/sym when creating inference layers (activation quantization fields like act_bits/act_group_size/act_sym are ignored in convert_model._create_quant_layer for GPTQ backends). This means a model exported as auto_gptq from W4A4 will effectively behave like weight-only W4A16 at inference. Either remove W4A4 from this list or add a hard check that scheme.act_bits is 16 for this format (and route users to fake if they want activation quantization).
…activations. only for test.
Description
Please briefly describe your main changes, the motivation.
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting