feat https://github.com/NVIDIA/TensorRT-Edge-LLM/issues/87: add CustomVoice language conditioning support for Qwen3-TTS by suharvest · Pull Request #98 · NVIDIA/TensorRT-Edge-LLM

suharvest · 2026-05-27T03:11:13Z

What does this PR do?

Type of change: new feature

Overview:

docs/source/user_guide/getting_started/supported-models.md at v0.7.1 states that Qwen3-TTS support is limited to the CustomVoice checkpoints (Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice and the 1.7B sibling). However the C++ runtime currently has no support for those models' language-conditioned prefill. The Python reference at modeling_qwen3_tts.py:2120-2186 emits a 9-row prefix with a language_id codec token injected between codec_think_bos_id and codec_think_eos_id:

if language_id is None:                          # base 12Hz / no-lang
    codec_prefill_list = [[codec_nothink_id, codec_think_bos_id, codec_think_eos_id]]
else:                                             # CustomVoice + language
    codec_prefill_list = [[codec_think_id, codec_think_bos_id, language_id, codec_think_eos_id]]

assistantPreambleKernel only implements the 8-row (no-language) branch — git grep for language_id / customvoice / codec_language in v0.7.1 returns 0 hits anywhere in cpp/, examples/omni/, or experimental/llm_loader/models/qwen3_tts/. Running the only officially supported Qwen3-TTS family therefore produces unintelligible output, validated end-to-end via an objective ASR loop (see "Validation" below).

This PR adds the missing 9-row branch and threads the language string from input JSON → runtime → kernel.

Usage

After this PR the CLI accepts "language" as a top-level field (with per-request override) in the existing inputJson schema. Engines exported through experimental/llm_loader/export_all_cli.py automatically pick up the codec_language_id map from the source talker_config. Example input:

{
  "speaker": "vivian",
  "language": "chinese",
  "apply_chat_template": true,
  "add_generation_prompt": true,
  "enable_thinking": false,
  "max_audio_length": 24000,
  "requests": [
    {"messages": [{"role": "user", "content": "今天天气真不错"}]}
  ]
}

Run:

./qwen3_tts_inference \
  --inputFile=in.json \
  --talkerEngineDir=engines/talker \
  --code2wavEngineDir=engines/code2wav \
  --tokenizerDir=engines/talker \
  --outputFile=out/result.json \
  --outputAudioDir=out

Stderr will now include:

CustomVoice language conditioning enabled: language="chinese" -> codec_id=2055
projectToTalkerInput: ... outputSeqLen=15, speakerId=3065, langId=2055, prefixRows=9

When "language" is absent (or unknown), the runtime falls back to the existing 8-row no-language path with a WARNING log line, and behavior is unchanged from main.

Changes by file

cpp/kernels/talkerMLPKernels/talkerMLPKernels.{cu,h} — add language_id as a kernel argument (default -1, preserves existing 8-row layout). Emit the 9-row prefix when language_id >= 0, with the language codec id injected between codec_think_bos_id and codec_think_eos_id.
cpp/runtime/qwen3OmniTTSRuntime.{cpp,h} — parse codec_language_id (string→int, lower-cased keys) and codec_think_id from the engine config.json. Accept language per-request, resolve via the map, log at LOG_INFO. Thread (speakerId, langId, prefixRows) through projectToTalkerInput into the kernel.
examples/omni/qwen3_tts_inference.cpp — forward top-level "language" from input JSON, allow per-request override.
experimental/llm_loader/export_all_cli.py — forward codec_language_id from the source HuggingFace talker_config into the engine config.json alongside the existing codec_*_id keys. Without this, the runtime read always sees an empty map and CustomVoice language conditioning is a silent no-op.

`codec_language_id` map (CustomVoice 12-language set)

The 12 (language → codec token id) pairs come directly from the public Qwen3-TTS CustomVoice checkpoint configs — talker_config.codec_language_id in Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/config.json:

key	codec id	key	codec id
chinese	2055	japanese	2058
english	2050	korean	2064
german	2053	french	2061
italian	2070	russian	2069
portuguese	2071	beijing_dialect	2074
spanish	2054	sichuan_dialect	2062

These are read from the source talker_config at export time rather than hardcoded, so future checkpoints with additional languages work without code changes.

Validation

End-to-end on Jetson Orin NX (sm_87, JetPack 6.x, CUDA 12.6, TRT 10.3), built with -DENABLE_CUTE_DSL=gemm -DCUTE_DSL_ARTIFACT_TAG=sm_87 -DEMBEDDED_TARGET=jetson-orin. ASR is radxa sherpa-onnx SenseVoice model.int8.onnx, language='zh', use_itn=True, applied to the synthesized WAV:

Variant	WAV md5 (bit-stable across 10 reruns)	ASR transcript	Input text
FP16	`a4f56dcbccc580a58a9bab0e0b727eae`	`今天天气真不错。`	`今天天气真不错`
W8A16 (ModelOpt AWQ per-channel)	`7c4e1825ca1bcbaa4a9b61505967e79d`	`今天天气真不错哦。`	`今天天气真不错`

Reviewer note: end-to-end audio validation also depends on the env-var workaround documented in issue #87 (Issue C — a separate bug in the prebuilt CuTe DSL artifact at kernelSrcs/cuteDSLPrebuilt/cutedsl_aarch64_sm_87_cuda12.tar.gz). This PR is a strict prerequisite (without it the prefix is still wrong even with the CuTe workaround), but it cannot be byte-verified independently of Issue C resolving at NVIDIA's end. Unit-level verification — kernel produces 9-row prefix for language_id >= 0 and unchanged 8-row prefix for language_id == -1 — is dependency-free and is the minimal test we recommend adding.

🚀 Pull Request Checklist

✅ Pre-commit Checks

pip install pre-commit
pre-commit install
pre-commit run --files <6 changed files> passes (clang-format / codespell / yapf / ruff / autoflake / etc. all green after staging clang-format auto-fixes)

🧪 Tests

TODO: add a unit test for assistantPreambleKernel exercising the language_id != -1 9-row layout (compare prefix output against the Python reference for speaker=vivian, language=chinese). The kernel-only path does not depend on the Issue C workaround.
Existing Qwen3-TTS smoke (base / no-language path) still passes — language_id defaults to -1 so behavior is unchanged for that path.

📄 Documentation

docs/source/user_guide/getting_started/supported-models.md already lists the CustomVoice checkpoints; this PR makes that documented support actually work. No doc change required for the PR itself.
Suggestion (not blocking): add a short "CustomVoice language conditioning" subsection under the Qwen3-TTS section explaining the new "language" input field and the 12 supported language keys.

⚙️ Compatibility

Backward compatible:
- language_id defaults to -1 → existing callers that do not set "language" get the same 8-row prefix as before.
- Engines built without codec_language_id in their config.json — including all engines exported by main today — read an empty map and continue to use the no-language path with a WARNING log line.
- No public API symbol removed; no schema field renamed.

Additional Information

Related issue: #87 — [Bug] Qwen3-TTS 0.6B TTS output is incorrect on Jetson (Issue B in the follow-up comment from 2026-05-27)
This PR addresses a single concern (CustomVoice language conditioning end-to-end).
Independent reproducer harness (BSD-3 license): scripts/verify_customvoice_tts_radxa_asr.sh — TTS+ASR roundtrip with PASS/FAIL JSON output.
Pre-built engines + W8A16 ONNX for reproducer on Orin NX are mirrored at harvestsu/seeed-local-voice-artifacts/orin-nx/qwen3-tts-12hz-0.6b-customvoice/ on Hugging Face.

…n3-TTS

zzd1994 · 2026-05-27T09:57:33Z

您好，请问您fork的版本支持qwen3-tts-0.6B-Base模型的onnx转换与tensorrt推理吗

suharvest · 2026-05-27T10:47:01Z

您好，请问您fork的版本支持qwen3-tts-0.6B-Base模型的onnx转换与tensorrt推理吗

这连个基于 v0.7.0的分支支持，highperf 加了流式什么的。现在在基于 v0.7.1，刚跑通 customvoice，后面再试试迁移回base 这个

zzd1994 · 2026-05-29T02:48:02Z

您好，请问您fork的版本支持qwen3-tts-0.6B-Base模型的onnx转换与tensorrt推理吗

这连个基于 v0.7.0的分支支持，highperf 加了流式什么的。现在在基于 v0.7.1，刚跑通 customvoice，后面再试试迁移回base 这个

https://github.com/suharvest/TensorRT-Edge-LLM/tree/highperf/runtime-service
请问您指的是这个分支吗，我用tensorrt-edgellm-export-llm导出onnx时没有small_to_mtp_projection.safetensors这个文件

suharvest · 2026-05-29T05:31:58Z

@zzd1994 对，就是 highperf/runtime-service 这个分支（基于 v0.7.0，加了流式 runtime），Base 12Hz 0.6B 的导出和推理都在这条分支上跑通过。

关于 small_to_mtp_projection.safetensors —— Base 模型没有这个文件是正常的、预期的，不是导出出错或分支拿错。

这个文件是 CodePredictor 的条件产物，只有当 talker 的 hidden_size 和 code_predictor 的 hidden_size 不相等时才会生成（那种情况下投影层是一个真正的 nn.Linear）。导出代码里的判断：

# tensorrt_edgellm/onnx_export/llm_export.py
proj = getattr(model, 'small_to_mtp_projection', None)
if proj is not None and not isinstance(proj, nn.Identity):   # 只有非 Identity 才写盘
    save_file(..., "small_to_mtp_projection.safetensors")

对 0.6B-Base 来说，talker hidden_size = 1024，code_predictor 也是 1024（两者相等），所以这个投影层是 nn.Identity，导出时会被跳过。文档产物树里那一行末尾的注释 # if not Identity 就是这个意思。只有像 CustomVoice 这种 talker(2048) ↔ code_predictor(1024) 维度不一致的模型，才会真正产出这个文件。

后续环节也都按可选处理，所以 Base 缺这个文件能正常闭环：

Builder：small_to_mtp_projection.safetensors 列在 cpOptionalFiles，缺了不报错；
Runtime：当 talkerHiddenSize == codePredictorHiddenSize 时直接跳过投影（会打一条 No small_to_mtp_projection needed 的日志）。

所以 Base 直接继续 build + 推理即可。唯一需要警惕的反例：如果 runtime 起来时报 small_to_mtp_projection.safetensors required when talkerHiddenSize != codePredictorHiddenSize，那才说明维度对不上，需要进一步排查；否则缺这个文件不影响。

feat NVIDIA#87: add CustomVoice language conditioning support for Qwe…

e071421

…n3-TTS

suharvest requested a review from a team May 27, 2026 03:11

suharvest mentioned this pull request May 27, 2026

[Bug] Qwen3-TTS 0.6B TTS output is incorrect on Jetson unless Talker / CodePredictor / Code2Wav contracts are fixed #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat https://github.com/NVIDIA/TensorRT-Edge-LLM/issues/87: add CustomVoice language conditioning support for Qwen3-TTS#98

feat https://github.com/NVIDIA/TensorRT-Edge-LLM/issues/87: add CustomVoice language conditioning support for Qwen3-TTS#98
suharvest wants to merge 1 commit into
NVIDIA:mainfrom
suharvest:feat/qwen3-tts-customvoice-language

suharvest commented May 27, 2026

Uh oh!

zzd1994 commented May 27, 2026

Uh oh!

suharvest commented May 27, 2026

Uh oh!

zzd1994 commented May 29, 2026

Uh oh!

suharvest commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suharvest commented May 27, 2026

What does this PR do?

Usage

Changes by file

codec_language_id map (CustomVoice 12-language set)

Validation

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

📄 Documentation

⚙️ Compatibility

Additional Information

Uh oh!

zzd1994 commented May 27, 2026

Uh oh!

suharvest commented May 27, 2026

Uh oh!

zzd1994 commented May 29, 2026

Uh oh!

suharvest commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`codec_language_id` map (CustomVoice 12-language set)