Skip to content

feat https://github.com/NVIDIA/TensorRT-Edge-LLM/issues/87: add CustomVoice language conditioning support for Qwen3-TTS#98

Open
suharvest wants to merge 1 commit into
NVIDIA:mainfrom
suharvest:feat/qwen3-tts-customvoice-language
Open

feat https://github.com/NVIDIA/TensorRT-Edge-LLM/issues/87: add CustomVoice language conditioning support for Qwen3-TTS#98
suharvest wants to merge 1 commit into
NVIDIA:mainfrom
suharvest:feat/qwen3-tts-customvoice-language

Conversation

@suharvest
Copy link
Copy Markdown

What does this PR do?

Type of change: new feature

Overview:

docs/source/user_guide/getting_started/supported-models.md at v0.7.1 states that Qwen3-TTS support is limited to the CustomVoice checkpoints (Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice and the 1.7B sibling). However the C++ runtime currently has no support for those models' language-conditioned prefill. The Python reference at modeling_qwen3_tts.py:2120-2186 emits a 9-row prefix with a language_id codec token injected between codec_think_bos_id and codec_think_eos_id:

if language_id is None:                          # base 12Hz / no-lang
    codec_prefill_list = [[codec_nothink_id, codec_think_bos_id, codec_think_eos_id]]
else:                                             # CustomVoice + language
    codec_prefill_list = [[codec_think_id, codec_think_bos_id, language_id, codec_think_eos_id]]

assistantPreambleKernel only implements the 8-row (no-language) branch — git grep for language_id / customvoice / codec_language in v0.7.1 returns 0 hits anywhere in cpp/, examples/omni/, or experimental/llm_loader/models/qwen3_tts/. Running the only officially supported Qwen3-TTS family therefore produces unintelligible output, validated end-to-end via an objective ASR loop (see "Validation" below).

This PR adds the missing 9-row branch and threads the language string from input JSON → runtime → kernel.

Usage

After this PR the CLI accepts "language" as a top-level field (with per-request override) in the existing inputJson schema. Engines exported through experimental/llm_loader/export_all_cli.py automatically pick up the codec_language_id map from the source talker_config. Example input:

{
  "speaker": "vivian",
  "language": "chinese",
  "apply_chat_template": true,
  "add_generation_prompt": true,
  "enable_thinking": false,
  "max_audio_length": 24000,
  "requests": [
    {"messages": [{"role": "user", "content": "今天天气真不错"}]}
  ]
}

Run:

./qwen3_tts_inference \
  --inputFile=in.json \
  --talkerEngineDir=engines/talker \
  --code2wavEngineDir=engines/code2wav \
  --tokenizerDir=engines/talker \
  --outputFile=out/result.json \
  --outputAudioDir=out

Stderr will now include:

CustomVoice language conditioning enabled: language="chinese" -> codec_id=2055
projectToTalkerInput: ... outputSeqLen=15, speakerId=3065, langId=2055, prefixRows=9

When "language" is absent (or unknown), the runtime falls back to the existing 8-row no-language path with a WARNING log line, and behavior is unchanged from main.

Changes by file

  1. cpp/kernels/talkerMLPKernels/talkerMLPKernels.{cu,h} — add language_id as a kernel argument (default -1, preserves existing 8-row layout). Emit the 9-row prefix when language_id >= 0, with the language codec id injected between codec_think_bos_id and codec_think_eos_id.
  2. cpp/runtime/qwen3OmniTTSRuntime.{cpp,h} — parse codec_language_id (string→int, lower-cased keys) and codec_think_id from the engine config.json. Accept language per-request, resolve via the map, log at LOG_INFO. Thread (speakerId, langId, prefixRows) through projectToTalkerInput into the kernel.
  3. examples/omni/qwen3_tts_inference.cpp — forward top-level "language" from input JSON, allow per-request override.
  4. experimental/llm_loader/export_all_cli.py — forward codec_language_id from the source HuggingFace talker_config into the engine config.json alongside the existing codec_*_id keys. Without this, the runtime read always sees an empty map and CustomVoice language conditioning is a silent no-op.

codec_language_id map (CustomVoice 12-language set)

The 12 (language → codec token id) pairs come directly from the public Qwen3-TTS CustomVoice checkpoint configs — talker_config.codec_language_id in Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/config.json:

key codec id key codec id
chinese 2055 japanese 2058
english 2050 korean 2064
german 2053 french 2061
italian 2070 russian 2069
portuguese 2071 beijing_dialect 2074
spanish 2054 sichuan_dialect 2062

These are read from the source talker_config at export time rather than hardcoded, so future checkpoints with additional languages work without code changes.

Validation

End-to-end on Jetson Orin NX (sm_87, JetPack 6.x, CUDA 12.6, TRT 10.3), built with -DENABLE_CUTE_DSL=gemm -DCUTE_DSL_ARTIFACT_TAG=sm_87 -DEMBEDDED_TARGET=jetson-orin. ASR is radxa sherpa-onnx SenseVoice model.int8.onnx, language='zh', use_itn=True, applied to the synthesized WAV:

Variant WAV md5 (bit-stable across 10 reruns) ASR transcript Input text
FP16 a4f56dcbccc580a58a9bab0e0b727eae 今天天气真不错。 今天天气真不错
W8A16 (ModelOpt AWQ per-channel) 7c4e1825ca1bcbaa4a9b61505967e79d 今天天气真不错哦。 今天天气真不错

Reviewer note: end-to-end audio validation also depends on the env-var workaround documented in issue #87 (Issue C — a separate bug in the prebuilt CuTe DSL artifact at kernelSrcs/cuteDSLPrebuilt/cutedsl_aarch64_sm_87_cuda12.tar.gz). This PR is a strict prerequisite (without it the prefix is still wrong even with the CuTe workaround), but it cannot be byte-verified independently of Issue C resolving at NVIDIA's end. Unit-level verification — kernel produces 9-row prefix for language_id >= 0 and unchanged 8-row prefix for language_id == -1 — is dependency-free and is the minimal test we recommend adding.

🚀 Pull Request Checklist

✅ Pre-commit Checks

  • pip install pre-commit
  • pre-commit install
  • pre-commit run --files <6 changed files> passes (clang-format / codespell / yapf / ruff / autoflake / etc. all green after staging clang-format auto-fixes)

🧪 Tests

  • TODO: add a unit test for assistantPreambleKernel exercising the language_id != -1 9-row layout (compare prefix output against the Python reference for speaker=vivian, language=chinese). The kernel-only path does not depend on the Issue C workaround.
  • Existing Qwen3-TTS smoke (base / no-language path) still passes — language_id defaults to -1 so behavior is unchanged for that path.

📄 Documentation

  • docs/source/user_guide/getting_started/supported-models.md already lists the CustomVoice checkpoints; this PR makes that documented support actually work. No doc change required for the PR itself.
  • Suggestion (not blocking): add a short "CustomVoice language conditioning" subsection under the Qwen3-TTS section explaining the new "language" input field and the 12 supported language keys.

⚙️ Compatibility

  • Backward compatible:
    • language_id defaults to -1 → existing callers that do not set "language" get the same 8-row prefix as before.
    • Engines built without codec_language_id in their config.json — including all engines exported by main today — read an empty map and continue to use the no-language path with a WARNING log line.
    • No public API symbol removed; no schema field renamed.

Additional Information

@zzd1994
Copy link
Copy Markdown

zzd1994 commented May 27, 2026

您好,请问您fork的版本支持qwen3-tts-0.6B-Base模型的onnx转换与tensorrt推理吗

@suharvest
Copy link
Copy Markdown
Author

您好,请问您fork的版本支持qwen3-tts-0.6B-Base模型的onnx转换与tensorrt推理吗

这连个基于 v0.7.0的分支支持,highperf 加了流式什么的。现在在基于 v0.7.1,刚跑通 customvoice,后面再试试迁移回base 这个

@zzd1994
Copy link
Copy Markdown

zzd1994 commented May 29, 2026

您好,请问您fork的版本支持qwen3-tts-0.6B-Base模型的onnx转换与tensorrt推理吗

这连个基于 v0.7.0的分支支持,highperf 加了流式什么的。现在在基于 v0.7.1,刚跑通 customvoice,后面再试试迁移回base 这个

https://github.com/suharvest/TensorRT-Edge-LLM/tree/highperf/runtime-service
请问您指的是这个分支吗,我用tensorrt-edgellm-export-llm导出onnx时没有small_to_mtp_projection.safetensors这个文件

@suharvest
Copy link
Copy Markdown
Author

@zzd1994 对,就是 highperf/runtime-service 这个分支(基于 v0.7.0,加了流式 runtime),Base 12Hz 0.6B 的导出和推理都在这条分支上跑通过。

关于 small_to_mtp_projection.safetensors —— Base 模型没有这个文件是正常的、预期的,不是导出出错或分支拿错。

这个文件是 CodePredictor 的条件产物,只有当 talker 的 hidden_size 和 code_predictor 的 hidden_size 不相等时才会生成(那种情况下投影层是一个真正的 nn.Linear)。导出代码里的判断:

# tensorrt_edgellm/onnx_export/llm_export.py
proj = getattr(model, 'small_to_mtp_projection', None)
if proj is not None and not isinstance(proj, nn.Identity):   # 只有非 Identity 才写盘
    save_file(..., "small_to_mtp_projection.safetensors")

0.6B-Base 来说,talker hidden_size = 1024,code_predictor 也是 1024(两者相等),所以这个投影层是 nn.Identity导出时会被跳过。文档产物树里那一行末尾的注释 # if not Identity 就是这个意思。只有像 CustomVoice 这种 talker(2048) ↔ code_predictor(1024) 维度不一致的模型,才会真正产出这个文件。

后续环节也都按可选处理,所以 Base 缺这个文件能正常闭环:

  • Builder:small_to_mtp_projection.safetensors 列在 cpOptionalFiles,缺了不报错;
  • Runtime:当 talkerHiddenSize == codePredictorHiddenSize 时直接跳过投影(会打一条 No small_to_mtp_projection needed 的日志)。

所以 Base 直接继续 build + 推理即可。唯一需要警惕的反例:如果 runtime 起来时报 small_to_mtp_projection.safetensors required when talkerHiddenSize != codePredictorHiddenSize,那才说明维度对不上,需要进一步排查;否则缺这个文件不影响。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants