feat https://github.com/NVIDIA/TensorRT-Edge-LLM/issues/87: add CustomVoice language conditioning support for Qwen3-TTS#98
Conversation
|
您好,请问您fork的版本支持qwen3-tts-0.6B-Base模型的onnx转换与tensorrt推理吗 |
这连个基于 v0.7.0的分支支持,highperf 加了流式什么的。现在在基于 v0.7.1,刚跑通 customvoice,后面再试试迁移回base 这个 |
https://github.com/suharvest/TensorRT-Edge-LLM/tree/highperf/runtime-service |
|
@zzd1994 对,就是 关于 这个文件是 CodePredictor 的条件产物,只有当 talker 的 hidden_size 和 code_predictor 的 hidden_size 不相等时才会生成(那种情况下投影层是一个真正的 # tensorrt_edgellm/onnx_export/llm_export.py
proj = getattr(model, 'small_to_mtp_projection', None)
if proj is not None and not isinstance(proj, nn.Identity): # 只有非 Identity 才写盘
save_file(..., "small_to_mtp_projection.safetensors")对 0.6B-Base 来说,talker 后续环节也都按可选处理,所以 Base 缺这个文件能正常闭环:
所以 Base 直接继续 build + 推理即可。唯一需要警惕的反例:如果 runtime 起来时报 |
What does this PR do?
Type of change: new feature
Overview:
docs/source/user_guide/getting_started/supported-models.mdat v0.7.1 states that Qwen3-TTS support is limited to the CustomVoice checkpoints (Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoiceand the 1.7B sibling). However the C++ runtime currently has no support for those models' language-conditioned prefill. The Python reference atmodeling_qwen3_tts.py:2120-2186emits a 9-row prefix with alanguage_idcodec token injected betweencodec_think_bos_idandcodec_think_eos_id:assistantPreambleKernelonly implements the 8-row (no-language) branch —git grepforlanguage_id/customvoice/codec_languagein v0.7.1 returns 0 hits anywhere incpp/,examples/omni/, orexperimental/llm_loader/models/qwen3_tts/. Running the only officially supported Qwen3-TTS family therefore produces unintelligible output, validated end-to-end via an objective ASR loop (see "Validation" below).This PR adds the missing 9-row branch and threads the language string from input JSON → runtime → kernel.
Usage
After this PR the CLI accepts
"language"as a top-level field (with per-request override) in the existing inputJson schema. Engines exported throughexperimental/llm_loader/export_all_cli.pyautomatically pick up thecodec_language_idmap from the sourcetalker_config. Example input:{ "speaker": "vivian", "language": "chinese", "apply_chat_template": true, "add_generation_prompt": true, "enable_thinking": false, "max_audio_length": 24000, "requests": [ {"messages": [{"role": "user", "content": "今天天气真不错"}]} ] }Run:
Stderr will now include:
When
"language"is absent (or unknown), the runtime falls back to the existing 8-row no-language path with aWARNINGlog line, and behavior is unchanged frommain.Changes by file
cpp/kernels/talkerMLPKernels/talkerMLPKernels.{cu,h}— addlanguage_idas a kernel argument (default-1, preserves existing 8-row layout). Emit the 9-row prefix whenlanguage_id >= 0, with the language codec id injected betweencodec_think_bos_idandcodec_think_eos_id.cpp/runtime/qwen3OmniTTSRuntime.{cpp,h}— parsecodec_language_id(string→int, lower-cased keys) andcodec_think_idfrom the engineconfig.json. Acceptlanguageper-request, resolve via the map, log atLOG_INFO. Thread(speakerId, langId, prefixRows)throughprojectToTalkerInputinto the kernel.examples/omni/qwen3_tts_inference.cpp— forward top-level"language"from input JSON, allow per-request override.experimental/llm_loader/export_all_cli.py— forwardcodec_language_idfrom the source HuggingFacetalker_configinto the engineconfig.jsonalongside the existingcodec_*_idkeys. Without this, the runtime read always sees an empty map and CustomVoice language conditioning is a silent no-op.codec_language_idmap (CustomVoice 12-language set)The 12 (language → codec token id) pairs come directly from the public Qwen3-TTS CustomVoice checkpoint configs —
talker_config.codec_language_idin Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice/config.json:These are read from the source
talker_configat export time rather than hardcoded, so future checkpoints with additional languages work without code changes.Validation
End-to-end on Jetson Orin NX (sm_87, JetPack 6.x, CUDA 12.6, TRT 10.3), built with
-DENABLE_CUTE_DSL=gemm -DCUTE_DSL_ARTIFACT_TAG=sm_87 -DEMBEDDED_TARGET=jetson-orin. ASR is radxa sherpa-onnx SenseVoicemodel.int8.onnx,language='zh',use_itn=True, applied to the synthesized WAV:a4f56dcbccc580a58a9bab0e0b727eae今天天气真不错。今天天气真不错7c4e1825ca1bcbaa4a9b61505967e79d今天天气真不错哦。今天天气真不错🚀 Pull Request Checklist
✅ Pre-commit Checks
pip install pre-commitpre-commit installpre-commit run --files <6 changed files>passes (clang-format / codespell / yapf / ruff / autoflake / etc. all green after staging clang-format auto-fixes)🧪 Tests
assistantPreambleKernelexercising thelanguage_id != -19-row layout (compare prefix output against the Python reference forspeaker=vivian, language=chinese). The kernel-only path does not depend on the Issue C workaround.language_iddefaults to-1so behavior is unchanged for that path.📄 Documentation
docs/source/user_guide/getting_started/supported-models.mdalready lists the CustomVoice checkpoints; this PR makes that documented support actually work. No doc change required for the PR itself."language"input field and the 12 supported language keys.⚙️ Compatibility
language_iddefaults to-1→ existing callers that do not set"language"get the same 8-row prefix as before.codec_language_idin theirconfig.json— including all engines exported bymaintoday — read an empty map and continue to use the no-language path with aWARNINGlog line.Additional Information
scripts/verify_customvoice_tts_radxa_asr.sh— TTS+ASR roundtrip with PASS/FAIL JSON output.harvestsu/seeed-local-voice-artifacts/orin-nx/qwen3-tts-12hz-0.6b-customvoice/on Hugging Face.