Official code for enabling full-duplex speech interaction with
SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation
SoulX-Duplug is a plug-and-play streaming semantic VAD model designed for real-time full-duplex speech conversation. Through text-guided streaming state prediction, SoulX-Duplug enables low-latency, semantic-aware streaming dialogue management. In addition to the core model, we also open-source a dialogue system build on top of SoulX-Duplug, which demonstrates the practicality of our model in real-world applications.
To facilitate benchmarking and research in this area, we also release SoulX-Duplug-Eval, a complementary evaluation set for benchmarking full-duplex spoken dialogue systems.
Below is a demo of full-duplex speech interaction powered by SoulX-Duplug.
SoulX-Duplug-demo-30fps.mp4
You can also try the online interactive demo here:
👉 https://soulx-duplug.sjtuxlance.com/
- [2026-03-17] Our paper on this project has been published! You can read it here: SoulX-Duplug.
- [2026-03-16] SoulX-Duplug checkpoint and SoulX-Duplug-Eval are now available on Hugging Face! You can access it directly from SoulX-Duplug-HF.
Here are instructions for installing on Linux.
- Clone the repo
git clone https://github.com/Soul-AILab/SoulX-Duplug.git
cd SoulX-Duplug- Install system dependencies
sudo apt-get update
sudo apt-get install ffmpeg sox libsox-dev -y-
Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
-
Create Conda env
conda create -n soulx-duplug -y python=3.10
conda activate soulx-duplug
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.comDownload via hf:
# If you are in mainland China, please first set the mirror:
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download Soul-AILab/SoulX-Duplug-0.6B --local-dir pretrained_modelsDownload via python:
from huggingface_hub import snapshot_download
snapshot_download("Soul-AILab/SoulX-Duplug-0.6B", local_dir="pretrained_models") Download via git clone:
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Soul-AILab/SoulX-Duplug-0.6B pretrained_modelsIn config/config.yaml:
-
For the
infer_config.asrfield:- For Chinese, we recommend using
model_name: paraformer - For English, set it to
model_name: sensevoice,language: en - For bilingual scenarios, use
model_name: sensevoice,language: auto
- For Chinese, we recommend using
-
The
max_wait_numparameter is used as a fallback mechanism to handle potential misclassification of incomplete cases. It defines the number of chunks to wait without additional user speech before the assistant starts responding. -
The
far_field_thresholdparameter sets the threshold for filtering far-field audio in noisy environments.
We provide a streaming inference server for SoulX-Duplug. Start the server:
bash run.shFor usage (see example_client.py for reference), streamingly send your audio query (in chunks) to the server, and the server will return its prediction of the current dialogue state in a dict:
-
Format:
{ "type": "turn_state", "session_id": , # session_id "state": { "state": , # predicted state: "idle", "nonidle", "speak", or "blank" "text": , # (optional) asr result of user's turn "asr_segment": , # (optional) asr result of current chunk "asr_buffer": , # (optional) asr result of last 3.2s }, "ts": time.time(), # timestamp } -
"idle" indicates that the current audio chunk contains no semantic content (e.g., silence, noise, or backchannel).
-
"nonidle" indicates that the current audio chunk contains semantic content. In this case,
"asr_segment"returns the ASR result of the current chunk, and"asr_buffer"returns the ASR result of the accumulated audio over the past 3.2 seconds. -
"speak" indicates that up to the current chunk, the user is judged to have stopped speaking and the utterance is semantically complete, meaning the system can take the turn. In this case,
"asr_segment"returns the ASR result of the current chunk,"asr_buffer"returns the ASR result of the accumulated audio over the past 3.2 seconds, and"text"returns the complete transcription of the user’s utterance for this turn. -
"blank" indicates that the current unprocessed streaming input does not yet fill a full chunk; the server has cached the input and is waiting for the next query.
We implemented a demo full-duplex spoken dialogue system based on SoulX-Duplug. See the dialogue-system branch for the demo code.
- Publish the technical report.
- Release evaluation scripts.
If you find this work useful in your research, please consider citing:
@misc{yan2026soulxduplug,
title={SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation},
author={Ruiqi Yan and Wenxi Chen and Zhanxun Liu and Ziyang Ma and Haopeng Lin and Hanlin Wen and Hanke Xie and Jun Wu and Yuzhe Liang and Yuxiang Zhao and Pengchao Feng and Jiale Qian and Hao Meng and Yuhang Dai and Shunshun Yin and Ming Tao and Lei Xie and Kai Yu and Xinsheng Wang and Xie Chen},
year={2026},
eprint={2603.14877},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.14877},
}This project is licensed under the Apache 2.0 License.
We thank the following open-source projects for their open-source contributions:
