Chatterbox is a high-performance, OpenAI-compatible Text-to-Speech (TTS) server optimized for AIRI. It features dynamic voice cloning, persona-based text preprocessing, and a specialized queuing system to handle high-concurrency requests.
- OpenAI Compatible: Seamlessly integrates with any OpenAI-compatible TTS client.
- Voice Cloning: Clone any voice by simply placing a clip (6s recommended) in the
voices/directory. - Auto-Padding: Automatically lengthens voice clone samples shorter than 5 seconds to satisfy Turbo model requirements.
- Presets (Virtual Voices): Create complex configurations binding a base voice to a mannerism profile and parameters.
- AIRI Management Studio: Manage presets and profiles directly through the integrated UI.
- Hot-Reloading: Automatically detects changes to
profiles.jsonandpresets.jsonon the fly. - Mannerisms: Customize character-specific fillers (e.g.,
~), emoticons, and tilde mappings. - Emotion Tags: Trigger specific sounds like
[laughter],[sigh], or[whisper]. - Turbo Mode: Supports high-speed
ChatterboxTurboTTSfor near-real-time synthesis. - OGG Opus Support: Natively streams high-quality, low-bandwidth audio.
- Python 3.11 Target: Optimized and tested specifically for Python 3.11 to avoid dependency conflicts (e.g.,
numpy<2).
Important
Python 3.11 is the officially supported version for Chatterbox.
Using Python 3.12 or 3.14 will cause installation failures (specifically with numpy and legacy build tools like imp). If you have multiple versions installed, the install.bat will attempt to select py -3.11 automatically.
Chatterbox now has a first-class management flow inside AIRI. The release became available on your forks with:
airicommit7c5ec4b1: Chatterbox studio CRUD UIchatterbox-tts-airicommitdd6d484: preset/profile CRUD endpoints
In practice:
- Presets choose the base voice, TTS mode, exaggeration, and linked profile.
- Profiles define text transformations such as tilde replacements, hmph replacements, and emoticon-to-sound mappings.
- AIRI uses the provider page to create, edit, and delete both without restarting the Chatterbox server.
For the full workflow and editing model, see USAGE.md.
The Chatterbox models (especially the Turbo version) are hosted on Hugging Face. To download them automatically, you must set an environment variable with your Hugging Face Access Token:
- Get a Token: Create a "Read" token at huggingface.co/settings/tokens.
- Set Environment Variable:
- Windows:
setx HF_TOKEN "your_token_here"(restart your terminal) - Linux/bash:
export HF_TOKEN="your_token_here"
- Windows:
If you don't want to set a system-wide variable, you can run this once to download the model and start the server:
set HF_TOKEN=your_token_here && run_server.bat --mannerisms=catgirl --turboNote
You only need the token for the first run while the model is downloading. After that, the files are cached locally and you can run run_server.bat normally.
Chatterbox is designed to be flexible across different GPU generations. Depending on your hardware, you should choose the appropriate installation path:
Most users should use the standard stable drivers:
- Environment: CUDA 11.8 or 12.1 (Stable)
- Performance: Reliable, standard synthesis speeds.
If you have an RTX 5090 or similar and encounter kernel errors (e.g., torchvision::nms missing):
- Environment: PyTorch Nightly with cu128 support.
- Performance: High-throughput synthesis, especially in Turbo Mode.
Run the provided installation script to automatically create the virtual environment and install base dependencies:
install.bat-
Prepare Environment: It is highly recommended to use the
pylauncher (Windows) to target the correct version:py -3.11 -m venv venv .\venv\Scripts\activate
-
Install Core Dependencies:
pip install -r requirements.txt
-
Hardware-Specific Optimization (Crucial):
- For Standard GPUs: (Already handled by
install.bat/requirements.txt) - For RTX 50-Series: Run these commands inside your activated
venv:pip uninstall torch torchvision torchaudio pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
- For Standard GPUs: (Already handled by
Run the provided batch file to start the FastAPI server:
run_server.bat --mannerisms=kappybara --turbo--mannerisms: Choose a character mannerism fromprofiles.json.--turbo: Use the high-speed Turbo model (requires higher VRAM).--port: Default is8090.
For quick one-off generation:
python runner.py zenbara "Hello Phil... [sigh] ~ how are you? ~" --turboPresets are "Virtual Voices" that simplify character management. They map a unique ID (e.g., Lain (Acting)) to a base voice and a mannerism profile.
{
"Lain (Acting)": {
"voice_file": "lain",
"mannerism_profile": "wired_goddess",
"exaggeration": 0.0,
"ui_expressions": ["[whisper]", "[sigh]", "[gasp]"]
}
}Manage character-specific logic:
tilde: Mappings for the~character (e.g.,nyanfor catgirl,brofor kappybara).hmph: Custom pronunciations for "hmph" variants (e.g.,hahmf).emoticons: Regex-based replacements for patterns like0_0.narrative: Character-specific speech settings for*text*(rate, volume).
The server monitors the modification timestamps of profiles.json and presets.json. You can manually edit these files while the server is running, and the changes will be picked up instantly on the next API request.
- Exact Tags: Use the tags exactly as written in
supported_tags.md. - Exaggeration: The default is set to
0.0(natural baseline). Increase this (e.g., to1.0or higher) for more intense emotional delivery.
Environment: PyTorch Nightly cu128
Turbo Model (--turbo)
| Length | Chars | Time (s) | Chars/s |
|---|---|---|---|
| 20 chars | 41 | 2.515 | 16.30 |
| 60 chars | 93 | 3.849 | 24.16 |
| 300 chars | 386 | 17.953 | 21.50 |
Standard Model
| Length | Chars | Time (s) | Chars/s |
|---|---|---|---|
| 20 chars | 41 | 7.062 | 5.81 |
| 60 chars | 93 | 8.071 | 11.52 |
| 300 chars | 386 | 32.237 | 11.97 |
Environment: Standard Desktop Hardware
Standard Model
| Length | Chars | Time (s) | Chars/s |
|---|---|---|---|
| 20 chars | 41 | 3.92 | 10.46 |
| 60 chars | 93 | 6.89 | 13.50 |
| 300 chars | 386 | 19.01 | 20.30 |
Turbo Model (--turbo)
| Length | Chars | Time (s) | Chars/s |
|---|---|---|---|
| 20 chars | 41 | 11.88 | 3.45 |
| 60 chars | 93 | 20.86 | 4.46 |
| 300 chars | 386 | 121.15 | 3.19 |
Warning
Performance Inversion: On the RTX 3070, the Standard model is significantly faster than the Turbo model. This is likely due to VRAM limitations (8GB) or lack of specific kernel optimizations for this architecture in the Turbo model's dependency stack. Stick to the Standard model on this hardware.
Environment: Standard cu118
Turbo Model (--turbo)
| Length | Chars | Time (s) | Chars/s |
|---|---|---|---|
| 20 chars | 41 | 6.054 | 6.77 |
| 60 chars | 93 | 7.921 | 11.74 |
| 300 chars | 386 | 16.499 | 23.40 |
Standard Model
| Length | Chars | Time (s) | Chars/s |
|---|---|---|---|
| 20 chars | 41 | 7.609 | 5.39 |
| 60 chars | 93 | 13.182 | 7.06 |
| 300 chars | 386 | 81.751 | 4.72 |
server.py: The FastAPI wrapper. Supports presets, hot-reloading, and dynamic resolution.presets.json: JSON store for virtual voice configurations.profiles.json: JSON store for character mannerisms.install.bat: Automated setup and dependency installation script.requirements.txt: Pinned dependencies for environment stability.runner.py: CLI script for direct OGG Opus generation.benchmark.py: Script used for gathering generation timings.supported_tags.md: Reference list of verified sound/emotion tokens.voices/: Directory for voice cloning source files.padded_voices/: Directory for lengthened voice clone samples (Auto-Padding).
Clients can discover capabilities and voices via:
GET /v1/voices: Returns a merged list of native voice files and virtual presets.GET /v1/audio/voices: Alias for the above.GET /chatterbox/capabilities: Returns available raw voice files, mannerism profiles, and TTS modes.
POST /v1/audio/speech
voice: Can be a native voice (e.g.,ivy) OR a preset ID (e.g.,Lain (Acting)).input: The text to synthesize, supporting tags and profiles.response_format:mp3(OGG Opus) orwav.
nyan! [laughter] ... [sigh] ... chill bro.