Chatterbox TTS Server

Chatterbox is a high-performance, OpenAI-compatible Text-to-Speech (TTS) server optimized for AIRI. It features dynamic voice cloning, persona-based text preprocessing, and a specialized queuing system to handle high-concurrency requests.

Features

OpenAI Compatible: Seamlessly integrates with any OpenAI-compatible TTS client.
Voice Cloning: Clone any voice by simply placing a clip (6s recommended) in the voices/ directory.
Auto-Padding: Automatically lengthens voice clone samples shorter than 5 seconds to satisfy Turbo model requirements.
Presets (Virtual Voices): Create complex configurations binding a base voice to a mannerism profile and parameters.
AIRI Management Studio: Manage presets and profiles directly through the integrated UI.
Hot-Reloading: Automatically detects changes to profiles.json and presets.json on the fly.
Mannerisms: Customize character-specific fillers (e.g., ~), emoticons, and tilde mappings.
Emotion Tags: Trigger specific sounds like [laughter], [sigh], or [whisper].
Turbo Mode: Supports high-speed ChatterboxTurboTTS for near-real-time synthesis.
OGG Opus Support: Natively streams high-quality, low-bandwidth audio.
Python 3.11 Target: Optimized and tested specifically for Python 3.11 to avoid dependency conflicts (e.g., numpy<2).

🐍 Supported Python Version

Important

Python 3.11 is the officially supported version for Chatterbox.

Using Python 3.12 or 3.14 will cause installation failures (specifically with numpy and legacy build tools like imp). If you have multiple versions installed, the install.bat will attempt to select py -3.11 automatically.

AIRI Integration

Chatterbox now has a first-class management flow inside AIRI. The release became available on your forks with:

airi commit 7c5ec4b1: Chatterbox studio CRUD UI
chatterbox-tts-airi commit dd6d484: preset/profile CRUD endpoints

In practice:

Presets choose the base voice, TTS mode, exaggeration, and linked profile.
Profiles define text transformations such as tilde replacements, hmph replacements, and emoticon-to-sound mappings.
AIRI uses the provider page to create, edit, and delete both without restarting the Chatterbox server.

For the full workflow and editing model, see USAGE.md.

Prerequisites (Authentication)

The Chatterbox models (especially the Turbo version) are hosted on Hugging Face. To download them automatically, you must set an environment variable with your Hugging Face Access Token:

Get a Token: Create a "Read" token at huggingface.co/settings/tokens.
Set Environment Variable:
- Windows: setx HF_TOKEN "your_token_here" (restart your terminal)
- Linux/bash: export HF_TOKEN="your_token_here"

One-Line Setup (Fast Track)

If you don't want to set a system-wide variable, you can run this once to download the model and start the server:

set HF_TOKEN=your_token_here && run_server.bat --mannerisms=catgirl --turbo

Note

You only need the token for the first run while the model is downloading. After that, the files are cached locally and you can run run_server.bat normally.

Hardware & Compatibility

Chatterbox is designed to be flexible across different GPU generations. Depending on your hardware, you should choose the appropriate installation path:

⚙️ Option A: Standard Hardware (RTX 20/30/40-Series)

Most users should use the standard stable drivers:

Environment: CUDA 11.8 or 12.1 (Stable)
Performance: Reliable, standard synthesis speeds.

🚀 Option B: Next-Gen Hardware (RTX 50-Series / Blackwell)

If you have an RTX 5090 or similar and encounter kernel errors (e.g., torchvision::nms missing):

Environment: PyTorch Nightly with cu128 support.
Performance: High-throughput synthesis, especially in Turbo Mode.

Installation & Setup

⚡ Automated Setup (Recommended)

Run the provided installation script to automatically create the virtual environment and install base dependencies:

install.bat

🔧 Manual Installation

Prepare Environment: It is highly recommended to use the py launcher (Windows) to target the correct version:
```
py -3.11 -m venv venv
.\venv\Scripts\activate
```
Install Core Dependencies:
```
pip install -r requirements.txt
```
Hardware-Specific Optimization (Crucial):
- For Standard GPUs: (Already handled by install.bat / requirements.txt)
- For RTX 50-Series: Run these commands inside your activated venv:
```
pip uninstall torch torchvision torchaudio
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
```

Usage

Starting the Server

Run the provided batch file to start the FastAPI server:

run_server.bat --mannerisms=kappybara --turbo

--mannerisms: Choose a character mannerism from profiles.json.
--turbo: Use the high-speed Turbo model (requires higher VRAM).
--port: Default is 8090.

CLI Runner

For quick one-off generation:

python runner.py zenbara "Hello Phil... [sigh] ~ how are you? ~" --turbo

Configuration & Customization

Presets (`presets.json`)

Presets are "Virtual Voices" that simplify character management. They map a unique ID (e.g., Lain (Acting)) to a base voice and a mannerism profile.

{
  "Lain (Acting)": {
    "voice_file": "lain",
    "mannerism_profile": "wired_goddess",
    "exaggeration": 0.0,
    "ui_expressions": ["[whisper]", "[sigh]", "[gasp]"]
  }
}

Mannerisms (`profiles.json`)

Manage character-specific logic:

tilde: Mappings for the ~ character (e.g., nyan for catgirl, bro for kappybara).
hmph: Custom pronunciations for "hmph" variants (e.g., hahmf).
emoticons: Regex-based replacements for patterns like 0_0.
narrative: Character-specific speech settings for *text* (rate, volume).

Hot-Reloading

The server monitors the modification timestamps of profiles.json and presets.json. You can manually edit these files while the server is running, and the changes will be picked up instantly on the next API request.

Emotion & Expressiveness Tips

Exact Tags: Use the tags exactly as written in supported_tags.md.
Exaggeration: The default is set to 0.0 (natural baseline). Increase this (e.g., to 1.0 or higher) for more intense emotional delivery.

Benchmark Results (CUDA)

RTX 5090 (Blackwell)

Environment: PyTorch Nightly cu128

Turbo Model (--turbo)

Length	Chars	Time (s)	Chars/s
20 chars	41	2.515	16.30
60 chars	93	3.849	24.16
300 chars	386	17.953	21.50

Standard Model

Length	Chars	Time (s)	Chars/s
20 chars	41	7.062	5.81
60 chars	93	8.071	11.52
300 chars	386	32.237	11.97

RTX 3070 (Mid-Range Baseline)

Environment: Standard Desktop Hardware

Standard Model

Length	Chars	Time (s)	Chars/s
20 chars	41	3.92	10.46
60 chars	93	6.89	13.50
300 chars	386	19.01	20.30

Turbo Model (--turbo)

Length	Chars	Time (s)	Chars/s
20 chars	41	11.88	3.45
60 chars	93	20.86	4.46
300 chars	386	121.15	3.19

Warning

Performance Inversion: On the RTX 3070, the Standard model is significantly faster than the Turbo model. This is likely due to VRAM limitations (8GB) or lack of specific kernel optimizations for this architecture in the Turbo model's dependency stack. Stick to the Standard model on this hardware.

RTX 4070 Mobile (Baseline)

Environment: Standard cu118

Turbo Model (--turbo)

Length	Chars	Time (s)	Chars/s
20 chars	41	6.054	6.77
60 chars	93	7.921	11.74
300 chars	386	16.499	23.40

Standard Model

Length	Chars	Time (s)	Chars/s
20 chars	41	7.609	5.39
60 chars	93	13.182	7.06
300 chars	386	81.751	4.72

Project Structure

server.py: The FastAPI wrapper. Supports presets, hot-reloading, and dynamic resolution.
presets.json: JSON store for virtual voice configurations.
profiles.json: JSON store for character mannerisms.
install.bat: Automated setup and dependency installation script.
requirements.txt: Pinned dependencies for environment stability.
runner.py: CLI script for direct OGG Opus generation.
benchmark.py: Script used for gathering generation timings.
supported_tags.md: Reference list of verified sound/emotion tokens.
voices/: Directory for voice cloning source files.
padded_voices/: Directory for lengthened voice clone samples (Auto-Padding).

API & Discovery

Clients can discover capabilities and voices via:

GET /v1/voices: Returns a merged list of native voice files and virtual presets.
GET /v1/audio/voices: Alias for the above.
GET /chatterbox/capabilities: Returns available raw voice files, mannerism profiles, and TTS modes.

Synthesis (OpenAI Compatible)

POST /v1/audio/speech

voice: Can be a native voice (e.g., ivy) OR a preset ID (e.g., Lain (Acting)).
input: The text to synthesize, supporting tags and profiles.
response_format: mp3 (OGG Opus) or wav.

nyan! [laughter] ... [sigh] ... chill bro.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chatterbox TTS Server

Features

🐍 Supported Python Version

AIRI Integration

Prerequisites (Authentication)

One-Line Setup (Fast Track)

Hardware & Compatibility

⚙️ Option A: Standard Hardware (RTX 20/30/40-Series)

🚀 Option B: Next-Gen Hardware (RTX 50-Series / Blackwell)

Installation & Setup

⚡ Automated Setup (Recommended)

🔧 Manual Installation

Usage

Starting the Server

CLI Runner

Configuration & Customization

Presets (`presets.json`)

Mannerisms (`profiles.json`)

Hot-Reloading

Emotion & Expressiveness Tips

Benchmark Results (CUDA)

RTX 5090 (Blackwell)

RTX 3070 (Mid-Range Baseline)

RTX 4070 Mobile (Baseline)

Project Structure

API & Discovery

Synthesis (OpenAI Compatible)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
voices		voices
.gitignore		.gitignore
README.md		README.md
USAGE.md		USAGE.md
benchmark.py		benchmark.py
benchmark_report_turbo.txt		benchmark_report_turbo.txt
client.py		client.py
dialoger.py		dialoger.py
install.bat		install.bat
presets.json		presets.json
profiles.json		profiles.json
requirements.txt		requirements.txt
run_server.bat		run_server.bat
runner.py		runner.py
server.py		server.py
supported_tags.csv		supported_tags.csv
supported_tags.md		supported_tags.md
transcribe_voices.py		transcribe_voices.py

Folders and files

Latest commit

History

Repository files navigation

Chatterbox TTS Server

Features

🐍 Supported Python Version

AIRI Integration

Prerequisites (Authentication)

One-Line Setup (Fast Track)

Hardware & Compatibility

⚙️ Option A: Standard Hardware (RTX 20/30/40-Series)

🚀 Option B: Next-Gen Hardware (RTX 50-Series / Blackwell)

Installation & Setup

⚡ Automated Setup (Recommended)

🔧 Manual Installation

Usage

Starting the Server

CLI Runner

Configuration & Customization

Presets (presets.json)

Mannerisms (profiles.json)

Hot-Reloading

Emotion & Expressiveness Tips

Benchmark Results (CUDA)

RTX 5090 (Blackwell)

RTX 3070 (Mid-Range Baseline)

RTX 4070 Mobile (Baseline)

Project Structure

API & Discovery

Synthesis (OpenAI Compatible)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Presets (`presets.json`)

Mannerisms (`profiles.json`)

Packages