Qwen3 TTS - Rust CLI tools

A Rust implementation of the Qwen3 Text-to-Speech (TTS) model inference. Provides three cross-platform CLI tools suitable for agentic skills for AI agents and bots.

tts — generate speech from text with named speaker voices
voice_clone — clone a voice from reference audio
api_server — OpenAI-compatible HTTP API server

Supports two backends: libtorch (via the tch crate, cross-platform with optional CUDA) and MLX (Apple Silicon native via Metal GPU).

Learn more:

A Rust implementation / CLI for Qwen3's ASR (Automatic Speech Recognition or Speech-to-Text) models
An OpenAI compatible API server for audio / speech
An OpenClaw SKILL for voice generation. Copy and Paste to your lobster to install it

Quick Start

Install binaries, models, and reference audio for your platform:

curl -sSf https://raw.githubusercontent.com/second-state/qwen3_tts_rs/main/install.sh | bash
cd qwen3_tts_rs

The installer detects your OS, CPU, and NVIDIA GPU (if present), then sets up everything in ./qwen3_tts_rs/.

Text-to-Speech

Generate speech with a named speaker using the CustomVoice model:

./tts models/Qwen3-TTS-12Hz-0.6B-CustomVoice "Hello world, this is a test." Vivian english
# Output: output.wav (24 kHz)

Voice Cloning

Clone a voice from reference audio using the Base model (ICL mode):

./voice_clone models/Qwen3-TTS-12Hz-0.6B-Base reference_audio/trump.wav \
  "Hello, this is a voice cloning test." english \
  "Angered and appalled millions of Americans across the political spectrum"
# Output: output_voice_clone.wav (24 kHz)

API Server

Start the OpenAI-compatible API server with the CustomVoice model:

./api_server models/Qwen3-TTS-12Hz-0.6B-CustomVoice --port 8080

Then call the endpoint:

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world!", "voice": "alloy"}' \
  -o output.wav

Reference

`tts` — Text-to-Speech

tts <model_path> [text] [speaker] [language] [instruction]

Argument	Default	Description
`model_path`	(required)	Path to model directory
`text`	"Hello! This is a test..."	Text to synthesize (max ~4096 chars)
`speaker`	Vivian	Speaker name (see below)
`language`	english	Language: `english`, `chinese`, `japanese`, `korean`
`instruction`	(empty)	Voice style instruction (1.7B models only)

Output: output.wav (24 kHz, 16-bit PCM)

Available speakers (CustomVoice models): Vivian, Serena, Ryan, Aiden, Uncle_fu, Ono_anna, Sohee, Eric, Dylan

Instruction examples (1.7B CustomVoice only):

"Speak in an urgent and excited voice"
"Speak happily and joyfully"
"Speak slowly and calmly"
"Speak in a whisper"

`voice_clone` — Voice Cloning

voice_clone <model_path> <ref_audio> [text] [language] [ref_text]

Argument	Default	Description
`model_path`	(required)	Path to model directory
`ref_audio`	(required)	Path to reference WAV file
`text`	"Hello! This is a test..."	Text to synthesize
`language`	english	Language
`ref_text`	(none)	Transcript of reference audio (enables ICL mode, higher quality)

Output: output_voice_clone.wav (24 kHz, 16-bit PCM)

Preparing reference audio: Must be mono 24 kHz 16-bit WAV. Convert with ffmpeg:

ffmpeg -i input.m4a -ac 1 -ar 24000 -sample_fmt s16 reference.wav

Example:

./voice_clone models/Qwen3-TTS-12Hz-0.6B-Base reference_audio/trump.wav \
  "Hello, this is a voice cloning test." english \
  "Angered and appalled millions of Americans across the political spectrum"

`api_server` — OpenAI-Compatible API

api_server <model_path> [--host 127.0.0.1] [--port 8080]

Option	Default	Description
`model_path`	(required)	Path to model directory
`--host`	127.0.0.1	Bind address
`--port`	8080	Listen port

Endpoints:

Method	Path	Description
POST	`/v1/audio/speech`	Generate speech (OpenAI-compatible)
GET	`/v1/models`	List available models
GET	`/health`	Health check

API Request: `POST /v1/audio/speech`

{
  "input": "Text to synthesize",
  "voice": "alloy",
  "model": "qwen3-tts",
  "response_format": "wav",
  "speed": 1.0,
  "stream": false,
  "language": "english",
  "instructions": "Speak urgently",
  "audio_sample": "<base64-encoded WAV>",
  "audio_sample_text": "Transcript of the reference audio"
}

Field	Type	Default	Description
`input`	string	(required)	Text to synthesize (max 4096 chars)
`voice`	string	`"alloy"`	OpenAI name or Qwen3 speaker name (see mapping below)
`model`	string	—	Accepted for compatibility, ignored
`response_format`	string	`"wav"`	`"wav"`, `"pcm"`, `"mp3"`, `"flac"`, `"ogg"`, or `"opus"`
`speed`	float	1.0	Speed multiplier (0.25–4.0)
`stream`	bool	false	Enable SSE streaming (requires `"pcm"`)
`language`	string	`"english"`	`english`, `chinese`, `japanese`, `korean`, `auto`
`instructions`	string	—	Voice style instruction (1.7B models only)
`audio_sample`	string	—	Base64-encoded reference WAV for voice cloning
`audio_sample_text`	string	—	Transcript of reference audio (required with `audio_sample`)

Voice name mapping (OpenAI → Qwen3):

OpenAI	Qwen3
alloy	serena
echo	ryan
fable	vivian
onyx	eric
nova	ono_anna
shimmer	sohee

You can also pass Qwen3 speaker names directly (e.g., "voice": "vivian").

Streaming: When stream: true and response_format: "pcm", the server returns Server-Sent Events with base64-encoded PCM chunks:

data: {"type":"speech.audio.delta","delta":"<base64 PCM>"}
data: {"type":"speech.audio.done"}

Voice cloning via API:

# Encode reference audio as base64
REF_B64=$(base64 < reference.wav)

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d "{
    \"input\": \"Hello from a cloned voice.\",
    \"voice\": \"alloy\",
    \"audio_sample\": \"$REF_B64\",
    \"audio_sample_text\": \"Transcript of the reference audio\"
  }" -o cloned.wav

Build from Source

macOS (MLX backend)

Requires Apple Silicon Mac, Xcode, and CMake.

brew install cmake
git clone https://github.com/second-state/qwen3_tts_rs.git
cd qwen3_tts_rs
git submodule update --init --recursive
cargo build --release --no-default-features --features mlx

Linux (libtorch backend)

1. Download libtorch from libtorch-releases:

# Linux x86_64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-2.7.1.tar.gz

# Linux x86_64 (CUDA 12.6)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-x86_64-cuda12.6-2.7.1.tar.gz

# Linux ARM64 (CPU)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-2.7.1.tar.gz

# Linux ARM64 (CUDA 12.6 / Jetson)
curl -LO https://github.com/second-state/libtorch-releases/releases/download/v2.7.1/libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz
tar xzf libtorch-cxx11-abi-aarch64-cuda12.6-2.7.1.tar.gz

2. Set environment and build:

export LIBTORCH=$(pwd)/libtorch
export LIBTORCH_BYPASS_VERSION_CHECK=1

git clone https://github.com/second-state/qwen3_tts_rs.git
cd qwen3_tts_rs
cargo build --release

Alternatively, use pip-installed PyTorch instead of downloading libtorch:

pip install torch==2.7.1
export LIBTORCH_USE_PYTORCH=1
export LD_LIBRARY_PATH=$(python3 -c "import torch; print(torch.__path__[0])")/lib:$LD_LIBRARY_PATH

Download models and generate tokenizer

After building, download models and generate tokenizer.json for each:

pip install huggingface_hub transformers

huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local-dir models/Qwen3-TTS-12Hz-0.6B-CustomVoice

python3 -c "
from transformers import AutoTokenizer
for model in ['Qwen3-TTS-12Hz-0.6B-CustomVoice', 'Qwen3-TTS-12Hz-0.6B-Base', 'Qwen3-TTS-12Hz-1.7B-CustomVoice']:
    path = f'models/{model}'
    try:
        tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
        tok.backend_tokenizer.save(f'{path}/tokenizer.json')
        print(f'Saved {path}/tokenizer.json')
    except Exception as e:
        print(f'Skipped {model}: {e}')
"

Binaries are in target/release/: tts, voice_clone, api_server.

Rust library usage

Add to your Cargo.toml:

[dependencies]
qwen3-tts-rs = "0.2"
# Or for MLX backend:
# qwen3-tts-rs = { version = "0.2", default-features = false, features = ["mlx"] }

See the API documentation on docs.rs for library usage examples.

Performance (Apple M4 Mac, MLX backend)

Test sentences: ~15–20 words in English ("The quick brown fox..." / "Scientists have discovered...") and Chinese. RTF = Real-Time Factor (wall time / audio duration). Lower is better; < 1.0 means faster than real-time.

0.6B CustomVoice

CLI (`tts` / `voice_clone`)

Test	Speaker	Language	Audio	Wall Time	RTF
Preset voice	Vivian	English	5.92s	10.93s	1.85x
Preset voice	Ryan	English	8.16s	14.15s	1.73x
Preset voice	Vivian	Chinese	6.64s	11.26s	1.70x
Voice clone (ICL)	ref audio	English	7.04s	16.77s	2.38x

API server (after warmup)

Test	Voice	Mode	Audio	Wall Time	RTF
Non-streaming WAV	alloy (serena)	full	6.40s	9.90s	1.55x
Non-streaming WAV	echo (ryan)	full	8.00s	12.70s	1.59x
Streaming PCM	alloy (serena)	stream	~6.4s	10.22s	~1.60x
Voice clone WAV	alloy + ref	full	9.04s	20.11s	2.22x

1.7B CustomVoice

CLI (`tts`)

Test	Speaker	Language	Audio	Wall Time	RTF
Preset voice	Vivian	English	6.24s	18.92s	3.03x
Preset voice	Ryan	English	8.64s	27.50s	3.18x
Preset voice	Vivian	Chinese	6.24s	18.31s	2.93x
Preset + instruction	Vivian	English	8.80s	29.97s	3.41x

API server (after warmup)

Test	Voice	Mode	Audio	Wall Time	RTF
Non-streaming WAV	alloy (serena)	full	5.52s	15.14s	2.74x
Non-streaming WAV	echo (ryan)	full	8.64s	26.31s	3.05x
Streaming PCM	alloy (serena)	stream	8.16s	19.79s	2.43x

0.6B vs 1.7B comparison

Metric	0.6B avg RTF	1.7B avg RTF	Slowdown
CLI preset voice	1.76x	3.05x	~1.7x
API non-streaming	1.57x	2.90x	~1.8x

Architecture

Text → Tokenizer → Dual-stream Embeddings → TalkerModel (28-layer Transformer)
                                                    ↓
                                              codec_head → Code 0
                                                    ↓
                                        CodePredictor (5-layer Transformer) → Codes 1-15
                                                    ↓
                                              Vocoder → 24kHz Waveform

License

Apache-2.0

Credits

Based on the original Python implementation by the Alibaba Qwen team.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github/workflows		.github/workflows
demos		demos
examples		examples
mlx-c @ a1290d2		mlx-c @ a1290d2
reference_audio		reference_audio
skills		skills
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen3 TTS - Rust CLI tools

Quick Start

Text-to-Speech

Voice Cloning

API Server

Reference

`tts` — Text-to-Speech

`voice_clone` — Voice Cloning

`api_server` — OpenAI-Compatible API

API Request: `POST /v1/audio/speech`

Build from Source

macOS (MLX backend)

Linux (libtorch backend)

Download models and generate tokenizer

Rust library usage

Performance (Apple M4 Mac, MLX backend)

0.6B CustomVoice

CLI (`tts` / `voice_clone`)

API server (after warmup)

1.7B CustomVoice

CLI (`tts`)

API server (after warmup)

0.6B vs 1.7B comparison

Architecture

License

Credits

About

Uh oh!

Releases 9

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen3 TTS - Rust CLI tools

Quick Start

Text-to-Speech

Voice Cloning

API Server

Reference

tts — Text-to-Speech

voice_clone — Voice Cloning

api_server — OpenAI-Compatible API

API Request: POST /v1/audio/speech

Build from Source

macOS (MLX backend)

Linux (libtorch backend)

Download models and generate tokenizer

Rust library usage

Performance (Apple M4 Mac, MLX backend)

0.6B CustomVoice

CLI (tts / voice_clone)

API server (after warmup)

1.7B CustomVoice

CLI (tts)

API server (after warmup)

0.6B vs 1.7B comparison

Architecture

License

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 9

Contributors

Uh oh!

Languages

`tts` — Text-to-Speech

`voice_clone` — Voice Cloning

`api_server` — OpenAI-Compatible API

API Request: `POST /v1/audio/speech`

CLI (`tts` / `voice_clone`)

CLI (`tts`)