@@ -88,8 +88,43 @@ python export_voxtral_rt.py \
8888| ---------| ---------| -----------| --------------|
8989| ` xnnpack ` | ✓ | ✓ | ` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` |
9090| ` metal ` | ✓ | ✓ | none (fp32) or ` fpa4w ` (Metal-specific 4-bit) |
91+ | ` cuda ` | ✓ | ✓ | ` 4w ` , ` 8w ` |
9192
92- Metal backend provides Apple GPU acceleration.
93+ Metal backend provides Apple GPU acceleration. CUDA backend provides NVIDIA GPU
94+ acceleration via AOTInductor.
95+
96+ #### CUDA export examples
97+
98+ Offline with int4 quantization:
99+
100+ ``` bash
101+ python export_voxtral_rt.py \
102+ --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
103+ --backend cuda \
104+ --dtype bf16 \
105+ --output-dir ./voxtral_rt_exports \
106+ --qlinear-encoder 4w \
107+ --qlinear-encoder-packing-format tile_packed_to_4d \
108+ --qlinear 4w \
109+ --qlinear-packing-format tile_packed_to_4d \
110+ --qembedding 8w
111+ ```
112+
113+ Streaming with int4 quantization:
114+
115+ ``` bash
116+ python export_voxtral_rt.py \
117+ --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
118+ --backend cuda \
119+ --dtype bf16 \
120+ --streaming \
121+ --output-dir ./voxtral_rt_exports \
122+ --qlinear-encoder 4w \
123+ --qlinear-encoder-packing-format tile_packed_to_4d \
124+ --qlinear 4w \
125+ --qlinear-packing-format tile_packed_to_4d \
126+ --qembedding 8w
127+ ```
93128
94129#### Metal export examples
95130
@@ -133,14 +168,17 @@ EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_ex
133168| Flag | Default | Description |
134169| ------| ---------| -------------|
135170| ` --model-path ` | (required) | Directory with ` params.json ` + ` consolidated.safetensors ` |
136- | ` --backend ` | ` xnnpack ` | ` xnnpack ` , ` metal ` , or ` portable ` |
171+ | ` --backend ` | ` xnnpack ` | ` xnnpack ` , ` metal ` , ` cuda ` , or ` portable ` |
172+ | ` --dtype ` | ` fp32 ` | Model dtype: ` fp32 ` or ` bf16 ` |
137173| ` --output-dir ` | ` ./voxtral_rt_exports ` | Output directory |
138174| ` --max-seq-len ` | ` 4096 ` | KV cache length |
139175| ` --delay-tokens ` | ` 6 ` | Transcription delay in tokens (6 = 480ms) |
140176| ` --qlinear ` | (none) | Decoder linear layer quantization (` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` , ` fpa4w ` ) |
141177| ` --qlinear-group-size ` | ` 32 ` | Group size for decoder linear quantization |
178+ | ` --qlinear-packing-format ` | (none) | Packing format for decoder 4w quantization (` tile_packed_to_4d ` for CUDA) |
142179| ` --qlinear-encoder ` | (none) | Encoder linear layer quantization (` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` , ` fpa4w ` ) |
143180| ` --qlinear-encoder-group-size ` | ` 32 ` | Group size for encoder linear quantization |
181+ | ` --qlinear-encoder-packing-format ` | (none) | Packing format for encoder 4w quantization (` tile_packed_to_4d ` for CUDA) |
144182| ` --qembedding ` | (none) | Embedding layer quantization (` 8w ` ) |
145183| ` --streaming ` | off | Export streaming encoder with KV cache |
146184| ` --max-enc-len ` | ` 750 ` | Encoder sliding window size (streaming only) |
@@ -164,6 +202,15 @@ make voxtral_realtime-cpu
164202This builds ExecuTorch core libraries with XNNPACK, then the runner binary
165203at ` cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner ` .
166204
205+ ### CUDA (NVIDIA GPU)
206+
207+ ``` bash
208+ make voxtral_realtime-cuda
209+ ```
210+
211+ This builds ExecuTorch with CUDA backend support. The runner binary is at
212+ the same path as above. Requires NVIDIA GPU with CUDA toolkit installed.
213+
167214### Metal (Apple GPU)
168215
169216``` bash
@@ -180,10 +227,22 @@ The runner requires:
180227- ` tekken.json ` — tokenizer from the model weights directory
181228- ` preprocessor.pte ` — mel spectrogram preprocessor (see [ Preprocessor] ( #preprocessor ) )
182229- A 16kHz mono WAV audio file (or live audio via ` --mic ` )
230+ - For CUDA: ` aoti_cuda_blob.ptd ` — delegate data file (pass via ` --data_path ` )
231+
232+ ``` bash
233+ cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
234+ --model_path voxtral_rt_exports/model.pte \
235+ --tokenizer_path ~ /models/Voxtral-Mini-4B-Realtime-2602/tekken.json \
236+ --preprocessor_path voxtral_rt_exports/preprocessor.pte \
237+ --audio_path input.wav
238+ ```
239+
240+ For CUDA, include the ` .ptd ` data file:
183241
184242``` bash
185243cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
186244 --model_path voxtral_rt_exports/model.pte \
245+ --data_path voxtral_rt_exports/aoti_cuda_blob.ptd \
187246 --tokenizer_path ~ /models/Voxtral-Mini-4B-Realtime-2602/tekken.json \
188247 --preprocessor_path voxtral_rt_exports/preprocessor.pte \
189248 --audio_path input.wav
@@ -218,9 +277,13 @@ ffmpeg -f avfoundation -i ":0" -ar 16000 -ac 1 -f f32le -nostats -loglevel error
218277
219278Ctrl+C stops recording and flushes remaining text.
220279
280+ ** CUDA:** Add ` --data_path voxtral_rt_exports/aoti_cuda_blob.ptd ` to all
281+ run commands above when using the CUDA backend.
282+
221283| Flag | Default | Description |
222284| ------| ---------| -------------|
223285| ` --model_path ` | ` model.pte ` | Path to exported model |
286+ | ` --data_path ` | (none) | Path to delegate data file (` .ptd ` , required for CUDA) |
224287| ` --tokenizer_path ` | ` tekken.json ` | Path to Tekken tokenizer |
225288| ` --preprocessor_path ` | (none) | Path to mel preprocessor ` .pte ` |
226289| ` --audio_path ` | (none) | Path to 16kHz mono WAV file |
0 commit comments