Add ACE-1.5 by Blaizzy · Pull Request #499 · Blaizzy/mlx-audio

Blaizzy · 2026-02-15T20:38:29Z

No description provided.

- Added methods for encoding audio to latent representations and normalizing audio inputs. - Implemented instruction generation based on task types, improving flexibility for various audio tasks. - Introduced new task type constants and instruction templates in the configuration file. - Updated imports and module exports to include new constants and configurations. These changes enhance the model's capabilities for audio processing and task management.

- Introduced a new section in the README for music generation, detailing the ACE-Step model for text-to-music tasks. - Added CLI and Python API examples for generating music from text descriptions and enhancing existing audio tracks. - Updated the `generate.py` script to accept additional generation parameters via a JSON string. - Enhanced the ACE-Step model loading logic to check for weights in both turbo and root directories. These changes expand the functionality of the MLX-Audio framework to include music generation features.

…on pipeline

optimize ace_step: completely remove torch runtime dependency + unified conversion pipeline

Signed-off-by: Prince Canuma <prince.gdt@gmail.com>

- Converter: export all weight prefixes (tokenizer, detokenizer, null_condition_emb) not just decoder/encoder - Converter: keep decoder. prefix instead of renaming to dit. - Converter: use underscore naming for decoder conv params (proj_in_weight) to match DiTModel bare parameters - Sanitize: add dit. -> decoder. mapping for legacy weights - Sanitize: scope conv param renaming to decoder only to avoid breaking detokenizer.proj_out (nn.Linear) - Add "ace" to MODEL_REMAPPING for HF repo name resolution

The turbo model requires 5Hz LM hints to generate music. Without LM-generated audio codes guiding the DiT, the diffusion simply denoises back to silence. Previously, LM hints were only generated for cover tasks.

The turbo model produces silence without LM hints, so enable them by default for a working out-of-the-box experience.

Add explicit language section to the LM prompt template so the 5Hz planner model better respects the vocal_language parameter. The turbo HF model uses Qwen3 word embeddings for lyrics (not BPE phoneme tokens), which limits vocal synthesis to the model's learned capabilities.

8 steps produces good instrumentals but vocals need more diffusion steps to resolve clearly. 20 steps still runs well under real-time (~8s diffusion for 30s of audio).

- Cache sliding window mask in DiT instead of recreating per-layer (240x) - Pass None instead of all-zeros no-op self-attention mask - Remove dead: test_model(), omega_scale param, silence_latent param, unused nested_weights, TASK_TYPES_TURBO/BASE, VAEConfig, TextEncoderConfig - Remove redundant hidden_size attrs stored but never read (3 classes) - Fix fragile .replace(".pt", ".npy") path handling - Strip narrating comments that restate what code already expresses

22 tests covering: config, KVCache, attention masks, RMSNorm, TimestepEmbedding, RotaryEmbedding, Attention (self + cross with cache), MLP, DiTLayer, DiTModel (forward, cache, conv_transpose), and weight sanitization (dit prefix remap, proj_in scoping, scale_shift_table reshape). Tests caught a stale cross_attn_mask variable left from the simplify pass — fixed to None.

- Document use_lm=True default (required for turbo model) - Document num_steps=20 default (needed for clear vocals) - Add "How It Works" section explaining the LM planner + DiT pipeline - Add Hugging Face repo links (fp32 and 4-bit quantized) - Update performance table with actual M4 Max measurements - Document LM size trade-offs (0.6B vs 4B) - List known limitations (language detection, seed variance)

The --save argument was accidentally fused into the --gen-kwargs argument body, causing 'positional argument follows keyword argument' SyntaxError when importing mlx_audio.tts.generate.

Fix ACE-Step: weight conversion, LM hints, and vocal support

Blaizzy added 5 commits February 4, 2026 21:04

woring text-to-music

4c8f33d

revert utils to main

0af7e0d

Merge branch 'main' into pc/add-ace

ea1f762

Blaizzy mentioned this pull request Feb 23, 2026

Optimize ACE-Step: NLC VAE, compiled decode, LoRA support #498

Open

9 tasks

Blaizzy mentioned this pull request Mar 4, 2026

Adding support for music gen / sound effects (SFX) models #536

Closed

Blaizzy mentioned this pull request Mar 14, 2026

add ACE-Step text-to-audio model #575

Closed

3 tasks

mm65x mentioned this pull request Mar 15, 2026

optimize ace_step: completely remove torch runtime dependency + unified conversion pipeline #577

Merged

mm65x and others added 18 commits March 20, 2026 10:14

optimize ace_step: remove torch runtime dependency + unified conversi…

1386580

…on pipeline

fix ace-step model conversion layout

f861698

Merge pull request #577 from mm65x/pc/enhance-ace-step

903d66e

optimize ace_step: completely remove torch runtime dependency + unified conversion pipeline

Merge branch 'main' into pc/add-ace

429349c

Signed-off-by: Prince Canuma <prince.gdt@gmail.com>

Add acestep/ace to MODEL_REMAPPING

ca2adc1

Enable LM hints for all task types, not just cover

5d05694

The turbo model requires 5Hz LM hints to generate music. Without LM-generated audio codes guiding the DiT, the diffusion simply denoises back to silence. Previously, LM hints were only generated for cover tasks.

Default use_lm=True for ACE-Step turbo model

30cc914

The turbo model produces silence without LM hints, so enable them by default for a working out-of-the-box experience.

Add language field to LM prompt template

55aaa83

Default num_steps=20 for better vocal quality

5d060ed

8 steps produces good instrumentals but vocals need more diffusion steps to resolve clearly. 20 steps still runs well under real-time (~8s diffusion for 30s of audio).

Fix tree_unflatten import removed during simplify

31d23b0

Remove non-Western genres limitation note from README

1f03f23

Fix malformed --gen-kwargs/--save argparse from prior merge

fd18c8e

The --save argument was accidentally fused into the --gen-kwargs argument body, causing 'positional argument follows keyword argument' SyntaxError when importing mlx_audio.tts.generate.

Merge pull request #653 from shreyaskarnik/pc/add-ace-clean

1e8264a

Fix ACE-Step: weight conversion, LM hints, and vocal support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ACE-1.5#499

Add ACE-1.5#499
Blaizzy wants to merge 23 commits intomainfrom
pc/add-ace

Blaizzy commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Blaizzy commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants