Skip to content

Add ACE-1.5#499

Draft
Blaizzy wants to merge 23 commits intomainfrom
pc/add-ace
Draft

Add ACE-1.5#499
Blaizzy wants to merge 23 commits intomainfrom
pc/add-ace

Conversation

@Blaizzy
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy commented Feb 15, 2026

No description provided.

- Added methods for encoding audio to latent representations and normalizing audio inputs.
- Implemented instruction generation based on task types, improving flexibility for various audio tasks.
- Introduced new task type constants and instruction templates in the configuration file.
- Updated imports and module exports to include new constants and configurations.

These changes enhance the model's capabilities for audio processing and task management.
- Introduced a new section in the README for music generation, detailing the ACE-Step model for text-to-music tasks.
- Added CLI and Python API examples for generating music from text descriptions and enhancing existing audio tracks.
- Updated the `generate.py` script to accept additional generation parameters via a JSON string.
- Enhanced the ACE-Step model loading logic to check for weights in both turbo and root directories.

These changes expand the functionality of the MLX-Audio framework to include music generation features.
mm65x and others added 18 commits March 20, 2026 10:14
optimize ace_step: completely remove torch runtime dependency + unified conversion pipeline
Signed-off-by: Prince Canuma <prince.gdt@gmail.com>
- Converter: export all weight prefixes (tokenizer, detokenizer,
  null_condition_emb) not just decoder/encoder
- Converter: keep decoder. prefix instead of renaming to dit.
- Converter: use underscore naming for decoder conv params
  (proj_in_weight) to match DiTModel bare parameters
- Sanitize: add dit. -> decoder. mapping for legacy weights
- Sanitize: scope conv param renaming to decoder only to avoid
  breaking detokenizer.proj_out (nn.Linear)
- Add "ace" to MODEL_REMAPPING for HF repo name resolution
The turbo model requires 5Hz LM hints to generate music.
Without LM-generated audio codes guiding the DiT, the
diffusion simply denoises back to silence. Previously,
LM hints were only generated for cover tasks.
The turbo model produces silence without LM hints, so
enable them by default for a working out-of-the-box experience.
Add explicit language section to the LM prompt template so the
5Hz planner model better respects the vocal_language parameter.
The turbo HF model uses Qwen3 word embeddings for lyrics (not
BPE phoneme tokens), which limits vocal synthesis to the model's
learned capabilities.
8 steps produces good instrumentals but vocals need more
diffusion steps to resolve clearly. 20 steps still runs
well under real-time (~8s diffusion for 30s of audio).
- Cache sliding window mask in DiT instead of recreating per-layer (240x)
- Pass None instead of all-zeros no-op self-attention mask
- Remove dead: test_model(), omega_scale param, silence_latent param,
  unused nested_weights, TASK_TYPES_TURBO/BASE, VAEConfig, TextEncoderConfig
- Remove redundant hidden_size attrs stored but never read (3 classes)
- Fix fragile .replace(".pt", ".npy") path handling
- Strip narrating comments that restate what code already expresses
22 tests covering: config, KVCache, attention masks, RMSNorm,
TimestepEmbedding, RotaryEmbedding, Attention (self + cross with
cache), MLP, DiTLayer, DiTModel (forward, cache, conv_transpose),
and weight sanitization (dit prefix remap, proj_in scoping,
scale_shift_table reshape).

Tests caught a stale cross_attn_mask variable left from the
simplify pass — fixed to None.
- Document use_lm=True default (required for turbo model)
- Document num_steps=20 default (needed for clear vocals)
- Add "How It Works" section explaining the LM planner + DiT pipeline
- Add Hugging Face repo links (fp32 and 4-bit quantized)
- Update performance table with actual M4 Max measurements
- Document LM size trade-offs (0.6B vs 4B)
- List known limitations (language detection, seed variance)
The --save argument was accidentally fused into the --gen-kwargs
argument body, causing 'positional argument follows keyword argument'
SyntaxError when importing mlx_audio.tts.generate.
Fix ACE-Step: weight conversion, LM hints, and vocal support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants