Conversation
- Added methods for encoding audio to latent representations and normalizing audio inputs. - Implemented instruction generation based on task types, improving flexibility for various audio tasks. - Introduced new task type constants and instruction templates in the configuration file. - Updated imports and module exports to include new constants and configurations. These changes enhance the model's capabilities for audio processing and task management.
- Introduced a new section in the README for music generation, detailing the ACE-Step model for text-to-music tasks. - Added CLI and Python API examples for generating music from text descriptions and enhancing existing audio tracks. - Updated the `generate.py` script to accept additional generation parameters via a JSON string. - Enhanced the ACE-Step model loading logic to check for weights in both turbo and root directories. These changes expand the functionality of the MLX-Audio framework to include music generation features.
9 tasks
3 tasks
optimize ace_step: completely remove torch runtime dependency + unified conversion pipeline
Signed-off-by: Prince Canuma <prince.gdt@gmail.com>
- Converter: export all weight prefixes (tokenizer, detokenizer, null_condition_emb) not just decoder/encoder - Converter: keep decoder. prefix instead of renaming to dit. - Converter: use underscore naming for decoder conv params (proj_in_weight) to match DiTModel bare parameters - Sanitize: add dit. -> decoder. mapping for legacy weights - Sanitize: scope conv param renaming to decoder only to avoid breaking detokenizer.proj_out (nn.Linear) - Add "ace" to MODEL_REMAPPING for HF repo name resolution
The turbo model requires 5Hz LM hints to generate music. Without LM-generated audio codes guiding the DiT, the diffusion simply denoises back to silence. Previously, LM hints were only generated for cover tasks.
The turbo model produces silence without LM hints, so enable them by default for a working out-of-the-box experience.
Add explicit language section to the LM prompt template so the 5Hz planner model better respects the vocal_language parameter. The turbo HF model uses Qwen3 word embeddings for lyrics (not BPE phoneme tokens), which limits vocal synthesis to the model's learned capabilities.
8 steps produces good instrumentals but vocals need more diffusion steps to resolve clearly. 20 steps still runs well under real-time (~8s diffusion for 30s of audio).
- Cache sliding window mask in DiT instead of recreating per-layer (240x)
- Pass None instead of all-zeros no-op self-attention mask
- Remove dead: test_model(), omega_scale param, silence_latent param,
unused nested_weights, TASK_TYPES_TURBO/BASE, VAEConfig, TextEncoderConfig
- Remove redundant hidden_size attrs stored but never read (3 classes)
- Fix fragile .replace(".pt", ".npy") path handling
- Strip narrating comments that restate what code already expresses
22 tests covering: config, KVCache, attention masks, RMSNorm, TimestepEmbedding, RotaryEmbedding, Attention (self + cross with cache), MLP, DiTLayer, DiTModel (forward, cache, conv_transpose), and weight sanitization (dit prefix remap, proj_in scoping, scale_shift_table reshape). Tests caught a stale cross_attn_mask variable left from the simplify pass — fixed to None.
- Document use_lm=True default (required for turbo model) - Document num_steps=20 default (needed for clear vocals) - Add "How It Works" section explaining the LM planner + DiT pipeline - Add Hugging Face repo links (fp32 and 4-bit quantized) - Update performance table with actual M4 Max measurements - Document LM size trade-offs (0.6B vs 4B) - List known limitations (language detection, seed variance)
The --save argument was accidentally fused into the --gen-kwargs argument body, causing 'positional argument follows keyword argument' SyntaxError when importing mlx_audio.tts.generate.
Fix ACE-Step: weight conversion, LM hints, and vocal support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.