This directory contains examples of using Model Optimizer with NeMo Megatron-Bridge framework for pruning, distillation, quantization, etc.
| Section | Description | Link |
|---|---|---|
| Pre-Requisites | Development environment setup | [Link] |
| Pruning | Examples of pruning a model using Minitron algorithm | [Link] |
| Distillation | Examples of distillation a pruned or quantized model | [Link] |
| Post-Training Quantization | Examples of quantizing a model | [Link] |
| Resources | Extra links to relevant resources | [Link] |
Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., nvcr.io/nvidia/nemo:26.02) which has all the dependencies installed.
To get the ModelOpt examples scripts, mount your Model-Optimizer repo to the container as follows:
export MODELOPT_DIR=${PWD}/Model-Optimizer # or set to your local Model-Optimizer repository path if you have cloned it
if [ ! -d "${MODELOPT_DIR}" ]; then
git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
fi
export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
docker run \
--gpus all \
--shm-size=16GB \
--net=host \
--ulimit memlock=-1 \
--rm -it \
-v ${MODELOPT_DIR}:/opt/Model-Optimizer \
-v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
-w /opt/Model-Optimizer/examples/megatron_bridge \
${DOCKER_IMAGE} bashOnce inside the container, you need to login with your HuggingFace token to download gated datasets / models.
Note that the default dataset for pruning and quantization is nemotron-post-training-dataset-v2, which is gated.
hf auth login --token <your token>Warning
Use python -m pip instead of pip to avoid conflicts with the system-wide installed packages in the NeMo containers.
This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the pruning README.
Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of num_attention_heads using following defaults:
1024 samples from nemotron-post-training-dataset-v2 for calibration,
at-most 20% depth (num_layers) and 40% width is pruned per prunable hparam (hidden_size, ffn_hidden_size, ...),
top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.
torchrun --nproc_per_node 2 prune_minitron.py \
--pp_size 2 \
--hf_model_name_or_path Qwen/Qwen3-8B \
--prune_target_params 6e9 \
--hparams_to_skip num_attention_heads \
--output_hf_path /tmp/Qwen3-8B-Pruned-6BExample usage for manually pruning to a specific architecture using following defaults:
1024 samples from nemotron-post-training-dataset-v2 for calibration.
torchrun --nproc_per_node 2 prune_minitron.py \
--pp_size 2 \
--hf_model_name_or_path Qwen/Qwen3-8B \
--prune_export_config '{"hidden_size": 3584, "ffn_hidden_size": 9216}' \
--output_hf_path /tmp/Qwen3-8B-Pruned-6B-manualTo see the full usage for advanced configurations, run:
torchrun --nproc_per_node 1 prune_minitron.py --helpTip
If number of layers in the model is not divisible by number of GPUs i.e. pipeline parallel (PP) size, you can configure
uneven PP by setting --num_layers_in_first_pipeline_stage and --num_layers_in_last_pipeline_stage.
E.g. for Qwen3-8B with 36 layers and 8 GPUs, you can set both to 3 to get 3-5-5-5-5-5-5-3 layers per GPU.
This section shows how to distill a student model from a teacher model in the Megatron-Bridge framework.
This can be used stand-alone or after Pruning / Post-Training Quantization to recover accuracy of the model by distilling from the original model (teacher).
The distill.py script supports both standard HuggingFace checkpoints and Puzzletron AnyModel checkpoints as student/teacher inputs. Just pass the checkpoint path via --student_hf_path / --teacher_hf_path. The distilled model is saved to <output_dir>/checkpoints in Megatron distributed checkpoint format.
The distillation script expects pre-tokenized data in Megatron's binary format (.bin / .idx files).
See the Dataset Preparation README
for full instructions on tokenizing JSONL files and Hugging Face datasets and get the list of output prefixes that you can use for --data_paths argument.
Example usage to distill a 4B student (HF) from an 8B teacher (HF) on 8 GPUs (TP=8, PP=1):
torchrun --nnodes 1 --nproc_per_node 8 distill.py \
--tp_size 8 \
--teacher_hf_path Qwen/Qwen3-8B \
--student_hf_path Qwen/Qwen3-4B \
--data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \
--data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
--seq_length 8192 \
--mbs 1 \
--gbs 768 \
--train_iters 15000 \
--lr 1e-4 \
--min_lr 1e-5 \
--lr_warmup_iters 50 \
--eval_interval 100 \
--eval_iters 32 \
--log_interval 10 \
--output_dir /output/qwen3_8b_to_4b_distillTensorboard logging is enabled by default and logs are saved to <output_dir>/tensorboard directory.
To use Weights & Biases for logging, set the WANDB_API_KEY environment variable and pass the --wandb_project argument.
Optionally, you can also pass --wandb_entity and --wandb_exp_name arguments to group runs under a project and experiment name.
To see all available arguments:
torchrun --nproc_per_node 1 distill.py --helpExample usage with mock data for quick testing (no pre-tokenized data needed):
torchrun --nproc_per_node 8 distill.py \
--tp_size 8 \
--teacher_hf_path Qwen/Qwen3-0.6B \
--student_hf_path Qwen/Qwen3-0.6B \
--use_mock_data \
--seq_length 512 \
--mbs 1 \
--gbs 8 \
--train_iters 100 \
--eval_interval 10 \
--eval_iters 4 \
--output_dir /tmp/test_distillTo run the distillation script on a Slurm cluster for multi-node training, you just need use python instead of torchrun and set the number of nodes using #SBATCH --nodes=<num_nodes> clause in your Slurm script.
The distilled checkpoint is saved in Megatron distributed format. If you need a HuggingFace checkpoint, there are two ways to convert it:
Inline -- add --hf_export_path and --student_hf_model to the distill.py command to automatically convert the final checkpoint after distillation:
torchrun --nnodes 1 --nproc_per_node 8 distill.py \
... \
--hf_export_path /path/to/save/distilled_hf_ckpt \
--student_hf_model Qwen/Qwen3-4B--student_hf_model should match the base architecture of the student (used as a template for export). For non-Puzzletron (i.e. standard) models, it should be same as --student_hf_path.
Separate conversion -- convert any saved iteration using the Megatron-Bridge conversion script:
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
--hf-model <path_to_pruned_hf_ckpt> \
--megatron-path <distill_output_dir>/checkpoints/iter_<iter_number> \
--hf-path <path_to_save_distilled_hf_ckpt>For more details, see the Megatron-Bridge conversion README.
See results/puzzletron.md for MMLU results demonstrating knowledge distillation on Puzzletron-compressed student models.
Checkout Quantization scripts for LLMs and VLMs in the Megatron-Bridge repository here.