[DRAFT] Add heterogeneous AnyModel distillation example for Puzzletron. by chochowski · Pull Request #1725 · NVIDIA/Model-Optimizer

chochowski · 2026-06-15T09:27:55Z

Ports examples/puzzletron/distillation from the puzzletron_distillation branch with SPDX headers, nemo2 registry fix, and puzzletron README link.

What does this PR do?

This PR adds support for mbridge distillation of puzzled heterogeneous models

Usage

In general distillation requires more compute than normal training, thus ideally you should run this on a multinode-setting (unless very constrained scenarion)

torchrun --nproc-per-node=8  examples/puzzletron/distillation/distill.py \
--student llama \
--teacher llama \
--student-checkpoint /puzzletron/workspaces/Llama-3.1-8B-Instruct/mip/puzzle_solutions/target_memory_78000MiB-num_params_7G/solutions--checkpoints/solution_0/ \
--teacher-checkpoint /puzzletron/workspaces/Llama-3.1-8B-Instruct/ckpts/teacher/   \
--config-file examples/puzzletron/distillation/kd-container-llama.yaml \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 8 \
--expert-model-parallel-size 1 \
--expert-tensor-parallel-size 1 \
train.train_iters=1000 \
checkpoint.save=/puzzletron/workspaces/Llama-3.1-8B-Instruct/kd/puzzle_solutions/target_memory_78000MiB-num_params_7G-intermediate-fix/ \
logger.wandb_exp_name=Llama-3.1-8B-Instruct-target_memory_78000MiB-num_params_7G-intermediate-fix-verify

Testing

TBD

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added a heterogeneous knowledge distillation workflow for AnyModel/Puzzletron pairs via Megatron Bridge.
- Introduced new distillation and export command-line scripts, including HuggingFace ↔ Megatron Bridge GPT-OSS weight conversion support.
- Added supporting distillation utilities and presets for streamlined heterogeneous scenarios.
Documentation
- Expanded Puzzletron distillation guides with detailed heterogeneous and GPT-OSS-specific instructions, plus end-to-end distill/export examples.
Bug Fixes
- Improved hybrid MoE distillation stability by preventing aux-loss buffer sizing mismatches on pipeline stages.
Chores
- Updated linting rules for the Puzzletron distillation example files.

Ports examples/puzzletron/distillation from the puzzletron_distillation branch with SPDX headers, nemo2 registry fix, and puzzletron README link. Signed-off-by: mchochowski <mchochowski@nvidia.com>

coderabbitai · 2026-06-15T09:28:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a70b550c-f6ff-4240-8eaa-de10c78ef5bc

📥 Commits

Reviewing files that changed from the base of the PR and between 22cda76 and b48c43f.

📒 Files selected for processing (3)

examples/puzzletron/distillation/kd-container-llama.yaml
examples/puzzletron/distillation/kd-container-nemotron3.yaml
examples/puzzletron/distillation/kd-container-qwen.yaml

✅ Files skipped from review due to trivial changes (1)

examples/puzzletron/distillation/kd-container-llama.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

examples/puzzletron/distillation/kd-container-qwen.yaml
examples/puzzletron/distillation/kd-container-nemotron3.yaml

📝 Walkthrough

Walkthrough

Adds a new examples/puzzletron/distillation/ package implementing heterogeneous AnyModel/Puzzletron knowledge distillation via Megatron Bridge. It introduces per-layer block config translation to MCore overrides, a mbridge_patcher context manager for monkey-patching layer construction, bridge and provider patches, a GPTOSSBridge with MXFP4 dequantization, distill.py and export_to_hf.py entrypoints, configuration templates for multiple models, and comprehensive documentation.

Changes

Heterogeneous Puzzletron distillation example

Layer / File(s)	Summary
Per-layer block config data contract and `MCoreLayerOverrides` `examples/puzzletron/distillation/block_config_utils.py`	Defines `MCoreLayerOverrides` dataclass and all functions to load, normalize, and translate per-decoder-layer `block_configs` into Megatron-Core `TransformerConfig` override structures with attention/MLP no-op flags, supporting GQA, Mamba slots, dense MLP, and MoE FFN.
MCore layer patching and no-op submodule replacement `examples/puzzletron/distillation/layer_patchers.py`	Introduces `NoOpWithBias`, `NoOpRMSNorm`, thread-local patcher state, `_NO_OP_RULES`, `_apply_no_ops`, and the `mbridge_patcher` context manager that monkey-patches `build_module` in multiple Megatron namespaces and `MambaLayer.__init__` to apply per-layer config overrides and replace disabled subblocks during model construction.
Model bridge and provider monkey-patches `examples/puzzletron/distillation/model_bridge_patch.py`, `examples/puzzletron/distillation/provider_patch.py`	Patches `MegatronModelBridge.load_weights_hf_to_megatron` with heterogeneous MoE name resolver and post-load diagnostics, and patches `ModelProviderMixin.provide` / `DistillationProvider.provide` to wrap provider construction inside `mbridge_patcher`; includes `set_provider_block_configs` and `set_student_block_configs` helpers.
GPT-OSS Megatron bridge and MXFP4 dequantization `examples/puzzletron/distillation/gpt_oss_bridge.py`	Registers `GPTOSSBridge` converting `GptOssForCausalLM` to Megatron `GPTModel` with per-expert weight re-indexing, MXFP4 block+scale dequantization via `_dequantize_mxfp4`, and custom `AutoMapping` subclasses for MoE MLP expert down-projection and gate/up projection with interleaving.
Shared helpers and `MODEL_REGISTRY` `examples/puzzletron/distillation/_common.py`	Defines `MODEL_REGISTRY`, path constants, logging setup, HF config/block config/descriptor loading helpers, `_load_bridge` via `deci_x_patcher`, `_build_provider`, and the `run_entrypoint` wrapper with distributed teardown.
Distillation entrypoint and YAML configs `examples/puzzletron/distillation/distill.py`, `examples/puzzletron/distillation/kd-container-*.yaml`	Implements `main()` orchestrating provider setup, `ConfigContainer` construction, YAML/Hydra override merging, teacher config sync, `_install_hybrid_moe_aux_loss_size_fix` (monkey-patching `track_moe_metrics` to prevent NCCL deadlocks on zero-MoE PP stages), and `distill()` invocation. Includes default YAML and model-specific templates (Llama, Nemotron-3, Qwen) providing training, optimizer, dataset, checkpoint, logging, and runtime defaults.
Export-to-HF entrypoint `examples/puzzletron/distillation/export_to_hf.py`	Implements `main()` for exporting an MCore checkpoint back to HuggingFace format: validates registry key, loads student HF config and block configs, applies provider patches, exports via `student_bridge.export_ckpt()` inside `mbridge_patcher`, and copies tokenizer/config files into the output directory.
Public API, docs, and project config `examples/puzzletron/distillation/__init__.py`, `examples/puzzletron/distillation/README.md`, `examples/puzzletron/README.md`, `pyproject.toml`	Adds the `__init__.py` re-exporting the public API, full `distillation/README.md` covering setup/usage/export/debugging, a pointer in the parent README, and Ruff `per-file-ignores` entry for the new directory.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant distill_main as distill.py main()
    participant _common
    participant model_bridge_patch
    participant provider_patch
    participant DistillationProvider
    participant mbridge_patcher
    participant MCore_build as MCore build_module
    participant Bridge_distill as Bridge.distill()

    CLI->>distill_main: student/teacher keys, checkpoint paths, YAML + CLI overrides
    distill_main->>_common: _load_hf_config → _get_block_configs (student + teacher)
    distill_main->>model_bridge_patch: apply_patch() — heterogeneous MoE resolution
    distill_main->>provider_patch: apply_patch() + apply_distillation_patch()
    distill_main->>_common: _load_bridge + _build_provider (student + teacher)
    distill_main->>DistillationProvider: set_student_block_configs(block_configs)
    distill_main->>distill_main: _build_distill_config_container → merge YAML + CLI overrides
    distill_main->>distill_main: _sync_teacher_config_from_student
    distill_main->>distill_main: _install_hybrid_moe_aux_loss_size_fix (patch track_moe_metrics)
    distill_main->>Bridge_distill: distill(config)
    Bridge_distill->>DistillationProvider: provide(config)
    DistillationProvider->>mbridge_patcher: enter(block_configs, num_heads, hidden_size)
    mbridge_patcher->>MCore_build: patched — apply MCoreLayerOverrides + NoOpWithBias per layer
    MCore_build-->>mbridge_patcher: layer built with overrides
    mbridge_patcher-->>DistillationProvider: exit, restore build_module

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

ChenhanYu
jenchen13
hychiang-git

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a heterogeneous AnyModel distillation example for Puzzletron. It accurately reflects the primary purpose of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All security anti-patterns from SECURITY.md checked: zero torch.load/numpy.load with unsafe defaults, trust_remote_code properly exposed as configurable CLI flag defaulting to False, no eval/exec,...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 1679-add-mbridge-dsitillation-for-puzzletron

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-15T09:38:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.20%. Comparing base (cc17f2c) to head (b48c43f).
⚠️ Report is 45 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1725      +/-   ##
==========================================
- Coverage   77.09%   75.20%   -1.90%     
==========================================
  Files         511      511              
  Lines       56168    58240    +2072     
==========================================
+ Hits        43302    43798     +496     
- Misses      12866    14442    +1576

Flag	Coverage Δ
examples	`40.59% <ø> (-1.36%)`	⬇️
gpu	`57.70% <ø> (-0.62%)`	⬇️
regression	`14.66% <ø> (+0.06%)`	⬆️
unit	`54.43% <ø> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 7

🧹 Nitpick comments (6)

examples/puzzletron/distillation/export_to_hf.py (1)

112-113: ⚡ Quick win

Move deferred imports to module scope, or explicitly justify them inline.

Imports inside main() currently have no documented reason (circular dependency, optional dependency, or heavy import deferral), so this deviates from the repository import rule.

As per coding guidelines, imports should stay at module top unless delayed import is necessary and explicitly justified.

Also applies to: 135-135
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/distillation/export_to_hf.py` around lines 112 - 113,
Move the deferred imports out of the main() function and place them at the
module scope with other imports at the top of the file. The import statement
`from provider_patch import apply_distillation_patch, apply_patch` at line
112-113 and the other deferred import at line 135 should both be relocated to
the module-level import section unless there is a documented reason (such as
circular dependency, optional dependency, or heavy import deferral) for keeping
them deferred—in which case, add an explicit inline comment justifying the
deferred import at that location.
Source: Coding guidelines

examples/puzzletron/distillation/block_config_utils.py (1)

42-52: 💤 Low value

Consider adding __all__ to declare the public API.

Per coding guidelines, each module should declare its public surface with __all__ at the top. This module exports MCoreLayerOverrides, load_block_configs, block_config_to_mcore_overrides, and get_overrides_for_layer.
Suggested addition after imports
__all__ = [
    "MCoreLayerOverrides",
    "load_block_configs",
    "block_config_to_mcore_overrides",
    "get_overrides_for_layer",
]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/distillation/block_config_utils.py` around lines 42 - 52,
The module is missing an `__all__` declaration to explicitly define its public
API. Add an `__all__` list after the import statements (after the logger
initialization) that includes the four public symbols: MCoreLayerOverrides,
load_block_configs, block_config_to_mcore_overrides, and
get_overrides_for_layer. This makes the module's public interface explicit and
aligns with coding guidelines.
Source: Coding guidelines

examples/puzzletron/distillation/layer_patchers.py (1)

43-59: 💤 Low value

Consider adding __all__ to declare the public API.

This module exports NoOpWithBias, NoOpRMSNorm, and mbridge_patcher as public symbols. Per coding guidelines, each module should declare its public surface with __all__.

Suggested addition after imports

__all__ = [
    "NoOpWithBias",
    "NoOpRMSNorm",
    "mbridge_patcher",
]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/distillation/layer_patchers.py` around lines 43 - 59, The
module is missing an `__all__` declaration to explicitly define its public API.
After the logger initialization line where `logger =
logging.getLogger(__name__)` is defined, add an `__all__` list that includes the
three exported symbols: NoOpWithBias, NoOpRMSNorm, and mbridge_patcher. This
declaration clarifies the module's public surface and follows coding guidelines
for proper API documentation.

Source: Coding guidelines

examples/puzzletron/distillation/distill.py (2)

616-623: 💤 Low value

Redundant distributed teardown.

The torch.distributed.destroy_process_group() is already handled in run_entrypoint() (line 176-177 in _common.py). This duplicate call is safe (it checks is_initialized()) but unnecessary.

♻️ Proposed fix

     try:
         distill(config)
     except Exception as e:
         logger.error("Error during distillation: %s", e)
         raise e
-    finally:
-        if torch.distributed.is_initialized():
-            torch.distributed.destroy_process_group()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/distillation/distill.py` around lines 616 - 623, The
finally block in the distill function that calls
torch.distributed.destroy_process_group() is redundant because this cleanup is
already handled by run_entrypoint() in _common.py. Remove the entire finally
block (which includes the is_initialized() check and destroy_process_group()
call) from the distill function, as the distributed process group teardown will
still be properly managed at the higher level in run_entrypoint().

151-152: 💤 Low value

Star imports pollute the module namespace.

The star imports from modelopt.torch.puzzletron.anymodel.converter and model_descriptor can pollute the namespace and make it unclear which names are used. Consider importing only the specific symbols needed.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/distillation/distill.py` around lines 151 - 152, The star
imports from modelopt.torch.puzzletron.anymodel.converter and
modelopt.torch.puzzletron.anymodel.model_descriptor pollute the namespace and
obscure which symbols are actually used. Replace both star import statements
with explicit imports of only the specific symbols needed from each module.
Identify which classes, functions, or constants from converter and
model_descriptor are actually used in this file and import them by name instead
of using the wildcard syntax.

examples/puzzletron/distillation/gpt_oss_bridge.py (1)

188-202: 💤 Low value

Potential variable shadowing with num_experts and ep_size.

Lines 188-190 calculate num_experts from self.hf_config.num_local_experts, but lines 199-202 immediately override these values from block_config. The initial calculation at lines 188-190 is never used.

Consider removing the dead code:

♻️ Proposed fix

     def maybe_modify_converted_hf_weight(
         self,
         task: WeightConversionTask,
         converted_weights_dict: dict[str, torch.Tensor],
         hf_state_dict: Mapping[str, torch.Tensor],
     ) -> dict[str, torch.Tensor]:
-        num_experts = self.hf_config.num_local_experts
-        ep_size = parallel_state.get_expert_model_parallel_world_size()
-        experts_per_rank = num_experts // ep_size
-
         try:
             layer_idx = extract_layer_idx_from_param(task.param_name)
             expert_number = extract_expert_number_from_param(task.param_name)
         except ValueError:
             # not an expert weight
             return converted_weights_dict

         block_config = self.hf_config.block_configs[layer_idx]
         num_experts = block_config["ffn"]["moe"]["num_local_experts"]
         ep_size = parallel_state.get_expert_model_parallel_world_size()
         experts_per_rank = num_experts // ep_size

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/distillation/gpt_oss_bridge.py` around lines 188 - 202,
Remove the dead code that calculates num_experts, ep_size, and experts_per_rank
from self.hf_config.num_local_experts before the try-except block. These
variables are shadowed and immediately recalculated after the
extract_layer_idx_from_param and extract_expert_number_from_param calls using
block_config instead, making the initial calculations at lines 188-190 unused.
Delete the redundant variable assignments and keep only the recalculation that
uses block_config["ffn"]["moe"]["num_local_experts"].

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/puzzletron/distillation/__init__.py`:
- Around line 36-55: The import statements at lines 36-43 use absolute imports
(block_config_utils, layer_patchers, provider_patch) which can break when
importing the distillation package as a whole. Convert these to package-relative
imports by prepending a dot (.) before each module name (from
.block_config_utils import ..., from .layer_patchers import ..., from
.provider_patch import ...). This ensures the imports are resolved relative to
the current package and follows the repository __init__.py API pattern using
relative star re-exports.

In `@examples/puzzletron/distillation/block_config_utils.py`:
- Line 177: The parameter moe_shared_expert_intermediate_omitted is defined and
documented in the docstring but never used in the function body. Either
implement the MoE handling logic described in the docstring that should use this
flag to control whether shared experts are cleared when the key was omitted from
JSON block_configs, or remove the parameter entirely if the functionality is not
needed. If implementing, add the conditional logic in the MoE handling section
that checks this flag and clears shared experts accordingly.
- Around line 130-137: The condition checking if converter_cls is None will
never be true because ConverterFactory.get() returns the input string itself
when the converter is not found, not None. Replace the `if converter_cls is
None:` check with a condition that verifies converter_cls is actually a class or
callable (not a string), such as checking if it is not a string type or if it is
callable. This will properly catch cases where an invalid converter_name is
provided and prevent the AttributeError that would occur when trying to call
create_block_configs_from_main_config() on a string object.

In `@examples/puzzletron/distillation/export_to_hf.py`:
- Around line 209-211: Update the help text string that describes the student
model key argument to reference the correct flag name. Change the mention of
`--student-checkpoint` to `--student-hf-checkpoint` in the help text to match
the actual CLI argument name, ensuring the help documentation is accurate and
not misleading to users.
- Around line 155-166: The copy loop that iterates over file names and uses
shutil.copy() will crash when student_path is a remote HF repo id or when
optional files like chat_template.jinja are missing. Before attempting to copy
each file in the loop where src is constructed as Path(student_path) / fname,
check if the source file exists using Path.exists(). For optional configuration
files, skip the copy operation gracefully if the source file doesn't exist
instead of raising FileNotFoundError. Ensure the success message is only printed
when a file was actually copied.

In `@examples/puzzletron/distillation/provider_patch.py`:
- Line 210: Replace all implicit relative imports with explicit relative imports
using dot notation to ensure the modules can be imported correctly regardless of
execution context. In examples/puzzletron/distillation/provider_patch.py at
lines 210 and 329, change `from layer_patchers import mbridge_patcher` to `from
.layer_patchers import mbridge_patcher`. In
examples/puzzletron/distillation/_common.py at lines 105-107, change `from
block_config_utils import load_block_configs` to `from .block_config_utils
import load_block_configs`. In examples/puzzletron/distillation/distill.py at
lines 467-473, change all import statements like `from model_bridge_patch import
...` and `from provider_patch import ...` to use explicit relative imports by
adding a leading dot prefix (e.g., `from .model_bridge_patch import ...` and
`from .provider_patch import ...`).

In `@examples/puzzletron/distillation/README.md`:
- Around line 169-186: The opening code fence in the README.md file does not
include a language identifier. Add a language specifier (such as `text`) to the
opening triple backticks of the fenced code block containing the directory tree
structure to ensure markdown linting compliance and maintain documentation
tooling standards.

---

Nitpick comments:
In `@examples/puzzletron/distillation/block_config_utils.py`:
- Around line 42-52: The module is missing an `__all__` declaration to
explicitly define its public API. Add an `__all__` list after the import
statements (after the logger initialization) that includes the four public
symbols: MCoreLayerOverrides, load_block_configs,
block_config_to_mcore_overrides, and get_overrides_for_layer. This makes the
module's public interface explicit and aligns with coding guidelines.

In `@examples/puzzletron/distillation/distill.py`:
- Around line 616-623: The finally block in the distill function that calls
torch.distributed.destroy_process_group() is redundant because this cleanup is
already handled by run_entrypoint() in _common.py. Remove the entire finally
block (which includes the is_initialized() check and destroy_process_group()
call) from the distill function, as the distributed process group teardown will
still be properly managed at the higher level in run_entrypoint().
- Around line 151-152: The star imports from
modelopt.torch.puzzletron.anymodel.converter and
modelopt.torch.puzzletron.anymodel.model_descriptor pollute the namespace and
obscure which symbols are actually used. Replace both star import statements
with explicit imports of only the specific symbols needed from each module.
Identify which classes, functions, or constants from converter and
model_descriptor are actually used in this file and import them by name instead
of using the wildcard syntax.

In `@examples/puzzletron/distillation/export_to_hf.py`:
- Around line 112-113: Move the deferred imports out of the main() function and
place them at the module scope with other imports at the top of the file. The
import statement `from provider_patch import apply_distillation_patch,
apply_patch` at line 112-113 and the other deferred import at line 135 should
both be relocated to the module-level import section unless there is a
documented reason (such as circular dependency, optional dependency, or heavy
import deferral) for keeping them deferred—in which case, add an explicit inline
comment justifying the deferred import at that location.

In `@examples/puzzletron/distillation/gpt_oss_bridge.py`:
- Around line 188-202: Remove the dead code that calculates num_experts,
ep_size, and experts_per_rank from self.hf_config.num_local_experts before the
try-except block. These variables are shadowed and immediately recalculated
after the extract_layer_idx_from_param and extract_expert_number_from_param
calls using block_config instead, making the initial calculations at lines
188-190 unused. Delete the redundant variable assignments and keep only the
recalculation that uses block_config["ffn"]["moe"]["num_local_experts"].

In `@examples/puzzletron/distillation/layer_patchers.py`:
- Around line 43-59: The module is missing an `__all__` declaration to
explicitly define its public API. After the logger initialization line where
`logger = logging.getLogger(__name__)` is defined, add an `__all__` list that
includes the three exported symbols: NoOpWithBias, NoOpRMSNorm, and
mbridge_patcher. This declaration clarifies the module's public surface and
follows coding guidelines for proper API documentation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 18de3e9f-6e82-4675-b39f-0cf70e941dfe

📥 Commits

Reviewing files that changed from the base of the PR and between 9f6e8fd and 48cb89a.

📒 Files selected for processing (13)

examples/puzzletron/README.md
examples/puzzletron/distillation/README.md
examples/puzzletron/distillation/__init__.py
examples/puzzletron/distillation/_common.py
examples/puzzletron/distillation/block_config_utils.py
examples/puzzletron/distillation/distill.py
examples/puzzletron/distillation/export_to_hf.py
examples/puzzletron/distillation/gpt_oss_bridge.py
examples/puzzletron/distillation/kd-container-default.yaml
examples/puzzletron/distillation/layer_patchers.py
examples/puzzletron/distillation/model_bridge_patch.py
examples/puzzletron/distillation/provider_patch.py
pyproject.toml

coderabbitai · 2026-06-15T09:39:41Z

+        2. Detects whether this is a Mamba/hybrid provider (to set ``apply_no_ops``).
+        3. Activates ``mbridge_patcher`` and delegates to the original ``provide()``.
+    """
+    from layer_patchers import mbridge_patcher


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Implicit relative imports may fail depending on execution context. All three files use implicit relative imports (e.g., from layer_patchers import ... instead of from .layer_patchers import ...). These work when the script directory is in sys.path but will fail if the modules are imported as part of a package from a different working directory.

examples/puzzletron/distillation/provider_patch.py#L210-L210: Change from layer_patchers import mbridge_patcher to from .layer_patchers import mbridge_patcher (also at line 329).

examples/puzzletron/distillation/_common.py#L105-L107: Change from block_config_utils import load_block_configs to from .block_config_utils import load_block_configs.

examples/puzzletron/distillation/distill.py#L467-L473: Change from model_bridge_patch import ... and from provider_patch import ... to use explicit relative imports with the dot prefix.

📍 Affects 3 files

examples/puzzletron/distillation/provider_patch.py#L210-L210 (this comment)

examples/puzzletron/distillation/_common.py#L105-L107

examples/puzzletron/distillation/distill.py#L467-L473

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/provider_patch.py` at line 210, Replace all implicit relative imports with explicit relative imports using dot notation to ensure the modules can be imported correctly regardless of execution context. In examples/puzzletron/distillation/provider_patch.py at lines 210 and 329, change `from layer_patchers import mbridge_patcher` to `from .layer_patchers import mbridge_patcher`. In examples/puzzletron/distillation/_common.py at lines 105-107, change `from block_config_utils import load_block_configs` to `from .block_config_utils import load_block_configs`. In examples/puzzletron/distillation/distill.py at lines 467-473, change all import statements like `from model_bridge_patch import ...` and `from provider_patch import ...` to use explicit relative imports by adding a leading dot prefix (e.g., `from .model_bridge_patch import ...` and `from .provider_patch import ...`).

kevalmorabia97 · 2026-06-15T10:09:00Z

we have been using https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/megatron_bridge/distill.py for puzzletron anymodel distillation already. Why do we now need all this extra patching logic?

MCore heterogeneous support is not general e.g. it does not support mamba stack. With the patching we can get independent from mcore support, and rely only on whether mbridge supports a model or not - the whole heterogeneous logic is done with patching

Signed-off-by: mchochowski <mchochowski@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🧹 Nitpick comments (1)

examples/puzzletron/distillation/kd-container-qwen.yaml (1)
24-24: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Use YAML anchor to avoid duplicating seq_length.

The seq_length value 4096 is duplicated on lines 24 and 93. If these values need to change, both locations must be updated, creating a maintenance risk. The nemotron3 config demonstrates the preferred pattern using YAML anchors and aliases.
♻️ Proposed fix using YAML anchor pattern
- seq_length: 4096
+ seq_length: &seqlen 4096
And later in the dataset section:
  dataset:
-   seq_length: 4096
+   seq_length: *seqlen
Also applies to: 93-93
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/distillation/kd-container-qwen.yaml` at line 24, The
seq_length value 4096 appears in multiple places in the YAML file, creating
maintenance risk. Define a YAML anchor on the first occurrence of seq_length
(around line 24) by adding &seq_length_anchor after the value, then replace the
duplicate seq_length value on line 93 with an alias reference using
*seq_length_anchor. This ensures a single source of truth for the seq_length
configuration value throughout the file.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/puzzletron/distillation/kd-container-nemotron3.yaml`:
- Line 15: The field name `hetereogenous_dist_checkpoint` contains a typo and
should be corrected to `heterogeneous_dist_checkpoint`. Locate the field in the
YAML configuration and fix the spelling by replacing `hetereogenous` with
`heterogeneous`. This will ensure the configuration field is properly recognized
by the system and prevent validation failures or the field being silently
ignored at runtime.

In `@examples/puzzletron/distillation/kd-container-qwen.yaml`:
- Line 95: The configuration file has inconsistent seed values for
reproducibility: the dataset section uses random_seed set to 1234 while the RNG
config section uses seed set to 42. To fix this, adopt a YAML anchor pattern by
defining a seed value at the top of the file (e.g., seed_value: &SEED 1234) and
then reference it in both locations using *SEED. Alternatively, choose a single
seed value and update both the random_seed in the dataset configuration and the
seed in the RNG configuration to use the same value consistently throughout the
file.
- Line 15: The field name hetereogenous_dist_checkpoint contains a spelling
error in the YAML configuration file. Correct the spelling by changing
hetereogenous_dist_checkpoint to heterogeneous_dist_checkpoint (the correct
spelling is heterogeneous, not hetereogenous, with an 'e' after the 't' and
before the 'r'). This will ensure the configuration field is properly recognized
at runtime and validation passes correctly.

---

Nitpick comments:
In `@examples/puzzletron/distillation/kd-container-qwen.yaml`:
- Line 24: The seq_length value 4096 appears in multiple places in the YAML
file, creating maintenance risk. Define a YAML anchor on the first occurrence of
seq_length (around line 24) by adding &seq_length_anchor after the value, then
replace the duplicate seq_length value on line 93 with an alias reference using
*seq_length_anchor. This ensures a single source of truth for the seq_length
configuration value throughout the file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 58c84ba6-41f2-4e1d-8788-a57f2a683ed2

📥 Commits

Reviewing files that changed from the base of the PR and between 48cb89a and 97e0135.

📒 Files selected for processing (7)

examples/puzzletron/distillation/README.md
examples/puzzletron/distillation/block_config_utils.py
examples/puzzletron/distillation/gpt_oss_bridge.py
examples/puzzletron/distillation/kd-container-default.yaml
examples/puzzletron/distillation/kd-container-llama.yaml
examples/puzzletron/distillation/kd-container-nemotron3.yaml
examples/puzzletron/distillation/kd-container-qwen.yaml

✅ Files skipped from review due to trivial changes (2)

examples/puzzletron/distillation/kd-container-llama.yaml
examples/puzzletron/distillation/README.md

🚧 Files skipped from review as they are similar to previous changes (3)

examples/puzzletron/distillation/block_config_utils.py
examples/puzzletron/distillation/kd-container-default.yaml
examples/puzzletron/distillation/gpt_oss_bridge.py

coderabbitai · 2026-06-22T08:59:50Z

+# ---------------------------------------------------------------------------
+model:
+  # Heterogeneous AnyModel / per-layer MoE: distinct checkpoint keys per global layer.
+  hetereogenous_dist_checkpoint: true


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the typo in hetereogenous_dist_checkpoint.

The field name contains a typo: hetereogenous_dist_checkpoint should be heterogeneous_dist_checkpoint. This misspelling may cause configuration validation failures or the field to be silently ignored at runtime.

📝 Proposed fix

- hetereogenous_dist_checkpoint: true + heterogeneous_dist_checkpoint: true

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

hetereogenous_dist_checkpoint: true

heterogeneous_dist_checkpoint: true

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/kd-container-qwen.yaml` at line 15, The field name hetereogenous_dist_checkpoint contains a spelling error in the YAML configuration file. Correct the spelling by changing hetereogenous_dist_checkpoint to heterogeneous_dist_checkpoint (the correct spelling is heterogeneous, not hetereogenous, with an 'e' after the 't' and before the 'r'). This will ensure the configuration field is properly recognized at runtime and validation passes correctly.

- Changed the README.md code block formatting to `text`. - Added `__all__` exports in `layer_patchers.py` and `block_config_utils.py` for better module visibility. - Removed unused imports in `distill.py` and streamlined the error handling in the `main` function. - Updated `export_to_hf.py` to avoid redundant imports and added a check for file existence before copying. - Minor adjustments in `gpt_oss_bridge.py` to remove unnecessary expert-related calculations. These changes enhance code clarity and maintainability across the distillation examples. Signed-off-by: mchochowski <mchochowski@nvidia.com>

Signed-off-by: mchochowski <mchochowski@nvidia.com>

Add heterogeneous AnyModel distillation example for Puzzletron.

48cb89a

Ports examples/puzzletron/distillation from the puzzletron_distillation branch with SPDX headers, nemo2 registry fix, and puzzletron README link. Signed-off-by: mchochowski <mchochowski@nvidia.com>

chochowski requested review from a team as code owners June 15, 2026 09:27

chochowski requested a review from kevalmorabia97 June 15, 2026 09:27

chochowski linked an issue Jun 15, 2026 that may be closed by this pull request

Add MBridge dsitillation for puzzletron #1679

Open

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

kevalmorabia97 reviewed Jun 15, 2026

View reviewed changes

distillation configuration examples

97e0135

Signed-off-by: mchochowski <mchochowski@nvidia.com>

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

chochowski added 3 commits June 22, 2026 03:22

unify the random seed

22cda76

Signed-off-by: mchochowski <mchochowski@nvidia.com>

unify seqlen and seed anchors across the yamls

b48c43f

Signed-off-by: mchochowski <mchochowski@nvidia.com>

	hetereogenous_dist_checkpoint: true
	heterogeneous_dist_checkpoint: true

Conversation

chochowski commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevalmorabia97 Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chochowski Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chochowski commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading