[DRAFT] Add heterogeneous AnyModel distillation example for Puzzletron.#1725
[DRAFT] Add heterogeneous AnyModel distillation example for Puzzletron.#1725chochowski wants to merge 5 commits into
Conversation
Ports examples/puzzletron/distillation from the puzzletron_distillation branch with SPDX headers, nemo2 registry fix, and puzzletron README link. Signed-off-by: mchochowski <mchochowski@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a new ChangesHeterogeneous Puzzletron distillation example
Sequence Diagram(s)sequenceDiagram
participant CLI
participant distill_main as distill.py main()
participant _common
participant model_bridge_patch
participant provider_patch
participant DistillationProvider
participant mbridge_patcher
participant MCore_build as MCore build_module
participant Bridge_distill as Bridge.distill()
CLI->>distill_main: student/teacher keys, checkpoint paths, YAML + CLI overrides
distill_main->>_common: _load_hf_config → _get_block_configs (student + teacher)
distill_main->>model_bridge_patch: apply_patch() — heterogeneous MoE resolution
distill_main->>provider_patch: apply_patch() + apply_distillation_patch()
distill_main->>_common: _load_bridge + _build_provider (student + teacher)
distill_main->>DistillationProvider: set_student_block_configs(block_configs)
distill_main->>distill_main: _build_distill_config_container → merge YAML + CLI overrides
distill_main->>distill_main: _sync_teacher_config_from_student
distill_main->>distill_main: _install_hybrid_moe_aux_loss_size_fix (patch track_moe_metrics)
distill_main->>Bridge_distill: distill(config)
Bridge_distill->>DistillationProvider: provide(config)
DistillationProvider->>mbridge_patcher: enter(block_configs, num_heads, hidden_size)
mbridge_patcher->>MCore_build: patched — apply MCoreLayerOverrides + NoOpWithBias per layer
MCore_build-->>mbridge_patcher: layer built with overrides
mbridge_patcher-->>DistillationProvider: exit, restore build_module
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1725 +/- ##
==========================================
- Coverage 77.09% 75.20% -1.90%
==========================================
Files 511 511
Lines 56168 58240 +2072
==========================================
+ Hits 43302 43798 +496
- Misses 12866 14442 +1576
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 7
🧹 Nitpick comments (6)
examples/puzzletron/distillation/export_to_hf.py (1)
112-113: ⚡ Quick winMove deferred imports to module scope, or explicitly justify them inline.
Imports inside
main()currently have no documented reason (circular dependency, optional dependency, or heavy import deferral), so this deviates from the repository import rule.As per coding guidelines, imports should stay at module top unless delayed import is necessary and explicitly justified.
Also applies to: 135-135
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/export_to_hf.py` around lines 112 - 113, Move the deferred imports out of the main() function and place them at the module scope with other imports at the top of the file. The import statement `from provider_patch import apply_distillation_patch, apply_patch` at line 112-113 and the other deferred import at line 135 should both be relocated to the module-level import section unless there is a documented reason (such as circular dependency, optional dependency, or heavy import deferral) for keeping them deferred—in which case, add an explicit inline comment justifying the deferred import at that location.Source: Coding guidelines
examples/puzzletron/distillation/block_config_utils.py (1)
42-52: 💤 Low valueConsider adding
__all__to declare the public API.Per coding guidelines, each module should declare its public surface with
__all__at the top. This module exportsMCoreLayerOverrides,load_block_configs,block_config_to_mcore_overrides, andget_overrides_for_layer.Suggested addition after imports
__all__ = [ "MCoreLayerOverrides", "load_block_configs", "block_config_to_mcore_overrides", "get_overrides_for_layer", ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/block_config_utils.py` around lines 42 - 52, The module is missing an `__all__` declaration to explicitly define its public API. Add an `__all__` list after the import statements (after the logger initialization) that includes the four public symbols: MCoreLayerOverrides, load_block_configs, block_config_to_mcore_overrides, and get_overrides_for_layer. This makes the module's public interface explicit and aligns with coding guidelines.Source: Coding guidelines
examples/puzzletron/distillation/layer_patchers.py (1)
43-59: 💤 Low valueConsider adding
__all__to declare the public API.This module exports
NoOpWithBias,NoOpRMSNorm, andmbridge_patcheras public symbols. Per coding guidelines, each module should declare its public surface with__all__.Suggested addition after imports
__all__ = [ "NoOpWithBias", "NoOpRMSNorm", "mbridge_patcher", ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/layer_patchers.py` around lines 43 - 59, The module is missing an `__all__` declaration to explicitly define its public API. After the logger initialization line where `logger = logging.getLogger(__name__)` is defined, add an `__all__` list that includes the three exported symbols: NoOpWithBias, NoOpRMSNorm, and mbridge_patcher. This declaration clarifies the module's public surface and follows coding guidelines for proper API documentation.Source: Coding guidelines
examples/puzzletron/distillation/distill.py (2)
616-623: 💤 Low valueRedundant distributed teardown.
The
torch.distributed.destroy_process_group()is already handled inrun_entrypoint()(line 176-177 in_common.py). This duplicate call is safe (it checksis_initialized()) but unnecessary.♻️ Proposed fix
try: distill(config) except Exception as e: logger.error("Error during distillation: %s", e) raise e - finally: - if torch.distributed.is_initialized(): - torch.distributed.destroy_process_group()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/distill.py` around lines 616 - 623, The finally block in the distill function that calls torch.distributed.destroy_process_group() is redundant because this cleanup is already handled by run_entrypoint() in _common.py. Remove the entire finally block (which includes the is_initialized() check and destroy_process_group() call) from the distill function, as the distributed process group teardown will still be properly managed at the higher level in run_entrypoint().
151-152: 💤 Low valueStar imports pollute the module namespace.
The star imports from
modelopt.torch.puzzletron.anymodel.converterandmodel_descriptorcan pollute the namespace and make it unclear which names are used. Consider importing only the specific symbols needed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/distill.py` around lines 151 - 152, The star imports from modelopt.torch.puzzletron.anymodel.converter and modelopt.torch.puzzletron.anymodel.model_descriptor pollute the namespace and obscure which symbols are actually used. Replace both star import statements with explicit imports of only the specific symbols needed from each module. Identify which classes, functions, or constants from converter and model_descriptor are actually used in this file and import them by name instead of using the wildcard syntax.examples/puzzletron/distillation/gpt_oss_bridge.py (1)
188-202: 💤 Low valuePotential variable shadowing with
num_expertsandep_size.Lines 188-190 calculate
num_expertsfromself.hf_config.num_local_experts, but lines 199-202 immediately override these values fromblock_config. The initial calculation at lines 188-190 is never used.Consider removing the dead code:
♻️ Proposed fix
def maybe_modify_converted_hf_weight( self, task: WeightConversionTask, converted_weights_dict: dict[str, torch.Tensor], hf_state_dict: Mapping[str, torch.Tensor], ) -> dict[str, torch.Tensor]: - num_experts = self.hf_config.num_local_experts - ep_size = parallel_state.get_expert_model_parallel_world_size() - experts_per_rank = num_experts // ep_size - try: layer_idx = extract_layer_idx_from_param(task.param_name) expert_number = extract_expert_number_from_param(task.param_name) except ValueError: # not an expert weight return converted_weights_dict block_config = self.hf_config.block_configs[layer_idx] num_experts = block_config["ffn"]["moe"]["num_local_experts"] ep_size = parallel_state.get_expert_model_parallel_world_size() experts_per_rank = num_experts // ep_size🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/gpt_oss_bridge.py` around lines 188 - 202, Remove the dead code that calculates num_experts, ep_size, and experts_per_rank from self.hf_config.num_local_experts before the try-except block. These variables are shadowed and immediately recalculated after the extract_layer_idx_from_param and extract_expert_number_from_param calls using block_config instead, making the initial calculations at lines 188-190 unused. Delete the redundant variable assignments and keep only the recalculation that uses block_config["ffn"]["moe"]["num_local_experts"].
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/puzzletron/distillation/__init__.py`:
- Around line 36-55: The import statements at lines 36-43 use absolute imports
(block_config_utils, layer_patchers, provider_patch) which can break when
importing the distillation package as a whole. Convert these to package-relative
imports by prepending a dot (.) before each module name (from
.block_config_utils import ..., from .layer_patchers import ..., from
.provider_patch import ...). This ensures the imports are resolved relative to
the current package and follows the repository __init__.py API pattern using
relative star re-exports.
In `@examples/puzzletron/distillation/block_config_utils.py`:
- Line 177: The parameter moe_shared_expert_intermediate_omitted is defined and
documented in the docstring but never used in the function body. Either
implement the MoE handling logic described in the docstring that should use this
flag to control whether shared experts are cleared when the key was omitted from
JSON block_configs, or remove the parameter entirely if the functionality is not
needed. If implementing, add the conditional logic in the MoE handling section
that checks this flag and clears shared experts accordingly.
- Around line 130-137: The condition checking if converter_cls is None will
never be true because ConverterFactory.get() returns the input string itself
when the converter is not found, not None. Replace the `if converter_cls is
None:` check with a condition that verifies converter_cls is actually a class or
callable (not a string), such as checking if it is not a string type or if it is
callable. This will properly catch cases where an invalid converter_name is
provided and prevent the AttributeError that would occur when trying to call
create_block_configs_from_main_config() on a string object.
In `@examples/puzzletron/distillation/export_to_hf.py`:
- Around line 209-211: Update the help text string that describes the student
model key argument to reference the correct flag name. Change the mention of
`--student-checkpoint` to `--student-hf-checkpoint` in the help text to match
the actual CLI argument name, ensuring the help documentation is accurate and
not misleading to users.
- Around line 155-166: The copy loop that iterates over file names and uses
shutil.copy() will crash when student_path is a remote HF repo id or when
optional files like chat_template.jinja are missing. Before attempting to copy
each file in the loop where src is constructed as Path(student_path) / fname,
check if the source file exists using Path.exists(). For optional configuration
files, skip the copy operation gracefully if the source file doesn't exist
instead of raising FileNotFoundError. Ensure the success message is only printed
when a file was actually copied.
In `@examples/puzzletron/distillation/provider_patch.py`:
- Line 210: Replace all implicit relative imports with explicit relative imports
using dot notation to ensure the modules can be imported correctly regardless of
execution context. In examples/puzzletron/distillation/provider_patch.py at
lines 210 and 329, change `from layer_patchers import mbridge_patcher` to `from
.layer_patchers import mbridge_patcher`. In
examples/puzzletron/distillation/_common.py at lines 105-107, change `from
block_config_utils import load_block_configs` to `from .block_config_utils
import load_block_configs`. In examples/puzzletron/distillation/distill.py at
lines 467-473, change all import statements like `from model_bridge_patch import
...` and `from provider_patch import ...` to use explicit relative imports by
adding a leading dot prefix (e.g., `from .model_bridge_patch import ...` and
`from .provider_patch import ...`).
In `@examples/puzzletron/distillation/README.md`:
- Around line 169-186: The opening code fence in the README.md file does not
include a language identifier. Add a language specifier (such as `text`) to the
opening triple backticks of the fenced code block containing the directory tree
structure to ensure markdown linting compliance and maintain documentation
tooling standards.
---
Nitpick comments:
In `@examples/puzzletron/distillation/block_config_utils.py`:
- Around line 42-52: The module is missing an `__all__` declaration to
explicitly define its public API. Add an `__all__` list after the import
statements (after the logger initialization) that includes the four public
symbols: MCoreLayerOverrides, load_block_configs,
block_config_to_mcore_overrides, and get_overrides_for_layer. This makes the
module's public interface explicit and aligns with coding guidelines.
In `@examples/puzzletron/distillation/distill.py`:
- Around line 616-623: The finally block in the distill function that calls
torch.distributed.destroy_process_group() is redundant because this cleanup is
already handled by run_entrypoint() in _common.py. Remove the entire finally
block (which includes the is_initialized() check and destroy_process_group()
call) from the distill function, as the distributed process group teardown will
still be properly managed at the higher level in run_entrypoint().
- Around line 151-152: The star imports from
modelopt.torch.puzzletron.anymodel.converter and
modelopt.torch.puzzletron.anymodel.model_descriptor pollute the namespace and
obscure which symbols are actually used. Replace both star import statements
with explicit imports of only the specific symbols needed from each module.
Identify which classes, functions, or constants from converter and
model_descriptor are actually used in this file and import them by name instead
of using the wildcard syntax.
In `@examples/puzzletron/distillation/export_to_hf.py`:
- Around line 112-113: Move the deferred imports out of the main() function and
place them at the module scope with other imports at the top of the file. The
import statement `from provider_patch import apply_distillation_patch,
apply_patch` at line 112-113 and the other deferred import at line 135 should
both be relocated to the module-level import section unless there is a
documented reason (such as circular dependency, optional dependency, or heavy
import deferral) for keeping them deferred—in which case, add an explicit inline
comment justifying the deferred import at that location.
In `@examples/puzzletron/distillation/gpt_oss_bridge.py`:
- Around line 188-202: Remove the dead code that calculates num_experts,
ep_size, and experts_per_rank from self.hf_config.num_local_experts before the
try-except block. These variables are shadowed and immediately recalculated
after the extract_layer_idx_from_param and extract_expert_number_from_param
calls using block_config instead, making the initial calculations at lines
188-190 unused. Delete the redundant variable assignments and keep only the
recalculation that uses block_config["ffn"]["moe"]["num_local_experts"].
In `@examples/puzzletron/distillation/layer_patchers.py`:
- Around line 43-59: The module is missing an `__all__` declaration to
explicitly define its public API. After the logger initialization line where
`logger = logging.getLogger(__name__)` is defined, add an `__all__` list that
includes the three exported symbols: NoOpWithBias, NoOpRMSNorm, and
mbridge_patcher. This declaration clarifies the module's public surface and
follows coding guidelines for proper API documentation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 18de3e9f-6e82-4675-b39f-0cf70e941dfe
📒 Files selected for processing (13)
examples/puzzletron/README.mdexamples/puzzletron/distillation/README.mdexamples/puzzletron/distillation/__init__.pyexamples/puzzletron/distillation/_common.pyexamples/puzzletron/distillation/block_config_utils.pyexamples/puzzletron/distillation/distill.pyexamples/puzzletron/distillation/export_to_hf.pyexamples/puzzletron/distillation/gpt_oss_bridge.pyexamples/puzzletron/distillation/kd-container-default.yamlexamples/puzzletron/distillation/layer_patchers.pyexamples/puzzletron/distillation/model_bridge_patch.pyexamples/puzzletron/distillation/provider_patch.pypyproject.toml
| 2. Detects whether this is a Mamba/hybrid provider (to set ``apply_no_ops``). | ||
| 3. Activates ``mbridge_patcher`` and delegates to the original ``provide()``. | ||
| """ | ||
| from layer_patchers import mbridge_patcher |
There was a problem hiding this comment.
Implicit relative imports may fail depending on execution context. All three files use implicit relative imports (e.g., from layer_patchers import ... instead of from .layer_patchers import ...). These work when the script directory is in sys.path but will fail if the modules are imported as part of a package from a different working directory.
examples/puzzletron/distillation/provider_patch.py#L210-L210: Changefrom layer_patchers import mbridge_patchertofrom .layer_patchers import mbridge_patcher(also at line 329).examples/puzzletron/distillation/_common.py#L105-L107: Changefrom block_config_utils import load_block_configstofrom .block_config_utils import load_block_configs.examples/puzzletron/distillation/distill.py#L467-L473: Changefrom model_bridge_patch import ...andfrom provider_patch import ...to use explicit relative imports with the dot prefix.
📍 Affects 3 files
examples/puzzletron/distillation/provider_patch.py#L210-L210(this comment)examples/puzzletron/distillation/_common.py#L105-L107examples/puzzletron/distillation/distill.py#L467-L473
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/puzzletron/distillation/provider_patch.py` at line 210, Replace all
implicit relative imports with explicit relative imports using dot notation to
ensure the modules can be imported correctly regardless of execution context. In
examples/puzzletron/distillation/provider_patch.py at lines 210 and 329, change
`from layer_patchers import mbridge_patcher` to `from .layer_patchers import
mbridge_patcher`. In examples/puzzletron/distillation/_common.py at lines
105-107, change `from block_config_utils import load_block_configs` to `from
.block_config_utils import load_block_configs`. In
examples/puzzletron/distillation/distill.py at lines 467-473, change all import
statements like `from model_bridge_patch import ...` and `from provider_patch
import ...` to use explicit relative imports by adding a leading dot prefix
(e.g., `from .model_bridge_patch import ...` and `from .provider_patch import
...`).
There was a problem hiding this comment.
we have been using https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/megatron_bridge/distill.py for puzzletron anymodel distillation already. Why do we now need all this extra patching logic?
There was a problem hiding this comment.
MCore heterogeneous support is not general e.g. it does not support mamba stack. With the patching we can get independent from mcore support, and rely only on whether mbridge supports a model or not - the whole heterogeneous logic is done with patching
Signed-off-by: mchochowski <mchochowski@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
🧹 Nitpick comments (1)
examples/puzzletron/distillation/kd-container-qwen.yaml (1)
24-24: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winUse YAML anchor to avoid duplicating
seq_length.The
seq_lengthvalue4096is duplicated on lines 24 and 93. If these values need to change, both locations must be updated, creating a maintenance risk. The nemotron3 config demonstrates the preferred pattern using YAML anchors and aliases.♻️ Proposed fix using YAML anchor pattern
- seq_length: 4096 + seq_length: &seqlen 4096And later in the dataset section:
dataset: - seq_length: 4096 + seq_length: *seqlenAlso applies to: 93-93
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/distillation/kd-container-qwen.yaml` at line 24, The seq_length value 4096 appears in multiple places in the YAML file, creating maintenance risk. Define a YAML anchor on the first occurrence of seq_length (around line 24) by adding &seq_length_anchor after the value, then replace the duplicate seq_length value on line 93 with an alias reference using *seq_length_anchor. This ensures a single source of truth for the seq_length configuration value throughout the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/puzzletron/distillation/kd-container-nemotron3.yaml`:
- Line 15: The field name `hetereogenous_dist_checkpoint` contains a typo and
should be corrected to `heterogeneous_dist_checkpoint`. Locate the field in the
YAML configuration and fix the spelling by replacing `hetereogenous` with
`heterogeneous`. This will ensure the configuration field is properly recognized
by the system and prevent validation failures or the field being silently
ignored at runtime.
In `@examples/puzzletron/distillation/kd-container-qwen.yaml`:
- Line 95: The configuration file has inconsistent seed values for
reproducibility: the dataset section uses random_seed set to 1234 while the RNG
config section uses seed set to 42. To fix this, adopt a YAML anchor pattern by
defining a seed value at the top of the file (e.g., seed_value: &SEED 1234) and
then reference it in both locations using *SEED. Alternatively, choose a single
seed value and update both the random_seed in the dataset configuration and the
seed in the RNG configuration to use the same value consistently throughout the
file.
- Line 15: The field name hetereogenous_dist_checkpoint contains a spelling
error in the YAML configuration file. Correct the spelling by changing
hetereogenous_dist_checkpoint to heterogeneous_dist_checkpoint (the correct
spelling is heterogeneous, not hetereogenous, with an 'e' after the 't' and
before the 'r'). This will ensure the configuration field is properly recognized
at runtime and validation passes correctly.
---
Nitpick comments:
In `@examples/puzzletron/distillation/kd-container-qwen.yaml`:
- Line 24: The seq_length value 4096 appears in multiple places in the YAML
file, creating maintenance risk. Define a YAML anchor on the first occurrence of
seq_length (around line 24) by adding &seq_length_anchor after the value, then
replace the duplicate seq_length value on line 93 with an alias reference using
*seq_length_anchor. This ensures a single source of truth for the seq_length
configuration value throughout the file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 58c84ba6-41f2-4e1d-8788-a57f2a683ed2
📒 Files selected for processing (7)
examples/puzzletron/distillation/README.mdexamples/puzzletron/distillation/block_config_utils.pyexamples/puzzletron/distillation/gpt_oss_bridge.pyexamples/puzzletron/distillation/kd-container-default.yamlexamples/puzzletron/distillation/kd-container-llama.yamlexamples/puzzletron/distillation/kd-container-nemotron3.yamlexamples/puzzletron/distillation/kd-container-qwen.yaml
✅ Files skipped from review due to trivial changes (2)
- examples/puzzletron/distillation/kd-container-llama.yaml
- examples/puzzletron/distillation/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/puzzletron/distillation/block_config_utils.py
- examples/puzzletron/distillation/kd-container-default.yaml
- examples/puzzletron/distillation/gpt_oss_bridge.py
| # --------------------------------------------------------------------------- | ||
| model: | ||
| # Heterogeneous AnyModel / per-layer MoE: distinct checkpoint keys per global layer. | ||
| hetereogenous_dist_checkpoint: true |
There was a problem hiding this comment.
Fix the typo in hetereogenous_dist_checkpoint.
The field name contains a typo: hetereogenous_dist_checkpoint should be heterogeneous_dist_checkpoint. This misspelling may cause configuration validation failures or the field to be silently ignored at runtime.
📝 Proposed fix
- hetereogenous_dist_checkpoint: true
+ heterogeneous_dist_checkpoint: true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hetereogenous_dist_checkpoint: true | |
| heterogeneous_dist_checkpoint: true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/puzzletron/distillation/kd-container-qwen.yaml` at line 15, The
field name hetereogenous_dist_checkpoint contains a spelling error in the YAML
configuration file. Correct the spelling by changing
hetereogenous_dist_checkpoint to heterogeneous_dist_checkpoint (the correct
spelling is heterogeneous, not hetereogenous, with an 'e' after the 't' and
before the 'r'). This will ensure the configuration field is properly recognized
at runtime and validation passes correctly.
- Changed the README.md code block formatting to `text`. - Added `__all__` exports in `layer_patchers.py` and `block_config_utils.py` for better module visibility. - Removed unused imports in `distill.py` and streamlined the error handling in the `main` function. - Updated `export_to_hf.py` to avoid redundant imports and added a check for file existence before copying. - Minor adjustments in `gpt_oss_bridge.py` to remove unnecessary expert-related calculations. These changes enhance code clarity and maintainability across the distillation examples. Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Ports examples/puzzletron/distillation from the puzzletron_distillation branch with SPDX headers, nemo2 registry fix, and puzzletron README link.
What does this PR do?
This PR adds support for mbridge distillation of puzzled heterogeneous models
Usage
In general distillation requires more compute than normal training, thus ideally you should run this on a multinode-setting (unless very constrained scenarion)
Testing
TBD
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit