Update offline distillation and saving top-k teacher logits to be efficient by ajkv-google · Pull Request #3990 · AI-Hypercomputer/maxtext

ajkv-google · 2026-05-27T17:36:42Z

Description

This PR implements an optimized Offline Distillation pipeline in MaxText, allowing student models to train using pre-saved teacher logits to significantly reduce compute costs.

Key changes:

The save_top_k_teacher_logits.py script now supports parallel execution across hosts and uses asynchronous GCS uploads to prevent TPU idling
Offline distillation now uses grain input pipeline without having to use the custom offline arrayrecord iterator. This speeds up the offline distillation process without having to read logits on a single thread.
Introduced sparse KL divergence in distillation_utils.py that computes KL divergence using only the teacher's top-k predictions when running offline distillation

Tests

Used the following to run offline distillation:

Command: https://paste.googleplex.com/4852897526448128
Yaml: https://paste.googleplex.com/4981440528908288
Example run for offline distillation (using our team's LTI checkpoint for 30b model): Tensorboard Link

Used the following to run saving of top-k teacher logits:

Command: https://paste.googleplex.com/6383382496935936
Yaml: https://paste.googleplex.com/5898105013796864
GS Bucket Path for saved logits (1B tokens of climbmix): Saved offline logits

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

…logits offline

codecov · 2026-05-27T17:41:54Z

Codecov Report

❌ Patch coverage is 40.81633% with 29 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/input_pipeline/input_pipeline_utils.py	25.00%	7 Missing and 5 partials ⚠️
...ners/post_train/distillation/distillation_utils.py	56.25%	6 Missing and 1 partial ⚠️
.../trainers/post_train/distillation/train_distill.py	0.00%	6 Missing ⚠️
...rc/maxtext/input_pipeline/grain_data_processing.py	63.63%	2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

vlad-karp · 2026-05-27T18:09:44Z

+      log_t_p_T_sparse = jax.nn.log_softmax(t_logits / temperature, axis=-1)
+
+      # 2. Student log-probs must be computed over the FULL vocabulary to be mathematically valid
+      log_s_T_full = jax.nn.log_softmax(s_logits / temperature, axis=-1)


should we also compute teacher logits softmax over entire vocabulary before saving to files?

We save the raw logits instead of the full softmax so we can still tweak the temperature on the fly during training. Also, saving full-vocab probabilities would make the files bigger, which could reduce read speeds during offline training. I think as of now, it would be good to stick to the current approach where we store the information that is needed, and read it quickly during the offline training.

"lso, saving full-vocab probabilities would make the files bigger" - why do you consider saving the full vocab? the entire idea of offline is to use a limited set of logits.
My concern is purely mathematical - you normalize only over top-k while the student logits will be normalized over entire vocabulary, and then you calculate kl divergence over those distributions with completely different normalization scales.

…e new efficient version

ajkv-google added 10 commits May 26, 2026 18:53

Updated script to work in multihost setting for collecting and using …

b601866

…logits offline

fixed sparsecore offloading issue by updating the offline arrayrecord

532f85a

updated code to be cleaner and more readable

2755bb8

Speed up offline distillation and saving top-k teacher logits

4e360de

updated train_distill to work with offline distillaiton

3013498

updated efficiency for offline distillation pipeline

83dc295

removed comments and updated to latest teacher saving logits code

61cc857

pdated offline distillation code to match latest code

550bc55

cleaned up comments

9f91eb4

removed offline distill run script

202db3c

ajkv-google requested a review from igorts-git as a code owner May 27, 2026 17:36

ajkv-google changed the title ~~Update offline distillation and saving top-k teacher logits to be efficient reliable~~ Update offline distillation and saving top-k teacher logits to be efficient May 27, 2026

vlad-karp reviewed May 27, 2026

View reviewed changes

ajkv-google added 6 commits May 28, 2026 17:50

removed redundancy in loss calculation

fa6fe2b

updated code formatting for maxtext guidelines

072cf7a

resolved merge conflict and cleaned up code

153e5fe

updated formatting for code readability

fe496c6

removed offline arrayrecord iterator test since we removed that in th…

1d941e7

…e new efficient version

fixed formatting for test file

4baa18e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update offline distillation and saving top-k teacher logits to be efficient#3990

Update offline distillation and saving top-k teacher logits to be efficient#3990
ajkv-google wants to merge 16 commits into
mainfrom
ajkv/offline-distillation-branch

ajkv-google commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

vlad-karp May 27, 2026

Uh oh!

ajkv-google May 28, 2026

Uh oh!

vlad-karp May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajkv-google commented May 27, 2026

Description

Tests

Checklist

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vlad-karp May 27, 2026

Choose a reason for hiding this comment

Uh oh!

ajkv-google May 28, 2026

Choose a reason for hiding this comment

Uh oh!

vlad-karp May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 27, 2026 •

edited

Loading