Skip to content

Hyperparameter tuning, FocalDiceLoss, and 5M/10M cross-regime transfer evaluation#21

Open
sridhs21 wants to merge 7 commits intomainfrom
feature/hyperparameter-tuning
Open

Hyperparameter tuning, FocalDiceLoss, and 5M/10M cross-regime transfer evaluation#21
sridhs21 wants to merge 7 commits intomainfrom
feature/hyperparameter-tuning

Conversation

@sridhs21
Copy link
Copy Markdown
Contributor

@sridhs21 sridhs21 commented May 7, 2026

Summary

This PR changes: 1 file modified (XPointMLTest.py), 3 new files (optuna_tuner.py, test_xpoint_transfer.py, build_transfer_cache.py).

What was done:

  • Optuna-driven hyperparameter tuning (optuna_tuner.py) — TPE sampler + median pruner over base_channels, dropout, weight decay, learning rate, positive-patch ratio, focal/dice weighting, scheduler choice, and SWA start fraction. Replaces the prior ad-hoc grid where base_channels was hard-coded to 32 while tuners assumed 64.

  • FocalDiceLoss + LR scheduling + SWA — new loss combining focal cross-entropy and Dice with configurable α / γ / dice weight; linear LR warmup followed by cosine annealing (or ReduceLROnPlateau); optional Stochastic Weight Averaging with a custom BN-update step compatible with the dict-based dataloader.

  • Cross-regime transfer evaluation (test_xpoint_transfer.py) — loads the best PKPM-trained checkpoint and evaluates zero-shot on 5M and 10M Gkeyll datasets (150 frames each), producing per-dataset and combined summaries. Re-evaluates the PKPM validation set as an in-domain reference.

  • Cache build pipeline (build_transfer_cache.py) — precomputes the deterministic X-point finder for all 150 frames of each transfer dataset, so subsequent evaluation runs read .npy caches instead of re-parsing .gkyl files. Supports --workers N for parallel processing and RC_EXTRACT_DIR / RC_CACHE_BASE env-var path overrides for ramdisk staging.

  • Augmentation correctness fixes in XPointMLTest.py — brightness/contrast jitter is now applied globally (not per channel) so the physical identities Bx = ∂y ψ, By = -∂x ψ, Jz = -∇²ψ/μ₀ stay consistent across the four input channels; cutout no longer mutates cached frame tensors in place.

  • Profiling and minor perf in getPgkylData — per-stage [PROFILE] timings around compactRead, gradient computation, getCritPoints, and getXOPoints; Hessian is now packed from already-computed second derivatives and passed to getXOPoints(hessian=…) to avoid recomputing gradients.

sridhs21 and others added 7 commits November 19, 2025 00:48
…osRatio, lossFunction, warmupEpochs, and swa so we can actually tune all the stuff that was hardcoded before, especially base_channels which was stuck at 32 while all the optuna tuners used 64. Also threw in a FocalDiceLoss class that combines focal and dice loss to help with the crazy class imbalance, and hooked up linear LR warmup and stochastic weight averaging with a custom BN update that works with our dict based dataloader. Created test_xpoint_transfer.py to evaluate our best PKPM trained model on the 5M and 10M datasets, which includes a monkey patch for the double component indexing bug in getData.py since we cant modify files outside reconClassifier. Then made build_transfer_cache.py to precompute and cache the xpoint finder results for all 150 frames of 5M and 10M data so we dont have to wait 20 minutes per frame every time we want to run the transfer evaluation.
… RC_CACHE_BASE) for ramdisk staging, and repoint transfer eval to the production checkpoint testdir_2026-04-02-13-23-05. XPointMLTest.py now profiles getPgkylData stages and reuses precomputed second derivatives as the Hessian for getXOPoints.
Copy link
Copy Markdown
Contributor

@cwsmith cwsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. A few comments are below.

Comment thread optuna_tuner.py
--xptCacheDir /path/to/cache \
--n-trials 50 \
--study-name xpoint-tuning \
--db sqlite:///optuna_xpoint.db
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does optuna automatically create the db or are additional manual setup steps required?

Comment thread test_xpoint_transfer.py
Cross-domain inference: evaluate the best PKPM-trained model on 5M and 10M data.

This script:
1. Extracts 5M.tgz and 10M.tgz (if not already extracted)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should force this to use the cache if XPointMLTest.py requires it.

Comment thread XPointMLTest.py
specify the path to the parameter txt file, the parent
directory of that file must contain the gkyl input training data
''')
parser.add_argument('--xptCacheDir', type=Path, default=None,
Copy link
Copy Markdown
Contributor

@cwsmith cwsmith May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this option will run the hessian based classifier and build the cache. How does this differ from the new build_transfer_cache.py? If they do the same thing we should likely remove the option, and supporting functionality, here and require the use of the cache prepared with build_transfer_cache.py.

On that note, we should probably rename build_transfer_cache.py to run_hessian_and_build_cache.py or something similarly explicit.

Comment thread XPointMLTest.py
Comment on lines 297 to 303
[fileName, axesNorm, critPoints, xpts, optsMax, optsMin, coords, psi, bx, by, jz] = getPgkylData(self.paramFile, fnum, verbosity=self.verbosity)
fields = {"psi":psi, "critPts":critPoints, "xpts":xpts,
"optsMax":optsMax, "optsMin":optsMin,
"axesNorm": axesNorm, "coords": coords,
"fileName": fileName,
"Bx":bx, "By":by, "Jz":jz}
writePgkylDataToCache(self.xptCacheDir, fnum, fields)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like the call to run the hessian based classifier and write the cache

@cwsmith
Copy link
Copy Markdown
Contributor

cwsmith commented May 7, 2026

IIRC, a patch was needed for an indexing bug in https://github.com/SCOREC/pgkylFrontEnd. If so, would you please create a PR with that change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants