Hyperparameter tuning, FocalDiceLoss, and 5M/10M cross-regime transfer evaluation#21
Hyperparameter tuning, FocalDiceLoss, and 5M/10M cross-regime transfer evaluation#21
Conversation
…physical field relationships
…ames via in-place cutout
…osRatio, lossFunction, warmupEpochs, and swa so we can actually tune all the stuff that was hardcoded before, especially base_channels which was stuck at 32 while all the optuna tuners used 64. Also threw in a FocalDiceLoss class that combines focal and dice loss to help with the crazy class imbalance, and hooked up linear LR warmup and stochastic weight averaging with a custom BN update that works with our dict based dataloader. Created test_xpoint_transfer.py to evaluate our best PKPM trained model on the 5M and 10M datasets, which includes a monkey patch for the double component indexing bug in getData.py since we cant modify files outside reconClassifier. Then made build_transfer_cache.py to precompute and cache the xpoint finder results for all 150 frames of 5M and 10M data so we dont have to wait 20 minutes per frame every time we want to run the transfer evaluation.
… RC_CACHE_BASE) for ramdisk staging, and repoint transfer eval to the production checkpoint testdir_2026-04-02-13-23-05. XPointMLTest.py now profiles getPgkylData stages and reuses precomputed second derivatives as the Hessian for getXOPoints.
cwsmith
left a comment
There was a problem hiding this comment.
Thank you. A few comments are below.
| --xptCacheDir /path/to/cache \ | ||
| --n-trials 50 \ | ||
| --study-name xpoint-tuning \ | ||
| --db sqlite:///optuna_xpoint.db |
There was a problem hiding this comment.
does optuna automatically create the db or are additional manual setup steps required?
| Cross-domain inference: evaluate the best PKPM-trained model on 5M and 10M data. | ||
|
|
||
| This script: | ||
| 1. Extracts 5M.tgz and 10M.tgz (if not already extracted) |
There was a problem hiding this comment.
We should force this to use the cache if XPointMLTest.py requires it.
| specify the path to the parameter txt file, the parent | ||
| directory of that file must contain the gkyl input training data | ||
| ''') | ||
| parser.add_argument('--xptCacheDir', type=Path, default=None, |
There was a problem hiding this comment.
IIRC, this option will run the hessian based classifier and build the cache. How does this differ from the new build_transfer_cache.py? If they do the same thing we should likely remove the option, and supporting functionality, here and require the use of the cache prepared with build_transfer_cache.py.
On that note, we should probably rename build_transfer_cache.py to run_hessian_and_build_cache.py or something similarly explicit.
| [fileName, axesNorm, critPoints, xpts, optsMax, optsMin, coords, psi, bx, by, jz] = getPgkylData(self.paramFile, fnum, verbosity=self.verbosity) | ||
| fields = {"psi":psi, "critPts":critPoints, "xpts":xpts, | ||
| "optsMax":optsMax, "optsMin":optsMin, | ||
| "axesNorm": axesNorm, "coords": coords, | ||
| "fileName": fileName, | ||
| "Bx":bx, "By":by, "Jz":jz} | ||
| writePgkylDataToCache(self.xptCacheDir, fnum, fields) |
There was a problem hiding this comment.
this looks like the call to run the hessian based classifier and write the cache
|
IIRC, a patch was needed for an indexing bug in https://github.com/SCOREC/pgkylFrontEnd. If so, would you please create a PR with that change? |
Summary
This PR changes: 1 file modified (
XPointMLTest.py), 3 new files (optuna_tuner.py,test_xpoint_transfer.py,build_transfer_cache.py).What was done:
Optuna-driven hyperparameter tuning (
optuna_tuner.py) — TPE sampler + median pruner over base_channels, dropout, weight decay, learning rate, positive-patch ratio, focal/dice weighting, scheduler choice, and SWA start fraction. Replaces the prior ad-hoc grid wherebase_channelswas hard-coded to 32 while tuners assumed 64.FocalDiceLoss + LR scheduling + SWA — new loss combining focal cross-entropy and Dice with configurable α / γ / dice weight; linear LR warmup followed by cosine annealing (or ReduceLROnPlateau); optional Stochastic Weight Averaging with a custom BN-update step compatible with the dict-based dataloader.
Cross-regime transfer evaluation (
test_xpoint_transfer.py) — loads the best PKPM-trained checkpoint and evaluates zero-shot on 5M and 10M Gkeyll datasets (150 frames each), producing per-dataset and combined summaries. Re-evaluates the PKPM validation set as an in-domain reference.Cache build pipeline (
build_transfer_cache.py) — precomputes the deterministic X-point finder for all 150 frames of each transfer dataset, so subsequent evaluation runs read.npycaches instead of re-parsing.gkylfiles. Supports--workers Nfor parallel processing andRC_EXTRACT_DIR/RC_CACHE_BASEenv-var path overrides for ramdisk staging.Augmentation correctness fixes in
XPointMLTest.py— brightness/contrast jitter is now applied globally (not per channel) so the physical identitiesBx = ∂y ψ,By = -∂x ψ,Jz = -∇²ψ/μ₀stay consistent across the four input channels; cutout no longer mutates cached frame tensors in place.Profiling and minor perf in
getPgkylData— per-stage[PROFILE]timings aroundcompactRead, gradient computation,getCritPoints, andgetXOPoints; Hessian is now packed from already-computed second derivatives and passed togetXOPoints(hessian=…)to avoid recomputing gradients.