Skip to content

Add Gadi Singularity container build files for UW3#133

Open
jcgraciosa wants to merge 1 commit into
underworldcode:developmentfrom
jcgraciosa:gadi-singularity-container
Open

Add Gadi Singularity container build files for UW3#133
jcgraciosa wants to merge 1 commit into
underworldcode:developmentfrom
jcgraciosa:gadi-singularity-container

Conversation

@jcgraciosa
Copy link
Copy Markdown
Contributor

Adds two Containerfiles and a README for building and running UW3 as a Singularity container on Gadi (NCI), modeled on the UW2 gadi_singularity setup.

Files added

  • docs/developer/gadi_singularity/petsc.rhel — builds PETSc 3.25.0 with full AMR support (petsc4py, slepc4py, mmg, parmmg, ptscotch, hypre, etc.) on Rocky Linux 8.10
  • docs/developer/gadi_singularity/underworld3.rhel — builds UW3 on top of the PETSc image
  • docs/developer/gadi_singularity/README.md — build and deployment instructions

Tested

  • Built for linux/amd64 using podman with QEMU emulation on Apple Silicon
  • Pushed to ghcr.io and pulled on Gadi with singularity pull
  • Stokes flow test passed with 4 MPI ranks on Gadi normalbw queue

Notes

  • Build context must be the UW3 repo root (for COPY petsc-custom/patches/...)
  • --platform linux/amd64 required when building on Apple Silicon
  • SINGULARITY_CACHEDIR must point to scratch on Gadi (home quota too small)

Future work

  • Rigorous scaling tests on Gadi using the Singularity container (multi-node, increasing MPI ranks)
  • Suppress OpenFabrics (mlx5_0) warnings by switching to UCX transport (--mca btl ^openib)
  • Kaiju support: install Apptainer via spack and test container there

Underworld development team with AI support from Claude Code

@jcgraciosa jcgraciosa requested a review from julesghub April 22, 2026 10:02
@jcgraciosa jcgraciosa requested a review from lmoresi as a code owner April 22, 2026 10:02
Copy link
Copy Markdown
Member

@lmoresi lmoresi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jcgraciosa — multi-stage build is clean and the Gadi-specific environment touches (`OMPI_MCA_io=ompio`, `OPENBLAS_NUM_THREADS=1`, scratch redirect for the Singularity cache) are exactly right. Nice touch on the aarch64 graceful skips for gmsh/vtk-osmesa so the recipe still tests locally on Apple Silicon.

One change required, plus a few optional polish items.

Required: drop `pykdtree` from `underworld3.rhel`

Line ~143 of `underworld3.rhel` lists `pykdtree` in the pip install. UW3 removed this dependency in the Aug 2025 KDTree backend switch — `uw.kdtree` is now backed by ckdtree/nanoflann. Beyond being unused, pykdtree's OpenMP integration causes:

  • Fatal crashes on macOS when loaded alongside PETSc/numpy/scipy (double `libomp.dylib` initialisation — C-level abort, not catchable by Python).
  • Hangs under MPI during KDTree queries due to thread contention between OpenMP and the MPI processes.

The second one is the concern for this container specifically — Gadi runs are MPI by construction. Please drop `pykdtree` from the pip install list.

(Background and justification are in the planning entry titled "Remove pykdtree dependency", 2026-02-13.)

Optional polish

These are not blockers — happy to land the must-fix without them, and follow up later.

  1. Default `PETSC_IMAGE` points at a personal namespace. `underworld3.rhel:24` defaults to `ghcr.io/jcgraciosa/petsc:3.25.0-ompi`. The README correctly tells users to override with `--build-arg`, but a downstream user copy-pasting the build command without reading the README would pull from your personal namespace. Either point the default at an org-level image (e.g. `ghcr.io/underworldcode/petsc:3.25.0-ompi`, once published) or leave the default empty so the build fails loudly instead of silently using the wrong image.

  2. `--with-cxx-dialect=C++11` for PETSc 3.25. PETSc 3.25 requires C++14 as a minimum. The flag may still work for backwards compatibility but it's understating what PETSc actually needs — easier to drop it entirely (PETSc auto-detects) or set `C++14` explicitly.

  3. `--download-fblaslapack=1` while runtime has `openblas`. The runtime layer installs system `openblas`, but PETSc downloads f2c BLAS/LAPACK separately. Using `--with-blaslapack-dir` against the system openblas would shrink the image a bit and speed up the configure stage. Optimisation, not correctness.

Cosmetic

  • Containerfile build-command comments reference `./docs/development/gadi_singularity/` (in two places) while the actual path is `./docs/developer/` — two-letter typo.
  • `petsc.rhel` ends without a trailing newline.

Looks good

  • Multi-stage `runtime → builder → final` keeps the final image small.
  • Patch application has a graceful "already-merged-upstream" fallback.
  • petsc4py/slepc4py install failure dumps the build log on exit — good debuggability.
  • `mpi4py` forced `--no-binary` against openmpi and `h5py` rebuilt `HDF5_MPI=ON` against PETSc's HDF5 — both essential and easy to get wrong.

Happy to approve and merge once `pykdtree` is removed. The polish items can roll into this PR or a follow-up — whichever you prefer.

Underworld development team with AI support from Claude Code

@lmoresi lmoresi force-pushed the gadi-singularity-container branch from 986513b to 06813bd Compare May 4, 2026 12:03
lmoresi added a commit to jcgraciosa/underworld3 that referenced this pull request May 4, 2026
Since 2026-04-30 every push to development and every PR has been failing
CI with the runner timing out:

    Resolving Environment ⧖ Starting           # 02:51:02
    ##[error]The runner has received a shutdown signal.   # 03:57:25

That's 66 minutes spent in micromamba's conda-forge solve before the
test step starts, after which GitHub kills the runner. environment.yaml
has loose constraints (python <= 3.11, an exact pin on petsc=3.21.5
that's now a year-old in conda-forge, and several unpinned packages
including pykdtree which UW3 doesn't actually use any more) — that
combination, plus recent conda-forge package state shifts, has put the
solver into deep backtracking.

The pixi.lock committed in the repo already captures the exact same
dependency set we use locally for development. Using
prefix-dev/setup-pixi with frozen: true (refuses to re-solve) gives a
deterministic, fast install matching local dev state. Bonus: the build
step now uses `pixi run -e dev build` (= `pip install . --no-build-isolation`
from pixi.toml), avoiding the editable install the previous CI was doing
in violation of project policy (CLAUDE.md "NEVER use pip install -e .").

Replaces the workflow file in place; keeping environment.yaml in the
repo for now because PR underworldcode#133 (Gadi Singularity container) consumes it,
and removing it would create a cross-PR ordering hazard.

Test plan: this commit's own CI run is the test plan — if it goes
green in <10 min instead of red in 90 min, we have our answer.

Underworld development team with AI support from Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants