ci: enable --stepmgr for Frontier jobs to relieve slurmctld pressure#1611
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Enables Slurm step management delegation (#SBATCH --stepmgr) for Frontier and Frontier-AMD CI job submissions to reduce slurmctld load from high-volume srun step creation.
Changes:
- Add
#SBATCH --stepmgrviaextra_sbatchfor Frontier CI jobs. - Document rationale for
--stepmgrin the Frontier cluster block.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1611 +/- ##
=======================================
Coverage 60.51% 60.51%
=======================================
Files 83 83
Lines 19905 19905
Branches 2950 2950
=======================================
Hits 12046 12046
Misses 5866 5866
Partials 1993 1993 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Validated on Frontier ✅Ran live tests on Frontier (Slurm 25.11.5) to confirm Prerequisites confirmed present controller-side:
Test 1 — stepmgr engages, step bursts don't hit the controller (job
|
Production incident data (Frontier accounting)OLCF pointed at job The flagged job A single-node test shard that ran ~half the suite in-process and issued 2,103 It runs alongside many sibling shards. In one ~35-min window (Jun 16 ~10:00–10:35), these single-node test shards were live concurrently, each hammering
(Build/prebuild/run jobs are only ~2 steps each — negligible. The step pressure is entirely the in-allocation test/bench runs.) Why this PR covers all of it: |
Each CI test/bench job is a single-node allocation that runs the full regression suite in-process via ./mfc.sh test, launching one srun per target (pre_process/simulation/post_process) for ~560 cases = ~1700+ srun step-creates per job, up to 32 concurrent. This congests the Frontier Slurm controller. --stepmgr delegates step management to each job's slurmstepd instead of routing every srun through slurmctld, which is the appropriate mechanism for many-step single-allocation workloads. Added via the existing extra_sbatch hook for both frontier and frontier_amd; Phoenix unaffected.
59fe0c7 to
a116bda
Compare
What
Add
#SBATCH --stepmgrto Frontier and Frontier-AMD CI job submissions, via the existingextra_sbatchhook in.github/scripts/submit-slurm-job.sh. Phoenix is unaffected.Why
Each Frontier CI test/bench job is a single-node (
-N 1) allocation that runs the full regression suite inside the allocation via./mfc.sh test -a -j 32. Each of the ~560 test cases launches a separatesrunper target (pre_process / simulation / post_process), with restart-roundtrip cases adding more — on the order of 1,700+srunstep-creates per job, up to 32 concurrent, all brokered byslurmctld. Across the test + bench + AMD matrix this reaches ~3,000 step-creates and congests the Frontier Slurm controller.OLCF flagged this and temporarily limited the maintainer account to one running job.
--stepmgris the mechanism OLCF recommended: it delegates step management to each job's ownslurmstepdinstead of routing everysrunthrough the central controller, which is the appropriate model for many-step single-allocation workloads.Scope
sruncalls or to test logic — this only changes who brokers the steps.Validation
enable_stepmgris active on the Frontier controller (coordinating with OLCF).If
--stepmgrproves insufficient, follow-ups in reserve: lower in-job concurrency (-j), or coalesce the three executables into a singlesrunper case.