Skip to content

ci: enable --stepmgr for Frontier jobs to relieve slurmctld pressure#1611

Merged
sbryngelson merged 1 commit into
masterfrom
ci/frontier-stepmgr
Jun 20, 2026
Merged

ci: enable --stepmgr for Frontier jobs to relieve slurmctld pressure#1611
sbryngelson merged 1 commit into
masterfrom
ci/frontier-stepmgr

Conversation

@sbryngelson

@sbryngelson sbryngelson commented Jun 18, 2026

Copy link
Copy Markdown
Member

What

Add #SBATCH --stepmgr to Frontier and Frontier-AMD CI job submissions, via the existing extra_sbatch hook in .github/scripts/submit-slurm-job.sh. Phoenix is unaffected.

Why

Each Frontier CI test/bench job is a single-node (-N 1) allocation that runs the full regression suite inside the allocation via ./mfc.sh test -a -j 32. Each of the ~560 test cases launches a separate srun per target (pre_process / simulation / post_process), with restart-roundtrip cases adding more — on the order of 1,700+ srun step-creates per job, up to 32 concurrent, all brokered by slurmctld. Across the test + bench + AMD matrix this reaches ~3,000 step-creates and congests the Frontier Slurm controller.

OLCF flagged this and temporarily limited the maintainer account to one running job. --stepmgr is the mechanism OLCF recommended: it delegates step management to each job's own slurmstepd instead of routing every srun through the central controller, which is the appropriate model for many-step single-allocation workloads.

Scope

  • One file changed; behavior identical except for the added SBATCH directive on Frontier.
  • No change to the number of srun calls or to test logic — this only changes who brokers the steps.

Validation

  • Confirm enable_stepmgr is active on the Frontier controller (coordinating with OLCF).
  • Run the Frontier test + bench jobs on this branch and confirm reduced controller load.

If --stepmgr proves insufficient, follow-ups in reserve: lower in-job concurrency (-j), or coalesce the three executables into a single srun per case.

Copilot AI review requested due to automatic review settings June 18, 2026 21:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Enables Slurm step management delegation (#SBATCH --stepmgr) for Frontier and Frontier-AMD CI job submissions to reduce slurmctld load from high-volume srun step creation.

Changes:

  • Add #SBATCH --stepmgr via extra_sbatch for Frontier CI jobs.
  • Document rationale for --stepmgr in the Frontier cluster block.

Comment thread .github/scripts/submit-slurm-job.sh
@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.51%. Comparing base (b4be438) to head (a116bda).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1611   +/-   ##
=======================================
  Coverage   60.51%   60.51%           
=======================================
  Files          83       83           
  Lines       19905    19905           
  Branches     2950     2950           
=======================================
  Hits        12046    12046           
  Misses       5866     5866           
  Partials     1993     1993           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson

Copy link
Copy Markdown
Member Author

Validated on Frontier ✅

Ran live tests on Frontier (Slurm 25.11.5) to confirm --stepmgr is accepted, actually engaged, and works end-to-end with the real test suite.

Prerequisites confirmed present controller-side:

  • PrologFlags = Alloc,Contain,...Contain is the hard requirement for per-job --stepmgr.
  • SlurmctldParameters = ...,rl_enable,rl_bucket_size=350,rl_refill_rate=64 — the controller already RPC-rate-limits, which is what was throttling our srun bursts. --stepmgr sidesteps it by having each job's slurmstepd broker its own steps.

Test 1 — stepmgr engages, step bursts don't hit the controller (job 4866187)

StepMgrEnabled=Yes
SLURM_STEPMGR env = 'frontier01755'
80 sequential srun steps done in 6s, failures=0   (rate-limit/errors: none)
16 concurrent overlapping srun steps             (errors: none)

sacct: 4866187 COMPLETED 00:00:09 0:0

Test 2 — real MFC suite under --stepmgr (job 4866242)

Command: ./mfc.sh test -j 8 --gpu acc --no-build --only 1D -- -c frontier, run inside an allocation carrying #SBATCH --stepmgr.

StepMgrEnabled=Yes
157 passed
  7 failed

sacct: 4866242 COMPLETED 00:02:12 0:0

157/164 1D tests passed, with every srun step (~470+: syscheck + pre_process + simulation per case) brokered by slurmstepd instead of slurmctld. The 7 failures are unrelated to stepmgr — they are gpu-acc-chem/syscheck: No such file or directory, i.e. the chemistry-enabled build variant simply wasn't compiled in my local test build. Zero stepmgr / step-creation / rate-limit errors throughout.

Conclusion

#SBATCH --stepmgr is honored on Frontier and the suite runs normally with controller step-RPC pressure removed — exactly the mitigation OLCF requested. Safe to merge for the Frontier/Frontier-AMD CI paths.

@sbryngelson

Copy link
Copy Markdown
Member Author

Production incident data (Frontier accounting)

OLCF pointed at job 4816911 as a specific offender. Accounting confirms it and shows this is the normal CI fan-out, not a one-off:

The flagged job

4816911  MFC-test-gpu-omp-1-of-2  1 node  hackathon  COMPLETED  14m22s
steps recorded: 2103

A single-node test shard that ran ~half the suite in-process and issued 2,103 srun step-creates, the bulk bursting at startup when all in-job workers launch at once — exactly the "~1000 steps in a few seconds" OLCF observed.

It runs alongside many sibling shards. In one ~35-min window (Jun 16 ~10:00–10:35), these single-node test shards were live concurrently, each hammering slurmctld:

Job Name Steps
4816813 MFC-test-cpu-none 4170
4816895 MFC-test-cpu-none 4170
4816911 MFC-test-gpu-omp-1-of-2 2103
4816909 MFC-test-gpu-acc-1-of-2 2100
4816880 MFC-test-gpu-acc-2-of-2 2093
4816898 MFC-test-gpu-omp-2-of-2 2093
4816858 MFC-test-cpu-none-2-of-2 2086
4816882 MFC-test-cpu-none-2-of-2 2086
4817028 MFC-test-gpu-omp-2-of-2 2086

(Build/prebuild/run jobs are only ~2 steps each — negligible. The step pressure is entirely the in-allocation test/bench runs.)

Why this PR covers all of it: --stepmgr is added in the shared submit-slurm-job.sh wrapper, so every one of these shards (test + bench, cpu + gpu, acc + omp) gets slurmstepd-managed steps. The live validation above (jobs 4866187, 4866242) confirms the mechanism engages and the suite still passes. This removes the controller step-RPC load that these ~2,000-step shards were generating in aggregate.

Each CI test/bench job is a single-node allocation that runs the full
regression suite in-process via ./mfc.sh test, launching one srun per
target (pre_process/simulation/post_process) for ~560 cases = ~1700+
srun step-creates per job, up to 32 concurrent. This congests the
Frontier Slurm controller.

--stepmgr delegates step management to each job's slurmstepd instead of
routing every srun through slurmctld, which is the appropriate mechanism
for many-step single-allocation workloads. Added via the existing
extra_sbatch hook for both frontier and frontier_amd; Phoenix unaffected.
@sbryngelson sbryngelson force-pushed the ci/frontier-stepmgr branch from 59fe0c7 to a116bda Compare June 19, 2026 01:30
@sbryngelson sbryngelson merged commit 22d6e1f into master Jun 20, 2026
94 of 98 checks passed
@sbryngelson sbryngelson deleted the ci/frontier-stepmgr branch June 20, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants