ci: retry NVHPC image pull on transient nvcr.io timeouts#1554
Conversation
The bare 'docker pull' in the NVHPC jobs fails the whole job when nvcr.io returns 'context deadline exceeded', which happens sporadically with ~30 matrix jobs pulling concurrently. Retry up to 5 times with linear backoff; pulls resume completed layers so retries are cheap.
|
Claude Code Review Head SHA: c8995c4 Files changed:
Findings:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1554 +/- ##
=======================================
Coverage 61.17% 61.17%
=======================================
Files 74 74
Lines 20313 20313
Branches 2961 2961
=======================================
Hits 12427 12427
Misses 5870 5870
Partials 2016 2016 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Problem
NVHPC matrix jobs sporadically fail at the Pull NVHPC container step with:
Example: NVHPC 25.7 (cpu) in run 27312594672, where 1 of 30 NVHPC jobs hit the timeout while the other 29 pulled the same registry fine — a transient nvcr.io flake, likely aggravated by ~30 concurrent pulls per run.
Fix
Retry
docker pullup to 5 times with linear backoff (30s, 60s, ...). Docker resumes already-downloaded layers, so retries are cheap. Worst case adds ~5 min before failing with an explicit error annotation.