Skip to content

ci: retry NVHPC image pull on transient nvcr.io timeouts#1554

Merged
sbryngelson merged 1 commit into
masterfrom
fix-nvhpc-pull-retry
Jun 11, 2026
Merged

ci: retry NVHPC image pull on transient nvcr.io timeouts#1554
sbryngelson merged 1 commit into
masterfrom
fix-nvhpc-pull-retry

Conversation

@sbryngelson

Copy link
Copy Markdown
Member

Problem

NVHPC matrix jobs sporadically fail at the Pull NVHPC container step with:

Error response from daemon: Get "https://nvcr.io/v2/": context deadline exceeded

Example: NVHPC 25.7 (cpu) in run 27312594672, where 1 of 30 NVHPC jobs hit the timeout while the other 29 pulled the same registry fine — a transient nvcr.io flake, likely aggravated by ~30 concurrent pulls per run.

Fix

Retry docker pull up to 5 times with linear backoff (30s, 60s, ...). Docker resumes already-downloaded layers, so retries are cheap. Worst case adds ~5 min before failing with an explicit error annotation.

The bare 'docker pull' in the NVHPC jobs fails the whole job when
nvcr.io returns 'context deadline exceeded', which happens
sporadically with ~30 matrix jobs pulling concurrently. Retry up to
5 times with linear backoff; pulls resume completed layers so
retries are cheap.
Copilot AI review requested due to automatic review settings June 11, 2026 01:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions

Copy link
Copy Markdown

Claude Code Review

Head SHA: c8995c4

Files changed:

  • 1
  • .github/workflows/test.yml

Findings:

  • Unnecessary 150-second sleep on final failed attempt (.github/workflows/test.yml): On the last loop iteration (attempt=5), when docker pull fails, the script falls through to sleep $((attempt * 30)) — sleeping 150 seconds — before the loop exits and the exit 1 fires. There is no further retry after iteration 5, so this sleep serves no purpose and adds 2.5 minutes to every eventual failure path. The echo message on that iteration also says "retrying in 150s…" which is misleading since no retry follows. Fix: skip the sleep (and misleading message) on the final attempt:

    for attempt in 1 2 3 4 5; do
        docker pull "$NVHPC_IMAGE" && exit 0
        if [ "$attempt" -lt 5 ]; then
            echo "docker pull failed (attempt $attempt/5); retrying in $((attempt * 30))s..."
            sleep $((attempt * 30))
        fi
    done
    echo "::error::Failed to pull $NVHPC_IMAGE after 5 attempts"
    exit 1

@sbryngelson sbryngelson merged commit ac30c32 into master Jun 11, 2026
90 of 91 checks passed
@sbryngelson sbryngelson deleted the fix-nvhpc-pull-retry branch June 11, 2026 02:15
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.17%. Comparing base (c0792ec) to head (c8995c4).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1554   +/-   ##
=======================================
  Coverage   61.17%   61.17%           
=======================================
  Files          74       74           
  Lines       20313    20313           
  Branches     2961     2961           
=======================================
  Hits        12427    12427           
  Misses       5870     5870           
  Partials     2016     2016           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants