Skip to content

fix(supervisor): retry transient instance create failures in compute workload manager#3902

Merged
myftija merged 4 commits into
mainfrom
tri-10293-cold-start-create-retry
Jun 11, 2026
Merged

fix(supervisor): retry transient instance create failures in compute workload manager#3902
myftija merged 4 commits into
mainfrom
tri-10293-cold-start-create-retry

Conversation

@myftija

@myftija myftija commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

ComputeWorkloadManager.create swallows gateway errors currently, so a cold start that fails placement (e.g. a netns slot with a busy tap, a full node disk) silently abandons the dequeued run until the run engine's PENDING_EXECUTING heartbeat timeout redrives it via stall detection.

Changes

  • Retry instances.create with short backoff (default 3 attempts, 250ms backoff), recording createAttempts in the wide event.
  • Only statuses where the create definitely did not commit are retried: 500 (agent/fcrun create failed) and 503 (no placement). 502/504 are excluded — the gateway emits those when it fails to reach the node or read its response, which can happen after the agent committed the create; the gateway only records the instance name on a clean 201, so a same-name retry would miss the collision check and could double-create the VM on another node. Network-level fetch failures are retried (if the gateway processed the create, its name index is populated and the retry 409s harmlessly). Timeouts are not retried.
  • Retry attempts after a 5xx use a deterministic -rN name suffix: a failed create can leave its name registered until async cleanup runs. Attempt 1 keeps the unsuffixed name.

myftija added 2 commits June 11, 2026 15:48
…he run

ComputeWorkloadManager.create swallows gateway errors by design, so a
cold start that fails placement (e.g. a netns slot with a busy tap, a
full node disk) silently abandons the dequeued run until the run
engine's PENDING_EXECUTING timeout redrives it minutes later. These
failures are transient per placement - redriven runs virtually always
succeed - so retry the create up to 3 times with short backoff before
giving up. Gateway 5xx and network-level fetch failures are retried;
4xx responses (won't heal) and timeouts (the instance may still be
provisioning) are not.
A failed create can leave its instance name registered gateway/fcrun-side
until async cleanup runs, so a same-name retry can 409 against our own
residue (observed: tap-EBUSY 500 at 18:29Z followed by 409 name_conflict
on the retry 2.7s later, costing the full redrive anyway). Give retry
attempts a deterministic -rN suffix; attempt 1 keeps the unsuffixed name
so the non-retry path is unchanged. The suffixed name flows into both the
instance name and TRIGGER_RUNNER_ID from the same variable - every
downstream flow (suspend scheduling, snapshot dispatch, cancel guards,
run-engine fields) treats it as one opaque self-reported token, and
restored VMs already carry deterministic name suffixes.

Temporary measure (TRI-10293): the proper fix is gateway-side cleanup of
failed-create registrations.
@changeset-bot

changeset-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: ddbe4b7

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR adds deterministic retry support to compute instance creation. It introduces two helpers—runnerNameForAttempt and isRetryableCreateError—adds configurable retry settings via env and runner options, and refactors the create method to run a bounded retry loop that conditionally retries based on error classification, suffixes runner names on retries, updates event.runnerId on success, and records createAttempts. Tests cover naming and retry-classification behavior and release notes document the fix.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: implementing retry logic for transient instance creation failures in the compute workload manager.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the problem, implementation details, and rationale for design decisions including retry logic and naming strategy.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tri-10293-cold-start-create-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@myftija myftija changed the title fix(supervisor): retry transient instance create failures instead of abandoning the run fix(supervisor): retry transient instance create failures in compute workload manager Jun 11, 2026
COMPUTE_INSTANCE_CREATE_MAX_ATTEMPTS (default 3, 1 disables retries) and
COMPUTE_INSTANCE_CREATE_RETRY_BASE_DELAY_MS (default 250) thread through
ComputeWorkloadManagerOptions.createRetry; behavior is unchanged at the
defaults.
devin-ai-integration[bot]

This comment was marked as resolved.

@myftija myftija merged commit 2397ca2 into main Jun 11, 2026
23 checks passed
@myftija myftija deleted the tri-10293-cold-start-create-retry branch June 11, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants