fix(supervisor): retry transient instance create failures in compute workload manager#3902
Conversation
…he run ComputeWorkloadManager.create swallows gateway errors by design, so a cold start that fails placement (e.g. a netns slot with a busy tap, a full node disk) silently abandons the dequeued run until the run engine's PENDING_EXECUTING timeout redrives it minutes later. These failures are transient per placement - redriven runs virtually always succeed - so retry the create up to 3 times with short backoff before giving up. Gateway 5xx and network-level fetch failures are retried; 4xx responses (won't heal) and timeouts (the instance may still be provisioning) are not.
A failed create can leave its instance name registered gateway/fcrun-side until async cleanup runs, so a same-name retry can 409 against our own residue (observed: tap-EBUSY 500 at 18:29Z followed by 409 name_conflict on the retry 2.7s later, costing the full redrive anyway). Give retry attempts a deterministic -rN suffix; attempt 1 keeps the unsuffixed name so the non-retry path is unchanged. The suffixed name flows into both the instance name and TRIGGER_RUNNER_ID from the same variable - every downstream flow (suspend scheduling, snapshot dispatch, cancel guards, run-engine fields) treats it as one opaque self-reported token, and restored VMs already carry deterministic name suffixes. Temporary measure (TRI-10293): the proper fix is gateway-side cleanup of failed-create registrations.
|
WalkthroughThis PR adds deterministic retry support to compute instance creation. It introduces two helpers—runnerNameForAttempt and isRetryableCreateError—adds configurable retry settings via env and runner options, and refactors the create method to run a bounded retry loop that conditionally retries based on error classification, suffixes runner names on retries, updates event.runnerId on success, and records createAttempts. Tests cover naming and retry-classification behavior and release notes document the fix. 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
COMPUTE_INSTANCE_CREATE_MAX_ATTEMPTS (default 3, 1 disables retries) and COMPUTE_INSTANCE_CREATE_RETRY_BASE_DELAY_MS (default 250) thread through ComputeWorkloadManagerOptions.createRetry; behavior is unchanged at the defaults.
ComputeWorkloadManager.createswallows gateway errors currently, so a cold start that fails placement (e.g. a netns slot with a busy tap, a full node disk) silently abandons the dequeued run until the run engine'sPENDING_EXECUTINGheartbeat timeout redrives it via stall detection.Changes
instances.createwith short backoff (default 3 attempts, 250ms backoff), recordingcreateAttemptsin the wide event.-rNname suffix: a failed create can leave its name registered until async cleanup runs. Attempt 1 keeps the unsuffixed name.