add a one-shot retry for object not found errors#6759
Open
glightfoot wants to merge 1 commit intotilt-dev:masterfrom
Open
add a one-shot retry for object not found errors#6759glightfoot wants to merge 1 commit intotilt-dev:masterfrom
glightfoot wants to merge 1 commit intotilt-dev:masterfrom
Conversation
Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix a rare transient
NotFounderror during Kubernetes apply reloads.During rapid reloads, Tilt can hit errors like:
rolebindings.rbac.authorization.k8s.io "..." not foundThe same YAML usually succeeds if applied again manually.
What was happening
Tilt uses kubectl's apply implementation for Kubernetes upserts. Client-side apply does a read of the current object, computes a patch, then sends the patch/update.
There is a race where the object can be deleted between those steps. For example, a previous reload may have started an async delete, then the next reload begins applying the same object before the API server has fully converged.
Kubectl apply already handles the simple case where the initial read returns
NotFound: it creates the object. But it does not recover if the read succeeds and the later patch/update returnsNotFound.Tilt already had retry handling for a related case where apply returns an object with a deletion timestamp. This change covers the adjacent failure mode where apply fails before returning an updated object.
Fix
When
Applyreturns a Kubernetes-styleNotFounderror, Tilt now treats it as a transient apply race:This keeps the retry narrow:
NotFounderrors are retriedI am not sure that this is the best way to fix this long-term, but it does solve the issue in our environment.
Why this works
If the object was deleted during kubectl apply's read/patch window, retrying starts a fresh apply operation after the cluster has had another chance to converge. On retry, kubectl either sees that the object no longer exists and creates it, or sees the current object and patches it normally.
Tests
Added a regression test that simulates a RoleBinding apply returning:
rolebindings.rbac.authorization.k8s.io "app-worker-discovery" not foundand verifies that Tilt retries and successfully applies the object.