Skip to content

add a one-shot retry for object not found errors#6759

Open
glightfoot wants to merge 1 commit intotilt-dev:masterfrom
glightfoot:retry-type-ordering-bug
Open

add a one-shot retry for object not found errors#6759
glightfoot wants to merge 1 commit intotilt-dev:masterfrom
glightfoot:retry-type-ordering-bug

Conversation

@glightfoot
Copy link
Copy Markdown
Contributor

Summary

Fix a rare transient NotFound error during Kubernetes apply reloads.

During rapid reloads, Tilt can hit errors like:

rolebindings.rbac.authorization.k8s.io "..." not found

The same YAML usually succeeds if applied again manually.

What was happening

Tilt uses kubectl's apply implementation for Kubernetes upserts. Client-side apply does a read of the current object, computes a patch, then sends the patch/update.

There is a race where the object can be deleted between those steps. For example, a previous reload may have started an async delete, then the next reload begins applying the same object before the API server has fully converged.

Kubectl apply already handles the simple case where the initial read returns NotFound: it creates the object. But it does not recover if the read succeeds and the later patch/update returns NotFound.

Tilt already had retry handling for a related case where apply returns an object with a deletion timestamp. This change covers the adjacent failure mode where apply fails before returning an updated object.

Fix

When Apply returns a Kubernetes-style NotFound error, Tilt now treats it as a transient apply race:

  1. Rebuild the resource list so it reflects the latest cluster state.
  2. Retry apply once.
  3. Return the retry error if the object still cannot be applied.

This keeps the retry narrow:

  • only NotFound errors are retried
  • only one retry is attempted
  • non-transient apply errors are still returned immediately

I am not sure that this is the best way to fix this long-term, but it does solve the issue in our environment.

Why this works

If the object was deleted during kubectl apply's read/patch window, retrying starts a fresh apply operation after the cluster has had another chance to converge. On retry, kubectl either sees that the object no longer exists and creates it, or sees the current object and patches it normally.

Tests

Added a regression test that simulates a RoleBinding apply returning:

rolebindings.rbac.authorization.k8s.io "app-worker-discovery" not found

and verifies that Tilt retries and successfully applies the object.

Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant