/kind bug
1. What kops version are you running? The command kops version, will display
this information.
1.33.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
$ kubectl version
Client Version: v1.35.0
Kustomize Version: v5.7.1
Server Version: v1.32.10
3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
Reproduction scenario (simplest observed so far):
-
Create a Kubernetes cluster on AWS using kOps with
- Kubernetes v1.32.x (kube-proxy
v1.32.10),
nodeTerminationHandler enabled in the kOps cluster spec (using aws-node-termination-handler),
- Standard kubelet resource reservations (no extreme overcommit).
-
Run a workload pod on a node, e.g.:
- Namespace:
zookeeper
- Pod:
cluster-2
- Container:
zookeeper
- Container resources:
resources.requests.memory: 512Mi, no memory limit set.
- The pod runs for a long time (in our case since
2026-02-11T10:11:23Z).
-
Trigger a node drain / pod eviction not caused by memory pressure, for example:
- By draining the node (
kubectl drain <node> --force --ignore-daemonsets --delete-emptydir-data), or
- By causing the node to be terminated by AWS (e.g. via
aws-node-termination-handler / ASG scale-in / maintenance),
- Or via a rolling update that cordons & drains the node.
-
Observe the pod status after the eviction/termination:
kubectl get pod -n zookeeper cluster-2 -o yaml
- And/or watch application logs and metrics around the time of termination.
We have reproduced this multiple times by evicting pods on kOps-managed nodes in this way.
5. What happened after the commands executed?
For an example pod (zookeeper/cluster-2):
-
The pod is eventually marked as phase: Failed.
-
The container (zookeeper) is reported in pod.status.containerStatuses as:
state.terminated.reason = "OOMKilled"
state.terminated.exitCode = 143
state.terminated.startedAt = 2026-02-11T10:11:23Z
state.terminated.finishedAt = 2026-03-06T13:34:26Z
-
There is no corresponding memory pressure or OOM indication:
- Prometheus (
kubelet / cAdvisor) shows:
container_memory_working_set_bytes{namespace="zookeeper",pod="cluster-2",container="zookeeper"}
is stable around ~350–370 MiB before termination, well below the 512Mi request.
container_oom_events_total{namespace="zookeeper",pod="cluster-2",container="zookeeper"} = 0
at and after the termination time.
- Kubernetes Events in Loki for this pod show:
reason=Unhealthy liveness probe failures (Zookeeper liveness script exit 1),
- No events indicating OOM or memory pressure.
-
An observability component (Robusta) logs the pod status it receives from the API server and confirms that
pod.status.containerStatuses[*].state.terminated.reason is "OOMKilled" with exitCode=143 for the
terminating container at the time the alert is raised.
So effectively, pods that are being drained/evicted without memory pressure are reported as if they were OOMKilled,
but with exit code 143 (SIGTERM).
6. What did you expect to happen?
- For non-OOM terminations such as node drains or manual pod deletions we would expect:
container.state.terminated.reason not to be "OOMKilled" (e.g. "Error", "Completed", or other appropriate reasons),
- Or the pod-level
status.reason to indicate Evicted or NodeShutdown while container termination reason
remains consistent with the signal / exit code (143 for SIGTERM, 0 for clean exit, etc.).
- Specifically, we do not expect the
OOMKilled reason
- For pods being terminated due to node drains or liveness failures without any evidence of memory pressure.
This misclassification is problematic because downstream controllers/alerting (including ours) rely on
containerStatuses[*].state.terminated.reason to distinguish true OOM events from expected evictions.
7. Please provide your cluster manifest.
# REDACTED / simplified example
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
name: prod
spec:
cloudProvider: aws
kubernetesVersion: 1.32.10
api:
loadBalancer:
type: Public
nodeTerminationHandler:
enabled: true
cpuRequest: 200m
enableRebalanceMonitoring: true
enableSQSTerminationDraining: true
managedASGTag: "aws-node-termination-handler/managed"
prometheusEnable: true
kubelet:
# Typical reservations; nothing exotic
kubeReserved:
cpu: "1"
memory: "2Gi"
ephemeral-storage: "1Gi"
systemReserved:
cpu: "500m"
memory: "1Gi"
ephemeral-storage: "1Gi"
enforceNodeAllocatable: "pods,system-reserved,kube-reserved"
# ... other standard kOps cluster config (subnets, IAM, etc.) ...
8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.
n.a.
9. Anything else do we need to know?
Any guidance is appreciated 🙏
/kind bug
1. What
kopsversion are you running? The commandkops version, will displaythis information.
1.33.12. What Kubernetes version are you running?
kubectl versionwill print theversion if a cluster is running or provide the Kubernetes version specified as
a
kopsflag.3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
Reproduction scenario (simplest observed so far):
Create a Kubernetes cluster on AWS using kOps with
v1.32.10),nodeTerminationHandlerenabled in the kOps cluster spec (usingaws-node-termination-handler),Run a workload pod on a node, e.g.:
zookeepercluster-2zookeeperresources.requests.memory: 512Mi, no memory limit set.2026-02-11T10:11:23Z).Trigger a node drain / pod eviction not caused by memory pressure, for example:
kubectl drain <node> --force --ignore-daemonsets --delete-emptydir-data), oraws-node-termination-handler/ ASG scale-in / maintenance),Observe the pod status after the eviction/termination:
kubectl get pod -n zookeeper cluster-2 -o yamlWe have reproduced this multiple times by evicting pods on kOps-managed nodes in this way.
5. What happened after the commands executed?
For an example pod (
zookeeper/cluster-2):The pod is eventually marked as
phase: Failed.The container (
zookeeper) is reported inpod.status.containerStatusesas:state.terminated.reason = "OOMKilled"state.terminated.exitCode = 143state.terminated.startedAt = 2026-02-11T10:11:23Zstate.terminated.finishedAt = 2026-03-06T13:34:26ZThere is no corresponding memory pressure or OOM indication:
kubelet/ cAdvisor) shows:container_memory_working_set_bytes{namespace="zookeeper",pod="cluster-2",container="zookeeper"}is stable around ~350–370 MiB before termination, well below the 512Mi request.
container_oom_events_total{namespace="zookeeper",pod="cluster-2",container="zookeeper"} = 0at and after the termination time.
reason=Unhealthyliveness probe failures (Zookeeper liveness script exit 1),An observability component (Robusta) logs the pod status it receives from the API server and confirms that
pod.status.containerStatuses[*].state.terminated.reasonis"OOMKilled"withexitCode=143for theterminating container at the time the alert is raised.
So effectively, pods that are being drained/evicted without memory pressure are reported as if they were OOMKilled,
but with exit code 143 (SIGTERM).
6. What did you expect to happen?
container.state.terminated.reasonnot to be"OOMKilled"(e.g."Error","Completed", or other appropriate reasons),status.reasonto indicateEvictedorNodeShutdownwhile container termination reasonremains consistent with the signal / exit code (143 for SIGTERM, 0 for clean exit, etc.).
OOMKilledreasonThis misclassification is problematic because downstream controllers/alerting (including ours) rely on
containerStatuses[*].state.terminated.reasonto distinguish true OOM events from expected evictions.7. Please provide your cluster manifest.
8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.
n.a.
9. Anything else do we need to know?
Any guidance is appreciated 🙏