Sacred Systems
Pod Lifecycle and Failure States
Pods are the symptom surface. If you can’t interpret their phases, reasons, and events, you cannot diagnose the cluster with discipline.
Text
Authored as doctrine; evaluated as systems craft.
Doctrine
A pod is the smallest schedulable unit, but it is not the right unit of operation. Operators still read pods because pods record the first truth of failure: what the scheduler decided, what kubelet attempted, and what the runtime rejected.
Kubblai doctrine: read the pod as testimony, then return to the controller as the unit of intent.
Phases vs container states
Pod phase is a coarse summary. Container state carries the detail: waiting reasons, termination reasons, exit codes, and whether the last crash is preserved.
Most investigations fail because the operator reads only `kubectl get pods` and ignores `describe` and events.
- Pending often means scheduling constraints or storage prerequisites.
- Running does not mean Ready; readiness gates traffic.
- CrashLoopBackOff is a restart pattern, not a root cause.
Events are the platform speaking
Events are not perfect, but they are often the shortest path to classification: image pull, mount, admission, probes, scheduling. When they repeat, they are also a cost signal.
Treat repeated events as a control loop gone wrong: the platform is retrying and making things worse.
kubectl
shell
kubectl describe pod <pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -n 40Common failure signatures
Learn a small set of signatures. Do not treat each incident as a new story.
- ImagePullBackOff: naming/auth/egress/rate limiting; no app signal exists yet.
- OOMKilled: memory limit exceeded or node pressure; fix economics before you tune probes.
- Readiness failing: traffic gate is closed; endpoints empty or withheld.
- Pending: placement failure, quota, policy, or prerequisites (PVC).
Field notes
Pods die for reasons unrelated to your application: node pressure, eviction thresholds, storage detach latency, admission failure. Don’t let the surface symptom trick you into changing code when the platform is the cause.
If you’re on call, your first job is classification. Your second job is containment.
Canonical Link
Canonical URL: /library/pod-lifecycle-and-failure-states
Related Readings
Sacred Systems
LibraryKubelet and the Discipline of Obedience
The kubelet is where the platform’s abstract intent becomes real processes. It obeys—but it also refuses when the node is dying.
Advanced Disciplines
LibraryProbes, Liveness, Readiness, and the Test of Worthiness
A probe is a contract between the workload and the cluster. Poor probes turn minor latency into systemic failure.
Canonical Texts
LibraryIncident Response as a Trial of Faith
Incidents reveal the true governance of your platform: who can act, what can be changed, and whether your system can recover with discipline.
Advanced Disciplines
LibraryDebugging the Control Plane Under Pressure
The control plane fails quietly, then all at once. Debugging it requires you to reduce churn, read saturation signals, and avoid write amplification.
Advanced Disciplines
LibraryCapacity, Bin Packing, and the Lies We Tell the Scheduler
The scheduler is not a magician. It places pods based on the numbers you give it. When those numbers are lies, placement becomes a slow-motion incident.