Evidence First
Troubleshooting Atlas
Common Kubernetes failures, written as procedure: symptom → evidence → commands → smallest safe fix → verification → next reading.
How to use the atlas
A small protocol that prevents thrash.
- Confirm the symptom precisely (don’t generalize).
- Run the inspection commands and collect evidence.
- Choose the smallest safe fix and verify convergence.
- Follow related readings to strengthen the underlying model.
Entries
19 diagnostic texts · built for search and speed
Showing 19 of 19.
Atlas
TroubleshootAtlas: Pods in CrashLoopBackOff
CrashLoopBackOff is a symptom. This entry provides a canonical triage sequence and safe resolutions.
Atlas
TroubleshootAtlas: ImagePullBackOff / ErrImagePull
Pull failures are usually naming, auth, or network. This entry gives the shortest path to truth.
Atlas
TroubleshootAtlas: Service Has No Endpoints
If endpoints are empty, traffic cannot route. This entry teaches the endpoint-first diagnostic sequence.
Atlas
TroubleshootAtlas: Pods Pending (Scheduling)
Pending pods are placement failures. This entry teaches you to read scheduler testimony and fix the governing constraint.
Atlas
TroubleshootAtlas: Readiness Probe Failing
Readiness is the traffic gate. This entry teaches probe semantics that prevent silent outages and restart storms.
Atlas
TroubleshootAtlas: Admission Webhook Timeouts
When admission fails, deploys stop. This entry teaches the shortest path to identifying the webhook and restoring the gate of truth.
Atlas
TroubleshootAtlas: Deployment Rollout Stalled
A rollout is a control loop with gates. This entry teaches how to read the gates and restore forward motion safely.
Atlas
TroubleshootAtlas: Liveness Probe Restarts
Liveness is the kill switch. When it is wrong, it creates outages that look like instability.
Atlas
TroubleshootAtlas: Ingress Returns 502/503
When ingress returns 502/503, the edge is telling you upstream is missing, unhealthy, or too slow.
Atlas
TroubleshootAtlas: PVC Pending (Storage)
PVC Pending is a binding failure. This entry teaches how to read storage events and unblock provisioning safely.
Atlas
TroubleshootAtlas: Node NotReady
Node NotReady is a failure domain boundary. This entry teaches containment first, then root cause.
Atlas
TroubleshootAtlas: OOMKilled and Evictions
Memory failures are accounting failures. This entry shows how to prove the killer and right-size with restraint.
Atlas
TroubleshootAtlas: HPA Not Scaling
When HPA does nothing, either metrics are missing or the signal is wrong. This entry teaches the proof path.
Atlas
TroubleshootAtlas: DNS Resolution Failures
DNS failures wear many masks. This entry teaches a proof-first sequence to distinguish naming mistakes from routing/policy/outage realities.
Atlas
TroubleshootAtlas: RBAC Forbidden
RBAC errors are deterministic. This entry teaches you to turn ‘forbidden’ into an exact sentence and fix the binding with restraint.
Atlas
TroubleshootAtlas: PVC Mount/Attach Failures
If PVC is Bound but pods still fail, the problem is attachment or mount. This entry teaches how to prove which stage failed and why.
Atlas
TroubleshootAtlas: Pod Stuck Terminating
A stuck termination is usually a finalizer, an unreachable node, or a storage/network dependency. This entry teaches containment and safe cleanup.
Atlas
TroubleshootAtlas: ConfigMap/Secret Not Found
Most config failures are naming failures. This entry teaches you to prove object existence and wiring before you restart anything.
Atlas
TroubleshootAtlas: Readiness Flapping
Readiness flapping creates traffic storms and partial outages. This entry teaches you to stabilize the gate without lying to yourself.