Evidence First

Troubleshooting Atlas

Common Kubernetes failures, written as procedure: symptom → evidence → commands → smallest safe fix → verification → next reading.

Start a Lab Operations Handbook Shrine Archive

How to use the atlas

A small protocol that prevents thrash.

Confirm the symptom precisely (don’t generalize).
Run the inspection commands and collect evidence.
Choose the smallest safe fix and verify convergence.
Follow related readings to strengthen the underlying model.

Entries

19 diagnostic texts · built for search and speed

Topic

Showing 19 of 19.

Atlas

Troubleshoot

Atlas: Pods in CrashLoopBackOff

CrashLoopBackOff is a symptom. This entry provides a canonical triage sequence and safe resolutions.

Atlas

Troubleshoot

Atlas: ImagePullBackOff / ErrImagePull

Pull failures are usually naming, auth, or network. This entry gives the shortest path to truth.

Atlas

Troubleshoot

Atlas: Service Has No Endpoints

If endpoints are empty, traffic cannot route. This entry teaches the endpoint-first diagnostic sequence.

Atlas

Troubleshoot

Atlas: Pods Pending (Scheduling)

Pending pods are placement failures. This entry teaches you to read scheduler testimony and fix the governing constraint.

Atlas

Troubleshoot

Atlas: Readiness Probe Failing

Readiness is the traffic gate. This entry teaches probe semantics that prevent silent outages and restart storms.

Atlas

Troubleshoot

Atlas: Admission Webhook Timeouts

When admission fails, deploys stop. This entry teaches the shortest path to identifying the webhook and restoring the gate of truth.

Atlas

Troubleshoot

Atlas: Deployment Rollout Stalled

A rollout is a control loop with gates. This entry teaches how to read the gates and restore forward motion safely.

Atlas

Troubleshoot

Atlas: Liveness Probe Restarts

Liveness is the kill switch. When it is wrong, it creates outages that look like instability.

Atlas

Troubleshoot

Atlas: Ingress Returns 502/503

When ingress returns 502/503, the edge is telling you upstream is missing, unhealthy, or too slow.

Atlas

Troubleshoot

Atlas: PVC Pending (Storage)

PVC Pending is a binding failure. This entry teaches how to read storage events and unblock provisioning safely.

Atlas

Troubleshoot

Atlas: Node NotReady

Node NotReady is a failure domain boundary. This entry teaches containment first, then root cause.

Atlas

Troubleshoot

Atlas: OOMKilled and Evictions

Memory failures are accounting failures. This entry shows how to prove the killer and right-size with restraint.

Atlas

Troubleshoot

Atlas: HPA Not Scaling

When HPA does nothing, either metrics are missing or the signal is wrong. This entry teaches the proof path.

Atlas

Troubleshoot

Atlas: DNS Resolution Failures

DNS failures wear many masks. This entry teaches a proof-first sequence to distinguish naming mistakes from routing/policy/outage realities.

Atlas

Troubleshoot

Atlas: RBAC Forbidden

RBAC errors are deterministic. This entry teaches you to turn ‘forbidden’ into an exact sentence and fix the binding with restraint.

Atlas

Troubleshoot

Atlas: PVC Mount/Attach Failures

If PVC is Bound but pods still fail, the problem is attachment or mount. This entry teaches how to prove which stage failed and why.

Atlas

Troubleshoot

Atlas: Pod Stuck Terminating

A stuck termination is usually a finalizer, an unreachable node, or a storage/network dependency. This entry teaches containment and safe cleanup.

Atlas

Troubleshoot

Atlas: ConfigMap/Secret Not Found

Most config failures are naming failures. This entry teaches you to prove object existence and wiring before you restart anything.

Atlas

Troubleshoot

Atlas: Readiness Flapping

Readiness flapping creates traffic storms and partial outages. This entry teaches you to stabilize the gate without lying to yourself.