Skip to content

Evidence First

Troubleshooting Atlas

Common Kubernetes failures, written as procedure: symptom → evidence → commands → smallest safe fix → verification → next reading.

How to use the atlas

A small protocol that prevents thrash.

  • Confirm the symptom precisely (don’t generalize).
  • Run the inspection commands and collect evidence.
  • Choose the smallest safe fix and verify convergence.
  • Follow related readings to strengthen the underlying model.

Entries

19 diagnostic texts · built for search and speed

Topic

Showing 19 of 19.

Atlas

Troubleshoot

Atlas: Pods in CrashLoopBackOff

CrashLoopBackOff is a symptom. This entry provides a canonical triage sequence and safe resolutions.

Atlas

Troubleshoot

Atlas: ImagePullBackOff / ErrImagePull

Pull failures are usually naming, auth, or network. This entry gives the shortest path to truth.

Atlas

Troubleshoot

Atlas: Service Has No Endpoints

If endpoints are empty, traffic cannot route. This entry teaches the endpoint-first diagnostic sequence.

Atlas

Troubleshoot

Atlas: Pods Pending (Scheduling)

Pending pods are placement failures. This entry teaches you to read scheduler testimony and fix the governing constraint.

Atlas

Troubleshoot

Atlas: Readiness Probe Failing

Readiness is the traffic gate. This entry teaches probe semantics that prevent silent outages and restart storms.

Atlas

Troubleshoot

Atlas: Admission Webhook Timeouts

When admission fails, deploys stop. This entry teaches the shortest path to identifying the webhook and restoring the gate of truth.

Atlas

Troubleshoot

Atlas: Deployment Rollout Stalled

A rollout is a control loop with gates. This entry teaches how to read the gates and restore forward motion safely.

Atlas

Troubleshoot

Atlas: Liveness Probe Restarts

Liveness is the kill switch. When it is wrong, it creates outages that look like instability.

Atlas

Troubleshoot

Atlas: Ingress Returns 502/503

When ingress returns 502/503, the edge is telling you upstream is missing, unhealthy, or too slow.

Atlas

Troubleshoot

Atlas: PVC Pending (Storage)

PVC Pending is a binding failure. This entry teaches how to read storage events and unblock provisioning safely.

Atlas

Troubleshoot

Atlas: Node NotReady

Node NotReady is a failure domain boundary. This entry teaches containment first, then root cause.

Atlas

Troubleshoot

Atlas: OOMKilled and Evictions

Memory failures are accounting failures. This entry shows how to prove the killer and right-size with restraint.

Atlas

Troubleshoot

Atlas: HPA Not Scaling

When HPA does nothing, either metrics are missing or the signal is wrong. This entry teaches the proof path.

Atlas

Troubleshoot

Atlas: DNS Resolution Failures

DNS failures wear many masks. This entry teaches a proof-first sequence to distinguish naming mistakes from routing/policy/outage realities.

Atlas

Troubleshoot

Atlas: RBAC Forbidden

RBAC errors are deterministic. This entry teaches you to turn ‘forbidden’ into an exact sentence and fix the binding with restraint.

Atlas

Troubleshoot

Atlas: PVC Mount/Attach Failures

If PVC is Bound but pods still fail, the problem is attachment or mount. This entry teaches how to prove which stage failed and why.

Atlas

Troubleshoot

Atlas: Pod Stuck Terminating

A stuck termination is usually a finalizer, an unreachable node, or a storage/network dependency. This entry teaches containment and safe cleanup.

Atlas

Troubleshoot

Atlas: ConfigMap/Secret Not Found

Most config failures are naming failures. This entry teaches you to prove object existence and wiring before you restart anything.

Atlas

Troubleshoot

Atlas: Readiness Flapping

Readiness flapping creates traffic storms and partial outages. This entry teaches you to stabilize the gate without lying to yourself.