KubePodNotHealthy

Kubernetes Pod Not Healthy

Overview

This runbook outlines the steps to diagnose and resolve the issue when a Kubernetes pod is reported as not healthy. The pod's unhealthiness might be due to various reasons, including application errors, resource constraints, or underlying infrastructure issues.

Initial Response

Alert received indicating pod not healthy.
Acknowledge the alert and assign yourself as the incident owner.
Notify the team about the ongoing incident using the primary communication channel.
Update the incident status on the incident tracking system.

Check RunBook Match

When running a kubectl get pod command, you will see a line like this in the output for your pod:

NAME              READY   STATUS      RESTARTS   AGE
pod-gdhsd         0/1     Failed     0          4d7h

Detailed Steps

1) Gather information

To determine the root cause here, first gather relevant information that you may need to refer back to later:

kubectl get pods -n <namespace>

2) Check Pod Logs

Fetch logs from the problematic pod to identify any application errors or exceptions:

kubectl logs <pod-name> -n <namespace>

Analyze the logs for any error messages that might provide insights into the cause of the problem.

3) Resource Constraints

Check the resource utilization of the pod, including CPU and memory usage:

kubectl describe pod <pod-name> -n <namespace>

4) Network Connectivity

Verify if the pod can reach its dependencies (databases, services, APIs, etc.).
Check if there are any network policies blocking communication.

5) Rolling Updates

If the pod was recently updated, check if the new version is causing issues.
Consider rolling back to a previous version to see if the problem persists.

6) Service Dependencies

Investigate if any dependent services or resources that the pod relies on are experiencing issues.
Check the health of databases, APIs, and external services.

Escalation:

If the issue persists or is severe, escalate to a senior SRE engineer for additional support and guidance.

Further Information

Pod Lifecycle

PreviousKubePersistentvolumeError NextKubePodCPUUsage

Last updated 2 years ago

hashtagOverview

hashtagInitial Response

hashtagCheck RunBook Match

hashtagDetailed Steps

hashtag1) Gather information

hashtag3) Resource Constraints

hashtag4) Network Connectivity

hashtag5) Rolling Updates

hashtag6) Service Dependencies

hashtagEscalation:

hashtagFurther Information