CrashLoopBackOff

A CrashLoopBackOff error occurs when a pod startup fails repeatedly in Kubernetes.

Initial Response

Alert received indicating crashloopback state.
Acknowledge the alert and assign yourself as the incident owner.
Notify the team about the ongoing incident using the primary communication channel.
Update the incident status on the incident tracking system.

Check RunBook Match

When running a kubectl get pods command, you will see a line like this in the output for your pod:

NAME                     READY     STATUS             RESTARTS   AGE
blockops-7ef9efa7cd-qd   0/1       CrashLoopBackOff   2          1m

If you see something like:

NAME                 READY     STATUS                 RESTARTS   AGE
blockops-7ef9efa    0/2       Init:CrashLoopBackOff   2          1m

then continue with this runbook, remembering that the problem is likely specific to the init container.

Detailed Steps

1) Gather information

Run these commands to gather relevant information in one step:

kubectl describe -n [NAMESPACE_NAME] pod [POD_NAME] > /tmp/runbooks_describe_pod.txt
kubectl logs --all-containers -n [NAMESPACE_NAME] > /tmp/runbooks_pod_logs.txt
kubectl logs --all-containers --previous -n [NAMESPACE_NAME] > /tmp/runbooks_previous_pod_logs.txt

2) Examine `Events` section in output

Look at the Events section of your /tmp/runbooks_describe_pod.txt file.

2.1) Back-off restarting failed container

If you see a warning like the following in your /tmp/runbooks_describe_pod.txt output:

Warning  BackOff    8s (x2 over 9s)    kubelet, dali      Back-off restarting failed container

then the pod has repeatedly failed to start up successfully.

Make a note of any containers that have State of Waiting in the description and a description of CrashLoopBackOff. These are the containers you will need to fix.

3) Check the exit code

Examine the described output, and look for the Exit Code.

3.1) Exit Code 0

This exit code implies that the specified container command completed ‘successfully’, but too often for Kubernetes to accept it as working.

Did you fail to specify a command in the pod spec, and the container ran (for example) a default shell command that failed? If so, you will need to add the correct command. See solution c.

Examine the logs in /tmp/runbooks_describe_pod.txt to see whether there are any clues as to why the application was terminated.

3.1) Exit Code 1

The container failed to run its command successfully and returned an exit code of 1. This is an application failure within the process that was started but return with a failing exit code sometime after.

If this is happening only with all pods running on your cluster, then there may be a problem with your nodes. Check nodes are OK on your cluster with: kubectl get nodes -o wide.

Examine the logs in /tmp/runbooks_describe_pod.txt and determine resolution in the context of the command that ran as specified in the image, or debug the application directly.

3.2) Exit Code 2

An exit code of 2 indicates either that the application chose to return that error code, or (by convention) there was a misuse of a shell builtin. Check your pod’s command specification to ensure that the command is correct. If you think it is correct, try running the image locally with a shell and run the command directly.

3.2) Exit Code 128

An exit code 128 indicates that the container could not run. Check this by examining the /tmp/runbooks_describe_pod.txt output to see whether the LastState Reason is: ContainerCannotRun.

3.3) Exit Code 137

This indicates that the container was killed with signal 9

This can be due to one of the following reasons:

3.3.1) Container ran out of memory

This may be because your application needs more resources than it’s allowed to use, or your application is using more than it should. Which of these is the case is context-specific, so you will need to use your judgment.

If you want to increase your pod’s resource request, see solution E.

3.3.2) The OOMKiller killed the container

You will also likely see Reason: OOM in the container in the /tmp/runbooks_describe_pod.txt output.

3.3.3) The liveness probe failed

If you see a warning like this in the Events output of /tmp/runbooks_describe_pod.txt:

Warning  Unhealthy  13s (x3 over 23s)  kubelet, dali      Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory

Then you will need to check your liveness probes. Skip to step 4

4) Check liveness/readiness probes

If these are too short for the application initialization time, then Kubernetes may be killing the application too early.

Whether the time taken to start is longer because there is a problem, or whether the time taken to start is genuinely longer than the probe times is a judgment for the reader/application owner.

If the probe times are too short, see Solution D) below.

See here for more background information.

5) Check common application issues

Some common application problems to consider that may not be specific to your context:

If your application requires privileged access to function, then you may need to set allowPrivilegeEscalation (some core components rely on this, eg coreDNS)
SELinux or AppArmor controls may be preventing your application from running

Note that by allowing privilege escalation, you may be undermining necessary controls, or allowing your application to do something that is not allowed in your context.

Solutions Details

A) Fix the application

Inform the software developers to fix the applications.

B) Add a startup command

In order for a pod to start, it needs a startup command. Consider adding one to the container image, or adding a command to the container specification(s) within the pod.

C) Correct the container or spec to run a command that exists in the container

If the command was not specified (both in the image and the pod specification), then add a command in either place.

If the command was not executable, make it executable. This may require a change to the container build, or specifying a correct executable.

D) Adjust the time for the liveness/readiness probes

See here for information on how and what to change in your pod specification.

E) Increase resource request

If you want to increase the resources allocated to your pod, see here.

Check Resolution

If the pod starts up with status RUNNING according to the output of kubectl get pods, then the issue has been resolved.

If there is a different status, then it may be that this particular issue is resolved, but a new issue has been revealed, and the runbook needs to be re-followed.

Escalation:

If the issue persists or is severe, escalate to senior engineering and development teams for additional support and guidance.

Further Information

Init containers

Probes

Kubelet logs

PreviousBlockops Runbooks NextCreateContainerError

Last updated 2 years ago

hashtagInitial Response

hashtagCheck RunBook Match

hashtagDetailed Steps

hashtag1) Gather information

hashtag2) Examine Events section in output

hashtag3) Check the exit code

hashtag4) Check liveness/readiness probes

hashtag5) Check common application issues

hashtagSolutions Details

hashtagA) Fix the application

hashtagB) Add a startup command

hashtagC) Correct the container or spec to run a command that exists in the container

hashtagD) Adjust the time for the liveness/readiness probes

hashtagE) Increase resource request

hashtagCheck Resolution

hashtagEscalation:

hashtagFurther Information