KubeStatefulsetDown

StatefulSet Down Troubleshooting and Recovery

Overview

This runbook outlines the steps to diagnose and resolve the issue, ensuring the StatefulSet is brought back up and the system is restored to normal operation.

Initial Response

Alert received indicating the StatefulSet is down.
Acknowledge the alert and assign yourself as the incident owner.
Notify the team about the ongoing incident using the primary communication channel.
Update the incident status on the incident tracking system.

Check RunBook Match

When running a kubectl get statefulsets command, you will see a line like this in the output for your pod:

kubectl get statefulsets -n <namespace>

Detailed Steps

1) Gather information

Use kubectl to check the status of the affected StatefulSet, its pods, and associated resources. To determine the root cause here, first gather relevant information that you may need to refer back to later:

kubectl get statefulsets,pods,services -n <namespace>

2) Inspect pod logs for error messages or abnormal behavior

kubectl logs <pod-name> -n <namespace>

3) Check cluster events to identify potential causes

kubectl get events -n <namespace>

Solutions Details

1) Attempt to restart the pod:

If there are any misconfigured or crashed pods, attempt to restart them:

kubectl delete pod <pod-name> -n <namespace>

2) Scale the StatefulSet down

If the issue is related to resource constraints, scale the StatefulSet down to ensure pods are rescheduled properly:

kubectl scale statefulset <statefulset-name> --replicas=<desired-replicas> -n <namespace>

3) Immediate Response

If the statefulset's downtime is due to insufficient resources, consider scaling up the resources for the associated pods temporarily.
If the pods are crashing, perform a rolling restart of the StatefulSet:

kubectl rollout restart statefulset [statefulset-name] -n [namespace]

Escalation:

If the issue persists or is severe, escalate to a senior SRE engineer for additional support and guidance.

Further Information

Kubernetes statefulset

PreviousCreateContainerError NextKubePersistentvolumeError

Last updated 2 years ago

hashtagOverview

hashtagInitial Response

hashtagCheck RunBook Match

hashtagDetailed Steps

hashtag1) Gather information

hashtag3) Check cluster events to identify potential causes

hashtagSolutions Details

hashtag1) Attempt to restart the pod:

hashtag2) Scale the StatefulSet down

hashtag3) Immediate Response

hashtagEscalation:

hashtagFurther Information