NodeHighCPULoad

Resolving High CPU Load on a Node

Overview


This runbook provides a step-by-step guide to diagnose and resolve high CPU load on a node in a Linux-based environment.

Initial Response


  • Alert received indicating a high CPU load on a node.

  • Acknowledge the alert and assign yourself as the incident owner.

  • Notify the team about the ongoing incident using the primary communication channel.

  • Update the incident status on the incident tracking system.

Detailed Steps


1) Identify the Affected Nodes

1.1 Connect to the Kubernetes cluster using kubectl

kubectl cluster-info

1.2 Check the status of all nodes in the cluster

kubectl get nodes

2) Gather CPU usage metrics for the nodes

2.1 Use kubectl to gather CPU usage metrics for the nodes

kubectl top nodes

2.2. Identify the nodes with high CPU usage. Nodes with consistently high CPU utilization may need further investigation.

3) Identify Problematic Pods

3.1. List all pods running on the affected node(s)

kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

3.2. Note down pods with high resource utilization or abnormal behavior.

kubectl describe pod <pod-name> -n <namespace>

3.3 Check the Pod's logs for any errors, excessive logging, or patterns that might indicate the root cause of the high CPU usage.

Solutions Details


1) Scale the Deployment

If the high CPU usage is caused by legitimate load, consider scaling the Deployment to distribute the load across multiple Pods.

kubectl scale deployment <deployment-name> --replicas=<new-replica-count> -n <namespace>

2) Optimize Application Code

Work with the development team to optimize the application's code and queries that might be causing the CPU spikes.

3) Resource Limit and Requests

Ensure that the Pod's resource requests and limits are appropriately set to avoid resource contention. Adjust the values in the Pod's YAML definition.

4) Inspect Pod Resource Usage

4.1. Describe the problematic pod to gather more information:

kubectl describe pod <pod-name> -n <namespace>

4.2. Examine the resources requested and limits set for CPU in the pod's YAML configuration. Adjust these values if necessary.

6) Scaling and Resource Adjustment

6.1. If the pod is CPU-bound, consider horizontally scaling the deployment by increasing the replica count:

kubectl scale deployment <deployment-name> --replicas=<desired-replica-count> -n <namespace>

6.2. If the issue persists, consider adjusting resource requests and limits for CPU in the pod's configuration.

Escalation:


If the issue persists or is severe, escalate to a senior SRE engineer for additional support and guidance.

Further Information


Last updated