NodeHighCPULoad
Resolving High CPU Load on a Node
Overview
This runbook provides a step-by-step guide to diagnose and resolve high CPU load on a node in a Linux-based environment.
Initial Response
Alert received indicating a high CPU load on a node.
Acknowledge the alert and assign yourself as the incident owner.
Notify the team about the ongoing incident using the primary communication channel.
Update the incident status on the incident tracking system.
Detailed Steps
1) Identify the Affected Nodes
1.1 Connect to the Kubernetes cluster using kubectl
kubectl cluster-info1.2 Check the status of all nodes in the cluster
kubectl get nodes2) Gather CPU usage metrics for the nodes
2.1 Use kubectl to gather CPU usage metrics for the nodes
kubectl top nodes2.2. Identify the nodes with high CPU usage. Nodes with consistently high CPU utilization may need further investigation.
3) Identify Problematic Pods
3.1. List all pods running on the affected node(s)
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>3.2. Note down pods with high resource utilization or abnormal behavior.
kubectl describe pod <pod-name> -n <namespace>3.3 Check the Pod's logs for any errors, excessive logging, or patterns that might indicate the root cause of the high CPU usage.
Solutions Details
1) Scale the Deployment
If the high CPU usage is caused by legitimate load, consider scaling the Deployment to distribute the load across multiple Pods.
kubectl scale deployment <deployment-name> --replicas=<new-replica-count> -n <namespace>2) Optimize Application Code
Work with the development team to optimize the application's code and queries that might be causing the CPU spikes.
3) Resource Limit and Requests
Ensure that the Pod's resource requests and limits are appropriately set to avoid resource contention. Adjust the values in the Pod's YAML definition.
4) Inspect Pod Resource Usage
4.1. Describe the problematic pod to gather more information:
kubectl describe pod <pod-name> -n <namespace>4.2. Examine the resources requested and limits set for CPU in the pod's YAML configuration. Adjust these values if necessary.
6) Scaling and Resource Adjustment
6.1. If the pod is CPU-bound, consider horizontally scaling the deployment by increasing the replica count:
kubectl scale deployment <deployment-name> --replicas=<desired-replica-count> -n <namespace>6.2. If the issue persists, consider adjusting resource requests and limits for CPU in the pod's configuration.
Escalation:
If the issue persists or is severe, escalate to a senior SRE engineer for additional support and guidance.
Further Information
Last updated