When you’re managing a Kubernetes cluster, few things are more frustrating than seeing your pods stuck in a “Pending” state while your application refuses to start. The Pending phase indicates that the pod has been accepted by the Kubernetes cluster, but it is not yet scheduled to run on a node, meaning it awaits assignment for execution. This comprehensive guide will walk you through the most common causes of pending pods and provide proven solutions to get your workloads running smoothly.
1. Understanding the Pod Pending State
Pods follow a defined lifecycle, starting in the Pending phase, moving through Running if at least one of its primary containers starts OK. During the Pending phase, your pod is essentially waiting in line for the Kubernetes scheduler to find a suitable node that meets all the necessary requirements.
Most of the pods only take seconds to progress from Pending to Running and spend most of their life in that state. However, when this transition doesn’t happen quickly, it usually indicates an underlying issue that needs immediate attention.
Quick Status Check
To identify pending pods in your cluster, use this simple command:
kubectl get pods --field-selector=status.phase=Pending
For a more detailed view across all namespaces:
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
2. Primary Causes of Pod Pending Status
Insufficient Cluster Resources
A common cause for pods remaining in the pending state is insufficient node resources. Kubernetes requires adequate CPU and memory on nodes to launch new pods. When these resources are depleted, the scheduler cannot place pods, leaving them in pending status.
Real-world example: Let’s say you have a pod requesting 2 CPU cores and 4Gi of memory, but your largest available node only has 1.5 CPU cores remaining. The scheduler will keep the pod in Pending state until resources become available.
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
Node Scheduling Constraints
If you cordon a node (e.g., to prepare the node for an upgrade), Kubernetes will mark it unschedulable. A cordoned node can continue to host the pods it is already running, but it cannot accept any new pods.
Node Selector and Affinity Mismatches
Pods can specify node selectors or affinity rules to control on which nodes they run. If these criteria are too restrictive, the scheduler may struggle to find an eligible node, extending the pending state duration.
Storage and Volume Issues
If a pod requires a PVC and the storage class is not available, the pod may be stuck in Pending. A PVC is a request for storage by a pod. If the requested storage is not available, the pod will remain in Pending until the storage is allocated.
3. Essential Troubleshooting Commands
Step 1: Investigate the Pod Status
The first and most important troubleshooting step is to examine the pod details:
kubectl describe pod <pod-name> -n <namespace>
The “Events” section in the output often indicates why the pod is not being scheduled. For instance, “FailedScheduling” could highlight resource constraints or node-related issues.
Example output you might see:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 30s default-scheduler 0/5 nodes are available: 2 Insufficient CPU, 3 Insufficient memory.
Step 2: Check Cluster Events
Developers can list all events in the cluster and filter them to find those associated with the pod. This can provide additional context, especially if certain events are not captured in kubectl describe pod.
kubectl get events --sort-by=.metadata.creationTimestamp
Filter events for a specific pod:
kubectl get events --field-selector involvedObject.name=<pod-name>
Step 3: Examine Node Resources
Check the resource availability across your nodes:
kubectl top nodes
Look for messages indicating “Insufficient CPU” or “Insufficient memory.” Use the kubectl describe pod <pod-name> command to check the Events section.
For detailed node information:
kubectl describe nodes
Step 4: Verify Node Readiness
Each node has several conditions that can be True, False, or Unknown, depending on the node’s status. Node conditions are a primary factor in scheduling decisions.
kubectl get nodes -o wide
Check for tainted nodes:
kubectl describe node <node-name> | grep -i "taints"
4. Systematic Resolution Strategies
Resolving Resource Constraints
Option 1: Adjust Resource Requests
Ensure that the pod’s resource requests and limits are realistic. Over-provisioning resources can lead to pods being stuck in Pending because Kubernetes cannot find a node that meets the excessive requirements.
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
Option 2: Scale Your Cluster
If your cluster frequently experiences resource shortages, consider enabling the Cluster Autoscaler. This tool automatically adds nodes to your cluster when resources are insufficient to meet pod scheduling demands.
Check if cluster autoscaler is running:
kubectl get deployment -n kube-system cluster-autoscaler
Fixing Node Scheduling Issues
Uncordon Nodes
If you still need to keep the node cordoned, you can fix this issue by adding more nodes to host Pending pods. Otherwise, you can uncordon it (kubectl uncordon <NODE_NAME>).
kubectl uncordon <node-name>
Address Node Health Issues
A node marked as “Not Ready” indicates it cannot host additional workloads, likely due to failed liveness or readiness probes. Identifying and resolving the underlying issues is crucial to restoring node functionality.
Managing Node Affinity and Selectors
Example of problematic node affinity:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
This means that the pod will get scheduled only on a node that has a disktype=ssd label. If no nodes have this label, the pod will remain pending.
Solution: Either add the required label to nodes or modify the affinity rules:
kubectl label node <node-name> disktype=ssd
Resolving Storage Issues
Check PVC status:
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
If the PVC status is Pending, the pod will not be scheduled.
5. Advanced Troubleshooting Scenarios
Taints and Tolerations
Think of Taint as “only you are allowed” signs on your Kubernetes nodes. A Taint marks a node with a specific characteristic, such as “key1=value1”. By default, pods cannot be scheduled on tainted nodes unless they have a special permission called Toleration.
Example taint removal:
kubectl taint node <node-name> key1=value1:NoSchedule-
Adding toleration to pod:
spec:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Priority and Preemption
Kubernetes allows configuring priorities for pods. If insufficient nodes are available to support a pod in a high-priority class, Kubernetes will stop lower-priority pods through a process known as preemption.
Create a priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: "High priority class for critical workloads"
Scheduler Logs Analysis
The Kubernetes scheduler logs provide a detailed view of scheduling operations, offering insights into why pods remain pending. This is particularly helpful for debugging complex scheduling scenarios.
kubectl logs -n kube-system $(kubectl get pods -n kube-system | grep scheduler | awk '{print $1}')
6. Prevention and Best Practices
Resource Management
Reserve CPU and memory resources for critical system components like the kubelet and the API server using –system-reserved or –kube-reserved flags. This prevents resource starvation that might delay pod scheduling.
Define default resource requests and limits using LimitRange in namespaces to ensure that pods always specify appropriate resource allocations.
Monitoring and Alerting
Set up alerting mechanisms to notify you when a pod enters a pending state. This can be done using Kubernetes-native tools or third-party monitoring solutions.
Example LimitRange for namespace:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
Proactive Capacity Planning
Deploy a small number of low-priority pods simulating load on nodes to ensure there is always some buffer capacity for higher-priority pods when needed. This approach works well in clusters with variable workloads.
7. Real-World Troubleshooting Example
Let’s walk through a complete troubleshooting scenario:
Problem: A deployment with 3 replicas has 2 pods running but 1 stuck in Pending.
Step 1: Check the pending pod
kubectl describe pod my-app-deployment-xyz123
Output:
Events:
Warning FailedScheduling 2m default-scheduler 0/3 nodes are available: 1 Insufficient memory, 2 node(s) had taints that the pod didn't tolerate.
Step 2: Check node resources and taints
kubectl top nodes
kubectl describe nodes | grep -A 5 "Taints"
Step 3: Solution – Either add toleration or remove taint
kubectl taint nodes worker-node-2 special-workload=true:NoSchedule-
8. Quick Reference Table
Issue Type | Diagnostic Command | Common Fix |
---|---|---|
Resource shortage | kubectl top nodes |
Scale cluster or reduce requests |
Node affinity | kubectl describe pod <name> |
Update node labels or affinity rules |
Taints/tolerations | kubectl describe nodes |
Add tolerations or remove taints |
Storage issues | kubectl get pvc |
Fix storage class or volume |
Scheduler problems | kubectl logs kube-scheduler-* |
Restart scheduler pod |
Pod pending issues in Kubernetes can stem from various causes, but systematic troubleshooting using the right commands will help you identify and resolve problems quickly. Remember that maintaining a healthy Kubernetes environment requires ongoing monitoring and adjustments as your workloads and infrastructure evolve.
The key to effective troubleshooting is starting with kubectl describe pod
to understand the immediate cause, then working through the logical progression of resource availability, node health, and scheduling constraints. By implementing the prevention strategies outlined in this guide, you can significantly reduce the occurrence of pending pod issues in your cluster.