Picture this: You’re monitoring your production Kubernetes cluster when suddenly pods start dying left and right, showing that dreaded OOMKilled
status. Even worse, your nodes seem to have plenty of free memory. Sound familiar?
OOMKilled errors remain one of the most frustrating issues in Kubernetes environments, catching even experienced teams off guard. I’ve seen entire services go down because of poorly configured memory limits, and frankly, it’s a problem that keeps happening because the underlying mechanics aren’t well understood.
In this guide, we’ll cut through the confusion and give you practical, battle-tested strategies to prevent and resolve OOMKilled errors.
Understanding the OOMKilled Phenomenon
Let’s start with what actually happens when you see that OOMKilled
status. Contrary to popular belief, Kubernetes doesn’t kill your pods directly. Instead, it’s the Linux kernel’s OOM Killer that does the dirty work.
Here’s the actual sequence of events:
- Container exceeds its memory limit
- Linux kernel detects memory pressure
- OOM Killer sends SIGKILL (signal 9) to the process
- Kubelet detects the termination
- Pod status updates to “OOMKilled”
The telltale sign is Exit Code 137 (128 + 9), which always indicates an OOM kill event.
Quick Diagnosis Commands
# Check pod status and recent events
kubectl describe pod <pod-name> -n <namespace>
# Look for the smoking gun in events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep OOMKilled
# Check current resource usage
kubectl top pods -n <namespace> --sort-by=memory
Root Causes: Why Pods Get OOMKilled
Based on real-world incidents I’ve investigated, OOMKilled errors typically stem from five main causes:
1. Misconfigured Memory Limits
This is the big one. Developers often set memory limits based on guesswork rather than actual usage patterns.
Real Example: A fintech company’s fraud detection system was configured with a 2Gi memory limit, but during peak hours, it actually needed 3.5Gi. Result? Constant OOMKills during business hours.
2. Memory Leaks in Application Code
Memory leaks cause gradual memory consumption growth until the container hits its limit. Different languages have different leak patterns:
- Java: JVM garbage collector not fully understanding container constraints
- Go: Improved since Go 1.19 with
GOMEMLIMIT
, but still requires attention - Node.js: Event loop-related memory leak patterns
- Python: Circular references and unclosed resources
3. Traffic Spikes
Applications that handle variable workloads can experience sudden memory spikes during traffic bursts.
Real Example: An e-commerce checkout service crashed during a flash sale when user traffic increased 10x, causing memory usage to spike beyond configured limits.
4. Node-Level Memory Pressure
Even well-configured pods can get killed when the entire node runs out of memory. Kubernetes follows a priority system based on Quality of Service (QoS) classes:
QoS Class:
QoS Class | Priority | Description |
---|---|---|
BestEffort | Lowest | No resource requests/limits set |
Burstable | Medium | Requests < Limits |
Guaranteed | Highest | Requests = Limits |
5. Resource Overcommitment
This happens when the sum of all pod memory requests exceeds node capacity, or when pods burst beyond their requests simultaneously.
Memory Management Best Practices
The Golden Rule: Memory Limit = Memory Request
In 2025, the best practice is setting memory limit = memory request
. This might surprise you, especially since we recommend NOT setting CPU limits at all.
Here’s why memory is different:
apiVersion: v1
kind: Pod
spec:
containers:
- name: my-app
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "2Gi" # Same as request
# CPU limit intentionally omitted
Think of it like a pizza party analogy: If each guest orders 2 slices but you allow them to eat up to 4 slices, you’ll run out of pizza mid-party. Memory works similarly—when actual usage exceeds requests, unpredictable situations arise.
Understanding QoS Classes
Kubernetes automatically assigns QoS classes based on your resource configuration:
# Guaranteed QoS - Highest priority
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "500m"
# Burstable QoS - Medium priority
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"
# BestEffort QoS - Lowest priority
# No requests or limits specified
Memory QoS Feature (Kubernetes 1.27+)
The Memory Quality of Service feature, introduced in Kubernetes 1.27, provides finer control over memory management using cgroups v2:
apiVersion: v1
kind: Pod
metadata:
annotations:
# Opt out of Memory QoS per pod if needed
qos.memory.kubernetes.io/disabled: "true"
spec:
containers:
- name: my-app
resources:
requests:
memory: "1Gi"
limits:
memory: "2Gi"
Monitoring and Observability Stack
Effective memory management starts with proper monitoring. Here’s the battle-tested stack that works in production:
Setting Up Prometheus + Grafana
# Install kube-prometheus-stack with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
Essential Memory Metrics
Container-level metrics:
container_memory_working_set_bytes
: The metric OOM killer actually usescontainer_memory_usage_bytes
: Current memory usagecontainer_spec_memory_limit_bytes
: Configured memory limit
Pod-level metrics:
kube_pod_container_resource_limits
: Resource limit valueskube_pod_container_resource_requests
: Resource request valueskube_pod_status_phase
: Pod status information
Production-Ready PromQL Queries
Memory utilization monitoring:
# Memory usage percentage per pod
100 * (
container_memory_working_set_bytes{job="kubelet", container!=""}
/
container_spec_memory_limit_bytes{job="kubelet", container!=""}
)
# Total memory usage by namespace
sum by (namespace) (container_memory_working_set_bytes{job="kubelet", container!=""})
# Pods using more than 80% of memory limit
(
container_memory_working_set_bytes{job="kubelet", container!=""}
/
container_spec_memory_limit_bytes{job="kubelet", container!=""}
) > 0.8
OOMKilled detection:
# OOMKilled events in the last hour
increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[1h])
# List of pods that were OOMKilled
kube_pod_container_status_restarts_total{reason="OOMKilled"} > 0
Grafana Dashboard Configuration
Create panels for comprehensive memory monitoring:
Panel Type | Query Focus | Purpose |
---|---|---|
Time Series | Memory usage trends | Pattern analysis |
Stat | Current memory usage | Real-time status |
Table | OOMKilled pod list | Problem identification |
Heatmap | Node memory distribution | Resource balance check |
Practical Problem-Solving Scenarios
Let me walk you through real scenarios and their solutions:
Scenario 1: Data Processing Application with Memory Spikes
Problem: Large file processing causes memory spikes leading to OOMKills
Solution Strategy:
# Before: Undersized resources
resources:
requests:
memory: "1Gi"
limits:
memory: "2Gi"
# After: Properly sized with room for spikes
resources:
requests:
memory: "4Gi"
limits:
memory: "4Gi"
# Additional: Switch from memory-backed to disk-backed storage
volumes:
- name: temp-storage
emptyDir:
medium: "" # Use disk instead of memory
sizeLimit: 10Gi
Scenario 2: Java Application Heap Management
Problem: JVM doesn’t properly recognize container memory limits
Solution:
env:
- name: JAVA_OPTS
value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -XX:+UseG1GC"
resources:
requests:
memory: "2Gi"
limits:
memory: "2Gi"
Why this works:
UseContainerSupport
: Makes JVM aware of container limitsMaxRAMPercentage=75.0
: Leaves 25% headroom for non-heap memoryUseG1GC
: Better garbage collection for containerized environments
Scenario 3: Node.js Memory Leak
Problem: Gradual memory increase over time
Solution:
env:
- name: NODE_OPTIONS
value: "--max-old-space-size=1536" # Limit to 1.5GB
resources:
requests:
memory: "2Gi"
limits:
memory: "2Gi"
# Add graceful shutdown handling
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
Implementing Vertical Pod Autoscaler (VPA)
VPA can automatically adjust resource limits based on actual usage:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Auto" # Automatically restart pods with new resources
resourcePolicy:
containerPolicies:
- containerName: my-app
minAllowed:
memory: "100Mi"
maxAllowed:
memory: "8Gi"
controlledResources: ["memory"]
Advanced Memory Profiling Techniques
Go Application Profiling
Add pprof endpoints to your Go applications:
import _ "net/http/pprof"
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
Then profile in production:
# Port-forward to the pod
kubectl port-forward pod/<pod-name> 6060:6060 &
# Collect heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
# Analyze memory usage patterns
(pprof) top
(pprof) list <function-name>
eBPF-Based Continuous Profiling
For production environments, eBPF provides continuous profiling with minimal overhead:
# Deploy continuous profiling with Parca
apiVersion: v1
kind: ConfigMap
metadata:
name: parca-config
data:
parca.yaml: |
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_profiles_grafana_com_memory_scrape]
action: keep
regex: true
OOMKilled Error Prevention Strategies
Development Phase Best Practices
1. Local Memory Testing
# Monitor Docker container memory usage during development
docker stats <container-id>
# Use load testing with memory tracking
for i in {1..100}; do
echo "Test iteration $i"
kubectl top pod <pod-name> >> memory-usage.log
sleep 30
done
2. Proper Resource Estimation
# Start with generous limits, then optimize
resources:
requests:
memory: "2Gi" # Based on observed usage + 50% buffer
limits:
memory: "2Gi"
Production Monitoring Setup
Proactive Alerting:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: memory-alerts
spec:
groups:
- name: memory.rules
rules:
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{job="kubelet", container!=""}
/
container_spec_memory_limit_bytes{job="kubelet", container!=""}
) > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "Container memory usage exceeds 80%"
description: "{{ $labels.namespace }}/{{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}"
- alert: OOMKilledDetected
expr: |
increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0
labels:
severity: critical
annotations:
summary: "OOMKilled event detected"
description: "{{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"
Memory-Based Horizontal Pod Autoscaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: memory-based-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70 # Scale at 70% memory usage
Infrastructure Optimization
Node Resource Reservations:
# Configure kubelet to reserve system resources
--system-reserved=memory=1Gi,cpu=500m
--kube-reserved=memory=500Mi,cpu=500m
Pod Priority and Preemption:
# High-priority workloads get better protection
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
description: "High priority for critical applications"
---
apiVersion: v1
kind: Pod
spec:
priorityClassName: high-priority
containers:
- name: critical-app
resources:
requests:
memory: "2Gi"
limits:
memory: "2Gi"
OOMKilled errors don’t have to be the bane of your Kubernetes operations. With proper understanding, monitoring, and proactive management, you can minimize their impact and often prevent them entirely.
The key takeaways:
- Set memory limits equal to requests for predictable behavior
- Monitor continuously with Prometheus and Grafana
- Profile applications to understand real memory usage patterns
- Implement proactive alerts before problems occur
- Use VPA and HPA for dynamic resource management
The landscape of Kubernetes memory management continues to evolve with features like Memory QoS and improved container runtime integration. Stay current with these developments, and your clusters will be more stable and efficient.