[Kubernetes] How to Fix OOMKilled Error: Memory Resource Management - 헤이든의 전산실 (Hayden's Server Room)

Picture this: You’re monitoring your production Kubernetes cluster when suddenly pods start dying left and right, showing that dreaded OOMKilled status. Even worse, your nodes seem to have plenty of free memory. Sound familiar?

OOMKilled errors remain one of the most frustrating issues in Kubernetes environments, catching even experienced teams off guard. I’ve seen entire services go down because of poorly configured memory limits, and frankly, it’s a problem that keeps happening because the underlying mechanics aren’t well understood.

In this guide, we’ll cut through the confusion and give you practical, battle-tested strategies to prevent and resolve OOMKilled errors.

Table of Contents

Understanding the OOMKilled Phenomenon

Let’s start with what actually happens when you see that OOMKilled status. Contrary to popular belief, Kubernetes doesn’t kill your pods directly. Instead, it’s the Linux kernel’s OOM Killer that does the dirty work.

Here’s the actual sequence of events:

Container exceeds its memory limit
Linux kernel detects memory pressure
OOM Killer sends SIGKILL (signal 9) to the process
Kubelet detects the termination
Pod status updates to “OOMKilled”

The telltale sign is Exit Code 137 (128 + 9), which always indicates an OOM kill event.

Quick Diagnosis Commands

# Check pod status and recent events
kubectl describe pod <pod-name> -n <namespace>

# Look for the smoking gun in events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep OOMKilled

# Check current resource usage
kubectl top pods -n <namespace> --sort-by=memory

Root Causes: Why Pods Get OOMKilled

Based on real-world incidents I’ve investigated, OOMKilled errors typically stem from five main causes:

1. Misconfigured Memory Limits

This is the big one. Developers often set memory limits based on guesswork rather than actual usage patterns.

Real Example: A fintech company’s fraud detection system was configured with a 2Gi memory limit, but during peak hours, it actually needed 3.5Gi. Result? Constant OOMKills during business hours.

2. Memory Leaks in Application Code

Memory leaks cause gradual memory consumption growth until the container hits its limit. Different languages have different leak patterns:

Java: JVM garbage collector not fully understanding container constraints
Go: Improved since Go 1.19 with GOMEMLIMIT, but still requires attention
Node.js: Event loop-related memory leak patterns
Python: Circular references and unclosed resources

3. Traffic Spikes

Applications that handle variable workloads can experience sudden memory spikes during traffic bursts.

Real Example: An e-commerce checkout service crashed during a flash sale when user traffic increased 10x, causing memory usage to spike beyond configured limits.

4. Node-Level Memory Pressure

Even well-configured pods can get killed when the entire node runs out of memory. Kubernetes follows a priority system based on Quality of Service (QoS) classes:

QoS Class:

QoS Class	Priority	Description
BestEffort	Lowest	No resource requests/limits set
Burstable	Medium	Requests < Limits
Guaranteed	Highest	Requests = Limits

5. Resource Overcommitment

This happens when the sum of all pod memory requests exceeds node capacity, or when pods burst beyond their requests simultaneously.

Memory Management Best Practices

The Golden Rule: Memory Limit = Memory Request

In 2025, the best practice is setting memory limit = memory request. This might surprise you, especially since we recommend NOT setting CPU limits at all.

Here’s why memory is different:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: my-app
    resources:
      requests:
        memory: "2Gi"
        cpu: "500m"
      limits:
        memory: "2Gi"  # Same as request
        # CPU limit intentionally omitted

Think of it like a pizza party analogy: If each guest orders 2 slices but you allow them to eat up to 4 slices, you’ll run out of pizza mid-party. Memory works similarly—when actual usage exceeds requests, unpredictable situations arise.

Understanding QoS Classes

Kubernetes automatically assigns QoS classes based on your resource configuration:

# Guaranteed QoS - Highest priority
resources:
  requests:
    memory: "1Gi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "500m"

# Burstable QoS - Medium priority  
resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "1Gi"

# BestEffort QoS - Lowest priority
# No requests or limits specified

Memory QoS Feature (Kubernetes 1.27+)

The Memory Quality of Service feature, introduced in Kubernetes 1.27, provides finer control over memory management using cgroups v2:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    # Opt out of Memory QoS per pod if needed
    qos.memory.kubernetes.io/disabled: "true"
spec:
  containers:
  - name: my-app
    resources:
      requests:
        memory: "1Gi"
      limits:
        memory: "2Gi"

Monitoring and Observability Stack

Effective memory management starts with proper monitoring. Here’s the battle-tested stack that works in production:

Setting Up Prometheus + Grafana

# Install kube-prometheus-stack with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

Essential Memory Metrics

Container-level metrics:

container_memory_working_set_bytes: The metric OOM killer actually uses
container_memory_usage_bytes: Current memory usage
container_spec_memory_limit_bytes: Configured memory limit

Pod-level metrics:

kube_pod_container_resource_limits: Resource limit values
kube_pod_container_resource_requests: Resource request values
kube_pod_status_phase: Pod status information

Production-Ready PromQL Queries

Memory utilization monitoring:

# Memory usage percentage per pod
100 * (
  container_memory_working_set_bytes{job="kubelet", container!=""}
  /
  container_spec_memory_limit_bytes{job="kubelet", container!=""}
)

# Total memory usage by namespace
sum by (namespace) (container_memory_working_set_bytes{job="kubelet", container!=""})

# Pods using more than 80% of memory limit
(
  container_memory_working_set_bytes{job="kubelet", container!=""}
  /
  container_spec_memory_limit_bytes{job="kubelet", container!=""}
) > 0.8

OOMKilled detection:

# OOMKilled events in the last hour
increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[1h])

# List of pods that were OOMKilled
kube_pod_container_status_restarts_total{reason="OOMKilled"} > 0

Grafana Dashboard Configuration

Create panels for comprehensive memory monitoring:

Panel Type	Query Focus	Purpose
Time Series	Memory usage trends	Pattern analysis
Stat	Current memory usage	Real-time status
Table	OOMKilled pod list	Problem identification
Heatmap	Node memory distribution	Resource balance check

Practical Problem-Solving Scenarios

Let me walk you through real scenarios and their solutions:

Scenario 1: Data Processing Application with Memory Spikes

Problem: Large file processing causes memory spikes leading to OOMKills

Solution Strategy:

# Before: Undersized resources
resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "2Gi"

# After: Properly sized with room for spikes
resources:
  requests:
    memory: "4Gi"
  limits:
    memory: "4Gi"

# Additional: Switch from memory-backed to disk-backed storage
volumes:
- name: temp-storage
  emptyDir:
    medium: ""  # Use disk instead of memory
    sizeLimit: 10Gi

Scenario 2: Java Application Heap Management

Problem: JVM doesn’t properly recognize container memory limits

Solution:

env:
- name: JAVA_OPTS
  value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -XX:+UseG1GC"
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "2Gi"

Why this works:

UseContainerSupport: Makes JVM aware of container limits
MaxRAMPercentage=75.0: Leaves 25% headroom for non-heap memory
UseG1GC: Better garbage collection for containerized environments

Scenario 3: Node.js Memory Leak

Problem: Gradual memory increase over time

Solution:

env:
- name: NODE_OPTIONS
  value: "--max-old-space-size=1536"  # Limit to 1.5GB
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "2Gi"

# Add graceful shutdown handling
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]

Implementing Vertical Pod Autoscaler (VPA)

VPA can automatically adjust resource limits based on actual usage:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"  # Automatically restart pods with new resources
  resourcePolicy:
    containerPolicies:
    - containerName: my-app
      minAllowed:
        memory: "100Mi"
      maxAllowed:
        memory: "8Gi"
      controlledResources: ["memory"]

Advanced Memory Profiling Techniques

Go Application Profiling

Add pprof endpoints to your Go applications:

import _ "net/http/pprof"

go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

Then profile in production:

# Port-forward to the pod
kubectl port-forward pod/<pod-name> 6060:6060 &

# Collect heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Analyze memory usage patterns
(pprof) top
(pprof) list <function-name>

eBPF-Based Continuous Profiling

For production environments, eBPF provides continuous profiling with minimal overhead:

# Deploy continuous profiling with Parca
apiVersion: v1
kind: ConfigMap
metadata:
  name: parca-config
data:
  parca.yaml: |
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_profiles_grafana_com_memory_scrape]
        action: keep
        regex: true

OOMKilled Error Prevention Strategies

Development Phase Best Practices

1. Local Memory Testing

# Monitor Docker container memory usage during development
docker stats <container-id>

# Use load testing with memory tracking
for i in {1..100}; do
  echo "Test iteration $i"
  kubectl top pod <pod-name> >> memory-usage.log
  sleep 30
done

2. Proper Resource Estimation

# Start with generous limits, then optimize
resources:
  requests:
    memory: "2Gi"  # Based on observed usage + 50% buffer
  limits:
    memory: "2Gi"

Production Monitoring Setup

Proactive Alerting:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: memory-alerts
spec:
  groups:
  - name: memory.rules
    rules:
    - alert: HighMemoryUsage
      expr: |
        (
          container_memory_working_set_bytes{job="kubelet", container!=""}
          /
          container_spec_memory_limit_bytes{job="kubelet", container!=""}
        ) > 0.8
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Container memory usage exceeds 80%"
        description: "{{ $labels.namespace }}/{{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}"
    
    - alert: OOMKilledDetected
      expr: |
        increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0
      labels:
        severity: critical
      annotations:
        summary: "OOMKilled event detected"
        description: "{{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"

Memory-Based Horizontal Pod Autoscaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-based-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70  # Scale at 70% memory usage

Infrastructure Optimization

Node Resource Reservations:

# Configure kubelet to reserve system resources
--system-reserved=memory=1Gi,cpu=500m
--kube-reserved=memory=500Mi,cpu=500m

Pod Priority and Preemption:

# High-priority workloads get better protection
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000
description: "High priority for critical applications"

---
apiVersion: v1
kind: Pod
spec:
  priorityClassName: high-priority
  containers:
  - name: critical-app
    resources:
      requests:
        memory: "2Gi"
      limits:
        memory: "2Gi"

OOMKilled errors don’t have to be the bane of your Kubernetes operations. With proper understanding, monitoring, and proactive management, you can minimize their impact and often prevent them entirely.

The key takeaways:

Set memory limits equal to requests for predictable behavior
Monitor continuously with Prometheus and Grafana
Profile applications to understand real memory usage patterns
Implement proactive alerts before problems occur
Use VPA and HPA for dynamic resource management

The landscape of Kubernetes memory management continues to evolve with features like Memory QoS and improved container runtime integration. Stay current with these developments, and your clusters will be more stable and efficient.

Understanding the OOMKilled Phenomenon

Quick Diagnosis Commands

Root Causes: Why Pods Get OOMKilled

1. Misconfigured Memory Limits

2. Memory Leaks in Application Code

3. Traffic Spikes

4. Node-Level Memory Pressure

QoS Class:

5. Resource Overcommitment

Memory Management Best Practices

The Golden Rule: Memory Limit = Memory Request

Understanding QoS Classes

Memory QoS Feature (Kubernetes 1.27+)

Monitoring and Observability Stack

Setting Up Prometheus + Grafana

Essential Memory Metrics

Production-Ready PromQL Queries

Grafana Dashboard Configuration

Create panels for comprehensive memory monitoring:

Practical Problem-Solving Scenarios

Scenario 1: Data Processing Application with Memory Spikes

Scenario 2: Java Application Heap Management

Scenario 3: Node.js Memory Leak

Implementing Vertical Pod Autoscaler (VPA)

Advanced Memory Profiling Techniques

Go Application Profiling

eBPF-Based Continuous Profiling

OOMKilled Error Prevention Strategies

Development Phase Best Practices

Production Monitoring Setup

Infrastructure Optimization

이 포스트와 관련 있는 글

댓글 남기기응답 취소

Understanding the OOMKilled Phenomenon

Quick Diagnosis Commands

Root Causes: Why Pods Get OOMKilled

1. Misconfigured Memory Limits

2. Memory Leaks in Application Code

3. Traffic Spikes

4. Node-Level Memory Pressure

QoS Class:

5. Resource Overcommitment

Memory Management Best Practices

The Golden Rule: Memory Limit = Memory Request

Understanding QoS Classes

Memory QoS Feature (Kubernetes 1.27+)

Monitoring and Observability Stack

Setting Up Prometheus + Grafana

Essential Memory Metrics

Production-Ready PromQL Queries

Grafana Dashboard Configuration

Create panels for comprehensive memory monitoring:

Practical Problem-Solving Scenarios

Scenario 1: Data Processing Application with Memory Spikes

Scenario 2: Java Application Heap Management

Scenario 3: Node.js Memory Leak

Implementing Vertical Pod Autoscaler (VPA)

Advanced Memory Profiling Techniques

Go Application Profiling

eBPF-Based Continuous Profiling

OOMKilled Error Prevention Strategies

Development Phase Best Practices

Production Monitoring Setup

Infrastructure Optimization

이 글 공유하기:

이 포스트와 관련 있는 글

댓글 남기기응답 취소