OOMKilled: A Horror Story in Three Acts

Reason: OOMKilled. Three words that have haunted my dreams. This is the story of how a simple Java application brought our cluster to its knees.

🎭 Act 1: The Mystery Begins

It started innocently enough. Our Java microservice, running perfectly fine for months, started getting OOMKilled randomly. No code changes. No traffic spikes. Just pods dying.

kubectl describe pod

State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Our resource limits looked reasonable:

deployment.yaml

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"
    cpu: "500m"

1GB should be plenty for this service. The developers said it only needed "about 500MB." Famous last words.

🕳️ Act 2: Down the Rabbit Hole

I started investigating. First, I checked if something was actually using all that memory:

Inside the container

$ cat /sys/fs/cgroup/memory/memory.usage_in_bytes
1073741824  # Exactly 1GB - hitting the limit

$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
1073741824  # The limit is 1GB

But here's where it got weird. The Java heap was set to 512MB:

JVM flags

-Xms256m -Xmx512m

512MB heap, 1GB limit. Should be fine, right? Wrong.

⚠️

The JVM lies about memory. Heap is just one part. There's also:

Metaspace (class metadata)
Thread stacks
Direct buffers (NIO)
JIT compiled code cache
GC overhead

I used Native Memory Tracking to see the real picture:

NMT Output (summarized)

Total: 1.2GB
- Java Heap:        512MB
- Metaspace:        120MB
- Thread Stacks:    180MB  # 180 threads × 1MB each!
- Code Cache:       80MB
- Direct Buffers:   250MB  # The culprit!
- Other:            78MB

250MB of direct buffers. The application was using an HTTP client that allocated off-heap memory for connection pooling. And it was never releasing it.

🎬 Act 3: The Resolution

The fix was two-fold:

Fixed JVM flags

# Limit direct memory and reduce threads
-Xms256m
-Xmx512m
-XX:MaxDirectMemorySize=128m
-XX:MaxMetaspaceSize=150m
-Xss512k  # Reduce thread stack size

And we increased the container limit to give breathing room:

deployment.yaml (fixed)

resources:
  requests:
    memory: "1Gi"
    cpu: "250m"
  limits:
    memory: "1.5Gi"  # Room for non-heap memory
    cpu: "500m"

📖 Epilogue: Lessons Learned

🧮

Container Memory ≠ Heap

Always allocate 1.5-2x your max heap for containers

📊

Enable NMT

-XX:NativeMemoryTracking=summary is your friend

🔍

Check Direct Buffers

HTTP clients and NIO can eat memory silently

🧵

Count Threads

Each thread = 1MB by default. They add up.

💡

Modern Java tip: Use -XX:+UseContainerSupport (default in JDK 11+) and let the JVM auto-tune based on container limits with -XX:MaxRAMPercentage=75

Two weeks without an OOMKill. I call that a win. How do you handle JVM memory in containers?

🎭 Act 1: The Mystery Begins

🕳️ Act 2: Down the Rabbit Hole

🎬 Act 3: The Resolution

📖 Epilogue: Lessons Learned

Container Memory ≠ Heap

Enable NMT

Check Direct Buffers

Count Threads

Share this story