โ† Back to all stories

OOMKilled: A Horror Story in Three Acts

When your pods keep dying and you can't figure out why. A deep dive into memory limits, Java heap sizes, and container mysteries.

Reason: OOMKilled. Three words that have haunted my dreams. This is the story of how a simple Java application brought our cluster to its knees.

๐ŸŽญ Act 1: The Mystery Begins

It started innocently enough. Our Java microservice, running perfectly fine for months, started getting OOMKilled randomly. No code changes. No traffic spikes. Just pods dying.

kubectl describe pod
State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Our resource limits looked reasonable:

deployment.yaml
resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"
    cpu: "500m"

1GB should be plenty for this service. The developers said it only needed "about 500MB." Famous last words.

๐Ÿ•ณ๏ธ Act 2: Down the Rabbit Hole

I started investigating. First, I checked if something was actually using all that memory:

Inside the container
$ cat /sys/fs/cgroup/memory/memory.usage_in_bytes
1073741824  # Exactly 1GB - hitting the limit

$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
1073741824  # The limit is 1GB

But here's where it got weird. The Java heap was set to 512MB:

JVM flags
-Xms256m -Xmx512m

512MB heap, 1GB limit. Should be fine, right? Wrong.

โš ๏ธ
The JVM lies about memory. Heap is just one part. There's also:
  • Metaspace (class metadata)
  • Thread stacks
  • Direct buffers (NIO)
  • JIT compiled code cache
  • GC overhead

I used Native Memory Tracking to see the real picture:

NMT Output (summarized)
Total: 1.2GB
- Java Heap:        512MB
- Metaspace:        120MB
- Thread Stacks:    180MB  # 180 threads ร— 1MB each!
- Code Cache:       80MB
- Direct Buffers:   250MB  # The culprit!
- Other:            78MB

250MB of direct buffers. The application was using an HTTP client that allocated off-heap memory for connection pooling. And it was never releasing it.

๐ŸŽฌ Act 3: The Resolution

The fix was two-fold:

Fixed JVM flags
# Limit direct memory and reduce threads
-Xms256m
-Xmx512m
-XX:MaxDirectMemorySize=128m
-XX:MaxMetaspaceSize=150m
-Xss512k  # Reduce thread stack size

And we increased the container limit to give breathing room:

deployment.yaml (fixed)
resources:
  requests:
    memory: "1Gi"
    cpu: "250m"
  limits:
    memory: "1.5Gi"  # Room for non-heap memory
    cpu: "500m"

๐Ÿ“– Epilogue: Lessons Learned

๐Ÿงฎ

Container Memory โ‰  Heap

Always allocate 1.5-2x your max heap for containers

๐Ÿ“Š

Enable NMT

-XX:NativeMemoryTracking=summary is your friend

๐Ÿ”

Check Direct Buffers

HTTP clients and NIO can eat memory silently

๐Ÿงต

Count Threads

Each thread = 1MB by default. They add up.

๐Ÿ’ก
Modern Java tip: Use -XX:+UseContainerSupport (default in JDK 11+) and let the JVM auto-tune based on container limits with -XX:MaxRAMPercentage=75

Two weeks without an OOMKill. I call that a win. How do you handle JVM memory in containers?

Share this story