Reason: OOMKilled. Three words that have haunted my dreams. This is the story of how a simple Java application brought our cluster to its knees.
๐ญ Act 1: The Mystery Begins
It started innocently enough. Our Java microservice, running perfectly fine for months, started getting OOMKilled randomly. No code changes. No traffic spikes. Just pods dying.
State: Terminated
Reason: OOMKilled
Exit Code: 137
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Our resource limits looked reasonable:
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
1GB should be plenty for this service. The developers said it only needed "about 500MB." Famous last words.
๐ณ๏ธ Act 2: Down the Rabbit Hole
I started investigating. First, I checked if something was actually using all that memory:
$ cat /sys/fs/cgroup/memory/memory.usage_in_bytes
1073741824 # Exactly 1GB - hitting the limit
$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
1073741824 # The limit is 1GB
But here's where it got weird. The Java heap was set to 512MB:
-Xms256m -Xmx512m
512MB heap, 1GB limit. Should be fine, right? Wrong.
- Metaspace (class metadata)
- Thread stacks
- Direct buffers (NIO)
- JIT compiled code cache
- GC overhead
I used Native Memory Tracking to see the real picture:
Total: 1.2GB
- Java Heap: 512MB
- Metaspace: 120MB
- Thread Stacks: 180MB # 180 threads ร 1MB each!
- Code Cache: 80MB
- Direct Buffers: 250MB # The culprit!
- Other: 78MB
250MB of direct buffers. The application was using an HTTP client that allocated off-heap memory for connection pooling. And it was never releasing it.
๐ฌ Act 3: The Resolution
The fix was two-fold:
# Limit direct memory and reduce threads
-Xms256m
-Xmx512m
-XX:MaxDirectMemorySize=128m
-XX:MaxMetaspaceSize=150m
-Xss512k # Reduce thread stack size
And we increased the container limit to give breathing room:
resources:
requests:
memory: "1Gi"
cpu: "250m"
limits:
memory: "1.5Gi" # Room for non-heap memory
cpu: "500m"
๐ Epilogue: Lessons Learned
Container Memory โ Heap
Always allocate 1.5-2x your max heap for containers
Enable NMT
-XX:NativeMemoryTracking=summary is your friend
Check Direct Buffers
HTTP clients and NIO can eat memory silently
Count Threads
Each thread = 1MB by default. They add up.
-XX:+UseContainerSupport (default in JDK 11+) and let the JVM auto-tune based on container limits with -XX:MaxRAMPercentage=75
Two weeks without an OOMKill. I call that a win. How do you handle JVM memory in containers?