Let me tell you about the night I had Prometheus running, Grafana dashboards glowing green, and absolutely no idea why pods were restarting every 5 minutes. The CPU graph looked fine. Memory? Normal. But something was very, very wrong.
This blog exists because of that failure. It's not about how to install Prometheus and Grafana. You can Google that. This is about setting up monitoring that actually helps during incidentsβon a multi-node Kubernetes cluster, covering node, pod, and cluster-level metrics.
"What is Prometheus?" copied theory. "Grafana is a visualization tool" β everyone knows. Step-by-step UI walkthroughs. A demo that works only on localhost and dies in real clusters. This is NOT about alerts β that's a separate blog.
π¨ The Wake-Up Call
3:42 AM. PagerDuty screaming. User-facing latency through the roof. I opened Grafana, looked at the dashboards I had spent days setting up, and saw... nothing useful.
CPU utilization: 23%. Memory: 45%. Network I/O: normal. All green. Yet our API was returning 5-second response times.
Dashboard: "Kubernetes Cluster Overview" (imported from Grafana.com)
Node CPU: 23% β
(looks fine!)
Node Memory: 45% β
(plenty of room!)
Pod Count: 47/50 β
(all healthy!)
Network: 120Mbps β
(normal!)
Reality: Users screaming, API timing out, SLA violated
Here's what I learned that night: Green dashboards lie. My pod had a CPU limit of 500m and was being throttled to death. But "node CPU" was fine because my node had 16 cores. I was looking at the wrong metric.
"We had Prometheus running but still had no clue what was happening."
β Me, at 4 AM, questioning my career choices
π What Tutorials Hide (And What Will Bite You)
Before we install anything, let's get some misconceptions out of the way. These are the things that will bite you in production if you don't understand them now.
Scraping β Monitoring
Prometheus is pulling metrics. That doesn't mean you're monitoring anything useful. Scraping is just data collection.
Metrics β Alerts
Having metrics doesn't mean you'll be notified when things go wrong. Alerts are a separate concern entirely.
Dashboards β Answers
A pretty dashboard doesn't help if you don't know which panel to look at during an incident.
Node Metrics β Pod Metrics
Node CPU being fine says nothing about container throttling. They're completely different metrics.
π How Data Actually Flows
Before you write any YAML, understand the architecture. This is where 80% of confusion comes from.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β YOUR CLUSTER β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Node 1 β β Node 2 β β Node 3 β β Node N β β
β β β β β β β β β β
β β kubelet β β kubelet β β kubelet β β kubelet β β
β β (metrics)β β (metrics)β β (metrics)β β (metrics)β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β β
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ β
β β PROMETHEUS β β
β β β β
β β β’ Scrapes β β Pulls metrics β
β β every 15-30s β from targets β
β β β β
β β β’ TSDB stores β β Time-series DB β
β β metrics β (local storage) β
β β β β
β β β’ PromQL β β Query language β
β β queries β β
β ββββββββββ¬ββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ β
β β GRAFANA β β
β β β β
β β β’ Visualizes β β Dashboards β
β β β’ Queries β β Uses PromQL β
β β Prometheus β β
β β β β
β β YOU β Dashboard β β What you see β
β ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CRITICAL: Prometheus PULLS data. If a target is unreachable,
you get no metrics. Silence is not "everything is fine."
π οΈ Installation (Only What Matters)
I'm going to use kube-prometheus-stack via Helm because it gives you Prometheus, Grafana, AlertManager, and a bunch of useful exporters in one shot. But let me be honest about the trade-offs.
Why kube-prometheus-stack?
- Pros: Pre-configured ServiceMonitors, sensible defaults, Grafana dashboards included
- Cons: Hides a LOT of configuration, large resource footprint, upgrades can be painful
Yes, Helm makes this easyβbut it also hides important configs. Know what you're deploying.
# Add the prometheus-community repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
# Version matters - I'm using 66.3.0 in production
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--version 66.3.0 \
--set grafana.service.type=NodePort \
--set grafana.service.nodePort=30080
retention: Default is 10d. For production, consider 15-30d depending on storage.scrapeInterval: Default 30s. For high-cardinality workloads, this can explode your storage.resources: Prometheus with defaults will OOM on a busy cluster. Set explicit limits.
I also installed additional exporters for specific use cases:
# MySQL exporter for database metrics
helm install mysql-exporter prometheus-community/prometheus-mysql-exporter \
--namespace monitoring \
--version 2.11.0
# DCGM exporter for GPU metrics (if you have GPU nodes)
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--version 3.5.0
What I'm intentionally skipping:
- Thanos/Cortex: Overkill for most clusters. Add when you need multi-cluster or long-term storage.
- Loki: Great for logs, but this blog is dashboards only.
- Custom ServiceMonitors: The defaults cover 90% of cases. Add when needed.
π Setting Up Grafana Properly
Here's what I did after installation to make Grafana actually usable in a team environment.
Step 1: Expose Grafana via Nginx
In production, I expose Grafana through our existing Nginx ingress on a /grafana path:
location /grafana/ {
proxy_pass http://grafana-nodeport-service:30080/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
Step 2: Create Users for Different Teams
Don't give everyone admin access. I created separate users:
# Users I created:
- admin (Admin) β DevOps team
- devops (Editor) β DevOps engineers
- backend (Viewer) β Backend team
- qa (Viewer) β QA team
# Why?
# - Viewers can see dashboards but not modify
# - Editors can create dashboards but not change data sources
# - Admin is for infrastructure changes only
Step 3: Verify Data Source
First thing after login: Configuration β Data Sources β Prometheus β Test
If this test fails, nothing else works. The URL should be the internal service name (usually http://prometheus-kube-prometheus-prometheus:9090).
π Dashboards That Work (Not Just Look Pretty)
This is where most blogs fail. They tell you to "import dashboard ID 1860 and you're done." That dashboard has 47 panels. Which one do you look at when things are breaking?
Here's my approach: Every panel must answer a specific question. If you can't articulate the question, delete the panel.
The Three Questions Every Dashboard Must Answer
1. Is something broken RIGHT NOW?
β Real-time error rates, latency spikes, pod restarts
2. What is trending toward broken?
β Resource saturation, disk filling up, memory creeping
3. What changed recently?
β Deployments, config changes, traffic patterns
CPU: Usage vs Throttling (The Lie I Believed)
CPU Usage tells you how much CPU a container used. CPU Throttling tells you how often it was blocked from using more CPU because of limits.
# CPU Throttling percentage
# This is what actually matters, not CPU usage
sum(rate(container_cpu_cfs_throttled_periods_total{
namespace="production"
}[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total{
namespace="production"
}[5m])) by (pod)
* 100
# If this is > 25%, your pod is being starved.
# Usage might show 30%, but throttling at 80% = performance death
Bad conclusion people draw: "CPU is at 30%, we have plenty of headroom!"
Reality: Container is throttled 80% of the time because limit is too low.
Memory: Usage vs OOMKills
Same story. Memory usage being at 60% means nothing if you have OOMKills.
# Memory usage as percentage of limit
container_memory_working_set_bytes{namespace="production"}
/
container_spec_memory_limit_bytes{namespace="production"}
* 100
# OOMKill events (the metric that actually wakes you up)
increase(kube_pod_container_status_restarts_total{
namespace="production",
reason="OOMKilled"
}[1h])
Node Metrics vs Pod Metrics (Don't Mix Them)
I see this mistake constantly: putting node-level and pod-level metrics on the same dashboard without clear separation.
NODE METRICS (from node-exporter):
- node_cpu_seconds_total
- node_memory_MemAvailable_bytes
- node_filesystem_avail_bytes
β "Is my infrastructure healthy?"
POD METRICS (from cAdvisor/kubelet):
- container_cpu_usage_seconds_total
- container_memory_working_set_bytes
- kube_pod_status_phase
β "Is my application healthy?"
CLUSTER METRICS (from kube-state-metrics):
- kube_deployment_status_replicas_available
- kube_pod_container_status_restarts_total
β "Is Kubernetes doing its job?"
π A Debugging Story: The Latency Spike
Let me walk you through an actual incident and how I used (and misused) dashboards.
The symptom: API latency jumped from 100ms to 2000ms at 2 PM.
First metric I checked (Wrong)
Node CPU. It was at 45%. "That's fine," I thought.
Second metric I checked (Also wrong)
Pod CPU usage. It was at 35%. "Definitely not CPU."
The metric that actually mattered
# Container CPU throttling - THIS was the problem
rate(container_cpu_cfs_throttled_seconds_total{
pod=~"api-.*"
}[5m])
# Result: 0.8 seconds throttled per second
# Translation: pod is blocked 80% of the time!
What happened: We had deployed a new feature that was CPU-intensive. The pod's CPU limit was 500m. Usage was "35%" of the node, but the container was hitting its limit constantly.
The fix: Increased CPU limit to 1000m. Latency dropped to 80ms.
π€· What Prometheus Can't Tell You
Prometheus is not magic. Here are its blind spots:
- Application-level context: It doesn't know if your 500 error is a bug or a bad request. It just counts them.
- Why something happened: Metrics tell you what, not why. Logs tell you why.
- Distributed traces: If request A causes slowdown in service B, Prometheus won't connect those dots. You need tracing.
- What's missing: If a pod dies and stops reporting, you get gaps. Absence of data isn't flagged automatically.
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β OBSERVABILITY REALITY β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β METRICS (Prometheus) β
β ββ "What" is happening? β
β ββ CPU at 90%, errors increased β
β β
β LOGS (Loki/ELK) β
β ββ "Why" is it happening? β
β ββ NullPointerException at line 47 β
β β
β TRACES (Jaeger/Tempo) β
β ββ "Where" in the system? β
β ββ Latency from service A β B β C β
β β
β You need ALL THREE for full picture. β
β Dashboards alone won't save you. β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
π The Uncomfortable Truths
After running this setup in production for over a year, here's what I've learned:
Dashboards Don't Save You
They're for investigation, not detection. Alerts save you. Dashboards help you understand after you're alerted.
Fewer Panels, More Answers
A dashboard with 5 panels you understand beats 50 panels you ignore. Delete what you don't use.
Test Your Monitoring
Kill a pod on purpose. Does your dashboard show it? If not, fix that before production kills it for you.
Document Your Panels
Every panel should have a description: what it shows, what's normal, what's concerning.
How to Know Your Monitoring is "Working"
Ask yourself these questions:
- Can a new team member look at the dashboard and understand what's wrong within 60 seconds?
- When the last incident happened, did you find the root cause using your dashboards?
- Are there panels that nobody has looked at in 3 months? Delete them.
- Do you trust your dashboards when they say "everything is fine"?
What to Add Next
This blog was about dashboards. Here's your monitoring maturity roadmap:
Level 1: Metrics + Dashboards (You are here)
β
Prometheus scraping
β
Grafana visualizing
β
Understanding what to look at
Level 2: Alerts
β AlertManager configuration
β PagerDuty/Slack integration
β Runbooks for each alert
Level 3: SLOs and Error Budgets
β Define what "good" looks like
β Burn rate alerts
β Data-driven reliability decisions
Level 4: Logs and Traces
β Loki for logs
β Tempo/Jaeger for distributed tracing
β Correlation between metrics, logs, traces
Level 5: AIOps (maybe)
β Anomaly detection
β Automated remediation
β Probably overkill for most teams
The One Thing I Want You to Remember
Monitoring doesn't end at installation. The setup is 20% of the work. The other 80% is understanding what the numbers mean, building dashboards that answer real questions, and testing that they work before you need them at 3 AM.
Next up: I'll write about alert queriesβbecause dashboards are useless if nobody is looking at them when things break.
Resources
- Prometheus Official Documentation β Start here for concepts
- Grafana Documentation β Dashboard creation best practices
- Google SRE Book - Monitoring Chapter β The philosophy behind good monitoring
- kube-prometheus-stack Helm Chart β What we installed today