← Back to all stories

Setting Up Prometheus + Grafana Dashboards That Actually Help During Incidents

Not another "install and import dashboard 1860" tutorial. This is about building monitoring that answers questions at 3 AM when everything is on fire.

Let me tell you about the night I had Prometheus running, Grafana dashboards glowing green, and absolutely no idea why pods were restarting every 5 minutes. The CPU graph looked fine. Memory? Normal. But something was very, very wrong.

This blog exists because of that failure. It's not about how to install Prometheus and Grafana. You can Google that. This is about setting up monitoring that actually helps during incidentsβ€”on a multi-node Kubernetes cluster, covering node, pod, and cluster-level metrics.

🚫
What this blog is NOT:

"What is Prometheus?" copied theory. "Grafana is a visualization tool" β€” everyone knows. Step-by-step UI walkthroughs. A demo that works only on localhost and dies in real clusters. This is NOT about alerts β€” that's a separate blog.

🚨 The Wake-Up Call

3:42 AM. PagerDuty screaming. User-facing latency through the roof. I opened Grafana, looked at the dashboards I had spent days setting up, and saw... nothing useful.

CPU utilization: 23%. Memory: 45%. Network I/O: normal. All green. Yet our API was returning 5-second response times.

What I was looking at
Dashboard: "Kubernetes Cluster Overview" (imported from Grafana.com)

Node CPU: 23% βœ… (looks fine!)
Node Memory: 45% βœ… (plenty of room!)
Pod Count: 47/50 βœ… (all healthy!)
Network: 120Mbps βœ… (normal!)

Reality: Users screaming, API timing out, SLA violated

Here's what I learned that night: Green dashboards lie. My pod had a CPU limit of 500m and was being throttled to death. But "node CPU" was fine because my node had 16 cores. I was looking at the wrong metric.

"We had Prometheus running but still had no clue what was happening."

β€” Me, at 4 AM, questioning my career choices

πŸ™ˆ What Tutorials Hide (And What Will Bite You)

Before we install anything, let's get some misconceptions out of the way. These are the things that will bite you in production if you don't understand them now.

❌

Scraping β‰  Monitoring

Prometheus is pulling metrics. That doesn't mean you're monitoring anything useful. Scraping is just data collection.

❌

Metrics β‰  Alerts

Having metrics doesn't mean you'll be notified when things go wrong. Alerts are a separate concern entirely.

❌

Dashboards β‰  Answers

A pretty dashboard doesn't help if you don't know which panel to look at during an incident.

❌

Node Metrics β‰  Pod Metrics

Node CPU being fine says nothing about container throttling. They're completely different metrics.

πŸ”„ How Data Actually Flows

Before you write any YAML, understand the architecture. This is where 80% of confusion comes from.

The Data Pipeline
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        YOUR CLUSTER                              β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ Node 1   β”‚   β”‚ Node 2   β”‚   β”‚ Node 3   β”‚   β”‚ Node N   β”‚     β”‚
β”‚  β”‚          β”‚   β”‚          β”‚   β”‚          β”‚   β”‚          β”‚     β”‚
β”‚  β”‚ kubelet  β”‚   β”‚ kubelet  β”‚   β”‚ kubelet  β”‚   β”‚ kubelet  β”‚     β”‚
β”‚  β”‚ (metrics)β”‚   β”‚ (metrics)β”‚   β”‚ (metrics)β”‚   β”‚ (metrics)β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜     β”‚
β”‚       β”‚              β”‚              β”‚              β”‚            β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                              β”‚                                   β”‚
β”‚                              β–Ό                                   β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚                    β”‚   PROMETHEUS     β”‚                         β”‚
β”‚                    β”‚                  β”‚                         β”‚
β”‚                    β”‚  β€’ Scrapes       β”‚  ← Pulls metrics        β”‚
β”‚                    β”‚    every 15-30s  β”‚    from targets         β”‚
β”‚                    β”‚                  β”‚                         β”‚
β”‚                    β”‚  β€’ TSDB stores   β”‚  ← Time-series DB       β”‚
β”‚                    β”‚    metrics       β”‚    (local storage)      β”‚
β”‚                    β”‚                  β”‚                         β”‚
β”‚                    β”‚  β€’ PromQL        β”‚  ← Query language       β”‚
β”‚                    β”‚    queries       β”‚                         β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚                             β”‚                                    β”‚
β”‚                             β–Ό                                    β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚                    β”‚    GRAFANA       β”‚                         β”‚
β”‚                    β”‚                  β”‚                         β”‚
β”‚                    β”‚  β€’ Visualizes    β”‚  ← Dashboards           β”‚
β”‚                    β”‚  β€’ Queries       β”‚  ← Uses PromQL          β”‚
β”‚                    β”‚    Prometheus    β”‚                         β”‚
β”‚                    β”‚                  β”‚                         β”‚
β”‚                    β”‚  YOU β†’ Dashboard β”‚  ← What you see         β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

CRITICAL: Prometheus PULLS data. If a target is unreachable,
          you get no metrics. Silence is not "everything is fine."
πŸ’‘
Key insight: Prometheus doesn't know if a service is healthy. It only knows if it can reach the /metrics endpoint. A target being "up" means Prometheus scraped it successfully. The service behind it could be returning 500s all day.

πŸ› οΈ Installation (Only What Matters)

I'm going to use kube-prometheus-stack via Helm because it gives you Prometheus, Grafana, AlertManager, and a bunch of useful exporters in one shot. But let me be honest about the trade-offs.

Why kube-prometheus-stack?

  • Pros: Pre-configured ServiceMonitors, sensible defaults, Grafana dashboards included
  • Cons: Hides a LOT of configuration, large resource footprint, upgrades can be painful

Yes, Helm makes this easyβ€”but it also hides important configs. Know what you're deploying.

Shell
# Add the prometheus-community repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
# Version matters - I'm using 66.3.0 in production
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --version 66.3.0 \
  --set grafana.service.type=NodePort \
  --set grafana.service.nodePort=30080
⚠️
Dangerous defaults I changed:
  • retention: Default is 10d. For production, consider 15-30d depending on storage.
  • scrapeInterval: Default 30s. For high-cardinality workloads, this can explode your storage.
  • resources: Prometheus with defaults will OOM on a busy cluster. Set explicit limits.

I also installed additional exporters for specific use cases:

Shell
# MySQL exporter for database metrics
helm install mysql-exporter prometheus-community/prometheus-mysql-exporter \
  --namespace monitoring \
  --version 2.11.0

# DCGM exporter for GPU metrics (if you have GPU nodes)
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --version 3.5.0

What I'm intentionally skipping:

  • Thanos/Cortex: Overkill for most clusters. Add when you need multi-cluster or long-term storage.
  • Loki: Great for logs, but this blog is dashboards only.
  • Custom ServiceMonitors: The defaults cover 90% of cases. Add when needed.

πŸ“Š Setting Up Grafana Properly

Here's what I did after installation to make Grafana actually usable in a team environment.

Step 1: Expose Grafana via Nginx

In production, I expose Grafana through our existing Nginx ingress on a /grafana path:

nginx.conf (relevant part)
location /grafana/ {
    proxy_pass http://grafana-nodeport-service:30080/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
}

Step 2: Create Users for Different Teams

Don't give everyone admin access. I created separate users:

User Setup
# Users I created:
- admin      (Admin)   β†’ DevOps team
- devops     (Editor)  β†’ DevOps engineers
- backend    (Viewer)  β†’ Backend team
- qa         (Viewer)  β†’ QA team

# Why?
# - Viewers can see dashboards but not modify
# - Editors can create dashboards but not change data sources
# - Admin is for infrastructure changes only

Step 3: Verify Data Source

First thing after login: Configuration β†’ Data Sources β†’ Prometheus β†’ Test

If this test fails, nothing else works. The URL should be the internal service name (usually http://prometheus-kube-prometheus-prometheus:9090).

πŸ“ˆ Dashboards That Work (Not Just Look Pretty)

This is where most blogs fail. They tell you to "import dashboard ID 1860 and you're done." That dashboard has 47 panels. Which one do you look at when things are breaking?

Here's my approach: Every panel must answer a specific question. If you can't articulate the question, delete the panel.

The Three Questions Every Dashboard Must Answer

Dashboard Philosophy
1. Is something broken RIGHT NOW?
   β†’ Real-time error rates, latency spikes, pod restarts

2. What is trending toward broken?
   β†’ Resource saturation, disk filling up, memory creeping

3. What changed recently?
   β†’ Deployments, config changes, traffic patterns

CPU: Usage vs Throttling (The Lie I Believed)

CPU Usage tells you how much CPU a container used. CPU Throttling tells you how often it was blocked from using more CPU because of limits.

PromQL - CPU Throttling Rate
# CPU Throttling percentage
# This is what actually matters, not CPU usage
sum(rate(container_cpu_cfs_throttled_periods_total{
  namespace="production"
}[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total{
  namespace="production"
}[5m])) by (pod)
* 100

# If this is > 25%, your pod is being starved.
# Usage might show 30%, but throttling at 80% = performance death

Bad conclusion people draw: "CPU is at 30%, we have plenty of headroom!"
Reality: Container is throttled 80% of the time because limit is too low.

Memory: Usage vs OOMKills

Same story. Memory usage being at 60% means nothing if you have OOMKills.

PromQL - Memory and OOMKills
# Memory usage as percentage of limit
container_memory_working_set_bytes{namespace="production"}
/
container_spec_memory_limit_bytes{namespace="production"}
* 100

# OOMKill events (the metric that actually wakes you up)
increase(kube_pod_container_status_restarts_total{
  namespace="production",
  reason="OOMKilled"
}[1h])

Node Metrics vs Pod Metrics (Don't Mix Them)

I see this mistake constantly: putting node-level and pod-level metrics on the same dashboard without clear separation.

Separate Concerns
NODE METRICS (from node-exporter):
- node_cpu_seconds_total
- node_memory_MemAvailable_bytes
- node_filesystem_avail_bytes
β†’ "Is my infrastructure healthy?"

POD METRICS (from cAdvisor/kubelet):
- container_cpu_usage_seconds_total
- container_memory_working_set_bytes
- kube_pod_status_phase
β†’ "Is my application healthy?"

CLUSTER METRICS (from kube-state-metrics):
- kube_deployment_status_replicas_available
- kube_pod_container_status_restarts_total
β†’ "Is Kubernetes doing its job?"

πŸ” A Debugging Story: The Latency Spike

Let me walk you through an actual incident and how I used (and misused) dashboards.

The symptom: API latency jumped from 100ms to 2000ms at 2 PM.

First metric I checked (Wrong)

Node CPU. It was at 45%. "That's fine," I thought.

Second metric I checked (Also wrong)

Pod CPU usage. It was at 35%. "Definitely not CPU."

The metric that actually mattered

PromQL - What I should have checked first
# Container CPU throttling - THIS was the problem
rate(container_cpu_cfs_throttled_seconds_total{
  pod=~"api-.*"
}[5m])

# Result: 0.8 seconds throttled per second
# Translation: pod is blocked 80% of the time!

What happened: We had deployed a new feature that was CPU-intensive. The pod's CPU limit was 500m. Usage was "35%" of the node, but the container was hitting its limit constantly.

The fix: Increased CPU limit to 1000m. Latency dropped to 80ms.

🎯
Lesson: Always check throttling before concluding "it's not CPU." Container limits create a ceiling that doesn't show up in simple usage metrics.

🀷 What Prometheus Can't Tell You

Prometheus is not magic. Here are its blind spots:

  • Application-level context: It doesn't know if your 500 error is a bug or a bad request. It just counts them.
  • Why something happened: Metrics tell you what, not why. Logs tell you why.
  • Distributed traces: If request A causes slowdown in service B, Prometheus won't connect those dots. You need tracing.
  • What's missing: If a pod dies and stops reporting, you get gaps. Absence of data isn't flagged automatically.
The Monitoring Stack Reality
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              OBSERVABILITY REALITY              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                 β”‚
β”‚  METRICS (Prometheus)                           β”‚
β”‚  └─ "What" is happening?                        β”‚
β”‚     └─ CPU at 90%, errors increased             β”‚
β”‚                                                 β”‚
β”‚  LOGS (Loki/ELK)                               β”‚
β”‚  └─ "Why" is it happening?                      β”‚
β”‚     └─ NullPointerException at line 47          β”‚
β”‚                                                 β”‚
β”‚  TRACES (Jaeger/Tempo)                          β”‚
β”‚  └─ "Where" in the system?                      β”‚
β”‚     └─ Latency from service A β†’ B β†’ C           β”‚
β”‚                                                 β”‚
β”‚  You need ALL THREE for full picture.           β”‚
β”‚  Dashboards alone won't save you.               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’Š The Uncomfortable Truths

After running this setup in production for over a year, here's what I've learned:

1️⃣

Dashboards Don't Save You

They're for investigation, not detection. Alerts save you. Dashboards help you understand after you're alerted.

2️⃣

Fewer Panels, More Answers

A dashboard with 5 panels you understand beats 50 panels you ignore. Delete what you don't use.

3️⃣

Test Your Monitoring

Kill a pod on purpose. Does your dashboard show it? If not, fix that before production kills it for you.

4️⃣

Document Your Panels

Every panel should have a description: what it shows, what's normal, what's concerning.

How to Know Your Monitoring is "Working"

Ask yourself these questions:

  • Can a new team member look at the dashboard and understand what's wrong within 60 seconds?
  • When the last incident happened, did you find the root cause using your dashboards?
  • Are there panels that nobody has looked at in 3 months? Delete them.
  • Do you trust your dashboards when they say "everything is fine"?

What to Add Next

This blog was about dashboards. Here's your monitoring maturity roadmap:

Monitoring Maturity Levels
Level 1: Metrics + Dashboards (You are here)
         βœ… Prometheus scraping
         βœ… Grafana visualizing
         βœ… Understanding what to look at

Level 2: Alerts
         β†’ AlertManager configuration
         β†’ PagerDuty/Slack integration
         β†’ Runbooks for each alert

Level 3: SLOs and Error Budgets
         β†’ Define what "good" looks like
         β†’ Burn rate alerts
         β†’ Data-driven reliability decisions

Level 4: Logs and Traces
         β†’ Loki for logs
         β†’ Tempo/Jaeger for distributed tracing
         β†’ Correlation between metrics, logs, traces

Level 5: AIOps (maybe)
         β†’ Anomaly detection
         β†’ Automated remediation
         β†’ Probably overkill for most teams

The One Thing I Want You to Remember

Monitoring doesn't end at installation. The setup is 20% of the work. The other 80% is understanding what the numbers mean, building dashboards that answer real questions, and testing that they work before you need them at 3 AM.

Next up: I'll write about alert queriesβ€”because dashboards are useless if nobody is looking at them when things break.

Resources

Share this story