// Prometheus & Grafana Guide · Observability

Prometheus & Grafana Complete Guide: PromQL, Alerting & SLO Monitoring

📅 Updated April 2026 · 📅 April 2026 ⏱ 12 min read 🏷 Prometheus · Grafana · Observability · SRE · Monitoring

👨‍💻

master.devops

Practising DevOps Engineer with deep hands-on experience in Kubernetes, AWS, CI/CD, and SRE. Every guide is written from real production work.

Prometheus and Grafana are the standard observability stack for Kubernetes environments. I have deployed and tuned Prometheus in production for monitoring production EKS clusters — writing PromQL queries for SLO burn rate alerts, building Grafana dashboards for engineering teams, and configuring Alertmanager for on-call routing. This guide covers everything from the data model to production SLO alerting.

How Prometheus Works — The Pull Model

Prometheus uses a pull model — it scrapes metrics from targets on a schedule (every 15 seconds by default). This is fundamentally different from push-based systems like StatsD or InfluxDB where applications push metrics to the monitoring system. The pull model means: Prometheus controls the scrape rate, failed scrapes generate an alert (missing target), and no agent needs to be installed in your application (just expose a /metrics endpoint).

    Pull vs Push — the interview answer: Pull is easier to reason about (Prometheus knows
    exactly what it is monitoring), makes target discovery more natural (Prometheus finds your pods via
    Kubernetes Service Discovery), and avoids push storms where all services push simultaneously. Push works
    better for ephemeral jobs (batch jobs, cron jobs) — use pushgateway for these.
  

Prometheus Data Model

Every metric in Prometheus is a time series identified by a metric name and a set of key-value labels. Labels are what make Prometheus powerful — they allow you to slice and aggregate metrics by any dimension.

# Example: HTTP request counter with labels
http_requests_total{
  method="GET",
  path="/api/users",
  status="200",
  service="api",
  namespace="production"
} 1847 1713000000000

# The same metric with different label combinations
http_requests_total{method="POST", path="/api/users", status="201", ...} 234
http_requests_total{method="GET", path="/api/users", status="500", ...} 12

Four Metric Types

Counter — Monotonically increasing value (never decreases). Request count, error count, bytes sent. Always use rate() or increase() to query counters — the raw value is meaningless without a time window.
Gauge — Value that can go up or down. Memory usage, queue depth, active connections, number of running pods. Query directly.
Histogram — Samples observations and counts them in configurable buckets. Used for latency and request size. Enables percentile calculations. Most important metric type for SLO work.
Summary — Similar to histogram but calculates percentiles client-side. Cannot be aggregated across instances — generally use Histogram instead.

PromQL — Essential Queries

# Request rate (requests per second over last 5 minutes)
rate(http_requests_total{namespace="production"}[5m])

# Error rate (percentage of 5xx responses)
rate(http_requests_total{status=~"5.."}[5m])
  /
rate(http_requests_total[5m])

# p99 latency from histogram
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{
    namespace="production",
    service="api"
  }[5m])
)

# Top 10 memory-consuming pods
topk(10,
  container_memory_working_set_bytes{namespace="production", container!=""}
)

# CPU utilisation per pod (% of request)
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
  /
on(pod, namespace) kube_pod_container_resource_requests{resource="cpu"}

# Pods not ready in production
kube_pod_status_ready{namespace="production", condition="true"} == 0

# SLO burn rate alert query (1-hour window, 14x burn rate)
(
  rate(http_requests_total{status=~"5.."}[1h])
    /
  rate(http_requests_total[1h])
)
  > (1 - 0.999) * 14   # 0.999 = 99.9% SLO, 14x = fast burn

Recording Rules — Performance Optimisation

Complex PromQL queries run on every panel load and every alert evaluation. For expensive queries (wide range vectors, many series), use recording rules to pre-compute the result every scrape interval. This dramatically reduces query time for dashboards and ensures alerts evaluate quickly.

# rules/slo_rules.yaml
groups:
  - name: slo.rules
    interval: 30s
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, namespace)

      - record: job:http_errors_total:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, namespace)

      - record: job:http_error_ratio:rate5m
        expr: |
          job:http_errors_total:rate5m
            /
          job:http_requests_total:rate5m

Alertmanager — Alert Routing

# alertmanager.yaml — route alerts by team
global:
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  group_by: ['alertname', 'namespace']
  group_wait: 30s          # wait 30s to group related alerts
  group_interval: 5m
  repeat_interval: 4h      # resend unresolved alert every 4h
  receiver: 'slack-general'

  routes:
    - match:
        severity: critical
        namespace: production
      receiver: 'pagerduty-oncall'

    - match:
        team: platform
      receiver: 'slack-platform'

receivers:
  - name: 'slack-general'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-oncall'
    pagerduty_configs:
      - service_key: '$PD_SERVICE_KEY'

inhibit_rules:   # suppress warning if critical already firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'namespace']

Grafana Dashboards

Grafana queries Prometheus (and other data sources like Loki, Tempo, CloudWatch) and renders visualisations. In production, store dashboards as JSON in Git and provision them via ConfigMaps in Kubernetes — this is "dashboard-as-code" and means dashboards are version-controlled and reproducible.

The Four Golden Signals (Google SRE Book)

Latency — Time to serve a request. Track p50, p95, p99 separately — averages hide tail latency problems.
Traffic — Volume of requests. Requests per second, queries per second, messages per second.
Errors — Rate of failed requests. Distinguish between client errors (4xx) and server errors (5xx).
Saturation — How "full" the service is. CPU %, memory %, queue depth. Saturation predicts problems before they cause latency increases.

Interview Q&A

Q1: Counter vs Gauge vs Histogram — when to use each?

Counter: for things that only increase — request count, error count, bytes transferred. Always query with rate() over a time window. Gauge: for values that fluctuate up and down — current memory usage, active connections, queue depth, number of running pods. Query directly. Histogram: for measuring distributions — request latency, request size. Allows calculating any percentile (p50, p95, p99) via histogram_quantile(). Use histogram for any SLO that involves latency. Never use Summary when you need to aggregate across multiple instances — only Histogram supports cross-instance aggregation.

Q2: How do you implement SLO burn rate alerting?

Burn rate = current error rate / (1 - SLO target). For a 99.9% SLO, the error budget is 0.1% per month. A burn rate of 1 means you are exactly consuming budget at the sustainable rate. A burn rate of 14 means you will exhaust the entire month's budget in 2 days. The standard approach (from Google's SRE Workbook) uses two burn rate windows: a fast window (1h) catches fast-burning outages quickly, a slow window (6h) catches slow burns. Alert when BOTH windows exceed the burn rate threshold — this reduces false positives from brief spikes.

Q3: What is Grafana Loki and how does it differ from Elasticsearch?

Loki is a log aggregation system designed to work with Prometheus — it uses the same label model and ships with Grafana. Unlike Elasticsearch, Loki does NOT index log content — it only indexes labels (pod name, namespace, app). Full-text search uses streaming grep over compressed log chunks. This makes Loki dramatically cheaper (10x less storage, no Lucene indexing overhead) but slower for ad-hoc full-text search. Use Loki when: you already have Prometheus/Grafana, cost is a concern, your team queries logs by service/pod rather than arbitrary full-text search. Use Elasticsearch when: you need powerful full-text search, complex aggregations, or compliance requirements for searchable audit logs.

// More Guides

📖 DevOps ☸️ Kubernetes 🐳 Docker ⚙️ CI/CD 🗂️ Terraform 🐧 Linux 🌿 Git ☁️ AWS 📊 Prometheus

📊 Explore Prometheus on the Interactive Mind Map

See how Prometheus and Grafana connect to Kubernetes, Splunk, Istio, and AWS — with real PromQL examples and interview Q&A.

Open Interactive Mind Map ← AWS Guide

🚀 Want the complete DevOps interview kit?

Full notes, Q&A cheat sheets, real commands — all tools covered.

💳 Get Complete DevOps Kit →

Prometheus monitors Kubernetes workloads. See how to set up the full stack — from cluster to alerts — in the Kubernetes guide →

📩 Get Free DevOps Interview Notes

Cheat sheets, real commands, interview Q&As — free.

No spam · Follow @master.devops for daily tips

// Continue Learning

☸️Kubernetes — Prometheus monitors K8s pods and nodes ☁️AWS — Export CloudWatch metrics to Prometheus ⚙️CI/CD — Alert on deployment failures automatically

Prometheus Data Model & Metric Types

Prometheus stores data as time series — streams of timestamped float64 values identified by a metric name and key-value labels. Understanding this model is essential before writing PromQL, because it determines what queries are possible.

Type	What it measures	Example
Counter	Monotonically increasing. Never goes down (except restart)	http_requests_total, errors_total
Gauge	Current value — goes up or down	memory_bytes, active_connections
Histogram	Distribution across configurable buckets	Request latency (p50/p95/p99)
Summary	Pre-calculated quantiles client-side	GC pause time, response size

Essential PromQL Queries

These are the queries every SRE and DevOps engineer needs to know — they cover the four golden signals and appear regularly in interviews and on-call runbooks.

# Request rate (per second, 5-min window)
rate(http_requests_total[5m])

# Error rate as percentage
rate(http_requests_total{status=~"5.."}[5m])
  / rate(http_requests_total[5m]) * 100

# p99 latency — the #1 interview PromQL question
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# CPU usage per pod in Kubernetes
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Memory usage MB per pod
container_memory_working_set_bytes{namespace="production"} / 1024 / 1024

# Node disk usage percentage
(node_filesystem_size_bytes - node_filesystem_free_bytes)
  / node_filesystem_size_bytes * 100

Key rule: Always use rate() on counters for dashboards (averages over window — stable for graphs). Use irate() only when you need per-second precision on the last two data points (alerts on sudden spikes).

Alertmanager — Production Alerting Rules

# prometheus-rules.yaml
groups:
  - name: api-slo-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $labels.service }}"
          description: "Error rate {{ $value | humanizePercentage }}"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 3m
        labels:
          severity: warning

      - alert: PodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[30m]) > 3
        for: 0m
        labels:
          severity: critical

Prometheus & Grafana Interview Questions

Q: Counter vs Gauge — when do you use each?

A Counter only ever increases (or resets to zero on restart). Use it for things that accumulate: total requests, total errors, total bytes. Always query counters with rate() or increase(). A Gauge represents a current value that can go up or down — memory usage, active connections, queue depth. Query gauges directly. The key rule: if you are counting events, use Counter. If you are measuring current state, use Gauge.

Q: How do you write a p99 latency PromQL query?

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])). You must use rate() on the _bucket metric before passing to histogram_quantile. The le label must be included in any aggregation: sum(...) by (le, service). Omitting le in the by() clause breaks the quantile calculation — this is the most common PromQL mistake in interviews.

Q: What are recording rules and why do you need them?

Recording rules pre-compute expensive PromQL queries and store results as new time series. Without them, a dashboard with 20 panels each running complex histogram_quantile queries across millions of series will time out. With recording rules, the query runs once on schedule (e.g., every 30s) and the result is a simple gauge that dashboards can query instantly. Naming convention: level:metric:operations, e.g., job:http_requests_total:rate5m.

Q: What are the four golden signals?

Defined by Google's SRE book: Latency — how long requests take (distinguish successful vs error latency); Traffic — how much demand the system handles (requests/sec, queries/sec); Errors — rate of requests that fail (explicit 5xx, implicit wrong content, policy violations); Saturation — how "full" the service is (CPU utilisation, memory pressure, queue depth). If you can only instrument four things, make it these four.

🔗 Related DevOps Topics

🐳 Docker ☸️ Kubernetes 🗂️ Terraform 🐧 Linux ☁️ AWS ⚙️ CI/CD 📊 Prometheus 🌿 Git 📖 DevOps 🗺️ Mind Map