📅 Updated April 2026 ·
📅 April 2026⏱ 12 min read🏷 Prometheus · Grafana · Observability · SRE · Monitoring
👨💻
master.devops
Practising DevOps Engineer with deep hands-on experience in Kubernetes, AWS, CI/CD, and SRE. Every guide is written from real production work.
Prometheus and Grafana are the standard observability stack for Kubernetes environments. I have
deployed and tuned Prometheus in production for monitoring production EKS clusters — writing
PromQL queries for SLO burn rate alerts, building Grafana dashboards for engineering teams, and
configuring Alertmanager for on-call routing. This guide covers everything from the data model
to production SLO alerting.
How Prometheus Works — The Pull Model
Prometheus uses a pull model — it scrapes metrics from targets on a schedule
(every 15 seconds by default). This is fundamentally different from push-based systems like
StatsD or InfluxDB where applications push metrics to the monitoring system. The pull model means:
Prometheus controls the scrape rate, failed scrapes generate an alert (missing target),
and no agent needs to be installed in your application (just expose a /metrics endpoint).
Pull vs Push — the interview answer: Pull is easier to reason about (Prometheus knows
exactly what it is monitoring), makes target discovery more natural (Prometheus finds your pods via
Kubernetes Service Discovery), and avoids push storms where all services push simultaneously. Push works
better for ephemeral jobs (batch jobs, cron jobs) — use pushgateway for these.
Prometheus Data Model
Every metric in Prometheus is a time series identified by a metric name and a set of key-value labels.
Labels are what make Prometheus powerful — they allow you to slice and aggregate metrics by any dimension.
# Example: HTTP request counter with labels
http_requests_total{
method="GET",
path="/api/users",
status="200",
service="api",
namespace="production"
} 1847 1713000000000
# The same metric with different label combinations
http_requests_total{method="POST", path="/api/users", status="201", ...} 234
http_requests_total{method="GET", path="/api/users", status="500", ...} 12
Four Metric Types
Counter — Monotonically increasing value (never decreases). Request count, error count, bytes sent. Always use rate() or increase() to query counters — the raw value is meaningless without a time window.
Gauge — Value that can go up or down. Memory usage, queue depth, active connections, number of running pods. Query directly.
Histogram — Samples observations and counts them in configurable buckets. Used for latency and request size. Enables percentile calculations. Most important metric type for SLO work.
Summary — Similar to histogram but calculates percentiles client-side. Cannot be aggregated across instances — generally use Histogram instead.
PromQL — Essential Queries
# Request rate (requests per second over last 5 minutes)
rate(http_requests_total{namespace="production"}[5m])
# Error rate (percentage of 5xx responses)
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
# p99 latency from histogram
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{
namespace="production",
service="api"
}[5m])
)
# Top 10 memory-consuming pods
topk(10,
container_memory_working_set_bytes{namespace="production", container!=""}
)
# CPU utilisation per pod (% of request)
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
/
on(pod, namespace) kube_pod_container_resource_requests{resource="cpu"}
# Pods not ready in production
kube_pod_status_ready{namespace="production", condition="true"} == 0
# SLO burn rate alert query (1-hour window, 14x burn rate)
(
rate(http_requests_total{status=~"5.."}[1h])
/
rate(http_requests_total[1h])
)
> (1 - 0.999) * 14 # 0.999 = 99.9% SLO, 14x = fast burn
Recording Rules — Performance Optimisation
Complex PromQL queries run on every panel load and every alert evaluation. For expensive queries
(wide range vectors, many series), use recording rules to pre-compute the result every scrape interval.
This dramatically reduces query time for dashboards and ensures alerts evaluate quickly.
# alertmanager.yaml — route alerts by team
global:
slack_api_url: 'https://hooks.slack.com/services/...'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s # wait 30s to group related alerts
group_interval: 5m
repeat_interval: 4h # resend unresolved alert every 4h
receiver: 'slack-general'
routes:
- match:
severity: critical
namespace: production
receiver: 'pagerduty-oncall'
- match:
team: platform
receiver: 'slack-platform'
receivers:
- name: 'slack-general'
slack_configs:
- channel: '#alerts'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-oncall'
pagerduty_configs:
- service_key: '$PD_SERVICE_KEY'
inhibit_rules: # suppress warning if critical already firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'namespace']
Grafana Dashboards
Grafana queries Prometheus (and other data sources like Loki, Tempo, CloudWatch) and renders
visualisations. In production, store dashboards as JSON in Git and provision them via ConfigMaps
in Kubernetes — this is "dashboard-as-code" and means dashboards are version-controlled and
reproducible.
The Four Golden Signals (Google SRE Book)
Latency — Time to serve a request. Track p50, p95, p99 separately — averages hide tail latency problems.
Traffic — Volume of requests. Requests per second, queries per second, messages per second.
Errors — Rate of failed requests. Distinguish between client errors (4xx) and server errors (5xx).
Saturation — How "full" the service is. CPU %, memory %, queue depth. Saturation predicts problems before they cause latency increases.
Interview Q&A
Q1: Counter vs Gauge vs Histogram — when to use each?
Counter: for things that only increase — request count, error count, bytes transferred. Always query with rate() over a time window. Gauge: for values that fluctuate up and down — current memory usage, active connections, queue depth, number of running pods. Query directly. Histogram: for measuring distributions — request latency, request size. Allows calculating any percentile (p50, p95, p99) via histogram_quantile(). Use histogram for any SLO that involves latency. Never use Summary when you need to aggregate across multiple instances — only Histogram supports cross-instance aggregation.
Q2: How do you implement SLO burn rate alerting?
Burn rate = current error rate / (1 - SLO target). For a 99.9% SLO, the error budget is 0.1% per month. A burn rate of 1 means you are exactly consuming budget at the sustainable rate. A burn rate of 14 means you will exhaust the entire month's budget in 2 days. The standard approach (from Google's SRE Workbook) uses two burn rate windows: a fast window (1h) catches fast-burning outages quickly, a slow window (6h) catches slow burns. Alert when BOTH windows exceed the burn rate threshold — this reduces false positives from brief spikes.
Q3: What is Grafana Loki and how does it differ from Elasticsearch?
Loki is a log aggregation system designed to work with Prometheus — it uses the same label model and ships with Grafana. Unlike Elasticsearch, Loki does NOT index log content — it only indexes labels (pod name, namespace, app). Full-text search uses streaming grep over compressed log chunks. This makes Loki dramatically cheaper (10x less storage, no Lucene indexing overhead) but slower for ad-hoc full-text search. Use Loki when: you already have Prometheus/Grafana, cost is a concern, your team queries logs by service/pod rather than arbitrary full-text search. Use Elasticsearch when: you need powerful full-text search, complex aggregations, or compliance requirements for searchable audit logs.
Prometheus stores data as time series — streams of timestamped float64 values identified by a metric name
and key-value labels. Understanding this model is essential before writing PromQL, because it determines
what queries are possible.
Type
What it measures
Example
Counter
Monotonically increasing. Never goes down (except restart)
http_requests_total, errors_total
Gauge
Current value — goes up or down
memory_bytes, active_connections
Histogram
Distribution across configurable buckets
Request latency (p50/p95/p99)
Summary
Pre-calculated quantiles client-side
GC pause time, response size
Essential PromQL Queries
These are the queries every SRE and DevOps engineer needs to know — they cover the four golden signals
and appear regularly in interviews and on-call runbooks.
# Request rate (per second, 5-min window)
rate(http_requests_total[5m])
# Error rate as percentage
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# p99 latency — the #1 interview PromQL question
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# CPU usage per pod in Kubernetes
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
# Memory usage MB per pod
container_memory_working_set_bytes{namespace="production"} / 1024 / 1024
# Node disk usage percentage
(node_filesystem_size_bytes - node_filesystem_free_bytes)
/ node_filesystem_size_bytes * 100
Key rule: Always use rate() on counters for dashboards (averages over window — stable for graphs).
Use irate() only when you need per-second precision on the last two data points (alerts on sudden spikes).
A Counter only ever increases (or resets to zero on restart). Use it for things that accumulate: total requests, total errors, total bytes. Always query counters with rate() or increase(). A Gauge represents a current value that can go up or down — memory usage, active connections, queue depth. Query gauges directly. The key rule: if you are counting events, use Counter. If you are measuring current state, use Gauge.
Q: How do you write a p99 latency PromQL query?
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])). You must use rate() on the _bucket metric before passing to histogram_quantile. The le label must be included in any aggregation: sum(...) by (le, service). Omitting le in the by() clause breaks the quantile calculation — this is the most common PromQL mistake in interviews.
Q: What are recording rules and why do you need them?
Recording rules pre-compute expensive PromQL queries and store results as new time series. Without them, a dashboard with 20 panels each running complex histogram_quantile queries across millions of series will time out. With recording rules, the query runs once on schedule (e.g., every 30s) and the result is a simple gauge that dashboards can query instantly. Naming convention: level:metric:operations, e.g., job:http_requests_total:rate5m.
Q: What are the four golden signals?
Defined by Google's SRE book: Latency — how long requests take (distinguish successful vs error latency); Traffic — how much demand the system handles (requests/sec, queries/sec); Errors — rate of requests that fail (explicit 5xx, implicit wrong content, policy violations); Saturation — how "full" the service is (CPU utilisation, memory pressure, queue depth). If you can only instrument four things, make it these four.
Master DevOps is a community of practising DevOps and SRE engineers sharing real production knowledge —
from Kubernetes internals to CI/CD pipeline design. All content is written from hands-on experience,
not copied from documentation. Our mission: make senior-level DevOps knowledge free for everyone.