Metrics & Prometheus
Prometheus architecture, metric types, scrape configuration, ServiceMonitor/PodMonitor CRDs, PromQL from basic to advanced, recording rules, cardinality management, kube-prometheus-stack, and long-term storage with Thanos and Mimir.
Coverage Checklist
- Prometheus pull model vs push model
- Prometheus TSDB architecture
- 4 metric types: Counter/Gauge/Histogram/Summary
- Histogram vs Summary tradeoffs
- scrape_configs: static, service discovery, relabeling
- Prometheus Operator: ServiceMonitor / PodMonitor / PrometheusRule
- kube-prometheus-stack Helm install and configuration
- Key K8s metrics: cAdvisor, kube-state-metrics, node_exporter, API server
- PromQL: selectors, range vectors, functions, aggregation
- PromQL: rate/irate, histogram_quantile, absent, topk
- Recording rules: naming convention, evaluation interval
- Cardinality: explosion causes, TSDB cardinality analysis
- Label dropping via relabeling
- Thanos: sidecar, store gateway, querier, compactor
- Mimir: horizontal scalability, multi-tenant
- Remote write configuration
- Federation for multi-cluster
- 5 metrics, 4 alerts, 5 runbooks, 8 best practices
Prometheus Architecture
Prometheus uses a pull-based model — it scrapes metrics from targets at a configurable interval. Targets expose metrics on an HTTP endpoint (typically /metrics) in the Prometheus exposition format.
Prometheus Server
├── Retrieval: scrapes /metrics from targets on schedule
│ └── Service Discovery (K8s API, static, DNS, EC2, ...)
│
├── TSDB (Time Series Database)
│ ├── In-memory: active head block (2h)
│ ├── WAL (Write-Ahead Log): crash recovery
│ └── On-disk: persistent blocks (compressed, indexed)
│
├── Rules Engine: recording rules + alerting rules (evaluated every 15s)
│
└── HTTP API: PromQL queries (Grafana, tooling, manual)
│ Alerts
▼
Alertmanager: dedup, grouping, routing, silencing → PagerDuty/Slack
Scrape targets:
App pods node_exporter kube-state-metrics cAdvisor etcd API server
Prometheus pulls metrics from targets. This means the monitoring system controls the scrape rate, and targets cannot overwhelm Prometheus with data. The downside: short-lived jobs (batch, CronJob) may not live long enough to be scraped. Use pushgateway for these cases — push metrics before the job exits, then Prometheus scrapes the gateway.
TSDB Storage Format
| Component | Duration | Description |
|---|---|---|
| Head block | ~2h (in memory + WAL) | Active write target; fast appends; WAL ensures durability |
| Persistent blocks | 2h chunks, compacted to larger | Immutable, compressed, indexed; default retention 15d |
| Compaction | Background process | Merges small blocks into larger; deduplicates; improves query speed |
| Retention | Default 15d (configurable) | --storage.tsdb.retention.time=30d or --storage.tsdb.retention.size=50GB |
Metric Types
| Type | Behavior | Use Case | PromQL Function |
|---|---|---|---|
| Counter | Monotonically increasing; resets to 0 on restart | Total requests, errors, bytes sent | rate(), increase() |
| Gauge | Can go up or down; current value | Memory usage, queue depth, temperature, active connections | Direct use, delta(), deriv() |
| Histogram | Samples observations into configurable buckets; exposes _bucket, _count, _sum | Request latency, response sizes — percentile queries at query time | histogram_quantile() |
| Summary | Calculates quantiles client-side; exposes pre-calculated quantiles, _count, _sum | Pre-calculated percentiles where aggregation across instances is not needed | Direct quantile labels |
Histogram vs Summary
| Property | Histogram | Summary |
|---|---|---|
| Quantile calculation | Server-side (PromQL at query time) | Client-side (in instrumented code) |
| Aggregation across instances | Yes — sum buckets then quantile | No — quantiles cannot be summed |
| Configurable accuracy | Depends on bucket boundaries | Configurable quantile error bound |
| Query performance | CPU-intensive at query time | Cheap at query time (pre-calculated) |
| Recommendation | Preferred for Kubernetes workloads | Legacy; avoid for new instrumentation |
Native histograms use a sparse exponential bucket schema — they automatically adapt bucket boundaries, eliminate the need to pre-configure buckets, and provide better accuracy with lower cardinality. Enabled with --enable-feature=native-histograms. The OTel SDK and Prometheus client libraries support native histograms in recent versions.
# Prometheus exposition format examples
# Counter
http_requests_total{method="GET",status="200"} 12345
# Gauge
process_resident_memory_bytes 45678901
# Histogram (automatic _bucket, _count, _sum)
http_request_duration_seconds_bucket{le="0.005"} 100
http_request_duration_seconds_bucket{le="0.01"} 200
http_request_duration_seconds_bucket{le="0.025"} 350
http_request_duration_seconds_bucket{le="0.05"} 400
http_request_duration_seconds_bucket{le="+Inf"} 500
http_request_duration_seconds_count 500
http_request_duration_seconds_sum 12.345
Scrape Configuration
# prometheus.yaml — scrape configuration
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Rules evaluation interval
scrape_timeout: 10s
scrape_configs:
# Scrape Kubernetes API server
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Scrape all pods with annotation prometheus.io/scrape: "true"
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
Prometheus Operator
The Prometheus Operator extends Kubernetes with CRDs that let you declare Prometheus instances and scrape targets as Kubernetes objects, enabling GitOps-compatible monitoring configuration.
Core CRDs
| CRD | Purpose | Key Fields |
|---|---|---|
Prometheus | Declares a Prometheus instance | replicas, retention, storage, ruleSelector, serviceMonitorSelector |
Alertmanager | Declares an Alertmanager instance | replicas, configSecret |
ServiceMonitor | Selects Services to scrape via label selectors | selector, endpoints (port, path, interval), namespaceSelector |
PodMonitor | Selects Pods to scrape directly | selector, podMetricsEndpoints, namespaceSelector |
PrometheusRule | Recording rules and alerting rules | groups (name, interval, rules) |
ScrapeConfig | Raw scrape config for non-K8s targets | staticConfigs, httpSDConfigs |
ServiceMonitor
# ServiceMonitor: tells Prometheus Operator which Services to scrape
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: production
labels:
team: backend # Must match Prometheus.spec.serviceMonitorSelector
spec:
selector:
matchLabels:
app: myapp # Selects Services with this label
namespaceSelector:
matchNames: [production]
endpoints:
- port: metrics # Named port on the Service
path: /metrics
interval: 30s
scrapeTimeout: 10s
scheme: http
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
metricRelabelings:
# Drop high-cardinality metrics at ingest time
- sourceLabels: [__name__]
regex: go_gc_.*
action: drop
PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-rules
namespace: production
labels:
team: backend
spec:
groups:
- name: myapp.recording
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- name: myapp.alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }}"
kube-prometheus-stack
The kube-prometheus-stack Helm chart is the standard way to deploy the full Prometheus monitoring stack with pre-built dashboards, recording rules, and alert rules for Kubernetes.
# Add Helm repo and install
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values kps-values.yaml
# kps-values.yaml — production configuration
prometheus:
prometheusSpec:
retention: 15d
retentionSize: 50GB
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
memory: 4Gi
# Accept ServiceMonitors from all namespaces
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
# TSDB compaction and query tuning
tsdb:
outOfOrderTimeWindow: 10m # Accept late-arriving samples
walCompression: true
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 10Gi
grafana:
adminPassword: changeme-use-secret
persistence:
enabled: true
size: 10Gi
nodeExporter:
enabled: true
kubeStateMetrics:
enabled: true
Key Kubernetes Metrics
Container Resources (cAdvisor)
| Metric | Type | Description |
|---|---|---|
container_cpu_usage_seconds_total | Counter | CPU seconds consumed by container |
container_memory_working_set_bytes | Gauge | Working set memory (OOM killer uses this) |
container_memory_rss | Gauge | Resident set size |
container_network_receive_bytes_total | Counter | Network bytes received by container |
container_fs_reads_bytes_total | Counter | Filesystem bytes read |
container_oom_events_total | Counter | OOM kill events per container |
Kubernetes State (kube-state-metrics)
| Metric | Type | Description |
|---|---|---|
kube_pod_status_phase | Gauge | Pod phase (Running/Pending/Failed) — labels: phase |
kube_pod_container_status_restarts_total | Counter | Container restart count |
kube_deployment_status_replicas_unavailable | Gauge | Unavailable replicas in a Deployment |
kube_node_status_condition | Gauge | Node conditions (Ready, MemoryPressure, DiskPressure) |
kube_horizontalpodautoscaler_status_current_replicas | Gauge | Current HPA replica count |
kube_persistentvolumeclaim_status_phase | Gauge | PVC binding status |
API Server
| Metric | Description |
|---|---|
apiserver_request_duration_seconds | API request latency histogram by verb, resource, scope |
apiserver_request_total | Total API requests by verb, resource, code |
apiserver_current_inflight_requests | In-flight requests (mutating and read-only) |
apiserver_admission_webhook_admission_duration_seconds | Admission webhook latency |
etcd_object_counts | Count of objects stored in etcd by resource type |
PromQL
PromQL (Prometheus Query Language) is a functional query language for time series data. It operates on selectors, range vectors, and a rich set of built-in functions.
Selectors and Labels
# Instant vector: current value
http_requests_total
# Label equality filter
http_requests_total{job="myapp", status="200"}
# Label regex match
http_requests_total{status=~"5.."} # 5xx errors
http_requests_total{status!~"2.."} # Not 2xx
# Range vector: samples over a time window
http_requests_total[5m] # Last 5 minutes of samples
# Offset: look back in time
http_requests_total offset 1h # Value 1 hour ago
# Subquery: range query over instant expression
rate(http_requests_total[5m])[1h:5m] # rate over 5m, sampled every 5m for 1h
Essential Functions
# rate(): per-second rate of increase of a counter (avg over window)
# Use for alerting — smooths out spikes
rate(http_requests_total[5m])
# irate(): instant rate — uses last two samples only
# More responsive to spikes; noisy for alerts
irate(http_requests_total[5m])
# increase(): total increase over window (rate * window seconds)
increase(http_requests_total[1h]) # Approx requests in last hour
# histogram_quantile(): calculate percentile from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# sum by: aggregate across labels
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m]))
# without: aggregate, dropping specified labels
sum without (instance, pod) (rate(http_requests_total[5m]))
# topk: top N time series by value
topk(10, rate(http_requests_total[5m]))
# absent: alert when metric is missing (target down)
absent(up{job="myapp"} == 1)
# predict_linear: forecast gauge value
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
# Will disk be full in 4 hours?
Common Production Queries
# Container CPU utilization % (vs request)
sum by (namespace, pod, container) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
) /
sum by (namespace, pod, container) (
kube_pod_container_resource_requests{resource="cpu", container!=""}
) * 100
# Container memory utilization % (vs limit)
sum by (namespace, pod, container) (
container_memory_working_set_bytes{container!=""}
) /
sum by (namespace, pod, container) (
kube_pod_container_resource_limits{resource="memory", container!=""}
) * 100
# Error rate for HTTP services
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# p99 latency across all instances of a service
histogram_quantile(0.99,
sum by (job, le) (
rate(http_request_duration_seconds_bucket{job="myapp"}[5m])
)
)
# Pods NOT running (pending/failed/unknown)
count by (namespace) (
kube_pod_status_phase{phase!="Running", phase!="Succeeded"}
)
# Nodes with memory pressure
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1
Recording Rules
Recording rules pre-compute expensive PromQL expressions and store the result as a new time series. This dramatically speeds up dashboard load times and reduces query-time CPU load.
Recording rule metrics follow the convention: level:metric:operations. For example, job:http_requests:rate5m means: aggregated by job, from http_requests, using rate over 5m. This convention makes it easy to identify pre-computed metrics from raw ones.
# PrometheusRule with recording rules
groups:
- name: kubernetes-resources.recording
interval: 30s
rules:
# CPU utilization ratio (pre-computed for dashboard)
- record: namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
expr: |
sum by (namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubelet", container!=""}[5m])
)
# Request rate per job
- record: job:http_requests_total:rate5m
expr: |
sum by (job, status) (rate(http_requests_total[5m]))
# Error ratio — used by SLO burn rate alert
- record: job:http_request_errors:ratio_rate5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# p99 latency recording rule
- record: job:http_request_duration_seconds:histogram_quantile99_rate5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
Cardinality Management
Cardinality is the number of unique time series in Prometheus. It is the primary driver of Prometheus memory usage and query performance. High cardinality is the most common operational problem.
Cardinality Analysis
# Check total time series count
prometheus_tsdb_head_series
# Top 10 jobs by series count (PromQL)
topk(10,
count by (job) ({__name__=~".+"})
)
# Top 10 metrics by series count
topk(10,
count by (__name__) ({__name__=~".+"})
)
# Series count for a specific metric
count(http_requests_total)
# cardinality API endpoint (Prometheus 2.14+)
curl http://prometheus:9090/api/v1/status/tsdb | jq .data.headStats
# Detailed cardinality analysis
curl http://prometheus:9090/api/v1/status/tsdb?limit=20 | \
jq '.data.labelValueCountByLabelName | sort_by(.labelValueCount) | reverse | .[0:10]'
Reducing Cardinality
# Drop high-cardinality labels via metric relabeling in ServiceMonitor
metricRelabelings:
# Drop pod-template-hash label (changes every deployment, creates new series)
- action: labeldrop
regex: pod_template_hash
# Drop entire metrics that are not needed
- sourceLabels: [__name__]
regex: go_gc_duration_seconds.*
action: drop
# Replace high-cardinality URL paths with bucketed versions
- sourceLabels: [path]
regex: /api/users/[0-9]+
targetLabel: path
replacement: /api/users/:id
# Prometheus per-scrape cardinality limit (prevent a single job from exploding)
scrape_configs:
- job_name: myapp
sample_limit: 10000 # Max samples per scrape (drops the scrape if exceeded)
label_limit: 64 # Max labels per sample
label_name_length_limit: 128
label_value_length_limit: 256
Long-Term Storage: Thanos & Mimir
Prometheus stores data locally with a default retention of 15 days. For long-term storage, multi-cluster aggregation, and high availability, use Thanos or Grafana Mimir.
Thanos Architecture
Prometheus (per cluster)
└── Thanos Sidecar
├── Exposes StoreAPI for real-time data (last 2h)
└── Uploads TSDB blocks to object storage every 2h
│
Object Storage (S3/GCS/Azure Blob)
└── Long-term TSDB blocks (years)
│
Thanos Store Gateway
└── StoreAPI for historical data
Thanos Querier
├── Aggregates StoreAPI from Sidecar + Store Gateway
├── Deduplication across HA Prometheus pairs
└── Exposes PromQL API (Grafana data source)
Thanos Compactor (single instance)
├── Compacts and downsamples old blocks in object storage
└── Retention enforcement
Thanos Ruler (optional)
└── Evaluates recording/alert rules across global view
# Thanos Sidecar: prometheus.yaml addition
# Add to kube-prometheus-stack values:
prometheus:
prometheusSpec:
thanos:
image: quay.io/thanos/thanos:v0.35.0
objectStorageConfig:
existingSecret:
name: thanos-objstore-secret
key: objstore.yml
---
# objstore.yml (stored as Secret)
type: S3
config:
bucket: my-thanos-bucket
endpoint: s3.amazonaws.com
region: us-east-1
aws_sdk_auth: true # Use IRSA
Grafana Mimir
Grafana Mimir is a horizontally scalable, multi-tenant Prometheus-compatible backend. It is the successor to Cortex and replaces the Thanos Sidecar + Store Gateway pattern with a unified write path.
| Feature | Thanos | Mimir |
|---|---|---|
| Architecture | Sidecar to existing Prometheus | Replaces Prometheus write path; Prometheus only for scraping |
| Scalability | Scales via object storage; query latency depends on Querier | Horizontally scaled Ingester/Querier/Compactor |
| Multi-tenancy | Limited (namespace-based) | First-class; X-Scope-OrgID header |
| Operations | Multiple components to operate | More complex; monolithic mode available for small clusters |
| Use case | Multiple clusters, HA, long-term storage | SaaS-scale, multi-tenant, high write throughput |
Remote Write & Federation
# Remote write: ship metrics to external backend
remote_write:
- url: https://mimir.example.com/api/v1/push
headers:
X-Scope-OrgID: production
remote_timeout: 30s
queue_config:
capacity: 10000
max_shards: 30
min_shards: 5
max_samples_per_send: 5000
batch_send_deadline: 5s
write_relabel_configs:
# Only send metrics with specific labels to reduce egress
- sourceLabels: [__name__]
regex: job:.*|kube_.*|node_.*
action: keep
# Federation: aggregate subset of metrics from child Prometheus
scrape_configs:
- job_name: federate
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"job:.*"}' # Only recording rules
- up
static_configs:
- targets:
- prometheus-cluster-a:9090
- prometheus-cluster-b:9090
Metrics, Alerts & Runbooks
Key Prometheus Self-Metrics
| Metric | Description |
|---|---|
prometheus_tsdb_head_series | Current active time series count (primary cardinality indicator) |
prometheus_tsdb_storage_blocks_bytes | Disk usage of persistent TSDB blocks |
prometheus_target_scrape_pool_sync_total | Scrape pool sync operations |
up | 1 if target is up, 0 if down — the most important scrape health metric |
prometheus_rule_evaluation_duration_seconds | Rule evaluation latency (p99 > 10s = rules falling behind) |
Alerts
# Alert: Prometheus target down
- alert: PrometheusTargetDown
expr: up == 0
for: 5m
annotations:
summary: "Scrape target {{ $labels.job }} / {{ $labels.instance }} is down"
# Alert: Prometheus cardinality too high
- alert: PrometheusHighCardinality
expr: prometheus_tsdb_head_series > 2000000
for: 15m
annotations:
summary: "Prometheus has >2M active series — memory pressure risk"
# Alert: Prometheus rule evaluation slow
- alert: PrometheusRuleEvaluationSlow
expr: prometheus_rule_evaluation_duration_seconds{quantile="0.9"} > 1
for: 10m
annotations:
summary: "Rule evaluation p90 > 1s — rules may miss evaluation windows"
# Alert: Remote write queue filling
- alert: PrometheusRemoteWriteQueueFull
expr: |
prometheus_remote_storage_shard_capacity -
prometheus_remote_storage_pending_examples < 100
for: 5m
annotations:
summary: "Remote write queue almost full — samples may be dropped"
Runbooks
Prometheus OOM / High Memory
1. Check prometheus_tsdb_head_series — if >2M investigate cardinality
2. Find top series producers: topk(10, count by (job) ({__name__=~".+"}))
3. Add metricRelabelings to drop unused metrics or labels
4. Increase memory limit as temporary measure
Target Scrape Failures
1. Check target health: Prometheus UI → Targets
2. Verify pod is running: kubectl get pod
3. Check ServiceMonitor selector matches Service labels
4. Verify metrics port is correct name in Service spec
Rules Evaluation Falling Behind
1. Check prometheus_rule_evaluation_duration_seconds
2. Identify slow rules: prometheus_rule_evaluation_failures_total
3. Simplify expensive recording rules or increase evaluation interval
4. Add Prometheus resources (CPU) if queries are compute-bound
Remote Write Dropping Samples
1. Check prometheus_remote_storage_failed_samples_total
2. Verify remote endpoint is reachable
3. Increase queue_config.max_shards or capacity
4. Add write_relabel_configs to reduce remote write volume
TSDB Disk Full
1. Check prometheus_tsdb_storage_blocks_bytes
2. Reduce retention: --storage.tsdb.retention.size
3. Add disk to Prometheus PVC
4. Enable remote write to offload historical data to Thanos/Mimir
Best Practices
Use kube-prometheus-stack as the baseline
Don't build your own scrape configuration from scratch. The kube-prometheus-stack includes pre-built ServiceMonitors, recording rules, and 20+ Grafana dashboards covering Kubernetes core components out of the box.
Never use high-cardinality labels in metrics
User IDs, request IDs, session tokens, or arbitrary string values as label values will explode cardinality. Instrument at the service boundary, not at the per-request level. Use traces for per-request detail.
Write recording rules for all dashboard and alert expressions
If a PromQL expression appears in a Grafana dashboard or alert rule, it should have a corresponding recording rule. This pre-computes the result, reduces query time from seconds to milliseconds, and reduces Prometheus CPU load.
Use Histogram over Summary for new instrumentation
Histograms allow aggregation across pod instances using histogram_quantile(). Summaries calculate quantiles in-process and cannot be aggregated — useless for Kubernetes where you always have multiple replicas.
Set sample_limit per scrape job
Without a sample_limit, a single misbehaving application can push millions of time series into Prometheus and cause an OOM. Set sample_limit: 10000 for most jobs; higher only when justified.
Deploy Thanos or Mimir for production clusters
Single-node Prometheus with local retention is not production-ready. It is a single point of failure and loses history on pod restart. Use Thanos Sidecar + object storage for HA and long-term retention with minimal operational complexity.
Alert on absence of expected metrics
Use absent() to alert when a time series disappears entirely — this catches target-down scenarios that won't fire rate-based alerts (because there are no samples to compute a rate from).
Use rate() over irate() for alerts
irate() uses only the last two samples and is highly sensitive to single-sample spikes, causing flapping alerts. Use rate() with a 5–10 minute window for alerting. Reserve irate() for exploratory dashboards where responsiveness is more important than stability.