Metrics & Prometheus

On This Page

Prometheus Architecture
Metric Types
Scrape Configuration
Prometheus Operator
kube-prometheus-stack
Key Kubernetes Metrics
PromQL
Recording Rules
Cardinality Management
Long-Term Storage: Thanos & Mimir
Remote Write & Federation
Metrics, Alerts & Runbooks
Best Practices

Coverage Checklist

Prometheus pull model vs push model
Prometheus TSDB architecture
4 metric types: Counter/Gauge/Histogram/Summary
Histogram vs Summary tradeoffs
scrape_configs: static, service discovery, relabeling
Prometheus Operator: ServiceMonitor / PodMonitor / PrometheusRule
kube-prometheus-stack Helm install and configuration
Key K8s metrics: cAdvisor, kube-state-metrics, node_exporter, API server
PromQL: selectors, range vectors, functions, aggregation
PromQL: rate/irate, histogram_quantile, absent, topk
Recording rules: naming convention, evaluation interval
Cardinality: explosion causes, TSDB cardinality analysis
Label dropping via relabeling
Thanos: sidecar, store gateway, querier, compactor
Mimir: horizontal scalability, multi-tenant
Remote write configuration
Federation for multi-cluster
5 metrics, 4 alerts, 5 runbooks, 8 best practices

Prometheus Architecture

Prometheus uses a pull-based model — it scrapes metrics from targets at a configurable interval. Targets expose metrics on an HTTP endpoint (typically /metrics) in the Prometheus exposition format.

── Prometheus pull architecture ──────────────────────────────────────

Prometheus Server
├── Retrieval: scrapes /metrics from targets on schedule
│ └── Service Discovery (K8s API, static, DNS, EC2, ...)
│
├── TSDB (Time Series Database)
│ ├── In-memory: active head block (2h)
│ ├── WAL (Write-Ahead Log): crash recovery
│ └── On-disk: persistent blocks (compressed, indexed)
│
├── Rules Engine: recording rules + alerting rules (evaluated every 15s)
│
└── HTTP API: PromQL queries (Grafana, tooling, manual)

│ Alerts
▼
Alertmanager: dedup, grouping, routing, silencing → PagerDuty/Slack

Scrape targets:
App pods node_exporter kube-state-metrics cAdvisor etcd API server

Pull vs Push

Prometheus pulls metrics from targets. This means the monitoring system controls the scrape rate, and targets cannot overwhelm Prometheus with data. The downside: short-lived jobs (batch, CronJob) may not live long enough to be scraped. Use pushgateway for these cases — push metrics before the job exits, then Prometheus scrapes the gateway.

TSDB Storage Format

Component	Duration	Description
Head block	~2h (in memory + WAL)	Active write target; fast appends; WAL ensures durability
Persistent blocks	2h chunks, compacted to larger	Immutable, compressed, indexed; default retention 15d
Compaction	Background process	Merges small blocks into larger; deduplicates; improves query speed
Retention	Default 15d (configurable)	`--storage.tsdb.retention.time=30d` or `--storage.tsdb.retention.size=50GB`

Metric Types

Type	Behavior	Use Case	PromQL Function
Counter	Monotonically increasing; resets to 0 on restart	Total requests, errors, bytes sent	`rate()`, `increase()`
Gauge	Can go up or down; current value	Memory usage, queue depth, temperature, active connections	Direct use, `delta()`, `deriv()`
Histogram	Samples observations into configurable buckets; exposes _bucket, _count, _sum	Request latency, response sizes — percentile queries at query time	`histogram_quantile()`
Summary	Calculates quantiles client-side; exposes pre-calculated quantiles, _count, _sum	Pre-calculated percentiles where aggregation across instances is not needed	Direct quantile labels

Histogram vs Summary

Property	Histogram	Summary
Quantile calculation	Server-side (PromQL at query time)	Client-side (in instrumented code)
Aggregation across instances	Yes — sum buckets then quantile	No — quantiles cannot be summed
Configurable accuracy	Depends on bucket boundaries	Configurable quantile error bound
Query performance	CPU-intensive at query time	Cheap at query time (pre-calculated)
Recommendation	Preferred for Kubernetes workloads	Legacy; avoid for new instrumentation

Native Histograms (Prometheus 2.40+)

Native histograms use a sparse exponential bucket schema — they automatically adapt bucket boundaries, eliminate the need to pre-configure buckets, and provide better accuracy with lower cardinality. Enabled with --enable-feature=native-histograms. The OTel SDK and Prometheus client libraries support native histograms in recent versions.

# Prometheus exposition format examples

# Counter
http_requests_total{method="GET",status="200"} 12345

# Gauge
process_resident_memory_bytes 45678901

# Histogram (automatic _bucket, _count, _sum)
http_request_duration_seconds_bucket{le="0.005"} 100
http_request_duration_seconds_bucket{le="0.01"}  200
http_request_duration_seconds_bucket{le="0.025"} 350
http_request_duration_seconds_bucket{le="0.05"}  400
http_request_duration_seconds_bucket{le="+Inf"}  500
http_request_duration_seconds_count              500
http_request_duration_seconds_sum                12.345

Scrape Configuration

# prometheus.yaml — scrape configuration
global:
  scrape_interval: 15s         # Default scrape interval
  evaluation_interval: 15s     # Rules evaluation interval
  scrape_timeout: 10s

scrape_configs:

# Scrape Kubernetes API server
- job_name: kubernetes-apiservers
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names: [default]
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https

# Scrape all pods with annotation prometheus.io/scrape: "true"
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: pod

Prometheus Operator

The Prometheus Operator extends Kubernetes with CRDs that let you declare Prometheus instances and scrape targets as Kubernetes objects, enabling GitOps-compatible monitoring configuration.

Core CRDs

CRD	Purpose	Key Fields
`Prometheus`	Declares a Prometheus instance	replicas, retention, storage, ruleSelector, serviceMonitorSelector
`Alertmanager`	Declares an Alertmanager instance	replicas, configSecret
`ServiceMonitor`	Selects Services to scrape via label selectors	selector, endpoints (port, path, interval), namespaceSelector
`PodMonitor`	Selects Pods to scrape directly	selector, podMetricsEndpoints, namespaceSelector
`PrometheusRule`	Recording rules and alerting rules	groups (name, interval, rules)
`ScrapeConfig`	Raw scrape config for non-K8s targets	staticConfigs, httpSDConfigs

ServiceMonitor

# ServiceMonitor: tells Prometheus Operator which Services to scrape
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: production
  labels:
    team: backend    # Must match Prometheus.spec.serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: myapp      # Selects Services with this label
  namespaceSelector:
    matchNames: [production]
  endpoints:
  - port: metrics          # Named port on the Service
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
    scheme: http
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
    metricRelabelings:
    # Drop high-cardinality metrics at ingest time
    - sourceLabels: [__name__]
      regex: go_gc_.*
      action: drop

PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-rules
  namespace: production
  labels:
    team: backend
spec:
  groups:
  - name: myapp.recording
    interval: 30s
    rules:
    - record: job:http_requests:rate5m
      expr: sum by (job) (rate(http_requests_total[5m]))
  - name: myapp.alerts
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on {{ $labels.job }}"
        description: "Error rate is {{ $value | humanizePercentage }}"

kube-prometheus-stack

The kube-prometheus-stack Helm chart is the standard way to deploy the full Prometheus monitoring stack with pre-built dashboards, recording rules, and alert rules for Kubernetes.

# Add Helm repo and install
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values kps-values.yaml

# kps-values.yaml — production configuration
prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: 50GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: [ReadWriteOnce]
          resources:
            requests:
              storage: 100Gi
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        memory: 4Gi
    # Accept ServiceMonitors from all namespaces
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false
    # TSDB compaction and query tuning
    tsdb:
      outOfOrderTimeWindow: 10m   # Accept late-arriving samples
    walCompression: true

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          resources:
            requests:
              storage: 10Gi

grafana:
  adminPassword: changeme-use-secret
  persistence:
    enabled: true
    size: 10Gi

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

Key Kubernetes Metrics

Container Resources (cAdvisor)

Metric	Type	Description
`container_cpu_usage_seconds_total`	Counter	CPU seconds consumed by container
`container_memory_working_set_bytes`	Gauge	Working set memory (OOM killer uses this)
`container_memory_rss`	Gauge	Resident set size
`container_network_receive_bytes_total`	Counter	Network bytes received by container
`container_fs_reads_bytes_total`	Counter	Filesystem bytes read
`container_oom_events_total`	Counter	OOM kill events per container

Kubernetes State (kube-state-metrics)

Metric	Type	Description
`kube_pod_status_phase`	Gauge	Pod phase (Running/Pending/Failed) — labels: phase
`kube_pod_container_status_restarts_total`	Counter	Container restart count
`kube_deployment_status_replicas_unavailable`	Gauge	Unavailable replicas in a Deployment
`kube_node_status_condition`	Gauge	Node conditions (Ready, MemoryPressure, DiskPressure)
`kube_horizontalpodautoscaler_status_current_replicas`	Gauge	Current HPA replica count
`kube_persistentvolumeclaim_status_phase`	Gauge	PVC binding status

API Server

Metric	Description
`apiserver_request_duration_seconds`	API request latency histogram by verb, resource, scope
`apiserver_request_total`	Total API requests by verb, resource, code
`apiserver_current_inflight_requests`	In-flight requests (mutating and read-only)
`apiserver_admission_webhook_admission_duration_seconds`	Admission webhook latency
`etcd_object_counts`	Count of objects stored in etcd by resource type

PromQL

PromQL (Prometheus Query Language) is a functional query language for time series data. It operates on selectors, range vectors, and a rich set of built-in functions.

Selectors and Labels

# Instant vector: current value
http_requests_total

# Label equality filter
http_requests_total{job="myapp", status="200"}

# Label regex match
http_requests_total{status=~"5.."}        # 5xx errors
http_requests_total{status!~"2.."}        # Not 2xx

# Range vector: samples over a time window
http_requests_total[5m]                   # Last 5 minutes of samples

# Offset: look back in time
http_requests_total offset 1h             # Value 1 hour ago

# Subquery: range query over instant expression
rate(http_requests_total[5m])[1h:5m]      # rate over 5m, sampled every 5m for 1h

Essential Functions

# rate(): per-second rate of increase of a counter (avg over window)
# Use for alerting — smooths out spikes
rate(http_requests_total[5m])

# irate(): instant rate — uses last two samples only
# More responsive to spikes; noisy for alerts
irate(http_requests_total[5m])

# increase(): total increase over window (rate * window seconds)
increase(http_requests_total[1h])    # Approx requests in last hour

# histogram_quantile(): calculate percentile from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# sum by: aggregate across labels
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m]))

# without: aggregate, dropping specified labels
sum without (instance, pod) (rate(http_requests_total[5m]))

# topk: top N time series by value
topk(10, rate(http_requests_total[5m]))

# absent: alert when metric is missing (target down)
absent(up{job="myapp"} == 1)

# predict_linear: forecast gauge value
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
# Will disk be full in 4 hours?

Common Production Queries

# Container CPU utilization % (vs request)
sum by (namespace, pod, container) (
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
) /
sum by (namespace, pod, container) (
  kube_pod_container_resource_requests{resource="cpu", container!=""}
) * 100

# Container memory utilization % (vs limit)
sum by (namespace, pod, container) (
  container_memory_working_set_bytes{container!=""}
) /
sum by (namespace, pod, container) (
  kube_pod_container_resource_limits{resource="memory", container!=""}
) * 100

# Error rate for HTTP services
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))

# p99 latency across all instances of a service
histogram_quantile(0.99,
  sum by (job, le) (
    rate(http_request_duration_seconds_bucket{job="myapp"}[5m])
  )
)

# Pods NOT running (pending/failed/unknown)
count by (namespace) (
  kube_pod_status_phase{phase!="Running", phase!="Succeeded"}
)

# Nodes with memory pressure
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1

Recording Rules

Recording rules pre-compute expensive PromQL expressions and store the result as a new time series. This dramatically speeds up dashboard load times and reduces query-time CPU load.

Naming Convention

Recording rule metrics follow the convention: level:metric:operations. For example, job:http_requests:rate5m means: aggregated by job, from http_requests, using rate over 5m. This convention makes it easy to identify pre-computed metrics from raw ones.

# PrometheusRule with recording rules
groups:
- name: kubernetes-resources.recording
  interval: 30s
  rules:

  # CPU utilization ratio (pre-computed for dashboard)
  - record: namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
    expr: |
      sum by (namespace, pod, container) (
        irate(container_cpu_usage_seconds_total{job="kubelet", container!=""}[5m])
      )

  # Request rate per job
  - record: job:http_requests_total:rate5m
    expr: |
      sum by (job, status) (rate(http_requests_total[5m]))

  # Error ratio — used by SLO burn rate alert
  - record: job:http_request_errors:ratio_rate5m
    expr: |
      sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
      /
      sum by (job) (rate(http_requests_total[5m]))

  # p99 latency recording rule
  - record: job:http_request_duration_seconds:histogram_quantile99_rate5m
    expr: |
      histogram_quantile(0.99,
        sum by (job, le) (
          rate(http_request_duration_seconds_bucket[5m])
        )
      )

Cardinality Management

Cardinality is the number of unique time series in Prometheus. It is the primary driver of Prometheus memory usage and query performance. High cardinality is the most common operational problem.

Cardinality Analysis

# Check total time series count
prometheus_tsdb_head_series

# Top 10 jobs by series count (PromQL)
topk(10,
  count by (job) ({__name__=~".+"})
)

# Top 10 metrics by series count
topk(10,
  count by (__name__) ({__name__=~".+"})
)

# Series count for a specific metric
count(http_requests_total)

# cardinality API endpoint (Prometheus 2.14+)
curl http://prometheus:9090/api/v1/status/tsdb | jq .data.headStats

# Detailed cardinality analysis
curl http://prometheus:9090/api/v1/status/tsdb?limit=20 | \
  jq '.data.labelValueCountByLabelName | sort_by(.labelValueCount) | reverse | .[0:10]'

Reducing Cardinality

# Drop high-cardinality labels via metric relabeling in ServiceMonitor
metricRelabelings:
# Drop pod-template-hash label (changes every deployment, creates new series)
- action: labeldrop
  regex: pod_template_hash

# Drop entire metrics that are not needed
- sourceLabels: [__name__]
  regex: go_gc_duration_seconds.*
  action: drop

# Replace high-cardinality URL paths with bucketed versions
- sourceLabels: [path]
  regex: /api/users/[0-9]+
  targetLabel: path
  replacement: /api/users/:id

# Prometheus per-scrape cardinality limit (prevent a single job from exploding)
scrape_configs:
- job_name: myapp
  sample_limit: 10000    # Max samples per scrape (drops the scrape if exceeded)
  label_limit: 64        # Max labels per sample
  label_name_length_limit: 128
  label_value_length_limit: 256

Long-Term Storage: Thanos & Mimir

Prometheus stores data locally with a default retention of 15 days. For long-term storage, multi-cluster aggregation, and high availability, use Thanos or Grafana Mimir.

Thanos Architecture

── Thanos components ─────────────────────────────────────────────────

Prometheus (per cluster)
└── Thanos Sidecar
├── Exposes StoreAPI for real-time data (last 2h)
└── Uploads TSDB blocks to object storage every 2h
│
Object Storage (S3/GCS/Azure Blob)
└── Long-term TSDB blocks (years)
│
Thanos Store Gateway
└── StoreAPI for historical data

Thanos Querier
├── Aggregates StoreAPI from Sidecar + Store Gateway
├── Deduplication across HA Prometheus pairs
└── Exposes PromQL API (Grafana data source)

Thanos Compactor (single instance)
├── Compacts and downsamples old blocks in object storage
└── Retention enforcement

Thanos Ruler (optional)
└── Evaluates recording/alert rules across global view

# Thanos Sidecar: prometheus.yaml addition
# Add to kube-prometheus-stack values:
prometheus:
  prometheusSpec:
    thanos:
      image: quay.io/thanos/thanos:v0.35.0
      objectStorageConfig:
        existingSecret:
          name: thanos-objstore-secret
          key: objstore.yml
---
# objstore.yml (stored as Secret)
type: S3
config:
  bucket: my-thanos-bucket
  endpoint: s3.amazonaws.com
  region: us-east-1
  aws_sdk_auth: true    # Use IRSA

Grafana Mimir

Grafana Mimir is a horizontally scalable, multi-tenant Prometheus-compatible backend. It is the successor to Cortex and replaces the Thanos Sidecar + Store Gateway pattern with a unified write path.

Feature	Thanos	Mimir
Architecture	Sidecar to existing Prometheus	Replaces Prometheus write path; Prometheus only for scraping
Scalability	Scales via object storage; query latency depends on Querier	Horizontally scaled Ingester/Querier/Compactor
Multi-tenancy	Limited (namespace-based)	First-class; X-Scope-OrgID header
Operations	Multiple components to operate	More complex; monolithic mode available for small clusters
Use case	Multiple clusters, HA, long-term storage	SaaS-scale, multi-tenant, high write throughput

Remote Write & Federation

# Remote write: ship metrics to external backend
remote_write:
- url: https://mimir.example.com/api/v1/push
  headers:
    X-Scope-OrgID: production
  remote_timeout: 30s
  queue_config:
    capacity: 10000
    max_shards: 30
    min_shards: 5
    max_samples_per_send: 5000
    batch_send_deadline: 5s
  write_relabel_configs:
  # Only send metrics with specific labels to reduce egress
  - sourceLabels: [__name__]
    regex: job:.*|kube_.*|node_.*
    action: keep

# Federation: aggregate subset of metrics from child Prometheus
scrape_configs:
- job_name: federate
  honor_labels: true
  metrics_path: /federate
  params:
    match[]:
    - '{__name__=~"job:.*"}'        # Only recording rules
    - up
  static_configs:
  - targets:
    - prometheus-cluster-a:9090
    - prometheus-cluster-b:9090

Metrics, Alerts & Runbooks

Key Prometheus Self-Metrics

Metric	Description
`prometheus_tsdb_head_series`	Current active time series count (primary cardinality indicator)
`prometheus_tsdb_storage_blocks_bytes`	Disk usage of persistent TSDB blocks
`prometheus_target_scrape_pool_sync_total`	Scrape pool sync operations
`up`	1 if target is up, 0 if down — the most important scrape health metric
`prometheus_rule_evaluation_duration_seconds`	Rule evaluation latency (p99 > 10s = rules falling behind)

Alerts

# Alert: Prometheus target down
- alert: PrometheusTargetDown
  expr: up == 0
  for: 5m
  annotations:
    summary: "Scrape target {{ $labels.job }} / {{ $labels.instance }} is down"

# Alert: Prometheus cardinality too high
- alert: PrometheusHighCardinality
  expr: prometheus_tsdb_head_series > 2000000
  for: 15m
  annotations:
    summary: "Prometheus has >2M active series — memory pressure risk"

# Alert: Prometheus rule evaluation slow
- alert: PrometheusRuleEvaluationSlow
  expr: prometheus_rule_evaluation_duration_seconds{quantile="0.9"} > 1
  for: 10m
  annotations:
    summary: "Rule evaluation p90 > 1s — rules may miss evaluation windows"

# Alert: Remote write queue filling
- alert: PrometheusRemoteWriteQueueFull
  expr: |
    prometheus_remote_storage_shard_capacity -
    prometheus_remote_storage_pending_examples < 100
  for: 5m
  annotations:
    summary: "Remote write queue almost full — samples may be dropped"

Runbooks

Prometheus OOM / High Memory

1. Check prometheus_tsdb_head_series — if >2M investigate cardinality
2. Find top series producers: topk(10, count by (job) ({__name__=~".+"}))
3. Add metricRelabelings to drop unused metrics or labels
4. Increase memory limit as temporary measure

Target Scrape Failures

1. Check target health: Prometheus UI → Targets
2. Verify pod is running: kubectl get pod
3. Check ServiceMonitor selector matches Service labels
4. Verify metrics port is correct name in Service spec

Rules Evaluation Falling Behind

1. Check prometheus_rule_evaluation_duration_seconds
2. Identify slow rules: prometheus_rule_evaluation_failures_total
3. Simplify expensive recording rules or increase evaluation interval
4. Add Prometheus resources (CPU) if queries are compute-bound

Remote Write Dropping Samples

1. Check prometheus_remote_storage_failed_samples_total
2. Verify remote endpoint is reachable
3. Increase queue_config.max_shards or capacity
4. Add write_relabel_configs to reduce remote write volume

TSDB Disk Full

1. Check prometheus_tsdb_storage_blocks_bytes
2. Reduce retention: --storage.tsdb.retention.size
3. Add disk to Prometheus PVC
4. Enable remote write to offload historical data to Thanos/Mimir

Best Practices

Use kube-prometheus-stack as the baseline

Don't build your own scrape configuration from scratch. The kube-prometheus-stack includes pre-built ServiceMonitors, recording rules, and 20+ Grafana dashboards covering Kubernetes core components out of the box.

Never use high-cardinality labels in metrics

User IDs, request IDs, session tokens, or arbitrary string values as label values will explode cardinality. Instrument at the service boundary, not at the per-request level. Use traces for per-request detail.

Write recording rules for all dashboard and alert expressions

If a PromQL expression appears in a Grafana dashboard or alert rule, it should have a corresponding recording rule. This pre-computes the result, reduces query time from seconds to milliseconds, and reduces Prometheus CPU load.

Use Histogram over Summary for new instrumentation

Histograms allow aggregation across pod instances using histogram_quantile(). Summaries calculate quantiles in-process and cannot be aggregated — useless for Kubernetes where you always have multiple replicas.

Set sample_limit per scrape job

Without a sample_limit, a single misbehaving application can push millions of time series into Prometheus and cause an OOM. Set sample_limit: 10000 for most jobs; higher only when justified.

Deploy Thanos or Mimir for production clusters

Single-node Prometheus with local retention is not production-ready. It is a single point of failure and loses history on pod restart. Use Thanos Sidecar + object storage for HA and long-term retention with minimal operational complexity.

Alert on absence of expected metrics

Use absent() to alert when a time series disappears entirely — this catches target-down scenarios that won't fire rate-based alerts (because there are no samples to compute a rate from).

Use `rate()` over `irate()` for alerts

irate() uses only the last two samples and is highly sensitive to single-sample spikes, causing flapping alerts. Use rate() with a 5–10 minute window for alerting. Reserve irate() for exploratory dashboards where responsiveness is more important than stability.