On This Page
  1. Prometheus Architecture
  2. Metric Types
  3. Scrape Configuration
  4. Prometheus Operator
  5. kube-prometheus-stack
  6. Key Kubernetes Metrics
  7. PromQL
  8. Recording Rules
  9. Cardinality Management
  10. Long-Term Storage: Thanos & Mimir
  11. Remote Write & Federation
  12. Metrics, Alerts & Runbooks
  13. Best Practices
Coverage Checklist

Prometheus Architecture

Prometheus uses a pull-based model — it scrapes metrics from targets at a configurable interval. Targets expose metrics on an HTTP endpoint (typically /metrics) in the Prometheus exposition format.

── Prometheus pull architecture ──────────────────────────────────────

Prometheus Server
├── Retrieval: scrapes /metrics from targets on schedule
│ └── Service Discovery (K8s API, static, DNS, EC2, ...)

├── TSDB (Time Series Database)
│ ├── In-memory: active head block (2h)
│ ├── WAL (Write-Ahead Log): crash recovery
│ └── On-disk: persistent blocks (compressed, indexed)

├── Rules Engine: recording rules + alerting rules (evaluated every 15s)

└── HTTP API: PromQL queries (Grafana, tooling, manual)

│ Alerts

Alertmanager: dedup, grouping, routing, silencing → PagerDuty/Slack

Scrape targets:
App pods node_exporter kube-state-metrics cAdvisor etcd API server
Pull vs Push

Prometheus pulls metrics from targets. This means the monitoring system controls the scrape rate, and targets cannot overwhelm Prometheus with data. The downside: short-lived jobs (batch, CronJob) may not live long enough to be scraped. Use pushgateway for these cases — push metrics before the job exits, then Prometheus scrapes the gateway.

TSDB Storage Format

ComponentDurationDescription
Head block~2h (in memory + WAL)Active write target; fast appends; WAL ensures durability
Persistent blocks2h chunks, compacted to largerImmutable, compressed, indexed; default retention 15d
CompactionBackground processMerges small blocks into larger; deduplicates; improves query speed
RetentionDefault 15d (configurable)--storage.tsdb.retention.time=30d or --storage.tsdb.retention.size=50GB

Metric Types

TypeBehaviorUse CasePromQL Function
Counter Monotonically increasing; resets to 0 on restart Total requests, errors, bytes sent rate(), increase()
Gauge Can go up or down; current value Memory usage, queue depth, temperature, active connections Direct use, delta(), deriv()
Histogram Samples observations into configurable buckets; exposes _bucket, _count, _sum Request latency, response sizes — percentile queries at query time histogram_quantile()
Summary Calculates quantiles client-side; exposes pre-calculated quantiles, _count, _sum Pre-calculated percentiles where aggregation across instances is not needed Direct quantile labels

Histogram vs Summary

PropertyHistogramSummary
Quantile calculationServer-side (PromQL at query time)Client-side (in instrumented code)
Aggregation across instancesYes — sum buckets then quantileNo — quantiles cannot be summed
Configurable accuracyDepends on bucket boundariesConfigurable quantile error bound
Query performanceCPU-intensive at query timeCheap at query time (pre-calculated)
RecommendationPreferred for Kubernetes workloadsLegacy; avoid for new instrumentation
Native Histograms (Prometheus 2.40+)

Native histograms use a sparse exponential bucket schema — they automatically adapt bucket boundaries, eliminate the need to pre-configure buckets, and provide better accuracy with lower cardinality. Enabled with --enable-feature=native-histograms. The OTel SDK and Prometheus client libraries support native histograms in recent versions.

# Prometheus exposition format examples

# Counter
http_requests_total{method="GET",status="200"} 12345

# Gauge
process_resident_memory_bytes 45678901

# Histogram (automatic _bucket, _count, _sum)
http_request_duration_seconds_bucket{le="0.005"} 100
http_request_duration_seconds_bucket{le="0.01"}  200
http_request_duration_seconds_bucket{le="0.025"} 350
http_request_duration_seconds_bucket{le="0.05"}  400
http_request_duration_seconds_bucket{le="+Inf"}  500
http_request_duration_seconds_count              500
http_request_duration_seconds_sum                12.345

Scrape Configuration

# prometheus.yaml — scrape configuration
global:
  scrape_interval: 15s         # Default scrape interval
  evaluation_interval: 15s     # Rules evaluation interval
  scrape_timeout: 10s

scrape_configs:

# Scrape Kubernetes API server
- job_name: kubernetes-apiservers
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names: [default]
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https

# Scrape all pods with annotation prometheus.io/scrape: "true"
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: pod

Prometheus Operator

The Prometheus Operator extends Kubernetes with CRDs that let you declare Prometheus instances and scrape targets as Kubernetes objects, enabling GitOps-compatible monitoring configuration.

Core CRDs

CRDPurposeKey Fields
PrometheusDeclares a Prometheus instancereplicas, retention, storage, ruleSelector, serviceMonitorSelector
AlertmanagerDeclares an Alertmanager instancereplicas, configSecret
ServiceMonitorSelects Services to scrape via label selectorsselector, endpoints (port, path, interval), namespaceSelector
PodMonitorSelects Pods to scrape directlyselector, podMetricsEndpoints, namespaceSelector
PrometheusRuleRecording rules and alerting rulesgroups (name, interval, rules)
ScrapeConfigRaw scrape config for non-K8s targetsstaticConfigs, httpSDConfigs

ServiceMonitor

# ServiceMonitor: tells Prometheus Operator which Services to scrape
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: production
  labels:
    team: backend    # Must match Prometheus.spec.serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: myapp      # Selects Services with this label
  namespaceSelector:
    matchNames: [production]
  endpoints:
  - port: metrics          # Named port on the Service
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
    scheme: http
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
    metricRelabelings:
    # Drop high-cardinality metrics at ingest time
    - sourceLabels: [__name__]
      regex: go_gc_.*
      action: drop

PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-rules
  namespace: production
  labels:
    team: backend
spec:
  groups:
  - name: myapp.recording
    interval: 30s
    rules:
    - record: job:http_requests:rate5m
      expr: sum by (job) (rate(http_requests_total[5m]))
  - name: myapp.alerts
    rules:
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on {{ $labels.job }}"
        description: "Error rate is {{ $value | humanizePercentage }}"

kube-prometheus-stack

The kube-prometheus-stack Helm chart is the standard way to deploy the full Prometheus monitoring stack with pre-built dashboards, recording rules, and alert rules for Kubernetes.

# Add Helm repo and install
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values kps-values.yaml
# kps-values.yaml — production configuration
prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: 50GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: [ReadWriteOnce]
          resources:
            requests:
              storage: 100Gi
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        memory: 4Gi
    # Accept ServiceMonitors from all namespaces
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false
    # TSDB compaction and query tuning
    tsdb:
      outOfOrderTimeWindow: 10m   # Accept late-arriving samples
    walCompression: true

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          resources:
            requests:
              storage: 10Gi

grafana:
  adminPassword: changeme-use-secret
  persistence:
    enabled: true
    size: 10Gi

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

Key Kubernetes Metrics

Container Resources (cAdvisor)

MetricTypeDescription
container_cpu_usage_seconds_totalCounterCPU seconds consumed by container
container_memory_working_set_bytesGaugeWorking set memory (OOM killer uses this)
container_memory_rssGaugeResident set size
container_network_receive_bytes_totalCounterNetwork bytes received by container
container_fs_reads_bytes_totalCounterFilesystem bytes read
container_oom_events_totalCounterOOM kill events per container

Kubernetes State (kube-state-metrics)

MetricTypeDescription
kube_pod_status_phaseGaugePod phase (Running/Pending/Failed) — labels: phase
kube_pod_container_status_restarts_totalCounterContainer restart count
kube_deployment_status_replicas_unavailableGaugeUnavailable replicas in a Deployment
kube_node_status_conditionGaugeNode conditions (Ready, MemoryPressure, DiskPressure)
kube_horizontalpodautoscaler_status_current_replicasGaugeCurrent HPA replica count
kube_persistentvolumeclaim_status_phaseGaugePVC binding status

API Server

MetricDescription
apiserver_request_duration_secondsAPI request latency histogram by verb, resource, scope
apiserver_request_totalTotal API requests by verb, resource, code
apiserver_current_inflight_requestsIn-flight requests (mutating and read-only)
apiserver_admission_webhook_admission_duration_secondsAdmission webhook latency
etcd_object_countsCount of objects stored in etcd by resource type

PromQL

PromQL (Prometheus Query Language) is a functional query language for time series data. It operates on selectors, range vectors, and a rich set of built-in functions.

Selectors and Labels

# Instant vector: current value
http_requests_total

# Label equality filter
http_requests_total{job="myapp", status="200"}

# Label regex match
http_requests_total{status=~"5.."}        # 5xx errors
http_requests_total{status!~"2.."}        # Not 2xx

# Range vector: samples over a time window
http_requests_total[5m]                   # Last 5 minutes of samples

# Offset: look back in time
http_requests_total offset 1h             # Value 1 hour ago

# Subquery: range query over instant expression
rate(http_requests_total[5m])[1h:5m]      # rate over 5m, sampled every 5m for 1h

Essential Functions

# rate(): per-second rate of increase of a counter (avg over window)
# Use for alerting — smooths out spikes
rate(http_requests_total[5m])

# irate(): instant rate — uses last two samples only
# More responsive to spikes; noisy for alerts
irate(http_requests_total[5m])

# increase(): total increase over window (rate * window seconds)
increase(http_requests_total[1h])    # Approx requests in last hour

# histogram_quantile(): calculate percentile from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# sum by: aggregate across labels
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m]))

# without: aggregate, dropping specified labels
sum without (instance, pod) (rate(http_requests_total[5m]))

# topk: top N time series by value
topk(10, rate(http_requests_total[5m]))

# absent: alert when metric is missing (target down)
absent(up{job="myapp"} == 1)

# predict_linear: forecast gauge value
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
# Will disk be full in 4 hours?

Common Production Queries

# Container CPU utilization % (vs request)
sum by (namespace, pod, container) (
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
) /
sum by (namespace, pod, container) (
  kube_pod_container_resource_requests{resource="cpu", container!=""}
) * 100

# Container memory utilization % (vs limit)
sum by (namespace, pod, container) (
  container_memory_working_set_bytes{container!=""}
) /
sum by (namespace, pod, container) (
  kube_pod_container_resource_limits{resource="memory", container!=""}
) * 100

# Error rate for HTTP services
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))

# p99 latency across all instances of a service
histogram_quantile(0.99,
  sum by (job, le) (
    rate(http_request_duration_seconds_bucket{job="myapp"}[5m])
  )
)

# Pods NOT running (pending/failed/unknown)
count by (namespace) (
  kube_pod_status_phase{phase!="Running", phase!="Succeeded"}
)

# Nodes with memory pressure
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1

Recording Rules

Recording rules pre-compute expensive PromQL expressions and store the result as a new time series. This dramatically speeds up dashboard load times and reduces query-time CPU load.

Naming Convention

Recording rule metrics follow the convention: level:metric:operations. For example, job:http_requests:rate5m means: aggregated by job, from http_requests, using rate over 5m. This convention makes it easy to identify pre-computed metrics from raw ones.

# PrometheusRule with recording rules
groups:
- name: kubernetes-resources.recording
  interval: 30s
  rules:

  # CPU utilization ratio (pre-computed for dashboard)
  - record: namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
    expr: |
      sum by (namespace, pod, container) (
        irate(container_cpu_usage_seconds_total{job="kubelet", container!=""}[5m])
      )

  # Request rate per job
  - record: job:http_requests_total:rate5m
    expr: |
      sum by (job, status) (rate(http_requests_total[5m]))

  # Error ratio — used by SLO burn rate alert
  - record: job:http_request_errors:ratio_rate5m
    expr: |
      sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
      /
      sum by (job) (rate(http_requests_total[5m]))

  # p99 latency recording rule
  - record: job:http_request_duration_seconds:histogram_quantile99_rate5m
    expr: |
      histogram_quantile(0.99,
        sum by (job, le) (
          rate(http_request_duration_seconds_bucket[5m])
        )
      )

Cardinality Management

Cardinality is the number of unique time series in Prometheus. It is the primary driver of Prometheus memory usage and query performance. High cardinality is the most common operational problem.

Cardinality Analysis

# Check total time series count
prometheus_tsdb_head_series

# Top 10 jobs by series count (PromQL)
topk(10,
  count by (job) ({__name__=~".+"})
)

# Top 10 metrics by series count
topk(10,
  count by (__name__) ({__name__=~".+"})
)

# Series count for a specific metric
count(http_requests_total)

# cardinality API endpoint (Prometheus 2.14+)
curl http://prometheus:9090/api/v1/status/tsdb | jq .data.headStats

# Detailed cardinality analysis
curl http://prometheus:9090/api/v1/status/tsdb?limit=20 | \
  jq '.data.labelValueCountByLabelName | sort_by(.labelValueCount) | reverse | .[0:10]'

Reducing Cardinality

# Drop high-cardinality labels via metric relabeling in ServiceMonitor
metricRelabelings:
# Drop pod-template-hash label (changes every deployment, creates new series)
- action: labeldrop
  regex: pod_template_hash

# Drop entire metrics that are not needed
- sourceLabels: [__name__]
  regex: go_gc_duration_seconds.*
  action: drop

# Replace high-cardinality URL paths with bucketed versions
- sourceLabels: [path]
  regex: /api/users/[0-9]+
  targetLabel: path
  replacement: /api/users/:id
# Prometheus per-scrape cardinality limit (prevent a single job from exploding)
scrape_configs:
- job_name: myapp
  sample_limit: 10000    # Max samples per scrape (drops the scrape if exceeded)
  label_limit: 64        # Max labels per sample
  label_name_length_limit: 128
  label_value_length_limit: 256

Long-Term Storage: Thanos & Mimir

Prometheus stores data locally with a default retention of 15 days. For long-term storage, multi-cluster aggregation, and high availability, use Thanos or Grafana Mimir.

Thanos Architecture

── Thanos components ─────────────────────────────────────────────────

Prometheus (per cluster)
└── Thanos Sidecar
├── Exposes StoreAPI for real-time data (last 2h)
└── Uploads TSDB blocks to object storage every 2h

Object Storage (S3/GCS/Azure Blob)
└── Long-term TSDB blocks (years)

Thanos Store Gateway
└── StoreAPI for historical data

Thanos Querier
├── Aggregates StoreAPI from Sidecar + Store Gateway
├── Deduplication across HA Prometheus pairs
└── Exposes PromQL API (Grafana data source)

Thanos Compactor (single instance)
├── Compacts and downsamples old blocks in object storage
└── Retention enforcement

Thanos Ruler (optional)
└── Evaluates recording/alert rules across global view
# Thanos Sidecar: prometheus.yaml addition
# Add to kube-prometheus-stack values:
prometheus:
  prometheusSpec:
    thanos:
      image: quay.io/thanos/thanos:v0.35.0
      objectStorageConfig:
        existingSecret:
          name: thanos-objstore-secret
          key: objstore.yml
---
# objstore.yml (stored as Secret)
type: S3
config:
  bucket: my-thanos-bucket
  endpoint: s3.amazonaws.com
  region: us-east-1
  aws_sdk_auth: true    # Use IRSA

Grafana Mimir

Grafana Mimir is a horizontally scalable, multi-tenant Prometheus-compatible backend. It is the successor to Cortex and replaces the Thanos Sidecar + Store Gateway pattern with a unified write path.

FeatureThanosMimir
ArchitectureSidecar to existing PrometheusReplaces Prometheus write path; Prometheus only for scraping
ScalabilityScales via object storage; query latency depends on QuerierHorizontally scaled Ingester/Querier/Compactor
Multi-tenancyLimited (namespace-based)First-class; X-Scope-OrgID header
OperationsMultiple components to operateMore complex; monolithic mode available for small clusters
Use caseMultiple clusters, HA, long-term storageSaaS-scale, multi-tenant, high write throughput

Remote Write & Federation

# Remote write: ship metrics to external backend
remote_write:
- url: https://mimir.example.com/api/v1/push
  headers:
    X-Scope-OrgID: production
  remote_timeout: 30s
  queue_config:
    capacity: 10000
    max_shards: 30
    min_shards: 5
    max_samples_per_send: 5000
    batch_send_deadline: 5s
  write_relabel_configs:
  # Only send metrics with specific labels to reduce egress
  - sourceLabels: [__name__]
    regex: job:.*|kube_.*|node_.*
    action: keep
# Federation: aggregate subset of metrics from child Prometheus
scrape_configs:
- job_name: federate
  honor_labels: true
  metrics_path: /federate
  params:
    match[]:
    - '{__name__=~"job:.*"}'        # Only recording rules
    - up
  static_configs:
  - targets:
    - prometheus-cluster-a:9090
    - prometheus-cluster-b:9090

Metrics, Alerts & Runbooks

Key Prometheus Self-Metrics

MetricDescription
prometheus_tsdb_head_seriesCurrent active time series count (primary cardinality indicator)
prometheus_tsdb_storage_blocks_bytesDisk usage of persistent TSDB blocks
prometheus_target_scrape_pool_sync_totalScrape pool sync operations
up1 if target is up, 0 if down — the most important scrape health metric
prometheus_rule_evaluation_duration_secondsRule evaluation latency (p99 > 10s = rules falling behind)

Alerts

# Alert: Prometheus target down
- alert: PrometheusTargetDown
  expr: up == 0
  for: 5m
  annotations:
    summary: "Scrape target {{ $labels.job }} / {{ $labels.instance }} is down"

# Alert: Prometheus cardinality too high
- alert: PrometheusHighCardinality
  expr: prometheus_tsdb_head_series > 2000000
  for: 15m
  annotations:
    summary: "Prometheus has >2M active series — memory pressure risk"

# Alert: Prometheus rule evaluation slow
- alert: PrometheusRuleEvaluationSlow
  expr: prometheus_rule_evaluation_duration_seconds{quantile="0.9"} > 1
  for: 10m
  annotations:
    summary: "Rule evaluation p90 > 1s — rules may miss evaluation windows"

# Alert: Remote write queue filling
- alert: PrometheusRemoteWriteQueueFull
  expr: |
    prometheus_remote_storage_shard_capacity -
    prometheus_remote_storage_pending_examples < 100
  for: 5m
  annotations:
    summary: "Remote write queue almost full — samples may be dropped"

Runbooks

Prometheus OOM / High Memory

1. Check prometheus_tsdb_head_series — if >2M investigate cardinality
2. Find top series producers: topk(10, count by (job) ({__name__=~".+"}))
3. Add metricRelabelings to drop unused metrics or labels
4. Increase memory limit as temporary measure

Target Scrape Failures

1. Check target health: Prometheus UI → Targets
2. Verify pod is running: kubectl get pod
3. Check ServiceMonitor selector matches Service labels
4. Verify metrics port is correct name in Service spec

Rules Evaluation Falling Behind

1. Check prometheus_rule_evaluation_duration_seconds
2. Identify slow rules: prometheus_rule_evaluation_failures_total
3. Simplify expensive recording rules or increase evaluation interval
4. Add Prometheus resources (CPU) if queries are compute-bound

Remote Write Dropping Samples

1. Check prometheus_remote_storage_failed_samples_total
2. Verify remote endpoint is reachable
3. Increase queue_config.max_shards or capacity
4. Add write_relabel_configs to reduce remote write volume

TSDB Disk Full

1. Check prometheus_tsdb_storage_blocks_bytes
2. Reduce retention: --storage.tsdb.retention.size
3. Add disk to Prometheus PVC
4. Enable remote write to offload historical data to Thanos/Mimir

Best Practices

1

Use kube-prometheus-stack as the baseline

Don't build your own scrape configuration from scratch. The kube-prometheus-stack includes pre-built ServiceMonitors, recording rules, and 20+ Grafana dashboards covering Kubernetes core components out of the box.

2

Never use high-cardinality labels in metrics

User IDs, request IDs, session tokens, or arbitrary string values as label values will explode cardinality. Instrument at the service boundary, not at the per-request level. Use traces for per-request detail.

3

Write recording rules for all dashboard and alert expressions

If a PromQL expression appears in a Grafana dashboard or alert rule, it should have a corresponding recording rule. This pre-computes the result, reduces query time from seconds to milliseconds, and reduces Prometheus CPU load.

4

Use Histogram over Summary for new instrumentation

Histograms allow aggregation across pod instances using histogram_quantile(). Summaries calculate quantiles in-process and cannot be aggregated — useless for Kubernetes where you always have multiple replicas.

5

Set sample_limit per scrape job

Without a sample_limit, a single misbehaving application can push millions of time series into Prometheus and cause an OOM. Set sample_limit: 10000 for most jobs; higher only when justified.

6

Deploy Thanos or Mimir for production clusters

Single-node Prometheus with local retention is not production-ready. It is a single point of failure and loses history on pod restart. Use Thanos Sidecar + object storage for HA and long-term retention with minimal operational complexity.

7

Alert on absence of expected metrics

Use absent() to alert when a time series disappears entirely — this catches target-down scenarios that won't fire rate-based alerts (because there are no samples to compute a rate from).

8

Use rate() over irate() for alerts

irate() uses only the last two samples and is highly sensitive to single-sample spikes, causing flapping alerts. Use rate() with a 5–10 minute window for alerting. Reserve irate() for exploratory dashboards where responsiveness is more important than stability.