Service Mesh
A service mesh moves cross-cutting concerns — mutual TLS, observability, retries, circuit breaking, traffic splitting — out of application code and into a dedicated infrastructure layer. This page covers sidecar and sidecarless architectures, Istio, Linkerd, Cilium service mesh, SPIFFE/SPIRE identity, and production patterns.
What This Page Covers
- What a service mesh is; problems it solves without code changes
- Sidecar vs sidecarless (ambient) architecture comparison
- Istio architecture: istiod (Pilot+Citadel+Galley merged), Envoy data plane
- Istio CRDs: VirtualService, DestinationRule, Gateway, ServiceEntry, Sidecar
- Istio security: PeerAuthentication (STRICT/PERMISSIVE mTLS), AuthorizationPolicy (ALLOW/DENY/AUDIT/CUSTOM)
- Istio traffic management: retries, timeouts, circuit breaking (outlier detection), fault injection, traffic mirroring
- Istio observability: Prometheus metrics, Kiali topology, Jaeger/Zipkin tracing, access logs
- Istio ambient mode: ztunnel (L4 per-node), waypoint proxy (L7 per-namespace/SA)
- Istio multi-cluster: single/multiple control planes, east-west gateway, flat network vs gateway topologies
- Linkerd: Rust-based linkerd2-proxy, zero-config mTLS, ServiceProfile CRD, retries, timeouts, traffic splitting
- Cilium service mesh: eBPF-based L4 mesh, optional Envoy sidecar for L7, Hubble observability
- Consul Connect: service intentions, transparent proxy, mesh gateways
- SPIFFE/SPIRE: SVIDs, X.509 and JWT formats, workload attestation, trust bundles
- Traffic management patterns: canary, blue/green, A/B testing, dark launch (mirroring)
- mTLS migration: PERMISSIVE → STRICT phase rollout
- Service mesh vs NetworkPolicy: complementary, not alternatives
- Mesh vs Gateway API: GatewayClass for mesh use cases (GAMMA)
- 8 metrics + 4 alerting rules + 5 troubleshooting runbooks
- 9 best practices for production service mesh operation
What a Service Mesh Is — and Is Not
A service mesh is an infrastructure layer that intercepts all network traffic between services and applies policy, observability, and reliability logic — without any changes to application code. It solves problems that are otherwise duplicated in every service:
| Problem | Without Mesh | With Mesh |
|---|---|---|
| Mutual TLS between services | Each app implements TLS with certificate rotation | Automatic; zero-code-change; cert rotation transparent |
| Service-to-service authorization | Application-level API keys or shared secrets | Identity-based (SPIFFE SVID); cryptographically verified |
| Retries, timeouts, circuit breaking | Each SDK/library re-implements per language | Uniform policy in mesh proxy; language-agnostic |
| Distributed tracing | Manual instrumentation in every service | Automatic trace context propagation (B3/W3C headers) |
| Traffic splitting / canary | Separate deployments with DNS trickery | Weighted routing at proxy layer; no DNS change needed |
| Observability (RED metrics) | Each service must expose /metrics | Proxy generates request rate, error rate, duration per route |
NetworkPolicy operates at L3/L4 (IP + port) and is enforced in the kernel by the CNI. Service mesh policies operate at L7 (HTTP, gRPC) in a user-space proxy. They are complementary: NetworkPolicy provides coarse-grained firewall rules; service mesh provides fine-grained identity-based authorization and traffic management. Run both in production.
Sidecar vs Sidecarless Architecture
Sidecar Model (Istio, Linkerd)
- Proxy injected as extra container in every pod
- iptables rules redirect all traffic through proxy
- Full L7 visibility per pod
- Pod startup overhead (~100ms), resource overhead (~50Mi RAM, ~0.05 CPU per sidecar)
- Requires pod restart to inject/upgrade proxy
- Isolation: per-pod failure domain
Sidecarless / Ambient (Istio Ambient, Cilium)
- No sidecar injected; no pod restart needed
- L4 proxy (ztunnel/eBPF) runs per node
- L7 proxy (waypoint) runs per namespace or SA — only when needed
- Lower resource overhead; better for large clusters
- Gradual adoption: enroll namespace without restarting pods
- Blast radius: node-level failure affects all pods on node (ztunnel)
Istio
Istio is the most feature-rich service mesh for Kubernetes. Since version 1.5, all control plane components (Pilot, Citadel, Galley, Mixer) merged into a single binary: istiod.
Control Plane Components
istiod
- Pilot — xDS server; pushes route/cluster/listener config to Envoy proxies
- Citadel — CA; issues SPIFFE X.509 SVIDs to workloads; rotates certs automatically (~24h TTL)
- Galley — config validation and distribution; validates Istio CRDs
- Port 15010 (gRPC xDS), 15012 (secure xDS), 15014 (control plane metrics), 8080 (debug)
Envoy Data Plane
- istio-proxy sidecar: Envoy + Istio agent
- Intercepts traffic via iptables REDIRECT (init container sets rules)
- Port 15001 (outbound), 15006 (inbound), 15090 (Prometheus metrics), 15000 (admin)
- xDS: LDS/RDS/CDS/EDS pushed from istiod; no polling
- Cert refresh from istiod via SDS (Secret Discovery Service)
Installation
# Install via istioctl (recommended)
istioctl install --set profile=default # minimal | default | demo | empty
# Verify installation
istioctl verify-install
kubectl get pods -n istio-system
# Enable sidecar injection for a namespace
kubectl label namespace production istio-injection=enabled
# Manual injection (without label):
istioctl kube-inject -f deployment.yaml | kubectl apply -f -
# Check proxy status across all pods
istioctl proxy-status
Istio CRDs — Traffic Management
VirtualService
VirtualService defines traffic routing rules for services within the mesh. It intercepts requests to a service's hostname and applies routing logic before the request reaches any pod:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-vs
namespace: production
spec:
hosts:
- reviews # short name = reviews.production.svc.cluster.local
- reviews.example.com # external hostname (when paired with Istio Gateway)
gateways:
- mesh # "mesh" = all sidecars in mesh
- production/my-gateway # also apply to external traffic via this Gateway
http:
# Canary: 10% to v2, 90% to v1
- match:
- headers:
x-canary:
exact: "true" # header-based routing to canary
route:
- destination:
host: reviews
subset: v2
# Weight-based split
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: "gateway-error,connect-failure,retriable-4xx"
DestinationRule
DestinationRule defines subsets (for canary routing) and traffic policies applied after routing (load balancing, circuit breaking, TLS settings to upstream):
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews-dr
namespace: production
spec:
host: reviews
trafficPolicy:
loadBalancer:
simple: LEAST_CONN # ROUND_ROBIN | LEAST_CONN | RANDOM | PASSTHROUGH
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 30ms
http:
http2MaxRequests: 1000
maxRequestsPerConnection: 10
outlierDetection: # circuit breaker
consecutiveGatewayErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50 # never eject more than 50% of hosts
minHealthPercent: 50
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
trafficPolicy: # per-subset override
loadBalancer:
simple: ROUND_ROBIN
Istio Gateway (vs Kubernetes Ingress)
Istio Gateway manages inbound/outbound traffic at the mesh boundary. It configures the Istio ingress gateway pods (Envoy) — not to be confused with Kubernetes Gateway API GatewayClass:
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: my-gateway
namespace: production
spec:
selector:
istio: ingressgateway # targets the istio-ingressgateway pods
servers:
- port:
number: 443
name: https
protocol: HTTPS
tls:
mode: SIMPLE # SIMPLE | MUTUAL | PASSTHROUGH | AUTO_PASSTHROUGH
credentialName: my-tls-cert # references a Secret
hosts:
- "*.example.com"
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*.example.com"
tls:
httpsRedirect: true
ServiceEntry — External Services
ServiceEntry adds external services to Istio's service registry, enabling traffic policies, mTLS, and observability for egress traffic:
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: stripe-api
namespace: production
spec:
hosts:
- api.stripe.com
ports:
- number: 443
name: https
protocol: HTTPS
location: MESH_EXTERNAL # MESH_EXTERNAL | MESH_INTERNAL
resolution: DNS
---
# Block all external traffic except explicit ServiceEntries:
# Set outboundTrafficPolicy: REGISTRY_ONLY in MeshConfig
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
outboundTrafficPolicy:
mode: REGISTRY_ONLY # default: ALLOW_ANY
Istio Security
PeerAuthentication — mTLS Policy
# Namespace-wide STRICT mTLS (all traffic must be mTLS)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT # STRICT | PERMISSIVE | DISABLE
# Port-level exception (useful during migration):
spec:
selector:
matchLabels:
app: legacy-app
mtls:
mode: PERMISSIVE # namespace STRICT, but this app accepts plaintext
portLevelMtls:
8080:
mode: DISABLE # specific port allows plaintext
# Mesh-wide STRICT (applies to all namespaces):
metadata:
name: default
namespace: istio-system # root namespace = mesh-wide policy
spec:
mtls:
mode: STRICT
AuthorizationPolicy
AuthorizationPolicy enforces access control at the sidecar level using service identity (SPIFFE) and request attributes. Four actions: ALLOW, DENY, AUDIT, CUSTOM (external auth):
# Allow only frontend service to call reviews on GET /api/*
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: reviews-authz
namespace: production
spec:
selector:
matchLabels:
app: reviews
action: ALLOW
rules:
- from:
- source:
principals:
# SPIFFE identity: cluster.local/ns/production/sa/frontend
- "cluster.local/ns/production/sa/frontend"
to:
- operation:
methods: ["GET"]
paths: ["/api/*"]
when:
- key: request.headers[x-api-version]
values: ["v1", "v2"]
---
# Deny all except explicitly allowed (default deny):
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: deny-all
namespace: production
spec:
{} # empty spec = deny all
---
# JWT-based authorization
spec:
rules:
- from:
- source:
requestPrincipals: ["https://accounts.google.com/*"]
when:
- key: request.auth.claims[role]
values: ["admin"]
Traffic Management Patterns
Fault Injection
Fault injection deliberately introduces delays and errors to test system resilience — chaos engineering at the proxy layer without modifying application code:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ratings-fault
namespace: production
spec:
hosts: [ratings]
http:
- match:
- headers:
x-test-user:
exact: "chaos" # only inject for specific test header
fault:
delay:
percentage:
value: 50 # 50% of requests get a 5s delay
fixedDelay: 5s
abort:
percentage:
value: 10 # 10% of requests get HTTP 503
httpStatus: 503
route:
- destination:
host: ratings
Traffic Mirroring (Dark Launch)
Mirror a percentage of live traffic to a shadow service — requests are duplicated, responses from the mirror are ignored. Ideal for testing new versions under real traffic without user impact:
spec:
http:
- route:
- destination:
host: product-service
subset: v1
weight: 100
mirror:
host: product-service
subset: v2 # shadow target; receives copy of every request
mirrorPercentage:
value: 100.0 # mirror 100% of traffic (can be lower, e.g. 10.0)
Circuit Breaking (Outlier Detection)
Defined in DestinationRule's outlierDetection (see earlier). When a host accumulates consecutive errors, it is ejected from the load balancing pool for a backoff period:
# Full outlier detection example:
outlierDetection:
splitExternalLocalOriginErrors: true # separate local-origin errors from upstream
consecutiveLocalOriginFailures: 3 # 3 connection failures → eject
consecutiveGatewayErrors: 5 # 5 5xx responses → eject
consecutive5xxErrors: 5
interval: 10s # check window
baseEjectionTime: 30s # first ejection duration
maxEjectionPercent: 33 # max % of hosts ejectable at once
minHealthPercent: 50 # stop ejecting if fewer than 50% healthy
Istio Ambient Mode
Ambient mode (GA in Istio 1.24) removes sidecars entirely. Traffic is intercepted at the node level by ztunnel (a per-node Rust DaemonSet) for L4, and optionally by waypoint proxies (per-namespace or per-ServiceAccount Envoy Deployments) for L7.
- No pod restart needed to enroll — just label the namespace
- ~100x less memory overhead vs per-pod sidecars (1 ztunnel per node vs N sidecars per N pods)
- Rolling upgrades: upgrade ztunnel without touching application pods
- L7 features are opt-in: only namespaces that need HTTP-level policy pay the waypoint cost
Istio Observability
Standard Metrics (Envoy → Prometheus)
Every Envoy sidecar/ztunnel exposes metrics at :15090/metrics. Istio also generates higher-level metrics via Telemetry API:
# Key Istio metrics:
# Request rate:
sum(rate(istio_requests_total[5m])) by (destination_service, response_code)
# P99 latency:
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service))
# Error rate:
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service)
/
sum(rate(istio_requests_total[5m])) by (destination_service)
# Connection count (TCP):
sum(istio_tcp_connections_opened_total) by (destination_service)
Telemetry API — Custom Metrics & Access Logs
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: custom-metrics
namespace: production
spec:
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: REQUEST_COUNT
tagOverrides:
request_id:
value: "request.headers['x-request-id'] | 'unknown'"
accessLogging:
- providers:
- name: envoy # built-in envoy access log
filter:
expression: "response.code >= 400" # only log errors
tracing:
- providers:
- name: jaeger
randomSamplingPercentage: 1.0 # 1% sampling
Istio Multi-Cluster
| Topology | Control Planes | Network | Use Case |
|---|---|---|---|
| Single primary | 1 (manages all clusters) | Flat (pods reach each other directly) | Dev/staging; tight coupling acceptable |
| Primary-remote | 1 primary, remotes have no istiod | Flat | Central control; reduced resource use on small clusters |
| Multi-primary | 1 per cluster (replicated config) | Flat or gateway | HA; independent failure domains; recommended for prod |
| Multi-primary + gateway | 1 per cluster | East-west gateway per cluster | Non-flat networks; clusters in different VPCs/clouds |
# East-west gateway (for non-flat multi-cluster)
# Exposes services on port 15443 (TLS SNI-based routing)
# Install east-west gateway:
istioctl install -f east-west-gateway.yaml --context=cluster1
istioctl install -f east-west-gateway.yaml --context=cluster2
# Expose services across clusters:
kubectl apply -f expose-services.yaml --context=cluster1
# ServiceEntry auto-created by Istio to reach remote cluster services
# Cross-cluster load balancing:
# ServiceEntry in cluster1 → VirtualService routes to both clusters
# Istio automatically generates cross-cluster endpoints from remote control plane
Linkerd
Linkerd is a CNCF graduated project focused on simplicity and minimal resource overhead. Its data plane proxy (linkerd2-proxy) is written in Rust — approximately 10x smaller memory footprint than Envoy.
Linkerd Key Properties
- Proxy: Rust linkerd2-proxy (~10Mi RAM vs ~50Mi Envoy)
- Automatic mTLS with zero configuration
- No CRDs required for basic use (just
linkerd inject) - Control plane: destination, identity, proxy-injector
- ServiceProfile CRD for per-route metrics and retries
- Traffic splitting via SMI (TrafficSplit) or HTTPRoute
Linkerd Limitations
- Less feature-rich than Istio (no fault injection, limited circuit breaking)
- No ambient/sidecarless mode (as of 2025)
- Smaller ecosystem of integrations
- ServiceProfile requires manual route definition
# Install Linkerd
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd check
# Inject sidecar into a namespace (annotation-based auto-inject):
kubectl annotate namespace production linkerd.io/inject=enabled
# Or inject manually:
kubectl get deploy -n production -o yaml | linkerd inject - | kubectl apply -f -
# ServiceProfile — per-route metrics and retry policy:
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: reviews.production.svc.cluster.local
namespace: production
spec:
routes:
- name: GET /api/reviews
condition:
method: GET
pathRegex: /api/reviews.*
responseClasses:
- condition:
status:
min: 500
isFailure: true # treat 5xx as failure for retry decisions
isRetryable: true
timeout: 5000ms # per-route timeout
- name: POST /api/reviews
condition:
method: POST
pathRegex: /api/reviews
isRetryable: false # do not retry non-idempotent writes
# Traffic splitting (Gateway API HTTPRoute):
# Linkerd 2.14+ supports HTTPRoute for traffic splitting
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: reviews-split
namespace: production
annotations:
linkerd.io/inject: enabled
spec:
parentRefs:
- name: reviews
kind: Service
group: core
rules:
- backendRefs:
- name: reviews-v1
port: 80
weight: 90
- name: reviews-v2
port: 80
weight: 10
Cilium Service Mesh
Cilium can operate as a service mesh using eBPF for L4 (no sidecar needed) and optionally deploying a per-node Envoy proxy for L7 features. This gives a sidecarless mesh with kernel-level performance:
Cilium Service Mesh Modes
- eBPF-only (L4): mTLS via eBPF + SPIFFE; mutual auth without any proxy; lowest overhead
- Envoy per-node (L7): one Envoy DaemonSet per node; handles HTTP/gRPC routing, retries, circuit breaking; no per-pod sidecar
- Sidecar compatible: can run alongside Istio or Linkerd sidecars as CNI
helm upgrade cilium cilium/cilium \
--set serviceMesh.enabled=true \
--set envoy.enabled=true \ # per-node Envoy for L7
--set kubeProxyReplacement=true \
--set authentication.mutual.spire.enabled=true \ # SPIFFE/SPIRE integration
--set authentication.mutual.spire.install.enabled=true
# CiliumEnvoyConfig — configure per-node Envoy:
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
name: http-policy
namespace: production
spec:
services:
- name: reviews
namespace: production
resources:
- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
# ... full Envoy xDS config
SPIFFE / SPIRE — Workload Identity
SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity. SPIRE is the reference implementation. Istio's CA, Linkerd's identity service, and Cilium's auth mode all implement or integrate with SPIFFE.
mTLS Migration: PERMISSIVE → STRICT
Migrating to STRICT mTLS without downtime requires a phased approach. All services must be enrolled in the mesh before flipping to STRICT — otherwise plaintext services are rejected:
Service Mesh Comparison
| Dimension | Istio (Sidecar) | Istio Ambient | Linkerd | Cilium Mesh |
|---|---|---|---|---|
| Data plane | Envoy per pod | ztunnel + waypoint | linkerd2-proxy per pod | eBPF + optional Envoy per node |
| Memory overhead | ~50Mi per pod | ~1 ztunnel per node | ~10Mi per pod | Minimal (eBPF in kernel) |
| Automatic mTLS | Yes (PERMISSIVE default) | Yes (L4 via ztunnel) | Yes (always on) | Yes (eBPF + SPIFFE) |
| L7 traffic management | Full (VirtualService/DR) | Yes (via waypoint) | Limited (ServiceProfile) | Via CiliumEnvoyConfig |
| Circuit breaking | Yes (outlier detection) | Yes (via waypoint) | Limited | Via Envoy config |
| Fault injection | Yes | Yes (via waypoint) | No | No |
| Observability | Full (Kiali, Jaeger, Prometheus) | Full | Good (Viz dashboard) | Hubble (eBPF-native) |
| Multi-cluster | Mature (4 topologies) | Maturing | Yes (multicluster extension) | Cluster Mesh (up to 255) |
| Pod restart needed | Yes (inject sidecar) | No | Yes (inject sidecar) | No |
| Learning curve | High (many CRDs) | Medium | Low | Medium |
| CNCF status | Graduated | Graduated | Graduated | Graduated |
Service Mesh vs NetworkPolicy
| Dimension | NetworkPolicy | Service Mesh AuthorizationPolicy |
|---|---|---|
| Layer | L3/L4 (IP + port) | L7 (HTTP method, path, JWT, headers) |
| Identity | Pod label + namespace label | SPIFFE workload identity (cryptographic) |
| Enforcement | Kernel (CNI eBPF/iptables) | User-space proxy (Envoy sidecar) |
| Overhead | Near-zero (kernel) | Proxy CPU/RAM per pod |
| Bypass risk | Cannot bypass (kernel enforced) | Can bypass if sidecar injection skipped |
| mTLS | No | Yes (automatic cert rotation) |
| External traffic | ipBlock for external CIDRs | ServiceEntry + AuthorizationPolicy |
NetworkPolicy at the kernel level blocks traffic that completely bypasses the mesh (e.g., a compromised pod that disables its sidecar). AuthorizationPolicy enforces L7 access control that NetworkPolicy cannot express. Running both layers provides defense in depth — an attacker must defeat both the CNI enforcement and the service mesh identity verification.
GAMMA — Gateway API for Mesh
GAMMA (Gateway API for Mesh Management and Administration) extends Gateway API to the east-west (service-to-service) mesh use case. Instead of creating Istio-specific VirtualService CRDs, services can use standard HTTPRoute to control mesh traffic:
# HTTPRoute for mesh traffic (parentRef = Service, not Gateway)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: reviews-mesh-route
namespace: production
spec:
parentRefs:
- group: ""
kind: Service
name: reviews # parentRef is a Service = mesh traffic routing
port: 80
rules:
- backendRefs:
- name: reviews-v1
port: 80
weight: 90
- name: reviews-v2
port: 80
weight: 10
Istio 1.16+ supports GAMMA HTTPRoute for mesh routing. Linkerd 2.14+ also supports it via the smi-adaptor. This enables portable mesh config that works across implementations.
Metrics, Alerting & Troubleshooting
Key Metrics (Istio / Prometheus)
| Metric | Description | Alert Threshold |
|---|---|---|
istio_requests_total | Total requests by source/destination/response_code | rate spike or drop |
istio_request_duration_milliseconds | Request latency histogram | P99 > SLO |
envoy_cluster_upstream_rq_pending_overflow | Requests rejected by circuit breaker | > 0 for 5m |
envoy_cluster_ejections_active | Actively ejected hosts (outlier detection) | > 0 sustained |
istio_tcp_connections_opened_total | TCP connections through mesh | rate anomaly |
pilot_xds_push_errors | istiod xDS push failures | > 0 |
pilot_proxy_convergence_time | Time for config to reach all proxies | P99 > 30s |
linkerd_request_errors_total | Linkerd request errors per route | error rate > 1% |
Alerting Rules
# High error rate for a destination service
- alert: MeshServiceHighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Service {{ $labels.destination_service_name }} error rate > 5%"
# P99 latency SLO breach
- alert: MeshServiceHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service_name)
) > 2000
for: 10m
labels:
severity: warning
# Circuit breaker tripping
- alert: MeshCircuitBreakerOpen
expr: increase(envoy_cluster_upstream_rq_pending_overflow[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Circuit breaker opened for {{ $labels.envoy_cluster_name }}"
# istiod xDS push errors
- alert: IstiodXDSPushErrors
expr: rate(pilot_xds_push_errors[5m]) > 0
for: 5m
labels:
severity: warning
Troubleshooting Runbooks
Runbook 1: RBAC: access denied (mTLS / AuthorizationPolicy)
# Check if source has a sidecar (must be enrolled in mesh)
istioctl proxy-status | grep <source-pod>
# Check AuthorizationPolicy applying to destination
kubectl get authorizationpolicy -n <dest-namespace>
# Debug auth decision
istioctl x authz check <dest-pod> -n <dest-namespace>
# Check if PeerAuthentication is STRICT but source has no sidecar
istioctl authn tls-check <source-pod> <dest-service>.<ns>.svc.cluster.local
# View real-time access log for denied requests
kubectl logs <dest-pod> -c istio-proxy | grep "RBAC"
Runbook 2: VirtualService Routing Not Applied
# Check VirtualService is valid
istioctl analyze -n <namespace>
# Verify proxy has received the config
istioctl proxy-config routes <pod> -n <namespace>
# Look for the route matching your VirtualService
# Check DestinationRule subsets exist
istioctl proxy-config cluster <pod> -n <namespace> | grep <service>
# Subsets should appear as separate clusters
# Common issues:
# - hosts field doesn't match service FQDN or short name scope
# - gateways field is missing "mesh" for in-cluster routing
# - DestinationRule subset labels don't match pod labels
kubectl get pods -l version=v2 -n <namespace> # verify subset pods exist
Runbook 3: Sidecar Injection Not Happening
# Check namespace label
kubectl get namespace <ns> --show-labels | grep istio-injection
# Check MutatingWebhookConfiguration
kubectl get mutatingwebhookconfigurations istio-sidecar-injector -o yaml \
| grep namespaceSelector
# Check pod-level annotation override (can disable injection)
kubectl get pod <pod> -o jsonpath='{.metadata.annotations}'
# sidecar.istio.io/inject: "false" → overrides namespace label
# Trigger re-injection (restart deployment)
kubectl rollout restart deploy/<name> -n <namespace>
# Verify sidecar present after restart
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].name}'
# Should include: istio-proxy
Runbook 4: istiod Not Syncing Config to Proxies (high xDS push delay)
# Check istiod pod status and resources
kubectl top pod -n istio-system -l app=istiod
# Check connected proxy count vs istiod capacity
kubectl exec -n istio-system deploy/istiod -- \
pilot-agent request GET stats | grep pilot_inbound_updates
# Check proxy convergence time (should be <30s P99)
kubectl exec -n istio-system deploy/istiod -- \
pilot-agent request GET metrics | grep pilot_proxy_convergence
# Common causes:
# - istiod OOM: increase memory limits
# - Too many services/endpoints causing full xDS push storms
# Fix: enable delta xDS (PILOT_ENABLE_EDS_DEBOUNCE=true)
# - Network policy blocking istiod ↔ sidecar port 15012
Runbook 5: Linkerd — Service Metrics Missing
# Check linkerd proxy injection
linkerd check --proxy -n <namespace>
# Check if pods have the proxy
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[*].name}{"\n"}{end}' \
| grep linkerd-proxy
# If missing: re-annotate and restart
kubectl annotate namespace <ns> linkerd.io/inject=enabled
kubectl rollout restart deploy -n <namespace>
# View live traffic stats
linkerd viz stat deploy -n <namespace>
linkerd viz top deploy/<name> -n <namespace>
# Check for certificate expiry
linkerd check --proxy 2>&1 | grep -i cert
Best Practices
- Start with PERMISSIVE mTLS, enforce STRICT per-namespace gradually — never flip the entire cluster to STRICT on day one. Migrate namespace by namespace, verify with
istioctl authn tls-checkbefore switching each namespace. - Set resource requests/limits on sidecars — Envoy sidecars have no limits by default. In production, set
global.proxy.resources.requests/limitsin Helm values to prevent sidecars from starving application containers. - Use outlier detection (circuit breaking) on all DestinationRules — without it, a slow/failing pod continues receiving traffic until endpoints age out. Add
consecutiveGatewayErrors: 5andinterval: 30sas a minimum baseline. - Enable
outboundTrafficPolicy: REGISTRY_ONLYin production — forces all egress through declared ServiceEntries. Prevents data exfiltration to arbitrary external IPs and makes egress auditable. - Use Istio ambient mode for large clusters (1000+ pods) — sidecar overhead becomes significant at scale. Ambient mode reduces memory by 10-100x and eliminates pod restart requirements for mesh enrollment.
- Validate CRDs before applying — run
istioctl analyze -n <namespace>andistioctl analyze --all-namespacesin CI pipelines. Invalid VirtualService or DestinationRule configs cause silent routing failures, not errors. - Propagate trace headers in every service — Istio generates trace spans but cannot correlate them across services unless your application forwards B3 headers (
x-b3-traceid,x-b3-spanid,x-request-id). Add header propagation middleware to every service. - Use GAMMA (HTTPRoute with parentRef=Service) for new mesh routing — portable across Istio, Linkerd, and future implementations. Avoids Istio-specific VirtualService lock-in for basic traffic splitting.
- Back up Istio config state regularly — all mesh config (VirtualService, DestinationRule, PeerAuthentication, AuthorizationPolicy) is stored in etcd. Treat it as part of GitOps — store all mesh CRDs in version control and reconcile via ArgoCD/Flux.