Service Mesh

A service mesh moves cross-cutting concerns — mutual TLS, observability, retries, circuit breaking, traffic splitting — out of application code and into a dedicated infrastructure layer. This page covers sidecar and sidecarless architectures, Istio, Linkerd, Cilium service mesh, SPIFFE/SPIRE identity, and production patterns.

What This Page Covers

What a service mesh is; problems it solves without code changes
Sidecar vs sidecarless (ambient) architecture comparison
Istio architecture: istiod (Pilot+Citadel+Galley merged), Envoy data plane
Istio CRDs: VirtualService, DestinationRule, Gateway, ServiceEntry, Sidecar
Istio security: PeerAuthentication (STRICT/PERMISSIVE mTLS), AuthorizationPolicy (ALLOW/DENY/AUDIT/CUSTOM)
Istio traffic management: retries, timeouts, circuit breaking (outlier detection), fault injection, traffic mirroring
Istio observability: Prometheus metrics, Kiali topology, Jaeger/Zipkin tracing, access logs
Istio ambient mode: ztunnel (L4 per-node), waypoint proxy (L7 per-namespace/SA)
Istio multi-cluster: single/multiple control planes, east-west gateway, flat network vs gateway topologies
Linkerd: Rust-based linkerd2-proxy, zero-config mTLS, ServiceProfile CRD, retries, timeouts, traffic splitting
Cilium service mesh: eBPF-based L4 mesh, optional Envoy sidecar for L7, Hubble observability
Consul Connect: service intentions, transparent proxy, mesh gateways
SPIFFE/SPIRE: SVIDs, X.509 and JWT formats, workload attestation, trust bundles
Traffic management patterns: canary, blue/green, A/B testing, dark launch (mirroring)
mTLS migration: PERMISSIVE → STRICT phase rollout
Service mesh vs NetworkPolicy: complementary, not alternatives
Mesh vs Gateway API: GatewayClass for mesh use cases (GAMMA)
8 metrics + 4 alerting rules + 5 troubleshooting runbooks
9 best practices for production service mesh operation

What a Service Mesh Is — and Is Not

A service mesh is an infrastructure layer that intercepts all network traffic between services and applies policy, observability, and reliability logic — without any changes to application code. It solves problems that are otherwise duplicated in every service:

Problem	Without Mesh	With Mesh
Mutual TLS between services	Each app implements TLS with certificate rotation	Automatic; zero-code-change; cert rotation transparent
Service-to-service authorization	Application-level API keys or shared secrets	Identity-based (SPIFFE SVID); cryptographically verified
Retries, timeouts, circuit breaking	Each SDK/library re-implements per language	Uniform policy in mesh proxy; language-agnostic
Distributed tracing	Manual instrumentation in every service	Automatic trace context propagation (B3/W3C headers)
Traffic splitting / canary	Separate deployments with DNS trickery	Weighted routing at proxy layer; no DNS change needed
Observability (RED metrics)	Each service must expose /metrics	Proxy generates request rate, error rate, duration per route

⚠️

A service mesh is not a replacement for NetworkPolicy

NetworkPolicy operates at L3/L4 (IP + port) and is enforced in the kernel by the CNI. Service mesh policies operate at L7 (HTTP, gRPC) in a user-space proxy. They are complementary: NetworkPolicy provides coarse-grained firewall rules; service mesh provides fine-grained identity-based authorization and traffic management. Run both in production.

Sidecar vs Sidecarless Architecture

Sidecar Model (Istio, Linkerd)

Proxy injected as extra container in every pod
iptables rules redirect all traffic through proxy
Full L7 visibility per pod
Pod startup overhead (~100ms), resource overhead (~50Mi RAM, ~0.05 CPU per sidecar)
Requires pod restart to inject/upgrade proxy
Isolation: per-pod failure domain

Sidecarless / Ambient (Istio Ambient, Cilium)

No sidecar injected; no pod restart needed
L4 proxy (ztunnel/eBPF) runs per node
L7 proxy (waypoint) runs per namespace or SA — only when needed
Lower resource overhead; better for large clusters
Gradual adoption: enroll namespace without restarting pods
Blast radius: node-level failure affects all pods on node (ztunnel)

SIDECAR MODEL (Istio / Linkerd) ───────────────────────────────────────────────────── Pod A Pod B ┌──────────────────────┐ ┌──────────────────────┐ │ app-container │ │ app-container │ │ ┌──────────────────┐ │ mTLS │ ┌──────────────────┐ │ │ │ Envoy sidecar │─┼──────────┼─│ Envoy sidecar │ │ │ │ (istio-proxy) │ │ │ │ (istio-proxy) │ │ │ └──────────────────┘ │ │ └──────────────────┘ │ └──────────────────────┘ └──────────────────────┘ ↑ iptables REDIRECT (port 15001/15006) │ istiod (control plane): pushes xDS config to all proxies AMBIENT MODEL (Istio Ambient) ───────────────────────────────────────────────────── Node ┌──────────────────────────────────────────────────────┐ │ ztunnel (per-node DaemonSet) ← L4 only: mTLS, authz│ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Pod A │ │ Pod B │ │ │ │ app only │ │ app only │ ← no sidecar! │ │ └─────────────┘ └─────────────┘ │ └──────────────────────────────────────────────────────┘ │ L7 needed? ▼ Waypoint proxy (per-namespace Deployment) ← L7: VirtualService, AuthzPolicy HTTP

Istio

Istio is the most feature-rich service mesh for Kubernetes. Since version 1.5, all control plane components (Pilot, Citadel, Galley, Mixer) merged into a single binary: istiod.

Control Plane Components

istiod

Pilot — xDS server; pushes route/cluster/listener config to Envoy proxies
Citadel — CA; issues SPIFFE X.509 SVIDs to workloads; rotates certs automatically (~24h TTL)
Galley — config validation and distribution; validates Istio CRDs
Port 15010 (gRPC xDS), 15012 (secure xDS), 15014 (control plane metrics), 8080 (debug)

Envoy Data Plane

istio-proxy sidecar: Envoy + Istio agent
Intercepts traffic via iptables REDIRECT (init container sets rules)
Port 15001 (outbound), 15006 (inbound), 15090 (Prometheus metrics), 15000 (admin)
xDS: LDS/RDS/CDS/EDS pushed from istiod; no polling
Cert refresh from istiod via SDS (Secret Discovery Service)

Installation

# Install via istioctl (recommended)
istioctl install --set profile=default    # minimal | default | demo | empty

# Verify installation
istioctl verify-install
kubectl get pods -n istio-system

# Enable sidecar injection for a namespace
kubectl label namespace production istio-injection=enabled

# Manual injection (without label):
istioctl kube-inject -f deployment.yaml | kubectl apply -f -

# Check proxy status across all pods
istioctl proxy-status

Istio CRDs — Traffic Management

VirtualService

VirtualService defines traffic routing rules for services within the mesh. It intercepts requests to a service's hostname and applies routing logic before the request reaches any pod:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-vs
  namespace: production
spec:
  hosts:
    - reviews                    # short name = reviews.production.svc.cluster.local
    - reviews.example.com        # external hostname (when paired with Istio Gateway)
  gateways:
    - mesh                       # "mesh" = all sidecars in mesh
    - production/my-gateway      # also apply to external traffic via this Gateway
  http:
    # Canary: 10% to v2, 90% to v1
    - match:
        - headers:
            x-canary:
              exact: "true"       # header-based routing to canary
      route:
        - destination:
            host: reviews
            subset: v2
    # Weight-based split
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 90
        - destination:
            host: reviews
            subset: v2
          weight: 10
      timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: "gateway-error,connect-failure,retriable-4xx"

DestinationRule

DestinationRule defines subsets (for canary routing) and traffic policies applied after routing (load balancing, circuit breaking, TLS settings to upstream):

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews-dr
  namespace: production
spec:
  host: reviews
  trafficPolicy:
    loadBalancer:
      simple: LEAST_CONN           # ROUND_ROBIN | LEAST_CONN | RANDOM | PASSTHROUGH
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 30ms
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10
    outlierDetection:              # circuit breaker
      consecutiveGatewayErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50       # never eject more than 50% of hosts
      minHealthPercent: 50
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
      trafficPolicy:               # per-subset override
        loadBalancer:
          simple: ROUND_ROBIN

Istio Gateway (vs Kubernetes Ingress)

Istio Gateway manages inbound/outbound traffic at the mesh boundary. It configures the Istio ingress gateway pods (Envoy) — not to be confused with Kubernetes Gateway API GatewayClass:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: my-gateway
  namespace: production
spec:
  selector:
    istio: ingressgateway            # targets the istio-ingressgateway pods
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE                 # SIMPLE | MUTUAL | PASSTHROUGH | AUTO_PASSTHROUGH
        credentialName: my-tls-cert  # references a Secret
      hosts:
        - "*.example.com"
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "*.example.com"
      tls:
        httpsRedirect: true

ServiceEntry — External Services

ServiceEntry adds external services to Istio's service registry, enabling traffic policies, mTLS, and observability for egress traffic:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: production
spec:
  hosts:
    - api.stripe.com
  ports:
    - number: 443
      name: https
      protocol: HTTPS
  location: MESH_EXTERNAL           # MESH_EXTERNAL | MESH_INTERNAL
  resolution: DNS
---
# Block all external traffic except explicit ServiceEntries:
# Set outboundTrafficPolicy: REGISTRY_ONLY in MeshConfig
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY           # default: ALLOW_ANY

Istio Security

PeerAuthentication — mTLS Policy

# Namespace-wide STRICT mTLS (all traffic must be mTLS)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT                   # STRICT | PERMISSIVE | DISABLE

# Port-level exception (useful during migration):
spec:
  selector:
    matchLabels:
      app: legacy-app
  mtls:
    mode: PERMISSIVE               # namespace STRICT, but this app accepts plaintext
  portLevelMtls:
    8080:
      mode: DISABLE                # specific port allows plaintext

# Mesh-wide STRICT (applies to all namespaces):
metadata:
  name: default
  namespace: istio-system          # root namespace = mesh-wide policy
spec:
  mtls:
    mode: STRICT

AuthorizationPolicy

AuthorizationPolicy enforces access control at the sidecar level using service identity (SPIFFE) and request attributes. Four actions: ALLOW, DENY, AUDIT, CUSTOM (external auth):

# Allow only frontend service to call reviews on GET /api/*
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: reviews-authz
  namespace: production
spec:
  selector:
    matchLabels:
      app: reviews
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              # SPIFFE identity: cluster.local/ns/production/sa/frontend
              - "cluster.local/ns/production/sa/frontend"
      to:
        - operation:
            methods: ["GET"]
            paths: ["/api/*"]
      when:
        - key: request.headers[x-api-version]
          values: ["v1", "v2"]

---
# Deny all except explicitly allowed (default deny):
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  {}                               # empty spec = deny all

---
# JWT-based authorization
spec:
  rules:
    - from:
        - source:
            requestPrincipals: ["https://accounts.google.com/*"]
      when:
        - key: request.auth.claims[role]
          values: ["admin"]

Traffic Management Patterns

Fault Injection

Fault injection deliberately introduces delays and errors to test system resilience — chaos engineering at the proxy layer without modifying application code:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ratings-fault
  namespace: production
spec:
  hosts: [ratings]
  http:
    - match:
        - headers:
            x-test-user:
              exact: "chaos"       # only inject for specific test header
      fault:
        delay:
          percentage:
            value: 50              # 50% of requests get a 5s delay
          fixedDelay: 5s
        abort:
          percentage:
            value: 10              # 10% of requests get HTTP 503
          httpStatus: 503
      route:
        - destination:
            host: ratings

Traffic Mirroring (Dark Launch)

Mirror a percentage of live traffic to a shadow service — requests are duplicated, responses from the mirror are ignored. Ideal for testing new versions under real traffic without user impact:

spec:
  http:
    - route:
        - destination:
            host: product-service
            subset: v1
          weight: 100
      mirror:
        host: product-service
        subset: v2             # shadow target; receives copy of every request
      mirrorPercentage:
        value: 100.0           # mirror 100% of traffic (can be lower, e.g. 10.0)

Circuit Breaking (Outlier Detection)

Defined in DestinationRule's outlierDetection (see earlier). When a host accumulates consecutive errors, it is ejected from the load balancing pool for a backoff period:

# Full outlier detection example:
outlierDetection:
  splitExternalLocalOriginErrors: true    # separate local-origin errors from upstream
  consecutiveLocalOriginFailures: 3       # 3 connection failures → eject
  consecutiveGatewayErrors: 5             # 5 5xx responses → eject
  consecutive5xxErrors: 5
  interval: 10s                           # check window
  baseEjectionTime: 30s                   # first ejection duration
  maxEjectionPercent: 33                  # max % of hosts ejectable at once
  minHealthPercent: 50                    # stop ejecting if fewer than 50% healthy

Istio Ambient Mode

Ambient mode (GA in Istio 1.24) removes sidecars entirely. Traffic is intercepted at the node level by ztunnel (a per-node Rust DaemonSet) for L4, and optionally by waypoint proxies (per-namespace or per-ServiceAccount Envoy Deployments) for L7.

Enrollment: kubectl label namespace production istio.io/dataplane-mode=ambient L4 layer (always present): ztunnel DaemonSet (one per node, Rust) → HBONE tunnel (HTTP/2 CONNECT, port 15008) between nodes → Enforces: mTLS, L4 AuthorizationPolicy, traffic metrics L7 layer (optional, per-namespace or per-SA): waypoint Deployment (Envoy, auto-scaled) → Enforces: HTTP routing (VirtualService), L7 AuthorizationPolicy, fault injection → Only for traffic destined to the waypoint's service account kubectl get waypoint -n production # list waypoints istioctl waypoint apply --enroll-namespace # create namespace waypoint istioctl waypoint apply --name sa-waypoint --for service-account # SA waypoint

✅

Ambient mode advantages for large clusters

No pod restart needed to enroll — just label the namespace
~100x less memory overhead vs per-pod sidecars (1 ztunnel per node vs N sidecars per N pods)
Rolling upgrades: upgrade ztunnel without touching application pods
L7 features are opt-in: only namespaces that need HTTP-level policy pay the waypoint cost

Istio Observability

Standard Metrics (Envoy → Prometheus)

Every Envoy sidecar/ztunnel exposes metrics at :15090/metrics. Istio also generates higher-level metrics via Telemetry API:

# Key Istio metrics:
# Request rate:
sum(rate(istio_requests_total[5m])) by (destination_service, response_code)

# P99 latency:
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service))

# Error rate:
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service)
/
sum(rate(istio_requests_total[5m])) by (destination_service)

# Connection count (TCP):
sum(istio_tcp_connections_opened_total) by (destination_service)

Telemetry API — Custom Metrics & Access Logs

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: custom-metrics
  namespace: production
spec:
  metrics:
    - providers:
        - name: prometheus
      overrides:
        - match:
            metric: REQUEST_COUNT
          tagOverrides:
            request_id:
              value: "request.headers['x-request-id'] | 'unknown'"
  accessLogging:
    - providers:
        - name: envoy             # built-in envoy access log
      filter:
        expression: "response.code >= 400"   # only log errors
  tracing:
    - providers:
        - name: jaeger
      randomSamplingPercentage: 1.0          # 1% sampling

Istio Multi-Cluster

Topology	Control Planes	Network	Use Case
Single primary	1 (manages all clusters)	Flat (pods reach each other directly)	Dev/staging; tight coupling acceptable
Primary-remote	1 primary, remotes have no istiod	Flat	Central control; reduced resource use on small clusters
Multi-primary	1 per cluster (replicated config)	Flat or gateway	HA; independent failure domains; recommended for prod
Multi-primary + gateway	1 per cluster	East-west gateway per cluster	Non-flat networks; clusters in different VPCs/clouds

# East-west gateway (for non-flat multi-cluster)
# Exposes services on port 15443 (TLS SNI-based routing)
# Install east-west gateway:
istioctl install -f east-west-gateway.yaml --context=cluster1
istioctl install -f east-west-gateway.yaml --context=cluster2

# Expose services across clusters:
kubectl apply -f expose-services.yaml --context=cluster1
# ServiceEntry auto-created by Istio to reach remote cluster services

# Cross-cluster load balancing:
# ServiceEntry in cluster1 → VirtualService routes to both clusters
# Istio automatically generates cross-cluster endpoints from remote control plane

Linkerd

Linkerd is a CNCF graduated project focused on simplicity and minimal resource overhead. Its data plane proxy (linkerd2-proxy) is written in Rust — approximately 10x smaller memory footprint than Envoy.

Linkerd Key Properties

Proxy: Rust linkerd2-proxy (~10Mi RAM vs ~50Mi Envoy)
Automatic mTLS with zero configuration
No CRDs required for basic use (just linkerd inject)
Control plane: destination, identity, proxy-injector
ServiceProfile CRD for per-route metrics and retries
Traffic splitting via SMI (TrafficSplit) or HTTPRoute

Linkerd Limitations

Less feature-rich than Istio (no fault injection, limited circuit breaking)
No ambient/sidecarless mode (as of 2025)
Smaller ecosystem of integrations
ServiceProfile requires manual route definition

# Install Linkerd
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd check

# Inject sidecar into a namespace (annotation-based auto-inject):
kubectl annotate namespace production linkerd.io/inject=enabled

# Or inject manually:
kubectl get deploy -n production -o yaml | linkerd inject - | kubectl apply -f -

# ServiceProfile — per-route metrics and retry policy:
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: reviews.production.svc.cluster.local
  namespace: production
spec:
  routes:
    - name: GET /api/reviews
      condition:
        method: GET
        pathRegex: /api/reviews.*
      responseClasses:
        - condition:
            status:
              min: 500
          isFailure: true        # treat 5xx as failure for retry decisions
      isRetryable: true
      timeout: 5000ms            # per-route timeout
    - name: POST /api/reviews
      condition:
        method: POST
        pathRegex: /api/reviews
      isRetryable: false         # do not retry non-idempotent writes

# Traffic splitting (Gateway API HTTPRoute):
# Linkerd 2.14+ supports HTTPRoute for traffic splitting
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: reviews-split
  namespace: production
  annotations:
    linkerd.io/inject: enabled
spec:
  parentRefs:
    - name: reviews
      kind: Service
      group: core
  rules:
    - backendRefs:
        - name: reviews-v1
          port: 80
          weight: 90
        - name: reviews-v2
          port: 80
          weight: 10

Cilium Service Mesh

Cilium can operate as a service mesh using eBPF for L4 (no sidecar needed) and optionally deploying a per-node Envoy proxy for L7 features. This gives a sidecarless mesh with kernel-level performance:

Cilium Service Mesh Modes

eBPF-only (L4): mTLS via eBPF + SPIFFE; mutual auth without any proxy; lowest overhead
Envoy per-node (L7): one Envoy DaemonSet per node; handles HTTP/gRPC routing, retries, circuit breaking; no per-pod sidecar
Sidecar compatible: can run alongside Istio or Linkerd sidecars as CNI

helm upgrade cilium cilium/cilium \
  --set serviceMesh.enabled=true \
  --set envoy.enabled=true \              # per-node Envoy for L7
  --set kubeProxyReplacement=true \
  --set authentication.mutual.spire.enabled=true \   # SPIFFE/SPIRE integration
  --set authentication.mutual.spire.install.enabled=true

# CiliumEnvoyConfig — configure per-node Envoy:
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: http-policy
  namespace: production
spec:
  services:
    - name: reviews
      namespace: production
  resources:
    - "@type": type.googleapis.com/envoy.config.listener.v3.Listener
      # ... full Envoy xDS config

SPIFFE / SPIRE — Workload Identity

SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity. SPIRE is the reference implementation. Istio's CA, Linkerd's identity service, and Cilium's auth mode all implement or integrate with SPIFFE.

SPIFFE Identity (SVID — SPIFFE Verifiable Identity Document) ─────────────────────────────────────────────────────────────── Format: spiffe://<trust-domain>/ns/<namespace>/sa/<service-account> Example: spiffe://cluster.local/ns/production/sa/checkout-service Two SVID types: X.509 SVID — TLS certificate with SPIFFE URI in SAN field → used for mTLS between services JWT SVID — short-lived JWT token with SPIFFE subject → used for HTTP bearer token auth SPIRE Architecture: SPIRE Server (control plane) → stores registration entries (workload selectors → SPIFFE IDs) → signs SVIDs with trust bundle SPIRE Agent (DaemonSet on each node) → attests workload identity (k8s: pod UID, SA, namespace) → serves SVIDs to workloads via Workload API (Unix socket) Istio integration: istiod acts as SPIRE-compatible CA Envoy fetches SVIDs via SDS from Istio agent (no SPIRE server needed) Or: external SPIRE + Istio CA integration for cross-cluster identity

mTLS Migration: PERMISSIVE → STRICT

Migrating to STRICT mTLS without downtime requires a phased approach. All services must be enrolled in the mesh before flipping to STRICT — otherwise plaintext services are rejected:

Phase 1: Install mesh, set PERMISSIVE globally → Mesh accepts both mTLS and plaintext → No service disruption Phase 2: Enroll namespaces gradually kubectl label namespace team-a istio-injection=enabled kubectl rollout restart deployment -n team-a → Verify all pods have sidecars: istioctl proxy-status Phase 3: Monitor traffic for plaintext kubectl exec -n istio-system deploy/istiod -- \ pilot-agent request GET stats | grep ssl.handshake # Or in Kiali: Services → check mTLS lock icon Phase 4: Switch to STRICT per-namespace kubectl apply -f - <<EOF apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: team-a spec: mtls: mode: STRICT EOF → Any plaintext caller now gets: RBAC: access denied Phase 5: Mesh-wide STRICT Apply PeerAuthentication in istio-system namespace → Verify with: istioctl authn tls-check [pod] [service]

Service Mesh Comparison

Dimension	Istio (Sidecar)	Istio Ambient	Linkerd	Cilium Mesh
Data plane	Envoy per pod	ztunnel + waypoint	linkerd2-proxy per pod	eBPF + optional Envoy per node
Memory overhead	~50Mi per pod	~1 ztunnel per node	~10Mi per pod	Minimal (eBPF in kernel)
Automatic mTLS	Yes (PERMISSIVE default)	Yes (L4 via ztunnel)	Yes (always on)	Yes (eBPF + SPIFFE)
L7 traffic management	Full (VirtualService/DR)	Yes (via waypoint)	Limited (ServiceProfile)	Via CiliumEnvoyConfig
Circuit breaking	Yes (outlier detection)	Yes (via waypoint)	Limited	Via Envoy config
Fault injection	Yes	Yes (via waypoint)	No	No
Observability	Full (Kiali, Jaeger, Prometheus)	Full	Good (Viz dashboard)	Hubble (eBPF-native)
Multi-cluster	Mature (4 topologies)	Maturing	Yes (multicluster extension)	Cluster Mesh (up to 255)
Pod restart needed	Yes (inject sidecar)	No	Yes (inject sidecar)	No
Learning curve	High (many CRDs)	Medium	Low	Medium
CNCF status	Graduated	Graduated	Graduated	Graduated

Service Mesh vs NetworkPolicy

Dimension	NetworkPolicy	Service Mesh AuthorizationPolicy
Layer	L3/L4 (IP + port)	L7 (HTTP method, path, JWT, headers)
Identity	Pod label + namespace label	SPIFFE workload identity (cryptographic)
Enforcement	Kernel (CNI eBPF/iptables)	User-space proxy (Envoy sidecar)
Overhead	Near-zero (kernel)	Proxy CPU/RAM per pod
Bypass risk	Cannot bypass (kernel enforced)	Can bypass if sidecar injection skipped
mTLS	No	Yes (automatic cert rotation)
External traffic	ipBlock for external CIDRs	ServiceEntry + AuthorizationPolicy

ℹ️

Defense in depth: use both

NetworkPolicy at the kernel level blocks traffic that completely bypasses the mesh (e.g., a compromised pod that disables its sidecar). AuthorizationPolicy enforces L7 access control that NetworkPolicy cannot express. Running both layers provides defense in depth — an attacker must defeat both the CNI enforcement and the service mesh identity verification.

GAMMA — Gateway API for Mesh

GAMMA (Gateway API for Mesh Management and Administration) extends Gateway API to the east-west (service-to-service) mesh use case. Instead of creating Istio-specific VirtualService CRDs, services can use standard HTTPRoute to control mesh traffic:

# HTTPRoute for mesh traffic (parentRef = Service, not Gateway)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: reviews-mesh-route
  namespace: production
spec:
  parentRefs:
    - group: ""
      kind: Service
      name: reviews              # parentRef is a Service = mesh traffic routing
      port: 80
  rules:
    - backendRefs:
        - name: reviews-v1
          port: 80
          weight: 90
        - name: reviews-v2
          port: 80
          weight: 10

Istio 1.16+ supports GAMMA HTTPRoute for mesh routing. Linkerd 2.14+ also supports it via the smi-adaptor. This enables portable mesh config that works across implementations.

Metrics, Alerting & Troubleshooting

Key Metrics (Istio / Prometheus)

Metric	Description	Alert Threshold
`istio_requests_total`	Total requests by source/destination/response_code	rate spike or drop
`istio_request_duration_milliseconds`	Request latency histogram	P99 > SLO
`envoy_cluster_upstream_rq_pending_overflow`	Requests rejected by circuit breaker	> 0 for 5m
`envoy_cluster_ejections_active`	Actively ejected hosts (outlier detection)	> 0 sustained
`istio_tcp_connections_opened_total`	TCP connections through mesh	rate anomaly
`pilot_xds_push_errors`	istiod xDS push failures	> 0
`pilot_proxy_convergence_time`	Time for config to reach all proxies	P99 > 30s
`linkerd_request_errors_total`	Linkerd request errors per route	error rate > 1%

Alerting Rules

# High error rate for a destination service
- alert: MeshServiceHighErrorRate
  expr: |
    sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
    /
    sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Service {{ $labels.destination_service_name }} error rate > 5%"

# P99 latency SLO breach
- alert: MeshServiceHighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service_name)
    ) > 2000
  for: 10m
  labels:
    severity: warning

# Circuit breaker tripping
- alert: MeshCircuitBreakerOpen
  expr: increase(envoy_cluster_upstream_rq_pending_overflow[5m]) > 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Circuit breaker opened for {{ $labels.envoy_cluster_name }}"

# istiod xDS push errors
- alert: IstiodXDSPushErrors
  expr: rate(pilot_xds_push_errors[5m]) > 0
  for: 5m
  labels:
    severity: warning

Troubleshooting Runbooks

Runbook 1: RBAC: access denied (mTLS / AuthorizationPolicy)

# Check if source has a sidecar (must be enrolled in mesh)
istioctl proxy-status | grep <source-pod>

# Check AuthorizationPolicy applying to destination
kubectl get authorizationpolicy -n <dest-namespace>

# Debug auth decision
istioctl x authz check <dest-pod> -n <dest-namespace>

# Check if PeerAuthentication is STRICT but source has no sidecar
istioctl authn tls-check <source-pod> <dest-service>.<ns>.svc.cluster.local

# View real-time access log for denied requests
kubectl logs <dest-pod> -c istio-proxy | grep "RBAC"

Runbook 2: VirtualService Routing Not Applied

# Check VirtualService is valid
istioctl analyze -n <namespace>

# Verify proxy has received the config
istioctl proxy-config routes <pod> -n <namespace>
# Look for the route matching your VirtualService

# Check DestinationRule subsets exist
istioctl proxy-config cluster <pod> -n <namespace> | grep <service>
# Subsets should appear as separate clusters

# Common issues:
# - hosts field doesn't match service FQDN or short name scope
# - gateways field is missing "mesh" for in-cluster routing
# - DestinationRule subset labels don't match pod labels
kubectl get pods -l version=v2 -n <namespace>  # verify subset pods exist

Runbook 3: Sidecar Injection Not Happening

# Check namespace label
kubectl get namespace <ns> --show-labels | grep istio-injection

# Check MutatingWebhookConfiguration
kubectl get mutatingwebhookconfigurations istio-sidecar-injector -o yaml \
  | grep namespaceSelector

# Check pod-level annotation override (can disable injection)
kubectl get pod <pod> -o jsonpath='{.metadata.annotations}'
# sidecar.istio.io/inject: "false" → overrides namespace label

# Trigger re-injection (restart deployment)
kubectl rollout restart deploy/<name> -n <namespace>

# Verify sidecar present after restart
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].name}'
# Should include: istio-proxy

Runbook 4: istiod Not Syncing Config to Proxies (high xDS push delay)

# Check istiod pod status and resources
kubectl top pod -n istio-system -l app=istiod

# Check connected proxy count vs istiod capacity
kubectl exec -n istio-system deploy/istiod -- \
  pilot-agent request GET stats | grep pilot_inbound_updates

# Check proxy convergence time (should be <30s P99)
kubectl exec -n istio-system deploy/istiod -- \
  pilot-agent request GET metrics | grep pilot_proxy_convergence

# Common causes:
# - istiod OOM: increase memory limits
# - Too many services/endpoints causing full xDS push storms
#   Fix: enable delta xDS (PILOT_ENABLE_EDS_DEBOUNCE=true)
# - Network policy blocking istiod ↔ sidecar port 15012

Runbook 5: Linkerd — Service Metrics Missing

# Check linkerd proxy injection
linkerd check --proxy -n <namespace>

# Check if pods have the proxy
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[*].name}{"\n"}{end}' \
  | grep linkerd-proxy

# If missing: re-annotate and restart
kubectl annotate namespace <ns> linkerd.io/inject=enabled
kubectl rollout restart deploy -n <namespace>

# View live traffic stats
linkerd viz stat deploy -n <namespace>
linkerd viz top deploy/<name> -n <namespace>

# Check for certificate expiry
linkerd check --proxy 2>&1 | grep -i cert

Best Practices

Start with PERMISSIVE mTLS, enforce STRICT per-namespace gradually — never flip the entire cluster to STRICT on day one. Migrate namespace by namespace, verify with istioctl authn tls-check before switching each namespace.
Set resource requests/limits on sidecars — Envoy sidecars have no limits by default. In production, set global.proxy.resources.requests/limits in Helm values to prevent sidecars from starving application containers.
Use outlier detection (circuit breaking) on all DestinationRules — without it, a slow/failing pod continues receiving traffic until endpoints age out. Add consecutiveGatewayErrors: 5 and interval: 30s as a minimum baseline.
Enable outboundTrafficPolicy: REGISTRY_ONLY in production — forces all egress through declared ServiceEntries. Prevents data exfiltration to arbitrary external IPs and makes egress auditable.
Use Istio ambient mode for large clusters (1000+ pods) — sidecar overhead becomes significant at scale. Ambient mode reduces memory by 10-100x and eliminates pod restart requirements for mesh enrollment.
Validate CRDs before applying — run istioctl analyze -n <namespace> and istioctl analyze --all-namespaces in CI pipelines. Invalid VirtualService or DestinationRule configs cause silent routing failures, not errors.
Propagate trace headers in every service — Istio generates trace spans but cannot correlate them across services unless your application forwards B3 headers (x-b3-traceid, x-b3-spanid, x-request-id). Add header propagation middleware to every service.
Use GAMMA (HTTPRoute with parentRef=Service) for new mesh routing — portable across Istio, Linkerd, and future implementations. Avoids Istio-specific VirtualService lock-in for basic traffic splitting.
Back up Istio config state regularly — all mesh config (VirtualService, DestinationRule, PeerAuthentication, AuthorizationPolicy) is stored in etcd. Treat it as part of GitOps — store all mesh CRDs in version control and reconcile via ArgoCD/Flux.