Service Mesh

A service mesh moves cross-cutting concerns — mutual TLS, observability, retries, circuit breaking, traffic splitting — out of application code and into a dedicated infrastructure layer. This page covers sidecar and sidecarless architectures, Istio, Linkerd, Cilium service mesh, SPIFFE/SPIRE identity, and production patterns.

What This Page Covers
  • What a service mesh is; problems it solves without code changes
  • Sidecar vs sidecarless (ambient) architecture comparison
  • Istio architecture: istiod (Pilot+Citadel+Galley merged), Envoy data plane
  • Istio CRDs: VirtualService, DestinationRule, Gateway, ServiceEntry, Sidecar
  • Istio security: PeerAuthentication (STRICT/PERMISSIVE mTLS), AuthorizationPolicy (ALLOW/DENY/AUDIT/CUSTOM)
  • Istio traffic management: retries, timeouts, circuit breaking (outlier detection), fault injection, traffic mirroring
  • Istio observability: Prometheus metrics, Kiali topology, Jaeger/Zipkin tracing, access logs
  • Istio ambient mode: ztunnel (L4 per-node), waypoint proxy (L7 per-namespace/SA)
  • Istio multi-cluster: single/multiple control planes, east-west gateway, flat network vs gateway topologies
  • Linkerd: Rust-based linkerd2-proxy, zero-config mTLS, ServiceProfile CRD, retries, timeouts, traffic splitting
  • Cilium service mesh: eBPF-based L4 mesh, optional Envoy sidecar for L7, Hubble observability
  • Consul Connect: service intentions, transparent proxy, mesh gateways
  • SPIFFE/SPIRE: SVIDs, X.509 and JWT formats, workload attestation, trust bundles
  • Traffic management patterns: canary, blue/green, A/B testing, dark launch (mirroring)
  • mTLS migration: PERMISSIVE → STRICT phase rollout
  • Service mesh vs NetworkPolicy: complementary, not alternatives
  • Mesh vs Gateway API: GatewayClass for mesh use cases (GAMMA)
  • 8 metrics + 4 alerting rules + 5 troubleshooting runbooks
  • 9 best practices for production service mesh operation

What a Service Mesh Is — and Is Not

A service mesh is an infrastructure layer that intercepts all network traffic between services and applies policy, observability, and reliability logic — without any changes to application code. It solves problems that are otherwise duplicated in every service:

ProblemWithout MeshWith Mesh
Mutual TLS between servicesEach app implements TLS with certificate rotationAutomatic; zero-code-change; cert rotation transparent
Service-to-service authorizationApplication-level API keys or shared secretsIdentity-based (SPIFFE SVID); cryptographically verified
Retries, timeouts, circuit breakingEach SDK/library re-implements per languageUniform policy in mesh proxy; language-agnostic
Distributed tracingManual instrumentation in every serviceAutomatic trace context propagation (B3/W3C headers)
Traffic splitting / canarySeparate deployments with DNS trickeryWeighted routing at proxy layer; no DNS change needed
Observability (RED metrics)Each service must expose /metricsProxy generates request rate, error rate, duration per route
⚠️
A service mesh is not a replacement for NetworkPolicy

NetworkPolicy operates at L3/L4 (IP + port) and is enforced in the kernel by the CNI. Service mesh policies operate at L7 (HTTP, gRPC) in a user-space proxy. They are complementary: NetworkPolicy provides coarse-grained firewall rules; service mesh provides fine-grained identity-based authorization and traffic management. Run both in production.

Sidecar vs Sidecarless Architecture

Sidecar Model (Istio, Linkerd)

  • Proxy injected as extra container in every pod
  • iptables rules redirect all traffic through proxy
  • Full L7 visibility per pod
  • Pod startup overhead (~100ms), resource overhead (~50Mi RAM, ~0.05 CPU per sidecar)
  • Requires pod restart to inject/upgrade proxy
  • Isolation: per-pod failure domain

Sidecarless / Ambient (Istio Ambient, Cilium)

  • No sidecar injected; no pod restart needed
  • L4 proxy (ztunnel/eBPF) runs per node
  • L7 proxy (waypoint) runs per namespace or SA — only when needed
  • Lower resource overhead; better for large clusters
  • Gradual adoption: enroll namespace without restarting pods
  • Blast radius: node-level failure affects all pods on node (ztunnel)
SIDECAR MODEL (Istio / Linkerd) ───────────────────────────────────────────────────── Pod A Pod B ┌──────────────────────┐ ┌──────────────────────┐ │ app-container │ │ app-container │ │ ┌──────────────────┐ │ mTLS │ ┌──────────────────┐ │ │ │ Envoy sidecar │─┼──────────┼─│ Envoy sidecar │ │ │ │ (istio-proxy) │ │ │ │ (istio-proxy) │ │ │ └──────────────────┘ │ │ └──────────────────┘ │ └──────────────────────┘ └──────────────────────┘ ↑ iptables REDIRECT (port 15001/15006) │ istiod (control plane): pushes xDS config to all proxies AMBIENT MODEL (Istio Ambient) ───────────────────────────────────────────────────── Node ┌──────────────────────────────────────────────────────┐ │ ztunnel (per-node DaemonSet) ← L4 only: mTLS, authz│ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Pod A │ │ Pod B │ │ │ │ app only │ │ app only │ ← no sidecar! │ │ └─────────────┘ └─────────────┘ │ └──────────────────────────────────────────────────────┘ │ L7 needed? ▼ Waypoint proxy (per-namespace Deployment) ← L7: VirtualService, AuthzPolicy HTTP

Istio

Istio is the most feature-rich service mesh for Kubernetes. Since version 1.5, all control plane components (Pilot, Citadel, Galley, Mixer) merged into a single binary: istiod.

Control Plane Components

istiod

  • Pilot — xDS server; pushes route/cluster/listener config to Envoy proxies
  • Citadel — CA; issues SPIFFE X.509 SVIDs to workloads; rotates certs automatically (~24h TTL)
  • Galley — config validation and distribution; validates Istio CRDs
  • Port 15010 (gRPC xDS), 15012 (secure xDS), 15014 (control plane metrics), 8080 (debug)

Envoy Data Plane

  • istio-proxy sidecar: Envoy + Istio agent
  • Intercepts traffic via iptables REDIRECT (init container sets rules)
  • Port 15001 (outbound), 15006 (inbound), 15090 (Prometheus metrics), 15000 (admin)
  • xDS: LDS/RDS/CDS/EDS pushed from istiod; no polling
  • Cert refresh from istiod via SDS (Secret Discovery Service)

Installation

# Install via istioctl (recommended)
istioctl install --set profile=default    # minimal | default | demo | empty

# Verify installation
istioctl verify-install
kubectl get pods -n istio-system

# Enable sidecar injection for a namespace
kubectl label namespace production istio-injection=enabled

# Manual injection (without label):
istioctl kube-inject -f deployment.yaml | kubectl apply -f -

# Check proxy status across all pods
istioctl proxy-status

Istio CRDs — Traffic Management

VirtualService

VirtualService defines traffic routing rules for services within the mesh. It intercepts requests to a service's hostname and applies routing logic before the request reaches any pod:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews-vs
  namespace: production
spec:
  hosts:
    - reviews                    # short name = reviews.production.svc.cluster.local
    - reviews.example.com        # external hostname (when paired with Istio Gateway)
  gateways:
    - mesh                       # "mesh" = all sidecars in mesh
    - production/my-gateway      # also apply to external traffic via this Gateway
  http:
    # Canary: 10% to v2, 90% to v1
    - match:
        - headers:
            x-canary:
              exact: "true"       # header-based routing to canary
      route:
        - destination:
            host: reviews
            subset: v2
    # Weight-based split
    - route:
        - destination:
            host: reviews
            subset: v1
          weight: 90
        - destination:
            host: reviews
            subset: v2
          weight: 10
      timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: "gateway-error,connect-failure,retriable-4xx"

DestinationRule

DestinationRule defines subsets (for canary routing) and traffic policies applied after routing (load balancing, circuit breaking, TLS settings to upstream):

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews-dr
  namespace: production
spec:
  host: reviews
  trafficPolicy:
    loadBalancer:
      simple: LEAST_CONN           # ROUND_ROBIN | LEAST_CONN | RANDOM | PASSTHROUGH
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 30ms
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 10
    outlierDetection:              # circuit breaker
      consecutiveGatewayErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50       # never eject more than 50% of hosts
      minHealthPercent: 50
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
      trafficPolicy:               # per-subset override
        loadBalancer:
          simple: ROUND_ROBIN

Istio Gateway (vs Kubernetes Ingress)

Istio Gateway manages inbound/outbound traffic at the mesh boundary. It configures the Istio ingress gateway pods (Envoy) — not to be confused with Kubernetes Gateway API GatewayClass:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: my-gateway
  namespace: production
spec:
  selector:
    istio: ingressgateway            # targets the istio-ingressgateway pods
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE                 # SIMPLE | MUTUAL | PASSTHROUGH | AUTO_PASSTHROUGH
        credentialName: my-tls-cert  # references a Secret
      hosts:
        - "*.example.com"
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "*.example.com"
      tls:
        httpsRedirect: true

ServiceEntry — External Services

ServiceEntry adds external services to Istio's service registry, enabling traffic policies, mTLS, and observability for egress traffic:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: production
spec:
  hosts:
    - api.stripe.com
  ports:
    - number: 443
      name: https
      protocol: HTTPS
  location: MESH_EXTERNAL           # MESH_EXTERNAL | MESH_INTERNAL
  resolution: DNS
---
# Block all external traffic except explicit ServiceEntries:
# Set outboundTrafficPolicy: REGISTRY_ONLY in MeshConfig
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY           # default: ALLOW_ANY

Istio Security

PeerAuthentication — mTLS Policy

# Namespace-wide STRICT mTLS (all traffic must be mTLS)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT                   # STRICT | PERMISSIVE | DISABLE

# Port-level exception (useful during migration):
spec:
  selector:
    matchLabels:
      app: legacy-app
  mtls:
    mode: PERMISSIVE               # namespace STRICT, but this app accepts plaintext
  portLevelMtls:
    8080:
      mode: DISABLE                # specific port allows plaintext

# Mesh-wide STRICT (applies to all namespaces):
metadata:
  name: default
  namespace: istio-system          # root namespace = mesh-wide policy
spec:
  mtls:
    mode: STRICT

AuthorizationPolicy

AuthorizationPolicy enforces access control at the sidecar level using service identity (SPIFFE) and request attributes. Four actions: ALLOW, DENY, AUDIT, CUSTOM (external auth):

# Allow only frontend service to call reviews on GET /api/*
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: reviews-authz
  namespace: production
spec:
  selector:
    matchLabels:
      app: reviews
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              # SPIFFE identity: cluster.local/ns/production/sa/frontend
              - "cluster.local/ns/production/sa/frontend"
      to:
        - operation:
            methods: ["GET"]
            paths: ["/api/*"]
      when:
        - key: request.headers[x-api-version]
          values: ["v1", "v2"]

---
# Deny all except explicitly allowed (default deny):
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  {}                               # empty spec = deny all

---
# JWT-based authorization
spec:
  rules:
    - from:
        - source:
            requestPrincipals: ["https://accounts.google.com/*"]
      when:
        - key: request.auth.claims[role]
          values: ["admin"]

Traffic Management Patterns

Fault Injection

Fault injection deliberately introduces delays and errors to test system resilience — chaos engineering at the proxy layer without modifying application code:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ratings-fault
  namespace: production
spec:
  hosts: [ratings]
  http:
    - match:
        - headers:
            x-test-user:
              exact: "chaos"       # only inject for specific test header
      fault:
        delay:
          percentage:
            value: 50              # 50% of requests get a 5s delay
          fixedDelay: 5s
        abort:
          percentage:
            value: 10              # 10% of requests get HTTP 503
          httpStatus: 503
      route:
        - destination:
            host: ratings

Traffic Mirroring (Dark Launch)

Mirror a percentage of live traffic to a shadow service — requests are duplicated, responses from the mirror are ignored. Ideal for testing new versions under real traffic without user impact:

spec:
  http:
    - route:
        - destination:
            host: product-service
            subset: v1
          weight: 100
      mirror:
        host: product-service
        subset: v2             # shadow target; receives copy of every request
      mirrorPercentage:
        value: 100.0           # mirror 100% of traffic (can be lower, e.g. 10.0)

Circuit Breaking (Outlier Detection)

Defined in DestinationRule's outlierDetection (see earlier). When a host accumulates consecutive errors, it is ejected from the load balancing pool for a backoff period:

# Full outlier detection example:
outlierDetection:
  splitExternalLocalOriginErrors: true    # separate local-origin errors from upstream
  consecutiveLocalOriginFailures: 3       # 3 connection failures → eject
  consecutiveGatewayErrors: 5             # 5 5xx responses → eject
  consecutive5xxErrors: 5
  interval: 10s                           # check window
  baseEjectionTime: 30s                   # first ejection duration
  maxEjectionPercent: 33                  # max % of hosts ejectable at once
  minHealthPercent: 50                    # stop ejecting if fewer than 50% healthy

Istio Ambient Mode

Ambient mode (GA in Istio 1.24) removes sidecars entirely. Traffic is intercepted at the node level by ztunnel (a per-node Rust DaemonSet) for L4, and optionally by waypoint proxies (per-namespace or per-ServiceAccount Envoy Deployments) for L7.

Enrollment: kubectl label namespace production istio.io/dataplane-mode=ambient L4 layer (always present): ztunnel DaemonSet (one per node, Rust) → HBONE tunnel (HTTP/2 CONNECT, port 15008) between nodes → Enforces: mTLS, L4 AuthorizationPolicy, traffic metrics L7 layer (optional, per-namespace or per-SA): waypoint Deployment (Envoy, auto-scaled) → Enforces: HTTP routing (VirtualService), L7 AuthorizationPolicy, fault injection → Only for traffic destined to the waypoint's service account kubectl get waypoint -n production # list waypoints istioctl waypoint apply --enroll-namespace # create namespace waypoint istioctl waypoint apply --name sa-waypoint --for service-account # SA waypoint
Ambient mode advantages for large clusters
  • No pod restart needed to enroll — just label the namespace
  • ~100x less memory overhead vs per-pod sidecars (1 ztunnel per node vs N sidecars per N pods)
  • Rolling upgrades: upgrade ztunnel without touching application pods
  • L7 features are opt-in: only namespaces that need HTTP-level policy pay the waypoint cost

Istio Observability

Standard Metrics (Envoy → Prometheus)

Every Envoy sidecar/ztunnel exposes metrics at :15090/metrics. Istio also generates higher-level metrics via Telemetry API:

# Key Istio metrics:
# Request rate:
sum(rate(istio_requests_total[5m])) by (destination_service, response_code)

# P99 latency:
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service))

# Error rate:
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service)
/
sum(rate(istio_requests_total[5m])) by (destination_service)

# Connection count (TCP):
sum(istio_tcp_connections_opened_total) by (destination_service)

Telemetry API — Custom Metrics & Access Logs

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: custom-metrics
  namespace: production
spec:
  metrics:
    - providers:
        - name: prometheus
      overrides:
        - match:
            metric: REQUEST_COUNT
          tagOverrides:
            request_id:
              value: "request.headers['x-request-id'] | 'unknown'"
  accessLogging:
    - providers:
        - name: envoy             # built-in envoy access log
      filter:
        expression: "response.code >= 400"   # only log errors
  tracing:
    - providers:
        - name: jaeger
      randomSamplingPercentage: 1.0          # 1% sampling

Istio Multi-Cluster

TopologyControl PlanesNetworkUse Case
Single primary1 (manages all clusters)Flat (pods reach each other directly)Dev/staging; tight coupling acceptable
Primary-remote1 primary, remotes have no istiodFlatCentral control; reduced resource use on small clusters
Multi-primary1 per cluster (replicated config)Flat or gatewayHA; independent failure domains; recommended for prod
Multi-primary + gateway1 per clusterEast-west gateway per clusterNon-flat networks; clusters in different VPCs/clouds
# East-west gateway (for non-flat multi-cluster)
# Exposes services on port 15443 (TLS SNI-based routing)
# Install east-west gateway:
istioctl install -f east-west-gateway.yaml --context=cluster1
istioctl install -f east-west-gateway.yaml --context=cluster2

# Expose services across clusters:
kubectl apply -f expose-services.yaml --context=cluster1
# ServiceEntry auto-created by Istio to reach remote cluster services

# Cross-cluster load balancing:
# ServiceEntry in cluster1 → VirtualService routes to both clusters
# Istio automatically generates cross-cluster endpoints from remote control plane

Linkerd

Linkerd is a CNCF graduated project focused on simplicity and minimal resource overhead. Its data plane proxy (linkerd2-proxy) is written in Rust — approximately 10x smaller memory footprint than Envoy.

Linkerd Key Properties

  • Proxy: Rust linkerd2-proxy (~10Mi RAM vs ~50Mi Envoy)
  • Automatic mTLS with zero configuration
  • No CRDs required for basic use (just linkerd inject)
  • Control plane: destination, identity, proxy-injector
  • ServiceProfile CRD for per-route metrics and retries
  • Traffic splitting via SMI (TrafficSplit) or HTTPRoute

Linkerd Limitations

  • Less feature-rich than Istio (no fault injection, limited circuit breaking)
  • No ambient/sidecarless mode (as of 2025)
  • Smaller ecosystem of integrations
  • ServiceProfile requires manual route definition
# Install Linkerd
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd check

# Inject sidecar into a namespace (annotation-based auto-inject):
kubectl annotate namespace production linkerd.io/inject=enabled

# Or inject manually:
kubectl get deploy -n production -o yaml | linkerd inject - | kubectl apply -f -

# ServiceProfile — per-route metrics and retry policy:
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: reviews.production.svc.cluster.local
  namespace: production
spec:
  routes:
    - name: GET /api/reviews
      condition:
        method: GET
        pathRegex: /api/reviews.*
      responseClasses:
        - condition:
            status:
              min: 500
          isFailure: true        # treat 5xx as failure for retry decisions
      isRetryable: true
      timeout: 5000ms            # per-route timeout
    - name: POST /api/reviews
      condition:
        method: POST
        pathRegex: /api/reviews
      isRetryable: false         # do not retry non-idempotent writes

# Traffic splitting (Gateway API HTTPRoute):
# Linkerd 2.14+ supports HTTPRoute for traffic splitting
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: reviews-split
  namespace: production
  annotations:
    linkerd.io/inject: enabled
spec:
  parentRefs:
    - name: reviews
      kind: Service
      group: core
  rules:
    - backendRefs:
        - name: reviews-v1
          port: 80
          weight: 90
        - name: reviews-v2
          port: 80
          weight: 10

Cilium Service Mesh

Cilium can operate as a service mesh using eBPF for L4 (no sidecar needed) and optionally deploying a per-node Envoy proxy for L7 features. This gives a sidecarless mesh with kernel-level performance:

Cilium Service Mesh Modes

  • eBPF-only (L4): mTLS via eBPF + SPIFFE; mutual auth without any proxy; lowest overhead
  • Envoy per-node (L7): one Envoy DaemonSet per node; handles HTTP/gRPC routing, retries, circuit breaking; no per-pod sidecar
  • Sidecar compatible: can run alongside Istio or Linkerd sidecars as CNI
helm upgrade cilium cilium/cilium \
  --set serviceMesh.enabled=true \
  --set envoy.enabled=true \              # per-node Envoy for L7
  --set kubeProxyReplacement=true \
  --set authentication.mutual.spire.enabled=true \   # SPIFFE/SPIRE integration
  --set authentication.mutual.spire.install.enabled=true

# CiliumEnvoyConfig — configure per-node Envoy:
apiVersion: cilium.io/v2
kind: CiliumEnvoyConfig
metadata:
  name: http-policy
  namespace: production
spec:
  services:
    - name: reviews
      namespace: production
  resources:
    - "@type": type.googleapis.com/envoy.config.listener.v3.Listener
      # ... full Envoy xDS config

SPIFFE / SPIRE — Workload Identity

SPIFFE (Secure Production Identity Framework For Everyone) defines a standard for workload identity. SPIRE is the reference implementation. Istio's CA, Linkerd's identity service, and Cilium's auth mode all implement or integrate with SPIFFE.

SPIFFE Identity (SVID — SPIFFE Verifiable Identity Document) ─────────────────────────────────────────────────────────────── Format: spiffe://<trust-domain>/ns/<namespace>/sa/<service-account> Example: spiffe://cluster.local/ns/production/sa/checkout-service Two SVID types: X.509 SVID — TLS certificate with SPIFFE URI in SAN field → used for mTLS between services JWT SVID — short-lived JWT token with SPIFFE subject → used for HTTP bearer token auth SPIRE Architecture: SPIRE Server (control plane) → stores registration entries (workload selectors → SPIFFE IDs) → signs SVIDs with trust bundle SPIRE Agent (DaemonSet on each node) → attests workload identity (k8s: pod UID, SA, namespace) → serves SVIDs to workloads via Workload API (Unix socket) Istio integration: istiod acts as SPIRE-compatible CA Envoy fetches SVIDs via SDS from Istio agent (no SPIRE server needed) Or: external SPIRE + Istio CA integration for cross-cluster identity

mTLS Migration: PERMISSIVE → STRICT

Migrating to STRICT mTLS without downtime requires a phased approach. All services must be enrolled in the mesh before flipping to STRICT — otherwise plaintext services are rejected:

Phase 1: Install mesh, set PERMISSIVE globally → Mesh accepts both mTLS and plaintext → No service disruption Phase 2: Enroll namespaces gradually kubectl label namespace team-a istio-injection=enabled kubectl rollout restart deployment -n team-a → Verify all pods have sidecars: istioctl proxy-status Phase 3: Monitor traffic for plaintext kubectl exec -n istio-system deploy/istiod -- \ pilot-agent request GET stats | grep ssl.handshake # Or in Kiali: Services → check mTLS lock icon Phase 4: Switch to STRICT per-namespace kubectl apply -f - <<EOF apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: team-a spec: mtls: mode: STRICT EOF → Any plaintext caller now gets: RBAC: access denied Phase 5: Mesh-wide STRICT Apply PeerAuthentication in istio-system namespace → Verify with: istioctl authn tls-check [pod] [service]

Service Mesh Comparison

DimensionIstio (Sidecar)Istio AmbientLinkerdCilium Mesh
Data planeEnvoy per podztunnel + waypointlinkerd2-proxy per podeBPF + optional Envoy per node
Memory overhead~50Mi per pod~1 ztunnel per node~10Mi per podMinimal (eBPF in kernel)
Automatic mTLSYes (PERMISSIVE default)Yes (L4 via ztunnel)Yes (always on)Yes (eBPF + SPIFFE)
L7 traffic managementFull (VirtualService/DR)Yes (via waypoint)Limited (ServiceProfile)Via CiliumEnvoyConfig
Circuit breakingYes (outlier detection)Yes (via waypoint)LimitedVia Envoy config
Fault injectionYesYes (via waypoint)NoNo
ObservabilityFull (Kiali, Jaeger, Prometheus)FullGood (Viz dashboard)Hubble (eBPF-native)
Multi-clusterMature (4 topologies)MaturingYes (multicluster extension)Cluster Mesh (up to 255)
Pod restart neededYes (inject sidecar)NoYes (inject sidecar)No
Learning curveHigh (many CRDs)MediumLowMedium
CNCF statusGraduatedGraduatedGraduatedGraduated

Service Mesh vs NetworkPolicy

DimensionNetworkPolicyService Mesh AuthorizationPolicy
LayerL3/L4 (IP + port)L7 (HTTP method, path, JWT, headers)
IdentityPod label + namespace labelSPIFFE workload identity (cryptographic)
EnforcementKernel (CNI eBPF/iptables)User-space proxy (Envoy sidecar)
OverheadNear-zero (kernel)Proxy CPU/RAM per pod
Bypass riskCannot bypass (kernel enforced)Can bypass if sidecar injection skipped
mTLSNoYes (automatic cert rotation)
External trafficipBlock for external CIDRsServiceEntry + AuthorizationPolicy
ℹ️
Defense in depth: use both

NetworkPolicy at the kernel level blocks traffic that completely bypasses the mesh (e.g., a compromised pod that disables its sidecar). AuthorizationPolicy enforces L7 access control that NetworkPolicy cannot express. Running both layers provides defense in depth — an attacker must defeat both the CNI enforcement and the service mesh identity verification.

GAMMA — Gateway API for Mesh

GAMMA (Gateway API for Mesh Management and Administration) extends Gateway API to the east-west (service-to-service) mesh use case. Instead of creating Istio-specific VirtualService CRDs, services can use standard HTTPRoute to control mesh traffic:

# HTTPRoute for mesh traffic (parentRef = Service, not Gateway)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: reviews-mesh-route
  namespace: production
spec:
  parentRefs:
    - group: ""
      kind: Service
      name: reviews              # parentRef is a Service = mesh traffic routing
      port: 80
  rules:
    - backendRefs:
        - name: reviews-v1
          port: 80
          weight: 90
        - name: reviews-v2
          port: 80
          weight: 10

Istio 1.16+ supports GAMMA HTTPRoute for mesh routing. Linkerd 2.14+ also supports it via the smi-adaptor. This enables portable mesh config that works across implementations.

Metrics, Alerting & Troubleshooting

Key Metrics (Istio / Prometheus)

MetricDescriptionAlert Threshold
istio_requests_totalTotal requests by source/destination/response_coderate spike or drop
istio_request_duration_millisecondsRequest latency histogramP99 > SLO
envoy_cluster_upstream_rq_pending_overflowRequests rejected by circuit breaker> 0 for 5m
envoy_cluster_ejections_activeActively ejected hosts (outlier detection)> 0 sustained
istio_tcp_connections_opened_totalTCP connections through meshrate anomaly
pilot_xds_push_errorsistiod xDS push failures> 0
pilot_proxy_convergence_timeTime for config to reach all proxiesP99 > 30s
linkerd_request_errors_totalLinkerd request errors per routeerror rate > 1%

Alerting Rules

# High error rate for a destination service
- alert: MeshServiceHighErrorRate
  expr: |
    sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
    /
    sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Service {{ $labels.destination_service_name }} error rate > 5%"

# P99 latency SLO breach
- alert: MeshServiceHighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service_name)
    ) > 2000
  for: 10m
  labels:
    severity: warning

# Circuit breaker tripping
- alert: MeshCircuitBreakerOpen
  expr: increase(envoy_cluster_upstream_rq_pending_overflow[5m]) > 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Circuit breaker opened for {{ $labels.envoy_cluster_name }}"

# istiod xDS push errors
- alert: IstiodXDSPushErrors
  expr: rate(pilot_xds_push_errors[5m]) > 0
  for: 5m
  labels:
    severity: warning

Troubleshooting Runbooks

Runbook 1: RBAC: access denied (mTLS / AuthorizationPolicy)

# Check if source has a sidecar (must be enrolled in mesh)
istioctl proxy-status | grep <source-pod>

# Check AuthorizationPolicy applying to destination
kubectl get authorizationpolicy -n <dest-namespace>

# Debug auth decision
istioctl x authz check <dest-pod> -n <dest-namespace>

# Check if PeerAuthentication is STRICT but source has no sidecar
istioctl authn tls-check <source-pod> <dest-service>.<ns>.svc.cluster.local

# View real-time access log for denied requests
kubectl logs <dest-pod> -c istio-proxy | grep "RBAC"

Runbook 2: VirtualService Routing Not Applied

# Check VirtualService is valid
istioctl analyze -n <namespace>

# Verify proxy has received the config
istioctl proxy-config routes <pod> -n <namespace>
# Look for the route matching your VirtualService

# Check DestinationRule subsets exist
istioctl proxy-config cluster <pod> -n <namespace> | grep <service>
# Subsets should appear as separate clusters

# Common issues:
# - hosts field doesn't match service FQDN or short name scope
# - gateways field is missing "mesh" for in-cluster routing
# - DestinationRule subset labels don't match pod labels
kubectl get pods -l version=v2 -n <namespace>  # verify subset pods exist

Runbook 3: Sidecar Injection Not Happening

# Check namespace label
kubectl get namespace <ns> --show-labels | grep istio-injection

# Check MutatingWebhookConfiguration
kubectl get mutatingwebhookconfigurations istio-sidecar-injector -o yaml \
  | grep namespaceSelector

# Check pod-level annotation override (can disable injection)
kubectl get pod <pod> -o jsonpath='{.metadata.annotations}'
# sidecar.istio.io/inject: "false" → overrides namespace label

# Trigger re-injection (restart deployment)
kubectl rollout restart deploy/<name> -n <namespace>

# Verify sidecar present after restart
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].name}'
# Should include: istio-proxy

Runbook 4: istiod Not Syncing Config to Proxies (high xDS push delay)

# Check istiod pod status and resources
kubectl top pod -n istio-system -l app=istiod

# Check connected proxy count vs istiod capacity
kubectl exec -n istio-system deploy/istiod -- \
  pilot-agent request GET stats | grep pilot_inbound_updates

# Check proxy convergence time (should be <30s P99)
kubectl exec -n istio-system deploy/istiod -- \
  pilot-agent request GET metrics | grep pilot_proxy_convergence

# Common causes:
# - istiod OOM: increase memory limits
# - Too many services/endpoints causing full xDS push storms
#   Fix: enable delta xDS (PILOT_ENABLE_EDS_DEBOUNCE=true)
# - Network policy blocking istiod ↔ sidecar port 15012

Runbook 5: Linkerd — Service Metrics Missing

# Check linkerd proxy injection
linkerd check --proxy -n <namespace>

# Check if pods have the proxy
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[*].name}{"\n"}{end}' \
  | grep linkerd-proxy

# If missing: re-annotate and restart
kubectl annotate namespace <ns> linkerd.io/inject=enabled
kubectl rollout restart deploy -n <namespace>

# View live traffic stats
linkerd viz stat deploy -n <namespace>
linkerd viz top deploy/<name> -n <namespace>

# Check for certificate expiry
linkerd check --proxy 2>&1 | grep -i cert

Best Practices

  1. Start with PERMISSIVE mTLS, enforce STRICT per-namespace gradually — never flip the entire cluster to STRICT on day one. Migrate namespace by namespace, verify with istioctl authn tls-check before switching each namespace.
  2. Set resource requests/limits on sidecars — Envoy sidecars have no limits by default. In production, set global.proxy.resources.requests/limits in Helm values to prevent sidecars from starving application containers.
  3. Use outlier detection (circuit breaking) on all DestinationRules — without it, a slow/failing pod continues receiving traffic until endpoints age out. Add consecutiveGatewayErrors: 5 and interval: 30s as a minimum baseline.
  4. Enable outboundTrafficPolicy: REGISTRY_ONLY in production — forces all egress through declared ServiceEntries. Prevents data exfiltration to arbitrary external IPs and makes egress auditable.
  5. Use Istio ambient mode for large clusters (1000+ pods) — sidecar overhead becomes significant at scale. Ambient mode reduces memory by 10-100x and eliminates pod restart requirements for mesh enrollment.
  6. Validate CRDs before applying — run istioctl analyze -n <namespace> and istioctl analyze --all-namespaces in CI pipelines. Invalid VirtualService or DestinationRule configs cause silent routing failures, not errors.
  7. Propagate trace headers in every service — Istio generates trace spans but cannot correlate them across services unless your application forwards B3 headers (x-b3-traceid, x-b3-spanid, x-request-id). Add header propagation middleware to every service.
  8. Use GAMMA (HTTPRoute with parentRef=Service) for new mesh routing — portable across Istio, Linkerd, and future implementations. Avoids Istio-specific VirtualService lock-in for basic traffic splitting.
  9. Back up Istio config state regularly — all mesh config (VirtualService, DestinationRule, PeerAuthentication, AuthorizationPolicy) is stored in etcd. Treat it as part of GitOps — store all mesh CRDs in version control and reconcile via ArgoCD/Flux.