Leader Election Flow

Overview

Traces how Kubernetes controllers elect a single active leader from multiple replicas — using the Lease API — and what happens during startup, failover, and leader loss.

Why Leader Election

kube-controller-manager, kube-scheduler, and custom operators
run as multiple replicas for availability.

Problem: if two replicas both reconcile the same object simultaneously,
they can fight each other → duplicate work, split-brain writes.

Solution: leader election — only ONE replica is "active" at any time.
  Others are "standby" — ready to take over if the leader dies.

Kubernetes uses the Lease API (coordination.k8s.io/v1 Lease object)
as the lock primitive.

Leader Election Architecture

  Replica A (leader)          Replica B (standby)        Replica C (standby)
       │                             │                          │
       │  every renewDeadline(10s):  │  every retryPeriod(2s):  │
       │  PATCH Lease                │  GET Lease               │
       │  renewTime: now             │  check holderIdentity    │
       │  holderIdentity: replica-a  │  if expired → try to acquire
       │                             │                          │
       │                    API Server                          │
       │                         │                             │
       │                        etcd (Lease object)            │

Full Leader Election Sequence — Initial Acquisition

Replica A          Replica B          Replica C       API Server    etcd
    │                  │                  │               │           │
    │  [All replicas start simultaneously]│               │           │
    │                  │                  │               │           │
    │─GET Lease ───────────────────────────────────────►  │           │
    │─GET Lease ────────────────────────► │               │           │
    │─GET Lease ─────────────────────────────────────────►│           │
    │                  │                  │               │           │
    │                  │                  │◄── 404 Not Found ─────────│
    │◄── 404 Not Found ─────────────────────────────────────────────  │
    │◄── 404 Not Found ─────────────────────────────────────────────  │
    │                  │                  │               │           │
    │  [All three attempt CREATE simultaneously]          │           │
    │─CREATE Lease ───────────────────────────────────►  │           │
    │  holderIdentity: replica-a          │               │──WRITE──► │
    │  leaseDuration: 15s                 │               │           │
    │  acquireTime: now                   │               │           │
    │                                     │               │           │
    │─────────── CREATE Lease ────────────────────────►  │           │
    │               holderIdentity: replica-b             │           │
    │                                                     │           │
    │                              ─ CREATE Lease ────►  │           │
    │                                 holderIdentity: c  │           │
    │                                                     │           │
    │◄── 201 Created (A wins, etcd atomic write) ──────────────────── │
    │    (B and C receive 409 Conflict)   │               │           │
    │                  │                  │               │           │
    │  Replica A starts reconciliation loop               │           │
    │  Replica B enters standby (retry GET loop)         │           │
    │  Replica C enters standby                          │           │

Lease Renewal — Active Leader

Replica A              API Server       etcd
    │                      │              │
    │  every leaseDurationSeconds/3       │
    │  (~5s for default 15s lease)        │
    │                      │              │
    │─PATCH Lease ─────────►              │
    │  renewTime: now       │              │
    │  resourceVersion: 42  │  (optimistic lock)
    │                       │──WRITE ─────►│
    │◄── 200 OK ─────────── │              │
    │                       │              │
    │  [renew loop continues every ~5s]   │

Failover — Leader Dies

Replica A (DEAD)       Replica B           Replica C       API Server    etcd
    │                      │                   │               │           │
    │  [A crashes / network partition]         │               │           │
    │                      │                   │               │           │
    │                      │  GET Lease ──────────────────────►│           │
    │                      │◄── Lease: holderIdentity=A, renewTime=T ─────  │
    │                      │                   │               │           │
    │                      │  [B checks: is Lease expired?]    │           │
    │                      │  now - renewTime > leaseDuration  │           │
    │                      │  → YES (A missed renewals)        │           │
    │                      │                   │               │           │
    │                      │─PATCH Lease ───────────────────►  │           │
    │                      │  holderIdentity: replica-b        │           │
    │                      │  acquireTime: now                 │           │
    │                      │  leaderTransitions: +1            │           │
    │                      │  resourceVersion: 43              │──WRITE──► │
    │                      │◄── 200 OK ─────────────────────── │           │
    │                      │                   │               │           │
    │                      │  Replica B becomes leader         │           │
    │                      │  Starts reconciliation loop       │           │
    │                      │                   │               │           │
    │  [A recovers]        │                   │               │           │
    │─PATCH Lease ──────────────────────────►  │               │           │
    │  holderIdentity: A   │                   │               │           │
    │  resourceVersion: 42 │ (stale!)          │               │           │
    │◄── 409 Conflict ───────────────────────── │              │           │
    │  [A sees it lost the election]            │               │           │
    │  A re-enters standby                      │               │           │

Lease Object

# Lease object in kube-system (created by kube-controller-manager)
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  holderIdentity: "ip-10-0-1-5_uuid-abc"   # pod identity of current leader
  leaseDurationSeconds: 15                   # how long lease is valid without renewal
  acquireTime: "2025-01-15T10:00:00Z"       # when this replica acquired leadership
  renewTime: "2025-01-15T10:05:43Z"         # last renewal (updated every ~5s)
  leaderTransitions: 3                       # how many times leadership has changed

Leader Election in Custom Operators

// controller-runtime leader election setup (main.go)
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,

    // Enable leader election
    LeaderElection:          true,
    LeaderElectionID:        "payments-operator.mycompany.io",  // Lease name
    LeaderElectionNamespace: "kube-system",

    // Tuning (defaults shown)
    LeaseDuration: &leaseDuration,   // 15s — how long lease valid
    RenewDeadline: &renewDeadline,   // 10s — leader must renew within this
    RetryPeriod:   &retryPeriod,     // 2s  — standby poll interval
})

# RBAC for leader election (operator needs Lease CRUD)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: payments-operator-leader-election
  namespace: kube-system
rules:
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "patch"]

Leader Election Timing Parameters

Parameter	Default	Meaning	If too low	If too high
`leaseDuration`	15s	Lease validity without renewal	Too many false failovers	Slow failover after leader crash
`renewDeadline`	10s	Leader must renew within this window	Leader gives up leadership too easily	Leader stays active too long during partial failure
`retryPeriod`	2s	Standby replica poll interval	API server load	Slow failover detection

Invariant: retryPeriod < renewDeadline < leaseDuration
Typical tuning for fast failover: 5s / 4s / 2s (at cost of higher API server load)

Diagnosing Leader Election Issues

# See who is the current leader
kubectl get lease kube-controller-manager -n kube-system \
  -o jsonpath='{.spec.holderIdentity}'

kubectl get lease kube-scheduler -n kube-system \
  -o jsonpath='{.spec.holderIdentity}'

# See all leases in kube-system
kubectl get leases -n kube-system

# Watch leader transitions in real time
kubectl get lease kube-controller-manager -n kube-system \
  -w -o jsonpath='{.spec.holderIdentity}{"\n"}'

# Check leaderTransitions counter (high = instability)
kubectl get lease kube-controller-manager -n kube-system \
  -o jsonpath='{.spec.leaderTransitions}'

# Custom operator lease
kubectl get lease payments-operator.mycompany.io -n kube-system -o yaml

# See leader election events
kubectl get events -n kube-system \
  --field-selector reason=LeaderElection

# Controller-manager logs showing election
kubectl logs -n kube-system \
  -l component=kube-controller-manager --tail=50 | grep -i leader

Split Brain Prevention

How Kubernetes prevents two leaders:

1. Optimistic locking (resourceVersion):
   - Both standby replicas see lease expired
   - Both attempt PATCH with old resourceVersion
   - etcd accepts only the first (atomic compare-and-swap)
   - Second gets 409 Conflict — retries as standby

2. Old leader detects loss:
   - renewDeadline exceeded → old leader logs error + exits
   - OR: old leader's PATCH returns 409 (lost lock to another replica)
   - controller-runtime: process exits, pod restarts as standby

3. No "deny" messages needed:
   - etcd atomicity guarantees exactly one winner per write

01 — Pod Creation Flow — kube-scheduler is also leader-elected
10 — Garbage Collection — GC controller runs under kube-controller-manager (leader-elected)
05 — Control Plane Issues — leader election failures in operations

Overview

Why Leader Election

Leader Election Architecture

Full Leader Election Sequence — Initial Acquisition

Lease Renewal — Active Leader

Failover — Leader Dies

Lease Object

Leader Election in Custom Operators

Leader Election Timing Parameters

Diagnosing Leader Election Issues

Split Brain Prevention

Related