Overview

Traces how Kubernetes controllers elect a single active leader from multiple replicas — using the Lease API — and what happens during startup, failover, and leader loss.

Why Leader Election

kube-controller-manager, kube-scheduler, and custom operators
run as multiple replicas for availability.

Problem: if two replicas both reconcile the same object simultaneously,
they can fight each other → duplicate work, split-brain writes.

Solution: leader election — only ONE replica is "active" at any time.
  Others are "standby" — ready to take over if the leader dies.

Kubernetes uses the Lease API (coordination.k8s.io/v1 Lease object)
as the lock primitive.

Leader Election Architecture

  Replica A (leader)          Replica B (standby)        Replica C (standby)
       │                             │                          │
       │  every renewDeadline(10s):  │  every retryPeriod(2s):  │
       │  PATCH Lease                │  GET Lease               │
       │  renewTime: now             │  check holderIdentity    │
       │  holderIdentity: replica-a  │  if expired → try to acquire
       │                             │                          │
       │                    API Server                          │
       │                         │                             │
       │                        etcd (Lease object)            │

Full Leader Election Sequence — Initial Acquisition

Replica A          Replica B          Replica C       API Server    etcd
    │                  │                  │               │           │
    │  [All replicas start simultaneously]│               │           │
    │                  │                  │               │           │
    │─GET Lease ───────────────────────────────────────►  │           │
    │─GET Lease ────────────────────────► │               │           │
    │─GET Lease ─────────────────────────────────────────►│           │
    │                  │                  │               │           │
    │                  │                  │◄── 404 Not Found ─────────│
    │◄── 404 Not Found ─────────────────────────────────────────────  │
    │◄── 404 Not Found ─────────────────────────────────────────────  │
    │                  │                  │               │           │
    │  [All three attempt CREATE simultaneously]          │           │
    │─CREATE Lease ───────────────────────────────────►  │           │
    │  holderIdentity: replica-a          │               │──WRITE──► │
    │  leaseDuration: 15s                 │               │           │
    │  acquireTime: now                   │               │           │
    │                                     │               │           │
    │─────────── CREATE Lease ────────────────────────►  │           │
    │               holderIdentity: replica-b             │           │
    │                                                     │           │
    │                              ─ CREATE Lease ────►  │           │
    │                                 holderIdentity: c  │           │
    │                                                     │           │
    │◄── 201 Created (A wins, etcd atomic write) ──────────────────── │
    │    (B and C receive 409 Conflict)   │               │           │
    │                  │                  │               │           │
    │  Replica A starts reconciliation loop               │           │
    │  Replica B enters standby (retry GET loop)         │           │
    │  Replica C enters standby                          │           │

Lease Renewal — Active Leader

Replica A              API Server       etcd
    │                      │              │
    │  every leaseDurationSeconds/3       │
    │  (~5s for default 15s lease)        │
    │                      │              │
    │─PATCH Lease ─────────►              │
    │  renewTime: now       │              │
    │  resourceVersion: 42  │  (optimistic lock)
    │                       │──WRITE ─────►│
    │◄── 200 OK ─────────── │              │
    │                       │              │
    │  [renew loop continues every ~5s]   │

Failover — Leader Dies

Replica A (DEAD)       Replica B           Replica C       API Server    etcd
    │                      │                   │               │           │
    │  [A crashes / network partition]         │               │           │
    │                      │                   │               │           │
    │                      │  GET Lease ──────────────────────►│           │
    │                      │◄── Lease: holderIdentity=A, renewTime=T ─────  │
    │                      │                   │               │           │
    │                      │  [B checks: is Lease expired?]    │           │
    │                      │  now - renewTime > leaseDuration  │           │
    │                      │  → YES (A missed renewals)        │           │
    │                      │                   │               │           │
    │                      │─PATCH Lease ───────────────────►  │           │
    │                      │  holderIdentity: replica-b        │           │
    │                      │  acquireTime: now                 │           │
    │                      │  leaderTransitions: +1            │           │
    │                      │  resourceVersion: 43              │──WRITE──► │
    │                      │◄── 200 OK ─────────────────────── │           │
    │                      │                   │               │           │
    │                      │  Replica B becomes leader         │           │
    │                      │  Starts reconciliation loop       │           │
    │                      │                   │               │           │
    │  [A recovers]        │                   │               │           │
    │─PATCH Lease ──────────────────────────►  │               │           │
    │  holderIdentity: A   │                   │               │           │
    │  resourceVersion: 42 │ (stale!)          │               │           │
    │◄── 409 Conflict ───────────────────────── │              │           │
    │  [A sees it lost the election]            │               │           │
    │  A re-enters standby                      │               │           │

Lease Object

# Lease object in kube-system (created by kube-controller-manager)
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  holderIdentity: "ip-10-0-1-5_uuid-abc"   # pod identity of current leader
  leaseDurationSeconds: 15                   # how long lease is valid without renewal
  acquireTime: "2025-01-15T10:00:00Z"       # when this replica acquired leadership
  renewTime: "2025-01-15T10:05:43Z"         # last renewal (updated every ~5s)
  leaderTransitions: 3                       # how many times leadership has changed

Leader Election in Custom Operators

// controller-runtime leader election setup (main.go)
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,

    // Enable leader election
    LeaderElection:          true,
    LeaderElectionID:        "payments-operator.mycompany.io",  // Lease name
    LeaderElectionNamespace: "kube-system",

    // Tuning (defaults shown)
    LeaseDuration: &leaseDuration,   // 15s — how long lease valid
    RenewDeadline: &renewDeadline,   // 10s — leader must renew within this
    RetryPeriod:   &retryPeriod,     // 2s  — standby poll interval
})
# RBAC for leader election (operator needs Lease CRUD)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: payments-operator-leader-election
  namespace: kube-system
rules:
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "patch"]

Leader Election Timing Parameters

ParameterDefaultMeaningIf too lowIf too high
leaseDuration15sLease validity without renewalToo many false failoversSlow failover after leader crash
renewDeadline10sLeader must renew within this windowLeader gives up leadership too easilyLeader stays active too long during partial failure
retryPeriod2sStandby replica poll intervalAPI server loadSlow failover detection
Invariant: retryPeriod < renewDeadline < leaseDuration
Typical tuning for fast failover: 5s / 4s / 2s (at cost of higher API server load)

Diagnosing Leader Election Issues

# See who is the current leader
kubectl get lease kube-controller-manager -n kube-system \
  -o jsonpath='{.spec.holderIdentity}'

kubectl get lease kube-scheduler -n kube-system \
  -o jsonpath='{.spec.holderIdentity}'

# See all leases in kube-system
kubectl get leases -n kube-system

# Watch leader transitions in real time
kubectl get lease kube-controller-manager -n kube-system \
  -w -o jsonpath='{.spec.holderIdentity}{"\n"}'

# Check leaderTransitions counter (high = instability)
kubectl get lease kube-controller-manager -n kube-system \
  -o jsonpath='{.spec.leaderTransitions}'

# Custom operator lease
kubectl get lease payments-operator.mycompany.io -n kube-system -o yaml

# See leader election events
kubectl get events -n kube-system \
  --field-selector reason=LeaderElection

# Controller-manager logs showing election
kubectl logs -n kube-system \
  -l component=kube-controller-manager --tail=50 | grep -i leader

Split Brain Prevention

How Kubernetes prevents two leaders:

1. Optimistic locking (resourceVersion):
   - Both standby replicas see lease expired
   - Both attempt PATCH with old resourceVersion
   - etcd accepts only the first (atomic compare-and-swap)
   - Second gets 409 Conflict — retries as standby

2. Old leader detects loss:
   - renewDeadline exceeded → old leader logs error + exits
   - OR: old leader's PATCH returns 409 (lost lock to another replica)
   - controller-runtime: process exits, pod restarts as standby

3. No "deny" messages needed:
   - etcd atomicity guarantees exactly one winner per write