Leader Election Flow
Overview
Traces how Kubernetes controllers elect a single active leader from multiple replicas — using the Lease API — and what happens during startup, failover, and leader loss.
Why Leader Election
kube-controller-manager, kube-scheduler, and custom operators
run as multiple replicas for availability.
Problem: if two replicas both reconcile the same object simultaneously,
they can fight each other → duplicate work, split-brain writes.
Solution: leader election — only ONE replica is "active" at any time.
Others are "standby" — ready to take over if the leader dies.
Kubernetes uses the Lease API (coordination.k8s.io/v1 Lease object)
as the lock primitive.
Leader Election Architecture
Replica A (leader) Replica B (standby) Replica C (standby)
│ │ │
│ every renewDeadline(10s): │ every retryPeriod(2s): │
│ PATCH Lease │ GET Lease │
│ renewTime: now │ check holderIdentity │
│ holderIdentity: replica-a │ if expired → try to acquire
│ │ │
│ API Server │
│ │ │
│ etcd (Lease object) │
Full Leader Election Sequence — Initial Acquisition
Replica A Replica B Replica C API Server etcd
│ │ │ │ │
│ [All replicas start simultaneously]│ │ │
│ │ │ │ │
│─GET Lease ───────────────────────────────────────► │ │
│─GET Lease ────────────────────────► │ │ │
│─GET Lease ─────────────────────────────────────────►│ │
│ │ │ │ │
│ │ │◄── 404 Not Found ─────────│
│◄── 404 Not Found ───────────────────────────────────────────── │
│◄── 404 Not Found ───────────────────────────────────────────── │
│ │ │ │ │
│ [All three attempt CREATE simultaneously] │ │
│─CREATE Lease ───────────────────────────────────► │ │
│ holderIdentity: replica-a │ │──WRITE──► │
│ leaseDuration: 15s │ │ │
│ acquireTime: now │ │ │
│ │ │ │
│─────────── CREATE Lease ────────────────────────► │ │
│ holderIdentity: replica-b │ │
│ │ │
│ ─ CREATE Lease ────► │ │
│ holderIdentity: c │ │
│ │ │
│◄── 201 Created (A wins, etcd atomic write) ──────────────────── │
│ (B and C receive 409 Conflict) │ │ │
│ │ │ │ │
│ Replica A starts reconciliation loop │ │
│ Replica B enters standby (retry GET loop) │ │
│ Replica C enters standby │ │
Lease Renewal — Active Leader
Replica A API Server etcd
│ │ │
│ every leaseDurationSeconds/3 │
│ (~5s for default 15s lease) │
│ │ │
│─PATCH Lease ─────────► │
│ renewTime: now │ │
│ resourceVersion: 42 │ (optimistic lock)
│ │──WRITE ─────►│
│◄── 200 OK ─────────── │ │
│ │ │
│ [renew loop continues every ~5s] │
Failover — Leader Dies
Replica A (DEAD) Replica B Replica C API Server etcd
│ │ │ │ │
│ [A crashes / network partition] │ │ │
│ │ │ │ │
│ │ GET Lease ──────────────────────►│ │
│ │◄── Lease: holderIdentity=A, renewTime=T ───── │
│ │ │ │ │
│ │ [B checks: is Lease expired?] │ │
│ │ now - renewTime > leaseDuration │ │
│ │ → YES (A missed renewals) │ │
│ │ │ │ │
│ │─PATCH Lease ───────────────────► │ │
│ │ holderIdentity: replica-b │ │
│ │ acquireTime: now │ │
│ │ leaderTransitions: +1 │ │
│ │ resourceVersion: 43 │──WRITE──► │
│ │◄── 200 OK ─────────────────────── │ │
│ │ │ │ │
│ │ Replica B becomes leader │ │
│ │ Starts reconciliation loop │ │
│ │ │ │ │
│ [A recovers] │ │ │ │
│─PATCH Lease ──────────────────────────► │ │ │
│ holderIdentity: A │ │ │ │
│ resourceVersion: 42 │ (stale!) │ │ │
│◄── 409 Conflict ───────────────────────── │ │ │
│ [A sees it lost the election] │ │ │
│ A re-enters standby │ │ │
Lease Object
# Lease object in kube-system (created by kube-controller-manager)
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: kube-controller-manager
namespace: kube-system
spec:
holderIdentity: "ip-10-0-1-5_uuid-abc" # pod identity of current leader
leaseDurationSeconds: 15 # how long lease is valid without renewal
acquireTime: "2025-01-15T10:00:00Z" # when this replica acquired leadership
renewTime: "2025-01-15T10:05:43Z" # last renewal (updated every ~5s)
leaderTransitions: 3 # how many times leadership has changed
Leader Election in Custom Operators
// controller-runtime leader election setup (main.go)
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
// Enable leader election
LeaderElection: true,
LeaderElectionID: "payments-operator.mycompany.io", // Lease name
LeaderElectionNamespace: "kube-system",
// Tuning (defaults shown)
LeaseDuration: &leaseDuration, // 15s — how long lease valid
RenewDeadline: &renewDeadline, // 10s — leader must renew within this
RetryPeriod: &retryPeriod, // 2s — standby poll interval
})
# RBAC for leader election (operator needs Lease CRUD)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: payments-operator-leader-election
namespace: kube-system
rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
Leader Election Timing Parameters
| Parameter | Default | Meaning | If too low | If too high |
|---|---|---|---|---|
leaseDuration | 15s | Lease validity without renewal | Too many false failovers | Slow failover after leader crash |
renewDeadline | 10s | Leader must renew within this window | Leader gives up leadership too easily | Leader stays active too long during partial failure |
retryPeriod | 2s | Standby replica poll interval | API server load | Slow failover detection |
Invariant: retryPeriod < renewDeadline < leaseDuration
Typical tuning for fast failover: 5s / 4s / 2s (at cost of higher API server load)
Diagnosing Leader Election Issues
# See who is the current leader
kubectl get lease kube-controller-manager -n kube-system \
-o jsonpath='{.spec.holderIdentity}'
kubectl get lease kube-scheduler -n kube-system \
-o jsonpath='{.spec.holderIdentity}'
# See all leases in kube-system
kubectl get leases -n kube-system
# Watch leader transitions in real time
kubectl get lease kube-controller-manager -n kube-system \
-w -o jsonpath='{.spec.holderIdentity}{"\n"}'
# Check leaderTransitions counter (high = instability)
kubectl get lease kube-controller-manager -n kube-system \
-o jsonpath='{.spec.leaderTransitions}'
# Custom operator lease
kubectl get lease payments-operator.mycompany.io -n kube-system -o yaml
# See leader election events
kubectl get events -n kube-system \
--field-selector reason=LeaderElection
# Controller-manager logs showing election
kubectl logs -n kube-system \
-l component=kube-controller-manager --tail=50 | grep -i leader
Split Brain Prevention
How Kubernetes prevents two leaders:
1. Optimistic locking (resourceVersion):
- Both standby replicas see lease expired
- Both attempt PATCH with old resourceVersion
- etcd accepts only the first (atomic compare-and-swap)
- Second gets 409 Conflict — retries as standby
2. Old leader detects loss:
- renewDeadline exceeded → old leader logs error + exits
- OR: old leader's PATCH returns 409 (lost lock to another replica)
- controller-runtime: process exits, pod restarts as standby
3. No "deny" messages needed:
- etcd atomicity guarantees exactly one winner per write
Related
- 01 — Pod Creation Flow — kube-scheduler is also leader-elected
- 10 — Garbage Collection — GC controller runs under kube-controller-manager (leader-elected)
- 05 — Control Plane Issues — leader election failures in operations