Kubernetes API Model
The declarative REST API that is the universal interface to every Kubernetes subsystem — resource objects, watch semantics, informer architecture, patch types, Server-Side Apply, CEL validation, and API Priority & Fairness.
REST API URL Patterns
Every Kubernetes resource is addressable by a deterministic URL. kube-apiserver (see 01-kube-apiserver.html) serves these paths over HTTPS on :6443.
| Scope | URL Template | Example |
|---|---|---|
| Cluster-scoped list | /apis/{group}/{version}/{resource} | /apis/rbac.authorization.k8s.io/v1/clusterroles |
| Cluster-scoped single | /apis/{group}/{version}/{resource}/{name} | /apis/rbac.authorization.k8s.io/v1/clusterroles/admin |
| Namespace-scoped list | /apis/{group}/{version}/namespaces/{ns}/{resource} | /apis/apps/v1/namespaces/default/deployments |
| Namespace-scoped single | /apis/{group}/{version}/namespaces/{ns}/{resource}/{name} | /apis/apps/v1/namespaces/default/deployments/nginx |
| Core group list | /api/v1/{resource} | /api/v1/nodes |
| Core group NS-scoped | /api/v1/namespaces/{ns}/{resource} | /api/v1/namespaces/kube-system/pods |
| Subresource | .../{resource}/{name}/{sub} | /api/v1/namespaces/default/pods/nginx/log |
| Discovery | /apis, /api, /apis/{group} | kubectl api-resources uses these |
Subresources
| Subresource | Parent | Purpose |
|---|---|---|
/status | Most resources | Update status subobject only; requires separate RBAC verb |
/scale | Deployment, ReplicaSet, StatefulSet, CRDs | Read/write replicas via HPA without full object access |
/exec | Pod | WebSocket exec into a container |
/log | Pod | Stream container logs |
/portforward | Pod | TCP tunnel to container port |
/attach | Pod | Attach stdin/stdout to running process |
/eviction | Pod | Graceful eviction respecting PodDisruptionBudgets |
/ephemeralcontainers | Pod | Add debug containers to running pod |
/binding | Pod | Scheduler assigns pod to node |
# Inspect raw API calls made by kubectl
kubectl get pod nginx -n default -v=8 2>&1 | grep -E 'GET|POST|PATCH'
# Equivalent curl (service-account token from inside a pod)
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sk https://kubernetes.default.svc/api/v1/namespaces/default/pods/nginx \
-H "Authorization: Bearer $TOKEN"
# Discover all API groups + versions
kubectl api-versions | sort
curl -sk https://<apiserver>:6443/apis --cert client.crt --key client.key \
--cacert ca.crt | jq '.groups[].name'
API Groups and Versions
Resources are organized into named API groups with independent versioning lifecycles. The core group (/api/v1) is legacy; all modern resources live under named groups in /apis/.
Alpha (v1alpha1)
- Off by default, feature-gated
- No stability guarantee
- May be removed any release
- Example:
flowcontrol.apiserver.k8s.io/v1alpha1
Beta (v1beta1, v2beta2)
- Enabled by default (k8s 1.8+)
- Schema may change between betas
- Deprecated → removed in 3 releases
- Example:
autoscaling/v2beta2→v2
GA / Stable (v1)
- 12+ months or 3-release support
- No breaking schema changes
- Example:
apps/v1,core/v1
Resource Object Anatomy
Every Kubernetes resource is a typed JSON/YAML object with a universal envelope. Understanding each field is essential for writing operators and debugging API behavior.
apiVersion: apps/v1 # <group>/<version> — TypeMeta
kind: Deployment # PascalCase resource kind — TypeMeta
metadata:
name: nginx # required unless generateName is set
generateName: nginx- # server appends 5-char random suffix; used by ReplicaSets
namespace: default
uid: 3fa85f64-5717-4562-b3fc-2c963f66afa6 # immutable UUID set by server
resourceVersion: "1042819" # etcd revision string — optimistic locking & watch
generation: 3 # incremented on spec changes (NOT status updates)
creationTimestamp: "2024-01-15T10:30:00Z"
deletionTimestamp: null # set when DELETE is called; lives until finalizers clear
deletionGracePeriodSeconds: 30
labels:
app: nginx
version: "1.25"
annotations:
deployment.kubernetes.io/revision: "3"
kubectl.kubernetes.io/last-applied-configuration: '...'
ownerReferences: # garbage-collection chain
- apiVersion: apps/v1
kind: ReplicaSet
name: nginx-6799fc88d8
uid: abc123
controller: true # only one ownerRef may have controller:true
blockOwnerDeletion: true # prevents owner deletion until dependent gone
finalizers: # object not deleted until all strings removed
- kubernetes.io/pvc-protection
managedFields: # Server-Side Apply field ownership tracking
- manager: kubectl
operation: Apply
apiVersion: apps/v1
time: "2024-01-15T10:30:00Z"
fieldsType: FieldsV1
fieldsV1: { ... }
spec: # desired state — varies by kind
replicas: 3
status: # observed state — written only by controllers
readyReplicas: 3
generateName and Naming Conventions
metadata.generateName causes kube-apiserver to append a random 5-character alphanumeric suffix before persisting the object. Used by ReplicaSets (which create Pods with generateName: nginx-6799fc88d8-), Jobs, and any controller that needs guaranteed unique names.
- If both
nameandgenerateNameare set,namewins —generateNameis ignored. - All resource names must conform to DNS subdomain rules (RFC 1123): lowercase alphanumeric or
-, max 253 chars, start/end alphanumeric — except Pods which use DNS label rules (max 63 chars per segment). - The server rejects names violating these rules with
422 Unprocessable Entity. - Namespace names are DNS labels (max 63 chars). Avoid dots in namespace names — they cause DNS resolution ambiguity.
# generateName in action
kubectl run --generate-name=debug- --image=busybox --restart=Never -- sleep 3600
# Result: debug-k7x9p (or similar)
# DNS subdomain validation — this will fail
kubectl create namespace "My_Namespace"
# Error: namespace name must match DNS-1123 subdomain pattern
HTTP Verbs, RBAC Verbs, and Patch Types
| HTTP | RBAC Verb | Semantics |
|---|---|---|
| GET (collection) | list | Return all resources; supports fieldSelector, labelSelector, limit/continue |
| GET (single) | get | Return single resource by name |
| GET (watch=true) | watch | Long-poll streaming; separate RBAC verb from list |
| POST | create | Create new; server sets uid, resourceVersion, creationTimestamp |
| PUT | update | Full replacement; must include current resourceVersion |
| PATCH | patch | Partial update; 4 distinct strategies (see below) |
| DELETE (single) | delete | Initiates deletion; sets deletionTimestamp if finalizers exist |
| DELETE (collection) | deletecollection | Bulk delete matching label selector |
| POST /subresource | create | e.g., exec, portforward, eviction |
| GET /status | get | Often granted separately from main resource get |
| PUT /status | update | Controllers update status without main update permission |
Four Patch Types
JSON Merge Patch (RFC 7396)
Content-Type: application/merge-patch+json
Send only changed fields. Setting a key to null deletes it. Cannot surgically modify list elements — replaces entire array. Works on CRDs.
kubectl patch pod nginx \
--type=merge \
-p '{"spec":{"terminationGracePeriodSeconds":60}}'
Strategic Merge Patch
Content-Type: application/strategic-merge-patch+json
Kubernetes-specific. Uses patchMergeKey struct tags to merge list elements by key (containers by name, volumes by name). Cannot be used on CRDs — CRDs have no Go struct tags.
kubectl patch deployment nginx \
--type=strategic \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"nginx","image":"nginx:1.26"}]}}}}'
JSON Patch (RFC 6902)
Content-Type: application/json-patch+json
Array of operations: add, remove, replace, move, copy, test. Precise surgical edits including array indices. test enables conditional updates.
kubectl patch deployment nginx \
--type=json \
-p '[
{"op":"test","path":"/spec/replicas","value":3},
{"op":"replace","path":"/spec/replicas","value":5}
]'
Server-Side Apply (SSA)
Content-Type: application/apply-patch+yaml
Declarative ownership. Server tracks managedFields per manager. Conflicts returned as 409 Conflict. Only send intent; full object not required. See § Server-Side Apply.
kubectl apply --server-side \
--field-manager=my-operator \
-f deployment.yaml
DeleteOptions and Propagation Policy
When deleting an object, the client can send a DeleteOptions body to control cascading behavior and race protection:
apiVersion: v1
kind: DeleteOptions
gracePeriodSeconds: 0 # override pod's terminationGracePeriodSeconds (0 = immediate)
preconditions:
uid: "3fa85f64-..." # only delete if UID matches — prevents deleting replacement
resourceVersion: "1042819" # only delete if resourceVersion matches — prevents races
propagationPolicy: Foreground # Foreground | Background | Orphan
Foreground Deletion
Owner gets deletionTimestamp + foregroundDeletion finalizer. GC controller deletes all dependents (those with blockOwnerDeletion: true) first, then removes the finalizer. Owner is visible as "terminating" until all dependents gone.
Background Deletion (default)
Owner deleted immediately. GC controller asynchronously deletes dependents via ownerReferences traversal in the background. Most common mode for kubectl delete. Dependents may outlive the owner briefly.
Orphan Policy
Owner deleted; dependents have their ownerReferences stripped but are not deleted. Used when you want to keep child resources after removing the parent — e.g., keep Pods when deleting a ReplicaSet for manual management.
Watch Protocol and ResourceVersion
Controllers never poll. They establish a long-lived HTTP/2 GET with ?watch=true that streams NDJSON event objects. The resourceVersion field drives resumable watches and optimistic locking.
| Event Type | Meaning | Controller action |
|---|---|---|
ADDED | New object (or first seen after relist) | Add to local store, enqueue reconcile |
MODIFIED | Any field changed (including status) | Update local store, enqueue reconcile |
DELETED | Object deleted from etcd | Remove from store, enqueue cleanup |
BOOKMARK | Server heartbeat with updated RV; no change | Update bookmark RV for reconnect |
ERROR | Server error; often 410 Gone (RV compacted) | Re-LIST → re-WATCH with new RV |
ResourceVersion and MVCC
Every object and collection has a resourceVersion — a string representation of the etcd revision at which it was last written. The MVCC internals are covered in 02-etcd.html § MVCC. From the API perspective:
- Optimistic locking: PUT/PATCH must include the current
resourceVersion; server returns409 Conflictif another writer changed it first. - Watch resumption:
?resourceVersion=Xstreams all events after revision X. If X is compacted, returns410 Gone. - Cache reads:
?resourceVersion=0reads from watchCache (slightly stale but fast). Omit for quorum read from etcd.
Informer Architecture
Controllers never call the API per event. The client-go informer machinery consolidates List+Watch into a local in-memory store and work queue, enabling safe concurrent reconciliation. The server-side counterpart — watchCache — is covered in 01-kube-apiserver.html § watchCache.
Reflector
Performs initial LIST (resourceVersion=0 for cache hit), then establishes long-running WATCH. On 410 Gone or disconnect, re-LISTs. Feeds all deltas into DeltaFIFO.
DeltaFIFO
Ordered queue of typed deltas per object key (Added, Updated, Deleted, Replaced, Sync). Coalesces rapid changes to the same object key. Pop calls the handler and updates ThreadSafeStore atomically.
ThreadSafeStore (Indexer)
In-memory map of namespace/name → object. Supports secondary indices (e.g., "all pods on node X"). Controllers call lister.Pods(ns).Get(name) — reads local cache, zero API calls per reconcile loop.
WorkQueue Semantics
Rate-limited deduplicating queue. Add(key) enqueues once regardless of call count. Done(key) signals completion. AddRateLimited for retries with exponential backoff (base 5ms, max 1000s). AddAfter for RequeueAfter.
controller-runtime Reconciler Example (Go)
import (
ctrl "sigs.k8s.io/controller-runtime"
apierrors "k8s.io/apimachinery/pkg/api/errors"
)
type MyReconciler struct {
client.Client
Scheme *runtime.Scheme
}
func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Read from local cache — NOT a live API call
var myObj myv1.MyResource
if err := r.Get(ctx, req.NamespacedName, &myObj); err != nil {
if apierrors.IsNotFound(err) {
return ctrl.Result{}, nil // deleted; nothing to do
}
return ctrl.Result{}, err
}
if myObj.Status.Phase == "" {
myObj.Status.Phase = "Pending"
// Status update only touches the /status subresource
if err := r.Status().Update(ctx, &myObj); err != nil {
return ctrl.Result{}, err
}
}
// Periodic re-check
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
// Register with controller-manager
ctrl.NewControllerManagedBy(mgr).
For(&myv1.MyResource{}).
Owns(&appsv1.Deployment{}). // watch Deployments owned by MyResource
Complete(r)
Server-Side Apply (SSA)
SSA moves merge logic to the server and introduces field ownership — each field is owned by a named manager. This enables safe co-management by multiple actors (controller + HPA + human operator).
| Dimension | Client-Side Apply (CSA) | Server-Side Apply (SSA) |
|---|---|---|
| Merge logic | kubectl (strategic merge) | kube-apiserver |
| Full object required | Yes (annotation stores last-applied) | No — only send intent fields |
| Field ownership | None (last writer wins) | Per-field, per-manager in managedFields |
| Conflict detection | None | 409 Conflict if another manager owns the field |
| CRD support | Limited | Full (no patchMergeKey struct tag needed) |
| kubectl flag | kubectl apply | kubectl apply --server-side |
HPA owns spec.replicas via SSA. If a human applies a manifest with a different spec.replicas, they get a 409 Conflict — the field is owned by horizontal-pod-autoscaler. They must either --force-conflicts (and HPA will fight back next cycle) or remove replicas from their manifest entirely, letting HPA remain the sole owner.
# Apply with SSA
kubectl apply --server-side -f deployment.yaml
# Custom manager name for operators
kubectl apply --server-side --field-manager=my-operator -f crd-instance.yaml
# Force-take ownership of conflicting fields
kubectl apply --server-side --force-conflicts -f deployment.yaml
# Inspect managed fields
kubectl get deployment nginx -o jsonpath='{.metadata.managedFields}' | jq .
# Remove a field from SSA management: omit it from the manifest and re-apply
# The field becomes "unmanaged" and other managers may adopt it
etcd Key Encoding
All objects are persisted in etcd under deterministic key paths. Deep internals (MVCC, compaction, WAL) are in 02-etcd.html § Kubernetes Keyspace.
# Key examples
/registry/pods/default/nginx-abc12
/registry/deployments/kube-system/coredns
/registry/secrets/default/my-secret
/registry/clusterroles/admin # cluster-scoped: no namespace
/registry/apiextensions.k8s.io/customresourcedefinitions/foos.example.com
# Object encoding: k8s\x00 magic prefix + protobuf Unknown wrapper
# To decode (needs etcd access + auger tool):
etcdctl get /registry/pods/default/nginx --print-value-only | auger decode
# kube-apiserver uses a different prefix for each API group:
# /registry/{group}/{resource}/{ns}/{name}
# Core group omits the group: /registry/pods/...
API Request Lifecycle
Every request traverses a staged pipeline in kube-apiserver. Full detail in 01-kube-apiserver.html § Request Pipeline.
| Stage | What happens | Failure code |
|---|---|---|
| ① APF | Request classified into FlowSchema → PriorityLevel; throttled if queue full | 429 |
| ② Authentication | Identity determined (x509, OIDC, SA JWT, etc.) | 401 |
| ③ Authorization | RBAC/Node/Webhook checks if identity may perform verb on resource | 403 |
| ④ Mutating Admission | Webhooks + built-in plugins may modify object (defaults, injections) | 400/500 |
| ⑤ Schema Validation | OpenAPI v3 structural schema + CEL rules enforced in-process | 422 |
| ⑥ Validating Admission | Webhooks validate the (possibly mutated) final object | 400/403 |
| ⑦ etcd persist | protobuf-encoded, optionally encrypted, MVCC revision assigned | 500/503 |
Label and Field Selectors
Label Selectors
| Type | Syntax | Example |
|---|---|---|
| Equality | key=value, key==value, key!=value | app=nginx,tier!=frontend |
| Set-based | key in (v1,v2), key notin (v1), key, !key | env in (prod,staging) |
Services only support equality-based selectors (map syntax in YAML). ReplicaSets and Deployments support both equality (matchLabels) and set-based (matchExpressions) in their .spec.selector. This is because Service endpoints use a different code path than replica set membership.
kubectl get pods -l app=nginx,tier=frontend
kubectl get pods -l 'env in (prod,staging)'
kubectl get pods -l '!canary' # pods WITHOUT the canary label
# ReplicaSet/Deployment combined selector
selector:
matchLabels:
app: nginx
matchExpressions:
- key: tier
operator: In
values: [frontend, backend]
Field Selectors
Filter on specific object fields. Support varies by resource — only indexed fields work server-side; others are filtered client-side (all objects fetched, then filtered locally).
kubectl get pods --field-selector status.phase=Running
kubectl get pods --field-selector spec.nodeName=worker-1
kubectl get events --field-selector involvedObject.name=nginx,type=Warning
kubectl get pods --field-selector 'status.phase!=Running,spec.restartPolicy=Always'
OpenAPI v3 and Schema Validation
kube-apiserver publishes per-group OpenAPI v3 schemas at /openapi/v3/apis/{group}/{version}. kubectl uses these for client-side validation. The combined v2 schema is at /openapi/v2.
# Get OpenAPI v3 schema for apps/v1 Deployment
kubectl get --raw '/openapi/v3/apis/apps/v1' | \
jq '.components.schemas."io.k8s.api.apps.v1.Deployment".properties.spec'
# Validate YAML without applying
kubectl apply --dry-run=client -f deployment.yaml # client-side only
kubectl apply --dry-run=server -f deployment.yaml # full server pipeline, no persist
# View structural schema on a CRD
kubectl get crd foos.example.com \
-o jsonpath='{.spec.versions[0].schema.openAPIV3Schema}' | jq .
CEL Validation in CRDs and ValidatingAdmissionPolicy
CEL (Common Expression Language) rules run in-process in kube-apiserver — no webhook round-trip, sub-millisecond latency. Available in CRD x-kubernetes-validations (GA in 1.30) and ValidatingAdmissionPolicy (GA in 1.30).
CEL in CRDs
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: foos.example.com
spec:
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
x-kubernetes-validations:
- rule: "self.minReplicas <= self.maxReplicas"
message: "minReplicas must be <= maxReplicas"
- rule: "self.replicas >= self.minReplicas && self.replicas <= self.maxReplicas"
message: "replicas must be within [minReplicas, maxReplicas]"
properties:
replicas: {type: integer, minimum: 0}
minReplicas: {type: integer, minimum: 0}
maxReplicas: {type: integer, minimum: 1}
ValidatingAdmissionPolicy
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: require-run-as-non-root
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE","UPDATE"]
resources: ["pods"]
validations:
- expression: >
object.spec.containers.all(c,
has(c.securityContext) &&
has(c.securityContext.runAsNonRoot) &&
c.securityContext.runAsNonRoot == true)
message: "All containers must set securityContext.runAsNonRoot=true"
- expression: "object.spec.hostPID == false && object.spec.hostNetwork == false"
message: "hostPID and hostNetwork must be false"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: require-run-as-non-root-binding
spec:
policyName: require-run-as-non-root
validationActions: [Deny]
matchResources:
namespaceSelector:
matchLabels:
enforce-security: "true"
CRD Versioning and Conversion Webhooks
CRDs support multiple versions simultaneously. The served flag controls which versions kube-apiserver serves; the storage flag (exactly one version) controls which version is persisted in etcd.
spec:
versions:
- name: v1alpha1
served: true # /apis/example.com/v1alpha1/... is active
storage: false # NOT the etcd storage version
schema: { ... }
- name: v1
served: true
storage: true # ALL objects written to etcd as v1
schema: { ... }
conversion:
strategy: Webhook # None (default) | Webhook
webhook:
conversionReviewVersions: ["v1","v1beta1"]
clientConfig:
service:
namespace: default
name: crd-conversion-webhook
path: /convert
Hub version pattern: designate one version as the authoritative internal representation. All conversions go through hub: v1alpha1 → hub(v1) → v1alpha1. Avoids O(n²) conversion pairs. The webhook receives a ConversionReview:
# ConversionReview request (sent by apiserver to webhook)
apiVersion: apiextensions.k8s.io/v1
kind: ConversionReview
request:
uid: "f5cba308-..."
desiredAPIVersion: example.com/v1
objects:
- apiVersion: example.com/v1alpha1
kind: Foo
metadata: { name: myfoo }
spec: { oldField: value }
response: # webhook returns this
uid: "f5cba308-..."
result: {status: Success}
convertedObjects:
- apiVersion: example.com/v1
kind: Foo
metadata: { name: myfoo }
spec: { newField: value } # converted representation
Changing the storage: true version does NOT migrate existing objects. Objects written under the old version remain in etcd as the old protobuf encoding. You must run a storage migration job (kubectl get RESOURCE -A --output=name | xargs -I{} kubectl replace --raw /path/{} -f -) or use the kube-storage-version-migrator to rewrite all objects to the new storage version.
apimachinery Library Layout
Understanding k8s.io/apimachinery is essential for writing controllers and API extensions.
k8s.io/apimachinery/pkg/api/errors
HTTP status helpers. Use these instead of raw HTTP codes:
import apierrors "k8s.io/apimachinery/pkg/api/errors"
apierrors.IsNotFound(err) // 404
apierrors.IsConflict(err) // 409 (RV mismatch)
apierrors.IsAlreadyExists(err) // 409 (name taken)
apierrors.IsTooManyRequests(err)// 429 (APF throttle)
apierrors.IsServerTimeout(err) // 503
apierrors.IsUnauthorized(err) // 401
apierrors.IsForbidden(err) // 403
apierrors.IsInvalid(err) // 422
runtime.Object and TypeMeta/ObjectMeta
All Kubernetes objects implement runtime.Object (two methods: GetObjectKind(), DeepCopyObject()). TypeMeta carries apiVersion/kind. ObjectMeta carries all metadata.*.
type TypeMeta struct {
Kind string `json:"kind,omitempty"`
APIVersion string `json:"apiVersion,omitempty"`
}
type ObjectMeta struct {
Name, Namespace string
UID types.UID
ResourceVersion string
Generation int64
Labels, Annotations map[string]string
OwnerReferences []OwnerReference
Finalizers []string
}
schema.GroupVersionResource / GroupVersionKind
import "k8s.io/apimachinery/pkg/runtime/schema"
// GVR — used to construct API paths
gvr := schema.GroupVersionResource{
Group: "apps", Version: "v1",
Resource: "deployments",
}
// GVK — used in TypeMeta
gvk := schema.GroupVersionKind{
Group: "apps", Version: "v1",
Kind: "Deployment",
}
// Convert via RESTMapper
mapper.RESTMappings(gvk.GroupKind(), gvk.Version)
meta/v1 Types and Condition Pattern
// ListOptions
metav1.ListOptions{
LabelSelector: "app=nginx",
FieldSelector: "status.phase=Running",
ResourceVersion: "0", // from watchCache
Limit: 500,
}
// Standard condition (use for all status fields)
metav1.Condition{
Type: "Ready",
Status: metav1.ConditionTrue,
Reason: "DeploymentAvailable",
Message: "3/3 replicas running",
LastTransitionTime: metav1.Now(),
}
Pagination and Chunking
Large LIST responses should be paginated using limit + continue token to avoid memory pressure on both client and server. All pages share the same resourceVersion snapshot.
# Shell: paginate pods in 100-item chunks
CONTINUE=""
while true; do
ARGS="--limit=100"
[ -n "$CONTINUE" ] && ARGS="$ARGS --continue=$CONTINUE"
RESP=$(kubectl get pods -n default $ARGS -o json)
echo "$RESP" | jq -r '.items[].metadata.name'
CONTINUE=$(echo "$RESP" | jq -r '.metadata.continue // empty')
[ -z "$CONTINUE" ] && break
done
// Go: using client-go pager
import "k8s.io/client-go/tools/pager"
p := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) {
return client.CoreV1().Pods("").List(ctx, opts)
}))
p.PageSize = 100
err = p.EachListItem(ctx, metav1.ListOptions{}, func(obj runtime.Object) error {
pod := obj.(*corev1.Pod)
fmt.Println(pod.Name)
return nil
})
All pages of a chunked LIST share the same resourceVersion. Objects created or deleted after the first page do not affect subsequent pages. This gives a consistent point-in-time snapshot — the same guarantee as a single large LIST, but without the memory spike.
API Priority and Fairness (APF)
APF replaces the old --max-requests-inflight flag with per-flow fairness queuing. Every request is classified by a FlowSchema and placed in a PriorityLevel queue.
FlowSchema
Matches requests (by user, group, ServiceAccount, resource, verb) to a PriorityLevel. Ordered by matchingPrecedence (lower number = checked first). Built-in: cluster-admin, system-nodes, system-leader-election, workload-high, workload-low, global-default, catch-all.
PriorityLevelConfiguration
Two types: Exempt (unlimited, for cluster-admin and leader election) and Limited (bounded concurrency with FIFO or shuffle-sharding queues). Shuffle sharding isolates noisy tenants by giving each flow a unique queue subset.
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
name: my-app-high-priority
spec:
matchingPrecedence: 200 # lower = matched first
priorityLevelConfiguration:
name: workload-high
rules:
- subjects:
- kind: ServiceAccount
serviceAccount:
name: my-critical-app
namespace: default
resourceRules:
- verbs: ["*"]
apiGroups: ["*"]
resources: ["*"]
# Monitor APF
kubectl get flowschemas
kubectl get prioritylevelconfigurations
kubectl get --raw /metrics | grep 'apiserver_flowcontrol'
# Key metrics:
# apiserver_flowcontrol_current_inqueue_requests{priority_level="workload-high"}
# apiserver_flowcontrol_rejected_requests_total{reason="queue-full"}
# apiserver_flowcontrol_request_wait_duration_seconds
# Diagnose throttling (look for 429 in audit log)
grep '"code":429' /var/log/kubernetes/audit.log | jq '.user.username' | sort | uniq -c
API Troubleshooting
HTTP 409 Conflict — resourceVersion mismatch
Another writer updated the object between your GET and PUT/PATCH. Always re-GET before retrying. Use RetryOnConflict in controllers:
import "k8s.io/client-go/util/retry"
err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
obj, err := client.Get(ctx, name, metav1.GetOptions{})
if err != nil { return err }
obj.Spec.Replicas = &desiredReplicas
_, err = client.Update(ctx, obj, metav1.UpdateOptions{})
return err
})
HTTP 410 Gone — watch resourceVersion too old (compacted)
etcd compacts history (typically every 5 minutes). If a watcher is paused longer, its RV is gone. Informers handle this automatically — they re-LIST from RV=0 then re-WATCH from the new RV.
# Check auto-compaction settings
kubectl -n kube-system exec etcd-master -- \
etcdctl endpoint status --write-out=table
# etcd flags to inspect:
# --auto-compaction-mode=periodic --auto-compaction-retention=5m
HTTP 429 Too Many Requests — APF throttled
Response includes Retry-After header. Root causes: burst watch reconnects after rolling restart, controller calling List in tight loop, unbounded LISTs without limit/continue. Fix: create a FlowSchema promoting the critical SA to a higher PriorityLevel.
kubectl get --raw /metrics | grep -E 'rejected_requests_total|current_inqueue'
# Check which flow schemas the request matches
kubectl get --raw '/apis/flowcontrol.apiserver.k8s.io/v1/flowschemas' | jq '.items[].metadata.name'
HTTP 422 Unprocessable Entity — schema validation failure
Response body contains field-level errors. Common causes: CRD CEL validation failure, structural schema violation, immutable field change (e.g., selector), name violating DNS rules.
# Full error detail
kubectl apply -f resource.yaml 2>&1
# Output includes: "spec.selector: Invalid value ... field is immutable"
# Pre-flight check
kubectl apply --dry-run=server -f resource.yaml
Debugging API calls with kubectl verbosity
# -v=6: request URL + response code
kubectl get pod nginx -v=6
# -v=8: full request/response headers
kubectl get pod nginx -v=8
# -v=9: full request/response body
kubectl apply -f . -v=9
# Record for analysis
kubectl apply -f . -v=8 2>&1 | tee /tmp/kubectl-debug.log | grep -E 'Request|Response code'
Production Best Practices
Use SSA for operators
Use --server-side --field-manager=my-operator. This prevents clobbering fields managed by HPA, cert-manager, or human operators.
Paginate all large LISTs
Never do a bare LIST of pods or events cluster-wide without limit. Unbounded LISTs cause OOM on kube-apiserver watchCache and GC pressure on the client.
Pin resourceVersion=0 for cache reads
Controllers needing eventual-consistent data should LIST with resourceVersion=0 — served from watchCache (in-memory), not etcd. Dramatically reduces etcd load at scale.
Use Informers, not polling
A single Informer multiplexes events to all controllers in a process. Never write a controller that polls GET /pods on a timer — does not scale past a few hundred pods.
Respect retry semantics
Use RetryOnConflict for 409, respect Retry-After for 429, implement exponential backoff for 5xx. workqueue.DefaultItemBasedRateLimiter handles this automatically for controller queues.
Use finalizers for external cleanup
Add a finalizer before creating external resources. Remove it only after cleanup completes. Never leave objects permanently stuck with a finalizer you cannot remove — use an emergency removal procedure.
Conditions over booleans in status
Follow metav1.Condition: Type, Status, Reason (machine-readable CamelCase), Message (human-readable), LastTransitionTime. This enables programmatic tooling and standard dashboards.
CEL validation over webhooks
Admission webhooks add latency and availability risk. For structural invariants expressible in CEL, use x-kubernetes-validations or ValidatingAdmissionPolicy — zero-latency, no webhook to maintain.
Plan CRD versions from day one
Designate a hub version, write conversion webhooks, never modify the storage version without migrating stored objects first. Use kube-storage-version-migrator or a migration Job.
RBAC needs both list AND watch
Controllers using Informers need both list and watch RBAC verbs. Missing watch causes a 403 when the Reflector tries to establish the watch stream — a subtle, delayed failure.
Namespace-scope new CRDs
Default to namespace-scoped CRDs. They get RBAC isolation, quota enforcement, and lifecycle management. Only use cluster-scoped for genuinely cluster-global resources.
APF FlowSchema for critical controllers
Create a FlowSchema pointing critical controllers (cert-manager, ArgoCD) to workload-high or a custom exempt PriorityLevel. Without it, a noisy kubectl loop by a cluster-admin can starve controllers.