REST API URL Patterns

Every Kubernetes resource is addressable by a deterministic URL. kube-apiserver (see 01-kube-apiserver.html) serves these paths over HTTPS on :6443.

ScopeURL TemplateExample
Cluster-scoped list/apis/{group}/{version}/{resource}/apis/rbac.authorization.k8s.io/v1/clusterroles
Cluster-scoped single/apis/{group}/{version}/{resource}/{name}/apis/rbac.authorization.k8s.io/v1/clusterroles/admin
Namespace-scoped list/apis/{group}/{version}/namespaces/{ns}/{resource}/apis/apps/v1/namespaces/default/deployments
Namespace-scoped single/apis/{group}/{version}/namespaces/{ns}/{resource}/{name}/apis/apps/v1/namespaces/default/deployments/nginx
Core group list/api/v1/{resource}/api/v1/nodes
Core group NS-scoped/api/v1/namespaces/{ns}/{resource}/api/v1/namespaces/kube-system/pods
Subresource.../{resource}/{name}/{sub}/api/v1/namespaces/default/pods/nginx/log
Discovery/apis, /api, /apis/{group}kubectl api-resources uses these

Subresources

SubresourceParentPurpose
/statusMost resourcesUpdate status subobject only; requires separate RBAC verb
/scaleDeployment, ReplicaSet, StatefulSet, CRDsRead/write replicas via HPA without full object access
/execPodWebSocket exec into a container
/logPodStream container logs
/portforwardPodTCP tunnel to container port
/attachPodAttach stdin/stdout to running process
/evictionPodGraceful eviction respecting PodDisruptionBudgets
/ephemeralcontainersPodAdd debug containers to running pod
/bindingPodScheduler assigns pod to node
# Inspect raw API calls made by kubectl
kubectl get pod nginx -n default -v=8 2>&1 | grep -E 'GET|POST|PATCH'

# Equivalent curl (service-account token from inside a pod)
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sk https://kubernetes.default.svc/api/v1/namespaces/default/pods/nginx \
  -H "Authorization: Bearer $TOKEN"

# Discover all API groups + versions
kubectl api-versions | sort
curl -sk https://<apiserver>:6443/apis --cert client.crt --key client.key \
  --cacert ca.crt | jq '.groups[].name'

API Groups and Versions

Resources are organized into named API groups with independent versioning lifecycles. The core group (/api/v1) is legacy; all modern resources live under named groups in /apis/.

API Group Version Evolution core /api/v1 (GA) apps extensions/v1beta1 ✗ apps/v1beta1 ✗ apps/v1beta2 ✗ apps/v1 (GA) batch batch/v1beta1 ✗ batch/v1 (GA) autoscaling autoscaling/v1 (GA) autoscaling/v2 (GA) GA (stable) Beta (deprecated) Removed

Alpha (v1alpha1)

  • Off by default, feature-gated
  • No stability guarantee
  • May be removed any release
  • Example: flowcontrol.apiserver.k8s.io/v1alpha1

Beta (v1beta1, v2beta2)

  • Enabled by default (k8s 1.8+)
  • Schema may change between betas
  • Deprecated → removed in 3 releases
  • Example: autoscaling/v2beta2v2

GA / Stable (v1)

  • 12+ months or 3-release support
  • No breaking schema changes
  • Example: apps/v1, core/v1

Resource Object Anatomy

Every Kubernetes resource is a typed JSON/YAML object with a universal envelope. Understanding each field is essential for writing operators and debugging API behavior.

apiVersion: apps/v1          # <group>/<version> — TypeMeta
kind: Deployment             # PascalCase resource kind — TypeMeta
metadata:
  name: nginx                # required unless generateName is set
  generateName: nginx-       # server appends 5-char random suffix; used by ReplicaSets
  namespace: default
  uid: 3fa85f64-5717-4562-b3fc-2c963f66afa6   # immutable UUID set by server
  resourceVersion: "1042819" # etcd revision string — optimistic locking & watch
  generation: 3              # incremented on spec changes (NOT status updates)
  creationTimestamp: "2024-01-15T10:30:00Z"
  deletionTimestamp: null    # set when DELETE is called; lives until finalizers clear
  deletionGracePeriodSeconds: 30
  labels:
    app: nginx
    version: "1.25"
  annotations:
    deployment.kubernetes.io/revision: "3"
    kubectl.kubernetes.io/last-applied-configuration: '...'
  ownerReferences:           # garbage-collection chain
  - apiVersion: apps/v1
    kind: ReplicaSet
    name: nginx-6799fc88d8
    uid: abc123
    controller: true         # only one ownerRef may have controller:true
    blockOwnerDeletion: true # prevents owner deletion until dependent gone
  finalizers:                # object not deleted until all strings removed
  - kubernetes.io/pvc-protection
  managedFields:             # Server-Side Apply field ownership tracking
  - manager: kubectl
    operation: Apply
    apiVersion: apps/v1
    time: "2024-01-15T10:30:00Z"
    fieldsType: FieldsV1
    fieldsV1: { ... }
spec:                        # desired state — varies by kind
  replicas: 3
status:                      # observed state — written only by controllers
  readyReplicas: 3

generateName and Naming Conventions

metadata.generateName causes kube-apiserver to append a random 5-character alphanumeric suffix before persisting the object. Used by ReplicaSets (which create Pods with generateName: nginx-6799fc88d8-), Jobs, and any controller that needs guaranteed unique names.

  • If both name and generateName are set, name wins — generateName is ignored.
  • All resource names must conform to DNS subdomain rules (RFC 1123): lowercase alphanumeric or -, max 253 chars, start/end alphanumeric — except Pods which use DNS label rules (max 63 chars per segment).
  • The server rejects names violating these rules with 422 Unprocessable Entity.
  • Namespace names are DNS labels (max 63 chars). Avoid dots in namespace names — they cause DNS resolution ambiguity.
# generateName in action
kubectl run --generate-name=debug- --image=busybox --restart=Never -- sleep 3600
# Result: debug-k7x9p (or similar)

# DNS subdomain validation — this will fail
kubectl create namespace "My_Namespace"
# Error: namespace name must match DNS-1123 subdomain pattern

HTTP Verbs, RBAC Verbs, and Patch Types

HTTPRBAC VerbSemantics
GET (collection)listReturn all resources; supports fieldSelector, labelSelector, limit/continue
GET (single)getReturn single resource by name
GET (watch=true)watchLong-poll streaming; separate RBAC verb from list
POSTcreateCreate new; server sets uid, resourceVersion, creationTimestamp
PUTupdateFull replacement; must include current resourceVersion
PATCHpatchPartial update; 4 distinct strategies (see below)
DELETE (single)deleteInitiates deletion; sets deletionTimestamp if finalizers exist
DELETE (collection)deletecollectionBulk delete matching label selector
POST /subresourcecreatee.g., exec, portforward, eviction
GET /statusgetOften granted separately from main resource get
PUT /statusupdateControllers update status without main update permission

Four Patch Types

JSON Merge Patch (RFC 7396)

Content-Type: application/merge-patch+json

Send only changed fields. Setting a key to null deletes it. Cannot surgically modify list elements — replaces entire array. Works on CRDs.

kubectl patch pod nginx \
  --type=merge \
  -p '{"spec":{"terminationGracePeriodSeconds":60}}'

Strategic Merge Patch

Content-Type: application/strategic-merge-patch+json

Kubernetes-specific. Uses patchMergeKey struct tags to merge list elements by key (containers by name, volumes by name). Cannot be used on CRDs — CRDs have no Go struct tags.

kubectl patch deployment nginx \
  --type=strategic \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"nginx","image":"nginx:1.26"}]}}}}'

JSON Patch (RFC 6902)

Content-Type: application/json-patch+json

Array of operations: add, remove, replace, move, copy, test. Precise surgical edits including array indices. test enables conditional updates.

kubectl patch deployment nginx \
  --type=json \
  -p '[
    {"op":"test","path":"/spec/replicas","value":3},
    {"op":"replace","path":"/spec/replicas","value":5}
  ]'

Server-Side Apply (SSA)

Content-Type: application/apply-patch+yaml

Declarative ownership. Server tracks managedFields per manager. Conflicts returned as 409 Conflict. Only send intent; full object not required. See § Server-Side Apply.

kubectl apply --server-side \
  --field-manager=my-operator \
  -f deployment.yaml

DeleteOptions and Propagation Policy

When deleting an object, the client can send a DeleteOptions body to control cascading behavior and race protection:

apiVersion: v1
kind: DeleteOptions
gracePeriodSeconds: 0         # override pod's terminationGracePeriodSeconds (0 = immediate)
preconditions:
  uid: "3fa85f64-..."         # only delete if UID matches — prevents deleting replacement
  resourceVersion: "1042819"  # only delete if resourceVersion matches — prevents races
propagationPolicy: Foreground # Foreground | Background | Orphan

Foreground Deletion

Owner gets deletionTimestamp + foregroundDeletion finalizer. GC controller deletes all dependents (those with blockOwnerDeletion: true) first, then removes the finalizer. Owner is visible as "terminating" until all dependents gone.

Background Deletion (default)

Owner deleted immediately. GC controller asynchronously deletes dependents via ownerReferences traversal in the background. Most common mode for kubectl delete. Dependents may outlive the owner briefly.

Orphan Policy

Owner deleted; dependents have their ownerReferences stripped but are not deleted. Used when you want to keep child resources after removing the parent — e.g., keep Pods when deleting a ReplicaSet for manual management.

Watch Protocol and ResourceVersion

Controllers never poll. They establish a long-lived HTTP/2 GET with ?watch=true that streams NDJSON event objects. The resourceVersion field drives resumable watches and optimistic locking.

Watch Protocol — Sequence Diagram Controller kube-apiserver etcd LIST pods?limit=500&resourceVersion=0 200 OK {items:[...], resourceVersion:"5000"} GET /pods?watch=true&resourceVersion=5000 etcd.Watch(rev=5000) ADDED/MODIFIED/DELETED events stream: {"type":"ADDED","object":{...}} {"type":"BOOKMARK","resourceVersion":"5100"} 410 Gone → re-LIST then re-WATCH
Event TypeMeaningController action
ADDEDNew object (or first seen after relist)Add to local store, enqueue reconcile
MODIFIEDAny field changed (including status)Update local store, enqueue reconcile
DELETEDObject deleted from etcdRemove from store, enqueue cleanup
BOOKMARKServer heartbeat with updated RV; no changeUpdate bookmark RV for reconnect
ERRORServer error; often 410 Gone (RV compacted)Re-LIST → re-WATCH with new RV

ResourceVersion and MVCC

Every object and collection has a resourceVersion — a string representation of the etcd revision at which it was last written. The MVCC internals are covered in 02-etcd.html § MVCC. From the API perspective:

  • Optimistic locking: PUT/PATCH must include the current resourceVersion; server returns 409 Conflict if another writer changed it first.
  • Watch resumption: ?resourceVersion=X streams all events after revision X. If X is compacted, returns 410 Gone.
  • Cache reads: ?resourceVersion=0 reads from watchCache (slightly stale but fast). Omit for quorum read from etcd.

Informer Architecture

Controllers never call the API per event. The client-go informer machinery consolidates List+Watch into a local in-memory store and work queue, enabling safe concurrent reconciliation. The server-side counterpart — watchCache — is covered in 01-kube-apiserver.html § watchCache.

Informer Pipeline (client-go) Reflector DeltaFIFO ThreadSafeStore EventHandlers WorkQueue Reconciler List+Watch change log cache (Get/List) dedup+ratelimit

Reflector

Performs initial LIST (resourceVersion=0 for cache hit), then establishes long-running WATCH. On 410 Gone or disconnect, re-LISTs. Feeds all deltas into DeltaFIFO.

DeltaFIFO

Ordered queue of typed deltas per object key (Added, Updated, Deleted, Replaced, Sync). Coalesces rapid changes to the same object key. Pop calls the handler and updates ThreadSafeStore atomically.

ThreadSafeStore (Indexer)

In-memory map of namespace/name → object. Supports secondary indices (e.g., "all pods on node X"). Controllers call lister.Pods(ns).Get(name) — reads local cache, zero API calls per reconcile loop.

WorkQueue Semantics

Rate-limited deduplicating queue. Add(key) enqueues once regardless of call count. Done(key) signals completion. AddRateLimited for retries with exponential backoff (base 5ms, max 1000s). AddAfter for RequeueAfter.

controller-runtime Reconciler Example (Go)
import (
    ctrl "sigs.k8s.io/controller-runtime"
    apierrors "k8s.io/apimachinery/pkg/api/errors"
)

type MyReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Read from local cache — NOT a live API call
    var myObj myv1.MyResource
    if err := r.Get(ctx, req.NamespacedName, &myObj); err != nil {
        if apierrors.IsNotFound(err) {
            return ctrl.Result{}, nil // deleted; nothing to do
        }
        return ctrl.Result{}, err
    }

    if myObj.Status.Phase == "" {
        myObj.Status.Phase = "Pending"
        // Status update only touches the /status subresource
        if err := r.Status().Update(ctx, &myObj); err != nil {
            return ctrl.Result{}, err
        }
    }

    // Periodic re-check
    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

// Register with controller-manager
ctrl.NewControllerManagedBy(mgr).
    For(&myv1.MyResource{}).
    Owns(&appsv1.Deployment{}).  // watch Deployments owned by MyResource
    Complete(r)

Server-Side Apply (SSA)

SSA moves merge logic to the server and introduces field ownership — each field is owned by a named manager. This enables safe co-management by multiple actors (controller + HPA + human operator).

DimensionClient-Side Apply (CSA)Server-Side Apply (SSA)
Merge logickubectl (strategic merge)kube-apiserver
Full object requiredYes (annotation stores last-applied)No — only send intent fields
Field ownershipNone (last writer wins)Per-field, per-manager in managedFields
Conflict detectionNone409 Conflict if another manager owns the field
CRD supportLimitedFull (no patchMergeKey struct tag needed)
kubectl flagkubectl applykubectl apply --server-side
▶ HPA + Manual Replicas Conflict

HPA owns spec.replicas via SSA. If a human applies a manifest with a different spec.replicas, they get a 409 Conflict — the field is owned by horizontal-pod-autoscaler. They must either --force-conflicts (and HPA will fight back next cycle) or remove replicas from their manifest entirely, letting HPA remain the sole owner.

# Apply with SSA
kubectl apply --server-side -f deployment.yaml

# Custom manager name for operators
kubectl apply --server-side --field-manager=my-operator -f crd-instance.yaml

# Force-take ownership of conflicting fields
kubectl apply --server-side --force-conflicts -f deployment.yaml

# Inspect managed fields
kubectl get deployment nginx -o jsonpath='{.metadata.managedFields}' | jq .

# Remove a field from SSA management: omit it from the manifest and re-apply
# The field becomes "unmanaged" and other managers may adopt it

etcd Key Encoding

All objects are persisted in etcd under deterministic key paths. Deep internals (MVCC, compaction, WAL) are in 02-etcd.html § Kubernetes Keyspace.

/registry/
+
{resource-type}
+
{namespace}/
+
{name}
# Key examples
/registry/pods/default/nginx-abc12
/registry/deployments/kube-system/coredns
/registry/secrets/default/my-secret
/registry/clusterroles/admin                                    # cluster-scoped: no namespace
/registry/apiextensions.k8s.io/customresourcedefinitions/foos.example.com

# Object encoding: k8s\x00 magic prefix + protobuf Unknown wrapper
# To decode (needs etcd access + auger tool):
etcdctl get /registry/pods/default/nginx --print-value-only | auger decode

# kube-apiserver uses a different prefix for each API group:
# /registry/{group}/{resource}/{ns}/{name}
# Core group omits the group: /registry/pods/...

API Request Lifecycle

Every request traverses a staged pipeline in kube-apiserver. Full detail in 01-kube-apiserver.html § Request Pipeline.

① APF
② Authn
③ Authz
④ Mutating
⑤ Schema Valid.
⑥ Validating
⑦ etcd
StageWhat happensFailure code
① APFRequest classified into FlowSchema → PriorityLevel; throttled if queue full429
② AuthenticationIdentity determined (x509, OIDC, SA JWT, etc.)401
③ AuthorizationRBAC/Node/Webhook checks if identity may perform verb on resource403
④ Mutating AdmissionWebhooks + built-in plugins may modify object (defaults, injections)400/500
⑤ Schema ValidationOpenAPI v3 structural schema + CEL rules enforced in-process422
⑥ Validating AdmissionWebhooks validate the (possibly mutated) final object400/403
⑦ etcd persistprotobuf-encoded, optionally encrypted, MVCC revision assigned500/503

Label and Field Selectors

Label Selectors

TypeSyntaxExample
Equalitykey=value, key==value, key!=valueapp=nginx,tier!=frontend
Set-basedkey in (v1,v2), key notin (v1), key, !keyenv in (prod,staging)
▶ Service vs ReplicaSet Selector Syntax

Services only support equality-based selectors (map syntax in YAML). ReplicaSets and Deployments support both equality (matchLabels) and set-based (matchExpressions) in their .spec.selector. This is because Service endpoints use a different code path than replica set membership.

kubectl get pods -l app=nginx,tier=frontend
kubectl get pods -l 'env in (prod,staging)'
kubectl get pods -l '!canary'  # pods WITHOUT the canary label

# ReplicaSet/Deployment combined selector
selector:
  matchLabels:
    app: nginx
  matchExpressions:
  - key: tier
    operator: In
    values: [frontend, backend]

Field Selectors

Filter on specific object fields. Support varies by resource — only indexed fields work server-side; others are filtered client-side (all objects fetched, then filtered locally).

kubectl get pods --field-selector status.phase=Running
kubectl get pods --field-selector spec.nodeName=worker-1
kubectl get events --field-selector involvedObject.name=nginx,type=Warning
kubectl get pods --field-selector 'status.phase!=Running,spec.restartPolicy=Always'

OpenAPI v3 and Schema Validation

kube-apiserver publishes per-group OpenAPI v3 schemas at /openapi/v3/apis/{group}/{version}. kubectl uses these for client-side validation. The combined v2 schema is at /openapi/v2.

# Get OpenAPI v3 schema for apps/v1 Deployment
kubectl get --raw '/openapi/v3/apis/apps/v1' | \
  jq '.components.schemas."io.k8s.api.apps.v1.Deployment".properties.spec'

# Validate YAML without applying
kubectl apply --dry-run=client -f deployment.yaml   # client-side only
kubectl apply --dry-run=server -f deployment.yaml   # full server pipeline, no persist

# View structural schema on a CRD
kubectl get crd foos.example.com \
  -o jsonpath='{.spec.versions[0].schema.openAPIV3Schema}' | jq .

CEL Validation in CRDs and ValidatingAdmissionPolicy

CEL (Common Expression Language) rules run in-process in kube-apiserver — no webhook round-trip, sub-millisecond latency. Available in CRD x-kubernetes-validations (GA in 1.30) and ValidatingAdmissionPolicy (GA in 1.30).

CEL in CRDs

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: foos.example.com
spec:
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            x-kubernetes-validations:
            - rule: "self.minReplicas <= self.maxReplicas"
              message: "minReplicas must be <= maxReplicas"
            - rule: "self.replicas >= self.minReplicas && self.replicas <= self.maxReplicas"
              message: "replicas must be within [minReplicas, maxReplicas]"
            properties:
              replicas: {type: integer, minimum: 0}
              minReplicas: {type: integer, minimum: 0}
              maxReplicas: {type: integer, minimum: 1}

ValidatingAdmissionPolicy

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: require-run-as-non-root
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups: [""]
      apiVersions: ["v1"]
      operations: ["CREATE","UPDATE"]
      resources: ["pods"]
  validations:
  - expression: >
      object.spec.containers.all(c,
        has(c.securityContext) &&
        has(c.securityContext.runAsNonRoot) &&
        c.securityContext.runAsNonRoot == true)
    message: "All containers must set securityContext.runAsNonRoot=true"
  - expression: "object.spec.hostPID == false && object.spec.hostNetwork == false"
    message: "hostPID and hostNetwork must be false"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: require-run-as-non-root-binding
spec:
  policyName: require-run-as-non-root
  validationActions: [Deny]
  matchResources:
    namespaceSelector:
      matchLabels:
        enforce-security: "true"

CRD Versioning and Conversion Webhooks

CRDs support multiple versions simultaneously. The served flag controls which versions kube-apiserver serves; the storage flag (exactly one version) controls which version is persisted in etcd.

spec:
  versions:
  - name: v1alpha1
    served: true      # /apis/example.com/v1alpha1/... is active
    storage: false    # NOT the etcd storage version
    schema: { ... }
  - name: v1
    served: true
    storage: true     # ALL objects written to etcd as v1
    schema: { ... }
  conversion:
    strategy: Webhook  # None (default) | Webhook
    webhook:
      conversionReviewVersions: ["v1","v1beta1"]
      clientConfig:
        service:
          namespace: default
          name: crd-conversion-webhook
          path: /convert

Hub version pattern: designate one version as the authoritative internal representation. All conversions go through hub: v1alpha1 → hub(v1) → v1alpha1. Avoids O(n²) conversion pairs. The webhook receives a ConversionReview:

# ConversionReview request (sent by apiserver to webhook)
apiVersion: apiextensions.k8s.io/v1
kind: ConversionReview
request:
  uid: "f5cba308-..."
  desiredAPIVersion: example.com/v1
  objects:
  - apiVersion: example.com/v1alpha1
    kind: Foo
    metadata: { name: myfoo }
    spec: { oldField: value }
response:               # webhook returns this
  uid: "f5cba308-..."
  result: {status: Success}
  convertedObjects:
  - apiVersion: example.com/v1
    kind: Foo
    metadata: { name: myfoo }
    spec: { newField: value }  # converted representation
⚠ Storage Migration

Changing the storage: true version does NOT migrate existing objects. Objects written under the old version remain in etcd as the old protobuf encoding. You must run a storage migration job (kubectl get RESOURCE -A --output=name | xargs -I{} kubectl replace --raw /path/{} -f -) or use the kube-storage-version-migrator to rewrite all objects to the new storage version.

apimachinery Library Layout

Understanding k8s.io/apimachinery is essential for writing controllers and API extensions.

k8s.io/apimachinery/pkg/api/errors

HTTP status helpers. Use these instead of raw HTTP codes:

import apierrors "k8s.io/apimachinery/pkg/api/errors"

apierrors.IsNotFound(err)      // 404
apierrors.IsConflict(err)      // 409 (RV mismatch)
apierrors.IsAlreadyExists(err) // 409 (name taken)
apierrors.IsTooManyRequests(err)// 429 (APF throttle)
apierrors.IsServerTimeout(err) // 503
apierrors.IsUnauthorized(err)  // 401
apierrors.IsForbidden(err)     // 403
apierrors.IsInvalid(err)       // 422

runtime.Object and TypeMeta/ObjectMeta

All Kubernetes objects implement runtime.Object (two methods: GetObjectKind(), DeepCopyObject()). TypeMeta carries apiVersion/kind. ObjectMeta carries all metadata.*.

type TypeMeta struct {
    Kind       string `json:"kind,omitempty"`
    APIVersion string `json:"apiVersion,omitempty"`
}
type ObjectMeta struct {
    Name, Namespace  string
    UID              types.UID
    ResourceVersion  string
    Generation       int64
    Labels, Annotations map[string]string
    OwnerReferences  []OwnerReference
    Finalizers       []string
}

schema.GroupVersionResource / GroupVersionKind

import "k8s.io/apimachinery/pkg/runtime/schema"

// GVR — used to construct API paths
gvr := schema.GroupVersionResource{
    Group: "apps", Version: "v1",
    Resource: "deployments",
}
// GVK — used in TypeMeta
gvk := schema.GroupVersionKind{
    Group: "apps", Version: "v1",
    Kind: "Deployment",
}
// Convert via RESTMapper
mapper.RESTMappings(gvk.GroupKind(), gvk.Version)

meta/v1 Types and Condition Pattern

// ListOptions
metav1.ListOptions{
    LabelSelector:   "app=nginx",
    FieldSelector:   "status.phase=Running",
    ResourceVersion: "0",   // from watchCache
    Limit:           500,
}
// Standard condition (use for all status fields)
metav1.Condition{
    Type:    "Ready",
    Status:  metav1.ConditionTrue,
    Reason:  "DeploymentAvailable",
    Message: "3/3 replicas running",
    LastTransitionTime: metav1.Now(),
}

Pagination and Chunking

Large LIST responses should be paginated using limit + continue token to avoid memory pressure on both client and server. All pages share the same resourceVersion snapshot.

# Shell: paginate pods in 100-item chunks
CONTINUE=""
while true; do
  ARGS="--limit=100"
  [ -n "$CONTINUE" ] && ARGS="$ARGS --continue=$CONTINUE"
  RESP=$(kubectl get pods -n default $ARGS -o json)
  echo "$RESP" | jq -r '.items[].metadata.name'
  CONTINUE=$(echo "$RESP" | jq -r '.metadata.continue // empty')
  [ -z "$CONTINUE" ] && break
done
// Go: using client-go pager
import "k8s.io/client-go/tools/pager"

p := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) {
    return client.CoreV1().Pods("").List(ctx, opts)
}))
p.PageSize = 100
err = p.EachListItem(ctx, metav1.ListOptions{}, func(obj runtime.Object) error {
    pod := obj.(*corev1.Pod)
    fmt.Println(pod.Name)
    return nil
})
⚠ Consistency Guarantee

All pages of a chunked LIST share the same resourceVersion. Objects created or deleted after the first page do not affect subsequent pages. This gives a consistent point-in-time snapshot — the same guarantee as a single large LIST, but without the memory spike.

API Priority and Fairness (APF)

APF replaces the old --max-requests-inflight flag with per-flow fairness queuing. Every request is classified by a FlowSchema and placed in a PriorityLevel queue.

FlowSchema

Matches requests (by user, group, ServiceAccount, resource, verb) to a PriorityLevel. Ordered by matchingPrecedence (lower number = checked first). Built-in: cluster-admin, system-nodes, system-leader-election, workload-high, workload-low, global-default, catch-all.

PriorityLevelConfiguration

Two types: Exempt (unlimited, for cluster-admin and leader election) and Limited (bounded concurrency with FIFO or shuffle-sharding queues). Shuffle sharding isolates noisy tenants by giving each flow a unique queue subset.

apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
  name: my-app-high-priority
spec:
  matchingPrecedence: 200       # lower = matched first
  priorityLevelConfiguration:
    name: workload-high
  rules:
  - subjects:
    - kind: ServiceAccount
      serviceAccount:
        name: my-critical-app
        namespace: default
    resourceRules:
    - verbs: ["*"]
      apiGroups: ["*"]
      resources: ["*"]
# Monitor APF
kubectl get flowschemas
kubectl get prioritylevelconfigurations
kubectl get --raw /metrics | grep 'apiserver_flowcontrol'

# Key metrics:
# apiserver_flowcontrol_current_inqueue_requests{priority_level="workload-high"}
# apiserver_flowcontrol_rejected_requests_total{reason="queue-full"}
# apiserver_flowcontrol_request_wait_duration_seconds

# Diagnose throttling (look for 429 in audit log)
grep '"code":429' /var/log/kubernetes/audit.log | jq '.user.username' | sort | uniq -c

API Troubleshooting

HTTP 409 Conflict — resourceVersion mismatch

Another writer updated the object between your GET and PUT/PATCH. Always re-GET before retrying. Use RetryOnConflict in controllers:

import "k8s.io/client-go/util/retry"

err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
    obj, err := client.Get(ctx, name, metav1.GetOptions{})
    if err != nil { return err }
    obj.Spec.Replicas = &desiredReplicas
    _, err = client.Update(ctx, obj, metav1.UpdateOptions{})
    return err
})
HTTP 410 Gone — watch resourceVersion too old (compacted)

etcd compacts history (typically every 5 minutes). If a watcher is paused longer, its RV is gone. Informers handle this automatically — they re-LIST from RV=0 then re-WATCH from the new RV.

# Check auto-compaction settings
kubectl -n kube-system exec etcd-master -- \
  etcdctl endpoint status --write-out=table

# etcd flags to inspect:
# --auto-compaction-mode=periodic --auto-compaction-retention=5m
HTTP 429 Too Many Requests — APF throttled

Response includes Retry-After header. Root causes: burst watch reconnects after rolling restart, controller calling List in tight loop, unbounded LISTs without limit/continue. Fix: create a FlowSchema promoting the critical SA to a higher PriorityLevel.

kubectl get --raw /metrics | grep -E 'rejected_requests_total|current_inqueue'
# Check which flow schemas the request matches
kubectl get --raw '/apis/flowcontrol.apiserver.k8s.io/v1/flowschemas' | jq '.items[].metadata.name'
HTTP 422 Unprocessable Entity — schema validation failure

Response body contains field-level errors. Common causes: CRD CEL validation failure, structural schema violation, immutable field change (e.g., selector), name violating DNS rules.

# Full error detail
kubectl apply -f resource.yaml 2>&1
# Output includes: "spec.selector: Invalid value ... field is immutable"

# Pre-flight check
kubectl apply --dry-run=server -f resource.yaml
Debugging API calls with kubectl verbosity
# -v=6: request URL + response code
kubectl get pod nginx -v=6

# -v=8: full request/response headers
kubectl get pod nginx -v=8

# -v=9: full request/response body
kubectl apply -f . -v=9

# Record for analysis
kubectl apply -f . -v=8 2>&1 | tee /tmp/kubectl-debug.log | grep -E 'Request|Response code'

Production Best Practices

Use SSA for operators

Use --server-side --field-manager=my-operator. This prevents clobbering fields managed by HPA, cert-manager, or human operators.

Paginate all large LISTs

Never do a bare LIST of pods or events cluster-wide without limit. Unbounded LISTs cause OOM on kube-apiserver watchCache and GC pressure on the client.

Pin resourceVersion=0 for cache reads

Controllers needing eventual-consistent data should LIST with resourceVersion=0 — served from watchCache (in-memory), not etcd. Dramatically reduces etcd load at scale.

Use Informers, not polling

A single Informer multiplexes events to all controllers in a process. Never write a controller that polls GET /pods on a timer — does not scale past a few hundred pods.

Respect retry semantics

Use RetryOnConflict for 409, respect Retry-After for 429, implement exponential backoff for 5xx. workqueue.DefaultItemBasedRateLimiter handles this automatically for controller queues.

Use finalizers for external cleanup

Add a finalizer before creating external resources. Remove it only after cleanup completes. Never leave objects permanently stuck with a finalizer you cannot remove — use an emergency removal procedure.

Conditions over booleans in status

Follow metav1.Condition: Type, Status, Reason (machine-readable CamelCase), Message (human-readable), LastTransitionTime. This enables programmatic tooling and standard dashboards.

CEL validation over webhooks

Admission webhooks add latency and availability risk. For structural invariants expressible in CEL, use x-kubernetes-validations or ValidatingAdmissionPolicy — zero-latency, no webhook to maintain.

Plan CRD versions from day one

Designate a hub version, write conversion webhooks, never modify the storage version without migrating stored objects first. Use kube-storage-version-migrator or a migration Job.

RBAC needs both list AND watch

Controllers using Informers need both list and watch RBAC verbs. Missing watch causes a 403 when the Reflector tries to establish the watch stream — a subtle, delayed failure.

Namespace-scope new CRDs

Default to namespace-scoped CRDs. They get RBAC isolation, quota enforcement, and lifecycle management. Only use cluster-scoped for genuinely cluster-global resources.

APF FlowSchema for critical controllers

Create a FlowSchema pointing critical controllers (cert-manager, ArgoCD) to workload-high or a custom exempt PriorityLevel. Without it, a noisy kubectl loop by a cluster-admin can starve controllers.