⚙️ Platform APIs

Platform APIs: Extending Kubernetes

Complete guide to extending the Kubernetes API — Custom Resource Definitions, Admission Webhooks, API Aggregation, Operator patterns, controller development with controller-runtime, and building internal developer APIs that feel native to the Kubernetes ecosystem.

📦 CRDs & Webhooks 🤖 Operator Pattern 🔧 controller-runtime 🧩 API Aggregation 🏗️ Kubebuilder

Kubernetes Extension Points
Custom Resource Definitions
CRD Validation & Conversion
Admission Webhooks
Operator Pattern
Kubebuilder & controller-runtime
Writing a Reconciler
API Aggregation Layer
Platform API Design Principles
Testing Operators & Webhooks
Operator Observability
Best Practices

Kubernetes Extension Points

Kubernetes is designed as an extensible platform. Rather than building every feature into core, it provides stable hooks that allow the platform team to add domain-specific abstractions while inheriting the full K8s control loop, RBAC, audit logging, and tooling.

KUBERNETES EXTENSION MECHANISMS kube-apiserver ├── Built-in APIs (v1, apps/v1, batch/v1 ...) │ ├── Custom Resource Definitions (CRDs) ← Add new resource types │ └── Operator/Controller watches CRDs ← Custom reconciliation logic │ ├── Admission Webhooks ← Intercept API requests │ ├── MutatingAdmissionWebhook ← Modify objects (defaults, injection) │ └── ValidatingAdmissionWebhook ← Accept/reject objects (policy) │ ├── API Aggregation Layer ← Proxy to external API server │ └── APIService → custom API server (metrics-server, custom metrics) │ └── ValidatingAdmissionPolicy (CEL) ← In-process CEL validation (1.30 GA) kubectl / CI / GitOps interact with ALL of these via standard K8s API verbs (get, list, watch, create, update, patch, delete)

Extension Mechanism Decision Guide

Use Case	Mechanism	Complexity
Store custom configuration with K8s RBAC + audit	CRD (no controller)	Low
Automate lifecycle of cloud resources or K8s objects	CRD + Operator (controller-runtime)	Medium
Inject defaults / mutate objects at admission time	MutatingAdmissionWebhook	Medium
Enforce business rules at admission time	ValidatingAdmissionWebhook or Kyverno/Gatekeeper	Low (policy engine) / Medium (custom)
Simple in-process validation rules	ValidatingAdmissionPolicy (CEL)	Low
Serve custom metrics (HPA scale target)	Custom Metrics API (API Aggregation)	High
Extend kubectl with new subcommands	kubectl plugin (krew)	Low
Multi-cluster or cross-namespace resource management	CRD + Operator with multi-cluster watch	High

Custom Resource Definitions

CRDs add new resource types to the Kubernetes API. Once registered, teams can kubectl apply, kubectl get, and kubectl watch the new types — with full RBAC, audit logging, and GitOps support automatically inherited.

Full CRD Example: Application

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: applications.platform.example.com
spec:
  group: platform.example.com
  scope: Namespaced
  names:
    plural: applications
    singular: application
    kind: Application
    shortNames: ["app"]
    categories: ["platform"]   # appears in kubectl get platform
  versions:
  - name: v1alpha1
    served: true
    storage: true              # only one version is storage version
    subresources:
      status: {}               # enable /status subresource (separate RBAC)
      scale:                   # enable /scale subresource (HPA support)
        specReplicasPath: .spec.replicas
        statusReplicasPath: .status.replicas
    additionalPrinterColumns:  # kubectl get app output
    - name: Desired
      type: integer
      jsonPath: .spec.replicas
    - name: Ready
      type: integer
      jsonPath: .status.readyReplicas
    - name: Phase
      type: string
      jsonPath: .status.phase
    - name: Age
      type: date
      jsonPath: .metadata.creationTimestamp
    schema:
      openAPIV3Schema:
        type: object
        required: ["spec"]
        properties:
          spec:
            type: object
            required: ["image","port"]
            properties:
              image:
                type: string
                description: "Container image (must include digest or semver tag)"
              port:
                type: integer
                minimum: 1
                maximum: 65535
              replicas:
                type: integer
                minimum: 1
                maximum: 100
                default: 2
              resources:
                type: object
                properties:
                  cpu:
                    type: string
                    pattern: "^[0-9]+(m|\\.[0-9]+)?$"
                  memory:
                    type: string
                    pattern: "^[0-9]+(Mi|Gi)$"
              env:
                type: array
                items:
                  type: object
                  required: ["name","value"]
                  properties:
                    name:
                      type: string
                    value:
                      type: string
              ingress:
                type: object
                properties:
                  enabled:
                    type: boolean
                    default: false
                  host:
                    type: string
                  tlsEnabled:
                    type: boolean
                    default: true
          status:
            type: object
            properties:
              phase:
                type: string
                enum: ["Pending","Deploying","Running","Degraded","Failed"]
              readyReplicas:
                type: integer
              conditions:
                type: array
                items:
                  type: object
                  required: ["type","status"]
                  properties:
                    type:
                      type: string
                    status:
                      type: string
                    reason:
                      type: string
                    message:
                      type: string
                    lastTransitionTime:
                      type: string
                      format: date-time

CRD Instance

apiVersion: platform.example.com/v1alpha1
kind: Application
metadata:
  name: payments-api
  namespace: payments-api-production
  labels:
    team: payments
    env: production
spec:
  image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payments-api@sha256:abc123
  port: 8080
  replicas: 3
  resources:
    cpu: "500m"
    memory: "512Mi"
  env:
  - name: LOG_LEVEL
    value: info
  ingress:
    enabled: true
    host: payments-api.company.com
    tlsEnabled: true

CRD Validation & Conversion

CEL Validation Rules (x-kubernetes-validations)

# Add CEL validation rules directly in the CRD schema (K8s 1.25+)
# These run in-process in the API server — no webhook needed
spec:
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            x-kubernetes-validations:
            # Cross-field validation: production must have >= 2 replicas
            - rule: "!(self.env == 'production' && self.replicas < 2)"
              message: "Production applications must have at least 2 replicas"
            # Image must not use :latest in production
            - rule: "!self.image.endsWith(':latest')"
              message: "Image tag ':latest' is not allowed"
            properties:
              replicas:
                type: integer
                minimum: 1
                x-kubernetes-validations:
                - rule: "self >= oldSelf || self >= 1"
                  message: "Replicas cannot be reduced below 1"

CRD Version Conversion Webhook

# When you add v1beta1 alongside v1alpha1, a conversion webhook translates between versions
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: applications.platform.example.com
spec:
  conversion:
    strategy: Webhook
    webhook:
      conversionReviewVersions: ["v1","v1beta1"]
      clientConfig:
        service:
          name: platform-operator-webhook
          namespace: platform-system
          path: /convert
        caBundle: "LS0tLS1CRUdJT..."

Admission Webhooks

Admission webhooks intercept API server requests before persistence. They are the right tool for logic that cannot be expressed in CEL validation rules — injecting sidecar containers, generating child resources, or calling external systems for approval.

Mutating Webhook: Inject Default Labels

// webhook/defaulter.go — using controller-runtime's webhook builder
package webhook

import (
    "context"
    "encoding/json"

    corev1 "k8s.io/api/core/v1"
    "sigs.k8s.io/controller-runtime/pkg/webhook/admission"
)

type PodDefaulter struct {
    decoder *admission.Decoder
}

func (d *PodDefaulter) Handle(ctx context.Context, req admission.Request) admission.Response {
    pod := &corev1.Pod{}
    if err := d.decoder.Decode(req, pod); err != nil {
        return admission.Errored(400, err)
    }

    // Inject defaults
    if pod.Labels == nil {
        pod.Labels = map[string]string{}
    }
    if _, ok := pod.Labels["managed-by"]; !ok {
        pod.Labels["managed-by"] = "platform"
    }

    // Set default terminationGracePeriodSeconds if not set
    if pod.Spec.TerminationGracePeriodSeconds == nil {
        grace := int64(60)
        pod.Spec.TerminationGracePeriodSeconds = &grace
    }

    // Return JSON patch
    marshaledPod, err := json.Marshal(pod)
    if err != nil {
        return admission.Errored(500, err)
    }
    return admission.PatchResponseFromRaw(req.Object.Raw, marshaledPod)
}

Register Webhook in main.go

// main.go
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,
    Port:   9443,   // webhook server port
    CertDir: "/tmp/k8s-webhook-server/serving-certs",
})

// Register mutating webhook
mgr.GetWebhookServer().Register("/mutate-v1-pod",
    &webhook.Admission{Handler: &PodDefaulter{decoder: admission.NewDecoder(scheme)}})

// Register validating webhook
mgr.GetWebhookServer().Register("/validate-platform-v1alpha1-application",
    &webhook.Admission{Handler: &ApplicationValidator{}})

MutatingWebhookConfiguration

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: platform-mutating-webhook
  annotations:
    cert-manager.io/inject-ca-from: platform-system/platform-operator-tls
spec:
  webhooks:
  - name: mpod.platform.example.com
    admissionReviewVersions: ["v1"]
    clientConfig:
      service:
        name: platform-operator-webhook-service
        namespace: platform-system
        path: /mutate-v1-pod
    rules:
    - apiGroups: [""]
      apiVersions: ["v1"]
      operations: ["CREATE"]
      resources: ["pods"]
      scope: Namespaced
    namespaceSelector:
      matchExpressions:
      - key: platform.example.com/inject-defaults
        operator: Exists
    failurePolicy: Ignore    # don't block pods if webhook is down
    sideEffects: None
    timeoutSeconds: 5
    reinvocationPolicy: IfNeeded

ℹ️

cert-manager webhook TLS injection. Add cert-manager.io/inject-ca-from annotation on the WebhookConfiguration and create a cert-manager Certificate for the webhook service. cert-manager's cainjector automatically patches the caBundle field, eliminating manual cert rotation for webhook TLS.

Operator Pattern

An Operator encodes human operational knowledge as code. It watches a CRD and reconciles the real world to match the desired state — creating, updating, and deleting dependent K8s and cloud resources as needed.

Operator Reconcile Loop User applies Application CR │ ▼ API server stores CR etcd (desired state) │ ▼ Informer watch fires Operator Reconciler ├── Read Application spec ├── Get/List owned Deployments, Services, Ingresses ├── Compare desired vs actual ├── Create/Update/Delete resources to match desired ├── Update Application.status └── Return Result{RequeueAfter: 5m} │ ▼ After RequeueAfter Reconciler runs again (drift detection) Operator also handles: ├── Owned resource changes (ownerReference watch) ├── Error backoff (exponential retry) └── Finalizers (cleanup before deletion)

Operator Maturity Model

Level	Capabilities	Example
Level 1: Basic Install	Automated provisioning and configuration	Create Deployment+Service from CRD spec
Level 2: Seamless Upgrades	Patch and minor version upgrades, config changes	Rolling update when image field changes
Level 3: Full Lifecycle	Backup, restore, failure recovery	Auto-backup on schedule, restore from snapshot
Level 4: Deep Insights	Metrics, alerting, log processing, workload analysis	Expose per-instance Prometheus metrics
Level 5: Auto Pilot	Horizontal/vertical scaling, auto-config tuning	Auto-tune JVM heap based on pod memory

Kubebuilder & controller-runtime

Kubebuilder is the official SDK for building Kubernetes operators. It scaffolds a project with controller-runtime (the Go library), generates CRD manifests from Go struct tags, and provides a webhook framework.

Bootstrap a New Operator

# Install kubebuilder
curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
chmod +x kubebuilder && mv kubebuilder /usr/local/bin/

# Create project
mkdir platform-operator && cd platform-operator
kubebuilder init \
  --domain platform.example.com \
  --repo github.com/myorg/platform-operator

# Create API (CRD + Controller skeleton)
kubebuilder create api \
  --group platform \
  --version v1alpha1 \
  --kind Application \
  --resource --controller

# Create webhook
kubebuilder create webhook \
  --group platform \
  --version v1alpha1 \
  --kind Application \
  --defaulting --programmatic-validation

# Generate CRD manifests and DeepCopy methods
make generate manifests

Go Type Definition (generates CRD schema)

// api/v1alpha1/application_types.go

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas
// +kubebuilder:printcolumn:name="Desired",type="integer",JSONPath=".spec.replicas"
// +kubebuilder:printcolumn:name="Ready",type="integer",JSONPath=".status.readyReplicas"
// +kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// +kubebuilder:resource:shortName=app;apps,categories=platform

type Application struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec   ApplicationSpec   `json:"spec,omitempty"`
    Status ApplicationStatus `json:"status,omitempty"`
}

type ApplicationSpec struct {
    // +kubebuilder:validation:Required
    Image string `json:"image"`

    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=65535
    Port int32 `json:"port"`

    // +kubebuilder:default=2
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=100
    Replicas *int32 `json:"replicas,omitempty"`

    Resources *ResourceRequirements `json:"resources,omitempty"`
    Env       []corev1.EnvVar       `json:"env,omitempty"`
    Ingress   *IngressSpec          `json:"ingress,omitempty"`
}

// +kubebuilder:validation:Enum=Pending;Deploying;Running;Degraded;Failed
type Phase string

type ApplicationStatus struct {
    Phase         Phase              `json:"phase,omitempty"`
    ReadyReplicas int32              `json:"readyReplicas,omitempty"`
    Conditions    []metav1.Condition `json:"conditions,omitempty"`
}

Writing a Reconciler

// controllers/application_controller.go
package controllers

import (
    "context"
    "fmt"

    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/api/errors"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"

    platformv1alpha1 "github.com/myorg/platform-operator/api/v1alpha1"
)

const finalizerName = "platform.example.com/finalizer"

type ApplicationReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=platform.example.com,resources=applications,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=platform.example.com,resources=applications/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=platform.example.com,resources=applications/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;update;patch;delete

func (r *ApplicationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := ctrl.LoggerFrom(ctx)

    // 1. Fetch the Application
    app := &platformv1alpha1.Application{}
    if err := r.Get(ctx, req.NamespacedName, app); err != nil {
        if errors.IsNotFound(err) {
            return ctrl.Result{}, nil  // deleted; nothing to do
        }
        return ctrl.Result{}, err
    }

    // 2. Handle deletion (finalizer pattern)
    if !app.DeletionTimestamp.IsZero() {
        if controllerutil.ContainsFinalizer(app, finalizerName) {
            if err := r.cleanup(ctx, app); err != nil {
                return ctrl.Result{}, err
            }
            controllerutil.RemoveFinalizer(app, finalizerName)
            return ctrl.Result{}, r.Update(ctx, app)
        }
        return ctrl.Result{}, nil
    }

    // 3. Add finalizer
    if !controllerutil.ContainsFinalizer(app, finalizerName) {
        controllerutil.AddFinalizer(app, finalizerName)
        if err := r.Update(ctx, app); err != nil {
            return ctrl.Result{}, err
        }
    }

    // 4. Reconcile Deployment
    deployment := r.desiredDeployment(app)
    if err := controllerutil.SetControllerReference(app, deployment, r.Scheme); err != nil {
        return ctrl.Result{}, err
    }
    result, err := controllerutil.CreateOrUpdate(ctx, r.Client, deployment, func() error {
        r.updateDeployment(deployment, app)
        return nil
    })
    if err != nil {
        return ctrl.Result{}, err
    }
    log.Info("Deployment reconciled", "result", result)

    // 5. Reconcile Service
    svc := r.desiredService(app)
    controllerutil.SetControllerReference(app, svc, r.Scheme)
    if _, err := controllerutil.CreateOrUpdate(ctx, r.Client, svc, func() error {
        r.updateService(svc, app)
        return nil
    }); err != nil {
        return ctrl.Result{}, err
    }

    // 6. Update status
    app.Status.Phase = platformv1alpha1.PhaseRunning
    app.Status.ReadyReplicas = deployment.Status.ReadyReplicas
    if err := r.Status().Update(ctx, app); err != nil {
        return ctrl.Result{}, err
    }

    // 7. Requeue periodically for drift detection
    return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil
}

func (r *ApplicationReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&platformv1alpha1.Application{}).
        Owns(&appsv1.Deployment{}).  // watch owned Deployments too
        Owns(&corev1.Service{}).
        WithOptions(controller.Options{
            MaxConcurrentReconciles: 4,
            RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
                500*time.Millisecond, 5*time.Minute,
            ),
        }).
        Complete(r)
}

desiredDeployment Helper

func (r *ApplicationReconciler) desiredDeployment(app *platformv1alpha1.Application) *appsv1.Deployment {
    labels := map[string]string{
        "app.kubernetes.io/name":       app.Name,
        "app.kubernetes.io/managed-by": "platform-operator",
    }
    return &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:      app.Name,
            Namespace: app.Namespace,
            Labels:    labels,
        },
        Spec: appsv1.DeploymentSpec{
            Replicas: app.Spec.Replicas,
            Selector: &metav1.LabelSelector{MatchLabels: labels},
            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{Labels: labels},
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{{
                        Name:  app.Name,
                        Image: app.Spec.Image,
                        Ports: []corev1.ContainerPort{{
                            ContainerPort: app.Spec.Port,
                        }},
                        Env: app.Spec.Env,
                    }},
                },
            },
        },
    }
}

API Aggregation Layer

The API Aggregation Layer allows registering a custom API server that the kube-apiserver proxies requests to. This is how metrics-server (for kubectl top) and custom metrics APIs (for HPA) work.

Custom Metrics API for HPA

# APIService registers your custom metrics server with the aggregation layer
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.custom.metrics.k8s.io
spec:
  service:
    name: custom-metrics-apiserver
    namespace: custom-metrics
    port: 443
  group: custom.metrics.k8s.io
  version: v1beta1
  insecureSkipTLSVerify: false
  caBundle: "LS0tLS1CRUdJTi..."
  groupPriorityMinimum: 100
  versionPriority: 100

HPA on Custom Metric

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api-hpa
  namespace: payments-api-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  # Scale on Prometheus metric via KEDA or custom metrics adapter
  - type: External
    external:
      metric:
        name: payments_queue_depth
        selector:
          matchLabels:
            queue: payments-processing
      target:
        type: AverageValue
        averageValue: "10"   # scale out when queue > 10 items per replica

Platform API Design Principles

Platform APIs are consumed by developers every day. Poor API design creates friction, bugs, and support burden. These principles apply whether you're designing a CRD, a Crossplane XRD (see 08-service-catalog.html), or a Backstage template.

API Design Guidelines

Intent over Implementation

Teams declare what they want, not how to achieve it. spec.size: medium instead of spec.instanceType: m5.xlarge. The operator maps intent to implementation — teams are insulated from AWS instance type changes.

Sensible Defaults

Every optional field should have a default that works for 80% of cases. Use +kubebuilder:default= markers and CRD default: in OpenAPI schema. Teams shouldn't need to specify 20 fields to get a working deployment.

Immutable Fields via CEL

Some fields should never change after creation (database name, region). Mark them with x-kubernetes-validations: rule: "self == oldSelf". This prevents accidental data loss from field updates that would trigger destructive replacement.

Status Conditions (Standard)

Use the standard metav1.Condition type for status conditions — type, status, reason, message, lastTransitionTime. This integrates with kubectl wait, Argo CD health checks, and standard tooling without custom logic.

Additive-Only Evolution

Never remove or rename fields in a served version. Add new fields with defaults. When breaking changes are needed, add a new API version (v1beta1) alongside the old one and provide a conversion webhook.

ownerReferences for Garbage Collection

Set SetControllerReference on every resource an operator creates. When the parent CRD is deleted, K8s cascades deletion to all owned resources automatically. Without this, orphaned Deployments and Services accumulate.

API Versioning Lifecycle

# Serving multiple versions simultaneously
spec:
  versions:
  - name: v1alpha1
    served: true
    storage: false    # deprecated; still served for old clients
    deprecated: true
    deprecationWarning: "v1alpha1 is deprecated; use v1beta1"
  - name: v1beta1
    served: true
    storage: true     # new storage version
  - name: v1
    served: true
    storage: false    # GA; not yet storage (migration in progress)

Testing Operators & Webhooks

envtest: Integration Tests Without a Cluster

// suite_test.go — envtest starts a real API server + etcd in-process
package controllers_test

import (
    "testing"
    . "github.com/onsi/ginkgo/v2"
    . "github.com/onsi/gomega"
    "sigs.k8s.io/controller-runtime/pkg/envtest"
    ctrl "sigs.k8s.io/controller-runtime"
)

var (
    testEnv   *envtest.Environment
    k8sClient client.Client
)

func TestControllers(t *testing.T) {
    RegisterFailHandler(Fail)
    RunSpecs(t, "Controller Suite")
}

var _ = BeforeSuite(func() {
    testEnv = &envtest.Environment{
        CRDDirectoryPaths: []string{"../config/crd/bases"},
        WebhookInstallOptions: envtest.WebhookInstallOptions{
            Paths: []string{"../config/webhook"},
        },
    }
    cfg, err := testEnv.Start()
    Expect(err).NotTo(HaveOccurred())

    k8sClient, err = client.New(cfg, client.Options{})
    Expect(err).NotTo(HaveOccurred())

    // Start the manager with controllers
    mgr, err := ctrl.NewManager(cfg, ctrl.Options{Scheme: scheme})
    Expect(err).NotTo(HaveOccurred())
    Expect((&ApplicationReconciler{Client: mgr.GetClient(), Scheme: mgr.GetScheme()}).
        SetupWithManager(mgr)).To(Succeed())
    go mgr.Start(context.Background())
})

var _ = AfterSuite(func() {
    testEnv.Stop()
})

Reconciler Test

var _ = Describe("Application controller", func() {
    It("should create a Deployment when an Application is created", func() {
        app := &platformv1alpha1.Application{
            ObjectMeta: metav1.ObjectMeta{
                Name:      "test-app",
                Namespace: "default",
            },
            Spec: platformv1alpha1.ApplicationSpec{
                Image:    "nginx:1.25.0",
                Port:     80,
                Replicas: ptr.To(int32(2)),
            },
        }
        Expect(k8sClient.Create(ctx, app)).To(Succeed())

        // Wait for Deployment to be created
        deployment := &appsv1.Deployment{}
        Eventually(func() error {
            return k8sClient.Get(ctx, types.NamespacedName{
                Name: "test-app", Namespace: "default",
            }, deployment)
        }, 10*time.Second, 100*time.Millisecond).Should(Succeed())

        Expect(*deployment.Spec.Replicas).To(Equal(int32(2)))
        Expect(deployment.Spec.Template.Spec.Containers[0].Image).To(Equal("nginx:1.25.0"))

        // Verify ownerReference is set
        Expect(deployment.OwnerReferences).To(HaveLen(1))
        Expect(deployment.OwnerReferences[0].Name).To(Equal("test-app"))
    })
})

Webhook Test with fake decoder

var _ = Describe("Application defaulting webhook", func() {
    It("should set default replicas to 2", func() {
        app := &platformv1alpha1.Application{
            Spec: platformv1alpha1.ApplicationSpec{
                Image: "nginx:1.25.0",
                Port:  80,
                // Replicas intentionally omitted
            },
        }
        // Defaulting webhook should set Replicas = 2
        app.Default()
        Expect(*app.Spec.Replicas).To(Equal(int32(2)))
    })

    It("should reject :latest image tag", func() {
        app := &platformv1alpha1.Application{
            Spec: platformv1alpha1.ApplicationSpec{
                Image:    "nginx:latest",
                Port:     80,
                Replicas: ptr.To(int32(2)),
            },
        }
        _, err := app.ValidateCreate()
        Expect(err).To(HaveOccurred())
        Expect(err.Error()).To(ContainSubstring("latest"))
    })
})

Operator Observability

controller-runtime Built-in Metrics

# controller-runtime exposes these metrics automatically
controller_runtime_reconcile_total{controller,result}        # reconcile count by result (success/error/requeue)
controller_runtime_reconcile_errors_total{controller}        # error count
controller_runtime_reconcile_time_seconds{controller}        # reconcile duration histogram
controller_runtime_webhook_requests_total{webhook,code}      # webhook call count
controller_runtime_webhook_latency_seconds{webhook}          # webhook latency
workqueue_depth{name}                                        # items waiting in work queue
workqueue_queue_duration_seconds{name}                       # time items spend queued

Custom Operator Metrics

// Register custom metrics
var (
    applicationTotal = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "platform_application_total",
            Help: "Total number of Application instances by phase",
        },
        []string{"namespace", "phase"},
    )
    reconcileDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "platform_reconcile_duration_seconds",
            Help:    "Time taken to reconcile an Application",
            Buckets: prometheus.DefBuckets,
        },
        []string{"controller"},
    )
)

func init() {
    metrics.Registry.MustRegister(applicationTotal, reconcileDuration)
}

PrometheusRule for Operator Health

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: platform-operator-alerts
  namespace: monitoring
spec:
  groups:
  - name: platform.operator
    rules:
    - alert: PlatformOperatorHighErrorRate
      expr: |
        rate(controller_runtime_reconcile_errors_total{
          controller="application"
        }[5m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Platform operator reconcile error rate high"
        description: "{{ $value | humanize }} errors/s for Application controller"

    - alert: PlatformOperatorQueueBacklog
      expr: |
        workqueue_depth{name="application"} > 50
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Platform operator work queue depth {{ $value }}"
        description: "Large backlog may indicate reconciliation is stuck or too slow."

    - alert: PlatformApplicationDegraded
      expr: |
        platform_application_total{phase="Degraded"} > 0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "{{ $value }} Application(s) in Degraded phase"

Best Practices

Idempotent Reconcilers

A reconciler must be safe to call any number of times with the same input. Use controllerutil.CreateOrUpdate instead of create-then-check patterns. Every reconcile should produce the same result regardless of how many times it runs — the K8s control loop guarantees it will run many times.

Status Conditions Over Phase Strings

Use metav1.Condition array for status (type/status/reason/message/lastTransitionTime) rather than a single phase string. Conditions compose — an Application can be Progressing=True and Available=True simultaneously. This is what kubectl wait and Argo CD health checks understand natively.

Exponential Backoff on Errors

Never return a bare error from Reconcile without understanding the retry behavior. controller-runtime uses an exponential backoff rate limiter by default. For transient errors (network), return the error. For permanent errors (invalid spec), update status and return nil — don't keep retrying what will never succeed.

Leader Election for HA

Run operator replicas with leader election (ctrl.Options{LeaderElection: true, LeaderElectionID: "platform-operator-leader"}). Only the leader reconciles; others are hot standbys. Without this, multiple replicas will conflict on concurrent updates to the same resource.

RBAC via Markers

Use // +kubebuilder:rbac: markers on the Reconcile function. make manifests generates the ClusterRole automatically. This keeps RBAC in sync with what the controller actually needs — no manually maintained ClusterRole that drifts over time.

Webhook TLS via cert-manager

Never manually manage webhook TLS certificates. Create a cert-manager Certificate for the webhook service and annotate the WebhookConfiguration with cert-manager.io/inject-ca-from. cert-manager's cainjector patches the caBundle automatically on rotation.

Test with envtest, Not mocks

Test reconcilers against a real API server via envtest — not with mocked clients. Mocks miss critical behaviors like watch cache delays, conflict errors on concurrent updates, and admission webhook calls. envtest runs in CI with no cluster dependency.

Finalizers for Cleanup

If your operator creates cloud resources, always add a finalizer. Without it, deleting the CRD object immediately removes the K8s resource — but the cloud resource (RDS, S3 bucket) lives on as orphaned, billing waste. The finalizer prevents deletion until the operator has cleaned up external state.

Coverage: 10 · Platform APIs

Kubernetes extension mechanisms diagram (CRDs/Webhooks/API Aggregation/VAP with relationship to API server)
Extension mechanism decision guide table (8 use cases mapped to mechanism + complexity)
Full CRD YAML (Application): group/scope/names/shortNames/categories/versions/subresources (status+scale)/additionalPrinterColumns/full openAPIV3Schema with nested properties/status conditions array
CRD instance YAML (Application with image/port/replicas/resources/env/ingress)
CEL validation rules in CRD schema (x-kubernetes-validations: cross-field prod requires 2 replicas, no :latest, replicas cannot reduce below 1)
CRD version conversion webhook configuration (strategy:Webhook, conversionReviewVersions, clientConfig)
Mutating webhook Go handler: PodDefaulter (inject managed-by label, set terminationGracePeriodSeconds, return JSON patch)
Register webhooks in main.go (mgr.GetWebhookServer().Register for mutate + validate paths)
MutatingWebhookConfiguration YAML (cert-manager inject-ca-from annotation, namespaceSelector matchExpressions, failurePolicy:Ignore, reinvocationPolicy:IfNeeded)
cert-manager webhook TLS injection callout (cainjector patches caBundle automatically)
Operator reconcile loop architecture diagram (desired state → informer → reconciler → create/update/delete → status → requeue)
Operator maturity model table (Level 1-5: Basic Install/Upgrades/Full Lifecycle/Insights/Auto Pilot)
Kubebuilder scaffold commands (init, create api, create webhook, make generate manifests)
Go type definition with kubebuilder markers (object:root/subresource:status+scale/printcolumn/resource shortName/validation markers)
Full Reconciler (Reconcile function): fetch object/handle deletion with finalizer/add finalizer/CreateOrUpdate Deployment/CreateOrUpdate Service/update status/RequeueAfter 5m
SetupWithManager: For/Owns Deployment+Service/WithOptions MaxConcurrentReconciles+RateLimiter
desiredDeployment helper function (labels/ObjectMeta/DeploymentSpec from ApplicationSpec)
APIService CRD for custom metrics aggregation (group/service/caBundle/groupPriorityMinimum)
HPA on external custom metric (payments_queue_depth with AverageValue:10)
Platform API design principles: 6 cards (intent over implementation/defaults/immutable CEL/status conditions/additive evolution/ownerReferences)
CRD API versioning lifecycle YAML (served/storage/deprecated flags for v1alpha1+v1beta1+v1)
envtest integration test suite (BeforeSuite: start testEnv/start manager/register reconciler)
Reconciler test (create Application → Eventually check Deployment created → verify replicas+image+ownerReference)
Webhook test (defaulting sets replicas:2, validation rejects :latest)
controller-runtime built-in metrics reference (reconcile_total/errors/time/webhook latency/workqueue_depth)
Custom operator metrics in Go (GaugeVec for application phases, HistogramVec for reconcile duration)
PrometheusRule: PlatformOperatorHighErrorRate / PlatformOperatorQueueBacklog / PlatformApplicationDegraded
8 best practices cards (idempotent reconcilers/status conditions/exponential backoff/leader election/RBAC markers/cert-manager TLS/envtest not mocks/finalizers for cleanup)