Platform Engineering Overview

1. What Is Platform Engineering?

Platform engineering is the discipline of designing and building self-service toolchains and workflows — collectively called an Internal Developer Platform (IDP) — that enable product teams to deploy, run, and observe their software in production without requiring deep infrastructure expertise for every task.

The goal is not to build the most technically sophisticated infrastructure. It is to reduce cognitive load on application developers while maintaining the guardrails (security, cost control, compliance) that the organisation requires.

Platform engineering vs "just a Kubernetes cluster"

A Kubernetes cluster is infrastructure. A platform wraps that cluster with golden-path workflows, self-service provisioning, pre-integrated observability, policy enforcement, and a developer portal — so that a new team can go from zero to production in hours, not weeks, while automatically complying with company standards.

Why Platform Engineering Emerged

The cloud-native ecosystem solved infrastructure scalability but created a new problem: the number of tools, concepts, and decisions required to deploy a service grew dramatically. A developer shipping a feature now needed to understand Kubernetes YAML, Helm charts, Prometheus scrape configs, Grafana dashboards, Vault secrets paths, network policies, and CI pipeline YAML — all before the first production deployment.

Before platform engineering (developer cognitive load):
  ┌──────────────────────────────────────────────────────────┐
  │ Write code → Write Dockerfile → Write Helm chart →       │
  │ Configure CI → Configure CD → Set up monitoring →        │
  │ Create alerts → Request secrets → Configure NetworkPolicy│
  │ → Write runbook → Set resource limits → ...              │
  │                                              ~40 tasks   │
  └──────────────────────────────────────────────────────────┘

After platform engineering (developer cognitive load):
  ┌─────────────────────────────────────────────┐
  │ Write code → Push to git                    │
  │  → Platform handles everything else         │
  │                                   ~2 tasks  │
  └─────────────────────────────────────────────┘
  (Security, observability, networking pre-wired by platform)

2. Platform Engineering vs DevOps vs SRE

DimensionDevOpsSREPlatform Engineering
Primary goal Break dev/ops silos; continuous delivery Reliability of production services (SLOs, error budgets) Developer productivity; reduce cognitive load
Core output Cultural practices, CI/CD pipelines SLO frameworks, incident response, automation to replace toil Internal Developer Platform (IDP)
Who are the customers? The organisation (cultural transformation) End users (reliability) and operators (toil reduction) Product/feature teams (developers)
Key metric DORA metrics (deployment freq, lead time, MTTR, change failure rate) SLO compliance, error budget burn rate, toil % Time to first deployment, developer NPS, self-service ratio
Infrastructure ownership Shared (everyone owns their pipeline) Shared (product teams own SLOs, SRE owns platform reliability) Platform team owns the platform; product teams own their apps
Relationship to Kubernetes Kubernetes is part of the CD pipeline Kubernetes is a production environment to make reliable Kubernetes is the infrastructure substrate the platform abstracts
These roles are complementary, not competing

A mature engineering organisation practises all three: DevOps culture (fast feedback loops), SRE principles (error budgets, toil reduction), and platform engineering (IDP to scale developer autonomy). The platform team's work enables both DevOps and SRE practices at scale.

3. IDP Capability Map

An Internal Developer Platform is not a single product — it is a composition of capabilities across five planes:

┌─────────────────────────────────────────────────────────────────┐
│                  Developer Experience Plane                      │
│  Backstage portal / CLI tools / VS Code extensions / Docs       │
│  Service catalog / Templates / Scorecards / TechDocs            │
├─────────────────────────────────────────────────────────────────┤
│                   Application Plane                              │
│  Golden path templates / Helm charts / Kustomize overlays       │
│  CI pipelines (Tekton/GitHub Actions) / CD (Argo CD/Flux)       │
│  Feature flags / Release automation / Preview environments       │
├─────────────────────────────────────────────────────────────────┤
│                   Security & Policy Plane                        │
│  OPA/Gatekeeper policies / Kyverno / Pod Security Standards     │
│  Secrets management (Vault/ESO) / SBOM / Image signing          │
│  Network policies / RBAC templates / Admission webhooks         │
├─────────────────────────────────────────────────────────────────┤
│                   Infrastructure Plane                           │
│  Cluster provisioning (Cluster API / EKS Blueprints / Crossplane│
│  Cloud resources (Crossplane / ACK / Terraform / Pulumi)        │
│  DNS / TLS (cert-manager / external-dns) / Load balancers       │
├─────────────────────────────────────────────────────────────────┤
│                   Observability Plane                            │
│  Metrics (Prometheus/Thanos) / Logs (Loki) / Traces (Tempo)     │
│  Dashboards (Grafana) / Alerting (Alertmanager) / Profiling      │
│  Events / Runbooks / On-call rotation (PagerDuty/Opsgenie)      │
└─────────────────────────────────────────────────────────────────┘

Capability Ownership Model

CapabilityPlatform Team ProvidesProduct Team Configures
Kubernetes clustersProvisioning, upgrades, node pools, add-onsNamespace resource quotas, workload specs
CI/CD pipelinesPipeline templates, shared tasks, image registryPipeline triggers, test commands, deploy targets
ObservabilityPrometheus stack, Loki, Tempo, Grafana, AlertmanagerServiceMonitors, dashboards, alert rules, SLOs
SecretsVault clusters, ESO install, access policiesSecretStore references, ExternalSecret paths
NetworkingIngress controllers, cert-manager, external-dns, service meshIngress resources, certificates, network policies
PolicyGatekeeper/Kyverno install, baseline policiesNamespace-scoped exceptions (via PR review)
Cloud resourcesCrossplane providers, Compositions, XRDsClaim YAML files (e.g., PostgreSQL, S3 bucket)
Developer portalBackstage instance, plugins, RBACcatalog-info.yaml, TechDocs, templates

4. Platform as a Product

The most important mindset shift in platform engineering: treat the internal developer platform as a product with users (developers), a roadmap, feedback loops, and success metrics — not as an IT service delivered on a project basis.

Understand your users

Run developer surveys (quarterly NPS), shadow sessions (watch a dev onboard), and track support ticket patterns. Build what reduces the most common pain, not what is technically interesting.

user research

Treat abstractions as contracts

When you publish a golden-path Helm chart or Crossplane Composition, that is a public API. Breaking changes require versioning and migration paths, just like a REST API.

API design

Measure developer productivity

DORA metrics (deployment frequency, lead time), time to first deploy for new services, self-service ratio (platform-provisioned vs ticket-provisioned resources).

DORA

Maintain a public roadmap

Publish what the platform team is building and when. Invite RFCs for major changes. Developers who understand the roadmap stop building workarounds.

transparency

Provide SLOs for the platform itself

If the platform's CI system has 99.5% availability and 10-minute P99 build time, publish that. Platform reliability directly multiplies or divides developer throughput.

platform SLOs

Avoid forced adoption

Mandate security guardrails (policy enforcement). Make everything else opt-in by being better than the alternative. Forced tools without value create shadow IT.

adoption

5. Golden Paths & Paved Roads

A golden path is a well-lit, well-maintained route from code to production that embeds all organisational standards by default. Developers can deviate from it, but the golden path should be the easiest option — not a constraint.

What a Golden Path Typically Includes

Developer runs: backstage-cli create --template go-microservice

Result: a Git repository pre-configured with:
├── src/                          # Application code skeleton
├── Dockerfile                    # Multi-stage, distroless base, non-root user
├── helm/                         # Helm chart (resources/limits/probes/PDB/HPA)
│   └── values.yaml               # Sane defaults (replica:2, readinessProbe, etc.)
├── .github/workflows/            # CI: lint → test → build → sign → push
│   ├── ci.yaml                   # (or .tekton/pipeline.yaml for in-cluster CI)
│   └── cd.yaml                   # ArgoCD app-of-apps update on merge to main
├── config/                       # Kustomize overlays: dev/staging/prod
├── catalog-info.yaml             # Backstage service catalog registration
├── docs/                         # TechDocs skeleton (mkdocs.yml + index.md)
├── .policy/                      # Pre-approved exceptions file (if needed)
└── monitoring/
    ├── servicemonitor.yaml       # Pre-wired to platform Prometheus
    ├── dashboard.json            # Grafana dashboard (RED metrics)
    └── alerts.yaml               # PrometheusRule with basic SLO alerts

Golden Path vs Escape Hatch

SituationRecommendation
Standard HTTP microservice (Go, Java, Python, Node.js)Use golden path template — covers 90% of services
ML training job (GPU, large memory, spot instances)Specialised template; golden path extended for batch/GPU workloads
Stateful service needing custom storage topologyEscape hatch: custom Helm chart, but must pass Gatekeeper policies
Legacy service not yet containerisedMigration path: platform provides lift-and-shift guide + VM-based workload support
Security-sensitive service (handles PII/PCI)Hardened golden path variant with additional network policies, encryption, audit logging
Escape hatches must still pass policy

Teams that deviate from the golden path are not exempt from Gatekeeper/Kyverno policies. The golden path builds compliance in by default; escape hatches must achieve compliance explicitly. OPA policies are the non-negotiable layer; golden paths are the convenience layer on top.

6. Tooling Landscape

Platform Engineering Tool Categories

CategoryOpen Source / CNCFCommercial / Managed
Developer portal Backstage (CNCF), Port Cortex, OpsLevel, Rely.io
Cluster provisioning Cluster API, kOps, k3s EKS Blueprints (CDK/Terraform), GKE Autopilot, AKS
Infrastructure as Code (cloud resources) Crossplane (CNCF), ACK, Terraform, Pulumi AWS CDK, Terraform Cloud, Pulumi Cloud
GitOps / CD Argo CD (CNCF), Flux (CNCF) Harness, Spinnaker (CD Foundation)
CI pipelines Tekton (CNCF), Dagger, Woodpecker GitHub Actions, GitLab CI, CircleCI, Jenkins
Policy enforcement OPA Gatekeeper (CNCF), Kyverno (CNCF) Styra DAS, Nirmata
Secrets management Vault (HashiCorp/BSL), External Secrets Operator AWS Secrets Manager, GCP Secret Manager, Azure Key Vault
Service mesh Istio (CNCF), Linkerd (CNCF), Cilium AWS App Mesh, Google Traffic Director
Image registry Harbor (CNCF), Zot ECR, GAR, ACR, Docker Hub, JFrog Artifactory
Image signing Cosign / Sigstore (CNCF), Notary v2 AWS Signer
Network / DNS / TLS cert-manager (CNCF), external-dns, MetalLB AWS ACM, Cloudflare, Route53
Cost management OpenCost (CNCF), Kubecost AWS Cost Explorer, CloudHealth, Apptio
Preview environments vCluster, Argo CD ApplicationSets Environments by Render, Railway, Architect
Feature flags OpenFeature (CNCF), Flagd LaunchDarkly, Statsig, Split

CNCF Landscape Positioning

CNCF Platform Engineering Relevant Projects (2024):

Graduated:  Argo, Flux, OPA, Harbor, Backstage (incubating → graduated)
Incubating: Crossplane, Kyverno, Cluster API, cert-manager, external-dns, OpenCost
Sandbox:    Headlamp, KubeVela, Kratix, OpenFeature (graduated 2024)

Key patterns these tools address:
  Crossplane    → Kubernetes-native IaC (cloud resources as CRDs)
  Cluster API   → Kubernetes-native cluster lifecycle management
  Backstage     → Developer portal and service catalog
  Argo CD       → GitOps continuous delivery
  Flux          → GitOps with Helm/Kustomize reconciliation
  Kyverno       → Policy-as-code without Rego
  OPA           → Policy engine (Gatekeeper for K8s admission)
  cert-manager  → Automated TLS certificate lifecycle
  external-dns  → DNS record sync from Kubernetes Ingress/Service

7. Reference Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     Developer Workflow                               │
│                                                                      │
│  git push  →  CI Pipeline (GitHub Actions / Tekton)                  │
│              │  lint / test / build / sign / push image              │
│              │  update Helm values / Kustomize image tag             │
│              ▼                                                       │
│         GitOps Repo (argocd-apps / fleet)                           │
│              │                                                       │
│              ▼                                                       │
│         Argo CD / Flux  ←──────── Drift detection                   │
│              │  reconcile desired state from Git                     │
│              ▼                                                       │
│  ┌───────────────────────────────────────────────────────────┐      │
│  │              Kubernetes Cluster                            │      │
│  │                                                            │      │
│  │  ┌─────────────────────┐  ┌──────────────────────────┐   │      │
│  │  │  Admission Control  │  │   Namespace (team-A)      │   │      │
│  │  │  Gatekeeper/Kyverno │  │   workloads / services    │   │      │
│  │  │  OPA policies       │  │   ConfigMaps / secrets    │   │      │
│  │  └─────────────────────┘  └──────────────────────────┘   │      │
│  │                                                            │      │
│  │  ┌─────────────────────┐  ┌──────────────────────────┐   │      │
│  │  │  Networking         │  │   Observability           │   │      │
│  │  │  Ingress NGINX      │  │   Prometheus / Loki       │   │      │
│  │  │  cert-manager       │  │   Tempo / Grafana         │   │      │
│  │  │  external-dns       │  │   Alertmanager            │   │      │
│  │  │  Cilium / Istio     │  │   Pyroscope               │   │      │
│  │  └─────────────────────┘  └──────────────────────────┘   │      │
│  │                                                            │      │
│  │  ┌─────────────────────┐  ┌──────────────────────────┐   │      │
│  │  │  Secrets            │  │   Infrastructure          │   │      │
│  │  │  Vault + ESO        │  │   Crossplane providers    │   │      │
│  │  │  Secret rotation    │  │   RDS / S3 / SQS claims   │   │      │
│  │  └─────────────────────┘  └──────────────────────────┘   │      │
│  └───────────────────────────────────────────────────────────┘      │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                Backstage Developer Portal                     │   │
│  │  Service catalog / Templates / TechDocs / Scorecards          │   │
│  │  Kubernetes plugin / Argo CD plugin / Cost plugin             │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Multi-Cluster Platform Architecture

Management Cluster (platform-managed)          Workload Clusters
┌───────────────────────────────┐         ┌──────────────────┐
│  Argo CD / Flux               │────────▶│  cluster: dev    │
│  Crossplane control plane     │         │  (team namespaces│
│  Vault                        │────────▶│   + policies)    │
│  Backstage                    │         └──────────────────┘
│  Harbor (image registry)      │         ┌──────────────────┐
│  Policy validation webhook    │────────▶│  cluster: staging│
│  Platform observability stack │         └──────────────────┘
└───────────────────────────────┘         ┌──────────────────┐
                                    ──────▶│  cluster: prod-1 │
                                    │      └──────────────────┘
                                    ──────▶│  cluster: prod-2 │
                                           │  (regional)      │
                                           └──────────────────┘

Fleet management: Cluster API (provisioning) + Argo CD ApplicationSets
                  (deploy platform add-ons uniformly across all clusters)

8. Platform Maturity Model

Level 1

Provisional

  • Manual cluster provisioning
  • Ad-hoc CI/CD per team
  • No shared observability
  • Secrets in environment variables
  • No policy enforcement
  • Deployments require ops tickets
Level 2

Operationalized

  • Cluster provisioning scripted (Terraform)
  • Shared CI templates (GitHub Actions)
  • Basic Prometheus + Grafana
  • Vault for secrets
  • Some RBAC standards
  • Manual CD (kubectl apply)
Level 3

Scalable

  • Cluster API / EKS Blueprints
  • GitOps (Argo CD / Flux)
  • Full LGTM observability stack
  • External Secrets Operator
  • Gatekeeper policies enforced
  • Golden path templates
  • Self-service namespaces
Level 4

Optimising

  • Full IDP with Backstage portal
  • Crossplane cloud resources
  • Preview environments
  • Multi-cluster fleet management
  • Cost attribution per team
  • SLO-driven deployments
  • Platform SLOs published
  • Developer NPS tracked
Don't skip levels

Organisations that jump from Level 1 directly to Backstage + Crossplane + Argo CD typically fail to adopt any of it because the prerequisite operational practices are absent. Build Level 2 (reliable deployments, working observability) before automating Level 3 (GitOps, policy). Add the portal (Level 4) last — it has no value if the underlying platform is unreliable.

9. Team Topologies

Matthew Skelton and Manuel Pais's Team Topologies framework describes how platform engineering teams relate to other teams:

Platform Team (Enabling)

Owns the IDP. Treats product teams as customers. Measures developer productivity. Works in 6–12-week cycles. Publishes roadmap and SLOs.

platform

Stream-Aligned Team

Product/feature teams that own end-to-end delivery of a product stream. They use the platform — they do not manage infrastructure, CI tooling, or observability backends.

consumer

Complicated Subsystem Team

Owns deeply specialised components (ML platform, data pipelines, real-time event bus). Interfaces with platform team for infrastructure; interfaces with stream-aligned teams as a service provider.

specialist

SRE / Reliability Team

Owns availability targets, incident response, and toil automation. Embedded in platform team or parallel. Uses the IDP to deploy reliability tooling (chaos engineering, synthetic monitors, runbooks).

reliability

Platform Team Sizing

Organisation SizeEngineersPlatform Team SizeFocus
Startup (<30 engineers)1–2 (infra-aware engineers)Basic CI/CD + managed K8s (EKS/GKE)
Scale-up (30–150 engineers)3–6GitOps + observability + policy + Vault
Mid-size (150–500 engineers)6–12Full IDP + Backstage + Crossplane + cost
Enterprise (>500 engineers)12–30+Multi-cluster fleet + compliance + portals
Ratio rule of thumb

A platform team should support 10–15 stream-aligned engineers per platform engineer. Ratio below 8:1 means the platform is over-staffed relative to demand. Above 20:1, developer experience degrades (slow onboarding, long wait times for platform changes).

10. Developer Experience Metrics

Measuring platform value requires metrics that reflect the developer experience, not just infrastructure uptime.

DORA Metrics (Google DevOps Research)

MetricEliteHighMediumLow
Deployment frequencyMultiple per dayDaily to weeklyWeekly to monthlyMonthly or slower
Lead time for changes< 1 hour1 day – 1 week1 week – 1 month> 1 month
Change failure rate< 5%5–10%10–15%> 15%
MTTR (mean time to restore)< 1 hour< 1 day1 day – 1 week> 1 week

Platform-Specific Metrics

# Prometheus metrics to track platform health
# (many from DORA, some platform-specific)

# Time to first production deployment for new services
# Measured via Argo CD: time from ApplicationSet creation to Healthy
histogram_quantile(0.90,
  rate(argocd_app_sync_total_duration_seconds_bucket[7d])
)

# Self-service ratio: platform-provisioned vs ticket-provisioned resources
# Track: number of Crossplane claims created vs infra tickets
sum(increase(crossplane_managed_resource_ready_total[30d]))
  / (sum(increase(crossplane_managed_resource_ready_total[30d])) + jira_infra_tickets_30d)

# CI pipeline P90 duration (developer wait time)
histogram_quantile(0.90,
  rate(tekton_pipelinerun_duration_seconds_bucket{status="success"}[7d])
)

# Platform availability (GitOps reconciliation loop health)
1 - (
  rate(argocd_app_sync_total{phase="Error"}[24h])
  / rate(argocd_app_sync_total[24h])
)

# Developer NPS — tracked via quarterly survey (not Prometheus)
# Target: NPS > 30 (platform is net positive to developers)

SPACE Framework (GitHub Research)

DimensionMetric Examples
SatisfactionDeveloper NPS, survey scores, onboarding satisfaction
PerformanceCode review cycle time, deployment success rate, incident MTTR
ActivityDeployments per week, PRs merged, features shipped
CommunicationDocumentation quality score, knowledge-sharing sessions
EfficiencyTime waiting for CI, on-call interruption rate, toil %

11. Section Guide

The Platform Engineering section covers 10 detailed topics. Each builds on this overview:

Coverage Checklist
  • Platform engineering definition and goal (reduce developer cognitive load)
  • Before/after cognitive load diagram (40 tasks → 2 tasks)
  • Platform engineering vs DevOps vs SRE comparison table
  • IDP five-plane capability map diagram (developer experience/application/security/infrastructure/observability)
  • Capability ownership table (platform provides vs product team configures)
  • Platform as a product: 6 principle cards (user research/API contracts/DORA/roadmap/platform SLOs/opt-in adoption)
  • Golden path anatomy: git push → repo structure with all pre-configured files
  • Golden path vs escape hatch decision table (5 scenarios)
  • Escape hatch must still pass policy — callout
  • Platform engineering tool landscape table (12 categories, CNCF and commercial options)
  • CNCF graduated/incubating/sandbox projects reference
  • Reference architecture diagram: CI → GitOps repo → Argo CD → K8s cluster (all planes)
  • Multi-cluster platform architecture (management cluster + workload clusters, fleet management)
  • Platform maturity model: 4 levels (Provisional/Operationalized/Scalable/Optimising) with characteristics
  • Don't skip levels callout (portal before reliable platform fails)
  • Team Topologies: platform/stream-aligned/complicated-subsystem/SRE team cards
  • Platform team sizing table (startup to enterprise with ratios)
  • Ratio rule of thumb callout (10–15 developers per platform engineer)
  • DORA metrics table with elite/high/medium/low thresholds
  • Platform-specific PromQL metrics (time-to-deploy, self-service ratio, CI P90, GitOps availability)
  • SPACE framework (Satisfaction/Performance/Activity/Communication/Efficiency)
  • Section guide with links to all 10 detail pages and descriptions