Platform Engineering Overview
1. What Is Platform Engineering?
Platform engineering is the discipline of designing and building self-service toolchains and workflows — collectively called an Internal Developer Platform (IDP) — that enable product teams to deploy, run, and observe their software in production without requiring deep infrastructure expertise for every task.
The goal is not to build the most technically sophisticated infrastructure. It is to reduce cognitive load on application developers while maintaining the guardrails (security, cost control, compliance) that the organisation requires.
A Kubernetes cluster is infrastructure. A platform wraps that cluster with golden-path workflows, self-service provisioning, pre-integrated observability, policy enforcement, and a developer portal — so that a new team can go from zero to production in hours, not weeks, while automatically complying with company standards.
Why Platform Engineering Emerged
The cloud-native ecosystem solved infrastructure scalability but created a new problem: the number of tools, concepts, and decisions required to deploy a service grew dramatically. A developer shipping a feature now needed to understand Kubernetes YAML, Helm charts, Prometheus scrape configs, Grafana dashboards, Vault secrets paths, network policies, and CI pipeline YAML — all before the first production deployment.
Before platform engineering (developer cognitive load):
┌──────────────────────────────────────────────────────────┐
│ Write code → Write Dockerfile → Write Helm chart → │
│ Configure CI → Configure CD → Set up monitoring → │
│ Create alerts → Request secrets → Configure NetworkPolicy│
│ → Write runbook → Set resource limits → ... │
│ ~40 tasks │
└──────────────────────────────────────────────────────────┘
After platform engineering (developer cognitive load):
┌─────────────────────────────────────────────┐
│ Write code → Push to git │
│ → Platform handles everything else │
│ ~2 tasks │
└─────────────────────────────────────────────┘
(Security, observability, networking pre-wired by platform)
2. Platform Engineering vs DevOps vs SRE
| Dimension | DevOps | SRE | Platform Engineering |
|---|---|---|---|
| Primary goal | Break dev/ops silos; continuous delivery | Reliability of production services (SLOs, error budgets) | Developer productivity; reduce cognitive load |
| Core output | Cultural practices, CI/CD pipelines | SLO frameworks, incident response, automation to replace toil | Internal Developer Platform (IDP) |
| Who are the customers? | The organisation (cultural transformation) | End users (reliability) and operators (toil reduction) | Product/feature teams (developers) |
| Key metric | DORA metrics (deployment freq, lead time, MTTR, change failure rate) | SLO compliance, error budget burn rate, toil % | Time to first deployment, developer NPS, self-service ratio |
| Infrastructure ownership | Shared (everyone owns their pipeline) | Shared (product teams own SLOs, SRE owns platform reliability) | Platform team owns the platform; product teams own their apps |
| Relationship to Kubernetes | Kubernetes is part of the CD pipeline | Kubernetes is a production environment to make reliable | Kubernetes is the infrastructure substrate the platform abstracts |
A mature engineering organisation practises all three: DevOps culture (fast feedback loops), SRE principles (error budgets, toil reduction), and platform engineering (IDP to scale developer autonomy). The platform team's work enables both DevOps and SRE practices at scale.
3. IDP Capability Map
An Internal Developer Platform is not a single product — it is a composition of capabilities across five planes:
┌─────────────────────────────────────────────────────────────────┐
│ Developer Experience Plane │
│ Backstage portal / CLI tools / VS Code extensions / Docs │
│ Service catalog / Templates / Scorecards / TechDocs │
├─────────────────────────────────────────────────────────────────┤
│ Application Plane │
│ Golden path templates / Helm charts / Kustomize overlays │
│ CI pipelines (Tekton/GitHub Actions) / CD (Argo CD/Flux) │
│ Feature flags / Release automation / Preview environments │
├─────────────────────────────────────────────────────────────────┤
│ Security & Policy Plane │
│ OPA/Gatekeeper policies / Kyverno / Pod Security Standards │
│ Secrets management (Vault/ESO) / SBOM / Image signing │
│ Network policies / RBAC templates / Admission webhooks │
├─────────────────────────────────────────────────────────────────┤
│ Infrastructure Plane │
│ Cluster provisioning (Cluster API / EKS Blueprints / Crossplane│
│ Cloud resources (Crossplane / ACK / Terraform / Pulumi) │
│ DNS / TLS (cert-manager / external-dns) / Load balancers │
├─────────────────────────────────────────────────────────────────┤
│ Observability Plane │
│ Metrics (Prometheus/Thanos) / Logs (Loki) / Traces (Tempo) │
│ Dashboards (Grafana) / Alerting (Alertmanager) / Profiling │
│ Events / Runbooks / On-call rotation (PagerDuty/Opsgenie) │
└─────────────────────────────────────────────────────────────────┘
Capability Ownership Model
| Capability | Platform Team Provides | Product Team Configures |
|---|---|---|
| Kubernetes clusters | Provisioning, upgrades, node pools, add-ons | Namespace resource quotas, workload specs |
| CI/CD pipelines | Pipeline templates, shared tasks, image registry | Pipeline triggers, test commands, deploy targets |
| Observability | Prometheus stack, Loki, Tempo, Grafana, Alertmanager | ServiceMonitors, dashboards, alert rules, SLOs |
| Secrets | Vault clusters, ESO install, access policies | SecretStore references, ExternalSecret paths |
| Networking | Ingress controllers, cert-manager, external-dns, service mesh | Ingress resources, certificates, network policies |
| Policy | Gatekeeper/Kyverno install, baseline policies | Namespace-scoped exceptions (via PR review) |
| Cloud resources | Crossplane providers, Compositions, XRDs | Claim YAML files (e.g., PostgreSQL, S3 bucket) |
| Developer portal | Backstage instance, plugins, RBAC | catalog-info.yaml, TechDocs, templates |
4. Platform as a Product
The most important mindset shift in platform engineering: treat the internal developer platform as a product with users (developers), a roadmap, feedback loops, and success metrics — not as an IT service delivered on a project basis.
Understand your users
Run developer surveys (quarterly NPS), shadow sessions (watch a dev onboard), and track support ticket patterns. Build what reduces the most common pain, not what is technically interesting.
user researchTreat abstractions as contracts
When you publish a golden-path Helm chart or Crossplane Composition, that is a public API. Breaking changes require versioning and migration paths, just like a REST API.
API designMeasure developer productivity
DORA metrics (deployment frequency, lead time), time to first deploy for new services, self-service ratio (platform-provisioned vs ticket-provisioned resources).
DORAMaintain a public roadmap
Publish what the platform team is building and when. Invite RFCs for major changes. Developers who understand the roadmap stop building workarounds.
transparencyProvide SLOs for the platform itself
If the platform's CI system has 99.5% availability and 10-minute P99 build time, publish that. Platform reliability directly multiplies or divides developer throughput.
platform SLOsAvoid forced adoption
Mandate security guardrails (policy enforcement). Make everything else opt-in by being better than the alternative. Forced tools without value create shadow IT.
adoption5. Golden Paths & Paved Roads
A golden path is a well-lit, well-maintained route from code to production that embeds all organisational standards by default. Developers can deviate from it, but the golden path should be the easiest option — not a constraint.
What a Golden Path Typically Includes
Developer runs: backstage-cli create --template go-microservice
Result: a Git repository pre-configured with:
├── src/ # Application code skeleton
├── Dockerfile # Multi-stage, distroless base, non-root user
├── helm/ # Helm chart (resources/limits/probes/PDB/HPA)
│ └── values.yaml # Sane defaults (replica:2, readinessProbe, etc.)
├── .github/workflows/ # CI: lint → test → build → sign → push
│ ├── ci.yaml # (or .tekton/pipeline.yaml for in-cluster CI)
│ └── cd.yaml # ArgoCD app-of-apps update on merge to main
├── config/ # Kustomize overlays: dev/staging/prod
├── catalog-info.yaml # Backstage service catalog registration
├── docs/ # TechDocs skeleton (mkdocs.yml + index.md)
├── .policy/ # Pre-approved exceptions file (if needed)
└── monitoring/
├── servicemonitor.yaml # Pre-wired to platform Prometheus
├── dashboard.json # Grafana dashboard (RED metrics)
└── alerts.yaml # PrometheusRule with basic SLO alerts
Golden Path vs Escape Hatch
| Situation | Recommendation |
|---|---|
| Standard HTTP microservice (Go, Java, Python, Node.js) | Use golden path template — covers 90% of services |
| ML training job (GPU, large memory, spot instances) | Specialised template; golden path extended for batch/GPU workloads |
| Stateful service needing custom storage topology | Escape hatch: custom Helm chart, but must pass Gatekeeper policies |
| Legacy service not yet containerised | Migration path: platform provides lift-and-shift guide + VM-based workload support |
| Security-sensitive service (handles PII/PCI) | Hardened golden path variant with additional network policies, encryption, audit logging |
Teams that deviate from the golden path are not exempt from Gatekeeper/Kyverno policies. The golden path builds compliance in by default; escape hatches must achieve compliance explicitly. OPA policies are the non-negotiable layer; golden paths are the convenience layer on top.
6. Tooling Landscape
Platform Engineering Tool Categories
| Category | Open Source / CNCF | Commercial / Managed |
|---|---|---|
| Developer portal | Backstage (CNCF), Port | Cortex, OpsLevel, Rely.io |
| Cluster provisioning | Cluster API, kOps, k3s | EKS Blueprints (CDK/Terraform), GKE Autopilot, AKS |
| Infrastructure as Code (cloud resources) | Crossplane (CNCF), ACK, Terraform, Pulumi | AWS CDK, Terraform Cloud, Pulumi Cloud |
| GitOps / CD | Argo CD (CNCF), Flux (CNCF) | Harness, Spinnaker (CD Foundation) |
| CI pipelines | Tekton (CNCF), Dagger, Woodpecker | GitHub Actions, GitLab CI, CircleCI, Jenkins |
| Policy enforcement | OPA Gatekeeper (CNCF), Kyverno (CNCF) | Styra DAS, Nirmata |
| Secrets management | Vault (HashiCorp/BSL), External Secrets Operator | AWS Secrets Manager, GCP Secret Manager, Azure Key Vault |
| Service mesh | Istio (CNCF), Linkerd (CNCF), Cilium | AWS App Mesh, Google Traffic Director |
| Image registry | Harbor (CNCF), Zot | ECR, GAR, ACR, Docker Hub, JFrog Artifactory |
| Image signing | Cosign / Sigstore (CNCF), Notary v2 | AWS Signer |
| Network / DNS / TLS | cert-manager (CNCF), external-dns, MetalLB | AWS ACM, Cloudflare, Route53 |
| Cost management | OpenCost (CNCF), Kubecost | AWS Cost Explorer, CloudHealth, Apptio |
| Preview environments | vCluster, Argo CD ApplicationSets | Environments by Render, Railway, Architect |
| Feature flags | OpenFeature (CNCF), Flagd | LaunchDarkly, Statsig, Split |
CNCF Landscape Positioning
CNCF Platform Engineering Relevant Projects (2024):
Graduated: Argo, Flux, OPA, Harbor, Backstage (incubating → graduated)
Incubating: Crossplane, Kyverno, Cluster API, cert-manager, external-dns, OpenCost
Sandbox: Headlamp, KubeVela, Kratix, OpenFeature (graduated 2024)
Key patterns these tools address:
Crossplane → Kubernetes-native IaC (cloud resources as CRDs)
Cluster API → Kubernetes-native cluster lifecycle management
Backstage → Developer portal and service catalog
Argo CD → GitOps continuous delivery
Flux → GitOps with Helm/Kustomize reconciliation
Kyverno → Policy-as-code without Rego
OPA → Policy engine (Gatekeeper for K8s admission)
cert-manager → Automated TLS certificate lifecycle
external-dns → DNS record sync from Kubernetes Ingress/Service
7. Reference Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Developer Workflow │
│ │
│ git push → CI Pipeline (GitHub Actions / Tekton) │
│ │ lint / test / build / sign / push image │
│ │ update Helm values / Kustomize image tag │
│ ▼ │
│ GitOps Repo (argocd-apps / fleet) │
│ │ │
│ ▼ │
│ Argo CD / Flux ←──────── Drift detection │
│ │ reconcile desired state from Git │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Kubernetes Cluster │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Admission Control │ │ Namespace (team-A) │ │ │
│ │ │ Gatekeeper/Kyverno │ │ workloads / services │ │ │
│ │ │ OPA policies │ │ ConfigMaps / secrets │ │ │
│ │ └─────────────────────┘ └──────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Networking │ │ Observability │ │ │
│ │ │ Ingress NGINX │ │ Prometheus / Loki │ │ │
│ │ │ cert-manager │ │ Tempo / Grafana │ │ │
│ │ │ external-dns │ │ Alertmanager │ │ │
│ │ │ Cilium / Istio │ │ Pyroscope │ │ │
│ │ └─────────────────────┘ └──────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Secrets │ │ Infrastructure │ │ │
│ │ │ Vault + ESO │ │ Crossplane providers │ │ │
│ │ │ Secret rotation │ │ RDS / S3 / SQS claims │ │ │
│ │ └─────────────────────┘ └──────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Backstage Developer Portal │ │
│ │ Service catalog / Templates / TechDocs / Scorecards │ │
│ │ Kubernetes plugin / Argo CD plugin / Cost plugin │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Multi-Cluster Platform Architecture
Management Cluster (platform-managed) Workload Clusters
┌───────────────────────────────┐ ┌──────────────────┐
│ Argo CD / Flux │────────▶│ cluster: dev │
│ Crossplane control plane │ │ (team namespaces│
│ Vault │────────▶│ + policies) │
│ Backstage │ └──────────────────┘
│ Harbor (image registry) │ ┌──────────────────┐
│ Policy validation webhook │────────▶│ cluster: staging│
│ Platform observability stack │ └──────────────────┘
└───────────────────────────────┘ ┌──────────────────┐
──────▶│ cluster: prod-1 │
│ └──────────────────┘
──────▶│ cluster: prod-2 │
│ (regional) │
└──────────────────┘
Fleet management: Cluster API (provisioning) + Argo CD ApplicationSets
(deploy platform add-ons uniformly across all clusters)
8. Platform Maturity Model
Provisional
- Manual cluster provisioning
- Ad-hoc CI/CD per team
- No shared observability
- Secrets in environment variables
- No policy enforcement
- Deployments require ops tickets
Operationalized
- Cluster provisioning scripted (Terraform)
- Shared CI templates (GitHub Actions)
- Basic Prometheus + Grafana
- Vault for secrets
- Some RBAC standards
- Manual CD (kubectl apply)
Scalable
- Cluster API / EKS Blueprints
- GitOps (Argo CD / Flux)
- Full LGTM observability stack
- External Secrets Operator
- Gatekeeper policies enforced
- Golden path templates
- Self-service namespaces
Optimising
- Full IDP with Backstage portal
- Crossplane cloud resources
- Preview environments
- Multi-cluster fleet management
- Cost attribution per team
- SLO-driven deployments
- Platform SLOs published
- Developer NPS tracked
Organisations that jump from Level 1 directly to Backstage + Crossplane + Argo CD typically fail to adopt any of it because the prerequisite operational practices are absent. Build Level 2 (reliable deployments, working observability) before automating Level 3 (GitOps, policy). Add the portal (Level 4) last — it has no value if the underlying platform is unreliable.
9. Team Topologies
Matthew Skelton and Manuel Pais's Team Topologies framework describes how platform engineering teams relate to other teams:
Platform Team (Enabling)
Owns the IDP. Treats product teams as customers. Measures developer productivity. Works in 6–12-week cycles. Publishes roadmap and SLOs.
platformStream-Aligned Team
Product/feature teams that own end-to-end delivery of a product stream. They use the platform — they do not manage infrastructure, CI tooling, or observability backends.
consumerComplicated Subsystem Team
Owns deeply specialised components (ML platform, data pipelines, real-time event bus). Interfaces with platform team for infrastructure; interfaces with stream-aligned teams as a service provider.
specialistSRE / Reliability Team
Owns availability targets, incident response, and toil automation. Embedded in platform team or parallel. Uses the IDP to deploy reliability tooling (chaos engineering, synthetic monitors, runbooks).
reliabilityPlatform Team Sizing
| Organisation Size | Engineers | Platform Team Size | Focus |
|---|---|---|---|
| Startup (<30 engineers) | — | 1–2 (infra-aware engineers) | Basic CI/CD + managed K8s (EKS/GKE) |
| Scale-up (30–150 engineers) | — | 3–6 | GitOps + observability + policy + Vault |
| Mid-size (150–500 engineers) | — | 6–12 | Full IDP + Backstage + Crossplane + cost |
| Enterprise (>500 engineers) | — | 12–30+ | Multi-cluster fleet + compliance + portals |
A platform team should support 10–15 stream-aligned engineers per platform engineer. Ratio below 8:1 means the platform is over-staffed relative to demand. Above 20:1, developer experience degrades (slow onboarding, long wait times for platform changes).
10. Developer Experience Metrics
Measuring platform value requires metrics that reflect the developer experience, not just infrastructure uptime.
DORA Metrics (Google DevOps Research)
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment frequency | Multiple per day | Daily to weekly | Weekly to monthly | Monthly or slower |
| Lead time for changes | < 1 hour | 1 day – 1 week | 1 week – 1 month | > 1 month |
| Change failure rate | < 5% | 5–10% | 10–15% | > 15% |
| MTTR (mean time to restore) | < 1 hour | < 1 day | 1 day – 1 week | > 1 week |
Platform-Specific Metrics
# Prometheus metrics to track platform health
# (many from DORA, some platform-specific)
# Time to first production deployment for new services
# Measured via Argo CD: time from ApplicationSet creation to Healthy
histogram_quantile(0.90,
rate(argocd_app_sync_total_duration_seconds_bucket[7d])
)
# Self-service ratio: platform-provisioned vs ticket-provisioned resources
# Track: number of Crossplane claims created vs infra tickets
sum(increase(crossplane_managed_resource_ready_total[30d]))
/ (sum(increase(crossplane_managed_resource_ready_total[30d])) + jira_infra_tickets_30d)
# CI pipeline P90 duration (developer wait time)
histogram_quantile(0.90,
rate(tekton_pipelinerun_duration_seconds_bucket{status="success"}[7d])
)
# Platform availability (GitOps reconciliation loop health)
1 - (
rate(argocd_app_sync_total{phase="Error"}[24h])
/ rate(argocd_app_sync_total[24h])
)
# Developer NPS — tracked via quarterly survey (not Prometheus)
# Target: NPS > 30 (platform is net positive to developers)
SPACE Framework (GitHub Research)
| Dimension | Metric Examples |
|---|---|
| Satisfaction | Developer NPS, survey scores, onboarding satisfaction |
| Performance | Code review cycle time, deployment success rate, incident MTTR |
| Activity | Deployments per week, PRs merged, features shipped |
| Communication | Documentation quality score, knowledge-sharing sessions |
| Efficiency | Time waiting for CI, on-call interruption rate, toil % |
11. Section Guide
The Platform Engineering section covers 10 detailed topics. Each builds on this overview:
Coverage Checklist
- Platform engineering definition and goal (reduce developer cognitive load)
- Before/after cognitive load diagram (40 tasks → 2 tasks)
- Platform engineering vs DevOps vs SRE comparison table
- IDP five-plane capability map diagram (developer experience/application/security/infrastructure/observability)
- Capability ownership table (platform provides vs product team configures)
- Platform as a product: 6 principle cards (user research/API contracts/DORA/roadmap/platform SLOs/opt-in adoption)
- Golden path anatomy: git push → repo structure with all pre-configured files
- Golden path vs escape hatch decision table (5 scenarios)
- Escape hatch must still pass policy — callout
- Platform engineering tool landscape table (12 categories, CNCF and commercial options)
- CNCF graduated/incubating/sandbox projects reference
- Reference architecture diagram: CI → GitOps repo → Argo CD → K8s cluster (all planes)
- Multi-cluster platform architecture (management cluster + workload clusters, fleet management)
- Platform maturity model: 4 levels (Provisional/Operationalized/Scalable/Optimising) with characteristics
- Don't skip levels callout (portal before reliable platform fails)
- Team Topologies: platform/stream-aligned/complicated-subsystem/SRE team cards
- Platform team sizing table (startup to enterprise with ratios)
- Ratio rule of thumb callout (10–15 developers per platform engineer)
- DORA metrics table with elite/high/medium/low thresholds
- Platform-specific PromQL metrics (time-to-deploy, self-service ratio, CI P90, GitOps availability)
- SPACE framework (Satisfaction/Performance/Activity/Communication/Efficiency)
- Section guide with links to all 10 detail pages and descriptions