Platform Engineering Overview

Contents

What Is Platform Engineering?
Platform Engineering vs DevOps vs SRE
IDP Capability Map
Platform as a Product
Golden Paths & Paved Roads
Tooling Landscape
Reference Architecture
Platform Maturity Model
Team Topologies
Developer Experience Metrics
Section Guide

1. What Is Platform Engineering?

Platform engineering is the discipline of designing and building self-service toolchains and workflows — collectively called an Internal Developer Platform (IDP) — that enable product teams to deploy, run, and observe their software in production without requiring deep infrastructure expertise for every task.

The goal is not to build the most technically sophisticated infrastructure. It is to reduce cognitive load on application developers while maintaining the guardrails (security, cost control, compliance) that the organisation requires.

Platform engineering vs "just a Kubernetes cluster"

A Kubernetes cluster is infrastructure. A platform wraps that cluster with golden-path workflows, self-service provisioning, pre-integrated observability, policy enforcement, and a developer portal — so that a new team can go from zero to production in hours, not weeks, while automatically complying with company standards.

Why Platform Engineering Emerged

The cloud-native ecosystem solved infrastructure scalability but created a new problem: the number of tools, concepts, and decisions required to deploy a service grew dramatically. A developer shipping a feature now needed to understand Kubernetes YAML, Helm charts, Prometheus scrape configs, Grafana dashboards, Vault secrets paths, network policies, and CI pipeline YAML — all before the first production deployment.

Before platform engineering (developer cognitive load):
  ┌──────────────────────────────────────────────────────────┐
  │ Write code → Write Dockerfile → Write Helm chart →       │
  │ Configure CI → Configure CD → Set up monitoring →        │
  │ Create alerts → Request secrets → Configure NetworkPolicy│
  │ → Write runbook → Set resource limits → ...              │
  │                                              ~40 tasks   │
  └──────────────────────────────────────────────────────────┘

After platform engineering (developer cognitive load):
  ┌─────────────────────────────────────────────┐
  │ Write code → Push to git                    │
  │  → Platform handles everything else         │
  │                                   ~2 tasks  │
  └─────────────────────────────────────────────┘
  (Security, observability, networking pre-wired by platform)

2. Platform Engineering vs DevOps vs SRE

Dimension	DevOps	SRE	Platform Engineering
Primary goal	Break dev/ops silos; continuous delivery	Reliability of production services (SLOs, error budgets)	Developer productivity; reduce cognitive load
Core output	Cultural practices, CI/CD pipelines	SLO frameworks, incident response, automation to replace toil	Internal Developer Platform (IDP)
Who are the customers?	The organisation (cultural transformation)	End users (reliability) and operators (toil reduction)	Product/feature teams (developers)
Key metric	DORA metrics (deployment freq, lead time, MTTR, change failure rate)	SLO compliance, error budget burn rate, toil %	Time to first deployment, developer NPS, self-service ratio
Infrastructure ownership	Shared (everyone owns their pipeline)	Shared (product teams own SLOs, SRE owns platform reliability)	Platform team owns the platform; product teams own their apps
Relationship to Kubernetes	Kubernetes is part of the CD pipeline	Kubernetes is a production environment to make reliable	Kubernetes is the infrastructure substrate the platform abstracts

These roles are complementary, not competing

A mature engineering organisation practises all three: DevOps culture (fast feedback loops), SRE principles (error budgets, toil reduction), and platform engineering (IDP to scale developer autonomy). The platform team's work enables both DevOps and SRE practices at scale.

3. IDP Capability Map

An Internal Developer Platform is not a single product — it is a composition of capabilities across five planes:

┌─────────────────────────────────────────────────────────────────┐
│                  Developer Experience Plane                      │
│  Backstage portal / CLI tools / VS Code extensions / Docs       │
│  Service catalog / Templates / Scorecards / TechDocs            │
├─────────────────────────────────────────────────────────────────┤
│                   Application Plane                              │
│  Golden path templates / Helm charts / Kustomize overlays       │
│  CI pipelines (Tekton/GitHub Actions) / CD (Argo CD/Flux)       │
│  Feature flags / Release automation / Preview environments       │
├─────────────────────────────────────────────────────────────────┤
│                   Security & Policy Plane                        │
│  OPA/Gatekeeper policies / Kyverno / Pod Security Standards     │
│  Secrets management (Vault/ESO) / SBOM / Image signing          │
│  Network policies / RBAC templates / Admission webhooks         │
├─────────────────────────────────────────────────────────────────┤
│                   Infrastructure Plane                           │
│  Cluster provisioning (Cluster API / EKS Blueprints / Crossplane│
│  Cloud resources (Crossplane / ACK / Terraform / Pulumi)        │
│  DNS / TLS (cert-manager / external-dns) / Load balancers       │
├─────────────────────────────────────────────────────────────────┤
│                   Observability Plane                            │
│  Metrics (Prometheus/Thanos) / Logs (Loki) / Traces (Tempo)     │
│  Dashboards (Grafana) / Alerting (Alertmanager) / Profiling      │
│  Events / Runbooks / On-call rotation (PagerDuty/Opsgenie)      │
└─────────────────────────────────────────────────────────────────┘

Capability Ownership Model

Capability	Platform Team Provides	Product Team Configures
Kubernetes clusters	Provisioning, upgrades, node pools, add-ons	Namespace resource quotas, workload specs
CI/CD pipelines	Pipeline templates, shared tasks, image registry	Pipeline triggers, test commands, deploy targets
Observability	Prometheus stack, Loki, Tempo, Grafana, Alertmanager	ServiceMonitors, dashboards, alert rules, SLOs
Secrets	Vault clusters, ESO install, access policies	SecretStore references, ExternalSecret paths
Networking	Ingress controllers, cert-manager, external-dns, service mesh	Ingress resources, certificates, network policies
Policy	Gatekeeper/Kyverno install, baseline policies	Namespace-scoped exceptions (via PR review)
Cloud resources	Crossplane providers, Compositions, XRDs	Claim YAML files (e.g., PostgreSQL, S3 bucket)
Developer portal	Backstage instance, plugins, RBAC	catalog-info.yaml, TechDocs, templates

4. Platform as a Product

The most important mindset shift in platform engineering: treat the internal developer platform as a product with users (developers), a roadmap, feedback loops, and success metrics — not as an IT service delivered on a project basis.

Understand your users

Run developer surveys (quarterly NPS), shadow sessions (watch a dev onboard), and track support ticket patterns. Build what reduces the most common pain, not what is technically interesting.

user research

Treat abstractions as contracts

When you publish a golden-path Helm chart or Crossplane Composition, that is a public API. Breaking changes require versioning and migration paths, just like a REST API.

API design

Measure developer productivity

DORA metrics (deployment frequency, lead time), time to first deploy for new services, self-service ratio (platform-provisioned vs ticket-provisioned resources).

DORA

Maintain a public roadmap

Publish what the platform team is building and when. Invite RFCs for major changes. Developers who understand the roadmap stop building workarounds.

transparency

Provide SLOs for the platform itself

If the platform's CI system has 99.5% availability and 10-minute P99 build time, publish that. Platform reliability directly multiplies or divides developer throughput.

platform SLOs

Avoid forced adoption

Mandate security guardrails (policy enforcement). Make everything else opt-in by being better than the alternative. Forced tools without value create shadow IT.

adoption

5. Golden Paths & Paved Roads

A golden path is a well-lit, well-maintained route from code to production that embeds all organisational standards by default. Developers can deviate from it, but the golden path should be the easiest option — not a constraint.

What a Golden Path Typically Includes

Developer runs: backstage-cli create --template go-microservice

Result: a Git repository pre-configured with:
├── src/                          # Application code skeleton
├── Dockerfile                    # Multi-stage, distroless base, non-root user
├── helm/                         # Helm chart (resources/limits/probes/PDB/HPA)
│   └── values.yaml               # Sane defaults (replica:2, readinessProbe, etc.)
├── .github/workflows/            # CI: lint → test → build → sign → push
│   ├── ci.yaml                   # (or .tekton/pipeline.yaml for in-cluster CI)
│   └── cd.yaml                   # ArgoCD app-of-apps update on merge to main
├── config/                       # Kustomize overlays: dev/staging/prod
├── catalog-info.yaml             # Backstage service catalog registration
├── docs/                         # TechDocs skeleton (mkdocs.yml + index.md)
├── .policy/                      # Pre-approved exceptions file (if needed)
└── monitoring/
    ├── servicemonitor.yaml       # Pre-wired to platform Prometheus
    ├── dashboard.json            # Grafana dashboard (RED metrics)
    └── alerts.yaml               # PrometheusRule with basic SLO alerts

Golden Path vs Escape Hatch

Situation	Recommendation
Standard HTTP microservice (Go, Java, Python, Node.js)	Use golden path template — covers 90% of services
ML training job (GPU, large memory, spot instances)	Specialised template; golden path extended for batch/GPU workloads
Stateful service needing custom storage topology	Escape hatch: custom Helm chart, but must pass Gatekeeper policies
Legacy service not yet containerised	Migration path: platform provides lift-and-shift guide + VM-based workload support
Security-sensitive service (handles PII/PCI)	Hardened golden path variant with additional network policies, encryption, audit logging

Escape hatches must still pass policy

Teams that deviate from the golden path are not exempt from Gatekeeper/Kyverno policies. The golden path builds compliance in by default; escape hatches must achieve compliance explicitly. OPA policies are the non-negotiable layer; golden paths are the convenience layer on top.

6. Tooling Landscape

Platform Engineering Tool Categories

Category	Open Source / CNCF	Commercial / Managed
Developer portal	Backstage (CNCF), Port	Cortex, OpsLevel, Rely.io
Cluster provisioning	Cluster API, kOps, k3s	EKS Blueprints (CDK/Terraform), GKE Autopilot, AKS
Infrastructure as Code (cloud resources)	Crossplane (CNCF), ACK, Terraform, Pulumi	AWS CDK, Terraform Cloud, Pulumi Cloud
GitOps / CD	Argo CD (CNCF), Flux (CNCF)	Harness, Spinnaker (CD Foundation)
CI pipelines	Tekton (CNCF), Dagger, Woodpecker	GitHub Actions, GitLab CI, CircleCI, Jenkins
Policy enforcement	OPA Gatekeeper (CNCF), Kyverno (CNCF)	Styra DAS, Nirmata
Secrets management	Vault (HashiCorp/BSL), External Secrets Operator	AWS Secrets Manager, GCP Secret Manager, Azure Key Vault
Service mesh	Istio (CNCF), Linkerd (CNCF), Cilium	AWS App Mesh, Google Traffic Director
Image registry	Harbor (CNCF), Zot	ECR, GAR, ACR, Docker Hub, JFrog Artifactory
Image signing	Cosign / Sigstore (CNCF), Notary v2	AWS Signer
Network / DNS / TLS	cert-manager (CNCF), external-dns, MetalLB	AWS ACM, Cloudflare, Route53
Cost management	OpenCost (CNCF), Kubecost	AWS Cost Explorer, CloudHealth, Apptio
Preview environments	vCluster, Argo CD ApplicationSets	Environments by Render, Railway, Architect
Feature flags	OpenFeature (CNCF), Flagd	LaunchDarkly, Statsig, Split

CNCF Landscape Positioning

CNCF Platform Engineering Relevant Projects (2024):

Graduated:  Argo, Flux, OPA, Harbor, Backstage (incubating → graduated)
Incubating: Crossplane, Kyverno, Cluster API, cert-manager, external-dns, OpenCost
Sandbox:    Headlamp, KubeVela, Kratix, OpenFeature (graduated 2024)

Key patterns these tools address:
  Crossplane    → Kubernetes-native IaC (cloud resources as CRDs)
  Cluster API   → Kubernetes-native cluster lifecycle management
  Backstage     → Developer portal and service catalog
  Argo CD       → GitOps continuous delivery
  Flux          → GitOps with Helm/Kustomize reconciliation
  Kyverno       → Policy-as-code without Rego
  OPA           → Policy engine (Gatekeeper for K8s admission)
  cert-manager  → Automated TLS certificate lifecycle
  external-dns  → DNS record sync from Kubernetes Ingress/Service

7. Reference Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     Developer Workflow                               │
│                                                                      │
│  git push  →  CI Pipeline (GitHub Actions / Tekton)                  │
│              │  lint / test / build / sign / push image              │
│              │  update Helm values / Kustomize image tag             │
│              ▼                                                       │
│         GitOps Repo (argocd-apps / fleet)                           │
│              │                                                       │
│              ▼                                                       │
│         Argo CD / Flux  ←──────── Drift detection                   │
│              │  reconcile desired state from Git                     │
│              ▼                                                       │
│  ┌───────────────────────────────────────────────────────────┐      │
│  │              Kubernetes Cluster                            │      │
│  │                                                            │      │
│  │  ┌─────────────────────┐  ┌──────────────────────────┐   │      │
│  │  │  Admission Control  │  │   Namespace (team-A)      │   │      │
│  │  │  Gatekeeper/Kyverno │  │   workloads / services    │   │      │
│  │  │  OPA policies       │  │   ConfigMaps / secrets    │   │      │
│  │  └─────────────────────┘  └──────────────────────────┘   │      │
│  │                                                            │      │
│  │  ┌─────────────────────┐  ┌──────────────────────────┐   │      │
│  │  │  Networking         │  │   Observability           │   │      │
│  │  │  Ingress NGINX      │  │   Prometheus / Loki       │   │      │
│  │  │  cert-manager       │  │   Tempo / Grafana         │   │      │
│  │  │  external-dns       │  │   Alertmanager            │   │      │
│  │  │  Cilium / Istio     │  │   Pyroscope               │   │      │
│  │  └─────────────────────┘  └──────────────────────────┘   │      │
│  │                                                            │      │
│  │  ┌─────────────────────┐  ┌──────────────────────────┐   │      │
│  │  │  Secrets            │  │   Infrastructure          │   │      │
│  │  │  Vault + ESO        │  │   Crossplane providers    │   │      │
│  │  │  Secret rotation    │  │   RDS / S3 / SQS claims   │   │      │
│  │  └─────────────────────┘  └──────────────────────────┘   │      │
│  └───────────────────────────────────────────────────────────┘      │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                Backstage Developer Portal                     │   │
│  │  Service catalog / Templates / TechDocs / Scorecards          │   │
│  │  Kubernetes plugin / Argo CD plugin / Cost plugin             │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Multi-Cluster Platform Architecture

Management Cluster (platform-managed)          Workload Clusters
┌───────────────────────────────┐         ┌──────────────────┐
│  Argo CD / Flux               │────────▶│  cluster: dev    │
│  Crossplane control plane     │         │  (team namespaces│
│  Vault                        │────────▶│   + policies)    │
│  Backstage                    │         └──────────────────┘
│  Harbor (image registry)      │         ┌──────────────────┐
│  Policy validation webhook    │────────▶│  cluster: staging│
│  Platform observability stack │         └──────────────────┘
└───────────────────────────────┘         ┌──────────────────┐
                                    ──────▶│  cluster: prod-1 │
                                    │      └──────────────────┘
                                    ──────▶│  cluster: prod-2 │
                                           │  (regional)      │
                                           └──────────────────┘

Fleet management: Cluster API (provisioning) + Argo CD ApplicationSets
                  (deploy platform add-ons uniformly across all clusters)

8. Platform Maturity Model

Level 1

Provisional

Manual cluster provisioning
Ad-hoc CI/CD per team
No shared observability
Secrets in environment variables
No policy enforcement
Deployments require ops tickets

Level 2

Operationalized

Cluster provisioning scripted (Terraform)
Shared CI templates (GitHub Actions)
Basic Prometheus + Grafana
Vault for secrets
Some RBAC standards
Manual CD (kubectl apply)

Level 3

Scalable

Cluster API / EKS Blueprints
GitOps (Argo CD / Flux)
Full LGTM observability stack
External Secrets Operator
Gatekeeper policies enforced
Golden path templates
Self-service namespaces

Level 4

Optimising

Full IDP with Backstage portal
Crossplane cloud resources
Preview environments
Multi-cluster fleet management
Cost attribution per team
SLO-driven deployments
Platform SLOs published
Developer NPS tracked

Don't skip levels

Organisations that jump from Level 1 directly to Backstage + Crossplane + Argo CD typically fail to adopt any of it because the prerequisite operational practices are absent. Build Level 2 (reliable deployments, working observability) before automating Level 3 (GitOps, policy). Add the portal (Level 4) last — it has no value if the underlying platform is unreliable.

9. Team Topologies

Matthew Skelton and Manuel Pais's Team Topologies framework describes how platform engineering teams relate to other teams:

Platform Team (Enabling)

Owns the IDP. Treats product teams as customers. Measures developer productivity. Works in 6–12-week cycles. Publishes roadmap and SLOs.

platform

Stream-Aligned Team

Product/feature teams that own end-to-end delivery of a product stream. They use the platform — they do not manage infrastructure, CI tooling, or observability backends.

consumer

Complicated Subsystem Team

Owns deeply specialised components (ML platform, data pipelines, real-time event bus). Interfaces with platform team for infrastructure; interfaces with stream-aligned teams as a service provider.

specialist

SRE / Reliability Team

Owns availability targets, incident response, and toil automation. Embedded in platform team or parallel. Uses the IDP to deploy reliability tooling (chaos engineering, synthetic monitors, runbooks).

reliability

Platform Team Sizing

Organisation Size	Engineers	Platform Team Size	Focus
Startup (<30 engineers)	—	1–2 (infra-aware engineers)	Basic CI/CD + managed K8s (EKS/GKE)
Scale-up (30–150 engineers)	—	3–6	GitOps + observability + policy + Vault
Mid-size (150–500 engineers)	—	6–12	Full IDP + Backstage + Crossplane + cost
Enterprise (>500 engineers)	—	12–30+	Multi-cluster fleet + compliance + portals

Ratio rule of thumb

A platform team should support 10–15 stream-aligned engineers per platform engineer. Ratio below 8:1 means the platform is over-staffed relative to demand. Above 20:1, developer experience degrades (slow onboarding, long wait times for platform changes).

10. Developer Experience Metrics

Measuring platform value requires metrics that reflect the developer experience, not just infrastructure uptime.

DORA Metrics (Google DevOps Research)

Metric	Elite	High	Medium	Low
Deployment frequency	Multiple per day	Daily to weekly	Weekly to monthly	Monthly or slower
Lead time for changes	< 1 hour	1 day – 1 week	1 week – 1 month	> 1 month
Change failure rate	< 5%	5–10%	10–15%	> 15%
MTTR (mean time to restore)	< 1 hour	< 1 day	1 day – 1 week	> 1 week

Platform-Specific Metrics

# Prometheus metrics to track platform health
# (many from DORA, some platform-specific)

# Time to first production deployment for new services
# Measured via Argo CD: time from ApplicationSet creation to Healthy
histogram_quantile(0.90,
  rate(argocd_app_sync_total_duration_seconds_bucket[7d])
)

# Self-service ratio: platform-provisioned vs ticket-provisioned resources
# Track: number of Crossplane claims created vs infra tickets
sum(increase(crossplane_managed_resource_ready_total[30d]))
  / (sum(increase(crossplane_managed_resource_ready_total[30d])) + jira_infra_tickets_30d)

# CI pipeline P90 duration (developer wait time)
histogram_quantile(0.90,
  rate(tekton_pipelinerun_duration_seconds_bucket{status="success"}[7d])
)

# Platform availability (GitOps reconciliation loop health)
1 - (
  rate(argocd_app_sync_total{phase="Error"}[24h])
  / rate(argocd_app_sync_total[24h])
)

# Developer NPS — tracked via quarterly survey (not Prometheus)
# Target: NPS > 30 (platform is net positive to developers)

SPACE Framework (GitHub Research)

Dimension	Metric Examples
Satisfaction	Developer NPS, survey scores, onboarding satisfaction
Performance	Code review cycle time, deployment success rate, incident MTTR
Activity	Deployments per week, PRs merged, features shipped
Communication	Documentation quality score, knowledge-sharing sessions
Efficiency	Time waiting for CI, on-call interruption rate, toil %

11. Section Guide

The Platform Engineering section covers 10 detailed topics. Each builds on this overview:

08 — 01

Cluster Provisioning

Cluster API, EKS Blueprints, kOps, kubeadm, cluster add-ons, node pools, upgrade strategies

08 — 02

GitOps

Argo CD (app-of-apps, ApplicationSets, sync waves), Flux (Kustomize/Helm controllers, image automation), Git branching strategies

08 — 03

CI/CD Pipelines

Tekton Pipelines, GitHub Actions, image build/sign/push, progressive delivery (canary/blue-green), Argo Rollouts

08 — 04

Developer Portal

Backstage: service catalog, Software Templates, TechDocs, plugins (K8s/Argo/cost), scorecards, onboarding automation

08 — 05

Policy Enforcement

OPA Gatekeeper ConstraintTemplates, Kyverno policies (validate/mutate/generate), Rego authoring, policy-as-code testing

08 — 06

Multi-Tenancy

Namespace-per-team, vCluster, Hierarchical Namespaces (HNC), ResourceQuota, LimitRange, network isolation patterns

08 — 07

Cost Management

OpenCost/Kubecost, cost allocation by namespace/team, rightsizing, Spot instances, Karpenter, showback vs chargeback

08 — 08

Service Catalog

Crossplane Compositions (XRDs/Claims), AWS ACK, cloud resource self-service, composition validation, provider configuration

08 — 09

Secrets Automation

External Secrets Operator, Vault Agent Injector vs ESO, secret rotation, IRSA/Workload Identity, sealed secrets

08 — 10

Platform APIs

CRD-based platform APIs, Crossplane XRDs as abstractions, Kubernetes API aggregation, admission webhooks for platform logic

Coverage Checklist

Platform engineering definition and goal (reduce developer cognitive load)
Before/after cognitive load diagram (40 tasks → 2 tasks)
Platform engineering vs DevOps vs SRE comparison table
IDP five-plane capability map diagram (developer experience/application/security/infrastructure/observability)
Capability ownership table (platform provides vs product team configures)
Platform as a product: 6 principle cards (user research/API contracts/DORA/roadmap/platform SLOs/opt-in adoption)
Golden path anatomy: git push → repo structure with all pre-configured files
Golden path vs escape hatch decision table (5 scenarios)
Escape hatch must still pass policy — callout
Platform engineering tool landscape table (12 categories, CNCF and commercial options)
CNCF graduated/incubating/sandbox projects reference
Reference architecture diagram: CI → GitOps repo → Argo CD → K8s cluster (all planes)
Multi-cluster platform architecture (management cluster + workload clusters, fleet management)
Platform maturity model: 4 levels (Provisional/Operationalized/Scalable/Optimising) with characteristics
Don't skip levels callout (portal before reliable platform fails)
Team Topologies: platform/stream-aligned/complicated-subsystem/SRE team cards
Platform team sizing table (startup to enterprise with ratios)
Ratio rule of thumb callout (10–15 developers per platform engineer)
DORA metrics table with elite/high/medium/low thresholds
Platform-specific PromQL metrics (time-to-deploy, self-service ratio, CI P90, GitOps availability)
SPACE framework (Satisfaction/Performance/Activity/Communication/Efficiency)
Section guide with links to all 10 detail pages and descriptions