Contents

#FileTopic
00workflows-overview.mdThis file — section guide
01local-development.mdLocal dev clusters, Skaffold, Tilt
02ci-cd-pipelines.mdGitHub Actions, Tekton, ArgoCD
03helm.mdHelm charts, releases, lifecycle
04kustomize.mdKustomize overlays, patches
05testing-strategies.mdUnit, integration, E2E, chaos
06progressive-delivery.mdCanary, blue-green, feature flags
07developer-self-service.mdPlatform portals, Backstage, Crossplane
08runbooks.mdRunbook authoring, templates
09on-call.mdOn-call rotation, escalation, postmortems

The Platform-Developer Contract

The core challenge in team workflows is the cognitive load split: platform/SRE teams own the cluster, but application developers own what runs on it. A healthy Kubernetes workflow minimises the surface area where those two worlds collide.

Developer concern:
  "I want to deploy my service, see its logs, and know it's healthy."

Platform concern:
  "I want every deployment to meet security, resource, and reliability standards."

The overlap (where friction lives):
  ┌─────────────────────────────────────────────────┐
  │  Dockerfile  │  resource requests  │  probes    │
  │  RBAC scope  │  NetworkPolicy      │  PDB       │
  │  image tags  │  secrets handling   │  HPA       │
  └─────────────────────────────────────────────────┘

Good workflow design turns the overlap into guardrails, not gatekeepers — developers can move fast while platform policies enforce correctness at admission time.

Workflow Maturity Model

LevelCharacteristicsTypical Pain
L1 — Manualkubectl apply from laptop, shared kubeconfig, no CIConfig drift, "works on my machine"
L2 — CI DeploymentsPipeline runs kubectl apply, image tags in manifests, one environmentHard to trace what's running, no rollback
L3 — GitOpsArgo CD / Flux, separate config repo, environment branchesEnvironment promotion is manual PR
L4 — PlatformDeveloper portal, self-service namespaces, policy-as-code, SLO dashboardsDeveloper onboarding still slow
L5 — Product PlatformGolden paths per workload type, internal IDP, automated compliance, cost showbackMaintenance of platform itself

Most teams shipping production Kubernetes should target L3–L4. L5 is worth pursuing when the platform team serves 50+ application teams.

Core Toolchain Map

 Code                  Build               Deploy              Operate
──────                 ─────               ──────              ───────
 Git (GitHub/GitLab) → CI (Actions/Tekton) → Argo CD / Flux → kubectl / k9s
 Dockerfile           → Buildkit/Kaniko    → Helm / Kustomize → Prometheus
 Skaffold / Tilt      → Trivy scan         → Kyverno policies  → Loki / Tempo
 devcontainer         → cosign sign        → Progressive rollout→ PagerDuty

Tool Selection Reference

CategoryOSS ChoiceManaged/SaaS AlternativeNotes
Local devTilt, SkaffoldDocker Desktop K8sTilt has better HMR; Skaffold simpler config
Image buildBuildkit, KanikoGitHub Actions buildxKaniko for in-cluster builds (no Docker daemon)
Image scanTrivySnyk, Prisma CloudTrivy is free, fast, OCI-native
CIGitHub Actions, TektonCircleCI, BuildKiteTekton is K8s-native but complex
GitOpsArgo CDFluxArgo CD has better UI; Flux better multi-tenancy
PackagingHelm, KustomizeOCI Helm chartsHelm for versioned releases; Kustomize for env overlays
Progressive deliveryArgo RolloutsFlaggerArgo Rollouts integrates with Argo CD natively
PolicyKyvernoOPA/GatekeeperKyverno is simpler YAML-native; OPA more powerful
Developer portalBackstagePort, CortexBackstage is extensible but heavy to operate

The Golden Path

A golden path is a paved, opinionated workflow that guides developers from code to production without requiring deep Kubernetes expertise. It does not prevent teams from going off-road, but it makes the default path the easiest path.

Minimal Golden Path Components

  1. Service template — a Helm chart or Kustomize base that embeds resource requests, probes, PDB, NetworkPolicy, and HPA defaults. Developers fill in image and config only.
  2. CI template — a reusable GitHub Actions workflow (or Tekton Pipeline) that builds, scans, signs, and pushes the image, then opens a PR to the config repo.
  3. Environment promotion — Argo CD ApplicationSets that automatically sync dev on every merge, and require a PR approval to promote to staging and production.
  4. Observability defaults — ServiceMonitor, log format enforced via Kyverno, and a pre-built Grafana dashboard template linked from the service template.
  5. Developer runbook — a one-page doc per service type (HTTP API / worker / cron) covering: how to deploy, how to roll back, how to read its dashboard, and who to call.

Example: New Service Onboarding in 10 Minutes

# 1. Scaffold from template (Backstage software template or cookiecutter)
backstage-cli create --template kubernetes-http-service \
  --name payments-worker \
  --owner team-payments

# 2. Scaffold creates:
#   - GitHub repo with Dockerfile, go.mod / package.json
#   - Helm chart in the app repo (or config repo PR)
#   - Argo CD Application pointing at the chart
#   - Namespace + RBAC for team-payments
#   - Datadog/Prometheus ServiceMonitor
#   - PagerDuty service integration

# 3. First deploy happens automatically when CI merges to main
# 4. Developer opens Backstage → sees service health, docs, runbook link

GitOps Repository Structure

Two common patterns: monorepo (all teams in one config repo) and polyrepo (one config repo per team or service).

k8s-config/
├── clusters/
│   ├── production/
│   │   ├── apps/          # Argo CD Application manifests
│   │   └── infra/         # cluster-level infra (cert-manager, prometheus...)
│   └── staging/
│       ├── apps/
│       └── infra/
├── services/
│   ├── payments-api/
│   │   ├── base/          # Kustomize base or Helm values-base.yaml
│   │   ├── staging/       # overlay / values-staging.yaml
│   │   └── production/    # overlay / values-production.yaml
│   └── auth-service/
│       ├── base/
│       ├── staging/
│       └── production/
└── platform/
    ├── cert-manager/
    ├── prometheus-stack/
    └── kyverno/
# Each team has: app-repo (source code) + config-repo (K8s manifests)
github.com/acme/
├── payments-api/           # source code + Dockerfile
├── payments-api-config/    # K8s manifests, Helm values, Argo CD app
├── auth-service/
├── auth-service-config/
└── platform-config/        # owned by platform team

The tradeoff: monorepo is easier to enforce standards across (one CI pipeline, one policy check); polyrepo gives teams more autonomy and avoids merge conflicts at scale.

Change Management and Review Standards

Deployment Approval Matrix

Change TypeWho can mergeRequired reviewsDeploy path
Config-only (env var, replica count)App team1 peerAuto-sync to staging → manual promote to prod
Application image bumpApp team1 peerAuto-sync staging → progressive rollout to prod
New serviceApp team + platform review2 (1 platform)Manual gate before prod
Platform infra changePlatform team2 platformStaging first, prod after 24h soak
Cluster upgradePlatform team2 platform + managerChange window, full DR drill first

Branch and PR Conventions

# Branches
main          → production (protected, no direct push)
staging       → staging environment (optional; Argo CD can target main with image filter)
feature/...   → developer branches

# PR title convention (enforced by CI lint)
feat(payments-api): add idempotency key support
fix(auth-service): handle token expiry edge case
chore(platform): upgrade cert-manager to v1.16

# Commit message body: include the Jira/Linear ticket
# PR description: link to runbook update if ops behaviour changes

Environment Promotion Workflow

                    merge to main
                         │
                         ▼
              ┌─── CI Pipeline ──────────────────┐
              │  build → test → scan → sign       │
              │  push image:sha-<git-sha>         │
              │  open PR: update image tag in     │
              │           config repo (staging)   │
              └───────────────────────────────────┘
                         │ auto-merge if tests pass
                         ▼
               Argo CD syncs staging
                         │
                    soak period
               (automated smoke tests,
                Datadog monitors green)
                         │
              ┌──── Promotion PR ────────────────┐
              │  update image tag in             │
              │  config repo (production)        │
              │  requires manual approval        │
              └──────────────────────────────────┘
                         │
               Argo CD syncs production
               via Argo Rollouts (canary)
                         │
                   metrics healthy?
                    ┌────┴────┐
                   yes       no
                    │         │
                 promote   rollback
                  100%    (automatic)

Cross-Cutting Concerns Checklist

Before a service goes to production, the following must be in place. Platform teams often encode this as a Kyverno ClusterPolicy or Backstage scaffolder check.

## Service Production Readiness

### Reliability
- [ ] `resources.requests` and `resources.limits` set on all containers
- [ ] `readinessProbe` and `livenessProbe` configured
- [ ] `startupProbe` for slow-starting services
- [ ] `minReadySeconds: 30` set on Deployment
- [ ] PodDisruptionBudget exists (`minAvailable: 1` or higher)
- [ ] HPA configured (CPU or custom metric)
- [ ] Topology spread constraints for multi-AZ

### Observability
- [ ] ServiceMonitor exists (Prometheus scrape)
- [ ] Structured JSON logging with `level`, `msg`, `trace_id`
- [ ] OpenTelemetry tracing instrumented
- [ ] Grafana dashboard created and linked in Backstage
- [ ] SLO defined in Sloth / PrometheusRule

### Security
- [ ] `securityContext.runAsNonRoot: true`
- [ ] `securityContext.readOnlyRootFilesystem: true`
- [ ] `securityContext.allowPrivilegeEscalation: false`
- [ ] `automountServiceAccountToken: false` (if no API access needed)
- [ ] NetworkPolicy: explicit ingress + egress
- [ ] Secrets via ESO (not hardcoded in manifests)
- [ ] Image signed with cosign, verified by Kyverno

### Operations
- [ ] Runbook written and linked in Backstage
- [ ] PagerDuty service exists, routing rules set
- [ ] On-call rotation includes this service
- [ ] Graceful shutdown: `preStop` hook + `terminationGracePeriodSeconds`

Cross-references: 09 — Production Overview · 08 — GitOps · Progressive Delivery