Team Workflows Overview
Contents
| # | File | Topic |
|---|---|---|
| 00 | workflows-overview.md | This file — section guide |
| 01 | local-development.md | Local dev clusters, Skaffold, Tilt |
| 02 | ci-cd-pipelines.md | GitHub Actions, Tekton, ArgoCD |
| 03 | helm.md | Helm charts, releases, lifecycle |
| 04 | kustomize.md | Kustomize overlays, patches |
| 05 | testing-strategies.md | Unit, integration, E2E, chaos |
| 06 | progressive-delivery.md | Canary, blue-green, feature flags |
| 07 | developer-self-service.md | Platform portals, Backstage, Crossplane |
| 08 | runbooks.md | Runbook authoring, templates |
| 09 | on-call.md | On-call rotation, escalation, postmortems |
The Platform-Developer Contract
The core challenge in team workflows is the cognitive load split: platform/SRE teams own the cluster, but application developers own what runs on it. A healthy Kubernetes workflow minimises the surface area where those two worlds collide.
Developer concern:
"I want to deploy my service, see its logs, and know it's healthy."
Platform concern:
"I want every deployment to meet security, resource, and reliability standards."
The overlap (where friction lives):
┌─────────────────────────────────────────────────┐
│ Dockerfile │ resource requests │ probes │
│ RBAC scope │ NetworkPolicy │ PDB │
│ image tags │ secrets handling │ HPA │
└─────────────────────────────────────────────────┘
Good workflow design turns the overlap into guardrails, not gatekeepers — developers can move fast while platform policies enforce correctness at admission time.
Workflow Maturity Model
| Level | Characteristics | Typical Pain |
|---|---|---|
| L1 — Manual | kubectl apply from laptop, shared kubeconfig, no CI | Config drift, "works on my machine" |
| L2 — CI Deployments | Pipeline runs kubectl apply, image tags in manifests, one environment | Hard to trace what's running, no rollback |
| L3 — GitOps | Argo CD / Flux, separate config repo, environment branches | Environment promotion is manual PR |
| L4 — Platform | Developer portal, self-service namespaces, policy-as-code, SLO dashboards | Developer onboarding still slow |
| L5 — Product Platform | Golden paths per workload type, internal IDP, automated compliance, cost showback | Maintenance of platform itself |
Most teams shipping production Kubernetes should target L3–L4. L5 is worth pursuing when the platform team serves 50+ application teams.
Core Toolchain Map
Code Build Deploy Operate
────── ───── ────── ───────
Git (GitHub/GitLab) → CI (Actions/Tekton) → Argo CD / Flux → kubectl / k9s
Dockerfile → Buildkit/Kaniko → Helm / Kustomize → Prometheus
Skaffold / Tilt → Trivy scan → Kyverno policies → Loki / Tempo
devcontainer → cosign sign → Progressive rollout→ PagerDuty
Tool Selection Reference
| Category | OSS Choice | Managed/SaaS Alternative | Notes |
|---|---|---|---|
| Local dev | Tilt, Skaffold | Docker Desktop K8s | Tilt has better HMR; Skaffold simpler config |
| Image build | Buildkit, Kaniko | GitHub Actions buildx | Kaniko for in-cluster builds (no Docker daemon) |
| Image scan | Trivy | Snyk, Prisma Cloud | Trivy is free, fast, OCI-native |
| CI | GitHub Actions, Tekton | CircleCI, BuildKite | Tekton is K8s-native but complex |
| GitOps | Argo CD | Flux | Argo CD has better UI; Flux better multi-tenancy |
| Packaging | Helm, Kustomize | OCI Helm charts | Helm for versioned releases; Kustomize for env overlays |
| Progressive delivery | Argo Rollouts | Flagger | Argo Rollouts integrates with Argo CD natively |
| Policy | Kyverno | OPA/Gatekeeper | Kyverno is simpler YAML-native; OPA more powerful |
| Developer portal | Backstage | Port, Cortex | Backstage is extensible but heavy to operate |
The Golden Path
A golden path is a paved, opinionated workflow that guides developers from code to production without requiring deep Kubernetes expertise. It does not prevent teams from going off-road, but it makes the default path the easiest path.
Minimal Golden Path Components
- Service template — a Helm chart or Kustomize base that embeds resource requests, probes, PDB, NetworkPolicy, and HPA defaults. Developers fill in image and config only.
- CI template — a reusable GitHub Actions workflow (or Tekton Pipeline) that builds, scans, signs, and pushes the image, then opens a PR to the config repo.
- Environment promotion — Argo CD ApplicationSets that automatically sync
devon every merge, and require a PR approval to promote tostagingandproduction. - Observability defaults — ServiceMonitor, log format enforced via Kyverno, and a pre-built Grafana dashboard template linked from the service template.
- Developer runbook — a one-page doc per service type (HTTP API / worker / cron) covering: how to deploy, how to roll back, how to read its dashboard, and who to call.
Example: New Service Onboarding in 10 Minutes
# 1. Scaffold from template (Backstage software template or cookiecutter)
backstage-cli create --template kubernetes-http-service \
--name payments-worker \
--owner team-payments
# 2. Scaffold creates:
# - GitHub repo with Dockerfile, go.mod / package.json
# - Helm chart in the app repo (or config repo PR)
# - Argo CD Application pointing at the chart
# - Namespace + RBAC for team-payments
# - Datadog/Prometheus ServiceMonitor
# - PagerDuty service integration
# 3. First deploy happens automatically when CI merges to main
# 4. Developer opens Backstage → sees service health, docs, runbook link
GitOps Repository Structure
Two common patterns: monorepo (all teams in one config repo) and polyrepo (one config repo per team or service).
Monorepo Layout (recommended for teams < 20 services)
k8s-config/
├── clusters/
│ ├── production/
│ │ ├── apps/ # Argo CD Application manifests
│ │ └── infra/ # cluster-level infra (cert-manager, prometheus...)
│ └── staging/
│ ├── apps/
│ └── infra/
├── services/
│ ├── payments-api/
│ │ ├── base/ # Kustomize base or Helm values-base.yaml
│ │ ├── staging/ # overlay / values-staging.yaml
│ │ └── production/ # overlay / values-production.yaml
│ └── auth-service/
│ ├── base/
│ ├── staging/
│ └── production/
└── platform/
├── cert-manager/
├── prometheus-stack/
└── kyverno/
Polyrepo Layout (recommended for teams > 20 services or strong team autonomy)
# Each team has: app-repo (source code) + config-repo (K8s manifests)
github.com/acme/
├── payments-api/ # source code + Dockerfile
├── payments-api-config/ # K8s manifests, Helm values, Argo CD app
├── auth-service/
├── auth-service-config/
└── platform-config/ # owned by platform team
The tradeoff: monorepo is easier to enforce standards across (one CI pipeline, one policy check); polyrepo gives teams more autonomy and avoids merge conflicts at scale.
Change Management and Review Standards
Deployment Approval Matrix
| Change Type | Who can merge | Required reviews | Deploy path |
|---|---|---|---|
| Config-only (env var, replica count) | App team | 1 peer | Auto-sync to staging → manual promote to prod |
| Application image bump | App team | 1 peer | Auto-sync staging → progressive rollout to prod |
| New service | App team + platform review | 2 (1 platform) | Manual gate before prod |
| Platform infra change | Platform team | 2 platform | Staging first, prod after 24h soak |
| Cluster upgrade | Platform team | 2 platform + manager | Change window, full DR drill first |
Branch and PR Conventions
# Branches
main → production (protected, no direct push)
staging → staging environment (optional; Argo CD can target main with image filter)
feature/... → developer branches
# PR title convention (enforced by CI lint)
feat(payments-api): add idempotency key support
fix(auth-service): handle token expiry edge case
chore(platform): upgrade cert-manager to v1.16
# Commit message body: include the Jira/Linear ticket
# PR description: link to runbook update if ops behaviour changes
Environment Promotion Workflow
merge to main
│
▼
┌─── CI Pipeline ──────────────────┐
│ build → test → scan → sign │
│ push image:sha-<git-sha> │
│ open PR: update image tag in │
│ config repo (staging) │
└───────────────────────────────────┘
│ auto-merge if tests pass
▼
Argo CD syncs staging
│
soak period
(automated smoke tests,
Datadog monitors green)
│
┌──── Promotion PR ────────────────┐
│ update image tag in │
│ config repo (production) │
│ requires manual approval │
└──────────────────────────────────┘
│
Argo CD syncs production
via Argo Rollouts (canary)
│
metrics healthy?
┌────┴────┐
yes no
│ │
promote rollback
100% (automatic)
Cross-Cutting Concerns Checklist
Before a service goes to production, the following must be in place. Platform teams often encode this as a Kyverno ClusterPolicy or Backstage scaffolder check.
## Service Production Readiness
### Reliability
- [ ] `resources.requests` and `resources.limits` set on all containers
- [ ] `readinessProbe` and `livenessProbe` configured
- [ ] `startupProbe` for slow-starting services
- [ ] `minReadySeconds: 30` set on Deployment
- [ ] PodDisruptionBudget exists (`minAvailable: 1` or higher)
- [ ] HPA configured (CPU or custom metric)
- [ ] Topology spread constraints for multi-AZ
### Observability
- [ ] ServiceMonitor exists (Prometheus scrape)
- [ ] Structured JSON logging with `level`, `msg`, `trace_id`
- [ ] OpenTelemetry tracing instrumented
- [ ] Grafana dashboard created and linked in Backstage
- [ ] SLO defined in Sloth / PrometheusRule
### Security
- [ ] `securityContext.runAsNonRoot: true`
- [ ] `securityContext.readOnlyRootFilesystem: true`
- [ ] `securityContext.allowPrivilegeEscalation: false`
- [ ] `automountServiceAccountToken: false` (if no API access needed)
- [ ] NetworkPolicy: explicit ingress + egress
- [ ] Secrets via ESO (not hardcoded in manifests)
- [ ] Image signed with cosign, verified by Kyverno
### Operations
- [ ] Runbook written and linked in Backstage
- [ ] PagerDuty service exists, routing rules set
- [ ] On-call rotation includes this service
- [ ] Graceful shutdown: `preStop` hook + `terminationGracePeriodSeconds`
Related Sections
- Local Development — running K8s locally with Tilt and Skaffold
- CI/CD Pipelines — GitHub Actions, Tekton, image build and signing
- Helm — chart authoring, release management, library charts
- Kustomize — base/overlay pattern, patches, environment promotion
- Testing Strategies — unit, integration, E2E, chaos testing
- Progressive Delivery — canary, blue-green, Argo Rollouts
- Developer Self-Service — Backstage, Crossplane, golden paths
- Runbooks — authoring standards, templates, automation
- On-Call — rotation setup, alert routing, postmortems
Cross-references: 09 — Production Overview · 08 — GitOps · Progressive Delivery