Team Workflows Overview

#	File	Topic
00	workflows-overview.md	This file — section guide
01	local-development.md	Local dev clusters, Skaffold, Tilt
02	ci-cd-pipelines.md	GitHub Actions, Tekton, ArgoCD
03	helm.md	Helm charts, releases, lifecycle
04	kustomize.md	Kustomize overlays, patches
05	testing-strategies.md	Unit, integration, E2E, chaos
06	progressive-delivery.md	Canary, blue-green, feature flags
07	developer-self-service.md	Platform portals, Backstage, Crossplane
08	runbooks.md	Runbook authoring, templates
09	on-call.md	On-call rotation, escalation, postmortems

The Platform-Developer Contract

The core challenge in team workflows is the cognitive load split: platform/SRE teams own the cluster, but application developers own what runs on it. A healthy Kubernetes workflow minimises the surface area where those two worlds collide.

Developer concern:
  "I want to deploy my service, see its logs, and know it's healthy."

Platform concern:
  "I want every deployment to meet security, resource, and reliability standards."

The overlap (where friction lives):
  ┌─────────────────────────────────────────────────┐
  │  Dockerfile  │  resource requests  │  probes    │
  │  RBAC scope  │  NetworkPolicy      │  PDB       │
  │  image tags  │  secrets handling   │  HPA       │
  └─────────────────────────────────────────────────┘

Good workflow design turns the overlap into guardrails, not gatekeepers — developers can move fast while platform policies enforce correctness at admission time.

Workflow Maturity Model

Level	Characteristics	Typical Pain
L1 — Manual	`kubectl apply` from laptop, shared kubeconfig, no CI	Config drift, "works on my machine"
L2 — CI Deployments	Pipeline runs `kubectl apply`, image tags in manifests, one environment	Hard to trace what's running, no rollback
L3 — GitOps	Argo CD / Flux, separate config repo, environment branches	Environment promotion is manual PR
L4 — Platform	Developer portal, self-service namespaces, policy-as-code, SLO dashboards	Developer onboarding still slow
L5 — Product Platform	Golden paths per workload type, internal IDP, automated compliance, cost showback	Maintenance of platform itself

Most teams shipping production Kubernetes should target L3–L4. L5 is worth pursuing when the platform team serves 50+ application teams.

Core Toolchain Map

 Code                  Build               Deploy              Operate
──────                 ─────               ──────              ───────
 Git (GitHub/GitLab) → CI (Actions/Tekton) → Argo CD / Flux → kubectl / k9s
 Dockerfile           → Buildkit/Kaniko    → Helm / Kustomize → Prometheus
 Skaffold / Tilt      → Trivy scan         → Kyverno policies  → Loki / Tempo
 devcontainer         → cosign sign        → Progressive rollout→ PagerDuty

Tool Selection Reference

Category	OSS Choice	Managed/SaaS Alternative	Notes
Local dev	Tilt, Skaffold	Docker Desktop K8s	Tilt has better HMR; Skaffold simpler config
Image build	Buildkit, Kaniko	GitHub Actions buildx	Kaniko for in-cluster builds (no Docker daemon)
Image scan	Trivy	Snyk, Prisma Cloud	Trivy is free, fast, OCI-native
CI	GitHub Actions, Tekton	CircleCI, BuildKite	Tekton is K8s-native but complex
GitOps	Argo CD	Flux	Argo CD has better UI; Flux better multi-tenancy
Packaging	Helm, Kustomize	OCI Helm charts	Helm for versioned releases; Kustomize for env overlays
Progressive delivery	Argo Rollouts	Flagger	Argo Rollouts integrates with Argo CD natively
Policy	Kyverno	OPA/Gatekeeper	Kyverno is simpler YAML-native; OPA more powerful
Developer portal	Backstage	Port, Cortex	Backstage is extensible but heavy to operate

The Golden Path

A golden path is a paved, opinionated workflow that guides developers from code to production without requiring deep Kubernetes expertise. It does not prevent teams from going off-road, but it makes the default path the easiest path.

Minimal Golden Path Components

Service template — a Helm chart or Kustomize base that embeds resource requests, probes, PDB, NetworkPolicy, and HPA defaults. Developers fill in image and config only.
CI template — a reusable GitHub Actions workflow (or Tekton Pipeline) that builds, scans, signs, and pushes the image, then opens a PR to the config repo.
Environment promotion — Argo CD ApplicationSets that automatically sync dev on every merge, and require a PR approval to promote to staging and production.
Observability defaults — ServiceMonitor, log format enforced via Kyverno, and a pre-built Grafana dashboard template linked from the service template.
Developer runbook — a one-page doc per service type (HTTP API / worker / cron) covering: how to deploy, how to roll back, how to read its dashboard, and who to call.

Example: New Service Onboarding in 10 Minutes

# 1. Scaffold from template (Backstage software template or cookiecutter)
backstage-cli create --template kubernetes-http-service \
  --name payments-worker \
  --owner team-payments

# 2. Scaffold creates:
#   - GitHub repo with Dockerfile, go.mod / package.json
#   - Helm chart in the app repo (or config repo PR)
#   - Argo CD Application pointing at the chart
#   - Namespace + RBAC for team-payments
#   - Datadog/Prometheus ServiceMonitor
#   - PagerDuty service integration

# 3. First deploy happens automatically when CI merges to main
# 4. Developer opens Backstage → sees service health, docs, runbook link

GitOps Repository Structure

Two common patterns: monorepo (all teams in one config repo) and polyrepo (one config repo per team or service).

Monorepo Layout (recommended for teams < 20 services)

k8s-config/
├── clusters/
│   ├── production/
│   │   ├── apps/          # Argo CD Application manifests
│   │   └── infra/         # cluster-level infra (cert-manager, prometheus...)
│   └── staging/
│       ├── apps/
│       └── infra/
├── services/
│   ├── payments-api/
│   │   ├── base/          # Kustomize base or Helm values-base.yaml
│   │   ├── staging/       # overlay / values-staging.yaml
│   │   └── production/    # overlay / values-production.yaml
│   └── auth-service/
│       ├── base/
│       ├── staging/
│       └── production/
└── platform/
    ├── cert-manager/
    ├── prometheus-stack/
    └── kyverno/

Polyrepo Layout (recommended for teams > 20 services or strong team autonomy)

# Each team has: app-repo (source code) + config-repo (K8s manifests)
github.com/acme/
├── payments-api/           # source code + Dockerfile
├── payments-api-config/    # K8s manifests, Helm values, Argo CD app
├── auth-service/
├── auth-service-config/
└── platform-config/        # owned by platform team

The tradeoff: monorepo is easier to enforce standards across (one CI pipeline, one policy check); polyrepo gives teams more autonomy and avoids merge conflicts at scale.

Change Management and Review Standards

Deployment Approval Matrix

Change Type	Who can merge	Required reviews	Deploy path
Config-only (env var, replica count)	App team	1 peer	Auto-sync to staging → manual promote to prod
Application image bump	App team	1 peer	Auto-sync staging → progressive rollout to prod
New service	App team + platform review	2 (1 platform)	Manual gate before prod
Platform infra change	Platform team	2 platform	Staging first, prod after 24h soak
Cluster upgrade	Platform team	2 platform + manager	Change window, full DR drill first

Branch and PR Conventions

# Branches
main          → production (protected, no direct push)
staging       → staging environment (optional; Argo CD can target main with image filter)
feature/...   → developer branches

# PR title convention (enforced by CI lint)
feat(payments-api): add idempotency key support
fix(auth-service): handle token expiry edge case
chore(platform): upgrade cert-manager to v1.16

# Commit message body: include the Jira/Linear ticket
# PR description: link to runbook update if ops behaviour changes

Environment Promotion Workflow

                    merge to main
                         │
                         ▼
              ┌─── CI Pipeline ──────────────────┐
              │  build → test → scan → sign       │
              │  push image:sha-<git-sha>         │
              │  open PR: update image tag in     │
              │           config repo (staging)   │
              └───────────────────────────────────┘
                         │ auto-merge if tests pass
                         ▼
               Argo CD syncs staging
                         │
                    soak period
               (automated smoke tests,
                Datadog monitors green)
                         │
              ┌──── Promotion PR ────────────────┐
              │  update image tag in             │
              │  config repo (production)        │
              │  requires manual approval        │
              └──────────────────────────────────┘
                         │
               Argo CD syncs production
               via Argo Rollouts (canary)
                         │
                   metrics healthy?
                    ┌────┴────┐
                   yes       no
                    │         │
                 promote   rollback
                  100%    (automatic)

Cross-Cutting Concerns Checklist

Before a service goes to production, the following must be in place. Platform teams often encode this as a Kyverno ClusterPolicy or Backstage scaffolder check.

## Service Production Readiness

### Reliability
- [ ] `resources.requests` and `resources.limits` set on all containers
- [ ] `readinessProbe` and `livenessProbe` configured
- [ ] `startupProbe` for slow-starting services
- [ ] `minReadySeconds: 30` set on Deployment
- [ ] PodDisruptionBudget exists (`minAvailable: 1` or higher)
- [ ] HPA configured (CPU or custom metric)
- [ ] Topology spread constraints for multi-AZ

### Observability
- [ ] ServiceMonitor exists (Prometheus scrape)
- [ ] Structured JSON logging with `level`, `msg`, `trace_id`
- [ ] OpenTelemetry tracing instrumented
- [ ] Grafana dashboard created and linked in Backstage
- [ ] SLO defined in Sloth / PrometheusRule

### Security
- [ ] `securityContext.runAsNonRoot: true`
- [ ] `securityContext.readOnlyRootFilesystem: true`
- [ ] `securityContext.allowPrivilegeEscalation: false`
- [ ] `automountServiceAccountToken: false` (if no API access needed)
- [ ] NetworkPolicy: explicit ingress + egress
- [ ] Secrets via ESO (not hardcoded in manifests)
- [ ] Image signed with cosign, verified by Kyverno

### Operations
- [ ] Runbook written and linked in Backstage
- [ ] PagerDuty service exists, routing rules set
- [ ] On-call rotation includes this service
- [ ] Graceful shutdown: `preStop` hook + `terminationGracePeriodSeconds`

Local Development — running K8s locally with Tilt and Skaffold
CI/CD Pipelines — GitHub Actions, Tekton, image build and signing
Helm — chart authoring, release management, library charts
Kustomize — base/overlay pattern, patches, environment promotion
Testing Strategies — unit, integration, E2E, chaos testing
Progressive Delivery — canary, blue-green, Argo Rollouts
Developer Self-Service — Backstage, Crossplane, golden paths
Runbooks — authoring standards, templates, automation
On-Call — rotation setup, alert routing, postmortems

Cross-references: 09 — Production Overview · 08 — GitOps · Progressive Delivery

Contents