CI/CD Pipelines
- CI/CD Pipeline Architecture
- GitHub Actions for Kubernetes
- Tekton Pipelines
- Container Image Build (Buildkit / Kaniko)
- Image Signing with Cosign
- SBOM Generation
- Registry Push & Tag Strategy
- GitOps Promotion from CI
- Progressive Delivery: Argo Rollouts
- Canary Deployments
- Blue-Green Deployments
- Flagger (Flux Progressive Delivery)
- Pipeline Observability
- Best Practices
1. CI/CD Pipeline Architecture
Complete CI/CD flow:
Developer
│── git push / PR ──▶ GitHub / GitLab
│
┌─────────▼──────────┐
│ CI Pipeline │
│ │
│ lint + unit test │
│ integration test │
│ build image │
│ SBOM generation │
│ vulnerability scan│
│ sign image │
│ push to registry │
└─────────┬──────────┘
│ update image tag in GitOps repo
▼
┌─────────────────────┐
│ GitOps Repo (Git) │
│ services/order-svc │
│ overlays/staging/ │
│ image: ...v1.4.3 │
└─────────┬───────────┘
│ Argo CD / Flux detects change
▼
┌─────────────────────┐
│ Argo CD / Flux │
│ sync to staging │
└─────────┬───────────┘
│
┌─────────▼──────────┐
│ Progressive CD │
│ (Argo Rollouts / │
│ Flagger) │
│ canary: 10%→50%→ │
│ 100% or rollback │
└─────────┬──────────┘
│ promote to production (GitOps PR / auto)
▼
Production
CI vs CD Separation
| Phase | Runs in | Outputs | Tools |
|---|---|---|---|
| CI (Continuous Integration) | External CI (GitHub Actions, Tekton) | Verified image, SBOM, signatures, updated GitOps repo | GitHub Actions, Tekton, GitLab CI, Jenkins |
| CD (Continuous Delivery) | In-cluster (GitOps agent) | Running workload, deployment state | Argo CD, Flux, Argo Rollouts, Flagger |
CI produces an immutable, signed, scanned container image and updates a GitOps repository with the new tag. CD is entirely the responsibility of the in-cluster GitOps agent — CI never kubectl applys directly to production. This separation means CI credentials never touch production clusters.
2. GitHub Actions for Kubernetes
Full CI Workflow — Build, Scan, Sign, Push, Promote
# .github/workflows/ci.yaml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
IMAGE_NAME: order-service
permissions:
contents: write # update GitOps repo
id-token: write # OIDC for keyless Cosign signing
packages: write # push to GHCR (if used)
security-events: write # upload SARIF scan results
jobs:
ci:
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
image_digest: ${{ steps.build.outputs.digest }}
steps:
- name: Checkout
uses: actions/checkout@v4
# ── Lint & Test ─────────────────────────────────────────
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: "1.22"
cache: true
- name: Lint
uses: golangci/golangci-lint-action@v6
with:
version: v1.59
- name: Unit tests
run: go test ./... -race -coverprofile=coverage.out
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: coverage.out
# ── Docker Build & Push ──────────────────────────────────
- name: Configure AWS credentials (OIDC — no long-lived keys)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-ecr
aws-region: us-east-1
- name: Login to ECR
id: ecr-login
uses: aws-actions/amazon-ecr-login@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Extract metadata (tags, labels)
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=,suffix=,format=short # short git SHA
type=semver,pattern={{version}} # v1.4.2 on tag
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
- name: Build and push
id: build
uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
provenance: true # generate SLSA provenance attestation
sbom: true # generate SBOM attestation
# ── Vulnerability Scan ───────────────────────────────────
- name: Scan image with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
format: sarif
output: trivy-results.sarif
severity: CRITICAL,HIGH
exit-code: 1 # fail on CRITICAL/HIGH
- name: Upload Trivy SARIF to GitHub Security
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: trivy-results.sarif
# ── Cosign Keyless Signing ───────────────────────────────
- name: Install Cosign
uses: sigstore/cosign-installer@v3
- name: Sign image (keyless OIDC)
if: github.event_name != 'pull_request'
env:
COSIGN_EXPERIMENTAL: "1"
run: |
cosign sign --yes \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
# ── GitOps Promotion ─────────────────────────────────────
- name: Update image tag in GitOps repo (staging)
if: github.event_name != 'pull_request'
env:
GIT_SHA: ${{ github.sha }}
run: |
SHORT_SHA="${GIT_SHA:0:7}"
git clone https://x-access-token:${{ secrets.GITOPS_TOKEN }}@github.com/myorg/platform-gitops.git
cd platform-gitops
yq eval -i ".images[0].newTag = \"${SHORT_SHA}\"" \
services/order-service/overlays/staging/kustomization.yaml
git config user.email "ci-bot@example.com"
git config user.name "CI Bot"
git add services/order-service/overlays/staging/kustomization.yaml
git commit -m "chore: update order-service to ${SHORT_SHA} in staging"
git push
GitHub Actions: Reusable Workflows
# .github/workflows/reusable-build.yaml (platform team maintains)
name: Reusable Build & Push
on:
workflow_call:
inputs:
image-name:
required: true
type: string
dockerfile:
required: false
type: string
default: Dockerfile
outputs:
digest:
description: "Image digest"
value: ${{ jobs.build.outputs.digest }}
tag:
description: "Short SHA tag"
value: ${{ jobs.build.outputs.tag }}
secrets:
ecr-role-arn:
required: true
jobs:
build:
runs-on: ubuntu-latest
outputs:
digest: ${{ steps.build.outputs.digest }}
tag: ${{ steps.tag.outputs.value }}
steps:
# ... (same build/scan/sign steps as above)
# Usage in a service repo:
# .github/workflows/ci.yaml
jobs:
build:
uses: myorg/.github/.github/workflows/reusable-build.yaml@main
with:
image-name: order-service
secrets:
ecr-role-arn: ${{ secrets.ECR_ROLE_ARN }}
3. Tekton Pipelines
Tekton is a Kubernetes-native CI/CD framework. Pipelines run as Pods in the cluster, which means they share cluster RBAC, service accounts, and can access internal cluster resources (useful for integration tests against real services).
Tekton Object Hierarchy
Task → defines a set of Steps (containers run sequentially in a Pod)
Pipeline → defines a sequence/graph of Tasks
PipelineRun → instantiation of a Pipeline with parameters and workspaces
TaskRun → instantiation of a Task
Workspace → shared volume between steps (PVC / emptyDir / Secret / ConfigMap)
TriggerTemplate → creates PipelineRun on events (webhooks)
EventListener → HTTP endpoint that receives webhooks and fires TriggerTemplates
Task: Run Tests
apiVersion: tekton.dev/v1
kind: Task
metadata:
name: go-test
spec:
params:
- name: package
type: string
default: "./..."
workspaces:
- name: source
steps:
- name: unit-test
image: golang:1.22-alpine
workingDir: $(workspaces.source.path)
env:
- name: GOFLAGS
value: "-mod=vendor"
- name: CGO_ENABLED
value: "0"
script: |
#!/bin/sh
set -ex
go test $(params.package) \
-race \
-coverprofile=/tmp/coverage.out \
-v 2>&1 | tee /tmp/test-output.txt
go tool cover -html=/tmp/coverage.out -o /tmp/coverage.html
- name: lint
image: golangci/golangci-lint:v1.59-alpine
workingDir: $(workspaces.source.path)
script: |
golangci-lint run --timeout 5m
Task: Build and Push with Kaniko
apiVersion: tekton.dev/v1
kind: Task
metadata:
name: kaniko-build
spec:
params:
- name: IMAGE
description: Full image name with registry
- name: DOCKERFILE
default: Dockerfile
- name: CONTEXT
default: "."
workspaces:
- name: source
- name: docker-credentials
optional: true
results:
- name: IMAGE_DIGEST
description: Digest of the built image
- name: IMAGE_URL
description: Full image URL with digest
steps:
- name: build-and-push
image: gcr.io/kaniko-project/executor:v1.23.0-debug
workingDir: $(workspaces.source.path)
env:
- name: DOCKER_CONFIG
value: /kaniko/.docker
command:
- /kaniko/executor
args:
- --dockerfile=$(params.DOCKERFILE)
- --context=$(params.CONTEXT)
- --destination=$(params.IMAGE)
- --cache=true
- --cache-repo=$(params.IMAGE)-cache
- --compressed-caching=false
- --snapshot-mode=redo
- --use-new-run
- --digest-file=/tekton/results/IMAGE_DIGEST
volumeMounts:
- name: docker-config
mountPath: /kaniko/.docker
volumes:
- name: docker-config
projected:
sources:
- secret:
name: ecr-credentials
Full Pipeline: CI for Go Service
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: go-service-ci
spec:
params:
- name: git-url
- name: git-revision
default: main
- name: image-name
- name: gitops-repo-url
- name: gitops-path
workspaces:
- name: shared-data # Git clone workspace
- name: git-credentials
tasks:
- name: clone
taskRef:
resolver: hub
params:
- name: catalog
value: tekton-catalog-pipeline
- name: type
value: artifact
- name: kind
value: task
- name: name
value: git-clone
- name: version
value: "0.9"
workspaces:
- name: output
workspace: shared-data
- name: ssh-directory
workspace: git-credentials
params:
- name: url
value: $(params.git-url)
- name: revision
value: $(params.git-revision)
- name: test
runAfter: [clone]
taskRef:
name: go-test
workspaces:
- name: source
workspace: shared-data
- name: build
runAfter: [test]
taskRef:
name: kaniko-build
workspaces:
- name: source
workspace: shared-data
params:
- name: IMAGE
value: $(params.image-name):$(tasks.clone.results.commit)
- name: scan
runAfter: [build]
taskRef:
name: trivy-scan
params:
- name: IMAGE
value: $(params.image-name)@$(tasks.build.results.IMAGE_DIGEST)
- name: SEVERITY
value: "CRITICAL,HIGH"
- name: EXIT_CODE
value: "1"
- name: sign
runAfter: [scan]
taskRef:
name: cosign-sign
params:
- name: IMAGE
value: $(params.image-name)@$(tasks.build.results.IMAGE_DIGEST)
- name: update-gitops
runAfter: [sign]
taskRef:
name: git-update-deployment
workspaces:
- name: source
workspace: shared-data
- name: ssh-directory
workspace: git-credentials
params:
- name: GIT_REPOSITORY
value: $(params.gitops-repo-url)
- name: GIT_PATH_FILES
value: $(params.gitops-path)
- name: NEW_TAG
value: $(tasks.clone.results.commit)
EventListener + Trigger — Webhook-based PipelineRun
apiVersion: triggers.tekton.dev/v1beta1
kind: EventListener
metadata:
name: github-webhook
namespace: tekton-pipelines
spec:
serviceAccountName: tekton-triggers-sa
triggers:
- name: github-push
interceptors:
- ref:
name: github
params:
- name: secretRef
value:
secretName: github-webhook-secret
secretKey: token
- name: eventTypes
value: [push]
- ref:
name: cel
params:
- name: filter
value: "body.ref == 'refs/heads/main'"
bindings:
- ref: github-push-binding
template:
ref: github-push-template
---
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerBinding
metadata:
name: github-push-binding
spec:
params:
- name: gitrevision
value: $(body.head_commit.id)
- name: gitrepositoryurl
value: $(body.repository.clone_url)
---
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerTemplate
metadata:
name: github-push-template
spec:
params:
- name: gitrevision
- name: gitrepositoryurl
resourcetemplates:
- apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
generateName: go-service-ci-run-
spec:
pipelineRef:
name: go-service-ci
params:
- name: git-url
value: $(tt.params.gitrepositoryurl)
- name: git-revision
value: $(tt.params.gitrevision)
- name: image-name
value: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service
workspaces:
- name: shared-data
volumeClaimTemplate:
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
- name: git-credentials
secret:
secretName: github-ssh-key
4. Container Image Build (BuildKit / Kaniko)
Production Dockerfile — Multi-Stage, Distroless, Non-Root
# ── Stage 1: Build ──────────────────────────────────────────────────
FROM golang:1.22-alpine AS builder
# Install build deps
RUN apk add --no-cache git ca-certificates tzdata
WORKDIR /app
# Copy go.mod first for layer caching
COPY go.mod go.sum ./
RUN go mod download
COPY . .
# Build with CGO disabled, trimpath for reproducible builds
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-trimpath \
-ldflags="-s -w \
-X main.version=$(git describe --tags --always 2>/dev/null || echo 'dev') \
-X main.commitSHA=$(git rev-parse --short HEAD 2>/dev/null || echo 'unknown')" \
-o /app/server ./cmd/server
# ── Stage 2: Runtime ─────────────────────────────────────────────────
# Use distroless — no shell, no package manager, minimal attack surface
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=builder /usr/share/zoneinfo /usr/share/zoneinfo
COPY --from=builder /app/server /app/server
# Run as non-root user (distroless nonroot = UID 65532)
USER nonroot:nonroot
EXPOSE 8080 6060
ENTRYPOINT ["/app/server"]
BuildKit vs Kaniko
| Feature | BuildKit (docker buildx) | Kaniko |
|---|---|---|
| Docker daemon required | No (rootless mode) | No |
| Runs in Kubernetes | Yes (DinD or rootless) | Yes (single pod, no daemon) |
| Cache | Registry cache, local, GitHub Actions cache | Registry cache (--cache-repo) |
| Multi-platform | Yes (QEMU emulation or native builders) | Per-arch build only (no cross-arch) |
| Security | Rootless available; DinD needs privileged | Runs in userspace; no privileged required |
| Speed | Fast (parallel layer build, smart caching) | Slower (sequential layers, no parallelism) |
| Best for | GitHub Actions, external CI, fast builds | Kubernetes-native CI (Tekton), air-gapped |
5. Image Signing with Cosign
Cosign (part of Sigstore) provides keyless container image signing using ephemeral keys anchored to OIDC identity. In CI, the GitHub Actions OIDC token or Tekton ServiceAccount token proves the pipeline's identity — no long-lived signing keys to rotate.
Keyless Signing in CI
# Keyless signing — uses OIDC identity (no key management)
# In GitHub Actions:
COSIGN_EXPERIMENTAL=1 cosign sign --yes \
--rekor-url https://rekor.sigstore.dev \
123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123...
# The signature is stored in the OCI registry as a separate artifact:
# 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:sha256-abc123....sig
# Verify signature (in admission webhook or manually):
COSIGN_EXPERIMENTAL=1 cosign verify \
--certificate-identity-regexp="https://github.com/myorg/order-service/.github/workflows/ci.yaml" \
--certificate-oidc-issuer=https://token.actions.githubusercontent.com \
123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123...
Key-based Signing (for air-gapped / private Rekor)
# Generate signing key pair (run once, store private key in Vault)
cosign generate-key-pair
# Sign with private key
cosign sign --key cosign.key \
--annotations "git-sha=${GITHUB_SHA}" \
--annotations "pipeline-url=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}" \
123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:${TAG}
# Verify with public key (admission webhook uses cosign.pub)
cosign verify --key cosign.pub \
123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:${TAG}
Enforcing Signatures via Kyverno Policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signature
spec:
validationFailureAction: Enforce
background: false
rules:
- name: check-image-signature
match:
any:
- resources:
kinds: [Pod]
namespaces: ["production", "staging"]
verifyImages:
- imageReferences:
- "123456789.dkr.ecr.us-east-1.amazonaws.com/*"
attestors:
- count: 1
entries:
- keyless:
subject: "https://github.com/myorg/*/.github/workflows/ci.yaml@refs/heads/main"
issuer: "https://token.actions.githubusercontent.com"
rekor:
url: https://rekor.sigstore.dev
mutateDigest: true # replace :tag with @sha256 digest
verifyDigest: true # reject images without digest
6. SBOM Generation
A Software Bill of Materials (SBOM) lists every package and dependency in a container image. SBOMs enable vulnerability auditing, license compliance checking, and rapid incident response ("which images contain log4j?").
# Generate SBOM with Syft (SPDX or CycloneDX format)
syft 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123 \
-o spdx-json=sbom.spdx.json \
-o cyclonedx-json=sbom.cdx.json
# Attest SBOM to the image (stored in OCI registry alongside image)
cosign attest --yes \
--predicate sbom.spdx.json \
--type spdxjson \
123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123
# Verify and extract SBOM from image:
cosign verify-attestation \
--type spdxjson \
--key cosign.pub \
123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123 \
| jq '.payload | @base64d | fromjson | .predicate'
# Audit SBOM for known vulnerabilities (Grype):
grype sbom:sbom.spdx.json --fail-on high
When using docker buildx build --sbom=true --provenance=true, Docker BuildKit generates an SPDX SBOM and SLSA provenance attestation automatically and attaches them to the image manifest. These are stored as OCI artifacts in the registry and can be verified with Cosign or Docker Scout.
7. Registry Push & Tag Strategy
Image Tag Strategy
| Tag | Format | Mutable | Use Case |
|---|---|---|---|
| Git SHA | abc1234 (7-char) | No | Primary deployment tag — immutable, auditable |
| Semver | v1.4.2 | No (once pushed) | Release artifacts; Helm chart appVersion |
| latest | latest | Yes | Development only — never deploy with :latest in production |
| Branch | main, pr-123 | Yes | Preview / feature branches; CI testing only |
| Date + SHA | 20240115-abc1234 | No | Sortable + unique; used by Flux alphabetical ImagePolicy |
:latest to production
:latest is a mutable tag — it changes with every push. Kubernetes caches image layers and may not pull a new :latest unless imagePullPolicy: Always is set (which adds latency to every pod start). Use immutable SHA-based tags. If you use imagePullPolicy: Always with :latest, a broken push will take down new pod restarts cluster-wide.
# ECR lifecycle policy — clean up old tags, keep recent and tagged images
aws ecr put-lifecycle-policy \
--repository-name order-service \
--lifecycle-policy '{
"rules": [
{
"rulePriority": 1,
"description": "Keep semver tags forever",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["v"],
"countType": "imageCountMoreThan",
"countNumber": 9999
},
"action": {"type": "expire"}
},
{
"rulePriority": 2,
"description": "Keep last 50 SHA-tagged images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["sha-", ""],
"countType": "imageCountMoreThan",
"countNumber": 50
},
"action": {"type": "expire"}
},
{
"rulePriority": 3,
"description": "Expire untagged images after 7 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 7
},
"action": {"type": "expire"}
}
]
}'
8. GitOps Promotion from CI
# Complete GitOps promotion script (used in CI after image push + sign)
#!/bin/bash
set -euo pipefail
GITOPS_REPO="https://x-access-token:${GITOPS_TOKEN}@github.com/myorg/platform-gitops.git"
SERVICE="order-service"
NEW_TAG="${GITHUB_SHA:0:7}"
ENV="${1:-staging}" # default to staging; pass 'production' for prod promotion
git clone "${GITOPS_REPO}" /tmp/gitops
cd /tmp/gitops
# Update image tag in Kustomize overlay
yq eval -i ".images[] |= select(.name == \"*/${SERVICE}\").newTag = \"${NEW_TAG}\"" \
"services/${SERVICE}/overlays/${ENV}/kustomization.yaml"
# Verify the change looks correct
git diff
git config user.email "ci-bot@myorg.com"
git config user.name "CI Bot"
git add "services/${SERVICE}/overlays/${ENV}/"
git commit -m "chore(${ENV}): update ${SERVICE} to ${NEW_TAG}
Source: ${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}
Image: ${REGISTRY}/${SERVICE}@${IMAGE_DIGEST}"
# Retry push in case of concurrent commits
for i in 1 2 3; do
git pull --rebase origin main && git push && break || sleep $((i * 5))
done
9. Progressive Delivery: Argo Rollouts
Argo Rollouts extends Kubernetes with advanced deployment strategies — canary, blue-green, and experiment-based — with automated metric analysis gates that roll back if error rate or latency thresholds are breached.
Install
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
-f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# kubectl plugin
kubectl argo rollouts version
Rollout CRD (replaces Deployment)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: order-service
namespace: order-service
spec:
replicas: 10
revisionHistoryLimit: 5
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:abc1234
ports:
- containerPort: 8080
resources:
requests: {cpu: 250m, memory: 256Mi}
limits: {memory: 512Mi}
readinessProbe:
httpGet: {path: /health/ready, port: 8080}
initialDelaySeconds: 5
periodSeconds: 5
strategy:
canary:
# Traffic management via NGINX Ingress (header-based or weight-based)
canaryService: order-service-canary
stableService: order-service-stable
trafficRouting:
nginx:
stableIngress: order-service-ingress
annotationPrefix: nginx.ingress.kubernetes.io
additionalIngressAnnotations:
canary-by-header: X-Canary
# Canary rollout steps
steps:
- setWeight: 5 # 5% traffic to canary
- pause: {duration: 5m} # wait 5 minutes
- analysis: # automated metric analysis
templates:
- templateName: success-rate
args:
- name: service-name
value: order-service-canary
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 50
- pause: {} # manual approval (indefinite pause)
- setWeight: 100
# Analysis during rollout (continuous background check)
analysis:
templates:
- templateName: success-rate
startingStep: 1 # start analysis at step 1
args:
- name: service-name
value: order-service-canary
AnalysisTemplate — Metric Gates
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
namespace: order-service
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95 # 95% success rate required
failureLimit: 3 # 3 consecutive failures → abort
provider:
prometheus:
address: http://prometheus.observability.svc:9090
query: |
sum(rate(
http_requests_total{
app="{{args.service-name}}",
status!~"5.."
}[5m]
)) /
sum(rate(
http_requests_total{
app="{{args.service-name}}"
}[5m]
))
- name: latency-p99
interval: 1m
successCondition: result[0] < 0.5 # p99 < 500ms
failureLimit: 3
provider:
prometheus:
address: http://prometheus.observability.svc:9090
query: |
histogram_quantile(0.99,
sum(rate(
http_request_duration_seconds_bucket{
app="{{args.service-name}}"
}[5m]
)) by (le)
)
- name: error-rate
interval: 1m
successCondition: result[0] < 0.01 # error rate < 1%
failureCondition: result[0] >= 0.05 # abort if > 5% errors
provider:
prometheus:
address: http://prometheus.observability.svc:9090
query: |
sum(rate(
http_requests_total{app="{{args.service-name}}",status=~"5.."}[5m]
)) /
sum(rate(
http_requests_total{app="{{args.service-name}}"}[5m]
))
Rollout CLI Commands
# Watch rollout progress
kubectl argo rollouts get rollout order-service --watch
# Pause a rollout at current step
kubectl argo rollouts pause order-service
# Promote (advance past a manual pause step)
kubectl argo rollouts promote order-service
# Abort and roll back to stable
kubectl argo rollouts abort order-service
# Retry after abort
kubectl argo rollouts retry rollout order-service
# Set image (trigger new rollout)
kubectl argo rollouts set image order-service \
order-service=123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:abc5678
# Undo to previous version
kubectl argo rollouts undo order-service
10. Canary Deployments in Detail
Traffic Splitting Options
| Method | Mechanism | Granularity | Requires |
|---|---|---|---|
| Pod-count canary | N canary pods out of total → N% traffic | Coarse (1/10 = 10%) | Nothing extra (default Argo Rollouts) |
| NGINX Ingress weight | canary-weight annotation | 1% granularity | NGINX Ingress Controller |
| Istio VirtualService | HTTPRoute weight split | 1% granularity | Istio service mesh |
| AWS ALB weighted target groups | Listener rule weights | 1% granularity | AWS ALB Controller |
| Header-based | Route specific users to canary by header | 0 or 100% | NGINX / Istio |
NGINX Ingress Canary Configuration
---
# Stable service (selects only stable pods)
apiVersion: v1
kind: Service
metadata:
name: order-service-stable
spec:
selector:
app: order-service
---
# Canary service (selects only canary pods — Argo Rollouts manages pod labels)
apiVersion: v1
kind: Service
metadata:
name: order-service-canary
spec:
selector:
app: order-service
---
# Primary Ingress (stable)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: order-service-ingress
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /order
pathType: Prefix
backend:
service:
name: order-service-stable
port:
number: 80
# Argo Rollouts controller creates and manages the canary Ingress automatically:
# (annotations are set/removed by rollout controller)
# nginx.ingress.kubernetes.io/canary: "true"
# nginx.ingress.kubernetes.io/canary-weight: "20"
Istio-based Canary with VirtualService
# Argo Rollouts manages these weights automatically:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service-stable
weight: 80
- destination:
host: order-service-canary
weight: 20 # ← Argo Rollouts updates this during rollout
# Rollout trafficRouting config for Istio:
# strategy:
# canary:
# trafficRouting:
# istio:
# virtualService:
# name: order-service
# destinationRule:
# name: order-service-destrule
# canarySubsetName: canary
# stableSubsetName: stable
11. Blue-Green Deployments
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: order-service-bg
spec:
replicas: 5
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:abc1234
strategy:
blueGreen:
# Active service: currently receiving production traffic
activeService: order-service-active
# Preview service: new version (green) before promotion
previewService: order-service-preview
# Auto-promote to active after analysis passes
autoPromotionEnabled: false # require manual kubectl argo rollouts promote
# Scale down old (blue) replicas after this duration
scaleDownDelaySeconds: 30
# Run analysis against preview before promoting
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: order-service-preview
# Run analysis after promotion (watch for regressions)
postPromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: order-service-active
Blue-Green vs Canary Decision Guide
| Factor | Use Blue-Green | Use Canary |
|---|---|---|
| Database schema change | Yes (run migration against preview, validate, then promote) | Risky (both versions hit same DB simultaneously) |
| Stateful services | Yes (swap traffic atomically after testing) | Risky (canary pods may have different state) |
| Gradual risk reduction | No (all-or-nothing switch) | Yes (expose 5%, 20%, 50%, 100%) |
| Long rollout acceptable | No (instant switch or rollback) | Yes (hours-long rollout with analysis) |
| Cost concern | 2× resource usage during cutover | Slightly above normal (canary pods added) |
| Traffic isolation testing | Yes (full green env accessible via preview Service) | Partial (5% real traffic is canary) |
12. Flagger (Flux Progressive Delivery)
Flagger is the progressive delivery operator for Flux (and compatible with Argo CD). It automates canary and blue-green deployments using Kubernetes Deployments (not Rollout CRDs), making adoption easier for existing workloads.
helm upgrade --install flagger flagger/flagger \
--namespace flagger-system \
--create-namespace \
--set meshProvider=nginx \
--set metricsServer=http://prometheus.observability.svc:9090
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: order-service
namespace: order-service
spec:
# Target: the existing Deployment to wrap
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
# Ingress for traffic routing
ingressRef:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: order-service
# Autoscaling (HPA follows canary/primary)
autoscalerRef:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: order-service
service:
port: 80
targetPort: 8080
gateways: []
hosts:
- api.example.com
analysis:
# Promote when 5 consecutive analysis runs pass
interval: 1m
threshold: 5
maxWeight: 50 # max canary weight before full promotion
stepWeight: 10 # increase by 10% each interval
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # minimum 99% success rate
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # maximum 500ms
interval: 1m
# Custom metric using PromQL
metrics:
- name: error-rate
templateRef:
name: error-rate
namespace: flagger-system
thresholdRange:
max: 0.01
interval: 1m
# Run integration tests as webhook before promotion
webhooks:
- name: integration-test
type: pre-rollout
url: http://flagger-loadtester.flagger-system/
timeout: 30s
metadata:
type: bash
cmd: "curl -sd 'anon' http://order-service-canary.order-service/checkout | grep 'order_id'"
13. Pipeline Observability
Tekton Metrics
# Tekton exposes Prometheus metrics on port 9090 of each controller
# Key metrics:
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket # pipeline duration histogram
tekton_pipelines_controller_pipelinerun_count{status} # success/failure counts
tekton_pipelines_controller_taskrun_duration_seconds_bucket
tekton_pipelines_controller_taskrun_count{status}
tekton_pipelines_controller_reconcile_count # reconciliation health
PrometheusRule — Pipeline Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pipeline-alerts
namespace: tekton-pipelines
spec:
groups:
- name: ci-pipelines
rules:
- alert: PipelineHighFailureRate
expr: |
(
rate(tekton_pipelines_controller_pipelinerun_count{status="failed"}[1h])
/
rate(tekton_pipelines_controller_pipelinerun_count[1h])
) > 0.20
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "CI pipeline failure rate > 20% for last hour"
- alert: PipelineSlowBuilds
expr: |
histogram_quantile(0.90,
rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[1h])
) > 600
for: 30m
labels:
severity: warning
annotations:
summary: "P90 pipeline duration > 10 minutes"
- alert: ArgoCDRolloutDegraded
expr: |
rollout_info{phase!~"Healthy|Paused|Progressing"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Argo Rollout {{ $labels.name }} is {{ $labels.phase }}"
Deployment Frequency Dashboard
# PromQL: deployment frequency per service per day (DORA metric)
# Counts GitOps repo commits that update image tags
increase(
argocd_app_sync_total{
app=~".*order-service.*",
phase="Succeeded"
}[24h]
)
# Lead time proxy: time from last commit to successful sync
# (requires custom metric or CI instrumentation)
# Change failure rate:
(
rate(argocd_app_info{sync_status="Unknown"}[7d])
+ rate(argocd_app_info{health_status="Degraded"}[7d])
) /
rate(argocd_app_sync_total{phase="Succeeded"}[7d])
14. Best Practices
1. Separate CI (image build) from CD (deploy)
CI builds and pushes images, updates GitOps repo. CD is entirely handled by the in-cluster GitOps agent. CI never runs kubectl apply against production. This hard separation means CI credential compromise cannot directly affect production clusters.
2. Use OIDC — never long-lived CI credentials
GitHub Actions, GitLab CI, and Tekton all support OIDC-based credential issuance to AWS/GCP/Azure. Use aws-actions/configure-aws-credentials with role-to-assume. No AWS_ACCESS_KEY_ID in repository secrets.
3. Pin every action and image version
Use full SHA pins for GitHub Actions (uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683) and image tags. Floating @v4 or :latest references are a supply chain attack vector.
4. Fail CI on CRITICAL vulnerabilities
Trivy/Grype should exit non-zero on CRITICAL/HIGH CVEs. Teams should triage them within the sprint. Add exceptions with justification for false positives — not exit-code: 0 for all.
5. Use AnalysisTemplates for automated rollback
Argo Rollouts with Prometheus-based AnalysisTemplates gate promotion on error rate and latency. A misconfigured deploy that raises errors from 0.1% to 3% will be rolled back automatically, without on-call intervention.
6. Start with canary for all production services
Even a 5%→100% canary with 5-minute pause is better than a simultaneous 100% rollout. It limits blast radius for configuration bugs, memory leaks, and serialisation errors that only manifest under real traffic.
7. Use reusable workflows / shared Tekton Tasks
Platform team maintains the canonical build/scan/sign/push pipeline template. Service teams call it, they do not copy it. Updates (e.g., new Trivy version, new signing key) propagate automatically to all services.
8. Tag images with git SHA — never :latest in production
Mutable tags create non-reproducible deployments. SHA-based tags mean every deployment is traceable to a specific commit. Add git SHA as an image label for docker inspect-level audit trail.
Coverage Checklist
- CI/CD pipeline architecture diagram (commit → CI → GitOps repo → Argo CD → progressive delivery)
- CI vs CD separation table (where each runs, credentials, outputs)
- CI/CD contract callout (CI never kubectl applies to production)
- Full GitHub Actions CI workflow: OIDC auth, Buildx multi-platform, docker/metadata-action tags, docker/build-push-action (sbom+provenance), Trivy SARIF scan, Cosign keyless signing, GitOps repo promotion
- GitHub Actions reusable workflows (workflow_call with inputs/outputs/secrets)
- Tekton object hierarchy (Task/Pipeline/PipelineRun/TaskRun/Workspace/TriggerTemplate/EventListener)
- Tekton Task: go-test (unit test + lint steps, shared workspace)
- Tekton Task: kaniko-build (executor with cache, digest result output, ECR credentials volume)
- Full Tekton Pipeline: clone → test → build → scan → sign → update-gitops (with task dependencies)
- Tekton EventListener + TriggerBinding + TriggerTemplate (GitHub webhook → PipelineRun)
- Production Dockerfile: multi-stage Go build (trimpath, ldflags, CGO_ENABLED=0) + distroless nonroot runtime
- BuildKit vs Kaniko comparison table (daemon/K8s/cache/multi-platform/security/speed)
- Cosign keyless signing in GitHub Actions (COSIGN_EXPERIMENTAL, sign + verify commands)
- Cosign key-based signing (generate-key-pair, sign with --annotations, verify)
- Kyverno ClusterPolicy: verifyImages (keyless, subject regexp, issuer, mutateDigest, verifyDigest)
- SBOM generation with Syft (spdx-json + cyclonedx-json formats)
- Cosign attest for SBOM attestation + verify-attestation + Grype audit
- Docker Buildx native SBOM+provenance callout (--sbom=true --provenance=true)
- Image tag strategy table (SHA/semver/latest/branch/date+SHA with mutability)
- Never :latest in production callout (imagePullPolicy risks)
- ECR lifecycle policy JSON (keep semver, keep 50 SHA-tagged, expire untagged after 7d)
- GitOps promotion script: yq update kustomization, git commit with run URL + digest, retry push loop
- Argo Rollouts install (kubectl apply + plugin)
- Rollout CRD: replicas, canary strategy, canaryService+stableService, NGINX traffic routing, canary steps (setWeight/pause/analysis), background analysis
- AnalysisTemplate: success-rate metric (Prometheus PromQL), latency-p99, error-rate with failureCondition
- kubectl argo rollouts CLI: get --watch, pause, promote, abort, retry, set image, undo
- Canary traffic splitting options table (pod-count/NGINX/Istio/ALB/header-based)
- NGINX Ingress canary: stable + canary Services + primary Ingress (Argo Rollouts manages canary Ingress)
- Istio VirtualService weight-based canary + Rollout trafficRouting config
- Blue-Green Rollout CRD (activeService/previewService, autoPromotionEnabled, prePromotionAnalysis/postPromotionAnalysis)
- Blue-Green vs Canary decision guide table (5 factors)
- Flagger Helm install + Canary CRD (targetRef Deployment, NGINX ingress, analysis interval/threshold/stepWeight/maxWeight, metrics, webhooks)
- Tekton Prometheus metrics reference (pipelinerun/taskrun duration buckets + counts)
- PrometheusRule: PipelineHighFailureRate (>20%), PipelineSlowBuilds (P90 >10min), ArgoCDRolloutDegraded
- DORA metrics PromQL (deployment frequency, change failure rate from Argo CD sync metrics)
- 8 best practices cards