CI/CD Pipelines — Kubernetes Docs

Contents

CI/CD Pipeline Architecture
GitHub Actions for Kubernetes
Tekton Pipelines
Container Image Build (Buildkit / Kaniko)
Image Signing with Cosign
SBOM Generation
Registry Push & Tag Strategy
GitOps Promotion from CI
Progressive Delivery: Argo Rollouts
Canary Deployments
Blue-Green Deployments
Flagger (Flux Progressive Delivery)
Pipeline Observability
Best Practices

1. CI/CD Pipeline Architecture

Complete CI/CD flow:

Developer
  │── git push / PR ──▶ GitHub / GitLab
                              │
                    ┌─────────▼──────────┐
                    │    CI Pipeline      │
                    │                    │
                    │  lint + unit test  │
                    │  integration test  │
                    │  build image       │
                    │  SBOM generation   │
                    │  vulnerability scan│
                    │  sign image        │
                    │  push to registry  │
                    └─────────┬──────────┘
                              │ update image tag in GitOps repo
                              ▼
                    ┌─────────────────────┐
                    │   GitOps Repo (Git) │
                    │  services/order-svc │
                    │  overlays/staging/  │
                    │  image: ...v1.4.3   │
                    └─────────┬───────────┘
                              │ Argo CD / Flux detects change
                              ▼
                    ┌─────────────────────┐
                    │  Argo CD / Flux     │
                    │  sync to staging    │
                    └─────────┬───────────┘
                              │
                    ┌─────────▼──────────┐
                    │  Progressive CD    │
                    │  (Argo Rollouts /  │
                    │   Flagger)         │
                    │  canary: 10%→50%→  │
                    │  100% or rollback  │
                    └─────────┬──────────┘
                              │ promote to production (GitOps PR / auto)
                              ▼
                         Production

CI vs CD Separation

Phase	Runs in	Outputs	Tools
CI (Continuous Integration)	External CI (GitHub Actions, Tekton)	Verified image, SBOM, signatures, updated GitOps repo	GitHub Actions, Tekton, GitLab CI, Jenkins
CD (Continuous Delivery)	In-cluster (GitOps agent)	Running workload, deployment state	Argo CD, Flux, Argo Rollouts, Flagger

The CI/CD contract

CI produces an immutable, signed, scanned container image and updates a GitOps repository with the new tag. CD is entirely the responsibility of the in-cluster GitOps agent — CI never kubectl applys directly to production. This separation means CI credentials never touch production clusters.

2. GitHub Actions for Kubernetes

Full CI Workflow — Build, Scan, Sign, Push, Promote

# .github/workflows/ci.yaml
name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
  IMAGE_NAME: order-service

permissions:
  contents: write        # update GitOps repo
  id-token: write        # OIDC for keyless Cosign signing
  packages: write        # push to GHCR (if used)
  security-events: write # upload SARIF scan results

jobs:
  ci:
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
      image_digest: ${{ steps.build.outputs.digest }}

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      # ── Lint & Test ─────────────────────────────────────────
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: "1.22"
          cache: true

      - name: Lint
        uses: golangci/golangci-lint-action@v6
        with:
          version: v1.59

      - name: Unit tests
        run: go test ./... -race -coverprofile=coverage.out

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.out

      # ── Docker Build & Push ──────────────────────────────────
      - name: Configure AWS credentials (OIDC — no long-lived keys)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-ecr
          aws-region: us-east-1

      - name: Login to ECR
        id: ecr-login
        uses: aws-actions/amazon-ecr-login@v2

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Extract metadata (tags, labels)
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=,suffix=,format=short     # short git SHA
            type=semver,pattern={{version}}            # v1.4.2 on tag
            type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}

      - name: Build and push
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          platforms: linux/amd64,linux/arm64
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          provenance: true    # generate SLSA provenance attestation
          sbom: true          # generate SBOM attestation

      # ── Vulnerability Scan ───────────────────────────────────
      - name: Scan image with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
          format: sarif
          output: trivy-results.sarif
          severity: CRITICAL,HIGH
          exit-code: 1   # fail on CRITICAL/HIGH

      - name: Upload Trivy SARIF to GitHub Security
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: trivy-results.sarif

      # ── Cosign Keyless Signing ───────────────────────────────
      - name: Install Cosign
        uses: sigstore/cosign-installer@v3

      - name: Sign image (keyless OIDC)
        if: github.event_name != 'pull_request'
        env:
          COSIGN_EXPERIMENTAL: "1"
        run: |
          cosign sign --yes \
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}

      # ── GitOps Promotion ─────────────────────────────────────
      - name: Update image tag in GitOps repo (staging)
        if: github.event_name != 'pull_request'
        env:
          GIT_SHA: ${{ github.sha }}
        run: |
          SHORT_SHA="${GIT_SHA:0:7}"
          git clone https://x-access-token:${{ secrets.GITOPS_TOKEN }}@github.com/myorg/platform-gitops.git
          cd platform-gitops
          yq eval -i ".images[0].newTag = \"${SHORT_SHA}\"" \
            services/order-service/overlays/staging/kustomization.yaml
          git config user.email "ci-bot@example.com"
          git config user.name "CI Bot"
          git add services/order-service/overlays/staging/kustomization.yaml
          git commit -m "chore: update order-service to ${SHORT_SHA} in staging"
          git push

GitHub Actions: Reusable Workflows

# .github/workflows/reusable-build.yaml (platform team maintains)
name: Reusable Build & Push

on:
  workflow_call:
    inputs:
      image-name:
        required: true
        type: string
      dockerfile:
        required: false
        type: string
        default: Dockerfile
    outputs:
      digest:
        description: "Image digest"
        value: ${{ jobs.build.outputs.digest }}
      tag:
        description: "Short SHA tag"
        value: ${{ jobs.build.outputs.tag }}
    secrets:
      ecr-role-arn:
        required: true

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      digest: ${{ steps.build.outputs.digest }}
      tag: ${{ steps.tag.outputs.value }}
    steps:
      # ... (same build/scan/sign steps as above)

# Usage in a service repo:
# .github/workflows/ci.yaml
jobs:
  build:
    uses: myorg/.github/.github/workflows/reusable-build.yaml@main
    with:
      image-name: order-service
    secrets:
      ecr-role-arn: ${{ secrets.ECR_ROLE_ARN }}

3. Tekton Pipelines

Tekton is a Kubernetes-native CI/CD framework. Pipelines run as Pods in the cluster, which means they share cluster RBAC, service accounts, and can access internal cluster resources (useful for integration tests against real services).

Tekton Object Hierarchy

Task          → defines a set of Steps (containers run sequentially in a Pod)
Pipeline      → defines a sequence/graph of Tasks
PipelineRun   → instantiation of a Pipeline with parameters and workspaces
TaskRun       → instantiation of a Task
Workspace     → shared volume between steps (PVC / emptyDir / Secret / ConfigMap)
TriggerTemplate → creates PipelineRun on events (webhooks)
EventListener   → HTTP endpoint that receives webhooks and fires TriggerTemplates

Task: Run Tests

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: go-test
spec:
  params:
    - name: package
      type: string
      default: "./..."
  workspaces:
    - name: source
  steps:
    - name: unit-test
      image: golang:1.22-alpine
      workingDir: $(workspaces.source.path)
      env:
        - name: GOFLAGS
          value: "-mod=vendor"
        - name: CGO_ENABLED
          value: "0"
      script: |
        #!/bin/sh
        set -ex
        go test $(params.package) \
          -race \
          -coverprofile=/tmp/coverage.out \
          -v 2>&1 | tee /tmp/test-output.txt
        go tool cover -html=/tmp/coverage.out -o /tmp/coverage.html
    - name: lint
      image: golangci/golangci-lint:v1.59-alpine
      workingDir: $(workspaces.source.path)
      script: |
        golangci-lint run --timeout 5m

Task: Build and Push with Kaniko

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: kaniko-build
spec:
  params:
    - name: IMAGE
      description: Full image name with registry
    - name: DOCKERFILE
      default: Dockerfile
    - name: CONTEXT
      default: "."
  workspaces:
    - name: source
    - name: docker-credentials
      optional: true
  results:
    - name: IMAGE_DIGEST
      description: Digest of the built image
    - name: IMAGE_URL
      description: Full image URL with digest
  steps:
    - name: build-and-push
      image: gcr.io/kaniko-project/executor:v1.23.0-debug
      workingDir: $(workspaces.source.path)
      env:
        - name: DOCKER_CONFIG
          value: /kaniko/.docker
      command:
        - /kaniko/executor
      args:
        - --dockerfile=$(params.DOCKERFILE)
        - --context=$(params.CONTEXT)
        - --destination=$(params.IMAGE)
        - --cache=true
        - --cache-repo=$(params.IMAGE)-cache
        - --compressed-caching=false
        - --snapshot-mode=redo
        - --use-new-run
        - --digest-file=/tekton/results/IMAGE_DIGEST
      volumeMounts:
        - name: docker-config
          mountPath: /kaniko/.docker
  volumes:
    - name: docker-config
      projected:
        sources:
          - secret:
              name: ecr-credentials

Full Pipeline: CI for Go Service

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: go-service-ci
spec:
  params:
    - name: git-url
    - name: git-revision
      default: main
    - name: image-name
    - name: gitops-repo-url
    - name: gitops-path

  workspaces:
    - name: shared-data     # Git clone workspace
    - name: git-credentials

  tasks:
    - name: clone
      taskRef:
        resolver: hub
        params:
          - name: catalog
            value: tekton-catalog-pipeline
          - name: type
            value: artifact
          - name: kind
            value: task
          - name: name
            value: git-clone
          - name: version
            value: "0.9"
      workspaces:
        - name: output
          workspace: shared-data
        - name: ssh-directory
          workspace: git-credentials
      params:
        - name: url
          value: $(params.git-url)
        - name: revision
          value: $(params.git-revision)

    - name: test
      runAfter: [clone]
      taskRef:
        name: go-test
      workspaces:
        - name: source
          workspace: shared-data

    - name: build
      runAfter: [test]
      taskRef:
        name: kaniko-build
      workspaces:
        - name: source
          workspace: shared-data
      params:
        - name: IMAGE
          value: $(params.image-name):$(tasks.clone.results.commit)

    - name: scan
      runAfter: [build]
      taskRef:
        name: trivy-scan
      params:
        - name: IMAGE
          value: $(params.image-name)@$(tasks.build.results.IMAGE_DIGEST)
        - name: SEVERITY
          value: "CRITICAL,HIGH"
        - name: EXIT_CODE
          value: "1"

    - name: sign
      runAfter: [scan]
      taskRef:
        name: cosign-sign
      params:
        - name: IMAGE
          value: $(params.image-name)@$(tasks.build.results.IMAGE_DIGEST)

    - name: update-gitops
      runAfter: [sign]
      taskRef:
        name: git-update-deployment
      workspaces:
        - name: source
          workspace: shared-data
        - name: ssh-directory
          workspace: git-credentials
      params:
        - name: GIT_REPOSITORY
          value: $(params.gitops-repo-url)
        - name: GIT_PATH_FILES
          value: $(params.gitops-path)
        - name: NEW_TAG
          value: $(tasks.clone.results.commit)

EventListener + Trigger — Webhook-based PipelineRun

apiVersion: triggers.tekton.dev/v1beta1
kind: EventListener
metadata:
  name: github-webhook
  namespace: tekton-pipelines
spec:
  serviceAccountName: tekton-triggers-sa
  triggers:
    - name: github-push
      interceptors:
        - ref:
            name: github
          params:
            - name: secretRef
              value:
                secretName: github-webhook-secret
                secretKey: token
            - name: eventTypes
              value: [push]
        - ref:
            name: cel
          params:
            - name: filter
              value: "body.ref == 'refs/heads/main'"
      bindings:
        - ref: github-push-binding
      template:
        ref: github-push-template
---
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerBinding
metadata:
  name: github-push-binding
spec:
  params:
    - name: gitrevision
      value: $(body.head_commit.id)
    - name: gitrepositoryurl
      value: $(body.repository.clone_url)
---
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerTemplate
metadata:
  name: github-push-template
spec:
  params:
    - name: gitrevision
    - name: gitrepositoryurl
  resourcetemplates:
    - apiVersion: tekton.dev/v1
      kind: PipelineRun
      metadata:
        generateName: go-service-ci-run-
      spec:
        pipelineRef:
          name: go-service-ci
        params:
          - name: git-url
            value: $(tt.params.gitrepositoryurl)
          - name: git-revision
            value: $(tt.params.gitrevision)
          - name: image-name
            value: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service
        workspaces:
          - name: shared-data
            volumeClaimTemplate:
              spec:
                accessModes: [ReadWriteOnce]
                resources:
                  requests:
                    storage: 1Gi
          - name: git-credentials
            secret:
              secretName: github-ssh-key

4. Container Image Build (BuildKit / Kaniko)

Production Dockerfile — Multi-Stage, Distroless, Non-Root

# ── Stage 1: Build ──────────────────────────────────────────────────
FROM golang:1.22-alpine AS builder

# Install build deps
RUN apk add --no-cache git ca-certificates tzdata

WORKDIR /app

# Copy go.mod first for layer caching
COPY go.mod go.sum ./
RUN go mod download

COPY . .

# Build with CGO disabled, trimpath for reproducible builds
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
    -trimpath \
    -ldflags="-s -w \
              -X main.version=$(git describe --tags --always 2>/dev/null || echo 'dev') \
              -X main.commitSHA=$(git rev-parse --short HEAD 2>/dev/null || echo 'unknown')" \
    -o /app/server ./cmd/server

# ── Stage 2: Runtime ─────────────────────────────────────────────────
# Use distroless — no shell, no package manager, minimal attack surface
FROM gcr.io/distroless/static-debian12:nonroot

COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
COPY --from=builder /usr/share/zoneinfo /usr/share/zoneinfo
COPY --from=builder /app/server /app/server

# Run as non-root user (distroless nonroot = UID 65532)
USER nonroot:nonroot

EXPOSE 8080 6060

ENTRYPOINT ["/app/server"]

BuildKit vs Kaniko

Feature	BuildKit (docker buildx)	Kaniko
Docker daemon required	No (rootless mode)	No
Runs in Kubernetes	Yes (DinD or rootless)	Yes (single pod, no daemon)
Cache	Registry cache, local, GitHub Actions cache	Registry cache (--cache-repo)
Multi-platform	Yes (QEMU emulation or native builders)	Per-arch build only (no cross-arch)
Security	Rootless available; DinD needs privileged	Runs in userspace; no privileged required
Speed	Fast (parallel layer build, smart caching)	Slower (sequential layers, no parallelism)
Best for	GitHub Actions, external CI, fast builds	Kubernetes-native CI (Tekton), air-gapped

5. Image Signing with Cosign

Cosign (part of Sigstore) provides keyless container image signing using ephemeral keys anchored to OIDC identity. In CI, the GitHub Actions OIDC token or Tekton ServiceAccount token proves the pipeline's identity — no long-lived signing keys to rotate.

Keyless Signing in CI

# Keyless signing — uses OIDC identity (no key management)
# In GitHub Actions:
COSIGN_EXPERIMENTAL=1 cosign sign --yes \
  --rekor-url https://rekor.sigstore.dev \
  123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123...

# The signature is stored in the OCI registry as a separate artifact:
# 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:sha256-abc123....sig

# Verify signature (in admission webhook or manually):
COSIGN_EXPERIMENTAL=1 cosign verify \
  --certificate-identity-regexp="https://github.com/myorg/order-service/.github/workflows/ci.yaml" \
  --certificate-oidc-issuer=https://token.actions.githubusercontent.com \
  123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123...

Key-based Signing (for air-gapped / private Rekor)

# Generate signing key pair (run once, store private key in Vault)
cosign generate-key-pair

# Sign with private key
cosign sign --key cosign.key \
  --annotations "git-sha=${GITHUB_SHA}" \
  --annotations "pipeline-url=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}" \
  123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:${TAG}

# Verify with public key (admission webhook uses cosign.pub)
cosign verify --key cosign.pub \
  123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:${TAG}

Enforcing Signatures via Kyverno Policy

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signature
spec:
  validationFailureAction: Enforce
  background: false
  rules:
    - name: check-image-signature
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: ["production", "staging"]
      verifyImages:
        - imageReferences:
            - "123456789.dkr.ecr.us-east-1.amazonaws.com/*"
          attestors:
            - count: 1
              entries:
                - keyless:
                    subject: "https://github.com/myorg/*/.github/workflows/ci.yaml@refs/heads/main"
                    issuer: "https://token.actions.githubusercontent.com"
                    rekor:
                      url: https://rekor.sigstore.dev
          mutateDigest: true      # replace :tag with @sha256 digest
          verifyDigest: true      # reject images without digest

6. SBOM Generation

A Software Bill of Materials (SBOM) lists every package and dependency in a container image. SBOMs enable vulnerability auditing, license compliance checking, and rapid incident response ("which images contain log4j?").

# Generate SBOM with Syft (SPDX or CycloneDX format)
syft 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123 \
  -o spdx-json=sbom.spdx.json \
  -o cyclonedx-json=sbom.cdx.json

# Attest SBOM to the image (stored in OCI registry alongside image)
cosign attest --yes \
  --predicate sbom.spdx.json \
  --type spdxjson \
  123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123

# Verify and extract SBOM from image:
cosign verify-attestation \
  --type spdxjson \
  --key cosign.pub \
  123456789.dkr.ecr.us-east-1.amazonaws.com/order-service@sha256:abc123 \
  | jq '.payload | @base64d | fromjson | .predicate'

# Audit SBOM for known vulnerabilities (Grype):
grype sbom:sbom.spdx.json --fail-on high

Docker Buildx native SBOM

When using docker buildx build --sbom=true --provenance=true, Docker BuildKit generates an SPDX SBOM and SLSA provenance attestation automatically and attaches them to the image manifest. These are stored as OCI artifacts in the registry and can be verified with Cosign or Docker Scout.

7. Registry Push & Tag Strategy

Image Tag Strategy

Tag	Format	Mutable	Use Case
Git SHA	`abc1234` (7-char)	No	Primary deployment tag — immutable, auditable
Semver	`v1.4.2`	No (once pushed)	Release artifacts; Helm chart `appVersion`
latest	`latest`	Yes	Development only — never deploy with :latest in production
Branch	`main`, `pr-123`	Yes	Preview / feature branches; CI testing only
Date + SHA	`20240115-abc1234`	No	Sortable + unique; used by Flux alphabetical ImagePolicy

Never deploy :latest to production

:latest is a mutable tag — it changes with every push. Kubernetes caches image layers and may not pull a new :latest unless imagePullPolicy: Always is set (which adds latency to every pod start). Use immutable SHA-based tags. If you use imagePullPolicy: Always with :latest, a broken push will take down new pod restarts cluster-wide.

# ECR lifecycle policy — clean up old tags, keep recent and tagged images
aws ecr put-lifecycle-policy \
  --repository-name order-service \
  --lifecycle-policy '{
    "rules": [
      {
        "rulePriority": 1,
        "description": "Keep semver tags forever",
        "selection": {
          "tagStatus": "tagged",
          "tagPrefixList": ["v"],
          "countType": "imageCountMoreThan",
          "countNumber": 9999
        },
        "action": {"type": "expire"}
      },
      {
        "rulePriority": 2,
        "description": "Keep last 50 SHA-tagged images",
        "selection": {
          "tagStatus": "tagged",
          "tagPrefixList": ["sha-", ""],
          "countType": "imageCountMoreThan",
          "countNumber": 50
        },
        "action": {"type": "expire"}
      },
      {
        "rulePriority": 3,
        "description": "Expire untagged images after 7 days",
        "selection": {
          "tagStatus": "untagged",
          "countType": "sinceImagePushed",
          "countUnit": "days",
          "countNumber": 7
        },
        "action": {"type": "expire"}
      }
    ]
  }'

8. GitOps Promotion from CI

# Complete GitOps promotion script (used in CI after image push + sign)
#!/bin/bash
set -euo pipefail

GITOPS_REPO="https://x-access-token:${GITOPS_TOKEN}@github.com/myorg/platform-gitops.git"
SERVICE="order-service"
NEW_TAG="${GITHUB_SHA:0:7}"
ENV="${1:-staging}"   # default to staging; pass 'production' for prod promotion

git clone "${GITOPS_REPO}" /tmp/gitops
cd /tmp/gitops

# Update image tag in Kustomize overlay
yq eval -i ".images[] |= select(.name == \"*/${SERVICE}\").newTag = \"${NEW_TAG}\"" \
  "services/${SERVICE}/overlays/${ENV}/kustomization.yaml"

# Verify the change looks correct
git diff

git config user.email "ci-bot@myorg.com"
git config user.name "CI Bot"
git add "services/${SERVICE}/overlays/${ENV}/"
git commit -m "chore(${ENV}): update ${SERVICE} to ${NEW_TAG}

Source: ${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}
Image: ${REGISTRY}/${SERVICE}@${IMAGE_DIGEST}"

# Retry push in case of concurrent commits
for i in 1 2 3; do
  git pull --rebase origin main && git push && break || sleep $((i * 5))
done

9. Progressive Delivery: Argo Rollouts

Argo Rollouts extends Kubernetes with advanced deployment strategies — canary, blue-green, and experiment-based — with automated metric analysis gates that roll back if error rate or latency thresholds are breached.

Install

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

# kubectl plugin
kubectl argo rollouts version

Rollout CRD (replaces Deployment)

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: order-service
  namespace: order-service
spec:
  replicas: 10
  revisionHistoryLimit: 5

  selector:
    matchLabels:
      app: order-service

  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: order-service
          image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:abc1234
          ports:
            - containerPort: 8080
          resources:
            requests: {cpu: 250m, memory: 256Mi}
            limits: {memory: 512Mi}
          readinessProbe:
            httpGet: {path: /health/ready, port: 8080}
            initialDelaySeconds: 5
            periodSeconds: 5

  strategy:
    canary:
      # Traffic management via NGINX Ingress (header-based or weight-based)
      canaryService: order-service-canary
      stableService: order-service-stable
      trafficRouting:
        nginx:
          stableIngress: order-service-ingress
          annotationPrefix: nginx.ingress.kubernetes.io
          additionalIngressAnnotations:
            canary-by-header: X-Canary

      # Canary rollout steps
      steps:
        - setWeight: 5         # 5% traffic to canary
        - pause: {duration: 5m}  # wait 5 minutes
        - analysis:            # automated metric analysis
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: order-service-canary
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {}            # manual approval (indefinite pause)
        - setWeight: 100

      # Analysis during rollout (continuous background check)
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1        # start analysis at step 1
        args:
          - name: service-name
            value: order-service-canary

AnalysisTemplate — Metric Gates

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: order-service
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95       # 95% success rate required
      failureLimit: 3                             # 3 consecutive failures → abort
      provider:
        prometheus:
          address: http://prometheus.observability.svc:9090
          query: |
            sum(rate(
              http_requests_total{
                app="{{args.service-name}}",
                status!~"5.."
              }[5m]
            )) /
            sum(rate(
              http_requests_total{
                app="{{args.service-name}}"
              }[5m]
            ))

    - name: latency-p99
      interval: 1m
      successCondition: result[0] < 0.5          # p99 < 500ms
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.observability.svc:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(
                http_request_duration_seconds_bucket{
                  app="{{args.service-name}}"
                }[5m]
              )) by (le)
            )

    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01         # error rate < 1%
      failureCondition: result[0] >= 0.05        # abort if > 5% errors
      provider:
        prometheus:
          address: http://prometheus.observability.svc:9090
          query: |
            sum(rate(
              http_requests_total{app="{{args.service-name}}",status=~"5.."}[5m]
            )) /
            sum(rate(
              http_requests_total{app="{{args.service-name}}"}[5m]
            ))

Rollout CLI Commands

# Watch rollout progress
kubectl argo rollouts get rollout order-service --watch

# Pause a rollout at current step
kubectl argo rollouts pause order-service

# Promote (advance past a manual pause step)
kubectl argo rollouts promote order-service

# Abort and roll back to stable
kubectl argo rollouts abort order-service

# Retry after abort
kubectl argo rollouts retry rollout order-service

# Set image (trigger new rollout)
kubectl argo rollouts set image order-service \
  order-service=123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:abc5678

# Undo to previous version
kubectl argo rollouts undo order-service

10. Canary Deployments in Detail

Traffic Splitting Options

Method	Mechanism	Granularity	Requires
Pod-count canary	N canary pods out of total → N% traffic	Coarse (1/10 = 10%)	Nothing extra (default Argo Rollouts)
NGINX Ingress weight	`canary-weight` annotation	1% granularity	NGINX Ingress Controller
Istio VirtualService	HTTPRoute weight split	1% granularity	Istio service mesh
AWS ALB weighted target groups	Listener rule weights	1% granularity	AWS ALB Controller
Header-based	Route specific users to canary by header	0 or 100%	NGINX / Istio

NGINX Ingress Canary Configuration

---
# Stable service (selects only stable pods)
apiVersion: v1
kind: Service
metadata:
  name: order-service-stable
spec:
  selector:
    app: order-service

---
# Canary service (selects only canary pods — Argo Rollouts manages pod labels)
apiVersion: v1
kind: Service
metadata:
  name: order-service-canary
spec:
  selector:
    app: order-service

---
# Primary Ingress (stable)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: order-service-ingress
spec:
  ingressClassName: nginx
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /order
            pathType: Prefix
            backend:
              service:
                name: order-service-stable
                port:
                  number: 80

# Argo Rollouts controller creates and manages the canary Ingress automatically:
# (annotations are set/removed by rollout controller)
# nginx.ingress.kubernetes.io/canary: "true"
# nginx.ingress.kubernetes.io/canary-weight: "20"

Istio-based Canary with VirtualService

# Argo Rollouts manages these weights automatically:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: order-service
spec:
  hosts:
    - order-service
  http:
    - route:
        - destination:
            host: order-service-stable
          weight: 80
        - destination:
            host: order-service-canary
          weight: 20   # ← Argo Rollouts updates this during rollout

# Rollout trafficRouting config for Istio:
# strategy:
#   canary:
#     trafficRouting:
#       istio:
#         virtualService:
#           name: order-service
#         destinationRule:
#           name: order-service-destrule
#           canarySubsetName: canary
#           stableSubsetName: stable

11. Blue-Green Deployments

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: order-service-bg
spec:
  replicas: 5
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: order-service
          image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:abc1234

  strategy:
    blueGreen:
      # Active service: currently receiving production traffic
      activeService: order-service-active
      # Preview service: new version (green) before promotion
      previewService: order-service-preview

      # Auto-promote to active after analysis passes
      autoPromotionEnabled: false     # require manual kubectl argo rollouts promote

      # Scale down old (blue) replicas after this duration
      scaleDownDelaySeconds: 30

      # Run analysis against preview before promoting
      prePromotionAnalysis:
        templates:
          - templateName: success-rate
        args:
          - name: service-name
            value: order-service-preview

      # Run analysis after promotion (watch for regressions)
      postPromotionAnalysis:
        templates:
          - templateName: success-rate
        args:
          - name: service-name
            value: order-service-active

Blue-Green vs Canary Decision Guide

Factor	Use Blue-Green	Use Canary
Database schema change	Yes (run migration against preview, validate, then promote)	Risky (both versions hit same DB simultaneously)
Stateful services	Yes (swap traffic atomically after testing)	Risky (canary pods may have different state)
Gradual risk reduction	No (all-or-nothing switch)	Yes (expose 5%, 20%, 50%, 100%)
Long rollout acceptable	No (instant switch or rollback)	Yes (hours-long rollout with analysis)
Cost concern	2× resource usage during cutover	Slightly above normal (canary pods added)
Traffic isolation testing	Yes (full green env accessible via preview Service)	Partial (5% real traffic is canary)

12. Flagger (Flux Progressive Delivery)

Flagger is the progressive delivery operator for Flux (and compatible with Argo CD). It automates canary and blue-green deployments using Kubernetes Deployments (not Rollout CRDs), making adoption easier for existing workloads.

helm upgrade --install flagger flagger/flagger \
  --namespace flagger-system \
  --create-namespace \
  --set meshProvider=nginx \
  --set metricsServer=http://prometheus.observability.svc:9090

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: order-service
  namespace: order-service
spec:
  # Target: the existing Deployment to wrap
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service

  # Ingress for traffic routing
  ingressRef:
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: order-service

  # Autoscaling (HPA follows canary/primary)
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: order-service

  service:
    port: 80
    targetPort: 8080
    gateways: []
    hosts:
      - api.example.com

  analysis:
    # Promote when 5 consecutive analysis runs pass
    interval: 1m
    threshold: 5
    maxWeight: 50          # max canary weight before full promotion
    stepWeight: 10         # increase by 10% each interval

    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99          # minimum 99% success rate
        interval: 1m

      - name: request-duration
        thresholdRange:
          max: 500         # maximum 500ms
        interval: 1m

    # Custom metric using PromQL
    metrics:
      - name: error-rate
        templateRef:
          name: error-rate
          namespace: flagger-system
        thresholdRange:
          max: 0.01
        interval: 1m

    # Run integration tests as webhook before promotion
    webhooks:
      - name: integration-test
        type: pre-rollout
        url: http://flagger-loadtester.flagger-system/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'anon' http://order-service-canary.order-service/checkout | grep 'order_id'"

13. Pipeline Observability

Tekton Metrics

# Tekton exposes Prometheus metrics on port 9090 of each controller
# Key metrics:
tekton_pipelines_controller_pipelinerun_duration_seconds_bucket  # pipeline duration histogram
tekton_pipelines_controller_pipelinerun_count{status}            # success/failure counts
tekton_pipelines_controller_taskrun_duration_seconds_bucket
tekton_pipelines_controller_taskrun_count{status}
tekton_pipelines_controller_reconcile_count                      # reconciliation health

PrometheusRule — Pipeline Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pipeline-alerts
  namespace: tekton-pipelines
spec:
  groups:
    - name: ci-pipelines
      rules:
        - alert: PipelineHighFailureRate
          expr: |
            (
              rate(tekton_pipelines_controller_pipelinerun_count{status="failed"}[1h])
              /
              rate(tekton_pipelines_controller_pipelinerun_count[1h])
            ) > 0.20
          for: 15m
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "CI pipeline failure rate > 20% for last hour"

        - alert: PipelineSlowBuilds
          expr: |
            histogram_quantile(0.90,
              rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[1h])
            ) > 600
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "P90 pipeline duration > 10 minutes"

- alert: ArgoCDRolloutDegraded
  expr: |
    rollout_info{phase!~"Healthy|Paused|Progressing"} == 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Argo Rollout {{ $labels.name }} is {{ $labels.phase }}"

Deployment Frequency Dashboard

# PromQL: deployment frequency per service per day (DORA metric)
# Counts GitOps repo commits that update image tags
increase(
  argocd_app_sync_total{
    app=~".*order-service.*",
    phase="Succeeded"
  }[24h]
)

# Lead time proxy: time from last commit to successful sync
# (requires custom metric or CI instrumentation)

# Change failure rate:
(
  rate(argocd_app_info{sync_status="Unknown"}[7d])
  + rate(argocd_app_info{health_status="Degraded"}[7d])
) /
rate(argocd_app_sync_total{phase="Succeeded"}[7d])

14. Best Practices

1. Separate CI (image build) from CD (deploy)

CI builds and pushes images, updates GitOps repo. CD is entirely handled by the in-cluster GitOps agent. CI never runs kubectl apply against production. This hard separation means CI credential compromise cannot directly affect production clusters.

2. Use OIDC — never long-lived CI credentials

GitHub Actions, GitLab CI, and Tekton all support OIDC-based credential issuance to AWS/GCP/Azure. Use aws-actions/configure-aws-credentials with role-to-assume. No AWS_ACCESS_KEY_ID in repository secrets.

3. Pin every action and image version

Use full SHA pins for GitHub Actions (uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683) and image tags. Floating @v4 or :latest references are a supply chain attack vector.

4. Fail CI on CRITICAL vulnerabilities

Trivy/Grype should exit non-zero on CRITICAL/HIGH CVEs. Teams should triage them within the sprint. Add exceptions with justification for false positives — not exit-code: 0 for all.

5. Use AnalysisTemplates for automated rollback

Argo Rollouts with Prometheus-based AnalysisTemplates gate promotion on error rate and latency. A misconfigured deploy that raises errors from 0.1% to 3% will be rolled back automatically, without on-call intervention.

6. Start with canary for all production services

Even a 5%→100% canary with 5-minute pause is better than a simultaneous 100% rollout. It limits blast radius for configuration bugs, memory leaks, and serialisation errors that only manifest under real traffic.

7. Use reusable workflows / shared Tekton Tasks

Platform team maintains the canonical build/scan/sign/push pipeline template. Service teams call it, they do not copy it. Updates (e.g., new Trivy version, new signing key) propagate automatically to all services.

8. Tag images with git SHA — never :latest in production

Mutable tags create non-reproducible deployments. SHA-based tags mean every deployment is traceable to a specific commit. Add git SHA as an image label for docker inspect-level audit trail.

Coverage Checklist

CI/CD pipeline architecture diagram (commit → CI → GitOps repo → Argo CD → progressive delivery)
CI vs CD separation table (where each runs, credentials, outputs)
CI/CD contract callout (CI never kubectl applies to production)
Full GitHub Actions CI workflow: OIDC auth, Buildx multi-platform, docker/metadata-action tags, docker/build-push-action (sbom+provenance), Trivy SARIF scan, Cosign keyless signing, GitOps repo promotion
GitHub Actions reusable workflows (workflow_call with inputs/outputs/secrets)
Tekton object hierarchy (Task/Pipeline/PipelineRun/TaskRun/Workspace/TriggerTemplate/EventListener)
Tekton Task: go-test (unit test + lint steps, shared workspace)
Tekton Task: kaniko-build (executor with cache, digest result output, ECR credentials volume)
Full Tekton Pipeline: clone → test → build → scan → sign → update-gitops (with task dependencies)
Tekton EventListener + TriggerBinding + TriggerTemplate (GitHub webhook → PipelineRun)
Production Dockerfile: multi-stage Go build (trimpath, ldflags, CGO_ENABLED=0) + distroless nonroot runtime
BuildKit vs Kaniko comparison table (daemon/K8s/cache/multi-platform/security/speed)
Cosign keyless signing in GitHub Actions (COSIGN_EXPERIMENTAL, sign + verify commands)
Cosign key-based signing (generate-key-pair, sign with --annotations, verify)
Kyverno ClusterPolicy: verifyImages (keyless, subject regexp, issuer, mutateDigest, verifyDigest)
SBOM generation with Syft (spdx-json + cyclonedx-json formats)
Cosign attest for SBOM attestation + verify-attestation + Grype audit
Docker Buildx native SBOM+provenance callout (--sbom=true --provenance=true)
Image tag strategy table (SHA/semver/latest/branch/date+SHA with mutability)
Never :latest in production callout (imagePullPolicy risks)
ECR lifecycle policy JSON (keep semver, keep 50 SHA-tagged, expire untagged after 7d)
GitOps promotion script: yq update kustomization, git commit with run URL + digest, retry push loop
Argo Rollouts install (kubectl apply + plugin)
Rollout CRD: replicas, canary strategy, canaryService+stableService, NGINX traffic routing, canary steps (setWeight/pause/analysis), background analysis
AnalysisTemplate: success-rate metric (Prometheus PromQL), latency-p99, error-rate with failureCondition
kubectl argo rollouts CLI: get --watch, pause, promote, abort, retry, set image, undo
Canary traffic splitting options table (pod-count/NGINX/Istio/ALB/header-based)
NGINX Ingress canary: stable + canary Services + primary Ingress (Argo Rollouts manages canary Ingress)
Istio VirtualService weight-based canary + Rollout trafficRouting config
Blue-Green Rollout CRD (activeService/previewService, autoPromotionEnabled, prePromotionAnalysis/postPromotionAnalysis)
Blue-Green vs Canary decision guide table (5 factors)
Flagger Helm install + Canary CRD (targetRef Deployment, NGINX ingress, analysis interval/threshold/stepWeight/maxWeight, metrics, webhooks)
Tekton Prometheus metrics reference (pipelinerun/taskrun duration buckets + counts)
PrometheusRule: PipelineHighFailureRate (>20%), PipelineSlowBuilds (P90 >10min), ArgoCDRolloutDegraded
DORA metrics PromQL (deployment frequency, change failure rate from Argo CD sync metrics)
8 best practices cards