Artifact Management: Build Caching, Provenance, and Retention

Manage CI/CD artifacts effectively—build caching for speed, provenance tracking for security, and retention policies for cost control.

published: reading time: 11 min read

Effective artifact management improves pipeline speed, ensures security, and controls costs. This guide covers caching strategies, provenance tracking, and retention policies.

Artifact Types and Storage Backends

CI/CD pipelines produce various artifact types requiring different storage characteristics.

Common artifact types:

TypeExamplesSizeRetention
Build outputsJAR, DLL, binaryMediumLong
Container imagesDocker, OCILargeMedium
Test reportsJUnit, coberturaSmallShort
Dependenciesnpm packages, MavenLargeMedium
Deployment manifestsHelm charts, K8s YAMLSmallMedium

Storage backend options:

BackendBest ForCostFeatures
S3/GCSAny artifact typePay per useVersioning, lifecycle
Azure BlobCross-cloudCompetitiveImmutable blobs
ArtifactoryPackage managementEnterpriseUniversal format
GitHub Actions cacheBuild cacheLimited freeBuilt-in
GitLab CI artifactsNative integrationIncludedSimple

GitHub Actions artifact configuration:

- name: Upload build artifacts
  uses: actions/upload-artifact@v4
  with:
    name: build-${{ matrix.node-version }}
    path: |
      dist/
      coverage/
      *.nupkg
    retention-days: 30
    compression-level: 9

- name: Download artifacts
  uses: actions/download-artifact@v4
  with:
    pattern: build-*
    path: ./combined
    merge-multiple: true

GitLab CI artifacts:

build:
  stage: build
  script:
    - npm run build
  artifacts:
    name: "build-$CI_COMMIT_SHORT_SHA"
    paths:
      - dist/
      - coverage/
    expire_in: 1 week
    reports:
      junit: junit.xml
      coverage_report: cobertura.xml

Build Cache Strategies

Caching dependencies and intermediate build outputs dramatically reduces pipeline time.

Dependency cache patterns:

# npm with cache
- uses: actions/setup-node@v4
  with:
    node-version: "20"
    cache: "npm"
    cache-dependency-path: "package-lock.json"
# Maven with cache
- uses: actions/cache@v4
  with:
    path: |
      ~/.m2/repository
      build/
    key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
    restore-keys: |
      maven-${{ runner.os }}-

Layer-based Docker caching:

# GitHub Actions with BuildKit
- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v5
  with:
    push: true
    tags: myregistry.azurecr.io/myapp:${{ github.sha }}
    cache-from: |
      type=gha
      url=https://${{ secrets.GITHUB_TOKEN }}@artifactcache
    cache-to: |
      type=gha,mode=max

GitLab Docker layer caching:

build:docker:
  stage: build
  image: docker:24
  services:
    - docker:dind
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker build --cache-from $CI_REGISTRY_IMAGE:previous
    - docker build --cache-from $CI_REGISTRY_IMAGE:previous -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  variables:
    DOCKER_BUILDKIT: "1"

Distributed cache for self-hosted runners:

# Terraform for S3 cache bucket
resource "aws_s3_bucket" "cache" {
bucket = "myci-cache-bucket"

lifecycle_rule {
enabled = true
expiration {
days = 14
}
}
}

resource "aws_s3_bucket_server_side_encryption_configuration" "cache" {
bucket = aws_s3_bucket.cache.id

rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}

SBOM and Artifact Provenance

Software Bills of Materials and provenance tracking improve supply chain security.

Generate SBOM with Syft:

- name: Generate SBOM
  uses: anchore/sbom-action@v0
  with:
    image: myregistry.azurecr.io/myapp:${{ github.sha }}
    format: spdx-json
    output-file: sbom.spdx.json
    license-policy-file: license-policy.yaml

Attach provenance with GitHub Actions:

- name: Generate provenance
  uses: actions/attest-build-provenance@v1
  with:
    subject-name: myregistry.azurecr.io/myapp
    push-to-registry: true
    image-shasum: ${{ env.IMAGE_SHA }}

GitLab CI SBOM generation:

sbom:generation:
  stage: analyze
  image:
    name: anchore/syft:latest
    entrypoint: [""]
  script:
    - syft myregistry.azurecr.io/myapp:${CI_COMMIT_SHA} -o spdx-json > sbom.spdx
  artifacts:
    paths:
      - sbom.spdx
    expire_in: 1 week

Verify provenance at deployment:

# OPA Gatekeeper policy for provenance
apiVersion: v1
kind: ConfigMap
metadata:
  name: provenance-policy
  namespace: gatekeeper-system
data:
  policy: |
    package kubernetes.admission
    deny[msg] {
      not provenance.verify(image)
      msg := "Image provenance verification failed"
    }

Retention Policies and Cleanup

Automatic cleanup prevents storage costs from growing unbounded.

GitHub Actions retention:

# Set globally or per artifact
- name: Upload artifact
  uses: actions/upload-artifact@v4
  with:
    name: logs
    path: logs/
    retention-days: 7 # Override default 90 days

S3 lifecycle policies:

resource "aws_s3_bucket_lifecycle_configuration" "artifacts" {
bucket = aws_s3_bucket.artifacts.id

rule {
id = "cleanup-old-artifacts"
status = "Enabled"
filter {
prefix = "builds/"
}
expiration {
days = 30
}
noncurrent_version_expiration {
noncurrent_days = 7
}
}

rule {
id = "cleanup-failed-builds"
status = "Enabled"
filter {
prefix = "builds/failed/"
}
expiration {
days = 7
}
}
}

GitLab artifact expiry:

# In .gitlab-ci.yml
job:
  artifacts:
    expire_in: 1 week # or 3 days, 2 weeks, etc.
    when: always # Keep artifacts even on failure

Automated cleanup scripts:

#!/bin/bash
# cleanup-artifacts.sh
REGISTRY="myregistry.azurecr.io"
RETENTION_DAYS=30

# Delete old images not in use
az acr repository show-manifests \
  --repository myapp \
  --orderby time_asc \
  --query "[?timestamp<'$(date -d '30 days ago' -I)'].digest" \
  --output tsv | while read digest; do
  echo "Deleting digest: $digest"
  az acr repository delete \
    --yes \
    --name myregistry \
    --image myapp@sha256:$digest
done

Artifact Signing and Verification

Sign artifacts to ensure integrity and authenticity.

Cosign for signing:

# Install cosign
brew install cosign

# Sign image
cosign sign --yes myregistry.azurecr.io/myapp:v1.0.0

# Sign with key (production)
cosign sign --key cosign.key myregistry.azurecr.io/myapp:v1.0.0

# Verify
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0

GitHub Actions signing:

- name: Install Cosign
  uses: sigstore/cosign-installer@v3

- name: Sign container image
  env:
    COSIGN_YES: "true"
  run: |
    cosign sign \
      --key-env-file <(echo "$COSIGN_KEY") \
      ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ env.DIGEST }}

Verification in Kubernetes admission:

# Kyverno policy for signed images
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-signature
      match:
        resources:
          kinds:
            - Pod
      verifyImages:
        - imageReferences:
            - "myregistry.azurecr.io/*"
          attestors:
            - entries:
                - keys:
                    secretData: "{{  my-signing-key }}"

Cost Optimization for Artifact Storage

Tiered storage strategies:

# Terraform with intelligent tiering
resource "aws_s3_bucket" "artifacts" {
bucket = "myci-artifacts"

lifecycle_rule {
id = "standard-ia-transition"
enabled = true
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
}
}

Cache optimization:

# Prefer cache over artifact storage
- name: Build with cache
  run: |
    # Cache hit - use cached dependencies
    if [ -d node_modules ]; then
      echo "Using cached dependencies"
    fi

Monitor storage usage:

# GitLab CI storage report
storage:
  stage: report
  script:
    - |
      echo "Checking artifact storage usage..."
      api_url="https://gitlab.com/api/v4"
      project_id=$CI_PROJECT_ID
      token=$GITLAB_TOKEN

      # Get storage usage
      curl -s --header "PRIVATE-TOKEN: $token" \
        "$api_url/projects/$project_id/storage" | jq .

When to Use / When Not to Use

When artifact management pays off

Artifact management matters when your pipelines build things more than once. If you are compiling code, packaging containers, or generating reports across dozens of commits per day, managing what gets stored and for how long directly affects your pipeline speed and cloud bill.

Use explicit artifact management when you need to pass outputs between pipeline stages. Downloading a fresh copy of node_modules on every job is wasteful when you could cache it. Same goes for build outputs that other jobs need later.

SBOM generation and provenance tracking make sense for any organization subject to supply chain security requirements. If your customers or compliance team ask “what went into this build?”, artifact management practices give you an answer.

When to keep it simple

For small projects with fast builds and infrequent deployments, aggressive artifact management adds complexity without much return. A project that builds in 30 seconds does not need layer caching or distributed cache infrastructure.

If you are just starting out with CI/CD, set basic retention policies first. You can always add SBOM generation and artifact signing later when the workflow stabilizes.

Artifact Management Decision Flow

flowchart TD
    A[Pipeline Build Completes] --> B{Artifact needed later?}
    B -->|No| C[Skip artifact storage]
    B -->|Yes| D{Shared across jobs?}
    D -->|Yes| E[Upload to shared storage]
    D -->|No| F{Cache dependency?}
    F -->|Yes| G[Use build cache]
    F -->|No| C
    E --> H{Security required?}
    H -->|SBOM/provenance| I[Generate SBOM + sign]
    H -->|No| J[Apply retention policy]

Production Failure Scenarios

Common Artifact Failures

FailureImpactMitigation
Cache key collisionWrong cached artifact used for different codeInclude file hash in cache key
Retention too shortArtifact deleted before debugging completeSet minimum 7-day retention for test artifacts
Storage quota exceededPipeline fails uploading artifactsMonitor usage, set lifecycle policies
Unsigned artifact deployedSecurity policy blocks deploymentRequire signed artifacts in admission controller
SBOM stale after buildSBOM does not match deployed imageGenerate SBOM after final image build, not intermediate

Cache Corruption Recovery

flowchart TD
    A[Build Fails with Cache] --> B{Cache suspected?}
    B -->|Yes| C[Clear cache]
    B -->|No| D[Investigate build script]
    C --> E[Retry build]
    E --> F{Succeeds?}
    F -->|Yes| D
    F -->|No| G[Disable cache, retry]
    D --> H[Fix root cause]

Artifact Verification Checklist

# Verify artifact integrity
sha256sum myapp-1.0.0.jar

# Check Cosign signature
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0

# Validate SBOM against image
syft myregistry.azurecr.io/myapp:v1.0.0 -o spdx-json | jq '.packages[].name'

Observability Hooks

Track artifact health to catch problems before they affect builds.

What to monitor:

  • Artifact upload success/failure rate
  • Cache hit ratio (cache vs fresh download)
  • Storage consumption growth per pipeline
  • Artifact age distribution (are retention policies working?)
  • SBOM generation success rate
# GitHub Actions - check cache hit rate
# Enable cache reporting in job summary
- name: Build
  run: npm ci && npm test
  env:
    Cache-hit: ${{ steps.cache.outputs.cache-hit }}

# GitLab CI - storage metrics
curl --header "PRIVATE-TOKEN: $TOKEN" \
  "https://gitlab.com/api/v4/projects/$PROJECT_ID/storage"

# AWS S3 - bucket size by prefix
aws s3api list-objects-v2 \
  --bucket myci-artifacts \
  --query 'sum(Contents[].Size)' \
  --output text

Common Pitfalls / Anti-Patterns

Using the same cache for unrelated builds

If your cache key is just node-modules without a hash of lock files, a React project and a Vue project can end up sharing a cache with incompatible dependency versions. Always include a content hash in cache keys.

Forgetting to expire artifacts

Without retention policies, artifacts accumulate forever. A busy CI system can accumulate terabytes in months. Set lifecycle rules on S3 buckets and retention-days on CI artifacts from day one.

Storing credentials in artifacts

Never upload artifacts containing secrets, API keys, or credentials. Even private repositories are not immune to accidental exposure. Audit what goes into your artifacts before uploading.

Not validating cached outputs

A cache hit does not mean the cached artifact is correct. A build that succeeds once with wrong configuration will keep succeeding from cache until someone clears it. Validate cache integrity when possible.

Over-engineering the first pass

Starting with distributed caches, SBOMs, and artifact signing before your pipeline is stable is backwards. Get the basic artifact storage working first, add caching, then layer on security features as the project matures.

Quick Recap

Key Takeaways

  • Cache dependency downloads to cut pipeline time significantly
  • SBOMs and provenance are essential for supply chain security compliance
  • Retention policies prevent storage costs from growing unbounded
  • Sign artifacts with Cosign and verify in Kubernetes admission
  • Cache keys must include content hashes to avoid collisions

Artifact Health Checklist

# Check cache hit rate in GitHub Actions
# (enable Actions step summary for this)

# Verify artifact retention is set
grep -r "retention-days" .github/workflows/

# Test Cosign verification locally
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0

# Check S3 lifecycle rules
aws s3api get-bucket-lifecycle-configuration --bucket myci-artifacts

# Scan for credentials in artifacts before upload
git secrets --scan || echo "No secrets found"

# List large artifacts for cleanup
aws s3api list-objects-v2 \
  --bucket myci-artifacts \
  --query 'sort_by(Contents, &Size)[-10:]'

Trade-off Summary

Artifact StorageLatencyCostSecurityBest For
S3 / GCS / BlobLowPay per useIAM + bucket policiesMost CI/CD workloads
ArtifactoryMediumLicense + infraRich access controlEnterprise with artifacts
GitHub PackagesMediumStorage + egressGitHub tokenGitHub-native workflows
GitLab Container RegistryMediumStorage + egressGitLab tokenGitLab-native workflows
HarborMediumInfrastructureRBAC + OPAEnterprise / air-gapped
Amazon ECR / GCR / ACRLowPay per storage + egressIAMCloud-native workloads

Conclusion

Artifact management balances speed, security, and cost. Use aggressive caching for dependencies, generate SBOMs for supply chain security, and enforce retention policies to control storage. Sign artifacts with Cosign for verification in production clusters. For more on CI/CD practices, see our CI/CD Pipelines overview, and for deployment patterns, see our Deployment Strategies guide.

Category

Related Posts

Automated Testing in CI/CD: Strategies and Quality Gates

Integrate comprehensive automated testing into your CI/CD pipeline—unit tests, integration tests, end-to-end tests, and quality gates.

#cicd #testing #devops

CI/CD Pipeline Design: Stages, Jobs, and Parallel Execution

Design CI/CD pipelines that are fast, reliable, and maintainable using parallel jobs, caching strategies, and proper stage orchestration.

#cicd #devops #pipeline

Docker Volumes: Persisting Data Across Container Lifecycles

Understand how to use Docker volumes and bind mounts to persist data, share files between containers, and manage stateful applications.

#docker #volumes #storage