Artifact Management: Build Caching, Provenance, and Retention

Manage CI/CD artifacts effectively—build caching for speed, provenance tracking for security, and retention policies for cost control.

published: March 25, 2026 reading time: 19 min read author: GeekWorkBench

Introduction

Artifact management is the discipline of organizing, storing, and lifecycle-managing the outputs of your CI/CD pipelines. Every build produces artifacts: compiled binaries, container images, test reports, and deployment manifests. Without thoughtful management, these outputs become unmanageable, costly to store, and potentially insecure.

Proper artifact management solves three problems. First, speed: reusing cached dependencies instead of re-downloading them on every pipeline run cuts build times dramatically. Second, security: provenance tracking and signing ensure you can verify that the image you deployed is exactly what your pipeline built. Third, cost: retention policies prevent storage bills from growing unbounded while keeping the artifacts you actually need for debugging and compliance.

This guide covers artifact types and storage backends, build caching strategies, SBOM and provenance tracking, retention policies and cleanup, artifact signing, and cost optimization. You will learn how to design artifact management that scales with your CI/CD maturity.

Artifact Types and Storage Backends

CI/CD pipelines produce various artifact types requiring different storage characteristics.

Common artifact types:

Type	Examples	Size	Retention
Build outputs	JAR, DLL, binary	Medium	Long
Container images	Docker, OCI	Large	Medium
Test reports	JUnit, cobertura	Small	Short
Dependencies	npm packages, Maven	Large	Medium
Deployment manifests	Helm charts, K8s YAML	Small	Medium

Storage backend options:

Backend	Best For	Cost	Features
S3/GCS	Any artifact type	Pay per use	Versioning, lifecycle
Azure Blob	Cross-cloud	Competitive	Immutable blobs
Artifactory	Package management	Enterprise	Universal format
GitHub Actions cache	Build cache	Limited free	Built-in
GitLab CI artifacts	Native integration	Included	Simple

GitHub Actions artifact configuration:

- name: Upload build artifacts
  uses: actions/upload-artifact@v4
  with:
    name: build-${{ matrix.node-version }}
    path: |
      dist/
      coverage/
      *.nupkg
    retention-days: 30
    compression-level: 9

- name: Download artifacts
  uses: actions/download-artifact@v4
  with:
    pattern: build-*
    path: ./combined
    merge-multiple: true

GitLab CI artifacts:

build:
  stage: build
  script:
    - npm run build
  artifacts:
    name: "build-$CI_COMMIT_SHORT_SHA"
    paths:
      - dist/
      - coverage/
    expire_in: 1 week
    reports:
      junit: junit.xml
      coverage_report: cobertura.xml

Build Cache Strategies

Caching dependencies and intermediate build outputs dramatically reduces pipeline time.

Dependency cache patterns:

# npm with cache
- uses: actions/setup-node@v4
  with:
    node-version: "20"
    cache: "npm"
    cache-dependency-path: "package-lock.json"

# Maven with cache
- uses: actions/cache@v4
  with:
    path: |
      ~/.m2/repository
      build/
    key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
    restore-keys: |
      maven-${{ runner.os }}-

Layer-based Docker caching:

# GitHub Actions with BuildKit
- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v5
  with:
    push: true
    tags: myregistry.azurecr.io/myapp:${{ github.sha }}
    cache-from: |
      type=gha
      url=https://${{ secrets.GITHUB_TOKEN }}@artifactcache
    cache-to: |
      type=gha,mode=max

GitLab Docker layer caching:

build:docker:
  stage: build
  image: docker:24
  services:
    - docker:dind
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker build --cache-from $CI_REGISTRY_IMAGE:previous
    - docker build --cache-from $CI_REGISTRY_IMAGE:previous -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  variables:
    DOCKER_BUILDKIT: "1"

Distributed cache for self-hosted runners:

# Terraform for S3 cache bucket
resource "aws_s3_bucket" "cache" {
bucket = "myci-cache-bucket"

lifecycle_rule {
enabled = true
expiration {
days = 14
}
}
}

resource "aws_s3_bucket_server_side_encryption_configuration" "cache" {
bucket = aws_s3_bucket.cache.id

rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}

SBOM and Artifact Provenance

Software Bills of Materials and provenance tracking improve supply chain security.

Generate SBOM with Syft:

- name: Generate SBOM
  uses: anchore/sbom-action@v0
  with:
    image: myregistry.azurecr.io/myapp:${{ github.sha }}
    format: spdx-json
    output-file: sbom.spdx.json
    license-policy-file: license-policy.yaml

Attach provenance with GitHub Actions:

- name: Generate provenance
  uses: actions/attest-build-provenance@v1
  with:
    subject-name: myregistry.azurecr.io/myapp
    push-to-registry: true
    image-shasum: ${{ env.IMAGE_SHA }}

GitLab CI SBOM generation:

sbom:generation:
  stage: analyze
  image:
    name: anchore/syft:latest
    entrypoint: [""]
  script:
    - syft myregistry.azurecr.io/myapp:${CI_COMMIT_SHA} -o spdx-json > sbom.spdx
  artifacts:
    paths:
      - sbom.spdx
    expire_in: 1 week

Verify provenance at deployment:

# OPA Gatekeeper policy for provenance
apiVersion: v1
kind: ConfigMap
metadata:
  name: provenance-policy
  namespace: gatekeeper-system
data:
  policy: |
    package kubernetes.admission
    deny[msg] {
      not provenance.verify(image)
      msg := "Image provenance verification failed"
    }

Retention Policies and Cleanup

Automatic cleanup prevents storage costs from growing unbounded.

GitHub Actions retention:

# Set globally or per artifact
- name: Upload artifact
  uses: actions/upload-artifact@v4
  with:
    name: logs
    path: logs/
    retention-days: 7 # Override default 90 days

S3 lifecycle policies:

resource "aws_s3_bucket_lifecycle_configuration" "artifacts" {
bucket = aws_s3_bucket.artifacts.id

rule {
id = "cleanup-old-artifacts"
status = "Enabled"
filter {
prefix = "builds/"
}
expiration {
days = 30
}
noncurrent_version_expiration {
noncurrent_days = 7
}
}

rule {
id = "cleanup-failed-builds"
status = "Enabled"
filter {
prefix = "builds/failed/"
}
expiration {
days = 7
}
}
}

GitLab artifact expiry:

# In .gitlab-ci.yml
job:
  artifacts:
    expire_in: 1 week # or 3 days, 2 weeks, etc.
    when: always # Keep artifacts even on failure

Automated cleanup scripts:

#!/bin/bash
# cleanup-artifacts.sh
REGISTRY="myregistry.azurecr.io"
RETENTION_DAYS=30

# Delete old images not in use
az acr repository show-manifests \
  --repository myapp \
  --orderby time_asc \
  --query "[?timestamp<'$(date -d '30 days ago' -I)'].digest" \
  --output tsv | while read digest; do
  echo "Deleting digest: $digest"
  az acr repository delete \
    --yes \
    --name myregistry \
    --image myapp@sha256:$digest
done

Artifact Signing and Verification

Sign artifacts to ensure integrity and authenticity.

Cosign for signing:

# Install cosign
brew install cosign

# Sign image
cosign sign --yes myregistry.azurecr.io/myapp:v1.0.0

# Sign with key (production)
cosign sign --key cosign.key myregistry.azurecr.io/myapp:v1.0.0

# Verify
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0

GitHub Actions signing:

- name: Install Cosign
  uses: sigstore/cosign-installer@v3

- name: Sign container image
  env:
    COSIGN_YES: "true"
  run: |
    cosign sign \
      --key-env-file <(echo "$COSIGN_KEY") \
      ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ env.DIGEST }}

Verification in Kubernetes admission:

# Kyverno policy for signed images
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-signature
      match:
        resources:
          kinds:
            - Pod
      verifyImages:
        - imageReferences:
            - "myregistry.azurecr.io/*"
          attestors:
            - entries:
                - keys:
                    secretData: "{{  my-signing-key }}"

Cost Optimization for Artifact Storage

Tiered storage strategies:

# Terraform with intelligent tiering
resource "aws_s3_bucket" "artifacts" {
bucket = "myci-artifacts"

lifecycle_rule {
id = "standard-ia-transition"
enabled = true
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
}
}

Cache optimization:

# Prefer cache over artifact storage
- name: Build with cache
  run: |
    # Cache hit - use cached dependencies
    if [ -d node_modules ]; then
      echo "Using cached dependencies"
    fi

Monitor storage usage:

# GitLab CI storage report
storage:
  stage: report
  script:
    - |
      echo "Checking artifact storage usage..."
      api_url="https://gitlab.com/api/v4"
      project_id=$CI_PROJECT_ID
      token=$GITLAB_TOKEN

      # Get storage usage
      curl -s --header "PRIVATE-TOKEN: $token" \
        "$api_url/projects/$project_id/storage" | jq .

When to Use / When Not to Use

When artifact management pays off

Artifact management matters when your pipelines build things more than once. If you are compiling code, packaging containers, or generating reports across dozens of commits per day, managing what gets stored and for how long directly affects your pipeline speed and cloud bill.

Use explicit artifact management when you need to pass outputs between pipeline stages. Downloading a fresh copy of node_modules on every job is wasteful when you could cache it. Same goes for build outputs that other jobs need later.

SBOM generation and provenance tracking make sense for any organization subject to supply chain security requirements. If your customers or compliance team ask “what went into this build?”, artifact management practices give you an answer.

When to keep it simple

For small projects with fast builds and infrequent deployments, aggressive artifact management adds complexity without much return. A project that builds in 30 seconds does not need layer caching or distributed cache infrastructure.

If you are just starting out with CI/CD, set basic retention policies first. You can always add SBOM generation and artifact signing later when the workflow stabilizes.

Artifact Management Decision Flow

flowchart TD
    A[Pipeline Build Completes] --> B{Artifact needed later?}
    B -->|No| C[Skip artifact storage]
    B -->|Yes| D{Shared across jobs?}
    D -->|Yes| E[Upload to shared storage]
    D -->|No| F{Cache dependency?}
    F -->|Yes| G[Use build cache]
    F -->|No| C
    E --> H{Security required?}
    H -->|SBOM/provenance| I[Generate SBOM + sign]
    H -->|No| J[Apply retention policy]

Production Failure Scenarios

Common Artifact Failures

Failure	Impact	Mitigation
Cache key collision	Wrong cached artifact used for different code	Include file hash in cache key
Retention too short	Artifact deleted before debugging complete	Set minimum 7-day retention for test artifacts
Storage quota exceeded	Pipeline fails uploading artifacts	Monitor usage, set lifecycle policies
Unsigned artifact deployed	Security policy blocks deployment	Require signed artifacts in admission controller
SBOM stale after build	SBOM does not match deployed image	Generate SBOM after final image build, not intermediate

Cache Corruption Recovery

flowchart TD
    A[Build Fails with Cache] --> B{Cache suspected?}
    B -->|Yes| C[Clear cache]
    B -->|No| D[Investigate build script]
    C --> E[Retry build]
    E --> F{Succeeds?}
    F -->|Yes| D
    F -->|No| G[Disable cache, retry]
    D --> H[Fix root cause]

Artifact Verification Checklist

# Verify artifact integrity
sha256sum myapp-1.0.0.jar

# Check Cosign signature
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0

# Validate SBOM against image
syft myregistry.azurecr.io/myapp:v1.0.0 -o spdx-json | jq '.packages[].name'

Observability Hooks

Track artifact health to catch problems before they affect builds.

What to monitor:

Artifact upload success/failure rate
Cache hit ratio (cache vs fresh download)
Storage consumption growth per pipeline
Artifact age distribution (are retention policies working?)
SBOM generation success rate

# GitHub Actions - check cache hit rate
# Enable cache reporting in job summary
- name: Build
  run: npm ci && npm test
  env:
    Cache-hit: ${{ steps.cache.outputs.cache-hit }}

# GitLab CI - storage metrics
curl --header "PRIVATE-TOKEN: $TOKEN" \
  "https://gitlab.com/api/v4/projects/$PROJECT_ID/storage"

# AWS S3 - bucket size by prefix
aws s3api list-objects-v2 \
  --bucket myci-artifacts \
  --query 'sum(Contents[].Size)' \
  --output text

Common Pitfalls / Anti-Patterns

Using the same cache for unrelated builds

If your cache key is just node-modules without a hash of lock files, a React project and a Vue project can end up sharing a cache with incompatible dependency versions. Always include a content hash in cache keys.

Forgetting to expire artifacts

Without retention policies, artifacts accumulate forever. A busy CI system can accumulate terabytes in months. Set lifecycle rules on S3 buckets and retention-days on CI artifacts from day one.

Storing credentials in artifacts

Never upload artifacts containing secrets, API keys, or credentials. Even private repositories are not immune to accidental exposure. Audit what goes into your artifacts before uploading.

Not validating cached outputs

A cache hit does not mean the cached artifact is correct. A build that succeeds once with wrong configuration will keep succeeding from cache until someone clears it. Validate cache integrity when possible.

Over-engineering the first pass

Starting with distributed caches, SBOMs, and artifact signing before your pipeline is stable is backwards. Get the basic artifact storage working first, add caching, then layer on security features as the project matures.

Trade-off Analysis

Artifact Storage	Latency	Cost	Security	Best For
S3 / GCS / Blob	Low	Pay per use	IAM + bucket policies	Most CI/CD workloads
Artifactory	Medium	License + infra	Rich access control	Enterprise with artifacts
GitHub Packages	Medium	Storage + egress	GitHub token	GitHub-native workflows
GitLab Container Registry	Medium	Storage + egress	GitLab token	GitLab-native workflows
Harbor	Medium	Infrastructure	RBAC + OPA	Enterprise / air-gapped
Amazon ECR / GCR / ACR	Low	Pay per storage + egress	IAM	Cloud-native workloads

Interview Questions

1. What are the different types of CI/CD artifacts and their storage requirements?

Expected answer points:

Build outputs (JAR, DLL, binary) - medium size, long retention
Container images (Docker, OCI) - large size, medium retention
Test reports (JUnit, cobertura) - small size, short retention
Dependencies (npm packages, Maven) - large size, medium retention
Deployment manifests (Helm charts, K8s YAML) - small size, medium retention

2. How do you optimize build cache hit rates in CI/CD pipelines?

Expected answer points:

Include content hashes in cache keys (file hashes, not just names)
Cache dependency directories (node_modules, .m2, ~/.cargo)
Use restore-keys for partial matches when exact key misses
Docker layer caching with BuildKit cache-from/cache-to
Prefer cache over artifact storage when possible
Monitor cache hit ratio and adjust keys based on usage patterns

3. What is an SBOM and why is it important for supply chain security?

Expected answer points:

SBOM = Software Bill of Materials - a formal inventory of software components
Lists all dependencies, licenses, and versions in a build
Enables vulnerability scanning against known CVEs
Required for compliance in regulated industries (FDA, NIST)
Tools like Syft generate SPDX/CycloneDX SBOMs
SBOM at deployment must match the actual image contents

4. How do you implement artifact retention policies?

Expected answer points:

GitHub Actions: retention-days on upload-artifact action (default 90, max 400)
GitLab CI: expire_in directive on artifacts (1 week, 3 days, etc.)
S3: Lifecycle rules with expiration days, transition to IA/Glacier
Azure Blob: lifecycle management policies
Container registries: Use az acr repository show-manifests with timestamp filter

5. How do you sign and verify container images with Cosign?

Expected answer points:

Install cosign: brew install cosign or from GitHub releases
Sign image: cosign sign --yes myregistry.azurecr.io/myapp:v1.0.0
For production: use cosign sign --key cosign.key for key-based signing
Verify: cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0
CI signing: Use sigstore/cosign-installer action with COSIGN_KEY env var

6. What is the difference between artifact caching and artifact storage?

Expected answer points:

Artifact caching: stores intermediate build outputs (dependencies, layers) to speed up future builds
Artifact storage: stores final build outputs (binaries, images, reports) for deployment or archival
Cache is content-addressable and keyed by input hashes
Artifacts are named and versioned, often with retention policies
Cache misses are acceptable; artifact misses cause pipeline failures

7. How do you handle cache key collisions in CI pipelines?

Expected answer points:

Cache key must include content hashes, not just cache name
Example: key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
If unrelated builds share the same cache, different lock files get mixed
Use restore-keys for fallback to partial matches
Monitor for unexpected cache behavior and refine keys

8. How do you recover from cache corruption in a CI pipeline?

Expected answer points:

Identify cache as root cause when builds fail only with cache present
Clear cache via CI platform UI or API
Retry build - if succeeds, cache was the issue
If still fails, investigate build script or disable cache temporarily
Fix root cause: bad cache key, corrupted dependency, etc.
Monitor for recurring cache issues and address fundamentally

9. What is provenance tracking and how do you implement it?

Expected answer points:

Provenance tracks the origin and build process of an artifact
GitHub Actions: use actions/attest-build-provenance to attach provenance
Provenance includes subject-name (image reference), image-shasum, build context
Verify provenance at deployment using OPA Gatekeeper or Kyverno
Store provenance attestations in transparency logs (Rekor) for auditability

10. How do you implement tiered storage for CI/CD artifacts?

Expected answer points:

S3 lifecycle: transition to STANDARD_IA after 30 days, GLACIER after 90 days
Deep Archive for rarely accessed artifacts (1 year+ old)
Set minimum retention for artifacts needed for debugging (7+ days)
Separate buckets or prefixes for different artifact types with different policies
Monitor storage costs and adjust policies based on usage patterns

11. How do you prevent credentials from being stored in artifacts?

Expected answer points:

Never upload artifacts containing secrets, API keys, or credentials
Use --dry-run=client when creating secrets to avoid exposure
Audit what goes into your artifacts before uploading
Use git secrets --scan to detect credentials before upload
Use sealed-secrets or external secrets operators instead of plain secrets
Even private repositories are not immune to accidental exposure

12. How do you validate cached outputs are correct?

Expected answer points:

A cache hit does not mean the cached artifact is correct
Build that succeeds once with wrong configuration keeps succeeding from cache
Validate cache integrity: compare output hash with expected hash
Consider cache invalidation triggers: lock file changes, env changes
Periodically clear cache to catch stale results

13. What is the decision flow for artifact storage vs caching?

Expected answer points:

Pipeline Build Completes → Artifact needed later?
No → Skip artifact storage
Yes → Shared across jobs? → Yes: Upload to shared storage
No → Cache dependency? → Yes: Use build cache
Security required (SBOM/provenance)? → Generate SBOM + sign
Otherwise: Apply retention policy

14. How do you monitor CI/CD artifact storage costs?

Expected answer points:

Track artifact upload success/failure rate
Monitor storage consumption growth per pipeline
Track artifact age distribution - are retention policies working?
GitHub Actions: enable step summary for cache reporting
GitLab CI: use API to get project storage usage
AWS S3: list-objects with query to sum sizes

15. What are the trade-offs between different artifact storage backends?

Expected answer points:

S3/GCS/Blob: Low latency, pay-per-use, versioning, lifecycle - best for most CI/CD
Artifactory: Medium latency, license + infra, rich access control - enterprise with artifacts
GitHub Packages: Medium latency, storage + egress, GitHub token - GitHub-native workflows
GitLab Container Registry: Medium latency, storage + egress, GitLab token - GitLab-native
Harbor: Medium latency, infrastructure, RBAC + OPA - enterprise/air-gapped
ECR/GCR/ACR: Low latency, pay per storage + egress, IAM - cloud-native workloads

16. How do you handle failed builds that leave stale artifacts?

Expected answer points:

GitLab CI: use expire_in with when: always to keep artifacts on failure
Create a separate cleanup job that runs on failure to remove stale artifacts
Use lifecycle policies with prefix filters for builds/failed/ paths
Set short retention (7 days) for failed build artifacts
Monitor for artifacts exceeding expected retention

17. How do you implement image signing in Kubernetes admission?

Expected answer points:

Use Kyverno ClusterPolicy with verifyImages rule
Attestors block specifies how to verify signatures (keys or attestors)
validationFailureAction: Enforce blocks non-signed images
imageReferences can be wildcards like myregistry.azurecr.io/*
secretData contains the public key for verification

18. What is the difference between Cosign key-based signing and keyless signing?

Expected answer points:

Key-based: uses a stored private key (cosign.key) for signing
Keyless: uses OIDC identity from GitHub Actions/Google, short-lived certificates
Keyless leverages Sigstore PKI infrastructure
Keyless preferred for CI/CD to avoid key management complexity
Production may prefer key-based for audit trail control

19. How do you handle layer-based Docker caching with BuildKit?

Expected answer points:

Use docker/setup-buildx-action to enable BuildKit
cache-from: type=gha uses GitHub Actions cache
cache-to: type=gha,mode=max saves all layers (not just final)
mode=max ensures intermediate layers cached for better hits
GitLab: DOCKER_BUILDKIT=1 with --cache-from for registry caching

20. How do you generate and verify SBOMs with Syft?

Expected answer points:

GitHub Actions: anchore/sbom-action generates SBOM on image build
Syft CLI: syft myregistry.azurecr.io/myapp:v1.0.0 -o spdx-json
GitLab CI: syft image in anchore/syft:latest container with spdx-json output
Verify: syft can regenerate SBOM from image and compare with stored
Use license-policy-file to flag prohibited licenses

Conclusion

Key Takeaways

Cache dependency downloads to cut pipeline time significantly
SBOMs and provenance are essential for supply chain security compliance
Retention policies prevent storage costs from growing unbounded
Sign artifacts with Cosign and verify in Kubernetes admission
Cache keys must include content hashes to avoid collisions

Artifact Health Checklist

# Check cache hit rate in GitHub Actions
# (enable Actions step summary for this)

# Verify artifact retention is set
grep -r "retention-days" .github/workflows/

# Test Cosign verification locally
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0

# Check S3 lifecycle rules
aws s3api get-bucket-lifecycle-configuration --bucket myci-artifacts

# Scan for credentials in artifacts before upload
git secrets --scan || echo "No secrets found"

# List large artifacts for cleanup
aws s3api list-objects-v2 \
  --bucket myci-artifacts \
  --query 'sort_by(Contents, &Size)[-10:]'

Introduction

Artifact Types and Storage Backends

Build Cache Strategies

SBOM and Artifact Provenance

Retention Policies and Cleanup

Artifact Signing and Verification

Cost Optimization for Artifact Storage

When to Use / When Not to Use

When artifact management pays off

When to keep it simple

Artifact Management Decision Flow

Production Failure Scenarios

Common Artifact Failures

Cache Corruption Recovery

Artifact Verification Checklist

Observability Hooks

Common Pitfalls / Anti-Patterns

Using the same cache for unrelated builds

Forgetting to expire artifacts

Storing credentials in artifacts

Not validating cached outputs

Over-engineering the first pass

Trade-off Analysis

Interview Questions

Further Reading

Conclusion

Key Takeaways

Artifact Health Checklist

Category

Tags

Related Posts

Automated Testing in CI/CD: Strategies and Quality Gates

CI/CD Pipeline Design: Stages, Jobs, and Parallel Execution

Docker Volumes: Persisting Data Across Container Lifecycles