Artifact Management: Build Caching, Provenance, and Retention
Manage CI/CD artifacts effectively—build caching for speed, provenance tracking for security, and retention policies for cost control.
Introduction
Artifact management is the discipline of organizing, storing, and lifecycle-managing the outputs of your CI/CD pipelines. Every build produces artifacts: compiled binaries, container images, test reports, and deployment manifests. Without thoughtful management, these outputs become unmanageable, costly to store, and potentially insecure.
Proper artifact management solves three problems. First, speed: reusing cached dependencies instead of re-downloading them on every pipeline run cuts build times dramatically. Second, security: provenance tracking and signing ensure you can verify that the image you deployed is exactly what your pipeline built. Third, cost: retention policies prevent storage bills from growing unbounded while keeping the artifacts you actually need for debugging and compliance.
This guide covers artifact types and storage backends, build caching strategies, SBOM and provenance tracking, retention policies and cleanup, artifact signing, and cost optimization. You will learn how to design artifact management that scales with your CI/CD maturity.
Artifact Types and Storage Backends
CI/CD pipelines produce various artifact types requiring different storage characteristics.
Common artifact types:
| Type | Examples | Size | Retention |
|---|---|---|---|
| Build outputs | JAR, DLL, binary | Medium | Long |
| Container images | Docker, OCI | Large | Medium |
| Test reports | JUnit, cobertura | Small | Short |
| Dependencies | npm packages, Maven | Large | Medium |
| Deployment manifests | Helm charts, K8s YAML | Small | Medium |
Storage backend options:
| Backend | Best For | Cost | Features |
|---|---|---|---|
| S3/GCS | Any artifact type | Pay per use | Versioning, lifecycle |
| Azure Blob | Cross-cloud | Competitive | Immutable blobs |
| Artifactory | Package management | Enterprise | Universal format |
| GitHub Actions cache | Build cache | Limited free | Built-in |
| GitLab CI artifacts | Native integration | Included | Simple |
GitHub Actions artifact configuration:
- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: build-${{ matrix.node-version }}
path: |
dist/
coverage/
*.nupkg
retention-days: 30
compression-level: 9
- name: Download artifacts
uses: actions/download-artifact@v4
with:
pattern: build-*
path: ./combined
merge-multiple: true
GitLab CI artifacts:
build:
stage: build
script:
- npm run build
artifacts:
name: "build-$CI_COMMIT_SHORT_SHA"
paths:
- dist/
- coverage/
expire_in: 1 week
reports:
junit: junit.xml
coverage_report: cobertura.xml
Build Cache Strategies
Caching dependencies and intermediate build outputs dramatically reduces pipeline time.
Dependency cache patterns:
# npm with cache
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
cache-dependency-path: "package-lock.json"
# Maven with cache
- uses: actions/cache@v4
with:
path: |
~/.m2/repository
build/
key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
restore-keys: |
maven-${{ runner.os }}-
Layer-based Docker caching:
# GitHub Actions with BuildKit
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
push: true
tags: myregistry.azurecr.io/myapp:${{ github.sha }}
cache-from: |
type=gha
url=https://${{ secrets.GITHUB_TOKEN }}@artifactcache
cache-to: |
type=gha,mode=max
GitLab Docker layer caching:
build:docker:
stage: build
image: docker:24
services:
- docker:dind
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker build --cache-from $CI_REGISTRY_IMAGE:previous
- docker build --cache-from $CI_REGISTRY_IMAGE:previous -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
variables:
DOCKER_BUILDKIT: "1"
Distributed cache for self-hosted runners:
# Terraform for S3 cache bucket
resource "aws_s3_bucket" "cache" {
bucket = "myci-cache-bucket"
lifecycle_rule {
enabled = true
expiration {
days = 14
}
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "cache" {
bucket = aws_s3_bucket.cache.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
SBOM and Artifact Provenance
Software Bills of Materials and provenance tracking improve supply chain security.
Generate SBOM with Syft:
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: myregistry.azurecr.io/myapp:${{ github.sha }}
format: spdx-json
output-file: sbom.spdx.json
license-policy-file: license-policy.yaml
Attach provenance with GitHub Actions:
- name: Generate provenance
uses: actions/attest-build-provenance@v1
with:
subject-name: myregistry.azurecr.io/myapp
push-to-registry: true
image-shasum: ${{ env.IMAGE_SHA }}
GitLab CI SBOM generation:
sbom:generation:
stage: analyze
image:
name: anchore/syft:latest
entrypoint: [""]
script:
- syft myregistry.azurecr.io/myapp:${CI_COMMIT_SHA} -o spdx-json > sbom.spdx
artifacts:
paths:
- sbom.spdx
expire_in: 1 week
Verify provenance at deployment:
# OPA Gatekeeper policy for provenance
apiVersion: v1
kind: ConfigMap
metadata:
name: provenance-policy
namespace: gatekeeper-system
data:
policy: |
package kubernetes.admission
deny[msg] {
not provenance.verify(image)
msg := "Image provenance verification failed"
}
Retention Policies and Cleanup
Automatic cleanup prevents storage costs from growing unbounded.
GitHub Actions retention:
# Set globally or per artifact
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: logs
path: logs/
retention-days: 7 # Override default 90 days
S3 lifecycle policies:
resource "aws_s3_bucket_lifecycle_configuration" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
rule {
id = "cleanup-old-artifacts"
status = "Enabled"
filter {
prefix = "builds/"
}
expiration {
days = 30
}
noncurrent_version_expiration {
noncurrent_days = 7
}
}
rule {
id = "cleanup-failed-builds"
status = "Enabled"
filter {
prefix = "builds/failed/"
}
expiration {
days = 7
}
}
}
GitLab artifact expiry:
# In .gitlab-ci.yml
job:
artifacts:
expire_in: 1 week # or 3 days, 2 weeks, etc.
when: always # Keep artifacts even on failure
Automated cleanup scripts:
#!/bin/bash
# cleanup-artifacts.sh
REGISTRY="myregistry.azurecr.io"
RETENTION_DAYS=30
# Delete old images not in use
az acr repository show-manifests \
--repository myapp \
--orderby time_asc \
--query "[?timestamp<'$(date -d '30 days ago' -I)'].digest" \
--output tsv | while read digest; do
echo "Deleting digest: $digest"
az acr repository delete \
--yes \
--name myregistry \
--image myapp@sha256:$digest
done
Artifact Signing and Verification
Sign artifacts to ensure integrity and authenticity.
Cosign for signing:
# Install cosign
brew install cosign
# Sign image
cosign sign --yes myregistry.azurecr.io/myapp:v1.0.0
# Sign with key (production)
cosign sign --key cosign.key myregistry.azurecr.io/myapp:v1.0.0
# Verify
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0
GitHub Actions signing:
- name: Install Cosign
uses: sigstore/cosign-installer@v3
- name: Sign container image
env:
COSIGN_YES: "true"
run: |
cosign sign \
--key-env-file <(echo "$COSIGN_KEY") \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ env.DIGEST }}
Verification in Kubernetes admission:
# Kyverno policy for signed images
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-images
spec:
validationFailureAction: Enforce
rules:
- name: check-signature
match:
resources:
kinds:
- Pod
verifyImages:
- imageReferences:
- "myregistry.azurecr.io/*"
attestors:
- entries:
- keys:
secretData: "{{ my-signing-key }}"
Cost Optimization for Artifact Storage
Tiered storage strategies:
# Terraform with intelligent tiering
resource "aws_s3_bucket" "artifacts" {
bucket = "myci-artifacts"
lifecycle_rule {
id = "standard-ia-transition"
enabled = true
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
}
}
Cache optimization:
# Prefer cache over artifact storage
- name: Build with cache
run: |
# Cache hit - use cached dependencies
if [ -d node_modules ]; then
echo "Using cached dependencies"
fi
Monitor storage usage:
# GitLab CI storage report
storage:
stage: report
script:
- |
echo "Checking artifact storage usage..."
api_url="https://gitlab.com/api/v4"
project_id=$CI_PROJECT_ID
token=$GITLAB_TOKEN
# Get storage usage
curl -s --header "PRIVATE-TOKEN: $token" \
"$api_url/projects/$project_id/storage" | jq .
When to Use / When Not to Use
When artifact management pays off
Artifact management matters when your pipelines build things more than once. If you are compiling code, packaging containers, or generating reports across dozens of commits per day, managing what gets stored and for how long directly affects your pipeline speed and cloud bill.
Use explicit artifact management when you need to pass outputs between pipeline stages. Downloading a fresh copy of node_modules on every job is wasteful when you could cache it. Same goes for build outputs that other jobs need later.
SBOM generation and provenance tracking make sense for any organization subject to supply chain security requirements. If your customers or compliance team ask “what went into this build?”, artifact management practices give you an answer.
When to keep it simple
For small projects with fast builds and infrequent deployments, aggressive artifact management adds complexity without much return. A project that builds in 30 seconds does not need layer caching or distributed cache infrastructure.
If you are just starting out with CI/CD, set basic retention policies first. You can always add SBOM generation and artifact signing later when the workflow stabilizes.
Artifact Management Decision Flow
flowchart TD
A[Pipeline Build Completes] --> B{Artifact needed later?}
B -->|No| C[Skip artifact storage]
B -->|Yes| D{Shared across jobs?}
D -->|Yes| E[Upload to shared storage]
D -->|No| F{Cache dependency?}
F -->|Yes| G[Use build cache]
F -->|No| C
E --> H{Security required?}
H -->|SBOM/provenance| I[Generate SBOM + sign]
H -->|No| J[Apply retention policy]
Production Failure Scenarios
Common Artifact Failures
| Failure | Impact | Mitigation |
|---|---|---|
| Cache key collision | Wrong cached artifact used for different code | Include file hash in cache key |
| Retention too short | Artifact deleted before debugging complete | Set minimum 7-day retention for test artifacts |
| Storage quota exceeded | Pipeline fails uploading artifacts | Monitor usage, set lifecycle policies |
| Unsigned artifact deployed | Security policy blocks deployment | Require signed artifacts in admission controller |
| SBOM stale after build | SBOM does not match deployed image | Generate SBOM after final image build, not intermediate |
Cache Corruption Recovery
flowchart TD
A[Build Fails with Cache] --> B{Cache suspected?}
B -->|Yes| C[Clear cache]
B -->|No| D[Investigate build script]
C --> E[Retry build]
E --> F{Succeeds?}
F -->|Yes| D
F -->|No| G[Disable cache, retry]
D --> H[Fix root cause]
Artifact Verification Checklist
# Verify artifact integrity
sha256sum myapp-1.0.0.jar
# Check Cosign signature
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0
# Validate SBOM against image
syft myregistry.azurecr.io/myapp:v1.0.0 -o spdx-json | jq '.packages[].name'
Observability Hooks
Track artifact health to catch problems before they affect builds.
What to monitor:
- Artifact upload success/failure rate
- Cache hit ratio (cache vs fresh download)
- Storage consumption growth per pipeline
- Artifact age distribution (are retention policies working?)
- SBOM generation success rate
# GitHub Actions - check cache hit rate
# Enable cache reporting in job summary
- name: Build
run: npm ci && npm test
env:
Cache-hit: ${{ steps.cache.outputs.cache-hit }}
# GitLab CI - storage metrics
curl --header "PRIVATE-TOKEN: $TOKEN" \
"https://gitlab.com/api/v4/projects/$PROJECT_ID/storage"
# AWS S3 - bucket size by prefix
aws s3api list-objects-v2 \
--bucket myci-artifacts \
--query 'sum(Contents[].Size)' \
--output text
Common Pitfalls / Anti-Patterns
Using the same cache for unrelated builds
If your cache key is just node-modules without a hash of lock files, a React project and a Vue project can end up sharing a cache with incompatible dependency versions. Always include a content hash in cache keys.
Forgetting to expire artifacts
Without retention policies, artifacts accumulate forever. A busy CI system can accumulate terabytes in months. Set lifecycle rules on S3 buckets and retention-days on CI artifacts from day one.
Storing credentials in artifacts
Never upload artifacts containing secrets, API keys, or credentials. Even private repositories are not immune to accidental exposure. Audit what goes into your artifacts before uploading.
Not validating cached outputs
A cache hit does not mean the cached artifact is correct. A build that succeeds once with wrong configuration will keep succeeding from cache until someone clears it. Validate cache integrity when possible.
Over-engineering the first pass
Starting with distributed caches, SBOMs, and artifact signing before your pipeline is stable is backwards. Get the basic artifact storage working first, add caching, then layer on security features as the project matures.
Trade-off Analysis
| Artifact Storage | Latency | Cost | Security | Best For |
|---|---|---|---|---|
| S3 / GCS / Blob | Low | Pay per use | IAM + bucket policies | Most CI/CD workloads |
| Artifactory | Medium | License + infra | Rich access control | Enterprise with artifacts |
| GitHub Packages | Medium | Storage + egress | GitHub token | GitHub-native workflows |
| GitLab Container Registry | Medium | Storage + egress | GitLab token | GitLab-native workflows |
| Harbor | Medium | Infrastructure | RBAC + OPA | Enterprise / air-gapped |
| Amazon ECR / GCR / ACR | Low | Pay per storage + egress | IAM | Cloud-native workloads |
Interview Questions
Expected answer points:
- Build outputs (JAR, DLL, binary) - medium size, long retention
- Container images (Docker, OCI) - large size, medium retention
- Test reports (JUnit, cobertura) - small size, short retention
- Dependencies (npm packages, Maven) - large size, medium retention
- Deployment manifests (Helm charts, K8s YAML) - small size, medium retention
Expected answer points:
- Include content hashes in cache keys (file hashes, not just names)
- Cache dependency directories (node_modules, .m2, ~/.cargo)
- Use restore-keys for partial matches when exact key misses
- Docker layer caching with BuildKit cache-from/cache-to
- Prefer cache over artifact storage when possible
- Monitor cache hit ratio and adjust keys based on usage patterns
Expected answer points:
- SBOM = Software Bill of Materials - a formal inventory of software components
- Lists all dependencies, licenses, and versions in a build
- Enables vulnerability scanning against known CVEs
- Required for compliance in regulated industries (FDA, NIST)
- Tools like Syft generate SPDX/CycloneDX SBOMs
- SBOM at deployment must match the actual image contents
Expected answer points:
- GitHub Actions: retention-days on upload-artifact action (default 90, max 400)
- GitLab CI: expire_in directive on artifacts (1 week, 3 days, etc.)
- S3: Lifecycle rules with expiration days, transition to IA/Glacier
- Azure Blob: lifecycle management policies
- Container registries: Use az acr repository show-manifests with timestamp filter
Expected answer points:
- Install cosign: brew install cosign or from GitHub releases
- Sign image: cosign sign --yes myregistry.azurecr.io/myapp:v1.0.0
- For production: use cosign sign --key cosign.key for key-based signing
- Verify: cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0
- CI signing: Use sigstore/cosign-installer action with COSIGN_KEY env var
Expected answer points:
- Artifact caching: stores intermediate build outputs (dependencies, layers) to speed up future builds
- Artifact storage: stores final build outputs (binaries, images, reports) for deployment or archival
- Cache is content-addressable and keyed by input hashes
- Artifacts are named and versioned, often with retention policies
- Cache misses are acceptable; artifact misses cause pipeline failures
Expected answer points:
- Cache key must include content hashes, not just cache name
- Example: key: maven-${{ runner.os }}-${{ hashFiles('**/pom.xml') }}
- If unrelated builds share the same cache, different lock files get mixed
- Use restore-keys for fallback to partial matches
- Monitor for unexpected cache behavior and refine keys
Expected answer points:
- Identify cache as root cause when builds fail only with cache present
- Clear cache via CI platform UI or API
- Retry build - if succeeds, cache was the issue
- If still fails, investigate build script or disable cache temporarily
- Fix root cause: bad cache key, corrupted dependency, etc.
- Monitor for recurring cache issues and address fundamentally
Expected answer points:
- Provenance tracks the origin and build process of an artifact
- GitHub Actions: use actions/attest-build-provenance to attach provenance
- Provenance includes subject-name (image reference), image-shasum, build context
- Verify provenance at deployment using OPA Gatekeeper or Kyverno
- Store provenance attestations in transparency logs (Rekor) for auditability
Expected answer points:
- S3 lifecycle: transition to STANDARD_IA after 30 days, GLACIER after 90 days
- Deep Archive for rarely accessed artifacts (1 year+ old)
- Set minimum retention for artifacts needed for debugging (7+ days)
- Separate buckets or prefixes for different artifact types with different policies
- Monitor storage costs and adjust policies based on usage patterns
Expected answer points:
- Never upload artifacts containing secrets, API keys, or credentials
- Use --dry-run=client when creating secrets to avoid exposure
- Audit what goes into your artifacts before uploading
- Use git secrets --scan to detect credentials before upload
- Use sealed-secrets or external secrets operators instead of plain secrets
- Even private repositories are not immune to accidental exposure
Expected answer points:
- A cache hit does not mean the cached artifact is correct
- Build that succeeds once with wrong configuration keeps succeeding from cache
- Validate cache integrity: compare output hash with expected hash
- Consider cache invalidation triggers: lock file changes, env changes
- Periodically clear cache to catch stale results
Expected answer points:
- Pipeline Build Completes → Artifact needed later?
- No → Skip artifact storage
- Yes → Shared across jobs? → Yes: Upload to shared storage
- No → Cache dependency? → Yes: Use build cache
- Security required (SBOM/provenance)? → Generate SBOM + sign
- Otherwise: Apply retention policy
Expected answer points:
- Track artifact upload success/failure rate
- Monitor storage consumption growth per pipeline
- Track artifact age distribution - are retention policies working?
- GitHub Actions: enable step summary for cache reporting
- GitLab CI: use API to get project storage usage
- AWS S3: list-objects with query to sum sizes
Expected answer points:
- S3/GCS/Blob: Low latency, pay-per-use, versioning, lifecycle - best for most CI/CD
- Artifactory: Medium latency, license + infra, rich access control - enterprise with artifacts
- GitHub Packages: Medium latency, storage + egress, GitHub token - GitHub-native workflows
- GitLab Container Registry: Medium latency, storage + egress, GitLab token - GitLab-native
- Harbor: Medium latency, infrastructure, RBAC + OPA - enterprise/air-gapped
- ECR/GCR/ACR: Low latency, pay per storage + egress, IAM - cloud-native workloads
Expected answer points:
- GitLab CI: use expire_in with when: always to keep artifacts on failure
- Create a separate cleanup job that runs on failure to remove stale artifacts
- Use lifecycle policies with prefix filters for builds/failed/ paths
- Set short retention (7 days) for failed build artifacts
- Monitor for artifacts exceeding expected retention
Expected answer points:
- Use Kyverno ClusterPolicy with verifyImages rule
- Attestors block specifies how to verify signatures (keys or attestors)
- validationFailureAction: Enforce blocks non-signed images
- imageReferences can be wildcards like myregistry.azurecr.io/*
- secretData contains the public key for verification
Expected answer points:
- Key-based: uses a stored private key (cosign.key) for signing
- Keyless: uses OIDC identity from GitHub Actions/Google, short-lived certificates
- Keyless leverages Sigstore PKI infrastructure
- Keyless preferred for CI/CD to avoid key management complexity
- Production may prefer key-based for audit trail control
Expected answer points:
- Use docker/setup-buildx-action to enable BuildKit
- cache-from: type=gha uses GitHub Actions cache
- cache-to: type=gha,mode=max saves all layers (not just final)
- mode=max ensures intermediate layers cached for better hits
- GitLab: DOCKER_BUILDKIT=1 with --cache-from for registry caching
Expected answer points:
- GitHub Actions: anchore/sbom-action generates SBOM on image build
- Syft CLI: syft myregistry.azurecr.io/myapp:v1.0.0 -o spdx-json
- GitLab CI: syft image in anchore/syft:latest container with spdx-json output
- Verify: syft can regenerate SBOM from image and compare with stored
- Use license-policy-file to flag prohibited licenses
Further Reading
- GitHub Actions Artifact Documentation - Official GitHub Actions docs
- GitLab CI Artifact Documentation - Official GitLab CI docs
- Cosign Documentation - Artifact signing
- Syft SBOM Generator - Software Bill of Materials
- ArgoCD Image Updater - Automated image updates
- Renovate for Dependency Updates - Automated dependency management
Conclusion
Key Takeaways
- Cache dependency downloads to cut pipeline time significantly
- SBOMs and provenance are essential for supply chain security compliance
- Retention policies prevent storage costs from growing unbounded
- Sign artifacts with Cosign and verify in Kubernetes admission
- Cache keys must include content hashes to avoid collisions
Artifact Health Checklist
# Check cache hit rate in GitHub Actions
# (enable Actions step summary for this)
# Verify artifact retention is set
grep -r "retention-days" .github/workflows/
# Test Cosign verification locally
cosign verify --key cosign.pub myregistry.azurecr.io/myapp:v1.0.0
# Check S3 lifecycle rules
aws s3api get-bucket-lifecycle-configuration --bucket myci-artifacts
# Scan for credentials in artifacts before upload
git secrets --scan || echo "No secrets found"
# List large artifacts for cleanup
aws s3api list-objects-v2 \
--bucket myci-artifacts \
--query 'sort_by(Contents, &Size)[-10:]' Category
Tags
Related Posts
Automated Testing in CI/CD: Strategies and Quality Gates
Integrate comprehensive automated testing into your CI/CD pipeline—unit tests, integration tests, end-to-end tests, and quality gates.
CI/CD Pipeline Design: Stages, Jobs, and Parallel Execution
Design CI/CD pipelines that are fast, reliable, and maintainable using parallel jobs, caching strategies, and proper stage orchestration.
Docker Volumes: Persisting Data Across Container Lifecycles
Understand how to use Docker volumes and bind mounts to persist data, share files between containers, and manage stateful applications.