Git Object Database and Pack Files

Understanding Git's object storage: loose objects, pack files, delta compression, and how Git optimizes storage for repositories with millions of objects and gigabytes of history.

published: reading time: 13 min read updated: March 31, 2026

Introduction

Every Git repository stores its entire history in the object database under .git/objects/. When you first start a project, objects are stored as individual compressed files — one file per blob, tree, commit, or tag. This “loose object” format is simple and fast for small repositories.

But as your project grows to thousands of commits and millions of files, storing each object as a separate file becomes inefficient. Filesystem overhead, disk space waste, and slow enumeration become real problems. Git solves this with pack files — a binary format that stores multiple objects together with delta compression between similar objects.

Understanding the object database and pack file system is essential for managing large repositories, optimizing CI/CD clone times, and debugging storage issues. This article explains how Git stores objects, how pack files work, and how to optimize your repository’s storage.

When to Use / When Not to Use

When to understand pack files:

  • Managing large repositories with long histories
  • Optimizing clone and fetch times in CI/CD pipelines
  • Debugging repository size issues
  • Understanding git gc and git repack behavior
  • Building Git server infrastructure

When not to manipulate pack files directly:

  • Daily development — Git manages packs automatically
  • Small repositories — loose objects are fine
  • When unsure — use git gc instead of manual repacking

Core Concepts

Git’s object database has two storage modes:


graph TD
    OBJ[".git/objects/"] --> LOOSE["Loose Objects\none file per object"]
    OBJ --> PACK["Pack Files\nmultiple objects per file"]

    LOOSE --> L1["zlib-compressed\nindividual files"]
    LOOSE --> L2["path: .git/objects/ab/cdef...\n(2-char prefix directory)"]

    PACK --> P1[".git/objects/pack/\npack-<hash>.pack\npack-<hash>.idx"]
    PACK --> P2["delta-compressed\nbetween similar objects"]
    PACK --> P3["reverse-index for\nfast lookup"]

Loose objects are simple: each object is zlib-compressed and stored in a file named by its SHA-1 hash. The first two characters form a subdirectory.

Pack files are complex: they store multiple objects in a single binary file, with delta compression between similar objects (e.g., consecutive versions of the same file), and an index file for fast random access.

Architecture or Flow Diagram


flowchart LR
    COMMIT["New Commit"] -->|creates| NEW_OBJ["New Objects\n(blobs, trees, commit)"]
    NEW_OBJ -->|stored as| LOOSE["Loose Objects\n.git/objects/XX/"]

    GC["git gc / git repack"] -->|collects| MANY["Many Loose Objects"]
    MANY -->|delta compression| DELTA["Delta Chains\nbase → delta → delta"]
    DELTA -->|writes| PACK["Pack File\n.pack + .idx"]
    PACK -->|replaces| OLD["Old Loose Objects\n(deleted)"]

    FETCH["git fetch"] -->|receives| THIN["Thin Pack\n(deltas against missing bases)"]
    THIN -->|resolves| COMPLETE["Complete Pack\n(all bases present)"]

The flow shows how objects start as loose files, get packed during garbage collection, and how fetch operations use thin packs for efficient network transfer.

Step-by-Step Guide / Deep Dive

Loose Object Format

Each loose object is stored as:


zlib(<object type> <size>\0<object content>)

# Create a loose object
echo "hello" | git hash-object -w --stdin
# Output: ce013625030ba8dba906f756967f9e9ca394464a

# The file is at .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
ls -la .git/objects/ce/

# Inspect the raw compressed content
python3 -c "
import zlib, sys
with open('.git/objects/ce/013625030ba8dba906f756967f9e9ca394464a', 'rb') as f:
    data = zlib.decompress(f.read())
    print(repr(data))
"
# Output: b'blob 6\x00hello\n'

Pack File Format

A pack file (.pack) contains:

  1. Header: PACK signature, version number, object count
  2. Objects: Each object is either:
    • A full (base) object: type + size + zlib-compressed data
    • A delta object: type + size + base object reference + delta instructions
  3. Trailer: SHA-1 checksum of the entire pack

The index file (.idx) provides:

  • Sorted list of object SHAs with their offsets in the pack
  • Fan-out table for binary search
  • Pack checksum

# List pack files
ls -lh .git/objects/pack/
# Output:
# pack-abc123.idx  (index - fast lookup)
# pack-abc123.pack (data - compressed objects)
# pack-abc123.rev  (reverse index - offset to SHA)

# Inspect pack contents
git verify-pack -v .git/objects/pack/pack-abc123.idx
# Output per object:
# <sha> <type> <size> <packed-size> <offset> <depth>
# abc123... commit 234 230 12 0
# def456... tree   120 115 242 1
# 789ghi... blob   5432 128 357 2  (delta, depth 2)

Delta Compression

Delta compression is Git’s key storage optimization. Instead of storing full copies of similar objects, Git stores:

  1. Base object: A full copy (usually the oldest version)
  2. Delta objects: Instructions to transform the base into the target

graph LR
    BASE["Base Object\nv1 of file.py\n(full copy, 5KB)"] -->|delta| D1["Delta 1\nv1 → v2\n(changes only, 200B)"]
    D1 -->|delta| D2["Delta 2\nv2 → v3\n(changes only, 150B)"]
    D2 -->|delta| D3["Delta 3\nv3 → v4\n(changes only, 180B)"]

This can reduce storage by 90%+ for text files that change incrementally. The delta depth matters — deep chains require more decompression work to reconstruct an object.

Thin Packs

When fetching from a remote, Git sends “thin packs” that contain deltas against objects the receiver already has. This minimizes network transfer:


# The server sends deltas against objects it knows you have
git fetch origin
# Output: remote: Compressing objects: 100% (15/15), done.

# Your Git resolves the deltas against local objects
# and creates a complete pack file

Pack Bitmaps

For very large repositories, Git can generate bitmap indexes that accelerate git rev-list operations:


# Enable bitmaps in config
git config repack.writeBitmaps true

# Repack with bitmaps
git repack -adb

# The .bitmap file speeds up history queries
ls .git/objects/pack/*.bitmap

Production Failure Scenarios + Mitigations

ScenarioSymptomsMitigation
Corrupted pack file”error: bad packed object”Run git unpack-objects from a fresh clone; delete corrupted pack
Deep delta chainsSlow git log or checkoutRepack with git repack -f to reset delta depth
Missing base object”fatal: bad object” during unpackFetch from remote; the base may exist only in another pack
Pack file too largeMemory issues during repackUse git repack --window-memory=1g to limit memory
Incomplete fetch”error: packfile does not match index”Delete pack files and re-fetch: rm .git/objects/pack/* && git fetch

Trade-offs

AspectAdvantageDisadvantage
Loose objectsSimple, fast individual accessWasteful for many similar objects
Pack filesExcellent compression, fast enumerationSlower individual object access
Delta compression90%+ space savings for textCPU cost for delta creation and resolution
Thin packsMinimal network transferRequires receiver to have base objects
Pack bitmapsFast history queriesAdditional disk space, repack time

Implementation Snippets


# Check object database statistics
git count-objects -vH
# Output:
# count: 1234        (loose objects)
# size: 5.6M         (loose object size)
# in-pack: 56789     (packed objects)
# packs: 3           (number of pack files)
# size-pack: 45.2M   (pack file size)

# Verify pack integrity
git verify-pack -v .git/objects/pack/pack-*.idx

# List objects in a pack, sorted by size
git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3 -n -r | head -20

# Repack with maximum compression
git repack -a -d -f --depth=250 --window=250

# Repack with delta compression disabled (for speed)
git repack -a -d -f --no-reuse-delta

# Create a pack with bitmaps
git repack -a -d -b

# Garbage collect aggressively
git gc --aggressive --prune=now

# Find largest objects in the repository
git rev-list --objects --all | \
  git cat-file --batch-check | \
  sort -k3 -n -r | head -20

# Check delta chain depth
git verify-pack -v .git/objects/pack/pack-*.idx | \
  awk '{print $6}' | sort -n | uniq -c | sort -n

Observability Checklist

  • Monitor: Loose object count (git count-objects -v)
  • Track: Pack file sizes and count over time
  • Alert: Delta chain depth exceeding 50 (causes slow access)
  • Verify: Pack integrity with git verify-pack after repacking
  • Audit: Largest objects in repository (may need Git LFS migration)

Security/Compliance Notes

  • Pack files contain all historical data — deleted files may still exist in packs
  • Running git gc --prune=now removes unreachable objects but doesn’t overwrite disk blocks
  • For true secret removal, see Removing Sensitive Data from History
  • Pack files are not encrypted — anyone with filesystem access can extract objects

Common Pitfalls / Anti-Patterns

  • Running git gc too frequently — wastes CPU; Git auto-gc is usually sufficient
  • Ignoring large objects in packs — they bloat every clone; migrate to Git LFS
  • Disabling auto-gc — leads to excessive loose objects and slow operations
  • Not pruning after history rewrite — old objects remain in packs until pruned
  • Assuming git rm deletes data — it only removes from the working tree; objects persist

Quick Recap Checklist

  • Loose objects are individual zlib-compressed files
  • Pack files store multiple objects with delta compression
  • Delta chains save space but add decompression cost
  • Thin packs minimize network transfer during fetch
  • Pack bitmaps accelerate history queries
  • git gc converts loose objects to packs automatically
  • git repack gives fine-grained control over packing
  • Deleted files may still exist in pack files

Interview Q&A

How does delta compression work in Git pack files?

Git finds similar objects (usually different versions of the same file) and stores one as a full base object. Subsequent versions are stored as delta instructions — byte-level copy/insert commands that transform the base into the target. This is similar to xdelta or bsdiff. Delta chains can be deep (A → B → C → D), but Git limits depth to balance compression ratio with access speed.

What's the difference between `git gc` and `git repack`?

git repack only creates new pack files from loose objects and/or existing packs. git gc is a higher-level command that runs repack, prunes unreachable objects, expires reflogs, and runs other maintenance tasks. Think of repack as a tool and gc as a maintenance workflow that uses repack.

Why does `git fetch` sometimes say "Compressing objects" on the server side?

The server creates a thin pack — a pack file containing deltas against objects it believes the client already has. This minimizes network transfer. The client then "thickens" the pack by resolving deltas against its local objects. If the server's assumption is wrong, the fetch fails and retries with a complete pack.

How can you find the largest objects in a Git repository?

Use git rev-list --objects --all | git cat-file --batch-check | sort -k3 -n -r | head -20. This lists all objects reachable from any ref, shows their sizes, and sorts by size descending. For objects inside pack files, use git verify-pack -v .git/objects/pack/*.idx | sort -k3 -n -r | head -20.

Object Storage Architecture (Clean)


graph TD
    DB[".git/objects/"] --> LOOSE["Loose Objects"]
    DB --> PACK_DIR["pack/"]

    LOOSE --> L_FORMAT["zlib-compressed files"]
    LOOSE --> L_PATH["Path: .git/objects/XX/YYYY..."]
    LOOSE --> L_ACCESS["Fast individual access"]

    PACK_DIR --> IDX[".idx (index file)"]
    PACK_DIR --> PACK[".pack (object data)"]
    PACK_DIR --> REV[".rev (reverse index)"]

    PACK --> DELTA_CHAIN["Delta Chain"]
    DELTA_CHAIN --> BASE["Base object (full copy)"]
    BASE --> D1["Delta 1 (changes)"]
    D1 --> D2["Delta 2 (changes)"]

Production Failure: Pack File Corruption

Scenario: Incomplete clone with missing deltas


# Symptoms
$ git log --oneline
error: Could not read abc123...
fatal: Failed to traverse parents of commit def456...

$ git fsck --full
error: packfile .git/objects/pack/pack-abc.pack does not match index
error: def456...: object missing

# Root cause: Network interruption during clone/fetch left pack file
# incomplete, or disk corruption damaged pack data

# Recovery steps:

# 1. Identify corrupted pack
ls -la .git/objects/pack/
# Note the pack file names

# 2. Remove corrupted pack files
rm -f .git/objects/pack/pack-*.pack
rm -f .git/objects/pack/pack-*.idx
rm -f .git/objects/pack/pack-*.rev

# 3. Re-fetch from remote
git fetch origin --refetch

# 4. Verify integrity
git fsck --full

# 5. If remote doesn't have the objects (local-only work):
#    Restore from backup or another clone
rsync -avz backup-server:/path/to/repo/.git/objects/ .git/objects/

# Prevention:
# - Use git clone --recurse-submodules for complete clones
# - Verify after large fetches: git fsck --connectivity-only
# - Use shallow clones only when full history isn't needed

Trade-offs: Loose vs Packed Storage

AspectLoose ObjectsPacked Objects
Performance (read)Fast for individual objectsSlower per-object (must unpack)
Performance (enumerate)Slow (many filesystem calls)Fast (single file scan)
Disk spaceHigh (no delta compression)Low (delta compression saves 90%+)
Transfer efficiencyPoor (many small files)Excellent (single pack file)
Creation costNone (created on each commit)High (CPU-intensive repacking)
Network transferInefficient for fetch/cloneOptimized with thin packs
Corruption impactSingle object lostEntire pack may be unreadable
Best forActive repos with few objectsLarge repos, archives, servers

Rule of thumb: Let Git manage this automatically. Manual repacking is only needed for very large repos or before archiving.

Implementation: Manual Pack Creation and Inspection


# === Create a pack from loose objects ===
# Repack all objects into a single pack
git repack -a -d
# -a: pack everything (not just unreachable)
# -d: delete loose objects after packing

# === Create pack with maximum compression ===
git repack -a -d -f --depth=250 --window=250
# -f: ignore existing delta info, recompute
# --depth: max delta chain length
# --window: objects to consider for delta

# === Inspect pack contents ===
PACK_FILE=$(ls .git/objects/pack/pack-*.idx | head -1)
git verify-pack -v "$PACK_FILE" | head -20
# Output format:
# <sha> <type> <size> <packed-size> <offset> <delta-depth>

# === Find largest objects in pack ===
git verify-pack -v "$PACK_FILE" | \
  grep -v "^$" | \
  grep -v "chain" | \
  sort -k3 -n -r | \
  head -10

# === Check delta chain depths ===
git verify-pack -v "$PACK_FILE" | \
  awk 'NF==6 {print $6}' | \
  sort -n | uniq -c | sort -rn
# Shows distribution: how many objects at each delta depth

# === Create a custom pack with specific objects ===
# Pack only objects reachable from main
git pack-objects my-pack <<EOF
$(git rev-list main)
EOF
# Creates: my-pack-<sha>.pack and my-pack-<sha>.idx

# === Create thin pack for network transfer ===
# Pack objects missing from remote's known commits
git pack-objects --thin --stdout <<EOF | ssh remote "git unpack-objects"
$(git rev-list origin/main..main)
EOF

Resources

Category

Related Posts

Git Garbage Collection and Maintenance

Master git gc, git prune, git fsck, and automated repository maintenance. Learn how Git manages object storage, cleans unreachable data, and keeps repositories healthy.

#git #version-control #garbage-collection

Centralized vs Distributed VCS: Architecture, Trade-offs, and When to Use Each

Compare centralized (SVN, CVS) vs distributed (Git, Mercurial) version control systems — their architectures, trade-offs, and when to use each approach.

#git #version-control #svn

Automated Changelog Generation: From Commit History to Release Notes

Build automated changelog pipelines from git commit history using conventional commits, conventional-changelog, and semantic-release. Learn parsing, templating, and production patterns.

#git #version-control #changelog