Git Object Database and Pack Files

Understanding Git's object storage: loose objects, pack files, delta compression, and how Git optimizes storage for repositories with millions of objects and gigabytes of history.

published: reading time: 20 min read author: Geek Workbench updated: March 31, 2026

Introduction

Every Git repository stores its entire history in the object database under .git/objects/. When you first start a project, objects are stored as individual compressed files — one file per blob, tree, commit, or tag. This “loose object” format is simple and fast for small repositories.

But as your project grows to thousands of commits and millions of files, storing each object as a separate file becomes inefficient. Filesystem overhead, disk space waste, and slow enumeration become real problems. Git solves this with pack files — a binary format that stores multiple objects together with delta compression between similar objects.

Understanding the object database and pack file system is essential for managing large repositories, optimizing CI/CD clone times, and debugging storage issues. This article explains how Git stores objects, how pack files work, and how to optimize your repository’s storage.

When to Use / When Not to Use

When to understand pack files:

  • Managing large repositories with long histories
  • Optimizing clone and fetch times in CI/CD pipelines
  • Debugging repository size issues
  • Understanding git gc and git repack behavior
  • Building Git server infrastructure

When not to manipulate pack files directly:

  • Daily development — Git manages packs automatically
  • Small repositories — loose objects are fine
  • When unsure — use git gc instead of manual repacking

Core Concepts

Git’s object database has two storage modes:


graph TD
    OBJ[".git/objects/"] --> LOOSE["Loose Objects\none file per object"]
    OBJ --> PACK["Pack Files\nmultiple objects per file"]

    LOOSE --> L1["zlib-compressed\nindividual files"]
    LOOSE --> L2["path: .git/objects/ab/cdef...\n(2-char prefix directory)"]

    PACK --> P1[".git/objects/pack/\npack-<hash>.pack\npack-<hash>.idx"]
    PACK --> P2["delta-compressed\nbetween similar objects"]
    PACK --> P3["reverse-index for\nfast lookup"]

Loose objects are simple: each object is zlib-compressed and stored in a file named by its SHA-1 hash. The first two characters form a subdirectory.

Pack files are complex: they store multiple objects in a single binary file, with delta compression between similar objects (e.g., consecutive versions of the same file), and an index file for fast random access.

Architecture or Flow Diagram


flowchart LR
    COMMIT["New Commit"] -->|creates| NEW_OBJ["New Objects\n(blobs, trees, commit)"]
    NEW_OBJ -->|stored as| LOOSE["Loose Objects\n.git/objects/XX/"]

    GC["git gc / git repack"] -->|collects| MANY["Many Loose Objects"]
    MANY -->|delta compression| DELTA["Delta Chains\nbase → delta → delta"]
    DELTA -->|writes| PACK["Pack File\n.pack + .idx"]
    PACK -->|replaces| OLD["Old Loose Objects\n(deleted)"]

    FETCH["git fetch"] -->|receives| THIN["Thin Pack\n(deltas against missing bases)"]
    THIN -->|resolves| COMPLETE["Complete Pack\n(all bases present)"]

The flow shows how objects start as loose files, get packed during garbage collection, and how fetch operations use thin packs for efficient network transfer.

Step-by-Step Guide / Deep Dive

Loose Object Format

Each loose object is stored as:


zlib(<object type> <size>\0<object content>)

# Create a loose object
echo "hello" | git hash-object -w --stdin
# Output: ce013625030ba8dba906f756967f9e9ca394464a

# The file is at .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
ls -la .git/objects/ce/

# Inspect the raw compressed content
python3 -c "
import zlib, sys
with open('.git/objects/ce/013625030ba8dba906f756967f9e9ca394464a', 'rb') as f:
    data = zlib.decompress(f.read())
    print(repr(data))
"
# Output: b'blob 6\x00hello\n'

Pack File Format

A pack file (.pack) contains:

  1. Header: PACK signature, version number, object count
  2. Objects: Each object is either:
    • A full (base) object: type + size + zlib-compressed data
    • A delta object: type + size + base object reference + delta instructions
  3. Trailer: SHA-1 checksum of the entire pack

The index file (.idx) provides:

  • Sorted list of object SHAs with their offsets in the pack
  • Fan-out table for binary search
  • Pack checksum

# List pack files
ls -lh .git/objects/pack/
# Output:
# pack-abc123.idx  (index - fast lookup)
# pack-abc123.pack (data - compressed objects)
# pack-abc123.rev  (reverse index - offset to SHA)

# Inspect pack contents
git verify-pack -v .git/objects/pack/pack-abc123.idx
# Output per object:
# <sha> <type> <size> <packed-size> <offset> <depth>
# abc123... commit 234 230 12 0
# def456... tree   120 115 242 1
# 789ghi... blob   5432 128 357 2  (delta, depth 2)

Delta Compression

Delta compression is Git’s key storage optimization. Instead of storing full copies of similar objects, Git stores:

  1. Base object: A full copy (usually the oldest version)
  2. Delta objects: Instructions to transform the base into the target

graph LR
    BASE["Base Object\nv1 of file.py\n(full copy, 5KB)"] -->|delta| D1["Delta 1\nv1 → v2\n(changes only, 200B)"]
    D1 -->|delta| D2["Delta 2\nv2 → v3\n(changes only, 150B)"]
    D2 -->|delta| D3["Delta 3\nv3 → v4\n(changes only, 180B)"]

This can reduce storage by 90%+ for text files that change incrementally. The delta depth matters — deep chains require more decompression work to reconstruct an object.

Thin Packs

When fetching from a remote, Git sends “thin packs” that contain deltas against objects the receiver already has. This minimizes network transfer:


# The server sends deltas against objects it knows you have
git fetch origin
# Output: remote: Compressing objects: 100% (15/15), done.

# Your Git resolves the deltas against local objects
# and creates a complete pack file

Pack Bitmaps

For very large repositories, Git can generate bitmap indexes that accelerate git rev-list operations:


# Enable bitmaps in config
git config repack.writeBitmaps true

# Repack with bitmaps
git repack -adb

# The .bitmap file speeds up history queries
ls .git/objects/pack/*.bitmap

Production Failure Scenarios

ScenarioSymptomsMitigation
Corrupted pack file”error: bad packed object”Run git unpack-objects from a fresh clone; delete corrupted pack
Deep delta chainsSlow git log or checkoutRepack with git repack -f to reset delta depth
Missing base object”fatal: bad object” during unpackFetch from remote; the base may exist only in another pack
Pack file too largeMemory issues during repackUse git repack --window-memory=1g to limit memory
Incomplete fetch”error: packfile does not match index”Delete pack files and re-fetch: rm .git/objects/pack/* && git fetch

Trade-off Analysis

AspectAdvantageDisadvantage
Loose objectsSimple, fast individual accessWasteful for many similar objects
Pack filesExcellent compression, fast enumerationSlower individual object access
Delta compression90%+ space savings for textCPU cost for delta creation and resolution
Thin packsMinimal network transferRequires receiver to have base objects
Pack bitmapsFast history queriesAdditional disk space, repack time

Implementation Snippets


# Check object database statistics
git count-objects -vH
# Output:
# count: 1234        (loose objects)
# size: 5.6M         (loose object size)
# in-pack: 56789     (packed objects)
# packs: 3           (number of pack files)
# size-pack: 45.2M   (pack file size)

# Verify pack integrity
git verify-pack -v .git/objects/pack/pack-*.idx

# List objects in a pack, sorted by size
git verify-pack -v .git/objects/pack/pack-*.idx | sort -k3 -n -r | head -20

# Repack with maximum compression
git repack -a -d -f --depth=250 --window=250

# Repack with delta compression disabled (for speed)
git repack -a -d -f --no-reuse-delta

# Create a pack with bitmaps
git repack -a -d -b

# Garbage collect aggressively
git gc --aggressive --prune=now

# Find largest objects in the repository
git rev-list --objects --all | \
  git cat-file --batch-check | \
  sort -k3 -n -r | head -20

# Check delta chain depth
git verify-pack -v .git/objects/pack/pack-*.idx | \
  awk '{print $6}' | sort -n | uniq -c | sort -n

Observability Checklist

  • Monitor: Loose object count (git count-objects -v)
  • Track: Pack file sizes and count over time
  • Alert: Delta chain depth exceeding 50 (causes slow access)
  • Verify: Pack integrity with git verify-pack after repacking
  • Audit: Largest objects in repository (may need Git LFS migration)

Security & Compliance Considerations

  • Pack files contain all historical data — deleted files may still exist in packs
  • Running git gc --prune=now removes unreachable objects but doesn’t overwrite disk blocks
  • For true secret removal, see Removing Sensitive Data from History
  • Pack files are not encrypted — anyone with filesystem access can extract objects

Common Pitfalls / Anti-Patterns

  • Running git gc too frequently — wastes CPU; Git auto-gc is usually sufficient
  • Ignoring large objects in packs — they bloat every clone; migrate to Git LFS
  • Disabling auto-gc — leads to excessive loose objects and slow operations
  • Not pruning after history rewrite — old objects remain in packs until pruned
  • Assuming git rm deletes data — it only removes from the working tree; objects persist

Quick Recap Checklist

  • Loose objects are individual zlib-compressed files
  • Pack files store multiple objects with delta compression
  • Delta chains save space but add decompression cost
  • Thin packs minimize network transfer during fetch
  • Pack bitmaps accelerate history queries
  • git gc converts loose objects to packs automatically
  • git repack gives fine-grained control over packing
  • Deleted files may still exist in pack files

Multi-Pack Index (MIDX) for Large Repositories

For repositories with hundreds of pack files, Git supports Multi-Pack Index (MIDX) to accelerate object lookups:


# Generate MIDX to accelerate multi-pack repos
git multi-pack-index write

# Verify MIDX integrity
git multi-pack-index verify

# Check MIDX statistics
git multi-pack-index statistics

# Enable MIDX generation automatically
git config core.multiPackIndex true

How MIDX works:

  • Creates a single .midx file containing combined index across multiple packs
  • Maintains a reachability bitmap for fast history traversal
  • Uses a tiered approach: pack files are grouped by access frequency
  • Bitmap generation (--write-midx-bitmaps) dramatically speeds up git log, git blame, and clone operations

When to use MIDX:

  • Repositories with more than 50 pack files
  • CI/CD environments doing frequent partial clones
  • Monorepos with large object counts (>5 million objects)
  • Server-side Git with heavy read traffic

Benchmark comparison:

OperationWithout MIDXWith MIDX
git log --all120s25s
git clone45s18s
git rev-list80s15s

Object Storage Architecture (Clean)


graph TD
    DB[".git/objects/"] --> LOOSE["Loose Objects"]
    DB --> PACK_DIR["pack/"]
    DB --> MIDX_DIR[".midx/"]

    LOOSE --> L_FORMAT["zlib-compressed files"]
    LOOSE --> L_PATH["Path: .git/objects/XX/YYYY..."]
    LOOSE --> L_ACCESS["Fast individual access"]

    PACK_DIR --> IDX[".idx (index file)"]
    PACK_DIR --> PACK[".pack (object data)"]
    PACK_DIR --> REV[".rev (reverse index)"]

    MIDX_DIR --> MIDX[".midx (multi-pack index)"]
    MIDX --> BITMAP[".bitmap (reachability bitmap)"]

    PACK --> DELTA_CHAIN["Delta Chain"]
    DELTA_CHAIN --> BASE["Base object (full copy)"]
    BASE --> D1["Delta 1 (changes)"]
    D1 --> D2["Delta 2 (changes)"]

Production Failure: Pack File Corruption

Scenario: Incomplete clone with missing deltas


# Symptoms
$ git log --oneline
error: Could not read abc123...
fatal: Failed to traverse parents of commit def456...

$ git fsck --full
error: packfile .git/objects/pack/pack-abc.pack does not match index
error: def456...: object missing

# Root cause: Network interruption during clone/fetch left pack file
# incomplete, or disk corruption damaged pack data

# Recovery steps:

# 1. Identify corrupted pack
ls -la .git/objects/pack/
# Note the pack file names

# 2. Remove corrupted pack files
rm -f .git/objects/pack/pack-*.pack
rm -f .git/objects/pack/pack-*.idx
rm -f .git/objects/pack/pack-*.rev

# 3. Re-fetch from remote
git fetch origin --refetch

# 4. Verify integrity
git fsck --full

# 5. If remote doesn't have the objects (local-only work):
#    Restore from backup or another clone
rsync -avz backup-server:/path/to/repo/.git/objects/ .git/objects/

# Prevention:
# - Use git clone --recurse-submodules for complete clones
# - Verify after large fetches: git fsck --connectivity-only
# - Use shallow clones only when full history isn't needed

Trade-offs: Loose vs Packed Storage

AspectLoose ObjectsPacked Objects
Performance (read)Fast for individual objectsSlower per-object (must unpack)
Performance (enumerate)Slow (many filesystem calls)Fast (single file scan)
Disk spaceHigh (no delta compression)Low (delta compression saves 90%+)
Transfer efficiencyPoor (many small files)Excellent (single pack file)
Creation costNone (created on each commit)High (CPU-intensive repacking)
Network transferInefficient for fetch/cloneOptimized with thin packs
Corruption impactSingle object lostEntire pack may be unreadable
Best forActive repos with few objectsLarge repos, archives, servers

Rule of thumb: Let Git manage this automatically. Manual repacking is only needed for very large repos or before archiving.

Implementation: Manual Pack Creation and Inspection


# === Create a pack from loose objects ===
# Repack all objects into a single pack
git repack -a -d
# -a: pack everything (not just unreachable)
# -d: delete loose objects after packing

# === Create pack with maximum compression ===
git repack -a -d -f --depth=250 --window=250
# -f: ignore existing delta info, recompute
# --depth: max delta chain length
# --window: objects to consider for delta

# === Inspect pack contents ===
PACK_FILE=$(ls .git/objects/pack/pack-*.idx | head -1)
git verify-pack -v "$PACK_FILE" | head -20
# Output format:
# <sha> <type> <size> <packed-size> <offset> <delta-depth>

# === Find largest objects in pack ===
git verify-pack -v "$PACK_FILE" | \
  grep -v "^$" | \
  grep -v "chain" | \
  sort -k3 -n -r | \
  head -10

# === Check delta chain depths ===
git verify-pack -v "$PACK_FILE" | \
  awk 'NF==6 {print $6}' | \
  sort -n | uniq -c | sort -rn
# Shows distribution: how many objects at each delta depth

# === Create a custom pack with specific objects ===
# Pack only objects reachable from main
git pack-objects my-pack <<EOF
$(git rev-list main)
EOF
# Creates: my-pack-<sha>.pack and my-pack-<sha>.idx

# === Create thin pack for network transfer ===
# Pack objects missing from remote's known commits
git pack-objects --thin --stdout <<EOF | ssh remote "git unpack-objects"
$(git rev-list origin/main..main)
EOF

Interview Questions

1. How does delta compression work in Git pack files?

Git finds similar objects (usually different versions of the same file) and stores one as a full base object. Subsequent versions are stored as delta instructions — byte-level copy/insert commands that transform the base into the target. This is similar to xdelta or bsdiff. Delta chains can be deep (A → B → C → D), but Git limits depth to balance compression ratio with access speed.

2. What's the difference between `git gc` and `git repack`?

git repack only creates new pack files from loose objects and/or existing packs. git gc is a higher-level command that runs repack, prunes unreachable objects, expires reflogs, and runs other maintenance tasks. Think of repack as a tool and gc as a maintenance workflow that uses repack.

3. Why does `git fetch` sometimes say "Compressing objects" on the server side?

The server creates a thin pack — a pack file containing deltas against objects it believes the client already has. This minimizes network transfer. The client then "thickens" the pack by resolving deltas against its local objects. If the server's assumption is wrong, the fetch fails and retries with a complete pack.

4. How can you find the largest objects in a Git repository?

Use git rev-list --objects --all | git cat-file --batch-check | sort -k3 -n -r | head -20. This lists all objects reachable from any ref, shows their sizes, and sorts by size descending. For objects inside pack files, use git verify-pack -v .git/objects/pack/*.idx | sort -k3 -n -r | head -20.

5. What is a thin pack and when does Git create one?

A thin pack contains delta objects that reference base objects the sender believes the receiver already has. Git creates thin packs during git push and git fetch to minimize network transfer. The receiving side "thickens" the pack by resolving deltas against its existing objects. If the receiver is missing the base objects, the operation fails and falls back to a full pack.

6. How do you recover from a corrupted pack file?

Steps: (1) Identify corrupted packs with git fsck --full, (2) Remove corrupted pack files (rm .git/objects/pack/pack-*.pack), (3) Re-fetch from remote with git fetch --refetch, (4) Verify integrity with git fsck --full. If objects are local-only, restore from backup or another clone using rsync.

7. What is the purpose of pack bitmaps and when should you enable them?

Pack bitmaps accelerate git rev-list operations by precomputing reachability information. Enable with git config repack.writeBitmaps true or git repack -adb. Best for large repositories with many objects where history queries are frequent (CI/CD, code hosting servers). Trade-off: additional disk space and longer repack times.

8. What is Multi-Pack Index (MIDX) and when does it help?

MIDX creates a combined index across multiple pack files, dramatically speeding up object lookups in repos with many packs (50+). Use git multi-pack-index write to generate it. Combined with bitmaps, it can reduce git log time by 80%+ in large monorepos. Enable automatically with git config core.multiPackIndex true.

9. Why can deleted files still exist in pack files?

Pack files store all objects ever added to the repository, including unreachable ones. git rm removes files from the working tree but not from history — objects persist until unreachable and then still may remain until git gc --prune=now runs. For true secret removal, use git filter-repo or BFG Repo-Cleaner to rewrite history and expire reflogs.

10. How does delta depth affect Git performance?

Deep delta chains (depth > 50) cause slow git log, checkout, and fetch operations because Git must reconstruct objects by applying many deltas sequentially. Use git repack -f --depth=250 to recompute deltas with controlled depth. Shallower chains (depth ~10-20) trade disk space for faster access.

11. What is the difference between .idx, .pack, and .rev files?

.pack contains the actual compressed objects and delta instructions. .idx is the forward index (SHA → offset) for fast object lookup using binary search. .rev is the reverse index (offset → SHA) for operations needing offset-to-SHA mapping. Git creates all three when generating a pack.

12. When should you avoid manual repacking?

Avoid manual repacking for: (1) Daily development — Git's auto-gc is sufficient, (2) Small repositories — loose objects work fine, (3) During active development — repacking creates brief lock contention, (4) When unsure — git gc handles most cases safely. Only repack manually for optimization before archiving or when debugging specific storage issues.

13. How does `git clone --depth 1` affect pack files?

Shallow clones create a single pack containing only reachable objects from the latest commits. Deltas reference base objects that may not be in the pack — the clone is effectively a "thin pack" locally. This saves disk space and speeds up clone but cannot be expanded with git fetch unless you unshallow with git fetch --unshallow.

14. What causes "packfile does not match index" errors?

This error occurs when: (1) Network interruption during fetch left pack file incomplete, (2) Disk corruption damaged pack data, (3) Concurrent process modified pack during read, (4) Insufficient disk space during clone caused partial write. Fix by deleting corrupted packs (rm .git/objects/pack/*) and re-fetching.

15. How do pack files affect Git LFS migration decisions?

Large binary files stored in pack files bloat every clone. Migrate to Git LFS with git lfs migrate to replace large blobs with pointer files. After migration, run git gc --prune=now to remove old objects from packs. Check large objects before migration with git rev-list --objects --all | git cat-file --batch-check | sort -k3 -n -r | head -20.

16. What is the difference between `git repack -a -d` and `git repack -a -d -f`?

git repack -a -d packs everything and deletes loose objects after. git repack -a -d -f adds -f to force recomputation of deltas from scratch, ignoring existing delta information. Use -f when delta chains are too deep, when you suspect delta corruption, or when you want maximum compression at the cost of longer repack time.

17. What is the format of a loose object in Git's object database?

A loose object is stored as zlib(<object type> <size>\0<object content>). The object type is one of: blob, tree, commit, tag. The content varies by type — a blob contains file data, a tree lists filenames and SHA pointers, a commit references a tree and parent commits, a tag references another object. Git stores these in .git/objects/<first 2 chars>/<remaining 38 chars> using the SHA-1 hash of the content.

18. How does Git's pack file format achieve such high compression ratios?

Pack files achieve 90%+ compression through three techniques: (1) Delta compression — similar objects stored as differences from a base object, (2) Object type grouping — similar objects (versions of the same file) clustered together for better delta matching, (3) Base object selection — Git chooses optimal base objects that maximize delta efficiency. For text files with incremental changes (like source code), this approach is extremely effective.

19. What is the role of the fan-out table in pack index files?

The fan-out table in a .idx file enables fast binary search for object lookups. It maps the first byte of a SHA-1 hash to a range of entries, so Git can quickly narrow down the search space. Without fan-out, Git would need to scan the entire index. The fan-out has 256 entries (one per possible first byte value), each storing the count of objects whose SHA starts with that byte.

20. How do you diagnose and resolve "error: bad packed object CRC" errors?

Diagnosis steps: (1) Run git fsck --full to identify corrupted objects, (2) Use git verify-pack -v to pinpoint which pack contains corruption, (3) Check disk health with smartctl — corruption may indicate hardware failure. Resolution: (1) Delete the corrupted pack and index files, (2) Fetch from a known-good remote, (3) If no remote available, restore from backup. Prevention: use ECC RAM, monitor disk health, and ensure proper shutdown procedures.

Further Reading

Conclusion

Git’s object database stores every version of every file as individual objects until pack files compress them into efficient deltas. This design — loose objects for fresh work, packed storage for history — is why Git can hold years of project history in megabytes.

Category

Related Posts

Git Garbage Collection and Maintenance

Master git gc, git prune, git fsck, and automated repository maintenance. Learn how Git manages object storage, cleans unreachable data, and keeps repositories healthy.

#git #version-control #garbage-collection

Centralized vs Distributed VCS: Architecture, Trade-offs, and When to Use Each

Compare centralized (SVN, CVS) vs distributed (Git, Mercurial) version control systems — their architectures, trade-offs, and when to use each approach.

#git #version-control #svn

Automated Changelog Generation: From Commit History to Release Notes

Build automated changelog pipelines from git commit history using conventional commits, conventional-changelog, and semantic-release. Learn parsing, templating, and production patterns.

#git #version-control #changelog