Git Objects: Blobs, Trees, Commits, Tags
Understanding Git's four object types — blobs, trees, commits, and annotated tags — how they relate through content-addressable storage, and how to inspect them with plumbing commands.
Introduction
Git is fundamentally a content-addressable filesystem with a VCS user interface. At its core lies a simple but powerful abstraction: four object types that together represent every snapshot of your project’s history. Understanding these objects — blobs, trees, commits, and tags — is the key to demystifying Git’s internals.
Unlike traditional version control systems that store deltas (differences between versions), Git stores complete snapshots. Each snapshot is decomposed into these four object types, linked together by SHA-1 hashes into a directed acyclic graph (DAG). This design gives Git its speed, integrity guarantees, and distributed nature.
This article examines each object type in depth, shows how they interconnect, and teaches you to inspect them using Git’s plumbing commands. By the end, you’ll be able to manually reconstruct a Git repository from raw objects if needed.
When to Use / When Not to Use
When to understand Git objects:
- Debugging corruption or missing data in repositories
- Building Git tooling or integrations
- Understanding how Git achieves integrity verification
- Optimizing repository size and performance
- Recovering lost data from the object database
When not to manipulate objects directly:
- Daily development — use porcelain commands (
git add,git commit) - When unsure — direct object manipulation can corrupt repositories
- For simple history inspection —
git logandgit showare sufficient
Core Concepts
Git stores everything as objects in .git/objects/, each identified by a SHA-1 hash of its content. There are exactly four types:
graph TD
TAG["Annotated Tag\n(type: tag)"] -->|points to| COMMIT["Commit\n(type: commit)"]
COMMIT -->|tree: points to| TREE["Tree\n(type: tree)"]
COMMIT -->|parent: points to| COMMIT2["Parent Commit"]
TREE -->|contains| TREE2["Subdirectory Tree"]
TREE -->|contains| BLOB1["Blob\n(file content)"]
TREE -->|contains| BLOB2["Blob\n(file content)"]
TREE2 -->|contains| BLOB3["Blob\n(file content)"]
The relationship is hierarchical: tags point to commits, commits point to trees (and parent commits), trees point to other trees and blobs. Blobs are the leaves — they contain actual file content.
Each object is stored as:
<object type> <content length>\0<content>
This header is zlib-compressed and stored in .git/objects/ under a path derived from the SHA-1 hash.
Architecture or Flow Diagram
flowchart LR
FILE["File Content"] -->|git hash-object| BLOB["Blob Object\nSHA: abc123..."]
BLOB -->|referenced by| TREE["Tree Object\nSHA: def456..."]
TREE -->|referenced by| COMMIT["Commit Object\nSHA: 789ghi..."]
COMMIT -->|referenced by| TAG["Tag Object\nSHA: jkl012..."]
META["Author, Date, Message"] --> COMMIT
PARENT["Parent SHA"] --> COMMIT
The flow shows how file content becomes a blob, which is referenced by a tree, which is referenced by a commit, which may be referenced by a tag. Metadata flows into commits separately from content.
Step-by-Step Guide / Deep Dive
Blob Objects
Blobs store file content. They do not store filenames, permissions, or directory structure — just raw bytes.
# Create a blob from a file
echo "Hello, Git!" | git hash-object -w --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# The object is now in .git/objects/8a/
ls .git/objects/8a/
# Inspect the blob
git cat-file -p 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# Output: Hello, Git!
# Check the type
git cat-file -t 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# Output: blob
Key properties:
- Content-addressable: identical files produce identical blobs (deduplication)
- Immutable: once created, a blob never changes
- No metadata: filename and permissions are stored in the tree, not the blob
Tree Objects
Trees represent directories. They map filenames to blob SHAs (or other tree SHAs for subdirectories).
# Create a tree from the current index
git write-tree
# Output: 4b825dc642cb6eb9a060e54bf899d69f824970a0
# Inspect a tree
git cat-file -p 4b825dc642cb6eb9a060e54bf899d69f824970a0
# Output format:
# 100644 blob abc123... file1.txt
# 040000 tree def456... subdir/
# 100755 blob 789ghi... script.sh
Each tree entry contains:
- Mode: file permissions (100644 for regular files, 100755 for executables, 040000 for directories)
- Type: blob or tree
- SHA-1: hash of the referenced object
- Filename: the name within this directory
Commit Objects
Commits are the backbone of Git history. Each commit records a snapshot (via a tree), authorship, and lineage.
# Create a commit object manually
export GIT_AUTHOR_NAME="Test User"
export GIT_AUTHOR_EMAIL="test@example.com"
export GIT_COMMITTER_NAME="Test User"
export GIT_COMMITTER_EMAIL="test@example.com"
export GIT_AUTHOR_DATE="2026-03-31T12:00:00+00:00"
export GIT_COMMITTER_DATE="2026-03-31T12:00:00+00:00"
TREE_SHA=$(git write-tree)
COMMIT_SHA=$(echo "Initial commit" | git commit-tree $TREE_SHA)
echo $COMMIT_SHA
# Output: a1b2c3d4e5f6...
# Inspect the commit
git cat-file -p $COMMIT_SHA
# Output:
# tree 4b825dc642cb6eb9a060e54bf899d69f824970a0
# author Test User <test@example.com> 1711886400 +0000
# committer Test User <test@example.com> 1711886400 +0000
#
# Initial commit
A commit object contains:
- tree: SHA-1 of the root tree (the snapshot)
- parent: SHA-1 of the parent commit(s) — zero for the initial commit, one for normal commits, two+ for merges
- author: who wrote the code (name, email, timestamp, timezone)
- committer: who committed the code (can differ from author, e.g., after rebase)
- message: the commit message
Tag Objects
There are two types of tags in Git:
Lightweight tags are simply refs pointing to a commit — no tag object is created:
git tag v1.0-lightweight
# Creates: .git/refs/tags/v1.0-lightweight → commit SHA
Annotated tags are full objects with metadata:
git tag -a v1.0 -m "Release version 1.0"
# Creates a tag object
# Inspect the tag object
git cat-file -p $(git rev-parse v1.0)
# Output:
# object a1b2c3d4e5f6...
# type commit
# tag v1.0
# tagger Test User <test@example.com> 1711886400 +0000
#
# Release version 1.0
Annotated tags contain:
- object: the SHA-1 of what the tag points to (usually a commit)
- type: the type of the object (commit, tree, blob, or tag)
- tag: the tag name
- tagger: who created the tag
- message: the tag message
Production Failure Scenarios
| Scenario | Symptoms | Mitigation |
|---|---|---|
| Missing blob | ”fatal: unable to read tree” | Fetch from remote, or restore from backup; blobs are immutable so any copy works |
| Corrupted tree | Directory listing fails | Rebuild tree from working tree with git read-tree; verify with git fsck |
| Broken commit chain | git log stops abruptly | Use git replace to graft history, or rebase onto valid ancestor |
| Tag pointing to wrong type | Unexpected behavior on tag checkout | Verify with git cat-file -t; recreate annotated tag if needed |
| Object store corruption | Multiple “bad object” errors | Run git fsck --full; clone fresh from remote; restore from backup |
Trade-off Analysis
| Aspect | Advantage | Disadvantage |
|---|---|---|
| Content-addressable storage | Automatic deduplication, integrity verification | SHA-1 collision risk (being mitigated with SHA-256) |
| Snapshot-based (not delta) | Fast checkouts, simple model | Higher storage for text files (mitigated by pack files) |
| Immutable objects | Safe concurrent access, easy replication | No in-place updates; every change creates new objects |
| No filenames in blobs | Blobs can be shared across trees | Must traverse tree to find which file a blob belongs to |
Implementation Snippets
# Create a blob and store it
echo "content" | git hash-object -w --stdin
# Read a blob's content
git cat-file -p <sha>
# Get object type
git cat-file -t <sha>
# Get object size
git cat-file -s <sha>
# Create a tree from index
git write-tree
# Read a tree into index
git read-tree <tree-sha>
# Create a commit object
git commit-tree <tree-sha> -p <parent-sha> -m "message"
# Create an annotated tag object
git mktag << EOF
object <commit-sha>
type commit
tag v1.0
tagger Name <email> date
EOF
# List all objects in the repository
git rev-list --objects --all
# Find all unreachable objects
git fsck --unreachable
Observability Checklist
- Monitor: Object count growth with
git count-objects -v - Verify: Run
git fsckperiodically to detect corruption - Track: Ratio of loose to packed objects (should favor packed)
- Alert: Unexpected object count spikes (may indicate accidental large file commits)
- Audit: Tag signatures for release integrity verification
Security & Compliance Considerations
- Object hashes provide integrity verification — tampering changes the hash
- SHA-1 is being deprecated in favor of SHA-256 for collision resistance
- Signed tags (GPG/SSH) provide non-repudiation for releases
- Objects are not encrypted — sensitive data in blobs is readable by anyone with repo access
- See Git Secrets Management for preventing secret commits
Common Pitfalls / Anti-Patterns
- Assuming blob size equals file size — blobs include a header; use
git cat-file -sfor actual size - Confusing lightweight and annotated tags — lightweight tags are just refs, not objects
- Modifying objects directly — objects are immutable; use Git commands to create new ones
- Ignoring unreachable objects — they consume space until
git gcprunes them - Storing large binary files as blobs — use Git LFS instead
Quick Recap Checklist
- Blobs store file content only — no filenames or metadata
- Trees map filenames to blob/tree SHAs — represent directories
- Commits point to a tree, parent(s), and record authorship
- Annotated tags are full objects; lightweight tags are just refs
- All objects are content-addressable by SHA-1 hash
- Objects are immutable and zlib-compressed
- The object graph forms a directed acyclic graph (DAG)
- Use
git cat-fileto inspect any object by type, size, or content
Object Relationship Diagram (Clean)
graph TD
REPO["Repository"] -->|contains| TAGS["Annotated Tags"]
TAGS -->|points to| COMMITS["Commits"]
COMMITS -->|tree ref| TREES["Trees"]
COMMITS -->|parent ref| PARENTS["Parent Commits"]
TREES -->|entries| SUBTREES["Subdirectory Trees"]
TREES -->|entries| BLOBS["Blobs (file content)"]
SUBTREES -->|entries| MORE_BLOBS["More Blobs"]
BLOBS -->|content only| FILES["Raw File Bytes"]
MORE_BLOBS -->|content only| FILES
Production Failure: Corrupted Object Database
Scenario: Missing blob causing checkout failure
# Symptoms
$ git checkout main
error: unable to read sha1 file (src/config.py)
fatal: unable to checkout working tree
$ git fsck --full
error: abc123def456...: object missing
error: 789ghi...: object corrupt
# Root cause: Disk corruption, interrupted gc, or filesystem error
# destroyed blob objects in .git/objects/
# Recovery steps:
# 1. Identify missing objects
git fsck --full 2>&1 | grep "missing"
# 2. Try to fetch missing objects from remote
git fetch origin --refetch
# 3. If remote doesn't have them (local-only commits):
# Check reflog for last known good state
git reflog
git checkout HEAD@{1} # Try previous HEAD
# 4. As last resort, clone fresh and cherry-pick
cd ..
git clone https://github.com/user/repo.git repo-clean
cd repo-clean
git --git-dir=../repo/.git cherry-pick <sha>
# 5. Prevent future corruption:
# - Use reliable storage (SSD > HDD for .git/)
# - Run git fsck periodically
# - Keep remote backups
git push origin --mirror # Full backup
Trade-offs: Annotated vs Lightweight Tags
| Aspect | Annotated Tags | Lightweight Tags |
|---|---|---|
| Object type | Full tag object in .git/objects/ | Simple ref file in .git/refs/tags/ |
| Metadata | Tagger name, email, date, message | None |
| GPG signing | Supported (git tag -s) | Not supported |
| Storage | ~200 bytes per tag | ~41 bytes (SHA only) |
| Immutability | Immutable once created | Can be moved with git tag -f |
| Use case | Releases, public milestones | Private bookmarks, temporary markers |
git describe | Works correctly | May not show tag message |
| Platform display | Shows message on GitHub/GitLab | Shows as simple pointer |
Recommendation: Use annotated tags for anything public or release-related. Lightweight tags are fine for personal, temporary markers.
Implementation: Creating and Inspecting Each Object Type Manually
# === 1. BLOB ===
# Create blob from string
BLOB_SHA=$(echo "file content" | git hash-object -w --stdin)
echo "Blob SHA: $BLOB_SHA"
# Create blob from file
git hash-object -w myfile.txt
# Inspect
git cat-file -t $BLOB_SHA # blob
git cat-file -s $BLOB_SHA # size in bytes
git cat-file -p $BLOB_SHA # content
# === 2. TREE ===
# Build index, then create tree
echo "100644 blob $BLOB_SHA test.txt" | git mktree
# Output: TREE_SHA
# Or from current index
TREE_SHA=$(git write-tree)
# Inspect
git cat-file -t $TREE_SHA # tree
git cat-file -p $TREE_SHA # entries (mode, type, sha, name)
# === 3. COMMIT ===
# Create commit (requires env vars for author)
export GIT_AUTHOR_NAME="Test"
export GIT_AUTHOR_EMAIL="test@example.com"
export GIT_COMMITTER_NAME="Test"
export GIT_COMMITTER_EMAIL="test@example.com"
COMMIT_SHA=$(echo "Initial commit" | git commit-tree $TREE_SHA)
# Inspect
git cat-file -t $COMMIT_SHA # commit
git cat-file -p $COMMIT_SHA # tree, author, committer, message
# === 4. ANNOTATED TAG ===
# Create tag object
git tag -a v1.0 -m "Release 1.0" $COMMIT_SHA
# Get tag object SHA (not the commit it points to)
TAG_SHA=$(git rev-parse v1.0^{tag})
# Inspect
git cat-file -t $TAG_SHA # tag
git cat-file -p $TAG_SHA # object, type, tag, tagger, message
# === Verify the chain ===
echo "Tag -> Commit -> Tree -> Blob"
git cat-file -p $TAG_SHA | grep "^object"
git cat-file -p $COMMIT_SHA | grep "^tree"
git cat-file -p $TREE_SHA | grep "blob"
Pack Files and Object Compression
Loose objects (individual files in .git/objects/) are eventually packed into pack files for efficiency:
# Trigger manual packing
git gc
# List packed objects
git verify-pack -v .git/objects/pack/*.idx
# See pack file statistics
git count-objects -v
Pack file structure:
graph TD
LOOSE["Loose Objects\n(individual files)"] -->|git gc triggers| PACKING["Packing Process"]
PACKING -->|delta compression| PACK["Pack File\n(.pack)"]
PACK -->|index for fast lookup| IDX["Pack Index\n(.idx)"]
PACK -->|reduced storage| STORED["Smaller on disk\n(~10-20% of loose)"]
How packing works:
- Git finds related objects (similar content, same repository)
- Stores one “base” object plus deltas (differences) for others
- Creates
.idxfile for O(log n) SHA lookups - Result: 10-20x storage reduction for text files
Why pack files matter:
- Clone and fetch operations transfer pack files, not loose objects
- Deduplication happens across all branches and history
git clone --referenceuses pack files for storage optimization
The Index (Staging Area)
The index (.git/index) is a binary file that maps tracked files to blob SHAs — it’s the staging area between working tree and repository.
# View the index
git ls-files --stage
# Sample output:
# 100644 abc123... 0 file1.txt
# 100644 def456... 0 file2.sh
# 040000 789abc... 0 subdir/
# Inspect index structure
git cat-file -p 2b3c4d... # Use write-tree output to see tree structure
Index entries contain:
- Mode: File permissions (100644, 100755, etc.)
- SHA: Blob hash of the file in the repository
- Stage: 0 for normal entries, 1-3 for merge conflicts
- Name: Relative path from repository root
Index vs Trees:
| Aspect | Index | Tree |
|---|---|---|
| Location | .git/index | .git/objects/ (SHA-named) |
| Mutability | Updates on every git add | Immutable once created |
| Scope | Single branch (current) | Any commit in history |
| Purpose | Staging area, merge conflict resolution | Snapshot of directory at commit |
Index workflow:
flowchart TD
WORKING["Working Tree\n(modified files)"] -->|git add| INDEX["Index\n(staged content)"]
INDEX -->|git commit| TREE["Tree Object\n(snapshot)"]
TREE -->|commit| REPO["Repository\n(permanent)"]
WORKING -.->|git diff| INDEX
Interview Questions
Blobs store only file content to enable deduplication. If two files in different directories have identical content, they share the same blob. Filenames are stored in tree objects, which map names to blob SHAs. This separation means renaming a file doesn't create a new blob — only a new tree.
Every object's SHA-1 hash is computed from its type, size, and content. When Git reads an object, it recomputes the hash and compares it to the filename. If they don't match, the object is corrupted. This is why Git is called a "content-addressable filesystem" — the address is the content's fingerprint.
git hash-object and git hash-object -w?Without -w, git hash-object only computes and prints the SHA-1 hash without storing the object. With -w (write), it also stores the object in .git/objects/. This is useful for checking if content already exists before writing it.
Yes — the initial commit of any repository has zero parents. Additionally, commits created with git commit-tree without the -p flag, or orphan branches created with git checkout --orphan, produce commits with no parent. This creates a new root in the commit DAG.
An annotated tag creates a full tag object in .git/objects/ with metadata (tagger, date, message, GPG signature). A lightweight tag is just a file in .git/refs/tags/ containing a commit SHA — no object is created. Annotated tags are preferred for releases because they're immutable and verifiable.
Blobs are never automatically deleted — they are immutable and retained indefinitely. When you delete a file and commit, Git creates a new tree that no longer references the blob's SHA, but the blob itself remains in `.git/objects/`. This enables recovery via reflog or dangling blob recovery. Blobs are only pruned by `git gc` when they become unreachable (no refs point to them) and are older than the expiration window.
git fsck validates each object's SHA-1 hash against its content. It traverses all reachable objects from refs (branches, tags) and checks that: (1) the object exists, (2) its hash matches computed content, (3) referenced objects exist. For packed objects, it verifies CRC checksums. Use git fsck --full for comprehensive checks including unreachable objects.
When Git was designed (2005), SHA-1 was the standard for integrity verification. Its 160-bit output provides sufficient collision resistance for content-addressable storage. Git requires deterministic hashing (same content = same hash) and depends on the property that different content produces different hashes. SHA-256 support was added later; SHA-1's main advantage is ubiquity and compatibility across systems.
Yes — merge commits have two or more parent commits. The first parent is typically the branch you were on when merging; additional parents are the commits from merged branches. This forms a directed acyclic graph (DAG) rather than a simple chain. Octopus merges can have even more parents (e.g., `git merge branch1 branch2 branch3`).
git cat-file -p, -t, and -s?-p pretty-prints the object's content in human-readable format. -t shows only the object type (blob, tree, commit, tag). -s shows only the size in bytes. These correspond to the three parts of the raw object header: type, size, and content.
Directories are represented as tree objects themselves. Each tree entry has a mode (040000 for directories), a SHA pointing to either a blob (file) or another tree (subdirectory), and the filename. When you `git add` a new file, Git creates a blob, updates the tree, and recursively updates parent trees — all as separate immutable objects.
Git's rename detection works via content similarity (optional, enabled with `-M` flag in `git diff`). Internally: a rename creates a new tree entry pointing to the existing blob (deduplication means the blob stays the same). Git compares old and new trees to infer renames. No blob is created or destroyed — only the tree structure changes.
Dangling blobs are not referenced by any commit or ref but still exist in the object store. To find them: git fsck --unreachable --no-reflogs. To view content: git cat-file -p <sha>. To recover: create a blob reference via `git hash-object -w <file>` if you have the content. Dangling blobs are often temporary from operations like `git rebase` or `git reset --soft`.
Git has no hard limit on object size — theoretically up to 2^63 bytes. In practice, Git becomes inefficient with objects larger than ~2GB (due to memory for delta compression). For large files, use Git LFS which stores pointer files in the repository and the actual content on LFS servers. Before Git LFS existed, large binary files caused repository bloat.
Two files with identical content should produce identical blob SHAs (content-addressable). If they don't: (1) check for line-ending differences (CRLF vs LF) — Git may normalize during checkout, (2) verify file mode (executable vs non-executable creates different trees), (3) ensure no trailing spaces or invisible characters differ. The blob SHA should match for truly identical content.
.git/objects/pack directory for?Pack files store multiple objects compressed together using delta encoding. Git groups similar objects (e.g., successive versions of a file) and stores only the differences plus a base object. This reduces storage by 10-20x for text files. The `.idx` file provides O(log n) SHA lookups. Running git gc converts loose objects to pack files; git clone primarily transfers pack files.
Commits are immutable — you cannot modify any field including the tree SHA. To change what's in a commit, you must create a new commit with a different SHA. This is what operations like `git commit --amend`, `git rebase`, and `git cherry-pick` do internally. The old commit remains (becomes dangling) until garbage collected.
Tree entries store file mode as an octal number: 100644 (regular file), 100755 (executable), 040000 (subdirectory), 120000 (symlink). When permissions change, Git creates a new tree entry with the updated mode but same blob SHA (if content unchanged). This is why `git diff` can show permission-only changes.
Refs (branches, tags) are pointers to commit SHAs stored in `.git/refs/`. HEAD is a special ref pointing to the current commit. Objects (blobs, trees, commits, tags) are stored in `.git/objects/`. When you commit, Git: (1) creates a tree from the index, (2) creates a commit pointing to that tree and the current HEAD, (3) updates HEAD and the current branch ref to the new commit.
Content-addressable storage (storing content by its SHA-1 hash) solves several problems: deduplication (identical content is stored once regardless of filename), integrity verification (the hash acts as a checksum — any corruption changes the hash), immutability (changing content produces a new hash, leaving the original intact), and efficient comparison (objects with the same hash are guaranteed identical). This design is why Git operations are so fast — most operations are local hash lookups.
Further Reading
- Git Objects - This article
- Git LFS - Large file handling
- Official Git documentation:
git cat-file,git hash-object,git write-tree
Conclusion
Git’s object model — blobs for content, trees for structure, commits for snapshots, tags for anchors — is elegant in its simplicity. Every Git operation is a dance between these four object types. Master them and you master Git.
Category
Related Posts
Git References and HEAD
Deep dive into Git references — branch refs, tag refs, HEAD, detached HEAD state, and symbolic references. Learn how Git tracks commits through the refs namespace.
Semantic Versioning and Git Tags: SemVer, Tag Types, and Management Strategies
Master semantic versioning (SemVer 2.0.0), lightweight vs annotated git tags, tag management strategies, and automated versioning workflows for production software releases.
Centralized vs Distributed VCS: Architecture, Trade-offs, and When to Use Each
Compare centralized (SVN, CVS) vs distributed (Git, Mercurial) version control systems — their architectures, trade-offs, and when to use each approach.