Data Catalog: Organizing and Discovering Data Assets
A data catalog is the single source of truth for data metadata. Learn how catalogs work, what they manage, and how to choose one.
Data Catalog: Organizing and Discovering Data Assets
You have 3,000 tables in Snowflake. Business analysts need customer revenue data. Data scientists need feature engineering datasets. Engineers need to know which tables are actively maintained and which are deprecated. Without a catalog, people ask in Slack, get inconsistent answers, and eventually just pick a table that looks right.
A data catalog is a searchable inventory of your data assets. It stores metadata about tables, columns, pipelines, dashboards, and the people and processes associated with them. It is the difference between data as a black box and data as a governed, discoverable resource.
What a Data Catalog Does
At minimum, a data catalog records what data exists and where it lives. Modern catalogs do significantly more.
Discovery: Search for tables by name, column, tag, description, or owner. A data scientist looking for customer features can search “customer features” rather than guessing table names.
Lineage: Track how data flows from source to transformation to output. If a revenue number looks wrong, lineage shows which tables and pipelines contributed to it.
Documentation: Store human-readable descriptions of tables and columns. “This column represents the customer’s lifetime value in USD, calculated as sum of all orders minus returns.”
Governance: Enforce access policies, mark sensitive columns (PII, PHI), track data contracts.
Quality: Store data quality metrics, test results, and freshness SLAs.
Metadata: The Foundation
A data catalog is only as good as its metadata. There are several layers.
Technical metadata: Table names, column names and types, partition keys, indexes, file formats, storage locations, row counts, data volume.
{
"table_name": "fact_orders",
"schema": "analytics",
"database": "snowflake",
"columns": [
{ "name": "order_id", "type": "VARCHAR", "nullable": false },
{ "name": "customer_id", "type": "NUMBER", "nullable": false },
{ "name": "order_total_usd", "type": "NUMBER(10,2)", "nullable": false },
{ "name": "order_date", "type": "DATE", "nullable": false }
],
"partition_keys": ["order_date"],
"row_count": 1450000000,
"bytes": 85899345920
}
Business metadata: Human-readable descriptions, ownership, business context, data contracts. This is the hardest metadata to maintain because it requires human input and upkeep.
Operational metadata: Last updated timestamp, data freshness, pipeline job status, quality test results, access patterns.
Structural metadata: Schema information, column-level statistics, key distributions.
How Data Catalogs Work
Data catalogs typically integrate with data platforms through connectors or automated scanning.
Automated discovery
Catalog tools connect to data warehouses, data lakes, and BI platforms and automatically scan for new tables. When a new table appears in Snowflake, the catalog picks it up within the next scan cycle.
# Example: OpenMetadata scanner configuration
# Scans Snowflake and ingests table metadata automatically
connectors:
- type: snowflake
config:
host: account.snowflakecomputing.com
username: catalog_service_account
role: sysadmin
databases:
- analytics
- production
schemaFilter:
include:
- "analytics.*"
- "production.customers"
Manual curation
Not everything can be automated. Business context, data contracts, and sensitive column markings require human input. The best catalogs combine automated scanning with curation workflows.
Lineage extraction
Catalogs can derive lineage from multiple sources:
- SQL parsing: Parse transformation SQL to understand which tables feed into which
- Pipeline metadata: Pull lineage from Airflow DAGs or dbt models
- BI metadata: Extract which reports and dashboards use which tables
-- Example lineage query in OpenMetadata
SELECT
from_table FullyQualifiedName,
to_table FullyQualifiedName,
sql_query
FROM lineage
WHERE sql_query ILIKE '%customer_id%'
ORDER BY to_table;
Popular Data Catalog Tools
Apache Atlas
Apache Atlas is the open-source standard for Hadoop ecosystems. It provides metadata management, lineage, and governance for data lakes built on Hive, Spark, and Kafka.
Atlas uses a type system for metadata objects. You define types (like “hive_table” or “kafka_topic”), create instances of those types, and attach classifications (like “PII” or ” confidential”). Lineage is derived from job metadata collected from Spark and Hive.
DataHub
DataHub is a modern open-source metadata platform built on PostgreSQL and Elasticsearch. It provides real-time metadata updates through a streaming architecture and a GraphQL API for consumption.
DataHub’s strength is its extensibility. The metadata model is flexible, and there are pre-built connectors for Snowflake, BigQuery, Kafka, Airflow, and dbt.
Amundsen
Amundsen, built by Lyft, focuses on search and discovery. It aggregates metadata from various sources into a searchable index optimized for data consumer workflows.
Amundsen’s philosophy is that the most important metadata is what helps people find and understand data quickly. It prioritizes the search experience over governance workflows.
Cloud-native catalogs
AWS Glue Catalog, Google Cloud Data Catalog, and Azure Purview provide managed catalog services that integrate with their respective cloud platforms. If your data stack is heavily cloud-vendor specific, these reduce operational overhead.
Data catalog tool comparison:
| Tool | Best for | Strengths | Weaknesses |
|---|---|---|---|
| Apache Atlas | Hadoop ecosystems | Native Hive/Spark/Kafka lineage, governance types | Steep setup, UI dated, slow |
| DataHub | General-purpose | PostgreSQL + Elasticsearch, GraphQL API, extensible connectors | Real-time metadata needs extra config |
| Amundsen | Search-first discovery | Lyfts search quality focus, lightweight | Less governance tooling out of the box |
| AWS Glue | AWS-only shops | Managed, integrates with Athena, Lake Formation | Lock-in, limited lineage beyond AWS |
| Google Data Catalog | GCP-only shops | Managed, integrates with BigQuery | Lock-in, regional limits |
| Azure Purview | Azure-only shops | Managed, integrates with ADF, Cosmos DB | Lock-in, cost at scale |
Data Catalog in the Modern Stack
In a modern data stack, the catalog sits at the center and integrates with every platform.
flowchart TD
Snowflake -->|metadata| Catalog
BigQuery -->|metadata| Catalog
S3 -->|metadata| Catalog
Kafka -->|metadata| Catalog
Airflow -->|lineage| Catalog
dbt -->|lineage| Catalog
Looker -->|dashboard metadata| Catalog
Catalog -->|search, lineage| Analyst
Catalog -->|policy enforcement| Governance
dbt has become a de facto metadata source for data catalogs. dbt models define transformations, and catalog tools extract model definitions and test results as metadata. The dbt manifest.json and catalog.json files are widely used as lineage sources.
For more on how transformation pipelines generate lineage, see Extract-Transform-Load and dbt.
Implementing a Data Catalog: Practical Steps
Start with what you have. Scanning an existing Snowflake instance and ingesting technical metadata takes hours, not weeks. Automate the basics before investing in curation.
Define ownership. Every table should have an owner (team or individual). Owners are responsible for documentation and are the first point of contact when questions arise.
Tag sensitive data. PII, PHI, financial data, and other regulated categories should be marked in the catalog. This enables governance workflows and access control.
Integrate lineage gradually. SQL parsing lineage is easiest to implement. Pipeline-level lineage requires integration with your orchestrator (Airflow, Dagster, Prefect).
Make the catalog the source of truth for data questions. When someone asks “which table has customer revenue data?”, the answer should come from the catalog, not from a Slack message.
When to Use and When Not to Use a Data Catalog
Use a data catalog when:
- You have more than a few dozen data assets (tables, streams, files) that multiple teams consume
- Discovery is a pain point — people ask in Slack instead of finding data themselves
- You need to track lineage for governance, debugging, or compliance
- You have regulated data (PII, PHI, financial) that requires column-level marking
Do not use a data catalog when:
- Your data stack is small and single-team — a shared spreadsheet may suffice until you hit real scale
- You cannot assign ownership — a catalog without owners decays into noise
- Your priority is speed to market and governance overhead will block pipelines
- Metadata curation requires more effort than the data work itself
Common Pitfalls
The catalog becomes stale. Automated scanning keeps technical metadata fresh. Business metadata and descriptions decay unless there is a process for maintaining them. Assign clear ownership and include metadata accuracy in team responsibilities.
Too many tags. Teams over-tag when first implementing governance. Start with essential classifications: PII, sensitive, deprecated. Expand only when the taxonomy has proven value.
Catalog is too hard to use. If searching the catalog is harder than asking in Slack, people will not use it. Invest in search quality and make the catalog the path of least resistance.
Observability for Data Catalogs
A data catalog quietly goes stale if nobody watches it.
What to track:
- Scan freshness: when was each platform last scanned? A catalog that has not touched Snowflake in a week is wrong more often than it is right.
- Untagged tables: what percentage of tables lack business descriptions or owners? High percentages mean the catalog is mostly noise.
- Search-to-click ratio: how often do searches end in a table visit? Low ratios mean the search results are not landing people where they need to go.
- Lineage coverage: what percentage of pipelines have lineage captured? Low coverage makes lineage useless for incident investigation.
Catalog health metrics to expose:
# Example: OpenMetadata health check
curl http://openmetadata:8585/api/v1/databaseServices | jq '.data[] | {
name: .name,
lastScan: .lastScanTimestamp,
status: .connectionStatus
}'
Alert on: scan freshness older than 48 hours, connection status changes, sudden drops in search activity.
Capacity Estimation for Data Catalogs
Catalog capacity planning comes down to how many metadata objects you have and how often they change.
Metadata storage:
Each table with 50 columns generates roughly 5-10 KB of metadata per scan (column types, stats, descriptions, tags). For 3,000 tables, a single scan produces about 30 MB. Over a year with daily scans, raw metadata is roughly 11 GB before compression. PostgreSQL and Elasticsearch compress this significantly on disk.
Scan frequency vs freshness:
- Hourly scans: high freshness, high compute cost. Makes sense for fast-moving ETL environments where schemas change daily.
- Daily scans: fine for most analytical warehouses. 3,000 tables scanned in under 5 minutes with a well-indexed catalog.
- Weekly scans: acceptable for stable data warehouses that rarely change.
Search index sizing:
Amundsen and DataHub use Elasticsearch for search. Index size runs about 1-2 KB per table (name, descriptions, tags, column names). 3,000 tables need a 3-6 GB index. Budget 2x overhead for Elasticsearch.
Quick Recap
- A data catalog solves the discovery problem. When you have hundreds of tables, people should find data without asking in Slack.
- Automated scanning handles technical metadata. Human curation handles business context.
- Atlas, DataHub, and Amundsen are the main open-source choices. Cloud-native catalogs work best in single-vendor shops.
- SQL parsing, pipeline orchestrators, and BI tools all produce partial lineage — combine them.
- Watch scan freshness, untagged tables, and search effectiveness. A stale catalog is worse than no catalog.
Conclusion
A data catalog transforms data from an opaque resource into a governed, discoverable asset. It solves the discovery problem that emerges in any organization with more than a few dozen tables.
The best catalogs combine automated metadata collection with human curation. Automated scanning handles technical metadata at scale. Curation workflows handle business context that cannot be derived automatically.
Start with what you have. Catalog your existing tables before building sophisticated governance workflows. Discovery is the immediate pain point. Governance comes later.
For related reading on data pipeline components, see Pipeline Orchestration and Data Quality.
Category
Related Posts
Data Lineage: Tracing the Journey of Your Data
Learn how to implement data lineage for tracking data flow across systems, enabling impact analysis, debugging, and compliance.
Audit Trails: Building Complete Data Accountability
Learn how to implement comprehensive audit trails that track data changes, access, and lineage for compliance and debugging.
Data Contracts: Establishing Reliable Data Agreements
Learn how to implement data contracts between data producers and consumers to ensure quality, availability, and accountability.