Data Catalog: Organizing and Discovering Data Assets

A data catalog is the single source of truth for data metadata. Learn how catalogs work, what they manage, and how to choose one.

published: March 27, 2026 reading time: 10 min read

Data Catalog: Organizing and Discovering Data Assets

You have 3,000 tables in Snowflake. Business analysts need customer revenue data. Data scientists need feature engineering datasets. Engineers need to know which tables are actively maintained and which are deprecated. Without a catalog, people ask in Slack, get inconsistent answers, and eventually just pick a table that looks right.

A data catalog is a searchable inventory of your data assets. It stores metadata about tables, columns, pipelines, dashboards, and the people and processes associated with them. It is the difference between data as a black box and data as a governed, discoverable resource.

What a Data Catalog Does

At minimum, a data catalog records what data exists and where it lives. Modern catalogs do significantly more.

Discovery: Search for tables by name, column, tag, description, or owner. A data scientist looking for customer features can search “customer features” rather than guessing table names.

Lineage: Track how data flows from source to transformation to output. If a revenue number looks wrong, lineage shows which tables and pipelines contributed to it.

Documentation: Store human-readable descriptions of tables and columns. “This column represents the customer’s lifetime value in USD, calculated as sum of all orders minus returns.”

Governance: Enforce access policies, mark sensitive columns (PII, PHI), track data contracts.

Quality: Store data quality metrics, test results, and freshness SLAs.

Metadata: The Foundation

A data catalog is only as good as its metadata. There are several layers.

Technical metadata: Table names, column names and types, partition keys, indexes, file formats, storage locations, row counts, data volume.

{
  "table_name": "fact_orders",
  "schema": "analytics",
  "database": "snowflake",
  "columns": [
    { "name": "order_id", "type": "VARCHAR", "nullable": false },
    { "name": "customer_id", "type": "NUMBER", "nullable": false },
    { "name": "order_total_usd", "type": "NUMBER(10,2)", "nullable": false },
    { "name": "order_date", "type": "DATE", "nullable": false }
  ],
  "partition_keys": ["order_date"],
  "row_count": 1450000000,
  "bytes": 85899345920
}

Business metadata: Human-readable descriptions, ownership, business context, data contracts. This is the hardest metadata to maintain because it requires human input and upkeep.

Operational metadata: Last updated timestamp, data freshness, pipeline job status, quality test results, access patterns.

Structural metadata: Schema information, column-level statistics, key distributions.

How Data Catalogs Work

Data catalogs typically integrate with data platforms through connectors or automated scanning.

Automated discovery

Catalog tools connect to data warehouses, data lakes, and BI platforms and automatically scan for new tables. When a new table appears in Snowflake, the catalog picks it up within the next scan cycle.

# Example: OpenMetadata scanner configuration
# Scans Snowflake and ingests table metadata automatically
connectors:
  - type: snowflake
    config:
      host: account.snowflakecomputing.com
      username: catalog_service_account
      role: sysadmin
      databases:
        - analytics
        - production
      schemaFilter:
        include:
          - "analytics.*"
          - "production.customers"

Manual curation

Not everything can be automated. Business context, data contracts, and sensitive column markings require human input. The best catalogs combine automated scanning with curation workflows.

Lineage extraction

Catalogs can derive lineage from multiple sources:

SQL parsing: Parse transformation SQL to understand which tables feed into which
Pipeline metadata: Pull lineage from Airflow DAGs or dbt models
BI metadata: Extract which reports and dashboards use which tables

-- Example lineage query in OpenMetadata
SELECT
    from_table FullyQualifiedName,
    to_table FullyQualifiedName,
    sql_query
FROM lineage
WHERE sql_query ILIKE '%customer_id%'
ORDER BY to_table;

Popular Data Catalog Tools

Apache Atlas

Apache Atlas is the open-source standard for Hadoop ecosystems. It provides metadata management, lineage, and governance for data lakes built on Hive, Spark, and Kafka.

Atlas uses a type system for metadata objects. You define types (like “hive_table” or “kafka_topic”), create instances of those types, and attach classifications (like “PII” or ” confidential”). Lineage is derived from job metadata collected from Spark and Hive.

DataHub

DataHub is a modern open-source metadata platform built on PostgreSQL and Elasticsearch. It provides real-time metadata updates through a streaming architecture and a GraphQL API for consumption.

DataHub’s strength is its extensibility. The metadata model is flexible, and there are pre-built connectors for Snowflake, BigQuery, Kafka, Airflow, and dbt.

Amundsen

Amundsen, built by Lyft, focuses on search and discovery. It aggregates metadata from various sources into a searchable index optimized for data consumer workflows.

Amundsen’s philosophy is that the most important metadata is what helps people find and understand data quickly. It prioritizes the search experience over governance workflows.

Cloud-native catalogs

AWS Glue Catalog, Google Cloud Data Catalog, and Azure Purview provide managed catalog services that integrate with their respective cloud platforms. If your data stack is heavily cloud-vendor specific, these reduce operational overhead.

Data catalog tool comparison:

Tool	Best for	Strengths	Weaknesses
Apache Atlas	Hadoop ecosystems	Native Hive/Spark/Kafka lineage, governance types	Steep setup, UI dated, slow
DataHub	General-purpose	PostgreSQL + Elasticsearch, GraphQL API, extensible connectors	Real-time metadata needs extra config
Amundsen	Search-first discovery	Lyfts search quality focus, lightweight	Less governance tooling out of the box
AWS Glue	AWS-only shops	Managed, integrates with Athena, Lake Formation	Lock-in, limited lineage beyond AWS
Google Data Catalog	GCP-only shops	Managed, integrates with BigQuery	Lock-in, regional limits
Azure Purview	Azure-only shops	Managed, integrates with ADF, Cosmos DB	Lock-in, cost at scale

Data Catalog in the Modern Stack

In a modern data stack, the catalog sits at the center and integrates with every platform.

flowchart TD
    Snowflake -->|metadata| Catalog
    BigQuery -->|metadata| Catalog
    S3 -->|metadata| Catalog
    Kafka -->|metadata| Catalog
    Airflow -->|lineage| Catalog
    dbt -->|lineage| Catalog
    Looker -->|dashboard metadata| Catalog
    Catalog -->|search, lineage| Analyst
    Catalog -->|policy enforcement| Governance

dbt has become a de facto metadata source for data catalogs. dbt models define transformations, and catalog tools extract model definitions and test results as metadata. The dbt manifest.json and catalog.json files are widely used as lineage sources.

For more on how transformation pipelines generate lineage, see Extract-Transform-Load and dbt.

Implementing a Data Catalog: Practical Steps

Start with what you have. Scanning an existing Snowflake instance and ingesting technical metadata takes hours, not weeks. Automate the basics before investing in curation.

Define ownership. Every table should have an owner (team or individual). Owners are responsible for documentation and are the first point of contact when questions arise.

Tag sensitive data. PII, PHI, financial data, and other regulated categories should be marked in the catalog. This enables governance workflows and access control.

Integrate lineage gradually. SQL parsing lineage is easiest to implement. Pipeline-level lineage requires integration with your orchestrator (Airflow, Dagster, Prefect).

Make the catalog the source of truth for data questions. When someone asks “which table has customer revenue data?”, the answer should come from the catalog, not from a Slack message.

When to Use and When Not to Use a Data Catalog

Use a data catalog when:

You have more than a few dozen data assets (tables, streams, files) that multiple teams consume
Discovery is a pain point — people ask in Slack instead of finding data themselves
You need to track lineage for governance, debugging, or compliance
You have regulated data (PII, PHI, financial) that requires column-level marking

Do not use a data catalog when:

Your data stack is small and single-team — a shared spreadsheet may suffice until you hit real scale
You cannot assign ownership — a catalog without owners decays into noise
Your priority is speed to market and governance overhead will block pipelines
Metadata curation requires more effort than the data work itself

Common Pitfalls

The catalog becomes stale. Automated scanning keeps technical metadata fresh. Business metadata and descriptions decay unless there is a process for maintaining them. Assign clear ownership and include metadata accuracy in team responsibilities.

Too many tags. Teams over-tag when first implementing governance. Start with essential classifications: PII, sensitive, deprecated. Expand only when the taxonomy has proven value.

Catalog is too hard to use. If searching the catalog is harder than asking in Slack, people will not use it. Invest in search quality and make the catalog the path of least resistance.

Observability for Data Catalogs

A data catalog quietly goes stale if nobody watches it.

What to track:

Scan freshness: when was each platform last scanned? A catalog that has not touched Snowflake in a week is wrong more often than it is right.
Untagged tables: what percentage of tables lack business descriptions or owners? High percentages mean the catalog is mostly noise.
Search-to-click ratio: how often do searches end in a table visit? Low ratios mean the search results are not landing people where they need to go.
Lineage coverage: what percentage of pipelines have lineage captured? Low coverage makes lineage useless for incident investigation.

Catalog health metrics to expose:

# Example: OpenMetadata health check
curl http://openmetadata:8585/api/v1/databaseServices | jq '.data[] | {
  name: .name,
  lastScan: .lastScanTimestamp,
  status: .connectionStatus
}'

Alert on: scan freshness older than 48 hours, connection status changes, sudden drops in search activity.

Capacity Estimation for Data Catalogs

Catalog capacity planning comes down to how many metadata objects you have and how often they change.

Metadata storage:

Each table with 50 columns generates roughly 5-10 KB of metadata per scan (column types, stats, descriptions, tags). For 3,000 tables, a single scan produces about 30 MB. Over a year with daily scans, raw metadata is roughly 11 GB before compression. PostgreSQL and Elasticsearch compress this significantly on disk.

Scan frequency vs freshness:

Hourly scans: high freshness, high compute cost. Makes sense for fast-moving ETL environments where schemas change daily.
Daily scans: fine for most analytical warehouses. 3,000 tables scanned in under 5 minutes with a well-indexed catalog.
Weekly scans: acceptable for stable data warehouses that rarely change.

Search index sizing:

Amundsen and DataHub use Elasticsearch for search. Index size runs about 1-2 KB per table (name, descriptions, tags, column names). 3,000 tables need a 3-6 GB index. Budget 2x overhead for Elasticsearch.

Quick Recap

A data catalog solves the discovery problem. When you have hundreds of tables, people should find data without asking in Slack.
Automated scanning handles technical metadata. Human curation handles business context.
Atlas, DataHub, and Amundsen are the main open-source choices. Cloud-native catalogs work best in single-vendor shops.
SQL parsing, pipeline orchestrators, and BI tools all produce partial lineage — combine them.
Watch scan freshness, untagged tables, and search effectiveness. A stale catalog is worse than no catalog.

Conclusion

A data catalog transforms data from an opaque resource into a governed, discoverable asset. It solves the discovery problem that emerges in any organization with more than a few dozen tables.

The best catalogs combine automated metadata collection with human curation. Automated scanning handles technical metadata at scale. Curation workflows handle business context that cannot be derived automatically.

Start with what you have. Catalog your existing tables before building sophisticated governance workflows. Discovery is the immediate pain point. Governance comes later.

For related reading on data pipeline components, see Pipeline Orchestration and Data Quality.

Data Catalog: Organizing and Discovering Data Assets

Data Catalog: Organizing and Discovering Data Assets

What a Data Catalog Does

Metadata: The Foundation

How Data Catalogs Work

Automated discovery

Manual curation

Lineage extraction

Popular Data Catalog Tools

Apache Atlas

DataHub

Amundsen

Cloud-native catalogs

Data Catalog in the Modern Stack

Implementing a Data Catalog: Practical Steps

When to Use and When Not to Use a Data Catalog

Common Pitfalls

Observability for Data Catalogs

Capacity Estimation for Data Catalogs

Quick Recap

Conclusion

Category

Tags

Related Posts

Data Lineage: Tracing the Journey of Your Data

Audit Trails: Building Complete Data Accountability

Data Contracts: Establishing Reliable Data Agreements