Data Quality: Validation, Monitoring, and Governance

Data quality determines whether pipeline outputs are trustworthy. Learn how to define rules, implement validation, and catch bad data before it reaches users.

published: reading time: 10 min read

Data Quality: Validation, Monitoring, and Governance

A dashboard shows revenue down 30% this month. The analyst panics. An engineer starts debugging. Hours later, the answer is mundane: a pipeline extracted a batch of test orders from the staging environment and loaded them alongside real orders, skewing the aggregates. The data was wrong, not the business.

Bad data costs real time. Data quality is the practice of catching bad data before it reaches users, and when it does reach users, catching it fast enough to minimize damage.

Data Quality Flow Through Pipeline Stages

Data quality checks apply at each stage of the pipeline, with different concerns at each layer:

flowchart TD
    subgraph Extraction[Extraction Stage]
        E1[Source data ingestion]
        E2[Schema validation]
        E3[Type checks]
    end
    subgraph Staging[Staging Stage]
        S1[Row count validation]
        S2[Null rate checks]
        S3[Unique key validation]
    end
    subgraph Transform[Transformation Stage]
        T1[Business logic rules]
        T2[Cross-table consistency]
        T3[Anomaly detection]
    end
    subgraph Load[Load Stage]
        L1[Referential integrity]
        L2[Constraint validation]
        L3[Freshness check]
    end
    E1 --> E2 --> E3 --> S1 --> S2 --> S3 --> T1 --> T2 --> T3 --> L1 --> L2 --> L3

Catching issues early (extraction/staging) is cheaper than catching them late (load stage). An issue caught at extraction can be re-extracted. An issue caught at load requires a backfill.

What Makes Data “Bad”

Data quality issues fall into several categories:

Completeness: Required fields are null or missing. A customer record without an email address is incomplete. A fact table row without a foreign key to the date dimension is broken.

Validity: Data is present but wrong. A date of “February 30th”. An email address that is not an email address. A percentage field with values over 100.

Consistency: The same entity has different values across systems. Customer name is “Alice Sharma” in the CRM and “Alice S.” in the billing system. These should be joined as the same customer but appear as different.

Uniqueness: Duplicate records where duplicates should not exist. The same order appearing twice because a pipeline ran twice.

Timeliness: Data arrives late. A daily pipeline that usually runs at 2am has been running at 6am for three days because of a upstream issue. Reports that assume 2am data are stale.

Accuracy: Data does not reflect reality. A sensor reading of 999 degrees when the actual temperature was 73. A zip code that does not exist.

Data Quality Rules

Data quality rules define what “good” data looks like. Rules are specific, measurable assertions about data.

# Example: Data quality rules in Python (Great Expectations style)
import great_expectations as ge

# Load data
df = ge.read_csv("customers.csv")

# Define expectations
df.expect_column_values_to_not_be_null("customer_id")
df.expect_column_values_to_be_unique("customer_id")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
df.expect_column_values_to_match_regex("email", r"^[a-zA-Z0-9_.+-]+@[a-z]+\.[a-zA-Z0-9-.]+$")
df.expect_column_proportions_of_unique_values_to_be_greater_than("status", 0.95)

# Run validation
results = df.validate()
if not results["success"]:
    print(f"Validation failed: {results['results']}")

Rules can be implemented at multiple levels:

Schema-level rules: Enforce data types, nullability, and primary key constraints at load time. Most data warehouses support these constraints natively.

Statistical rules: Monitor column statistics. If a column normally ranges from 0 to 100 and suddenly shows values in the thousands, something is wrong.

Business logic rules: Customer age should not exceed 120. Order total should equal sum of line items. These require explicit validation logic.

Pipeline-Level Validation

Integrate validation into the pipeline, not as a separate process.

# Example: Validation step in a pipeline
def validate_orders(orders_df: pd.DataFrame) -> ValidationResult:
    """Run validation on orders data and return result."""
    violations = []

    # Check for null customer IDs
    null_customers = orders_df[orders_df["customer_id"].isna()]
    if len(null_customers) > 0:
        violations.append(
            Violation(type="null_key", count=len(null_customers), sample=null_customers.head())
        )

    # Check for duplicate order IDs
    duplicates = orders_df[orders_df["order_id"].duplicated(keep=False)]
    if len(duplicates) > 0:
        violations.append(
            Violation(type="duplicate_key", count=len(duplicates), sample=duplicates.head())
        )

    # Check order total consistency
    orders_df["calc_total"] = orders_df["line_items"].apply(sum)
    inconsistent_totals = orders_df[orders_df["order_total"] != orders_df["calc_total"]]
    if len(inconsistent_totals) > 0:
        violations.append(
            Violation(type="invalid_total", count=len(inconsistent_totals), sample=inconsistent_totals.head())
        )

    return ValidationResult(
        passed=len(violations) == 0,
        violations=violations,
        row_count=len(orders_df),
        error_rate=len(violations) / len(orders_df) if len(orders_df) > 0 else 0
    )

A validation failure in a pipeline step should determine the pipeline outcome. Stop and alert on critical failures. Log and continue on non-critical ones, but track the error rate.

Data Contracts

A data contract is an explicit agreement between a data producer and a data consumer about what data will look like. It defines the schema, quality requirements, and SLA for data freshness.

{
  "contract_id": "orders-warehouse-v1",
  "producer": "orders-service",
  "consumer": "analytics-warehouse",
  "schema": {
    "order_id": { "type": "string", "required": true },
    "customer_id": { "type": "string", "required": true },
    "order_total": { "type": "number", "required": true, "min": 0 },
    "status": {
      "type": "string",
      "enum": ["pending", "shipped", "delivered", "cancelled"]
    }
  },
  "quality_requirements": {
    "null_rate_threshold": { "customer_id": 0.001 },
    "duplicate_rate_threshold": { "order_id": 0.0 }
  },
  "freshness_sla": {
    "max_latency_minutes": 30,
    "frequency": "continuous"
  }
}

When a producer violates the contract (sends data outside the schema or exceeds the null rate threshold), consumers are notified. This prevents silent data quality degradation.

Anomaly Detection

Rule-based validation catches known failure modes. Anomaly detection catches unknown ones. Monitor statistical properties of your data and alert when something deviates significantly from baseline.

-- Simple anomaly detection: monitor distribution changes
SELECT
    date,
    AVG(order_total) AS avg_order_total,
    STDDEV(order_total) AS stddev_order_total,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY order_total) AS p99_order_total,
    COUNT(*) AS order_count
FROM fact_orders
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY date
ORDER BY date;

-- Alert if today's stddev is 3x the 30-day average
-- This catches distribution shifts even when individual values are valid

More sophisticated anomaly detection uses machine learning models trained on historical data patterns. These models can catch subtle shifts that threshold-based alerts miss.

Data Quality Monitoring

Build quality metrics dashboards. Track these over time:

  • Null rate per column per table
  • Duplicate rate per primary key
  • Row count vs historical average
  • Freshness: time since last successful load
  • Validation failure rate per rule
# Example: Data quality metrics for Prometheus/Grafana
metrics:
  - name: pipeline_data_quality_null_rate
    query: |
      SELECT
        table_name,
        column_name,
        SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS null_rate
      FROM column_stats
      GROUP BY table_name, column_name
    thresholds:
      warning: 0.01
      critical: 0.05

  - name: pipeline_freshness_minutes
    query: |
      SELECT
        table_name,
        EXTRACT(EPOCH FROM (now() - last_loaded_at)) / 60 AS minutes_since_load
      FROM pipeline_status
    thresholds:
      warning: 30
      critical: 60

For pipeline-level monitoring patterns, see Pipeline Orchestration.

Governance and Ownership

Data quality is a governance problem as much as a technical problem. Someone must own each dataset and be accountable for its quality.

Data stewards: Individuals responsible for specific datasets. They respond to quality issues, approve schema changes, and maintain documentation.

Quality SLAs: Define what “good enough” means. Not every dataset needs 100% completeness. Different use cases have different tolerance for bad data.

Incident process: When bad data reaches users, there must be a process for identifying the root cause, fixing the data, and preventing recurrence. This is a standard incident management process adapted for data.

When to Use Data Quality Checks

Use data quality checks when:

  • Pipelines process data that drives business decisions (revenue, metrics, reports)
  • Source data comes from external systems with unreliable schemas
  • Regulatory requirements mandate data validation (SOX, GDPR, HIPAA)
  • Multiple teams produce data that other teams consume (data mesh contexts)

Do not use comprehensive data quality checks when:

  • Prototyping or exploring data — overhead slows down initial iteration
  • One-off data migrations — validate manually instead
  • Pipeline latency is critical and data quality is already handled upstream

Production Failure Scenarios

Data quality failures in production tend to follow predictable patterns:

Failure ModeCauseImpactMitigation
Silent aggregation corruptionTest/dev data loaded into production alongside real dataWrong revenue/metric calculationsSeparate environment credentials, pipeline-level environment filtering
Stale data served as freshPipeline runs late but downstream queries assume freshnessExecutives see wrong numbersFreshness SLA monitoring with alerts
Schema drift silent failureSource adds column, downstream breaksDashboard errors, pipeline failuresSchema validation at ingestion
Duplicate records inflate metricsUpsert key collision or re-run without cleanupIncorrect aggregate countsUnique key monitoring, idempotent pipeline design
Null values misinterpretedNULL treated as zero or empty string in aggregatesSubtle wrong totalsEnforce NOT NULL on critical columns

The most dangerous failures are the silent ones — wrong numbers that look plausible and get acted on before anyone notices.

Trade-off Table: Validation Approaches

AspectSchema-levelStatisticalBusiness logic
CoverageNarrow — types and nullsMedium — distributionsWide — custom rules
Implementation costLow — native warehouse constraintsMedium — query templatesHigh — requires domain knowledge
MaintenanceLow — rarely changesMedium — thresholds driftHigh — business rules evolve
False positive rateLowMediumHigh
Catches unknown issuesNoSometimesNo
Best forEarly-stage pipelinesMature pipelines with baseline dataCritical business metrics

Schema-level validation is table stakes. Statistical validation catches what schema validation misses. Business logic validation is essential for metrics that drive decisions.

Common Pitfalls

Too many tests, not enough coverage: Teams add many individual column tests but miss the business logic errors that span multiple columns.

Validation at the wrong layer: Validation that only happens at load time misses issues that are introduced in transformations. dbt tests run after each transformation model catch issues closer to where they are introduced.

No historical tracking: Without tracking quality metrics over time, you cannot tell if quality is improving or degrading. Build metrics persistence into your pipeline.

Blaming the source: When data quality issues appear, teams often blame the source system. In practice, most quality issues are introduced in the pipeline itself.

Quick Recap

  • Bad data comes in six flavors: completeness, validity, consistency, uniqueness, timeliness, accuracy.
  • Integrate validation into the pipeline, not as a separate post-processing step.
  • Catch issues at extraction or staging when re-extraction is still possible.
  • Data contracts formalize the agreement between producers and consumers.
  • Track quality metrics over time. Without history, you cannot tell if quality is improving.
  • Silent failures — wrong numbers that look plausible — are the most dangerous.

Conclusion

Data quality is not a nice-to-have. Bad data leads to bad decisions, wasted engineering time, and lost trust in data systems. The practices are straightforward: define rules, validate automatically in the pipeline, monitor metrics over time, and have a process for responding to failures.

The hard part is discipline. Quality checks add overhead. They slow down pipeline development. They create alerts that someone must respond to. Skipping them seems efficient until a test order batch skews your revenue dashboard.

For related reading on pipeline reliability, see Pipeline Orchestration and Backfills.

Category

Related Posts

Data Validation: Ensuring Reliability in Data Pipelines

Learn data validation techniques for catching errors early, defining constraints, and building reliable production data pipelines.

#data-engineering #data-quality #data-validation

Data Contracts: Establishing Reliable Data Agreements

Learn how to implement data contracts between data producers and consumers to ensure quality, availability, and accountability.

#data-engineering #data-contracts #data-quality

Data Governance: Practical Implementation Guide

Learn the essential framework for data governance including data ownership, quality standards, policy enforcement, and organizational alignment.

#data-engineering #data-governance #data-quality