Data Quality: Validation, Monitoring, and Governance

Data quality determines whether pipeline outputs are trustworthy. Learn how to define rules, implement validation, and catch bad data before it reaches users.

published: March 27, 2026 reading time: 10 min read

Data Quality: Validation, Monitoring, and Governance

A dashboard shows revenue down 30% this month. The analyst panics. An engineer starts debugging. Hours later, the answer is mundane: a pipeline extracted a batch of test orders from the staging environment and loaded them alongside real orders, skewing the aggregates. The data was wrong, not the business.

Bad data costs real time. Data quality is the practice of catching bad data before it reaches users, and when it does reach users, catching it fast enough to minimize damage.

Data Quality Flow Through Pipeline Stages

Data quality checks apply at each stage of the pipeline, with different concerns at each layer:

flowchart TD
    subgraph Extraction[Extraction Stage]
        E1[Source data ingestion]
        E2[Schema validation]
        E3[Type checks]
    end
    subgraph Staging[Staging Stage]
        S1[Row count validation]
        S2[Null rate checks]
        S3[Unique key validation]
    end
    subgraph Transform[Transformation Stage]
        T1[Business logic rules]
        T2[Cross-table consistency]
        T3[Anomaly detection]
    end
    subgraph Load[Load Stage]
        L1[Referential integrity]
        L2[Constraint validation]
        L3[Freshness check]
    end
    E1 --> E2 --> E3 --> S1 --> S2 --> S3 --> T1 --> T2 --> T3 --> L1 --> L2 --> L3

Catching issues early (extraction/staging) is cheaper than catching them late (load stage). An issue caught at extraction can be re-extracted. An issue caught at load requires a backfill.

What Makes Data “Bad”

Data quality issues fall into several categories:

Completeness: Required fields are null or missing. A customer record without an email address is incomplete. A fact table row without a foreign key to the date dimension is broken.

Validity: Data is present but wrong. A date of “February 30th”. An email address that is not an email address. A percentage field with values over 100.

Consistency: The same entity has different values across systems. Customer name is “Alice Sharma” in the CRM and “Alice S.” in the billing system. These should be joined as the same customer but appear as different.

Uniqueness: Duplicate records where duplicates should not exist. The same order appearing twice because a pipeline ran twice.

Timeliness: Data arrives late. A daily pipeline that usually runs at 2am has been running at 6am for three days because of a upstream issue. Reports that assume 2am data are stale.

Accuracy: Data does not reflect reality. A sensor reading of 999 degrees when the actual temperature was 73. A zip code that does not exist.

Data Quality Rules

Data quality rules define what “good” data looks like. Rules are specific, measurable assertions about data.

# Example: Data quality rules in Python (Great Expectations style)
import great_expectations as ge

# Load data
df = ge.read_csv("customers.csv")

# Define expectations
df.expect_column_values_to_not_be_null("customer_id")
df.expect_column_values_to_be_unique("customer_id")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
df.expect_column_values_to_match_regex("email", r"^[a-zA-Z0-9_.+-]+@[a-z]+\.[a-zA-Z0-9-.]+$")
df.expect_column_proportions_of_unique_values_to_be_greater_than("status", 0.95)

# Run validation
results = df.validate()
if not results["success"]:
    print(f"Validation failed: {results['results']}")

Rules can be implemented at multiple levels:

Schema-level rules: Enforce data types, nullability, and primary key constraints at load time. Most data warehouses support these constraints natively.

Statistical rules: Monitor column statistics. If a column normally ranges from 0 to 100 and suddenly shows values in the thousands, something is wrong.

Business logic rules: Customer age should not exceed 120. Order total should equal sum of line items. These require explicit validation logic.

Pipeline-Level Validation

Integrate validation into the pipeline, not as a separate process.

# Example: Validation step in a pipeline
def validate_orders(orders_df: pd.DataFrame) -> ValidationResult:
    """Run validation on orders data and return result."""
    violations = []

    # Check for null customer IDs
    null_customers = orders_df[orders_df["customer_id"].isna()]
    if len(null_customers) > 0:
        violations.append(
            Violation(type="null_key", count=len(null_customers), sample=null_customers.head())
        )

    # Check for duplicate order IDs
    duplicates = orders_df[orders_df["order_id"].duplicated(keep=False)]
    if len(duplicates) > 0:
        violations.append(
            Violation(type="duplicate_key", count=len(duplicates), sample=duplicates.head())
        )

    # Check order total consistency
    orders_df["calc_total"] = orders_df["line_items"].apply(sum)
    inconsistent_totals = orders_df[orders_df["order_total"] != orders_df["calc_total"]]
    if len(inconsistent_totals) > 0:
        violations.append(
            Violation(type="invalid_total", count=len(inconsistent_totals), sample=inconsistent_totals.head())
        )

    return ValidationResult(
        passed=len(violations) == 0,
        violations=violations,
        row_count=len(orders_df),
        error_rate=len(violations) / len(orders_df) if len(orders_df) > 0 else 0
    )

A validation failure in a pipeline step should determine the pipeline outcome. Stop and alert on critical failures. Log and continue on non-critical ones, but track the error rate.

Data Contracts

A data contract is an explicit agreement between a data producer and a data consumer about what data will look like. It defines the schema, quality requirements, and SLA for data freshness.

{
  "contract_id": "orders-warehouse-v1",
  "producer": "orders-service",
  "consumer": "analytics-warehouse",
  "schema": {
    "order_id": { "type": "string", "required": true },
    "customer_id": { "type": "string", "required": true },
    "order_total": { "type": "number", "required": true, "min": 0 },
    "status": {
      "type": "string",
      "enum": ["pending", "shipped", "delivered", "cancelled"]
    }
  },
  "quality_requirements": {
    "null_rate_threshold": { "customer_id": 0.001 },
    "duplicate_rate_threshold": { "order_id": 0.0 }
  },
  "freshness_sla": {
    "max_latency_minutes": 30,
    "frequency": "continuous"
  }
}

When a producer violates the contract (sends data outside the schema or exceeds the null rate threshold), consumers are notified. This prevents silent data quality degradation.

Anomaly Detection

Rule-based validation catches known failure modes. Anomaly detection catches unknown ones. Monitor statistical properties of your data and alert when something deviates significantly from baseline.

-- Simple anomaly detection: monitor distribution changes
SELECT
    date,
    AVG(order_total) AS avg_order_total,
    STDDEV(order_total) AS stddev_order_total,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY order_total) AS p99_order_total,
    COUNT(*) AS order_count
FROM fact_orders
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY date
ORDER BY date;

-- Alert if today's stddev is 3x the 30-day average
-- This catches distribution shifts even when individual values are valid

More sophisticated anomaly detection uses machine learning models trained on historical data patterns. These models can catch subtle shifts that threshold-based alerts miss.

Data Quality Monitoring

Build quality metrics dashboards. Track these over time:

Null rate per column per table
Duplicate rate per primary key
Row count vs historical average
Freshness: time since last successful load
Validation failure rate per rule

# Example: Data quality metrics for Prometheus/Grafana
metrics:
  - name: pipeline_data_quality_null_rate
    query: |
      SELECT
        table_name,
        column_name,
        SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS null_rate
      FROM column_stats
      GROUP BY table_name, column_name
    thresholds:
      warning: 0.01
      critical: 0.05

  - name: pipeline_freshness_minutes
    query: |
      SELECT
        table_name,
        EXTRACT(EPOCH FROM (now() - last_loaded_at)) / 60 AS minutes_since_load
      FROM pipeline_status
    thresholds:
      warning: 30
      critical: 60

For pipeline-level monitoring patterns, see Pipeline Orchestration.

Governance and Ownership

Data quality is a governance problem as much as a technical problem. Someone must own each dataset and be accountable for its quality.

Data stewards: Individuals responsible for specific datasets. They respond to quality issues, approve schema changes, and maintain documentation.

Quality SLAs: Define what “good enough” means. Not every dataset needs 100% completeness. Different use cases have different tolerance for bad data.

Incident process: When bad data reaches users, there must be a process for identifying the root cause, fixing the data, and preventing recurrence. This is a standard incident management process adapted for data.

When to Use Data Quality Checks

Use data quality checks when:

Pipelines process data that drives business decisions (revenue, metrics, reports)
Source data comes from external systems with unreliable schemas
Regulatory requirements mandate data validation (SOX, GDPR, HIPAA)
Multiple teams produce data that other teams consume (data mesh contexts)

Do not use comprehensive data quality checks when:

Prototyping or exploring data — overhead slows down initial iteration
One-off data migrations — validate manually instead
Pipeline latency is critical and data quality is already handled upstream

Production Failure Scenarios

Data quality failures in production tend to follow predictable patterns:

Failure Mode	Cause	Impact	Mitigation
Silent aggregation corruption	Test/dev data loaded into production alongside real data	Wrong revenue/metric calculations	Separate environment credentials, pipeline-level environment filtering
Stale data served as fresh	Pipeline runs late but downstream queries assume freshness	Executives see wrong numbers	Freshness SLA monitoring with alerts
Schema drift silent failure	Source adds column, downstream breaks	Dashboard errors, pipeline failures	Schema validation at ingestion
Duplicate records inflate metrics	Upsert key collision or re-run without cleanup	Incorrect aggregate counts	Unique key monitoring, idempotent pipeline design
Null values misinterpreted	NULL treated as zero or empty string in aggregates	Subtle wrong totals	Enforce NOT NULL on critical columns

The most dangerous failures are the silent ones — wrong numbers that look plausible and get acted on before anyone notices.

Trade-off Table: Validation Approaches

Aspect	Schema-level	Statistical	Business logic
Coverage	Narrow — types and nulls	Medium — distributions	Wide — custom rules
Implementation cost	Low — native warehouse constraints	Medium — query templates	High — requires domain knowledge
Maintenance	Low — rarely changes	Medium — thresholds drift	High — business rules evolve
False positive rate	Low	Medium	High
Catches unknown issues	No	Sometimes	No
Best for	Early-stage pipelines	Mature pipelines with baseline data	Critical business metrics

Schema-level validation is table stakes. Statistical validation catches what schema validation misses. Business logic validation is essential for metrics that drive decisions.

Common Pitfalls

Too many tests, not enough coverage: Teams add many individual column tests but miss the business logic errors that span multiple columns.

Validation at the wrong layer: Validation that only happens at load time misses issues that are introduced in transformations. dbt tests run after each transformation model catch issues closer to where they are introduced.

No historical tracking: Without tracking quality metrics over time, you cannot tell if quality is improving or degrading. Build metrics persistence into your pipeline.

Blaming the source: When data quality issues appear, teams often blame the source system. In practice, most quality issues are introduced in the pipeline itself.

Quick Recap

Bad data comes in six flavors: completeness, validity, consistency, uniqueness, timeliness, accuracy.
Integrate validation into the pipeline, not as a separate post-processing step.
Catch issues at extraction or staging when re-extraction is still possible.
Data contracts formalize the agreement between producers and consumers.
Track quality metrics over time. Without history, you cannot tell if quality is improving.
Silent failures — wrong numbers that look plausible — are the most dangerous.

Conclusion

Data quality is not a nice-to-have. Bad data leads to bad decisions, wasted engineering time, and lost trust in data systems. The practices are straightforward: define rules, validate automatically in the pipeline, monitor metrics over time, and have a process for responding to failures.

The hard part is discipline. Quality checks add overhead. They slow down pipeline development. They create alerts that someone must respond to. Skipping them seems efficient until a test order batch skews your revenue dashboard.

For related reading on pipeline reliability, see Pipeline Orchestration and Backfills.

Data Quality: Validation, Monitoring, and Governance

Data Quality: Validation, Monitoring, and Governance

Data Quality Flow Through Pipeline Stages

What Makes Data “Bad”

Data Quality Rules

Pipeline-Level Validation

Data Contracts

Anomaly Detection

Data Quality Monitoring

Governance and Ownership

When to Use Data Quality Checks

Production Failure Scenarios

Trade-off Table: Validation Approaches

Common Pitfalls

Quick Recap

Conclusion

Category

Tags

Related Posts

Data Validation: Ensuring Reliability in Data Pipelines

Data Contracts: Establishing Reliable Data Agreements

Data Governance: Practical Implementation Guide