Data Quality: Validation, Monitoring, and Governance
Data quality determines whether pipeline outputs are trustworthy. Learn how to define rules, implement validation, and catch bad data before it reaches users.
Data Quality: Validation, Monitoring, and Governance
A dashboard shows revenue down 30% this month. The analyst panics. An engineer starts debugging. Hours later, the answer is mundane: a pipeline extracted a batch of test orders from the staging environment and loaded them alongside real orders, skewing the aggregates. The data was wrong, not the business.
Bad data costs real time. Data quality is the practice of catching bad data before it reaches users, and when it does reach users, catching it fast enough to minimize damage.
Data Quality Flow Through Pipeline Stages
Data quality checks apply at each stage of the pipeline, with different concerns at each layer:
flowchart TD
subgraph Extraction[Extraction Stage]
E1[Source data ingestion]
E2[Schema validation]
E3[Type checks]
end
subgraph Staging[Staging Stage]
S1[Row count validation]
S2[Null rate checks]
S3[Unique key validation]
end
subgraph Transform[Transformation Stage]
T1[Business logic rules]
T2[Cross-table consistency]
T3[Anomaly detection]
end
subgraph Load[Load Stage]
L1[Referential integrity]
L2[Constraint validation]
L3[Freshness check]
end
E1 --> E2 --> E3 --> S1 --> S2 --> S3 --> T1 --> T2 --> T3 --> L1 --> L2 --> L3
Catching issues early (extraction/staging) is cheaper than catching them late (load stage). An issue caught at extraction can be re-extracted. An issue caught at load requires a backfill.
What Makes Data “Bad”
Data quality issues fall into several categories:
Completeness: Required fields are null or missing. A customer record without an email address is incomplete. A fact table row without a foreign key to the date dimension is broken.
Validity: Data is present but wrong. A date of “February 30th”. An email address that is not an email address. A percentage field with values over 100.
Consistency: The same entity has different values across systems. Customer name is “Alice Sharma” in the CRM and “Alice S.” in the billing system. These should be joined as the same customer but appear as different.
Uniqueness: Duplicate records where duplicates should not exist. The same order appearing twice because a pipeline ran twice.
Timeliness: Data arrives late. A daily pipeline that usually runs at 2am has been running at 6am for three days because of a upstream issue. Reports that assume 2am data are stale.
Accuracy: Data does not reflect reality. A sensor reading of 999 degrees when the actual temperature was 73. A zip code that does not exist.
Data Quality Rules
Data quality rules define what “good” data looks like. Rules are specific, measurable assertions about data.
# Example: Data quality rules in Python (Great Expectations style)
import great_expectations as ge
# Load data
df = ge.read_csv("customers.csv")
# Define expectations
df.expect_column_values_to_not_be_null("customer_id")
df.expect_column_values_to_be_unique("customer_id")
df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
df.expect_column_values_to_match_regex("email", r"^[a-zA-Z0-9_.+-]+@[a-z]+\.[a-zA-Z0-9-.]+$")
df.expect_column_proportions_of_unique_values_to_be_greater_than("status", 0.95)
# Run validation
results = df.validate()
if not results["success"]:
print(f"Validation failed: {results['results']}")
Rules can be implemented at multiple levels:
Schema-level rules: Enforce data types, nullability, and primary key constraints at load time. Most data warehouses support these constraints natively.
Statistical rules: Monitor column statistics. If a column normally ranges from 0 to 100 and suddenly shows values in the thousands, something is wrong.
Business logic rules: Customer age should not exceed 120. Order total should equal sum of line items. These require explicit validation logic.
Pipeline-Level Validation
Integrate validation into the pipeline, not as a separate process.
# Example: Validation step in a pipeline
def validate_orders(orders_df: pd.DataFrame) -> ValidationResult:
"""Run validation on orders data and return result."""
violations = []
# Check for null customer IDs
null_customers = orders_df[orders_df["customer_id"].isna()]
if len(null_customers) > 0:
violations.append(
Violation(type="null_key", count=len(null_customers), sample=null_customers.head())
)
# Check for duplicate order IDs
duplicates = orders_df[orders_df["order_id"].duplicated(keep=False)]
if len(duplicates) > 0:
violations.append(
Violation(type="duplicate_key", count=len(duplicates), sample=duplicates.head())
)
# Check order total consistency
orders_df["calc_total"] = orders_df["line_items"].apply(sum)
inconsistent_totals = orders_df[orders_df["order_total"] != orders_df["calc_total"]]
if len(inconsistent_totals) > 0:
violations.append(
Violation(type="invalid_total", count=len(inconsistent_totals), sample=inconsistent_totals.head())
)
return ValidationResult(
passed=len(violations) == 0,
violations=violations,
row_count=len(orders_df),
error_rate=len(violations) / len(orders_df) if len(orders_df) > 0 else 0
)
A validation failure in a pipeline step should determine the pipeline outcome. Stop and alert on critical failures. Log and continue on non-critical ones, but track the error rate.
Data Contracts
A data contract is an explicit agreement between a data producer and a data consumer about what data will look like. It defines the schema, quality requirements, and SLA for data freshness.
{
"contract_id": "orders-warehouse-v1",
"producer": "orders-service",
"consumer": "analytics-warehouse",
"schema": {
"order_id": { "type": "string", "required": true },
"customer_id": { "type": "string", "required": true },
"order_total": { "type": "number", "required": true, "min": 0 },
"status": {
"type": "string",
"enum": ["pending", "shipped", "delivered", "cancelled"]
}
},
"quality_requirements": {
"null_rate_threshold": { "customer_id": 0.001 },
"duplicate_rate_threshold": { "order_id": 0.0 }
},
"freshness_sla": {
"max_latency_minutes": 30,
"frequency": "continuous"
}
}
When a producer violates the contract (sends data outside the schema or exceeds the null rate threshold), consumers are notified. This prevents silent data quality degradation.
Anomaly Detection
Rule-based validation catches known failure modes. Anomaly detection catches unknown ones. Monitor statistical properties of your data and alert when something deviates significantly from baseline.
-- Simple anomaly detection: monitor distribution changes
SELECT
date,
AVG(order_total) AS avg_order_total,
STDDEV(order_total) AS stddev_order_total,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY order_total) AS p99_order_total,
COUNT(*) AS order_count
FROM fact_orders
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY date
ORDER BY date;
-- Alert if today's stddev is 3x the 30-day average
-- This catches distribution shifts even when individual values are valid
More sophisticated anomaly detection uses machine learning models trained on historical data patterns. These models can catch subtle shifts that threshold-based alerts miss.
Data Quality Monitoring
Build quality metrics dashboards. Track these over time:
- Null rate per column per table
- Duplicate rate per primary key
- Row count vs historical average
- Freshness: time since last successful load
- Validation failure rate per rule
# Example: Data quality metrics for Prometheus/Grafana
metrics:
- name: pipeline_data_quality_null_rate
query: |
SELECT
table_name,
column_name,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS null_rate
FROM column_stats
GROUP BY table_name, column_name
thresholds:
warning: 0.01
critical: 0.05
- name: pipeline_freshness_minutes
query: |
SELECT
table_name,
EXTRACT(EPOCH FROM (now() - last_loaded_at)) / 60 AS minutes_since_load
FROM pipeline_status
thresholds:
warning: 30
critical: 60
For pipeline-level monitoring patterns, see Pipeline Orchestration.
Governance and Ownership
Data quality is a governance problem as much as a technical problem. Someone must own each dataset and be accountable for its quality.
Data stewards: Individuals responsible for specific datasets. They respond to quality issues, approve schema changes, and maintain documentation.
Quality SLAs: Define what “good enough” means. Not every dataset needs 100% completeness. Different use cases have different tolerance for bad data.
Incident process: When bad data reaches users, there must be a process for identifying the root cause, fixing the data, and preventing recurrence. This is a standard incident management process adapted for data.
When to Use Data Quality Checks
Use data quality checks when:
- Pipelines process data that drives business decisions (revenue, metrics, reports)
- Source data comes from external systems with unreliable schemas
- Regulatory requirements mandate data validation (SOX, GDPR, HIPAA)
- Multiple teams produce data that other teams consume (data mesh contexts)
Do not use comprehensive data quality checks when:
- Prototyping or exploring data — overhead slows down initial iteration
- One-off data migrations — validate manually instead
- Pipeline latency is critical and data quality is already handled upstream
Production Failure Scenarios
Data quality failures in production tend to follow predictable patterns:
| Failure Mode | Cause | Impact | Mitigation |
|---|---|---|---|
| Silent aggregation corruption | Test/dev data loaded into production alongside real data | Wrong revenue/metric calculations | Separate environment credentials, pipeline-level environment filtering |
| Stale data served as fresh | Pipeline runs late but downstream queries assume freshness | Executives see wrong numbers | Freshness SLA monitoring with alerts |
| Schema drift silent failure | Source adds column, downstream breaks | Dashboard errors, pipeline failures | Schema validation at ingestion |
| Duplicate records inflate metrics | Upsert key collision or re-run without cleanup | Incorrect aggregate counts | Unique key monitoring, idempotent pipeline design |
| Null values misinterpreted | NULL treated as zero or empty string in aggregates | Subtle wrong totals | Enforce NOT NULL on critical columns |
The most dangerous failures are the silent ones — wrong numbers that look plausible and get acted on before anyone notices.
Trade-off Table: Validation Approaches
| Aspect | Schema-level | Statistical | Business logic |
|---|---|---|---|
| Coverage | Narrow — types and nulls | Medium — distributions | Wide — custom rules |
| Implementation cost | Low — native warehouse constraints | Medium — query templates | High — requires domain knowledge |
| Maintenance | Low — rarely changes | Medium — thresholds drift | High — business rules evolve |
| False positive rate | Low | Medium | High |
| Catches unknown issues | No | Sometimes | No |
| Best for | Early-stage pipelines | Mature pipelines with baseline data | Critical business metrics |
Schema-level validation is table stakes. Statistical validation catches what schema validation misses. Business logic validation is essential for metrics that drive decisions.
Common Pitfalls
Too many tests, not enough coverage: Teams add many individual column tests but miss the business logic errors that span multiple columns.
Validation at the wrong layer: Validation that only happens at load time misses issues that are introduced in transformations. dbt tests run after each transformation model catch issues closer to where they are introduced.
No historical tracking: Without tracking quality metrics over time, you cannot tell if quality is improving or degrading. Build metrics persistence into your pipeline.
Blaming the source: When data quality issues appear, teams often blame the source system. In practice, most quality issues are introduced in the pipeline itself.
Quick Recap
- Bad data comes in six flavors: completeness, validity, consistency, uniqueness, timeliness, accuracy.
- Integrate validation into the pipeline, not as a separate post-processing step.
- Catch issues at extraction or staging when re-extraction is still possible.
- Data contracts formalize the agreement between producers and consumers.
- Track quality metrics over time. Without history, you cannot tell if quality is improving.
- Silent failures — wrong numbers that look plausible — are the most dangerous.
Conclusion
Data quality is not a nice-to-have. Bad data leads to bad decisions, wasted engineering time, and lost trust in data systems. The practices are straightforward: define rules, validate automatically in the pipeline, monitor metrics over time, and have a process for responding to failures.
The hard part is discipline. Quality checks add overhead. They slow down pipeline development. They create alerts that someone must respond to. Skipping them seems efficient until a test order batch skews your revenue dashboard.
For related reading on pipeline reliability, see Pipeline Orchestration and Backfills.
Category
Related Posts
Data Validation: Ensuring Reliability in Data Pipelines
Learn data validation techniques for catching errors early, defining constraints, and building reliable production data pipelines.
Data Contracts: Establishing Reliable Data Agreements
Learn how to implement data contracts between data producers and consumers to ensure quality, availability, and accountability.
Data Governance: Practical Implementation Guide
Learn the essential framework for data governance including data ownership, quality standards, policy enforcement, and organizational alignment.