Schema Evolution: Managing Changing Data Structures

Schema evolution lets pipelines handle changing data structures without breaking consumers. Learn backward and forward compatibility strategies.

published: March 27, 2026 reading time: 14 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Schema evolution is how you handle the reality of changing data structures in production pipelines without breaking downstream consumers. The core tension is between producers (who want to evolve schemas) and consumers (who want stable contracts). Avro and Parquet provide format-level support for this, with Avro handling field additions through defaults and Parquet supporting column additions. Schema Registry takes a different approach, enforcing compatibility rules at the pipeline level and rejecting incompatible writes before they reach Kafka. Backward compatibility is what most teams aim for — new schema readers processing old data — and it is achieved by adding fields with defaults rather than removing or renaming them.

Schema Evolution: Managing Changing Data Structures

Your data pipeline ingests customer events. Six months in, the business decides to add a loyalty tier field to the event schema. The production pipeline has been running uninterrupted. Now you have old events without the loyalty_tier field and new events with it. Your queries should not break. Your dashboards should not error. Your consumers should not crash.

Schema evolution is the set of practices and technologies that let data structures change over time while maintaining compatibility between producers and consumers. It is one of the hardest operational problems in data engineering, and it is unavoidable in production systems.

The Core Problem

When data producers and consumers operate independently, schema changes create coupling. A producer changes its output schema. Consumers that expect the old schema break. The pipeline stops.

Schema evolution solves this by defining rules for how schemas can change while remaining compatible. There are two directions of compatibility:

Backward compatibility: New schema can read data written under the old schema. Consumers on the new schema can process old data.
Forward compatibility: Old schema can read data written under the new schema. Producers on the old schema can process new data.

Most teams aim for backward compatibility because consumers typically upgrade before producers.

Schema Evolution in Avro

Avro is the canonical example of schema evolution built into the format. Avro schemas can evolve according to specific rules.

Adding a field with a default value is backward compatible. The new field appears in the schema but old data does not contain it. Readers using the new schema see the default value for missing fields.

// Original schema
{
  "type": "record",
  "name": "Customer",
  "fields": [
    {"name": "customer_id", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}

// Evolved schema (adds loyalty_tier)
{
  "type": "record",
  "name": "Customer",
  "fields": [
    {"name": "customer_id", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "loyalty_tier", "type": "string", "default": "standard"}
  ]
}

Removing a field is backward compatible if the removed field had a default value (in forward compatibility terms, old readers simply ignore the missing field).

Changing a field type is generally not compatible. A string cannot become an integer without potentially corrupting data or causing runtime errors.

Renaming a field is not directly backward compatible. You can work around this by adding a field with the new name and marking the old name as deprecated, but this requires coordination.

Schema Evolution in Parquet

Parquet supports schema evolution through column addition. You can add new columns to a Parquet dataset without rewriting existing files. New columns appear in new files, and queries that reference only old columns continue to work.

Parquet does not have the same rich schema evolution rules as Avro. Changing the type of an existing column or removing a column typically requires rewriting the affected files.

import pyarrow as pa
import pyarrow.parquet as pq

# Original schema
original_schema = pa.schema([
    ('customer_id', pa.string()),
    ('email', pa.string())
])

# Evolved schema adds loyalty_tier column
evolved_schema = pa.schema([
    ('customer_id', pa.string()),
    ('email', pa.string()),
    ('loyalty_tier', pa.string())
])

# New files use the evolved schema
new_table = pa.table({
    'customer_id': ['CUST-001', 'CUST-002'],
    'email': ['alice@example.com', 'bob@example.com'],
    'loyalty_tier': ['gold', 'standard']
})

pq.write_table(new_table, 'customers_v2.parquet', schema=evolved_schema)

# Reading with original schema works - new columns are ignored
old_schema_table = pq.read_table('customers_v2.parquet', schema=original_schema)

Parquet’s approach is pragmatic: new columns are additions only. This is sufficient for many real-world scenarios where schema changes are primarily additions.

Schema Registry: Centralized Schema Management

In Kafka-based pipelines, schema evolution is managed through a Schema Registry. The registry stores schemas and enforces compatibility rules at the topic level.

flowchart LR
    Producer[Producer App] -->|1. Register schema| Registry[(Schema Registry)]
    Registry -->|2. Schema ID| Producer
    Producer -->|3. Encoded record with ID| Kafka[Kafka Topic]
    Kafka -->|4. Encoded record| Consumer[Consumer App]
    Consumer -->|5. Fetch schema by ID| Registry
    Registry -->|6. Schema definition| Consumer

The flow: Producer registers its schema and receives a schema ID. The producer embeds this ID in each record. Consumers fetch the schema by ID on first encounter and cache it locally. If a producer tries to register an incompatible schema, the registry rejects the write. The producer never sends data with a schema the consumer cannot interpret.

Confluent Schema Registry is the standard implementation for Kafka. You configure a compatibility mode per topic.

Compatibility Mode	Producer writes with	Consumer reads with	Use when
`BACKWARD`	New schema	Old schema	Consumers upgrade before producers (most common)
`FORWARD`	Old schema	New schema	Producers upgrade before consumers
`FULL`	Both schemas	Both schemas	Bidirectional migration window
`NONE`	Any schema	Any schema	No enforcement — not recommended

Most teams use BACKWARD compatibility. Consumers can process old data without schema changes, giving them time to migrate while producers evolve independently.

// Register a new schema with Schema Registry
Schema newSchema = Schema.builder()
    .add("customer_id", Schema.STRING_SCHEMA)
    .add("email", Schema.STRING_SCHEMA)
    .add("loyalty_tier", Schema.STRING_SCHEMA)
    .build();

int schemaId = schemaRegistryClient.register("customer-events-value", newSchema);

When a producer tries to write with a new schema, the registry validates it against the compatibility rules. If the schema is incompatible, the write is rejected. This prevents broken schemas from propagating through the pipeline.

For reading more about how this fits into event-driven systems, see Event-Driven Architecture and Apache Kafka.

Data Contracts: Schema as a Service

Schema evolution works best when producers and consumers agree on a contract. A data contract is an explicit agreement about the schema, including which changes are allowed and how they are communicated.

A data contract approach:

Schema is registered in a central registry before production use
Compatibility rules are enforced automatically
Breaking changes require explicit negotiation and versioning
Consumers subscribe to specific schema versions

Data contracts shift schema management from reactive (something breaks, then you fix it) to proactive (schema changes are reviewed before deployment).

Observability for Schema Evolution

Schema evolution problems usually surface at runtime, not when the write happens. Catch drift before it takes down a consumer.

What to track:

Schema version mismatch rate: how often consumers fetch schema IDs they have not cached. High rates mean producers are publishing faster than consumers can keep up.
Rejected writes: Schema Registry counts registrations rejected for incompatibility. A spike signals a producer pushing breaking changes.
Reader lag by schema version: know which schema version each consumer is running. When versions drift far apart across consumers, the pipeline gets hard to reason about.
Field null rate on new columns: a new nullable field should gradually fill in. If it stays at 0% for weeks, the producer migration is stalled.

Schema Registry metrics to expose:

# Karapace (Confluent-compatible) metrics endpoint example
curl http://schema-registry:8081/metrics | grep schema

# Key metrics to watch:
# schema_registry.schema.version.count        — schemas registered per version
# schema_registry.schema.compatibility.result — compatibility check results
# schema_registry.schema.rejected.count       — rejected registrations

Without observability, you learn about schema problems the hard way — through consumer crashes in production.

Common Schema Evolution Pitfalls

Silent data loss

Type migration breaks data in ways you do not see until something downstream fails. The scenario that trips up teams most often: a string field holding numeric IDs gets converted to integer. Values like “123” silently become 123. Values like “abc” cause runtime crashes when a consumer tries to parse them. The write succeeds. Months later a report starts returning nulls and nobody knows why.

Nothing validates that the integer is actually meaningful. An email address field converted to an integer produces valid data that is completely wrong.

Before any migration, run every record in the dataset against the target schema. For each record, verify the new type can represent the existing values without truncation or errors. Reject any record that would break. Run this against a staging copy before touching production. If the pass finds records that would not survive the change, handle those manually or build a compat layer before flipping the switch.

After the migration, watch null rates on fields that changed type. If nulls spike right after a migration, the type change is corrupting records. Set alerts on field-level null rates that trigger when the spike exceeds a threshold within a single migration window.

Implicit nullability

Adding a required field that did not exist in the old schema breaks consumers that expect every field to be present. Old records arrive with the new field missing. Consumer code that accesses the field without null checks crashes. The pipeline stops.

This breaks because producers and consumers evolve on different schedules. The producer adds a field and starts writing it. The consumer has not been updated yet. Records written in that gap have the new field, but the consumer code predates it. When the consumer reads an old record, the field is null.

The rule is simple: new fields are always nullable or come with a default value. Avro enforces this for fields with defaults. Parquet lets you add nullable columns without defaults. The schema makes the null explicit instead of leaving it as an implicit gap between schema versions. The hard part is enforcing this rule consistently, and not just at the schema level — consumer code has to handle nulls too.

These bugs show up when consumer code assumes fields are always present. Common in typed consumer code where classes map directly to schema fields. The class has a non-nullable loyalty_tier. The schema says nullable with default “standard”. The code reads event.loyaltyTier without null-checking and crashes on pre-migration records. Schema contract satisfied. Code contract not. Consumer code has to treat all fields as potentially null, even when the schema provides defaults.

When this hits production, the fix depends on urgency. The fastest approach: add null checks or switch to Optional types — gets the consumer running again. The real fix: make sure all consumer code that reads schema-evolved records handles nulls explicitly, and that code reviews for new fields always check null-safety in downstream consumers.

Schema drift

Producers and consumers evolve independently. Over time their schemas diverge. One team adds a device_id field for mobile sessions. Months later, a consumer team notices records arriving with a field they never asked for and cannot interpret. Their downstream models are breaking.

There is no shared view of the schema contract. Producers evolve to meet product requirements. Consumers evolve to meet analytics requirements. Without a central authority tracking which fields are in use and which are deprecated, fields multiply and drift. The producer emits fields the consumer never wanted. The consumer expects fields the producer no longer emits. Both think their schema is correct. Neither has a way to resolve the mismatch.

A schema registry fixes this by keeping a single source of truth. Every producer registers its schema before writing data. The registry enforces compatibility: a new schema must be backward or forward compatible with the existing version, depending on the configured mode. Producers cannot push schemas that consumers cannot read. Consumers cannot receive unexpected fields without the registry flagging the problem.

Drift is also managed through field ownership. Each field has an owning team. Changes to a field need sign-off from that team. This stops teams from quietly adding fields that break downstream consumers. Deprecated fields stay marked as such until every consumer has migrated off them. This matters most in large organizations where many producer and consumer teams share the same pipeline. Without explicit ownership, fields proliferate and the schema becomes impossible to reason about.

Default value mismatches

A producer adds a new nullable field loyalty_tier with default “standard”. The consumer expects the default to be “unregistered” because that is what their legacy system used. Both schemas are valid on their own. The producer writes loyalty_tier: "standard". The consumer reads “standard” as a real loyalty tier, not a proxy for “unregistered”. Downstream segmentation puts customers in the wrong groups. The data is correct. The interpretation is wrong.

These mismatches slip through because the data itself is not corrupted. The producer wrote a valid value. The consumer read a valid value. The problem is semantic: the producer’s default and the consumer’s default do not match. This shows up most often when a new field replaces an older field with a different enumeration scheme. The name stays the same. The valid values change.

Catch mismatches by comparing default values across schema versions before migration. When a producer registers a new schema with a default value, the registry should check whether existing consumer schemas have a different default for the same field name. Not all registry implementations do this, but it is the most effective way to catch the problem before data is written.

If you catch it early, the producer and consumer teams negotiate a shared default and update their schemas together. If you catch it after data has been written with the wrong default, you either migrate every historical record with a one-time transform or accept the permanent inconsistency and document it in the data catalog. Backfilling is expensive. Not backfilling means every downstream query that uses the field has to account for the discrepancy.

When to Use and When Not to Use

Schema evolution tools have distinct sweet spots:

Use Schema Registry when:

You run Kafka and need enforced compatibility at the pipeline level
Multiple producer and consumer teams need to coordinate schema changes
You want automatic rejection of breaking changes before they reach production

Do not use Schema Registry when:

Your pipeline is single-producer, single-consumer with tight coordination
Schema changes are infrequent and tightly controlled through other means
Operational overhead of managing a registry outweighs the compatibility guarantees

Use Avro when:

You need format-level schema evolution without a central registry
Streaming pipelines with high-throughput write patterns
Schema changes are primarily additive (new fields with defaults)

Use Parquet when:

Schema changes are mostly column additions
You are working in a Spark or Hive environment
You do not need fine-grained compatibility rules

Use versioning as a fallback when:

Backward or forward compatibility cannot be maintained
Consumers and producers cannot coordinate on a compatibility schedule
You need to support multiple schema versions long-term

Versioning Strategies

When backward compatibility is not achievable, versioning provides a fallback. Rather than modifying a schema in place, create a new version.

File naming conventions for versioned schemas:

customers_v1.parquet
customers_v2.parquet

Topic naming conventions for versioned Kafka topics:

customer-events-v1
customer-events-v2

Versioning lets consumers migrate on their own timeline. They can consume from the old version until they are ready to upgrade, without blocking producers from evolving.

Quick Recap

Backward compatibility: new schema readers process old data. Add fields with defaults.
Forward compatibility: old schema readers process new data. Remove fields after they have defaults.
Schema Registry rejects incompatible schemas at write time — broken data never reaches consumers.
Avro handles format-level schema evolution. Parquet handles column additions.
Data contracts shift schema management from reactive to proactive.
When compatibility cannot be maintained: version the schema. New topic or file path, consumers migrate on their own schedule.

Conclusion

Schema evolution is not optional in production data systems. Business requirements change, products evolve, and data structures follow. The question is whether your pipeline handles it gracefully or catastrophically.

Avro and Parquet provide format-level support for schema changes. Schema registries enforce compatibility at the pipeline level. Data contracts formalize the producer-consumer agreement. Together, they let your pipeline evolve without the kind of incidents that ruin weekends.

For more on how data pipelines manage the flow of changing data, see Extract-Transform-Load and Change Data Capture.

Schema Evolution: Managing Changing Data Structures

Schema Evolution: Managing Changing Data Structures

The Core Problem

Schema Evolution in Avro

Schema Evolution in Parquet

Schema Registry: Centralized Schema Management

Data Contracts: Schema as a Service

Observability for Schema Evolution

Common Schema Evolution Pitfalls

Silent data loss

Implicit nullability

Schema drift

Default value mismatches

When to Use and When Not to Use

Versioning Strategies

Quick Recap

Conclusion

Schema Evolution: Managing Changing Data Structures

Schema Evolution: Managing Changing Data Structures

The Core Problem

Schema Evolution in Avro

Schema Evolution in Parquet

Schema Registry: Centralized Schema Management

Data Contracts: Schema as a Service

Observability for Schema Evolution

Common Schema Evolution Pitfalls

Silent data loss

Implicit nullability

Schema drift

Default value mismatches

When to Use and When Not to Use

Versioning Strategies

Quick Recap

Conclusion

Category

Tags

Related Posts

Data Formats: JSON, CSV, Parquet, Avro, and ORC Explained

Schema Registry: Enforcing Data Contracts

Data Contracts: Establishing Reliable Data Agreements