Schema Evolution: Managing Changing Data Structures

Schema evolution lets pipelines handle changing data structures without breaking consumers. Learn backward and forward compatibility strategies.

published: reading time: 9 min read

Schema Evolution: Managing Changing Data Structures

Your data pipeline ingests customer events. Six months in, the business decides to add a loyalty tier field to the event schema. The production pipeline has been running uninterrupted. Now you have old events without the loyalty_tier field and new events with it. Your queries should not break. Your dashboards should not error. Your consumers should not crash.

Schema evolution is the set of practices and technologies that let data structures change over time while maintaining compatibility between producers and consumers. It is one of the hardest operational problems in data engineering, and it is unavoidable in production systems.

The Core Problem

When data producers and consumers operate independently, schema changes create coupling. A producer changes its output schema. Consumers that expect the old schema break. The pipeline stops.

Schema evolution solves this by defining rules for how schemas can change while remaining compatible. There are two directions of compatibility:

  • Backward compatibility: New schema can read data written under the old schema. Consumers on the new schema can process old data.
  • Forward compatibility: Old schema can read data written under the new schema. Producers on the old schema can process new data.

Most teams aim for backward compatibility because consumers typically upgrade before producers.

Schema Evolution in Avro

Avro is the canonical example of schema evolution built into the format. Avro schemas can evolve according to specific rules.

Adding a field with a default value is backward compatible. The new field appears in the schema but old data does not contain it. Readers using the new schema see the default value for missing fields.

// Original schema
{
  "type": "record",
  "name": "Customer",
  "fields": [
    {"name": "customer_id", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}

// Evolved schema (adds loyalty_tier)
{
  "type": "record",
  "name": "Customer",
  "fields": [
    {"name": "customer_id", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "loyalty_tier", "type": "string", "default": "standard"}
  ]
}

Removing a field is backward compatible if the removed field had a default value (in forward compatibility terms, old readers simply ignore the missing field).

Changing a field type is generally not compatible. A string cannot become an integer without potentially corrupting data or causing runtime errors.

Renaming a field is not directly backward compatible. You can work around this by adding a field with the new name and marking the old name as deprecated, but this requires coordination.

Schema Evolution in Parquet

Parquet supports schema evolution through column addition. You can add new columns to a Parquet dataset without rewriting existing files. New columns appear in new files, and queries that reference only old columns continue to work.

Parquet does not have the same rich schema evolution rules as Avro. Changing the type of an existing column or removing a column typically requires rewriting the affected files.

import pyarrow as pa
import pyarrow.parquet as pq

# Original schema
original_schema = pa.schema([
    ('customer_id', pa.string()),
    ('email', pa.string())
])

# Evolved schema adds loyalty_tier column
evolved_schema = pa.schema([
    ('customer_id', pa.string()),
    ('email', pa.string()),
    ('loyalty_tier', pa.string())
])

# New files use the evolved schema
new_table = pa.table({
    'customer_id': ['CUST-001', 'CUST-002'],
    'email': ['alice@example.com', 'bob@example.com'],
    'loyalty_tier': ['gold', 'standard']
})

pq.write_table(new_table, 'customers_v2.parquet', schema=evolved_schema)

# Reading with original schema works - new columns are ignored
old_schema_table = pq.read_table('customers_v2.parquet', schema=original_schema)

Parquet’s approach is pragmatic: new columns are additions only. This is sufficient for many real-world scenarios where schema changes are primarily additions.

Schema Registry: Centralized Schema Management

In Kafka-based pipelines, schema evolution is managed through a Schema Registry. The registry stores schemas and enforces compatibility rules at the topic level.

flowchart LR
    Producer[Producer App] -->|1. Register schema| Registry[(Schema Registry)]
    Registry -->|2. Schema ID| Producer
    Producer -->|3. Encoded record with ID| Kafka[Kafka Topic]
    Kafka -->|4. Encoded record| Consumer[Consumer App]
    Consumer -->|5. Fetch schema by ID| Registry
    Registry -->|6. Schema definition| Consumer

The flow: Producer registers its schema and receives a schema ID. The producer embeds this ID in each record. Consumers fetch the schema by ID on first encounter and cache it locally. If a producer tries to register an incompatible schema, the registry rejects the write. The producer never sends data with a schema the consumer cannot interpret.

Confluent Schema Registry is the standard implementation for Kafka. You configure a compatibility mode per topic.

Compatibility ModeProducer writes withConsumer reads withUse when
BACKWARDNew schemaOld schemaConsumers upgrade before producers (most common)
FORWARDOld schemaNew schemaProducers upgrade before consumers
FULLBoth schemasBoth schemasBidirectional migration window
NONEAny schemaAny schemaNo enforcement — not recommended

Most teams use BACKWARD compatibility. Consumers can process old data without schema changes, giving them time to migrate while producers evolve independently.

// Register a new schema with Schema Registry
Schema newSchema = Schema.builder()
    .add("customer_id", Schema.STRING_SCHEMA)
    .add("email", Schema.STRING_SCHEMA)
    .add("loyalty_tier", Schema.STRING_SCHEMA)
    .build();

int schemaId = schemaRegistryClient.register("customer-events-value", newSchema);

When a producer tries to write with a new schema, the registry validates it against the compatibility rules. If the schema is incompatible, the write is rejected. This prevents broken schemas from propagating through the pipeline.

For reading more about how this fits into event-driven systems, see Event-Driven Architecture and Apache Kafka.

Data Contracts: Schema as a Service

Schema evolution works best when producers and consumers agree on a contract. A data contract is an explicit agreement about the schema, including which changes are allowed and how they are communicated.

A data contract approach:

  1. Schema is registered in a central registry before production use
  2. Compatibility rules are enforced automatically
  3. Breaking changes require explicit negotiation and versioning
  4. Consumers subscribe to specific schema versions

Data contracts shift schema management from reactive (something breaks, then you fix it) to proactive (schema changes are reviewed before deployment).

Observability for Schema Evolution

Schema evolution problems usually surface at runtime, not when the write happens. Catch drift before it takes down a consumer.

What to track:

  • Schema version mismatch rate: how often consumers fetch schema IDs they have not cached. High rates mean producers are publishing faster than consumers can keep up.
  • Rejected writes: Schema Registry counts registrations rejected for incompatibility. A spike signals a producer pushing breaking changes.
  • Reader lag by schema version: know which schema version each consumer is running. When versions drift far apart across consumers, the pipeline gets hard to reason about.
  • Field null rate on new columns: a new nullable field should gradually fill in. If it stays at 0% for weeks, the producer migration is stalled.

Schema Registry metrics to expose:

# Karapace (Confluent-compatible) metrics endpoint example
curl http://schema-registry:8081/metrics | grep schema

# Key metrics to watch:
# schema_registry.schema.version.count        — schemas registered per version
# schema_registry.schema.compatibility.result — compatibility check results
# schema_registry.schema.rejected.count       — rejected registrations

Without observability, you learn about schema problems the hard way — through consumer crashes in production.

Common Schema Evolution Pitfalls

Silent data loss

Changing a field type without proper validation corrupts data silently. A string “123” becomes integer 123. A string “abc” causes a runtime error. Validate data before schema changes reach production.

Implicit nullability

Adding a new required field to a consumer that expects all fields causes null pointer exceptions when old records arrive. Always add new fields as nullable or with defaults.

Schema drift

Producers and consumers evolve independently. Without central schema enforcement, producers may emit fields that consumers do not understand or expect. A schema registry prevents this drift.

Default value mismatches

A producer writes a default value for a new field that differs from what the consumer expects. Queries that count or aggregate the field produce inconsistent results depending on which schema version was used.

When to Use and When Not to Use

Schema evolution tools have distinct sweet spots:

Use Schema Registry when:

  • You run Kafka and need enforced compatibility at the pipeline level
  • Multiple producer and consumer teams need to coordinate schema changes
  • You want automatic rejection of breaking changes before they reach production

Do not use Schema Registry when:

  • Your pipeline is single-producer, single-consumer with tight coordination
  • Schema changes are infrequent and tightly controlled through other means
  • Operational overhead of managing a registry outweighs the compatibility guarantees

Use Avro when:

  • You need format-level schema evolution without a central registry
  • Streaming pipelines with high-throughput write patterns
  • Schema changes are primarily additive (new fields with defaults)

Use Parquet when:

  • Schema changes are mostly column additions
  • You are working in a Spark or Hive environment
  • You do not need fine-grained compatibility rules

Use versioning as a fallback when:

  • Backward or forward compatibility cannot be maintained
  • Consumers and producers cannot coordinate on a compatibility schedule
  • You need to support multiple schema versions long-term

Versioning Strategies

When backward compatibility is not achievable, versioning provides a fallback. Rather than modifying a schema in place, create a new version.

File naming conventions for versioned schemas:

customers_v1.parquet
customers_v2.parquet

Topic naming conventions for versioned Kafka topics:

customer-events-v1
customer-events-v2

Versioning lets consumers migrate on their own timeline. They can consume from the old version until they are ready to upgrade, without blocking producers from evolving.

Quick Recap

  • Backward compatibility: new schema readers process old data. Add fields with defaults.
  • Forward compatibility: old schema readers process new data. Remove fields after they have defaults.
  • Schema Registry rejects incompatible schemas at write time — broken data never reaches consumers.
  • Avro handles format-level schema evolution. Parquet handles column additions.
  • Data contracts shift schema management from reactive to proactive.
  • When compatibility cannot be maintained: version the schema. New topic or file path, consumers migrate on their own schedule.

Conclusion

Schema evolution is not optional in production data systems. Business requirements change, products evolve, and data structures follow. The question is whether your pipeline handles it gracefully or catastrophically.

Avro and Parquet provide format-level support for schema changes. Schema registries enforce compatibility at the pipeline level. Data contracts formalize the producer-consumer agreement. Together, they let your pipeline evolve without the kind of incidents that ruin weekends.

For more on how data pipelines manage the flow of changing data, see Extract-Transform-Load and Change Data Capture.

Category

Related Posts

Data Formats: JSON, CSV, Parquet, Avro, and ORC Explained

Compare data formats — JSON, CSV, Parquet, Avro, and ORC — covering structure, compression, schema handling, and when to use each in pipelines.

#data-engineering #data-formats #parquet

Schema Registry: Enforcing Data Contracts

Learn how Schema Registry prevents data incompatibilities in distributed systems, supports schema evolution, and enables reliable streaming pipelines.

#data-engineering #schema-registry #avro

Data Contracts: Establishing Reliable Data Agreements

Learn how to implement data contracts between data producers and consumers to ensure quality, availability, and accountability.

#data-engineering #data-contracts #data-quality