Data Formats: JSON, CSV, Parquet, Avro, and ORC Explained

Compare data formats — JSON, CSV, Parquet, Avro, and ORC — covering structure, compression, schema handling, and when to use each in pipelines.

published: reading time: 9 min read

Data Formats: JSON, CSV, Parquet, Avro, and ORC Explained

Data formats are the language your data pipeline speaks. Move data from one system to another, and the format determines how much storage you burn, how fast you query, and whether downstream systems understand what they are reading. The wrong format for the wrong workload is a silent killer of performance.

This post covers the five most common formats in data engineering: JSON, CSV, Parquet, Avro, and ORC. We will look at structure, compression, schema handling, and when to use each.

JSON: Flexibility at a Cost

JSON (JavaScript Object Notation) is the dominant format for APIs and semi-structured data. It stores data as key-value pairs nested into objects and arrays. Human-readable, flexible, and everywhere.

{
  "order_id": "ORD-12345",
  "customer": {
    "id": "CUST-789",
    "name": "Priya Sharma",
    "email": "priya@example.com"
  },
  "items": [
    { "product": "wireless-headphones", "qty": 2, "price": 79.99 },
    { "product": "usb-c-cable", "qty": 1, "price": 12.99 }
  ],
  "total": 172.97,
  "timestamp": "2026-03-27T10:23:45Z"
}

JSON schema is self-describing. Every field name appears with every record. This makes it easy to evolve records without breaking readers, but it adds significant overhead. A billion JSON records each carry every field name, burning storage and slowing scans.

JSON works well for event payloads, configuration files, and data exchange between services. It is the wrong choice for analytical workloads scanning millions of rows.

JSON trade-offs

AspectAssessment
ReadabilityExcellent — human-readable
Schema evolutionFlexible — fields can be added without breaking readers
Storage efficiencyPoor — repeated field names
Query performanceSlow — requires parsing full record
CompressionModerate — text formats compress reasonably

CSV: Simple but Fragile

CSV (Comma-Separated Values) is the lowest common denominator for tabular data exchange. It is a plain text format with one record per line and commas delimiting fields.

order_id,customer_id,product,qty,price,total
ORD-12345,CUST-789,wireless-headphones,2,79.99,172.97
ORD-12346,CUST-456,laptop-stand,1,45.00,45.00

The simplicity is the appeal. Every tool in existence reads CSV. The problems are equally fundamental: no schema definition, no type preservation, no handling of delimiters inside fields without quoting rules, and no efficient columnar access.

CSV has no standard way to represent dates versus strings versus numbers. A column of dates arrives as text that downstream systems must parse. Null values appear as empty fields or magic strings like “NULL” or “N/A” that nobody standardized.

CSV works for one-time data transfers and flat tabular exports. It is a poor choice for production pipelines that care about types, schema evolution, or query performance.

Parquet: Columnar for Analytical Workloads

Parquet is a columnar storage format built for analytical queries. It stores data column-by-column rather than row-by-row, which means scanning a specific column reads only that column’s data from disk.

Parquet was designed for the Hadoop ecosystem and is now the default format for modern data lakes. It supports complex nested data structures similar to JSON, but organizes them in columnar fashion.

import pyarrow.parquet as pq

# Write a Parquet file
table = pa.table({
    'order_id': ['ORD-12345', 'ORD-12346'],
    'customer_id': ['CUST-789', 'CUST-456'],
    'total': [172.97, 45.00]
})
pq.write_table(table, 'orders.parquet')

# Read only specific columns (column pruning)
table = pq.read_table('orders.parquet', columns=['order_id', 'total'])

Parquet files embed a schema in the file footer and encode data using techniques like dictionary encoding and run-length encoding. A column of repeated category values compresses dramatically because the same strings appear contiguously.

Typical compression ratios for Parquet are 5:1 to 10:1 compared to raw JSON. For analytical workloads that aggregate across columns, Parquet is the standard choice.

For more on how Parquet fits into data lakes versus data warehouses, see Data Warehousing.

Avro: Row-Based with Schema Evolution

Avro is a row-based binary format designed for serialization and data exchange, particularly in Hadoop and Kafka ecosystems. Unlike Parquet, Avro stores data row by row, which makes it better for write-heavy workloads and record-level access.

Avro’s distinguishing feature is its schema. The schema lives in a separate Schema Registry or at the top of the file. Records do not carry field names with every row. Instead, data is encoded against the schema, and the schema is what provides context.

{
  "type": "record",
  "name": "Order",
  "fields": [
    { "name": "order_id", "type": "string" },
    { "name": "customer_id", "type": "string" },
    { "name": "total", "type": "double" },
    {
      "name": "timestamp",
      "type": { "type": "long", "logicalType": "timestamp-millis" }
    }
  ]
}

Avro supports schema evolution. You can add a new field with a default value, or remove a field, without breaking readers operating on older schemas. This makes Avro well-suited for streaming pipelines where schemas evolve over time.

Avro uses binary encoding, which is compact and fast to serialize and deserialize. For Kafka producers and consumers that need to exchange records with evolving schemas, Avro with Confluent Schema Registry is a common pattern.

ORC: Hive’s Columnar Format

ORC (Optimized Row Columnar) is Hive’s columnar format, similar in concept to Parquet but optimized for Hive’s workload patterns. ORC was developed to address limitations in Hive’s original RCFile format.

# Example ORC file inspection
hive --orc-file-info --file orders.orc

ORC includes features like built-in indexes (bloom filters and statistics), ACID support for Hive transactions, and native support for complex types. It stores data in stripe-based columnar layout where each stripe contains column data for a set of rows.

FeatureParquetORC
Hive ACID supportNoYes
Built-in indexesLimitedYes (bloom filters, min/max)
Timestamp precisionNanosecondsSeconds
Nested structure supportComplex typesMore explicit
Primary ecosystemSpark, Drill, PrestoHive

ORC makes sense when you are running Hive or Spark with heavy analytical aggregations and need the built-in index features for predicate pushdown. Parquet is more broadly supported across query engines and is the safer default for general-purpose data lake storage.

Choosing a Format

The right format depends on your workload pattern.

WorkloadRecommended Format
APIs and web servicesJSON
One-time data exchangeCSV
Analytical queries (read-heavy)Parquet
Streaming with schema evolutionAvro
Hive/Spark with aggregationsORC

For most modern data pipelines landing data in a data lake, Parquet is the default choice. For Kafka message serialization with evolving schemas, Avro with Schema Registry is the common pattern. CSV and JSON persist in data exchange scenarios where their trade-offs are acceptable.

Data Format Selection Flow

Choosing the right format is a decision tree. This diagram maps the selection logic:

flowchart TD
    Start[New format decision] --> Workload{Workload type?}
    Workload -->|APIs / Web| JSON
    Workload -->|One-time export| CSV
    Workload -->|Read-heavy analytics| Analytics{Column projections?}
    Analytics -->|Wide reads / aggregations| Parquet
    Analytics -->|Hive / Spark aggregations| ORC
    Workload -->|Streaming with schema changes| Avro
    Workload -->|General-purpose lake| Parquet

Start at the top and follow the path that matches your workload. If you land on multiple candidates, use the comparison table in the “Choosing a Format” section to break the tie.

Common Anti-Patterns

Using JSON for analytical workloads

JSON’s self-describing format carries every field name with every record. A billion records means a billion copies of each field name. Storage costs scale with record count, not data volume. Column scans pay the price. If you are building aggregations, reports, or OLAP queries, JSON is the wrong starting point.

Using CSV for typed data

CSV has no type system. Dates arrive as strings. Numbers arrive as text. Nulls appear as “NULL”, “None”, “N/A”, or empty strings depending on what the exporting tool decided that day. Downstream systems spend more code parsing CSV quirks than doing actual analysis. Use CSV only for one-time, low-stakes data exchanges.

Choosing Parquet for write-heavy workloads

Parquet’s columnar layout is built for reads, not writes. Writing Parquet buffers rows into column pages before persisting them. For high-frequency writes or record-level access, Avro’s row-based layout is a better fit.

Capacity Estimation

Storage planning comes down to format, compression ratios, and data shape. Benchmarks to anchor your estimates:

FormatUncompressed vs JSONTypical CompressionWhen to Re-estimate
JSON1x (baseline)3-5x (gzip)Schema changes add fields
CSV0.9-1.1x vs JSON3-5x (gzip)New delimiters or quoting
Parquet1.2-1.5x vs JSON5-10x (Snappy/Gzip)After column cardinality changes
Avro1.1-1.3x vs JSON5-8x (deflate)After schema changes
ORC1.2-1.4x vs JSON5-10x (zlib)After Hive query patterns change

For a dataset of 1 billion records with 50 columns averaging 20 bytes each, raw JSON is approximately 1 TB. Parquet with Snappy compression lands around 100-150 GB. Calculate your compression ratio as: uncompressed_size / parquet_size = compression_ratio.

For related reading on how data formats interact with storage systems, see Object Storage and Database Replication.

Conclusion

Data formats are not interchangeable. CSV is simple but fragile. JSON is flexible but expensive to store and query. Parquet gives you columnar efficiency for analytical reads. Avro gives you compact binary encoding with schema evolution. ORC adds Hive-specific features like ACID and indexes.

Choose based on how your pipeline reads data. Row-based formats like Avro and CSV favor write throughput. Columnar formats like Parquet and ORC favor read-heavy analytical queries. The format is not a detail — it determines the performance and cost profile of everything downstream.

Quick Recap

  • JSON: APIs and event payloads. Not for analytical workloads — field names bloat storage and slow scans.
  • CSV: Universal compatibility. No type safety, no schema, no column pruning. One-time exports only.
  • Parquet: Default for data lakes. 5-10x compression over JSON, fast aggregations on column reads.
  • Avro: Row-based binary with schema evolution. Kafka and streaming pipelines with changing schemas.
  • ORC: Hive ACID transactions, built-in bloom filters. Use when Parquet’s ecosystem support is not enough.
  • Match format to read/write pattern — this is the decision that haunts pipelines at scale.

Category

Related Posts

Schema Evolution: Managing Changing Data Structures

Schema evolution lets pipelines handle changing data structures without breaking consumers. Learn backward and forward compatibility strategies.

#data-engineering #schema-evolution #avro

Schema Registry: Enforcing Data Contracts

Learn how Schema Registry prevents data incompatibilities in distributed systems, supports schema evolution, and enables reliable streaming pipelines.

#data-engineering #schema-registry #avro

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring