DuckDB: The SQLite for Analytical Workloads
Explore DuckDB, the in-process analytical database that runs anywhere, handles columnar storage efficiently, and brings analytics to where your data lives.
DuckDB: The SQLite for Analytical Workloads
SQLite became ubiquitous because it solved a real problem: a zero-configuration, serverless, embeddable database that runs inside your application. Every phone, every browser, every desktop application that needed structured data used SQLite. For transactional workloads—inserts, updates, point queries—SQLite delivered exactly what it promised.
Analytical workloads needed something different. When you’re aggregating millions of rows, SQLite’s row-based storage scans the entire table. Its single-writer architecture limits concurrency. Running analytical queries on SQLite is painful.
DuckDB is the answer to the question SQLite could not answer: what if you took the simplicity and portability of SQLite and optimized it for analytical queries? DuckDB is in-process, serverless, runs anywhere, and screams at analytical workloads. No infrastructure to manage. No separate service to run. Just a library that processes data at the speed of memory.
The Basics: Installation and First Queries
DuckDB installs in seconds and runs immediately. Single binary, no dependencies, no configuration.
# Install with pip, conda, or download the binary
pip install duckdb
# Or run the CLI directly
curl -L https://github.com/duckdb/duckdb/releases/download/v1.1.0/duckdb_cli-osx-universal.zip | unzip -
./duckdb
From Python, R, or any language with DuckDB bindings, you connect with a single line:
import duckdb
# Connect to an in-memory database (or a file)
con = duckdb.connect('analytics.db')
# Run a query
result = con.execute("""
SELECT
category,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM 'sales.parquet'
GROUP BY 1, 2
ORDER BY 2, 3 DESC
""").fetchdf()
print(result.head())
The result comes back as a pandas DataFrame, ready for analysis or visualization. You can also query directly from pandas DataFrames without loading data into DuckDB first.
Columnar Storage: Why DuckDB Is Fast
DuckDB uses columnar storage internally. Instead of storing rows contiguously on disk (row store), each column is stored separately. When you query SELECT category, SUM(amount), DuckDB reads only the category and amount columns, skipping everything else.
This is the same storage format that Snowflake, BigQuery, and Redshift use. The difference is that DuckDB runs in your process rather than on a remote cluster.
# DuckDB can query remote data without downloading it
con.execute("""
SELECT
*
FROM read_parquet('s3://warehouse/gold/fact_orders/**/*.parquet')
WHERE order_date >= DATE '2025-01-01'
""")
# Query a pandas DataFrame directly
import pandas as pd
df = pd.read_csv('large_file.csv')
con.execute("""
SELECT category, COUNT(*), AVG(amount)
FROM df
GROUP BY category
""").fetchdf()
Vectorized execution processes columns in batches (typically 2048 rows at a time) using SIMD instructions. Modern CPUs can process 256-512 bytes of data per cycle with AVX instructions. A row-at-a-time executor wastes most of this bandwidth. Vectorized execution saturates CPU throughput.
Zero-Copy Integration with Pandas
One of DuckDB’s killer features is zero-copy reading from pandas DataFrames and writing results back. DuckDB doesn’t copy data when you query a pandas DataFrame—it memory-maps the underlying numpy arrays.
import pandas as pd
import duckdb
# Create a large DataFrame
df = pd.DataFrame({
'user_id': range(10_000_000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 10_000_000),
'amount': np.random.uniform(0, 1000, 10_000_000),
'date': pd.date_range('2020-01-01', periods=10_000_000, freq='min')
})
# Query it directly with DuckDB (zero copy)
result = duckdb.sql("""
SELECT
category,
DATE_TRUNC('day', date) AS day,
SUM(amount) AS total_amount,
COUNT(*) AS transaction_count
FROM df
WHERE date >= '2025-01-01'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 20
""").df()
Large-scale aggregations that would take minutes in pandas complete in seconds.
Reading Directly from Files and Remote Sources
DuckDB can read from Parquet, CSV, JSON, and even remote sources like S3 without loading into a table first. The read_parquet() and read_csv() functions query data in place.
-- Query Parquet files on S3 without loading
SELECT
supplier_name,
SUM(line_item_value) AS total_value,
COUNT(DISTINCT order_id) AS order_count
FROM read_parquet('s3://warehouse/gold/purchase_orders/*.parquet',
hive_partitioning = true)
WHERE order_date >= DATE '2025-01-01'
AND supplier_region = 'EMEA'
GROUP BY supplier_name
ORDER BY total_value DESC
LIMIT 20;
-- Union multiple CSV files
SELECT * FROM read_csv_auto('data/2025-01-*.csv')
UNION ALL
SELECT * FROM read_csv_auto('data/2025-02-*.csv')
UNION ALL
SELECT * FROM read_csv_auto('data/2025-03-*.csv');
This capability turns DuckDB into a polyglot query engine. You can join a local Parquet file with a remote S3 Parquet file and a pandas DataFrame in the same query, and DuckDB handles the execution planning.
Window Functions and Advanced SQL
DuckDB supports the full SQL standard including sophisticated window functions, common table expressions, and complex joins:
WITH daily_metrics AS (
SELECT
DATE_TRUNC('week', order_date) AS week,
customer_id,
SUM(amount) AS weekly_revenue,
AVG(amount) AS avg_order_value,
COUNT(*) AS order_count,
-- Running total
SUM(SUM(amount)) OVER (
PARTITION BY customer_id
ORDER BY DATE_TRUNC('week', order_date)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue,
-- Rank within week
RANK() OVER (
PARTITION BY DATE_TRUNC('week', order_date)
ORDER BY SUM(amount) DESC
) AS weekly_rank
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY 1, 2
)
SELECT *
FROM daily_metrics
WHERE weekly_rank <= 10
ORDER BY week, weekly_rank;
Window functions are essential for analytical workloads—moving averages, running totals, ranking, lag/lead comparisons. DuckDB implements them correctly and efficiently.
Extensions: PostgreSQL, Iceberg, and More
DuckDB ships with an extension framework that adds capabilities beyond the core engine.
The postgres scanner lets you query a live PostgreSQL database without ETL. You can join PostgreSQL tables with local data, push down predicates to PostgreSQL to minimize data transfer, and use DuckDB’s analytical engine for aggregations.
# Query PostgreSQL alongside local data
con.execute("""
CREATE TABLE local_lookup AS
SELECT * FROM postgres_scan('host=db port=5432 dbname=warehouse',
'public', 'product_categories');
""")
# Join PostgreSQL source with local table
result = con.execute("""
SELECT
p.category,
p.brand,
l.lookup_value,
COUNT(*) AS event_count
FROM read_parquet('s3://events/2025/**/*.parquet') e
JOIN local_lookup l ON e.product_id = l.product_id
JOIN postgres_scan('host=db', 'public', 'products') p
ON e.product_id = p.product_id
GROUP BY 1, 2, 3
""").df()
The Iceberg extension reads Apache Iceberg tables directly, enabling DuckDB to query lakehouse data without a dedicated query engine.
When DuckDB Is the Right Tool
DuckDB excels at analytical workloads on a single machine. If you are processing data that fits on one server, DuckDB is simpler than a Spark cluster and faster than pandas for large datasets. Data scientists exploring large CSV exports, analysts running complex SQL on local Parquet files, applications that need embedded analytics—these are DuckDB’s sweet spots.
DuckDB is not a replacement for distributed query engines when you need to process terabytes across multiple machines. It is not a transactional database (use SQLite for that). It does not have a distributed filesystem or cluster coordination built in.
DuckDB in the Analytical Toolchain
Understanding where DuckDB fits relative to other tools helps you make better architectural decisions.
flowchart LR
subgraph "Single-Machine Tools"
DuckDB[DuckDB]
SQLite[SQLite]
Pandas[Pandas]
end
subgraph "Distributed Engines"
Spark[Apache Spark]
Snowflake[Snowflake]
BigQuery[BigQuery]
Redshift[Amazon Redshift]
end
Pandas -->|too slow for >10M rows| DuckDB
SQLite -->|no columnar, no analytical perf| DuckDB
DuckDB -->|need cluster scale| Spark
DuckDB -->|need cloud warehouse| Snowflake
Spark -->|small data, local perf| DuckDB
DuckDB sits between pandas (single-machine, row-based) and Spark (cluster, distributed). It wins when your data fits on one machine but exceeds what pandas can handle efficiently.
When to Use DuckDB
Use DuckDB for:
- Datasets that fit in RAM but are too large for pandas to handle at speed (typically >5 million rows)
- SQL analytical queries on local Parquet or CSV files
- Embedded analytics inside applications without running a database server
- Querying across multiple remote sources (S3, PostgreSQL, Iceberg) without ETL
- Faster aggregations than pandas during exploratory data analysis
Avoid DuckDB when:
- Your data needs distributed processing across multiple machines
- You require concurrent write transactions (reach for PostgreSQL or SQLite instead)
- You need a client-server model with simultaneous connections
- You are processing streaming data in real time (Flink, Spark Streaming, and Kafka Streams are built for this)
Production Failure Scenarios
DuckDB runs in your application process. When it runs out of memory or hits a lock conflict, your application takes the hit. Here are the scenarios that bite people in production.
Scenario 1: Concurrent Write Limitations
DuckDB is single-writer by design. Multiple processes attempting concurrent writes will queue serialized, or one will fail with a ConcurrentTransactionError.
What happens: Your ETL pipeline runs a DuckDB write while a separate reporting query holds a read lock. The write blocks. If the read transaction runs long, the write times out.
Mitigation:
import duckdb
import threading
import queue
class DuckDBWriter:
"""Serialized writer to prevent concurrent write errors"""
def __init__(self, db_path):
self.db_path = db_path
self.write_lock = threading.Lock()
self.write_queue = queue.Queue()
def write(self, query, params=None):
"""Thread-safe write via queue"""
result_queue = queue.Queue()
self.write_queue.put((query, params, result_queue))
# Wait for write to complete
return result_queue.get()
def process_queue(self):
"""Background thread that processes writes serially"""
conn = duckdb.connect(self.db_path)
while True:
query, params, result_queue = self.write_queue.get()
try:
if params:
conn.execute(query, params)
else:
conn.execute(query)
result_queue.put({'success': True})
except Exception as e:
result_queue.put({'success': False, 'error': str(e)})
self.write_queue.task_done()
Scenario 2: Large Result Set OOM
Fetching millions of rows into a pandas DataFrame via fetchdf() loads everything into memory at once. For wide tables or large result sets, this causes OOM crashes.
What happens: You query a 50-column table with 50 million rows. DuckDB materializes the full result set in memory before converting to pandas. Your process is killed by the OS OOM killer.
Mitigation:
# BAD: Loads all rows into memory at once
result = con.execute("SELECT * FROM large_table").fetchdf()
# GOOD: Stream results in batches using fetchmany()
result_chunks = []
while True:
batch = con.execute("SELECT * FROM large_table").fetchmany(100_000)
if not batch:
break
result_chunks.append(batch)
# GOOD: Use DuckDB's native arrow export for zero-copy streaming
import pyarrow as pa
conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
reader = conn.execute("""
SELECT * FROM read_parquet('s3://bucket/large_table/**/*.parquet')
""").fetch_arrow_reader()
# Stream via Arrow without loading all data into memory
Scenario 3: Extension Loading Failures in Production
The postgres scanner and Iceberg extensions require pre-loading. If extensions are not available in your deployment environment, queries using them fail with NotImplementedError.
What happens: Your local test with postgres_scan() works perfectly. In production Docker container, the extension is not installed. Queries fail silently or throw IOError: Unable to load extension.
Mitigation:
def safe_postgres_scan(conn, host, port, dbname, schema, table, fallback=True):
"""Try postgres_scan with graceful fallback"""
try:
query = f"""
SELECT * FROM postgres_scan(
'host={host} port={port} dbname={dbname}',
'{schema}', '{table}'
)
"""
return conn.execute(query).fetchdf()
except (duckdb.NotImplementedError, IOError) as e:
if not fallback:
raise
# Fallback: use standard DB-API connection
import psycopg2
pg_conn = psycopg2.connect(host=host, port=port, database=dbname)
return pd.read_sql(f"SELECT * FROM {schema}.{table}", pg_conn)
Scenario 4: S3 Query Performance on Wide Files
Querying remote Parquet on S3 without partition pruning reads entire datasets. For wide (hundreds of columns) and deep (millions of rows) files, network transfer becomes the bottleneck.
What happens: SELECT * FROM read_parquet('s3://bucket/table/*.parquet') reads all files. No predicate pushdown to S3, so you transfer gigabytes only to filter to 100 rows.
Mitigation:
# Always project only needed columns and filter early
result = con.execute("""
SELECT
customer_id,
order_date,
total_amount
FROM read_parquet('s3://warehouse/gold/orders/**/*.parquet',
columns=['customer_id', 'order_date', 'total_amount'])
WHERE order_date >= DATE '2025-01-01'
AND total_amount > 100
""").fetchdf()
# Use Hive partitioning for partition pruning
# Path structure: s3://bucket/table/year=2025/month=01/day=15/
result = con.execute("""
SELECT * FROM read_parquet('s3://bucket/table/', hive_partitioning=true)
WHERE year = 2025 AND month = 1
""").fetchdf()
Trade-Off Table: DuckDB vs Alternatives
| Dimension | DuckDB | Apache Spark | Pandas |
|---|---|---|---|
| Deployment | In-process library | Cluster / single-node | In-memory |
| Setup complexity | Zero config | Requires cluster | Zero config |
| Concurrent writes | Single writer only | Multiple writers | Single writer |
| Scaling | Single machine | Multi-node cluster | Single machine |
| Performance (large data) | Excellent for single-machine | Excellent at scale | Poor beyond ~10M rows |
| Remote data queries | S3, PostgreSQL, Iceberg | S3, JDBC, native | Requires load |
| Memory model | Columnar vectorized | Distributed columnar | Row-based |
| Best for | Local analytics, embedded BI | Petabyte lakehouse | Small data, prototyping |
Capacity Estimation
DuckDB’s memory requirements scale with your data and query complexity.
Memory Formula
Working Memory ≈ (Vector Batch Size × Column Width × Concurrent Operations) + Sort Buffers
Vector Batch Size = 2048 rows (default)
Column Width = average bytes per value per column
Example: 50-column table, 10M rows, avg 20 bytes/column value
==============================================================
Raw data in memory: 50 × 10,000,000 × 20 bytes = 10 GB
Vector batch working set: 2048 × 50 × 20 = 2 MB per batch
Sort buffer (8-way): 8 × 256 MB = 2 GB
Estimated peak: 12-15 GB RAM needed
Dataset Size Guidelines
| Dataset Size | Recommended Configuration | Expected Performance |
|---|---|---|
| <100 MB | Default DuckDB settings | Sub-second queries |
| 100 MB – 1 GB | Default settings | 1-5 second queries |
| 1 GB – 10 GB | 4-8 GB memory limit | 5-30 second queries |
| 10 GB – 50 GB | 16-32 GB RAM, vectorized | 30s – 2min queries |
| 50 GB+ | Consider Spark or distributed engine | Depends on architecture |
Vector Batch Sizing
DuckDB processes data in vectors of 2048 rows by default. You can tune this:
# Set memory limit to prevent runaway queries
con.execute("SET memory_limit = '8GB'")
# Adjust vector size for your workload (default 2048)
# Larger = better throughput for sequential scans
# Smaller = better latency for interactive queries
con.execute("SET vectors_per_batch = 4096")
# Monitor memory usage
result = con.execute("SELECT * FROM duckdb_memory_usage()").fetchdf()
print(result)
Observability Hooks
Track query performance so you catch regressions before they become incidents.
import time
import logging
logger = logging.getLogger(__name__)
class ObservedConnection:
"""Wrapper that instruments DuckDB queries"""
def __init__(self, db_path=None, memory_limit_gb=8):
self.con = duckdb.connect(db_path)
self.con.execute(f"SET memory_limit = '{memory_limit_gb}GB'")
def execute(self, query, params=None):
start_time = time.time()
memory_before = self._get_memory_usage()
try:
if params:
result = self.con.execute(query, params).fetchdf()
else:
result = self.con.execute(query).fetchdf()
elapsed = time.time() - start_time
memory_after = self._get_memory_usage()
logger.info(
f"query_duration_ms={elapsed * 1000:.1f} "
f"memory_delta_mb={(memory_after - memory_before) / 1e6:.1f} "
f"rows={len(result)}"
)
return result
except Exception as e:
elapsed = time.time() - start_time
logger.error(f"query_failed after {elapsed * 1000:.1f}ms: {e}")
raise
def _get_memory_usage(self):
"""Approximate current memory usage"""
import sys
# Rough estimate: size of all DataFrames in memory
return sys.getsizeof(self.con.execute("SELECT 1").fetchdf())
What to monitor:
- Query duration — watch for regressions on queries that previously ran fast
- Memory delta per query — sudden spikes mean OOM risk is rising
- Row count returned — zero results when you expected rows means the data is missing
- Extension load failures in logs — postgres_scan failing silently in production is a nasty surprise
Security Checklist
DuckDB runs in your application process. Think about these production concerns:
- No built-in authentication. Layer access control at the application level, or use
duckdb-authif you need it. DuckDB assumes a trusted environment - S3 credentials. Prefer IAM roles for EC2 and ECS over static keys. For local work,
aws sso loginor short-lived environment variable tokens are safer - Data at rest. DuckDB database files are unencrypted by default. If you are handling sensitive data, use filesystem encryption — LUKS on Linux, BitLocker on Windows
- Query injection. Parameterize any query that accepts user input. The Python bindings support
con.execute("SELECT ?", (user_input,))which handles this safely - Extension sources. Only load extensions you trust. Malicious extensions run with the same privileges as your process
- Crash dumps. Core dumps and crash reports can contain query result data. Configure your production environment to handle or suppress these appropriately
Common Anti-Patterns
1. Using DuckDB as a Transactional Database
DuckDB has no concurrent write support. If you need multi-user read/write transactions, use PostgreSQL. DuckDB is OLAP, not OLTP.
2. Fetching Full Tables into Pandas for Export
con.execute("SELECT * FROM table").fetchdf() loads everything into memory before converting to pandas. For large tables, use fetchmany() batches, Arrow readers, or export directly to Parquet with COPY.
# Export to Parquet instead of fetching into pandas
con.execute("""
COPY (SELECT * FROM large_table WHERE date >= '2025-01-01')
TO 's3://bucket/export/' (FORMAT PARQUET, PER_THREAD_OUTPUT=true)
""")
3. Ignoring Memory Limits in Long-Running Services
Without a memory limit, a runaway aggregation can consume all system RAM and crash your service. Always set memory_limit.
4. Using DuckDB for Real-Time Streaming
DuckDB processes bounded datasets. For streaming, use Kafka Streams, Flink, or Spark Streaming. DuckDB can consume a Kafka topic in micro-batches, but it is not built for streaming workloads.
5. Loading Data Before Querying
Newcomers often load data into DuckDB with CREATE TABLE AS SELECT when DuckDB can query data in place via read_parquet(), read_csv_auto(), and extension scanners. Loading copies data unnecessarily.
# Unnecessary: copy data into DuckDB
con.execute("CREATE TABLE sales AS SELECT * FROM read_parquet('s3://data/sales.parquet')")
# Better: query directly
result = con.execute("""
SELECT * FROM read_parquet('s3://data/sales.parquet')
WHERE sale_date >= '2025-01-01'
""").fetchdf()
Quick Recap
Key Takeaways:
- DuckDB is an in-process, serverless analytical database — no config, no infrastructure, no operational overhead, and performance that embarrasses pandas for single-machine workloads
- Columnar vectorized execution handles 10M–100M row aggregations at speeds competitive with distributed engines on the same hardware
- Zero-copy pandas integration means existing pandas code gets faster without any API changes
- The extension system (PostgreSQL scanner, Iceberg, httpfs) lets you query S3, PostgreSQL, and local files in a single JOIN without moving data
- Set memory limits, use batched fetching for large result sets, and do not use it as a transactional database
When DuckDB is the Right Tool:
- Local analytical queries on data too large for pandas
- Embedded analytics in desktop or server applications
- Quick exploration of remote data without ETL
- Polyglot joins across S3, PostgreSQL, and local files
When to Choose Alternatives:
- Distributed scale → Apache Spark
- Cloud data warehouse → Snowflake, BigQuery, Redshift
- Concurrent writes → PostgreSQL
- Streaming/real-time → Kafka Streams, Flink, Spark Streaming
For cluster-scale processing, see Apache Spark. For portable pipelines across runners, read about Apache Beam.
Conclusion
DuckDB is what SQLite would be if designed for analytics in 2025. No configuration, no infrastructure, no operational overhead. Run it on your laptop, in a serverless function, inside a Spark executor, or in the browser via WebAssembly. The same SQL runs everywhere with the same performance characteristics on single-machine data.
For most data engineering tasks that do not require petabyte scale, DuckDB is worth trying before defaulting to Spark. It is faster than pandas for large local datasets, simpler to operate than any distributed engine, and more capable than SQLite for analytical workloads.
Category
Related Posts
Joins and Aggregations: SQL Patterns for Data Analysis
Master SQL joins and aggregation techniques for building efficient analytical queries in data warehouses and analytical databases.
Data Warehousing
OLAP vs OLTP comparison. Star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and columnar storage in data warehouses.
Data Warehouse Architecture: Building the Foundation for Analytics
Learn the core architectural patterns of data warehouses, from ETL pipelines to dimensional modeling, and how they enable business intelligence at scale.