DuckDB: The SQLite for Analytical Workloads

Explore DuckDB, the in-process analytical database that runs anywhere, handles columnar storage efficiently, and brings analytics to where your data lives.

published: March 27, 2026 reading time: 20 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

DuckDB is an in-process analytical database that runs anywhere without configuration or infrastructure, delivering columnar vectorized performance that embarrasses pandas for single-machine workloads in the 10M-100M row range. Zero-copy integration with pandas DataFrames means existing code gets faster without API changes, and extensions for PostgreSQL scanning, Apache Iceberg, and S3 querying let you join across remote sources without ETL. It is single-writer only—no concurrent write transactions—and setting memory limits is essential to prevent runaway aggregations from consuming all available RAM. For data that fits on one machine but exceeds pandas performance thresholds, DuckDB is worth trying before spinning up a Spark cluster.

DuckDB: The SQLite for Analytical Workloads

SQLite became ubiquitous because it solved a real problem: a zero-configuration, serverless, embeddable database that runs inside your application. Every phone, every browser, every desktop application that needed structured data used SQLite. For transactional workloads — inserts, updates, point queries — SQLite delivered exactly what it promised.

Analytical workloads needed something different. When you’re aggregating millions of rows, SQLite’s row-based storage scans the entire table. Its single-writer architecture limits concurrency. Running analytical queries on SQLite feels painful.

DuckDB tackles the problem SQLite wasn’t built for: taking the simplicity and portability of SQLite and optimizing it for analytical queries. It’s in-process, serverless, runs anywhere, and screams at analytical workloads. No infrastructure to manage. No separate service to run. Just a library that processes data at the speed of memory.

The Basics: Installation and First Queries

DuckDB installs in seconds and runs immediately. Single binary, no dependencies, no configuration.

# Install with pip, conda, or download the binary
pip install duckdb

# Or run the CLI directly
curl -L https://github.com/duckdb/duckdb/releases/download/v1.1.0/duckdb_cli-osx-universal.zip | unzip -
./duckdb

From Python, R, or any language with DuckDB bindings, you connect with a single line:

import duckdb

# Connect to an in-memory database (or a file)
con = duckdb.connect('analytics.db')

# Run a query
result = con.execute("""
    SELECT
        category,
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM 'sales.parquet'
    GROUP BY 1, 2
    ORDER BY 2, 3 DESC
""").fetchdf()

print(result.head())

The result comes back as a pandas DataFrame, ready for analysis or visualization. You can also query directly from pandas DataFrames without loading data into DuckDB first.

Columnar Storage: Why DuckDB Is Fast

DuckDB uses columnar storage internally. Instead of storing rows contiguously on disk (row store), each column is stored separately. When you query SELECT category, SUM(amount), DuckDB reads only the category and amount columns, skipping everything else.

This is the same storage format that Snowflake, BigQuery, and Redshift use. The difference is that DuckDB runs in your process rather than on a remote cluster.

# DuckDB can query remote data without downloading it
con.execute("""
    SELECT
        *
    FROM read_parquet('s3://warehouse/gold/fact_orders/**/*.parquet')
    WHERE order_date >= DATE '2025-01-01'
""")

# Query a pandas DataFrame directly
import pandas as pd
df = pd.read_csv('large_file.csv')

con.execute("""
    SELECT category, COUNT(*), AVG(amount)
    FROM df
    GROUP BY category
""").fetchdf()

Vectorized execution processes columns in batches (typically 2048 rows at a time) using SIMD instructions. Modern CPUs can process 256-512 bytes of data per cycle with AVX instructions. A row-at-a-time executor wastes most of this bandwidth. Vectorized execution saturates CPU throughput.

Zero-Copy Integration with Pandas

One of DuckDB’s killer features is zero-copy reading from pandas DataFrames and writing results back. DuckDB doesn’t copy data when you query a pandas DataFrame—it memory-maps the underlying numpy arrays.

import pandas as pd
import duckdb

# Create a large DataFrame
df = pd.DataFrame({
    'user_id': range(10_000_000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 10_000_000),
    'amount': np.random.uniform(0, 1000, 10_000_000),
    'date': pd.date_range('2020-01-01', periods=10_000_000, freq='min')
})

# Query it directly with DuckDB (zero copy)
result = duckdb.sql("""
    SELECT
        category,
        DATE_TRUNC('day', date) AS day,
        SUM(amount) AS total_amount,
        COUNT(*) AS transaction_count
    FROM df
    WHERE date >= '2025-01-01'
    GROUP BY 1, 2
    ORDER BY 3 DESC
    LIMIT 20
""").df()

Large-scale aggregations that would take minutes in pandas complete in seconds.

Reading Directly from Files and Remote Sources

DuckDB can read from Parquet, CSV, JSON, and even remote sources like S3 without loading into a table first. The read_parquet() and read_csv() functions query data in place.

-- Query Parquet files on S3 without loading
SELECT
    supplier_name,
    SUM(line_item_value) AS total_value,
    COUNT(DISTINCT order_id) AS order_count
FROM read_parquet('s3://warehouse/gold/purchase_orders/*.parquet',
                  hive_partitioning = true)
WHERE order_date >= DATE '2025-01-01'
  AND supplier_region = 'EMEA'
GROUP BY supplier_name
ORDER BY total_value DESC
LIMIT 20;

-- Union multiple CSV files
SELECT * FROM read_csv_auto('data/2025-01-*.csv')
UNION ALL
SELECT * FROM read_csv_auto('data/2025-02-*.csv')
UNION ALL
SELECT * FROM read_csv_auto('data/2025-03-*.csv');

This capability turns DuckDB into a polyglot query engine. You can join a local Parquet file with a remote S3 Parquet file and a pandas DataFrame in the same query, and DuckDB handles the execution planning.

Window Functions and Advanced SQL

DuckDB supports the full SQL standard including sophisticated window functions, common table expressions, and complex joins:

WITH daily_metrics AS (
    SELECT
        DATE_TRUNC('week', order_date) AS week,
        customer_id,
        SUM(amount) AS weekly_revenue,
        AVG(amount) AS avg_order_value,
        COUNT(*) AS order_count,
        -- Running total
        SUM(SUM(amount)) OVER (
            PARTITION BY customer_id
            ORDER BY DATE_TRUNC('week', order_date)
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        -- Rank within week
        RANK() OVER (
            PARTITION BY DATE_TRUNC('week', order_date)
            ORDER BY SUM(amount) DESC
        ) AS weekly_rank
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
    GROUP BY 1, 2
)
SELECT *
FROM daily_metrics
WHERE weekly_rank <= 10
ORDER BY week, weekly_rank;

Window functions are essential for analytical workloads—moving averages, running totals, ranking, lag/lead comparisons. DuckDB implements them correctly and efficiently.

Extensions: PostgreSQL, Iceberg, and More

DuckDB ships with an extension framework that adds capabilities beyond the core engine.

The postgres scanner lets you query a live PostgreSQL database without ETL. You can join PostgreSQL tables with local data, push down predicates to PostgreSQL to minimize data transfer, and use DuckDB’s analytical engine for aggregations.

# Query PostgreSQL alongside local data
con.execute("""
    CREATE TABLE local_lookup AS
    SELECT * FROM postgres_scan('host=db port=5432 dbname=warehouse',
                                 'public', 'product_categories');
""")

# Join PostgreSQL source with local table
result = con.execute("""
    SELECT
        p.category,
        p.brand,
        l.lookup_value,
        COUNT(*) AS event_count
    FROM read_parquet('s3://events/2025/**/*.parquet') e
    JOIN local_lookup l ON e.product_id = l.product_id
    JOIN postgres_scan('host=db', 'public', 'products') p
        ON e.product_id = p.product_id
    GROUP BY 1, 2, 3
""").df()

The Iceberg extension reads Apache Iceberg tables directly, enabling DuckDB to query lakehouse data without a dedicated query engine.

When DuckDB Is the Right Tool

DuckDB excels at analytical workloads on a single machine. If you are processing data that fits on one server, DuckDB is simpler than a Spark cluster and faster than pandas for large datasets. Data scientists exploring large CSV exports, analysts running complex SQL on local Parquet files, applications that need embedded analytics—these are DuckDB’s sweet spots.

DuckDB is not a replacement for distributed query engines when you need to process terabytes across multiple machines. It is not a transactional database (use SQLite for that). It does not have a distributed filesystem or cluster coordination built in.

DuckDB in the Analytical Toolchain

Understanding where DuckDB fits relative to other tools helps you make better architectural decisions.

flowchart LR
    subgraph "Single-Machine Tools"
        DuckDB[DuckDB]
        SQLite[SQLite]
        Pandas[Pandas]
    end

    subgraph "Distributed Engines"
        Spark[Apache Spark]
        Snowflake[Snowflake]
        BigQuery[BigQuery]
        Redshift[Amazon Redshift]
    end

    Pandas -->|too slow for >10M rows| DuckDB
    SQLite -->|no columnar, no analytical perf| DuckDB
    DuckDB -->|need cluster scale| Spark
    DuckDB -->|need cloud warehouse| Snowflake
    Spark -->|small data, local perf| DuckDB

DuckDB sits between pandas (single-machine, row-based) and Spark (cluster, distributed). It wins when your data fits on one machine but exceeds what pandas can handle efficiently.

When to Use DuckDB

Use DuckDB for:

Datasets that fit in RAM but are too large for pandas to handle at speed (typically >5 million rows)
SQL analytical queries on local Parquet or CSV files
Embedded analytics inside applications without running a database server
Querying across multiple remote sources (S3, PostgreSQL, Iceberg) without ETL
Faster aggregations than pandas during exploratory data analysis

Avoid DuckDB when:

Your data needs distributed processing across multiple machines
You require concurrent write transactions (reach for PostgreSQL or SQLite instead)
You need a client-server model with simultaneous connections
You are processing streaming data in real time (Flink, Spark Streaming, and Kafka Streams are built for this)

Production Failure Scenarios

DuckDB runs in your application process. When it runs out of memory or hits a lock conflict, your application takes the hit. Here are the scenarios that bite people in production.

Scenario 1: Concurrent Write Limitations

DuckDB is single-writer by design. Multiple processes attempting concurrent writes will queue serialized, or one will fail with a ConcurrentTransactionError.

What happens: Your ETL pipeline runs a DuckDB write while a separate reporting query holds a read lock. The write blocks. If the read transaction runs long, the write times out.

Mitigation:

import duckdb
import threading
import queue

class DuckDBWriter:
    """Serialized writer to prevent concurrent write errors"""
    def __init__(self, db_path):
        self.db_path = db_path
        self.write_lock = threading.Lock()
        self.write_queue = queue.Queue()

    def write(self, query, params=None):
        """Thread-safe write via queue"""
        result_queue = queue.Queue()
        self.write_queue.put((query, params, result_queue))

        # Wait for write to complete
        return result_queue.get()

    def process_queue(self):
        """Background thread that processes writes serially"""
        conn = duckdb.connect(self.db_path)
        while True:
            query, params, result_queue = self.write_queue.get()
            try:
                if params:
                    conn.execute(query, params)
                else:
                    conn.execute(query)
                result_queue.put({'success': True})
            except Exception as e:
                result_queue.put({'success': False, 'error': str(e)})
            self.write_queue.task_done()

Scenario 2: Large Result Set OOM

Fetching millions of rows into a pandas DataFrame via fetchdf() loads everything into memory at once. For wide tables or large result sets, this causes OOM crashes.

What happens: You query a 50-column table with 50 million rows. DuckDB materializes the full result set in memory before converting to pandas. Your process is killed by the OS OOM killer.

Mitigation:

# BAD: Loads all rows into memory at once
result = con.execute("SELECT * FROM large_table").fetchdf()

# GOOD: Stream results in batches using fetchmany()
result_chunks = []
while True:
    batch = con.execute("SELECT * FROM large_table").fetchmany(100_000)
    if not batch:
        break
    result_chunks.append(batch)

# GOOD: Use DuckDB's native arrow export for zero-copy streaming
import pyarrow as pa
conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
reader = conn.execute("""
    SELECT * FROM read_parquet('s3://bucket/large_table/**/*.parquet')
""").fetch_arrow_reader()
# Stream via Arrow without loading all data into memory

Scenario 3: Extension Loading Failures in Production

The postgres scanner and Iceberg extensions require pre-loading. If extensions are not available in your deployment environment, queries using them fail with NotImplementedError.

What happens: Your local test with postgres_scan() works perfectly. In production Docker container, the extension is not installed. Queries fail silently or throw IOError: Unable to load extension.

Mitigation:

def safe_postgres_scan(conn, host, port, dbname, schema, table, fallback=True):
    """Try postgres_scan with graceful fallback"""
    try:
        query = f"""
            SELECT * FROM postgres_scan(
                'host={host} port={port} dbname={dbname}',
                '{schema}', '{table}'
            )
        """
        return conn.execute(query).fetchdf()
    except (duckdb.NotImplementedError, IOError) as e:
        if not fallback:
            raise
        # Fallback: use standard DB-API connection
        import psycopg2
        pg_conn = psycopg2.connect(host=host, port=port, database=dbname)
        return pd.read_sql(f"SELECT * FROM {schema}.{table}", pg_conn)

Scenario 4: S3 Query Performance on Wide Files

Querying remote Parquet on S3 without partition pruning reads entire datasets. For wide (hundreds of columns) and deep (millions of rows) files, network transfer becomes the bottleneck.

What happens: SELECT * FROM read_parquet('s3://bucket/table/*.parquet') reads all files. No predicate pushdown to S3, so you transfer gigabytes only to filter to 100 rows.

Mitigation:

# Always project only needed columns and filter early
result = con.execute("""
    SELECT
        customer_id,
        order_date,
        total_amount
    FROM read_parquet('s3://warehouse/gold/orders/**/*.parquet',
                      columns=['customer_id', 'order_date', 'total_amount'])
    WHERE order_date >= DATE '2025-01-01'
      AND total_amount > 100
""").fetchdf()

# Use Hive partitioning for partition pruning
# Path structure: s3://bucket/table/year=2025/month=01/day=15/
result = con.execute("""
    SELECT * FROM read_parquet('s3://bucket/table/', hive_partitioning=true)
    WHERE year = 2025 AND month = 1
""").fetchdf()

Trade-Off Table: DuckDB vs Alternatives

Dimension	DuckDB	Apache Spark	Pandas
Deployment	In-process library	Cluster / single-node	In-memory
Setup complexity	Zero config	Requires cluster	Zero config
Concurrent writes	Single writer only	Multiple writers	Single writer
Scaling	Single machine	Multi-node cluster	Single machine
Performance (large data)	Excellent for single-machine	Excellent at scale	Poor beyond ~10M rows
Remote data queries	S3, PostgreSQL, Iceberg	S3, JDBC, native	Requires load
Memory model	Columnar vectorized	Distributed columnar	Row-based
Best for	Local analytics, embedded BI	Petabyte lakehouse	Small data, prototyping

Capacity Estimation

DuckDB’s memory requirements scale with your data and query complexity.

Memory Formula

Working Memory ≈ (Vector Batch Size × Column Width × Concurrent Operations) + Sort Buffers

Vector Batch Size = 2048 rows (default)
Column Width = average bytes per value per column

Example: 50-column table, 10M rows, avg 20 bytes/column value
==============================================================
Raw data in memory: 50 × 10,000,000 × 20 bytes = 10 GB
Vector batch working set: 2048 × 50 × 20 = 2 MB per batch
Sort buffer (8-way): 8 × 256 MB = 2 GB
Estimated peak: 12-15 GB RAM needed

Dataset Size Guidelines

The key constraint for DuckDB is that all data being queried must fit in memory. Unlike Spark, which distributes data across a cluster, DuckDB processes everything in a single process. This simplifies operations dramatically but means your data size must be compatible with available RAM.

Dataset Size	Recommended Configuration	Expected Performance
<100 MB	Default DuckDB settings	Sub-second queries
100 MB – 1 GB	Default settings	1-5 second queries
1 GB – 10 GB	4-8 GB memory limit	5-30 second queries
10 GB – 50 GB	16-32 GB RAM, vectorized	30s – 2min queries
50 GB+	Consider Spark or distributed engine	Depends on architecture

These guidelines assume a dedicated machine with limited concurrent activity. In a shared environment, reserve memory for the OS and other processes—never allocate all RAM to DuckDB. Setting memory_limit below your actual available memory prevents the OS from swapping, which would tank performance.

The jump from 10 GB to 50 GB is the most significant transition point. At 10 GB, most analytical queries complete in under a minute on modern hardware. At 50 GB, you are pushing the limits of single-machine processing, and query times become unpredictable depending on join complexity and aggregation cardinality. If your typical query pattern involves multiple large table joins or large GROUP BY operations at 50 GB scale, consider whether a distributed engine would serve you better.

For interactive exploration on data in the 10-50 GB range, DuckDB remains competitive with distributed engines for many analytical workloads. The advantage is simplicity: no cluster to manage, no Spark submit commands, no YARN configuration. You run a query and get results. For ad-hoc analysis, BI dashboards with moderate concurrency, or embedded analytics in applications, this trade-off often favors DuckDB even at larger data sizes.

Vector Batch Sizing

DuckDB processes data in vectors of 2048 rows by default. You can tune this:

# Set memory limit to prevent runaway queries
con.execute("SET memory_limit = '8GB'")

# Adjust vector size for your workload (default 2048)
# Larger = better throughput for sequential scans
# Smaller = better latency for interactive queries
con.execute("SET vectors_per_batch = 4096")

# Monitor memory usage
result = con.execute("SELECT * FROM duckdb_memory_usage()").fetchdf()
print(result)

Observability Hooks

Track query performance so you catch regressions before they become incidents.

import time
import logging

logger = logging.getLogger(__name__)

class ObservedConnection:
    """Wrapper that instruments DuckDB queries"""
    def __init__(self, db_path=None, memory_limit_gb=8):
        self.con = duckdb.connect(db_path)
        self.con.execute(f"SET memory_limit = '{memory_limit_gb}GB'")

    def execute(self, query, params=None):
        start_time = time.time()
        memory_before = self._get_memory_usage()

        try:
            if params:
                result = self.con.execute(query, params).fetchdf()
            else:
                result = self.con.execute(query).fetchdf()

            elapsed = time.time() - start_time
            memory_after = self._get_memory_usage()

            logger.info(
                f"query_duration_ms={elapsed * 1000:.1f} "
                f"memory_delta_mb={(memory_after - memory_before) / 1e6:.1f} "
                f"rows={len(result)}"
            )

            return result
        except Exception as e:
            elapsed = time.time() - start_time
            logger.error(f"query_failed after {elapsed * 1000:.1f}ms: {e}")
            raise

    def _get_memory_usage(self):
        """Approximate current memory usage"""
        import sys
        # Rough estimate: size of all DataFrames in memory
        return sys.getsizeof(self.con.execute("SELECT 1").fetchdf())

What to monitor:

Query duration — watch for regressions on queries that previously ran fast
Memory delta per query — sudden spikes mean OOM risk is rising
Row count returned — zero results when you expected rows means the data is missing
Extension load failures in logs — postgres_scan failing silently in production is a nasty surprise

Security Checklist

DuckDB runs in your application process. Think about these production concerns:

No built-in authentication. Layer access control at the application level, or use duckdb-auth if you need it. DuckDB assumes a trusted environment
S3 credentials. Prefer IAM roles for EC2 and ECS over static keys. For local work, aws sso login or short-lived environment variable tokens are safer
Data at rest. DuckDB database files are unencrypted by default. If you are handling sensitive data, use filesystem encryption — LUKS on Linux, BitLocker on Windows
Query injection. Parameterize any query that accepts user input. The Python bindings support con.execute("SELECT ?", (user_input,)) which handles this safely
Extension sources. Only load extensions you trust. Malicious extensions run with the same privileges as your process
Crash dumps. Core dumps and crash reports can contain query result data. Configure your production environment to handle or suppress these appropriately

Common Anti-Patterns

1. Using DuckDB as a Transactional Database

DuckDB has no concurrent write support. If you need multi-user read/write transactions, use PostgreSQL. DuckDB is OLAP, not OLTP.

The single-writer constraint is architectural, not a bug to fix. DuckDB’s storage model uses a single writer lock to maintain consistency. When two connections attempt concurrent writes, DuckDB serializes them. That sounds manageable until you run this in production ETL.

Your Python process holds a read transaction on a DuckDB file. An Airflow task tries to run an INSERT INTO at the same time. The write blocks, waiting for the read to release its snapshot. Long-running connections are common in BI tools that keep connections open. In that scenario, the write queue grows until your ETL job times out and your pipeline stops.

The same problem surfaces when you run two CREATE TABLE AS SELECT operations simultaneously from different processes. One acquires the write lock. The second waits. If your pipeline has a tight timeout, you get a ConcurrentTransactionError and no built-in retry.

DuckDB’s own documentation flags this: use it for analytical reads and single-writer workloads. If you have multiple processes writing simultaneously — a multi-tenant SaaS, overlapping ETL pipelines, any situation where writes compete — DuckDB queues them and eventually fails. For concurrent writes, reach for PostgreSQL, SQLite with WAL mode, or a managed database instead.

2. Fetching Full Tables into Pandas for Export

con.execute("SELECT * FROM table").fetchdf() loads everything into memory before converting to pandas. For large tables, use fetchmany() batches, Arrow readers, or export directly to Parquet with COPY.

# Export to Parquet instead of fetching into pandas
con.execute("""
    COPY (SELECT * FROM large_table WHERE date >= '2025-01-01')
    TO 's3://bucket/export/' (FORMAT PARQUET, PER_THREAD_OUTPUT=true)
""")

3. Ignoring Memory Limits in Long-Running Services

Without a memory limit, a runaway aggregation can consume all system RAM and crash your service. Always set memory_limit.

DuckDB defaults to using all available memory. On your laptop during development, this works fine — you know how much RAM you have and nothing else is competing for it. In production, it falls apart.

Picture a Kubernetes pod with 4 GB of RAM. Your DuckDB query runs an aggregation across 50 million rows. The vectorized executor allocates for sort operations, hash tables, and intermediate buffers. Without an explicit limit, DuckDB sees 4 GB and claims 3.5 GB. But the OS needs RAM for the Python interpreter, container overhead, and whatever else is running on the node. The kernel OOM killer steps in. The pod terminates with no DuckDB error anywhere in your logs.

This is hard to debug because nothing looks like an error. In Kubernetes you get exit code 137 (SIGKILL). In systemd, a clean exit with no core dump. Your monitoring shows nothing unusual until the process is gone.

Set memory_limit explicitly in production, below your container’s allocation:

con = duckdb.connect('analytics.db')
con.execute("SET memory_limit = '3GB'")  # Keep 1GB for the OS and Python runtime

Also configure max_temp_memory for long-running services. This controls the scratch space DuckDB uses for sort and hash operations independently from its main memory budget, so one large query cannot eat everything and starve the rest of your connection pool.

4. Using DuckDB for Real-Time Streaming

DuckDB processes bounded datasets. For streaming, use Kafka Streams, Flink, or Spark Streaming. DuckDB can consume a Kafka topic in micro-batches, but it is not built for streaming workloads.

DuckDB can read from Kafka through the kafka_scan() extension. That is real and it works. Developers see this and assume DuckDB works as a streaming query engine. It does not, and this distinction actually matters for your architecture.

Real streaming systems are built around unbounded data — events arriving continuously and needing processing within a latency window. Kafka Streams, Flink, and Spark Streaming maintain state incrementally. As each event arrives, they update running aggregates, hold sliding windows, and emit results. Distributed checkpointing manages the state.

DuckDB has no concept of streaming state. Each query is a snapshot over a bounded dataset. Call kafka_scan() and DuckDB reads whatever messages are in the topic right now, processes them, and returns. The next query starts from scratch. It does not know that a previous query already processed up to offset 1,000,000. Your application has to track offsets, handle late arrivals, and implement windowing logic outside DuckDB.

What you end up with is a brittle micro-batch system without fault tolerance. If your process crashes at offset 1,000,000, you either reprocess from the beginning or lose messages. Neither is acceptable in a production pipeline.

Real-time requirements under one second latency call for a streaming engine. Sub-second analytics on historical data is where DuckDB shines. Match the tool to the problem.

5. Loading Data Before Querying

Newcomers often load data into DuckDB with CREATE TABLE AS SELECT when DuckDB can query data in place via read_parquet(), read_csv_auto(), and extension scanners. Loading copies data unnecessarily.

# Unnecessary: copy data into DuckDB
con.execute("CREATE TABLE sales AS SELECT * FROM read_parquet('s3://data/sales.parquet')")

# Better: query directly
result = con.execute("""
    SELECT * FROM read_parquet('s3://data/sales.parquet')
    WHERE sale_date >= '2025-01-01'
""").fetchdf()

Quick Recap

Key Takeaways:

DuckDB is an in-process, serverless analytical database — no config, no infrastructure, no operational overhead, and performance that embarrasses pandas for single-machine workloads
Vectorized execution handles 10M–100M row aggregations at speeds competitive with distributed engines on the same hardware
Zero-copy pandas integration means existing pandas code gets faster without API changes
The extension system (PostgreSQL scanner, Iceberg, httpfs) lets you query S3, PostgreSQL, and local files in a single JOIN without moving data
Set memory limits, use batched fetching for large result sets, and do not use it as a transactional database

When DuckDB is the Right Tool:

Local analytical queries on data too large for pandas
Embedded analytics in desktop or server applications
Quick exploration of remote data without ETL
Polyglot joins across S3, PostgreSQL, and local files

When to Choose Alternatives:

Distributed scale → Apache Spark
Cloud data warehouse → Snowflake, BigQuery, Redshift
Concurrent writes → PostgreSQL
Streaming/real-time → Kafka Streams, Flink, Spark Streaming

For cluster-scale processing, see Apache Spark. For portable pipelines across runners, read about Apache Beam.

Conclusion

DuckDB is what SQLite would be if designed for analytics in 2025. No configuration, no infrastructure, no operational overhead. Run it on your laptop, in a serverless function, inside a Spark executor, or in the browser via WebAssembly. The same SQL runs everywhere with the same performance characteristics on single-machine data.

For most data engineering tasks that do not require petabyte scale, DuckDB is worth trying before defaulting to Spark. It is faster than pandas for large local datasets, simpler to operate than any distributed engine, and more capable than SQLite for analytical workloads.

DuckDB: The SQLite for Analytical Workloads

The Basics: Installation and First Queries

Columnar Storage: Why DuckDB Is Fast

Zero-Copy Integration with Pandas

Reading Directly from Files and Remote Sources

Window Functions and Advanced SQL

Extensions: PostgreSQL, Iceberg, and More

When DuckDB Is the Right Tool

DuckDB in the Analytical Toolchain

When to Use DuckDB

Production Failure Scenarios

Scenario 1: Concurrent Write Limitations

Scenario 2: Large Result Set OOM

Scenario 3: Extension Loading Failures in Production

Scenario 4: S3 Query Performance on Wide Files

Trade-Off Table: DuckDB vs Alternatives

Capacity Estimation

Memory Formula

Dataset Size Guidelines

Vector Batch Sizing

Observability Hooks

Security Checklist

Common Anti-Patterns

1. Using DuckDB as a Transactional Database

2. Fetching Full Tables into Pandas for Export

3. Ignoring Memory Limits in Long-Running Services

4. Using DuckDB for Real-Time Streaming

5. Loading Data Before Querying

Quick Recap

Conclusion

Category

Tags

Related Posts

Joins and Aggregations: SQL Patterns for Data Analysis

Data Warehousing

Data Warehouse Architecture: Building the Foundation for Analytics