DuckDB: The SQLite for Analytical Workloads

Explore DuckDB, the in-process analytical database that runs anywhere, handles columnar storage efficiently, and brings analytics to where your data lives.

published: reading time: 16 min read

DuckDB: The SQLite for Analytical Workloads

SQLite became ubiquitous because it solved a real problem: a zero-configuration, serverless, embeddable database that runs inside your application. Every phone, every browser, every desktop application that needed structured data used SQLite. For transactional workloads—inserts, updates, point queries—SQLite delivered exactly what it promised.

Analytical workloads needed something different. When you’re aggregating millions of rows, SQLite’s row-based storage scans the entire table. Its single-writer architecture limits concurrency. Running analytical queries on SQLite is painful.

DuckDB is the answer to the question SQLite could not answer: what if you took the simplicity and portability of SQLite and optimized it for analytical queries? DuckDB is in-process, serverless, runs anywhere, and screams at analytical workloads. No infrastructure to manage. No separate service to run. Just a library that processes data at the speed of memory.

The Basics: Installation and First Queries

DuckDB installs in seconds and runs immediately. Single binary, no dependencies, no configuration.

# Install with pip, conda, or download the binary
pip install duckdb

# Or run the CLI directly
curl -L https://github.com/duckdb/duckdb/releases/download/v1.1.0/duckdb_cli-osx-universal.zip | unzip -
./duckdb

From Python, R, or any language with DuckDB bindings, you connect with a single line:

import duckdb

# Connect to an in-memory database (or a file)
con = duckdb.connect('analytics.db')

# Run a query
result = con.execute("""
    SELECT
        category,
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM 'sales.parquet'
    GROUP BY 1, 2
    ORDER BY 2, 3 DESC
""").fetchdf()

print(result.head())

The result comes back as a pandas DataFrame, ready for analysis or visualization. You can also query directly from pandas DataFrames without loading data into DuckDB first.

Columnar Storage: Why DuckDB Is Fast

DuckDB uses columnar storage internally. Instead of storing rows contiguously on disk (row store), each column is stored separately. When you query SELECT category, SUM(amount), DuckDB reads only the category and amount columns, skipping everything else.

This is the same storage format that Snowflake, BigQuery, and Redshift use. The difference is that DuckDB runs in your process rather than on a remote cluster.

# DuckDB can query remote data without downloading it
con.execute("""
    SELECT
        *
    FROM read_parquet('s3://warehouse/gold/fact_orders/**/*.parquet')
    WHERE order_date >= DATE '2025-01-01'
""")

# Query a pandas DataFrame directly
import pandas as pd
df = pd.read_csv('large_file.csv')

con.execute("""
    SELECT category, COUNT(*), AVG(amount)
    FROM df
    GROUP BY category
""").fetchdf()

Vectorized execution processes columns in batches (typically 2048 rows at a time) using SIMD instructions. Modern CPUs can process 256-512 bytes of data per cycle with AVX instructions. A row-at-a-time executor wastes most of this bandwidth. Vectorized execution saturates CPU throughput.

Zero-Copy Integration with Pandas

One of DuckDB’s killer features is zero-copy reading from pandas DataFrames and writing results back. DuckDB doesn’t copy data when you query a pandas DataFrame—it memory-maps the underlying numpy arrays.

import pandas as pd
import duckdb

# Create a large DataFrame
df = pd.DataFrame({
    'user_id': range(10_000_000),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 10_000_000),
    'amount': np.random.uniform(0, 1000, 10_000_000),
    'date': pd.date_range('2020-01-01', periods=10_000_000, freq='min')
})

# Query it directly with DuckDB (zero copy)
result = duckdb.sql("""
    SELECT
        category,
        DATE_TRUNC('day', date) AS day,
        SUM(amount) AS total_amount,
        COUNT(*) AS transaction_count
    FROM df
    WHERE date >= '2025-01-01'
    GROUP BY 1, 2
    ORDER BY 3 DESC
    LIMIT 20
""").df()

Large-scale aggregations that would take minutes in pandas complete in seconds.

Reading Directly from Files and Remote Sources

DuckDB can read from Parquet, CSV, JSON, and even remote sources like S3 without loading into a table first. The read_parquet() and read_csv() functions query data in place.

-- Query Parquet files on S3 without loading
SELECT
    supplier_name,
    SUM(line_item_value) AS total_value,
    COUNT(DISTINCT order_id) AS order_count
FROM read_parquet('s3://warehouse/gold/purchase_orders/*.parquet',
                  hive_partitioning = true)
WHERE order_date >= DATE '2025-01-01'
  AND supplier_region = 'EMEA'
GROUP BY supplier_name
ORDER BY total_value DESC
LIMIT 20;

-- Union multiple CSV files
SELECT * FROM read_csv_auto('data/2025-01-*.csv')
UNION ALL
SELECT * FROM read_csv_auto('data/2025-02-*.csv')
UNION ALL
SELECT * FROM read_csv_auto('data/2025-03-*.csv');

This capability turns DuckDB into a polyglot query engine. You can join a local Parquet file with a remote S3 Parquet file and a pandas DataFrame in the same query, and DuckDB handles the execution planning.

Window Functions and Advanced SQL

DuckDB supports the full SQL standard including sophisticated window functions, common table expressions, and complex joins:

WITH daily_metrics AS (
    SELECT
        DATE_TRUNC('week', order_date) AS week,
        customer_id,
        SUM(amount) AS weekly_revenue,
        AVG(amount) AS avg_order_value,
        COUNT(*) AS order_count,
        -- Running total
        SUM(SUM(amount)) OVER (
            PARTITION BY customer_id
            ORDER BY DATE_TRUNC('week', order_date)
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        -- Rank within week
        RANK() OVER (
            PARTITION BY DATE_TRUNC('week', order_date)
            ORDER BY SUM(amount) DESC
        ) AS weekly_rank
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
    GROUP BY 1, 2
)
SELECT *
FROM daily_metrics
WHERE weekly_rank <= 10
ORDER BY week, weekly_rank;

Window functions are essential for analytical workloads—moving averages, running totals, ranking, lag/lead comparisons. DuckDB implements them correctly and efficiently.

Extensions: PostgreSQL, Iceberg, and More

DuckDB ships with an extension framework that adds capabilities beyond the core engine.

The postgres scanner lets you query a live PostgreSQL database without ETL. You can join PostgreSQL tables with local data, push down predicates to PostgreSQL to minimize data transfer, and use DuckDB’s analytical engine for aggregations.

# Query PostgreSQL alongside local data
con.execute("""
    CREATE TABLE local_lookup AS
    SELECT * FROM postgres_scan('host=db port=5432 dbname=warehouse',
                                 'public', 'product_categories');
""")

# Join PostgreSQL source with local table
result = con.execute("""
    SELECT
        p.category,
        p.brand,
        l.lookup_value,
        COUNT(*) AS event_count
    FROM read_parquet('s3://events/2025/**/*.parquet') e
    JOIN local_lookup l ON e.product_id = l.product_id
    JOIN postgres_scan('host=db', 'public', 'products') p
        ON e.product_id = p.product_id
    GROUP BY 1, 2, 3
""").df()

The Iceberg extension reads Apache Iceberg tables directly, enabling DuckDB to query lakehouse data without a dedicated query engine.

When DuckDB Is the Right Tool

DuckDB excels at analytical workloads on a single machine. If you are processing data that fits on one server, DuckDB is simpler than a Spark cluster and faster than pandas for large datasets. Data scientists exploring large CSV exports, analysts running complex SQL on local Parquet files, applications that need embedded analytics—these are DuckDB’s sweet spots.

DuckDB is not a replacement for distributed query engines when you need to process terabytes across multiple machines. It is not a transactional database (use SQLite for that). It does not have a distributed filesystem or cluster coordination built in.

DuckDB in the Analytical Toolchain

Understanding where DuckDB fits relative to other tools helps you make better architectural decisions.

flowchart LR
    subgraph "Single-Machine Tools"
        DuckDB[DuckDB]
        SQLite[SQLite]
        Pandas[Pandas]
    end

    subgraph "Distributed Engines"
        Spark[Apache Spark]
        Snowflake[Snowflake]
        BigQuery[BigQuery]
        Redshift[Amazon Redshift]
    end

    Pandas -->|too slow for >10M rows| DuckDB
    SQLite -->|no columnar, no analytical perf| DuckDB
    DuckDB -->|need cluster scale| Spark
    DuckDB -->|need cloud warehouse| Snowflake
    Spark -->|small data, local perf| DuckDB

DuckDB sits between pandas (single-machine, row-based) and Spark (cluster, distributed). It wins when your data fits on one machine but exceeds what pandas can handle efficiently.

When to Use DuckDB

Use DuckDB for:

  • Datasets that fit in RAM but are too large for pandas to handle at speed (typically >5 million rows)
  • SQL analytical queries on local Parquet or CSV files
  • Embedded analytics inside applications without running a database server
  • Querying across multiple remote sources (S3, PostgreSQL, Iceberg) without ETL
  • Faster aggregations than pandas during exploratory data analysis

Avoid DuckDB when:

  • Your data needs distributed processing across multiple machines
  • You require concurrent write transactions (reach for PostgreSQL or SQLite instead)
  • You need a client-server model with simultaneous connections
  • You are processing streaming data in real time (Flink, Spark Streaming, and Kafka Streams are built for this)

Production Failure Scenarios

DuckDB runs in your application process. When it runs out of memory or hits a lock conflict, your application takes the hit. Here are the scenarios that bite people in production.

Scenario 1: Concurrent Write Limitations

DuckDB is single-writer by design. Multiple processes attempting concurrent writes will queue serialized, or one will fail with a ConcurrentTransactionError.

What happens: Your ETL pipeline runs a DuckDB write while a separate reporting query holds a read lock. The write blocks. If the read transaction runs long, the write times out.

Mitigation:

import duckdb
import threading
import queue

class DuckDBWriter:
    """Serialized writer to prevent concurrent write errors"""
    def __init__(self, db_path):
        self.db_path = db_path
        self.write_lock = threading.Lock()
        self.write_queue = queue.Queue()

    def write(self, query, params=None):
        """Thread-safe write via queue"""
        result_queue = queue.Queue()
        self.write_queue.put((query, params, result_queue))

        # Wait for write to complete
        return result_queue.get()

    def process_queue(self):
        """Background thread that processes writes serially"""
        conn = duckdb.connect(self.db_path)
        while True:
            query, params, result_queue = self.write_queue.get()
            try:
                if params:
                    conn.execute(query, params)
                else:
                    conn.execute(query)
                result_queue.put({'success': True})
            except Exception as e:
                result_queue.put({'success': False, 'error': str(e)})
            self.write_queue.task_done()

Scenario 2: Large Result Set OOM

Fetching millions of rows into a pandas DataFrame via fetchdf() loads everything into memory at once. For wide tables or large result sets, this causes OOM crashes.

What happens: You query a 50-column table with 50 million rows. DuckDB materializes the full result set in memory before converting to pandas. Your process is killed by the OS OOM killer.

Mitigation:

# BAD: Loads all rows into memory at once
result = con.execute("SELECT * FROM large_table").fetchdf()

# GOOD: Stream results in batches using fetchmany()
result_chunks = []
while True:
    batch = con.execute("SELECT * FROM large_table").fetchmany(100_000)
    if not batch:
        break
    result_chunks.append(batch)

# GOOD: Use DuckDB's native arrow export for zero-copy streaming
import pyarrow as pa
conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs;")
reader = conn.execute("""
    SELECT * FROM read_parquet('s3://bucket/large_table/**/*.parquet')
""").fetch_arrow_reader()
# Stream via Arrow without loading all data into memory

Scenario 3: Extension Loading Failures in Production

The postgres scanner and Iceberg extensions require pre-loading. If extensions are not available in your deployment environment, queries using them fail with NotImplementedError.

What happens: Your local test with postgres_scan() works perfectly. In production Docker container, the extension is not installed. Queries fail silently or throw IOError: Unable to load extension.

Mitigation:

def safe_postgres_scan(conn, host, port, dbname, schema, table, fallback=True):
    """Try postgres_scan with graceful fallback"""
    try:
        query = f"""
            SELECT * FROM postgres_scan(
                'host={host} port={port} dbname={dbname}',
                '{schema}', '{table}'
            )
        """
        return conn.execute(query).fetchdf()
    except (duckdb.NotImplementedError, IOError) as e:
        if not fallback:
            raise
        # Fallback: use standard DB-API connection
        import psycopg2
        pg_conn = psycopg2.connect(host=host, port=port, database=dbname)
        return pd.read_sql(f"SELECT * FROM {schema}.{table}", pg_conn)

Scenario 4: S3 Query Performance on Wide Files

Querying remote Parquet on S3 without partition pruning reads entire datasets. For wide (hundreds of columns) and deep (millions of rows) files, network transfer becomes the bottleneck.

What happens: SELECT * FROM read_parquet('s3://bucket/table/*.parquet') reads all files. No predicate pushdown to S3, so you transfer gigabytes only to filter to 100 rows.

Mitigation:

# Always project only needed columns and filter early
result = con.execute("""
    SELECT
        customer_id,
        order_date,
        total_amount
    FROM read_parquet('s3://warehouse/gold/orders/**/*.parquet',
                      columns=['customer_id', 'order_date', 'total_amount'])
    WHERE order_date >= DATE '2025-01-01'
      AND total_amount > 100
""").fetchdf()

# Use Hive partitioning for partition pruning
# Path structure: s3://bucket/table/year=2025/month=01/day=15/
result = con.execute("""
    SELECT * FROM read_parquet('s3://bucket/table/', hive_partitioning=true)
    WHERE year = 2025 AND month = 1
""").fetchdf()

Trade-Off Table: DuckDB vs Alternatives

DimensionDuckDBApache SparkPandas
DeploymentIn-process libraryCluster / single-nodeIn-memory
Setup complexityZero configRequires clusterZero config
Concurrent writesSingle writer onlyMultiple writersSingle writer
ScalingSingle machineMulti-node clusterSingle machine
Performance (large data)Excellent for single-machineExcellent at scalePoor beyond ~10M rows
Remote data queriesS3, PostgreSQL, IcebergS3, JDBC, nativeRequires load
Memory modelColumnar vectorizedDistributed columnarRow-based
Best forLocal analytics, embedded BIPetabyte lakehouseSmall data, prototyping

Capacity Estimation

DuckDB’s memory requirements scale with your data and query complexity.

Memory Formula

Working Memory ≈ (Vector Batch Size × Column Width × Concurrent Operations) + Sort Buffers

Vector Batch Size = 2048 rows (default)
Column Width = average bytes per value per column

Example: 50-column table, 10M rows, avg 20 bytes/column value
==============================================================
Raw data in memory: 50 × 10,000,000 × 20 bytes = 10 GB
Vector batch working set: 2048 × 50 × 20 = 2 MB per batch
Sort buffer (8-way): 8 × 256 MB = 2 GB
Estimated peak: 12-15 GB RAM needed

Dataset Size Guidelines

Dataset SizeRecommended ConfigurationExpected Performance
<100 MBDefault DuckDB settingsSub-second queries
100 MB – 1 GBDefault settings1-5 second queries
1 GB – 10 GB4-8 GB memory limit5-30 second queries
10 GB – 50 GB16-32 GB RAM, vectorized30s – 2min queries
50 GB+Consider Spark or distributed engineDepends on architecture

Vector Batch Sizing

DuckDB processes data in vectors of 2048 rows by default. You can tune this:

# Set memory limit to prevent runaway queries
con.execute("SET memory_limit = '8GB'")

# Adjust vector size for your workload (default 2048)
# Larger = better throughput for sequential scans
# Smaller = better latency for interactive queries
con.execute("SET vectors_per_batch = 4096")

# Monitor memory usage
result = con.execute("SELECT * FROM duckdb_memory_usage()").fetchdf()
print(result)

Observability Hooks

Track query performance so you catch regressions before they become incidents.

import time
import logging

logger = logging.getLogger(__name__)

class ObservedConnection:
    """Wrapper that instruments DuckDB queries"""
    def __init__(self, db_path=None, memory_limit_gb=8):
        self.con = duckdb.connect(db_path)
        self.con.execute(f"SET memory_limit = '{memory_limit_gb}GB'")

    def execute(self, query, params=None):
        start_time = time.time()
        memory_before = self._get_memory_usage()

        try:
            if params:
                result = self.con.execute(query, params).fetchdf()
            else:
                result = self.con.execute(query).fetchdf()

            elapsed = time.time() - start_time
            memory_after = self._get_memory_usage()

            logger.info(
                f"query_duration_ms={elapsed * 1000:.1f} "
                f"memory_delta_mb={(memory_after - memory_before) / 1e6:.1f} "
                f"rows={len(result)}"
            )

            return result
        except Exception as e:
            elapsed = time.time() - start_time
            logger.error(f"query_failed after {elapsed * 1000:.1f}ms: {e}")
            raise

    def _get_memory_usage(self):
        """Approximate current memory usage"""
        import sys
        # Rough estimate: size of all DataFrames in memory
        return sys.getsizeof(self.con.execute("SELECT 1").fetchdf())

What to monitor:

  • Query duration — watch for regressions on queries that previously ran fast
  • Memory delta per query — sudden spikes mean OOM risk is rising
  • Row count returned — zero results when you expected rows means the data is missing
  • Extension load failures in logs — postgres_scan failing silently in production is a nasty surprise

Security Checklist

DuckDB runs in your application process. Think about these production concerns:

  • No built-in authentication. Layer access control at the application level, or use duckdb-auth if you need it. DuckDB assumes a trusted environment
  • S3 credentials. Prefer IAM roles for EC2 and ECS over static keys. For local work, aws sso login or short-lived environment variable tokens are safer
  • Data at rest. DuckDB database files are unencrypted by default. If you are handling sensitive data, use filesystem encryption — LUKS on Linux, BitLocker on Windows
  • Query injection. Parameterize any query that accepts user input. The Python bindings support con.execute("SELECT ?", (user_input,)) which handles this safely
  • Extension sources. Only load extensions you trust. Malicious extensions run with the same privileges as your process
  • Crash dumps. Core dumps and crash reports can contain query result data. Configure your production environment to handle or suppress these appropriately

Common Anti-Patterns

1. Using DuckDB as a Transactional Database

DuckDB has no concurrent write support. If you need multi-user read/write transactions, use PostgreSQL. DuckDB is OLAP, not OLTP.

2. Fetching Full Tables into Pandas for Export

con.execute("SELECT * FROM table").fetchdf() loads everything into memory before converting to pandas. For large tables, use fetchmany() batches, Arrow readers, or export directly to Parquet with COPY.

# Export to Parquet instead of fetching into pandas
con.execute("""
    COPY (SELECT * FROM large_table WHERE date >= '2025-01-01')
    TO 's3://bucket/export/' (FORMAT PARQUET, PER_THREAD_OUTPUT=true)
""")

3. Ignoring Memory Limits in Long-Running Services

Without a memory limit, a runaway aggregation can consume all system RAM and crash your service. Always set memory_limit.

4. Using DuckDB for Real-Time Streaming

DuckDB processes bounded datasets. For streaming, use Kafka Streams, Flink, or Spark Streaming. DuckDB can consume a Kafka topic in micro-batches, but it is not built for streaming workloads.

5. Loading Data Before Querying

Newcomers often load data into DuckDB with CREATE TABLE AS SELECT when DuckDB can query data in place via read_parquet(), read_csv_auto(), and extension scanners. Loading copies data unnecessarily.

# Unnecessary: copy data into DuckDB
con.execute("CREATE TABLE sales AS SELECT * FROM read_parquet('s3://data/sales.parquet')")

# Better: query directly
result = con.execute("""
    SELECT * FROM read_parquet('s3://data/sales.parquet')
    WHERE sale_date >= '2025-01-01'
""").fetchdf()

Quick Recap

Key Takeaways:

  • DuckDB is an in-process, serverless analytical database — no config, no infrastructure, no operational overhead, and performance that embarrasses pandas for single-machine workloads
  • Columnar vectorized execution handles 10M–100M row aggregations at speeds competitive with distributed engines on the same hardware
  • Zero-copy pandas integration means existing pandas code gets faster without any API changes
  • The extension system (PostgreSQL scanner, Iceberg, httpfs) lets you query S3, PostgreSQL, and local files in a single JOIN without moving data
  • Set memory limits, use batched fetching for large result sets, and do not use it as a transactional database

When DuckDB is the Right Tool:

  • Local analytical queries on data too large for pandas
  • Embedded analytics in desktop or server applications
  • Quick exploration of remote data without ETL
  • Polyglot joins across S3, PostgreSQL, and local files

When to Choose Alternatives:

  • Distributed scale → Apache Spark
  • Cloud data warehouse → Snowflake, BigQuery, Redshift
  • Concurrent writes → PostgreSQL
  • Streaming/real-time → Kafka Streams, Flink, Spark Streaming

For cluster-scale processing, see Apache Spark. For portable pipelines across runners, read about Apache Beam.

Conclusion

DuckDB is what SQLite would be if designed for analytics in 2025. No configuration, no infrastructure, no operational overhead. Run it on your laptop, in a serverless function, inside a Spark executor, or in the browser via WebAssembly. The same SQL runs everywhere with the same performance characteristics on single-machine data.

For most data engineering tasks that do not require petabyte scale, DuckDB is worth trying before defaulting to Spark. It is faster than pandas for large local datasets, simpler to operate than any distributed engine, and more capable than SQLite for analytical workloads.

Category

Related Posts

Joins and Aggregations: SQL Patterns for Data Analysis

Master SQL joins and aggregation techniques for building efficient analytical queries in data warehouses and analytical databases.

#data-engineering #sql #joins

Data Warehousing

OLAP vs OLTP comparison. Star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and columnar storage in data warehouses.

#database #data-warehouse #olap

Data Warehouse Architecture: Building the Foundation for Analytics

Learn the core architectural patterns of data warehouses, from ETL pipelines to dimensional modeling, and how they enable business intelligence at scale.

#data-engineering #data-warehouse #olap