DuckDB vs Pandas vs Polars: Python Data Analysis Compared

Compare DuckDB, Pandas, and Polars for Python data analysis — performance benchmarks, architecture differences, and which fits your workflow.

This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.

If you’re working with data in Python, you’ve probably used Pandas. It’s been the default for over a decade—the library everyone learns first, with an ecosystem of tutorials, Stack Overflow answers, and integrations that nothing else matches. But Pandas has well-known limitations: it’s single-threaded, uses too much memory, and slows down as datasets grow.

Two newer tools have emerged as serious alternatives. DuckDB is an in-process analytical database that lets you query data with SQL. Polars is a DataFrame library written in Rust that focuses on speed and memory efficiency. Both can handle datasets that would bring Pandas to its knees.

The question isn’t which one is “best”—it’s which one fits your workflow.

In this post, we’ll cover:

What each tool is designed for - The architectural differences that drive performance
Performance comparisons - Benchmarks on real-world operations
Interoperability - How all three tools work together
DuckDB beyond DataFrames - Querying files directly, ETL pipelines, and browser support
When to use each - A practical decision framework

Architectural differences

The performance gaps between these tools aren’t accidental—they come from fundamentally different designs.

Pandas

Pandas was created by Wes McKinney at AQR Capital Management in 2008 and open-sourced in 2009. It’s built on top of NumPy, which means it stores data in row-oriented NumPy arrays. Operations are mostly single-threaded and execute eagerly—every operation runs immediately and returns a result.

Pandas 2.x introduced optional Apache Arrow backends and copy-on-write semantics, which reduce memory usage in some scenarios. But the core execution model hasn’t changed. For datasets that fit comfortably in memory (up to a few hundred megabytes), Pandas is fast enough and its API is the most mature and well-documented.

The memory problem is real, though. Pandas typically needs 5-10x the size of your dataset in RAM because of intermediate copies, type inference, and the way it represents strings and nullable types. A 2 GB CSV file might require 10-20 GB of RAM to process.

Polars

Polars was created by Ritchie Vink in 2020. It’s written in Rust and built on Apache Arrow’s columnar memory format. Polars is multi-threaded by default—it automatically parallelizes operations across all available CPU cores.

The key architectural difference is lazy evaluation. Polars has two APIs: an eager API (like Pandas, where operations execute immediately) and a lazy API, where operations are recorded as a query plan and optimized before execution. The query optimizer can reorder operations, push down filters, eliminate unnecessary columns, and fuse operations together.

import polars as pl

# Lazy evaluation: build a plan, then execute
result = (
    pl.scan_csv("events.csv")
    .filter(pl.col("timestamp") > "2024-01-01")
    .group_by("event_type")
    .agg(pl.col("duration").mean())
    .sort("duration", descending=True)
    .collect()  # execute the optimized plan
)

Polars also handles larger-than-memory datasets through its streaming engine. When you use scan_csv instead of read_csv, Polars processes the file in chunks rather than loading it entirely into memory.

DuckDB

DuckDB was created by Mark Raasveldt and Hannes Mühleisen at CWI Amsterdam (the same research institute where PostgreSQL’s predecessor, Ingres, originated). DuckDB is an in-process OLAP (Online Analytical Processing) database—think of it as “SQLite for analytics.”

Like SQLite, DuckDB runs inside your application process with no separate server. But where SQLite is row-oriented and optimized for transactional workloads (OLTP), DuckDB is columnar and optimized for analytical queries—aggregations, joins, window functions, and scans over large datasets. It uses vectorized execution, processing data in batches of columns rather than one row at a time.

DuckDB’s most distinctive feature is its ability to query data in place. You can run SQL directly against CSV files, Parquet files, JSON files, and even Pandas DataFrames—without importing anything first.

import duckdb

# Query a Parquet file directly—no import step
result = duckdb.sql("""
    SELECT event_type, AVG(duration) as avg_duration
    FROM 'events.parquet'
    WHERE timestamp > '2024-01-01'
    GROUP BY event_type
    ORDER BY avg_duration DESC
""")

Performance comparisons

Benchmarks always come with caveats—results depend on hardware, dataset size, data types, and the specific operations being tested. That said, the performance gaps between these tools are large enough that the overall picture is clear.

On the H2O.ai database-like operations benchmark, which tests groupby and join operations on datasets from 0.5 GB to 50 GB, DuckDB and Polars consistently outperform Pandas by 5-50x depending on the operation and dataset size. DuckDB and Polars trade wins depending on the specific query—DuckDB tends to be faster on complex joins and aggregations, while Polars excels on simpler transformations.

For a 5 GB dataset with a basic groupby-sum operation, representative results look roughly like this:

Tool	Time	Memory
Pandas	~45s	~25 GB
Polars	~2s	~6 GB
DuckDB	~2s	~4 GB

The memory difference is often more important than the speed difference. Pandas’ 5x memory overhead means you hit out-of-memory errors much sooner. Both DuckDB and Polars can process datasets larger than available RAM—DuckDB through its disk-spilling query engine, Polars through its streaming mode.

On smaller datasets (under 100 MB), the performance gap narrows. Pandas is fast enough for interactive analysis at this scale, and its startup overhead is lower than DuckDB’s query parsing. If you’re doing quick exploratory work on a small CSV, Pandas is still fine.

Interoperability

These tools aren’t mutually exclusive. One of DuckDB’s most powerful features is its ability to query Pandas and Polars DataFrames directly, with zero-copy access through Apache Arrow.

import pandas as pd
import polars as pl
import duckdb

# Create a Pandas DataFrame
pdf = pd.read_csv("events.csv")

# Query it with DuckDB SQL—no import or conversion
result = duckdb.sql("""
    SELECT event_type, COUNT(*) as count
    FROM pdf
    GROUP BY event_type
    HAVING count > 100
""").fetchdf()  # returns a Pandas DataFrame

This means you can use Pandas for data loading and manipulation where its API is convenient, then switch to DuckDB for heavy analytical queries. The conversion is essentially free because both can share Arrow memory.

Polars DataFrames work the same way:

# Polars DataFrame queried with DuckDB
plf = pl.read_csv("events.csv")

result = duckdb.sql("""
    SELECT * FROM plf WHERE duration > 1000
""").pl()  # returns a Polars DataFrame

See what QueryPlane can build for you

Connect to your database, write SQL with AI, and build shareable apps — all from your browser.

Get Started Book a Demo

DuckDB beyond DataFrames

DuckDB’s value extends well beyond being a fast SQL engine for DataFrames.

Direct file querying

DuckDB reads CSV, Parquet, JSON, and even Excel files directly. You can query multiple files at once using glob patterns, and DuckDB will handle schema detection automatically.

-- Query all Parquet files in a directory
SELECT date_trunc('month', timestamp) as month,
       COUNT(*) as events
FROM 'data/events/*.parquet'
GROUP BY month
ORDER BY month;

-- Query a remote CSV over HTTP
SELECT * FROM 'https://example.com/data.csv' LIMIT 10;

ETL with SQL

A growing use case for DuckDB is replacing multi-step Python ETL scripts with pure SQL. DuckDB can read from one format and write to another, handling transformations in between.

-- Read CSV, transform, write to Parquet
COPY (
    SELECT
        user_id,
        event_type,
        timestamp::TIMESTAMP as event_time,
        COALESCE(duration, 0) as duration
    FROM 'raw_events.csv'
    WHERE timestamp IS NOT NULL
) TO 'clean_events.parquet' (FORMAT PARQUET);

The COALESCE(duration, 0) swaps NULL durations for 0 so downstream aggregates do not get poisoned — see our COALESCE in SQL guide for the full pattern. For data pipelines that are mostly extract-transform-load, this can eliminate hundreds of lines of Python. DuckDB also supports reading from and writing to S3-compatible object stores directly, with the httpfs extension.

-- Read from S3, write back to S3
COPY (
    SELECT * FROM 's3://my-bucket/raw/*.parquet'
    WHERE event_date >= '2024-01-01'
) TO 's3://my-bucket/processed/events.parquet';

DuckDB in the browser

DuckDB-WASM compiles DuckDB to WebAssembly, making it possible to run analytical SQL queries entirely in the browser. This enables privacy-first analytics dashboards, local data exploration tools, and interactive SQL playgrounds without sending data to a server.

The DuckDB Shell is a working demo—a full SQL shell running in your browser tab. Projects like Observable and Evidence use DuckDB-WASM to power client-side data visualization.

There are limitations: WASM runs single-threaded in the browser (though worker threads help), memory is capped by the browser tab’s allocation, and some extensions aren’t available. But for datasets under a few hundred megabytes, the performance is surprisingly good.

When to use each

Use Pandas when:

Your dataset is under ~500 MB and fits comfortably in memory
You need maximum ecosystem compatibility (scikit-learn, matplotlib, seaborn, statsmodels all expect Pandas DataFrames)
You’re doing quick exploratory analysis in a notebook
You need time series functionality—Pandas’ DatetimeIndex and resampling are still best-in-class
Your team already knows Pandas and the code is working fine

Use Polars when:

You prefer a DataFrame API over SQL
Your dataset is too large for Pandas (multi-GB)
You want lazy evaluation and automatic query optimization
You need consistent multi-threaded performance
You’re building data pipelines where Polars’ streaming mode handles larger-than-memory data

Use DuckDB when:

You prefer SQL or work in a SQL-heavy environment
You want to query files directly without a loading step (CSV, Parquet, JSON)
You’re building ETL pipelines and want to replace Python transformation code with SQL
You need to join across multiple data sources (files, DataFrames, databases)
You want analytical queries on data that’s already in PostgreSQL—DuckDB can attach to Postgres and query it directly

Use them together when:

The most pragmatic approach is often a combination. Load and clean data with Pandas or Polars, run complex analytical queries with DuckDB’s SQL, and use Polars for performance-critical transformations in production pipelines. The zero-copy interoperability through Apache Arrow makes this practical with minimal overhead.

Wrapping up

The Python data tooling landscape has shifted. Pandas is still valuable for small datasets and quick exploration, but it’s no longer the only option—and for anything beyond a few hundred megabytes, it probably shouldn’t be the default.

Polars and DuckDB represent two different philosophies for solving the same problem. Polars gives you a modern DataFrame API with lazy evaluation and Rust-powered performance. DuckDB gives you SQL and the ability to query anything—files, DataFrames, remote databases—as if it were a table.

If you’re working with data in PostgreSQL, DuckDB’s Postgres extension is worth exploring. It lets you query your Postgres database directly from DuckDB, combining Postgres’ transactional capabilities with DuckDB’s analytical speed. For more on getting the most out of PostgreSQL, check out our posts on choosing a primary key strategy and top PostgreSQL extensions.

For local-first analytics use cases—where you want to process data on the client without sending it to a server—both DuckDB-WASM and SQLite-WASM are options. See our post on local-first databases for more on this emerging architecture.