Skip to content
Go back

DuckDB for Local Analytics: Pandas on Steroids

By SumGuy 10 min read
DuckDB for Local Analytics: Pandas on Steroids

You Don’t Need a Data Lake for 50 Million Rows

Somewhere along the way, “analytics” became synonymous with “spin up a Kafka cluster, a Spark fleet, and a BI platform that costs more than your mortgage.” You’ve got a 5 GB CSV from your Frigate camera events. You want to know which hour of the day gets the most motion triggers. Someone in a conference room is about to suggest you “build a data lake.”

Don’t let them.

DuckDB is what happens when the SQLite team looks at the analytics world and says, “this is embarrassing.” It’s a single binary, in-process columnar OLAP database that reads CSV, Parquet, JSON, and Arrow directly—no import, no server, no YAML manifests for a Kubernetes operator. You run a SQL query against a file on disk and get an answer in seconds.

Let’s get into it.


What DuckDB Actually Is

Think SQLite, but built for the query patterns that SQLite hates: full table scans, group-bys, aggregations, window functions over 50 million rows. SQLite is row-oriented and optimized for transactional workloads—lots of small reads and writes to individual records. DuckDB is column-oriented, which means when you SELECT sum(amount) FROM sales, it only reads the amount column from disk. The other 40 columns don’t exist for that query.

They’re complements, not competitors. The DuckDB team will tell you this themselves. Use SQLite for your application’s operational data. Use DuckDB when you need to ask questions about that data at scale.

DuckDB 1.x (the current stable line as of 2026) ships with:

Zero servers. Zero configuration. One file or no file at all.


Installing It

Three reasonable ways depending on your setup:

Terminal window
# Python ecosystem (recommended for data work)
pipx install duckdb
# macOS / Linux with Homebrew
brew install duckdb
# Or just grab the static binary
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
chmod +x duckdb
sudo mv duckdb /usr/local/bin/

Then pop into the CLI:

Terminal window
duckdb mydb.duckdb

No database file? No problem—just run duckdb with no arguments and you get an in-memory session. Perfect for one-off queries against files you’ll never import.


Query a CSV Without Importing It

This is where DuckDB earns its keep immediately. You’ve got nginx_access.csv sitting there, maybe 2 million rows. In the old workflow you’d load it into pandas, wait, curse the memory usage, then ask your question.

In DuckDB:

SELECT
strftime(timestamp, '%H') AS hour,
count(*) AS hits,
count(*) FILTER (WHERE status >= 500) AS errors
FROM read_csv_auto('nginx_access.csv')
GROUP BY hour
ORDER BY hits DESC;

DuckDB reads only the columns it needs, streams the file, and projects away everything else. read_csv_auto sniffs the schema automatically—types, delimiter, header row, the works. If it guesses wrong you can override with read_csv('file.csv', delim='|', header=true, columns={'ts': 'TIMESTAMP', 'ip': 'VARCHAR'}).

You can also just drop the function call entirely and use the filename as a table:

SELECT * FROM 'big_file.csv' WHERE response_time_ms > 2000 LIMIT 20;

That’s it. That’s the whole import story.


Parquet: DuckDB’s Native Language

If CSV is duct tape, Parquet is load-bearing steel. DuckDB treats Parquet as a first-class citizen with full predicate pushdown—it’ll skip entire row groups that don’t match your WHERE clause without reading them from disk.

-- Local Parquet
SELECT
customer_id,
sum(amount) AS total_spent
FROM read_parquet('sales_2026.parquet')
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 10;
-- S3/R2 with httpfs extension
INSTALL httpfs;
LOAD httpfs;
SET s3_region = 'auto';
SET s3_endpoint = 'your-account.r2.cloudflarestorage.com';
SET s3_access_key_id = 'YOUR_KEY';
SET s3_secret_access_key = 'YOUR_SECRET';
SELECT
customer_id,
sum(amount) AS total_spent
FROM 's3://your-bucket/data/year=2026/*.parquet'
GROUP BY 1
ORDER BY 2 DESC;

That glob pattern on the S3 path? DuckDB handles partition pruning. If your Parquet files are partitioned by year/month and your WHERE clause filters on those columns, DuckDB skips files it doesn’t need entirely. It’s not magic—it’s just smart.

On a modern laptop, a 5 GB Parquet file with 50 million rows, running a group-by aggregation, finishes in roughly 2 seconds. Not “fast for Python” fast. Actually fast.

Want to see why? Run EXPLAIN ANALYZE on any query:

EXPLAIN ANALYZE
SELECT region, sum(revenue)
FROM read_parquet('sales.parquet')
GROUP BY region;

You’ll get a physical plan showing row counts, timing per operator, and where time is actually spent. It’s not beautiful output, but it tells you exactly what the engine decided to do.


Pandas Integration: Zero-Copy via Arrow

Here’s where the Python crowd perks up. DuckDB’s Python API can read from—and write to—pandas DataFrames directly via Apache Arrow. No serialization. The data doesn’t move.

duckdb_pandas.py
import duckdb
import pandas as pd
# Your existing DataFrame
df_in = pd.read_csv("events.csv")
# Query it with SQL — DuckDB sees the variable by name
result = duckdb.sql("SELECT event_type, count(*) AS n FROM df_in GROUP BY 1").df()
print(result)

That df_in in the SQL string? DuckDB finds it in your Python scope automatically. No registration step, no .to_sql(), no SQLAlchemy URI. It just works.

Going the other way—DuckDB result back to pandas—is equally clean:

to_pandas.py
import duckdb
conn = duckdb.connect("analytics.duckdb")
# Returns a pandas DataFrame
df = conn.sql("""
SELECT
date_trunc('hour', ts) AS hour,
avg(response_ms) AS p50_approx
FROM read_parquet('logs/*.parquet')
GROUP BY 1
ORDER BY 1
""").df()

When does pandas still win? Interactive Jupyter cell-by-cell munging where you want to inspect intermediate state. Plotting—matplotlib and seaborn speak pandas natively. And for truly small datasets (under ~100K rows), the overhead difference is irrelevant and pandas is more ergonomic. DuckDB wins the moment you’re doing aggregations, joins, or window functions at scale.


JSON: Application Logs Without Preprocessing

Your application dumps JSON event logs. You want to ask questions about them without standing up Elasticsearch or hand-writing a parser.

-- Read a newline-delimited JSON log
SELECT
json_extract_string(data, '$.user.email') AS email,
json_extract_string(data, '$.event') AS event_type,
count(*) AS occurrences
FROM read_json_auto('events.jsonl')
GROUP BY 1, 2
ORDER BY 3 DESC;

read_json_auto handles both JSON arrays and newline-delimited JSONL. For deeply nested structures, json_extract and json_extract_string let you pull specific paths. If the schema is consistent enough, DuckDB will auto-infer columns from the JSON structure and let you query them like regular columns—no extraction needed.

This is extremely useful for Frigate event logs, Home Assistant history exports, or any application that logs structured JSON.


Iceberg and Delta Lake (Because You Asked)

For the home labbers who’ve gone deep on modern table formats, DuckDB speaks both:

-- Apache Iceberg
INSTALL iceberg;
LOAD iceberg;
SELECT count(*), max(event_time)
FROM iceberg_scan('s3://your-bucket/warehouse/events/');
-- Delta Lake
INSTALL delta;
LOAD delta;
SELECT * FROM delta_scan('s3://your-bucket/delta/sales/')
WHERE year = 2026;

Both extensions handle snapshot isolation, schema evolution, and partition pruning. If you’re already running these table formats (maybe via Spark or Trino at work and you’ve exported some tables), DuckDB can query them directly with zero additional infrastructure.


Real Home Lab Use Cases

Let’s get concrete about where this actually earns its place on your server.

Nginx access logs as CSV. Drop a log_format csv directive in your nginx config, or just pipe the default access log through a quick transform, and suddenly every request is a queryable row. Find your top 10 slowest endpoints in seconds.

Frigate camera event database. Frigate stores events in SQLite. DuckDB can attach to it:

ATTACH '/data/frigate/frigate.db' AS frigate (TYPE SQLITE);
SELECT
camera,
label,
count(*) AS events,
avg(score) AS avg_confidence
FROM frigate.events
WHERE start_time > epoch_ms(unixepoch('now', '-7 days') * 1000)
GROUP BY camera, label
ORDER BY events DESC;

Yes, DuckDB can cross-join against a live SQLite file. Frigate stays running. You get analytics.

Jellyfin playback history. Same trick—Jellyfin uses SQLite. Attach it, query your own watch history, figure out which shows you actually finish versus abandon at episode 3.

Restic backup manifest comparison. Restic can export snapshot metadata as JSON. Dump two snapshots, diff them with DuckDB to find files that changed between backups.

Prometheus metrics dump. Export a metric time series as JSON or CSV, load into DuckDB, run window functions to find anomalies without PromQL gymnastics.


Embedding DuckDB: Quick Python ETL Script

Here’s a real pattern for ingesting JSON event logs and writing aggregated Parquet—the kind of lightweight ETL you’d run as a cron job:

summarize_events.py
import duckdb
from datetime import date
conn = duckdb.connect() # in-memory
conn.execute("""
CREATE TABLE events AS
SELECT *
FROM read_json_auto('/var/log/app/events.jsonl')
""")
conn.execute(f"""
COPY (
SELECT
json_extract_string(data, '$.event_type') AS event_type,
date_trunc('hour', CAST(json_extract_string(data, '$.ts') AS TIMESTAMP)) AS hour,
count(*) AS n,
avg(CAST(json_extract_string(data, '$.duration_ms') AS DOUBLE)) AS avg_duration_ms
FROM events
GROUP BY 1, 2
ORDER BY 2, 1
)
TO '/data/analytics/events_{date.today()}.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD)
""")
print("Done.")

Run this nightly. After a month you’ve got 30 daily Parquet files you can query in aggregate:

SELECT event_type, sum(n) AS total
FROM read_parquet('/data/analytics/events_*.parquet')
GROUP BY 1
ORDER BY 2 DESC;

No Airflow. No Spark. No Kubernetes operator. Just a cron job and a folder.


DuckDB vs. Pandas: The Real Comparison

DuckDBPandas
ExecutionLazy, vectorized, parallelEager, single-threaded by default
MemoryPredictable, spillable to diskLoads everything upfront
SQLFull ANSI SQL + window functions.groupby(), .merge(), .agg()
Schema enforcementStrong typingDtype chaos (object columns, hello)
File I/ORead CSV/Parquet/JSON natively, streamingpd.read_csv loads entire file
PlottingNo nativeNative matplotlib/seaborn integration
Jupyter UXSQL strings in cellsChainable operations feel natural

The short version: if you’re writing a for loop over a DataFrame to do an aggregation, stop. That’s a GROUP BY. Use DuckDB. If you’re doing interactive exploration where you want to see the DataFrame shape after each transformation and then plot it, pandas is still your friend.


Should You Bother?

Yes, if any of these describe you:

No, if:

DuckDB is the kind of tool that makes you feel slightly embarrassed about how much complexity you were tolerating before. It doesn’t replace your database—it replaces the 200-line Python script you wrote to answer a question you could have answered with two lines of SQL.

Install it. Query a CSV. You’ll understand in about 90 seconds.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Boundary vs Teleport

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts