Skip to content
Go back

ClickHouse vs DuckDB vs StarRocks: Light OLAP

By SumGuy 11 min read
ClickHouse vs DuckDB vs StarRocks: Light OLAP

You Don’t Need Snowflake

You’ve got 100 million rows of event logs sitting in a directory somewhere. Your Grafana dashboard is choking on Postgres aggregation queries. Someone on Reddit said ClickHouse is the move. Someone else said DuckDB. Your coworker who just got back from a conference said StarRocks is the future.

Honestly? They’re all right. They’re also all wrong, depending on what you’re actually trying to do.

This is the comparison I wish existed when I was staring at a pile of JSON logs wondering how to make sense of them without spinning up a data warehouse that costs more than my rent.

Three engines. One decision. Let’s go.


The Contenders

ClickHouse 25.x — The Production Workhorse

ClickHouse is a columnar OLAP database that’s been eating the competition’s lunch for a decade. It’s genuinely fast on aggregations over huge datasets — think Grafana metric dashboards, user event analytics, log ingestion pipelines.

The core primitive is the MergeTree engine family. You pick a variant based on what you need:

The “eventually” part of ReplacingMergeTree trips people up constantly. You insert a row, insert an update, and the old version is still visible until a background merge runs. Always query with FINAL or use max(version) if you need consistent deduplication right now.

-- ClickHouse: create an event table with deduplication
CREATE TABLE events
(
event_id String,
user_id UInt64,
event_type LowCardinality(String),
ts DateTime,
properties String -- JSON blob
)
ENGINE = ReplacingMergeTree()
ORDER BY (event_type, user_id, ts);
-- Insert a batch
INSERT INTO events
SELECT
generateUUIDv4(),
rand() % 10000,
['click','view','purchase'][rand() % 3 + 1],
now() - toIntervalSecond(rand() % 86400),
'{}'
FROM numbers(1000000);
-- Query with FINAL to force deduplication
SELECT
event_type,
toStartOfHour(ts) AS hour,
count() AS events,
uniqExact(user_id) AS users
FROM events FINAL
WHERE ts >= now() - INTERVAL 24 HOUR
GROUP BY event_type, hour
ORDER BY hour DESC;

On a single node with 2–3 GB RAM, ClickHouse handles this query over 100M rows in under a second. Legitimately under a second. It’s not magic — it’s column storage + vectorized execution + aggressive compression.

Async inserts (enabled since 22.x, mature in 25.x) let you fire-and-forget small batches without hammering the server. Rows get buffered and merged in the background:

SET async_insert = 1;
SET wait_for_async_insert = 0;

The weak spot historically has been JOINs. ClickHouse was designed around denormalized wide tables, not relational schemas. Large JOIN performance has improved significantly in 24.x+ with hash join improvements, but if your workload is fundamentally join-heavy across normalized tables, read the StarRocks section carefully.

Docker Compose:

services:
clickhouse:
image: clickhouse/clickhouse-server:25.4
container_name: clickhouse
ports:
- "8123:8123" # HTTP interface
- "9000:9000" # Native protocol
volumes:
- clickhouse_data:/var/lib/clickhouse
- ./clickhouse-config.xml:/etc/clickhouse-server/config.d/custom.xml
environment:
CLICKHOUSE_USER: default
CLICKHOUSE_PASSWORD: yourpassword
CLICKHOUSE_DB: analytics
ulimits:
nofile:
soft: 262144
hard: 262144
volumes:
clickhouse_data:

Memory footprint: 1–3 GB at rest on a single node, depending on caches. Goes higher under load. Plan for at least 4 GB available if you’re running this alongside other services.


DuckDB 1.x — The Binary That Slaps

DuckDB is not a server. It’s a library. You embed it in your script, your Python notebook, your Go ETL pipeline — it runs in-process and disappears when your process exits.

That’s the pitch. No daemon, no connection pool, no port to open. You need to analyze a 20 GB Parquet file? Point DuckDB at it and run SQL. Done.

import duckdb
# DuckDB: analyze a parquet file directly — no server required
con = duckdb.connect()
result = con.execute("""
SELECT
event_type,
date_trunc('hour', ts::TIMESTAMP) AS hour,
count(*) AS events,
count(DISTINCT user_id) AS users
FROM read_parquet('/data/events/*.parquet')
WHERE ts::TIMESTAMP >= now() - INTERVAL '24 hours'
GROUP BY event_type, hour
ORDER BY hour DESC
""").fetchdf()
print(result.head(20))

DuckDB 1.x ships with native Parquet, CSV, JSON, Arrow, and Iceberg support. You can query remote S3 files directly. You can attach a Postgres database and join against it:

# Attach Postgres and query across systems
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("ATTACH 'dbname=mydb host=localhost user=app password=pass' AS pg (TYPE postgres)")
result = con.execute("""
SELECT
e.event_type,
u.email,
count(*) AS events
FROM read_parquet('/data/events.parquet') e
JOIN pg.users u ON e.user_id = u.id
GROUP BY e.event_type, u.email
ORDER BY events DESC
LIMIT 50
""").fetchdf()

That query pattern — Parquet on disk, metadata in Postgres, joined in DuckDB — is genuinely useful for home lab ETL work.

JOIN performance is DuckDB’s strength. It uses a vectorized hash join with adaptive spill-to-disk. You can throw complex multi-table joins at it and it handles them gracefully, even on a machine with 8 GB RAM.

The catch: DuckDB is single-process. One writer at a time. No concurrent connections doing heavy queries. It’s not a server — it doesn’t pretend to be.

For a CLI-accessible setup, you can use the duckdb binary:

Terminal window
# Install DuckDB CLI (Linux amd64)
curl -LO https://github.com/duckdb/duckdb/releases/download/v1.2.1/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
chmod +x duckdb
sudo mv duckdb /usr/local/bin/
# Query a CSV right from the shell
duckdb -c "SELECT count(*) FROM read_csv_auto('/var/log/nginx/access.log')"

Or run it via MotherDuck (their managed cloud offering) if you want a persistent server. But honestly, for home lab use, the CLI or embedded mode is the right call.

Docker Compose (if you want a persistent DuckDB REST interface via their HTTP API wrapper):

services:
duckdb-api:
image: ghcr.io/iotabah/duckdb-api:latest
ports:
- "8080:8080"
volumes:
- ./data:/data
- duckdb_state:/state
environment:
DUCKDB_PATH: /state/analytics.duckdb
volumes:
duckdb_state:

Memory footprint at rest: essentially zero. DuckDB uses what it needs and releases it. For a 100M-row Parquet query, expect 2–6 GB peak usage depending on the query shape.


StarRocks 3.x — The Enterprise Interloper

StarRocks is an Apache 2.0-licensed MPP (massively parallel processing) analytical database. It forked from Apache Doris, which itself has deep roots in Google Mesa. The headline feature: it speaks the MySQL wire protocol, so anything that talks to MySQL can talk to StarRocks.

That means your Grafana MySQL datasource, your Superset connection, your little Python script using pymysql — all of it works without modification. That’s a bigger deal than it sounds when you’re stitching together home lab tooling.

StarRocks 3.x introduced shared-nothing to shared-data architecture, better Iceberg/Hudi/Delta Lake support, and solid upsert semantics via Primary Key tables:

-- StarRocks: Primary Key table with upsert support
CREATE TABLE events (
event_id BIGINT NOT NULL,
user_id BIGINT,
event_type VARCHAR(64),
ts DATETIME,
properties JSON
)
PRIMARY KEY (event_id)
DISTRIBUTED BY HASH(event_id) BUCKETS 8
PROPERTIES (
"replication_num" = "1",
"enable_persistent_index" = "true"
);
-- Upsert via stream load or INSERT ... ON CONFLICT
INSERT INTO events (event_id, user_id, event_type, ts, properties)
VALUES (12345, 789, 'click', '2026-07-15 12:00:00', '{"page": "/home"}')
ON DUPLICATE KEY UPDATE
event_type = VALUES(event_type),
ts = VALUES(ts),
properties = VALUES(properties);

The JOIN story is where StarRocks actually earns its place. It was designed for multi-table analytical queries from day one. Broadcast joins, shuffle joins, colocate joins — it picks the right strategy automatically. Running a join across a 100M-row fact table and a 10M-row dimension table? StarRocks is going to beat ClickHouse here.

Lakehouse integration is genuinely impressive for a self-hosted setup. You can create external catalogs pointing at Iceberg tables on S3-compatible storage:

-- Create an Iceberg catalog pointing at MinIO
CREATE EXTERNAL CATALOG iceberg_catalog
PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "rest",
"iceberg.catalog.uri" = "http://minio:8181",
"aws.s3.endpoint" = "http://minio:9000",
"aws.s3.access_key" = "minioadmin",
"aws.s3.secret_key" = "minioadmin",
"aws.s3.enable_path_style_access" = "true"
);
-- Query across local StarRocks table and Iceberg lake table
SELECT
e.event_type,
p.product_name,
count(*) AS conversions
FROM events e
JOIN iceberg_catalog.lake.products p ON e.properties->'$.product_id' = p.id
WHERE e.ts >= '2026-07-01'
GROUP BY e.event_type, p.product_name;

The downside: StarRocks has a real appetite for resources. A minimum useful single-node deployment wants at least 8 GB RAM, and 16 GB is where it starts feeling comfortable. The FE (frontend coordinator) and BE (backend executor) are separate JVM processes with their own heap requirements.

Docker Compose (single-node development setup):

services:
starrocks-fe:
image: starrocks/fe-ubuntu:3.3.7
hostname: starrocks-fe
ports:
- "8030:8030" # HTTP
- "9020:9020" # RPC
- "9030:9030" # MySQL protocol
volumes:
- fe_data:/opt/starrocks/fe/meta
environment:
- HOST_TYPE=FQDN
command: /opt/starrocks/fe/bin/start_fe.sh
starrocks-be:
image: starrocks/be-ubuntu:3.3.7
hostname: starrocks-be
ports:
- "8040:8040"
volumes:
- be_data:/opt/starrocks/be/storage
depends_on:
- starrocks-fe
command: >
sh -c "sleep 10 &&
mysql -h starrocks-fe -P 9030 -u root -e
'ALTER SYSTEM ADD BACKEND \"starrocks-be:9050\";' &&
/opt/starrocks/be/bin/start_be.sh"
volumes:
fe_data:
be_data:

Memory footprint: 4–6 GB for FE + BE combined at idle. In practice, give it a host with 16 GB free and don’t share it with heavy workloads.


The Numbers (Rough, Honest, Not a Press Release)

100M-row event table, single server, 8-core CPU, 16 GB RAM:

QueryClickHouseDuckDBStarRocks
COUNT(*)~0.05s~0.3s~0.2s
Group by + count distinct (1 key)~0.4s~0.8s~0.5s
Group by + count distinct (3 keys)~0.9s~1.2s~0.8s
3-table JOIN aggregate~2.1s~0.9s~0.6s
Filtered range scan + percentile~0.3s~0.6s~0.4s

These are directional numbers pulled from public benchmarks (ClickBench, StarRocks public benchmarks, DuckDB docs). Your mileage will vary based on data distribution, query patterns, and whether you remembered to tune the JVM heap.

The takeaway: ClickHouse wins on raw aggregation speed. DuckDB wins on JOINs and flexibility. StarRocks wins on join-heavy OLAP workloads at concurrency.


Ecosystem: Grafana, Superset, and Friends

ClickHouse has a first-class Grafana plugin (grafana-clickhouse-datasource) that supports macros for time-series queries. Superset has a ClickHouse SQLAlchemy driver. It’s well-supported.

DuckDB doesn’t have a native Grafana datasource (because it’s not a server). You can use it as a backend for dbt, export to Parquet and serve from somewhere else, or wrap it in a lightweight FastAPI layer. It’s not a dashboarding target — it’s a processing engine.

StarRocks speaks MySQL protocol, so it works with any MySQL-compatible client. Grafana MySQL datasource, Superset’s MySQL connector, Metabase, TablePlus — all of them work. This is a significant practical advantage.


Update/Delete Semantics — The Honest Version

ClickHouse: Deletes are mutations — they rewrite data parts asynchronously. Use ALTER TABLE DELETE WHERE ... sparingly. ReplacingMergeTree handles deduplication but not real updates. Plan your schema around append-only or version-based patterns.

DuckDB: Full ACID transactions with real UPDATE and DELETE. It’s a proper relational engine. This works exactly like you’d expect.

StarRocks: Primary Key tables support true upserts and deletes with ACID guarantees. Much better than ClickHouse for mutable data. This is one of StarRocks’ real differentiators.


Pick Your Fighter

Home Lab Metrics Dashboard (Loki Replacement)

Use ClickHouse.

You’re ingesting time-series log events, querying recent windows, running aggregations. Async inserts handle bursty ingest from Vector or Fluentd. Materialized views pre-aggregate the heavy stuff. The Grafana plugin is solid. Single-node runs fine on 4 GB RAM.

Terminal window
# Ship logs from Vector to ClickHouse
# In your vector.toml:
[sinks.clickhouse]
type = "clickhouse"
inputs = ["my_source"]
endpoint = "http://clickhouse:8123"
database = "logs"
table = "events"
compression = "gzip"
auth.strategy = "basic"
auth.user = "default"
auth.password = "yourpassword"

One-Off “What Does This CSV/Parquet Look Like”

Use DuckDB.

It’s a binary. You don’t need a server. You don’t need Docker. You need answers now.

Terminal window
duckdb -c "
SELECT event_type, count(*) AS n
FROM read_csv_auto('/tmp/events.csv')
GROUP BY event_type
ORDER BY n DESC
LIMIT 10
"

Run it. Get your answer. Close the terminal. Done.

Multi-Source Dashboarding With Joins

Use StarRocks.

You’ve got Iceberg tables in MinIO, a Postgres OLTP database, and you want to join them in real-time for a business dashboard. StarRocks handles this with external catalogs and its MySQL-compatible interface means your existing BI tooling works without modification.


Should You Bother?

Yes — but pick one and commit.

The failure mode here isn’t using the wrong tool, it’s running all three because you couldn’t decide. Your home lab doesn’t need a three-engine OLAP stack. Pick the use case that actually matters to you right now:

All three are genuinely good at what they do. The key is knowing which problem you have before you pick the hammer.

Your 2 AM self will appreciate the clarity.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Boundary vs Teleport

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts