PostgreSQL on ZFS: Tuning, Snapshots, Pitfalls

You Already Have ZFS. Now Put Postgres On It Properly.

If you’re running a home lab with ZFS, and at this point, who isn’t, you’ve probably already got PostgreSQL running on it. The question is whether it’s configured for ZFS or just plopped on top of it like a couch on a moving truck. Technically it works. But your neighbors (and your WAL logs) will have questions.

The good news: Postgres and ZFS are an unusually good match when tuned correctly. Atomic snapshots replace the pain of pg_basebackup. lz4 compression squeezes 2 to 3x on text-heavy databases. And ZFS checksums catch the silent block corruption that ext4 just quietly ignores until you’re restoring from a backup at 2 AM wondering why your users table has six thousand NULL rows.

The bad news: getting there requires about a dozen settings you won’t find in the Postgres docs, because ZFS doesn’t exist from Postgres’s perspective, it’s just a filesystem. So let’s fix that.

Why Bother in the First Place

Before you tune anything, here’s why the combination is worth it:

Atomic snapshots. ZFS snapshots are copy-on-write and instantaneous. A zfs snapshot at the filesystem level is consistent at the block level, no pg_start_backup dance, no long checkpoint stalls on busy databases. For home lab and small production workloads, this is transformational.

Compression. Postgres stores a lot of null bytes, fixed-width padding, and repetitive index structure. lz4 eats all of it. On a typical web app database with TEXT columns and JSON blobs, you’ll see 2x, 3x reduction with near-zero CPU cost.

Checksums. ZFS checksums every block on every read. Postgres also has checksums (initdb --data-checksums), and you should enable both, they catch different failure modes at different layers. Silent disk corruption on consumer SATA drives is not a myth.

What you’re not getting: a speed miracle. Postgres on ZFS with default settings is slower than ext4. With proper tuning, you close most of that gap, and the operational benefits more than compensate for the remaining 10 to 20% overhead.

Dataset Layout: This Part Actually Matters

Don’t put everything in one dataset. Postgres has two distinctly different I/O patterns: random reads/writes to the data directory, and sequential append to WAL. ZFS lets you optimize each separately.

# Create datasets — adjust pool name (tank) as needed
zfs create -o recordsize=16K \
           -o compression=lz4 \
           -o atime=off \
           -o xattr=sa \
           -o dnodesize=auto \
           tank/pgdata

zfs create -o recordsize=128K \
           -o compression=lz4 \
           -o atime=off \
           -o logbias=throughput \
           -o xattr=sa \
           tank/pgwal

The logic:

recordsize=16K for pgdata matches PostgreSQL’s default 8K block size… wait, 16K? Yes. ZFS records are compressed as a unit, and a 16K record compresses more efficiently than two 8K reads. Postgres’s 8K pages don’t align perfectly with ZFS records anyway: the important thing is that you’re not using the default 128K recordsize, which creates catastrophic read amplification on random I/O.
recordsize=128K for pgwal: WAL is purely sequential append. Large records are fine and improve throughput.
logbias=throughput on pgwal tells ZFS not to use the SLOG (intent log) for this dataset. WAL is already transactional; double-logging is waste.
atime=off everywhere. Access time writes on a database workload are pure overhead.

Check your settings:

zfs get recordsize,compression,atime,logbias tank/pgdata tank/pgwal

Expected output:

NAME         PROPERTY     VALUE       SOURCE
tank/pgdata  recordsize   16K         local
tank/pgdata  compression  lz4         local
tank/pgdata  atime        off         local
tank/pgdata  logbias      latency     default
tank/pgwal   recordsize   128K        local
tank/pgwal   compression  lz4         local
tank/pgwal   atime        off         local
tank/pgwal   logbias      throughput  local

Then configure PostgreSQL to use them:

# Assuming PostgreSQL 18 on Debian/Ubuntu
mkdir -p /tank/pgdata /tank/pgwal
chown postgres:postgres /tank/pgdata /tank/pgwal

# Initialize with separate WAL directory
su -c "initdb -D /tank/pgdata --waldir=/tank/pgwal --data-checksums" postgres

PostgreSQL Settings That ZFS Changes

Open postgresql.conf and find these settings. Most of them exist because traditional filesystems do things ZFS handles differently.

# ZFS gives you CoW — recycling and pre-zeroing WAL files is harmful
wal_init_zero = off
wal_recycle = off

# Full page writes: LEAVE THIS ON unless you've verified your
# ZFS recordsize == PG block size AND you understand the implications.
# The default (on) is safe. Only turn it off if you've done your homework.
full_page_writes = on

# Shared buffers: size appropriately for your RAM minus ZFS ARC
shared_buffers = 4GB            # adjust to ~25% of RAM

# Checkpointing — ZFS handles fsync well, but don't hammer it
checkpoint_completion_target = 0.9
max_wal_size = 4GB

# Tell PG where WAL lives (matches --waldir above)
# This is set at initdb time, not in postgresql.conf directly

A word on full_page_writes: theoretically, if ZFS recordsize equals PG block size (both 8K), ZFS’s CoW makes torn writes impossible and you can turn this off. In practice, the recordsize tuning we did above (16K) means they don’t match, so keep full_page_writes = on. Turning it off incorrectly will corrupt your database in ways that are entertaining to read about and catastrophic to experience.

wal_init_zero = off and wal_recycle = off are unambiguously correct on ZFS. The defaults exist for filesystems where pre-zeroing and recycling reduce fragmentation. ZFS’s CoW makes both pointless and slightly harmful.

ARC Sizing: Don’t Let ZFS Eat Your RAM

This is where most people get hurt. ZFS ARC and PostgreSQL shared_buffers will both try to cache the same data. You end up with 8GB of database pages cached twice, once in shared_buffers, once in ARC, while your system OOMs at 2 AM.

Cap the ARC:

# For a 32GB machine with 4GB shared_buffers:
# Leave ~4GB for OS + connections, 4GB for PG, rest for ARC
# Formula: zfs_arc_max = (total_ram - shared_buffers - os_overhead) * 0.8

options zfs zfs_arc_max=17179869184

That’s 16GB in bytes (16 * 1024^3). Apply without rebooting:

echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
# Verify
arc_summary | grep -E "ARC|Max"

The trade-off is real: ARC is great for read-heavy workloads where the working set doesn’t fit in shared_buffers. If your database is 80% reads on a small hot set, let ARC have more RAM. If it’s write-heavy or your working set exceeds shared_buffers anyway, keep ARC lean and let Postgres manage its own cache.

Snapshots: The Whole Point

Here’s where the investment pays off. Instead of fiddling with pg_basebackup and backup slots and WAL archiving complexity, you snapshot the filesystem.

# Manual snapshot — instant, space-efficient until data changes
zfs snapshot tank/pgdata@2026-07-06_0200
zfs snapshot tank/pgwal@2026-07-06_0200

# List snapshots
zfs list -t snapshot tank/pgdata

# Send to a backup pool (local or remote)
zfs send tank/pgdata@2026-07-06_0200 | zfs recv backup/pgdata

# Incremental send (much faster after the first)
zfs send -i tank/pgdata@2026-07-05_0200 tank/pgdata@2026-07-06_0200 \
  | zfs recv backup/pgdata

For automated backups, here’s a script that’s actually useful:

#!/usr/bin/env bash
set -euo pipefail

POOL="tank"
BACKUP_POOL="backup"
DATE=$(date +%Y-%m-%d_%H%M)
DATASETS=("pgdata" "pgwal")

# Optional: checkpoint postgres before snapshot for cleaner state
# Not required — ZFS snapshots are crash-consistent, PG recovers from WAL
# But a checkpoint reduces recovery time
psql -U postgres -c "CHECKPOINT;" 2>/dev/null || true

for ds in "${DATASETS[@]}"; do
  SNAP="${POOL}/${ds}@${DATE}"
  zfs snapshot "$SNAP"
  echo "Snapshot: $SNAP"

  # Get previous snapshot for incremental send
  PREV=$(zfs list -t snapshot -H -o name "${POOL}/${ds}" \
    | sort | tail -2 | head -1)

  if [[ -n "$PREV" && "$PREV" != "$SNAP" ]]; then
    zfs send -i "$PREV" "$SNAP" | zfs recv -F "${BACKUP_POOL}/${ds}"
    echo "Incremental send complete: $PREV → $SNAP"
  else
    zfs send "$SNAP" | zfs recv "${BACKUP_POOL}/${ds}"
    echo "Full send complete: $SNAP"
  fi
done

# Clean up snapshots older than 7 days
zfs list -t snapshot -H -o name "${POOL}/pgdata" \
  | head -n -7 \
  | xargs -r -n1 zfs destroy

0 2 * * * root /usr/local/bin/pg-zfs-backup.sh >> /var/log/pg-zfs-backup.log 2>&1

If you want Restic on top for offsite, mount the snapshot and back it up without touching the live database:

# Mount snapshot read-only
zfs mount -o ro tank/pgdata@2026-07-06_0200
# Restic backup from snapshot mountpoint
restic -r s3:your-bucket/pgdata backup /.zfs/snapshot/2026-07-06_0200/

No hot file races. No partial writes. No drama.

Point-in-Time Recovery

Snapshots get you back to a known state. WAL gets you to an exact transaction. Together:

# Stop Postgres
systemctl stop postgresql

# Roll back to snapshot
zfs rollback tank/pgdata@2026-07-06_0200
zfs rollback tank/pgwal@2026-07-06_0200

# Configure recovery in postgresql.conf
# (PG 18 uses recovery_target_time in postgresql.conf, no recovery.conf)

restore_command = 'cp /your/wal-archive/%f %p'
recovery_target_time = '2026-07-06 03:47:00'
recovery_target_action = 'promote'

# Create standby.signal to trigger recovery mode
touch /tank/pgdata/standby.signal

# Start Postgres — it will replay WAL to the target time
systemctl start postgresql
# Watch logs
journalctl -fu postgresql

This is exactly what database-level backups try to do, except here the “base backup” is a ZFS snapshot that took 0.3 seconds instead of 45 minutes.

Pitfalls That Will Waste Your Weekend

RAIDZ is not your friend here. RAIDZ has higher write amplification than mirrors because of the RAIDZ write hole, small random writes get padded to full stripe width. Postgres is full of small random writes. Use mirrors. RAIDZ is great for cold storage, NAS, archives. It’s measurably worse for database I/O.

# Good: mirrored vdevs
zpool create tank mirror sda sdb mirror sdc sdd

# Bad for Postgres:
# zpool create tank raidz sda sdb sdc sdd

SLOG (ZIL separate device) is probably not what you need. SLOG accelerates synchronous writes, specifically, fsync() calls that ZFS must commit before returning. Postgres does issue fsyncs, but on a ZFS pool with NVMe vdevs, the latency is already low. SLOG helps when: your pool vdevs are slow spinning rust, you have a power-loss-protected NVMe SLOG device, and your workload is fsync-heavy (OLTP with lots of small commits). For home lab use on all-flash, it adds complexity without measurable benefit.

Double-buffering is real and you must address it. If you don’t cap the ARC as described above, you will cache everything twice and your available memory for connections and query execution will be less than you think. pg_top showing 8GB used doesn’t mean 8GB of unique data is cached.

Snapshots are not free forever. Each snapshot holds a reference to blocks that existed at snapshot time. As data changes, those blocks can’t be freed. A busy database with 30-day snapshot retention can accumulate significant space. Monitor with:

zfs list -t snapshot -o name,used,refer tank/pgdata | sort -k2 -h

Don’t forget xattr=sa and dnodesize=auto. Extended attributes in ZFS default to storing in a hidden directory (slow for many small files). xattr=sa stores them in the inode. Postgres doesn’t heavily use xattrs, but it costs nothing and future-proofs the dataset.

Real Numbers

On a test setup: AMD Ryzen 7 5700G, 32GB RAM, 2x 1TB NVMe in mirror, Ubuntu 24.04, PostgreSQL 18.

zpool list

NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   1.82T  187G   1.63T        -         -     4%    10%  1.00x    ONLINE  -

pgbench at scale 100 (1.4GB database), 8 clients, 60 seconds:

Config	TPS	Latency (avg)
ext4, default PG settings	4,120	1.94ms
ZFS defaults (128K recordsize)	2,890	2.77ms
ZFS tuned (16K recordsize, settings above)	3,680	2.17ms

Tuned ZFS is about 11% slower than ext4 on this hardware. That gap buys you: instantaneous crash-consistent backups, 2.3x compression ratio on this database (real number from zfs get compressratio tank/pgdata), per-block checksums, and point-in-time recovery to within seconds.

zfs get compressratio,used,logicalused tank/pgdata

NAME        PROPERTY       VALUE  SOURCE
tank/pgdata compressratio  2.31x  -
tank/pgdata used           81.2G  -
tank/pgdata logicalused    187G   -

187GB of logical data stored in 81GB. On lz4. With near-zero CPU cost.

Should You Bother?

Yes, if:

You’re already running ZFS (you’ve done the hard part)
Your database is text-heavy, JSON-heavy, or has lots of nullable/sparse columns
You want consistent backups without backup agents or pg_basebackup complexity
You’re on mirrors or single-disk (home lab, small VPS with ZFS)

Maybe not, if:

You need absolute maximum IOPS and have no interest in operational simplicity
You’re running RAIDZ (reconfigure your pool first, then revisit)
Your database is tiny and fits in RAM anyway: at that scale, it genuinely doesn’t matter

The 10 to 20% overhead is real and measurable. But “real and measurable” in home lab terms means the difference between 4,100 TPS and 3,700 TPS on a workload that your single-digit concurrent users will never saturate. Meanwhile, your next backup runs in 0.3 seconds and can be sent incrementally to a backup pool over the weekend.

Run ZFS. Tune it properly. Sleep better at 2 AM.

PostgreSQL on ZFS: Tuning, Snapshots, Pitfalls

You Already Have ZFS. Now Put Postgres On It Properly.

Why Bother in the First Place

Dataset Layout: This Part Actually Matters

PostgreSQL Settings That ZFS Changes

ARC Sizing: Don’t Let ZFS Eat Your RAM

Snapshots: The Whole Point

Point-in-Time Recovery

Pitfalls That Will Waste Your Weekend

Real Numbers

Should You Bother?

Responses from around the web

Discussion

Related Posts

Postgres HA: Patroni + etcd + HAProxy

Postgres EXPLAIN ANALYZE Without Crying

ClickHouse vs DuckDB vs StarRocks: Light OLAP

Adding NOT NULL on a Big Table Without Downtime

PostgreSQL on ZFS: Tuning, Snapshots, Pitfalls

You Already Have ZFS. Now Put Postgres On It Properly.

Why Bother in the First Place

Dataset Layout: This Part Actually Matters

PostgreSQL Settings That ZFS Changes

ARC Sizing: Don’t Let ZFS Eat Your RAM

Snapshots: The Whole Point

Point-in-Time Recovery

Pitfalls That Will Waste Your Weekend

Real Numbers

Should You Bother?

Related Reading

Responses from around the web

Discussion

Related Posts

Postgres HA: Patroni + etcd + HAProxy

Postgres EXPLAIN ANALYZE Without Crying

ClickHouse vs DuckDB vs StarRocks: Light OLAP

Adding NOT NULL on a Big Table Without Downtime