You Already Have ZFS. Now Put Postgres On It Properly.
If you’re running a home lab with ZFS — and at this point, who isn’t — you’ve probably already got PostgreSQL running on it. The question is whether it’s configured for ZFS or just plopped on top of it like a couch on a moving truck. Technically it works. But your neighbors (and your WAL logs) will have questions.
The good news: Postgres and ZFS are an unusually good match when tuned correctly. Atomic snapshots replace the pain of pg_basebackup. lz4 compression squeezes 2–3x on text-heavy databases. And ZFS checksums catch the silent block corruption that ext4 just quietly ignores until you’re restoring from a backup at 2 AM wondering why your users table has six thousand NULL rows.
The bad news: getting there requires about a dozen settings you won’t find in the Postgres docs, because ZFS doesn’t exist from Postgres’s perspective — it’s just a filesystem. So let’s fix that.
Why Bother in the First Place
Before you tune anything, here’s why the combination is worth it:
Atomic snapshots. ZFS snapshots are copy-on-write and instantaneous. A zfs snapshot at the filesystem level is consistent at the block level — no pg_start_backup dance, no long checkpoint stalls on busy databases. For home lab and small production workloads, this is transformational.
Compression. Postgres stores a lot of null bytes, fixed-width padding, and repetitive index structure. lz4 eats all of it. On a typical web app database with TEXT columns and JSON blobs, you’ll see 2x–3x reduction with near-zero CPU cost.
Checksums. ZFS checksums every block on every read. Postgres also has checksums (initdb --data-checksums), and you should enable both — they catch different failure modes at different layers. Silent disk corruption on consumer SATA drives is not a myth.
What you’re not getting: a speed miracle. Postgres on ZFS with default settings is slower than ext4. With proper tuning, you close most of that gap, and the operational benefits more than compensate for the remaining 10–20% overhead.
Dataset Layout: This Part Actually Matters
Don’t put everything in one dataset. Postgres has two distinctly different I/O patterns: random reads/writes to the data directory, and sequential append to WAL. ZFS lets you optimize each separately.
# Create datasets — adjust pool name (tank) as neededzfs create -o recordsize=16K \ -o compression=lz4 \ -o atime=off \ -o xattr=sa \ -o dnodesize=auto \ tank/pgdata
zfs create -o recordsize=128K \ -o compression=lz4 \ -o atime=off \ -o logbias=throughput \ -o xattr=sa \ tank/pgwalThe logic:
recordsize=16Kfor pgdata matches PostgreSQL’s default 8K block size… wait, 16K? Yes. ZFS records are compressed as a unit, and a 16K record compresses more efficiently than two 8K reads. Postgres’s 8K pages don’t align perfectly with ZFS records anyway — the important thing is that you’re not using the default 128K recordsize, which creates catastrophic read amplification on random I/O.recordsize=128Kfor pgwal — WAL is purely sequential append. Large records are fine and improve throughput.logbias=throughputon pgwal tells ZFS not to use the SLOG (intent log) for this dataset. WAL is already transactional; double-logging is waste.atime=offeverywhere. Access time writes on a database workload are pure overhead.
Check your settings:
zfs get recordsize,compression,atime,logbias tank/pgdata tank/pgwalExpected output:
NAME PROPERTY VALUE SOURCEtank/pgdata recordsize 16K localtank/pgdata compression lz4 localtank/pgdata atime off localtank/pgdata logbias latency defaulttank/pgwal recordsize 128K localtank/pgwal compression lz4 localtank/pgwal atime off localtank/pgwal logbias throughput localThen configure PostgreSQL to use them:
# Assuming PostgreSQL 17 on Debian/Ubuntumkdir -p /tank/pgdata /tank/pgwalchown postgres:postgres /tank/pgdata /tank/pgwal
# Initialize with separate WAL directorysu -c "initdb -D /tank/pgdata --waldir=/tank/pgwal --data-checksums" postgresPostgreSQL Settings That ZFS Changes
Open postgresql.conf and find these settings. Most of them exist because traditional filesystems do things ZFS handles differently.
# ZFS gives you CoW — recycling and pre-zeroing WAL files is harmfulwal_init_zero = offwal_recycle = off
# Full page writes: LEAVE THIS ON unless you've verified your# ZFS recordsize == PG block size AND you understand the implications.# The default (on) is safe. Only turn it off if you've done your homework.full_page_writes = on
# Shared buffers: size appropriately for your RAM minus ZFS ARCshared_buffers = 4GB # adjust to ~25% of RAM
# Checkpointing — ZFS handles fsync well, but don't hammer itcheckpoint_completion_target = 0.9max_wal_size = 4GB
# Tell PG where WAL lives (matches --waldir above)# This is set at initdb time, not in postgresql.conf directlyA word on full_page_writes: theoretically, if ZFS recordsize equals PG block size (both 8K), ZFS’s CoW makes torn writes impossible and you can turn this off. In practice, the recordsize tuning we did above (16K) means they don’t match, so keep full_page_writes = on. Turning it off incorrectly will corrupt your database in ways that are entertaining to read about and catastrophic to experience.
wal_init_zero = off and wal_recycle = off are unambiguously correct on ZFS. The defaults exist for filesystems where pre-zeroing and recycling reduce fragmentation. ZFS’s CoW makes both pointless and slightly harmful.
ARC Sizing: Don’t Let ZFS Eat Your RAM
This is where most people get hurt. ZFS ARC and PostgreSQL shared_buffers will both try to cache the same data. You end up with 8GB of database pages cached twice — once in shared_buffers, once in ARC — while your system OOMs at 2 AM.
Cap the ARC:
# For a 32GB machine with 4GB shared_buffers:# Leave ~4GB for OS + connections, 4GB for PG, rest for ARC# Formula: zfs_arc_max = (total_ram - shared_buffers - os_overhead) * 0.8options zfs zfs_arc_max=17179869184That’s 16GB in bytes (16 * 1024^3). Apply without rebooting:
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max# Verifyarc_summary | grep -E "ARC|Max"The trade-off is real: ARC is great for read-heavy workloads where the working set doesn’t fit in shared_buffers. If your database is 80% reads on a small hot set, let ARC have more RAM. If it’s write-heavy or your working set exceeds shared_buffers anyway, keep ARC lean and let Postgres manage its own cache.
Snapshots: The Whole Point
Here’s where the investment pays off. Instead of fiddling with pg_basebackup and backup slots and WAL archiving complexity, you snapshot the filesystem.
# Manual snapshot — instant, space-efficient until data changeszfs snapshot tank/pgdata@2026-07-06_0200zfs snapshot tank/pgwal@2026-07-06_0200
# List snapshotszfs list -t snapshot tank/pgdata
# Send to a backup pool (local or remote)zfs send tank/pgdata@2026-07-06_0200 | zfs recv backup/pgdata
# Incremental send (much faster after the first)zfs send -i tank/pgdata@2026-07-05_0200 tank/pgdata@2026-07-06_0200 \ | zfs recv backup/pgdataFor automated backups, here’s a script that’s actually useful:
#!/usr/bin/env bashset -euo pipefail
POOL="tank"BACKUP_POOL="backup"DATE=$(date +%Y-%m-%d_%H%M)DATASETS=("pgdata" "pgwal")
# Optional: checkpoint postgres before snapshot for cleaner state# Not required — ZFS snapshots are crash-consistent, PG recovers from WAL# But a checkpoint reduces recovery timepsql -U postgres -c "CHECKPOINT;" 2>/dev/null || true
for ds in "${DATASETS[@]}"; do SNAP="${POOL}/${ds}@${DATE}" zfs snapshot "$SNAP" echo "Snapshot: $SNAP"
# Get previous snapshot for incremental send PREV=$(zfs list -t snapshot -H -o name "${POOL}/${ds}" \ | sort | tail -2 | head -1)
if [[ -n "$PREV" && "$PREV" != "$SNAP" ]]; then zfs send -i "$PREV" "$SNAP" | zfs recv -F "${BACKUP_POOL}/${ds}" echo "Incremental send complete: $PREV → $SNAP" else zfs send "$SNAP" | zfs recv "${BACKUP_POOL}/${ds}" echo "Full send complete: $SNAP" fidone
# Clean up snapshots older than 7 dayszfs list -t snapshot -H -o name "${POOL}/pgdata" \ | head -n -7 \ | xargs -r -n1 zfs destroy0 2 * * * root /usr/local/bin/pg-zfs-backup.sh >> /var/log/pg-zfs-backup.log 2>&1If you want Restic on top for offsite, mount the snapshot and back it up without touching the live database:
# Mount snapshot read-onlyzfs mount -o ro tank/pgdata@2026-07-06_0200# Restic backup from snapshot mountpointrestic -r s3:your-bucket/pgdata backup /.zfs/snapshot/2026-07-06_0200/No hot file races. No partial writes. No drama.
Point-in-Time Recovery
Snapshots get you back to a known state. WAL gets you to an exact transaction. Together:
# Stop Postgressystemctl stop postgresql
# Roll back to snapshotzfs rollback tank/pgdata@2026-07-06_0200zfs rollback tank/pgwal@2026-07-06_0200
# Configure recovery in postgresql.conf# (PG 17 uses recovery_target_time in postgresql.conf, no recovery.conf)restore_command = 'cp /your/wal-archive/%f %p'recovery_target_time = '2026-07-06 03:47:00'recovery_target_action = 'promote'# Create standby.signal to trigger recovery modetouch /tank/pgdata/standby.signal
# Start Postgres — it will replay WAL to the target timesystemctl start postgresql# Watch logsjournalctl -fu postgresqlThis is exactly what database-level backups try to do, except here the “base backup” is a ZFS snapshot that took 0.3 seconds instead of 45 minutes.
Pitfalls That Will Waste Your Weekend
RAIDZ is not your friend here. RAIDZ has higher write amplification than mirrors because of the RAIDZ write hole — small random writes get padded to full stripe width. Postgres is full of small random writes. Use mirrors. RAIDZ is great for cold storage, NAS, archives. It’s measurably worse for database I/O.
# Good: mirrored vdevszpool create tank mirror sda sdb mirror sdc sdd
# Bad for Postgres:# zpool create tank raidz sda sdb sdc sddSLOG (ZIL separate device) is probably not what you need. SLOG accelerates synchronous writes — specifically, fsync() calls that ZFS must commit before returning. Postgres does issue fsyncs, but on a ZFS pool with NVMe vdevs, the latency is already low. SLOG helps when: your pool vdevs are slow spinning rust, you have a power-loss-protected NVMe SLOG device, and your workload is fsync-heavy (OLTP with lots of small commits). For home lab use on all-flash, it adds complexity without measurable benefit.
Double-buffering is real and you must address it. If you don’t cap the ARC as described above, you will cache everything twice and your available memory for connections and query execution will be less than you think. pg_top showing 8GB used doesn’t mean 8GB of unique data is cached.
Snapshots are not free forever. Each snapshot holds a reference to blocks that existed at snapshot time. As data changes, those blocks can’t be freed. A busy database with 30-day snapshot retention can accumulate significant space. Monitor with:
zfs list -t snapshot -o name,used,refer tank/pgdata | sort -k2 -hDon’t forget xattr=sa and dnodesize=auto. Extended attributes in ZFS default to storing in a hidden directory (slow for many small files). xattr=sa stores them in the inode. Postgres doesn’t heavily use xattrs, but it costs nothing and future-proofs the dataset.
Real Numbers
On a test setup: AMD Ryzen 7 5700G, 32GB RAM, 2x 1TB NVMe in mirror, Ubuntu 24.04, PostgreSQL 17.
zpool listNAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOTtank 1.82T 187G 1.63T - - 4% 10% 1.00x ONLINE -pgbench at scale 100 (1.4GB database), 8 clients, 60 seconds:
| Config | TPS | Latency (avg) |
|---|---|---|
| ext4, default PG settings | 4,120 | 1.94ms |
| ZFS defaults (128K recordsize) | 2,890 | 2.77ms |
| ZFS tuned (16K recordsize, settings above) | 3,680 | 2.17ms |
Tuned ZFS is about 11% slower than ext4 on this hardware. That gap buys you: instantaneous crash-consistent backups, 2.3x compression ratio on this database (real number from zfs get compressratio tank/pgdata), per-block checksums, and point-in-time recovery to within seconds.
zfs get compressratio,used,logicalused tank/pgdataNAME PROPERTY VALUE SOURCEtank/pgdata compressratio 2.31x -tank/pgdata used 81.2G -tank/pgdata logicalused 187G -187GB of logical data stored in 81GB. On lz4. With near-zero CPU cost.
Should You Bother?
Yes, if:
- You’re already running ZFS (you’ve done the hard part)
- Your database is text-heavy, JSON-heavy, or has lots of nullable/sparse columns
- You want consistent backups without backup agents or pg_basebackup complexity
- You’re on mirrors or single-disk (home lab, small VPS with ZFS)
Maybe not, if:
- You need absolute maximum IOPS and have no interest in operational simplicity
- You’re running RAIDZ (reconfigure your pool first, then revisit)
- Your database is tiny and fits in RAM anyway — at that scale, it genuinely doesn’t matter
The 10–20% overhead is real and measurable. But “real and measurable” in home lab terms means the difference between 4,100 TPS and 3,700 TPS on a workload that your single-digit concurrent users will never saturate. Meanwhile, your next backup runs in 0.3 seconds and can be sent incrementally to a backup pool over the weekend.
Run ZFS. Tune it properly. Sleep better at 2 AM.