ZFS Defaults Came From Spinning Rust. SSDs Want Something Else.
Here’s the thing about ZFS: it’s brilliant at managing HDDs. Pool-level redundancy, checksums, snapshots, copy-on-write magic—all optimized around the idea that seeks are expensive and data safety trumps raw speed. But drop ZFS on an SSD or NVMe drive and suddenly you’re paying a tax for optimizations that don’t apply anymore.
Spinning platters have 7ms average seek times. Flash memory has microseconds. Your 128 KB default recordsize, designed to amortize that seek cost, is now way too big for a database that wants 4 KB pages. Your ARC (Adaptive Replacement Cache) is sized for servers with 256 GB of RAM, and you’re running it on a 16 GB homelab box. Your TXG (transaction group) sync behavior, tuned for mechanical latency, is queuing writes when your NVMe can handle thousands of IOPS.
This post is about fixing that. Not by replacing ZFS—ZFS on flash is still rock-solid—but by recalibrating it for the medium. Let’s start with the biggest pain point: cache sizing.
ARC Sizing: The Hidden Tax on Small Hosts
ARC is ZFS’s answer to the page cache: an in-memory layer that caches hot blocks before hitting the disk. On a 256 GB server with 200 GB pools, ARC is a god-send. On a 16 GB homelab box with 8 TB of storage, ARC eats lunch, leaves dinner on the table, and doesn’t apologize.
By default, ZFS will claim up to 50% of your RAM for ARC on Linux (capped at 25% of physical memory on older kernels). If you have 16 GB, that’s 4 GB in cache. But you also need buffer space for:
- Kernel page cache (for non-ZFS filesystems, application buffers, mmap regions)
- Application working memory (your database, VMs, containers)
- Dirty buffers for pending writes
Leave all of that to chance and you’ll watch ZFS’s ARC evict everything when your application needs to page fault, then ARC rebuilds, then your app evicts again. It’s like hiring a forklift to move a couch—technically it works, but your memory bus will have questions.
Set zfs_arc_max explicitly:
echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.confmodprobe -r zfs && modprobe zfsThat’s 2 GB (2147483648 bytes) on a 16 GB host. Leaves 10 GB for the kernel, your apps, and breathing room. On a 32 GB host, 8 GB is sane. On a 64 GB host, 25% of RAM still feels right.
If you’re running a database on ZFS—PostgreSQL, MySQL, whatever—this becomes even more critical. Databases manage their own buffer pools. If ARC is also trying to cache the same data blocks, you’re wasting RAM and creating weird interactions where the database’s planner sees different performance in cache vs. not. Set ARC smaller and let the database own the caching layer.
Monitor it:
arc_summary# orzpool iostat 10Watch the cache hit ratio (c column in iostat). Below 50% on an SSD workload? Your ARC is too small or you’re doing sequential reads (which shouldn’t be cached anyway). Above 95%? You’re caching working data, but is ARC your bottleneck or is your storage? Flip primarycache=metadata (explained below) and find out.
Recordsize: Stop Using 128K for Everything
The 128 KB default recordsize assumes you’re storing files the way a human would organize them on a desktop: documents, images, videos, backups. It amortizes the overhead of metadata and small writes across bigger chunks.
But SSDs don’t care about chunk size the way HDDs do. A 16 GB database on ZFS with 128 KB records is fragmenting every query. PostgreSQL wants to read 8 KB pages. ZFS says “sure, I’ll read 128 KB and throw away 120 KB.” That’s not just waste—it’s cache pollution.
Set recordsize per-dataset:
# For databases (Postgres, MySQL, SQLite, etc.)zfs set recordsize=16K tank/postgres
# For OLTP workloads that do random small writeszfs set recordsize=8K tank/database
# For video storage, archives, sequential backupszfs set recordsize=1M tank/media
# For general purpose (documents, code repos, containers)recordsize=128K is still fine; leave itThis is a per-dataset setting. You can set it at pool creation, but you probably won’t get the fine-tuning right the first time, so use zfs set per dataset. The recordsize for existing blocks doesn’t change—only new writes use the new setting.
Catch: recordsize changes propagate to child datasets if you don’t override them. Create your databases dataset-by-dataset if you’re tuning aggressively:
zfs create tank/postgreszfs set recordsize=16K tank/postgres
zfs create tank/mediazfs set recordsize=1M tank/mediaIf you’re not sure, start with 64K as a middle ground, monitor PostgreSQL query times, and dial it down if you see excessive cache misses.
Compression: LZ4 Is Your Friend; Zstd Is Your Secret Weapon
ZFS compression is free money on SSDs. Unlike HDDs, where CPU time for compression has to amortize against seek time savings, SSDs benefit from smaller data = fewer reads.
LZ4 (default, compression=lz4) is fast, hits 20-40% compression on typical workloads, and adds negligible CPU overhead. Enable it on everything:
zfs set compression=lz4 tankIt applies to all child datasets. On a 16 GB dataset of text, configs, and code, you’ll see 2-3x space savings. Your SSD just became effectively bigger.
For cold storage (archives, backups you rarely read), zstd-3 is worth the CPU:
zfs set compression=zstd-3 tank/archiveIt compresses 30-50% better than LZ4, costs a bit more CPU, but if you’re only decompressing during recovery or rare reads, who cares. And your NVMe is so fast that even with decompression overhead, you’re usually still faster than streaming uncompressed data from spinning rust.
Don’t use zstd-10 or higher on active datasets. The CPU cost isn’t worth the marginal compression gain on data you access regularly.
Autotrim vs. TRIM Polling: Just Enable It
SSDs need TRIM to know which blocks are truly free. Without it, the drive treats “deleted” as “potentially needed later” and allocates garbage collection cycles to blocks you don’t care about.
ZFS supports autotrim, which sends TRIM commands asynchronously when blocks are freed. On an SSD, this is sane:
zpool set autotrim=on tankThis runs every 30 seconds by default, batches TRIM commands, and doesn’t block writes. On a HDD, autotrim is pointless (HDDs ignore TRIM anyway) and slightly wasteful. On flash, it’s free performance.
If you’re paranoid about TRIM, you can also cron-based fstrim:
0 2 * * * root /sbin/fstrim -v /tankBut autotrim is cleaner and already built in. Just set it and forget it.
Sync Mode and TXG: When Writes Block
By default, ZFS batches writes into transaction groups (TXGs) every 5 seconds and waits for them to hit stable storage before acknowledging to the application. This is safe (your data is durable) but can feel slow.
# How often TXGs commit (seconds)zfs set sync=standard tankThree options:
-
sync=standard(default): TXG commits every 5 seconds or when the TXG fills (1.5 GB of dirty data). Writes are ACK’d when they’re in memory but not yet on disk. Safe enough for most workloads. -
sync=always: Every write hits the disk before returning. Safe as houses, but slow. Only for databases that need fsync-on-every-transaction semantics. Even then, your database probably handles this itself—double-syncing is wasteful. -
sync=disabled: Don’t wait for anything. Fast, but if you lose power between TXG commits, you lose recent data. Only for caches or data you can afford to lose.
For a homelab SSD/NVMe pool:
zfs set sync=standard tankStay here. Autotrim, checksums, snapshots—all working, all reasonably fast. If you’re serving a database that needs explicit durability, let the database handle fsync and leave ZFS sync alone.
On an NVMe, TXG latency is already low (5 seconds at millisecond-level commit times). You’re not waiting long. If you are, your workload is write-saturated and no tuning here will help—you need more disk or to shard the load.
Primary Cache: Separating Metadata from Data
By default, ZFS caches both metadata (inodes, dentries, indirect blocks) and data (actual file contents) in ARC.
For certain workloads—especially databases—you want the opposite:
zfs set primarycache=metadata tank/postgresThis tells ZFS: “Cache metadata aggressively, but treat data blocks as transient. Let the database’s buffer pool own the data.”
Why? Because your database already caches what matters to it. PostgreSQL’s shared_buffers, MySQL’s innodb_buffer_pool, SQLite’s page cache—these are tuned for your schema and query patterns. ARC doesn’t know that. If ARC is also caching the same pages, you’re wasting RAM and creating coherency headaches.
Set primarycache=metadata for:
- PostgreSQL, MySQL, MariaDB datasets
- Redis, Valkey, Memcached instances
- Any application that manages its own page cache
Leave primarycache=all (default) for:
- General-purpose storage (NFS, Samba shares)
- Media libraries (where the OS cache is your only cache)
- Web server docroots (let ARC cache frequently served files)
Log VDEV (SLOG) vs. Special VDEV: When Each Helps
Two optional VDEVs exist to offload specific workload:
Log VDEV (SLOG): A fast, small device (usually an SSD or NVMe) that caches the ZFS Intent Log (ZIL). When you write with sync=always or the database demands durability, the write lands on the SLOG first, then gets committed to the main pool asynchronously.
Use a SLOG when:
- You’re running synchronous databases (Postgres with fsync enabled) and you want write latency under 1ms
- Your main pool is HDDs and you want to decouple write latency from HDD seek time
Skip a SLOG when:
- Your main pool is already NVMe (the SLOG won’t help; NVMe is already <1ms)
- Your workload is read-heavy or async writes
- You don’t have a spare fast device (a SLOG failure can cause issues, depending on how it fails)
Special VDEV: A fast device that stores specific metadata: dedup tables, small block allocations (if recordsize is tiny), or future ZFS features.
Use a special VDEV when:
- You’re running dedup and the dedup table doesn’t fit in ARC (very rare on small pools)
- You have tiny recordsizes (8K) and want metadata I/O separated from data I/O
Skip a special VDEV unless you have a specific problem it solves.
For a homelab NVMe setup: Neither is necessary. Your main pool is already fast enough.
NVMe-Specific Kernel Tuning
Modern Linux kernel handles NVMe queues well, but a few tweaks help:
# Check your NVMe devicenvme list
# Set deadline scheduler (good for NVMe)echo deadline > /sys/block/nvme0n1/queue/scheduler
# Or mq-deadline (preferred on recent kernels)echo mq-deadline > /sys/block/nvme0n1/queue/scheduler
# Increase queue depth (NVMe can handle it)echo 32 > /sys/block/nvme0n1/queue/nr_requests
# Decrease I/O scheduler tuningecho 0 > /sys/block/nvme0n1/queue/rq_affinityMake this persist:
ACTION=="add|change", KERNEL=="nvme*n*", ATTR{queue/scheduler}="mq-deadline"ACTION=="add|change", KERNEL=="nvme*n*", ATTR{queue/nr_requests}="32"Then reload udev:
udevadm control --reloadudevadm triggerOn modern kernels (5.10+), the defaults are pretty good already. These tweaks squeeze another 5-10% throughput on sustained workloads.
Checksums and the Myth of “Free” Verification
ZFS checksums every block. By default, it uses Fletcher-4, which is fast but less thorough than SHA256. You can change it:
zfs set checksum=sha256 tank# orzfs set checksum=skein tankBut here’s the thing: checksums cost CPU. On reads, ZFS verifies every block you touch. On an SSD where you’re already getting 50K+ IOPS, that verification is now your bottleneck, not the disk.
Rule of thumb:
checksum=fletcher4(default): Fast, catches most bit flips. Use this.checksum=sha256: Paranoid mode. Use for cold storage or when cryptographic integrity matters.
For a homelab, fletcher4 is fine. Your SSD is more likely to fail catastrophically than to silently flip bits. When it does, ZFS will scream about it.
Monitoring: Know What’s Actually Happening
Before tuning, measure:
# Full ARC and L2ARC statsarc_summary
# Pool-level I/O and spacezpool status tankzpool iostat 5
# Per-dataset performance (ZFS 0.8.0+)zfs get -p all tank | grep -E 'used|avail|compress'After tuning, watch for:
- ARC hit rate (
ccolumn): Should be >60% for cached workloads, >80% for metadata. - Read/write latency:
zpool iostat -l 5shows read and write latency per pool. If reads are >10ms on NVMe, something’s wrong (usually ARC miss rate or TXG backpressure). - Compression ratio:
zfs get compressratio tanktells you if compression is actually helping.
Keep a baseline before making changes. Then tweak one knob at a time and remeasure. Otherwise you won’t know what actually helped.
Postgres on ZFS: A Specific Example
Let’s tie this together. You’re running PostgreSQL 16 on a 2TB NVMe with 16 GB of RAM:
# Create the datasetzfs create tank/postgreszfs set recordsize=16K tank/postgreszfs set compression=lz4 tank/postgreszfs set primarycache=metadata tank/postgreszfs set sync=standard tank/postgres
# Tune ARC globally (done once for the pool)echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.confmodprobe -r zfs && modprobe zfs
# In PostgreSQL config (postgresql.conf)shared_buffers = 4GB # 25% of system RAM; let Postgres own data cachingeffective_cache_size = 12GB # Planner hint; rest of system RAMsynchronous_commit = on # Let Postgres fsync; ZFS handles durabilityThe result:
- Postgres caches its working set (4 GB shared_buffers)
- ZFS caches metadata (inodes, directory entries) in the remaining ARC
- Recordsize matches Postgres’s 8 KB page size (so 16 K reads are 2 pages)
- Compression saves space without adding latency
- Syncs go through ZFS’s standard path, which is safe and reasonably fast
Query speed improves because Postgres isn’t competing with ARC for the same cache, and each layer knows what it’s caching.
A Sane Starting Config
Here’s a template for a 2TB NVMe pool on a 16 GB homelab box:
# At creation timezpool create -f -O atime=off -O recordsize=128K tank nvme0n1
# Then tunezfs set compression=lz4 tankzfs set autotrim=on tankzfs set sync=standard tankzpool set autotrim=on tank
# Per-dataset: create databases, media, general storage separatelyzfs create tank/postgreszfs set recordsize=16K tank/postgreszfs set primarycache=metadata tank/postgres
zfs create tank/mediazfs set recordsize=1M tank/media
zfs create tank/dockerzfs set recordsize=128K tank/docker
# ARC sizing (do this once globally)echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.confmodprobe -r zfs && modprobe zfs
# Verifyarc_summaryzpool status tankThis config:
- Lets your database run at its native block size (16 K for Postgres, 8 K for others)
- Compresses general storage by 2-3x
- Keeps ARC reasonable for your RAM constraints
- Automatically trims freed blocks
- Checksums everything (fletcher4 is fine)
- Doesn’t wait unnecessarily for writes
From here, measure for a week. Watch iostat, query latency, cache hit rates. Adjust recordsize or ARC if something stands out. But this baseline won’t surprise you.
ZFS is powerful. It’s also opinionated. Spinning rust shaped those opinions. On flash, a few tweaks align those opinions with your actual hardware, and suddenly ZFS on SSD/NVMe feels less like a penalty box and more like a feature you paid for.