Skip to content
Go back

ZFS Tuning for SSDs and NVMe

By SumGuy 13 min read
ZFS Tuning for SSDs and NVMe

ZFS Defaults Came From Spinning Rust. SSDs Want Something Else.

Here’s the thing about ZFS: it’s brilliant at managing HDDs. Pool-level redundancy, checksums, snapshots, copy-on-write magic—all optimized around the idea that seeks are expensive and data safety trumps raw speed. But drop ZFS on an SSD or NVMe drive and suddenly you’re paying a tax for optimizations that don’t apply anymore.

Spinning platters have 7ms average seek times. Flash memory has microseconds. Your 128 KB default recordsize, designed to amortize that seek cost, is now way too big for a database that wants 4 KB pages. Your ARC (Adaptive Replacement Cache) is sized for servers with 256 GB of RAM, and you’re running it on a 16 GB homelab box. Your TXG (transaction group) sync behavior, tuned for mechanical latency, is queuing writes when your NVMe can handle thousands of IOPS.

This post is about fixing that. Not by replacing ZFS—ZFS on flash is still rock-solid—but by recalibrating it for the medium. Let’s start with the biggest pain point: cache sizing.

ARC Sizing: The Hidden Tax on Small Hosts

ARC is ZFS’s answer to the page cache: an in-memory layer that caches hot blocks before hitting the disk. On a 256 GB server with 200 GB pools, ARC is a god-send. On a 16 GB homelab box with 8 TB of storage, ARC eats lunch, leaves dinner on the table, and doesn’t apologize.

By default, ZFS will claim up to 50% of your RAM for ARC on Linux (capped at 25% of physical memory on older kernels). If you have 16 GB, that’s 4 GB in cache. But you also need buffer space for:

Leave all of that to chance and you’ll watch ZFS’s ARC evict everything when your application needs to page fault, then ARC rebuilds, then your app evicts again. It’s like hiring a forklift to move a couch—technically it works, but your memory bus will have questions.

Set zfs_arc_max explicitly:

Terminal window
echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.conf
modprobe -r zfs && modprobe zfs

That’s 2 GB (2147483648 bytes) on a 16 GB host. Leaves 10 GB for the kernel, your apps, and breathing room. On a 32 GB host, 8 GB is sane. On a 64 GB host, 25% of RAM still feels right.

If you’re running a database on ZFS—PostgreSQL, MySQL, whatever—this becomes even more critical. Databases manage their own buffer pools. If ARC is also trying to cache the same data blocks, you’re wasting RAM and creating weird interactions where the database’s planner sees different performance in cache vs. not. Set ARC smaller and let the database own the caching layer.

Monitor it:

Terminal window
arc_summary
# or
zpool iostat 10

Watch the cache hit ratio (c column in iostat). Below 50% on an SSD workload? Your ARC is too small or you’re doing sequential reads (which shouldn’t be cached anyway). Above 95%? You’re caching working data, but is ARC your bottleneck or is your storage? Flip primarycache=metadata (explained below) and find out.

Recordsize: Stop Using 128K for Everything

The 128 KB default recordsize assumes you’re storing files the way a human would organize them on a desktop: documents, images, videos, backups. It amortizes the overhead of metadata and small writes across bigger chunks.

But SSDs don’t care about chunk size the way HDDs do. A 16 GB database on ZFS with 128 KB records is fragmenting every query. PostgreSQL wants to read 8 KB pages. ZFS says “sure, I’ll read 128 KB and throw away 120 KB.” That’s not just waste—it’s cache pollution.

Set recordsize per-dataset:

Terminal window
# For databases (Postgres, MySQL, SQLite, etc.)
zfs set recordsize=16K tank/postgres
# For OLTP workloads that do random small writes
zfs set recordsize=8K tank/database
# For video storage, archives, sequential backups
zfs set recordsize=1M tank/media
# For general purpose (documents, code repos, containers)
recordsize=128K is still fine; leave it

This is a per-dataset setting. You can set it at pool creation, but you probably won’t get the fine-tuning right the first time, so use zfs set per dataset. The recordsize for existing blocks doesn’t change—only new writes use the new setting.

Catch: recordsize changes propagate to child datasets if you don’t override them. Create your databases dataset-by-dataset if you’re tuning aggressively:

Terminal window
zfs create tank/postgres
zfs set recordsize=16K tank/postgres
zfs create tank/media
zfs set recordsize=1M tank/media

If you’re not sure, start with 64K as a middle ground, monitor PostgreSQL query times, and dial it down if you see excessive cache misses.

Compression: LZ4 Is Your Friend; Zstd Is Your Secret Weapon

ZFS compression is free money on SSDs. Unlike HDDs, where CPU time for compression has to amortize against seek time savings, SSDs benefit from smaller data = fewer reads.

LZ4 (default, compression=lz4) is fast, hits 20-40% compression on typical workloads, and adds negligible CPU overhead. Enable it on everything:

Terminal window
zfs set compression=lz4 tank

It applies to all child datasets. On a 16 GB dataset of text, configs, and code, you’ll see 2-3x space savings. Your SSD just became effectively bigger.

For cold storage (archives, backups you rarely read), zstd-3 is worth the CPU:

Terminal window
zfs set compression=zstd-3 tank/archive

It compresses 30-50% better than LZ4, costs a bit more CPU, but if you’re only decompressing during recovery or rare reads, who cares. And your NVMe is so fast that even with decompression overhead, you’re usually still faster than streaming uncompressed data from spinning rust.

Don’t use zstd-10 or higher on active datasets. The CPU cost isn’t worth the marginal compression gain on data you access regularly.

Autotrim vs. TRIM Polling: Just Enable It

SSDs need TRIM to know which blocks are truly free. Without it, the drive treats “deleted” as “potentially needed later” and allocates garbage collection cycles to blocks you don’t care about.

ZFS supports autotrim, which sends TRIM commands asynchronously when blocks are freed. On an SSD, this is sane:

Terminal window
zpool set autotrim=on tank

This runs every 30 seconds by default, batches TRIM commands, and doesn’t block writes. On a HDD, autotrim is pointless (HDDs ignore TRIM anyway) and slightly wasteful. On flash, it’s free performance.

If you’re paranoid about TRIM, you can also cron-based fstrim:

/etc/cron.d/zfs-trim
0 2 * * * root /sbin/fstrim -v /tank

But autotrim is cleaner and already built in. Just set it and forget it.

Sync Mode and TXG: When Writes Block

By default, ZFS batches writes into transaction groups (TXGs) every 5 seconds and waits for them to hit stable storage before acknowledging to the application. This is safe (your data is durable) but can feel slow.

Terminal window
# How often TXGs commit (seconds)
zfs set sync=standard tank

Three options:

For a homelab SSD/NVMe pool:

Terminal window
zfs set sync=standard tank

Stay here. Autotrim, checksums, snapshots—all working, all reasonably fast. If you’re serving a database that needs explicit durability, let the database handle fsync and leave ZFS sync alone.

On an NVMe, TXG latency is already low (5 seconds at millisecond-level commit times). You’re not waiting long. If you are, your workload is write-saturated and no tuning here will help—you need more disk or to shard the load.

Primary Cache: Separating Metadata from Data

By default, ZFS caches both metadata (inodes, dentries, indirect blocks) and data (actual file contents) in ARC.

For certain workloads—especially databases—you want the opposite:

Terminal window
zfs set primarycache=metadata tank/postgres

This tells ZFS: “Cache metadata aggressively, but treat data blocks as transient. Let the database’s buffer pool own the data.”

Why? Because your database already caches what matters to it. PostgreSQL’s shared_buffers, MySQL’s innodb_buffer_pool, SQLite’s page cache—these are tuned for your schema and query patterns. ARC doesn’t know that. If ARC is also caching the same pages, you’re wasting RAM and creating coherency headaches.

Set primarycache=metadata for:

Leave primarycache=all (default) for:

Log VDEV (SLOG) vs. Special VDEV: When Each Helps

Two optional VDEVs exist to offload specific workload:

Log VDEV (SLOG): A fast, small device (usually an SSD or NVMe) that caches the ZFS Intent Log (ZIL). When you write with sync=always or the database demands durability, the write lands on the SLOG first, then gets committed to the main pool asynchronously.

Use a SLOG when:

Skip a SLOG when:

Special VDEV: A fast device that stores specific metadata: dedup tables, small block allocations (if recordsize is tiny), or future ZFS features.

Use a special VDEV when:

Skip a special VDEV unless you have a specific problem it solves.

For a homelab NVMe setup: Neither is necessary. Your main pool is already fast enough.

NVMe-Specific Kernel Tuning

Modern Linux kernel handles NVMe queues well, but a few tweaks help:

Terminal window
# Check your NVMe device
nvme list
# Set deadline scheduler (good for NVMe)
echo deadline > /sys/block/nvme0n1/queue/scheduler
# Or mq-deadline (preferred on recent kernels)
echo mq-deadline > /sys/block/nvme0n1/queue/scheduler
# Increase queue depth (NVMe can handle it)
echo 32 > /sys/block/nvme0n1/queue/nr_requests
# Decrease I/O scheduler tuning
echo 0 > /sys/block/nvme0n1/queue/rq_affinity

Make this persist:

/etc/udev/rules.d/60-zfs-nvme.rules
ACTION=="add|change", KERNEL=="nvme*n*", ATTR{queue/scheduler}="mq-deadline"
ACTION=="add|change", KERNEL=="nvme*n*", ATTR{queue/nr_requests}="32"

Then reload udev:

Terminal window
udevadm control --reload
udevadm trigger

On modern kernels (5.10+), the defaults are pretty good already. These tweaks squeeze another 5-10% throughput on sustained workloads.

Checksums and the Myth of “Free” Verification

ZFS checksums every block. By default, it uses Fletcher-4, which is fast but less thorough than SHA256. You can change it:

Terminal window
zfs set checksum=sha256 tank
# or
zfs set checksum=skein tank

But here’s the thing: checksums cost CPU. On reads, ZFS verifies every block you touch. On an SSD where you’re already getting 50K+ IOPS, that verification is now your bottleneck, not the disk.

Rule of thumb:

For a homelab, fletcher4 is fine. Your SSD is more likely to fail catastrophically than to silently flip bits. When it does, ZFS will scream about it.

Monitoring: Know What’s Actually Happening

Before tuning, measure:

Terminal window
# Full ARC and L2ARC stats
arc_summary
# Pool-level I/O and space
zpool status tank
zpool iostat 5
# Per-dataset performance (ZFS 0.8.0+)
zfs get -p all tank | grep -E 'used|avail|compress'

After tuning, watch for:

Keep a baseline before making changes. Then tweak one knob at a time and remeasure. Otherwise you won’t know what actually helped.

Postgres on ZFS: A Specific Example

Let’s tie this together. You’re running PostgreSQL 16 on a 2TB NVMe with 16 GB of RAM:

Terminal window
# Create the dataset
zfs create tank/postgres
zfs set recordsize=16K tank/postgres
zfs set compression=lz4 tank/postgres
zfs set primarycache=metadata tank/postgres
zfs set sync=standard tank/postgres
# Tune ARC globally (done once for the pool)
echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.conf
modprobe -r zfs && modprobe zfs
# In PostgreSQL config (postgresql.conf)
shared_buffers = 4GB # 25% of system RAM; let Postgres own data caching
effective_cache_size = 12GB # Planner hint; rest of system RAM
synchronous_commit = on # Let Postgres fsync; ZFS handles durability

The result:

Query speed improves because Postgres isn’t competing with ARC for the same cache, and each layer knows what it’s caching.

A Sane Starting Config

Here’s a template for a 2TB NVMe pool on a 16 GB homelab box:

Terminal window
# At creation time
zpool create -f -O atime=off -O recordsize=128K tank nvme0n1
# Then tune
zfs set compression=lz4 tank
zfs set autotrim=on tank
zfs set sync=standard tank
zpool set autotrim=on tank
# Per-dataset: create databases, media, general storage separately
zfs create tank/postgres
zfs set recordsize=16K tank/postgres
zfs set primarycache=metadata tank/postgres
zfs create tank/media
zfs set recordsize=1M tank/media
zfs create tank/docker
zfs set recordsize=128K tank/docker
# ARC sizing (do this once globally)
echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.conf
modprobe -r zfs && modprobe zfs
# Verify
arc_summary
zpool status tank

This config:

From here, measure for a week. Watch iostat, query latency, cache hit rates. Adjust recordsize or ARC if something stands out. But this baseline won’t surprise you.

ZFS is powerful. It’s also opinionated. Spinning rust shaped those opinions. On flash, a few tweaks align those opinions with your actual hardware, and suddenly ZFS on SSD/NVMe feels less like a penalty box and more like a feature you paid for.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Jellyseerr Tagging Workflows for Real Libraries

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts