ZFS Tuning for SSDs and NVMe

ZFS Defaults Came From Spinning Rust. SSDs Want Something Else.

ZFS is brilliant at managing HDDs. Pool-level redundancy, checksums, snapshots, copy-on-write magic, all optimized around the idea that seeks are expensive and data safety trumps raw speed. But drop ZFS on an SSD or NVMe drive and suddenly you’re paying a tax for optimizations that don’t apply anymore.

Spinning platters have 7ms average seek times. Flash memory has microseconds. Your 128 KB default recordsize, designed to amortize that seek cost, is now way too big for a database that wants 4 KB pages. Your ARC (Adaptive Replacement Cache) is sized for servers with 256 GB of RAM, and you’re running it on a 16 GB homelab box. Your TXG (transaction group) sync behavior, tuned for mechanical latency, is queuing writes when your NVMe can handle thousands of IOPS.

This post is about fixing that. Not by replacing ZFS (ZFS on flash is still rock-solid) but by recalibrating it for the medium. Let’s start with the biggest pain point: cache sizing.

ARC Sizing: The Hidden Tax on Small Hosts

ARC is ZFS’s answer to the page cache: an in-memory layer that caches hot blocks before hitting the disk. On a 256 GB server with 200 GB pools, ARC is a god-send. On a 16 GB homelab box with 8 TB of storage, ARC eats lunch, leaves dinner on the table, and doesn’t apologize.

By default, ZFS will claim up to 50% of your RAM for ARC on Linux (capped at 25% of physical memory on older kernels). If you have 16 GB, that’s 4 GB in cache. But you also need buffer space for:

Kernel page cache (for non-ZFS filesystems, application buffers, mmap regions)
Application working memory (your database, VMs, containers)
Dirty buffers for pending writes

Leave all of that to chance and you’ll watch ZFS’s ARC evict everything when your application needs to page fault, then ARC rebuilds, then your app evicts again. It’s like hiring a forklift to move a couch, technically it works, but your memory bus will have questions.

Set zfs_arc_max explicitly:

echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.conf
modprobe -r zfs && modprobe zfs

That’s 2 GB (2147483648 bytes) on a 16 GB host. Leaves 10 GB for the kernel, your apps, and breathing room. On a 32 GB host, 8 GB is sane. On a 64 GB host, 25% of RAM still feels right.

If you’re running a database on ZFS (PostgreSQL, MySQL, whatever) this becomes even more critical. Databases manage their own buffer pools. If ARC is also trying to cache the same data blocks, you’re wasting RAM and creating weird interactions where the database’s planner sees different performance in cache vs. not. Set ARC smaller and let the database own the caching layer.

Monitor it:

arc_summary
# or
zpool iostat 10

Watch the cache hit ratio (arc_summary reports it under the ARC hits/misses section). Below 50% on an SSD workload? Your ARC is too small or you’re doing sequential reads (which shouldn’t be cached anyway). Above 95%? You’re caching working data, but is ARC your bottleneck or is your storage? Flip primarycache=metadata (explained below) and find out.

Recordsize: Stop Using 128K for Everything

The 128 KB default recordsize assumes you’re storing files the way a human would organize them on a desktop: documents, images, videos, backups. It amortizes the overhead of metadata and small writes across bigger chunks.

But SSDs don’t care about chunk size the way HDDs do. A 16 GB database on ZFS with 128 KB records is fragmenting every query. PostgreSQL wants to read 8 KB pages. ZFS says “sure, I’ll read 128 KB and throw away 120 KB.” That’s not just waste; it’s cache pollution.

Set recordsize per-dataset:

# For databases (Postgres, MySQL, SQLite, etc.)
zfs set recordsize=16K tank/postgres

# For OLTP workloads that do random small writes
zfs set recordsize=8K tank/database

# For video storage, archives, sequential backups
zfs set recordsize=1M tank/media

# For general purpose (documents, code repos, containers)
recordsize=128K is still fine; leave it

This is a per-dataset setting. You can set it at pool creation, but you probably won’t get the fine-tuning right the first time, so use zfs set per dataset. The recordsize for existing blocks doesn’t change, only new writes use the new setting.

Catch: recordsize changes propagate to child datasets if you don’t override them. Create your databases dataset-by-dataset if you’re tuning aggressively:

zfs create tank/postgres
zfs set recordsize=16K tank/postgres

zfs create tank/media
zfs set recordsize=1M tank/media

If you’re not sure, start with 64K as a middle ground, monitor PostgreSQL query times, and dial it down if you see excessive cache misses.

Compression: LZ4 Is Your Friend; Zstd Is Your Secret Weapon

ZFS compression is free money on SSDs. Unlike HDDs, where CPU time for compression has to amortize against seek time savings, SSDs benefit from smaller data = fewer reads.

LZ4 (default, compression=lz4) is fast, hits 20-40% compression on typical workloads, and adds negligible CPU overhead. Enable it on everything:

zfs set compression=lz4 tank

It applies to all child datasets. On a 16 GB dataset of text, configs, and code, you’ll see 2-3x space savings. Your SSD just became effectively bigger.

For cold storage (archives, backups you rarely read), zstd-3 is worth the CPU:

zfs set compression=zstd-3 tank/archive

It compresses 30-50% better than LZ4, costs a bit more CPU, but if you’re only decompressing during recovery or rare reads, who cares. And your NVMe is so fast that even with decompression overhead, you’re usually still faster than streaming uncompressed data from spinning rust.

Don’t use zstd-10 or higher on active datasets. The CPU cost isn’t worth the marginal compression gain on data you access regularly.

Autotrim vs. TRIM Polling: Just Enable It

SSDs need TRIM to know which blocks are truly free. Without it, the drive treats “deleted” as “potentially needed later” and allocates garbage collection cycles to blocks you don’t care about.

ZFS supports autotrim, which sends TRIM commands asynchronously when blocks are freed. On an SSD, this is sane:

zpool set autotrim=on tank

This runs every 30 seconds by default, batches TRIM commands, and doesn’t block writes. On a HDD, autotrim is pointless (HDDs ignore TRIM anyway) and slightly wasteful. On flash, it’s free performance.

If you’re paranoid about TRIM, you can also cron-based fstrim:

0 2 * * * root /sbin/fstrim -v /tank

But autotrim is cleaner and already built in. Just set it and forget it.

Sync Mode and TXG: When Writes Block

By default, ZFS batches writes into transaction groups (TXGs) every 5 seconds and waits for them to hit stable storage before acknowledging to the application. This is safe (your data is durable) but can feel slow.

# How often TXGs commit (seconds)
zfs set sync=standard tank

Three options:

sync=standard (default): TXG commits every 5 seconds or when dirty data hits zfs_dirty_data_max (default 10% of RAM, capped at 4 GB). Async writes are ACK’d when they’re in memory but not yet on disk; sync writes still go through the ZIL. Safe enough for most workloads.
sync=always: Every write hits the disk before returning. Safe as houses, but slow. Only for databases that need fsync-on-every-transaction semantics. Even then, your database probably handles this itself; double-syncing is wasteful.
sync=disabled: Don’t wait for anything. Fast, but if you lose power between TXG commits, you lose recent data. Only for caches or data you can afford to lose.

For a homelab SSD/NVMe pool:

zfs set sync=standard tank

Stay here. Autotrim, checksums, snapshots, all working, all reasonably fast.

If you’re serving a database that needs explicit durability, let the database handle fsync and leave ZFS sync alone.

On an NVMe, TXG latency is already low (5 seconds at millisecond-level commit times). You’re not waiting long. If you are, your workload is write-saturated and no tuning here will help; you need more disk or to shard the load.

Primary Cache: Separating Metadata from Data

By default, ZFS caches both metadata (inodes, dentries, indirect blocks) and data (actual file contents) in ARC.

For certain workloads, especially databases, you want the opposite:

zfs set primarycache=metadata tank/postgres

This tells ZFS: “Cache metadata aggressively, but treat data blocks as transient. Let the database’s buffer pool own the data.”

Why? Because your database already caches what matters to it. PostgreSQL’s shared_buffers, MySQL’s innodb_buffer_pool, SQLite’s page cache; these are tuned for your schema and query patterns. ARC doesn’t know that. If ARC is also caching the same pages, you’re wasting RAM and creating coherency headaches.

Set primarycache=metadata for:

PostgreSQL, MySQL, MariaDB datasets
Redis, Valkey, Memcached instances
Any application that manages its own page cache

Leave primarycache=all (default) for:

General-purpose storage (NFS, Samba shares)
Media libraries (where the OS cache is your only cache)
Web server docroots (let ARC cache frequently served files)

Log VDEV (SLOG) vs. Special VDEV: When Each Helps

Two optional VDEVs exist to offload specific workload:

Log VDEV (SLOG): A fast, small device (usually an SSD or NVMe) that caches the ZFS Intent Log (ZIL). When you write with sync=always or the database demands durability, the write lands on the SLOG first, then gets committed to the main pool asynchronously.

Use a SLOG when:

You’re running synchronous databases (Postgres with fsync enabled) and you want write latency under 1ms
Your main pool is HDDs and you want to decouple write latency from HDD seek time

Skip a SLOG when:

Your main pool is already NVMe (the SLOG won’t help; NVMe is already <1ms)
Your workload is read-heavy or async writes
You don’t have a spare fast device (a SLOG failure can cause issues, depending on how it fails)

Special VDEV: A fast device that stores specific metadata: dedup tables, small block allocations (if recordsize is tiny), or future ZFS features.

Use a special VDEV when:

You’re running dedup and the dedup table doesn’t fit in ARC (very rare on small pools)
You have tiny recordsizes (8K) and want metadata I/O separated from data I/O

Skip a special VDEV unless you have a specific problem it solves.

For a homelab NVMe setup: Neither is necessary. Your main pool is already fast enough.

NVMe-Specific Kernel Tuning

Modern Linux kernel handles NVMe queues well, but a few tweaks help:

# Check your NVMe device
nvme list

# For NVMe, 'none' is usually best (the legacy single-queue
# 'deadline' is gone on modern blk-mq kernels)
echo none > /sys/block/nvme0n1/queue/scheduler

# Or mq-deadline if you want some I/O ordering
echo mq-deadline > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth (NVMe can handle it)
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

Make this persist:

ACTION=="add|change", KERNEL=="nvme*n*", ATTR{queue/scheduler}="none"

Then reload udev:

udevadm control --reload
udevadm trigger

On modern kernels (5.10+), the defaults are pretty good already. These tweaks squeeze another 5-10% throughput on sustained workloads.

Checksums and the Myth of “Free” Verification

ZFS checksums every block. By default, it uses Fletcher-4, which is fast but less thorough than SHA256. You can change it:

zfs set checksum=sha256 tank
# or
zfs set checksum=skein tank

But checksums cost CPU. On reads, ZFS verifies every block you touch. On an SSD where you’re already getting 50K+ IOPS, that verification is now your bottleneck, not the disk.

Rule of thumb:

checksum=fletcher4 (default): Fast, catches most bit flips. Use this.
checksum=sha256: Paranoid mode. Use for cold storage or when cryptographic integrity matters.

For a homelab, fletcher4 is fine. Your SSD is more likely to fail catastrophically than to silently flip bits. When it does, ZFS will scream about it.

Monitoring: Know What’s Actually Happening

Before tuning, measure:

# Full ARC and L2ARC stats
arc_summary

# Pool-level I/O and space
zpool status tank
zpool iostat 5

# Per-dataset performance (ZFS 0.8.0+)
zfs get -p all tank | grep -E 'used|avail|compress'

After tuning, watch for:

ARC hit rate (from arc_summary): Should be >60% for cached workloads, >80% for metadata.
Read/write latency: zpool iostat -l 5 shows read and write latency per pool. If reads are >10ms on NVMe, something’s wrong (usually ARC miss rate or TXG backpressure).
Compression ratio: zfs get compressratio tank tells you if compression is actually helping.

Keep a baseline before making changes. Then tweak one knob at a time and remeasure. Otherwise you won’t know what actually helped.

Postgres on ZFS: A Specific Example

Let’s tie this together. You’re running PostgreSQL 18 on a 2TB NVMe with 16 GB of RAM:

# Create the dataset
zfs create tank/postgres
zfs set recordsize=16K tank/postgres
zfs set compression=lz4 tank/postgres
zfs set primarycache=metadata tank/postgres
zfs set sync=standard tank/postgres

# Tune ARC globally (done once for the pool)
echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.conf
modprobe -r zfs && modprobe zfs

# In PostgreSQL config (postgresql.conf)
shared_buffers = 4GB           # 25% of system RAM; let Postgres own data caching
effective_cache_size = 12GB    # Planner hint; rest of system RAM
synchronous_commit = on        # Let Postgres fsync; ZFS handles durability

The result:

Postgres caches its working set (4 GB shared_buffers)
ZFS caches metadata (inodes, directory entries) in the remaining ARC
Recordsize matches Postgres’s 8 KB page size (so 16 K reads are 2 pages)
Compression saves space without adding latency
Syncs go through ZFS’s standard path, which is safe and reasonably fast

Query speed improves because Postgres isn’t competing with ARC for the same cache, and each layer knows what it’s caching.

A Sane Starting Config

Here’s a template for a 2TB NVMe pool on a 16 GB homelab box:

# At creation time
zpool create -f -O atime=off -O recordsize=128K tank nvme0n1

# Then tune
zfs set compression=lz4 tank
zfs set autotrim=on tank
zfs set sync=standard tank
zpool set autotrim=on tank

# Per-dataset: create databases, media, general storage separately
zfs create tank/postgres
zfs set recordsize=16K tank/postgres
zfs set primarycache=metadata tank/postgres

zfs create tank/media
zfs set recordsize=1M tank/media

zfs create tank/docker
zfs set recordsize=128K tank/docker

# ARC sizing (do this once globally)
echo "options zfs zfs_arc_max=2147483648" >> /etc/modprobe.d/zfs.conf
modprobe -r zfs && modprobe zfs

# Verify
arc_summary
zpool status tank

This config:

Lets your database run at its native block size (16 K for Postgres, 8 K for others)
Compresses general storage by 2-3x
Keeps ARC reasonable for your RAM constraints
Automatically trims freed blocks
Checksums everything (fletcher4 is fine)
Doesn’t wait unnecessarily for writes

From here, measure for a week. Watch iostat, query latency, cache hit rates. Adjust recordsize or ARC if something stands out. But this baseline won’t surprise you.

ZFS is powerful. It’s also opinionated. Spinning rust shaped those opinions. On flash, a few tweaks align those opinions with your actual hardware, and suddenly ZFS on SSD/NVMe works the way it should.

ZFS Defaults Came From Spinning Rust. SSDs Want Something Else.

ARC Sizing: The Hidden Tax on Small Hosts

Recordsize: Stop Using 128K for Everything

Compression: LZ4 Is Your Friend; Zstd Is Your Secret Weapon

Autotrim vs. TRIM Polling: Just Enable It

Sync Mode and TXG: When Writes Block

Primary Cache: Separating Metadata from Data

Log VDEV (SLOG) vs. Special VDEV: When Each Helps

NVMe-Specific Kernel Tuning

Checksums and the Myth of “Free” Verification

Monitoring: Know What’s Actually Happening

Postgres on ZFS: A Specific Example

A Sane Starting Config

Responses from around the web

Discussion

Related Posts

RAID-Z and dRAID: ZFS Parity Explained

TrueNAS vs OpenMediaVault vs Unraid

ZFS Encryption vs LUKS

ZFS Replication with syncoid + sanoid: The Lazy Admin's Backup

ZFS Tuning for SSDs and NVMe

ZFS Defaults Came From Spinning Rust. SSDs Want Something Else.

ARC Sizing: The Hidden Tax on Small Hosts

Recordsize: Stop Using 128K for Everything

Compression: LZ4 Is Your Friend; Zstd Is Your Secret Weapon

Autotrim vs. TRIM Polling: Just Enable It

Sync Mode and TXG: When Writes Block

Primary Cache: Separating Metadata from Data

Log VDEV (SLOG) vs. Special VDEV: When Each Helps

NVMe-Specific Kernel Tuning

Checksums and the Myth of “Free” Verification

Monitoring: Know What’s Actually Happening

Postgres on ZFS: A Specific Example

A Sane Starting Config

Related Reading

Responses from around the web

Discussion

Related Posts

RAID-Z and dRAID: ZFS Parity Explained

TrueNAS vs OpenMediaVault vs Unraid

ZFS Encryption vs LUKS

ZFS Replication with syncoid + sanoid: The Lazy Admin's Backup