Skip to content
Go back

SMART Disk Monitoring with smartmontools

By SumGuy 10 min read
SMART Disk Monitoring with smartmontools

Disks Fail. The Question Is Whether You’ll Know in Time.

Here’s the thing: every hard drive and SSD you own will fail eventually. Not metaphorically. Physically. And when it does, you want to know about it—ideally before your NAS starts rebuilding a RAID array at 3 AM, or worse, before you discover a silent data loss in your backup.

That’s where SMART comes in.

SMART stands for Self-Monitoring, Analysis, and Reporting Technology. It’s been baked into every modern drive for decades. Your drive is constantly measuring things: temperature, seek errors, sector reallocations, command timeouts. The problem? Most monitoring setups are completely useless. Your NAS tells you “SMART OK” and you assume everything’s fine. It’s not. It’s lying to you.

This guide shows you how to read what your drives are actually saying, configure smartmontools to actually catch failures before they wreck you, and integrate that data into your monitoring stack so you can sleep at night.


What SMART Actually Measures (And Why “OK” Doesn’t Mean OK)

SMART is an old standard. It predates SSDs. It was designed by drive manufacturers to tell you (the user) that a drive is about to die, not to give you deep insight into drive health. This is important. SMART status is binary: PASSED or FAILED. That “OK” badge you see is just the PASSED state. It tells you almost nothing.

Here’s the trap: a drive can have several hundred reallocated sectors and still report “OK.” It can be losing sectors in real time and report “OK.” The SMART FAILED state is more like a dead-man’s switch—by the time it trips, you’ve usually got hours to days before total failure, not weeks or months of warning.

Backblaze, the cloud backup company, analyzed petabytes of real drive telemetry. They found that specific SMART attributes correlate with failure rates. Most attributes? Useless noise. The ones that matter—the ones that actually predict failure—are:

Everything else—Power_On_Hours, Temperature (within normal ranges), Spin-up time—is mostly decorative. Your 5-year-old drive running at 45°C is fine. Power-on hours don’t kill drives; degradation does.


Getting Started with smartctl

smartmontools gives you two tools: smartctl for one-off queries, and smartd for continuous monitoring. Start with smartctl to get comfortable reading your drives.

Basic Commands

Terminal window
# Get overall health status
smartctl -a /dev/sda
# Get detailed info and firmware
smartctl -i /dev/sda
# Run a comprehensive test (takes ~10 mins)
smartctl -t short /dev/sda
# Run the long test (takes 2+ hours)
smartctl -t long /dev/sda
# Check test results
smartctl -x /dev/sda

The -a flag (all) is your main weapon. It dumps the whole SMART table: current values, thresholds, worst values. Read it top to bottom. The attributes that matter have non-zero raw values when failing.

For NVMe drives (the -x flag is your friend):

Terminal window
# NVMe-specific details
smartctl -x /dev/nvme0n1

NVMe attributes are different. Look for:


Installing smartmontools

On most distros, it’s trivial:

Terminal window
# Debian/Ubuntu
sudo apt install smartmontools
# RHEL/Rocky/CentOS
sudo dnf install smartmontools
# Arch
sudo pacman -S smartmontools

On macOS (if you’re doing this locally):

Terminal window
brew install smartmontools

After install, check that smartd isn’t auto-running:

Terminal window
sudo systemctl status smartd

If it’s not enabled, that’s fine. We’ll configure it properly next.


Setting Up smartd for Continuous Monitoring

smartctl is great for poking at a drive once. But you need something running 24/7 to catch degradation in real time. That’s smartd.

The config file is /etc/smartd.conf. Out of the box, it’s often commented out or pointing to all drives without useful alerts. Let’s fix that.

/etc/smartd.conf
# Monitor all SATA drives with aggressive attribute monitoring
/dev/sda -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 4,45,50 -m [email protected] -M exec /path/to/alert-script.sh
# NVMe drives
/dev/nvme0n1 -a -n standby,q -M exec /path/to/alert-script.sh
# Watch specific attributes that predict failure
/dev/sda -l selftest -l errorlog

Breaking this down:

The script part is where the real magic happens. Email often doesn’t work in home labs (no MTA). Instead, use an exec script to send alerts to your monitoring system.

Example alert script:

/usr/local/bin/smartd-alert.sh
#!/bin/bash
DEVICE="$1"
MESSAGE="$2"
SEVERITY="$3"
# Send to syslog so systemd-journald picks it up
logger -t smartd -p "user.${SEVERITY:-warning}" "[$DEVICE] $MESSAGE"
# Or send to a webhook/Prometheus pushgateway
curl -s -X POST http://localhost:9091/metrics/job/smartd/instance/${DEVICE} \
--data-binary @- << EOF
# HELP smartd_alert_count Number of SMART alerts
# TYPE smartd_alert_count counter
smartd_alert_count{device="${DEVICE}",severity="${SEVERITY}"} 1
EOF

Start smartd:

Terminal window
sudo systemctl enable smartd
sudo systemctl start smartd
sudo systemctl status smartd

Check the logs:

Terminal window
sudo journalctl -u smartd -f

Automating Tests with Cron

smartd can handle scheduled tests, but for more control, run them via cron. This is useful if you want to stagger tests across multiple drives so they don’t all spin up at once (and cause a power spike).

Example cron setup
# Run short test on /dev/sda at 1 AM daily
0 1 * * * root smartctl -t short /dev/sda
# Run long test on /dev/sdb every Sunday at 2 AM
0 2 * * 0 root smartctl -t long /dev/sdb
# Log SMART status to a file every 6 hours
0 */6 * * * root smartctl -a /dev/sda >> /var/log/smartctl-sda.log

Then read that log with something like:

Terminal window
# Show only reallocated sectors and pending sectors
grep -E "Reallocated_Sector|Current_Pending" /var/log/smartctl-sda.log

Integrating with Prometheus

If you’re running Prometheus (for a home lab this is overkill, but mention-worthy), use the smartctl_exporter:

Terminal window
# Install prometheus smartctl exporter
git clone https://github.com/prometheus-community/smartctl_exporter
cd smartctl_exporter
make build
sudo cp ./smartctl_exporter /usr/local/bin/

Set up a systemd service:

/etc/systemd/system/smartctl-exporter.service
[Unit]
Description=Prometheus smartctl exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/smartctl_exporter
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target

Add to your Prometheus config:

prometheus.yml
scrape_configs:
- job_name: 'smartctl'
static_configs:
- targets: ['localhost:9633']

Now you can graph SMART attributes over time and set alerts when reallocated sectors or pending sectors climb.


What To Do When A Drive Starts Failing

You saw it coming. Maybe Current_Pending_Sector jumped from 0 to 47. Maybe Reallocated_Sector_Count started climbing. What now?

Don’t panic. You have time. That drive isn’t dead yet. But it will be.

Step 1: Verify It’s Really Failing

Run the long test and wait for results:

Terminal window
smartctl -t long /dev/sda
sleep 2h # Wait for test
smartctl -x /dev/sda # Check results

If the long test itself throws errors or the drive doesn’t complete the test, that’s a bad sign. The drive is struggling.

Step 2: Back Up Everything It Holds

If this drive is in a RAID array, stop here for a moment. You have options:

Terminal window
zpool replace poolname /dev/sda /dev/sdc # Replace /dev/sda with /dev/sdc
zpool status -v # Watch resilver progress

If the drive is a standalone backup or data drive, just copy everything off to another disk.

Step 3: Order a Replacement

Don’t wait. Buy the replacement drive now. Expect 3-7 business days. Your failing drive will probably last that long, but you don’t want to be surprised.

Step 4: Replace and Retest

Once the new drive arrives:

Terminal window
# Shut down gracefully
sudo shutdown -h now
# Physically swap the drive
# Power back on
# For ZFS pools:
zpool replace poolname /dev/sda
# For RAID:
sudo mdadm /dev/md0 --add /dev/sda
sudo mdadm /dev/md0 --remove /dev/sdb # Remove failed drive
# For standalone drives, just copy data back
rsync -av /backup/ /mnt/newdrive/

Run the long test on the new drive to make sure it’s healthy:

Terminal window
smartctl -t long /dev/sda

Common Gotchas

“SMART says OK, but the drive is failing.” SMART status is binary. It lags reality. Watch the attributes, not the status.

“I don’t see any SMART data.” Some systems require elevated privileges, or the drive doesn’t support SMART (rare). Try sudo smartctl -a /dev/sda. If you get “Unknown USB bridge” or “No SMART device” it might be behind a controller that doesn’t expose SMART data.

“smartd won’t start.” Check /etc/smartd.conf for syntax errors. Run sudo smartd -D -d 1 to run smartd in debug mode and see what’s wrong.

“My NVMe drive shows no SMART data.” Some controllers don’t expose NVMe SMART over the standard interface. Try nvme smart-log /dev/nvme0n1 directly, or check if the drive manufacturer has their own monitoring tool.

“Reallocated sectors jumped overnight. Am I losing data?” No, not yet. The drive found bad sectors and moved the data to spares. You have days or weeks. Start the replacement process calmly. Panicking at 2 AM doesn’t help.


The Real Talk

SMART monitoring is boring. It’s the kind of thing you set up once and then ignore for years. That’s exactly when it’s working. The moment you see an alert about rising pending sectors or offline uncorrectable errors, you’ll be glad you bothered.

For a home lab or small NAS, this setup takes maybe 30 minutes:

  1. Install smartmontools (apt install smartmontools)
  2. Edit /etc/smartd.conf to monitor your drives and log to syslog
  3. Enable smartd (systemctl enable smartd)
  4. Set a cron job to run long tests weekly
  5. Glance at journalctl -u smartd once a month

That’s it. You’re now catching disk failures weeks or months before they destroy your data. Your future self—the one at 3 AM when a drive dies—will thank you profusely.

Disks fail. But you’ll know when they’re about to.


Full example: If you’re running this on a Proxmox cluster or bare-metal Debian, the config above works verbatim. For other systems (UnRaid, TrueNAS, etc.), check their docs—they often have built-in SMART monitoring that’s already wired up. Don’t reinvent the wheel there.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
Syncthing vs Resilio vs Seafile
Next Post
Jellyseerr Tagging Workflows for Real Libraries

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts