Arch Linux based File Server, S.M.A.R.T.
This article describes the S.M.A.R.T. setup on tennessine, my Arch Linux file server. S.M.A.R.T. is a capability built into most modern hard drives (mechanical and SSD), that allows the system administrator to monitor the health of the hard drive, and notify of any failures or predicted failures.
To begin, install the smartmontools package. Full documentation can be found on the smartmontools website. The Arch Wiki article is also quite helpful. I wanted to monitor all eight Seagate 14T disks, as well as the WD NVMe SSD I installed. By default, DEVICESCAN is enabled, which will scan the system for any ATA or SCSI disks. There are warnings in the annotated /etc/smartd.conf as well as the smartd.conf manual page that this is probably not what you want, as a lot of spurious warnings will be generated for devices which don't actually exist in the system.
I ultimately settled on the following configuration in /etc/smartd.conf:
DEFAULT -a -o on -S on \
-m my-email-address@domain.tld,2345678901@carrier.sms.domain.tld \
-s (S/../.././02|L/../../6/03) \
-W 4,50,60 \
-n standby,q \
-I 1 \
-I 7 \
-I 190 \
-I 194
/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg
/dev/sdh
/dev/nvme0n1 -a -o on -S on \
-m my-email-address@domain.tld,mobile-number@carrier.sms.domain.tld \
-s (S/../.././02|L/../../6/03) \
-W 4,55,60 \
-n standby,q
If you have a mobile phone with SMS/text messaging I highly recommend that you research whether your mobile carrier has an email-to-text feature. Most major carriers in the US do (I have experience with AT&T and Google Fi). It's typically your mobile phone number (possibly prefixed with the country code) with a specific email domain owned by your carrier. That way you can be notified immediately should a disk begin to fail. I've included a dummy example in my smartd.conf above.
Here is an explanation of the options used above. NOTE, these are
not smartctl
options!
-a: shorthand for '-H -f -t -l error -l selftest -C 197 -U 198'
-H: Monitor SMART health status, report if failed
-f: Monitor for failure of any USAGE attributes
-t: Equivalent to -p and -u directives
-p: Report changes in prefailure normalized attributes
-u: Report changes in usage normalized attributes
-l: Monitor SMART log, in this case the error and selftest logs
-C: Report if Current Pending Sector count is nonzero (ID 197 is Current_Pending_Sector)
-U: Report if Offline Uncorrectable count is nonzero (ID 198 is Offline_Uncorrectable)
-o on: Enable offline automatic tests
-S on: Enable attribute auto save
-m: Send email notification to comma-separated list of email addresses
-s: Test schedule (see smartd.conf manual for details on syntax)
-W D,I,C: Monitor Temperature D)ifference, I)nformal limit, C)ritical limit
-n MODE,N,q: no check will be performed if disk is in STANDBY mode, ,q supresses the message that would normally be printed. See smartd.conf manual for explanation.
-I ID: Ignore attribute ID for -t option
1: Raw_Read_Error_Rate (see notes below)
7: Seek_Error_Rate (see notes below)
190: Airflow_Tempoerature_Cel
194: Temperature_Celsius
I'm ignoring certain attributes because they proved to be quite noisy. For these Seagate disks, the Raw_Read_Error_Rate and Seek_Error_Rate are constantly increasing. I did a quick web search, and even engaged with Seagate support. I never found an official answer, but it looks like these are internal counters that only Seagate understands. They are not indicative of prefailure or any other failure condition that I'm aware of. If these disks last long enough the counter will overflow, and they'll revert to zero before counting up again.
The temperature attributes are also quite noisy, and will report minor changes in temperature. tennessine is in a locked closet with a sky light, during the summer months it can get quite warm in there. I've been monitoring the server temperature via a few different sensors, and I think the -W option will warn me if the disk temperature reaches 50°C, or to the critical temperature of 60°C. So far the Seagate disk temperatures have not exceeded 45°C in my spot checks. Some sensors in my ThinkPad 25th Anniversary Edition (based on a T470) exceed 73°C during normal use (no gaming or 3D graphics), so I'm not too worried about tennessine. If it gets to be a problem I may move it into my loft, which is likely considerably cooler.
Conclusion
That's pretty much it for setting up S.M.A.R.T.! Using this feature is highly recommended, so you will be informed sooner rather than later should a hard drive begin to fail or fail outright.
Next Steps
This series on how I set up my new file server, tennessine, continues: