One of the precautions I take to ensure that my home server keeps steadily ticking along is to monitor the health of the hard drives with smartmontools. This uses the SMART health monitoring interfaces built into almost every modern hard drive to predict if the drive is starting to exhibit problems that might lead to data loss, or even complete drive failure. To further improve on this, I run the monitoring system as a daemon, and have it run some simple tests each night, and an extensive test (lasting several hours) each week.
And this is great. The system will email me if it spots any problems, giving me the chance to either fix them, or (worst case) order a new hard drive before the old one finally dies. Because generally, when smartd spots a problem, its a sign of the beginning of the end for that drive.
But not always. My current hard drive has been reporting the same error to me for over 9 months now, patiently emailing me the same email every night:
This message was generated by the smartd daemon running on:
host name: house
DNS domain: xxxxxxx.com
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], 3 Offline uncorrectable sectors
WDC WD20EFRX-68AX9N0, S/N:WD-WMC30043xxxx, WWN:5-0014ee-0ae19da81, FW:80.00A80, 2.00 TB
For details see host’s SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sun Jun 29 08:07:30 2014 BST
Another message will be sent in 24 hours if the problem persists.
No matter what I try, I cannot get the drive to resolve the problem, but it’s not getting any worse, and the overall health of the drive is reported as “OK”. So actually, unless the system spots a new error, I just want it to stop emailing me, because otherwise I run the risk of ignoring the server that cried wolf …
So here is the way to get the smartd daemon as installed under Ubuntu Server 14.04 LTS, to not report the same SMART error over and over again:
- cd /usr/share/smartmontools
- sudo cp smartd-runner smartd-runner.backup
Now, open up smartd-runner in a text editor like vi or gedit, (sudo vi smartd-runner) and make it look like this:
# Generate a temporary filename for new error information
# Copy the new error information into the file
# Test if the new error information is different to the saved
# error information from our last run.
if ! cmp -s "$tmp" "$laststate"
# Save the "new" latest error information for next time
cp $tmp $laststate
# Call the email routine
run-parts --report --lsbsysinit --arg=$tmp --arg="$1" \
--arg="$2" --arg="$3" -- /etc/smartmontools/run.d
# Delete the temporary copy of the error information
rm -f $tmp
Save the file. The system will take one more run of the smartd daemon to “prime” the state into the system, but thereafter the system will not send you the same error twice in a row. Of course, this does mean that you now need to pay attention when the system does email you … or you could modify my code here, so it will send a duplicate “reminder” email again (say) every week, or month, or whatever works for you.