Diagnosing potential Linux drive failures
While working on server today I came across some odd issues in some applications. While checking the logs, I found some drive errors in the warn log file (Failed SMART usage Attribute: 7 Seek_Error_Rate). In searching for the cause of this error, I think discovered a nifty little tool that will tell you the health of physical disk drives as reported by the SMART controller. SMART stands for Self-Monitoring, Analysis and Reporting Technology Systems, and is a technology built into most disk systems.
Two switches will get you going
smartctl -i <drive>
smartctl -Hc <drive>
The following output shows how my troublesome drive is on its way to a nearby recycling plant. Notice that the health check (Hc) shows that the drive is in pre-fail state.
websrver1:/var/log # smartctl -Hc /dev/sdh
smartctl 5.39 2009-08-08 r2872~ [x86_64-unknown-linux-gnu] (openSUSE RPM)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 041 041 140 Pre-fail Always FAILING_NOW 1265
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (12000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 140) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.These tips were found on a link from Linux Journal. Hope they help you as well as they did me! Now, I just need to find a matching drive for this to replace in the RAID5 array. Hmmm…
I've recently move to Linux OS as we decided to use Linux as well here and your tips here are really useful. Checked all my devices with health check – fortunately, looks like they're working well at the moment.