[plug] HD S.M.A.R.T monitoring under Linux

Craig Ringer craig at postnewspapers.com.au
Tue Jul 22 11:03:01 WST 2003


> Do you often have drives fail while they are still under warranty?

Yes - if they fail, it appears to usually be within 3-6 months though. 
At least, that's what happened with the majority of drives that have 
died on me:
	3 Western Digital 120g JBs (later JB series)
	1 Seagate Barracuda 80g (older model)
I had 9 WD drives at the time those 3 failed - each failed in a separate 
machine (home box, gateway server at work, RAID storage server at work), 
all were well cooled, etc. The seagate was a drive for my home PC that I 
noticed was failing and replaced with a WD 120g JB. That replacement 
drive was one of the early model WD JBs and is still running great, but 
a second one I got later died quite quickly.

The short answer - there's a surprisingly high chance that, at least if 
you're unlucky and buy from the wrong manufacturer at the wrong time 
(there's usually a wrong manufacturer at a wrong time, unfortunately) 
you'll get a drive that plans to suicide. But then, if you're doing good 
backups then all it'll be is a bit of irritatation, RIGHT?

>>There's no point talking to most disk manufacturers about the data 
>>though. You'll get the endless litany 'download our drive tools 
>>from....'. Yay. 
> 
> What about RA? I'm not sure how often I actually speak to the
> manufacturer but I know what you mean if some hardware fails they expect
> you to be testing it with Windows.What about RA? I'm not sure how often I actually speak to the
manufacturer but I know what you mean if some hardware fails they expect
you to be testing it with Windows.

Actually, in my experience they're very good about that. I've dealt 
directly with several HDD manufacturers, and all of them just didn't 
care what OS I was running - so long as I was using their boot disk with 
their hdd tools on it.

WD even accepted all my smartctl data (something like: 'well, if the 
SMART data you've obtained says there are that many reallocated sectors, 
and you're getting DMA timeouts and CRC errors, then it does sound like 
a fault with the drives'), though they did seem a bit confused as to how 
I got it. Apparently the guys I was dealing with were unaware of any 
end-user available tools to directly access the drive's SMART data.

> What is the IBM/ Hitatchi util like? 

*cough* *choke*
The utility is EXCELLENT, the only better one I've worked with is 
Seagate's (haven't tried the maxtor util, my maxtor 120g drive is very 
happy). The problem is that if your drive is an IBM DeskStar, maybe a 
few years old, you're rather likely to NEED that drive utility. Then 
again, if it hasn't died yet it probably won't. IBM had a massive run of 
faulty drives that gave their drives the nickname 'DeathStar' and caused 
people using them in RAID arrays to STILL lose data because sometimes 
two or more would fail almost at once. This is suspected as one of the 
reasons IBM sold their HDD business - they always denied the problem, 
and there were indications they were unable to fix it.

Western Digital's drive tools are AFAICT utterly pointless. See previous 
rant on this topic ;-)

> I can see some good applications for it including rigging it up with
> Nagios for server monitoring and adding it to Knoppix (if it isn't
> already) to diagnose desktops.

As for Knoppix - I'd be very surprised if it wasn't there.

You should also be able to make a simple parser for the output so that 
you can access it over SNMP (via snmpd's shell extensions) if you care.

> I recently had a customers HD crash on my while I was in the middle of
> diagnosing their machine so now I check the S.M.A.R.T readings before I
> do anything.

It won't always tell you of a potential fault, but it is surprisingly 
good. Generally in failing disks I've seen warning signs like huge 
spin-up times, lots of reallocated sectors, etc - but usually no SMART 
alarms. Logged errors can be a sign of a dying drive too.

Be aware that some issues that show up in SMART can actually be caused 
by motherboard chipsets (timing issues, etc) or cable problems.

Craig Ringer



More information about the plug mailing list