[plug] re: RAID5 & Hot Spares

Thu Jun 12 14:45:13 WST 2003

> I know this is an IBM h/w RAID we're talking about but just to be safe 
> it may pay to ensure that any drive errors are appropriately 
> acknowledged and the ailing drive kicked out of the array.   I am 
> thinking here in terms of what UWA faced recently with a MegaRaid 
> controller - a drive was going toes up.  The individual drive logs 
> apparently said so, but the controller was happy to leave the drive in 
> the array to cause mischief!   Not Pretty (tm)   Having said that, I 
> don't know how you'd go about artificially creating fake bad blocks on a 
> working drive to test whether or not it gets tossed out in your 
> situation.   Anyone?

Well ... if it weren't for the expensive SCSI drives, I'd probably do 
something quite horrible. Ask the boss for a spare drive to do some 
testing with - mentioning that it's going to destroy the drive in the 
process. Remove the top of the drive, give it a light, short scratch on 
a platter with a screwdriver, and replace the top. Instant bad sectors 
;-) Of course, that can only be done with the server /off/ for the damage.

<rant> Speaking of RAID that isn't so great, I'm less than thrilled with 
3ware. Despite excellent first impressions, the card has been apalling. 
It's eaten my data - twice - though the service guys think that was a 
hardware fault (I'm awaiting a warranty replacement while the server 
sits inactive. They wouldn't send me the replacement until they got the 
original - wonderful service, I say). That problem can only be specific 
to our setup and card, since I can't possibly imagine a RAID controller 
being that bad.

However, the card lacks any way to let you query the drive SMART data 
(no LUN >0 reads like some SCSI arrays, no custom tools to do it). Their 
support folks responded with "the card will do this for you". I had a 
drive fail in the array (separate to the problem with the card its self, 
it was another WD 120G JB dying) and it failed to notice before the 
drive was totally f**ed. We're talking 400 + bad sectors, and almost 
totally unuseable. When I ran a disk test, it started with 400-ish bad 
sectors and finished with 440 (according to SMART). Clearly, the SMART 
monitoring and drive testing doesn't work as well as it should - but 
there's no user monitoring facility. In retrospect, I think I would've 
gone for software RAID and saved $1200.

Hopefully they'll fix these issues, but they don't seem to be in the 
listening to customers business, so maybe not.

</rant>