[plug] Raid-5 recovery

Brad Campbell brad at wasp.net.au
Sat Dec 27 23:48:13 WST 2003


G'day all,
I just went through the anguish of a 2 channel failure on a 5 disk 780GB 
raid. Most of the info I located on the web was wayyyy out of date for 
the current tools and md drivers.

In case it might save someone the anguish in the future.. some tips.

In my case the issue is caused by ide channels locking up. Normaly if 
I'm recording satellite overnight I may get a channel lock up and drop a 
drive. Cold restart and resync and we are sweet. (I'm waiting on some 
PATA-SATA adaptors to work around this issue but they are on backorder 
in the US). Last night I dropped two channels in about 30 minutes, which 
caused the array to be marked as bad. (More than one failed disk and 
your toast)

So.. check the syslog and dmesg event counts and determine which was the 
first disk to fail.

use lsraid -p -R to scan all disks and spit out a raidtab.
Ensure your raidtab matches this disk mapping exactly before you do 
anything else.

Change the first failed disk from raid-disk to failed-disk in your raidtab.

run mkraid --force --dangereous-no-resync and read the stern warning.
after sufficient checks and a big lump in your throat run
mkraid --really-force --dangerous-no-resync

If all went well you will have a running array. If not then I guess you 
have lost the contents of the array. I have not seen any other way to 
make this work. I even contemplated hand editing the raid superblocks to 
mark the second failed disk as ok but could find no info on it.

Mount the fs readonly and check to see if everything is where it should 
be. Run fsck readonly just in case.

And if all went well, mark the failed disk as raid-disk in your raidtab 
and use raidhotadd to add it back into the array.
3 hours worth of resyncing later and your back on-line.
<Wipes sweat from brow>

All the info out there about ckraid and the kernel doing the work for 
you is well out of date.

If someone could suggest a cheap way of backing up 720GB of data I might 
sleep easier at night. Short of that I'm going to have to find something 
that watches the syslog for a raid disk failure message and shuts the 
machine down before it gets a chance to fail a second disk.

Brad (waiting for raid resizing to be added to EVMS so I can add an 
extra couple of drives into the array, it should happen early 2004 
fingers crossed!)



More information about the plug mailing list