[plug] Raid-5 recovery
Brad Campbell
brad at wasp.net.au
Sat Dec 27 23:48:13 WST 2003
G'day all,
I just went through the anguish of a 2 channel failure on a 5 disk 780GB
raid. Most of the info I located on the web was wayyyy out of date for
the current tools and md drivers.
In case it might save someone the anguish in the future.. some tips.
In my case the issue is caused by ide channels locking up. Normaly if
I'm recording satellite overnight I may get a channel lock up and drop a
drive. Cold restart and resync and we are sweet. (I'm waiting on some
PATA-SATA adaptors to work around this issue but they are on backorder
in the US). Last night I dropped two channels in about 30 minutes, which
caused the array to be marked as bad. (More than one failed disk and
your toast)
So.. check the syslog and dmesg event counts and determine which was the
first disk to fail.
use lsraid -p -R to scan all disks and spit out a raidtab.
Ensure your raidtab matches this disk mapping exactly before you do
anything else.
Change the first failed disk from raid-disk to failed-disk in your raidtab.
run mkraid --force --dangereous-no-resync and read the stern warning.
after sufficient checks and a big lump in your throat run
mkraid --really-force --dangerous-no-resync
If all went well you will have a running array. If not then I guess you
have lost the contents of the array. I have not seen any other way to
make this work. I even contemplated hand editing the raid superblocks to
mark the second failed disk as ok but could find no info on it.
Mount the fs readonly and check to see if everything is where it should
be. Run fsck readonly just in case.
And if all went well, mark the failed disk as raid-disk in your raidtab
and use raidhotadd to add it back into the array.
3 hours worth of resyncing later and your back on-line.
<Wipes sweat from brow>
All the info out there about ckraid and the kernel doing the work for
you is well out of date.
If someone could suggest a cheap way of backing up 720GB of data I might
sleep easier at night. Short of that I'm going to have to find something
that watches the syslog for a raid disk failure message and shuts the
machine down before it gets a chance to fail a second disk.
Brad (waiting for raid resizing to be added to EVMS so I can add an
extra couple of drives into the array, it should happen early 2004
fingers crossed!)
More information about the plug
mailing list