[plug] server failing with bizarre disk errors
Craig Ringer
craig at postnewspapers.com.au
Wed Apr 9 16:21:27 WST 2003
Hi folk
I've got a server here at the POST that runs great for several
hours/days, then randomly dies. The console prints a series of disk
error messages, then the machine sprouts defunct processes everywhere
and can't even be shut down. Often it stops responding entirely. It has
printed out a kernel panic on several occasions, too.
I don't like to post a message without the full copies of panics,
errors, etc - but alas I have to go "arrggh not again" and hit the big
red button (little grey button actually, but you know what I mean). It
is, of course, the busiest week we've had in a long time - which is why
I have a progressively failing server on my hands.
The errors look very much like this:
status=0x10 { SeekComplete }
I/O error: drive is not ready for command
The confusing thing is that the server runs /fine/ until the errors
ocurr, and then totally fails to recover. It has two drives, and the
problem appears to happen randomly on one of them (/dev/hda and
/dev/hdb) suggesting that its not a physical disk fault. I've disabled
APM in the BIOS and kernel, set it to use the XT-PIC instead of the
motherboard's IO-APIC , ensured there are no shared interrupts, etc -
nothing has helped. I'm at my wits end.
resiserfsck shows no FS corruption, and I haven't found any indications
of it myself. This in stark contrast to another of our servers, which
recently developed a problem that could be described as "an ext3 volume
full of utter gibberish". RAID controller failure :-( so its safe to
blame the hardware.
Anybody encountered anything similar?
Things I haven't done (yet): shuffled the drives around. Run a full
low-level media scan with the drive tools. (both due to downtime
concerns). Chopped the computer into little bits of tinfoil with a
really big axe (the desire is there, but...).
I'll be doing a media scan and drive juggle tonight.
Craig
More information about the plug
mailing list