[plug] handling failed non-redundant storage in a server

Craig Ringer craig at postnewspapers.com.au
Thu Feb 12 11:09:29 WST 2004


Hi folks

Our core server at the POST has just had a small accident (it's been one
of those weeks) and I was hoping for some advice.

We recently bought a couple of 250GB SATA disks so that we can take
snapshots of the server and take them off-site. This eliminates the
nightmarish prospect of rebuilding the server from vanilla RH8.

I went to take the first snapshot last night. I plugged in the disk,
activated it using the utilities for our RAID card, partitioned and
formatted the disk. All was well, so I began the giant copy operation.
Unfortunately, it became apparent at about the 120GB mark that the disk
wasn't adequately cooled - it began to fail and was disabled by the RAID
card.

Unfortunately, this leaves me with a significant number of processes
that are trying to talk to a mounted partition on a device that's not
there. They've gone into interruptible sleep and aren't really
disrupting the server's operation. For some reason the load average is
at something stupid like 12 (this machine's norm being < 1, usually
about 0.3), but it doesn't appear to reflect real load. Perhaps it's the
two sync processes that are hung.

Unfortunately, it looks like nautilus on one of the users' thin clients
has somehow tried to access the mounted partition associated with the
missing disk. It's also gone into D state - and new nautilus processes
run under that user ID try to signal it to start a new window rather
than starting a new full nautilus process themselves. Naturally they
also hang permanently. This is making it a bit hard for the user in
question to work - she's managing by using the in-application file
dialogs, but it's a bit painful. (as an aside, it's so cool that the
server can keep working so well under these conditions, and that a user
with a half-broken desktop can still use the rest as if nothing was
wrong). 

I was wondering if there's any way to deal with this - to remove the
processes I know will never recover, unmount the dead volume without
causing any harm to other parts of the system, etc. While I'll be able
to reboot this evening, surely there's a way of dealing with this sort
of thing without a reboot? 

Being able to convince GNOME not to try to talk to the residual nautlus,
but rather start a new one, would do too. Is there some lock or socket I
can remove to force this?

BTW, the rest of the server is on RAID-redundant storage, it's only the
snapshot disks that aren't.

Oh yeah - MAKE SURE YOUR DISKS ARE WELL COOLED. I thought this one was
fine, but the continuous high load on a fast 3 platter disk was
apparently too much for it. I think our snapshot target disks will have
to live in drive cages for extra cooling, as the current airflow is
clearly just not good enough. *sigh* - it's loud enough already.

Craig Ringer




More information about the plug mailing list