Fw: [plug] How to diagnose a crashing Linux server?

Tue Jun 3 12:49:27 WST 2003

At 11:53 3/06/2003 +0800, Richard Mortimer wrote in part:

>Greetings Folks,
>
>As a follow-up to this email that I sent just over a week ago - I resorted
>to "plan B" which was to schedule a CRON job to reboot every night.

<snip>

> > 0:00 /usr/libexec/gcon
> > 0:00 /usr/libexec/bono
> > 0:00 metacity --sm-sav
> >
> > JLM> metacity -sm-sav = small windows manager using gtk2
> > The other two I'm not sure, but you need to issue ps au to see the owner.
>
>Is anyone familiar with these two processes "gcon" and "bono"?

Sounds like Bonobo-activation to me - not that I've played with it 
knowingly :-)   Try this link for some info.   gcon seems related so they 
may not be anything to worry about...
http://www-106.ibm.com/developerworks/library/l-gn-cor/

<snip>

(From previous material describing the hardware.)
Hardware RAID, eh?   I recall very recent problems here at UWA where the 
main mail server dropped and didn't get up again real soon.   Happened 
several times in the space of a few days.   The problem was reported to be 
that the hardware RAID (AMI MegaRaid) controller was masking drive 
errors.   Can you examine individual drive error logs or run S.M.A.R.T. 
checks on the drives to determine their state of health?

"... it turns out the you have to use the RAID
configuration software and look at each physical disk individually, and then
find the one with media errors in its history. For some reason the RAID
controller doesn't fail the disk from the array when it gets media errors,
but instead stupidly leaves it in there to cause problems!"

Sorry if this is "been there, done that" territory.

Cheers,
Denis