[plug] How to diagnose a crashing Linux server?

Tue May 20 14:43:48 WST 2003

> What's the behaviour in the crash? Kernel panic, just becomes
> unreachable, what?

The computer remains powered, but the screen, TCP access, SMB access, HTTP
access all fail to respond.

> As for general debugging - well, make sure the CPU isn't overheating,

The crashes appear to be happening around 3 to 4am, it's in an
airconditioned office, on physical inspection the box appears to be cool;
the only CPU "intensive" script running around that time is a mail script,
which slurps a text file, and emails it out. The standard "cron" jobs are
running around this time for the daily and weekly jobs - however none of
these have been modified since installation.

> and if you've added any new RAM recently see if you can borrow some
> different RAM to test with.

No physical changes have been made to the box since about Feb this year.

>Check syslog for disk errors (though if
> they're on the primary disk, they probably won't get written to the log
> on disk - that's another good use of a serial console).

I've noticed the lastlog file has grown to around 18Mb - is this normal? The
file appears to be either encrypted, or full of dud entries.

Sorry, but I have to show my ignorance here, when you refer to "syslog",
does that encompass all the log files, such as /var/log/messages,
/var/log/cron, /var/log/maillog, and so forth? Or should I be able to find a
file named syslog?

> If you can attach a null-modem cable, then try booting the machine with

Is this a stndard cable I can buy? Or will I need to figure out the pin-outs
and make this?

Any pointers are (still) greatly appreciated.

Thanks

Richard

----- Original Message ----- 
From: "Craig Ringer" <craig at postnewspapers.com.au>
To: <plug at plug.linux.org.au>
Sent: Tuesday, May 20, 2003 10:41 AM
Subject: Re: [plug] How to diagnose a crashing Linux server?

> > I have a RH8.0 webserver which is crashing every 4 to 5 days. I've never
had
> > a server crash on me before, and was hoping someone could go through a
basic
> > "check list" of what I should be looking for.
>
> What's the behaviour in the crash? Kernel panic, just becomes
> unreachable, what?
>
> Do you have console access or the ability to attach a null-modem cable
> to another machine you control? If so, try to see if you can get some
> info on why its crashing by looking at the console. I suggest a syslog
> entry (to /etc/syslog.conf) like:
>
> *.* /dev/tty12
>
> to dump all system messages to tty12. Can be helpful tracking faults.
>
> If you can attach a null-modem cable, then try booting the machine with
> "console=/dev/ttyS0" or "console=/dev/tty1 console=/dev/ttyS0" after
> attaching the serial cable to another machine. You should be able to use
> a program like Minicom or (if a 'doze box) Hyperterminal to access the
> serial port and watch the console output. If the machine dumps anything
> like a kernel panic, you can capture it (because you set your terminal
> app to log to a file) and that'll help your diagnostics a lot.
>
> You can also make an /etc/inittab entry to attach a getty to the serial
> line, allowing you to log in over the serial port even if you lose
> TCP/IP access. Great for (a) if you kill the ssh server while upgrading
> it and (b) if something goes badly wrong on the server.
>
> As for general debugging - well, make sure the CPU isn't overheating,
> and if you've added any new RAM recently see if you can borrow some
> different RAM to test with. Check syslog for disk errors (though if
> they're on the primary disk, they probably won't get written to the log
> on disk - that's another good use of a serial console).
>
> If you can give some more info on what's happening, it'd be helpful.
>
> Craig
>