[plug] How to diagnose a crashing Linux server?

Tue May 20 15:37:48 WST 2003

>>What's the behaviour in the crash? Kernel panic, just becomes
>>unreachable, what?
> 
> The computer remains powered, but the screen, TCP access, SMB access, HTTP
> access all fail to respond.

Interesting. Anybody on PLUG know how to stop the kernel from blanking 
the console so you can see what goes wrong if something breaks? I 
presume there's something you can set in /proc.

I presume you don't get anything as useful/informative as flashing 
keyboard LEDs (kernel panic)?

Does it reply to ICMP? I've seen machines reply to v. low level IP 
traffic like ARP and ICMP, but anything that requires userspace action 
just fails to respond. Essentially, its as if the kernel is still 
running happily, but all of userspace is just kaput.

If it is, in fact, crashing hard and without even a kernel panic - 
hmm.... I really would tend to blame the hardware or at worst a dodgy 
driver module, but as you say that machine has been unchanged for some 
time. Tricky.

Hmmm.... I have seen something like this once. A real pain it was, too. 
My DHCP server at work was crashing every few weeks, with no explanation 
at all, and usually over the weekend. First thing I knew was when my 
gkrellm snmp monitors started showing it as unreachable. It turned out 
to be a failing HDD with bad sectors in the swap partition - the kernel 
didn't like that one tiny bit. It just died - no panic or anything, just 
dead. A reset bought it back up fine ... for a while. I only figured out 
what was wrong when I detected disk-related read errors on /var as the 
problem became worse. Consider running the manufacturer's disk tools on 
your drive, and/or imaging across your install to a spare HDD.

> The crashes appear to be happening around 3 to 4am, it's in an
> airconditioned office, on physical inspection the box appears to be cool;
> the only CPU "intensive" script running around that time is a mail script,
> which slurps a text file, and emails it out. The standard "cron" jobs are
> running around this time for the daily and weekly jobs - however none of
> these have been modified since installation.

Can you read that file w/o problems (or run the script normally, if 
that's not disruptive) at other times, while monitoring the machine?

> I've noticed the lastlog file has grown to around 18Mb - is this normal? The
> file appears to be either encrypted, or full of dud entries.

Interesting. Mine is 18k but this server has only been around for 
~9months. Does the "lastlog" command still like the file?

> Sorry, but I have to show my ignorance here, when you refer to "syslog",
> does that encompass all the log files, such as /var/log/messages,
> /var/log/cron, /var/log/maillog, and so forth? Or should I be able to find a
> file named syslog?

Sorry, I should've been clearer. I meant all syslog output in the 
various /var/log files. That said, debian will have a /var/log/syslog 
that contains most of what you'd want - not important though.

>>If you can attach a null-modem cable, then try booting the machine with
>
> Is this a stndard cable I can buy? Or will I need to figure out the pin-outs
> and make this?

You can buy a standard serial cable and a "null modem adapter" - though 
those are becoming harder to find - or a null modem cable if you can 
find one. Alternately, yeah you can make one by mangling a normal serial 
cable.