[plug] How to diagnose a crashing Linux server?
Craig Ringer
craig at postnewspapers.com.au
Tue May 20 15:37:48 WST 2003
>>What's the behaviour in the crash? Kernel panic, just becomes
>>unreachable, what?
>
> The computer remains powered, but the screen, TCP access, SMB access, HTTP
> access all fail to respond.
Interesting. Anybody on PLUG know how to stop the kernel from blanking
the console so you can see what goes wrong if something breaks? I
presume there's something you can set in /proc.
I presume you don't get anything as useful/informative as flashing
keyboard LEDs (kernel panic)?
Does it reply to ICMP? I've seen machines reply to v. low level IP
traffic like ARP and ICMP, but anything that requires userspace action
just fails to respond. Essentially, its as if the kernel is still
running happily, but all of userspace is just kaput.
If it is, in fact, crashing hard and without even a kernel panic -
hmm.... I really would tend to blame the hardware or at worst a dodgy
driver module, but as you say that machine has been unchanged for some
time. Tricky.
Hmmm.... I have seen something like this once. A real pain it was, too.
My DHCP server at work was crashing every few weeks, with no explanation
at all, and usually over the weekend. First thing I knew was when my
gkrellm snmp monitors started showing it as unreachable. It turned out
to be a failing HDD with bad sectors in the swap partition - the kernel
didn't like that one tiny bit. It just died - no panic or anything, just
dead. A reset bought it back up fine ... for a while. I only figured out
what was wrong when I detected disk-related read errors on /var as the
problem became worse. Consider running the manufacturer's disk tools on
your drive, and/or imaging across your install to a spare HDD.
> The crashes appear to be happening around 3 to 4am, it's in an
> airconditioned office, on physical inspection the box appears to be cool;
> the only CPU "intensive" script running around that time is a mail script,
> which slurps a text file, and emails it out. The standard "cron" jobs are
> running around this time for the daily and weekly jobs - however none of
> these have been modified since installation.
Can you read that file w/o problems (or run the script normally, if
that's not disruptive) at other times, while monitoring the machine?
> I've noticed the lastlog file has grown to around 18Mb - is this normal? The
> file appears to be either encrypted, or full of dud entries.
Interesting. Mine is 18k but this server has only been around for
~9months. Does the "lastlog" command still like the file?
> Sorry, but I have to show my ignorance here, when you refer to "syslog",
> does that encompass all the log files, such as /var/log/messages,
> /var/log/cron, /var/log/maillog, and so forth? Or should I be able to find a
> file named syslog?
Sorry, I should've been clearer. I meant all syslog output in the
various /var/log files. That said, debian will have a /var/log/syslog
that contains most of what you'd want - not important though.
>>If you can attach a null-modem cable, then try booting the machine with
>
> Is this a stndard cable I can buy? Or will I need to figure out the pin-outs
> and make this?
You can buy a standard serial cable and a "null modem adapter" - though
those are becoming harder to find - or a null modem cable if you can
find one. Alternately, yeah you can make one by mangling a normal serial
cable.
More information about the plug
mailing list