[plug] server keeps dropping out

Thu Mar 25 10:18:42 WST 2004

On Thu, 2004-03-25 at 07:29, Jon Miller wrote:
> Like to get some input from the group on a problem that is a constant pain in the rear.  I have 2 sites that have a Linux gateway (one with Red Hat Linux 7.2 and the other with Debian (Linux version 2.4.18-bf2.4) .  They mainly perform gateway services in conjunction with mail services.  They sit behind Cisco routers that have firewall and VPN feature sets  turned on.
> Sporadically,  the servers go through periods where they freeze up and have to be rebooted.  For instance the Debian server normally have to be rebooted every Monday, now it's everyday and sometimes twice a day.  The Red Hat server requires at least once a day. 

That screams disk failure to me. The only time I've had Linux servers do
that sort of thing, they've always turned out to have a dying disk.
Linux doesn't seem to like bad sectors in swap partitions ;-) 

That said, you'll /usually/ start seeing DMA errors etc in syslog unless
you just have some bad sectors in the swap and nothing else.

If you had enough physical memory, I'd suggest just disabling swap and
seeing how that affected things. That was my first clue on one of the
machines I was troubleshooting, and it was confirmed by use of smartctl
and badblocks.

>  The servers are still running just the services either drops out or
goes into what appears to be in a zombie state (loaded but not
functioning).

This could be clearer. If you mean that the server appears to be running
(you can ping it, etc) but you can't log in, and it doesn't actually
allow connections to /any/ service such as ssh, smtp, etc, then that is
consistent with what I often see in disk failure situations. The Linux
kernel appears capable of staggering on with a functioning IP stack even
when mostly dead.

If you mean that you can still log in, however, but some/all network
services don't work - that's different. Do a 'ps aux' and look for apps
that appear to be hanging in the 'D' state; that can indicate disk
issues or other I/O problems (e.g. drivers). Look for long-lived zombie
processes ('Z').

BTW, if it is the latter case, some more info about exactly what
services etc are misbehaving (all/some, what versions, etc) would be
handy.

> Are there any online testing that can be performed while the server is up to get any indication as to what is going on?

Use smartctl to query the disks and find out of they think they're
dying. You may need to enable SMART first. Consider installing the
newest version first, as it knows about more hard disk quirks and more
vendor attributes. You can get it from smartmontools.sf.net .

If you can't figure out the output or want a second opinion, post it and
I'll check it out. I'm getting used to spotting dying disks now - due,
alas, to practise.

I'd also set up a serial console to another machine, and boot the server
with console=/dev/ttyS0 . Then use dmesg -n <blah> to increase the
kernel message verbosity. Get the other box to capture the serial port
input and log it. That way you can capture panics etc even when the
machine is unattended.

If you can't do that, make sure a head is attached and disable console
blanking with setterm -blank 0. That way, if it's dying due to a panic
you'll see it.

Speaking of panics - any flashing keyboard LEDs?

Also start logging syslog remotely if you have another UNIX box on site.
My firewall logs to my core internal server and my SCO box with the
following two lines in syslog.conf:

*.*			@10.0.0.10
*.*			@10.0.0.4

The core server (RH8) needed tweaks to either the syslog init script or
a file in /etc/sysconfig (can't remember now) to get it to accept remote
logging. Basically, you must ensure syslogd is started with the '-r'
option. The SCO box accepts remote logging by default... *sigh*.

You might also be able to get a syslog server for Windows - I'm not
sure.

Oh ... and use 'chkrootkit'.

Craig Ringer