[plug] How to diagnose a crashing Linux server?

Richard Mortimer linux at netfire.com.au
Mon May 26 10:55:17 WST 2003


Hi all,

Hope you had a good weekend. Our server went down again this morning, so in
reply to the questions:

> Aha, you didn't mention you were running X anywhere. Can you run the
> server w/o X for normal operations, and just fire up X when needed?

X was off at the time of the most recent crash - set in init level 3, the
terminal was left on, but apparently the only messages to the system console
were about samba retries (I didn't see it personally), at boot up time there
was mention of a recovery journal - but doing a grep on the for "recovery",
"core" or "dump" produced nothing. It's as though the machine just suddenly
stops.

BTW: not sure whether this is important or not, but X has been installed
since I first configured the machine, nothing has changed in this
environment and it has been running for the last couple of months with no
drama.

> What's the motherboard chipset and video card? X can cause all sorts of

CPU: Intel Xeon 2GHz x 2
Graphics: ATI Mach64, 8Mb

> >>Does it reply to ICMP?
> >
> > Will a 'ping' suffice? I haven't tried this, but can next time it fails.
>
> A ping is what you want, yeah.

No, it doesn't reply to a ping.

> from your description, but I could easily be wrong - can you post a
> "free -m"?

from the 22nd May:

             total       used       free     shared    buffers     cached
Mem:          1511       1496         14          0         62       1298
-/+ buffers/cache:        136       1374
Swap:         1992          0       1992

from today (post reboot):

             total       used       free     shared    buffers     cached
Mem:          1511        161       1350          0         38         47
-/+ buffers/cache:         75       1436
Swap:         1992          0       1992

>I'm just wondering if you have anything infected with UNIX/RST.[AB] or
[snip]
>Are there any 'defunct' processes listed when you 'ps waux' -

No, it doesn't appear so:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.0  1388  476 ?        S    09:23   0:03 init
root         2  0.0  0.0     0    0 ?        SW   09:23   0:00
[migration_CPU0]
root         3  0.0  0.0     0    0 ?        SW   09:23   0:00
[migration_CPU1]
root         4  0.0  0.0     0    0 ?        SW   09:23   0:00
[migration_CPU2]
root         5  0.0  0.0     0    0 ?        SW   09:23   0:00
[migration_CPU3]
root         6  0.0  0.0     0    0 ?        SW   09:23   0:00 [keventd]
root         7  0.0  0.0     0    0 ?        SWN  09:23   0:00
[ksoftirqd_CPU0]
root         8  0.0  0.0     0    0 ?        SWN  09:23   0:00
[ksoftirqd_CPU1]
root         9  0.0  0.0     0    0 ?        SWN  09:23   0:00
[ksoftirqd_CPU2]
root        10  0.0  0.0     0    0 ?        SWN  09:23   0:00
[ksoftirqd_CPU3]
root        11  0.0  0.0     0    0 ?        SW   09:23   0:00 [kswapd]
root        12  0.0  0.0     0    0 ?        SW   09:23   0:00 [bdflush]
root        13  0.0  0.0     0    0 ?        SW   09:23   0:00 [kupdated]
root        14  0.0  0.0     0    0 ?        SW   09:23   0:00 [mdrecoveryd]
root        20  0.0  0.0     0    0 ?        SW   09:23   0:00 [scsi_eh_0]
root        25  0.0  0.0     0    0 ?        SW   09:23   0:00 [kjournald]
root        81  0.0  0.0     0    0 ?        SW   09:23   0:00 [khubd]
root       175  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
root       176  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
root       177  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
root       178  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
root       505  0.0  0.0  1460  580 ?        S    09:24   0:00 syslogd -m 0
root       509  0.0  0.0  1396  452 ?        S    09:24   0:00 klogd -x
rpc        526  0.0  0.0  1552  560 ?        S    09:24   0:00 portmap
rpcuser    545  0.0  0.0  1596  752 ?        S    09:24   0:00 rpc.statd
root       619  0.0  0.0     0    0 ?        SW   09:24   0:00 [rpciod]
root       620  0.0  0.0     0    0 ?        SW   09:24   0:00 [lockd]
root       668  0.0  0.0  3344 1464 ?        S    09:24   0:00
/usr/sbin/sshd
root       683  0.0  0.0  2064  892 ?        S    09:24   0:00
xinetd -stayalive -pidfile /var/run/xinetd.pid
lp         693  0.0  0.0  4772 1184 ?        S    09:24   0:00 lpd Waiting
root       714  0.0  0.1  5596 2452 ?        S    09:24   0:00 sendmail:
accepting connections
smmsp      723  0.0  0.1  4944 2084 ?        S    09:24   0:00 sendmail:
Queue runner at 01:00:00 for /var/spool/clientmqueue
root       733  0.0  0.0  1428  444 ?        S    09:24   0:00 gpm -t
ps/2 -m /dev/mouse
root       742  0.0  0.0  1444  592 ?        S    09:24   0:00 crond
xfs        771  0.0  0.2  4484 3140 ?        S    09:24   0:00
xfs -droppriv -daemon
root       780  0.0  0.0  1412  620 ?        SN   09:24   0:00 anacron -s
daemon     789  0.0  0.0  1432  552 ?        S    09:24   0:00 /usr/sbin/atd
root       800  0.0  0.0  3424  556 ?        S    09:24   0:00
rhnsd --interval 120
root       806  0.0  0.0  2332 1060 ?        S    09:24   0:00 login -- root
root       807  0.0  0.0  1372  416 tty2     S    09:24   0:00
/sbin/mingetty tty2
root       808  0.0  0.0  1376  420 tty3     S    09:24   0:00
/sbin/mingetty tty3
root       809  0.0  0.0  1376  420 tty4     S    09:24   0:00
/sbin/mingetty tty4
root       810  0.0  0.0  1376  420 tty5     S    09:24   0:00
/sbin/mingetty tty5
root       811  0.0  0.0  1376  420 tty6     S    09:24   0:00
/sbin/mingetty tty6
root       822  0.0  0.0  4396 1452 tty1     S    09:33   0:00 -bash
root       870  0.0  0.5 17760 8012 ?        S    09:34   0:00 httpd -k
start
apache     871  0.0  0.5 17892 8228 ?        S    09:34   0:00 httpd -k
start
apache     872  0.0  0.5 17892 8232 ?        S    09:34   0:00 httpd -k
start
apache     873  0.0  0.5 17884 8224 ?        S    09:34   0:00 httpd -k
start
apache     874  0.0  0.5 17892 8236 ?        S    09:34   0:00 httpd -k
start
apache     875  0.0  0.5 17884 8220 ?        S    09:34   0:00 httpd -k
start
apache     876  0.0  0.5 17892 8232 ?        S    09:34   0:00 httpd -k
start
apache     877  0.0  0.5 17892 8228 ?        S    09:34   0:00 httpd -k
start
apache     878  0.0  0.5 17884 8216 ?        S    09:34   0:00 httpd -k
start
root       897  0.0  0.1  5000 1796 ?        S    09:34   0:00 smbd
root       900  0.0  0.1  3844 1596 ?        S    09:34   0:00 nmbd
root       908  0.0  0.1  3884 1692 ?        S    09:36   0:00
/sbin/mount.smbfs //phil/BackMeUp /mnt/phil -o rw username Administra
root       927  0.0  0.1  4032 1888 ?        S    09:42   0:00
/sbin/mount.smbfs //W2KServer/BackMeUp /mnt/W2KServer -o rw username
root       930  0.0  0.1  5484 2432 ?        S    09:43   0:00 smbd
root       948  0.0  0.0  2652  704 tty1     R    09:49   0:00 ps waux

So is there anything else I can check? Or should I proceed with 'plan B',
which is to schedule a cron job to reboot the server every night?

Thanks again folks

Richard


----- Original Message ----- 
From: "Craig Ringer" <craig at postnewspapers.com.au>
To: <plug at plug.linux.org.au>
Sent: Wednesday, May 21, 2003 3:28 PM
Subject: Re: [plug] How to diagnose a crashing Linux server?


> > The odd thing about my lastlog being 18Mb in size is that there are
three
> > valid users on the system, two of whom rarely log on, and the machine
was
> > set up a couple of months ago ~feb/mar.
> >
> > I followed the other thread on this topic, and followed a similar tact,
> > which recreated the lastlog as a 143kb file.
>
> Interesting.
>
> >>setterm -blank 0
> >
> > Thanks - however my screen keeps turning off, I've turned 'off' energy
> > saving in both the screen itself, and RH(Gnome) -> Preferences ->
> > Screensaver | Advanced and also applied the CLI parameters as supplied
> > above. Trying the 'failsafe' login mode to see if that stops the screen
turn
> > off.
>
> Aha, you didn't mention you were running X anywhere. Can you run the
> server w/o X for normal operations, and just fire up X when needed?
> What's the motherboard chipset and video card? X can cause all sorts of
> stability issues occasionally, and its always a good idea to disable X
> logins for servers unless you have a good reason not to.
>
> How to disable graphical logins depends on distro. In Red Hat you can vi
> /etc/inittab and change
> id:5:initdefault:
> to
> id:3:initdefault:
> but first make sure that all your other services will start as expected
> in runlevel 3. Unless you've customised your starting services heavily
> and only in runlevel 5, it should be the the same. Make sure with
> "chkconfig" and by looking at the /etc/rc3.d and /etc/rc5.d directories.
>
> For debian, you can always simply remove /etc/rc2.d/S99gdm or mv
> /etc/rc2.d/{S,K}99gdm if you like (and assuming you're using gdm).
>
> You can always fire up X manually by logging in on the console and
> running "startx". Alternately, if you're using GDM (method is different
> for other ?DMs) you can vi /etc/X11/gdm/gdm.conf and in the [servers]
> section comment out the line that starts an X server on GDM startup.
> This allows you to have GDM running but not starting local X servers -
> important if you use a server for serving remote X clients with XDMCP.
> If its set up to reply to XDMCP queries, you can get a login with
> X -query <server-running-gdm>
>
> As for preventing the display from turning off, you can usually just
> "xset dpms 0 0 0". However something like GNOME might override that,
> and/or anything set in the XF86Config. I've never run a desktop
> environment on a server (and barely at all on a desktop box, I use a
> customised IceWM usually) so I can't really help you there. Suggested
> solution: disable X.
>
> >>Does it reply to ICMP?
> >
> > Will a 'ping' suffice? I haven't tried this, but can next time it fails.
>
> A ping is what you want, yeah.
>
> > One thing I've noticed, however is the memory usage is up reasonably
high,
> > at 3PM yesterday, following a reboot at around 9am, there was 293Mb of
1.5Gb
> > of physical memory in use, with none of the 1.9Gb swap file being used.
> > Today at 9.30AM there was 1.4Gb of memory in use, with no swap file
being
> > used. Currently (3PM today) there is still 1.4Gb of memory in use, the
top
> > usage programs are nautilus (~14Mb), python (~15Mb),
gnome-settings-daemon
> > (~12.2Mb), gnome-panel (~10Mb), gnome-system-monitor (~8.5Mb) and httpd
> > (~8Mb) ... the total of these seems to be nowhere near the 1.4Gb being
> > reported used. This info was collected via System Monitor. 'free' seems
to
> > indicate roughly the same figures.
>
> That all sounds reasonable, most of your RAM useage will be cached files
> from the disk. For example:
>
> [craig at bucket craig]$ free -m
>             total       used       free     shared    buffers     cached
> Mem:       2015       1808        207          0        304       1300
> -/+ buffers/cache:     202       1812
> Swap:      3999          0       3999
>
> Here, I have 1.8 gigs of RAM "in use" but in reality, all but 200mb of
> that is available for programs to use - the kernel is just using it as a
> monster disk cache until its needed. I assume that's what you're seeing
> from your description, but I could easily be wrong - can you post a
> "free -m"?
>



More information about the plug mailing list