[plug] How to diagnose a crashing Linux server?
Richard Mortimer
linux at netfire.com.au
Mon May 26 10:55:17 WST 2003
Hi all,
Hope you had a good weekend. Our server went down again this morning, so in
reply to the questions:
> Aha, you didn't mention you were running X anywhere. Can you run the
> server w/o X for normal operations, and just fire up X when needed?
X was off at the time of the most recent crash - set in init level 3, the
terminal was left on, but apparently the only messages to the system console
were about samba retries (I didn't see it personally), at boot up time there
was mention of a recovery journal - but doing a grep on the for "recovery",
"core" or "dump" produced nothing. It's as though the machine just suddenly
stops.
BTW: not sure whether this is important or not, but X has been installed
since I first configured the machine, nothing has changed in this
environment and it has been running for the last couple of months with no
drama.
> What's the motherboard chipset and video card? X can cause all sorts of
CPU: Intel Xeon 2GHz x 2
Graphics: ATI Mach64, 8Mb
> >>Does it reply to ICMP?
> >
> > Will a 'ping' suffice? I haven't tried this, but can next time it fails.
>
> A ping is what you want, yeah.
No, it doesn't reply to a ping.
> from your description, but I could easily be wrong - can you post a
> "free -m"?
from the 22nd May:
total used free shared buffers cached
Mem: 1511 1496 14 0 62 1298
-/+ buffers/cache: 136 1374
Swap: 1992 0 1992
from today (post reboot):
total used free shared buffers cached
Mem: 1511 161 1350 0 38 47
-/+ buffers/cache: 75 1436
Swap: 1992 0 1992
>I'm just wondering if you have anything infected with UNIX/RST.[AB] or
[snip]
>Are there any 'defunct' processes listed when you 'ps waux' -
No, it doesn't appear so:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.2 0.0 1388 476 ? S 09:23 0:03 init
root 2 0.0 0.0 0 0 ? SW 09:23 0:00
[migration_CPU0]
root 3 0.0 0.0 0 0 ? SW 09:23 0:00
[migration_CPU1]
root 4 0.0 0.0 0 0 ? SW 09:23 0:00
[migration_CPU2]
root 5 0.0 0.0 0 0 ? SW 09:23 0:00
[migration_CPU3]
root 6 0.0 0.0 0 0 ? SW 09:23 0:00 [keventd]
root 7 0.0 0.0 0 0 ? SWN 09:23 0:00
[ksoftirqd_CPU0]
root 8 0.0 0.0 0 0 ? SWN 09:23 0:00
[ksoftirqd_CPU1]
root 9 0.0 0.0 0 0 ? SWN 09:23 0:00
[ksoftirqd_CPU2]
root 10 0.0 0.0 0 0 ? SWN 09:23 0:00
[ksoftirqd_CPU3]
root 11 0.0 0.0 0 0 ? SW 09:23 0:00 [kswapd]
root 12 0.0 0.0 0 0 ? SW 09:23 0:00 [bdflush]
root 13 0.0 0.0 0 0 ? SW 09:23 0:00 [kupdated]
root 14 0.0 0.0 0 0 ? SW 09:23 0:00 [mdrecoveryd]
root 20 0.0 0.0 0 0 ? SW 09:23 0:00 [scsi_eh_0]
root 25 0.0 0.0 0 0 ? SW 09:23 0:00 [kjournald]
root 81 0.0 0.0 0 0 ? SW 09:23 0:00 [khubd]
root 175 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
root 176 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
root 177 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
root 178 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
root 505 0.0 0.0 1460 580 ? S 09:24 0:00 syslogd -m 0
root 509 0.0 0.0 1396 452 ? S 09:24 0:00 klogd -x
rpc 526 0.0 0.0 1552 560 ? S 09:24 0:00 portmap
rpcuser 545 0.0 0.0 1596 752 ? S 09:24 0:00 rpc.statd
root 619 0.0 0.0 0 0 ? SW 09:24 0:00 [rpciod]
root 620 0.0 0.0 0 0 ? SW 09:24 0:00 [lockd]
root 668 0.0 0.0 3344 1464 ? S 09:24 0:00
/usr/sbin/sshd
root 683 0.0 0.0 2064 892 ? S 09:24 0:00
xinetd -stayalive -pidfile /var/run/xinetd.pid
lp 693 0.0 0.0 4772 1184 ? S 09:24 0:00 lpd Waiting
root 714 0.0 0.1 5596 2452 ? S 09:24 0:00 sendmail:
accepting connections
smmsp 723 0.0 0.1 4944 2084 ? S 09:24 0:00 sendmail:
Queue runner at 01:00:00 for /var/spool/clientmqueue
root 733 0.0 0.0 1428 444 ? S 09:24 0:00 gpm -t
ps/2 -m /dev/mouse
root 742 0.0 0.0 1444 592 ? S 09:24 0:00 crond
xfs 771 0.0 0.2 4484 3140 ? S 09:24 0:00
xfs -droppriv -daemon
root 780 0.0 0.0 1412 620 ? SN 09:24 0:00 anacron -s
daemon 789 0.0 0.0 1432 552 ? S 09:24 0:00 /usr/sbin/atd
root 800 0.0 0.0 3424 556 ? S 09:24 0:00
rhnsd --interval 120
root 806 0.0 0.0 2332 1060 ? S 09:24 0:00 login -- root
root 807 0.0 0.0 1372 416 tty2 S 09:24 0:00
/sbin/mingetty tty2
root 808 0.0 0.0 1376 420 tty3 S 09:24 0:00
/sbin/mingetty tty3
root 809 0.0 0.0 1376 420 tty4 S 09:24 0:00
/sbin/mingetty tty4
root 810 0.0 0.0 1376 420 tty5 S 09:24 0:00
/sbin/mingetty tty5
root 811 0.0 0.0 1376 420 tty6 S 09:24 0:00
/sbin/mingetty tty6
root 822 0.0 0.0 4396 1452 tty1 S 09:33 0:00 -bash
root 870 0.0 0.5 17760 8012 ? S 09:34 0:00 httpd -k
start
apache 871 0.0 0.5 17892 8228 ? S 09:34 0:00 httpd -k
start
apache 872 0.0 0.5 17892 8232 ? S 09:34 0:00 httpd -k
start
apache 873 0.0 0.5 17884 8224 ? S 09:34 0:00 httpd -k
start
apache 874 0.0 0.5 17892 8236 ? S 09:34 0:00 httpd -k
start
apache 875 0.0 0.5 17884 8220 ? S 09:34 0:00 httpd -k
start
apache 876 0.0 0.5 17892 8232 ? S 09:34 0:00 httpd -k
start
apache 877 0.0 0.5 17892 8228 ? S 09:34 0:00 httpd -k
start
apache 878 0.0 0.5 17884 8216 ? S 09:34 0:00 httpd -k
start
root 897 0.0 0.1 5000 1796 ? S 09:34 0:00 smbd
root 900 0.0 0.1 3844 1596 ? S 09:34 0:00 nmbd
root 908 0.0 0.1 3884 1692 ? S 09:36 0:00
/sbin/mount.smbfs //phil/BackMeUp /mnt/phil -o rw username Administra
root 927 0.0 0.1 4032 1888 ? S 09:42 0:00
/sbin/mount.smbfs //W2KServer/BackMeUp /mnt/W2KServer -o rw username
root 930 0.0 0.1 5484 2432 ? S 09:43 0:00 smbd
root 948 0.0 0.0 2652 704 tty1 R 09:49 0:00 ps waux
So is there anything else I can check? Or should I proceed with 'plan B',
which is to schedule a cron job to reboot the server every night?
Thanks again folks
Richard
----- Original Message -----
From: "Craig Ringer" <craig at postnewspapers.com.au>
To: <plug at plug.linux.org.au>
Sent: Wednesday, May 21, 2003 3:28 PM
Subject: Re: [plug] How to diagnose a crashing Linux server?
> > The odd thing about my lastlog being 18Mb in size is that there are
three
> > valid users on the system, two of whom rarely log on, and the machine
was
> > set up a couple of months ago ~feb/mar.
> >
> > I followed the other thread on this topic, and followed a similar tact,
> > which recreated the lastlog as a 143kb file.
>
> Interesting.
>
> >>setterm -blank 0
> >
> > Thanks - however my screen keeps turning off, I've turned 'off' energy
> > saving in both the screen itself, and RH(Gnome) -> Preferences ->
> > Screensaver | Advanced and also applied the CLI parameters as supplied
> > above. Trying the 'failsafe' login mode to see if that stops the screen
turn
> > off.
>
> Aha, you didn't mention you were running X anywhere. Can you run the
> server w/o X for normal operations, and just fire up X when needed?
> What's the motherboard chipset and video card? X can cause all sorts of
> stability issues occasionally, and its always a good idea to disable X
> logins for servers unless you have a good reason not to.
>
> How to disable graphical logins depends on distro. In Red Hat you can vi
> /etc/inittab and change
> id:5:initdefault:
> to
> id:3:initdefault:
> but first make sure that all your other services will start as expected
> in runlevel 3. Unless you've customised your starting services heavily
> and only in runlevel 5, it should be the the same. Make sure with
> "chkconfig" and by looking at the /etc/rc3.d and /etc/rc5.d directories.
>
> For debian, you can always simply remove /etc/rc2.d/S99gdm or mv
> /etc/rc2.d/{S,K}99gdm if you like (and assuming you're using gdm).
>
> You can always fire up X manually by logging in on the console and
> running "startx". Alternately, if you're using GDM (method is different
> for other ?DMs) you can vi /etc/X11/gdm/gdm.conf and in the [servers]
> section comment out the line that starts an X server on GDM startup.
> This allows you to have GDM running but not starting local X servers -
> important if you use a server for serving remote X clients with XDMCP.
> If its set up to reply to XDMCP queries, you can get a login with
> X -query <server-running-gdm>
>
> As for preventing the display from turning off, you can usually just
> "xset dpms 0 0 0". However something like GNOME might override that,
> and/or anything set in the XF86Config. I've never run a desktop
> environment on a server (and barely at all on a desktop box, I use a
> customised IceWM usually) so I can't really help you there. Suggested
> solution: disable X.
>
> >>Does it reply to ICMP?
> >
> > Will a 'ping' suffice? I haven't tried this, but can next time it fails.
>
> A ping is what you want, yeah.
>
> > One thing I've noticed, however is the memory usage is up reasonably
high,
> > at 3PM yesterday, following a reboot at around 9am, there was 293Mb of
1.5Gb
> > of physical memory in use, with none of the 1.9Gb swap file being used.
> > Today at 9.30AM there was 1.4Gb of memory in use, with no swap file
being
> > used. Currently (3PM today) there is still 1.4Gb of memory in use, the
top
> > usage programs are nautilus (~14Mb), python (~15Mb),
gnome-settings-daemon
> > (~12.2Mb), gnome-panel (~10Mb), gnome-system-monitor (~8.5Mb) and httpd
> > (~8Mb) ... the total of these seems to be nowhere near the 1.4Gb being
> > reported used. This info was collected via System Monitor. 'free' seems
to
> > indicate roughly the same figures.
>
> That all sounds reasonable, most of your RAM useage will be cached files
> from the disk. For example:
>
> [craig at bucket craig]$ free -m
> total used free shared buffers cached
> Mem: 2015 1808 207 0 304 1300
> -/+ buffers/cache: 202 1812
> Swap: 3999 0 3999
>
> Here, I have 1.8 gigs of RAM "in use" but in reality, all but 200mb of
> that is available for programs to use - the kernel is just using it as a
> monster disk cache until its needed. I assume that's what you're seeing
> from your description, but I could easily be wrong - can you post a
> "free -m"?
>
More information about the plug
mailing list