Fw: [plug] How to diagnose a crashing Linux server?
Richard Mortimer
linux at netfire.com.au
Tue Jun 3 11:53:24 WST 2003
Greetings Folks,
As a follow-up to this email that I sent just over a week ago - I resorted
to "plan B" which was to schedule a CRON job to reboot every night.
Unfortunately we had the server go down again this weekend, which wasn't too
good for the people in the east who were trying to access the site on
Monday.
Jon Miller asked a couple of questions in a off-list email, so I thought I'd
replicate some of his queries, and the responses to the queries here - to
see if anyone else can suggest any ideas about what might be causing this:
> JLM> which model and specs?
It's an IBM eServer xSeries 235, Type 8671, 2 x Intel Xeon 2GHz CPU's, 1.5Gb
Memory (not sure how this is configured), 2 x 73.4 Gb 10,000RPM IBM HD
(Model: IC35L073UCDY10-0 - P/N 07N8812), mirrored via hardware RAID, the
only hardware additions to this box have been the Adaptec SCSI card
(AVA-2906) and the Travan tape unit (TapeStor 20Gb, SCSI version).
> 0:00 /usr/libexec/gcon
> 0:00 /usr/libexec/bono
> 0:00 metacity --sm-sav
>
> JLM> metacity -sm-sav = small windows manager using gtk2
> The other two I'm not sure, but you need to issue ps au to see the owner.
Is anyone familiar with these two processes "gcon" and "bono"?
> JLM> Ouch, I never would use a Travan tape unit in a server. If you can
try using a DDS.
Are there specific problems/known issues with Travan units? Are there
advantages to using a DDS based drive?
> JLM> I do not see anything out of the norm, except those 3 processes you
sent. If possible try to remove them to see if they have any effect on
anything. My question would be is this server booting to a gui or CLI
interface.
It generally boots into a GUI, but it was recommended by another PLUG member
to try running this in a CLI situation - the crash still occured under both
situations.
> As for hardware if the backup is scheduled to do a backup at a certain
time and the server crashes soon after or during or even later in the day.
Question has to be asked is the memory being depleted. The only way to see
this is to get a top read out after server reboots, before and after the
backup.
Hmmm, yes, this was something that crossed my mind too - the backups run at
11.55pm, I scheduled a CRON job to reboot the system at 4am, backups
generally run for around 10 minutes (backing up config files only, to a
local tar file), or around an hour backing up data directories (creates a
tar file on disk, writes tar file to tape).
So - anyone have any ideas? Plan C is a new box in it's place .... see
whether the same situation occurs; then I can isolate it to hardware or
software.
Thanks again
Richard
----- Original Message -----
From: "Richard Mortimer" <linux at netfire.com.au>
To: <plug at plug.linux.org.au>
Sent: Monday, May 26, 2003 10:55 AM
Subject: Re: [plug] How to diagnose a crashing Linux server?
> Hi all,
>
> Hope you had a good weekend. Our server went down again this morning, so
in
> reply to the questions:
>
> > Aha, you didn't mention you were running X anywhere. Can you run the
> > server w/o X for normal operations, and just fire up X when needed?
>
> X was off at the time of the most recent crash - set in init level 3, the
> terminal was left on, but apparently the only messages to the system
console
> were about samba retries (I didn't see it personally), at boot up time
there
> was mention of a recovery journal - but doing a grep on the for
"recovery",
> "core" or "dump" produced nothing. It's as though the machine just
suddenly
> stops.
>
> BTW: not sure whether this is important or not, but X has been installed
> since I first configured the machine, nothing has changed in this
> environment and it has been running for the last couple of months with no
> drama.
>
> > What's the motherboard chipset and video card? X can cause all sorts of
>
> CPU: Intel Xeon 2GHz x 2
> Graphics: ATI Mach64, 8Mb
>
> > >>Does it reply to ICMP?
> > >
> > > Will a 'ping' suffice? I haven't tried this, but can next time it
fails.
> >
> > A ping is what you want, yeah.
>
> No, it doesn't reply to a ping.
>
> > from your description, but I could easily be wrong - can you post a
> > "free -m"?
>
> from the 22nd May:
>
> total used free shared buffers cached
> Mem: 1511 1496 14 0 62 1298
> -/+ buffers/cache: 136 1374
> Swap: 1992 0 1992
>
> from today (post reboot):
>
> total used free shared buffers cached
> Mem: 1511 161 1350 0 38 47
> -/+ buffers/cache: 75 1436
> Swap: 1992 0 1992
>
> >I'm just wondering if you have anything infected with UNIX/RST.[AB] or
> [snip]
> >Are there any 'defunct' processes listed when you 'ps waux' -
>
> No, it doesn't appear so:
>
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 1 0.2 0.0 1388 476 ? S 09:23 0:03 init
> root 2 0.0 0.0 0 0 ? SW 09:23 0:00
> [migration_CPU0]
> root 3 0.0 0.0 0 0 ? SW 09:23 0:00
> [migration_CPU1]
> root 4 0.0 0.0 0 0 ? SW 09:23 0:00
> [migration_CPU2]
> root 5 0.0 0.0 0 0 ? SW 09:23 0:00
> [migration_CPU3]
> root 6 0.0 0.0 0 0 ? SW 09:23 0:00 [keventd]
> root 7 0.0 0.0 0 0 ? SWN 09:23 0:00
> [ksoftirqd_CPU0]
> root 8 0.0 0.0 0 0 ? SWN 09:23 0:00
> [ksoftirqd_CPU1]
> root 9 0.0 0.0 0 0 ? SWN 09:23 0:00
> [ksoftirqd_CPU2]
> root 10 0.0 0.0 0 0 ? SWN 09:23 0:00
> [ksoftirqd_CPU3]
> root 11 0.0 0.0 0 0 ? SW 09:23 0:00 [kswapd]
> root 12 0.0 0.0 0 0 ? SW 09:23 0:00 [bdflush]
> root 13 0.0 0.0 0 0 ? SW 09:23 0:00 [kupdated]
> root 14 0.0 0.0 0 0 ? SW 09:23 0:00
[mdrecoveryd]
> root 20 0.0 0.0 0 0 ? SW 09:23 0:00 [scsi_eh_0]
> root 25 0.0 0.0 0 0 ? SW 09:23 0:00 [kjournald]
> root 81 0.0 0.0 0 0 ? SW 09:23 0:00 [khubd]
> root 175 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
> root 176 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
> root 177 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
> root 178 0.0 0.0 0 0 ? SW 09:24 0:00 [kjournald]
> root 505 0.0 0.0 1460 580 ? S 09:24 0:00 syslogd -m
0
> root 509 0.0 0.0 1396 452 ? S 09:24 0:00 klogd -x
> rpc 526 0.0 0.0 1552 560 ? S 09:24 0:00 portmap
> rpcuser 545 0.0 0.0 1596 752 ? S 09:24 0:00 rpc.statd
> root 619 0.0 0.0 0 0 ? SW 09:24 0:00 [rpciod]
> root 620 0.0 0.0 0 0 ? SW 09:24 0:00 [lockd]
> root 668 0.0 0.0 3344 1464 ? S 09:24 0:00
> /usr/sbin/sshd
> root 683 0.0 0.0 2064 892 ? S 09:24 0:00
> xinetd -stayalive -pidfile /var/run/xinetd.pid
> lp 693 0.0 0.0 4772 1184 ? S 09:24 0:00 lpd Waiting
> root 714 0.0 0.1 5596 2452 ? S 09:24 0:00 sendmail:
> accepting connections
> smmsp 723 0.0 0.1 4944 2084 ? S 09:24 0:00 sendmail:
> Queue runner at 01:00:00 for /var/spool/clientmqueue
> root 733 0.0 0.0 1428 444 ? S 09:24 0:00 gpm -t
> ps/2 -m /dev/mouse
> root 742 0.0 0.0 1444 592 ? S 09:24 0:00 crond
> xfs 771 0.0 0.2 4484 3140 ? S 09:24 0:00
> xfs -droppriv -daemon
> root 780 0.0 0.0 1412 620 ? SN 09:24 0:00 anacron -s
> daemon 789 0.0 0.0 1432 552 ? S 09:24 0:00
/usr/sbin/atd
> root 800 0.0 0.0 3424 556 ? S 09:24 0:00
> rhnsd --interval 120
> root 806 0.0 0.0 2332 1060 ? S 09:24 0:00 login --
root
> root 807 0.0 0.0 1372 416 tty2 S 09:24 0:00
> /sbin/mingetty tty2
> root 808 0.0 0.0 1376 420 tty3 S 09:24 0:00
> /sbin/mingetty tty3
> root 809 0.0 0.0 1376 420 tty4 S 09:24 0:00
> /sbin/mingetty tty4
> root 810 0.0 0.0 1376 420 tty5 S 09:24 0:00
> /sbin/mingetty tty5
> root 811 0.0 0.0 1376 420 tty6 S 09:24 0:00
> /sbin/mingetty tty6
> root 822 0.0 0.0 4396 1452 tty1 S 09:33 0:00 -bash
> root 870 0.0 0.5 17760 8012 ? S 09:34 0:00 httpd -k
> start
> apache 871 0.0 0.5 17892 8228 ? S 09:34 0:00 httpd -k
> start
> apache 872 0.0 0.5 17892 8232 ? S 09:34 0:00 httpd -k
> start
> apache 873 0.0 0.5 17884 8224 ? S 09:34 0:00 httpd -k
> start
> apache 874 0.0 0.5 17892 8236 ? S 09:34 0:00 httpd -k
> start
> apache 875 0.0 0.5 17884 8220 ? S 09:34 0:00 httpd -k
> start
> apache 876 0.0 0.5 17892 8232 ? S 09:34 0:00 httpd -k
> start
> apache 877 0.0 0.5 17892 8228 ? S 09:34 0:00 httpd -k
> start
> apache 878 0.0 0.5 17884 8216 ? S 09:34 0:00 httpd -k
> start
> root 897 0.0 0.1 5000 1796 ? S 09:34 0:00 smbd
> root 900 0.0 0.1 3844 1596 ? S 09:34 0:00 nmbd
> root 908 0.0 0.1 3884 1692 ? S 09:36 0:00
> /sbin/mount.smbfs //phil/BackMeUp /mnt/phil -o rw username Administra
> root 927 0.0 0.1 4032 1888 ? S 09:42 0:00
> /sbin/mount.smbfs //W2KServer/BackMeUp /mnt/W2KServer -o rw username
> root 930 0.0 0.1 5484 2432 ? S 09:43 0:00 smbd
> root 948 0.0 0.0 2652 704 tty1 R 09:49 0:00 ps waux
>
> So is there anything else I can check? Or should I proceed with 'plan B',
> which is to schedule a cron job to reboot the server every night?
>
> Thanks again folks
>
> Richard
>
>
> ----- Original Message -----
> From: "Craig Ringer" <craig at postnewspapers.com.au>
> To: <plug at plug.linux.org.au>
> Sent: Wednesday, May 21, 2003 3:28 PM
> Subject: Re: [plug] How to diagnose a crashing Linux server?
>
>
> > > The odd thing about my lastlog being 18Mb in size is that there are
> three
> > > valid users on the system, two of whom rarely log on, and the machine
> was
> > > set up a couple of months ago ~feb/mar.
> > >
> > > I followed the other thread on this topic, and followed a similar
tact,
> > > which recreated the lastlog as a 143kb file.
> >
> > Interesting.
> >
> > >>setterm -blank 0
> > >
> > > Thanks - however my screen keeps turning off, I've turned 'off' energy
> > > saving in both the screen itself, and RH(Gnome) -> Preferences ->
> > > Screensaver | Advanced and also applied the CLI parameters as supplied
> > > above. Trying the 'failsafe' login mode to see if that stops the
screen
> turn
> > > off.
> >
> > Aha, you didn't mention you were running X anywhere. Can you run the
> > server w/o X for normal operations, and just fire up X when needed?
> > What's the motherboard chipset and video card? X can cause all sorts of
> > stability issues occasionally, and its always a good idea to disable X
> > logins for servers unless you have a good reason not to.
> >
> > How to disable graphical logins depends on distro. In Red Hat you can vi
> > /etc/inittab and change
> > id:5:initdefault:
> > to
> > id:3:initdefault:
> > but first make sure that all your other services will start as expected
> > in runlevel 3. Unless you've customised your starting services heavily
> > and only in runlevel 5, it should be the the same. Make sure with
> > "chkconfig" and by looking at the /etc/rc3.d and /etc/rc5.d directories.
> >
> > For debian, you can always simply remove /etc/rc2.d/S99gdm or mv
> > /etc/rc2.d/{S,K}99gdm if you like (and assuming you're using gdm).
> >
> > You can always fire up X manually by logging in on the console and
> > running "startx". Alternately, if you're using GDM (method is different
> > for other ?DMs) you can vi /etc/X11/gdm/gdm.conf and in the [servers]
> > section comment out the line that starts an X server on GDM startup.
> > This allows you to have GDM running but not starting local X servers -
> > important if you use a server for serving remote X clients with XDMCP.
> > If its set up to reply to XDMCP queries, you can get a login with
> > X -query <server-running-gdm>
> >
> > As for preventing the display from turning off, you can usually just
> > "xset dpms 0 0 0". However something like GNOME might override that,
> > and/or anything set in the XF86Config. I've never run a desktop
> > environment on a server (and barely at all on a desktop box, I use a
> > customised IceWM usually) so I can't really help you there. Suggested
> > solution: disable X.
> >
> > >>Does it reply to ICMP?
> > >
> > > Will a 'ping' suffice? I haven't tried this, but can next time it
fails.
> >
> > A ping is what you want, yeah.
> >
> > > One thing I've noticed, however is the memory usage is up reasonably
> high,
> > > at 3PM yesterday, following a reboot at around 9am, there was 293Mb of
> 1.5Gb
> > > of physical memory in use, with none of the 1.9Gb swap file being
used.
> > > Today at 9.30AM there was 1.4Gb of memory in use, with no swap file
> being
> > > used. Currently (3PM today) there is still 1.4Gb of memory in use, the
> top
> > > usage programs are nautilus (~14Mb), python (~15Mb),
> gnome-settings-daemon
> > > (~12.2Mb), gnome-panel (~10Mb), gnome-system-monitor (~8.5Mb) and
httpd
> > > (~8Mb) ... the total of these seems to be nowhere near the 1.4Gb being
> > > reported used. This info was collected via System Monitor. 'free'
seems
> to
> > > indicate roughly the same figures.
> >
> > That all sounds reasonable, most of your RAM useage will be cached files
> > from the disk. For example:
> >
> > [craig at bucket craig]$ free -m
> > total used free shared buffers cached
> > Mem: 2015 1808 207 0 304 1300
> > -/+ buffers/cache: 202 1812
> > Swap: 3999 0 3999
> >
> > Here, I have 1.8 gigs of RAM "in use" but in reality, all but 200mb of
> > that is available for programs to use - the kernel is just using it as a
> > monster disk cache until its needed. I assume that's what you're seeing
> > from your description, but I could easily be wrong - can you post a
> > "free -m"?
> >
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.483 / Virus Database: 279 - Release Date: 19/05/2003
More information about the plug
mailing list