Fw: [plug] How to diagnose a crashing Linux server?

Tue Jun 3 11:53:24 WST 2003

Greetings Folks,

As a follow-up to this email that I sent just over a week ago - I resorted
to "plan B" which was to schedule a CRON job to reboot every night.

Unfortunately we had the server go down again this weekend, which wasn't too
good for the people in the east who were trying to access the site on
Monday.

Jon Miller asked a couple of questions in a off-list email, so I thought I'd
replicate some of his queries, and the responses to the queries here - to
see if anyone else can suggest any ideas about what might be causing this:

> JLM> which model and specs?

It's an IBM eServer xSeries 235, Type 8671, 2 x Intel Xeon 2GHz CPU's, 1.5Gb
Memory (not sure how this is configured), 2 x 73.4 Gb 10,000RPM IBM HD
(Model: IC35L073UCDY10-0 - P/N 07N8812), mirrored via hardware RAID, the
only hardware additions to this box have been the Adaptec SCSI card
(AVA-2906) and the Travan tape unit (TapeStor 20Gb, SCSI version).

> 0:00 /usr/libexec/gcon
> 0:00 /usr/libexec/bono
> 0:00 metacity --sm-sav
>
> JLM> metacity -sm-sav = small windows manager using gtk2
> The other two I'm not sure, but you need to issue ps au to see the owner.

Is anyone familiar with these two processes "gcon" and "bono"?

> JLM> Ouch, I never would use a Travan tape unit in a server.  If you can
try using a DDS.

Are there specific problems/known issues with Travan units? Are there
advantages to using a DDS based drive?

> JLM> I do not see anything out of the norm, except those 3 processes you
sent.  If possible try to remove them to see if they have any effect on
anything.  My question would be is this server booting to a gui or CLI
interface.

It generally boots into a GUI, but it was recommended by another PLUG member
to try running this in a CLI situation - the crash still occured under both
situations.

> As for hardware if the backup is scheduled to do a backup at a certain
time and the server crashes soon after or during or even later in the day.
Question has to be asked is the memory being depleted.  The only way to see
this is to get a top read out after server reboots, before and after the
backup.

Hmmm, yes, this was something that crossed my mind too - the backups run at
11.55pm, I scheduled a CRON job to reboot the system at 4am, backups
generally run for around 10 minutes (backing up config files only, to a
local tar file), or around an hour backing up data directories (creates a
tar file on disk, writes tar file to tape).

So - anyone have any ideas? Plan C is a new box in it's place .... see
whether the same situation occurs; then I can isolate it to hardware or
software.

Thanks again

Richard

----- Original Message ----- 
From: "Richard Mortimer" <linux at netfire.com.au>
To: <plug at plug.linux.org.au>
Sent: Monday, May 26, 2003 10:55 AM
Subject: Re: [plug] How to diagnose a crashing Linux server?

> Hi all,
>
> Hope you had a good weekend. Our server went down again this morning, so
in
> reply to the questions:
>
> > Aha, you didn't mention you were running X anywhere. Can you run the
> > server w/o X for normal operations, and just fire up X when needed?
>
> X was off at the time of the most recent crash - set in init level 3, the
> terminal was left on, but apparently the only messages to the system
console
> were about samba retries (I didn't see it personally), at boot up time
there
> was mention of a recovery journal - but doing a grep on the for
"recovery",
> "core" or "dump" produced nothing. It's as though the machine just
suddenly
> stops.
>
> BTW: not sure whether this is important or not, but X has been installed
> since I first configured the machine, nothing has changed in this
> environment and it has been running for the last couple of months with no
> drama.
>
> > What's the motherboard chipset and video card? X can cause all sorts of
>
> CPU: Intel Xeon 2GHz x 2
> Graphics: ATI Mach64, 8Mb
>
> > >>Does it reply to ICMP?
> > >
> > > Will a 'ping' suffice? I haven't tried this, but can next time it
fails.
> >
> > A ping is what you want, yeah.
>
> No, it doesn't reply to a ping.
>
> > from your description, but I could easily be wrong - can you post a
> > "free -m"?
>
> from the 22nd May:
>
>              total       used       free     shared    buffers     cached
> Mem:          1511       1496         14          0         62       1298
> -/+ buffers/cache:        136       1374
> Swap:         1992          0       1992
>
> from today (post reboot):
>
>              total       used       free     shared    buffers     cached
> Mem:          1511        161       1350          0         38         47
> -/+ buffers/cache:         75       1436
> Swap:         1992          0       1992
>
> >I'm just wondering if you have anything infected with UNIX/RST.[AB] or
> [snip]
> >Are there any 'defunct' processes listed when you 'ps waux' -
>
> No, it doesn't appear so:
>
> USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
> root         1  0.2  0.0  1388  476 ?        S    09:23   0:03 init
> root         2  0.0  0.0     0    0 ?        SW   09:23   0:00
> [migration_CPU0]
> root         3  0.0  0.0     0    0 ?        SW   09:23   0:00
> [migration_CPU1]
> root         4  0.0  0.0     0    0 ?        SW   09:23   0:00
> [migration_CPU2]
> root         5  0.0  0.0     0    0 ?        SW   09:23   0:00
> [migration_CPU3]
> root         6  0.0  0.0     0    0 ?        SW   09:23   0:00 [keventd]
> root         7  0.0  0.0     0    0 ?        SWN  09:23   0:00
> [ksoftirqd_CPU0]
> root         8  0.0  0.0     0    0 ?        SWN  09:23   0:00
> [ksoftirqd_CPU1]
> root         9  0.0  0.0     0    0 ?        SWN  09:23   0:00
> [ksoftirqd_CPU2]
> root        10  0.0  0.0     0    0 ?        SWN  09:23   0:00
> [ksoftirqd_CPU3]
> root        11  0.0  0.0     0    0 ?        SW   09:23   0:00 [kswapd]
> root        12  0.0  0.0     0    0 ?        SW   09:23   0:00 [bdflush]
> root        13  0.0  0.0     0    0 ?        SW   09:23   0:00 [kupdated]
> root        14  0.0  0.0     0    0 ?        SW   09:23   0:00
[mdrecoveryd]
> root        20  0.0  0.0     0    0 ?        SW   09:23   0:00 [scsi_eh_0]
> root        25  0.0  0.0     0    0 ?        SW   09:23   0:00 [kjournald]
> root        81  0.0  0.0     0    0 ?        SW   09:23   0:00 [khubd]
> root       175  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
> root       176  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
> root       177  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
> root       178  0.0  0.0     0    0 ?        SW   09:24   0:00 [kjournald]
> root       505  0.0  0.0  1460  580 ?        S    09:24   0:00 syslogd -m
0
> root       509  0.0  0.0  1396  452 ?        S    09:24   0:00 klogd -x
> rpc        526  0.0  0.0  1552  560 ?        S    09:24   0:00 portmap
> rpcuser    545  0.0  0.0  1596  752 ?        S    09:24   0:00 rpc.statd
> root       619  0.0  0.0     0    0 ?        SW   09:24   0:00 [rpciod]
> root       620  0.0  0.0     0    0 ?        SW   09:24   0:00 [lockd]
> root       668  0.0  0.0  3344 1464 ?        S    09:24   0:00
> /usr/sbin/sshd
> root       683  0.0  0.0  2064  892 ?        S    09:24   0:00
> xinetd -stayalive -pidfile /var/run/xinetd.pid
> lp         693  0.0  0.0  4772 1184 ?        S    09:24   0:00 lpd Waiting
> root       714  0.0  0.1  5596 2452 ?        S    09:24   0:00 sendmail:
> accepting connections
> smmsp      723  0.0  0.1  4944 2084 ?        S    09:24   0:00 sendmail:
> Queue runner at 01:00:00 for /var/spool/clientmqueue
> root       733  0.0  0.0  1428  444 ?        S    09:24   0:00 gpm -t
> ps/2 -m /dev/mouse
> root       742  0.0  0.0  1444  592 ?        S    09:24   0:00 crond
> xfs        771  0.0  0.2  4484 3140 ?        S    09:24   0:00
> xfs -droppriv -daemon
> root       780  0.0  0.0  1412  620 ?        SN   09:24   0:00 anacron -s
> daemon     789  0.0  0.0  1432  552 ?        S    09:24   0:00
/usr/sbin/atd
> root       800  0.0  0.0  3424  556 ?        S    09:24   0:00
> rhnsd --interval 120
> root       806  0.0  0.0  2332 1060 ?        S    09:24   0:00 login -- 
root
> root       807  0.0  0.0  1372  416 tty2     S    09:24   0:00
> /sbin/mingetty tty2
> root       808  0.0  0.0  1376  420 tty3     S    09:24   0:00
> /sbin/mingetty tty3
> root       809  0.0  0.0  1376  420 tty4     S    09:24   0:00
> /sbin/mingetty tty4
> root       810  0.0  0.0  1376  420 tty5     S    09:24   0:00
> /sbin/mingetty tty5
> root       811  0.0  0.0  1376  420 tty6     S    09:24   0:00
> /sbin/mingetty tty6
> root       822  0.0  0.0  4396 1452 tty1     S    09:33   0:00 -bash
> root       870  0.0  0.5 17760 8012 ?        S    09:34   0:00 httpd -k
> start
> apache     871  0.0  0.5 17892 8228 ?        S    09:34   0:00 httpd -k
> start
> apache     872  0.0  0.5 17892 8232 ?        S    09:34   0:00 httpd -k
> start
> apache     873  0.0  0.5 17884 8224 ?        S    09:34   0:00 httpd -k
> start
> apache     874  0.0  0.5 17892 8236 ?        S    09:34   0:00 httpd -k
> start
> apache     875  0.0  0.5 17884 8220 ?        S    09:34   0:00 httpd -k
> start
> apache     876  0.0  0.5 17892 8232 ?        S    09:34   0:00 httpd -k
> start
> apache     877  0.0  0.5 17892 8228 ?        S    09:34   0:00 httpd -k
> start
> apache     878  0.0  0.5 17884 8216 ?        S    09:34   0:00 httpd -k
> start
> root       897  0.0  0.1  5000 1796 ?        S    09:34   0:00 smbd
> root       900  0.0  0.1  3844 1596 ?        S    09:34   0:00 nmbd
> root       908  0.0  0.1  3884 1692 ?        S    09:36   0:00
> /sbin/mount.smbfs //phil/BackMeUp /mnt/phil -o rw username Administra
> root       927  0.0  0.1  4032 1888 ?        S    09:42   0:00
> /sbin/mount.smbfs //W2KServer/BackMeUp /mnt/W2KServer -o rw username
> root       930  0.0  0.1  5484 2432 ?        S    09:43   0:00 smbd
> root       948  0.0  0.0  2652  704 tty1     R    09:49   0:00 ps waux
>
> So is there anything else I can check? Or should I proceed with 'plan B',
> which is to schedule a cron job to reboot the server every night?
>
> Thanks again folks
>
> Richard
>
>
> ----- Original Message ----- 
> From: "Craig Ringer" <craig at postnewspapers.com.au>
> To: <plug at plug.linux.org.au>
> Sent: Wednesday, May 21, 2003 3:28 PM
> Subject: Re: [plug] How to diagnose a crashing Linux server?
>
>
> > > The odd thing about my lastlog being 18Mb in size is that there are
> three
> > > valid users on the system, two of whom rarely log on, and the machine
> was
> > > set up a couple of months ago ~feb/mar.
> > >
> > > I followed the other thread on this topic, and followed a similar
tact,
> > > which recreated the lastlog as a 143kb file.
> >
> > Interesting.
> >
> > >>setterm -blank 0
> > >
> > > Thanks - however my screen keeps turning off, I've turned 'off' energy
> > > saving in both the screen itself, and RH(Gnome) -> Preferences ->
> > > Screensaver | Advanced and also applied the CLI parameters as supplied
> > > above. Trying the 'failsafe' login mode to see if that stops the
screen
> turn
> > > off.
> >
> > Aha, you didn't mention you were running X anywhere. Can you run the
> > server w/o X for normal operations, and just fire up X when needed?
> > What's the motherboard chipset and video card? X can cause all sorts of
> > stability issues occasionally, and its always a good idea to disable X
> > logins for servers unless you have a good reason not to.
> >
> > How to disable graphical logins depends on distro. In Red Hat you can vi
> > /etc/inittab and change
> > id:5:initdefault:
> > to
> > id:3:initdefault:
> > but first make sure that all your other services will start as expected
> > in runlevel 3. Unless you've customised your starting services heavily
> > and only in runlevel 5, it should be the the same. Make sure with
> > "chkconfig" and by looking at the /etc/rc3.d and /etc/rc5.d directories.
> >
> > For debian, you can always simply remove /etc/rc2.d/S99gdm or mv
> > /etc/rc2.d/{S,K}99gdm if you like (and assuming you're using gdm).
> >
> > You can always fire up X manually by logging in on the console and
> > running "startx". Alternately, if you're using GDM (method is different
> > for other ?DMs) you can vi /etc/X11/gdm/gdm.conf and in the [servers]
> > section comment out the line that starts an X server on GDM startup.
> > This allows you to have GDM running but not starting local X servers -
> > important if you use a server for serving remote X clients with XDMCP.
> > If its set up to reply to XDMCP queries, you can get a login with
> > X -query <server-running-gdm>
> >
> > As for preventing the display from turning off, you can usually just
> > "xset dpms 0 0 0". However something like GNOME might override that,
> > and/or anything set in the XF86Config. I've never run a desktop
> > environment on a server (and barely at all on a desktop box, I use a
> > customised IceWM usually) so I can't really help you there. Suggested
> > solution: disable X.
> >
> > >>Does it reply to ICMP?
> > >
> > > Will a 'ping' suffice? I haven't tried this, but can next time it
fails.
> >
> > A ping is what you want, yeah.
> >
> > > One thing I've noticed, however is the memory usage is up reasonably
> high,
> > > at 3PM yesterday, following a reboot at around 9am, there was 293Mb of
> 1.5Gb
> > > of physical memory in use, with none of the 1.9Gb swap file being
used.
> > > Today at 9.30AM there was 1.4Gb of memory in use, with no swap file
> being
> > > used. Currently (3PM today) there is still 1.4Gb of memory in use, the
> top
> > > usage programs are nautilus (~14Mb), python (~15Mb),
> gnome-settings-daemon
> > > (~12.2Mb), gnome-panel (~10Mb), gnome-system-monitor (~8.5Mb) and
httpd
> > > (~8Mb) ... the total of these seems to be nowhere near the 1.4Gb being
> > > reported used. This info was collected via System Monitor. 'free'
seems
> to
> > > indicate roughly the same figures.
> >
> > That all sounds reasonable, most of your RAM useage will be cached files
> > from the disk. For example:
> >
> > [craig at bucket craig]$ free -m
> >             total       used       free     shared    buffers     cached
> > Mem:       2015       1808        207          0        304       1300
> > -/+ buffers/cache:     202       1812
> > Swap:      3999          0       3999
> >
> > Here, I have 1.8 gigs of RAM "in use" but in reality, all but 200mb of
> > that is available for programs to use - the kernel is just using it as a
> > monster disk cache until its needed. I assume that's what you're seeing
> > from your description, but I could easily be wrong - can you post a
> > "free -m"?
> >

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.483 / Virus Database: 279 - Release Date: 19/05/2003