[plug] Compaq SMP problem (PLUG list repost with updates)

Raven ian.kent at pobox.com
Mon Jun 12 21:17:18 WST 2000


Hi all,

Here I am with a story about a 'sick' Compaq.

The hardware:

    Compaq SP750 with dual 733MHz Pentium III xeon processors
    1GB of Rambus memory
    Adaptec 7899 SCSI controller
    Matrox 16MB G400 dual head card
    Intel EtherPro100 (I believe, I will confirm the driver)
    The machine has a full duplex link to a switch with
    Gigabit connectivity to our Solaris servers (a Cabletron SSR8000).

The problem:

The problem is with network performance. After some period of time,
network performance drops off to almost nothing. FTP's that crank
through at 8-10 Mbyte/sec when the machine is 'fresh' drop off to
sub-modem speeds ie. < 2KBytes/sec when it gets 'sick'.

The drop-off can happen after a few hours of operation, or it can happen
after a week. No other major symptoms, everything other than network
related operations seem to perform OK. The only common factor is that
the system has allocated most or all of it's memory for some purpose
(not unusual for a Unix system).

The story so far:

I have had a look at the messages output and the machine seems to
recognize everything OK and there doesn't seem to be anything that looks
relevant to when the machine gets sick. The kernel .config checks out
for an SMP kernel (according to the SMP FAQ).

Kernels that have shown the problem so far are 2.2.14, 2.3.99-pre6 and
2.2.15. They are compiled with fewest options needed to support required
system functionality. Kernels 2.2.16 and 2.4.0-test1 have not been tried
yet.

The kernel currently used is 2.2.15. The most recent build of this
kernel performed OK for about 5-6 days and then required a shutdown for
building mains power maintenance. This kernel is being used now in SMP
mode and has lasted 6 days so far. Currently the machine has been OK for
6 days, but will be going down for hard disk maintenance tomorrow.

One time when the machine got sick the interface was downed, the network
card module unloaded and reloaded and the interface brought back up.
This had no effect, the machine still ran slow until the next reboot.

A dump of the /proc tree was taken when the system was operating
normally and on a couple of occasions where the system was running
slowly. There is no information there, that we are aware of, that might
indicate the source of the problem.

Today I checked the network card settings and found that the Linux
machine has forced 100mbps full-duplex operation. I have had the switch
changed to the same setting. Could this cause such a problem?

The next steps:

After reading the linux-kernel list FAQ I have replaced egcs-1.1.2 with
gcc-2.7.2.3, downloaded 2.2.16 and compiled a new smp kernel. This
kernel will be used after the reboot tomorrow. We also plan on using the
'nosmp' option to see if that makes any difference.

Trying a different network card will also be done (can't say exactly
when).

In the meantime can anyone suggest what might be causing this problem or
suggest any other things to try please.

--
   ,-._|\    Ian Kent
  /      \   Perth, Western Australia
  *_.--._/   E-mail: ian.kent at pobox.com, raven at plug.linux.org.au
        v    Web: http://pobox.com/~ian.kent





More information about the plug mailing list