[plug] Compaq SMP problem

Raven ian.kent at pobox.com
Thu Jun 8 23:04:39 WST 2000


Hi all,

Here I am with the story about our 'sick Compaq.

The hardware:

    Compaq SP750 with dual 733MHz Pentium III xeon processors
    1GB of Rambus memory
    Adaptec 7899 SCSI controller
    Matrox 16MB G400 dual head card
    Intel EtherPro100 (I believe, I will confirm the driver)
    The machine has a full duplex link to a switch with
    Gigabit connectivity to our Solaris servers.

The problem:

The problem is with network performance. After some period of time,
network performance drops off to almost nothing. FTP's that crank
through at 8-10 Mbyte/sec when the machine is 'fresh' drop off to
sub-modem speeds ie. < 2KBytes/sec when it gets 'sick'.

The drop-off can happen after a few hours of operation, or it can happen
after a week. No other major symptoms, everything other than network
related operations seem to perform OK. The only common factor seen to
date is that the system has allocated most or all of it's memory for
some purpose (not unusual for a Unix system).


The story so far:

I have had a quick look at the dmesg output and the machine seems to
recognise everything OK. The kernel .config checks out for an SMP kernel
(according to the SMP FAQ, brief check).

Kernels that have shown the problem so far are 2.2.14, 2.3.99.pre6 and
2.2.15. They are compiled with fewest options needed to support required
system functionality. Kernels 2.2.16 and 2.4.0-test1 have not been tried
yet as they are not yet stable.

The kernel currently used is 2.2.15. The most recent build of this
kernel performed OK for about 5-6 days and then required a shutdown for
building mains power maintenance. This kernel will be used again in SMP
mode until the problem occurs. Next the same system kernel config with
the 'nosmp' boot option will be used.

One time when the machine got sick the interface was downed, the network
card module unloaded and reloaded and the interface brought back up.
This had no effect, the machine still ran slow until the next reboot.

A dump of the /proc tree was taken when the system was operating
normally and on a couple of occassions where the sysem was running
slowly. There is no information there, that we are aware of, that might
indicate the source of the problem.

There doesn't seem to be anything in the messages file that looks
relevent when the machine gets sick.

Trying a different network card will also be done (can't say exactly
when).

Ian K




More information about the plug mailing list