[plug] re: server freezing

Tue Jul 8 18:36:31 WST 2003

I have both the broadcom and a Intel NIC both 10/100/1000.  When the lock up happens neither cards responds.
Yes, the disks are in a RAID (hardware RAID 5) 4x73 using a ServerRaid-5i Controller Card (IBM supplied).
The drives are all IBM U320 SCSI Drives.
The only thing the server is doing is sitting still doing nothing for the moment because I have not starting importing data.  So in summary, just running and doing nothing the system freezes up.  Both NIC do not allow access.  
What I did last night is an ssh from a workstation in the office and left that connected.  The server has not locked.  I will leave it for another 24 hrs to see if locking occurs.

Thanks

Jon L. Miller, MCNE, CNS
Director/Sr Systems Consultant
MMT Networks Pty Ltd
http://www.mmtnetworks.com.au

"I don't know the key to success, but the key to failure
 is trying to please everybody." -Bill Cosby

>>> craig at postnewspapers.com.au 1:28:04 AM 8/07/2003 >>>
> IBM x235
> 4 x 73GB SCSI U320 Drives
> 2 GB memory
> 2 x 10/100/1000 NIC
> ServerRAID -5i Raid controller.

I have a dual Xeon running RH8 with a gigabit NIC (and 2x 10/100 NICs) 
thats quite happy, but I'm using nice Intel NICs . I've heard bad things 
about broadcom - perhaps you might want to see if you can borrow an 
Intel NIC (buy a 10/100 or see if you can get a 10/100/1000 on loan, 
whatever).

I'm also operating with 2 GB of RAM. The disk subsystem is different 
(SATA RAID - was a PITA at first, but now works like a dream) but that 
shouldn't really matter. Are the disks in RAID, and if so what type? If 
they're not in a RAID array, try doing SMART queries on them (it 
shouldn't happen, but sometimes even really top quality drives are DOA 
or close to it).

I've observed a problem similar to that which you describe in the past, 
and it turned out to be caused by the system trying to swap pages back 
in from a swap partition on a dying HDD. I ended up replacing the entire 
(basic PC hardware) machine. A few months later, the new machine started 
doing the same thing - but that time I was getting syslog messages (DMA 
errors etc) that clued me in to the problem. I think the first time the 
bad areas must've been /only/ on swap space or rarely used bits of disk, 
so I didn't get any useful messages. The disk tested "OK" with the 
manufacturers disk utils, but proved stuffed when installed and thrashed 
with bonnie++ overnight. Anyway, what I'm trying to say, fighting 
against 1:30-am-itis, is "even if they're good disks, test them and make 
sure you're not encountering a defective HDD." Most 
*cough*westerndigital*cough* manufacturers disk tools don't suck, and 
are capable of quering the drive SMART data (though they don't say as 
much), so that tends to be a good start.

Your server probably has BIOS serial console support, as well as support 
for IPMI. Most Xeon systems do AFAIK. I suggest that you look into these 
and see if you can get more diagnostic information from it.

Also - posting /proc/interrupts and the output of both 'lspci' and 
'lspci -vvv' can be exceedingly useful when reading "it just crashes" 
questions. Perhaps you could post this info?

People: please post detailed hardware info when dealing with potential 
hardware issues such as lockups, crashes, unexplained stalling, and the 
like. Think PCI devices list, interrupts, `uname -a`, loaded modules, 
storage info (eg RAID type if any), etc). Someone always has to ask for 
it anyway.

Oh .... if you decide the server is inexplicably FUBAR and it's easier 
to replace than fix, can I have the old one? ;-)

*grin*

Craig Ringer