[plug] re: server freezing
Jon Miller
jlmiller at mmtnetworks.com.au
Tue Jul 8 18:36:31 WST 2003
I have both the broadcom and a Intel NIC both 10/100/1000. When the lock up happens neither cards responds.
Yes, the disks are in a RAID (hardware RAID 5) 4x73 using a ServerRaid-5i Controller Card (IBM supplied).
The drives are all IBM U320 SCSI Drives.
The only thing the server is doing is sitting still doing nothing for the moment because I have not starting importing data. So in summary, just running and doing nothing the system freezes up. Both NIC do not allow access.
What I did last night is an ssh from a workstation in the office and left that connected. The server has not locked. I will leave it for another 24 hrs to see if locking occurs.
Thanks
Jon L. Miller, MCNE, CNS
Director/Sr Systems Consultant
MMT Networks Pty Ltd
http://www.mmtnetworks.com.au
"I don't know the key to success, but the key to failure
is trying to please everybody." -Bill Cosby
>>> craig at postnewspapers.com.au 1:28:04 AM 8/07/2003 >>>
> IBM x235
> 4 x 73GB SCSI U320 Drives
> 2 GB memory
> 2 x 10/100/1000 NIC
> ServerRAID -5i Raid controller.
I have a dual Xeon running RH8 with a gigabit NIC (and 2x 10/100 NICs)
thats quite happy, but I'm using nice Intel NICs . I've heard bad things
about broadcom - perhaps you might want to see if you can borrow an
Intel NIC (buy a 10/100 or see if you can get a 10/100/1000 on loan,
whatever).
I'm also operating with 2 GB of RAM. The disk subsystem is different
(SATA RAID - was a PITA at first, but now works like a dream) but that
shouldn't really matter. Are the disks in RAID, and if so what type? If
they're not in a RAID array, try doing SMART queries on them (it
shouldn't happen, but sometimes even really top quality drives are DOA
or close to it).
I've observed a problem similar to that which you describe in the past,
and it turned out to be caused by the system trying to swap pages back
in from a swap partition on a dying HDD. I ended up replacing the entire
(basic PC hardware) machine. A few months later, the new machine started
doing the same thing - but that time I was getting syslog messages (DMA
errors etc) that clued me in to the problem. I think the first time the
bad areas must've been /only/ on swap space or rarely used bits of disk,
so I didn't get any useful messages. The disk tested "OK" with the
manufacturers disk utils, but proved stuffed when installed and thrashed
with bonnie++ overnight. Anyway, what I'm trying to say, fighting
against 1:30-am-itis, is "even if they're good disks, test them and make
sure you're not encountering a defective HDD." Most
*cough*westerndigital*cough* manufacturers disk tools don't suck, and
are capable of quering the drive SMART data (though they don't say as
much), so that tends to be a good start.
Your server probably has BIOS serial console support, as well as support
for IPMI. Most Xeon systems do AFAIK. I suggest that you look into these
and see if you can get more diagnostic information from it.
Also - posting /proc/interrupts and the output of both 'lspci' and
'lspci -vvv' can be exceedingly useful when reading "it just crashes"
questions. Perhaps you could post this info?
People: please post detailed hardware info when dealing with potential
hardware issues such as lockups, crashes, unexplained stalling, and the
like. Think PCI devices list, interrupts, `uname -a`, loaded modules,
storage info (eg RAID type if any), etc). Someone always has to ask for
it anyway.
Oh .... if you decide the server is inexplicably FUBAR and it's easier
to replace than fix, can I have the old one? ;-)
*grin*
Craig Ringer
More information about the plug
mailing list