[plug] hot freeze - not a contradiction in terms

Wed Dec 13 11:47:48 WST 2006

On Wed, 2006-12-13 at 11:19 +0900, Denis Brown wrote:
> At 08:34 PM 12/12/2006, Gavin Chester wrote:
> >With this hot spell I have seen my workstation emulating a windoze PC in
> >the sense that twice today without warning it has lost all keyboard
> >input and apparent disc I/O so that I could not even open a VT or do
> >anything except a full reboot.
> 
> Some random thoughts...
> RAIDed drives?   If not then drive(s) could be a possibility however, if 
> RAID, then it should fail gracefully or leave messages behind I would think 
> in something like /var/log/messages.

They are running in a 2-drive LVM setup off a LSI Logic controller card.
> 
> Although, if the problem is drive-related then drives per se have a fairly 
> high thermal mass.   When they get hot, they "stay hot" and a subsequent 
> reboot - if that is what it takes - should see the system crash again very 
> shortly thereafter, assuming you let it "cool down" for say five minutes.

Maybe that figures :-/  The second time it froze after a couple of hours
use I had rebooted straight after the first freeze.  This time, it's
been running solid for 14 hrs now after I first rested it for a couple
of hours and we haven't yet reached the hottest part of the day.

> If it was me, I would be thinking more in terms of memory or other mobo 
> related components since they would cool down faster and give a longer 
> period of operation before overheating again.   Sort of supporting that is 
> the thought that, if a drive failed and loaded garbage in place of a 
> required application, module, etc then only that application, module or 
> whatever should be affected - the remainder of the processes should just 
> keep marching along.   Ergo you should still have shell access, etc.
> 
> Of course I can think of heaps of things that would contravene that, swap 
> being one.   Load something dodgy out of swap and all bets are off.
> 
> If it is mobo-related then that might explain why the system has no time to 
> record in logs where it hurts before it dies.
> 
> Temperature / voltage monitoring of the mobo may be a profitable avenue to 
> pursue, especially if the logging can be done via a serial port to a dumb 
> terminal - you may be able to see some trends leading up to failure?
> 
> HTH,
> Denis

Thanks for the info :-)

Gavin