[plug] odd temperature and parameter changes reported in /var/log/messages

Thu Nov 29 10:18:55 WST 2007

Dear PLUG list members,

Sorry, another long post but the "executive summary" is near the top :-)

A new system being "bedded down" prior to shipment overseas for research 
work, exhibits occasional freezes
such that no keyboard or mouse actions are recognised, display remains 
static, cannot SSH into the machine when in this state.   Only solution is 
power-cycle or hard reset.

Motherboard ASUS P5K (now with AMI BIOS version 0704 which is latest except 
for the Beta version :-) )
Onboard Jmicron BIOS untouched
CPU is Intel 6600 Core2Duo at 2.4GHz
Memory 4 GB
Graphics nVidia 8500GT with 512 MB memory
Lots of cooling, heavy duty power supply
One 80 GB Samsung SATA system drive known as /dev/sdd
Three 320 GB Samsung SATA drives in software RAID5 individual drives are 
/dev/sda, sdb and sdc
openSuSE 10.3 x86_64

Since the freezes seem to occur when the software RAID array is being 
accessed I looked at /var/log/messages and some odd things come to 
light.   For the first array drive for instance...

<quote>
Nov 28 15:23:23 spcnwks004 smartd[3712]: Device: /dev/sda, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 15:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage 
Attribute: 190 Temperature_Celsius changed from 139 to 136
Nov 28 15:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage 
Attribute: 194 Temperature_Celsius changed from 139 to 136
Nov 28 16:23:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage 
Attribute: 200 Multi_Zone_Error_Rate changed from 253 to 100
Nov 28 17:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage 
Attribute: 190 Temperature_Celsius changed from 136 to 139
Nov 28 17:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage 
Attribute: 194 Temperature_Celsius changed from 136 to 139
<unquote>

The situation seems to have improved since the BIOS upgrade however... 
prior to the upgrade I was seeing many instances of "prefailure" attribute 
change messages such as...

<quote>
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sda, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sdb, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sdc, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sdd, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:14 spcnwks004 smartd[3926]: Device: /dev/sda, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:15 spcnwks004 smartd[3926]: Device: /dev/sdb, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:15 spcnwks004 smartd[3926]: Device: /dev/sdc, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:15 spcnwks004 smartd[3926]: Device: /dev/sdd, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100

Note: BIOS upgrade here

Nov 28 15:23:23 spcnwks004 smartd[3712]: Device: /dev/sda, SMART Prefailure 
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
<unquote>

Results from smartctl -a /dev/sdX show...

For the first array drive sda:
195 
Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always 
-       644452

So it has not failed but there are 64,000 of these errors in the lifetime 
of the drive so far.   Is this "normal" I wonder?

For sdb the value is higher...
195 
Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always 
-       466109033

For sdc it is...
195 
Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always 
-       285607443

And for the system drive sdd it is...
195 
Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always 
-       301539788

Primary question is "is any of the above normal"?   I am thinking to try to 
borrow a hardware SATA RAID card and see if stability improves although I 
have had no data loss per se so far with the software RAID solution.   But 
I *must* solve the stability problem before this one goes to Brazil.   Too 
far away for housecalls :-(

May try booting in failsafe mode to see if stability improves.   If so it 
may be a power (ACPI / APM) issue.

Thoughts appreciated!!
Denis

<Some background on this system and diagnostics performed for the 
non-feint-hearted>
This system has been the subject of previous posts... notably the nVidia 
"lack of video during POST" thread which has since been solved (sort of - 
still need to figure out how to make digital output the default.)

When first took delivery of the hardware I ran badblocks tests on all 
drives - sweet.
Configured software RAID5 with my original choice of OS, ubuntu.   Did not 
perform much in the way of high duty cycle reading / writing to the RAID 
array but did note that using rsync-backup to copy some of the 80-gig 
system drive to the RAID array, that a lockup occurred.  Did not have time 
to follow that up :-(

When openSuSE 10.3 came out recently I repartitioned the system drive and 
loaded it instead - closer match to this machine's server which runs SuSE 
SLES 9.   The ubuntu-to-openSuSE was an eye-opener... openSuSE actually 
preserved the old user accounts!   Colour me impressed.   RAID array still 
worked and had all the previous data.   So far so good.

Initial system freeze (video still showing but no response from mouse, 
keyboard and could not SSH into the box) happened in the midst of copying 3 
GB data over an NFS link to the software RAID array.

Rebooting the system (full power cycle) and then the data copy went ahead 
okay.   Second freeze happened right at the start of burning a DVD of the 
same 3GB data, using k3b burning software.   Same symptoms - 
non-responsive, not even SSH is possible.

Bit of Googling shows that some system freezes can be due to power 
management, e.g. APM, AHCI.   Rebooted in failsafe mode.   No system 
freezes so far BUT...

Examined the logs and as above some very odd changes in status of drive 
temperature.   Red herring?
Running smartclt shows drives check okay but occasional changes in status 
of some of the parameters.   Again these reflected in the logs, as above.

Discovered the business about "odd" CPU core temperature rise during 
glxgears.   BIOS upgrade seems to have fixed that but done nothing really 
for the freeze problem.

Furthermore after a freeze (post-upgrade of BIOS) I found it would not 
reboot - grub stage 1.5 error, disk not found!   The BIOS did not show the 
system drive.   Oh dear!   Switched it off (again) and let it "cool down" 
for 10 minutes then the system drive is seen in BIOS and can be assigned as 
boot drive.   All sweet again.

However it is running in air-conditioned room so should be no need to "cool 
down" for the drive to be recognised.