[plug] odd temperature and parameter changes reported in /var/log/messages
Denis Brown
dsbrown at cyllene.uwa.edu.au
Thu Nov 29 10:18:55 WST 2007
Dear PLUG list members,
Sorry, another long post but the "executive summary" is near the top :-)
A new system being "bedded down" prior to shipment overseas for research
work, exhibits occasional freezes
such that no keyboard or mouse actions are recognised, display remains
static, cannot SSH into the machine when in this state. Only solution is
power-cycle or hard reset.
Motherboard ASUS P5K (now with AMI BIOS version 0704 which is latest except
for the Beta version :-) )
Onboard Jmicron BIOS untouched
CPU is Intel 6600 Core2Duo at 2.4GHz
Memory 4 GB
Graphics nVidia 8500GT with 512 MB memory
Lots of cooling, heavy duty power supply
One 80 GB Samsung SATA system drive known as /dev/sdd
Three 320 GB Samsung SATA drives in software RAID5 individual drives are
/dev/sda, sdb and sdc
openSuSE 10.3 x86_64
Since the freezes seem to occur when the software RAID array is being
accessed I looked at /var/log/messages and some odd things come to
light. For the first array drive for instance...
<quote>
Nov 28 15:23:23 spcnwks004 smartd[3712]: Device: /dev/sda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 15:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage
Attribute: 190 Temperature_Celsius changed from 139 to 136
Nov 28 15:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage
Attribute: 194 Temperature_Celsius changed from 139 to 136
Nov 28 16:23:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage
Attribute: 200 Multi_Zone_Error_Rate changed from 253 to 100
Nov 28 17:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage
Attribute: 190 Temperature_Celsius changed from 136 to 139
Nov 28 17:53:24 spcnwks004 smartd[3921]: Device: /dev/sda, SMART Usage
Attribute: 194 Temperature_Celsius changed from 136 to 139
<unquote>
The situation seems to have improved since the BIOS upgrade however...
prior to the upgrade I was seeing many instances of "prefailure" attribute
change messages such as...
<quote>
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sdb, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sdc, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 27 14:49:31 spcnwks004 smartd[3341]: Device: /dev/sdd, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:14 spcnwks004 smartd[3926]: Device: /dev/sda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:15 spcnwks004 smartd[3926]: Device: /dev/sdb, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:15 spcnwks004 smartd[3926]: Device: /dev/sdc, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Nov 28 12:03:15 spcnwks004 smartd[3926]: Device: /dev/sdd, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
Note: BIOS upgrade here
Nov 28 15:23:23 spcnwks004 smartd[3712]: Device: /dev/sda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 253 to 100
<unquote>
Results from smartctl -a /dev/sdX show...
For the first array drive sda:
195
Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always
- 644452
So it has not failed but there are 64,000 of these errors in the lifetime
of the drive so far. Is this "normal" I wonder?
For sdb the value is higher...
195
Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always
- 466109033
For sdc it is...
195
Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always
- 285607443
And for the system drive sdd it is...
195
Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always
- 301539788
Primary question is "is any of the above normal"? I am thinking to try to
borrow a hardware SATA RAID card and see if stability improves although I
have had no data loss per se so far with the software RAID solution. But
I *must* solve the stability problem before this one goes to Brazil. Too
far away for housecalls :-(
May try booting in failsafe mode to see if stability improves. If so it
may be a power (ACPI / APM) issue.
Thoughts appreciated!!
Denis
<Some background on this system and diagnostics performed for the
non-feint-hearted>
This system has been the subject of previous posts... notably the nVidia
"lack of video during POST" thread which has since been solved (sort of -
still need to figure out how to make digital output the default.)
When first took delivery of the hardware I ran badblocks tests on all
drives - sweet.
Configured software RAID5 with my original choice of OS, ubuntu. Did not
perform much in the way of high duty cycle reading / writing to the RAID
array but did note that using rsync-backup to copy some of the 80-gig
system drive to the RAID array, that a lockup occurred. Did not have time
to follow that up :-(
When openSuSE 10.3 came out recently I repartitioned the system drive and
loaded it instead - closer match to this machine's server which runs SuSE
SLES 9. The ubuntu-to-openSuSE was an eye-opener... openSuSE actually
preserved the old user accounts! Colour me impressed. RAID array still
worked and had all the previous data. So far so good.
Initial system freeze (video still showing but no response from mouse,
keyboard and could not SSH into the box) happened in the midst of copying 3
GB data over an NFS link to the software RAID array.
Rebooting the system (full power cycle) and then the data copy went ahead
okay. Second freeze happened right at the start of burning a DVD of the
same 3GB data, using k3b burning software. Same symptoms -
non-responsive, not even SSH is possible.
Bit of Googling shows that some system freezes can be due to power
management, e.g. APM, AHCI. Rebooted in failsafe mode. No system
freezes so far BUT...
Examined the logs and as above some very odd changes in status of drive
temperature. Red herring?
Running smartclt shows drives check okay but occasional changes in status
of some of the parameters. Again these reflected in the logs, as above.
Discovered the business about "odd" CPU core temperature rise during
glxgears. BIOS upgrade seems to have fixed that but done nothing really
for the freeze problem.
Furthermore after a freeze (post-upgrade of BIOS) I found it would not
reboot - grub stage 1.5 error, disk not found! The BIOS did not show the
system drive. Oh dear! Switched it off (again) and let it "cool down"
for 10 minutes then the system drive is seen in BIOS and can be assigned as
boot drive. All sweet again.
However it is running in air-conditioned room so should be no need to "cool
down" for the drive to be recognised.
More information about the plug
mailing list