Fw: [plug] How to diagnose a crashing Linux server?
Craig Ringer
craig at postnewspapers.com.au
Tue Jun 3 12:11:24 WST 2003
> As a follow-up to this email that I sent just over a week ago - I resorted
> to "plan B" which was to schedule a CRON job to reboot every night.
Wow - sounds like winNT. Sad that you needed to do that for a linux box.
>>JLM> Ouch, I never would use a Travan tape unit in a server. If you can
> try using a DDS.
>
> Are there specific problems/known issues with Travan units? Are there
> advantages to using a DDS based drive?
Travan aren't particularly reliable, and I believe I heard something
about them running /incredbly/ hot too. I use DDS4 and get good results,
but the tape life isn't as good as with linear tape technologies like
DLT. Alas, with tape tech you get what you pay for.
I seem to remember I responded to your question last time, including the
suggestion to try hooking up a null-modem cable and disalbing the
console blanking so that you could see any errors the system might've
printed pre-death. Have you had a chance to try any of that?
> It generally boots into a GUI, but it was recommended by another PLUG member
> to try running this in a CLI situation - the crash still occured under both
> situations.
Interesting. And you'd actually ensured that X was not starting, not
just switched to another virtual console (CTL-ALT-Fx)? Oh well. X is
often the #1 culprit in system crashes, but its not the only one.
OTOH, if you're running on the console instead of running X, you can
also see kernel messages that you might otherwise miss, like harddisk
errors. I suggest setting "dmesg -n 2" or higher at the console (as
root) to increase verbosity of kernel messages. Don't do this on a
firewall though, or you'll get drowned by iptables info.
> So - anyone have any ideas? Plan C is a new box in it's place .... see
> whether the same situation occurs; then I can isolate it to hardware or
> software.
I do hate having to go to that - "I don't know WTF is wrong, but I
suspect a brand new server will solve the problem". I was getting to
that point with a machine here, where I'd replaced /all/ the parts
except the case and it was still crashing. (It'd stopped for a while,
then resumed again). Then I looked at the S.M.A.R.T info on the second
HDD, and I knew what was wrong. I'd replaced the disks, suspecting they
were dodgy, and that hadn't fixed the problem. Turns out it had, but one
of the replacement disks was also now dying:
Device: WDC WD1200JB-75CRA0 Supports ATA Version 5
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.
General Smart Values:
Off-line data collection status: (0x85) Offline data collection activity
was
aborted by an interrupting command
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test has ever
been run
Total time to complete off-line
data collection: (4680) Seconds
Offline data collection
Capabilities: (0x3b)SMART EXECUTE OFF-LINE IMMEDIATE
Automatic timer ON/OFF support
Suspend Offline Collection upon new
command
Offline surface scan supported
Self-test supported
Smart Capablilities: (0x0003) Saves SMART data before entering
power-saving mode
Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine
recommended polling time: ( 2) Minutes
Extended self-test routine
recommended polling time: ( 87) Minutes
Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute Flag Value Worst Threshold Raw Value
( 1)Raw Read Error Rate 0x000b 200 200 051 0
( 3)Spin Up Time 0x0007 099 094 021 5816
( 4)Start Stop Count 0x0032 100 100 040 27
( 5)Reallocated Sector Ct 0x0033 172 172 140 440
( 7)Seek Error Rate 0x000b 200 200 051 0
( 9)Power On Hours 0x0032 098 098 000 1740
( 10)Spin Retry Count 0x0013 100 253 051 0
( 11)Calibration Retry Count 0x0013 100 253 051 0
( 12)Power Cycle Count 0x0032 100 100 000 27
(196)Reallocated Event Count 0x0032 169 169 000 31
(197)Current Pending Sector 0x0012 200 200 000 0
(198)Offline Uncorrectable 0x0012 200 200 000 0
(199)UDMA CRC Error Count 0x000a 200 253 000 0
(200)Unknown Attribute 0x0009 200 200 051 0
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 01
ATA Error Count: 6
Non-Fatal Count: 0
Error Log Structure 1:
DCR FR SC SN CL SH D/H CR Timestamp
00 00 80 80 65 b2 e0 c8 3768
00 00 80 80 66 b2 e0 c8 3768
00 00 80 00 67 b2 e0 c8 3768
00 00 2c 80 63 b2 e0 ca 3768
00 00 80 00 64 b2 e0 c8 3768
00 40 80 2c 64 b2 e0 51 11691
Error condition: 0 Error State: 20
Number of Hours in Drive Life: 74 (life of the drive in hours)
Error Log Structure 2:
DCR FR SC SN CL SH D/H CR Timestamp
00 00 80 80 61 b2 e0 ca 3755
00 00 80 00 62 b2 e0 c8 3755
00 00 80 80 62 b2 e0 c8 3755
00 00 80 80 63 b2 e0 c8 3755
00 00 80 00 64 b2 e0 c8 3755
00 40 80 2c 64 b2 e0 51 11691
Error condition: 0 Error State: 20
Number of Hours in Drive Life: 74 (life of the drive in hours)
Error Log Structure 3:
DCR FR SC SN CL SH D/H CR Timestamp
00 00 2c 00 62 b2 e0 ca 3758
00 00 80 80 62 b2 e0 c8 3758
00 00 80 80 63 b2 e0 c8 3758
00 00 2c 00 63 b2 e0 ca 3758
00 00 80 00 64 b2 e0 c8 3758
00 40 80 2c 64 b2 e0 51 11691
Error condition: 0 Error State: 20
Number of Hours in Drive Life: 74 (life of the drive in hours)
Error Log Structure 4:
DCR FR SC SN CL SH D/H CR Timestamp
00 00 80 80 66 b2 e0 c8 3761
00 00 2c 80 62 b2 e0 ca 3761
00 00 80 80 63 b2 e0 c8 3762
00 00 2c 00 63 b2 e0 ca 3762
00 00 80 00 64 b2 e0 c8 3762
00 40 80 2c 64 b2 e0 51 11691
Error condition: 0 Error State: 20
Number of Hours in Drive Life: 74 (life of the drive in hours)
Error Log Structure 5:
DCR FR SC SN CL SH D/H CR Timestamp
00 00 80 80 66 b2 e0 c8 3765
00 00 80 00 67 b2 e0 c8 3765
00 00 2c 00 63 b2 e0 ca 3765
00 00 80 80 63 b2 e0 c8 3765
00 00 80 00 64 b2 e0 c8 3765
00 40 80 2c 64 b2 e0 51 11691
Error condition: 0 Error State: 20
Number of Hours in Drive Life: 74 (life of the drive in hours)
More information about the plug
mailing list