Fw: [plug] How to diagnose a crashing Linux server?

Tue Jun 3 12:11:24 WST 2003

> As a follow-up to this email that I sent just over a week ago - I resorted
> to "plan B" which was to schedule a CRON job to reboot every night.

Wow - sounds like winNT. Sad that you needed to do that for a linux box.

>>JLM> Ouch, I never would use a Travan tape unit in a server.  If you can
> try using a DDS.
> 
> Are there specific problems/known issues with Travan units? Are there
> advantages to using a DDS based drive?

Travan aren't particularly reliable, and I believe I heard something 
about them running /incredbly/ hot too. I use DDS4 and get good results, 
but the tape life isn't as good as with linear tape technologies like 
DLT. Alas, with tape tech you get what you pay for.

I seem to remember I responded to your question last time, including the 
suggestion to try hooking up a null-modem cable and disalbing the 
console blanking so that you could see any errors the system might've 
printed pre-death. Have you had a chance to try any of that?

> It generally boots into a GUI, but it  was recommended by another PLUG member
> to try running this in a CLI situation - the crash still occured under both
> situations.

Interesting. And you'd actually ensured that X was not starting, not 
just switched to another virtual console (CTL-ALT-Fx)? Oh well. X is 
often the #1 culprit in system crashes, but its not the only one.

OTOH, if you're running on the console instead of running X, you can 
also see kernel messages that you might otherwise miss, like harddisk 
errors. I suggest setting "dmesg -n 2" or higher at the console (as 
root) to increase verbosity of kernel messages. Don't do this on a 
firewall though, or you'll get drowned by iptables info.

> So - anyone have any ideas? Plan C is a new box in it's place .... see
> whether the same situation occurs; then I can isolate it to hardware or
> software.

I do hate having to go to that - "I don't know WTF is wrong, but I 
suspect a brand new server will solve the problem". I was getting to 
that point with a machine here, where I'd replaced /all/ the parts 
except the case and it was still crashing. (It'd stopped for a while, 
then resumed again). Then I looked at the S.M.A.R.T info on the second 
HDD, and I knew what was wrong. I'd replaced the disks, suspecting they 
were dodgy, and that hadn't fixed the problem. Turns out it had, but one 
of the replacement disks was also now dying:

Device: WDC WD1200JB-75CRA0  Supports ATA Version 5
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.

General Smart Values:
Off-line data collection status: (0x85)	Offline data collection activity 
was
					aborted by an interrupting command

Self-test execution status:      (   0)	The previous self-test routine 
completed
					without error or no self-test has ever
					been run

Total time to complete off-line
data collection: 		 (4680) Seconds

Offline data collection
Capabilities: 			 (0x3b)SMART EXECUTE OFF-LINE IMMEDIATE
					Automatic timer ON/OFF support
					Suspend Offline Collection upon new
					command
					Offline surface scan supported
					Self-test supported

Smart Capablilities:           (0x0003)	Saves SMART data before entering
					power-saving mode
					Supports SMART auto save timer

Error logging capability:        (0x01)	Error logging supported

Short self-test routine
recommended polling time: 	 (   2) Minutes

Extended self-test routine
recommended polling time: 	 (  87) Minutes

Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute                    Flag     Value Worst Threshold Raw Value
(  1)Raw Read Error Rate     0x000b   200   200   051       0
(  3)Spin Up Time            0x0007   099   094   021       5816
(  4)Start Stop Count        0x0032   100   100   040       27
(  5)Reallocated Sector Ct   0x0033   172   172   140       440
(  7)Seek Error Rate         0x000b   200   200   051       0
(  9)Power On Hours          0x0032   098   098   000       1740
( 10)Spin Retry Count        0x0013   100   253   051       0
( 11)Calibration Retry Count 0x0013   100   253   051       0
( 12)Power Cycle Count       0x0032   100   100   000       27
(196)Reallocated Event Count 0x0032   169   169   000       31
(197)Current Pending Sector  0x0012   200   200   000       0
(198)Offline Uncorrectable   0x0012   200   200   000       0
(199)UDMA CRC Error Count    0x000a   200   253   000       0
(200)Unknown Attribute       0x0009   200   200   051       0
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 01
ATA Error Count: 6
Non-Fatal Count: 0

Error Log Structure 1:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
  00   00   80   80   65   b2    e0   c8     3768
  00   00   80   80   66   b2    e0   c8     3768
  00   00   80   00   67   b2    e0   c8     3768
  00   00   2c   80   63   b2    e0   ca     3768
  00   00   80   00   64   b2    e0   c8     3768
  00   40   80   2c   64   b2    e0   51     11691
Error condition:   0	Error State:      20
Number of Hours in Drive Life: 74 (life of the drive in hours)

Error Log Structure 2:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
  00   00   80   80   61   b2    e0   ca     3755
  00   00   80   00   62   b2    e0   c8     3755
  00   00   80   80   62   b2    e0   c8     3755
  00   00   80   80   63   b2    e0   c8     3755
  00   00   80   00   64   b2    e0   c8     3755
  00   40   80   2c   64   b2    e0   51     11691
Error condition:   0	Error State:      20
Number of Hours in Drive Life: 74 (life of the drive in hours)

Error Log Structure 3:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
  00   00   2c   00   62   b2    e0   ca     3758
  00   00   80   80   62   b2    e0   c8     3758
  00   00   80   80   63   b2    e0   c8     3758
  00   00   2c   00   63   b2    e0   ca     3758
  00   00   80   00   64   b2    e0   c8     3758
  00   40   80   2c   64   b2    e0   51     11691
Error condition:   0	Error State:      20
Number of Hours in Drive Life: 74 (life of the drive in hours)

Error Log Structure 4:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
  00   00   80   80   66   b2    e0   c8     3761
  00   00   2c   80   62   b2    e0   ca     3761
  00   00   80   80   63   b2    e0   c8     3762
  00   00   2c   00   63   b2    e0   ca     3762
  00   00   80   00   64   b2    e0   c8     3762
  00   40   80   2c   64   b2    e0   51     11691
Error condition:   0	Error State:      20
Number of Hours in Drive Life: 74 (life of the drive in hours)

Error Log Structure 5:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
  00   00   80   80   66   b2    e0   c8     3765
  00   00   80   00   67   b2    e0   c8     3765
  00   00   2c   00   63   b2    e0   ca     3765
  00   00   80   80   63   b2    e0   c8     3765
  00   00   80   00   64   b2    e0   c8     3765
  00   40   80   2c   64   b2    e0   51     11691
Error condition:   0	Error State:      20
Number of Hours in Drive Life: 74 (life of the drive in hours)