[plug] strangely crashing PPro [long]

Wed Dec 10 23:58:11 WST 2003

Hi folks

I've been bashing my head against this for a little while now, so I
thought I'd ask here and see if anybody has any ideas.

I recently picked up a single-CPU Pentium Pro/180 machine for use as a
firewall. It contains 32MB of 72pin EDO RAM and 4 Intel ether-express
Pro NICs, plus an ISA video card. It does not run X.

I'm encountering insability and kernel oopses on the box, and it's
driving me nuts. I'd write it off to bad hardware, but I can't actually
isolate any particular aspect that's bad, nor see any trends in the
crashes that'd tend to point to a particular subsystem. The memory tests
out OK.

It installed fine (Fedora 1, text install) and booted up fine. A little
while (hours) later, programs started segfaulting. Shortly afterwards, a
series of oops-es were reported, followed by a total kernel panic. A
reboot cleared things up, but it did the same shortly afterwards. I
fired up memtest86 and ran it overnight - it found no errors. As the
system tends to fail within hours or minutes, I'd expect about an 8 hour
memtest86 run to be good enough to nail RAM/cache problems of that severity.

So - the memory, memory controller, and cache are probably OK. I've
swapped disks, tested with a vanlilla 2.4.21 kernel, and even tried the
latest 2.6 kernel - all exhibit the same issue.

There is no apparent pattern to the failures - no particular action or
program that's executed that causes the problem. Disabling the loading
of the updated microcode didn't make any difference either (not that it
should've).

I've captured two sets of oopses, but don't see anything of particular
note in the decoded oops. The first set shows crashes in
'shrink_dcache_memory' like this:

Trace; c014e537 <shrink_dcache_memory+27/40>
Trace; c0134187 <shrink_caches+77/a0>
Trace; c01341ee <try_to_free_pages_zone+3e/60>
Trace; c013431e <kswapd_balance_pgdat+5e/b0>
Trace; c0134398 <kswapd_balance+28/40>
Trace; c01344da <kswapd+9a/c0>
Trace; c0134440 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010767e <arch_kernel_thread+2e/40>
Trace; c0134440 <kswapd+0/c0>

and this (happened later in the same session):

Trace; c014e537 <shrink_dcache_memory+27/40>
Trace; c0134187 <shrink_caches+77/a0>
Trace; c01341ee <try_to_free_pages_zone+3e/60>
Trace; c0135099 <balance_classzone+59/1f0>
Trace; c013531f <__alloc_pages+ef/190>
Trace; c012a5cc <do_anonymous_page+5c/100>
Trace; c012a87f <handle_mm_fault+6f/100>
Trace; c0117b70 <do_page_fault+1b0/545>
Trace; c012a6d8 <do_no_page+68/1a0>
Trace; c012a87f <handle_mm_fault+6f/100>
Trace; c012e717 <filemap_nopage+1d7/210>
Trace; c012a6d8 <do_no_page+68/1a0>
Trace; c01179c0 <do_page_fault+0/545>
Trace; c0109370 <error_code+34/3c>
Trace; c012dd3c <file_read_actor+5c/90>
Trace; c012d7e0 <do_generic_file_read+1c0/420>
Trace; c012dce0 <file_read_actor+0/90>
Trace; c012de09 <generic_file_read+99/140>
Trace; c012dce0 <file_read_actor+0/90>
Trace; c013b535 <sys_read+a5/140>
Trace; c010927f <system_call+33/38>

(there are 4 others)

where in the second session, ie after a reboot, I captured totally
different errors:

Trace; c012ce5d <add_to_page_cache_unique+7d/80>
Trace; c013599e <add_to_swap_cache+6e/d0>
Trace; c013489b <try_to_swap_out+12b/170>
Trace; c013475f <swap_out_pmd+10f/120>
Trace; c01345f9 <swap_out_mm+f9/150>
Trace; c0133caf <swap_out+5f/e0>
Trace; c0133e65 <shrink_cache+135/300>
Trace; c0134170 <shrink_caches+60/a0>
Trace; c01c677f <submit_bh+4f/70>
Trace; c01341ee <try_to_free_pages_zone+3e/60>
Trace; c0135099 <balance_classzone+59/1f0>
Trace; c013531f <__alloc_pages+ef/190>
Trace; c012ced9 <page_cache_read+79/d0>
Trace; c012d170 <__lock_page+b0/c0>
Trace; c012cf6e <read_cluster_nonblocking+3e/50>
Trace; c012e64c <filemap_nopage+10c/210>
Trace; c012a6d8 <do_no_page+68/1a0>
Trace; c012a87f <handle_mm_fault+6f/100>
Trace; c0117b70 <do_page_fault+1b0/545>
Trace; c0133813 <kmem_cache_free_one+f3/210>
Trace; c014a782 <sys_select+282/4b0>
Trace; c01179c0 <do_page_fault+0/545>
Trace; c0109370 <error_code+34/3c>

,

Trace; c013489b <try_to_swap_out+12b/170>
Trace; c013475f <swap_out_pmd+10f/120>
Trace; c01345f9 <swap_out_mm+f9/150>
Trace; c0133caf <swap_out+5f/e0>
Trace; c0133e65 <shrink_cache+135/300>
Trace; c0134170 <shrink_caches+60/a0>
Trace; c01341ee <try_to_free_pages_zone+3e/60>
Trace; c013431e <kswapd_balance_pgdat+5e/b0>
Trace; c0134398 <kswapd_balance+28/40>
Trace; c01344da <kswapd+9a/c0>
Trace; c0134440 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010767e <arch_kernel_thread+2e/40>
Trace; c0134440 <kswapd+0/c0>

,

Trace; c013f8b0 <try_to_free_buffers+a0/110>
Trace; c0133fa3 <shrink_cache+273/300>
Trace; c0134170 <shrink_caches+60/a0>
Trace; c01341ee <try_to_free_pages_zone+3e/60>
Trace; c0135099 <balance_classzone+59/1f0>
Trace; c013531f <__alloc_pages+ef/190>
Trace; c0135c0f <read_swap_cache_async+af/c0>
Trace; c012a440 <swapin_readahead+40/60>
Trace; c012a538 <do_swap_page+d8/110>
Trace; c012a8af <handle_mm_fault+9f/100>
Trace; c0117b70 <do_page_fault+1b0/545>
Trace; c012f9fe <generic_file_write+49e/700>
Trace; c0109370 <error_code+34/3c>
Trace; c011ecdc <sys_wait4+11c/3b0>
Trace; c01179c0 <do_page_fault+0/545>
Trace; c0109370 <error_code+34/3c>

etc.

These all appear to be ocurring in memory management, which seems odd -
especially since the RAM /appears/ to test out OK. I've also had a
number of kernel panics that I've been unable to capture, as my null
modem cable is at work :-( .

So - ideas anybody? If I have to I'll write it off as "generically
borked hardware" but I'm just not convinced.

I have the rest of the decoded oops-es on hand if anybody is interested,
but all I'm really looking for is a quick opinion/suggestion anyway.

Craig Ringer