[plug] the GNOME panel that just won't die

Wed Jun 16 20:17:18 WST 2004

James Devenish wrote:
> In message <40D02611.80505 at postnewspapers.com.au>
> on Wed, Jun 16, 2004 at 06:50:57PM +0800, Craig Ringer wrote:
> 
>>After a while it suddenly stopped responding over the network. 'ifdown
>>eth2; ifup eth2' helped briefly, but then it stopped responding again.
>>'ethtool eth2' reported link was fine, and the interface has a
>>statically assigned IP. When the machine started reporting 'no route
>>to host' for /some/ packets (completely losing the rest) when pinging
>>another machine,
> 
> Hmm, interesting. I've encountered a similar-sounding problem with
> kernels <2.6. There's a particular machine with Intel EtherExpress Pro
> cards (IIRC) that's always given trouble, regardless of kernel version.

I use a PCI-X e1000 on this machine. That's _very_ interesting. Are you 
using the eepro (Donald Becker) or e100 (Intel) driver for your eepro/100?

> Mostly, problems under load (though it doesn't take much to make it act
> "heavily loaded", because it's bad at I/O on the whole). Occasionally it
> says "too much work" and stops responding on an interface.

I didn't see anything along those lines in dmesg.

What I do see, shortly before the crash, is a bunch of errors like this:

Jun 16 17:56:28 bucket mount.smbfs[15421]: 
tdb(/var/lib/samba/gencache.tdb): tdb_lock failed on list 10 ltype=0 
(Bad file descriptor)
Jun 16 17:56:28 bucket mount.smbfs[15421]: [2004/06/16 17:56:28, 0] 
tdb/tdbutil.c:tdb_log(724)

from smbfs. I was noticing problems with smbfs (well, even more problems 
than smbfs usually causes) before the crash. Still, I think it likely 
that these errors are just related to the general networking failure.

This is also interesting:

Jun 16 17:54:19 bucket kernel: NETDEV WATCHDOG: eth2: transmit timed out
(repeated several times over the half hour before the crash)

and there are _lots_ of errors from gdm, afpd, imapd, pop3d, and smbfs 
about timeouts. Nothing else obviously raises a red flag, but the logs 
are _very_ noisy so it'll take some proper filtering to analyse them 
properly.

I didn't see anything interesting in dmesg when examining it before the 
crash, either, but I was in rather a hurry...

> I can connect
> via a different interface, but then TCP connections only last a few
> minutes before they stall. I can keep starting new TCP connections for a
> while, but eventually the interface will fail like the previous one.
> There is certainly some way of temporarily recovering the interfaces
> from the console, yet not the existing connections, so it still acts
> pretty screwed until rebooted.

That does sound vaguely similar, yes. Odd. I didn't get a chance to try 
another interface (the two eepro/100 interfaces are currently unused, 
and I didn't have time to fiddle around) but my experience with the 
gigabit interface matches what you mention, in terms of a temporary 
recovery and weird stalling.

--
Craig Ringer