[plug] the GNOME panel that just won't die

Tue Jun 15 14:54:48 WST 2004

Hi all

I'm running into a real head-scratcher here, and was hoping to get some 
assistance or ideas.

A user here has had the GNOME panel hang a couple of times recently. Odd 
and very annoying, but killing it and restarting it has always done the 
trick.

Not so this time.

The panel just wouldn't die. It looks like the GNOME segfault handler 
ran when it died, flagging it as traced. This happened before, too - or 
at least the process was in T state when I killed it.

Mentally swearing at GNOME, I tried to kill the panel as I had the 
previous time - but it didn't die. Not even kill -9 would kill it. A 
little more looking around revealed the GNOME segfault handler with the 
panel as a ppid .... but the segfault handler is a defunct process (Z 
state).

from "ps aux":

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
aja      28611  0.0  0.5 21196 10960 ?       S    09:12   0:03 gnome-panel
jen       6442  0.0  0.0     0    0 ?        Z    14:13   0:00 
[gnome_segv2] <defunct>

[craig]$ ps -e --format "pid user ppid wchan cmd" | grep jen
  6442 jen       2648 exit   [gnome_segv2] <defunct>
  2648 jen          1 finish gnome-panel --sm-config-prefix 
/gnome-panel-0iv5Wq/ --sm-client-id 
110a000004000107517969800000242930001 --profile default

I've worked around the problem for now by logging the user out, killing 
gconfd and the bonobo-activation-server, and logging them back in. GNOME 
could no longer see the 'zombie' panel, so it started a new one and is 
working OK. I still have an unkillable process on my server, though, and 
that is not something that makes me happy.

The machine in question used to be RH8, but has evolved over time. It 
runs 2.6.3 (soon to be 2.6.6, as it's going down for an upgrade soon 
anyway). I didn't want to go to 2.6 but we needed some of the disk 
elevator improvements quite badly. Aside from a few upgraded apps it's 
otherwise mostly RH8. GNOME is unmodified from the RH8 GNOME. The 
machine has ample RAM (ECC DDR) and storage (RAID) and is otherwise 100% 
rock solid, so the chances of this being hardware related are slim to none.

Uptime is 102 days. We've done better than that before, but not all that 
much better. It's not like we're using WinNT where uptime is a possible 
explanation for "it's going insane" though. (I just rebooted the NT last 
week after over 90 days of uptime - not too bad for a Windows server).

As you can imagine, this is driving me nuts. One process is gone but 
won't leave the process table, and another one won't finish terminating.

As an aside, we have a nice shiny KDE 3.2.1 but the users don't like it, 
so they're sticking with GNOME. This is painful for me, as GNOME 2.0.1 
is ... less than entirely stable, and upgrading GNOME has proved 
impractical. Maybe GARNOME has improved since I last tried. Any opinions 
  or comments on the "virtual server on server using UML" approach? I'm 
_very_ tempted to keep the terminal users in a UML so that their 
environment can easily be cloned, upgraded, forked off for a testbed, 
etc. I'd be very interested in any info on production experience with 
UML, esp in environments with interactive, memory-heavy processes.

--
Craig Ringer