[8184] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Some additional info on "hanging on login" issue and change

daemon@ATHENA.MIT.EDU (Jonathan D Reed)
Thu Jan 14 22:34:52 2016

From: Jonathan D Reed <jdreed@mit.edu>
To: Oliver Thomas <othomas@mit.edu>
CC: release-team <release-team@mit.edu>, Patricia Sheppard <pshepp@mit.edu>,
        Matthew Harrington <mjharrin@mit.edu>
Date: Fri, 15 Jan 2016 03:34:47 +0000
Message-ID: <B3944CD5-1864-4BF1-BD98-56CDFD4B155C@mit.edu>
In-Reply-To: <3C7CDA14-3432-4B0B-AAC9-953526338C93@mit.edu>
Content-Language: en-US
Content-Type: text/plain; charset="Windows-1252"
Content-ID: <6F56E8797B7DA8449A52D3416755E84C@exchange.mit.edu>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit


On Jan 14, 2016, at 8:31 PM, Oliver Thomas <othomas@mit.edu> wrote:

> Hi everyone,
> 
> Garry, Sar, and I had a chance this afternoon to look at the machines in 11, which are also exhibiting the hanging-on-login issue. The problem appears to be machine specific (on machines where it occurs, it affects any user doing an xsession login.) Garry was able to confirm that removing /etc/X11/Xsession.d/90qt-a11y resolves the issue. 

I was able to reproduce the problem in an account-specific way during my initial debugging.  I’ll note the only machine where I could consistently and deterministically cause it to fail had 4GB of RAM, so it’s possible some race condition is avoided by excessive swapping, which is kind of terrifying.

> Understanding that we don't fully understand the root cause, the problem is affecting enough machines that the band-aid solution may be necessary for now.

My money is on some sort of weird race condition, possibly involving OpenAFS.  That, I think, is the only variable.  dconf and gconf have not changed since 2014, according to their changelogs.  It could of course be any key or schema in {d,g}conf that’s causing this.  You can’t attach a debugger to a zombie process, but as far as the kernel stack was concerned, it was stuck in exit(2), which makes absolutely no sense.

> Does anyone have the time to implement the fix to divert /etc/X11/Xsession.d/90qt-a11y?

I should have clarified, this is a “hide” operation as far as config-package-dev is concerned, because we will not be shipping our own variant or replacement of the file, simply moving it out of the way.  I’m 99% certain we already hide things in debathena-cluster-login-config, so that’s probably as good a choice as any; we’ve abused that package over the years for precisely this sort of thing, and the .hide file probably already exists.

> I think once there is a deployable change we can quickly run it through our standard change approval process (there's an expedited path for urgent changes), which includes a sign-off.
> 
> Many thanks,
> 
> Oliver
> 
> (There is also the second issue out there, Compiz segfaulting on some machines immediately after a standard login. With the research Jon's done we're pretty sure that is also an upstream issue, but there is at least an end-user accessible workaround since ignoring customizations or choosing a different graphical login session such as GNOME will still work on machines where the problem occurs.)

Wait, I’m confused.  Choosing a different session type should avoid the problem (because you get twm or sawfish or something instead of compiz), but ignoring customizations absolutely should not have any effect, because Compiz is still launched.  If ignoring customizations _does_ consistently avoid the segfaults, then I have absolutely no idea what’s going on, because “ignore customizations” only affects the shell and xsession startup.  If the segfault can be deterministically reproduced, a reasonable step might be to turn on unlimited coredumps, and examine one to see what’s going on.  When we had the problem on the 790s in April, that was segfaulting inside the radeon driver, not compiz itself.

(And to completely close the loop on that, the root cause of that was identified and understood: Ubuntu changed their kernel cleanup code and it was fighting with our kernel cleanup code, resulting in some machines continuing to run old kernels, even after having reported taking an update.  While the symptoms of this may be the same, the root cause is clearly different, assuming it has been verified that the machines are in fact running the latest kernels.)

-Jon

home help back first fref pref prev next nref lref last post