[8183] in Release_7.7_team
Some additional info on "hanging on login" issue and change request
daemon@ATHENA.MIT.EDU (Oliver Thomas)
Thu Jan 14 20:32:03 2016
From: Oliver Thomas <othomas@mit.edu>
To: release-team <release-team@mit.edu>, Patricia Sheppard <pshepp@mit.edu>,
Matthew Harrington <mjharrin@mit.edu>
CC: Oliver Thomas <othomas@mit.edu>
Date: Fri, 15 Jan 2016 01:31:58 +0000
Message-ID: <3C7CDA14-3432-4B0B-AAC9-953526338C93@mit.edu>
Content-Language: en-US
Content-Type: multipart/alternative;
boundary="_000_3C7CDA1434324B0BAAC9953526338C93mitedu_"
MIME-Version: 1.0
--_000_3C7CDA1434324B0BAAC9953526338C93mitedu_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Hi everyone,
Garry, Sar, and I had a chance this afternoon to look at the machines in 11=
, which are also exhibiting the hanging-on-login issue. The problem appears=
to be machine specific (on machines where it occurs, it affects any user d=
oing an xsession login.) Garry was able to confirm that removing /etc/X11/X=
session.d/90qt-a11y resolves the issue.
Understanding that we don't fully understand the root cause, the problem is=
affecting enough machines that the band-aid solution may be necessary for =
now.
Does anyone have the time to implement the fix to divert /etc/X11/Xsession.=
d/90qt-a11y?
I think once there is a deployable change we can quickly run it through our=
standard change approval process (there's an expedited path for urgent cha=
nges), which includes a sign-off.
Many thanks,
Oliver
(There is also the second issue out there, Compiz segfaulting on some machi=
nes immediately after a standard login. With the research Jon's done we're =
pretty sure that is also an upstream issue, but there is at least an end-us=
er accessible workaround since ignoring customizations or choosing a differ=
ent graphical login session such as GNOME will still work on machines where=
the problem occurs.)
On Dec 22, 2015, at 13:51, Jonathan D Reed <jdreed@mit.edu<mailto:jdreed@mi=
t.edu>> wrote:
...
1) Changes to production, going forward
I took an extended lunch break today and took a look at the machines in Hay=
den which were hanging on login. There does indeed seem to be something go=
ing on, possibly account dependent, and/or possibly an obscure race conditi=
on or an upstream bug involving gsettings. /etc/X11/Xsession.d/90qt-a11y =
contains a single invocation of "gsettings get" which should not be even a =
little-bit controversial, yet somehow is causing gsettings to become a zomb=
ie process. A stack trace is unenlightening, and people who have a better =
understanding of the kernel and userspace than I do are out of ideas.
The quick-and-dirty fix is to simply divert that file out of the way, but t=
hat's clearly a band-aid, because the root cause has not been identified. N=
either gsettings nor dconf have changed since 2014, but the Unity login pro=
cess is a maze of twisty passages, and it would take me at least 40 FTE hou=
rs to eliminate all possible culprits (and I don't have dedicated hardware =
anymore, even if I had the free time).
That having been said, I'd like to come up with a process for someone in IS=
&T to sign off on changes to the clusters in production, lest something go =
wrong. Obviously we can roll back, and we have the update hook, but the pr=
evious workflow policy operated under the assumption that there was full-ti=
me staff devoted to development and release engineering, and that's no long=
er true.
...
--_000_3C7CDA1434324B0BAAC9953526338C93mitedu_
Content-Type: text/html; charset="us-ascii"
Content-ID: <3D42BF770D626D4E87C1407A0B64E667@exchange.mit.edu>
Content-Transfer-Encoding: quoted-printable
<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space;" class=3D"">
<div class=3D"">Hi everyone,</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Garry, Sar, and I had a chance this afternoon to look at th=
e machines in 11, which are also exhibiting the hanging-on-login issue. The=
problem appears to be machine specific (on machines where it occurs, it af=
fects any user doing an xsession login.)
Garry was able to confirm that removing /etc/X11/Xsession.d/90qt-a11y reso=
lves the issue. </div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Understanding that we don't fully understand the root cause=
, the problem is affecting enough machines that the band-aid solution may b=
e necessary for now.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Does anyone have the time to implement the fix to divert&nb=
sp;/etc/X11/Xsession.d/90qt-a11y?</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">I think once there is a deployable change we can quickly ru=
n it through our standard change approval process (there's an expedited pat=
h for urgent changes), which includes a sign-off.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Many thanks,</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Oliver</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">(There is also the second issue out there, Compiz segf=
aulting on some machines immediately after a standard login. With the resea=
rch Jon's done we're pretty sure that is also an upstream issue, but there =
is at least an end-user accessible workaround
since ignoring customizations or choosing a different graphical login sess=
ion such as GNOME will still work on machines where the problem occurs.)</d=
iv>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">
<blockquote type=3D"cite" class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-family: Tahoma; font-size: 10=
pt;">
<div class=3D"">On Dec 22, 2015, at 13:51, Jonathan D Reed <<a href=3D"m=
ailto:jdreed@mit.edu" class=3D"">jdreed@mit.edu</a>> wrote:</div>
...<br class=3D"Apple-interchange-newline">
<div class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-size: 10pt;">
<div class=3D"" style=3D"font-size: 13.3333px;"><span style=3D"font-size: 1=
0pt;" class=3D"">1) Changes to production, going forward</span></div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">I took an extended lunch break today and took a look at the=
machines in Hayden which were hanging on login. There does indeed se=
em to be something going on, possibly account dependent, and/or possibly an=
obscure race condition or an upstream
bug involving gsettings. /etc/X11/Xsession.d/90qt-a11y <span c=
lass=3D"" style=3D"font-size: 10pt;">contains a single invocation of "=
gsettings get" which should not be even a </span><span class=3D""=
style=3D"font-size: 10pt;">little-bit controversial, yet somehow
is causing gsettings to become a </span><span class=3D"" style=3D"fon=
t-size: 10pt;">zombie process. A stack trace is unenlightening, and p=
eople who have a </span><span class=3D"" style=3D"font-size: 10pt;">be=
tter understanding of the kernel and userspace than I do
are out of </span><span class=3D"" style=3D"font-size: 10pt;">ideas.=
</span></div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">The quick-and-dirty fix is to simply divert that file out o=
f the way, but <span class=3D"" style=3D"font-size: 10pt;">that's clea=
rly a band-aid, because the root cause has not been identified. </span=
><span class=3D"" style=3D"font-size: 10pt;">Neither
gsettings nor dconf have changed since 2014, but the Unity login </sp=
an><span class=3D"" style=3D"font-size: 10pt;">process is a maze of twisty =
passages, and it would take me at least 40 FTE hours </span><span clas=
s=3D"" style=3D"font-size: 10pt;">to eliminate all
possible culprits (and I don't have dedicated hardware </span><span c=
lass=3D"" style=3D"font-size: 10pt;">anymore, even if I had the free time).=
</span></div>
<div class=3D""><span class=3D"" style=3D"font-size: 10pt;"><br class=3D"">
</span></div>
<div class=3D""><span class=3D"" style=3D"font-size: 10pt;">That having bee=
n said, I'd like to come up with a process for someone in IS&T to sign =
off on changes to the clusters in production, lest something go wrong. &nbs=
p;Obviously we can roll back, and we have the update
hook, but the previous workflow policy operated under the assumption that =
there was full-time staff devoted to development and release engineering, a=
nd that's no longer true.</span></div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<blockquote type=3D"cite" class=3D"">...</blockquote>
</div>
<div class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-family: Tahoma; font-size: 10=
pt;">
<div class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-size: 10pt;">
<div class=3D""><span class=3D"" style=3D"font-size: 10pt;"><br class=3D"">
</span></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D""><br class=3D"">
</div>
</body>
</html>
--_000_3C7CDA1434324B0BAAC9953526338C93mitedu_--