[8183] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Some additional info on "hanging on login" issue and change request

daemon@ATHENA.MIT.EDU (Oliver Thomas)
Thu Jan 14 20:32:03 2016

From: Oliver Thomas <othomas@mit.edu>
To: release-team <release-team@mit.edu>, Patricia Sheppard <pshepp@mit.edu>,
        Matthew Harrington <mjharrin@mit.edu>
CC: Oliver Thomas <othomas@mit.edu>
Date: Fri, 15 Jan 2016 01:31:58 +0000
Message-ID: <3C7CDA14-3432-4B0B-AAC9-953526338C93@mit.edu>
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_3C7CDA1434324B0BAAC9953526338C93mitedu_"
MIME-Version: 1.0

--_000_3C7CDA1434324B0BAAC9953526338C93mitedu_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi everyone,

Garry, Sar, and I had a chance this afternoon to look at the machines in 11=
, which are also exhibiting the hanging-on-login issue. The problem appears=
 to be machine specific (on machines where it occurs, it affects any user d=
oing an xsession login.) Garry was able to confirm that removing /etc/X11/X=
session.d/90qt-a11y resolves the issue.

Understanding that we don't fully understand the root cause, the problem is=
 affecting enough machines that the band-aid solution may be necessary for =
now.

Does anyone have the time to implement the fix to divert /etc/X11/Xsession.=
d/90qt-a11y?

I think once there is a deployable change we can quickly run it through our=
 standard change approval process (there's an expedited path for urgent cha=
nges), which includes a sign-off.

Many thanks,

Oliver

(There is also the second issue out there, Compiz segfaulting on some machi=
nes immediately after a standard login. With the research Jon's done we're =
pretty sure that is also an upstream issue, but there is at least an end-us=
er accessible workaround since ignoring customizations or choosing a differ=
ent graphical login session such as GNOME will still work on machines where=
 the problem occurs.)

On Dec 22, 2015, at 13:51, Jonathan D Reed <jdreed@mit.edu<mailto:jdreed@mi=
t.edu>> wrote:
...
1) Changes to production, going forward

I took an extended lunch break today and took a look at the machines in Hay=
den which were hanging on login.  There does indeed seem to be something go=
ing on, possibly account dependent, and/or possibly an obscure race conditi=
on or an upstream bug involving gsettings.   /etc/X11/Xsession.d/90qt-a11y =
contains a single invocation of "gsettings get" which should not be even a =
little-bit controversial, yet somehow is causing gsettings to become a zomb=
ie process.  A stack trace is unenlightening, and people who have a better =
understanding of the kernel and userspace than I do are out of  ideas.

The quick-and-dirty fix is to simply divert that file out of the way, but t=
hat's clearly a band-aid, because the root cause has not been identified. N=
either gsettings nor dconf have changed since 2014, but the Unity login pro=
cess is a maze of twisty passages, and it would take me at least 40 FTE hou=
rs to eliminate all possible culprits (and I don't have dedicated hardware =
anymore, even if I had the free time).

That having been said, I'd like to come up with a process for someone in IS=
&T to sign off on changes to the clusters in production, lest something go =
wrong.  Obviously we can roll back, and we have the update hook, but the pr=
evious workflow policy operated under the assumption that there was full-ti=
me staff devoted to development and release engineering, and that's no long=
er true.
...




--_000_3C7CDA1434324B0BAAC9953526338C93mitedu_
Content-Type: text/html; charset="us-ascii"
Content-ID: <3D42BF770D626D4E87C1407A0B64E667@exchange.mit.edu>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space;" class=3D"">
<div class=3D"">Hi everyone,</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Garry, Sar, and I had a chance this afternoon to look at th=
e machines in 11, which are also exhibiting the hanging-on-login issue. The=
 problem appears to be machine specific (on machines where it occurs, it af=
fects any user doing an xsession login.)
 Garry was able to confirm that removing /etc/X11/Xsession.d/90qt-a11y reso=
lves the issue.&nbsp;</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Understanding that we don't fully understand the root cause=
, the problem is affecting enough machines that the band-aid solution may b=
e necessary for now.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Does anyone have the time to implement the fix to divert&nb=
sp;/etc/X11/Xsession.d/90qt-a11y?</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">I think once there is a deployable change we can quickly ru=
n it through our standard change approval process (there's an expedited pat=
h for urgent changes), which includes a sign-off.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Many thanks,</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Oliver</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">(There is also the second issue out there,&nbsp;Compiz segf=
aulting on some machines immediately after a standard login. With the resea=
rch Jon's done we're pretty sure that is also an upstream issue, but there =
is at least an end-user accessible workaround
 since ignoring customizations or choosing a different graphical login sess=
ion such as GNOME will still work on machines where the problem occurs.)</d=
iv>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">
<blockquote type=3D"cite" class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-family: Tahoma; font-size: 10=
pt;">
<div class=3D"">On Dec 22, 2015, at 13:51, Jonathan D Reed &lt;<a href=3D"m=
ailto:jdreed@mit.edu" class=3D"">jdreed@mit.edu</a>&gt; wrote:</div>
...<br class=3D"Apple-interchange-newline">
<div class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-size: 10pt;">
<div class=3D"" style=3D"font-size: 13.3333px;"><span style=3D"font-size: 1=
0pt;" class=3D"">1) Changes to production, going forward</span></div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">I took an extended lunch break today and took a look at the=
 machines in Hayden which were hanging on login. &nbsp;There does indeed se=
em to be something going on, possibly account dependent, and/or possibly an=
 obscure race condition or an upstream
 bug involving gsettings. &nbsp; /etc/X11/Xsession.d/90qt-a11y&nbsp;<span c=
lass=3D"" style=3D"font-size: 10pt;">contains a single invocation of &quot;=
gsettings get&quot; which should not be even a&nbsp;</span><span class=3D""=
 style=3D"font-size: 10pt;">little-bit controversial, yet somehow
 is causing gsettings to become a&nbsp;</span><span class=3D"" style=3D"fon=
t-size: 10pt;">zombie process. &nbsp;A stack trace is unenlightening, and p=
eople who have a&nbsp;</span><span class=3D"" style=3D"font-size: 10pt;">be=
tter understanding of the kernel and userspace than I do
 are out of &nbsp;</span><span class=3D"" style=3D"font-size: 10pt;">ideas.=
</span></div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">The quick-and-dirty fix is to simply divert that file out o=
f the way, but&nbsp;<span class=3D"" style=3D"font-size: 10pt;">that's clea=
rly a band-aid, because the root cause has not been identified.&nbsp;</span=
><span class=3D"" style=3D"font-size: 10pt;">Neither
 gsettings nor dconf have changed since 2014, but the Unity login&nbsp;</sp=
an><span class=3D"" style=3D"font-size: 10pt;">process is a maze of twisty =
passages, and it would take me at least 40 FTE hours&nbsp;</span><span clas=
s=3D"" style=3D"font-size: 10pt;">to eliminate all
 possible culprits (and I don't have dedicated hardware&nbsp;</span><span c=
lass=3D"" style=3D"font-size: 10pt;">anymore, even if I had the free time).=
</span></div>
<div class=3D""><span class=3D"" style=3D"font-size: 10pt;"><br class=3D"">
</span></div>
<div class=3D""><span class=3D"" style=3D"font-size: 10pt;">That having bee=
n said, I'd like to come up with a process for someone in IS&amp;T to sign =
off on changes to the clusters in production, lest something go wrong. &nbs=
p;Obviously we can roll back, and we have the update
 hook, but the previous workflow policy operated under the assumption that =
there was full-time staff devoted to development and release engineering, a=
nd that's no longer true.</span></div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<blockquote type=3D"cite" class=3D"">...</blockquote>
</div>
<div class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-family: Tahoma; font-size: 10=
pt;">
<div class=3D"">
<div fpstyle=3D"1" ocsi=3D"0" class=3D"">
<div class=3D"" style=3D"direction: ltr; font-size: 10pt;">
<div class=3D""><span class=3D"" style=3D"font-size: 10pt;"><br class=3D"">
</span></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D""><br class=3D"">
</div>
</body>
</html>

--_000_3C7CDA1434324B0BAAC9953526338C93mitedu_--

home help back first fref pref prev next nref lref last post