[8186] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Some additional info on "hanging on login" issue and change

daemon@ATHENA.MIT.EDU (Sar Haidar)
Fri Jan 15 08:45:18 2016

From: Sar Haidar <shaidar@mit.edu>
To: Jonathan D Reed <jdreed@mit.edu>
CC: Oliver Thomas <othomas@mit.edu>, release-team <release-team@mit.edu>,
        Patricia Sheppard <pshepp@mit.edu>,
        Matthew Harrington <mjharrin@mit.edu>
Date: Fri, 15 Jan 2016 13:45:13 +0000
Message-ID: <8DBA26BF-77C2-4B66-9C54-CD7232120715@mit.edu>
In-Reply-To: <B3944CD5-1864-4BF1-BD98-56CDFD4B155C@mit.edu>
Content-Language: en-US
Content-Type: multipart/signed;
	boundary="Apple-Mail=_39EFCCC1-4BDA-4A46-BAC3-246B5B7B1454";
	protocol="application/pkcs7-signature"; micalg=sha1
MIME-Version: 1.0

--Apple-Mail=_39EFCCC1-4BDA-4A46-BAC3-246B5B7B1454
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_A6E23E39-1D73-41AB-A3D0-F80A358A1B70"


--Apple-Mail=_A6E23E39-1D73-41AB-A3D0-F80A358A1B70
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

Here=92s what I have:
66-080
Dell 9030 AIO
8GB RAM
Intel HD Graphics
256GB SSD

56-129
Dell Optiplex 7010
8GB RAM
Intel Generic Graphic

11 & W20-575
Dell 9020 AIO
8GB RAM
Intel HD Graphics
256GB SSD

Cluster	Login	GNOME	No customization	Notes
66	works	works	works	66-080-4 very inconsistent. Login hangs =
at times and works at others. GNOME/No customization always worked.
56	crashes	crashes	crashes=09
m56-129-20
23 - inconsistent (worked a few times)
11	crashes	works	works	quickstation-11-4
W20	crashes	works	works	w20-575-21 didn=92t work most logins, =
but i was able to login once to default from the 6-7 tries.=20

Testing has been difficult as things have been inconsistent. Not all =
machines in the clusters are down and some machines eventually work and =
after a reboot, stop working. I also noticed (mainly on the 11, W20 -575 =
machines) that at times (worked a couple of times, and failed once) =
logging into GNOME, and then logging out and trying to log back in to =
Default Athena, ends up working. In addition, and again inconsistently, =
a manual reboot after the login process crashes, I was able to login to =
Default Athena. The most consistent behavior (not working) I saw today =
was on machines in 56, specifically m56-129-20.

Please let me know if there=92s anything else I can try out.

Sar


> On Jan 14, 2016, at 10:34 PM, Jonathan D Reed <jdreed@mit.edu =
<mailto:jdreed@mit.edu>> wrote:
>=20
>=20
> On Jan 14, 2016, at 8:31 PM, Oliver Thomas <othomas@mit.edu =
<mailto:othomas@mit.edu>> wrote:
>=20
>> Hi everyone,
>>=20
>> Garry, Sar, and I had a chance this afternoon to look at the machines =
in 11, which are also exhibiting the hanging-on-login issue. The problem =
appears to be machine specific (on machines where it occurs, it affects =
any user doing an xsession login.) Garry was able to confirm that =
removing /etc/X11/Xsession.d/90qt-a11y resolves the issue.=20
>=20
> I was able to reproduce the problem in an account-specific way during =
my initial debugging.  I=92ll note the only machine where I could =
consistently and deterministically cause it to fail had 4GB of RAM, so =
it=92s possible some race condition is avoided by excessive swapping, =
which is kind of terrifying.
>=20
>> Understanding that we don't fully understand the root cause, the =
problem is affecting enough machines that the band-aid solution may be =
necessary for now.
>=20
> My money is on some sort of weird race condition, possibly involving =
OpenAFS.  That, I think, is the only variable.  dconf and gconf have not =
changed since 2014, according to their changelogs.  It could of course =
be any key or schema in {d,g}conf that=92s causing this.  You can=92t =
attach a debugger to a zombie process, but as far as the kernel stack =
was concerned, it was stuck in exit(2), which makes absolutely no sense.
>=20
>> Does anyone have the time to implement the fix to divert =
/etc/X11/Xsession.d/90qt-a11y?
>=20
> I should have clarified, this is a =93hide=94 operation as far as =
config-package-dev is concerned, because we will not be shipping our own =
variant or replacement of the file, simply moving it out of the way.  =
I=92m 99% certain we already hide things in =
debathena-cluster-login-config, so that=92s probably as good a choice as =
any; we=92ve abused that package over the years for precisely this sort =
of thing, and the .hide file probably already exists.
>=20
>> I think once there is a deployable change we can quickly run it =
through our standard change approval process (there's an expedited path =
for urgent changes), which includes a sign-off.
>>=20
>> Many thanks,
>>=20
>> Oliver
>>=20
>> (There is also the second issue out there, Compiz segfaulting on some =
machines immediately after a standard login. With the research Jon's =
done we're pretty sure that is also an upstream issue, but there is at =
least an end-user accessible workaround since ignoring customizations or =
choosing a different graphical login session such as GNOME will still =
work on machines where the problem occurs.)
>=20
> Wait, I=92m confused.  Choosing a different session type should avoid =
the problem (because you get twm or sawfish or something instead of =
compiz), but ignoring customizations absolutely should not have any =
effect, because Compiz is still launched.  If ignoring customizations =
_does_ consistently avoid the segfaults, then I have absolutely no idea =
what=92s going on, because =93ignore customizations=94 only affects the =
shell and xsession startup.  If the segfault can be deterministically =
reproduced, a reasonable step might be to turn on unlimited coredumps, =
and examine one to see what=92s going on.  When we had the problem on =
the 790s in April, that was segfaulting inside the radeon driver, not =
compiz itself.
>=20
> (And to completely close the loop on that, the root cause of that was =
identified and understood: Ubuntu changed their kernel cleanup code and =
it was fighting with our kernel cleanup code, resulting in some machines =
continuing to run old kernels, even after having reported taking an =
update.  While the symptoms of this may be the same, the root cause is =
clearly different, assuming it has been verified that the machines are =
in fact running the latest kernels.)
>=20
> -Jon


--Apple-Mail=_A6E23E39-1D73-41AB-A3D0-F80A358A1B70
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dwindows-1252"><meta http-equiv=3D"Content-Type" =
content=3D"text/html charset=3Dwindows-1252"></head><body =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space;" class=3D"">Here=92s what I =
have:<div class=3D"">66-080</div><div class=3D""><ul =
class=3D"MailOutline"><li class=3D"">Dell 9030 AIO</li><li class=3D"">8GB =
RAM</li><li class=3D"">Intel HD Graphics</li><li class=3D"">256GB =
SSD</li></ul><div class=3D""><br class=3D""></div><div =
class=3D"">56-129</div><div class=3D""><ul class=3D"MailOutline"><li =
class=3D"">Dell Optiplex 7010</li><li class=3D"">8GB RAM</li><li =
class=3D"">Intel Generic Graphic</li></ul><div class=3D""><br =
class=3D""></div></div><div class=3D"">11 &amp; W20-575</div><div =
class=3D""><ul class=3D"MailOutline"><li class=3D"">Dell 9020 =
AIO</li><li class=3D"">8GB RAM</li><li class=3D"">Intel HD =
Graphics</li><li class=3D"">256GB SSD</li></ul><div class=3D""><br =
class=3D""></div></div><div class=3D""><table =
style=3D"-evernote-table:true;border-collapse:collapse;table-layout:fixed;=
margin-left:0px;width:100%;font-family:'Helvetica Neue';font-size:14px;" =
class=3D""><tbody class=3D""><tr class=3D""><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20.025188916876573%;" class=3D""><b =
class=3D"">Cluster</b></td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20.025188916876573%;" class=3D""><b =
class=3D"">Login</b></td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20.025188916876573%;" class=3D""><b =
class=3D"">GNOME</b></td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:19.899244332493705%;" class=3D""><b =
class=3D"">No customization</b></td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:19.899244332493705%;" class=3D""><b =
class=3D"">Notes</b></td></tr><tr class=3D""><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">66</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">works</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">works</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">works</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">66-080-4 very =
inconsistent. Login hangs at times and works at others. GNOME/No =
customization always worked.</td></tr><tr class=3D""><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">56</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">crashes</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">crashes</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">crashes</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D""><div =
class=3D"">m56-129-20</div><div class=3D"">23 - inconsistent (worked a =
few times)</div></td></tr><tr class=3D""><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">11</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">crashes</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">works</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">works</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" =
class=3D"">quickstation-11-4</td></tr><tr class=3D""><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">W20</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">crashes</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">works</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">works</td><td =
style=3D"border-style:solid;border-width:1px;border-color:rgb(219,219,219)=
;padding:10px;margin:0px;width:20%;" class=3D"">w20-575-21 didn=92t work =
most logins, but i was able to login once to default from the 6-7 =
tries.&nbsp;</td></tr></tbody></table><div class=3D""><br =
class=3D""></div></div><div class=3D"">Testing has been difficult as =
things have been inconsistent. Not all machines in the clusters are down =
and some machines eventually work and after a reboot, stop working. I =
also noticed (mainly on the 11, W20 -575 machines) that at times (worked =
a couple of times, and failed once) logging into GNOME, and then logging =
out and trying to log back in to Default Athena, ends up working. In =
addition, and again inconsistently, a manual reboot after the login =
process crashes, I was able to login to Default Athena. The most =
consistent behavior (not working) I saw today was on machines in 56, =
specifically m56-129-20.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Please let me know if there=92s anything else I can try =
out.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Sar</div><div class=3D""><br class=3D""></div><div =
class=3D""><br class=3D""></div><div class=3D""><blockquote type=3D"cite" =
class=3D""><div class=3D"">On Jan 14, 2016, at 10:34 PM, Jonathan D Reed =
&lt;<a href=3D"mailto:jdreed@mit.edu" class=3D"">jdreed@mit.edu</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div =
class=3D""><br class=3D"">On Jan 14, 2016, at 8:31 PM, Oliver Thomas =
&lt;<a href=3D"mailto:othomas@mit.edu" class=3D"">othomas@mit.edu</a>&gt; =
wrote:<br class=3D""><br class=3D""><blockquote type=3D"cite" =
class=3D"">Hi everyone,<br class=3D""><br class=3D"">Garry, Sar, and I =
had a chance this afternoon to look at the machines in 11, which are =
also exhibiting the hanging-on-login issue. The problem appears to be =
machine specific (on machines where it occurs, it affects any user doing =
an xsession login.) Garry was able to confirm that removing =
/etc/X11/Xsession.d/90qt-a11y resolves the issue. <br =
class=3D""></blockquote><br class=3D"">I was able to reproduce the =
problem in an account-specific way during my initial debugging. =
&nbsp;I=92ll note the only machine where I could consistently and =
deterministically cause it to fail had 4GB of RAM, so it=92s possible =
some race condition is avoided by excessive swapping, which is kind of =
terrifying.<br class=3D""><br class=3D""><blockquote type=3D"cite" =
class=3D"">Understanding that we don't fully understand the root cause, =
the problem is affecting enough machines that the band-aid solution may =
be necessary for now.<br class=3D""></blockquote><br class=3D"">My money =
is on some sort of weird race condition, possibly involving OpenAFS. =
&nbsp;That, I think, is the only variable. &nbsp;dconf and gconf have =
not changed since 2014, according to their changelogs. &nbsp;It could of =
course be any key or schema in {d,g}conf that=92s causing this. =
&nbsp;You can=92t attach a debugger to a zombie process, but as far as =
the kernel stack was concerned, it was stuck in exit(2), which makes =
absolutely no sense.<br class=3D""><br class=3D""><blockquote =
type=3D"cite" class=3D"">Does anyone have the time to implement the fix =
to divert /etc/X11/Xsession.d/90qt-a11y?<br class=3D""></blockquote><br =
class=3D"">I should have clarified, this is a =93hide=94 operation as =
far as config-package-dev is concerned, because we will not be shipping =
our own variant or replacement of the file, simply moving it out of the =
way. &nbsp;I=92m 99% certain we already hide things in =
debathena-cluster-login-config, so that=92s probably as good a choice as =
any; we=92ve abused that package over the years for precisely this sort =
of thing, and the .hide file probably already exists.<br class=3D""><br =
class=3D""><blockquote type=3D"cite" class=3D"">I think once there is a =
deployable change we can quickly run it through our standard change =
approval process (there's an expedited path for urgent changes), which =
includes a sign-off.<br class=3D""><br class=3D"">Many thanks,<br =
class=3D""><br class=3D"">Oliver<br class=3D""><br class=3D"">(There is =
also the second issue out there, Compiz segfaulting on some machines =
immediately after a standard login. With the research Jon's done we're =
pretty sure that is also an upstream issue, but there is at least an =
end-user accessible workaround since ignoring customizations or choosing =
a different graphical login session such as GNOME will still work on =
machines where the problem occurs.)<br class=3D""></blockquote><br =
class=3D"">Wait, I=92m confused. &nbsp;Choosing a different session type =
should avoid the problem (because you get twm or sawfish or something =
instead of compiz), but ignoring customizations absolutely should not =
have any effect, because Compiz is still launched. &nbsp;If ignoring =
customizations _does_ consistently avoid the segfaults, then I have =
absolutely no idea what=92s going on, because =93ignore customizations=94 =
only affects the shell and xsession startup. &nbsp;If the segfault can =
be deterministically reproduced, a reasonable step might be to turn on =
unlimited coredumps, and examine one to see what=92s going on. =
&nbsp;When we had the problem on the 790s in April, that was segfaulting =
inside the radeon driver, not compiz itself.<br class=3D""><br =
class=3D"">(And to completely close the loop on that, the root cause of =
that was identified and understood: Ubuntu changed their kernel cleanup =
code and it was fighting with our kernel cleanup code, resulting in some =
machines continuing to run old kernels, even after having reported =
taking an update. &nbsp;While the symptoms of this may be the same, the =
root cause is clearly different, assuming it has been verified that the =
machines are in fact running the latest kernels.)<br class=3D""><br =
class=3D"">-Jon</div></div></blockquote></div><br =
class=3D""></div></body></html>=

--Apple-Mail=_A6E23E39-1D73-41AB-A3D0-F80A358A1B70--

--Apple-Mail=_39EFCCC1-4BDA-4A46-BAC3-246B5B7B1454
Content-Disposition: attachment; filename="smime.p7s"
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIDvTCCA7kw
ggMioAMCAQICEDWWz/q/QeXmKO+jdTggGV0wDQYJKoZIhvcNAQEFBQAwbDELMAkGA1UEBhMCVVMx
FjAUBgNVBAgTDU1hc3NhY2h1c2V0dHMxLjAsBgNVBAoTJU1hc3NhY2h1c2V0dHMgSW5zdGl0dXRl
IG9mIFRlY2hub2xvZ3kxFTATBgNVBAsTDENsaWVudCBDQSB2MTAeFw0xNTA3MjcwMDAzNTBaFw0x
NjA4MDEwMDAzNTBaMIGhMQswCQYDVQQGEwJVUzEWMBQGA1UECBMNTWFzc2FjaHVzZXR0czEuMCwG
A1UEChMlTWFzc2FjaHVzZXR0cyBJbnN0aXR1dGUgb2YgVGVjaG5vbG9neTEVMBMGA1UECxMMQ2xp
ZW50IENBIHYxMRMwEQYDVQQDEwpTYXIgSGFpZGFyMR4wHAYJKoZIhvcNAQkBFg9zaGFpZGFyQE1J
VC5FRFUwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCzl3JT85u0HLeHLiwznGycjf7F
kRBTKDwlNjnkX3k5MgMf1KZaM4uPoYIt9b5Pl8qxPvEfWFHP8CTI/i6lprKa5BaPliM2oZbt7/ar
nQxqCBks0/WlH6WM4qDqu706ip7EwkQLd08wjjRj6o78ZII50JdOtMazqZS5lrX7LOTguZZJA03P
mTxGroIFQGEEJgSnGDh+NwLQbPIZmxzidhvDUmFc2hZ2j2r/5BmcFibz9VRarxXz6Zy6yfymFqrA
XLMtKgFPl03NDMqlw1ijziXldILokTbbsxfOIun+63V53HBZAYtpX5EkH+AZ+ZWd+/P2wQU/m/3P
4hmiPejAh9QFAgMBAAGjgaEwgZ4wCQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBaAwHQYDVR0l
BBYwFAYIKwYBBQUHAwQGCCsGAQUFBwMCMAsGA1UdDwQEAwIF4DAdBgNVHQ4EFgQUsBp858pzbDMV
CGFdQhNuYsD2cjswMwYDVR0fBCwwKjAooCagJIYiaHR0cDovL2NhLm1pdC5lZHUvY2EvbWl0Y2xp
ZW50LmNybDANBgkqhkiG9w0BAQUFAAOBgQBEEdudZr+b4VZcSvn3k7pO/aORHU/GZDonsidzh0WX
O8TbYdd6lW3RArAQuS/VvW8EKrR2Z6zUoZ6MPZ5vG415QpKGI5IRDB7CyCZwUi/fV+y8/GrkEhRk
IuXofd8cAAL+hs4iIXFVIpZuiMSLT0M2h0eUX8yeOi9yc+uY03fm2jGCAzMwggMvAgEBMIGAMGwx
CzAJBgNVBAYTAlVTMRYwFAYDVQQIEw1NYXNzYWNodXNldHRzMS4wLAYDVQQKEyVNYXNzYWNodXNl
dHRzIEluc3RpdHV0ZSBvZiBUZWNobm9sb2d5MRUwEwYDVQQLEwxDbGllbnQgQ0EgdjECEDWWz/q/
QeXmKO+jdTggGV0wCQYFKw4DAhoFAKCCAYcwGAYJKoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkq
hkiG9w0BCQUxDxcNMTYwMTE1MTM0NTEzWjAjBgkqhkiG9w0BCQQxFgQU7Z281e8gymeAS79SM7E4
ST6FKgQwgZEGCSsGAQQBgjcQBDGBgzCBgDBsMQswCQYDVQQGEwJVUzEWMBQGA1UECBMNTWFzc2Fj
aHVzZXR0czEuMCwGA1UEChMlTWFzc2FjaHVzZXR0cyBJbnN0aXR1dGUgb2YgVGVjaG5vbG9neTEV
MBMGA1UECxMMQ2xpZW50IENBIHYxAhA1ls/6v0Hl5ijvo3U4IBldMIGTBgsqhkiG9w0BCRACCzGB
g6CBgDBsMQswCQYDVQQGEwJVUzEWMBQGA1UECBMNTWFzc2FjaHVzZXR0czEuMCwGA1UEChMlTWFz
c2FjaHVzZXR0cyBJbnN0aXR1dGUgb2YgVGVjaG5vbG9neTEVMBMGA1UECxMMQ2xpZW50IENBIHYx
AhA1ls/6v0Hl5ijvo3U4IBldMA0GCSqGSIb3DQEBAQUABIIBABOFxhQxzeR9Lz060h4HD/rG7xiL
BG7Ojq09mYQ1WO2fSyYmSZGXgCh42BlrmpJz6F7rMyV2IuTitg0l5vWjJ3XS/egSTF97F1AawjyF
zRDCl5gxQvVC+1ct5SjWJyzyVI9LiROcUGYE9Xf0Fjlw7/amtmY6cxmbPPq82nziMoO8Z9Hz30v2
NiK1kmGerQ9QTAb+oKsT7g+gevFpG84jN2scOFOiE2AuIjb7ibXTDa+aKhSzOAcIwDBMOnEhwfI4
cLZO3ukE3u5hnbfJ6Ze+I1woKfM+rRdg7tPMBJZsSK8mL5rEvmTViDF2Z4TWL/9TShi8KbEH/18D
xIWKoSqTZpoAAAAAAAA=

--Apple-Mail=_39EFCCC1-4BDA-4A46-BAC3-246B5B7B1454--

home help back first fref pref prev next nref lref last post