[7192] in testers

home help back first fref pref prev next nref lref last post

linux update failures in w20 cluster analysis

daemon@ATHENA.MIT.EDU (Garry Zacheiss)
Wed Jun 22 07:03:12 2005

Message-Id: <200506221103.j5MB314B000240@sipb-office-escape-pod.mit.edu>
To: testers@MIT.EDU
Date: Wed, 22 Jun 2005 07:03:01 -0400
From: Garry Zacheiss <zacheiss@MIT.EDU>

I spent some time over the past few days looking at the w20 early
cluster Linux machines that failed to update.  What I learned is that
this:

>> + glibc                     warning: /etc/ld.so.conf created as /etc/ld.so.conf.rpmnew
>> ###########################################error: %post(glibc-2.3.4-2) scriptlet failed, exit status 0
>>  [  1%]

isn't telling the whole story.  At the same time this happens, the
following gets syslogged:

kernel: audit_intercept: error 38, killing task

which is what causes the update to fail.  This appears to occur on all
autoupdates from 9.3 to 9.4, but (in my limited sample set) never when
the update is run manually.  

That error message is coming from the audit subsystem in the 2.4 kernel,
aka laus.  I don't know what's causing it, and neither does the web;
debugging it looks painful, and I'm not sure it's worth the effort for
reasons disucssed below.

I know we had discussed finding new love for laus in the light of the
trojaned xlogin on recent months, but laus no longer exists in RHEL4,
and the package that seems the most likely replacement ("audit") isn't
installed in the 9.4 release, so I don't think we care, and that the
path of least resistance is punting the laus package from the 9.3
release; it might be sufficient to have the the update script on Linux
kill auditd.  I have a machine in the w20 cluster (w20-575-42) testing
this hypothesis as soon as the hesiod DCM occurs and it joins the early
cluster.

I have tested that after the first failed update, the second will
succeed, since the machine is now running all 9.4 packages, so long as
the public workstation verification doesn't run.  When it runs after the
failed 9.4 update, it downgrades the machine to RHEL3 again (although
less successfully than one would hope, if the number of error messages
from failed RPM scriptlets is to be belived) and we're back in the same
situation we started with.

Assuming the testing goes successfully, I'll leave it up to Greg as to
whether we want to fix this by hacking the update script or if we want
to put out a 9.3.19 that removes the laus package.

Garry

home help back first fref pref prev next nref lref last post