[7194] in testers
Re: linux update failures in w20 cluster analysis
daemon@ATHENA.MIT.EDU (William Cattey)
Wed Jun 22 10:59:52 2005
In-Reply-To: <200506221103.j5MB314B000240@sipb-office-escape-pod.mit.edu>
Mime-Version: 1.0 (Apple Message framework v622)
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-Id: <81583e886fe2fa84537defd290753c18@mit.edu>
Content-Transfer-Encoding: 7bit
Cc: testers@mit.edu
From: William Cattey <wdc@MIT.EDU>
Date: Wed, 22 Jun 2005 10:59:37 -0400
To: Garry Zacheiss <zacheiss@mit.edu>
Facinating.
That would, perhaps explain why my system auto-updated without incident.
I hand-removed laus (had my hand-removal blow up because I did it
wrong, and then I cleaned up more carefully) back when it seemed to be
filling up the disk with useless audit trails. Does my system,
tokata.mit.edu, provide any useful insight as to the extent to which
one must remove laus to get things to happen smoothly?
-wdc
"Like my father before me, I shall remain 5 years old till the day I
die!"
On Jun 22, 2005, at 7:03 AM, Garry Zacheiss wrote:
> I spent some time over the past few days looking at the w20 early
> cluster Linux machines that failed to update. What I learned is that
> this:
>
>>> + glibc warning: /etc/ld.so.conf created as
>>> /etc/ld.so.conf.rpmnew
>>> ###########################################error:
>>> %post(glibc-2.3.4-2) scriptlet failed, exit status 0
>>> [ 1%]
>
> isn't telling the whole story. At the same time this happens, the
> following gets syslogged:
>
> kernel: audit_intercept: error 38, killing task
>
> which is what causes the update to fail. This appears to occur on all
> autoupdates from 9.3 to 9.4, but (in my limited sample set) never when
> the update is run manually.
>
> That error message is coming from the audit subsystem in the 2.4
> kernel,
> aka laus. I don't know what's causing it, and neither does the web;
> debugging it looks painful, and I'm not sure it's worth the effort for
> reasons disucssed below.
>
> I know we had discussed finding new love for laus in the light of the
> trojaned xlogin on recent months, but laus no longer exists in RHEL4,
> and the package that seems the most likely replacement ("audit") isn't
> installed in the 9.4 release, so I don't think we care, and that the
> path of least resistance is punting the laus package from the 9.3
> release; it might be sufficient to have the the update script on Linux
> kill auditd. I have a machine in the w20 cluster (w20-575-42) testing
> this hypothesis as soon as the hesiod DCM occurs and it joins the early
> cluster.
>
> I have tested that after the first failed update, the second will
> succeed, since the machine is now running all 9.4 packages, so long as
> the public workstation verification doesn't run. When it runs after
> the
> failed 9.4 update, it downgrades the machine to RHEL3 again (although
> less successfully than one would hope, if the number of error messages
> from failed RPM scriptlets is to be belived) and we're back in the same
> situation we started with.
>
> Assuming the testing goes successfully, I'll leave it up to Greg as to
> whether we want to fix this by hacking the update script or if we want
> to put out a 9.3.19 that removes the laus package.
>
> Garry