[1004] in Release_Engineering
FYI: Fwd: a2104 APC card bug
daemon@ATHENA.MIT.EDU (Bill Cattey)
Fri May 5 16:32:46 1989
Date: Fri, 5 May 89 16:32:21 -0400 (EDT)
From: Bill Cattey <wdc@ATHENA.MIT.EDU>
To: rel-eng@ATHENA.MIT.EDU
---------- Forwarded message begins here ----------
From ibmapar Wed May 3 15:05:07 1989
Date: Wed, 3 May 89 15:05:07 19
From: ibmapar (ibmapar)
To: ttsapar
Subject: a2104 APC card bug
open_date: 890503
originator: ibmapar
site_contact: Paul Taylor, IBM Palo Alto
port: all
component: fpa
severity: 1
origin: c
responsible:
reference:
ptm_apar: noneed
description:
There is a hardware bug on the APC card that would cause some
floating point operations to fail SILENTLY. The bug involes
exception handling logic on the ROMP CPU: There is a small
window during a pagefault exception, when combined with
other conditions, causes the ROMP to report the fault twice.
Austin is working on an EC to the ROMP card to correct the
problem. In the mean time, we must provide some software
work-around to circumvent the bug.
The Problem:
------------
We were able to verify that the following code sequences is likely
to hit the bug:
cau r5,0xfe06 # AFPA
st r3,-0x3ff6(r5) # fr1 = fr0 * D:0(r3)
l r2,0(r2) # dummy load
This is a FP DMA operation to the AFPA. The address in r3 causes a
pagefault. If you have a load right after the store, and the load
didn't have an exception, the ROMP sometimes reports the exception
twice on the st. Thus the st is errorously re-started twice. This
is OK for a normal memory store (store a vaule twice into a same
memory location doesn't hurt), but it is a disaster if the st was
a floating point instruction.
As in the example, assume fr0=2.0, D:0(r3)=2.0 then sometimes
fr1 gets 8.0 instead of 4.0. In my experience, it happens once
every 10,000 times. If there are lots of paging activities on
the system, you may see the error more often.
repeat_by:
producing the above code
index:
FP, AFPA
fix:
For all the FP codes that follow RTFL convention, the kernel will
intercept the generic RTFL block and generate actual hardware FP
codes at run time. The kernel can ensure that the above code
sequences will not be generated thus avoiding the problem.
The hf77 compiler does have an option to generate direct FP codes
for the AFPA. Thus it is required that hf77 also ensure that it
will avoid such code sequences. This could be done by inserting
a DMA sync instruction (setsb scr15,8) between the st and the l.
The new codes sequence would be:
cau r5,0xfe06 # AFPA
st r3,-0x3ff6(r5) # fr1 = fr0 * D:0(r3)
setsb scr15,8 # set permannent zero bit.
# thus sync up the DMA.
l r2,0(r2) # dummy load
Notice that the DMA sync instruction is needed if:
1. The st is a FP instruction which REFERENCES memory.
2. The next memory reference instruction is a l (having some
other non-memory reference instructions, cal, cau, between
the ST & L is NOT enough to avoid the problem).
While awaiting the fix from HCR, the kernel will have a check
for that special error condition. If it is detected, the kernel
may chose to kill the process.
Questions or comments call:
Tri.
8-465-4462.
configuration:
all configurations will floating point adapters