[1004] in Release_Engineering

home help back first fref pref prev next nref lref last post

FYI: Fwd: a2104 APC card bug

daemon@ATHENA.MIT.EDU (Bill Cattey)
Fri May 5 16:32:46 1989

Date: Fri,  5 May 89 16:32:21 -0400 (EDT)
From: Bill Cattey <wdc@ATHENA.MIT.EDU>
To: rel-eng@ATHENA.MIT.EDU
---------- Forwarded message begins here ----------
From ibmapar  Wed May  3 15:05:07 1989
Date: Wed, 3 May 89 15:05:07  19
From: ibmapar (ibmapar)
To: ttsapar
Subject:   a2104 APC card bug

open_date:       890503  
originator:      ibmapar 
site_contact:    Paul Taylor, IBM Palo Alto              
port:            all     
component:       fpa     
severity:        1       
origin:          c       
responsible:             
reference:               
ptm_apar:        noneed  

description: 
There is a hardware bug on the APC card that would cause some
floating point operations to fail SILENTLY.  The bug involes
exception handling logic on the ROMP CPU:  There is a small
window during a pagefault exception, when combined with
other conditions, causes the ROMP to report the fault twice.

Austin is working on an EC to the ROMP card to correct the
problem.  In the mean time, we must provide some software
work-around to circumvent the bug.

The Problem:
------------

We were able to verify that the following code sequences is likely
to hit the bug:

        cau     r5,0xfe06               # AFPA
        st      r3,-0x3ff6(r5)          # fr1 = fr0 * D:0(r3)
        l       r2,0(r2)                # dummy load
This is a FP DMA operation to the AFPA.  The address in r3 causes a
pagefault.  If you have a load right after the store, and the load
didn't have an exception, the ROMP sometimes reports the exception
twice on the st.  Thus the st is errorously re-started twice.  This
is OK for a normal memory store (store a vaule twice into a same
memory location doesn't hurt), but it is a disaster if the st was
a floating point instruction.

As in the example, assume fr0=2.0, D:0(r3)=2.0 then sometimes
fr1 gets 8.0 instead of 4.0.  In my experience, it happens once
every 10,000 times.  If there are lots of paging activities on
the system, you may see the error more often.


repeat_by: 
producing the above code

index: 
FP, AFPA


fix: 

For all the FP codes that follow RTFL convention, the kernel will
intercept the generic RTFL block and generate actual hardware FP
codes at run time.  The kernel can ensure that the above code
sequences will not be generated thus avoiding the problem.

The hf77 compiler does have an option to generate direct FP codes
for the AFPA.  Thus it is required that hf77 also ensure that it
will avoid such code sequences.  This could be done by inserting
a DMA sync instruction (setsb scr15,8) between the st and the l.

The new codes sequence would be:

        cau     r5,0xfe06               # AFPA
     st      r3,-0x3ff6(r5)          # fr1 = fr0 * D:0(r3)
        setsb   scr15,8                 # set permannent zero bit.
                                        # thus sync up the DMA.
        l       r2,0(r2)                # dummy load

Notice that the DMA sync instruction is needed if:

1.  The st is a FP instruction which REFERENCES memory.
2.  The next memory reference instruction is a l (having some
    other non-memory reference instructions, cal, cau, between
    the ST & L is NOT enough to avoid the problem).

While awaiting the fix from HCR, the kernel will have a check
for that special error condition.  If it is detected, the kernel
may chose to kill the process.

Questions or comments call:

Tri.
8-465-4462.


configuration: 
all configurations will floating point adapters


home help back first fref pref prev next nref lref last post