[6028] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: LVD SCSI problems with ncr53c8xx

daemon@ATHENA.MIT.EDU (Gerard Roudier)
Fri Mar 5 16:32:58 1999

Date:	Fri, 5 Mar 1999 21:44:58 +0100 (MET)
From: Gerard Roudier <groudier@club-internet.fr>
To: "Bradley M. Kuhn" <bkuhn@ebb.org>
Cc: linux-scsi@vger.rutgers.edu
In-Reply-To: <19990304210015.N1848@ebb.org>


On Thu, 4 Mar 1999, Bradley M. Kuhn wrote:

> Once, I even got these, which were really strange:

Indeed! they are so.

> kernel: ncr53c875-0-<0,0>: phase change 2-7 10@07fe0645 resid=16711926.

2             = COMMAND PHASE
7             = MESSAGE IN PHASE
10            = size of the command (presumally and very probably)
0x07fe0645    = address of the start of the data data (presumally COMMAND)
                the controller was sending.
residual data = 16711926

Weird are the following points applied to the values just above:

a) It is very unlikely a device will switch from COMMAND phase to 
   MESSAGE IN phase unless some catastrophic condition confused it 
   a lot.

b) The COMMAND area for 10 bytes command can only come from the 
   Scsi_Cmnd structure that is normally guaranteed to be:
   1 - Allocated under the 16 MB limit
   2 - 32 bit aligned and the command area follows pointers declaration
       and so must also be 32 bit aligned.
   (0x07fe0645 does not match those conditions)

c) The residual must be less than the size of the command (10).
   (16711926 seems a bit too large :)  )

> kernel: ncr53c875-0-<0,0>: phase change 2-7 10@07fe0645 resid=4.

Except for the residual value which seems possible, other values are 
the same.

On Thu, 4 Mar 1999, Bradley M. Kuhn wrote:

> I have spent nearly 60-80 hours debugging this problem, and I am not
> sure where to go from here.  I humbly ask help from SCSI gurus here.

Are you sure it is a SCSI problem ?

> I have a Tekram DC-390U2B/W card and a Seagate ST39175LW drive.  It uses the
> ncr53c8xx.  I am using Linux kernel 2.2.2, and the most recent ncr53c8xx
> driver (which is now called symb53c8xx version sym53c8xx/version 1.2a).

The U2B is different from the U2W. The U2W uses a LVD to/from SE
translator which is something that can only be very complex, since:
- The SCSI protocol for LVD has been changed in order to deal with 
  release glitches and the protocol changes can only be properly 
  implemented by devices.
- IMO, this chip must snoop the SCSI protocol and enforce the SCSI
  protocol changes for old SE devices that are on the SE side.
Btw, on paper, I am not going to trust any LVD/SE converter for these 
reasons. But, if you are not using a U2W or are not using SE devices on 
such a controller, this chip is probably just quiet ...

> I have made it so that these are the *only* two things on that SCSI card, to
> be sure that other items on the chain did not effect my operation.
> 
> The Problem:
> 
> The drive works for a while.  Sometimes up to a few days.  However, usually
> under some point of heavy load, the drive locks, with the I/O light on
> continuously, and the device driver on the system can no longer talk to it,
> giving a variety of timeout and other errors (see below for details).  Only
> a hard reset of the system brings that SCSI bus alive again.
> 
> 
> I can always force this behavior by putting it under heavy load immediately
> upon booting.
> 
> I know that there is nothing wrong with the driver.  I even tried FreeBSD,
> but to know avail.
> 
> Configurations I have tried;
> 
> 0. Using the special red-white-blue LVD cable that came with the card, I
>    hooked the terminator that came with the Tekram card (labeled SCSI LVD/SE
>    Terminator) to the end of that red-white-blue LVD cable, and put the
>    drive in between the terminator and the card. (THIS IS THE CANONICAL,
>    CORRECT CONFIGURATION, I have been told).
> 
> 1. I tried using configuration (0), with the drive's internal termination on
>    (I realize this does nothing for LVD drives, according to spec, but I was
>    desperate).
> 
> 2. I tried using plain 68 pin Ultra cable, with the drive's internal
>    termination on and the drive in *forced* SE mode (using another jumper),
>    and reproduced the problem.
> 
> 3. I tried (2), with the drive's internal termination *off* and the LVD/SE
>    terminator mentioned in (0) on the chain.
> 
> All of these produced similar results---works for a while, then fails
> in the manner described.  In almost all cases, it severely corrupts the
> ext2fs almost beyond repair.

Are you still sure, it is a SCSI problem?

> Before the crash, the messages in the log look something like (sometimes it
> happens so fast there are no messages at all):
> 
> kernel: scsi : aborting command due to timeout : pid 96752, scsi1, channel 0, id 0, lun 0 Write (6) 00 00 5f 02 00 
> kernel: ncr53c8xx_abort: pid=96752 serial_number=96772 serial_number_at_timeout=96772
> kernel: scsi : aborting command due to timeout : pid 96759, scsi1, channel 0, id 0, lun 0 Write (6) 00 80 89 02 00 
> kernel: ncr53c8xx_abort: pid=96759 serial_number=96779 serial_number_at_timeout=96779
> kernel: scsi : aborting command due to timeout : pid 96760, scsi1, channel 0, id 0, lun 0 Write (6) 00 81 8b 02 00 
> kernel: ncr53c8xx_abort: pid=96760 serial_number=96780 serial_number_at_timeout=96780
> kernel: ncr53c875-0: abort ccb=c60ce800 (cancel)kernel: SCSI host 1 abort (pid 96630) timed out - resetting
> kernel: SCSI bus is being reset for host 1 channel 0.
> kernel: ncr53c8xx_reset: pid=96630 reset_flags=2 serial_number=96650 serial_number_at_timeout=96650
> kernel: ncr53c875-0: restart (scsi reset).
> kernel: ncr53c875-0: Downloading SCSI SCRIPTS.
> kernel: ncr53c875-0-<0,*>: FAST-20 WIDE SCSI 40.0 MB/s (50 ns, offset 15)
> 
> 
> (Note these occured when I had it connected as plain Ultra chain---similar
> things happen in LVD mode)
> 
> Once, I even got these, which were really strange:
> 
> kernel: ncr53c875-0-<0,0>: phase change 2-7 10@07fe0645 resid=16711926.
> kernel: ncr53c875-0-<0,0>: phase change 2-7 10@07fe0645 resid=4.

Here are the offending messages I have commented above.

> OTOH, a friend of mine with a plain *Ultra* chain put the drive on his
> machine in the middle of the chain (between other devices), and is not
> able to reproduce the problem.
> 
> He also used *my card* in his machine, without being able to reproduce the
> problem.

So, the card may not be the culprit.

> It is *not* a software problem.  I have spent a long time exchanging email
> with the very helpful author of the driver, and we are sure it isn't
> software.  Plus, I tried installing FreeBSD, and got similar results.

Then, the OS and the driver may well be fine.

> Here is my current thinking on what the problem might be (some of these are
> suggestions from comp.periphs.scsi):
> 
>   (a) I am running my bus at too fast a rate (100 MhZ).  However, I have
>       been able to use other cards, even another older SCSI card in that bus
>       at that speed.  Plus, there is no way (according to my FIC PA-2013
>       manual) to use my AMD K6-2/350 at proper speed without the 100 MhZ
>       bus.

Is this BUS speed within chipset and memory specifications and is it 
divided properly so that the PCI BUS will run within the 0-33MHz clock
range?
Btw, I just understand it is not.

>   (b) bad PCI bus/bad PCI slot [0]

Did you give a try using another PCI slot for the board during your 
80 hours of problem tracking?

>   (c) bad terminator (being tested in my friends machine now).  I don't
>       think this is it, though, because the terminator was new.

The newer a thing is, the more probable it may be broken.

>   (d) insufficient power (is there a way I can test this?)  I don't think
>       this is a problem because I have an expensive server case with a good
>       power supply.
> 
>  
> I believe I am doing things right.  I hope you all can help me...

Hmmm .... You did most of things right, but not all in my opinion, or you 
just missed some basics, for example:

1 - Check that all the components are running within specifications.
2 - A system that has ever been overclocked can be broken for ever 
    in a very subtle way. Forget that if you didn't overclock.
3 - You must also try underclocking for such weird problem tracking. This
    can apply to CPU, memory BUS, PCI BUS, SCSI synchronous speed ...
4 - You also must try to limit features used, for example for PCI and 
    SCSI.
5 - If you are not using ECC or parity memory, memory errors can lead 
    to very strange symptoms.

> [0] can
> 
> -- 
>       Bradley M. Kuhn   |     bkuhn@ebb.org    |   http://www.ebb.org/bkuhn

Gérard.

PS: I donnot know of the PA2013 board. My understanding of your point (a) 
and some guessing could be that you are using this board at 100 MHz and
that it is only supported for 66 MHz. If this lead to a 50 MHz PCI BUS,
then you may well be running a system completely broken in theory and in
practice as well. 



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu

home help back first fref pref prev next nref lref last post