[6024] in linux-scsi channel archive
LVD SCSI problems with ncr53c8xx
daemon@ATHENA.MIT.EDU (Bradley M. Kuhn)
Thu Mar 4 21:39:19 1999
Date: Thu, 4 Mar 1999 21:00:15 -0500
From: "Bradley M. Kuhn" <bkuhn@ebb.org>
To: linux-scsi@vger.rutgers.edu, groudier@club-internet.fr
Mail-Followup-To: linux-scsi@vger.rutgers.edu, groudier@club-internet.fr
I have spent nearly 60-80 hours debugging this problem, and I am not
sure where to go from here. I humbly ask help from SCSI gurus here.
I have a Tekram DC-390U2B/W card and a Seagate ST39175LW drive. It uses the
ncr53c8xx. I am using Linux kernel 2.2.2, and the most recent ncr53c8xx
driver (which is now called symb53c8xx version sym53c8xx/version 1.2a).
I have made it so that these are the *only* two things on that SCSI card, to
be sure that other items on the chain did not effect my operation.
The Problem:
The drive works for a while. Sometimes up to a few days. However, usually
under some point of heavy load, the drive locks, with the I/O light on
continuously, and the device driver on the system can no longer talk to it,
giving a variety of timeout and other errors (see below for details). Only
a hard reset of the system brings that SCSI bus alive again.
I can always force this behavior by putting it under heavy load immediately
upon booting.
I know that there is nothing wrong with the driver. I even tried FreeBSD,
but to know avail.
Configurations I have tried;
0. Using the special red-white-blue LVD cable that came with the card, I
hooked the terminator that came with the Tekram card (labeled SCSI LVD/SE
Terminator) to the end of that red-white-blue LVD cable, and put the
drive in between the terminator and the card. (THIS IS THE CANONICAL,
CORRECT CONFIGURATION, I have been told).
1. I tried using configuration (0), with the drive's internal termination on
(I realize this does nothing for LVD drives, according to spec, but I was
desperate).
2. I tried using plain 68 pin Ultra cable, with the drive's internal
termination on and the drive in *forced* SE mode (using another jumper),
and reproduced the problem.
3. I tried (2), with the drive's internal termination *off* and the LVD/SE
terminator mentioned in (0) on the chain.
All of these produced similar results---works for a while, then fails
in the manner described. In almost all cases, it severely corrupts the
ext2fs almost beyond repair.
Before the crash, the messages in the log look something like (sometimes it
happens so fast there are no messages at all):
kernel: scsi : aborting command due to timeout : pid 96752, scsi1, channel 0, id 0, lun 0 Write (6) 00 00 5f 02 00
kernel: ncr53c8xx_abort: pid=96752 serial_number=96772 serial_number_at_timeout=96772
kernel: scsi : aborting command due to timeout : pid 96759, scsi1, channel 0, id 0, lun 0 Write (6) 00 80 89 02 00
kernel: ncr53c8xx_abort: pid=96759 serial_number=96779 serial_number_at_timeout=96779
kernel: scsi : aborting command due to timeout : pid 96760, scsi1, channel 0, id 0, lun 0 Write (6) 00 81 8b 02 00
kernel: ncr53c8xx_abort: pid=96760 serial_number=96780 serial_number_at_timeout=96780
kernel: ncr53c875-0: abort ccb=c60ce800 (cancel)kernel: SCSI host 1 abort (pid 96630) timed out - resetting
kernel: SCSI bus is being reset for host 1 channel 0.
kernel: ncr53c8xx_reset: pid=96630 reset_flags=2 serial_number=96650 serial_number_at_timeout=96650
kernel: ncr53c875-0: restart (scsi reset).
kernel: ncr53c875-0: Downloading SCSI SCRIPTS.
kernel: ncr53c875-0-<0,*>: FAST-20 WIDE SCSI 40.0 MB/s (50 ns, offset 15)
(Note these occured when I had it connected as plain Ultra chain---similar
things happen in LVD mode)
Once, I even got these, which were really strange:
kernel: ncr53c875-0-<0,0>: phase change 2-7 10@07fe0645 resid=16711926.
kernel: ncr53c875-0-<0,0>: phase change 2-7 10@07fe0645 resid=4.
OTOH, a friend of mine with a plain *Ultra* chain put the drive on his
machine in the middle of the chain (between other devices), and is not
able to reproduce the problem.
He also used *my card* in his machine, without being able to reproduce the
problem.
It is *not* a software problem. I have spent a long time exchanging email
with the very helpful author of the driver, and we are sure it isn't
software. Plus, I tried installing FreeBSD, and got similar results.
Here is my current thinking on what the problem might be (some of these are
suggestions from comp.periphs.scsi):
(a) I am running my bus at too fast a rate (100 MhZ). However, I have
been able to use other cards, even another older SCSI card in that bus
at that speed. Plus, there is no way (according to my FIC PA-2013
manual) to use my AMD K6-2/350 at proper speed without the 100 MhZ
bus.
(b) bad PCI bus/bad PCI slot [0]
(c) bad terminator (being tested in my friends machine now). I don't
think this is it, though, because the terminator was new.
(d) insufficient power (is there a way I can test this?) I don't think
this is a problem because I have an expensive server case with a good
power supply.
I believe I am doing things right. I hope you all can help me...
[0] can
--
Bradley M. Kuhn | bkuhn@ebb.org | http://www.ebb.org/bkuhn
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu