[2124] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

DPT 2144UW w/EATA: Inexplicable SCSI bus resets [LONG]

daemon@ATHENA.MIT.EDU (Eu Hin Chua)
Tue Jul 8 10:22:53 1997

Date: 	Tue, 08 Jul 1997 22:11:13 +0800
To: linux-scsi@vger.rutgers.edu
From: Eu Hin Chua <emeritus@iinet.net.au>

Hi,

I have a rather vexing and frustrating problem. Using Redhat 4.2 (kernel
2.0.30), my SCSI bus unexpectedly resets during large file transfers, using
both the EATA and EATA_DMA drivers. 

The situation is extremely frustrating because my machine works fine in
Windows 95 and NT 4, with DPT Storage Manager reporting no excessive bus
resets.

First, my hardware setup:

o Pentium II 266, Asus KN97-X Motherboard (FX Chipset)
o DPT PM2144UW Smartcache IV PCI SCSI controller (without caching/RAID
module,   SCSI ID 7)
o 2 IBM Ultrastar 2ES 4.3 gig (DCAS-34330W) Wide SCSI hard drives (SCSI ID
5 and 6)
o 1 EIDE Quantum Fireball (original) 1.2 gig drive (set to boot SCSI first
 through BIOS)
o 1 Sony CDU-415 12X SCSI-2 CDROM (SCSI ID 3)

o PCI Cards:
	Matrox Mystique 4 meg VGA
	Orchid Righteous 3D accelerator
	(and the DPT of course)
o ISA Cards:
	Soundblaster AWE32 PNP (with Yamaha Waveforce MIDI daughterboard)
	Practical Peripherals 28.8 internal modem (COM 4)

Next, the problem:

When attempting to install Redhat 4.2 off the CD, and using the EATA_DMA
driver, the setup program manages to get through at least 50% of the file
copying (say in the region of 30-60 megabytes) before the machine locks up
with messages like:

Jul  8 12:27:37 localhost kernel: CD-ROM I/O error: dev 0b:00, sector 224776
Jul  8 12:27:37 localhost kernel: CD-ROM I/O error: dev 0b:00, sector 224776

The only course of action is to reboot the machine and try again. I am
NEVER able to complete a full installation (installing all the packages I
want). However, I AM able, after several attempts, to install a MINIMAL
Linux setup (using the base packages).

I was then able to boot into this basic setup, install the kernel source,
gcc and other associated packages manually, and examine the problem
further. I attempted to compile two kernels, one using Dario's EATA driver
and the other Michael's EATA_DMA driver.

First, the EATA driver output at bootup:

localhost kernel: EATA0: address 0x1f0 in use, skipping probe.
localhost kernel: EATA0: 2.0C, PCI 0xe010, IRQ 12, NO DMA, SG 252, MB 64,
tc:y, lc:y, mq:16.
localhost kernel: EATA0: wide SCSI support enabled, max_id 16, max_lun 8.
localhost kernel: EATA0: SCSI channel 0 enabled, host target ID 7.
localhost kernel: EATA/DMA 2.0x: Copyright (C) 1994-1997 Dario Ballabio.
localhost kernel: scsi0 : EATA/DMA 2.0x rev. 3.00.09 
localhost kernel: scsi : 1 host.
localhost kernel:   Vendor: SONY      Model: CD-ROM CDU-415    Rev: 1.1i
localhost kernel:   Type:   CD-ROM                             ANSI SCSI
revision: 02
localhost kernel: Detected scsi CD-ROM sr0 at scsi0, channel 0, id 3, lun 0
localhost kernel:   Vendor: IBM       Model: DCAS-34330W       Rev: S61A
localhost kernel:   Type:   Direct-Access                      ANSI SCSI
revision: 02
localhost kernel: Detected scsi disk sda at scsi0, channel 0, id 5, lun 0
localhost kernel:   Vendor: IBM       Model: DCAS-34330W       Rev: S61A
localhost kernel:   Type:   Direct-Access                      ANSI SCSI
revision: 02
localhost kernel: Detected scsi disk sdb at scsi0, channel 0, id 6, lun 0
localhost kernel: EATA0: scsi0, channel 0, id 3, lun 0, cmds/lun 16, linked.
localhost kernel: EATA0: scsi0, channel 0, id 5, lun 0, cmds/lun 16,
linked, tagged.
localhost kernel: EATA0: scsi0, channel 0, id 6, lun 0, cmds/lun 16,
linked, tagged.
localhost kernel: scsi : detected 1 SCSI cdrom 2 SCSI disks total.
localhost kernel: SCSI device sda: hdwr sector= 512 bytes. Sectors= 8466688
[4134 MB] [4.1 GB]
localhost kernel: SCSI device sdb: hdwr sector= 512 bytes. Sectors= 8466688
[4134 MB] [4.1 GB]

My test was to tar up a bunch of files, totalling about 40 meg, delete the
tar file, and repeat the process until errors appeared. I am pretty sure
that the problem occurs either when the SCSI hard drives are being
accessed, or when data is written to them. I am able to use the Linux
system quite stably (I think) as long as the hard drive isn't grinding away
at something.

The CDROM does not seem to be an issue. I have tested this by running
multiple dd if=/dev/scd0 of=/dev/null &. The CDROM is able to be read
perfectly fine. In addition, I was also able to successfully install RedHat
on the EIDE drive, while reading installation files off the SCSI CDROM.

After several iterations of the test, I would be confronted with an error
message like:

EATA0: ihdlr, mbox 59, err 0x6:0, target 0.6:0, pid 25006, reg 0x51, count
25019

In MOST cases, the kernel would not panic, and I would still be able to use
the machine.

Having spoken to Dario, this seems to indicated an "unexpected bus free"
error on the hard drive.

Now, I tried doing the same using the EATA_DMA driver. Output at bootup:

localhost kernel: EATA (Extended Attachment) driver version: 2.59b
localhost kernel: developed in co-operation with DPT
localhost kernel: (c) 1993-96 Michael Neuffer, mike@i-Connect.Net
localhost kernel: Registered HBAs:
localhost kernel: HBA no. Boardtype    Revis  EATA Bus  BaseIO IRQ DMA Ch
ID Pr QS  S/G IS
localhost kernel: scsi0 : PM2144UW     v07H.1 2.0c PCI  0xe010  12 BMST 1
7  N  64 252 Y
localhost kernel: scsi0 : EATA (Extended Attachment) HBA driver
localhost kernel: scsi : 1 host.
localhost kernel:   Vendor: SONY      Model: CD-ROM CDU-415    Rev: 1.1i
localhost kernel:   Type:   CD-ROM                             ANSI SCSI
revision: 02
localhost kernel: Detected scsi CD-ROM sr0 at scsi0, channel 0, id 3, lun 0
localhost kernel:   Vendor: IBM       Model: DCAS-34330W       Rev: S61A
localhost kernel:   Type:   Direct-Access                      ANSI SCSI
revision: 02
localhost kernel: Detected scsi disk sda at scsi0, channel 0, id 5, lun 0
localhost kernel:   Vendor: IBM       Model: DCAS-34330W       Rev: S61A
localhost kernel:   Type:   Direct-Access                      ANSI SCSI
revision: 02
localhost kernel: Detected scsi disk sdb at scsi0, channel 0, id 6, lun 0
localhost kernel: scsi0: queue depth for target 3 on channel 0 set to 6
localhost kernel: scsi0: queue depth for target 5 on channel 0 set to 27
localhost kernel: scsi0: queue depth for target 6 on channel 0 set to 27
localhost kernel: scsi : detected 1 SCSI cdrom 2 SCSI disks total.
localhost kernel: SCSI device sda: hdwr sector= 512 bytes. Sectors= 8466688
[4134 MB] [4.1 GB]
localhost kernel: SCSI device sdb: hdwr sector= 512 bytes. Sectors= 8466688
[4134 MB] [4.1 GB]

After repeating the tar and delete process several times (in the region of
5-10 iterations), I would receive:

localhost kernel: scsi0 channel 0 : resetting for second half of retries.
localhost kernel: SCSI bus is being reset for host 0 channel 0.
localhost kernel: eata_reset called pid:32712 target: 6 lun: 0 reason 0
localhost kernel: eata_reset: slot 1 in reset, pid 32730.
localhost kernel: eata_reset: slot 3 in reset, pid 32732.
localhost kernel: eata_reset: slot 30 in reset, pid 32592.
localhost kernel: eata_reset: slot 40 in reset, pid 32602.
localhost kernel: eata_reset: slot 44 in reset, pid 32660.
localhost kernel: eata_reset: slot 50 in reset, pid 32718.
localhost kernel: eata_reset: slot 51 in reset, pid 32719.
localhost kernel: eata_reset: slot 53 in reset, pid 32721.
localhost kernel: eata_reset: slot 56 in reset, pid 32724.
localhost kernel: eata_reset: slot 59 in reset, pid 32675.
localhost kernel: eata_reset: board reset done, enabling interrupts.
localhost kernel: eata_reset: interrupts disabled again.
localhost kernel: eata_reset: slot 1 locked, DID_RESET, pid 32730 done.
localhost kernel: eata_reset: slot 3 locked, DID_RESET, pid 32732 done.
localhost kernel: eata_reset: slot 30 locked, DID_RESET, pid 32592 done.
localhost kernel: eata_reset: slot 40 locked, DID_RESET, pid 32602 done.
localhost kernel: eata_reset: slot 44 locked, DID_RESET, pid 32660 done.
localhost kernel: eata_reset: slot 50 locked, DID_RESET, pid 32718 done.
localhost kernel: eata_reset: slot 51 locked, DID_RESET, pid 32719 done.
localhost kernel: eata_reset: slot 53 locked, DID_RESET, pid 32721 done.
localhost kernel: eata_reset: slot 56 locked, DID_RESET, pid 32724 done.
localhost kernel: eata_reset: slot 59 locked, DID_RESET, pid 32675 done.
localhost kernel: eata_reset: exit, wakeup.
localhost kernel: eata_dma: int_handler, reseted command pid 32602 returned
localhost kernel: eata_dma: int_handler, reseted command pid 32675 returned
localhost kernel: eata_dma: int_handler, reseted command pid 32724 returned
localhost kernel: SCSI disk error : host 0 channel 0 id 6 lun 0 return code
= 27000002
localhost kernel: scsidisk I/O error: dev 08:17, sector 422394
localhost kernel: scsi : aborting command due to timeout : pid 32718,
scsi0, channel 0, id 6, lun 0 Write (10) 00 00 39 f2 43 00 00 f4 00 
localhost kernel: eata_abort called pid: 32718 target: 6 lun: 0 reason 3
localhost kernel: Returning: SCSI_ABORT_BUSY
localhost kernel: scsi : aborting command due to timeout : pid 32719,
scsi0, channel 0, id 6, lun 0 Write (10) 00 00 39 f3 37 00 00 f4 00 
localhost kernel: eata_abort called pid: 32719 target: 6 lun: 0 reason 3
localhost kernel: Returning: SCSI_ABORT_BUSY
localhost kernel: scsi : aborting command due to timeout : pid 32721,
scsi0, channel 0, id 6, lun 0 Write (10) 00 00 39 f5 1f 00 00 f4 00 
localhost kernel: eata_abort called pid: 32721 target: 6 lun: 0 reason 3
localhost kernel: Returning: SCSI_ABORT_BUSY
localhost kernel: scsi : aborting command due to timeout : pid 32592,
scsi0, channel 0, id 6, lun 0 Write (10) 00 00 39 7a 9f 00 00 02 00 
localhost kernel: eata_abort called pid: 32592 target: 6 lun: 0 reason 3
localhost kernel: Returning: SCSI_ABORT_BUSY
localhost kernel: scsi : aborting command due to timeout : pid 32730,
scsi0, channel 0, id 6, lun 0 Write (10) 00 00 3a 01 a7 00 00 f4 00 
localhost kernel: eata_abort called pid: 32730 target: 6 lun: 0 reason 3
localhost kernel: Returning: SCSI_ABORT_BUSY
localhost kernel: scsi : aborting command due to timeout : pid 32732,
scsi0, channel 0, id 6, lun 0 Write (10) 00 00 3a 03 8f 00 00 f4 00 
localhost kernel: eata_abort called pid: 32732 target: 6 lun: 0 reason 3
localhost kernel: Returning: SCSI_ABORT_BUSY
localhost kernel: scsi : aborting command due to timeout : pid 32660,
scsi0, channel 0, id 6, lun 0 Write (10) 00 00 39 bc e3 00 00 f4 00 
localhost kernel: eata_abort called pid: 32660 target: 6 lun: 0 reason 3
localhost kernel: Returning: SCSI_ABORT_BUSY

I would get a couple of screens more of these SCSI_ABORT_BUSY messages,
before another screenfull of eata_resets for a whole bunch of slots. The
lines after those resets would be something like:

eata_reset: board reset done, enabling interrupts.
eata_dma: int_handler, reseted
command pid 32592 returned
command pid 33023 returned
kernel panic: SCSI_free trying to free unused memory

whereupon the system will lock solid.

Here's what I did to try and diagnose the problem. Initially, I assumed it
was a hardware problem, so I checked the cabling and termination settings
on the controller. However, cabling and termination seem unlikely as the
system performs without a hitch in 95 and NT 4.

Next came the issue of a defective drive. I managed to replace the ID 6
drive with another drive (brand new, exact same model). However, the
problems persisted (due to the regularity of this phrase, I'll abbreviate
it to HTPP).

I then performed a low-level format on the drive, thinking that perhaps the
media was defective. HTPP. I then disconnected the ID 6 drive, and tried
installing Linux on the ID 5 drive. HTPP. I then also disconnected the EIDE
drive, disabled the IDE drive settings in motherboard BIOS, and recompiled
the kernel without IDE support. HTTP. I then removed the ID 5 drive, and
reconnected the ID 6 drive (now the only hard drive connected), and tried
again. HTPP.

I then examined the SCSI card itself, I tried moving it to another PCI
slot. HTPP. I tried decreasing the PCI Latency Timer setting in the
motherboard BIOS settings. HTPP. I then tried disabling the Extended PCI
Req setting (controlling the duration of PCI Bus Request Signals) and
Tagged Command Queueing in the controller BIOS. HTPP.

For the EATA driver, I then tried disabling Linked Commands and Tagged
Queuing with the lc:n, tc:n parameters, and set the Max Queue size to 2
with the mc:2 parameter. HTPP.

As a last desperate measure, I tried editting scsi.c, changing any value
that referred to a timeout to an absurdly large value, recompiled with both
drivers and tried again. Since I have no idea about the SCSI code, this was
a shot in the dark. HTPP.

As a test, I reconnected all the drives, and installed Redhat to the EIDE
drive. During the install, the EATA_DMA driver was using to copy ~200
megabytes to distribution files from the SCSI CDROM to the EIDE drive
without a hitch. I now have a perfectly working (but painfully slow)
distribution of Linux on my EIDE drive.

I have not tried the following:

o using another Linux distribution.

o moving the controller, CDROM and a hard drive to another system and
seeing whether the problems persist (lack of resources).

o replacing the DPT 2144UW with another DPT 2144 UW controller.

o replacing the DPT 2144UW with a more "popular" card like the Adaptec 2940UW.

In the last two cases I am hamstrung because the dealer needs proof of
hardware malfunction before I can exchange hardware. It is rather hard to
prove that my controller card is defective when it works flawlessly in
Windows 95 and NT 4.

Now, I am a SCSI novice (all my experience with Linux has been with IDE),
but I have pondered about several things:

o It does not seem to be an issue of defective media (sector wise). I have
used 3 seperate drives (but of identitical model), low-level formatted and
"high" level formatted with verfication them. There were no problems of bad
sectors.

o It does not seem to be an issue of defective SCSI electronics on the hard
drives or on the card, because I would be experiencing problems in Win95
and NT  4 (as a test, I have copied 500-800 megabytes of files across
drives without a hitch, in those environments).

o Is it then an issue of a flaw in the drivers? I'm not sure about this as
no one else seems to suffer these problems, and because the same error (bus
resetting) occurs in BOTH drivers.

o Is it then a flaw in Linux's SCSI handling code? I don't have the
knowledge to comment here, although when examining some of the eata driver
source code I came across references to bugs in the mid-level code.

o Is it then something about my particular setup (whether it be a
combination of motherboard, drive or card)? If it is, is it Linux specific,
because of the lack of problems with Microsoft OSs?

As I said, Linux runs fine on my IDE system, and accessing the SCSI CD-ROM
through it causes no problems. Most importantly, Win95 and NT 4 work fine too.

Frankly, I am out of ideas. As I said I am not in a position to return the
DPT card (nor do I really want to since barring this I am very pleased with
its performance), because it does not exhibit any defects in Microsoft
operating systems.

I would really really really appreciate anyone offering any suggestions or
experiences regarding this situation, as I am quite (no, make that VERY)
desperate. This is my first experience with SCSI on Linux, and it certainly
is one I won't forget in a hurry; I have been pondering a solution and
talking to people about it, for the best part of two weeks. I cannot help
but wondering that if I had gone the path of most students and stuck to
bog-standard EIDE drives I might have saved myself a lot of pain, but now
that I have tasted SCSI...

humble grovellings in advance,

Eu Hin

--
"When seagulls follow the trawler it is because they think sardines
 will be thrown into the sea" - Eric Cantona.

"If a Frenchman goes on about seagulls, trawlers and sardines he's
 called a philosopher; I'd just be called a short Scottish bum talking
 crap" - Gordon Strachan.

http://www.iinet.net.au/~emeritus


home help back first fref pref prev next nref lref last post