[361] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

NCR troubles :(

daemon@ATHENA.MIT.EDU (Geoffrey Bennett)
Thu Jul 13 19:18:58 1995

From: Geoffrey Bennett <geoffrey@tafe.sa.edu.au>
To: linux-scsi@vger.rutgers.edu
Date: Fri, 14 Jul 1995 02:38:09 +0930 (CST)

I hope someone can help here...

I've got a 486DX4/100 PCI computer with an NCR53c810 controller and
a Seagate ST31230N drive.  If the computer is cold, and I thrash the
hard disk a bit (cat /dev/sda > /dev/null & find / & etc) then it will
eventually die.  I've tried 1.2.10 and 1.2.11+diffs.rel8 and 1.2.11+
diffs.rel8+Stephen's small patch, and while I get different errors
between the different kernel versions, they all will eventually fail.

I've tried three different drives in the computer (all ST31230N's
though); I went to my dealer and explained the problem (mentioned scsi
timeouts, and the NCR driver not being able to handle scsi problems
too well), and he suggested I get the drive "AV Certified" (apparently
stops it doing thermal recalibration during which time it is unable to
give out any data => it may timeout).  I didn't want to wait for the
drive to be shipped to Seagate, certified, then shipped back so I
ordered another one & specified it to be AV-C (was buying another
computer anyway & needed a disk for that).  I just got that drive
today and there is no difference (kernel still dies).  However, I
can't find any sticker that tells me that the drive is certified so
I'll check back with my dealer to find out how I can tell that it
really is certified.

I've tried two (different brand) NCR53c810 controllers in the computer
and they both die the same way.  I've tried a different scsi cable,
and it still gives the same probs.  The only thing that will fix the
problem (apart from keeping the computer warm) is changing the bios
option "CPU Clock/PCI Clock" from "1:1" to "1:1/2", but I don't really
like the sound of that, even if the disk transfer rate is still the
same.  Does this point to a motherboard problem (which the kernel
just happens to not handle gracefully)?

Anyway, the error messages go something like this for a 1.2.10 kernel
(errors and omissions excepted):

scsi0 : unexpected phase unknown at dsp = 0x1d0270
001d0270 : 0x0e000001 0x001cfa99
001d0278 : 0x48000000 0x00000000
scsi0 : DANGER: abort_connected() called
scsi0 : DMA FIFO not empty
scsi0 : DMA FIFO not empty
this then repeats for a while & if I get too many of these messages
at once then I get a resetting for second half of retries, something
is a nop, then it locks.

For 1.2.11+diffs.rel8, it varies a bit more, but along the lines of:

scsi0 : unexpected phase MSGOUT during select message out DSP = xxxx
DANGER, abort etc.
scsi0 : did this command ever run?
or
unexpected stuff...
general protection: 0000
EIP 10:00199540
swapper dying, dead, panic, lockup.

Putting in a bunch of printk's, I've tracked it down to line 4368 of
53c7,8xx.c where it's got the loop:

for (curr = (struct ... *) hostdata->issue_queue,
     prev = (struct ... **) &(hostdata->issue_queue);
     curr && curr->cmd != cmd; prev = (struct NCR53c7x0_cmd **)
     &(curr->next), curr = (struct ... *) curr->next);

(this is the general prot, not the "did this command ever run" dying
sequence)

I expanded out the code a bit & put in some more printk's:

for (curr = ..., prev = ...; curr && .. != cmd ; ) {
  printk("%x %x %x %x\n", prev, curr, curr->next, &(curr->next));
  prev = ...
  curr = ...
}

and it showed:

1d06c0 1d2720 f000 1d2740
1d2740 f000 43b41943 f020
general protection: 0000
EIP: etc etc etc.

I dunno if that helps any.  Sorry I don't have a full dump of
everything but 1) this message is too long already and 2) there's
so much to write down that there's no point in doing it if it
doesn't help.  If more kernel error messages would help, I can
produce them at will and write them down.

Seeing as you got this far, thanks for your time.

Regards,
--
 ___
/  __
\___|eoffrey D. Bennett!-)            geoffrey@tafe.sa.edu.au

home help back first fref pref prev next nref lref last post