[4] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

A very strange SCSI problem with 1.1.{62,72} (long)

daemon@ATHENA.MIT.EDU (Luca Maranzano)
Fri Dec 30 13:46:00 1994

From: liuk@etabeta.com.dist.unige.it (Luca Maranzano)
To: linux-kernel@vger.rutgers.edu
Date: Fri, 30 Dec 1994 14:28:56 +0100 (MET)
Cc: linux-scsi@vger.rutgers.edu


Hi all!

I've a very strange problem with Linux 1.1.{62,72} on my PC. 

Here is the story.

Since when I added the SCSI 2 ISA controller (2 months ago) the system
occasionally hangs apparently without any reason, most of the times
during access to the SCSI HDD. The oddity of the problem is that it
happens without any regularity; the system is up 24 hours a day, and
the hang may happen 2 times a day, or may not happen for more
than a week of regular uptime.

Before going up, my configuration:

 - i486 DX 40 w/ 8 MB of RAM, 256k L2 Cache
 - VLB Mother Board 
 - Cirrus 5428 VLB Video Card
 - 2 Maxtor IDE HDD, 345M and 245M
 - 1 IBM SCSI 2 HDD, 574M, model 0662S
 - Tekram DC-300B SCSI 2 ISA controller w/ 512k of cache
 - 1 Mitsumi CD-Rom
 - SoundBlaster Compatible Audio Card
 - PPP and UUCP connection to The Net

Here is the salient output of dmesg:

...
snd2 <SoundBlaster 2.1> at 0x220 irq 5 drq 1
mcd=0x300,10: Mitsumi status, type and version : 00 D 4
Calibrating delay loop.. ok - 19.97 BogoMips
Configuring Adaptec at IO:330, IRQ 11, DMA priority 5
scsi0 : Adaptec 1542
scsi : 1 hosts.
  Vendor: IBMRAID   Model: 0662S089337       Rev: 1014
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, id 0, lun 0
scsi : detected 1 SCSI disk total.
Memory: 7084k/8448k available (656k kernel code, 384k reserved, 324k data)
...
Linux version 1.1.62 (root@etabeta) (gcc version 2.5.8) #1 Tue Dec 6 21:37:04 MET 1994
Partition check:
  sda: sda1 sda2 < sda5 sda6 sda7 sda8 >
  hda: Maxtor 7345 AT, 329MB w/64KB Cache, CHS=790/15/57, MaxMult=32
  hda: hda1 hda2 hda3 hda4
  hdb: Maxtor 7245 AT, 234MB w/64KB Cache, CHS=967/16/31, MaxMult=32
  hdb: hdb1 hdb2 hdb3 < hdb5 hdb6 hdb7 >
VFS: Mounted root (ext2 filesystem) readonly.
Adding Swap: 25596k swap-space                 /* <- on /dev/sda1 */
Adding Swap: 20332k swap-space                 /* <- on /dev/hdb2 */

Tipically this happened in these situations:

 1- while rnews was running (I get comp.os.linux.* via UUCP, and I'm 
    a Fidonet Point)
 2- during kernel compilation 
 3- during vi sessions
 4- running 'mthread' (yes, I use trn for news reading :)
 5- during news expire

all these operations involve accesses to the SCSI disk for the following
reasons:

 1- the news spool area is on /dev/sda5, while the /usr/lib/news stuff
    is on /dev/hda3
 2- the kernel sources are on /dev/sda6, while all the gcc stuff is
    on /dev/hda3
 3- the edited file was on a SCSI partition (my home)
 4- same as 1
 5- same as 1

No matter if X was running or not (it happened in both cases, without
any particular preferences :)). I've been up under X11 for 7 days
without any problem, and rnews run every day several time, except the
last one yesterday evening, when the system suddenly was frozen :-(
Considering the fact that under X11 the swap traffic is moderatedly high
and the fact that the swap partition is on the SCSI disk, this adds
more strangeness to the story, isn'it ?

The good thing (the VERY good thing) is that after the reboot the data
loss was _ALWAYS_ minimal, just some .o files if it happened during kernel 
compilation or some articles if it happened during news operations.

This IMHO is a great thing. The ext2 fs and the e2fsck program seem
to be rock solid and very reliable, even in these critical conditions.

The not very good thing is that the syslogd most of the time wasn't
able to report anything, and consider that the /var/adm directory
is on /dev/hda4.

When it was able to write something, here is what it wrote:

Dec  1 10:01:06 etabeta kernel: scsi : aborting command due to timeout : pid 575, scsi0, id 0, lun 0 Write (6) 01 9b ef 24 00
Dec  1 10:01:06 etabeta kernel: SCSI host 0 abort() timed out - reseting
Dec  1 10:01:06 etabeta kernel: Sent BUS DEVICE RESET to target 0
Dec  1 10:01:06 etabeta kernel: Sending DID_RESET for target 0
Dec  1 10:01:06 etabeta kernel: Sending DID_RESET for target 0
Dec  1 10:01:06 etabeta kernel: aha1542_intr_handle: Unexpected interrupt
Dec  1 10:01:06 etabeta kernel: tarstat=0, hastat=0 idlun=10 ccb#=6
Dec  1 10:01:06 etabeta kernel: aha1542_intr_handle: Unexpected interrupt
Dec  1 10:01:06 etabeta kernel: tarstat=0, hastat=0 idlun=10 ccb#=7
Dec  1 10:01:06 etabeta kernel: scsi : aborting command due to timeout : pid 575, scsi0, id 0, lun 0 Write (6) 01 9b ef 24 00
Dec  1 10:01:06 etabeta kernel: SCSI host 0 abort() timed out - reseting
Dec  1 10:01:06 etabeta kernel: Sent BUS DEVICE RESET to target 0
Dec  1 10:01:06 etabeta kernel: Sending DID_RESET for target 0
Dec  1 10:01:06 etabeta last message repeated 2 times
Dec  1 10:01:06 etabeta kernel: aha1542_intr_handle: Unexpected interrupt
Dec  1 10:01:06 etabeta kernel: tarstat=0, hastat=0 idlun=10 ccb#=0
Dec  1 10:01:06 etabeta kernel: Sending DID_RESET for target 0
Dec  1 10:01:06 etabeta kernel: aha1542_intr_handle: Unexpected interrupt
Dec  1 10:01:06 etabeta kernel: tarstat=0, hastat=0 idlun=10 ccb#=1
Dec  1 10:01:06 etabeta kernel: Sending DID_RESET for target 0
.... etc etc

Other times the system suddenly hangs without any messagge (not even
on the console), even if NumLock still responds and Console switching
works. Ctrl-Alt-Del sometimes worked and the shutdown process started,
but it never finished properly.

I didn't count the times the problem happened, I'd say 12/15 in 2 months.

Now, let's come to my ideas about the causes of this:

 - My mother board chipset is buggy, and doesn't support very well the 
   BusMaster controller
 - The SCSI controller is not *fully* AHA 1540/1542 compatible as it
   claims to be (but I've read on the News of one Linuxer which is
   successfully using the VLB version without problems)
 - The SCSI Disk doesn't work properly (but in this case, should the
   kernel be able to handle the malfunction ?)
 - There is some strange bug in the kernel scsi part.
   I've tried only 62 and 72, and I don't know if it could be useful
   to try other versions, since the problem happens so irregularly !
   (sometimes 2 times a day, sometimes once a week :-( )

Please let me know something about this, I'm getting crazy, the system
is quite unreliable and I really don't know how to trace it down !

Thanks for the patience :-) and Happy New Year !

Luca


home help back first fref pref prev next nref lref last post