[8148] in linux-scsi channel archive


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Matthias Andree)
Tue Feb 22 10:16:13 2000

To: linux-scsi@vger.rutgers.edu
From: Matthias Andree <ma@dt.e-technik.uni-dortmund.de>
Date:   22 Feb 2000 15:17:03 +0100
Message-ID: <m34sb1cof4.fsf@emma1.emma.line.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Hi,

I am wondering if the behaviour I see is correct. The system gets almost
irresponsive while one disk is breaking down. Any access to the
defective drive will bring the bus down for quite some time,
effectively, several minutes. 



Involved configuration: small UP P-II machine with AHA2940UW, two
Seagate drives (see below). 

One drive (sdb, containing /usr/local which is nice to have not direly
needed) dies at 6:33. In the following hour, the system goes into
endless SCSI bus reset loops which cannot help since the hardware is
broken. Eventually, after 7 o'clock, the driver degrades the bus, it
finally is down to 20 MB/sec, without the slightest hope that this helps
the drive out of its headcrash.

During recovery, the bus resets trash right into read and write
operations to /dev/sda which also abort. At short before 8 o'clock, the
system has finally responded to its fstab edit, and can finally be shot
out of the way and rebooted (no sync, no init: reboot -n -f), at the
expense of gory fscks on the intact drives.


What we see is an endless series of bus resets due to timeouts, domain
revalidations and aborts on the bus, bringing the intact drive away from
operation as well.

Later, I tried to analyze the drive using sformat. I had to abandon the
check since testing /dev/sdb was bringing /dev/sda down again. 

I find that unbearable. Is there any way to prevent that a defective
drive brings down the entire bus, that it is degraded and its
performance fucked beyond all limits? 

There simply *MUST* be a way to bring all processes into D state (or
throw them off with SIGIOT or SIGKILL or something) and to "dismiss"
/dev/sdb from the system, i. e. prevent further use entirely. There MUST
be a way to prevent these self-destructing bus aborts. 

There is no hope of bringing unflushed buffers back to a broken disk, it
would be no good anyhow.

It is unbearable it takes something like 75 minutes to get the system
down to reboot. 

I'd expect that after a couple of bus resets induced by only a single drive,
the system decides that that drive is broken and does something similar
to echo remove-single-device 0 0 1 0 >/proc/scsi/scsi or at least lock
that drive so that the other operations can continue properly. 

I am annoyed by that Adaptec junk. Be it the 1542CF, be it the
2940UW. Expensive and useless if something goes wrong.

Any way to prevent this that I have not seen? No, buying RAIDs is not an
option. 



host adaptor and drives:
------------------------------------------------------------------------
(scsi0) <Adaptec AHA-294X Ultra SCSI host adapter> found at PCI 9/0
(scsi0) Wide Channel, SCSI ID=7, 16/255 SCBs
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.21/3.2.4
       <Adaptec AHA-294X Ultra SCSI host adapter>
scsi : 1 host.

(scsi0:0:0:0) Synchronous at 40.0 Mbyte/sec, offset 8.
  Vendor: SEAGATE   Model: ST34572W          Rev: 0718
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 0, lun 0

(scsi0:0:1:0) Synchronous at 40.0 Mbyte/sec, offset 8.
  Vendor: SEAGATE   Model: ST39140W          Rev: 1281
  Type:   Direct-Access                      ANSI SCSI revision: 02
Detected scsi disk sdb at scsi0, channel 0, id 1, lun 0

SCSI device sda: hdwr sector= 512 bytes. Sectors= 8888924 [4340 MB] [4.3 GB]
SCSI device sdb: hdwr sector= 512 bytes. Sectors= 17783240 [8683 MB] [8.7 GB]
------------------------------------------------------------------------

Now, at 6:33, sdb dies while sda is still happy. 
Feb 22 06:33:46 deadhost kernel: scsi : aborting command due to timeout : pid 233702, scsi0, channel 0, id 1, lun 0 Read (6) 17 73 0f 16 00 
Feb 22 06:33:47 deadhost kernel: scsi : aborting command due to timeout : pid 233705, scsi0, channel 0, id 1, lun 0 Read (6) 15 aa f3 02 00 
Feb 22 06:34:16 deadhost kernel: SCSI host 0 abort (pid 233702) timed out - resetting
Feb 22 06:34:16 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 06:34:18 deadhost kernel: SCSI host 0 channel 0 reset (pid 233702) timed out - trying harder
Feb 22 06:34:18 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 06:34:21 deadhost kernel: (scsi0:0:1:0) Performing Domain validation.
Feb 22 06:34:22 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:34:22 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1263678
Feb 22 06:34:22 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:34:22 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1146914
Feb 22 06:34:22 deadhost kernel: EXT2-fs error (device sd(8,18)): ext2_write_inode: unable to read inode block - inode=142880, block=573457
Feb 22 06:34:23 deadhost kernel: scsi0 channel 0 : resetting for second half of retries.
Feb 22 06:34:23 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 06:34:26 deadhost kernel: (scsi0:0:0:0) Synchronous at 40.0 Mbyte/sec, offset 8.

WOP. Finally, the bus reset succeeded. After that, I have TONS of these
in my syslog:

Feb 22 06:46:18 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1263636
Feb 22 06:46:18 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:46:18 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1263006
Feb 22 06:46:18 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:46:18 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1263782
Feb 22 06:46:18 deadhost kernel: scsi0 channel 0 : resetting for second half of retries.
Feb 22 06:46:18 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 06:46:18 deadhost kernel: (scsi0:0:0:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 06:46:18 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:46:18 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1263512
Feb 22 06:46:18 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
[...]
Feb 22 06:46:18 deadhost kernel: (scsi0:0:0:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 06:46:18 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:46:18 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1263514
Feb 22 06:46:18 deadhost kernel: scsi0 channel 0 : resetting for second half of retries.
Feb 22 06:46:18 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 06:46:18 deadhost kernel: (scsi0:0:0:0) Performing Domain validation.
Feb 22 06:46:18 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:46:18 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1149118
Feb 22 06:46:18 deadhost kernel: (scsi0:0:0:0) Successfully completed Domain validation.
Feb 22 06:46:18 deadhost kernel: (scsi0:0:0:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 06:46:18 deadhost kernel: (scsi0:0:0:0) Performing Domain validation.
Feb 22 06:46:18 deadhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 26030000
Feb 22 06:46:18 deadhost kernel: scsidisk I/O error: dev 08:12, sector 1263638
Feb 22 06:46:18 deadhost kernel: scsi0 channel 0 : resetting for second half of retries.
Feb 22 06:46:18 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 06:46:18 deadhost kernel: (scsi0:0:0:0) Successfully completed Domain validation.



Eventually, these bus resets tampered with the intact drive:

Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234510, scsi0, channel 0, id 0, lun 0 Write (6) 02 0e f1 06 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234409, scsi0, channel 0, id 0, lun 0 Write (6) 02 00 4b 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234392, scsi0, channel 0, id 0, lun 0 Write (6) 02 00 47 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234802, scsi0, channel 0, id 0, lun 0 Write (6) 18 6b fa 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234793, scsi0, channel 0, id 0, lun 0 Write (6) 17 2b a6 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234574, scsi0, channel 0, id 0, lun 0 Write (6) 0a 80 55 04 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234770, scsi0, channel 0, id 0, lun 0 Write (6) 17 2b 96 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234745, scsi0, channel 0, id 0, lun 0 Write (6) 17 2b 86 04 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234704, scsi0, channel 0, id 0, lun 0 Write (6) 15 2b 8c 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234661, scsi0, channel 0, id 0, lun 0 Write (6) 15 2b 6e 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234637, scsi0, channel 0, id 0, lun 0 Write (6) 10 ab 5c 04 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234613, scsi0, channel 0, id 0, lun 0 Write (6) 0b 81 27 02 00 
Feb 22 06:46:18 deadhost kernel: scsi : aborting command due to timeout : pid 234601, scsi0, channel 0, id 0, lun 0 Write (6) 0a c0 59 04 00 

This game continues. 
Remarkable points after that:

Feb 22 06:57:40 deadhost kernel: (scsi0:0:0:0) Synchronous at 32.0 Mbyte/sec, offset 8.

Feb 22 07:06:05 deadhost kernel: (scsi0:0:0:0) Synchronous at 26.8 Mbyte/sec, offset 8.

Feb 22 07:19:58 deadhost kernel: (scsi0:0:0:0) Synchronous at 20.0 Mbyte/sec, offset 8.

*sigh* the bus degrades, the intact drive is down to 10 MXFers/s
 (i. e. Fast SCSI) while it could still happily server 20 wide MXfers/s. 



Then, after /etc/fstab has been treated so that all sdb lines are
commented out and the machine is finally rebooted, half an hour after
reboot I am seeing these:

Feb 22 11:12:51 deadhost kernel: scsi : aborting command due to timeout : pid 156373, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 a9 dc c9 0
Feb 22 11:12:51 deadhost kernel: (scsi0:0:1:0) Parity error during Command phase.
Feb 22 11:12:51 deadhost kernel: scsi : aborting command due to timeout : pid 156820, scsi0, channel 0, id 0, lun 0 Read (6) 1e 98 e4 18 00 
Feb 22 11:12:51 deadhost kernel: scsi : aborting command due to timeout : pid 156821, scsi0, channel 0, id 0, lun 0 Read (6) 1e 98 fe 14 00 
Feb 22 11:12:51 deadhost kernel: SCSI host 0 abort (pid 156373) timed out - resetting
Feb 22 11:12:51 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 11:12:51 deadhost kernel: (scsi0:0:1:0) Performing Domain validation.
Feb 22 11:12:51 deadhost kernel: (scsi0:0:1:0) Successfully completed Domain validation.
Feb 22 11:12:51 deadhost kernel: (scsi0:0:0:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 11:12:51 deadhost kernel: (scsi0:0:1:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 11:12:51 deadhost kernel: (scsi0:0:1:0) Performing Domain validation.
Feb 22 11:12:51 deadhost kernel: (scsi0:0:1:0) Successfully completed Domain validation.
Feb 22 11:13:21 deadhost kernel: SCSI host 0 abort (pid 156373) timed out - resetting
Feb 22 11:13:21 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 11:13:25 deadhost kernel: (scsi0:0:1:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 11:13:25 deadhost kernel: (scsi0:0:1:0) Performing Domain validation.
Feb 22 11:13:25 deadhost kernel: (scsi0:0:1:0) Successfully completed Domain validation.
Feb 22 11:13:56 deadhost kernel: SCSI host 0 channel 0 reset (pid 156373) timed out - trying harder
Feb 22 11:13:56 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 11:13:59 deadhost kernel: (scsi0:0:0:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 11:13:59 deadhost kernel: (scsi0:0:1:0) Synchronous at 40.0 Mbyte/sec, offset 8.
Feb 22 11:14:29 deadhost kernel: SCSI host 0 abort (pid 156373) timed out - resetting
Feb 22 11:14:29 deadhost kernel: SCSI bus is being reset for host 0 channel 0.
Feb 22 11:14:33 deadhost kernel: (scsi0:0:1:0) Synchronous at 40.0 Mbyte/sec, offset 8.
[...]
------------------------------------------------------------------------

-- 
Matthias Andree

Hi! I'm the infamous .signature virus!
Copy me into your ~/.signature to help me spread!

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[8148] in linux-scsi channel archive

recovery behaviour with 1 bad + 1 good drive (aic7xxx)

daemon@ATHENA.MIT.EDU (Matthias Andree)Tue Feb 22 10:16:13 2000

daemon@ATHENA.MIT.EDU (Matthias Andree)
Tue Feb 22 10:16:13 2000