[5524] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: more on scsi tape failures

daemon@ATHENA.MIT.EDU (Guest section DW)
Fri Jan 1 19:58:17 1999

Date: 	Sat, 2 Jan 1999 00:51:27 +0100 (MET)
From: dwguest@win.tue.nl (Guest section DW)
To: dwguest@win.tue.nl, james@dansfoods.com
Cc: linux-scsi@vger.rutgers.edu, lnz@dandelion.com

	From james@danix.dansfoods.com Fri Jan  1 23:24:51 1999
	From: James Rich <james@dansfoods.com>

SCSI tape stuff happens...

	st0: Buffer flushed, 1 EOF(s) written
	st0: Rewinding tape.
	st0: Block limits 1 - 65536 bytes.
	st0: Mode sense. Length 14, medium 0, WBS 10, BLL 8
	st0: Density 11, tape length: 0, drv buffer: 1
	st0: Block size: 1024, buffer size: 32768 (32 blocks).
	st0: Error: 28000002, cmd: 8 1 0 0 20 0 Len: 32768
	FMK Current error st09:00: sense key None
	st0: Sense: f0  0 80  0  0  0 14  6
	st0: EOF detected (12288 bytes read).
	st0: EOF up (1). Left 12288, needed 2048.
	st0: EOF/EOM flag up (1). Bytes 10240
	st0: EOF up (1). Left 10240, needed 10240.
	st0: Rewinding tape.

Now a timeout occurs. The routine internal_cmnd() in scsi.c has
registered scsi_old_times_out() as the routine to call upon timeout.
It lives in scsi_obsolete.c, notices NORMAL_TIMEOUT and calls
scsi_abort(). This routine prints

	scsi : aborting command due to timeout : pid 3870, scsi0, channel 0, id 
	0, lun 0
	 Read (10) 00 00 24 5c 90 00 00 08 00 

But this is a command to the disk that got the timeout. Strange..
[Maybe nothing is wrong with the disk, there never is...]

	scsi0: Aborting CCB #3883 to Target 0
	SCSI host 0 abort (pid 3870) timed out - resetting

The SCSI controller did not react to the abort
[If the problem is reproducible, and you are using Buslogic.c
you might see whether BusLogic_AbortedCommandNotFound is returned
by the controller. It seems at first sight that the driver does
ignore such a return status. At least it should print some message,
I think, in case SCSI error logging is enabled.]
so the error recovery code decides to reset the disk drive.

	SCSI bus is being reset for host 0 channel 0.
	scsi0: Sending Bus Device Reset CCB #3885 to Target 0

Also the disk reset times out. A bus reset is attempted.

	SCSI host 0 channel 0 reset (pid 3870) timed out - trying harder
	SCSI bus is being reset for host 0 channel 0.
	scsi0: Resetting BusLogic BT-958 due to Target 0
	scsi0: Resetting BusLogic BT-958 Failed
	SCSI host 0 reset (pid 3870) timed out again -
	probably an unrecoverable SCSI bus or device hang.

Here is the next disk request - it fails in the same way.

	scsi : aborting command due to timeout : pid 3871, scsi0, channel 0, id 
	0, lun 0
	 Write (6) 0f 90 14 02 00 
	scsi0: Unable to Abort Command to Target 0 - CCB Reset

	I've put some printk's to st.c and found that it does indeed get stuck 
	while trying to do a close on the device.

	After this the system goes through cycles of responsiveness and no 
	response.  The load skyrockets.

Yes.

This ought to be good information to work from.

[Information as I see it: nothing is wrong with the disk.
 Nothing is wrong with the SCSI controller.
 But some error occurs somewhere, and the SCSI subsystem gets
 terminally confused.

 Now scsi_old_times_out() starts with spin_lock_irqsave();
 maybe handling of IRQ or io_request_lock is flawed.]

If you can reproduce it, even better.
No doubt Leonard Zubkoff will correct all that is wrong in the above.

Andries


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu

home help back first fref pref prev next nref lref last post