[5045] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

A bug in scsi code in 2.0.35 (36?) [long]

daemon@ATHENA.MIT.EDU (ishikawa)
Tue Nov 3 18:36:53 1998

Date: 	Wed, 04 Nov 1998 07:51:02 +0900
From: ishikawa <ishikawa@yk.rim.or.jp>
To: linux-scsi@vger.rutgers.edu

Hello,

The following  problem reported about two months ago
still seems to be in 2.0.36-preXX.

On the other hand, when I tried this on the 2.1.125 kernel,
the bug didn't show up after about a dozen trials of the
said command sequences.

Thinking that the command sequence locked up 2.0.3{5,6-prexx}
solidly about 99% of the time, I tend to regard the problem fixed
in 2.1.125.

However, the 2.2 still seems to be away at least for a few months.

Will someone be able to figure out the real cause and
find the fix for the problem mentioned below in 2.0.35 kernel?

The problem:

We have a Nakamich SCSI CD changer drive MBR-7.
This is a multi-LUN SCSI device with 7 CDs in it.
(This has to be treated as single_lun device.)

The following sequence of commands locks the 2.3.35 kernel.

dd if=/dev/scdX of=/dev/null &
dd if=/dev/scdY of=/dev/null

where /dev/scdX and /dev/scdY (X != Y) are two CD drives
in the said Nakamich device.

Note. 1. Richard Waltham noted that
by inserting sleep 1 between the two commands above, the
problem may not happen.
CI (me) noted that there was a time when inserting sleep X 
and system didn't lock up (after Waltham's and CI's patch
mentioned at the end).  But in that single instance, 
when he tried to interrupt the command by hitting Control-C,
the system got hung again.

Because of the timing problem, we suggest that the tester
issue the above command from a shell script.
If you type them from the console, I think the delay between
the two commands is enough to mask the problem.

Note. 2: Richard Waltham noted that on his system the
problem may occur or may not occur depending on
	X > Y, or
	Y < X.

He attributed this possibly to the scan order of the devices in
SCSI routine. (But this was before he realized that putting enough
delay between the two commands may hide the problem.)
CI (me) notes that issueing the two commands from shell scripts
with various X, Y settings didn't change the situation on his system
The system locked.

Note 3. CI reported If X == Y, then the problem didn't occur.

Note 4. CI checked the lock occurred with different SCSI cards
(Tekram DC390 and another one based on Symbios 8xx chip.) on different
PCs.
Between Waltham and CI, we cover three different(?) 
(or possibly 4) cards, so the problem seems to be in 
the SCSI code after all.
That 2.1.125 doesn't show the symptom easily also suggests that
the 2.0.35 code has a bug that was subsequently solved in 2.1.125.

I attach a quote from the past posting a few paragraphs I exchanged
with Waltham below.

From: Richard Waltham <dormouse@farsrobt.demon.co.uk>
Message-Id: <199809090212.DAA31279@farsrobt.demon.co.uk>
Subject: Re: Panic in scsi.c (and a fix)
To: ishikawa@yk.rim.or.jp
Date: Wed, 9 Sep 1998 03:12:19 +0100 (BST)


> Ishikawa wrote:
> > 
> > Kurt Garloff wrote:
> > > 
> > > On Wed, Aug 19, 1998 at 01:52:03AM +0100, Richard Waltham wrote:
> > > > Hi,
> > > >
> > > > I can generate the following panic in scsi.c at will using a CD media
> > > > changer - Nakamichi MBR-7.
> > > >
> > > > Happens with kernel versions 2.0.35 and 2.0.36-pre6. I haven't checked any
> > > > others.
> > > >
> > > >       Attempt to allocate device channel 0, target 6, lun x
> > > >       Kernel Panic: No Device found in allocate_device().
> > > >
> > > > If I start the following two commands running in different vc's
> > > >
> > > > dd if=/dev/scdX of=/dev/null   (X = 1, 2 ...)
> > > >
> > > > dd if=/dev/scdY of=/dev/null   (Y = 0, 1 ...)
> > > >
> > > > and the second one started has Y < X I get the panic.
> > > >
> > > > eg
> > > >
> > > > dd if=/dev/sdc1 of=/dev/null    - starting this first
> > > >
> > > > dd if=/dev/scd0 of=/dev/null    - then starting this
> > > >
> > > > generates the panic. Starting scd0 first and then scd1 is OK - but very
> > > > sloooooow as its spending most of the time changing CDs;)
> > > >
> > > > I guess the panic is caused by the call to allocate_device from
> > > > do_sr_request in sr.c but don't know why.
> > > >
> > > > Anyone figure it out?
> > > 
	Kurt's comment 
	about the single lun handling broken in 2.0.35 (patched in
        the pre-1x of 2.0.36) omitted.
	(CI' comment: The fix in 2.0.36-pre1x now.)

> > A couple of weeks ago or so, there was a mention of 
> > reproducible system panic when the following
> > commands were issued agains Nakamichi MBR SCSI CD changer (The above
> > exchange, that is.):
> > 
> > dd if=/dev/scdx of=/dev/null &
> > dd if=/dev/scdy of=/dev/null
> > 
> > (x>y)
> > 
> > I have the same (or similar) Nakamichi SCSI
> > CD changer and I found that I can reproduce the
> > same problem on my PC. The MBR7 SCSI CD changer is connected to Tekram DC390
> > SCSI host adaptor card. The DC390 driver version is 1.20s2.
> > 
> 
> That was me I guess. My SCSI controller is a Symbios SYM8751SP using the
> ncr53c8xx driver (and a new experimental version sym53x8xx v0.4) but both
> give the exact same results.
> 
> > [By the way, in my case, the system got hung even when x< y.  It could
> > be due to the fact my system had this "single_lun" problem patch
> > applied. But I don't know the real reason.]
> > 
> 
> No, this behaviour has nothing to do with the single_lun patch. But you will
> need the patch to get things working along with an additional patch I'll
> append at the end of this message.
> 
> I've looked in to this. Adding a printk in the routine allocate_device in
> scsi.c displays the devices as the drive table is searched. On my system the
> high luns are scanned first going to the lowest lun last. This also ties in
> with the failure I was getting where starting a high lun first and then a
> lower lun caused a panic.
> 
> Other systems/drivers may order the devices the other way round, lowest luns
> first then scanning through to the higher luns. This would then give a
> failure when starting a lower lun before a higher lun.
> 
> The following patch to scsi.c will show the order devices are scanned if
> you're interested. It is not part of the fix.
> 
> --- linux-2.0.36-pre8/drivers/scsi/scsi.c~	Wed Sep  9 01:10:19 1998
> +++ linux/drivers/scsi/scsi.c	Wed Sep  9 01:12:24 1998
> @@ -1081,6 +1081,10 @@
>  	    target_busy = 0;
>  	    SCpnt = device->host->host_queue;
>  	    while(SCpnt){
> +	    printk("single_lun: (%d,%d,%d)\n",
> +	      SCpnt->channel,
> +	      SCpnt->target,
> +	      SCpnt->lun);
>  		if(SCpnt->channel == device->channel 
>                     && SCpnt->target == device->id) {
>  		    if (SCpnt->lun == device->lun) {
> 
> 
> > I tried to see what is going on and could produce some
> > printk() messages right before the panic occurs.
> > This might be helpful in deducing the cause of the bug and so
> > I am reporting this message.
> > 
> > In the following session log I manually recorded ,
> > scd2 is the CD at id=6,lun=0, and 
> > scd3 is the CD at id=6,lun=1 if I am not mistaken.
> > 
> 
> Seems reasonable to me.
> 
> > The kernel I tested was 2.0.36pre7.
> > (I obtained the patches from the site mentioned in Alan's message
> > quoted in Linux Weekly News site.
> > Can I  simply run "patch -p1 < patch_for_pre7"
> > instead of running patch_for_pre1, then for pre2, etc. in order to
> > get the 2.0.36pre7 source tree?
> > When I tried to apply patches in sequence, I got
> > the dreaded "reverse patch deteced" message and after looking
> > at the files, I figured that
> > each patch can be applied to the base 2.0.35 source tree in one operation to
> > get to the pre-NNN status. Correct me if I am wrong here.)
> > 
> 
> The 2.0.36-pre patches apply to a clean 2.0.35. _Do not_ try adding a 2.0.36
> pre-patch on top of another pre-patch.
> 
> 8< some text removed
> 
> > 
> > I don't know if this message
> > helps people in fixing the bug before 2.0.36 release, but
> > this bug got to be fixed somehow in 2.0.3x release, I think.
> > 
> 
> It may be fixed - try the appended patch and let me know. I share your
> concern.
> 
> > If someone wants to delve into this problem and would like me to
> > print more info by inserting printk() in the source files, 
> > just let me know.
> 
> I have, and have what I believe is a fix, so don't need any more printk's -
> besides there's no more room in my log files for any more messages as I've
> filled them up with my own;)
> 
> > 
> > By the way, is the original reporter of this problem
> > using DC390 or other SCSI cards?
> 
> Symbios SYM8751SP and SYM8951U.
> 
> > I don't think DC390 driver is not the cause of the problem, but
> > just wanted to make sure that the problem occurs with the combination of
> > Nakamichi SCSI CD changer and other SCSI cards.
> 
> Doesn't appear to be a driver problem though the failure depending on higher
> or lower lun starting first appears driver dependent.
> 
> > 
> > Finally, here is the manual recording of the messages shown on the console
> > after I typed the problematic commands and 
> > when the system paniced:
> > 
> 
> 8< text removed (sorry)
> 
> > 
> > I don't know if the problem is caused by the improper protection of
> > the various allocate routines in sr*.c files as mentioned in the
> > case of 2.1.1xx kernel lately.
> 
> It doesn't appear to be caused by any problems in sr.c but I've only had it
> running a couple of hours.
> 
> > But my understanding of SCSI subsystem of linux 2.0.35 is not good
> > enough to make any judgement now, and for that matter, producing
> > a patch for 2.0.35 based on the recent patch for protecting these functions
> > based on SMP lock/unlock functions in 2.1.1xx is beyond me now.
> > 
> 
> Mines not very good either but I do understand SCSI and the errors that were
> introduced in the driver by this fault helped a lot. + lots of printk's.
> 
> > The output messages were produced by the insertion of printk in the
> > following places in the relevant files:
> 
> I'll save these just in case
> 
> 8< lots of code snipped
> 
> > 
> > Happy Hacking
> > 
> > Chiaki Ishikawa
> > 
> > PS: I am sorry that I only read linux-scsi mailing list...
> > 
> 
> Now for the patch which I think finally fixes these problems with at least
> the Nakamichi drives. This is to the allocate_device routine in scsi.c
> 
> The panic was being caused by the routine scanning the command queues for
> devices finishing the scan prematurely before the device being allocated had
> been scanned leaving the SCwait pointer set to NULL - the reason for
> the panic.
> 
> The scan order appears to depend on the ordering done by the different
> drivers so the failure could occur if a command with a high lun was started
> first before a command with low lun or vice versa. Between us we appear to
> cover both cases.
> 
> The fix forces the complete list to be scanned saving status on whether any
> of the relevent target luns are busy and then setting the state of SCpnt
> accordingly on completion. This guarantees that SCwait is set to a valid
> value for the device being allocated.
> 
> This needs more testing so it can hopefully get into 2.0.36 - is this a
> possibility Alan?
> 
> Check this out - its against 2.0.36-pre8 but will probably apply equally
> well to earlier 2.0.36-pre versions (pre7 anyway). Let me know how you get
> on. If its OK I'll pass it on to Alan if he's not read this and already done
> something about it.
> 
> Now on to try and fix 2.1.xx :)
> 
> Bye for now
> Richard
> 
> --- linux-2.0.36-pre8/drivers/scsi/scsi.c.orig	Tue Sep  8 19:52:46 1998
> +++ linux/drivers/scsi/scsi.c	Wed Sep  9 00:33:22 1998
> @@ -1045,6 +1045,7 @@
>      kdev_t dev;
>      struct request * req = NULL;
>      int tablesize;
> +    int target_busy;
>      unsigned long flags;
>      struct buffer_head * bh, *bhp;
>      struct Scsi_Host * host;
> @@ -1077,6 +1078,7 @@
>  		SCpnt = SCpnt->device_next;
>  	    }
>  	} else {
> +	    target_busy = 0;
>  	    SCpnt = device->host->host_queue;
>  	    while(SCpnt){
>  		if(SCpnt->channel == device->channel 
> @@ -1095,13 +1097,15 @@
>  			 * outstanding command per device - this is what tends
>                           * to trip up buggy firmware.
>  			 */
> -			found = NULL;
> -			break;
> +			target_busy = 1;
>  		    }
>  		}
>  		SCpnt = SCpnt->next;
>  	    }
> -	    SCpnt = found;
> +	    if (target_busy)
> +		SCpnt = NULL;
> +	    else
> +		SCpnt = found;
>  	}
>  
>  	save_flags(flags);
> 
> 

CI' additional comment.

I thought similar fix was necessary to the request_queuable() since it
seems to have a very similar-looking loop.
But after I consulted Waltham, I think request_queueable() may not require
the mods.

But after careful analysis of Waltham's patch, I thought it may still
had a problem.  That is the variable found needs to be re-initialized
in a loop even after Waltham's patch.

----------------------------------------
> 
> I think I found a potential problem in allocate_device().
> The variable found is initialized to NULL at the beginning of
> function.
> The general flow of the main loop of allocate_device() is
> as follows.
> I wonder if `found' ought to be re-initialized to NULL
> before entering in the "while loop on SCpnt"?
> [It could be that the initialization value may have to 
> be dependent on the previous value of `found' and target_busy
> value in Waltham patch.]
> 

Waltham seems to agree that found needs to be re-initialized.

>     ... found = NULL; /* initial value */
> 
>     while(1==1) 
>     {
>        if(!single_lun)
>        {
>              ...
>        }
>        else
>        {
> 
>          /* Should we re-initialize `found' here? */
> 
>          while loop on SCpnt
>          {
> 
>           }
>     	...
>        }
> 	...
>     }
> 

But then even with the re-initialization of the found, we still
get the locking.

Anyone familiar with scsi code in 2.0.35
care to analyze what goes on?

Happy Hacking,

Chiaki Ishikawa

PS: I wish I could have reported this earlier. But when I tried to upgrade my
2.1.109 to 2.1.125 to see if the problem is in the later development kernel, I trashed
my working partition and needed to reload various libraries and programs.
At least now I can run X and xterms. But I think I have problem going back to 2.0.35
kernel right now... More tinkering next weekend.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.rutgers.edu

home help back first fref pref prev next nref lref last post