[584] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: st.c driver errors can get lost!

daemon@ATHENA.MIT.EDU (Richard Waltham)
Sun Sep 3 08:11:40 1995

From: Richard Waltham <dormouse@farsrobt.demon.co.uk>
To: Kai.Makisara@metla.fi
Date: Sun, 3 Sep 1995 09:11:57 +0100 (BST)
Cc: linux-scsi@vger.rutgers.edu
In-Reply-To: <Pine.OSF.3.91.950902094747.9566B-100000@abies.metla.fi> from "Kai Makisara" at Sep 2, 95 10:12:16 am

> 
> On Fri, 1 Sep 1995, Richard Waltham wrote:
> 
> > 3. A much more serious problem is that in certain circumstances it appears 
> > possible to loose errors reported by a scsi tape device during writing. 
> ...
> > The standard kernel distributions have the st.c driver set up to write
> > asynchronously. Write errors are checked for at the start of the next write
> > command when any errors in the last write are reported. If the tar is only 
> > one block long, in unbuffered variable block mode, or several/many blocks 
> > long in buffered mode, the error is not detected until the close (device 
> > release?) and the device close routine does not return an error.  Also the 
> > closing filemark(s) is written during the close routine so if there is an 
> > error during writing the filemark this is also not reported other than by 
> > another short message in /var/adm/messages. Looking at another driver it  
> > appears that release does not return any value. Is that correct? If so is 
>                ^^^^^^^
> I have been aware of this problem and the reason is, as you have noticed, 
> that the there is no other way for the close function to return an error 
> than to log it. The release function is defined in include/linux/fs.h as 
> follows:
> 
> struct file_operations {
....
>         int (*open) (struct inode *, struct file *);
>         void (*release) (struct inode *, struct file *);
>         int (*fsync) (struct inode *, struct file *);
....
> };
> 

I've a real problem trying to understand why release doesn't have the
ability, or probably a need, to return an error code.

Can someone explain why, or recommend somewhere where I an find a decent
explanation why? I can only assume it something to do with the way the file
handling system operates, although my instinct at the moment is to concider
it's operation rather inadequate.

> > there any way round this or do we have to live with it? 
> > 
> As you note in your message, if you want utmost reliability in backups, you 
> should disable asynchronous writes and write buffering (you don't have to
> recompile the kernel; just use the MTSTOPTIONS ioctl). Another possible 
> problem with some programs is that async writes delay reporting errors
> and the program may not handle this correctly (it does not
> "know" that it was actually some earlier write that failed).
> On the other hand, async writes have a positive effect on performance in 
> some systems. How to balance these factors is up to each individual user.
> 
> 	Kai
> 
> *  Kai Makisara                      * email Kai.Makisara@metla.fi *
> |  Finnish Forest Research Institute | tel. +358-0-857 05 334      |
> |  Unioninkatu 40A                   | fax  +358-0-625 308         |
> *  FIN-00170 Helsinki, Finland       * GSM  +358-40-5533211        *
> 
> 

Error handling is definitely not easy. And testing error handling code is
even more difficult when devices refuse to give errors. Forcing drives in to
generating error conditions is not easy.

I agree with your remarks on programs not understanding defered or delayed
errors although I would say most programs rather than just some. However I
would expect them to at least be able to detect that an error had happened
and report it in some way, rather than the driver writing a message in a log
file. Unfortunately it doesn't seem much can be done about that without a 
change in the release function.


Richard


home help back first fref pref prev next nref lref last post