[2040] in linux-scsi channel archive

home help back first fref pref prev next nref lref last post

Re: scsi-problem (phase change ?)

daemon@ATHENA.MIT.EDU (Gerard Roudier)
Sun Jun 22 05:52:43 1997

Date: 	Sun, 22 Jun 1997 11:47:11 +0200 (MET DST)
From: Gerard Roudier <groudier@club-internet.fr>
To: Hauke Johannknecht <ash@ash.ccc.de>
cc: linux-kernel@vger.rutgers.edu, linux-scsi@vger.rutgers.edu,
        ncr53c810@colorado.edu
In-Reply-To: <Pine.LNX.3.95.970622001050.1375A-100000@mdma.ash.de>



On Sun, 22 Jun 1997, Hauke Johannknecht wrote:

> i have a "evil" problem here ...
> it trashed some partitions 3 times now.
> 
> i am using kernel pre-2.0.31-1 updated with ncr-1.18f ...
> scsi-host is a no-probs-till-now-NCR-810 with
> 5 devices attached. 
> 
>  ID 0 -- IBM-DCRS 4.5 GB (new in the system, maybe the troublestarter)
>  ID 1 -- Quantum LPS 105 (just dont ask ...)
>  ID 3 -- Seagate ST1600N (OLD, but works ...)
>  ID 5 -- Sanyo Quad-CDROM
>  ID 6 -- HP 6020i (now take a guess why i keep the seagate ...)
> 
> the ST got some heat-problems. but i
> keep it for "buffer"-usage, most times
> its powered down. so no prob.
> 
> the system trashed data on the DCRS in the last two days.
> up to complete partition-corruption.
> 
> only relevant comment in syslog was something like
> 
> ncr53c810-0-<0,0>: phase change 2-7 6@00249c20 resid=2.
> ncr53c810-0-<0,0>: phase change 2-7 10@0024962c resid=4.

The scsi controller saw a phase change from COMMAND phase to
MESSAGE_IN phase, with some residual data of the SCSI COMMAND not 
accepted by the drive. If we exclude some problem in the DCRS drive, 
the most probable reason is some bad signal level on the SCSI bus that 
corrupted data or broke the scsi protocol.
We probably should expect such problems to be recovered, hewever,
error recovery is very hard to implement and to test and, in any case,
it is not possible in my opinion to recover from all kinds of errors.

I think that mixing old and recent devices and devices with too different 
purpose and speed on the same SCSI bus, or connecting too many
devices on the same SCSI bus increases the probability of SCSI problems.

> seems to happen only if the system is running under
> heavy load AND the ST is powered up some time ...

Do you mean that you powered up the ST while the system is running?

> (can an overheated hdd data-kill another one via the scsi-bus ?)

Since the SCSI bus is a shared resource, any device on the bus can 
make the resource unusable.
 
> questions now:
> - WHAT are these errors ?
My response is above.

> - WHY  is it happening ?
You should send this question to Mr Murphy. :)

> - WHO  is responsible ?
Us.
You, because your SCSI bus configuration looks like something that risks 
a lot to get problems, and if you used to switch you ST under heavy load.
And me, if it is possible to recover from such errors.

> - HOW  can i stop it ?
Trying to recover for such errors in the driver, if it is possible, would 
perhaps cure the consequence but not repair the system, if as I think 
your SCSI system (all components sharing the ressource) uses a mix that 
increases too much SCSI problem probability.  
It is better to try to fix the cause, in my opinion.

My recommendation is to use more than 1 scsi BUS and to distribute devices 
among buses in a way that will minimize the risk to get SCSI problems.
2 buses is generally enough for most systems.
Base choice on speed, purpose, age, quality, etc.. of scsi devices.
That cannot be bad, at least for performance when you are using 2 devices 
with very different speed at the same time.

As an example, here is my SCSI system description:

- NCR53C810 that drives a IBM S12 narrow fast SCSI-2 HD and a Toshiba
  3401D SCSI-2 CD/ROM.

- NCR53C875 that drives an Atlas I Wide HD and an Atlas II Ultra Wide HD.

All that stuff with a BUS as short as possible and only active
terminations.
 
       Gerard.


home help back first fref pref prev next nref lref last post