[3482] in SAPr3-news

home help back first fref pref prev next nref lref last post

Re: Striped Disk

daemon@ATHENA.MIT.EDU (Knut Hofmann)
Fri Aug 15 20:38:41 1997

To: sapr3-news@MIT.EDU
Date: 15 Aug 97 23:00:58 GMT
From: khofmann@icc.pop-frankfurt.com (Knut Hofmann)
Reply-To: khofmann@icc.pop-frankfurt.com

Eine Verteilung ueber mehrere Controller ist unter Hochlastbedingungen si=
nnvoll.
Anbei noch ein guter Artikel als Anregung (in der Praxis sieht das aber i=
m einen oder anderen Fall anders aus, insbesondere was redolog angeht).

Gruss
Knut


Table of Contents

1. ABSTRACT

2. ORACLE7 AND RAID LEVELS
=20
 2.1 RAID LEVELS
  2.1.1 RAID 0:  STRIPING WITH NO PARITY
  2.1.2 RAID 1:  SHADOWING
  2.1.3 RAID 0+1:  STRIPING AND SHADOWING
  2.1.4 RAID 3:  STRIPING WITH STATIC PARITY
  2.1.5 RAID 5:  STRIPING WITH ROTATING PARITY
   2.1.5.1 SUMMARY:  ORACLE7 AND RAID LEVELS

3. ORACLE7 AND CACHED I/OS
=20
 3.1 TERMS AND BASICS
=20
 3.2 TYPES OF DISK I/O CACHING
=20
 3.3 ORACLE AS A DISK I/O CACHING PRODUCT
=20
 3.4 USING ORACLE7 WITH DISK I/O CACHING PRODUCTS
  3.4.1 RELIABILITY
  3.4.2 MEMORY WASTE
  3.4.3 PERFORMANCE EXPECTATIONS
   3.4.3.1 ORACLE DBWR AND DISK I/O CACHING
   3.4.3.2 ORACLE7 LGWR AND DISK I/O CACHING
   3.4.3.3 SYSTEM-WIDE PERFORMANCE
=20
 3.5 SUMMARY:  ORACLE7 AND DISK I/O CACHING PRODUCTS


ABSTRACT

This paper is meant to be an exhaustive treatment of the issues concernin=
g
Oracle and disk I/O.  It covers the interactions between Oracle7 I/O and:

-    RAID
-    disk caching products (hardware and software)


2.   Oracle7 and RAID Levels

This section is a discussion of the various RAID levels, their advantages=
 and
disadvantages, and their use with Oracle7.


2.1  RAID Levels


2.1.1     RAID 0:  Striping with No Parity

RAID 0 offers striping only.  It is not redundant (hence the name?); ther=
e is no
protection against drive failure at all.  It is simply a collection of dr=
ives in
a stripe configuration.

During an I/O, a single drive gets <chunksize> bytes of I/O before the I/=
O
continues onto the next drive in the set.  For I/Os that fit in a single =
chunk,
performance is the same is a single disk drive.  For I/O's that span more=
 than
one chunk, there may be a slight performance improvement since disks are =
able to
do a little work in parallel.

RAID 0 is useful with Oracle to reduce disk hot spots for Oracle data fil=
es.  It
is generally not recommended for other Oracle files.


2.1.2     RAID 1:  Shadowing

RAID 1 provides redundancy by duplicating an entire disk drive onto anoth=
er.  It
provides complete protection against single drive failures.  It is also t=
he most
expensive (in $) form of RAID since it maintains entire copies of disk dr=
ives
(perhaps even more than 1 copy).

During a read, any of the drives in the shadow set can be used.  During a=
 write,
all drives will eventually be updated with the new data.

When all drives are functioning, reads complete slightly faster than a si=
ngle
disk read since the controller will route the read to a free (not busy) d=
isk.
Writes take slightly longer than a single disk write.  Performance
characteristics are not effected much during a single drive failure.  In =
the
worst case, performance is equivalent to a single disk.

RAID 1 is generally useful to Oracle (if the $ cost is acceptable).  RAID=
 1 can
be used for any Oracle file.  It is especially useful for Oracle redo log=
 files
and control files; Oracle only has to issue one redo log I/O, saving code=
 path
and context switching.  However, the DBA/system administrator must use th=
e RAID
controller utilities to keep up with failed disks since the shadowing of =
the
file is hidden from Oracle.


2.1.3     RAID 0+1:  Striping and Shadowing

RAID 0+1 is often billed as a separate solution that offers the reduced h=
ot spot
and performance benefits of striping (RAID 0) and the redundancy of shado=
wing
(RAID 1).  It is just as costly as RAID 1.

While RAID 0+1 can be used with Oracle data files, it should not be used =
with
redo log files.


2.1.4     RAID 3:  Striping with Static Parity

RAID 3 attempts to give performance and redundancy of RAID 0+1 without th=
e high
cost associated with RAID 1's 1-for-1 drive redundancy.  A number of driv=
es are
ganged together in a RAID 0 stripe set.  An additional drive is used to k=
eep
parity information for the stripe set.

During normal operation, RAID 3 gives performance similar to RAID 0.  Rea=
ds are
striped.  Writes require two I/O's however; one for the data drive, and o=
ne for
the parity.  In the event of a single disk failure, the set continues to
function albeit at reduced performance.  Disk blocks from the missing dis=
k are
reconstructed by reading all remaining drives in the set and the parity d=
rive.
RAID vendors typically include cache on-board the RAID controller to incr=
ease
performance.  Note that the parity disk in RAID 3 can be a performance
bottleneck, which is why most RAID vendors go to RAID 5.

RAID 3 is useful for Oracle data files, but not for redo log files.


2.1.5     RAID 5:  Striping with Rotating Parity

RAID 5 has similar performance and redundancy characteristics as RAID 3, =
but the
parity information is spread across all drives which eliminates the parit=
y drive
as a bottleneck.

RAID 5 is useful for Oracle data files, but not for redo log files.


2.1.5.1   Summary:  Oracle7 and RAID Levels

This is a summary of the various RAID levels and their use with Oracle7. =
 The
numbers in parenthesis refer to the notes that follow the table.
                                                                    =20
RAID  Type of RAID         Control     Database      Redo Log   Archive L=
og
                            File         File          File        File
                                                                    =20
 0    Striping              avoid         OK          avoid        avoid
                                                                    =20
 1    Shadowing          recommended      OK       recommended  recommend=
ed
                                                                    =20
0+1   Striping +             OK       recommended     avoid        avoid
      Shadowing                           (1)
                                                                    =20
 3    Striping w/            OK           OK          avoid        avoid
      Static Parity
                                                                    =20
 5    Striping w/ Round-     OK       recommended     avoid        avoid
      robin Parity                        (2)

Notes:

1.   RAID 0+1 is recommended for database files because this avoids hot s=
pots
  and gives the best possible performance during a disk failure.  This is=
 a costly
  configuration though.
2.   RAID 5 is recommended for database files if RAID 0+1 is too expensiv=
e.



3.   Oracle7 and Cached I/Os

This section is meant to give perspective on disk caching products and th=
e
Oracle RDBMS.  It covers both hardware-based (ESE20 solid state disk driv=
e,
HSZ40 disk controller, Prestoserve disk caching memory module, etc.) and
software-based (UFS, AdvFS, VIOC, I/O Express, RAM disk, etc.) caching pr=
oducts.


3.1  Terms and Basics

In general terms, a cache:

=85    is some small amount of expensive storage
=85    replicates selected portions of some larger, cheaper storage
=85    provides better (faster) access to the stored contents.

For example, a small paint can could be considered a cache compared to a =
5
gallon drum of paint.  While we could walk back to the 5 gallon paint dru=
m
between brush strokes, it makes more sense to carry a small can of paint =
with us
for quick, easy access to the paint.

In computing terms, a cache attempts to provide fast access to data that =
is held
on slow, cheap media by moving the actively used portions of it to faster=
, more
costly media.  For us, the slow, cheap media is disk drives and the faste=
r, more
costly media is memory.  So a cache is some amount of memory that is used=
 to
hold the selected contents of a disk drive so that the CPU has quicker ac=
cess to
the information.


3.2  Types of Disk I/O Caching

Generally, there are two types of caching:  write-through and write-back.=
  These
types are differentiated by the policies used to maintain them.

Both types of caching treat read I/Os the same.  During a read, the memor=
y cache
is checked first to see if the needed data is there.  If it is not, the r=
ead
completes from disk, and if appropriate, a copy is saved in the cache to =
save
the disk I/O on subsequent reads.  Performance characteristics for write-=
back
and write-through cached reads are the same too.  If the read is able to
complete from cache, the read will be fast.  If the read has to go to dis=
k, the
read will be slow.

The difference between write-through and write-back caching is how they h=
andle
data writes.  Both methods will write the data to the cache and the disk.=
  In
write-through caching, the write is not considered complete until the dat=
a makes
it to the disk.  In a write-back caching, a write is complete when the da=
ta
makes it to the cache .  This difference has both performance and reliabi=
lity
implications.

Write-back caching performs faster than write-through caching.  Because t=
he
write-through cache write has to go to disk, it completes at disk speeds.=
  Since
the write-back cache write completes when the data gets to the cache, it
completes at memory speeds.  The increased write performance of write-bac=
k
caching comes at a price though.

Write-back caching has vulnerabilities to system failures that write-thro=
ugh
caching does not have.  Write-back caching is dependent upon memory.  Mem=
ory is
not persistent storage, i.e., when it loses power, it forgets everything.=
  So,
writes that an application was told were complete may not actually comple=
te if
the system crashes before the write-back cache has a chance to dump its c=
ontents
back to disk.  This could leave the applications data in an inconsistent =
state.
                                               =20
Writes          Write-Through                   Write-Back
                                               =20
Complete when   data gets to disk               data gets to cache
                                               =20
Write speed     slow - have to wait for disk    fast - memory-to-memory c=
opy
                                               =20
Vulnerability   none - writes to disk are       high - writes to memory a=
ren't
                persistent                      persistent

There are a number of variations on write-through and write-back caching,=
 most
notably, write-behind caching.  Write-behind caching behaves like write-b=
ack
caching (with the same dangers), but with a time guarantee:  writes will =
get to
disk within N seconds after the write gets to the cache.  This is an inte=
resting
twist to write-back caching in that it reduces the window of exposure, bu=
t the
exposure is still there.  OpenVMS' Spiralog sports write-behind caching
(discussed later).


3.3  Oracle as a Disk I/O Caching Product

This may be insulting to the Oracle7 developers, but it's true:  Oracle7 =
is a
fancy disk caching product that happens to understand SQL.  The cache is =
the
buffer cache portion of the SGA.  The portion of the disk being cached is=
 the
database files.  Like any caching product, Oracle7 is trying to provide f=
ast
access to data that is held on slow, cheap media (disk drives) by moving =
the
actively used portions of it to faster, more costly media (memory).

The Oracle7 buffer cache is maintained with a write-back algorithm.  A re=
ad will
be satisfied by the cache if possible.  If the data is not in the cache, =
the
read will be directed to disk and the results will be saved in the cache.
Writes change the block in the cache; they do not immediately go to disk.

Recall that while write-back caching has great performance characteristic=
s on
writes, it also has reliability concerns during failures.  To ensure that=
 no
writes to the database blocks are lost, a redo log is maintained.  The wr=
ite to
the redo log includes a list of all database block changes and must occur=
 before
commit is returned to the database user.  A redo log write is faster than=
 a
random disk write since it is a spiral write (i.e., no other disk activit=
y
should be on the redo log disk).  The combination of a write-back algorit=
hm and
redo logging provides Oracle7 the fastest possible performance while main=
taining
complete data integrity and recoverability.

Note one of the differences between Oracle7 and other disk caching produc=
ts.
Disk caching products allocate physical memory from the system for their =
cache.
Oracle7 does not.  It allocates the SGA from virtual memory (in order to
maintain portability across platforms among other reasons).  It is up to =
the DBA
and system administrator to ensure that there is enough physical memory
available on the system so that the operating system does not have to pag=
e or
swap Oracle7's cache.  The idea of paging a cache is self-defeating (thin=
k about
the purpose of a cache again, then think of the consequences of paging th=
e
Oracle7 buffer cache to disk).  Further discussion of this is outside the=
 scope
of this paper and is better left to Oracle7 database tuning guides.


3.4  Using Oracle7 with Disk I/O Caching Products

Disk caching services are provided by operating systems, and separate dis=
k
caching products are available from 3rd party vendors targeted for I/O-in=
tense
environments.  For OpenVMS, Digital has VIOC (Virtual I/O Cache) and Exec=
utive
Software makes I/O Express.  Digital UNIX has Prestoserve, AdvFS, and UFS.
These products have good performance records and are based on well known
technology.

This leads to the question, "How does Oracle (a disk caching product of s=
orts)
behave when used with other disk caching products?"  There are several is=
sues
that arise:  reliability, memory waste, and performance.


3.4.1     Reliability

If you want the performance improvements of a disk caching product, it is
important to understand their reliability characteristics.  Using unprote=
cted
write-back caching with Oracle7 will probably lead to database corruption=
 if a
system failure occurs.  Write-through caching does not cause these reliab=
ility
problems.  First we will see how these corruptions can happen in general =
terms,
then we will further define protected vs. unprotected.

Write-back cached database files.  Oracle knows at all times where the cu=
rrent
copy of a database block is:  either it is in the buffer cache or it is i=
n the
database file (we will ignore Oracle Parallel Server for now, but the sam=
e
argument holds).  Even when the current copy of a database block is in th=
e
buffer cache, Oracle knows how stale the disk block is and how much infor=
mation
it needs to keep in order to bring the stale block on disk up to date aga=
in in
case of a system failure.  When write-back caching is used on database di=
sks and
a system failure occurs, it is possible that Oracle's recovery mechanism =
will
find a disk database block to be more stale than expected, and have insuf=
ficient
information to bring the database block up-to-date.

Write-back cached redo log files.  When Oracle says a transaction has bee=
n
committed, this really means that Oracle has written the transaction's re=
do to a
persistent store -- the redo log file -- so that if the system crashes, O=
racle
can regenerate the transaction.  When write-back caching is used on redo =
log
files, this redo log write is no longer persistent.  During recovery from=
 a
system failure, it is possible that Oracle will not recover transactions =
that it
said were committed before the failure.

How can these corruptions happen?  In essence, Oracle has no idea that di=
sk
caching software is running underneath it.  Most disk caching products ar=
e
implemented as device drivers or disk controllers and fool the layers of
hardware or software above themselves into thinking the I/O is really don=
e.
Oracle optimizes both the amount of disk I/O it does and the amount of
information it keeps around to bring stale disk blocks up-to-date.  Oracl=
e
depends on knowing what is on disk.  If the caching software does not get=
 the
data to disk eventually, Oracle cannot recover from system failures.

The distinction between protected and unprotected write-back caching is a=
s
follows.  Protected write-back caching ensures that the cached I/O will
eventually get to the disk drive.  Protected write-back caching is typica=
lly
battery-backed memory implemented as an I/O controller (HSZ40) or as a se=
parate
memory module (Prestoserve).  Unprotected write-back caching simply uses =
the
computer's physical memory to implement the cache with no way to guarante=
e that
the I/O will get to disk in case of system failure.

Caution is in order even when using protected write-back caching.  In cas=
e of
system failure, protection against database corruption is only as good as=
 the
battery that is keeping the write-back cache warm.  Make sure the write-b=
ack
cache hardware can complete the I/O's it said it would, especially after =
a
system failure.  Oracle cannot be held liable for database corruptions du=
e to
write-back cache hardware problems.


3.4.2     Memory Waste

For software-based cache products, one problem that arises from combining=
 them
with Oracle7 is that Oracle data may be doubly cached, once by Oracle and=
 once
by the disk I/O caching product.  In the worst case, the user gets no
performance win from the disk I/O caching product and lots of memory is w=
asted
by storing the same information twice.  Some cache products recognize thi=
s
situation and allow the DBA or system administrator to disable disk I/O c=
aching
by the product for selected files if desired.  OpenVMS' Spiralog and othe=
r 3rd
party products provide this selectivity.  Unfortunately, the current UNIX
filesystems, UFS and AdvFS, don't.

Memory waste is not an issue with hardware-based caching solutions since =
they do
not use system memory.


3.4.3     Performance Expectations

It is important to understand how the DBWR and LGWR algorithms work in or=
der to
understand the affect disk caching products can have on Oracle's performa=
nce.
Beyond that, the larger needs of the system will determine which kind of =
caching
product, if any, should be used to attain optimal performance.  We start =
first
by restricting our view to Oracle DBWR and LGWR algorithm performance wit=
h
caching, then we will take the broader system-wide view.


3.4.3.1   Oracle DBWR and Disk I/O Caching

There are two notable problems with using disk caching with Oracle7's cur=
rent
DBWR algorithm.  First is that caching disk writes may not necessarily ma=
ke
Oracle run any faster.  Second, if caching does succeed in making DBWR ru=
n
faster, it may actually slow down users' ability to get read work done.

DBWR performs parallel, synchronous writes of batches of blocks at a time=
,
referred to as a write batch.  DBWR issues a batch of I/Os (asynchronous =
I/Os
for both OpenVMS and Digital UNIX), then waits for them all to complete b=
efore
continuing processing.  This means that the latency time before DBWR begi=
ns
doing useful work again is as long as the longest I/O in the batch.  Beca=
use of
this, there is little reason to have a database disk farm with disks of w=
idely
differing latency times.  In other words, there is no performance gain to=
 having
part of a database on a solid state disk (like the ESE20) and another par=
t on a
traditional disk (like an RZ28).

Perhaps a system-wide disk caching product is used, caching all DBWR writ=
es.
This simply means that DBWR is able to get back to the work of cleaning t=
he
Oracle7 SGA buffer cache more quickly.  Unless keeping the SGA clean is a
problem (and it rarely is), DBWR could be wasting processing time keeping=
 the
cache too clean.


3.4.3.2   Oracle7 LGWR and Disk I/O Caching

The problem with caching LGWR writes is similar to the previous problem
mentioned with DBWR:  LGWR is able to get back to work too quickly.  The =
LGWR
design writes out batches of redo at a time, and uses the log write I/O l=
atency
as a natural gating factor to determine the batch size.  This design allo=
ws the
LGWR algorithm to degrade gracefully under load.  As LGWR I/O latency bec=
omes
smaller, so does its batching factor.  In the worst case, LGWR is doing a
separate write for each transaction on the system, causing the LGWR code =
path
executed per transaction to skyrocket.  It is possible (and we have seen =
it) for
LGWR to fire continuously when writing to a cached redo log file, consumi=
ng an
entire CPU in an SMP system.


3.4.3.3   System-Wide Performance

The performance improvement attributable to a caching product is highly
dependent upon a number of factors:

-    whether the product is hardware or software based
-    whether the product is doing write-back or write-through caching
-    the size of the cache
-    the read vs. write mix on the system (heavy write would favor write-=
back
      caching, heavy read favors write-through)
-    the locality of reference of the I/Os
-    the performance requirements for different applications on the same =
system

In the end, whether or not a situation calls for disk caching software de=
pends
on the performance needs and the situation itself.

Assume we have a system where Oracle is the only performance-critical
application on the system.  It does not make sense to use software-based =
disk
I/O caching.  Any physical memory that we would have used on the caching
software could be better utilized by Oracle7.  For additional performance=
, we
could consider adding more memory to the system (and giving it to Oracle7=
) or
perhaps using controller-based, protected write-back caching.

Now let's assume we have a system where both Oracle and another I/O-inten=
se
application share the label "performance-critical".  We might consider a
software-based, write-through disk I/O caching product for this situation=
.  If
the disk caching product can be told to cache only the I/O-intense applic=
ation,
then we will have two tuning "knobs" that we can use to fine-tune the
applications' performance -- one the size of the Oracle buffer cache, the=
 other
the size of the caching product's cache.  If the disk caching product doe=
s not
have this selectivity, then we might want to give enough memory to it to =
make
both applications run well and use a very small Oracle buffer cache.




3.5  Summary:  Oracle7 and Disk I/O Caching Products

This is a summary of the use of disk caching products with Oracle7.  Numb=
ers in
parenthesis refer to the notes following the table.

                                                        =20
Type of Caching   Control    Database   Redo Log   Archive Log
                    File       File       File         File
                                                        =20
Write-Through        OK         OK       avoid        avoid
                                                        =20
Write-Back,        never      never      never        never
Unprotected
                                                        =20
Write-Back,        OK (1)     OK (1)   avoid (2)      avoid
Protected

Notes:

1.   Oracle cannot recommend using write-back caching.  While it may bene=
fit
  control and database files, there are too many implementation issues th=
at affect
  database integrity to make a wholesale endorsement.  If you choose to u=
se
  protected write-back caching, test the cache's ability to recover from =
system
  failures before relying upon it in production systems.
2.   Write-back caching could cause the LGWR to work too hard and consume=
 an
  entire CPU.




home help back first fref pref prev next nref lref last post