Thursday, July 17, 2008

redo undo and data consistency

the following text was copied from http://www.freelists.org/archives/oracle-l/12-2005/msg00537.html

=========================

When modify a table with normal insert/update/delete, then


1) redo for corresponding UNDO block change for this update is placed in redolog buffer
2) corresponding UNDO block is changed
3) redo for DATA block change for this update is placed in redolog buffer
4) DATA block itself is changed

Note that during buffer cache block changes we record the RBA (redo byte address in redo stream) of last change in buffer headers (can be seen from X$BH). RBA is an address in redo stream, it can uniquely identify a redo record/change vector in redologs.

As long as we don't commit (with sync option) here, no redo has to be written onto disk.

1) If instance crashes now, we don't have neither the redo nor datablock changes persisting, thus no inconsistencies
2) If instance crashes after LGWR has written the redo to disk, but DBWR hasn't written the datablocks, then during instance recovery we:
a) read old versions of datablocks from disk
b) apply redo to the old versions - note that Oracle guarantees every change vector between old version on disk and current version until commit to be in online redologs - e.g. it doesn't allow any redolog to be overwritten before all corresponding datablocks are checkpointed to disk.
c) this above is called cache recover - we make sure that all datafiles are physically consistent. After this point if we still have inconsistent/fractured blocks, we have encountered a hardware fault or software bug
d) database can be opened now, if any server processes see uncommitted datablocks, they can roll these back on their own - leaving the blocks they don't need for SMON (this is called fast start rollback if memory serves correctly)
e) SMON goes through undo segments to see whether there are uncommitted transactions (the committed transactions have a commit record at their end in redologs) and rolls those back (or hands those over to parallel slaves to roll back depending on transaction sizes)

3) It is not possible to have a situation where DBWR writes modified buffers back to disk before their redo is written. This is called write-ahead logging protocol and is mentioned in documentation. It is achieved by marking RBA of latest change to buffer header (as mentioned above) and DBWR checks whether redo up to max(RBA) in write block batch is already on disk (from memory structure underlying X$KCCCP).
If it's not, DBWR will signal LGWR and won't write those buffers to disk until LGWR acknowledges that redo is on disk.

So, as long as the foundations on what Oracle relies (OS, hardware) works properly and we don't hit any bugs (and unless I'm wrong) Oracle ALWAYS guarantees the blocks to be consistent after cache recovery in case of instance failure. It doesn't matter whether only half of DBWR's block write succeeded, or 99% or 1%, Oracle's recovery mechanisms are conceptually reliable.

All of above was a simplification, especially from 10g where we can have some redo log buffers in shared pool (private strands), some undo data cached in shared pool (in-memory undo), out-of order physical redo (multiple redolog strands and redo allocation latches), but again, I'm sure Oracle has based their recovery mechanisms on solid theoretical foundations here - no vulnerabilities on race conditions nor dependencies on which order you execute your stuff.

Tanel.

No comments: