Redo Processing
What makes a good database is its ability to recover from crashes and mistakes by users or dba's, redo data is used to recover from these types of mishaps. I have already discuss redo in my Oracle section, but in section I go into a little more depth and were it fits in the Oracle Data Guard Environment.
Recovery deals mainly with redo data that recovery can use to reconstruct all the changes made to the database, it's like a replay. Below are the important concepts and components of Oracle recovery
Most databases including Oracle's are protected by the ACID properties which state that when a transaction is started it must follow these basic rules
Remember these rules apply to all databases not just Oracle, ACID provides the guarantee of reliability and consistency of a database. I have also discussed ACID in my Oracle 10g section called transactions.
There are three types of failures
So what really happens when you create a transaction as it generates its changes and produces redo, and the log writer process (LGWR) flushes to disk, a good understanding goes a long way to helping you resolve problems
Note that before the DBWR process has flushed the database buffers to disk, the LGWR process must have already written the redo buffers to disk, this is because of the write-ahead logging protocol which states that no changes appear in the datafiles that are not already in the redo log.
Eventually the transaction will be committed which allocates a SCN and undergoes the same transaction life cycle steps as above, the commit guarantees that the redo previously generated is propagated to disk, this is know as log-force commit.
Components of a Physical Standby
As mentioned in the architecture it can be categorized into three major components
The over picture looks like below
There are a number of Oracle background processes that play a key role, first the primary database
The standby database will also have key processes
Standby Redo Logs were introduced for the following reasons
SRL files are configured on both the primary and the standby but are on active on the standby until a switchover/failover occurs. They should be configured as the same size as the ORL files, it is also recommended to have n+1 SRL files per instance defined on the standby, where n is the number of redo log members per thread on the primary site.
Here is a list of the all the processes running on the standby
and the list of redo processes
Lets carry on the life cycle of a transaction with the content of Data Guard being added to the equation, we left off with the LGWR just flushing off the redo to the disk, we are going to presume that we have ASYNC and he RTA is configured
The whole process above is continuously repeated until recovery is stopped or a role transition (switchover/failover) occurs. With real-time apply when redo is received by the RFS on the standby, the process writes the redo to the archived redo logs or optionally to the SRL, in some cases the redo rate becomes too high and the apply process in unable to keep up with real-time, in this case the MRP (or LSP) automatically performs the redo apply using the archived log files when it has caught up the apply will again resume real-time apply using the SRL files, a diagram of the whole process is below
To enable real-time apply
Enable real-time apply | sql> alter database recover managed standby database using current logfile disconnect; |
Determine if real-time apply is enabled | sql> select recovery_mode from v$archive_dest_status where dest_id = 2; RECOVERY_MODE -------------------------- MANAGED REAL-TIME APPLY |
You improve performance regarding real-time by trying to tuning the following
Parallel Media Recovery (PMR) allows you to keep up with the primary database, the MRP process will perform a scan of the redo logs and parse and build redo change segment maps, this is easily handled by the MRP process, once the map is build the apply process can begin and this is where parallelism occurs. In Oracle 11g the PQ slave problem in 10g has been solved by leveraging Oracle kernel slave process (KSV), these can be seen as PR0x processes, the MRP process uses these kernel processes to perform the actual recovery, the PR0x process will hash to a segment map, read from this map, and apply the redo to the standby database. The number of PR0x process is dependant on the number of CPU's in the system.
Tools and Views for Monitoring
There are a number of tools and views that are useful to identify the status and performance problems, first we start with Data Guard views
Background processes | select process, client_process, thread#, sequence#, status from v$managed_standby; |
Information on Redo Data | select * from v$dataguard_stats; Note: this indirectly shows how much redo data could be lost if the primary db crashes |
Redo apply rate | select to_char(snapshot_time, 'dd-mon-rr hh24:mi:ss') snapshot_time, Note: this command can only run when the database is open |
Recovery operations | select to_char(start_time, 'dd-mon-rr hh24:mi:ss') start_time, item, round(sofar/1024,2) "MB/Sec" from v$recovery_progress where (item='Active Apply Rate' or item='Average Apply Rate'); |
You can use Oracle's statspack which you should be familiar with if you have ever used Oracle before, this can be used on the primary to collect data and store performance data from the standby database, however the standby needs to be opened for read only which does require an additional license because you would have to use Active Data Guard.
Follow below on how to setup statspack for a standby database
statspack for the standby database | # This is the newer version of statspack, all the scripts should reside in $ORACLE_HOME/rdbms/admin
|
There are a number of scripts that can be used regarding statspack
Physical Standby Corruption Detection
Although this is a Oracle 11g feature it can be used with standby databases, Oracle has a number of components that detect data block corruption (see my database corruption section for further information), the corruptions are recorded in the view v$database_block_corruption, you can create an alert in Enterprise Manager to let you now of any new corrupted data blocks, you can enable an internal mechanism to provide even better data protection by setting the parameter db_ultra_safe to true. This parameter includes the following checks
By setting the parameter db_ultra_safe, you are setting the parameters db_block_checking, db_block_checksum and db_lost_write_protection., however this parameter does have a performance impact depending on the numbers of changes from 1-10% and it will be higher on a physical standby than the primary.
Physical standby databases redo apply mechanism implicitly verifies redo headers for correct format and compares the version of data block header with the tail block for accuracy, when db_block_checksum is set it compares the current block with the calculated value, checksums catch most data block inconsistence's, it also validates more internal data block structures such as Interested Transaction List (ITLs), free space and used space in the block.
When the db_lost_write_protection is set to typical on the primary database, the database log buffer cache reads and for read-write tablespace's in the redo log; however when the parameter is set to full the instance also logs redo data for read-only and read-write tablespaces., when set on either typical or full a physical standby database the instance performs lost write detection during media recovery.When set to typical on either the primary or the physical database, the primary will record buffer cache block reads in the redo log and this information can be used to detect lost writes in the standby database, this is done by comparing SCN versions of the blocks stored on the standby with those in the incoming redo stream, if a block version discrepancy occurs, this implies that a lost write occurred on either the primary or the standby database.
Lost writes occur during a hardware failure (disk, HBA), these can go undetected for a while they are silent corruptions. If the block SCN on the primary database is lower on the standby database it detects a lost write on the primary and throws an internal error (ORA-752), to recover you must failover to the standby and re-create the primary. If the SCN is higher it detects a lost write on the standby and throws an internal error (ORA-600 3020), to repair you must re-create the standby database or affected files.
Oracle recommends that you should set the parameter db_lost_write_protect to typical, this is the greatest protection but has the minimum impact on your databases, you you want the best protection but don't mind having a performance hit then set db_ultra_safe to true.