Data Guard - Switchover and Failover

This section is what Data Guard is really all about, its about a standby database taking over the production database, and how to revert back when problems have been fixed.

Role transition is divide into two types switchover and failover, they also might be called switchback, failback but they all mean the the same thing to transition the state of Data Guard from one state to another.

Switchover

Switchover is the act of change the standby database into the primary but in a controlled manor, the planned event means that it is safe from data loss because the primary database must complete all redo generation on the production data before allowing the switchover to commence. The switchback does not exists as it is a switchover but in the reserve order, which would restore the database back on its original server. This planned event normally happens during a quiet period, the reason for the switchover might be DR testing, patch, hardware changes, implementing RAC, etc.

Once the switchover is complete the redo from the new primary will send it to the remaining standby servers, including the old primary, if using either grid control or the broker this should be all automatically do for you, but if you are using SQLPlus you have to performance some manual work.

You always start the switchover on the primary database, the actual switchover command is below whether you are using Grid Control, Broker or SQLPlus.

start the switchover (primary)

alter database commit to switchover to standby;

When the switchover command is executed the redo generation is stopped, all DML related cursors are invalidated and users are either prevented from executing transactions or terminated and he current redo log is archived for each tread. A special switchover marker called the EOR (end of redo) is then placed in the header of the next sequence for each thread, and the online redo files are archived a second time, sending the final sequences to the standby databases. At this point the physical standby database is closed and the final log switch is done without allowing the primary database to advance the sequence numbers for each thread.

After the EOR redo is sent to the standby databases, the original primary database is finalized as a standby and its control file backed up to the trace file and converted to the correct type of standby control file. In the case of a physical standby switchover the managed recovery process (MRP) is automatically started on the original primary to apply the final archive logs that contain the EOR so that all the redo ever generated is processed. The primary is then dismounted and must then be restarted as a standby database in at least the mount state.

The standby database must received this EOR redo otherwise the switchover cannot occur, once this redo has been received and applied to complete the switchover you run the following command, this will be automatic if you are using the Grid Control or the Broker

complete the switchover (new primary)

alter database commit to switchover to primary;

The physical standby switchover will wait for the MRP process to exit after processing the EOR redo and then convert the standby control file into a normal production control file. The final thing to do is to open the database for general production use

complete the switchover (new primary)

alter database open;

A logical standby also has to wait for the EOR redo from the primary to be applied and SQL apply to shut down before the switchover command can complete, once the EOR has been processed, the GUARD can be turned off and production processing can begin.

Failover

A failover is a unplanned event when something has happened to hardware, networking, etc. This is when you invoke you DR procedures (hopefully documented), and you will have full confidence in getting the new primary up and running as quickly as possible. Unlike the switchover which begins on the primary, no primary is involved which means you will not be able to get the redo from the primary. Depending on what protection mode you have chosen there may be data loss (less you have a Maximum Protection mode enabled), you start be telling Data Guard to apply the remaining redo that it can. Once the redo has been applied you run the same command that you do with a physical standby to switchover the standby to a primary

complete the switchover (new primary)

alter database commit to switchover to primary;

Once difference is when the switchover has completed the protection mode will be maximum performance regardless what it was before, to get it back to your original protection mode you must get a standby database back up and running, then manually execute the steps to get it into the protection mode you want.

change the protection mode

# Choose what level of protection you require
sql> alter database set standby to maximize performance;
sql> alter database set standby to maximize availability;
sql> alter database set standby to maximize protection;

If you are using a protection mode that may result in data, the received archive redo logs are merged into a a single thread and the sequence is sorted on the dependant transaction, this merged thread is then applied to the standby database up until the last redo. This may take sometime if using a RAC environment as the redo data has to be transfers from each instance.

Since the redo heartbeat is sent every 6 seconds or so, the general rule is that you may lose 6 seconds of redo during a failover but this is a best guess. At failover the merging thread will look at the last log of the disconnected thread and use the last heartbeat in it to define the consistent point, throwing away all the redo that the surviving nodes had been sending all along.

Flashback and Data Guard

I have discussed Flashback technologies in my Oracle section, you can enable flashback on the primary database, on one or more standby databases or on any combination of the two. You can use flashback in the following circumstances

On the primary database	to recover data in lieu of a delay on the apply processing.
On a physical standby database	to open a physical standby database in snapshot mode (read-write for test). Data Guard uses a guaranteed restore point (GRP) before opening the physical standby in read-write mode. The retention period is also not needed as the flashback logs will be maintained as long as necessary. Make sure that you size the flash area accordingly as this could stop the redo processing.
On the primary and all standby databases	If you want to flashback primary database, you must enable it on all standby databases as well. The retention period must be the same across all databases. If you perform a flashback recovery on the primary, you must stop the apply process on all standbys, when the open resetlogs has been executed, you must flashback each standby to a point in time prior to the primary. When the redo starts to be transmitted again the standby will recognize the change and recover the standby up to the point of the flashback and then start processing the new redo stream.
On the primary but after a failover	to be able to recover the database after a failover. Data Guard needs only a little bit of flashback log to move the failed primary back in time so that it can be converted to a standby database and resynchronized with the primary.

Physical Switchover

In this section we will perform a switchover, we will perform some checks before the switchover, then switchover. I will discuss the switchover using SQLPlus then how to do it using the Broker, learning the SQLPlus will give you a good understanding on how the whole processes works and what the broker does in the background.

To start with the check that you must do are the following

Make sure that all of the necessary Data Guard parameters have been configured
Make sure that the SRL files have been configured
Make sure that you have a temporary data file defined
Make sure that Flashback has been enabled (optional)

The reason for the checks is that if it all goes wrong it would easy be able to switchback to the primary.

First we verify that the standby has received all the redo

check redo has been received

## check the syn status, it should say yes (run on the standby)
select db_unique_name, protection_mode, synchronization_status, synchronized from v$archive_dest_status;

## if it says NO then lets make further checks (run on the standby)
select client_process, process, sequence#, status from v$managed_standby;

## now check on the primary we should be one in front (run on the primary)
select thread#, sequence#, status from v$log;

Note: if using a RAC environment make sure you check each instance

Once you have determined that the redo stream is current, we need to check that the redo has been applied

check that redo has been applied (physical)	## check that MRP (applying_log) matches the RFS process, if the MRP line is missing then you need to ## start the apply process, you also may see the status of wait_for_gap so wait until the gap have been ## resolved first select client_process, process, sequence#, status from v$managed_standby;
check that redo has been applied (logical)	## if you are using a logical standby then you need to check the following to confirm the redo has been ## applied select applied_scn, latest_scn, mining_scn from v$logstdby_progress; ## if the mining scn is behind you may have a gap check this by using the following select status from v$logstdby_process where type = 'READER';

Finally you must cancel any jobs and backups that are running, however there should be none as you will probably be performing the switchover during a quiet period

show any running jobs or backups

select process, operation, r.status, mbytes_processed pct, s.status from v$rman_status r, v$session s where r.sid=s.sid;

Once the checks have been carried out you are ready to perform the switchover, you may want to tail the alert logs as it will give you an idea on what is going in the background and you can formularize yourself with the messages that are displayed, you can increase the level of messaging for diagnostic purposes

increase logging level

alter system set log_archive_trace=8129;

## to turn it off again
alter system set log_archive_trace=0;

One point to make is that if you are using flashback make sure that you drop the GRP restore point on all databases after the switchover is complete, otherwise you will be generating permanent flashback database logs forever.

If you are running a Oracle RAC then you need to bring the RAC down to a single instance, because this is a planned it may be worth using shutdown immediate and try not you use shutdown abort, as oracle will have to recover first. You can check the session by running the below command

check active sessions	select switchover_status from v$database;
check the switchover status	select switchover_status from v$database;

Now it it is time to switchover

switchover

## on the primary

alter database commit to switchover to physical standby with session shutdown;

keep an eye on your alert logs, then check that first part of the switchover has completed

check the switchover status

select switchover_status from v$database;

to finish off the switchover

complete the switchover

alter database commit to switchover to primary with session shutdown;

open the new primary database

open the new primary

alter database open;

finish off the old primary and start it up as a standby

finish off the old primary

shutdown immediate;
startup mount;
alter database recover managed standby database using current logfile disconnect;

You can now startup auxiliary RAC instances on the primary and standby databases if you are using RAC.

Switching over to a Logical Standby

Switching over to a logical standby is simpler because the database is already open for reading and writing, but there is a second set of steps that guarantee the safety of your data if a failure occurs after the switchover but before the the new logical standby can process transactions. As of Oracle 11g you no longer need to shutdown any database instances to perform the switchover, older versions of Oracle you need to shutdown all auxiliary instances.

Firstly check the state of the primary

Check primary database state

## It should have a status of "to standby"

select switchover_status from v$database;

Unlike a physical standby which a exact copy of the primary, a logical standby could be different and once the roles are reversed the new logical standby needs to know what the new primary looks like so SQL Apply process can process the redo stream. So to prepare we run the following command on the primary

Prepare the primary standby

alter database prepare to switchover to logical standby;

## confirm that the prepare has started to happen, you should now see "preparing switchover"
select switchover_status from v$database;

Now go to the logical standby and tell it to send the preparation information to the primary in its redo stream

Prepare the logical standby

alter database prepare to switchover to primary;

## confirm that the prepare has started to happen, you should see "preparing dictionary"
select switchover_status from v$database;

## wait a while until the dictionary is built and sent and you should see "preparing switchover"
select switchover_status from v$database;

The logical will commence a dictionary build and put it into the redo stream of the logical standby, which will be sent to the primary. The reason for this step is that new logical standby knows how to apply the redo from the new primary the moment new primary transactions start to generate redo. If you skip this step you will be generating redo from business transactions that will be sent to your logical standby. The new dictionary would then be behind that redo, and if you had a failure of the new primary before the dictionary was sent to the new logical standby, a failover would result in data loss.

Now check the primary again and it should be ready to become the new logical standby, to confirm this use the following command

Check primary database state

## you should now see its in the state of "to logical standby"

select switchover_status from v$database;

At this point you can cancel the whole switchover as you are not committed yet

cancel the switchover

## On the primary
alter database prepare to switchover cancel;

## on the logical
alter database prepare to switchover cancel;

If you are happy to continue the switchover then on the primary lets tell it that its going to be a logical standby

switchover the primary to a logical standby

alter database commit to switchover to logical standby;

Your users will now have to wait until the switchover has completed, they may experience significant stalling. Once the switchover is completed and Data Guard enables the Guard on the primary to prevent more updates to the data, as this is now a logical standby database.

Now finish off the logical standby database

switchover the logical standby to a primary

## check that its ready to become the primary, you should see "to primary"

select switchover_status from v$database

## Complete the switchover
alter database commit to standby to primary;

Start the apply process on the new logical standby

start the apply process

alter database start logical standby apply immediate;

That's it, you should now have a successful switchover, you can use my monitoring section to check everything is working correctly.

That seemed a lot of work above to switchover, if you have a broker configured then it just one command, buts its good to learn what goes on behind the scenes when you use the broker.

Switchover using the broker

DGMGRL> switchover to prod1lr

Don't forget that when you are learning to tail the alert and broker logs when your are playing around with switching over, you get a better understanding on what Oracle is doing.

So the finals checks you need to make before the switchover are

Make sure that your configuration has been setup properly and has not been altered
Meet all the Data Prerequisites
Enable Flashback Database
Verify that redo is being sent to the standbys and is being applied
Ensure that the standby will not encounter problems during the switchover due to nologging DML on the primary if you have not enabled FORCE LOGGING

Performing a Failover

Hopefully you have prepared yourself as you never know when this will happen it could be in the middle of the night and you are awoken when your Boss calls you saying that the Production Data Center is no longer, but you will be ready as no chance of getting any redo from the primary as it is gone. Depending on the protection mode selected there may be some data loss but again everyone will be aware of this as it will be part of the DR documentation. Firstly you must choose a standby database to become the primary, if you have one standby in maximum protection mode then this would be the prime target as this is guaranteed to be synchronized with the primary database, below would be the choice of database to use

Mode	Order
Maximum Protection Mode database	1
Maximum Availability (SYNC)	2
Maximum Availability (ASYNC)	3
Maximum Performance (SYNC)	4
Maximum Performance (ASYNC)	5

Basically choose any maximum protection mode database first, then maximum availability second, etc. If you have two database with the same protection and using SYNC, then you need to determine which one has the most redo to do this run the below command

Check redo applied

## This will tell you the lag time

select name, value, time_computed from v$dataguard_stats where name like '%lag%';

## You can also use the SCN number

select thread#, sequence#, last_change#, last_time from v$standby_log;

Once you have a chosen a target we are ready to failover, firstly shutdown any other instances if you are using a RAC, hen tell the standby that it is to become the primary

the failover process (physical standby)

## Start by telling the apply process that this standby is going to be the new primary, and to apply all
## the redo that it has

alter database recover managed standby database cancel;
alter database recover managed standby database finish;

## At this point the protection mode is lowered

select protection_mode from v$database;

## Now issue the switchover command and then open the database

alter database commit to switchover to primary with session shutdown;
alter database open;

## Startup the other RAC instances if using RAC

## You can then raise the protection mode (if desired)

set standby database to maximum protection;

If you want to failover to a logical standby then it's one command, there is no need to shutdown any other instances if using a RAC environment

the failover process (logical standby)

alter database activate logical standby database finish apply;

The Old Primary

The data center has now been fixed and the old primary is back online and you need to do this quick to make sure that you are protected again. If yo are not using failback database then you have to delete the original database and re-create it using the RMAN procedure I used in setting up a physical standby. If you have enabled flashback database then we can easily bring back the primary

bring back the old primary (physical standby)

## Since redo is applied by SCN we need he failover SCN from the new primary

select to_char(standby_became_primary_scn) failover_scn from v$database;

FAILOVER_SCN
-----------------------------------------------
7658841

## Now flashback the old primary to this SCN and start in mount mode
startup mount;
flashback database to scn 7658841;
alter database convert to physical standby;
shutdown immediate;
startup mount;

## hopefully the old primary will start to resolve any gap issues at the next log switch, which means we can start the MRP
## process to get this standby going to catchup as fast as possible

alter database recover managed standby database using current logfile disconnect;

## eventually the missing redos will be sent to the standby and applied, bring us back to synchronization again.

bring back the old primary (logical standby)

## again we need to obtained the SCN
select merge_change# as flashback_scn, processed_change# as recovery_scn from dba_logstdby_history where stream_sequence# = (select max(stream_sequence#)-1 from dba_logstdby_history);

flashback_scn recovery_scn
---------------------------------------------------------
7658941 7659568

## Now flashback the old primary to this SCN and start in mount mode
startup mount;
flashback database to scn 7658841;
alter database convert to physical standby;
shutdown immediate;
startup mount;

## Now we need to hand feed the archive logs from the primary to the standby (old primary) into the MRP
## process, so lets get those logs (run on the primary)

select file_name from dba_logstdby_log where first_changed# <= recovery_scn and next_change# > flashback_scn;

## Now you will hopefully have a short list of the files you need, now you need to register them with
## the standby database (old primary)

alter database register logfile '<files from above list>';

## Now you can recover up to the SCN but not including the one you specify
recover managed standby database until change 7659568;

## Now the standby database becomes a logical standby as up to this point it has been a physical one.
alter database active standby database;

## Lastly you need tell your new logical standby to ask the primary for a new copy of the dictionary and
## all the redo in between. The SQL Apply will connect to the new primary using the database link and
## retrieve the LogMiner dictionary, once the dictionary has been built, SQL Apply will apply all the
## redo sent from the new primary and get itself synchronized

create public database link reinstatelogical connect to system identified by password using 'service_name_of_new_primary_database';

alter database start logical standby apply new primary reinstatelogical;

At a convenient time you will perform a switchover which will put the whole environment back to normal.

Again if you use the Broker it makes very light work of this

Failover using the Broker

DGMGRL> failover to prod1dr;

Again reinstating the old primary is much easier too

reinstating the old primary using the Broker

DGMGRL> reinstate database prod1;

Automatic Failover

Data Guard has implemented something called Fast-start Failover (FSFO) which uses the Broker to perform the failover actions when there is a problem. This architecture uses a third member quorum that ensures that the failover occurs only when everything meets the rules that you have defined, when the failover has happened the primary will never be allowed to open to avoid any split-brain scenarios, this would be a bit of a nightmare should both databases be open and processing transactions.

The third member is called the Observer and its job is to maintain a connection with the primary and target standby databases, monitoring there health and performing any failover's necessary, the Observer will also reinstate the old primary when it comes back on, if it can. The observer pings the primary database and that the first sign of trouble it will start to countdown (which you configure), if it does re-establish the connection it will make all the necessary checks before goes back to watch mode again, if the timer expires then it checks that the standby can take over and initiates a failover, this switchover will have all automatically and in the background using the Broker. If and when the primary comes back the Observer will reinstate the old primary as a standby database again using the Broker to achieve this.

It is import on where he Observer is placed in the network, only one observer per Data Guard installation can be installed, so this server must have access to both the primary and standby databases with as much redundant networking as possible. Next your thing is how much is the observer going to cost me, not much as it can run on most platforms and only required the Oracle Client Kit for the version of Oracle that you are running, you must setup the TNSNAMES on the observer to allow it to ping the databases. If the Observer was to crash it will have no impact on the current Oracle environment, the only impact is that FSFO will not be available until the Observer is up and running again. The Observer can monitoring the following

Database crash
System crash
The loss of the network
Complete site outage

You can also get FSFO to perform a shutdown abort on the primary when other issues arise such as

Datafile Offline
Corrupted Controlfile
Corrupted Dictionary
Inaccessible Logfile
Stuck Archiver

The tags above must be entered as they are above otherwise the Broker will not understand them

Monitor a specific condition via the Broker

DGMGRL> enable fast_start failover condition "Corrupted Controlfile";
DGMGRL> enable fast_start failover condition "Datafile Offline";

To display what you are monitoring use

Display conditions that are be monitored

DGMGRL> show fast_start failover;

Fast-Start Failover: DISABLED

Threshold: 30 seconds
Target: (none)
Observer: (none)
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE

Configurable Failover Conditions
Health Conditions:
Corrupted Controlfile YES
Corrupted Dictionary YES
Inaccessible Logfile NO
Stuck Archiver NO
Datafile Offline YES

Oracle Error Conditions:
(none)

Now that you have an overview of FSFO it's time to set it up and test it, just a quick check before we progress, make sure that the following has been setup or configured

Use the Broker with all its prerequisites
Enable flashback database on both primary and standby
Setup the configuration correctly for the protection mode (standby redo logs files on both sides, redo transport setup the same in both directions)
Install the Observer system and configure TNSNAME

If you are using more than one standby you must let the Broker know which one you want to become the primary, if you only have one then the broker will know already

Select the standby to become the primary

DGMGRL> edit database prod1 set property FastStartFailoverTarget = 'prod1dr';
DGMGRL> edit database prod1dr set property FastStartFailoverTarget = 'prod1';

Now its time to discuss how long you should wait before you want to failover, you don't want it too short just in case you network blips, by default it is set to 30 seconds but you can go down to 6 seconds if you wish.

change threshold

DGMGRL> edit configuration set property FastStartFailoverTargetThreshold = 45;

You can control the amount of data loss, if using one of the lesser protection modes, the greater the lag limit set the greater the data loss, again the time is in seconds.

lag limit

DGMGRL> edit configuration set property FastStartFailoverLagLimit = 60;

If the data loss is less then the limit the failover will proceed, if more redo would be loss than the lag limit, the failover will not occur and nothing happens until the primary database either comes back and processing continues or you choose to failover manually, suffering the additional data loss. If you are using maximum protection mode then this property is ignored.

Here are two more additional properties that you can setup regarding the primary, one is to shutdown it down if it becomes hung and the other is to reinstate it if a failover does occur

abort primary if in a hung state	DGMGRL>edit configuration set property FastStartFailoverPmyShutdown = true;
reinstate primary after a failover	DGMGRL>edit configuration set property FastStartFailoverAutoReinstate = true;

Once you are happy with everything you can now enable the FSFO

Enable FSFO

DGMGRL> enable fast_start failover;

## Display the configuration

DGMGRL> show fast_start failover;

Once all setup you can test the FSFO by performing a shutdown abort on the primary, and checking that the failover occurs and that they primary is reinstated and with the amount of data loss expected if using the lesser protection modes. If you are using a test environment this is the time to experiment and play around with different settings. Again keep an eye on the log files including the Broker log file to see how Oracle handles the failover's and to become familiar with them.