Data Guard - Monitoring

Proactive monitoring plays a big parting in making sure that your Data Guard environment is working correctly and that potential problems are picked up very quickly. This is one area the business never see's, it takes a lot of work to keep a system running 365 days per year without downtime. This section will cover both physical and logical standby databases, we look at log files, statistical information, log gaps, redo transport and we will also be taking a better look at the Data Guard Menu utility which I mentioned in my broker section.

We first start at the alert log file which every dba should be familiar with, I use a script which was supplied by the www.dataguardbook.com website, I have made a few changes and linked into my monitoring system (Big Brother or BB for short). The script compares the alert.log to the previous version that was checked and any new ORA errors are then alerted. The script is called alert_log_monitor.ksh, you need to setup an environment first which means creating a file called .ORACLE_BASE which you would source first or put in your profile as that it gets sourced every time you login

.ORACLE_BASE

export BASE_DIR=/home/oracle
export ORACLE_BASE=/home/oracle
export PATH=/usr/local/bin:/bin:/usr/bin:/usr/sbin:$PATH
export SH=$ORACLE_BASE/general/sh

The script uses sqlplus -V to determine if you are using 10g or 11g and thus it will use the correct location of your alert log or you can just edit the script to point to your own location. When the script detects new ORA errors it calls the alert_notification.ksh script which in turns uses the alert_notication.ctl (which you edit to point to the email addresses that you wan to alert).

Here is an example of the alert_notification.ctl file, notice the catch all recipient in the last line, the script will pickup the following categories MINOR, WARNING and CRITICAL.

alert_notification.ctl

blockmon.ksh          DB CRITICAL dbasupport@datadisk.co.uk itsupport@datadisk.co.uk
logical_log_alert.ksh DB CRITICAL dbasupport@datadisk.co.uk itsupport@datadisk.co.uk
alert_log_monitor.ksh OSE WARNING dbasupport@datadisk.co.uk
archmon.ksh           OSE MINOR    dbasupport@datadisk.co.uk
*:                    OTH MINOR    dbasupport@datadisk.co.uk

You can also just check the alert log each day, I generally mail it to myself and have a look, this has the added benefit of that you become familiar with it and can see what else Oracle is up to, this really only apply's if you have a small environment through.

In an ideal world a log switch should occur every 30 mins, but no less than 15 minutes, you can check this by viewing the v$log_history view, try and size the logs so that you hit 30 mins (remember that if you increase the primary log sizes you must increase the standby ones too). you can use the below formula

total number of standby redo = ( no of ORL per RAC instance + 1) * no of instances

So if you had a 3 node RAC that has 3 ORLs per instance you would need 12 standby redo logs (3+1) * 3 = 12

You the below to identify the peak archive times for a specified day

identify peak archive times

## You probably will have to increase your linesize
select to_char(trunc(first_time), 'Mon DD') "DG Date",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "12am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "01am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "02am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "03am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "04am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "05am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "06am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "07am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "08am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "09am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "10am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "11am",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "12pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "01pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "02pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "03pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "04pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "05pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "06pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "07pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "08pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "09pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "10pm",
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'9999') "11pm"
from v$log_history
group by trunc(first_time)
order by trunc(first_time) desc
;

Check the archive_lag_target parameter as this will have an affect on your log switching, by default it is set to 0, which means it will not influence log switching, if setting the parameter it is recommended to set to 1800 (seconds) which is 30 mins.

Eventually you will have gaps in the archive logs sequences maybe due to network issues, remember these occur when an archive log is generated on the primary database but not received at the standby site. Archive gaps logs can be monitored by examining the low and high sequence numbers in the v$archive_gap view, you can also use the DG_Menu (which we discuss later). I also see my troubleshooting section regarding manually resolving gaps.

archive gap logs

# Use the thread# when using RAC an detect missing sequences

select thread#, low_sequence#, high_sequence# from v$archive_gap;

You can identify delays in the redo transport by comparing the highest sequence numbers of both the primary and physical standby databases, you can use the script dg_archive_monitor to monitor the archive logs as well.

delays in redo transport

select max(sequence#), thread# from v$archived_log group by thread#;

## you can use the dg_archivelog_monitor.sh script, which accepts three parameters, primary, physical
## and the archive log threshold (# of archive logs)

dg_archivelog_monitor.sh <primary> <standby> <threshold>

Identify the missing logs on the primary

## On the primary run the below
select L.thread#, L.sequence#
from
  (select thread#, sequence# from v$archived_log where dest_id=1) L
    where L.sequence# not in
       (select sequence# from v$archived_log where dest_id=2 and thread# = L.thread#)

;

If the archive log destination are in ASM you can monitor the free space in ASM by using the dgmon_arch script

ASM free space

## The script uses three parameters, alert notification threshold, the database name and the diskgroup

dgmon_mon.ksh <alert notification> <databasename> <diskgroup>

You can monitor the apply rate and active rate by using the below

apply rate and active monitoring

select to_char(start_time, 'DD-MON-RR HH24:MI:SS') start_time, item , sofar
from v$recovery_progress
where item in ('Active Apply Rate', 'Average Apply Rate', 'Redo Applied')
;

Note: the redo applied is measured in megabytes, while the average apply rate and the active apply rate is measured in kilobytes.

You can find out how far behind the redo transport and redo apply processes are, the transport lag column will indicate how far in time that the shipment of redo from the primary to the standby database is behind, remember this is what you would lose if the primary were to go down. The apply lag indicates how much time it is behind this can be affected by the 30 min delay attribute in the parameter archive_log_dest_n as you may want to have a delay. You can also use the script dg_time_lag to find out how far you are behind but with the added bonus of seeing a real time.

transport and apply lag

col name for a13
col value for a13
col unit for a30
set lines 132

select name, value, unit, time_computed from v$dataguard_stats where name in ('transport lag', 'apply lag');

## use the dg_time_lag.ksh script
dg_time_lag.ksh

You can review the progress of the standby database in managed recovery mode, you can see the redo apply and redo transport services

Viewing the status of the managed recovery process

col client_pid for a10;

select pid, process, status, client_process, client_pid, thread#, sequence#, block#, blocks from v$managed_standby;

Data Guard Utility Menu

The Data Guard Utility Menu (DG Menu) is a very useful tool to view reports of your current system, you can also see what the scripts are doing in the background and see what tables are being used to look up this information. you and even add or change the menu as long as you know how to script a little. You can even use the scripts in cron to report and email you on a regular basis.

I am not going o go into great detail here as the menu and options are all self explaining, the one you need to do after downloading the tool is to setup the configuration file called dg.conf. Here is a copy of my configuration to give to an idea on how to configure it

dg.conf

Note: the tool only uses SQLPlus it does not use tnsnames.ora file

I get this tool working in all my database servers as you never know which one may fail, at least its ready to go if a failover occurs.

The tool has two menus, one I call the main and the other a standby menu which is invoke from the main menu option 20.

DG Menu Utility

DG Menu
========================================
# ------------------------------------------------------------------------- #
# #
# Data Guard Check List - primarydg01
# #
# 0. Review database information and status #
# 1. Check for password file #
# 2. Check for forced logging and unrecoverable activities #
# ------------------------------------------------------------------------- #
# 3. Check for archive log mode #
# 4. Check for standby redo logs #
# 5. Check current SCN on primary and standby databases #
# 6. Check archive log destinations #
# ------------------------------------------------------------------------- #
# 7. Check Data Guard Status View for errors and fatal messages #
# 8. Check Managed Recovery Process Status #
# 9. Check for missing archive logs #
# 10. Check archive log gaps on the standby database #
# 11. Check average apply rate / active apply rate #
# 12. Check transport / apply lag #
# 13. How far behind is my Data Guard in terms of time? #
# #
# ------------------------------------------------------------------------- #
# 20. Launch the Logical Standby Data Guard Submenu #
# ------------------------------------------------------------------------- #
# 21. Generate init.ora entries for primary database #
# 22. Generate init.ora entries for standby database #
# 23. Generate tnsnames.ora entries for primary and standby databases #
# 24. Generate SQL syntax to create standby redo logs #
# #
# ------------------------------------------------------------------------- #
# 30. Generate syntax to duplicate standby database from active database #
# #
# x. Exit #
# ------------------------------------------------------------------------- #
# Enter Task Number:

# ------------------------------------------------------------------------- #
# #
# Logical Standby Data Guard Check List - primarydg01
# #
# 1. Check Logical Progress - View Overall Progress Of SQL Apply #
# 2. Check Logical Events - History on Logical Standby Apply Activity #
# 3. Check Logical Events - Detailed View #
# 4. Check Logical Stats - Logical Standby Stats #
# 5. Check Logical Parameters - Logical Standby Parameters #
# 6. Look At What The Logical Standby Processes Are Doing #
# Coordinator, Reader, Builder, Preparer, Analyzer, Applier ... #
# 7. Look At The Status Codes For The Logical Standby Processes #
# 8. Look At Events The Applier Process Is Stuck On #
# ------------------------------------------------------------------------- #
# 10. Check the LCR - Look At Bytes Paged Out #
# 11. Generate Syntax To Skip Transactions #
# Based On MAX(EVENT_TIME) FROM DBA_LOGSTDBY_EVENTS #
# DO NOT SKIP DML STATEMENTS #
# 12. Diagnostic Script Per Metalink Note ID: 241512.1 #
# Look for output in logical_diag_[ORACLE_SID_MONDD_HHMM.out] format #
# ------------------------------------------------------------------------- #
# 20. Review What Is NOT Supported In Your Logical Standby Database #
# 21. Review Tables That Do NOT have Unique Identifiers #
# 22. Check Primary Database For Supplemental Logging #
# #
# ------------------------------------------------------------------------- #
# 30. Start Logical Standby Database #
# 40. Stop Logical Standby Database - PLEASE BE CAREFUL !!!!! #
# THIS WILL STOP THE LOGICAL STANDBY APPLY PROCESS #
# ------------------------------------------------------------------------- #
# #
# x. Exit #
# ------------------------------------------------------------------------- #
# Enter Task Number:

As you can see there are lots of options which report on all sorts of things, so have a play around and don't forget to have a look at the scripts to get a better understanding on what tables are being accessed to supply the information.

Conclusion

There are many things that you can do to monitor your Data Guard, some like menus, others like automatic cron jobs running checks and emailing, myself I have used the Data Guard Menu Utility scripts and modified them to work in a monitoring tool called "Big Brother", I edit the script to alert when a condition or a threshold has been meet, I don't monitor to much just enough to alert me if things are working correctly. I also have built a complete test environment at home in my VMWare setup I have running, this gives me the practice that I need when testing the DG environment and generally playing around with parameters, tuning, etc.