使用增量备份修复dataguard与日常备份策略的选择

浸心阁 2015-04-13

展开全文

一.前言

在我们生产环境中，dataguard通常用来提供容灾方案，因此有效地保护归档日志就显得尤为重要，否则一旦形成日志gap，且归档日志被误删除，进而导致DG会失去数据同步的能力，从这个角度来说DG似乎又显得十分脆弱。本文会简要地提到如何制定较好的备份策略，但是最重要的目的是来看看如何修复gap。或许有人想到了重建DG，特别是在11g可以在线重建，但是在真正的生产环境，重建的成本是很高的，是因为：

1. 生产环境通常数据库体积庞大，重建过程中需要传输全备文件到standby，即使使用了压缩，其体积可能仍然非常庞大。

2.大部分生产系统的DG是跨地域的，这意味着primary与standby端的带宽可能及其有限，通常正常传输速度达到1M/s已经算很不错。但是对于传输庞大的备份集，仍然是杯水车薪。还不算在传输期间primary产生的归档日志也是需要后续传输的。其时间可能远远长于仅传输全备的时间。

3.再说说11g的duplicate from active database,这个方法固然简单，但是此法在传输时是不能压缩的！在带宽有限，数据库体积庞大的情况下，基本上是不能在限定的时间内完成创建的。这个方法所需的时间预计会比手动压缩传输的方法长6倍以上！也是不可取的。

因此正确的修复gap的方式，是利用增量备份集，在standby端进行恢复。

二.备份策略因为增量备份的存在而更加灵活、实用

最佳的备份策略能在备份所需空间与恢复时间上达到平衡，具体根据实际业务情况而定，但是本质上就是利用全备+增量备份+日志备份的方式。

关于增量备份的官方详细内容，请参看另一篇博文 “Incremental Backups in Oracle 10.2.0.5” Translation & Notes

其中一个重要的概念就是：

  A level 1 incremental backup can be either of the following types:
    A differential backup, which backs up all blocks changed after the most recent incremental backup at level 1 or 0
    A cumulative backup, which backs up all blocks changed after the most recent incremental backup at level 0

但是我们必须知道，所谓增量，实际上是基于SCN判断的！

因此举例来说：differential backup适用于dailybackup，而cumulative backup适用于weeklybackup，而零级备份适用于monthly backup或者更长周期的全备。

因此我目前的情况，OLTP系统，日志产生比较频繁，比较倾向的一种的备份方式是：

一周以内：

Monday	differential incremental level 1	backup archivelog
Tuesday	differential incremental level 1	backup archivelog
Wednsday	differential incremental level 1	backup archivelog
Thusday	differential incremental level 1	backup archivelog
Friday	differential incremental level 1	backup archivelog
Saturday	differential incremental level 1	backup archivelog
Sumday	cumulative incremental level 1	backup archivelog

一个月以内：

每月的第一天做上个月的零级全备。

这样在任意时间需要恢复数据是，最多用到一个全备+一个周备+7个日备+一天之内的归档日志。

特别注意的是：如果生产库非常大，那么强烈建议使用block_change_tracking特性，能极大缩短增量备份时间，否则做一次增量备份的时间与做一次全备所需时间无异！

三.利用增量备份修复DG实例演示

Step1.场景模拟

OS: Linux5.8 x64 DB:Oracle 11.2.0.3 x64

主库：test3_A 备库test3_B 且正常运行中

test3_B:

SQL> archive log list
Database log mode           Archive Mode
Automatic archival           Enabled
Archive destination           /u01/archivelogs/test3
Oldest online log sequence     45
Next log sequence to archive   0
Current log sequence           59

SQL> alter database recover managed standby database cancel;

Database altered.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL>

test3_A:

SQL> archive log list
Database log mode           Archive Mode
Automatic archival           Enabled
Archive destination           /u01/archivelog/test3
Oldest online log sequence     57
Next log sequence to archive   59
Current log sequence           59
SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> /

System altered.

SQL> archive log list
Database log mode           Archive Mode
Automatic archival           Enabled
Archive destination           /u01/archivelog/test3
Oldest online log sequence     60
Next log sequence to archive   62
Current log sequence           62

SQL> quit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
[oracle@redhat ~]$ rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Mon Jun 10 21:31:04 2013

Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.

connected to target database: TEST3 (DBID=915537947)

RMAN> list archivelog low sequence 60 high sequence 62;

using target database control file instead of recovery catalog
List of Archived Log Copies for database with db_unique_name TEST3_A
=====================================================================

Key     Thrd Seq     S Low Time
------- ---- ------- - ---------
468     1    60      A 10-JUN-13
        Name: /u01/archivelog/test3/1_60_800115694.dbf

469     1    61      A 10-JUN-13
        Name: /u01/archivelog/test3/1_61_800115694.dbf

RMAN> delete archivelog low sequence 60 high sequence 61;     (相比10g DG的改进是，没有被applied的归档，是不能仅仅被delete命令删除的，必须使用delete force删除)

released channel: ORA_DISK_1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=55 device type=DISK
RMAN-08120: WARNING: archived log not deleted, not yet applied by standby
archived log file name=/u01/archivelog/test3/1_60_800115694.dbf thread=1 sequence=60
RMAN-08120: WARNING: archived log not deleted, not yet applied by standby
archived log file name=/u01/archivelog/test3/1_61_800115694.dbf thread=1 sequence=61

RMAN> delete force archivelog low sequence 60 high sequence 61;

released channel: ORA_DISK_1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=55 device type=DISK
List of Archived Log Copies for database with db_unique_name TEST3_A
=====================================================================

Key     Thrd Seq     S Low Time
------- ---- ------- - ---------
468     1    60      A 10-JUN-13
        Name: /u01/archivelog/test3/1_60_800115694.dbf

469     1    61      A 10-JUN-13
        Name: /u01/archivelog/test3/1_61_800115694.dbf

Do you really want to delete the above objects (enter YES or NO)? yes
deleted archived log
archived log file name=/u01/archivelog/test3/1_60_800115694.dbf RECID=468 STAMP=817766882
deleted archived log
archived log file name=/u01/archivelog/test3/1_61_800115694.dbf RECID=469 STAMP=817766885
Deleted 2 objects

RMAN> quit

Recovery Manager complete.
[oracle@redhat ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Mon Jun 10 21:48:31 2013

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

test3_B:

SQL> startup
ORACLE instance started.

Total System Global Area 622149632 bytes
Fixed Size            2230912 bytes
Variable Size          180356480 bytes
Database Buffers      436207616 bytes
Redo Buffers            3354624 bytes
Database mounted.
Database opened.
SQL> alter database recover managed standby database using current logfile disconnect from session;

Database altered.

此时警告日志文件中出现：

Mon Jun 10 21:48:27 2013
FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 60-61
DBID 915537947 branch 800115694
FAL[client]: All defined FAL servers have been attempted.
------------------------------------------------------------
Check that the CONTROL_FILE_RECORD_KEEP_TIME initialization
parameter is defined to a value that's sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.
------------------------------------------------------------
Mon Jun 10 21:48:40 2013
RFS[2]: Selected log 5 for thread 1 sequence 64 dbid 915537947 branch 800115694
Mon Jun 10 21:48:40 2013
Archived Log entry 76 added for thread 1 sequence 63 ID 0x37014efb dest 1:
RFS[2]: Selected log 4 for thread 1 sequence 65 dbid 915537947 branch 800115694
Mon Jun 10 21:48:44 2013
Archived Log entry 77 added for thread 1 sequence 64 ID 0x37014efb dest 1:

然后查看一下日志gap的范围，发现正好是在primary强行删掉的两个归档日志：

SQL> select * from v$archive_gap;

THREAD# LOW_SEQUENCE# HIGH_SEQUENCE#
---------- ------------- --------------
1 60 61

Step2.查看GAP的起始SCN信息

归档日志的gap，具体来说就是从某个scn开始，备库没有可用的归档日志进行apply。但是尽管如此，后来的归档日志仍然会自动传输到standby端，但是却不能被应用，也就不能被rman的备份策略自动删除（force除外，但是我相信生产系统几乎不会用到它），进而导致空间占满问题；相应的primary端的日志同样不能被正常删除，也会遇到空间问题，然后可能hang住。这对于磁盘空间本不富裕，且日志产生频繁的系统就是一颗定时炸弹！

那么先来查看一下standby （test3_B）端截至目前的scn：

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
3221814

那么意味着，我们必须至少将standby的scn继续向后推动到primary在gap之后的最早的一个日志（可能是在线日志，也可能是归档日志）的first_change#，才能继续应用日志！因此接下来需要在primary端做一次基于SCN起始的增量备份

Step3.定制增量备份集，并传输到standby端

在test3_A做基于SCN的增量备份，该备份包含了从3221814起，截至数据文件当前时间点，所有变化的数据块。如果此时primary没有全备也没关系。但是如果有block_change_tracking的话，能显著缩短这一过程的时间！

注意下面会同时生成一份最新的standby controlfile的备份集。这是一个自定义跨度的增量备份，因此differential 或者cumulative或者level都不需要指定了！

test3_A:

RMAN>run

{
ALLOCATE CHANNEL C1 DEVICE TYPE DISK format '/u01/rman/test3/db_%u_%T.bk';
backup as compressed backupset incremental from scn 3221814 database include current controlfile for standby;
RELEASE CHANNEL C1;

}

using target database control file instead of recovery catalog
allocated channel: C1
channel C1: SID=52 device type=DISK

Starting backup at 10-JUN-13

backup will be obsolete on date 17-JUN-13
archived logs will not be kept or backed up
channel C1: starting compressed full datafile backup set
channel C1: specifying datafile(s) in backup set
input datafile file number=00001 name=/u01/oradata/test3/system01.dbf
input datafile file number=00003 name=/u01/oradata/test3/sysaux01.dbf
input datafile file number=00002 name=/u01/oradata/test3/undotbs01.dbf
input datafile file number=00004 name=/u01/oradata/test3/users01.dbf
input datafile file number=00005 name=/u01/oradata/test3/example01.dbf
channel C1: starting piece 1 at 10-JUN-13
channel C1: finished piece 1 at 10-JUN-13
piece handle=/u01/rman/test3/db_25obsb96_20130610.bk tag=TAG20130610T221605 comment=NONE
channel C1: backup set complete, elapsed time: 00:00:15
channel C1: starting compressed full datafile backup set
channel C1: specifying datafile(s) in backup set
including standby control file in backup set
channel C1: starting piece 1 at 10-JUN-13
channel C1: finished piece 1 at 10-JUN-13
piece handle=/u01/rman/test3/db_26obsb9l_20130610.bk tag=TAG20130610T221605 comment=NONE
channel C1: backup set complete, elapsed time: 00:00:01

backup will be obsolete on date 17-JUN-13
archived logs will not be kept or backed up
channel C1: starting compressed full datafile backup set
channel C1: specifying datafile(s) in backup set
including current control file in backup set
channel C1: starting piece 1 at 10-JUN-13
channel C1: finished piece 1 at 10-JUN-13
piece handle=/u01/rman/test3/db_27obsb9p_20130610.bk tag=TAG20130610T221605 comment=NONE
channel C1: backup set complete, elapsed time: 00:00:01
Finished backup at 10-JUN-13

released channel: C1

RMAN> quit

Recovery Manager complete.

注：或者加上tag能够方便定位具体备份集

run
{
ALLOCATE CHANNEL C1 DEVICE TYPE DISK;
backup as compressed backupset incremental from scn 3221814 database format '/u01/rman/test3/db_%u_%T.bk' tag 'incr_datafile_gap';
backup current controlfile for standby FORMAT '/u01/rman/test3/standby_ctl_%u_%T.bk' tag 'standby_controlfile_gap';
RELEASE CHANNEL C1;
}

[oracle@redhat ~]$ cd /u01/rman/test3/
[oracle@redhat test3]$ ls -l
total 34564
-rw-r----- 1 oracle dba 11600384 Jun 10 22:17 block_tracking.log
-rw-r----- 1 oracle dba 10420224 Nov 23 2012 c-915537947-20121123-00
-rw-r----- 1 oracle dba 11051008 Jun 10 22:16 db_25obsb96_20130610.bk
-rw-r----- 1 oracle dba 1130496 Jun 10 22:16 db_26obsb9l_20130610.bk
-rw-r----- 1 oracle dba 1130496 Jun 10 22:16 db_27obsb9p_20130610.bk
[oracle@redhat test3]$ scp *20130610.bk oracle@10.0.0.21:/u01/rman/test3/
oracle@10.0.0.21's password:
db_25obsb96_20130610.bk                                                                                                                                        100%   11MB 10.5MB/s   00:01
db_26obsb9l_20130610.bk                                                                                                                                        100% 1104KB   1.1MB/s   00:00
db_27obsb9p_20130610.bk                                                                                                                                        100% 1104KB   1.1MB/s   00:00
[oracle@redhat test3]$

最后确定主库在此scn后是否添加过数据文件，如果有则需要在standby端也手动添加：

SQL> select FILE#,name from v$datafile where CREATION_CHANGE#> =3221814;

no rows selected

Step4.停止备库日志应用，在stanby端使用增量备份recover数据文件，并恢复standby controlfile以越过gap

SQL> alter database recover managed standby database cancel;

Database altered.

RMAN> shutdown immediate

using target database control file instead of recovery catalog
database closed
database dismounted
Oracle instance shut down

RMAN> startup mount

connected to target database (not started)
Oracle instance started
database mounted

Total System Global Area     622149632 bytes

Fixed Size                     2230912 bytes
Variable Size                180356480 bytes
Database Buffers             436207616 bytes
Redo Buffers                   3354624 bytes

RMAN> catalog start with '/u01/rman/test3';

searching for all files that match the pattern /u01/rman/test3

List of Files Unknown to the Database
=====================================
File Name: /u01/rman/test3/db_25obsb96_20130610.bk
File Name: /u01/rman/test3/db_27obsb9p_20130610.bk
File Name: /u01/rman/test3/db_36nm49ec_20120925.bk
File Name: /u01/rman/test3/c-915537947-20120925-00
File Name: /u01/rman/test3/c-915537947-20121030-01
File Name: /u01/rman/test3/c-915537947-20121106-00
File Name: /u01/rman/test3/db_37nm49go_20120925.bk
File Name: /u01/rman/test3/c-915537947-20121030-02
File Name: /u01/rman/test3/db_26obsb9l_20130610.bk
File Name: /u01/rman/test3/arc_35nm49e8_20120925.bk
File Name: /u01/rman/test3/arc_38nm49gr_20120925.bk

Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /u01/rman/test3/db_25obsb96_20130610.bk
File Name: /u01/rman/test3/db_27obsb9p_20130610.bk
File Name: /u01/rman/test3/db_36nm49ec_20120925.bk
File Name: /u01/rman/test3/c-915537947-20120925-00
File Name: /u01/rman/test3/c-915537947-20121030-01
File Name: /u01/rman/test3/c-915537947-20121106-00
File Name: /u01/rman/test3/db_37nm49go_20120925.bk
File Name: /u01/rman/test3/c-915537947-20121030-02
File Name: /u01/rman/test3/db_26obsb9l_20130610.bk
File Name: /u01/rman/test3/arc_35nm49e8_20120925.bk
File Name: /u01/rman/test3/arc_38nm49gr_20120925.bk

RMAN> recover database noredo;

Starting recover at 10-JUN-13
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=26 device type=DISK
channel ORA_DISK_1: starting incremental datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00001: /u01/oradata/test3/system01.dbf
destination for restore of datafile 00002: /u01/oradata/test3/undotbs01.dbf
destination for restore of datafile 00003: /u01/oradata/test3/sysaux01.dbf
destination for restore of datafile 00004: /u01/oradata/test3/users01.dbf
destination for restore of datafile 00005: /u01/oradata/test3/example01.dbf
channel ORA_DISK_1: reading from backup piece /u01/rman/test3/db_25obsb96_20130610.bk
channel ORA_DISK_1: piece handle=/u01/rman/test3/db_25obsb96_20130610.bk tag=TAG20130610T221605
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:23

Finished recover at 10-JUN-13

RMAN> quit

Recovery Manager complete.

test3_B:
[oracle@redhat1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Mon Jun 10 22:38:51 2013

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
3221814

test3_A:

SQL> select thread#,sequence#,first_change#,next_change# from v$archived_log where first_change#<3221814 and next_change#>3221814；

THREAD# SEQUENCE# FIRST_CHANGE# NEXT_CHANGE#
---------- ---------- ------------- ------------
1 59 3221114 3221815
1 59 3221114 3221815

test3_B:

RMAN> restore standby controlfile from '/u01/rman/test3/db_27obsb9p_20130610.bk'; （一定要恢复standby控制文件，否则standby DB不能跳过原有gap的信息）

Starting restore at 10-JUN-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=19 device type=DISK

channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
output file name=/u02/test3/control01.ctl
output file name=/u03/test3/control02.ctl
Finished restore at 10-JUN-13

RMAN> quit

Recovery Manager complete.
[oracle@redhat1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Mon Jun 10 23:13:10 2013

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> alter database mount;

Database altered.

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-10458: standby database requires recovery
ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: '/u01/oradata/test3/system01.dbf'

SQL> recover standby database using backup controlfile;
ORA-00279: change 3230822 generated at 06/10/2013 22:16:06 needed for thread 1
ORA-00289: suggestion : /u01/archivelogs/test3/1_66_800115694.dbf
ORA-00280: change 3230822 for thread 1 is in sequence #66

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
auto
ORA-00279: change 3232949 generated at 06/10/2013 22:41:24 needed for thread 1
ORA-00289: suggestion : /u01/archivelogs/test3/1_67_800115694.dbf
ORA-00280: change 3232949 for thread 1 is in sequence #67
ORA-00278: log file '/u01/archivelogs/test3/1_66_800115694.dbf' no longer
needed for this recovery

ORA-00308: cannot open archived log '/u01/archivelogs/test3/1_67_800115694.dbf'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3

SQL> alter database open;

Database altered.

SQL> alter database recover managed standby database using current logfile disconnect from session;

Database altered.

此时standby的警告日志信息，发现DG已经可以越过gap，开始从67号日志开始apply：

Mon Jun 10 23:17:22 2013
Primary database is in MAXIMUM PERFORMANCE mode
Changing standby controlfile to MAXIMUM PERFORMANCE mode
RFS[1]: Assigned to RFS process 26846
RFS[1]: Selected log 4 for thread 1 sequence 68 dbid 915537947 branch 800115694
Mon Jun 10 23:17:25 2013
RFS[2]: Assigned to RFS process 26850
RFS[2]: Selected log 5 for thread 1 sequence 67 dbid 915537947 branch 800115694
Mon Jun 10 23:17:28 2013
Archived Log entry 1 added for thread 1 sequence 67 ID 0x37014efb dest 1:
Mon Jun 10 23:18:59 2013
alter database recover managed standby database using current logfile disconnect from session
Attempt to start background Managed Standby Recovery process (test3)
Mon Jun 10 23:18:59 2013
MRP0 started with pid=27, OS id=26857
MRP0: Background Managed Standby Recovery process started (test3)
Serial Media Recovery started
Managed Standby Recovery starting Real Time Apply
Waiting for all non-current ORLs to be archived...
All non-current ORLs have been archived.
Media Recovery Log /u01/archivelogs/test3/1_67_800115694.dbf
Completed: alter database recover managed standby database using current logfile disconnect from session
Media Recovery Waiting for thread 1 sequence 68 (in transit)
Recovery of Online Redo Log: Thread 1 Group 4 Seq 68 Reading mem 0
Mem# 0: /u02/test3/standby_redo04a.log
Mem# 1: /u03/test3/standby_redo04b.log