
张乐奕 云和恩墨副总经理 Oracle ACE 总监 ITPUB Oracle数据库管理版版主、Oracle高可用版版主、ACOUG联合创始人
今天收到一个发过来请求帮助的 case,Oracle 数据库无法启动,请求帮助恢复。仔细阅读了发过来的告警日志,这是一个典型的“事情越弄越糟”的案例。
作为一个专业的DBA,在遇到问题时,一定要思考:如何保护现场,不让事情变得更糟。这是基本要求,保护现场以使得其他人接手工作时,可以从原有状态开始。
以下就来根据告警日志,一条一条地回顾这位 DBA 是如何将数据库弄到完全启动不了的。
故障最开始是从1月11日的凌晨 3:30 开始出现,数据库在归档的时候,意外发现某个控制文件的头块全部被清零了,这可能是存储本身的问题,并非人为。 Fri Jan 11 03:30:24 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' Master background archival failure: 227 Fri Jan 11 03:31:24 2013 Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc Corrupt block relative dba: 0x00000001 (file 0, block 1) Completely zero block found during control file header read Fri Jan 11 03:31:24 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc: ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' Fri Jan 11 03:31:24 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' Fri Jan 11 03:30:24 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' Fri Jan 11 03:30:24 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' Fri Jan 11 03:30:24 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_arc1_3031.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl'
接下来,数据库痛苦地挣扎了半小时,期间不停地报相同的 ORA-00227 错误。一直到凌晨 4:01,终于 CKPT 进程也发现无法更新控制文件头部,于是强势地将数据库直接关闭了。 Fri Jan 11 04:01:25 2013 Hex dump of (file 0, block 1) in trace file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc Corrupt block relative dba: 0x00000001 (file 0, block 1) Completely zero block found during control file header read Fri Jan 11 04:01:25 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc: ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' Fri Jan 11 04:01:25 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_3007.trc: ORA-00227: corrupt block detected in control file: (block 1, # blocks 1) ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' CKPT: terminating instance due to error 227 Fri Jan 11 04:01:25 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_pmon_2997.trc: ORA-00227: corrupt block detected in control file: (block , # blocks ) Fri Jan 11 04:01:26 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_psp0_2999.trc: ORA-00227: corrupt block detected in control file: (block , # blocks ) Instance terminated by CKPT, pid = 3007
接下来的5个小时,数据库静静地躺在机房里,没有人知道这个数据库已经挂掉了,一直到上午 DBA 来上班。他发现数据库无法访问,于是尝试重新启动数据库。 Fri Jan 11 09:15:51 2013 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 3 Autotune of undo retention is turned on.
自然数据库无法正常启动,连 mount 状态都无法进入,因为某个控制文件头部已经损坏了。告警日志的信息明确地说明了无法读取 control03.ctl 文件的头块,因此在尝试 mount 数据库的时候报了 ORA-00205 错误。 Fri Jan 11 09:15:56 2013 ALTER DATABASE MOUNT Fri Jan 11 09:15:56 2013 ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' ORA-27047: unable to read the header block of file Additional information: 2 Fri Jan 11 09:15:59 2013 ORA-205 signalled during: ALTER DATABASE MOUNT... Fri Jan 11 09:19:31 2013 Starting ORACLE instance (normal) Fri Jan 11 09:19:43 2013 alter database mount Fri Jan 11 09:19:43 2013 ORA-00202: control file: '/oracle/oradata/dpdata/control03.ctl' ORA-27047: unable to read the header block of file Additional information: 2 Fri Jan 11 09:19:43 2013 ORA-205 signalled during: alter database mount
接下来,这位 DBA 开始反复地关闭实例,又启动实例。这样的操作一直持续了1个小时,一直到上午的 10:28,可以想象这是多么纠结的一个小时。 忠告: 除非十分特殊的恢复案例,否则反复起停数据库实例是于事无补的。 Shutting down instance: further logons disabled Fri Jan 11 09:23:47 2013 Stopping background process CJQ0 …… Fri Jan 11 09:38:02 2013 Starting ORACLE instance (normal) …… Fri Jan 11 09:43:00 2013 Shutting down instance: further logons disabled …… Fri Jan 11 09:43:58 2013 Starting ORACLE instance (normal) …… Fri Jan 11 09:55:34 2013 ALTER DATABASE CLOSE NORMAL …… Fri Jan 11 09:56:55 2013 Starting ORACLE instance (normal) …… Fri Jan 11 10:28:10 2013 ALTER DATABASE CLOSE NORMAL
接下来 10:29 的再次重新启动数据库实例之前,DBA 终于意识到可能是控制文件出现了问题,因此修改了初始化参数,将报错的 control03.ctl 文件从初始化参数 control_files 中去掉了。这次数据库得以正常启动。 Fri Jan 11 10:29:20 2013 Starting ORACLE instance (normal) …… control_files = /data1/oradata/dpdata/control01.ctl, /data1/oradata/dpdata/control02.ctl …… Fri Jan 11 10:29:37 2013 Completed: ALTER DATABASE OPEN
而DBA也迅速地作了一次备份控制文件的操作,但是正是这个操作引导到了后面噩梦一般的结果。 Fri Jan 11 10:36:14 2013 alter database backup controlfile to trace Fri Jan 11 10:36:14 2013 Completed: alter database backup controlfile to trace
数据库又平稳地运行了一个上午,这种宁静到下午14:16的时候被打破,数据库开始报 ORA-600 错误,并且在 CKPT 进程作检查点的时候,报数据文件10和31的文件头部无法被正确读取。如果不是更深层次的原因,那么这可能仍然是跟凌晨时候控制文件意外损坏时候的故障一样,也许是存储子系统本身的问题。 Fri Jan 11 14:16:07 2013 Errors in file /oracle/admin/dpdata/udump/dpdata_ora_22240.trc: ORA-00600: internal error code, arguments: [6002], [0], [0], [3], [0], [], [], [] Fri Jan 11 14:19:44 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc: ORA-01171: datafile 10 going offline due to error advancing checkpoint ORA-01122: database file 10 failed verification check ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf' ORA-01251: Unknown File Header Version read for file number 10 Fri Jan 11 14:19:59 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_ckpt_9579.trc: ORA-01171: datafile 31 going offline due to error advancing checkpoint ORA-01122: database file 31 failed verification check ORA-01110: data file 31: '/data3/ts2_dpcis.dbf' ORA-01251: Unknown File Header Version read for file number 31
紧接着,应用系统的某个 JOB 也由于数据文件无法访问,而开始报错。 Fri Jan 11 14:30:19 2013 Errors in file /oracle/admin/dpdata/bdump/dpdata_j001_12993.trc: ORA-12012: error on auto execute of job 88 ORA-00376: file 10 cannot be read at this time ORA-01110: data file 10: '/data2/DECTR_HIS2.dbf' ORA-06512: at 'DECTR.P_MOVE_CONTS_SHIP', line 77 ORA-06512: at line 1
相同的报错一直持续了四十多分钟,之后 DBA 又再次介入了。但是 DBA 很奇怪地断然执行了 offline 这两个数据文件的操作,并在2分多钟之后,又尝试将两个数据文件再次 online 。而由于文件损坏,自然在 online 的时候遇到了 ORA-1122 错误,而无法成功 online 。
Fri Jan 11 15:16:23 2013 alter database datafile '/data3/ts2_dpcis.dbf' offline Fri Jan 11 15:16:23 2013 Completed: alter database datafile '/data3/ts2_dpcis.dbf' offline Fri Jan 11 15:17:05 2013 alter database datafile '/data2/DECTR_HIS2.dbf' offline Fri Jan 11 15:17:05 2013 Completed: alter database datafile '/data2/DECTR_HIS2.dbf' offline Fri Jan 11 15:19:41 2013 alter database datafile '/data3/ts2_dpcis.dbf' online Fri Jan 11 15:19:41 2013 ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online... Fri Jan 11 15:21:10 2013 alter database datafile '/data2/DECTR_HIS2.dbf' online Fri Jan 11 15:21:10 2013 ORA-1122 signalled during: alter database datafile '/data2/DECTR_HIS2.dbf' online...
这才仅仅是噩梦的开始,接下来的一切属于危险动作,请勿轻易模仿。 遇到 ORA-1122 错误以后,DBA 考虑了 9 秒钟,再次断然地关闭了数据库,并随之又重新启动。由于仅仅是用户表空间数据文件损坏,并且之前也已经被 offline 了,因此数据库实例毫无障碍地 OPEN 成功。 Fri Jan 11 15:21:19 2013 Shutting down instance: further logons disabled Fri Jan 11 15:21:19 2013 Stopping background process QMNC Fri Jan 11 15:21:19 2013 Stopping background process CJQ0 Fri Jan 11 15:21:21 2013 Stopping background process MMNL Fri Jan 11 15:21:22 2013 Stopping background process MMON Fri Jan 11 15:21:23 2013 Shutting down instance (immediate) …… Fri Jan 11 15:22:59 2013 Starting ORACLE instance (normal) …… Fri Jan 11 15:23:13 2013 Completed: ALTER DATABASE OPEN
DBA再次尝试online数据文件的操作,同样的ORA-1122错误。 Fri Jan 11 15:23:31 2013 alter database datafile '/data3/ts2_dpcis.dbf' online Fri Jan 11 15:23:31 2013 ORA-1122 signalled during: alter database datafile '/data3/ts2_dpcis.dbf' online...
考虑了2分半钟之后,DBA 也许是想起上午的时候做过控制文件的备份,因此决定进行数据库恢复。重启数据库到 nomount 状态,并开始进行RECOVER DATABASE USING BACKUP CONTROLFILE,ORA-1507 错误的意思是告知如果要使用备份的控制文件进行数据库恢复,那么应该要先使用备份的控制文件将数据库启动到 mount 状态。 Fri Jan 11 15:25:05 2013 Shutting down instance: further logons disabled Fri Jan 11 15:25:05 2013 Stopping background process QMNC Fri Jan 11 15:25:05 2013 Stopping background process CJQ0 Fri Jan 11 15:25:07 2013 Stopping background process MMNL Fri Jan 11 15:25:08 2013 Stopping background process MMON Fri Jan 11 15:25:09 2013 Shutting down instance (immediate) …… Fri Jan 11 15:26:32 2013 Starting ORACLE instance (normal) …… Fri Jan 11 15:26:46 2013 ALTER DATABASE RECOVER database using backup controlfile until cancel Fri Jan 11 15:26:46 2013 ORA-1507 signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ...
DBA 于是将数据库启动到 mount 状态,继续进行数据库恢复。这其中的几个ORA 错误都是正常的,ORA-279 提示需要一个归档文件来完成恢复,ORA-308提示打不开 1_87749_604491553.dbf 归档文件,根据前面的告警日志,可以知道实际上 87749 这个重做日志是当前日志,还没有归档,自然找不到。ORA-1547错误表示恢复已经完成,但是 OPEN RESETLOGS 的时候仍然要遇到错误。
Fri Jan 11 15:26:56 2013 alter database mount Fri Jan 11 15:27:00 2013 Setting recovery target incarnation to 2 Fri Jan 11 15:27:00 2013 Successful mount of redo thread 1, with mount id 560899584 Fri Jan 11 15:27:00 2013 Database mounted in Exclusive Mode Completed: alter database mount Fri Jan 11 15:27:10 2013 ALTER DATABASE RECOVER database using backup controlfile until cancel Media Recovery Start parallel recovery started with 3 processes ORA-279 signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ... Fri Jan 11 15:27:28 2013 ALTER DATABASE RECOVER CONTINUE DEFAULT Fri Jan 11 15:27:28 2013 Media Recovery Log /soft/db_arch/1_87749_604491553.dbf Errors with log /soft/db_arch/1_87749_604491553.dbf ORA-308 signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ... Fri Jan 11 15:27:28 2013 ALTER DATABASE RECOVER CONTINUE DEFAULT Fri Jan 11 15:27:28 2013 Media Recovery Log /soft/db_arch/1_87749_604491553.dbf Errors with log /soft/db_arch/1_87749_604491553.dbf ORA-308 signalled during: ALTER DATABASE RECOVER CONTINUE DEFAULT ... Fri Jan 11 15:27:28 2013 ALTER DATABASE RECOVER CANCEL ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...
DBA忽略了这个错误,尝试将数据库打开,很显然会遇到O RA-1589 错误,之后又尝试用 NORESTLOGS 方式 OPEN 数据库,这也很显然会遇到 ORA-1588 错误。不完全恢复的数据库必须要以 RESETLOGS 方式打开。 Fri Jan 11 15:29:52 2013 alter database open Fri Jan 11 15:29:52 2013 ORA-1589 signalled during: alter database open... Fri Jan 11 15:30:11 2013 alter database open NORESETLOGS Fri Jan 11 15:30:11 2013 ORA-1588 signalled during: alter database open NORESETLOGS
之后, DBA 作了一个艰难的决定,再次关闭并重启了数据库。又再次尝试相同的 OPEN 步骤。当然,Oracle 也给与了相同的报错。数据库仍然无法打开。至此,数据库无法提供服务已经1个多小时。 Fri Jan 11 15:30:42 2013 Shutting down instance: further logons disabled Fri Jan 11 15:30:42 2013 Stopping background process CJQ0 Fri Jan 11 15:30:42 2013 Stopping background process MMNL Fri Jan 11 15:30:43 2013 Stopping background process MMON Fri Jan 11 15:30:44 2013 Shutting down instance (immediate) …… Fri Jan 11 15:30:59 2013 Starting ORACLE instance (normal) …… Fri Jan 11 15:31:08 2013 ALTER DATABASE OPEN ORA-1589 signalled during: ALTER DATABASE OPEN... Fri Jan 11 15:31:28 2013 alter database open NORESETLOGS Fri Jan 11 15:31:28 2013 ORA-1588 signalled during: alter database open NORESETLOGS... Fri Jan 11 15:31:41 2013 alter database open RESETLOGS Fri Jan 11 15:31:41 2013 ORA-1122 signalled during: alter database open RESETLOGS...
再接下来,是一团混乱, DBA 多次重启数据库,尝试了多种恢复手段。offline数据文件,recover 数据文件,recover 数据库,online 数据文件,再recover,再 offline,再 open,但是一切尝试都是徒劳的。一直到晚上18:35,在数据库宕机4个多小时以后,开始求助我们帮助其恢复数据库。 Fri Jan 11 15:41:28 2013 alter database datafile '/data2/DECTR_HIS2.dbf' offline Completed: alter database datafile '/data2/DECTR_HIS2.dbf' offline Fri Jan 11 15:41:35 2013 alter database open Fri Jan 11 15:41:35 2013 ORA-1589 signalled during: alter database open... Fri Jan 11 15:42:20 2013 alter database open resetlogs Fri Jan 11 15:42:20 2013 ORA-1245 signalled during: alter database open resetlogs... Fri Jan 11 15:43:40 2013 ALTER DATABASE RECOVER datafile '/data3/ts2_dpcis.dbf' Fri Jan 11 15:43:40 2013 Media Recovery Start Fri Jan 11 15:43:40 2013 Media Recovery failed with error 1610 ORA-283 signalled during: ALTER DATABASE RECOVER datafile '/data3/ts2_dpcis.dbf' ... Fri Jan 11 15:46:09 2013 ALTER DATABASE RECOVER datafile 10 Fri Jan 11 15:46:09 2013 Media Recovery Start Fri Jan 11 15:46:09 2013 Media Recovery failed with error 1610 ORA-283 signalled during: ALTER DATABASE RECOVER datafile 10 ... …… Fri Jan 11 16:37:51 2013 ALTER DATABASE RECOVER database Fri Jan 11 16:37:51 2013 Media Recovery Start Fri Jan 11 16:37:51 2013 Media Recovery failed with error 1610 ORA-283 signalled during: ALTER DATABASE RECOVER database ... Fri Jan 11 16:39:29 2013 ALTER DATABASE RECOVER database using backup controlfile until cancel Fri Jan 11 16:39:29 2013 Media Recovery Start parallel recovery started with 3 processes ORA-279 signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ... Fri Jan 11 16:39:43 2013 ALTER DATABASE RECOVER CANCEL Fri Jan 11 16:39:44 2013 ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ... Fri Jan 11 16:39:44 2013 ALTER DATABASE RECOVER CANCEL ORA-1112 signalled during: ALTER DATABASE RECOVER CANCEL ... Fri Jan 11 16:40:15 2013 alter database datafile 10 online Fri Jan 11 16:40:15 2013 Completed: alter database datafile 10 online Fri Jan 11 16:40:25 2013 alter database datafile 31 online Completed: alter database datafile 31 online Fri Jan 11 16:40:47 2013 ALTER DATABASE RECOVER database using backup controlfile until cancel Fri Jan 11 16:40:47 2013 Media Recovery Start Fri Jan 11 16:40:47 2013 Media Recovery failed with error 1110 ORA-283 signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ... Fri Jan 11 16:47:12 2013 WARNING: inbound connection timed out (ORA-3136) Fri Jan 11 17:44:47 2013 ALTER DATABASE RECOVER datafile 10 Fri Jan 11 17:44:47 2013 Media Recovery Start Fri Jan 11 17:44:47 2013 Media Recovery failed with error 1610 ORA-283 signalled during: ALTER DATABASE RECOVER datafile 10 ... Fri Jan 11 17:45:19 2013 ALTER DATABASE RECOVER database until cancel using backup controlfile Fri Jan 11 17:45:19 2013 Media Recovery Start Fri Jan 11 17:45:19 2013 Media Recovery failed with error 1110 ORA-283 signalled during: ALTER DATABASE RECOVER database until cancel using backup controlfile ... Fri Jan 11 17:46:39 2013 alter database datafile 10 offline Fri Jan 11 17:46:40 2013 Completed: alter database datafile 10 offline Fri Jan 11 17:47:18 2013 ALTER DATABASE RECOVER database until cancel Fri Jan 11 17:47:18 2013 Media Recovery Start Fri Jan 11 17:47:18 2013 Media Recovery failed with error 1610 ORA-283 signalled during: ALTER DATABASE RECOVER database until cancel ... Fri Jan 11 18:11:31 2013 alter database open Fri Jan 11 18:11:31 2013 ORA-1589 signalled during: alter database open... Fri Jan 11 18:35:29 2013 Starting ORACLE instance (normal) Fri Jan 11 18:35:43 2013 alter database open Fri Jan 11 18:35:43 2013 ORA-1589 signalled during: alter database open... 这是一个没有备份的数据库,实际上如果是存储字系统的问题导致了数据文件损坏,那么可能与DBA的关系并不大,但是在经过一下午的折腾,将一个其实仅仅是坏了2个数据文件而可以轻松 OPEN 的数据库恢复到无论如何也无法轻易打开的状态,这就与 DBA 有很大的关系了。
|