平台:Redhat Linux 7.6 数据库版本:Oracle 19.10 问题描述:进行OCR磁盘替换DD破坏实验,并删除原有的OCR磁盘组。中间过程中,因spfile文件以及asm密码文件没有进行迁移新的OCR磁盘。导致两个节点crs集群的其中一个节点只能手工进行启动。 注:模拟破坏OCR磁盘过程中,发现OCR备份在OCR磁盘组中,一旦OCR发生损坏,恢复过程中无法读取OCR磁盘组中的OCR备份。无法按照常规restore进行还原。(参考:2717306.1 2256199.1 1929673.1 2264675.1) 1.通过以下命令查看集群状态,发现组件crsd异常、ora.storage异常,其余都是正常的crsctl stat res -t -init 2.通过检查crsd日志发现问题该问题是此次事故的第一个具体原因。 3.通过查看gpnp,来进一步观察spfile文件位置通过以下可以看到文件位置是正常的。 4.通过init.ora文件将asm实例拉起[grid@bjyctzdb10 ~]$ cat init.ora +ASM1.__large_pool_size=20971520 +ASM2.__large_pool_size=20971520 +ASM1.__oracle_base=’/u02/app/grid’#ORACLE_BASE set from in memory value +ASM2.__oracle_base=’/u02/app/grid’#ORACLE_BASE set from in memory value +ASM1._asm_max_connected_clients=4 +ASM2._asm_max_connected_clients=4 *.asm_diskgroups='ARCH’,'DATA’,'OCR_NEW’#Manual Mount .asm_diskstring=’/dev/oracleasm/disks/’ *.asm_power_limit=1 *.large_pool_size=12M *.remote_login_passwordfile='EXCLUSIVE’ SQL> startup pfile=’/home/grid/init.ora’; ASM instance started Total System Global Area 1137173312 bytes Fixed Size 8905536 bytes Variable Size 1103101952 bytes ASM Cache 25165824 bytes ASM diskgroups mounted SQL> 5.查看集群中spfile位置描述#从这里发现并未指定spfile文件位置,可能是该原因找不到spfile文件,导致集群无法启动。 SQL> show parameter spfile NAME TYPE VALUE ------------------------------------------- spfile string 指定spfile文件位置。 SQL> alter system set spfile=’+OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999’; System altered. SQL> show parameter spfile NAME TYPE VALUE -------------------------------------------- spfile string +OCR_NEW/bjyctzd-cluster/ASMPA RAMETERFILE/registry.253.10727 95999 6.重启集群,再次观察–需要重启asm实例才可以和之前问题一样。 7.再次使用init.ora去启动asm磁盘SQL> startup pfile=’/home/grid/init.ora’; ASM instance started Total System Global Area 1137173312 bytes Fixed Size 8905536 bytes Variable Size 1103101952 bytes ASM Cache 25165824 bytes ASM diskgroups mounted SQL> show parameter spfile NAME TYPE VALUE -------------------------------------------- spfile string SQL> show parameter spfile NAME TYPE VALUE -------------------------------------------- spfile string SQL> alter system set spfile=’+OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999’; System altered. SQL> create pfile=’/home/grid/init.ora’ from spfile; File created. SQL> shutdown immediate ASM diskgroups dismounted ASM instance shutdown SQL> startup ASM instance started Total System Global Area 1137173312 bytes Fixed Size 8905536 bytes Variable Size 1103101952 bytes ASM Cache 25165824 bytes ASM diskgroups mounted SQL> show parameter spfile NAME TYPE VALUE -------------------------------------------- spfile string +OCR_NEW/bjyctzd-cluster/ASMPA RAMETERFILE/registry.253.10727 95999 SQL> 8.再次重启集群还是如上问题,asm实例需要手工启动、asm实例启动之后crsd集群组件才可以启动,通过查看日志发现问题9.通过命令查看asm密码文件位置发现密码文件和密码文件备份都在之前的OCR磁盘中,但是OCR已经dismount。 [grid@bjyctzdb10 ~]$ srvctl config asm ASM home: Password file: +OCR/orapwASM Backup of Password file: +OCR/orapwASM_backup ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM [grid@bjyctzdb10 ~]$ 修改asm密码文件位置以及备份位置。 [root@bjyctzdb10 ~]# srvctl modify asm -pwfile OCR_NEW [root@bjyctzdb10 ~]# srvctl config asm ASM home: Password file: OCR_NEW Backup of Password file: +OCR/orapwASM_backup ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM [root@bjyctzdb10 ~]# [root@bjyctzdb10 ~]# srvctl modify asm -pwfilebackup ocr_new [root@bjyctzdb10 ~]# srvctl config asm ASM home: Password file: OCR_NEW Backup of Password file: ocr_new ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM [root@bjyctzdb10 ~]# 10.再次尝试关闭集群以及启动集群11.通过查看日志,还是因为无法找到密码文件打开asm磁盘12.进一步观察asm密码文件从这里我们可以看到Password file:后面没有跟具体的密码文件。 [grid@bjyctzdb09 ~]$ srvctl config asm ASM home: Password file: OCR_NEW Backup of Password file: ocr_new ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM [grid@bjyctzdb09 ~]$ exit 13.进入asmcmd,查看是否有密码文件从下面我们可以发现没有asm密码文件。 ASMCMD> cd ocr_new ASMCMD> ls bjyctzd-cluster/ 14.重新生成一个密码文件提示已经存在一个密码文件。 [grid@bjyctzdb09 ~]$ orapwd file=’+ocr_new/orapwasm’ asm=y Enter password for SYS: OPW-00010: Could not create the password file. This resource has a Password File. 通过这个,我们可以看到密码文件是OCR_NEW。 [grid@bjyctzdb09 ~]$ srvctl config asm ASM home: Password file: OCR_NEW Backup of Password file: ocr_new ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM [grid@bjyctzdb09 ~]$ exit 15.根据上面提示,我们重新将asm密码文件位置设置为空[grid@bjyctzdb09 ~]$ srvctl modify asm -pwfile [grid@bjyctzdb09 ~]$ srvctl config asm ASM home: Password file: Backup of Password file: ocr_new ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM 16.重新生成密码文件,并再次查看密码文件位置这是我们可以发现,对应的密码文件节点1和节点2已经存在。 [grid@bjyctzdb09 ~]$ orapwd file=’+ocr_new/orapwasm’ asm=y Enter password for SYS: [grid@bjyctzdb09 ~]$ srvctl config asm ASM home: Password file: +OCR_NEW/orapwasm Backup of Password file: ocr_new ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM [grid@bjyctzdb10 ~]$ srvctl config asm ASM home: Password file: +OCR_NEW/orapwasm Backup of Password file: ocr_new ASM listener: LISTENER ASM instance count: 3 Cluster ASM listener: ASMNET1LSNR_ASM 这是我们发现我们的密码文件已经完整的回来了。 17.再次重启集群,观察是否可以自动将crsd资源拉起来失败,和之前的失败原因是一样的。 18.本次具体的报错信息alert.log日志报错。 2021-05-17 19:09:52.326 [CRSD(123152)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc. 2021-05-17 19:09:52.533 [CRSD(123176)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 123176 2021-05-17 19:09:58.341 [CRSD(123176)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc. 2021-05-17 19:09:58.346 [CRSD(123176)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc. 2021-05-17 19:09:58.555 [CRSD(123220)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 123220 ohasd_orarootagent_root.trc报错信息。 2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] Error [kgfoAl06] in [kgfokge] at kgfo.c:3180 2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] ORA-01017: invalid username/password; logon denied 2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] Category: 7 2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] DepInfo: 1017 2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] – trace dump end – 2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} Thread:kgfoCheckMountExt isRunning is reset to false here 2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} Thread:kgfoCheckMountExt isFinished set to true 2021-05-17 18:58:56.791 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Tasklet::doTask m_cv.timewait returned 2021-05-17 18:58:56.791 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Tasklet::doTask(kgfoCheckMountExt) executed in 3 seconds 2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::parsekgforetcodes retcode = 7, kgfoCheckMount(OCR_NEW), flag 2 2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] (null) category: 7, operation: kgfoAl06, loc: kgfokge, OS error: 1017, other: ORA-01017: invalid username/password; logon denied 2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::check 260 kgfo returncode 1 2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] (:CLSN00140:)StorageAgent::check 300 parsekgforretcodes OCR dgName OCR_NEW state 1 2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] Storage::start waiting for check to not return PARTIAL or UNPLANNED_OFFLINE 1 2021-05-17 18:58:57.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::check 160 NODEROLE_HUB getOCRdetails 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread constructor exit this:9c07c820 m_pThnd:0 m_thndMX:9c07c840, m_tintMX:9c07c890 &m_postMX:0x7f0a9c07c870 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} TaskletObjManager::manageTasklets START 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor entry { this:0x7f0a9c03a8e0 m_pThnd:0x7f0a9c0851f0 m_thndMX:0x7f0a9c03a900, m_tintMX:0x7f0a9c03a950 &m_postMX:0x7f0a9c03a930 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor stop 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt stop { m_pThnd:0x7f0a9c0851f0 tid:2394859264 running:0 alive:0 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread is not running:0 or alive:0 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread acquire m_pThnd:0x7f0a9c0851f0 m_thndMX:0x7f0a9c03a900 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread sltstjn 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt stop release m_thndMX:0x7f0a9c03a900 m_tid:2394859264} 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor m_pThnd:(nil) exit } 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} TaskletObjManager::manageTasklets END 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start { acquire thndMX:9c07c840 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start spawn pThnd:0x7f0a9c04dff0 thndType:1 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start thread spawned tid:2394859264 2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start spawned release thndMX:9c07c840 } 2021-05-17 18:58:57.799 : CLSNS:2394859264: clsns_SetTraceLevel:trace level set to 1. 2021-05-17 18:58:57.827 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] 9607 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS 19.查看asm实例中的用户名以及权限信息从这里发现此次事故的第二个具体原因。 [root@bjyctzdb09 dbs]# asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE 对照正常的asm实例用户如下:(12c之后的版本都是如下–11g版本没有验证)。 20.添加对应的用户,并授予对应的权限[root@bjyctzdb09 dbs]# asmcmd orapwusr --add ASMSNMP Enter password: ******* [root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysdba ASMSNMP [root@bjyctzdb09 dbs]# asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE ASMSNMP TRUE FALSE FALSE [root@bjyctzdb09 dbs]# orapwusr --add CRSUSER__ASM_001 -bash: orapwusr: command not found [root@bjyctzdb09 dbs]# asmcmd orapwusr --add CRSUSER__ASM_001 Enter password: ****** [root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysdba CRSUSER__ASM_001 [root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysasm CRSUSER__ASM_001 从这里我们可以发现,用户已经完全正确,且对应的权限也没有什么问题。 [root@bjyctzdb09 dbs]# asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE ASMSNMP TRUE FALSE FALSE CRSUSER__ASM_001 TRUE FALSE TRUE 21.再次重启crs,观察是否能够自动启动报错,和之前一样的报错信息。 22.进一步查阅mos文档,发现最终问题数据库版本19.8或者之后的ru补丁后的数据库,带来一个新特性"asmcmd credverify" and “asmcmd credfix” 命令来创建asm密码文件。 执行如下命令。该命令OCR、OLR和密码文件中是否一致,以及验证是否存在连接集群的凭证。 [grid@bjyctzdb10 ~]$ asmcmd --nocp credverify credverify: Cannot connect using credentials for CRSUSER__ASM_001, please run 'credfix’ to fix the credentials. 根据上述提示,无法使用功能CRSUSER__ASM_001,并需要运行一下命令来修复凭证。 从下面可以看出删除了CRSUSER__ASM_001用户,并创建了新的用户CRSUSER__ASM_002,而且最后将连接集群的证书修复。 [grid@bjyctzdb10 ~]$ asmcmd --nocp credfix credfix: Failed to connect using credentials for CRSUSER__ASM_001. op=addcrscreds wrap=/tmp/creds0.xml credfix: Creating new credentials, no valid credentials in OCR. credfix: New user CRSUSER__ASM_002 created. credfix: Starting SSH session on node bjyctzdb09. credfix: OLR for bjyctzdb09 has been fixed if credentials were created incorrectly. Exiting SSH session. op=credimport wrap=/tmp/creds0.xml olr=true force=true credfix: OLR for bjyctzdb10 has been fixed if credentials were created incorrectly. op=delcrscreds crs_user=CRSUSER__ASM_001 credfix: Deleted CRSUSER__ASM_001 from OCR. credverify: starting SSH session on node bjyctzdb09 credverify: Credentials created correctly on bjyctzdb09. Exiting SSH session. credverify: Credentials created correctly on bjyctzdb10. credfix: Credentials have been fixed if they were created incorrectly 23.再次查看asm实例中的用户和密码[grid@bjyctzdb10 ~]$ asmcmd lspwusr Username sysdba sysoper sysasm SYS TRUE TRUE TRUE ASMSNMP TRUE FALSE FALSE CRSUSER__ASM_002 TRUE FALSE TRUE 24.再次重启集群,观察crs是否自动启动到此,问题得到完美解决! 墨天轮原文链接:https://www./db/65395(复制到浏览器或者点击“阅读原文”立即查看) |
|