alert日志报错信息: Wed Jun 18 09:03:51 2008 原因分析: DBA_2PC_PENDING Oracle会自动处理分布事务,保证分布事务的一致性,所有站点全部提交或全部回滚。一般情况下,处理过程在很短的时间内完成,根本无法察觉到。但是,如果在commit或rollback的时候,出现了连接中断或某个数据库站点CRASH的情况,则提交操作可能会无法继续,此时DBA_2PC_PENDING和DBA_2PC_NEIGHBORS中会包含尚未解决的分布事务。 对于绝大多数情况,当恢复连接或CRASH的数据库重新启动后,会自动解决分布式事务,不需要人工干预。只有分布事务锁住的对象急需被访问,锁住的回滚段阻止了其他事务的使用,网络故障或CRASH的数据库的恢复需要很长的时间等情况出现时,才使用人工操作的方式来维护分布式事务。
SQL> conn /as sysdba LOCAL_TRAN_ID STATE MIX A FAIL_TIME RETRY_TIME SQL> col LOCAL_TRAN_ID for a15 LOCAL_TRAN_ID IN_OUT DATABASE INTERFACE SQL> COMMIT FORCE '21.3.5270155';
SQL> select count(*) from pending_trans$; COUNT(*) SQL> select LOCAL_TRAN_ID,STATUS,STATE,TOP_DB_USER from pending_trans$; LOCAL_TRAN_ID S STATE TOP_DB_USER SQL> COMMIT FORCE '21.3.5270155'; SQL> COMMIT FORCE '21.3.5270155';
*
LOCAL_TRAN_ID
*
提交完成。 SQL> alter session set "_smu_debug_mode" = 4; 会话已更改。 SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('21.3.5270155'); PL/SQL 过程已成功完成。 SQL> SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID FROM DBA_2PC_PENDING; 未选定行 SQL> SELECT LOCAL_TRAN_ID, IN_OUT, DATABASE, INTERFACE FROM DBA_2PC_NEIGHBORS; 未选定行 SQL> select LOCAL_TRAN_ID,STATE from dba_2pc_pending; 未选定行 SQL> commit; 提交完成。 SQL> exit
In this Document
--------------------------------------------------------------------------------
Applies to: Symptoms SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); END; SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108');
If the remote database no longer exists then the transaction will have to be Follow the instructions on how to purge a 1. Identify the id of the transaction: SQL> SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID FROM DBA_2PC_PENDING; 2. Purge the transaction: SQL> EXECUTE DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY(''); 3. Confirm that the transaction has been purged: SQL> SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID FROM DBA_2PC_PENDING;
BEGIN DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('5.23.2386'); END;
Use the following workaround: You have to use local_tran_id..... Issue commit before alter system set "_smu_debug_mode" = 4; Follow the steps, SQL>commit; SQL> alter session set "_smu_debug_mode" = 4; SQL>execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id');
When executing the following procedure(dbms_transaction.purge_lost_db_entry) SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('37.16.108'); ==>For example.. Fix: The transaction to be deleted is in the prepared state and has to be either SQL> select LOCAL_TRAN_ID,STATE from dba_2pc_pending; LOCAL_TRAN_ID STATE SQL> rollback force '37.16.108'; ==>For example Rollback complete. SQL> select LOCAL_TRAN_ID,STATE from dba_2pc_pending; LOCAL_TRAN_ID STATE SQL>Commit; SQL>alter system set "_smu_debug_mode" = 4; SQL> exec dbms_transaction.purge_lost_db_entry('37.16.108'); ==>For example References ============ 其它参考: http://blog./u/12960/showart_457785.html 最近数据库突然出现RECO进程不停的报ORA-02068和ORA-03113的错误:
Errors in file /oracle/admin/UBISP/bdump/ubisp_reco_23401.trc:
ORA-02068: following severe error from DBNAME
ORA-03113: end-of-file on communication channel 检查trace文件发行tran号总是那几个,执行语句:
select * from dba_2pc_pending;
可以看到记录的所有LOCAL_TRAN_ID与trace中的一样,这太奇怪了,从现象上看只能问题RECO进程无法清除这些失败后的事物,这个问题在RAC环境有可能出现,但单节点上为什么出现这个问题还不清楚。解决方法如下:
1. select * from dba_2pc_pending;
2. alter system disable distributed recovery ; 3. alter session set "_smu_debug_mode" = 4 ;
4. execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('<transaction id>');
5. select * from dba_2pc_pending; 6. alter system enable distributed recovery; 如果有多个事物,需要在第四步后面执行commit;
_smu_debug_mode缺省是0,可以在完成之后改回来,查看该隐藏参数的命令是:
set linesize 132 column name format a30 column value format a25 select x.ksppinm name, y.ksppstvl value, y.ksppstdf isdefault, decode(bitand(y.ksppstvf,7),1,'MODIFIED',4,'SYSTEM_MOD','FALSE') ismod, decode(bitand(y.ksppstvf,2),2,'TRUE','FALSE') isadj from sys.x$ksppi x, sys.x$ksppcv y where x.inst_id = userenv('Instance') and y.inst_id = userenv('Instance') and x.indx = y.indx and x.ksppinm like '%_smu_debug_mode%' order by translate(x.ksppinm, ' _', ' '); 另外: http://www./dodd/2008/02/ora03113.html 大家知道,ORA-03113错误是Oracle数据库常见的错误,导致这个错误的原因比较复杂,各种各样的原因。可能是网络中断引起的、也可能是数据库本身出现了问题。
下面就一个案例,分析一下ORA-03113错误。
故障现象:
开始alert文件提示错误: Error 1013 trapped in 2PC on transaction 1.60.1257421. Cleaning up. Error stack returned to user: ORA-02050: transaction 1.60.1257421 rolled back, some remote DBs may be in-doubt ORA-01013: user requested cancel of current operation ORA-06553: PLS-103: Encountered the symbol "EXCEPTION" when expecting one of the following: begin case declare exit for function goto if loop mod null package pragma procedure raise return select separate type update while with <an identifier> <a double-quoted delimited-identifier> <a bind variable> << form table call close current define delete fetch lock Mon Feb 18 09:07:19 2008 DISTRIB TRAN SMSBOSS.09aad41c.1.60.1257421 is local tran 1.60.1257421 (hex=01.3c.132fcd) insert pending collecting tran, scn=8914343855672 (hex=81b.884c8638) 然后时不时的会提示下面错误: ERROR, tran=1.60.1257421, session#=1, ose=0: ORA-03113: end-of-file on communication channel *** 2008-02-18 09:45:25.919 ERROR, tran=1.60.1257421, session#=1, ose=0: ORA-03113: end-of-file on communication channel *** 2008-02-18 10:19:42.891
Oracle数据库只有这些错误提示,其余状态均正常。
从错误提示看,应该是由于分布事务由于人为cancel中止,引起的事务失败,下面查看相关信息:
SQL> select LOCAL_TRAN_ID,GLOBAL_TRAN_ID,STATE,MIXED,HOST,COMMIT# from dba_2pc_pending;
LOCAL_TRAN_ID GLOBAL_TRAN_ID STATE MIX HOST COMMIT# ---------------------- -------------------- ---------------- --- -------------------- ---------------- 1.60.1257421 SMSBOSS.09aad41c.1.60.1257421 collecting no WORKGROUP\LIUQING 8914343855672
SQL> select * from DBA_2PC_NEIGHBORS;
LOCAL_TRAN_ID IN_ DATABASE DBUSER_OWNER I DBID SESS# BRANCH ---------------------- --- --------------- ------------------------------ ----------------- ---------- -------------------- 1.60.1257421 in BOSSMGR N 1 0000
1.60.1257421 out SMSDBN BOSSMGR N cc3ddb9b 1 4
select * from DBA_2PC_PENDING@smsdbn;
no rows selected
select * from DBA_2PC_NEIGHBORS@smsdbn;
no rows selected
dba_2pc_pending视图记录等待恢复的分布式事务的信息 dba_2pc_neighbors视图记录未决的分布式事务的输入输出连接信息
有上述信息分析原因,1.60.1257421事务的状态为collecting,本机数据库数据流向为in,远端smsdbn数据库流向为out。
询问业务人员,确实运行过这么个一存储,中途手工中止了。并且是从smsdbn数据库里select数据然后update本地数据库。 这基本证实了我们的猜测。
下面尝试force commit或者 force rollback此事务,
SQL> commit force '1.60.1257421'; commit force '1.60.1257421' * ERROR at line 1: ORA-02058: no prepared transaction found with ID 1.60.1257421
上述错误的原因是由于collecting状态的事务不需要commit/rollback force。 我们现在需要做的就是:
(1) Disable分布式恢复 SQL> ALTER SYSTEM DISABLE DISTRIBUTED RECOVERY; System altered.
(2)Puege(清空)in-doubt transaction entry:
SQL> exec DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('1.60.1257421'); PL/SQL procedure successfully completed.
(3)然后enable 分布式恢复: SQL> ALTER SYSTEM ENABLE DISTRIBUTED RECOVERY;
参考信息/更多阅读:
Note:1012842.102 ORA-2019 ORA-2058 ORA-2068 ORA-2050: Failed Distributed Transactions
Note:100664.1 How to Troubleshoot Distributed Transactions
Note:274321.1
Note:126069.1 Manually Resolving In-Doubt Transactions: Different Scenarios |
|
来自: guolijiegg > 《oracle》