RAC 死锁原理

作者简介:孙显鹏,Oracle 十年从业经验,拥有11G ocp认证,精通内部原理,擅长调优,解决疑难问题,致力于帮助客户解决生产过程过出现的性能问题,提高生产效率!爱好书法!

概述:

死锁原理:(如下图)

上图表示两个enqueue结构,其中资源TM-43-0持有者为session1 模式 S,此时资源TM-320-0持有者为session 2 模式 S,资源TM-43-0 的请求者为session 2 模式为X,因为X和S不兼容session 2处于等待状态,在等待队列排队,此时资源TM-320-0的请求者又是session 1 模式为X,同样X和S不兼容,session 1 等待TM-320-0资源的TM锁。由于session 1 和session 2 互相等待对方资源造成死锁。Oracle发现死锁后回滚阻塞者解除死锁。

在RAC环境下发生死锁时GES信息会打印到lmd*.trc文件。下面我们看看如何快速通过trace文件快速定位死锁的根因。

实验步骤:

实验环境:11.2.0.4 RAC LINUX 5.8

TX deadlock in Exclusive(X) mode

node2:

node2 session1:

node1 session2:

node2 session 1:

node1 session2:

此时node2 session 1 发现死锁,回滚该操作,解除死锁:

node1 session2:

waiting 状态! 因为lock模式不兼容。

全局事物信息:

这里记住两个session的事务XID号,后面分析用!

死锁过程:

1:node2 lock ID=2 持有ID=2 TX

2:node1 lock ID=1 持有ID=1 TX

3:node2 req lock ID=1 NODE1持有ID=1 TX,NODE2 等待ID=1 tx

4:node1 req lock ID=2 NODE2持有ID=2 TX,NODE1 等待ID=2 tx

死锁产生,oracle 撤销第三步 node2 req lock ID=1 操作。

我们分析trace文件:

这里首先dump 本地BLOCKER/HOLDER:本地阻塞者,[0x7001f][0x1a64e],[TX] 是节点2的事物xid,eq是由资源和lock队列组成,下面首先是资源信息,后面对应lock信息主要持有者和请求者列表。

GRANTED_Q :持有链表,持有tx锁

信息:

CONVERT_Q: 请求列表

因为GES只打印本地节点enqueue所以node1 的enqueue没有打印,请求列表没有pid 信息,该信息可以在node1 trace看到。

注释:这一块就是阻塞者session信息和当前执行的SQL语句:

从hash value值可以看出下面这部分信息是相同的

后面的信息不打印了!

注意死锁中的锁模式5代表x级锁,0是null,也就是从0到5,这个和v$lock中的级别数字不同!

上面实验的WFG:(在trace文件中查找,我添加了部分注释)

WFG内容含义:

从上面看出都是 mode 5 (exclusive) TX lock,该类型原因为SQL的执行顺序问题,也就是事务控制问题导致。

更详细的步骤如下:

BLOCKER 0xea2111e8 表示第一步, TX 0x7001f.0x1a64e 已经获取id=2 的 X锁,node 2,pid 19635

BLOCKER 0xea1c0838 表示第二步, TX 0xb001c.0x1b000 已经获取id=1 的 X锁,node 1,pid 15

BLOCKED 0xea800c78 表示第三步, TX 0xb001c.0x1b000 请求id=1的X锁,等待,node 2,pid 19635

BLOCKED 0xea8079f8 表示第四步, TX 0x7001f.0x1a64e 请求id=2的X锁,等待,node 1,pid 15

知识延伸:

注意node2 和node1 修改相同行的 XID=0x7001f.0x1a64e相同,两个节点访问了相同的UNDO段头事务表和UNDO块!!回忆PCM锁,首先node 2 修改 id=2 块事务没有提交,接着node 1 修改 id=2 数据块,node1通过访问master节点2 的GRD(在SGA中)信息,发现node 2正在修改id=2数据块,通过id=2数据块的事务层(每个数据块有cache层事务层和数据层)读取到XID信息为TX 0x7001f.0x1a64e,那么此时修改GRD Pcm 锁信息,等待TX 0x7001f.0x1a64e事务,将自己添加到enqueue的CONVERT_Q请求列表中,处于等待状态。那么也就是说 事务的xid相同,表示修改相同行,分析WFG首先看XID信息可以确定资源。

处理方案:

应用调整事务控制流程,不能批量提交,单独提交。

该类死锁总结:

其实我们遇到实际问题时,不需要细致到关心哪个session先执行哪个session后执行,谁是阻塞者,谁是被阻塞者,我们关注的是导致死锁的原因就好了,找到SQL,依据根因解决问题就,模拟这个过程主要是了解死锁的详细步骤做到心中有数。比如该类死锁是由于事务控制问题导致,找到原因和SQL,然后和开发人员沟通,通过修改代码的事务控制逻辑该问题就可以解决。但是如果你不知道具体原因,也明白是死锁,却没有处理思路。

该类问题是最常见的死锁!

下面是MOS文档关于其他死锁的介绍:

  • TX deadlock in Share(S) mode

trace shows:

Corresponding to:

<BLOCKED|BLOCKER> <lockp> <cvt|held mode> <res name> <pid|did|txn_id> <node>

mode 3 is shared lock

The causes for TX deadlock in S mode wait can be:

a. ITL contention, eg: INITRANS setting for the object is too small, it can not handle the number of concurrent transactions.

The solution is to increase INITRANS setting for the object involved in the deadlock using “alter table” or “alter index” command

The SQL involved in the deadlock can be found in lmd0 or client trace. The object involved in the SQL should be checked including table and its associated index.

b. If the object involved is an unique key index, the wait could be caused by uniqueness validation. Application needs to be checked to avoid unique key violation.

c. If the object involved has a bitmap index, then the bitmap index should be dropped to accommodate concurrent DML. Please refer to Document 1496403.1 ORA-60 DEADLOCK

DUE TO BITMAP INDEX IN RAC.

  • TM deadlock

典型的外键无索引:

trace shows:

The deadlock is usually caused by missing index for foreign key constraint,

refer to Document 473124.1 – “Frequent GES: Potential Blocker (Pid=nnnn) On Resource TM-<id1>-<id2>” for more information.

Check dba_constraints and dba_index to verify if foreign key index is missing.

Also refer to Document 1019527.6 Script to Check for Foreign Key Locking Issues for a Specific User which will generate a report for all problem objects.

The solution is to create index for every foreign key constraint.

外键无索引 DELETE 主表会导致子表添加TM-S锁,(一般的DML给表添加TM-SX,SX和SX是兼容的,但是SX和S是不兼容的)。

  • Single resource deadlock for TX , TM or IV

trace shows:

a. For single resource deadlock on TX enqueue, often it is caused by using autonomous transaction in stored procedure or PL/SQL. It is a known issue that the use of autonomous transactions is vulnerable to deadlocks. Please check out Oracle? Database Concepts Overview of Autonomous Transactions for detail explanation. Since AUTONOMOUS transaction has been used in the stored procedure, the system would consider any DML statement under this transaction as a separate one (commit/rollback won’t affect the parent), and this would cause conflict if the same row is involved in the parent transaction (INSERT, UPDATE or DELETE), and hence deadlock is reported rightly. Usually the SQL involved in the deadlock is called from a stored procedure or PL/SQL with the following line:

To avoid such deadlock, please remove the autonomous transaction in the application code.

b. If there is no autonomous_transaction involved, please check out Document 6145177.8, it can also be caused by Bug 6145177 – Single resource deadlock with a zero DID

c. For single resource deadlock on TM enqueue, missing foreign key index is often the cause, please check case 3 for the solution.

d. For single resource deadlock type IV (Instance Validation), refer to Document 973178.1, as mentioned in Bug 8843816, this message can be ignored. Bug 8843816 has

been fixed in 11.1.

  • LB deadlock

For example: Bug 6475688 Concurrent rewrite and on-commit refresh can deadlock (library cache pin <–> lock) Document 6475688.8

The bug has been fixed in 11.1.0.7 and 11.2. Please apply patch accordingly.

  • Known Issues

For other deadlock type or known issues related to dead lock, refer to Document 554567.1 Summary Of Bugs Which Could Cause Deadlock In RAC Environment

  • Further Diagnosis

Please collect the following information for further diagnosis:

a. alert log lmd0, and trace mentioned in the alert log from all instances.

b. set the following event to collect systemstate dump ONLY if the information in trace files are insufficient:

It will cause a systemstate dump to be generated whenever a deadlock is reported. If there are constant deadlocks, it could cause a lot of trace files being generated, monitor the system carefully.

To turn off the trace:

总结:

当遇到死锁问题后台alert会打印死锁trace文件记录死锁的详细信息,查看trace文件查找Global Wait-For-Graph(WFG)图表,依据上面文档分析死锁类型,是TX-5还是TX-3等,依据该信息找到对应的user session,找到SQL语句,TX-5解决应用设计问题事务控制问题,ITL问题增大ITL,删除位图,外键加索引,bug? 具体问题具体处理。

未经允许不得转载:Oracle一体机用户组 » RAC 死锁原理

相关推荐