如何高效处理oracle hang

作者简介:孙显鹏,Oracle 十年从业经验,拥有11G ocp认证,精通内部原理,擅长调优,解决疑难问题,致力于帮助客户解决生产过程过出现的性能问题,提高生产效率!爱好书法!

概述:

hanganalyze 用于当数据库出现hang或者严重性能问题时,对数据库阻塞进程和被阻塞进程进行分析,因为人工分析可能比较复杂,那么借助hanganalyze工具可以快速发现数据库出现问题的根因,找出阻塞者,继而处理问题。另外oracle还提供了一个hangfg工具包,自动收集hang文件包括systemstate,在RAC下也可以使用,非常方便,但是这个脚本需要稍加修改,linux下报错,主要是shell路径问题,关于hangfg使用方法可以到mos上下载使用。建议还是不要使用,应该非常熟练hanganalyzesystemstate命令,这个是DBA基本技能。要不然就会觉得很low对不对!

为什么数据库会出现hang呢?比如TM/TX锁,也就是对象上面存在的锁以及死锁oracle会等待或者自动解除死锁。如果锁是发生在内存结构上出现死锁,也就是循环等待资源或者长时间等待资源不能释放,这种类型的锁或死锁oracle是不能自动解除。那么这种情况就会造成数据库HANG。有时候不是真正的hang可能只是性能问题,需要我们诊断。

oracle在出现hang或者急剧的性能问题会自动dump systemstatesystemstate就是SGA中对象的dump,也就是所有进程的dump。(这个有点类似weblogic JAVA dump 文件,可以分析javaGC机制和OOM问题,IBM有一个非常好的OOM分析工具大家可以使用)systemstate信息量较大人工分析困难,如果不熟练可能不好分析,可以借助ass.awk 工具进行分析,这个工具会把systemstate内容分类整理成一个比较容易看懂的文件。使用方法很简单:awk -f ass.awk systemstate文件名,具体命令如下: awk -f ass.awk /home/oracle/yyjc1_ora_20938.trc

具体处理方式:

下面系统HNAG的严重情况具体说明操作方法:

1:如果系统hang不能正常登录,也不能以sqlplus -prelim ‘/ as sysdba’ 方式登录,那么我们可以借助操作系统debug 工具来完成systemstate,步骤如下:

首先需要在系统查找一个oracle 进程,最好不要使用后台进程,然后依据操作系统执行下面操作即可,那么就会在oracle trace目录生成systemstate 文件,文件名为ora_PID.trc.

命令如下:

给一个linux下gdb 使用案例:

然后去 trace目录找trc文件。

2:对于hanganalyze 不能通过正常方式登录,但是可以使用sqlplus “preliminary connection” (sqlplus -prelim ‘/ as sysdba’)方式登录,操作步骤如下:

注意11.2.0.2 以后preliminary connection使用方法有变化:

NOTE: From 11.2.0.2 onwards, hanganalyze will not produce output under a sqlplus “preliminary connection” since it requires a process state object and a session state object. If a hanganalyze is attempted, although the hanganalyze will appear to be successful:

具体示例:

3:数据库能正常登录时处理Hanganalyze 的步骤如下:

单节点脚本:

RAC 脚本:

For 11g :

For 10g, run as following:

下面是对hanganalyze trace文件中各个状态的说明:

The following describes the important states to be considered:IN_HANG: This might be considered as the most critical STATE. Basically a node in this state is involved in a deadlock, or is hung.

Usually there will be another “adjacent node” in the same status. For example:

[nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor

[16]/0/17/154/0x24617be0/26800/IN_HANG/29/32/[185]/19

[185]/1/16/4966/0x24617270//IN_HANG/30/31/[16]/16

In this example the node [16] is waiting for node [185], and the other way around; this is a cyclical condition (deadlock).

LEAF and LEAF_NW: Leaf nodes are considered on top of the wait chain (usually blockers). They are considered “Blockers” when there is another session waiting. This can be easily identified using the “predecesor” field. If there is a node referenced in the ‘prdecessor’ field, the node is considered as “blocker”, otherwise it is considered as a “slow” session waiting for some resource.

The difference between LEAF and LEAF_NW is that LEAF nodes are waiting for something, while LEAF_NW are not waiting or may be using the CPU. A typical representation of these nodes when they are considered blockers is:

[ nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor

[16]/0/17/154/0x24617be0/26800/LEAF/29/30//19

[19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186

In this example, node [16] is blocking node [19]. Notice that node [16] has node [19] in the predecessor field.

Also notice that node [19] has node [16] in the adjacent list.

NLEAF : These sessions are usually considered as “stuck” sessions. It means that there is another session holding a resource needed by the session in this state. By using the adjlist, you can determine which node is the blocker of this process. When many sessions are found in this state,

it is likely the database is experiencing a performance problem rather than a hang problem.

IGN and IGN_DMP : Sessions in this state are usually considered as IDLE sessions, unless they reference a node in the ” adjlist” field.In this case, the node is waiting for another node, so it will be considered as a ‘stuck’ session as well.

Extending the previous example,

[nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor

[16]/0/17/154/0x24617be0/26800/LEAF/29/30//19

[19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186

[189]/1/20/36/0x24619830//IGN/95/96/[19]/none

[176]/1/7/1/0x24611d80//IGN/75/76//none

You may notice that node [189] is waiting for node [19] which in turn is waiting for node [16], while node [176] is an IDLE session .

This maybe the case when a session has a DML lock but never finished the transaction.

In Oracle9i, two new states were introduced to differentiate between LEAF nodes that have other nodes waiting behind them (i.e., LEAF nodes that are blockers) vs. LEAF nodes that are not affecting other nodes.SINGLE_NODE and SINGLE_NODE_NW:

This can be described the same as LEAF and LEAF_NW, except that they don’t have processes depending on them.

提供一个使用ass.awk工具分析systemstate的示例:

awk -f ass.awk /home/oracle/yyjc1_ora_20938.trc 结果如下:

Resource Holder State:资源状态

从上面结果很清楚看到大量的MUTEX 等待,并且存在SGA内存动态调整等待。

是不是非常清楚!

 

未经允许不得转载:Oracle一体机用户组 » 如何高效处理oracle hang

相关推荐