原标题:守得云开见月明:一回ASM存储高可用故障消除进度深入分析

今天他们要测量检验IBM的老大S(英文名:Barbie Hsu)VC存款和储蓄同步的东西,然后要求把服务器上的盘都踢出去后再加进去,不过尔尔的话磁盘名称就能变了。因此供给把ASM中的磁盘名称都换了,其实进度也很简短:

图片 1

1、修改asm实例的asm磁盘暗中同意查找路子参数asm_diskstring,使用如下命令:

小编 | 姜劲松,云和恩墨行家支持部Oracle本事行家,Oracle OCP,MySQL
OCP,奥德赛HCE等注明行家。长时间服务活动运营商户当客商,驾驭 oracle
品质优化,故障检查判断,特殊复苏领域。23年IT从业经验、资深数据库及系统软硬件集成行家。

alter
system set asm_diskstring=’/dev/rhdisk*’;

百万级顾客规模经营贩卖账务系统研究开发及试行运转经验,主持过11省千万级电力经营出卖业务体系运行老董专门的学业;设计推行过10多少个Ali云平台新能源SAAS系统。历任开拓程序猿、项目首席实践官、技艺COO、项目总经理、运行高管、云平台架构师等职分。

2、关闭全体Cluster,等待她们踢盘加盘后再修改如下属性,小编的是RAC蒙受进而一下操作要在具有节点上实行

前言

修改磁盘客商及属组:

Oracle ASM 全称为Automated Storage
Management,即活动存款和储蓄管理,它是自 Oracle10g 这么些本子 Oracle
推出的新功用。那是 Oracle
提供的二个卷管理器,用于替代操作操作系统所提供的
LVM,它不但扶植单实例配置,也支持RAC那样的多实例配置。

[rac11g2@root]# chown grid:asmadmin
/dev/rhdisk[2-4]

给 Oracle 数据库管理员带来相当大的福利,ASM
能够自行处理磁盘组,并提供数据冗余和优化。
ASM提供了丰裕的管制和容灾花招,通过适当的配备,能够达成快捷的数据库层面包车型大巴仓库储存容灾效率。

修改磁盘属性为660:

此案例通过某顾客项目现场1次ASM仓库储存容灾不能够兑现预期目的的标题深入分析化解进度,和大家一同切磋对于非预期难点的化解之道。

[rac11g2@root]# chmod 660 /dev/rhdisk[2-4]

01难点简述

修改磁盘分享属性:

背景表达:

[rac11g2@root]# lsattr -El hdisk2|grep reserve_policy
reserve_policy  no_reserve                                        
 Reserve Policy                          True

1、Oracle12.2RAC+ASM Normal Redendancy
形式,数据仓库储存款和储蓄选拔双存款和储蓄冗余架构,规避单存款和储蓄故障导致服务中断及数据遗失;

2、 ASM DiskGroup 设计2个
Failgroup(FG),1个FG磁盘全体囤积在1#存款和储蓄;1个FG全体磁盘存款和储蓄在2#存储中;

style=”font-size: 16px;”>3、期望跋扈存款和储蓄故障或断电,数据库实例不受影响,数据不抛弃,故障存款和储蓄上线后数据自动同步。

[rac11g2@root]# chdev -l hdisk2 -a
reserve_policy=no_reserve

[rac11g2@root]# chdev -l hdisk3 -a
reserve_policy=no_reserve

[rac11g2@root]# chdev -l hdisk4 -a
reserve_policy=no_reserve

在实际高可用测量试验中,拔掉1个存款和储蓄,开掘如下现象:

3、现在就足以运维Cluster了

style=”font-size: 16px;”>1.CCR-VS集群不受影响,ocr/votedisk自动Failover;

2.DB
Controlfile/Redolog发生I/O错误,导致LWGENVISION/CKPT等基本进程长日子阻塞后,Oracle主动重启DB实例(1个或2个实例)后,数据库恢复生机正常;

style=”font-size: 16px;”>3.数据库数据符合规律,故障存款和储蓄Online后自动同步寻常。

[rac11g1@root]# crsctl start cluster -all

02测验进程

注:本人曾经因为忘了更改磁盘属性为660,结果形成Database起不来,在Alert日志中出现了ORA-00600的谬误,吓笔者一跳,不过从日记中比较轻易看出来是权力的难题,调度磁盘属性后再重启就能够了:

1) 第一类测量检验

Sweep [inc][393409]: completed
Sweep [inc2][393409]:
completed
NOTE: Loaded library: System 
ORA-15025: could not open disk
“/dev/rhdisk4”
ORA-27041: unable to open file
IBM
AIX RISC System/6000 Error: 13: Permission denied

Additional information: 11
SUCCESS: diskgroup DATA was
mounted
Errors in file
/soft/Oracle/diag/rdbms/nint/nint1/trace/nint1_ckpt_19136654.trc
 (incident=409793):
ORA-00600: internal error code,
arguments: [kfioTranslateIO03], [], [], [], [], [], [],
[], [], [], [], []
Incident details in:
/soft/oracle/diag/rdbms/nint/nint1/incident/incdir_409793/nint1_ckpt_19136654_i409793.trc
Use ADRCI or Support Workbench to
package the incident.
See Note 411.1 at My Oracle
Support for error and packaging details.
NOTE: dependency between database
nint and diskgroup resource ora.DATA.dg is established
ERROR: unrecoverable error ORA-600
raised in ASM I/O path; terminating process 19136654 
Dumping diagnostic data in
directory=[cdmp_20120302172201], requested by (instance=1,
osid=19136654 (CKPT)), summary=[incident=409793].
Fri Mar 02 17:22:01 2012
PMON (ospid: 14156014):
terminating the instance due to error 469
System state dump requested by
(instance=1, osid=14156014 (PMON)), summary=[abnormal instance
termination].
System State dumped to trace file
/soft/oracle/diag/rdbms/nint/nint1/trace/nint1_diag_21168306.trc
Fri Mar 02 17:22:02 2012
ORA-1092 : opitsk aborting
process
Fri Mar 02 17:22:02 2012
License high water mark = 1
Instance terminated by PMON, pid =
14156014
USER (ospid: 15335672):
terminating the instance
Instance terminated by USER, pid =
15335672

1、存储完毕拔线:16:56:05

2、实例16:57:37-16:57:39 挂掉

【摘录】

ASM日志:

First, you can try to check the OS drive ownership , permission and
reserve_policy attribute on all nodes. Then restart the ASM instance.
  1)Make sure that the hdisk# is owned by the OS user who installed the
ASM Oracle Home … and that the disk is mounted correctly (with the
correct owner) 
  2)Make sure that the permissions are set correctly at the disk level
… 660 is normal … but if there are problems use 777 as a test 
  ls -l /dev/rhdisk3 output:
  For 10gR2/11gR1 like:  crw-rw—-  oracle:oinstall /dev/rhdisk3 
  For 11gR2 like:        crw-rw—-  grid:asmadmin /dev/rhdisk3

2018-08-01T16:57:41.712885+08:00

NOTE: ASM client style=”font-size: 16px;”>node11:node1:node1-rac disconnected
unexpectedly

  How to change the drive ownership and permission ?
  For 10gR2/11gR1:
    # chown -R oracle:oinstall /dev/rhdisk[3-10]
    # chmod -R 660 /dev/rhdisk[3-10]
  For 11gR2:
    # chown -R grid:asmadmin /dev/rhdisk[3-10]
    # chmod -R 660 /dev/rhdisk[3-10]

DB:

  3)Make sure that the reserve_policy attribute of the needed hdisk#
is no_reserve or no on all nodes.
    chdev -l hdisk# -a reserve_policy=no_reserve

2018-08-01T16:57:45.214182+08:00

Instance terminated by USER, pid =
10158

2018-08-01T16:57:36.704927+08:00

Errors in file
/oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00206: error in writing (block 3,
# blocks 1) of control file

ORA-00202: control file:
‘+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275’

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15064: communication failure with
ASM instance

2018-08-01T16:57:36.705340+08:00

Errors in file
/oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00221: error on write to control
file

ORA-00206: error in writing (block 3,
# blocks 1) of control file

ORA-00202: control file:
‘+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275’

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15064: communication failure with
ASM instance

If it also fail by the first step, you may try to set the Oracle ASM
parameter ASM_DISKSTRING to /dev/* or /dev/rhdisk*. The Step is:
1)Backup the ASM instance pfile(Parameter File) or spfile (Server
Parameter File).
  Most in the $ORACLE_HOME/dbs. pfile name like is init+ASM1.ora, you
can use cp command to backup it .and vi the content. 
  You to create spfile to pfile for backup,if use spfile. 
2)set ASM_DISKSTRING parameter
  use pfile ENV:
    Add or Edit “ASM_DISKSTRING” line to
*.ASM_DISKSTRING=’/dev/rhdisk*’ in pfile. Startup the ASM instance
using the pfile.
  
  use spfile ENV:
    $ ORACLE_SID=+ASM1;export ORACLE_SID
    
    $ sqlplus “/ as sysdba”
    or
    $ sqlplus “/ as sysasm”
    
    SQL> startup
    SQL> alter system set asm_diskstring=’/dev/rhdisk*’;
    SQL> select group_number,disk_number,path from v$asm_disk; 
        –You can get some disk info and the most disk’s group_number
 is not 0.

Oracle CKPT 进度因为调整文件 IO
错误阻塞,导致主动重启 instance,每一遍测量试验都在逾期70s后来初步Terminate
instance。

If ASM_DISKSTRING is NOT SET … then the following default is used

疑虑是ASM实例offline
disk时间过慢,希望调高CKPT阻塞时间阀值消除难题,可是并未有找到相应的参数。

    Default ASM_DISKSTRING per OS

既然如此是controlfile存在这里难题,是还是不是因为DATA磁盘比很多,导致offline检查测试时间长吗?

    Operating System Default            Search String
    =======================================
    Solaris (32/64 bit)                        /dev/rdsk/*
    Windows NT/XP                          \.orcldisk* 
    Linux (32/64 bit)                          /dev/raw/* 

品味将controlfile转移到磁盘相当少的REDO
DG,仍旧在controfile这里报错:

    LINUX (ASMLIB)                         ORCL:*
    LINUX (ASMLIB)                         /dev/oracleasm/disks/* ( as
a workaround )

systemstatedump文件:

—– Beginning of Customized Incident
Dump(s) —–

Process CKPT (ospid: 4693) is waiting
for event ‘control file sequential read’.

Process O009 (ospid: 5080) is the
blocker of the wait chain.

===[ Wait Chain ]===

CKPT (ospid: 4693) waits for event
‘control file sequential read’.

LGWR (ospid: 4691) waits for event ‘KSV
master wait’.

O009 (ospid: 5080) waits for event ‘ASM
file metadata operation’.

node1_lgwr_4691.trc

—– END DDE Actions Dump (total 0
csec) —–

ORA-15080: synchronous I/O operation
failed to write block 1031 of disk 4 in disk group
DG_REDO_MOD

ORA-27063: number of bytes read/written
is incorrect

HPUX-ia64 Error: 11: Resource
temporarily unavailable

Additional information:
4294967295

Additional information: 1024

NOTE: process _lgwr_node1 (4691)
initiating offline of disk 4.4042263303 (DG_REDO_MOD_0004) with
mask 0x7e in group 3 (DG_REDO_MO

D) with client assisting

    HPUX                                       /dev/rdsk/* 
    HP-UX(Tru 64)                            /dev/rdisk/*
    AIX                                            /dev/*

2) 第二类测试

图片 2

尝试对 controlfile 进行 multiplex:

1、每一个存款和储蓄分配1个10GB
LUN给服务器;

2、基于各类LUN创制1个DG,controlfile
multiplex到那2个DG中。

再度起始效仿1个存款和储蓄故障测验,发掘依旧会发生调整文件不只怕读写,重启实例!

在Oracle文书档案发掘只可以使用ASM
FG来兑现高可用,因为任何决定文件都亟待在线,不然将平素导致实例中止!

style=”font-size: 16px;”>

Multiplex Control Files on Different
Disks

Every Oracle Database should have at
least two control files, each stored on a different physical disk. If
a control file is damaged due to a disk failure, the associated
instance must be shut down. Once the disk drive is repaired, the
damaged control file can be restored using the intact copy of the
control file from the other disk and the instance can be restarted. In
this case, no media recovery is required.

The behavior of multiplexed control
files is this:

The database writes to all filenames
listed for the initialization parameter CONTROL_FILES in the database
initialization parameter file.

The database reads only the first file
listed in the CONTROL_FILES parameter during database
operation.

If any of the control files become
unavailable during database operation, the instance becomes inoperable
and should be aborted.

Note:

Oracle strongly recommends that your
database has a minimum of two control files and that they are located
on separate physical disks.

所以这种 multiplex 方法对 controlfile
的高可用无效!

3) 第三类测量试验

将controlfile存款和储蓄在叁个RPT存储中,防止因为controlfile同步导致的鸿沟。

意识不时测量检验能够成功,但是不常会在**REDO
LOG**读写时报错导致DB重启!

4) 第四类测量试验

创制2个独立的DG,指向2个不等存款和储蓄,REDO
GROUP的2个member multiplex到2个DG中。

测量检验failover成功,ASM实例会将故障DG
dismount,数据库完全不受影响!

据他们说上述的测量试验进程,发掘如下现象:

1、 ASM
Failgroup对数据库文件管理完全没不平日,能够完成Failover

2、 ControlFile/RedoLogfile在Normal
DG做offline时,非凡长日子阻塞并积极重启DB实例,重启后运维平常化,数据完整性不受影响!

往往多次测量试验,难点均随机现身,因而低度困惑为Oracle
BUG,在MOS上开采1个近乎『 链接:Bug 23179662 – ASM B-slave Process
Blocking Fatal background Process like LGW本田UR-V producing ORA-29771 (文书档案 ID
23179662.8)』,不过MOS表达 20180417PSU 已经 fixed 此 BUG, Wordaround
行为正是重启实例。

在三翻五次1周无法化解难题的场合,采纳了之类有时的实施方案:

发表评论

电子邮件地址不会被公开。 必填项已用*标注