Skip to content

ASM Disk header corruption

Couple of ASM disk headers were accidentally removed by UNIX sysadmin team.

After disk headers got removed, the database kept running for some time as the data inside disk was intact. When scheduled RMAN backup kicked off the database came down.

OCR_VOTE disk group was not affected and remain mounted. All RAC processes except database were running normal as OCR and voting disk were available.

Disk Status was showing as ‘CANDIDATE’ just after DB went down >
SQL> select GROUP_NUMBER,DISK_NUMBER,NAME,TOTAL_MB,FREE_MB,HEADER_STATUS from v$asm_disk;

GROUP_NUMBER DISK_NUMBER NAME                             TOTAL_MB    FREE_MB HEADER_STATU
------------ ----------- ------------------------------ ---------- ---------- ------------
           1           0 ARC_LOG_0000                       102400      56199 CANDIDATE
           2           5 DB_DATA_0005                       102400      22441 CANDIDATE
           2           4 DB_DATA_0004                       102400      22412 CANDIDATE
           2           3 DB_DATA_0003                       102400      22431 CANDIDATE
           2           2 DB_DATA_0002                       102400      22427 CANDIDATE
           2           1 DB_DATA_0001                       102400      22433 CANDIDATE
           2           0 DB_DATA_0000                       102400      22425 CANDIDATE
           3           0 OCR_VOTE_0000                       51199      50803 MEMBER

8 rows selected.
As show above, the disks status were showing as ‘CANDIDATE’ just after the database went down, but after some time ASM process dismounted the disk groups and took all disks offline.The ASM alert log showed below message:
WARNING: Disk 0 (DB_DATA_0000) in group 2 in mode 0x7f is now being taken offline on ASM inst 1
WARNING: Disk 1 (DB_DATA_0001) in group 2 in mode 0x7f is now being taken offline on ASM inst 1
WARNING: Disk 2 (DB_DATA_0002) in group 2 in mode 0x7f is now being taken offline on ASM inst 1
WARNING: Disk 3 (DB_DATA_0003) in group 2 in mode 0x7f is now being taken offline on ASM inst 1
WARNING: Disk 4 (DB_DATA_0004) in group 2 in mode 0x7f is now being taken offline on ASM inst 1
WARNING: Disk 5 (DB_DATA_0005) in group 2 in mode 0x7f is now being taken offline on ASM inst 1

NOTE: cache deleting context for group DB_DATA 2/0xbf00da29
GMON dismounting group 2 at 16 for pid 40, osid 17308
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
NOTE: Disk  in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DB_DATA was dismounted
SUCCESS: alter diskgroup DB_DATA dismount force /* ASM SERVER */




ERROR: PST-initiated MANDATORY DISMOUNT of group DB_DATA
After above error message in alert log, ASM was not able to locate disks on the server >
[root@dbserver ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks:               [  OK  ]

[root@dbserver ~]# /etc/init.d/oracleasm listdisks
OCR_VOTE01

 

Database backups cannot restore a corrupted ASM header. rman backups the Oracle data files. Not the physical ASM disk itself with the disk’s headers.

 

Below steps helped us in correcting the situation:

 

Step 1 : Finding ASM to Physical Disk mapping
ASM disk to physical disk mapping information was available to us. This information was collected by below commands.
A) $ /etc/init.d/oracleasm querydisk -d DB_DATA01
Disk "DB_DATA01" is a valid ASM disk on device /dev/sdl[8,176]
 /dev/sdl is the actual disk path
 

B) Running 'multipath -ll' command and looking for  '/dev/sdl'
mpath17 (360060e8006d8e7000000d8e700000000) dm-7 HP,OPEN-V
[size=100G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 4:0:0:10 sdaf 65:240 [active][ready]
 \_ 3:0:0:10 sdl  8:176  [active][ready]
Here
>> 360060e8006d8e7000000d8e700000000 is the WWID which should be same on all RAC nodes when checked with ‘multipath -ll’.
>> /dev/mapper/mpath17 is the multipath disk device corresponding to DB_DATA01 on this node
Similar information was available for other data disks too.

 

Step 2: Running ‘kfed’ Oracle utility
A backup copy of ASM disk header is always saved inside the disk. Such kind of disk header corruption/missing issues can be taken care by running kfed with ‘repair’ command.
As the  ASM disks were not showing up under the ASM directory ‘/dev/oracleasm/disks’, we were not able to run the ‘kfed repair’ command on ASM disks.
Switched to root user and run the ‘kfed repair’ command on corresponding physical devices>
As ‘kfed’ is an oracle utility, you will have to go to GRID_HOME bin directory and run the command.
$ cd /u01/app/11.2.0/grid/bin

$ ./kfed repair /dev/mapper/mpath17
$ ./kfed repair /dev/mapper/mpath18
$ ./kfed repair /dev/mapper/mpath19
$ ./kfed repair /dev/mapper/mpath20
$ ./kfed repair /dev/mapper/mpath21
$ ./kfed repair /dev/mapper/mpath22
$ ./kfed repair /dev/mapper/mpath23
After running above command, you may have to change the ownership of disks under /dev/oracleasm/disks to Grid user.
On each node, execute the next commands as root user (next actions will scan the multipath disks and locate all ASM disks) :
$ /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks:               [  OK  ]


$ /etc/init.d/oracleasm listdisks

ARC_LOG01
DB_DATA01
DB_DATA02
DB_DATA03
DB_DATA04
DB_DATA05
DB_DATA06
OCR_VOTE01
Above command showed all the required ASM disks now.

 

Step 3: Start Database and verify
$ srvctl start database -d <SID>
$ crsctl status resource -t

 

Brijesh Gogia

One Comment

  1. Syed Syed

    Nice article

Leave a Reply