Archive for October, 2012

Oracle online patching for bug 13817131 (backport for ORA-7445 near ksfqfcrx()+X during RMAN backups)

Wednesday, October 3rd, 2012

Oracle 11.2.0.3.x seems to be affected sometimes (the bug manifest itself only in highly utilized environments as per my observations) by ORA-7445 during RMAN backups. The first symptoms are like this in the RMAN log:
channel t1: starting piece 1 at 17-SEP-12
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on t1 channel at 09/17/2012 16:01:23
RMAN-10038: database session for channel t1 terminated unexpectedly

The alert log contains information that incident has been registered and the most interesting entry is this:
ORA-07445: exception encountered: core dump [ksfqfcrx()+1143] [SIGSEGV] [ADDR:0x8] [PC:0xA3E2E3] [Address not mapped to object] []

Starting from this point, sooner or later all RMAN backups are going to fail. My observation is that this is only matter of time. RAC gives some additional headroom, because one can switch making backups to next node, but please be advised that as this is some kind of memory corruption going on, this will also happen on next nodes. You need to patch it.

The GDB backtrace shows excactly the function

(gdb) where
#0 0x0000003513630285 in raise () from /lib64/libc.so.6
#1 0x0000003513631d30 in abort () from /lib64/libc.so.6
#2 0x0000000008130445 in skgdbgcra ()
#3 0x00000000067a67f1 in sksdbgcra ()
#4 0x0000000004977828 in ksdbgcra ()
#5 0x000000000679e3a5 in ssexhd ()
#6 <signal handler called>
#7 0x0000000000a3e2e3 in ksfqfcrx ()
#8 0x0000000007018463 in krbbOpenOutput ()
#9 0x000000000702afe7 in krbbpcint ()
#10 0x000000000702e393 in krbbpc ()
#11 0x0000000008a47cea in krbibpc ()
#12 0x00000000093f8131 in pevm_icd_call_common ()
#13 0x00000000093ed460 in pfrinstr_ICAL ()
#14 0x00000000093f10b9 in pfrrun_no_tool ()
#15 0x00000000093ef78d in pfrrun ()
[..]
#31 0x0000000001dc0d14 in ssthrdmain ()
#32 0x0000000000a0b271 in main ()
(gdb) ... serveral "up" cmds to get to the #6 stack frame
(gdb) info registers rax r12
rax 0x0 0
r12 0x7fd3ac78bf20 140547108421408
(gdb)
(gdb) disas
[..]
0x0000000000a3e2d6 <ksfqfcrx+1130>: callq 0x4838ada <ksfq_ibuf>
0x0000000000a3e2db <ksfqfcrx+1135>: mov 0xcd0(%r12),%rax <--- r12</strong>
0x0000000000a3e2e3 <ksfqfcrx+1143>: orl $0x4,0x8(%rax) <--- crash, due to %rax being 0 and the instruction is addres at $rax register + 8 (SEG FAULT)
0x0000000000a3e2e7 <ksfqfcrx+1147>: jmpq 0xa3e0bc <ksfqfcrx+592>

More or less the issue is documented under Oracle Support note 1482775.1 and the only workaround is to upgrade to 12.1 or 11.2.0.4 which is a little joke (we all know that 12.1 is going to be such great database version that it won’t have even an single bug… all is fixed in 12.1 ;) ). Now in my opinion one will have several options:

  1. you agree to loose backups of mission critical production envs (change RMAN ARCHIVELOG POLICY to NONE?) or fallback to backups taken on DataGuard standby (delete force archivelog is your friend… or enemy actually ;) )
  2. fallback to remaining RAC nodes and restart one of the RAC nodes… (restart of instances seems to avoid the problem for sometime)
  3. do nothing and have downtime (due to ARCH stuck problem, due to FRA filling up)
  4. request a backport for the bug mentioned in in 1482775.1

We’ve went with options b + d. Later I’ve raised escalation call for the case we’ve raised initially. Several days later we had patch ready to be applied, the nice thing about it (patch 13817131) was that it is possible to apply it completely in online mode even on RAC (no need to shutdown anything), just live memory patching! Here are the details from the lab:

[oracle@labr2 13817131]$ opatch apply online -connectString ORCL2:::labr1,ORCL1:::labr2
Oracle Interim Patch Installer version 11.2.0.3.0
Copyright (c) 2012, Oracle Corporation.  All rights reserved.

Oracle Home       : /u01/app/oracle/product/11.2.0/dbhome_1
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oracle/product/11.2.0/dbhome_1/oraInst.loc
OPatch version    : 11.2.0.3.0
OUI version       : 11.2.0.3.0
Log file location : /u01/app/oracle/product/11.2.0/dbhome_1/cfgtoollogs/opatch/13817131_Sep_27_2012_12_55_35/apply2012-09-27_12-55-35PM_1.log

The patch should be applied/rolled back in '-all_nodes' mode only.
Converting the RAC mode to '-all_nodes' mode.
Applying interim patch '13817131' to OH '/u01/app/oracle/product/11.2.0/dbhome_1'
Verifying environment and performing prerequisite checks...
All checks passed.
[..]
Backing up files...

Patching component oracle.rdbms, 11.2.0.3.0...
Installing and enabling the online patch 'bug13817131.pch', on database 'ORCL1'.

Verifying the update...

Patching in all-node mode.

Updating nodes 'labr1'
   Apply-related files are:
     FP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt"
     DP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt"
     MP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/make_cmds.txt"
     RC = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/remote_cmds.txt"

Instantiating the file "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt.instantiated" by replacing $ORACLE_HOME in "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt" with actual path.
Propagating files to remote nodes...
Instantiating the file "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt.instantiated" by replacing $ORACLE_HOME in "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt" with actual path.
Propagating directories to remote nodes...
Installing and enabling the online patch 'bug13817131.pch', on database 'ORCL2' on node 'labr1'.

Patch 13817131 successfully applied
Log file location: /u01/app/oracle/product/11.2.0/dbhome_1/cfgtoollogs/opatch/13817131_Sep_27_2012_12_55_35/apply2012-09-27_12-55-35PM_1.log

OPatch succeeded.
[oracle@labr2 13817131]$

Meanwhile in the alertlog
Thu Sep 27 12:56:09 2012
Patch file bug13817131.pch is out of sync with oracle binary; performing fixup
Patch file bug13817131.pch has been synced with oracle binary
Patch bug13817131.pch Installed – Update #1
Patch bug13817131.pch Enabled – Update #2
Thu Sep 27 12:56:11 2012
Online patch bug13817131.pch has been installed
Online patch bug13817131.pch has been enabled

Technical details follow, as you can see the dynamic library is being added without any interruption to running binaries. Probably some functions are also being redirected to this dynamic library to:

[root@labr2 ~]# ps -ef | grep pmon
oracle    2618     1  0 11:29 ?        00:00:01 asm_pmon_+ASM1
oracle    3013     1  0 11:30 ?        00:00:01 ora_pmon_ORCL1
root      4947  3844  0 13:02 pts/0    00:00:00 grep pmon
[root@labr2 ~]# lsof -p 3013 |grep bug
oracle  3013 oracle  mem    REG             202,17    217221 1183112 /u01/app/oracle/product/11.2.0/dbhome_1/hpatch/bug13817131.so
[root@labr2 ~]#
[root@labr2 ~]# pmap -d 3013|grep bug
00007fcb51e8c000     168 r-x-- 0000000000000000 0ca:00011 bug13817131.so
00007fcb51eb6000    1020 ----- 000000000002a000 0ca:00011 bug13817131.so
00007fcb51fb5000      12 rwx-- 0000000000029000 0ca:00011 bug13817131.so
[root@labr2 ~]#