Ok, this is part II of the post PowerHA failure scenario when dealing with SAN-booted LPARs – part I.
The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:
$ vfcmap -vadapter vfchostXYZ -fcp
where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:
root@jkwha001d : /home/root :# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
AFA89905 0103071611 I O grpsvcs Group Services daemon started
97419D60 0103071611 I O topsvcs Topology Services daemon started
A6DF45AA 0103071111 I O RMCdaemon The daemon is started.
67145A39 0103070911 U S SYSDUMP SYSTEM DUMP
F48137AC 0103070911 U O minidump COMPRESSED MINIMAL DUMP
DC73C03A 0103070911 T S fscsi0 SOFTWARE PROGRAM ERROR
9DBCFDEE 0103070911 T O errdemon ERROR LOGGING TURNED ON
26623394 0103070511 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070511 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070511 T H fscsi1 COMMUNICATION PROTOCOL ERROR
(..)
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
DE3B8540 0103070411 P H hdisk4 PATH HAS FAILED
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
DE3B8540 0103070411 P H hdisk8 PATH HAS FAILED
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
DE3B8540 0103070411 P H hdisk6 PATH HAS FAILED
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
DE3B8540 0103070411 P H hdisk3 PATH HAS FAILED
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
DE3B8540 0103070411 P H hdisk2 PATH HAS FAILED
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
DE3B8540 0103070411 P H hdisk0 PATH HAS FAILED
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DE3B8540 0103070411 P H hdisk5 PATH HAS FAILED
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR
DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR
(..)
A39F8A49 1210235210 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
D5676F6F 1210235210 T H fscsi1 ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540 1210235210 P H hdisk6 PATH HAS FAILED
D5676F6F 1210235210 T H fscsi0 ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540 1210235210 P H hdisk1 PATH HAS FAILED
D5676F6F 1210235210 T H fscsi1 ATTACHED SCSI TARGET DEVICE ERROR
(..)
As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…
OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).
On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):
storage2> lun show -g jkwha002d_boot
/vol/jkwha002d_boot/boot/rootvg 40.0g (42953867264) (r/w, online, mapped)
storage2>
Also on 2nd storage there was an igroup responsbile for SAP and Oracle LUNs:
storage2> lun show -g jkwha002d_cluster
/vol/sap_jw1_mirrlog/mirrlog/mirrlog01.lun 1g (1073741824) (r/w, online, mapped)
/vol/sap_jw1_oraarch/oraarch/oraarch01.lun 51g (54760833024) (r/w, online, mapped)
/vol/sap_jw1_sapdata/sapdata2/sapdata02.lun 15g (16106127360) (r/w, online, mapped)
storage2>
The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):
storage1> lun show -g jkwha002d_cluster
/vol/clus_001d_002d_hb/hb/hbvg 20m (20971520) (r/w, online, mapped)
/vol/sap_jw1_origlog/origlog/origlog01.lun 1g (1073741824) (r/w, online, mapped)
/vol/sap_jw1_sapbin/sapbin/sapbin01.lun 100g (107374182400) (r/w, online, mapped)
/vol/sap_jw1_sapdata/sapdata1/sapdata01.lun 15g (16106127360) (r/w, online, mapped)
storage1>
Let’s do it:
storage1> igroup remove
usage:
igroup remove [ -f ] <initiator_group> <node> ...
- removes node(s) from an initiator group. The node may also
specified as its alias. See "fcp wwpn-alias" for help with aliases
By default, if lun maps exist for the initiator group, the node is
not removed from the group. The -f option can be used to force the
removal.
For more information, try 'man na_igroup'
storage1>
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:e0
.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:
root@jkwha002d:# topas
bash: /usr/bin/topas: There is an input or output error.
root@jkwha002d:# nmon
bash: /usr/bin/nmon: There is an input or output error.
root@jkwha002d:# lspath
Failed hdisk0 fscsi1
Failed hdisk1 fscsi1
Enabled hdisk0 fscsi1
Failed hdisk1 fscsi1
Failed hdisk0 fscsi0
Failed hdisk1 fscsi0
Failed hdisk0 fscsi0
Enabled hdisk1 fscsi0
Failed hdisk2 fscsi0
Failed hdisk2 fscsi0
Failed hdisk2 fscsi0
Failed hdisk2 fscsi0
Enabled hdisk2 fscsi1
Failed hdisk2 fscsi1
Failed hdisk2 fscsi1
Failed hdisk2 fscsi1
Failed hdisk3 fscsi0
Failed hdisk4 fscsi0
Failed hdisk5 fscsi0
Failed hdisk6 fscsi0
Failed hdisk7 fscsi0
Failed hdisk8 fscsi0
Failed hdisk3 fscsi0
Failed hdisk4 fscsi0
Failed hdisk5 fscsi0
Failed hdisk6 fscsi0
Failed hdisk7 fscsi0
Failed hdisk8 fscsi0
Failed hdisk3 fscsi0
Failed hdisk4 fscsi0
Failed hdisk5 fscsi0
Failed hdisk6 fscsi0
Failed hdisk7 fscsi0
Failed hdisk8 fscsi0
Failed hdisk3 fscsi0
Failed hdisk4 fscsi0
Failed hdisk5 fscsi0
Failed hdisk6 fscsi0
Failed hdisk7 fscsi0
Failed hdisk8 fscsi0
Failed hdisk3 fscsi1
Failed hdisk4 fscsi1
Enabled hdisk5 fscsi1
Enabled hdisk6 fscsi1
Failed hdisk7 fscsi1
Failed hdisk8 fscsi1
Enabled hdisk3 fscsi1
Enabled hdisk4 fscsi1
Failed hdisk5 fscsi1
Failed hdisk6 fscsi1
Failed hdisk7 fscsi1
Failed hdisk8 fscsi1
Failed hdisk3 fscsi1
Failed hdisk4 fscsi1
Failed hdisk5 fscsi1
Failed hdisk6 fscsi1
Enabled hdisk7 fscsi1
Failed hdisk8 fscsi1
Failed hdisk3 fscsi1
Failed hdisk4 fscsi1
Failed hdisk5 fscsi1
Failed hdisk6 fscsi1
Failed hdisk7 fscsi1
Enabled hdisk8 fscsi1
Missing hdisk9 fscsi0
Missing hdisk9 fscsi0
Missing hdisk9 fscsi1
Missing hdisk9 fscsi1
root@jkwha002d:# errpt | more
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
B6267342 0103074511 P H hdisk0 DISK OPERATION ERROR
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
B6267342 0103074511 P H hdisk0 DISK OPERATION ERROR
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
EA88F829 0103074511 I O SYSJ2 USER DATA I/O ERROR
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
00B984B3 0103074511 U H hdisk0 UNDETERMINED ERROR
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
B6267342 0103074511 P H hdisk0 DISK OPERATION ERROR
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM
A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
Oracle is going to panic (which is good behaviour in my opinion):
root@jkwha002d:# tail alert_JW1.log
Mon Jan 3 07:39:21 2011
Errors in file /oracle/JW1/saptrace/background/jw1_lgwr_831500.trc:
ORA-00345: redo log write error block 7 count 1
ORA-00312: online log 13 thread 1: '/oracle/JW1/origlogA/log_g13m1.dbf'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
Additional information: 512
Mon Jan 3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan 3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan 3 07:39:23 2011
CKPT: terminating instance due to error 221
Instance terminated by CKPT, pid = 368764
root@jkwha002d:#
but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???
Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).
Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??
Perhaps i’ll write a part III of this article
-Jakub.