Ok, this is part II of the post PowerHA failure scenario when dealing with SAN-booted LPARs – part I.
The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:
$ vfcmap -vadapter vfchostXYZ -fcp
where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:
root@jkwha001d : /home/root :# errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION AFA89905 0103071611 I O grpsvcs Group Services daemon started 97419D60 0103071611 I O topsvcs Topology Services daemon started A6DF45AA 0103071111 I O RMCdaemon The daemon is started. 67145A39 0103070911 U S SYSDUMP SYSTEM DUMP F48137AC 0103070911 U O minidump COMPRESSED MINIMAL DUMP DC73C03A 0103070911 T S fscsi0 SOFTWARE PROGRAM ERROR 9DBCFDEE 0103070911 T O errdemon ERROR LOGGING TURNED ON 26623394 0103070511 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070511 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070511 T H fscsi1 COMMUNICATION PROTOCOL ERROR (..) DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR DE3B8540 0103070411 P H hdisk4 PATH HAS FAILED 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR DE3B8540 0103070411 P H hdisk8 PATH HAS FAILED 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR DE3B8540 0103070411 P H hdisk6 PATH HAS FAILED 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR DE3B8540 0103070411 P H hdisk3 PATH HAS FAILED 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR DE3B8540 0103070411 P H hdisk2 PATH HAS FAILED 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR DE3B8540 0103070411 P H hdisk0 PATH HAS FAILED 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DE3B8540 0103070411 P H hdisk5 PATH HAS FAILED DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR 26623394 0103070411 T H fscsi1 COMMUNICATION PROTOCOL ERROR DC73C03A 0103070411 T S fscsi1 SOFTWARE PROGRAM ERROR (..) A39F8A49 1210235210 T S syserrlg ERROR LOGGING BUFFER OVERFLOW D5676F6F 1210235210 T H fscsi1 ATTACHED SCSI TARGET DEVICE ERROR DE3B8540 1210235210 P H hdisk6 PATH HAS FAILED D5676F6F 1210235210 T H fscsi0 ATTACHED SCSI TARGET DEVICE ERROR DE3B8540 1210235210 P H hdisk1 PATH HAS FAILED D5676F6F 1210235210 T H fscsi1 ATTACHED SCSI TARGET DEVICE ERROR (..)
As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…
OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).
On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):
storage2> lun show -g jkwha002d_boot /vol/jkwha002d_boot/boot/rootvg 40.0g (42953867264) (r/w, online, mapped) storage2>
Also on 2nd storage there was an igroup responsbile for SAP and Oracle LUNs:
storage2> lun show -g jkwha002d_cluster /vol/sap_jw1_mirrlog/mirrlog/mirrlog01.lun 1g (1073741824) (r/w, online, mapped) /vol/sap_jw1_oraarch/oraarch/oraarch01.lun 51g (54760833024) (r/w, online, mapped) /vol/sap_jw1_sapdata/sapdata2/sapdata02.lun 15g (16106127360) (r/w, online, mapped) storage2>
The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):
storage1> lun show -g jkwha002d_cluster /vol/clus_001d_002d_hb/hb/hbvg 20m (20971520) (r/w, online, mapped) /vol/sap_jw1_origlog/origlog/origlog01.lun 1g (1073741824) (r/w, online, mapped) /vol/sap_jw1_sapbin/sapbin/sapbin01.lun 100g (107374182400) (r/w, online, mapped) /vol/sap_jw1_sapdata/sapdata1/sapdata01.lun 15g (16106127360) (r/w, online, mapped) storage1>
Let’s do it:
storage1> igroup remove usage: igroup remove [ -f ] <initiator_group> <node> ... - removes node(s) from an initiator group. The node may also specified as its alias. See "fcp wwpn-alias" for help with aliases By default, if lun maps exist for the initiator group, the node is not removed from the group. The -f option can be used to force the removal. For more information, try 'man na_igroup' storage1> storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0 storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0 storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:de storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:e0
.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:
root@jkwha002d:# topas bash: /usr/bin/topas: There is an input or output error. root@jkwha002d:# nmon bash: /usr/bin/nmon: There is an input or output error. root@jkwha002d:# lspath Failed hdisk0 fscsi1 Failed hdisk1 fscsi1 Enabled hdisk0 fscsi1 Failed hdisk1 fscsi1 Failed hdisk0 fscsi0 Failed hdisk1 fscsi0 Failed hdisk0 fscsi0 Enabled hdisk1 fscsi0 Failed hdisk2 fscsi0 Failed hdisk2 fscsi0 Failed hdisk2 fscsi0 Failed hdisk2 fscsi0 Enabled hdisk2 fscsi1 Failed hdisk2 fscsi1 Failed hdisk2 fscsi1 Failed hdisk2 fscsi1 Failed hdisk3 fscsi0 Failed hdisk4 fscsi0 Failed hdisk5 fscsi0 Failed hdisk6 fscsi0 Failed hdisk7 fscsi0 Failed hdisk8 fscsi0 Failed hdisk3 fscsi0 Failed hdisk4 fscsi0 Failed hdisk5 fscsi0 Failed hdisk6 fscsi0 Failed hdisk7 fscsi0 Failed hdisk8 fscsi0 Failed hdisk3 fscsi0 Failed hdisk4 fscsi0 Failed hdisk5 fscsi0 Failed hdisk6 fscsi0 Failed hdisk7 fscsi0 Failed hdisk8 fscsi0 Failed hdisk3 fscsi0 Failed hdisk4 fscsi0 Failed hdisk5 fscsi0 Failed hdisk6 fscsi0 Failed hdisk7 fscsi0 Failed hdisk8 fscsi0 Failed hdisk3 fscsi1 Failed hdisk4 fscsi1 Enabled hdisk5 fscsi1 Enabled hdisk6 fscsi1 Failed hdisk7 fscsi1 Failed hdisk8 fscsi1 Enabled hdisk3 fscsi1 Enabled hdisk4 fscsi1 Failed hdisk5 fscsi1 Failed hdisk6 fscsi1 Failed hdisk7 fscsi1 Failed hdisk8 fscsi1 Failed hdisk3 fscsi1 Failed hdisk4 fscsi1 Failed hdisk5 fscsi1 Failed hdisk6 fscsi1 Enabled hdisk7 fscsi1 Failed hdisk8 fscsi1 Failed hdisk3 fscsi1 Failed hdisk4 fscsi1 Failed hdisk5 fscsi1 Failed hdisk6 fscsi1 Failed hdisk7 fscsi1 Enabled hdisk8 fscsi1 Missing hdisk9 fscsi0 Missing hdisk9 fscsi0 Missing hdisk9 fscsi1 Missing hdisk9 fscsi1 root@jkwha002d:# errpt | more IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW B6267342 0103074511 P H hdisk0 DISK OPERATION ERROR A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW B6267342 0103074511 P H hdisk0 DISK OPERATION ERROR A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW EA88F829 0103074511 I O SYSJ2 USER DATA I/O ERROR A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW 00B984B3 0103074511 U H hdisk0 UNDETERMINED ERROR A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW B6267342 0103074511 P H hdisk0 DISK OPERATION ERROR A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW E86653C3 0103074511 P H LVDD I/O ERROR DETECTED BY LVM A39F8A49 0103074511 T S syserrlg ERROR LOGGING BUFFER OVERFLOW
Oracle is going to panic (which is good behaviour in my opinion):
root@jkwha002d:# tail alert_JW1.log Mon Jan 3 07:39:21 2011 Errors in file /oracle/JW1/saptrace/background/jw1_lgwr_831500.trc: ORA-00345: redo log write error block 7 count 1 ORA-00312: online log 13 thread 1: '/oracle/JW1/origlogA/log_g13m1.dbf' ORA-27063: number of bytes read/written is incorrect IBM AIX RISC System/6000 Error: 5: I/O error Additional information: -1 Additional information: 512 Mon Jan 3 07:39:23 2011 Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc: ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf' ORA-27072: File I/O error IBM AIX RISC System/6000 Error: 5: I/O error Additional information: 10 Additional information: 3 ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf' ORA-27072: File I/O error IBM AIX RISC System/6000 Error: 5: I/O error Additional information: 10 Additional information: 3 Mon Jan 3 07:39:23 2011 Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc: ORA-00221: error on write to control file ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf' ORA-27072: File I/O error IBM AIX RISC System/6000 Error: 5: I/O error Additional information: 10 Additional information: 3 ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf' ORA-27072: File I/O error IBM AIX RISC System/6000 Error: 5: I/O error Additional information: 10 Additional information: 3 Mon Jan 3 07:39:23 2011 CKPT: terminating instance due to error 221 Instance terminated by CKPT, pid = 368764 root@jkwha002d:#
but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???
Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).
Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??
Perhaps i’ll write a part III of this article
-Jakub.