Archive for the ‘PowerHA’ Category

AIX disk I/O pacing vs PowerHA

Wednesday, July 6th, 2011

Recently i’ve played with disk I/O pacing (technically minpout and maxpout parameters) on AIX 6.1. The root cause of this investigation is based on some high I/O waits in production. The default AIX 6.1 and 7.1 systems are coming with maxpout set to 8193 and minpout respecitvletly to 4096. However old deployments and HACMP <= 5.4.1 installations procedures are still recommending setting value of 33/20, just take a look on some old High Availability Cluster Multi-Processing Troubleshooting guide:
“Although the most efficient high- and low-water marks vary from system to system, an initial high-water mark of 33 and a low-water mark of 24 provides a good starting point. These settings only slightly reduce write times and consistently generate correct fallover behavior from the HACMP software.”

Additionally IBM described the whole disk I/O pacing mechanics in this link

100% write results with maxpout=8193 minpout=4096 (defaults)
orion_20110706_0525_summary.txt:Maximum Large MBPS=213.55 @ Small=0 and Large=6
orion_20110706_0525_summary.txt:Maximum Small IOPS=32323 @ Small=2 and Large=0

100% write results with maxpout=33 minpout=24 (old HACMP)
orion_20110706_0552_summary.txt:Maximum Large MBPS=109.64 @ Small=0 and Large=10
orion_20110706_0552_summary.txt:Maximum Small IOPS=31203 @ Small=5 and Large=0

As you can see there is drastic change when it comes to large sequential writes, but don’t change that parameters yet! You really need to read the following sentence from the IBM site “The maxpout parameter specifies the number of pages that can be scheduled in the I/O state to a file before the threads are suspended. “. It is obvious that those parameters are set for the reason of avoiding I/O resource hungry processes dominating the CPU. If you take that the smallest page size is 4kB with 64kB automatically used sometimes by Oracle you might get 33*64kB = 2112kB value in best case, after which thread doing the I/O is suspended ! Ouch! So what is the process of taking the CPU from active process? It is called involuntary context switch…

Why it is all important with HACMP/PowerHA software? Because it is implemented as normal shell scripts and some RSCT processes which are mostly not running in kernel mode at all (at least everything below PowerHA 7.1)! The only way PowerHA/HACMP can give more CPU horse power to the processes needing is to give them higher priority on the run queue, just take a look on the PRI column..

root@X: /home/root :# ps -efo "sched,nice,pid,pri,comm"|sort -n +3|head
SCH NI     PID PRI COMMAND
  - --  520286  31 hatsd
  - --  729314  38 hats_diskhb_nim
  - --  938220  38 hats_nim
  - --  622658  39 hagsd
  - 20  319502  39 clstrmgr
  - 20  348330  39 rmcd
  - 20  458874  39 gsclvmd
  - 20  508082  39 gsclvmd
  - 20  528554  39 gsclvmd
root@X: /home/root :#

.. but it appears that even that could not be not enough in some situations. As you can see there is no easy answer to that question and you would have to drive your exact system configuration to 99/100% of CPU utilization on all POWER cores to see if you are not getting random failovers, because e.g. some HACMP process would not get heartbeat too late due to the CPU scheduling. So perhaps the way forward for 33/20 environments here is to leave them somewhere between 33/20 and 8193/4096 (like the mentioned 513/256 in the IBM manuals). Another thing is that you should never have a system driven to > 80% CPU usage on all physical cores, but you should now that already:)

PowerHA failure scenario when dealing with SAN-booted LPARs – part IV

Thursday, June 9th, 2011

If you are wondering how the story ended up, then just be sure to read IBM PowerHA SystemMirror 7.1 for AIX. It appears that full clustering support for LPARs having their rootvg on SAN is starting to be properly supported fon versions of PowerHA higher than 7.1…

The rootvg system event
PowerHA SystemMirror 7.1 introduces system events. These events are handled by a new
subsystem called clevmgrdES. The rootvg system event allows for the monitoring of loss of
access to the rootvg volume group. By default, in the case of loss of access, the event logs an
entry in the system error log and reboots the system. If required, you can change this option
in the SMIT menu to log only an event entry and not to reboot the system.

and later:
The rootvg system event
As discussed previously, event monitoring is now at the kernel level. The
/usr/lib/drivers/phakernmgr kernel extension, which is loaded by the clevmgrdES
subsystem, monitors these events for loss of rootvg. It can initiate a node restart operation if
enabled to do so as shown in Figure 9-9.
PowerHA 7.1 has a new system event that is enabled by default. This new event allows for the
monitoring of the loss of the rootvg volume group while the cluster node is up and running.
Previous versions of PowerHA/HACMP were unable to monitor this type of loss. Also the
cluster was unable to perform a failover action in the event of the loss of access to rootvg. An
example is if you lose a SAN disk that is hosting the rootvg for this cluster node.

PowerHA failure scenario when dealing with SAN-booted LPARs – part III

Tuesday, April 12th, 2011

Ok, this is part III of the rather long story. As shown in last series the problem was really tricky, we cannot run anything in the event of loosing storage (you cannot read binaries if you don’t have storage). Ok, so how HACMP/PowerHA deals with it?

If you lose all storage Virtual FibreChannel connections, this is going to reported as a loss of quorum for a VG in AIX’s error reporting facility. In theory this error should be propagated to HACMP/PowerHA failover mechanism so that RG is evacuated from the affected node. Sometimes this behaviour is titled as “selective failover behavior”. Technical details are pretty interesting as it shows the lack of real/solid integration between AIX kernel and HACMP/PowerHA. Primary source of information is AIX RAS reporting facility (see “man errpt” for more info) and the errdemon process. Errdemon takes proper actions based on the configuration of “errnotify” ODM objects. The installation of PowerHA adds special hooks so that any special messages about failing hardware/OS/probes (in this case loosing of quorum on VGs) are propagated to the HACMP/PowerHA scripts. This can be verified as follows:

root@jkwha001d : /home/root :# odmget errnotify | tail -17

errnotify:
        en_pid = 0
        en_name = "clSF1"
        en_persistenceflg = 1
        en_label = "LVM_SA_QUORCLOSE"
        en_crcid = 0
        en_class = "H"
        en_type = "UNKN"
        en_alertflg = ""
        en_resource = "LVDD"
        en_rtype = "NONE"
        en_rclass = "NONE"
        en_symptom = ""
        en_err64 = ""
        en_dup = ""
        en_method = "/usr/es/sbin/cluster/diag/clreserror $9 $1 $6 $8"
root@jkwha001d : /home/root :#
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      32768 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

.. so the method of rescuing from losing SAN storage (also on SAN booted hosts) seems to be the script /usr/es/sbin/cluster/diag/clreserror. Of course this script is also located on SAN in that particular case…

root@jkwha001d:# ls -l /usr/es/sbin/cluster/diag/clreserror
-r-x------    1 root     system        13813 Feb 24 2009  /usr/es/sbin/cluster/diag/clreserror
root@jkwha001d:# file /usr/es/sbin/cluster/diag/clreserror
/usr/es/sbin/cluster/diag/clreserror: shell script  - ksh (Korn shell)
root@jkwha001d:# head -45  /usr/es/sbin/cluster/diag/clreserror
#!/bin/ksh93
[..]
#
#  Name:        clreserror
#
#   Notify method in errnotify stanzas that are configured for Selective
#   Fallover triggered by (AIX) Error Notification.
#   This function is merely a wrapper for clRMupdate.
#
#   Argument validation is performed in the clRMapi.
#
#  Arguments:
#
#       $1      Error Label, corresponds to errnotify.en_label
#       $2      Sequence number of the AIX error log entry
#       $3      Resource Name, corresponds to errnotify.en_resource
#       $4      Resource class, corresponds to errnotify.en_class
#
#       Argument validation is performed in the clRMapi.
#
#       Environment:  PATH
#
root@jkwha001d:#

And rest assured that this script calls a lot of other scripts, of course that can be unavailable if the rootvg is on the same physical storage as the affected VG.

There are two good findings here actually. First one is that if you are going to loose all SAN-based hdisks you are going to be flooded with thousands entries in errpt facility. Those can be undetected by errdemon because of the overflowing the log in memory. Workaround for the first case seems to be trivial, just enlarge the error log buffer. This is documented here:

root@jkwha001d : /home/root :# /usr/lib/errdemon -B 1048576
0315-175 The error log memory buffer size you supplied will be rounded up
to a multiple of 4096 bytes.
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      1048576 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

Additionaly it seems to have some value mirroring some of the LVs on the affected VGs. This might add some stability to the detection of loosing LVM quorum, i.e – this shows properly mirrored LVM loglv4 across 2 hdisks(PVs)…

root@jkwha001d:# lsvg -p jw1data1vg
jw1data1vg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk5            active            479         30          00..00..00..00..30
hdisk3            active            479         30          00..00..00..00..30
root@jkwha001d:# lsvg -l jw1data1vg
jw1data1vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jw1data1            jfs2       896     896     2    open/syncd    /oracle/JW1/sapdata1
loglv04             jfs2log    1       2       2    open/syncd    N/A
root@jkwha001d:# lslv loglv04
LOGICAL VOLUME:     loglv04                VOLUME GROUP:   jw1data1vg
LV IDENTIFIER:      0001d2c20000d9000000012b86756f99.2 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2log                WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                1                      PPs:            2
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    128
MOUNT POINT:        N/A                    LABEL:          None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     NO
root@jkwha001d:#

Second finding is that even with all those changes it has very high probability of failing (i.e. PowerHA RG move won’t work). Personally the risk is so high that for me it is nearly a guarantee. The only proper solution of this problem that i am able to see is to add special handler to the err_method in errdemon code. Something like err_method = KILL_THE_NODE. This KILL_THE_NODE should be implemented internally by running all the time errdemon process. The process should be running with memory protected from swapping (something like mlockall())… because currently it is not running that way.

root@jkwha001d:# svmon -P 143470
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd  16MB
  143470 errdemon         24851    11028        0    24700      N     N     N

     PageSize                Inuse        Pin       Pgsp    Virtual
     s    4 KB                 323          4          0        172
     m   64 KB                1533        689          0       1533

    Vsid      Esid Type Description              PSize  Inuse   Pin Pgsp Virtual
   7c05d         d work fork tree                    m   1041   233    0    1041
                   children=4d6cdc, 0
    8002         0 work fork tree                    m    492   456    0     492
                   children=802760, 0
   4d0f7         2 work process private             sm    131     4    0     131
   35309         3 clnt mapped file,/dev/hd9var:69   s    107     0    -       -
   110c0         f work shared library data         sm     41     0    0      41
   1d023         1 clnt code,/dev/hd2:44058          s     25     0    -       -
   45115         - clnt /dev/hd2:63900               s     11     0    -       -
   3902a         - clnt /dev/hd9var:1015             s      8     0    -       -
root@jkwha001d:#

Hope that IBM fixes that. There is no APAR because I’m lazy… i even don’t want to fight with 1st line of support… BTW: i’m not the first one who noticed this problem, please see here for a blog post about the same from Ricardo Gameiro

-Jakub.

PowerHA failure scenario when dealing with SAN-booted LPARs – part II

Monday, January 24th, 2011

Ok, this is part II of the post PowerHA failure scenario when dealing with SAN-booted LPARs – part I.

The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:

$ vfcmap -vadapter vfchostXYZ -fcp

where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:

root@jkwha001d : /home/root :# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
AFA89905   0103071611 I O grpsvcs        Group Services daemon started
97419D60   0103071611 I O topsvcs        Topology Services daemon started
A6DF45AA   0103071111 I O RMCdaemon      The daemon is started.
67145A39   0103070911 U S SYSDUMP        SYSTEM DUMP
F48137AC   0103070911 U O minidump       COMPRESSED MINIMAL DUMP
DC73C03A   0103070911 T S fscsi0         SOFTWARE PROGRAM ERROR
9DBCFDEE   0103070911 T O errdemon       ERROR LOGGING TURNED ON
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070511 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
(..)
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk4         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk8         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk6         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk3         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk2         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk0         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DE3B8540   0103070411 P H hdisk5         PATH HAS FAILED
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
(..)
A39F8A49   1210235210 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk6         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi0         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk1         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
(..)

As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…

OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).

On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):

storage2> lun show -g jkwha002d_boot
        /vol/jkwha002d_boot/boot/rootvg   40.0g (42953867264)   (r/w, online, mapped)
storage2>

Also on 2nd storage there was an igroup responsbile for SAP and Oracle LUNs:

storage2> lun show -g jkwha002d_cluster
        /vol/sap_jw1_mirrlog/mirrlog/mirrlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_oraarch/oraarch/oraarch01.lun     51g (54760833024)   (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata2/sapdata02.lun     15g (16106127360)   (r/w, online, mapped)
storage2>

The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):

storage1> lun show -g jkwha002d_cluster
        /vol/clus_001d_002d_hb/hb/hbvg     20m (20971520)      (r/w, online, mapped)
        /vol/sap_jw1_origlog/origlog/origlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_sapbin/sapbin/sapbin01.lun    100g (107374182400)  (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata1/sapdata01.lun     15g (16106127360)   (r/w, online, mapped)
storage1>

Let’s do it:

storage1> igroup remove
usage:
igroup remove [ -f ] <initiator_group> <node> ...
  - removes node(s) from an initiator group. The node may also
    specified as its alias. See "fcp wwpn-alias" for help with aliases
    By default, if lun maps exist for the initiator group, the node is
    not removed from the group. The -f option can be used to force the
    removal.
For more information, try 'man na_igroup'
storage1>
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:e0

.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:

root@jkwha002d:# topas
bash: /usr/bin/topas: There is an input or output error.
root@jkwha002d:# nmon
bash: /usr/bin/nmon: There is an input or output error.
root@jkwha002d:# lspath
Failed  hdisk0 fscsi1
Failed  hdisk1 fscsi1
Enabled hdisk0 fscsi1
Failed  hdisk1 fscsi1
Failed  hdisk0 fscsi0
Failed  hdisk1 fscsi0
Failed  hdisk0 fscsi0
Enabled hdisk1 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Enabled hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Enabled hdisk5 fscsi1
Enabled hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Enabled hdisk3 fscsi1
Enabled hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Enabled hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Enabled hdisk8 fscsi1
Missing hdisk9 fscsi0
Missing hdisk9 fscsi0
Missing hdisk9 fscsi1
Missing hdisk9 fscsi1
root@jkwha002d:# errpt | more
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
EA88F829   0103074511 I O SYSJ2          USER DATA I/O ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
00B984B3   0103074511 U H hdisk0         UNDETERMINED ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW

Oracle is going to panic (which is good behaviour in my opinion):

root@jkwha002d:# tail alert_JW1.log
Mon Jan  3 07:39:21 2011
Errors in file /oracle/JW1/saptrace/background/jw1_lgwr_831500.trc:
ORA-00345: redo log write error block 7 count 1
ORA-00312: online log 13 thread 1: '/oracle/JW1/origlogA/log_g13m1.dbf'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
Additional information: 512
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
CKPT: terminating instance due to error 221
Instance terminated by CKPT, pid = 368764
root@jkwha002d:#

but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???

Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).

Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??

Perhaps i’ll write a part III of this article :)

-Jakub.

PowerHA failure scenario when dealing with SAN-booted LPARs – part I

Sunday, January 9th, 2011

Together with Jedrzej we’ve exposed rather interesting weaknees in IBM PowerHA 5.5 solution (in the old days it was called HACMP). Normally you would assume that in case major cataclysm such as *complete* storage disappear on the active node, PowerHA or AIX has internal mechanism to prevent downtime by switching the services to the next active node (as defined in PowerHA policies/confguration). This is starting to be really interesting when we start talking about SAN BOOT-ed AIX LPARs. As everybody knows any form of assumption is bad (this is starting to be my mantra), and as we show it here avoid this pitfall requires requires SOME PLANNING AND CONFIGURATION to avoid ending in long downtime….

The enviorniment was consisting of:

  • Two LPARs (jkwha001d = active; jkwha002d = passive) running on separate Power frames.
  • Each LPAR on each frame was protected by separate pair of VIOS (to provide redundancy for NPIV-enabled FC fabrics and enable to have Shared Ethernet Failovers).
  • AIX 6.1 TL3
  • Oracle 10gR2
  • PowerHA 5.5SP2
  • Netapp DataOntap 7.3 storage (Netapp cluster, 2 controllers) – yes, AIX LPARs are also SAN BOOT-ed from this storage
  • 1 network-based (IPv4) heartbeat
  • 1 disk-based (on FC LUN) heartbeat
  • Properly configured & running scripts for starting and stopping Oracle
  • No application monitors configured

We’ve performed two experiments:

  1. Simulating total storage losing only on jkwha001d (active one, hosting Oracle) by unmapping Virtual Fabre Channel (NPIV) host adapters (vadapters) on both VIOS protecting jkwha001d.
  2. Simulating total storage losing only on jkwha002d (active one, hosting Oracle — after failover) by forcing disconnection of jkwha002d Fibre Channel initiators (WWNs) on Netapp cluster.

both of which simulated the same thing… Virtual LPAR running without storage at all. More on result in next post…

-Jakub.

I’m IBM Certified Advanced Technical Expert now…

Sunday, January 9th, 2011

This is going to be a rather short blog post. On 17th of Dec last year (2010) I’ve passed all of the pre-requisites (5 other IBM certificates) for being “IBM Certified Advanced Technical Expert – Power Systems with AIX” (the highest [master level] credential for people working with & designing solutions with AIX, VIOS, PowerHA/HACMP, PowerVM, (D)LPARs, Linux on Power, etc)).

ibm_cate

The story is rather boring and there are only two lessons:
1) you won’t pass the required tests/certifications if don’t know this stuff deeply (each technology like PowerVM, PowerHA – at the expert level – doh! you want to be ‘Advanced Technical Expert’!).
2) deep AIX knowledge alone is not enough. You need to get the big picture and be interested in integration of everything that plays with AIX so this includes applications, storage, high-availbility, disaster recoveries, business requirements, a lot of experience of how everything interacts, etc. You need a lot of experience (years of them).

After my whole journey i can say that AIX and its surrounding ecosystem on Power is the most rock solid & mature piece of IT solution that i’ve ever worked with. There is not a lot on the market that can be compared to this solution, and this is the main factor why I’ve decided to take this route. Now, I’ll try to blog in future (if time permits) several interesting cases about Power Systems and AIX…

-Jakub.