Archive for April, 2011

PowerHA failure scenario when dealing with SAN-booted LPARs – part III

Tuesday, April 12th, 2011

Ok, this is part III of the rather long story. As shown in last series the problem was really tricky, we cannot run anything in the event of loosing storage (you cannot read binaries if you don’t have storage). Ok, so how HACMP/PowerHA deals with it?

If you lose all storage Virtual FibreChannel connections, this is going to reported as a loss of quorum for a VG in AIX’s error reporting facility. In theory this error should be propagated to HACMP/PowerHA failover mechanism so that RG is evacuated from the affected node. Sometimes this behaviour is titled as “selective failover behavior”. Technical details are pretty interesting as it shows the lack of real/solid integration between AIX kernel and HACMP/PowerHA. Primary source of information is AIX RAS reporting facility (see “man errpt” for more info) and the errdemon process. Errdemon takes proper actions based on the configuration of “errnotify” ODM objects. The installation of PowerHA adds special hooks so that any special messages about failing hardware/OS/probes (in this case loosing of quorum on VGs) are propagated to the HACMP/PowerHA scripts. This can be verified as follows:

root@jkwha001d : /home/root :# odmget errnotify | tail -17

        en_pid = 0
        en_name = "clSF1"
        en_persistenceflg = 1
        en_label = "LVM_SA_QUORCLOSE"
        en_crcid = 0
        en_class = "H"
        en_type = "UNKN"
        en_alertflg = ""
        en_resource = "LVDD"
        en_rtype = "NONE"
        en_rclass = "NONE"
        en_symptom = ""
        en_err64 = ""
        en_dup = ""
        en_method = "/usr/es/sbin/cluster/diag/clreserror $9 $1 $6 $8"
root@jkwha001d : /home/root :#
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      32768 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

.. so the method of rescuing from losing SAN storage (also on SAN booted hosts) seems to be the script /usr/es/sbin/cluster/diag/clreserror. Of course this script is also located on SAN in that particular case…

root@jkwha001d:# ls -l /usr/es/sbin/cluster/diag/clreserror
-r-x------    1 root     system        13813 Feb 24 2009  /usr/es/sbin/cluster/diag/clreserror
root@jkwha001d:# file /usr/es/sbin/cluster/diag/clreserror
/usr/es/sbin/cluster/diag/clreserror: shell script  - ksh (Korn shell)
root@jkwha001d:# head -45  /usr/es/sbin/cluster/diag/clreserror
#  Name:        clreserror
#   Notify method in errnotify stanzas that are configured for Selective
#   Fallover triggered by (AIX) Error Notification.
#   This function is merely a wrapper for clRMupdate.
#   Argument validation is performed in the clRMapi.
#  Arguments:
#       $1      Error Label, corresponds to errnotify.en_label
#       $2      Sequence number of the AIX error log entry
#       $3      Resource Name, corresponds to errnotify.en_resource
#       $4      Resource class, corresponds to errnotify.en_class
#       Argument validation is performed in the clRMapi.
#       Environment:  PATH

And rest assured that this script calls a lot of other scripts, of course that can be unavailable if the rootvg is on the same physical storage as the affected VG.

There are two good findings here actually. First one is that if you are going to loose all SAN-based hdisks you are going to be flooded with thousands entries in errpt facility. Those can be undetected by errdemon because of the overflowing the log in memory. Workaround for the first case seems to be trivial, just enlarge the error log buffer. This is documented here:

root@jkwha001d : /home/root :# /usr/lib/errdemon -B 1048576
0315-175 The error log memory buffer size you supplied will be rounded up
to a multiple of 4096 bytes.
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      1048576 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

Additionaly it seems to have some value mirroring some of the LVs on the affected VGs. This might add some stability to the detection of loosing LVM quorum, i.e – this shows properly mirrored LVM loglv4 across 2 hdisks(PVs)…

root@jkwha001d:# lsvg -p jw1data1vg
hdisk5            active            479         30          00..00..00..00..30
hdisk3            active            479         30          00..00..00..00..30
root@jkwha001d:# lsvg -l jw1data1vg
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jw1data1            jfs2       896     896     2    open/syncd    /oracle/JW1/sapdata1
loglv04             jfs2log    1       2       2    open/syncd    N/A
root@jkwha001d:# lslv loglv04
LOGICAL VOLUME:     loglv04                VOLUME GROUP:   jw1data1vg
LV IDENTIFIER:      0001d2c20000d9000000012b86756f99.2 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2log                WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                1                      PPs:            2
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    128
MOUNT POINT:        N/A                    LABEL:          None
Serialize IO ?:     NO

Second finding is that even with all those changes it has very high probability of failing (i.e. PowerHA RG move won’t work). Personally the risk is so high that for me it is nearly a guarantee. The only proper solution of this problem that i am able to see is to add special handler to the err_method in errdemon code. Something like err_method = KILL_THE_NODE. This KILL_THE_NODE should be implemented internally by running all the time errdemon process. The process should be running with memory protected from swapping (something like mlockall())… because currently it is not running that way.

root@jkwha001d:# svmon -P 143470
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd  16MB
  143470 errdemon         24851    11028        0    24700      N     N     N

     PageSize                Inuse        Pin       Pgsp    Virtual
     s    4 KB                 323          4          0        172
     m   64 KB                1533        689          0       1533

    Vsid      Esid Type Description              PSize  Inuse   Pin Pgsp Virtual
   7c05d         d work fork tree                    m   1041   233    0    1041
                   children=4d6cdc, 0
    8002         0 work fork tree                    m    492   456    0     492
                   children=802760, 0
   4d0f7         2 work process private             sm    131     4    0     131
   35309         3 clnt mapped file,/dev/hd9var:69   s    107     0    -       -
   110c0         f work shared library data         sm     41     0    0      41
   1d023         1 clnt code,/dev/hd2:44058          s     25     0    -       -
   45115         - clnt /dev/hd2:63900               s     11     0    -       -
   3902a         - clnt /dev/hd9var:1015             s      8     0    -       -

Hope that IBM fixes that. There is no APAR because I’m lazy… i even don’t want to fight with 1st line of support… BTW: i’m not the first one who noticed this problem, please see here for a blog post about the same from Ricardo Gameiro