Ok, this is part III of the rather long story. As shown in last series the problem was really tricky, we cannot run anything in the event of loosing storage (you cannot read binaries if you don’t have storage). Ok, so how HACMP/PowerHA deals with it?
If you lose all storage Virtual FibreChannel connections, this is going to reported as a loss of quorum for a VG in AIX’s error reporting facility. In theory this error should be propagated to HACMP/PowerHA failover mechanism so that RG is evacuated from the affected node. Sometimes this behaviour is titled as “selective failover behavior”. Technical details are pretty interesting as it shows the lack of real/solid integration between AIX kernel and HACMP/PowerHA. Primary source of information is AIX RAS reporting facility (see “man errpt” for more info) and the errdemon process. Errdemon takes proper actions based on the configuration of “errnotify” ODM objects. The installation of PowerHA adds special hooks so that any special messages about failing hardware/OS/probes (in this case loosing of quorum on VGs) are propagated to the HACMP/PowerHA scripts. This can be verified as follows:
root@jkwha001d : /home/root :# odmget errnotify | tail -17 errnotify: en_pid = 0 en_name = "clSF1" en_persistenceflg = 1 en_label = "LVM_SA_QUORCLOSE" en_crcid = 0 en_class = "H" en_type = "UNKN" en_alertflg = "" en_resource = "LVDD" en_rtype = "NONE" en_rclass = "NONE" en_symptom = "" en_err64 = "" en_dup = "" en_method = "/usr/es/sbin/cluster/diag/clreserror $9 $1 $6 $8" root@jkwha001d : /home/root :# root@jkwha001d : /home/root :# /usr/lib/errdemon -l Error Log Attributes -------------------------------------------- Log File /var/adm/ras/errlog Log Size 1048576 bytes Memory Buffer Size 32768 bytes Duplicate Removal true Duplicate Interval 10000 milliseconds Duplicate Error Maximum 1000 root@jkwha001d : /home/root :#
.. so the method of rescuing from losing SAN storage (also on SAN booted hosts) seems to be the script /usr/es/sbin/cluster/diag/clreserror. Of course this script is also located on SAN in that particular case…
root@jkwha001d:# ls -l /usr/es/sbin/cluster/diag/clreserror -r-x------ 1 root system 13813 Feb 24 2009 /usr/es/sbin/cluster/diag/clreserror root@jkwha001d:# file /usr/es/sbin/cluster/diag/clreserror /usr/es/sbin/cluster/diag/clreserror: shell script - ksh (Korn shell) root@jkwha001d:# head -45 /usr/es/sbin/cluster/diag/clreserror #!/bin/ksh93 [..] # # Name: clreserror # # Notify method in errnotify stanzas that are configured for Selective # Fallover triggered by (AIX) Error Notification. # This function is merely a wrapper for clRMupdate. # # Argument validation is performed in the clRMapi. # # Arguments: # # $1 Error Label, corresponds to errnotify.en_label # $2 Sequence number of the AIX error log entry # $3 Resource Name, corresponds to errnotify.en_resource # $4 Resource class, corresponds to errnotify.en_class # # Argument validation is performed in the clRMapi. # # Environment: PATH # root@jkwha001d:#
And rest assured that this script calls a lot of other scripts, of course that can be unavailable if the rootvg is on the same physical storage as the affected VG.
There are two good findings here actually. First one is that if you are going to loose all SAN-based hdisks you are going to be flooded with thousands entries in errpt facility. Those can be undetected by errdemon because of the overflowing the log in memory. Workaround for the first case seems to be trivial, just enlarge the error log buffer. This is documented here:
- IZ93034: ERRLOG OVERFLOW RESULTS IN LACK OF HACMP FAILOVER
- IY75323: ERRORLOG BUFFER SIZE INCREASE NEEDED FOR LVM_SA_QUORCLOSE
root@jkwha001d : /home/root :# /usr/lib/errdemon -B 1048576 0315-175 The error log memory buffer size you supplied will be rounded up to a multiple of 4096 bytes. root@jkwha001d : /home/root :# /usr/lib/errdemon -l Error Log Attributes -------------------------------------------- Log File /var/adm/ras/errlog Log Size 1048576 bytes Memory Buffer Size 1048576 bytes Duplicate Removal true Duplicate Interval 10000 milliseconds Duplicate Error Maximum 1000 root@jkwha001d : /home/root :#
Additionaly it seems to have some value mirroring some of the LVs on the affected VGs. This might add some stability to the detection of loosing LVM quorum, i.e – this shows properly mirrored LVM loglv4 across 2 hdisks(PVs)…
root@jkwha001d:# lsvg -p jw1data1vg jw1data1vg: PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION hdisk5 active 479 30 00..00..00..00..30 hdisk3 active 479 30 00..00..00..00..30 root@jkwha001d:# lsvg -l jw1data1vg jw1data1vg: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT jw1data1 jfs2 896 896 2 open/syncd /oracle/JW1/sapdata1 loglv04 jfs2log 1 2 2 open/syncd N/A root@jkwha001d:# lslv loglv04 LOGICAL VOLUME: loglv04 VOLUME GROUP: jw1data1vg LV IDENTIFIER: 0001d2c20000d9000000012b86756f99.2 PERMISSION: read/write VG STATE: active/complete LV STATE: opened/syncd TYPE: jfs2log WRITE VERIFY: off MAX LPs: 512 PP SIZE: 32 megabyte(s) COPIES: 2 SCHED POLICY: parallel LPs: 1 PPs: 2 STALE PPs: 0 BB POLICY: relocatable INTER-POLICY: minimum RELOCATABLE: yes INTRA-POLICY: middle UPPER BOUND: 128 MOUNT POINT: N/A LABEL: None MIRROR WRITE CONSISTENCY: on/ACTIVE EACH LP COPY ON A SEPARATE PV ?: yes Serialize IO ?: NO root@jkwha001d:#
Second finding is that even with all those changes it has very high probability of failing (i.e. PowerHA RG move won’t work). Personally the risk is so high that for me it is nearly a guarantee. The only proper solution of this problem that i am able to see is to add special handler to the err_method in errdemon code. Something like err_method = KILL_THE_NODE. This KILL_THE_NODE should be implemented internally by running all the time errdemon process. The process should be running with memory protected from swapping (something like mlockall())… because currently it is not running that way.
root@jkwha001d:# svmon -P 143470 ------------------------------------------------------------------------------- Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd 16MB 143470 errdemon 24851 11028 0 24700 N N N PageSize Inuse Pin Pgsp Virtual s 4 KB 323 4 0 172 m 64 KB 1533 689 0 1533 Vsid Esid Type Description PSize Inuse Pin Pgsp Virtual 7c05d d work fork tree m 1041 233 0 1041 children=4d6cdc, 0 8002 0 work fork tree m 492 456 0 492 children=802760, 0 4d0f7 2 work process private sm 131 4 0 131 35309 3 clnt mapped file,/dev/hd9var:69 s 107 0 - - 110c0 f work shared library data sm 41 0 0 41 1d023 1 clnt code,/dev/hd2:44058 s 25 0 - - 45115 - clnt /dev/hd2:63900 s 11 0 - - 3902a - clnt /dev/hd9var:1015 s 8 0 - - root@jkwha001d:#
Hope that IBM fixes that. There is no APAR because I’m lazy… i even don’t want to fight with 1st line of support… BTW: i’m not the first one who noticed this problem, please see here for a blog post about the same from Ricardo Gameiro
-Jakub.
Hi,
I got an answer from IBM regarding this issue, stating that it is fixed in PowerHA 7.1 – which if you’re wondering is perfectly supported on AIX6.1.
I still have to test it, hopefully on a new set of clusters that I will be setting up in the next couple of months.
Regards,
RG
Ricardo, thanks for info