PowerHA failure scenario when dealing with SAN-booted LPARs – part IV

If you are wondering how the story ended up, then just be sure to read IBM PowerHA SystemMirror 7.1 for AIX. It appears that full clustering support for LPARs having their rootvg on SAN is starting to be properly supported fon versions of PowerHA higher than 7.1…

The rootvg system event
PowerHA SystemMirror 7.1 introduces system events. These events are handled by a new
subsystem called clevmgrdES. The rootvg system event allows for the monitoring of loss of
access to the rootvg volume group. By default, in the case of loss of access, the event logs an
entry in the system error log and reboots the system. If required, you can change this option
in the SMIT menu to log only an event entry and not to reboot the system.

and later:
The rootvg system event
As discussed previously, event monitoring is now at the kernel level. The
/usr/lib/drivers/phakernmgr kernel extension, which is loaded by the clevmgrdES
subsystem, monitors these events for loss of rootvg. It can initiate a node restart operation if
enabled to do so as shown in Figure 9-9.
PowerHA 7.1 has a new system event that is enabled by default. This new event allows for the
monitoring of the loss of the rootvg volume group while the cluster node is up and running.
Previous versions of PowerHA/HACMP were unable to monitor this type of loss. Also the
cluster was unable to perform a failover action in the event of the loss of access to rootvg. An
example is if you lose a SAN disk that is hosting the rootvg for this cluster node.

4 Responses to “PowerHA failure scenario when dealing with SAN-booted LPARs – part IV”

  1. Joachim Gann says:

    hi,

    have you given the feature a try? I am currently testing this on a cluster with vscsi disks and lvm mirrored rootvg.

    the mechanism triggers a system dump and reboot upon failure of the *first* rootvg disk (when the node could run perfectly on the remaining mirror copy).

    Greetings
    Joachim

  2. admin says:

    Hi Joachim,

    thanks for comment – the answer is … i’m not so brave! To be honest smells like a good bug finding, perhaps the kernel module is really reacting on hdisk/PV failure for rootvg not a reall rootvg VG failure.For various reasons i would rather stay away from bleeding edge for issues like those…

    BTW: how are you injecting a failure to the first hdisk? (offlining LUN?)

    -Jakub.

  3. Joachim Gann says:

    hi Jakub,

    i am setting the VTDs to Defined state on the redundant vios one by one, yielding a path failure first and then a disk operation error and then sysdump within an instant.

    anyway, glad to see the problem has been adressed at all, even if only sparsely documented in the redbook. your blog has the topmost 2 of 5 google hits for clevmgrdES…

    Joachim

  4. Jamil Alam says:

    hi Joachim,

    Did you get a solution for your problem?

    I am currently facing the same problem when testing a storage failure.
    A system dump has been detected on the AIX server After failing of the first copy of the LVM mirroring on rootvg disk (Logically we should continue working on the other copy of the LVM mirror that is installed on the other storge)

    Please advise.