Archive for the ‘SAN’ Category

I/O latency on REDO disks during storage array controller failover

Wednesday, January 18th, 2017

I’ve managed to create an interesting I/O latency graph (from iostat atime) in GNU R during an planned test of storage controller array failure in 2-way configuration (sorry I cannot provide vendor/model). In most 2-way storage arrays write back cache is enabled (and sync I/O from LGWR are really buffered) until the 2nd controller crashes which disables this write-back cache. This increased latency may have influence on overall platform stability , especially with Oracle DBs doing thousands of IO/s (e.g. commits or other activity). The interesting thing about this I/O latency are freezes (up to ~500 .. 900ms) caused by the failed/rebooted controller entering and leaving the array processing:

Of course result might vary depending on many factors (storagee array vendor, OS used, multipathing configuration, etc)

How one can to overcome:

  • use 4-way controllers storage arrays (crash of one 1 out of 4 controllers might not disable write back cache – depends on vendor)
  • provide required IOPS on RAID groups with the assumption of write cache hit ratio=0% (e.g. more disks in RAID10, different RAID level, SSDs, Fusion IO, flash arrays) – I think it is interesting voice in discussion should we use SSD or not in storage arrays for Oracle’s REDO

Ref:

  1. https://itpeernetwork.intel.com/should-you-put-oracle-database-redo-on-solid-state-disks-ssds/
  2. https://flashdba.com/2013/08/22/storage-myths-put-oracle-redo-on-ssd/
  3. https://www.pythian.com/blog/de-confusing-ssd-for-oracle-databases/

WAFL performance VS sequential reads: part III, FC LUN performance from AIX vs read_realloc

Thursday, July 7th, 2011

Continuing my series about LUN fragmentation in WALF (part1, part2) I wanted to give a try to read_realloc option. Mostly the same storage system (still DataOnTap 7.3.2) and settings were used with an exception that i’ve even used more powerful AIX with POWER7 CPU cores system (shouldn’t matter as i was completely not constrained with CPU power during the benchmarks).

The methodology was simple:
1) Do a lot of random and sequential writes to JFS2 on top of that LUN on FlexVol having read_realloc=off to fragment the LUN as much as possible
2) Start a loop of random large reads (simulating let’s say many random full table scans of up to 1MB in Oracle data file) on top of that JFS2 filesystem (BTW: this time without CIO)

The full measurement took something about 18 hours (yes, storage was hammered 18 hours all the time with just reads). Results are pretty interesting:

… it basically shows how Netapp storage can self-optimize and adapt based on the workload. Now i’m just wondering how it works under the hood (i.e. what is the trade-off or worst possible scenario for this kind of setting). What was even more interesting is that reallocate measure indicated after the whole cycle that the LUN was really nicely de-fragmented:

Thu Jul  7 03:32:20 EDT [X: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/fc_fill_aggr_sm/lun' is 3.

WAFL performance VS sequential reads: part II, FC LUN defragmentation

Monday, June 27th, 2011

After partI – where i’ve been simulating typical Oracle workload (generating 70:30 read to write ratio on FC LUN) and creating snapshots – i’ve wanted to try different performance tests. In order to achieve the same performance characteristics, so i’ve deleted all my snapshots, so my FlexVol ended up again in 40% utilization:

X> snap list full_aggr_test
Volume full_aggr_test
working...

No snapshots exist.
X>
X> df -g full_aggr_test
Filesystem               total       used      avail capacity  Mounted on
/vol/full_aggr_test/       50GB       20GB       29GB      40%  /vol/full_aggr_test/
/vol/full_aggr_test/.snapshot        0GB        0GB        0GB     ---%  /vol/full_aggr_test/.snapshot
X>

Later i’ve executed Orion stress test, in a identical way like in partI on the same enviorniment. As you can see still the LUN is fragmented because any kind of sequential read is going to be impacted (maximum read observed ~17MB/s):

root@Y:# grep Maximum orion*
orion_20110627_1116_summary.txt:Maximum Large MBPS=17.07 @ Small=0 and Large=9
orion_20110627_1116_summary.txt:Maximum Small IOPS=683 @ Small=24 and Large=0
root@Y:#

So in order to fight with this performance issue one can establish the root cause:

X> reallocate measure /vol/full_aggr_test
Reallocation scan will be started on '/vol/full_aggr_test'.
Monitor the system log for results.
X>

System log will reveal this:

Mon Jun 27 07:35:31 EDT [X: wafl.scan.start:info]: Starting WAFL layout measurement on volume full_aggr_test.
Mon Jun 27 07:35:32 EDT [X: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/full_aggr_test' is 8, hotspot 0 (threshold 4), consider running reallocate.

This seems to be identical to running measure on the LUN:

X> reallocate measure  /vol/full_aggr_test/lun01
Reallocation scan will be started on '/vol/full_aggr_test/lun01'.
Monitor the system log for results.
X>

Log will show this:

Mon Jun 27 07:45:21 EDT [X: wafl.scan.start:info]: Starting WAFL layout measurement on volume full_aggr_test.
Mon Jun 27 07:45:21 EDT [X: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/full_aggr_test/lun01' is 8, hotspot 0 (threshold 4), consider running reallocate.

So in both cases we were recommended to defragment the LUN, but keep in mind that this is a rather resource hungry operation, as it might involve reading and rewriting the full contents of the data!

X> reallocate start -f -p /vol/full_aggr_test/lun01
Reallocation scan will be started on '/vol/full_aggr_test/lun01'.
Monitor the system log for results.
X>

Log will show that the operation has started …

Mon Jun 27 07:46:23 EDT [X: wafl.br.revert.slow:info]: The aggregate 'sm_aggr1' contains blocks that require redirection; 'revert_to' might take longer than expected.
Mon Jun 27 07:46:23 EDT [X: wafl.scan.start:info]: Starting file reallocating on volume full_aggr_test.

As you can see it is rather low CPU activity however , physical utilization of the disks is reported as high (don’t be fooled by low write activity – this is function of time, it does perform a lot of writes later):

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 10%     0     0     0     157     0     0  22372  19320     0     0    53s  94%  58%  :   97%    156     0   589   175     0     0
 10%     1     0     0     108     0     0  24884      0     0     0    53s  94%   0%  -   92%    106     0   256   585     0     0
  9%     0     0     0     101     0     0  25284     24     0     0    53s  94%   0%  -   93%    100     0   421   260     0     0
 12%     0     0     0     627    20    25  25620      8     0     0    53s  94%   0%  -   92%    511     0   297   132     0     0
 11%     0     0     0     792     0     0  22832      0     0     0    53s  94%   0%  -   90%    652     0   670   461     0     0
  6%     1     0     0      81     1     1  25232     24     0     0    53s  99%   0%  -   92%     78     0   233   253     0     0

One can monitor the progress by using “status” command and in fact observe

X> reallocate status -v /vol/full_aggr_test/lun01
Reallocation scans are on
/vol/full_aggr_test/lun01:
        State: Reallocating: Block 1347456 of 5242880 (25%), updated 1346434
        Flags: doing_force,measure_only,repeat,keep_vvbn
    Threshold: 4
     Schedule: n/a
     Interval: 1 day
 Optimization: 8
  Measure Log: n/a
X>
[..]
X> reallocate status -v /vol/full_aggr_test/lun01
Reallocation scans are on
/vol/full_aggr_test/lun01:
        State: Idle
        Flags: measure_only,repeat
    Threshold: 4
     Schedule: n/a
     Interval: 1 day
 Optimization: 8
  Measure Log: n/a

X> sysstat -x 1
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 53%     1     0     0     678     1     1  29428   1556     0     0     1   72%   9%  :   11%    573     0   311 21077     0     0
 34%     0     0     0     443     0     0  22028     32     0     0     1   78%   0%  -    5%    442     0  1068 20121     0     0
 40%     0     0     0     172     0     0  16360      0     0     0     1   77%   0%  -    4%    171     0   367 14450     0     0
 CTRL+C
X>

Later results indicate that indeed sequential reads are back to their top value (~42MB/s) and this was our starting point on fresh FlexVol inside LUN in partI…

root@Y:# grep Maximum orion*
orion_20110627_1208_summary.txt:Maximum Large MBPS=42.73 @ Small=0 and Large=9
orion_20110627_1208_summary.txt:Maximum Small IOPS=645 @ Small=25 and Large=0
root@Y:#

In the next series i’ll try to investiage the various AIX JFS2/CIO behaviours and to some degree the performance characteristics of Netapp storage and it’s options (e.g. read_realloc option). Stay tuned…

WAFL performance VS sequential reads: part I, FC LUN performance from AIX vs FlexVol utilization

Friday, June 24th, 2011

Some time ago i’ve started thinking about getting more serious about performance research on AIX6.1, PowerHA and Oracle stack on top of Netapp storage. One of the first things that i wanted to measure was how the Netapp’s WALF handles FlexVolume utilization in correlation to space usage. In theory long sequential reads (as like in Oracle Datawarehouses in which i’m interested) could be affected because of the fragmentation introduced by WAFL (Write *Anywhere* File System).

Some specs first:

  • DataONTap version 7.3.2, single Netapp controller 3160 was tested (but in cluster).
  • The test was done using Oracle’s Orion 11.1.0.7 storage benchmarking tool on top of AIX (orion_aix_ppc64 -run advanced -num_disks 5 -cache_size 2048 -write 30 -duration 20 -type rand -matrix basic) – as you can see the read vs write ratio was 70% to 30%, but only long reads were presented (i was not interested in the performance of 8kB reads/writes, just 1MB long reads)
  • AIX was connected via VFCs to two separate VIOS, which each were connected using 2 FC 8 GBps links each (but running in 4 GBps mode due to the Brocade SAN switches not supporting 8 GBps)
  • Netapp controller was having 4 GBps ports
  • AIX was using 6.1′s internal MPIO (round-robin) for the tested LUN
  • AIX hdisk for LUN was set to the default value of 12 (as per Netapp Host Attachement Kit)
  • AIX JFS2 filesystem was mounted with Concurrent I/O to prevent AIX from caching and read-aheads (still AIX had 3GB of RAM allocated but VMM should not use it)
  • Netapp storage controller was having 4 processors, 8GB RAM and 2GB for NVRAM (as indicated by sysconfig output, of course as this is cluster, only 1GB was available)
  • LUN size was 20GB, on top of 50GB FlexVol on RAID-DP aggregate with 5x FC 15k RPM disks
X> vol options full_aggr_test
nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off,
ignore_inconsistent=off, snapmirrored=off, create_ucode=on,
convert_ucode=off, maxdirsize=83804, schedsnapname=ordinal,
fs_size_fixed=off, compression=off, guarantee=volume, svo_enable=off,
svo_checksum=off, svo_allow_rman=off, svo_reject_errors=off,
no_i2p=off, fractional_reserve=100, extent=off, try_first=volume_grow,
read_realloc=off, snapshot_clone_dependency=off
X> df -Ag aggr_used
Aggregate                total       used      avail capacity
aggr_used               1102GB      668GB      433GB      61%
aggr_used/.snapshot         0GB        0GB        0GB     ---%
X>
X> snap list -A aggr_used
Aggregate aggr_used
working...

No snapshots exist.
X> snap sched -A aggr_used
Aggregate aggr_used: 0 0 0
X>

WALF aggregate was idle during that test and configured and running as follows:

Aggregate aggr_used (online, raid_dp) (block checksums)
  Plex /aggr_used/plex0 (online, normal, active)
    RAID group /aggr_used/plex0/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   2d.18   2d    1   2   FC:B   -  FCAL 15000 418000/856064000  420156/860480768
      parity    1d.19   1d    1   3   FC:A   -  FCAL 15000 418000/856064000  420156/860480768
      data      2d.21   2d    1   5   FC:B   -  FCAL 15000 418000/856064000  420156/860480768
      data      1d.22   1d    1   6   FC:A   -  FCAL 15000 418000/856064000  420156/860480768
      data      2d.23   2d    1   7   FC:B   -  FCAL 15000 418000/856064000  420156/860480768

[..]

X> aggr status aggr_used -v
           Aggr State           Status            Options
       aggr_used online          raid_dp, aggr     nosnap=off, raidtype=raid_dp,
                                                  raidsize=16,
                                                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  resyncsnaptime=60,
                                                  fs_size_fixed=off,
                                                  snapshot_autodelete=on,
                                                  lost_write_protect=on
[..]

X> df -g full_aggr_test
Filesystem               total       used      avail capacity  Mounted on
/vol/full_aggr_test/       50GB       20GB       29GB      40%  /vol/full_aggr_test/
/vol/full_aggr_test/.snapshot        0GB        0GB        0GB     ---%  /vol/full_aggr_test/.snapshot
X>
X> snap list full_aggr_test
Volume full_aggr_test
working...

No snapshots exist.
X>

From AIX point of view, filesystem was configured as follows (notice those big files for Orion’s use):

root@Y:# df -m .
Filesystem    MB blocks      Free %Used    Iused %Iused Mounted on
/dev/fslv00    19456.00   2099.70   90%        7     1% /fullaggr
root@Y:# du -sm *
10000.01        bigfile
7353.00 bigfile2
0.00    lost+found
0.00    orion.lun
root@Y:#

Results:

Methodology:
For each next attempt a snapshot was created to grow the space used inside the FlexVol until it was full (40%..100%). There was single Orion execution after each snapshot was created.The Y-axis represents maximum bandwidth observed for sequential 1MB reads (as reported by Orion). The Z-axis (depth) ranging from 1 to 10 reprents the number of concurrent/paralell reads being done by Orion (to let’s say simulate multiple full table scans happening on the same LUN). As it is visible from the graph when FlexVol utilization is close or equal to 100% a nearly more than double performance impact can be observed (40-45 MB/s vs 10-15MB/s). The sane FlexVol utilization minimium seems to somewhere max at 70% to avoid problems with fragmentation. AIX system was mostly coming with default settings without any more advanced optimizations (that was done on purpose except Concurrent I/O).

PowerHA failure scenario when dealing with SAN-booted LPARs – part IV

Thursday, June 9th, 2011

If you are wondering how the story ended up, then just be sure to read IBM PowerHA SystemMirror 7.1 for AIX. It appears that full clustering support for LPARs having their rootvg on SAN is starting to be properly supported fon versions of PowerHA higher than 7.1…

The rootvg system event
PowerHA SystemMirror 7.1 introduces system events. These events are handled by a new
subsystem called clevmgrdES. The rootvg system event allows for the monitoring of loss of
access to the rootvg volume group. By default, in the case of loss of access, the event logs an
entry in the system error log and reboots the system. If required, you can change this option
in the SMIT menu to log only an event entry and not to reboot the system.

and later:
The rootvg system event
As discussed previously, event monitoring is now at the kernel level. The
/usr/lib/drivers/phakernmgr kernel extension, which is loaded by the clevmgrdES
subsystem, monitors these events for loss of rootvg. It can initiate a node restart operation if
enabled to do so as shown in Figure 9-9.
PowerHA 7.1 has a new system event that is enabled by default. This new event allows for the
monitoring of the loss of the rootvg volume group while the cluster node is up and running.
Previous versions of PowerHA/HACMP were unable to monitor this type of loss. Also the
cluster was unable to perform a failover action in the event of the loss of access to rootvg. An
example is if you lose a SAN disk that is hosting the rootvg for this cluster node.

PowerHA failure scenario when dealing with SAN-booted LPARs – part III

Tuesday, April 12th, 2011

Ok, this is part III of the rather long story. As shown in last series the problem was really tricky, we cannot run anything in the event of loosing storage (you cannot read binaries if you don’t have storage). Ok, so how HACMP/PowerHA deals with it?

If you lose all storage Virtual FibreChannel connections, this is going to reported as a loss of quorum for a VG in AIX’s error reporting facility. In theory this error should be propagated to HACMP/PowerHA failover mechanism so that RG is evacuated from the affected node. Sometimes this behaviour is titled as “selective failover behavior”. Technical details are pretty interesting as it shows the lack of real/solid integration between AIX kernel and HACMP/PowerHA. Primary source of information is AIX RAS reporting facility (see “man errpt” for more info) and the errdemon process. Errdemon takes proper actions based on the configuration of “errnotify” ODM objects. The installation of PowerHA adds special hooks so that any special messages about failing hardware/OS/probes (in this case loosing of quorum on VGs) are propagated to the HACMP/PowerHA scripts. This can be verified as follows:

root@jkwha001d : /home/root :# odmget errnotify | tail -17

errnotify:
        en_pid = 0
        en_name = "clSF1"
        en_persistenceflg = 1
        en_label = "LVM_SA_QUORCLOSE"
        en_crcid = 0
        en_class = "H"
        en_type = "UNKN"
        en_alertflg = ""
        en_resource = "LVDD"
        en_rtype = "NONE"
        en_rclass = "NONE"
        en_symptom = ""
        en_err64 = ""
        en_dup = ""
        en_method = "/usr/es/sbin/cluster/diag/clreserror $9 $1 $6 $8"
root@jkwha001d : /home/root :#
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      32768 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

.. so the method of rescuing from losing SAN storage (also on SAN booted hosts) seems to be the script /usr/es/sbin/cluster/diag/clreserror. Of course this script is also located on SAN in that particular case…

root@jkwha001d:# ls -l /usr/es/sbin/cluster/diag/clreserror
-r-x------    1 root     system        13813 Feb 24 2009  /usr/es/sbin/cluster/diag/clreserror
root@jkwha001d:# file /usr/es/sbin/cluster/diag/clreserror
/usr/es/sbin/cluster/diag/clreserror: shell script  - ksh (Korn shell)
root@jkwha001d:# head -45  /usr/es/sbin/cluster/diag/clreserror
#!/bin/ksh93
[..]
#
#  Name:        clreserror
#
#   Notify method in errnotify stanzas that are configured for Selective
#   Fallover triggered by (AIX) Error Notification.
#   This function is merely a wrapper for clRMupdate.
#
#   Argument validation is performed in the clRMapi.
#
#  Arguments:
#
#       $1      Error Label, corresponds to errnotify.en_label
#       $2      Sequence number of the AIX error log entry
#       $3      Resource Name, corresponds to errnotify.en_resource
#       $4      Resource class, corresponds to errnotify.en_class
#
#       Argument validation is performed in the clRMapi.
#
#       Environment:  PATH
#
root@jkwha001d:#

And rest assured that this script calls a lot of other scripts, of course that can be unavailable if the rootvg is on the same physical storage as the affected VG.

There are two good findings here actually. First one is that if you are going to loose all SAN-based hdisks you are going to be flooded with thousands entries in errpt facility. Those can be undetected by errdemon because of the overflowing the log in memory. Workaround for the first case seems to be trivial, just enlarge the error log buffer. This is documented here:

root@jkwha001d : /home/root :# /usr/lib/errdemon -B 1048576
0315-175 The error log memory buffer size you supplied will be rounded up
to a multiple of 4096 bytes.
root@jkwha001d : /home/root :# /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      1048576 bytes
Duplicate Removal       true
Duplicate Interval      10000 milliseconds
Duplicate Error Maximum 1000
root@jkwha001d : /home/root :#

Additionaly it seems to have some value mirroring some of the LVs on the affected VGs. This might add some stability to the detection of loosing LVM quorum, i.e – this shows properly mirrored LVM loglv4 across 2 hdisks(PVs)…

root@jkwha001d:# lsvg -p jw1data1vg
jw1data1vg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk5            active            479         30          00..00..00..00..30
hdisk3            active            479         30          00..00..00..00..30
root@jkwha001d:# lsvg -l jw1data1vg
jw1data1vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jw1data1            jfs2       896     896     2    open/syncd    /oracle/JW1/sapdata1
loglv04             jfs2log    1       2       2    open/syncd    N/A
root@jkwha001d:# lslv loglv04
LOGICAL VOLUME:     loglv04                VOLUME GROUP:   jw1data1vg
LV IDENTIFIER:      0001d2c20000d9000000012b86756f99.2 PERMISSION:     read/write
VG STATE:           active/complete        LV STATE:       opened/syncd
TYPE:               jfs2log                WRITE VERIFY:   off
MAX LPs:            512                    PP SIZE:        32 megabyte(s)
COPIES:             2                      SCHED POLICY:   parallel
LPs:                1                      PPs:            2
STALE PPs:          0                      BB POLICY:      relocatable
INTER-POLICY:       minimum                RELOCATABLE:    yes
INTRA-POLICY:       middle                 UPPER BOUND:    128
MOUNT POINT:        N/A                    LABEL:          None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?:     NO
root@jkwha001d:#

Second finding is that even with all those changes it has very high probability of failing (i.e. PowerHA RG move won’t work). Personally the risk is so high that for me it is nearly a guarantee. The only proper solution of this problem that i am able to see is to add special handler to the err_method in errdemon code. Something like err_method = KILL_THE_NODE. This KILL_THE_NODE should be implemented internally by running all the time errdemon process. The process should be running with memory protected from swapping (something like mlockall())… because currently it is not running that way.

root@jkwha001d:# svmon -P 143470
-------------------------------------------------------------------------------
     Pid Command          Inuse      Pin     Pgsp  Virtual 64-bit Mthrd  16MB
  143470 errdemon         24851    11028        0    24700      N     N     N

     PageSize                Inuse        Pin       Pgsp    Virtual
     s    4 KB                 323          4          0        172
     m   64 KB                1533        689          0       1533

    Vsid      Esid Type Description              PSize  Inuse   Pin Pgsp Virtual
   7c05d         d work fork tree                    m   1041   233    0    1041
                   children=4d6cdc, 0
    8002         0 work fork tree                    m    492   456    0     492
                   children=802760, 0
   4d0f7         2 work process private             sm    131     4    0     131
   35309         3 clnt mapped file,/dev/hd9var:69   s    107     0    -       -
   110c0         f work shared library data         sm     41     0    0      41
   1d023         1 clnt code,/dev/hd2:44058          s     25     0    -       -
   45115         - clnt /dev/hd2:63900               s     11     0    -       -
   3902a         - clnt /dev/hd9var:1015             s      8     0    -       -
root@jkwha001d:#

Hope that IBM fixes that. There is no APAR because I’m lazy… i even don’t want to fight with 1st line of support… BTW: i’m not the first one who noticed this problem, please see here for a blog post about the same from Ricardo Gameiro

-Jakub.

PowerHA failure scenario when dealing with SAN-booted LPARs – part II

Monday, January 24th, 2011

Ok, this is part II of the post PowerHA failure scenario when dealing with SAN-booted LPARs – part I.

The first scenario we have performed was to disable 100% of the MPIO storage paths to the active HACMP node by un-mapping Virtual FibreChannel adapters (vadapters from both VIOS protecting the active node [LPAR]). On both VIOS we have performed the following command:

$ vfcmap -vadapter vfchostXYZ -fcp

where the vfchostXYZ was server-side (or better VIOS side) vadapter handling FC I/O traffic for this LPAR. The result? The LPAR with HACMP/PowerHA active Resource Groups on it (jkwha001d) after some time evicted itself from the HACMP cluster, and the old passive node jkwha002d became the active one (Oracle started there). The root cause of the jkwha001d LPAR is the following one:

root@jkwha001d : /home/root :# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
AFA89905   0103071611 I O grpsvcs        Group Services daemon started
97419D60   0103071611 I O topsvcs        Topology Services daemon started
A6DF45AA   0103071111 I O RMCdaemon      The daemon is started.
67145A39   0103070911 U S SYSDUMP        SYSTEM DUMP
F48137AC   0103070911 U O minidump       COMPRESSED MINIMAL DUMP
DC73C03A   0103070911 T S fscsi0         SOFTWARE PROGRAM ERROR
9DBCFDEE   0103070911 T O errdemon       ERROR LOGGING TURNED ON
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070511 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070511 T H fscsi1         COMMUNICATION PROTOCOL ERROR
(..)
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk4         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk8         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk6         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk3         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk2         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
DE3B8540   0103070411 P H hdisk0         PATH HAS FAILED
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DE3B8540   0103070411 P H hdisk5         PATH HAS FAILED
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
26623394   0103070411 T H fscsi1         COMMUNICATION PROTOCOL ERROR
DC73C03A   0103070411 T S fscsi1         SOFTWARE PROGRAM ERROR
(..)
A39F8A49   1210235210 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk6         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi0         ATTACHED SCSI TARGET DEVICE ERROR
DE3B8540   1210235210 P H hdisk1         PATH HAS FAILED
D5676F6F   1210235210 T H fscsi1         ATTACHED SCSI TARGET DEVICE ERROR
(..)

As you can see AIX6.1 generated a “SYSTEM DUMP” which is actually a system panic indicator. Normally AIX saves the memory image of it’s kernel on runtime to the Logical Volumes configured with “sysdump” type and then reboots. This allows investigation why the machine crashed, but here you won’t see anything. You’ve lost FC storage connectivity (even rootvg) , so it even couldn’t save it . So from where it knows it? Probably the state of the LPAR crash can be saved somewhere in POWERVM firmware area. It’s one of the RAS things in AIX/POWER. So far, so good, the HACMP/PowerHA was able to recover services…

OK, but we wanted to have a real proof, so we have performed a double-check. We suspected that VIOS can have some magic way to communicate with LPAR. We wanted exclude that factor, and we’ve performed simulation of disconnecting the FC from storage level. The initital state was stable HACMP cluster, with RG active on jkwha002d (and jkwha001d being passive), all MPIO paths to Netapp cluster (storage1, storage2) were reported as “Enabled” by “lspath”. The igroup term in Netapp concept is “inititator group” and is responsible for things like LUN masking. If you remove access to some FC WWPNs on the LUN, it is going to end-like in situation in which AIX will have hdisks point to non-existing SCSI adapters and AIX SCSI stack will get a lot of errors (FC is just wrapper around SCSI).

On the 2nd storage controller (storage2) the igroup jkwha002d_boot was controlling access to the OS level LUNs (rootvg, etc):

storage2> lun show -g jkwha002d_boot
        /vol/jkwha002d_boot/boot/rootvg   40.0g (42953867264)   (r/w, online, mapped)
storage2>

Also on 2nd storage there was an igroup responsbile for SAP and Oracle LUNs:

storage2> lun show -g jkwha002d_cluster
        /vol/sap_jw1_mirrlog/mirrlog/mirrlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_oraarch/oraarch/oraarch01.lun     51g (54760833024)   (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata2/sapdata02.lun     15g (16106127360)   (r/w, online, mapped)
storage2>

The same igroup was present on 1st storage system, controlling the remaining LUNs (yes, this is active-active configuration, where reads/writes are being managed by AIX LVM/VG consisting on LUNs on two controllers):

storage1> lun show -g jkwha002d_cluster
        /vol/clus_001d_002d_hb/hb/hbvg     20m (20971520)      (r/w, online, mapped)
        /vol/sap_jw1_origlog/origlog/origlog01.lun      1g (1073741824)    (r/w, online, mapped)
        /vol/sap_jw1_sapbin/sapbin/sapbin01.lun    100g (107374182400)  (r/w, online, mapped)
        /vol/sap_jw1_sapdata/sapdata1/sapdata01.lun     15g (16106127360)   (r/w, online, mapped)
storage1>

Let’s do it:

storage1> igroup remove
usage:
igroup remove [ -f ] <initiator_group> <node> ...
  - removes node(s) from an initiator group. The node may also
    specified as its alias. See "fcp wwpn-alias" for help with aliases
    By default, if lun maps exist for the initiator group, the node is
    not removed from the group. The -f option can be used to force the
    removal.
For more information, try 'man na_igroup'
storage1>
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage1> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_cluster c0:50:76:00:61:20:00:e0
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:de
storage2> igroup remove -f jkwha002d_boot c0:50:76:00:61:20:00:e0

.. and HACMP failover won’t work. The active LPAR node (jkwha002d) is going end up in zoombie state. If you have opened SSH sessions before, everything would indicate that is working, ls on /etc, some commands utils too. But the reason is because some things are cached in AIX’s filesystem cache. Everything is going to fail with I/O errors, lspath will cry too (it won’t display all MPIO paths as failed but this is story for another post), examples:

root@jkwha002d:# topas
bash: /usr/bin/topas: There is an input or output error.
root@jkwha002d:# nmon
bash: /usr/bin/nmon: There is an input or output error.
root@jkwha002d:# lspath
Failed  hdisk0 fscsi1
Failed  hdisk1 fscsi1
Enabled hdisk0 fscsi1
Failed  hdisk1 fscsi1
Failed  hdisk0 fscsi0
Failed  hdisk1 fscsi0
Failed  hdisk0 fscsi0
Enabled hdisk1 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Failed  hdisk2 fscsi0
Enabled hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk2 fscsi1
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi0
Failed  hdisk4 fscsi0
Failed  hdisk5 fscsi0
Failed  hdisk6 fscsi0
Failed  hdisk7 fscsi0
Failed  hdisk8 fscsi0
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Enabled hdisk5 fscsi1
Enabled hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Enabled hdisk3 fscsi1
Enabled hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Enabled hdisk7 fscsi1
Failed  hdisk8 fscsi1
Failed  hdisk3 fscsi1
Failed  hdisk4 fscsi1
Failed  hdisk5 fscsi1
Failed  hdisk6 fscsi1
Failed  hdisk7 fscsi1
Enabled hdisk8 fscsi1
Missing hdisk9 fscsi0
Missing hdisk9 fscsi0
Missing hdisk9 fscsi1
Missing hdisk9 fscsi1
root@jkwha002d:# errpt | more
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
EA88F829   0103074511 I O SYSJ2          USER DATA I/O ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
00B984B3   0103074511 U H hdisk0         UNDETERMINED ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
B6267342   0103074511 P H hdisk0         DISK OPERATION ERROR
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW
E86653C3   0103074511 P H LVDD           I/O ERROR DETECTED BY LVM
A39F8A49   0103074511 T S syserrlg       ERROR LOGGING BUFFER OVERFLOW

Oracle is going to panic (which is good behaviour in my opinion):

root@jkwha002d:# tail alert_JW1.log
Mon Jan  3 07:39:21 2011
Errors in file /oracle/JW1/saptrace/background/jw1_lgwr_831500.trc:
ORA-00345: redo log write error block 7 count 1
ORA-00312: online log 13 thread 1: '/oracle/JW1/origlogA/log_g13m1.dbf'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1
Additional information: 512
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
Errors in file /oracle/JW1/saptrace/background/jw1_ckpt_368764.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogA/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/oracle/JW1/origlogB/cntrl/cntrlJW1.dbf'
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: 10
Additional information: 3
Mon Jan  3 07:39:23 2011
CKPT: terminating instance due to error 221
Instance terminated by CKPT, pid = 368764
root@jkwha002d:#

but what’s most interesting is that Oracle will ask AIX to write it to the alert log file – and will be available to read by commands like “tail”, but “cat” command won’t work (you won’t be able to read whole alert log file becasue you don’t have I/O!). What’s even more interesting is that you won’t see those messages after rebooting! (after kernel memory is gone!). If you don’t have I/O how you are going to write/fsync this file???

Another additional thing is that the active HACMP node still will be active, it will be handling Resource Groups, etc. Failover won’t happen. Possible solutions to this problem should be an kernel-based check that would verify that at least /etc is accesible. Why kernel-based? Because you have to have some form of heartbeat in memory (like AIX kernel module or uncachable binary always present in RAM, running in Real-Time scheduling priority) that would continuesly . If it would fail for several times it should reboot the node (that would trigger Resource Group failover to 2nd node in the cluster).

Note: typical HACMP scripting – at least in theory – is not enough, even if it would force running /sbin/halt, how can you be sure that /sbin/halt all required libc contents are in memory??

Perhaps i’ll write a part III of this article :)

-Jakub.

PowerHA failure scenario when dealing with SAN-booted LPARs – part I

Sunday, January 9th, 2011

Together with Jedrzej we’ve exposed rather interesting weaknees in IBM PowerHA 5.5 solution (in the old days it was called HACMP). Normally you would assume that in case major cataclysm such as *complete* storage disappear on the active node, PowerHA or AIX has internal mechanism to prevent downtime by switching the services to the next active node (as defined in PowerHA policies/confguration). This is starting to be really interesting when we start talking about SAN BOOT-ed AIX LPARs. As everybody knows any form of assumption is bad (this is starting to be my mantra), and as we show it here avoid this pitfall requires requires SOME PLANNING AND CONFIGURATION to avoid ending in long downtime….

The enviorniment was consisting of:

  • Two LPARs (jkwha001d = active; jkwha002d = passive) running on separate Power frames.
  • Each LPAR on each frame was protected by separate pair of VIOS (to provide redundancy for NPIV-enabled FC fabrics and enable to have Shared Ethernet Failovers).
  • AIX 6.1 TL3
  • Oracle 10gR2
  • PowerHA 5.5SP2
  • Netapp DataOntap 7.3 storage (Netapp cluster, 2 controllers) – yes, AIX LPARs are also SAN BOOT-ed from this storage
  • 1 network-based (IPv4) heartbeat
  • 1 disk-based (on FC LUN) heartbeat
  • Properly configured & running scripts for starting and stopping Oracle
  • No application monitors configured

We’ve performed two experiments:

  1. Simulating total storage losing only on jkwha001d (active one, hosting Oracle) by unmapping Virtual Fabre Channel (NPIV) host adapters (vadapters) on both VIOS protecting jkwha001d.
  2. Simulating total storage losing only on jkwha002d (active one, hosting Oracle — after failover) by forcing disconnection of jkwha002d Fibre Channel initiators (WWNs) on Netapp cluster.

both of which simulated the same thing… Virtual LPAR running without storage at all. More on result in next post…

-Jakub.

Testing NetApp’s SyncMirror whole plex failure (on simulator)

Thursday, May 20th, 2010

Together with Lukasz Borek we’ve tested the SyncMirror reliability under some stress…. In short basic NetApp deployment just uses RAID-DP (Double Parity, so you are might 2 disks fail before you loose data). If your data is critical, they you can actually configure SyncMirror which just mirrors aggregate (on top of which are volumes) to two plexes, each consisting of RAID group(s). This behaves like RAID-1 which is using two RAID-DP (you have double ammount of disks protecting you). Simple demonstration (just before this we’ve removed all spare disks for demonstration purposes)

So we have our aggregate aggr_mirr:

filerA> aggr status -r
Aggregate aggr_mirr (online, raid_dp, mirrored) (block checksums)
  Plex /aggr_mirr/plex0 (online, normal, active, pool0)
    RAID group /aggr_mirr/plex0/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   v5.19   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
      parity    v5.20   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
      data      v5.24   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448

  Plex /aggr_mirr/plex1 (online, normal, active, pool0)
    RAID group /aggr_mirr/plex1/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   v5.25   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
      parity    v5.26   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
      data      v5.27   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448

[..]

Pool1 spare disks (empty)

Pool0 spare disks (empty)

Broken disks

RAID Disk       Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
---------       ------  ------------- ---- ---- ---- ----- --------------    --------------
admin removed   v4.16   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.17   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.20   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.21   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.22   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.24   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.25   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.26   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.27   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.28   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.29   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v4.32   v4    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v5.28   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v5.29   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
admin removed   v5.32   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448

We want to fail all of the disks (5.19, 5.20, 5.24) that are part of /aggr_mirr/plex0/rg0 (that simulates loosing whole plex, we wanted to test that aggr_mirr is going to be able survive crash of one of the plexes):

filerA> disk fail v5.19
*** You are about to prefail the following file system disk, ***
*** which will eventually result in it being failed ***
  Disk /aggr_mirr/plex0/rg0/v5.19

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   v5.19   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
***
Really prefail disk v5.19?  y

WARNING! There is no spare disk available to which to copy.
Are you sure you want to continue with disk fail (y/n)? y
disk fail: The following disk was prefailed: v5.19
Disk v5.19 has been prefailed.  Its contents will be copied to a
replacement disk, and the prefailed disk will be failed out.
filerA>
Tue Apr 27 18:03:52 GMT [raid.rg.diskcopy.cant.start:warning]: /aggr_mirr/plex0/rg0: unable to start disk copy for v5.19: No block checksum disk of required type and size is available, targeting Pool0
filerA>
filerA> disk fail v5.20
*** You are about to prefail the following file system disk, ***
*** which will eventually result in it being failed ***
  Disk /aggr_mirr/plex0/rg0/v5.20

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      parity    v5.20   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
***
Really prefail disk v5.20? yes

WARNING! There is no spare disk available to which to copy.
Are you sure you want to continue with disk fail (y/n)? y
disk fail: The following disk was prefailed: v5.20
Disk v5.20 has been prefailed.  Its contents will be copied to a
replacement disk, and the prefailed disk will be failed out.
filerA>
filerA>
filerA> disk fail v5.24
*** You are about to prefail the following file system disk, ***
*** which will eventually result in it being failed ***
  Disk /aggr_mirr/plex0/rg0/v5.24

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      data      v5.24   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
***
Really prefail disk v5.24? y

WARNING! There is no spare disk available to which to copy.
Are you sure you want to continue with disk fail (y/n)? y
disk fail: The following disk was prefailed: v5.24
Disk v5.24 has been prefailed.  Its contents will be copied to a
replacement disk, and the prefailed disk will be failed out.
filerA>
filerA> disk simpull  v5.19
Tue Apr 27 18:06:08 GMT [raid.config.filesystem.disk.missing:info]: File system Disk /aggr_mirr/plex0/rg0/v5.19 Shelf ? Bay ? [NETAPP   VD-1000MB-FZ-520 0042] S/N [16402503] is missing.
filerA>
Tue Apr 27 18:06:08 GMT [raid.rg.recons.missing:notice]: RAID group /aggr_mirr/plex0/rg0 is missing 1 disk(s).
Tue Apr 27 18:06:08 GMT [raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /aggr_mirr/plex0/rg0: No block checksum disk of required type and size is available, targeting Pool1
filerA>
filerA> disk simpull  v5.20
Tue Apr 27 18:06:16 GMT [raid.disk.missing:info]: Disk /aggr_mirr/plex0/rg0/v5.20 Shelf ? Bay ? [NETAPP   VD-1000MB-FZ-520 0042] S/N [16402504] is missing from the system
Tue Apr 27 18:06:16 GMT [raid.config.filesystem.disk.missing:info]: File system Disk /aggr_mirr/plex0/rg0/v5.20 Shelf ? Bay ? [NETAPP   VD-1000MB-FZ-520 0042] S/N [16402504] is missing.
filerA>
Tue Apr 27 18:06:17 GMT [raid.rg.recons.missing:notice]: RAID group /aggr_mirr/plex0/rg0 is missing 2 disk(s).
Tue Apr 27 18:06:17 GMT [raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /aggr_mirr/plex0/rg0: No block checksum disk of required type and size is available, targeting Pool1
filerA>
filerA> disk simpull  v5.24
Tue Apr 27 18:06:20 GMT [raid.config.filesystem.disk.missing:info]: File system Disk /aggr_mirr/plex0/rg0/v5.24 Shelf ? Bay ? [NETAPP   VD-1000MB-FZ-520 0042] S/N [16402507] is missing.
Tue Apr 27 18:06:20 GMT [raid.vol.mirror.degraded:error]: Aggregate aggr_mirr is mirrored and one plex has failed. It is no longer protected by mirroring.

So we’ve destroyed whole plex! Great success! ;) In real world this would be more like whole shelf failure. NetApp controller of course tried also to call home, because this is pretty serious:

Tue Apr 27 18:06:20 GMT [callhome.syncm.plex:CRITICAL]: Call home for SYNCMIRROR PLEX FAILED

Let’s verify again:

filerA> aggr status -r
Aggregate aggr_mirr (online, raid_dp, mirror degraded) (block checksums)
  Plex /aggr_mirr/plex0 (offline, failed, inactive, pool0)
    RAID group /aggr_mirr/plex0/rg0 (partial)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   FAILED          N/A                        1020/2089984
      parity    FAILED          N/A                        1020/2089984
      data      FAILED          N/A                        1020/2089984
      Raid group is missing 3 disks.

  Plex /aggr_mirr/plex1 (online, normal, active, pool0)
    RAID group /aggr_mirr/plex1/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   v1.25   v1    ?   ?   FC:A   0  FCAL  N/A  1020/2089984      1027/2104448
      parity    v5.26   v5    ?   ?   FC:B   0  FCAL  N/A  1020/2089984      1027/2104448
      data      v1.27   v1    ?   ?   FC:A   0  FCAL  N/A  1020/2089984      1027/2104448

[..]

During the whole PLEX destruction process, we were continually writing to NFS volume on this aggr_mirr, in synchronus mode (to avoid caching at OS level and to always hit the [NV]RAM – of course no NVRAM in simulator….), as you can see from below output, no single I/O error at all from NFS client perspective:

2097152 bytes (2.1 MB) copied, 0.498529 seconds, 4.2 MB/s
Thu Apr 29 16:10:32 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.51502 seconds, 4.1 MB/s
Thu Apr 29 16:10:33 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.514629 seconds, 4.1 MB/s
Thu Apr 29 16:10:35 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.419478 seconds, 5.0 MB/s
Thu Apr 29 16:10:36 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.411578 seconds, 5.1 MB/s
Thu Apr 29 16:10:38 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.444393 seconds, 4.7 MB/s
Thu Apr 29 16:10:39 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.43042 seconds, 4.9 MB/s
Thu Apr 29 16:10:41 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.417388 seconds, 5.0 MB/s
Thu Apr 29 16:10:42 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.504876 seconds, 4.2 MB/s
Thu Apr 29 16:10:44 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.425559 seconds, 4.9 MB/s
Thu Apr 29 16:10:45 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.453916 seconds, 4.6 MB/s
Thu Apr 29 16:10:46 CEST 2010
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.417439 seconds, 5.0 MB/s

Storage migration for ASM database deployed on Oracle Enterprise Linux in Oracle VM without downtime.

Thursday, May 21st, 2009

Suppose you wanted to migrate your database from storage array SAN1 to SAN2 without downtime. With Oracle databases using ASM this is possible. It was performed on configuration better described here. One note: the LUN can be visible through dom0 or directly by domU (by passing PCI hardware handling into our domU VM) — this posts explains only first scenario, as this is more common scenario  Brief steps include:

  1. Prepare new LUNs on storage array (not described here)
  2. Attach new LUNs to the Oracle VM (not described here, simulated here by using simple zero-padded file created using dd utility; I expect this was performed earlier[scsi bus rescanning and so on] or file created in /OVS).
  3. Modifing VM’s XEN config file.
  4. Online attaching block device to VM.
  5. Preparing new storage device from inside of the target VM.
  6. Discovering new LUN in ASM.
  7. Actual rebalance process…
  8. Verification.

Step 3: Modify vm.cfg file for additional storage.

This is straighforward, just add one line. Do NOT restart the VM. There is no need.

[root@quadovm 01_prac1]# cat vm.cfg
bootloader = '/usr/bin/pygrub'
disk = ['file:/OVS/running_pool/01_prac1/system.img,hda,w',
'file:/OVS/running_pool/01_prac1/oracle_software.img,hdd,w',
'file:/OVS/running_pool/prac.storage.raw,hde,w!',
'file:/OVS/running_pool/prac.ocr,hdf,w!',
'file:/OVS/running_pool/prac.voting,hdg,w!'
<strong>'file:/OVS/running_pool/prac.storage2.raw,hdi,w!',</strong>
]
memory = 1638
name = '01_prac1'
[..]
[root@quadovm 01_prac1]#

Step 4: Attach block device to the running VM.

[root@quadovm 01_prac1]# xm block-attach 01_prac1 file:///OVS/running_pool/prac.storage2.raw /dev/hdi 'w!'
[root@quadovm 01_prac1]#

Step 5: Prepare prac1 VM for new device.

New added storage should be automatically detected, this can be verified by checking dmesg output:

[root@prac1 ~]# dmesg|grep hdi
hdi: unknown partition table
[root@prac1 ~]# ls -al /dev/hde /dev/hdi
brw-rw---- 1 root dba  33, 0 Jan 24 13:00 /dev/hde
brw-r----- 1 root disk 56, 0 Jan 24 12:59 /dev/hdi
[root@prac1 ~]#
[root@prac1 ~]# fdisk -l /dev/hd[ei] 2> /dev/null | grep GB
Disk /dev/hde: 15.7 GB, 15728640000 bytes
Disk /dev/hdi: 16.1 GB, 16106127360 bytes
[root@prac1 ~]#

As we can see new LUN is bigger (it should be bigger or equal, but I haven’t checked what happens if you add a smaller one). Now we have to assign correct permissions so that ASM/Database can use new /dev/hdi device without problems (this doesn’t include modifing udev rules in /etc/udev/, and it is required to make new devices come with right permissions after reboot — do your own home work:) ) :

[root@prac1 ~]# chgrp dba /dev/hdi
[root@prac1 ~]# chmod g+w /dev/hdi
[root@prac1 ~]#

Step 6: Preparing ASM for new disk.

Verification of current diskgroups and changing diskgroup DATA1 ASM_POWER_BALANCE to zero.

SQL> col name format a20
SQL> SELECT name, type, total_mb, free_mb, required_mirror_free_mb req_mirr_free, usable_file_mb FROM V$ASM_DISKGROUP;

NAME                 TYPE     TOTAL_MB    FREE_MB REQ_MIRR_FREE USABLE_FILE_MB
-------------------- ------ ---------- ---------- ------------- --------------
DATA1                EXTERN      15000      14143             0          14143

SQL> ALTER DISKGROUP DATA1 REBALANCE POWER 0 WAIT;

Diskgroup altered.

SQL>

Next we have to force ASM to discover new devices by modifing asm_diskstring parameter (I’m using IFILE for ASM, so I’ve to manually edit pfile for ASM. If I don’t, ASM won’t remember new settings after restarting).

SQL> show parameter string

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/hde

SQL>
SQL> alter system set asm_diskstring='/dev/hde', '/dev/hdi' scope=memory;

System altered.

SQL> show parameter string

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/hde, /dev/hdi

SQL>
[oracle@prac1 11.1.0]$ vi /u01/app/oracle/product/11.1.0/db_1/dbs/init+ASM1.ora
#Modify asm_diskstring here too
asm_diskstring='/dev/hde','/dev/hdi'

Step 7: The main part: ASM rebalance

SQL> ALTER DISKGROUP DATA1 ADD DISK '/dev/hdi';

Diskgroup altered.

SQL> SELECT GROUP_NUMBER, OPERATION, STATE FROM V$ASM_OPERATION;

GROUP_NUMBER OPERA STATE
------------ ----- --------------------
1 REBAL RUN

SQL> select name,path,state,failgroup from v$asm_disk;

NAME                 PATH            STATE                FAILGROUP
-------------------- --------------- -------------------- ----------
DATA1_0000     /dev/hde       NORMAL               DATA1_0000
DATA1_0001     /dev/hdi        NORMAL               DATA1_0001

SQL> ALTER DISKGROUP DATA1 DROP DISK DATA1_0000;

Diskgroup altered.

SQL> SELECT GROUP_NUMBER, OPERATION, STATE, EST_MINUTES FROM V$ASM_OPERATION;

GROUP_NUMBER  OPERA   STATE    EST_MINUTES
---------------------  ---------  ----------- -----------
1             REBAL    RUN        32 

SQL>

Typical snapshot of iostat right now (10 sec averages):

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hde               0.00     0.00 340.80  1.10    14.20     0.00    85.08     1.43    4.18   0.35  12.04
hdi               0.00     0.00  0.40 357.40     0.01    14.28    81.77    11.52   32.22   2.40  86.04

From other normal SQL session:

SQL> insert into t2(id) values ('2');

1 row created.

SQL> commit;

Commit complete.

SQL>

Back to the ASM instance:

SQL> ALTER DISKGROUP DATA1 REBALANCE POWER 11;

Step 8: Verification.

We’ll just execute some big heavy-intensive SQL statement to generate some IO (thnx to Tanel for blogging this query):

SQL> create table t4 as select rownum r from
(select rownum r from dual connect by rownum <= 1000) a,
(select rownum r from dual connect by rownum <= 1000) b,
(select rownum r from dual connect by rownum <= 1000) c
where rownum <= 100000000;

From iostat we can monitor that only hdi is used, which assures us that database is really using hdi.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
33.33    0.00   26.47    5.88    0.00   34.31

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hde               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
hdi               0.00     0.00 107.84 351.96     1.60    12.35    62.14     4.96   10.30   0.91  41.96

My engineering work…

Wednesday, August 15th, 2007

As of May I’m very busy architecting & implementing cluster for Java Enterprise Edition on comodity hardware (mainly x86_32 based) for my engineering work – to obtain BEng title. Our subject is:
“Web service based on scalable and highly available J2EE application cluster”. We have team consisting of 4 persons in which I’m responsible for all kind of systems/hardware scaling/clusters/load balancing/databases/networking/tunning everything :) . What kind of portal we are creating is to be decided by developers (it will likely be some kind of Web 2.0 portal).
Rest of the team is dedicated to J2EE programming. We are mainly playing with technology.
Currently rock-solid base core cluster architecture looks like this:

Cluster architecture

We are utilizing:

  • Load balancers: Linux Virtual Servers with DirectRouting on CentOS5 (configured as a part of Redhat Cluster Suite)
  • Database: Oracle10g R2
  • Middleware: JBOSS 4.2.0 (EJB3) running in a cluster based on JGroups + Hibernate(JPA) + JBOSS Cache
  • Frontend: Apache2 webservers with Solaris Network Cache Accelerator and AJP proxy to JBOSS servers
  • Solaris Jumpstart to setup new systems really fast with our selfwritten application in PHP for maintaing systems.
  • NFS for providing static content for web servers from Oracle server (yay! dedicated NetApp would be great! ;) )
  • LDAP to synchronize admins accounts inside cluster.
  • SNMPv2(LVS,OSes,JBOSS,Oracle) to monitor everything with single (selfwritten) Java application which graphs everything in realtime.

As this is basic configuration with database as an single point of failure, in Septemer I’m going to setup DataGuard for Oracle. Also I’m testing more advanced scale up. Currently I’m in process of setting up Solaris Cluster with Oracle RAC 10gR2 implemented on iSCSI storage provided by third node based on Solaris Nevada with iSCSI target to test Transparent Application Failover. I’ve been scratching my head over this one for awhile now. Yeah, it is real hardcore… more over that’s not the end of the story – Disaster Recovery with some other interesting bits of technology is going to be implemented later on… all on x86_32 comodity hardware :) Also we are going to put C-JDBC(Sequoia project) under stress…

Exporting simple file from Linux target to Solaris initiatior using iSCSI

Friday, March 23rd, 2007

QuickHowTo about ”exporting” via iSCSI simple file from Linux target (ietd) to Solaris OS:

Linux target is running Debian/4.0, 2.6.18 kernel and iSCSI target version 0.4.14 – I wish it was Solaris box, but my very old home SCSI controllers aren’t supported by Solaris ( DELL MegaRAID 428 – PERC2 and InitIO ) – however there are some drivers but for Solaris 2.7-2.8, but after small war with them I must say that I failed…. even after playing hardcore stuff in /etc/driver_aliases

Installing iSCSI target on Debian is discussed here: Unofficial ISCSI target installation. Some checks:

rac3:/etc/init.d# cat /proc/net/iet/volume
tid:2 name:iqn.2001-04.com.example:storage.disk2.sys1.xyz
lun:0 state:0 iotype:fileio iomode:wt path:/u01/iscsi.target

rac3:/etc/init.d# cat /proc/net/iet/session
tid:2 name:iqn.2001-04.com.example:storage.disk2.sys1.xyz

As you can see /u01/iscsi.target is normal file ( created with dd(1) ) on MegaRAID RAID0 array. We will use it to do some testing from Solaris:


root@opensol:~# iscsiadm add static-config iqn.2001-04.com.example:storage.disk2.sys1.xyz,10.99.1.25
root@opensol:~# iscsiadm modify discovery --static enable
root@opensol:~# devfsadm -i iscsi
root@opensol:~# iscsiadm list target
Target: iqn.2001-04.com.example:storage.disk2.sys1.xyz
Alias: -
TPGT: 1
ISID: 4000002a0000
Connections: 1
root@opensol:~# format
Searching for disks...done

0. c1t0d0
/pci@0,0/pci1000,30@10/sd@0,0

1. c2t17d0
/iscsi/disk@0000iqn.200104.com.example%3Astorage.disk2.sys1.xyzFFFF,0
Specify disk (enter its number): CTRL+C


Okay so we are now sure that iSCSI works. In several days i'm going to test exporting SONY SDT-9000 ( an old tape drive ) via iSCSI :)

Kilka ciekawych linkow (Clustered Samba, CCIE.PL ciekawy post, drogie urzadzenia SAN ;) )

Sunday, March 18th, 2007

Samba nigdy zbytnio mnie nie interesowala ( bo nie interesowal mnie fakt udostepniania przestrzeni Windowsom ;) ) aczkolwiek ciekawie zapowiada sie klastrowana Samba.

Na forum CCIE.PL jest bardzo ciekawy wpis ( autor: pjeter ) ktory opisuje mniej wiecej sciaganie obrazu z tunera TV z kablowki i rzucanie tego na multicast w czasie rzeczywistym ( z kodekiem ) na zwyklym PC z Linuxem dzieki czemu obraz TV jest dostepny dla innych komputerow w sieci lokalnej…

Architektury sieci SAN mnie interesuja ( np link ) … ale niestety sa poza zasiegiem budzetowym:

  • MDS 9216 kosztuje ok 120000zl
  • o MDS 95xx lepiej wcale nie pisac ;)