Archive for June, 2011
After partI – where i’ve been simulating typical Oracle workload (generating 70:30 read to write ratio on FC LUN) and creating snapshots – i’ve wanted to try different performance tests. In order to achieve the same performance characteristics, so i’ve deleted all my snapshots, so my FlexVol ended up again in 40% utilization:
X> snap list full_aggr_test Volume full_aggr_test working... No snapshots exist. X> X> df -g full_aggr_test Filesystem total used avail capacity Mounted on /vol/full_aggr_test/ 50GB 20GB 29GB 40% /vol/full_aggr_test/ /vol/full_aggr_test/.snapshot 0GB 0GB 0GB ---% /vol/full_aggr_test/.snapshot X>
Later i’ve executed Orion stress test, in a identical way like in partI on the same enviorniment. As you can see still the LUN is fragmented because any kind of sequential read is going to be impacted (maximum read observed ~17MB/s):
root@Y:# grep Maximum orion* orion_20110627_1116_summary.txt:Maximum Large MBPS=17.07 @ Small=0 and Large=9 orion_20110627_1116_summary.txt:Maximum Small IOPS=683 @ Small=24 and Large=0 root@Y:#
So in order to fight with this performance issue one can establish the root cause:
X> reallocate measure /vol/full_aggr_test Reallocation scan will be started on '/vol/full_aggr_test'. Monitor the system log for results. X>
System log will reveal this:
Mon Jun 27 07:35:31 EDT [X: wafl.scan.start:info]: Starting WAFL layout measurement on volume full_aggr_test. Mon Jun 27 07:35:32 EDT [X: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/full_aggr_test' is 8, hotspot 0 (threshold 4), consider running reallocate.
This seems to be identical to running measure on the LUN:
X> reallocate measure /vol/full_aggr_test/lun01 Reallocation scan will be started on '/vol/full_aggr_test/lun01'. Monitor the system log for results. X>
Log will show this:
Mon Jun 27 07:45:21 EDT [X: wafl.scan.start:info]: Starting WAFL layout measurement on volume full_aggr_test. Mon Jun 27 07:45:21 EDT [X: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/full_aggr_test/lun01' is 8, hotspot 0 (threshold 4), consider running reallocate.
So in both cases we were recommended to defragment the LUN, but keep in mind that this is a rather resource hungry operation, as it might involve reading and rewriting the full contents of the data!
X> reallocate start -f -p /vol/full_aggr_test/lun01 Reallocation scan will be started on '/vol/full_aggr_test/lun01'. Monitor the system log for results. X>
Log will show that the operation has started …
Mon Jun 27 07:46:23 EDT [X: wafl.br.revert.slow:info]: The aggregate 'sm_aggr1' contains blocks that require redirection; 'revert_to' might take longer than expected. Mon Jun 27 07:46:23 EDT [X: wafl.scan.start:info]: Starting file reallocating on volume full_aggr_test.
As you can see it is rather low CPU activity however , physical utilization of the disks is reported as high (don’t be fooled by low write activity – this is function of time, it does perform a lot of writes later):
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 10% 0 0 0 157 0 0 22372 19320 0 0 53s 94% 58% : 97% 156 0 589 175 0 0 10% 1 0 0 108 0 0 24884 0 0 0 53s 94% 0% - 92% 106 0 256 585 0 0 9% 0 0 0 101 0 0 25284 24 0 0 53s 94% 0% - 93% 100 0 421 260 0 0 12% 0 0 0 627 20 25 25620 8 0 0 53s 94% 0% - 92% 511 0 297 132 0 0 11% 0 0 0 792 0 0 22832 0 0 0 53s 94% 0% - 90% 652 0 670 461 0 0 6% 1 0 0 81 1 1 25232 24 0 0 53s 99% 0% - 92% 78 0 233 253 0 0
One can monitor the progress by using “status” command and in fact observe
X> reallocate status -v /vol/full_aggr_test/lun01 Reallocation scans are on /vol/full_aggr_test/lun01: State: Reallocating: Block 1347456 of 5242880 (25%), updated 1346434 Flags: doing_force,measure_only,repeat,keep_vvbn Threshold: 4 Schedule: n/a Interval: 1 day Optimization: 8 Measure Log: n/a X> [..] X> reallocate status -v /vol/full_aggr_test/lun01 Reallocation scans are on /vol/full_aggr_test/lun01: State: Idle Flags: measure_only,repeat Threshold: 4 Schedule: n/a Interval: 1 day Optimization: 8 Measure Log: n/a X> sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 53% 1 0 0 678 1 1 29428 1556 0 0 1 72% 9% : 11% 573 0 311 21077 0 0 34% 0 0 0 443 0 0 22028 32 0 0 1 78% 0% - 5% 442 0 1068 20121 0 0 40% 0 0 0 172 0 0 16360 0 0 0 1 77% 0% - 4% 171 0 367 14450 0 0 CTRL+C X>
Later results indicate that indeed sequential reads are back to their top value (~42MB/s) and this was our starting point on fresh FlexVol inside LUN in partI…
root@Y:# grep Maximum orion* orion_20110627_1208_summary.txt:Maximum Large MBPS=42.73 @ Small=0 and Large=9 orion_20110627_1208_summary.txt:Maximum Small IOPS=645 @ Small=25 and Large=0 root@Y:#
In the next series i’ll try to investiage the various AIX JFS2/CIO behaviours and to some degree the performance characteristics of Netapp storage and it’s options (e.g. read_realloc option). Stay tuned…
A nice video in “Shift Happens” style:
… now i’m wondering how it works in reality
Some time ago i’ve started thinking about getting more serious about performance research on AIX6.1, PowerHA and Oracle stack on top of Netapp storage. One of the first things that i wanted to measure was how the Netapp’s WALF handles FlexVolume utilization in correlation to space usage. In theory long sequential reads (as like in Oracle Datawarehouses in which i’m interested) could be affected because of the fragmentation introduced by WAFL (Write *Anywhere* File System).
Some specs first:
- DataONTap version 7.3.2, single Netapp controller 3160 was tested (but in cluster).
- The test was done using Oracle’s Orion 18.104.22.168 storage benchmarking tool on top of AIX (orion_aix_ppc64 -run advanced -num_disks 5 -cache_size 2048 -write 30 -duration 20 -type rand -matrix basic) – as you can see the read vs write ratio was 70% to 30%, but only long reads were presented (i was not interested in the performance of 8kB reads/writes, just 1MB long reads)
- AIX was connected via VFCs to two separate VIOS, which each were connected using 2 FC 8 GBps links each (but running in 4 GBps mode due to the Brocade SAN switches not supporting 8 GBps)
- Netapp controller was having 4 GBps ports
- AIX was using 6.1′s internal MPIO (round-robin) for the tested LUN
- AIX hdisk for LUN was set to the default value of 12 (as per Netapp Host Attachement Kit)
- AIX JFS2 filesystem was mounted with Concurrent I/O to prevent AIX from caching and read-aheads (still AIX had 3GB of RAM allocated but VMM should not use it)
- Netapp storage controller was having 4 processors, 8GB RAM and 2GB for NVRAM (as indicated by sysconfig output, of course as this is cluster, only 1GB was available)
- LUN size was 20GB, on top of 50GB FlexVol on RAID-DP aggregate with 5x FC 15k RPM disks
X> vol options full_aggr_test nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off, ignore_inconsistent=off, snapmirrored=off, create_ucode=on, convert_ucode=off, maxdirsize=83804, schedsnapname=ordinal, fs_size_fixed=off, compression=off, guarantee=volume, svo_enable=off, svo_checksum=off, svo_allow_rman=off, svo_reject_errors=off, no_i2p=off, fractional_reserve=100, extent=off, try_first=volume_grow, read_realloc=off, snapshot_clone_dependency=off X> df -Ag aggr_used Aggregate total used avail capacity aggr_used 1102GB 668GB 433GB 61% aggr_used/.snapshot 0GB 0GB 0GB ---% X> X> snap list -A aggr_used Aggregate aggr_used working... No snapshots exist. X> snap sched -A aggr_used Aggregate aggr_used: 0 0 0 X>
WALF aggregate was idle during that test and configured and running as follows:
Aggregate aggr_used (online, raid_dp) (block checksums) Plex /aggr_used/plex0 (online, normal, active) RAID group /aggr_used/plex0/rg0 (normal) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 2d.18 2d 1 2 FC:B - FCAL 15000 418000/856064000 420156/860480768 parity 1d.19 1d 1 3 FC:A - FCAL 15000 418000/856064000 420156/860480768 data 2d.21 2d 1 5 FC:B - FCAL 15000 418000/856064000 420156/860480768 data 1d.22 1d 1 6 FC:A - FCAL 15000 418000/856064000 420156/860480768 data 2d.23 2d 1 7 FC:B - FCAL 15000 418000/856064000 420156/860480768 [..] X> aggr status aggr_used -v Aggr State Status Options aggr_used online raid_dp, aggr nosnap=off, raidtype=raid_dp, raidsize=16, ignore_inconsistent=off, snapmirrored=off, resyncsnaptime=60, fs_size_fixed=off, snapshot_autodelete=on, lost_write_protect=on [..] X> df -g full_aggr_test Filesystem total used avail capacity Mounted on /vol/full_aggr_test/ 50GB 20GB 29GB 40% /vol/full_aggr_test/ /vol/full_aggr_test/.snapshot 0GB 0GB 0GB ---% /vol/full_aggr_test/.snapshot X> X> snap list full_aggr_test Volume full_aggr_test working... No snapshots exist. X>
From AIX point of view, filesystem was configured as follows (notice those big files for Orion’s use):
root@Y:# df -m . Filesystem MB blocks Free %Used Iused %Iused Mounted on /dev/fslv00 19456.00 2099.70 90% 7 1% /fullaggr root@Y:# du -sm * 10000.01 bigfile 7353.00 bigfile2 0.00 lost+found 0.00 orion.lun root@Y:#
For each next attempt a snapshot was created to grow the space used inside the FlexVol until it was full (40%..100%). There was single Orion execution after each snapshot was created.The Y-axis represents maximum bandwidth observed for sequential 1MB reads (as reported by Orion). The Z-axis (depth) ranging from 1 to 10 reprents the number of concurrent/paralell reads being done by Orion (to let’s say simulate multiple full table scans happening on the same LUN). As it is visible from the graph when FlexVol utilization is close or equal to 100% a nearly more than double performance impact can be observed (40-45 MB/s vs 10-15MB/s). The sane FlexVol utilization minimium seems to somewhere max at 70% to avoid problems with fragmentation. AIX system was mostly coming with default settings without any more advanced optimizations (that was done on purpose except Concurrent I/O).
If you are wondering how the story ended up, then just be sure to read IBM PowerHA SystemMirror 7.1 for AIX. It appears that full clustering support for LPARs having their rootvg on SAN is starting to be properly supported fon versions of PowerHA higher than 7.1…
The rootvg system event
PowerHA SystemMirror 7.1 introduces system events. These events are handled by a new
subsystem called clevmgrdES. The rootvg system event allows for the monitoring of loss of
access to the rootvg volume group. By default, in the case of loss of access, the event logs an
entry in the system error log and reboots the system. If required, you can change this option
in the SMIT menu to log only an event entry and not to reboot the system.
The rootvg system event
As discussed previously, event monitoring is now at the kernel level. The
/usr/lib/drivers/phakernmgr kernel extension, which is loaded by the clevmgrdES
subsystem, monitors these events for loss of rootvg. It can initiate a node restart operation if
enabled to do so as shown in Figure 9-9.
PowerHA 7.1 has a new system event that is enabled by default. This new event allows for the
monitoring of the loss of the rootvg volume group while the cluster node is up and running.
Previous versions of PowerHA/HACMP were unable to monitor this type of loss. Also the
cluster was unable to perform a failover action in the event of the loss of access to rootvg. An
example is if you lose a SAN disk that is hosting the rootvg for this cluster node.