Archive for July, 2011

WAFL performance VS sequential reads: part III, FC LUN performance from AIX vs read_realloc

Thursday, July 7th, 2011

Continuing my series about LUN fragmentation in WALF (part1, part2) I wanted to give a try to read_realloc option. Mostly the same storage system (still DataOnTap 7.3.2) and settings were used with an exception that i’ve even used more powerful AIX with POWER7 CPU cores system (shouldn’t matter as i was completely not constrained with CPU power during the benchmarks).

The methodology was simple:
1) Do a lot of random and sequential writes to JFS2 on top of that LUN on FlexVol having read_realloc=off to fragment the LUN as much as possible
2) Start a loop of random large reads (simulating let’s say many random full table scans of up to 1MB in Oracle data file) on top of that JFS2 filesystem (BTW: this time without CIO)

The full measurement took something about 18 hours (yes, storage was hammered 18 hours all the time with just reads). Results are pretty interesting:

… it basically shows how Netapp storage can self-optimize and adapt based on the workload. Now i’m just wondering how it works under the hood (i.e. what is the trade-off or worst possible scenario for this kind of setting). What was even more interesting is that reallocate measure indicated after the whole cycle that the LUN was really nicely de-fragmented:

Thu Jul  7 03:32:20 EDT [X: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/fc_fill_aggr_sm/lun' is 3.

AIX disk I/O pacing vs PowerHA

Wednesday, July 6th, 2011

Recently i’ve played with disk I/O pacing (technically minpout and maxpout parameters) on AIX 6.1. The root cause of this investigation is based on some high I/O waits in production. The default AIX 6.1 and 7.1 systems are coming with maxpout set to 8193 and minpout respecitvletly to 4096. However old deployments and HACMP <= 5.4.1 installations procedures are still recommending setting value of 33/20, just take a look on some old High Availability Cluster Multi-Processing Troubleshooting guide:
“Although the most efficient high- and low-water marks vary from system to system, an initial high-water mark of 33 and a low-water mark of 24 provides a good starting point. These settings only slightly reduce write times and consistently generate correct fallover behavior from the HACMP software.”

Additionally IBM described the whole disk I/O pacing mechanics in this link

100% write results with maxpout=8193 minpout=4096 (defaults)
orion_20110706_0525_summary.txt:Maximum Large MBPS=213.55 @ Small=0 and Large=6
orion_20110706_0525_summary.txt:Maximum Small IOPS=32323 @ Small=2 and Large=0

100% write results with maxpout=33 minpout=24 (old HACMP)
orion_20110706_0552_summary.txt:Maximum Large MBPS=109.64 @ Small=0 and Large=10
orion_20110706_0552_summary.txt:Maximum Small IOPS=31203 @ Small=5 and Large=0

As you can see there is drastic change when it comes to large sequential writes, but don’t change that parameters yet! You really need to read the following sentence from the IBM site “The maxpout parameter specifies the number of pages that can be scheduled in the I/O state to a file before the threads are suspended. “. It is obvious that those parameters are set for the reason of avoiding I/O resource hungry processes dominating the CPU. If you take that the smallest page size is 4kB with 64kB automatically used sometimes by Oracle you might get 33*64kB = 2112kB value in best case, after which thread doing the I/O is suspended ! Ouch! So what is the process of taking the CPU from active process? It is called involuntary context switch…

Why it is all important with HACMP/PowerHA software? Because it is implemented as normal shell scripts and some RSCT processes which are mostly not running in kernel mode at all (at least everything below PowerHA 7.1)! The only way PowerHA/HACMP can give more CPU horse power to the processes needing is to give them higher priority on the run queue, just take a look on the PRI column..

root@X: /home/root :# ps -efo "sched,nice,pid,pri,comm"|sort -n +3|head
SCH NI     PID PRI COMMAND
  - --  520286  31 hatsd
  - --  729314  38 hats_diskhb_nim
  - --  938220  38 hats_nim
  - --  622658  39 hagsd
  - 20  319502  39 clstrmgr
  - 20  348330  39 rmcd
  - 20  458874  39 gsclvmd
  - 20  508082  39 gsclvmd
  - 20  528554  39 gsclvmd
root@X: /home/root :#

.. but it appears that even that could not be not enough in some situations. As you can see there is no easy answer to that question and you would have to drive your exact system configuration to 99/100% of CPU utilization on all POWER cores to see if you are not getting random failovers, because e.g. some HACMP process would not get heartbeat too late due to the CPU scheduling. So perhaps the way forward for 33/20 environments here is to leave them somewhere between 33/20 and 8193/4096 (like the mentioned 513/256 in the IBM manuals). Another thing is that you should never have a system driven to > 80% CPU usage on all physical cores, but you should now that already:)