Archive for the ‘Performance’ Category

Very simple demonstration how to limit latency impact (improve consumer experience) by switching to out-of-band/background validation connection checking for DB connections

Thursday, March 31st, 2016

This is very simple demonstration on how to limit latency impact (improve consumer experience) by switching to out-of-band/background validation connection checking for DB connections:

Test setup: laptop <-- 0ms latency --> VM { Wildfly JBoss} <-- 6ms latency --> VM { Oracle DB }, latency was simulated using linux tc netem (6ms but only one way).
Benchmark generator settings: ApacheBenchmark 2.4 launched with concurrency = 1, timeout = 5, 1000 requests to simple JSP page, test was repeated multiple times to heat caches/etc.
Middle-tier: JBoss/WildFly Core 2.0.10.Final, 100 connection pool, validation connection checker explicitly set to “SELECT 1 FROM DUAL”
Application: simple JSP page performing 1 DB CALL (1x INSERT)

with 6ms RTT with JDBC Connection Pool with default foreground connection check (actually 4 packets; 4*6 =~ 24ms):

Requests per second:    55.97 [#/sec] (mean)
Time per request:       17.867 [ms] (mean)
Time per request:       17.867 [ms] (mean, across all concurrent requests)
Transfer rate:          16.07 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    13   18   9.8     16     218
Waiting:       13   17   9.8     16     218
Total:         13   18   9.8     17     218

Percentage of the requests served within a certain time (ms)
  50%     17
  66%     17
  75%     18
  80%     18
  90%     19
  95%     21
  98%     38
  99%     71
 100%    218 (longest request)

with 6ms RTT with JDBC Connection Pool with background connection check (just 2 packets to DB; 2*6 =~ 12ms):

Requests per second:    100.74 [#/sec] (mean)
Time per request:       9.927 [ms] (mean)
Time per request:       9.927 [ms] (mean, across all concurrent requests)
Transfer rate:          28.92 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     7   10   2.8      9      41
Waiting:        7    9   2.8      9      41
Total:          7   10   2.9      9      42

Percentage of the requests served within a certain time (ms)
  50%      9
  66%     10
  75%     10
  80%     10
  90%     11
  95%     13
  98%     18
  99%     26
 100%     42 (longest request)

This demonstrated that the average consumer experience can be improved from ~18ms to ~10ms just by setting two options that move the checking “out-of-band” for business transaction processing. Of course the more times the application calls .getConnection() even transparently like with the case of JSP, the more impact of those settings.

p.s. I’ve used

<check-valid-connection-sql>select 1 from dual</check-valid-connection-sql>

in my experiment however you should be really using

<valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleValidConnectionChecker"/> 

as it calls native OJDBC .isValid() or ping methods that perform the check even without SQL layer (smaller packets).

An excellent presentation about real-world scaling without unecessary hype

Wednesday, September 30th, 2015

For those who are tired of bullshit^H^H^H^H^H^H^Hnew hypes and are more keen on real-world, solid, predictable and stable IT infrastructure here you go:

https://plus.google.com/+KristianK%C3%B6hntopp/posts/4dme9PNZ1pa?pid=6197690782797826690&oid=117024231055768477646

For DBA especially comments on pages 20-25 are interesting, but nothing you wouldn’t know already. Page 19 is my personal favorite page due to how they really give a thought how to avoid XA/2PC, even if they could just use it.

Part 5: Oracle SecureFile LOB is better than BasicFile, isn’t it?

Friday, May 9th, 2014

In this part I’ll try to discuss one of the biggest performance problem with BasicFile LOBs – enq HW contention (wait event that belongs to “configuration” class). Often it may be observed on systems with concurrent INSERTs into BasicFile LOBs with data bigger tan ~3960 bytes (or with LOBs having DISABLED STORAGE ON ROW), it probably is common name for things related to space allocation (allocating/formating the next extent for LOBSEGMENT itself OR looking for free space and reclaiming it). More research typically ends up in reading the following engineering notes in Metalink:

  • LOB HWM CONTENTION :Using AWR Reports to Identify the Problem; Confirm and Verify the Fix (Doc ID 837883.1)
  • Bug 9801919 “enq: HW – contention” against segments that add an extent frequently during high concurrency
  • ‘enq HW – contention’ For Busy LOB Segment (Doc ID 740075.1)
  • BUG:4867884 LOB HW LOCK CONTENTION WITH SPACE RECLAIMATION
  • Bug 6376915 HW enqueue contention for ASSM LOB segments
  • Bug 7319120 HW contention for BasicFile LOBs – special workaround

Especially in the last Bug 7319120, the workaround is stated as “The preferred option for such contention on basicfile LOBs is to migrate the LOBS to Securefiles in RDBMS 11g or later. Securefile LOBs have been designed for better concurrency“, but as I’ve presented in earlier posts, it is not beatiful as it sounds (huge problems shown in previous posts with wasting space with SecureFiles on early 11.2.0.2.x releases and supprisgnly the same on at least <= 11.2.0.4.2 as of writing). Most of the remaining bugs are already fixed starting 11.2.0.3.0, but as stated by Oracle SecureFiles are "the way forward".

To best of my knowledge typical solutions to "enq HW - contention" include usually one of the following:

  1. properly setting LOB option cache vs nocache (default) and tuning I/O – the enq HW for LOBs seems to affected by storage operations as it will be shown later
  2. proper (hash) partitioning scheme on the affected table/LOB to split contention
  3. altering LOB option chunksize (often increasing the default 8k to something like 16k/32k gives some minimal better results – but this is very workload specific)
  4. setting event 44951

Test env specification
Armed with the knowledge I’ve decided to really test that interesting event for my specific artificial workload. By this i mean version 11.2.0.3.8, BasicFile LOB, NOCACHE, pctversion 0, lobs sizes between 6000 … 10000, 7 concurrent INSERT and DELETE jobs preceeded by prefilling LOBs and then DELETing — to simulate in non-ideal clean conditions for table (ab)used as queue. DB was in NOARCHIVELOG and flashback was disabled. The only difference was storage being RAID1 of just 2xSAS 15k RPM drives (shared between data and redo) – instead of ramdisk (I’ve needed that I/O latency for enq HW to become visible in Top5),

HW contention root cause
The even more interesting fact is that INSERTs into LOB (with related LOB extent allocation) is masked by enq HW contention which (when you look via V$wait_chains) is actually waiting for “db file sequential read” :

CHAIN_SIGNATURE                                    WHO             WAIT_EVENT_TEXT              BLOCKED_BY              P1 P1_TEXT                 P2 P2_TEXT                 P3 P3_TEXT
-------------------------------------------------- --------------- ---------------------------- --------------- ---------- --------------- ---------- --------------- ---------- ---------------
'db file sequential read'<='enq: HW - contention'  129,571         enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  160,25          db file sequential read      ,                        3 file#                 4514 block#                   1 blocks
'db file sequential read'<='enq: HW - contention'  6,7             enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  33,9            enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  65,9            enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  190,13          enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  223,9           enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block

In the above case 6 sessions were waiting with enq TX contetnion (looking for free space for INSERT) for 7th session (sid=160,serial#=25) actually searching for space/reclaiming the space to finish – and actually the 7th session was doing single block (8kB) physical DB I/O read. The Top5 session look similiar to this:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:24:54.463960000 

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 62,717         918     15   32.2 Configurat
direct path write                    93,156         277      3    9.7 User I/O
db file sequential read              64,995         199      3    7.0 User I/O
DB CPU                                               91           3.2
buffer busy waits                    26,578           1      0     .0 Concurrenc

and most useful for us statistics (look at physical direct writes for lobs):

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
[..]
lob reads                                    63,000          188.8           1.0
lob writes                                  315,000          943.7           5.0
lob writes unaligned                        314,990          943.7           5.0
physical read IO requests                    64,851          194.3           1.0
physical read bytes                     531,267,584    1,591,684.2       8,428.1
physical read total IO requests              65,656          196.7           1.0
physical read total bytes               544,384,000    1,630,981.2       8,636.2
physical read total multi block                   0            0.0           0.0
physical reads                               64,852          194.3           1.0
physical reads cache                            558            1.7           0.0
physical reads cache prefetch                     1            0.0           0.0
physical reads direct                        64,294          192.6           1.0
physical reads direct temporary                   0            0.0           0.0
physical write IO requests                  105,628          316.5           1.7
physical write bytes                  1,036,320,768    3,104,829.8      16,440.4
physical write total IO requests            170,303          510.2           2.7
physical write total bytes            1,339,469,824    4,013,068.1      21,249.6
physical write total multi block                 43            0.1           0.0
physical writes                             126,504          379.0           2.0
physical writes direct                      125,324          375.5           2.0
physical writes direct (lob)                125,316          375.5           2.0
physical writes from cache                    1,180            3.5           0.0
physical writes non checkpoint              126,504          379.0           2.0
[..]

Below are data after changing the LOB to CACHE, which had resulted in drastic reduction of enq HW contention (as the dependend I/O is faster because it is usually cached) and V$WAIT_CHAINS results in many for example wait chains like those(notice lack of “-> db file sequential read”):

  • ‘log file parallel write’<='log buffer space'<='buffer busy waits'
  • ‘log file parallel write’<='log buffer space'
  • ‘enq: HW – contention’<='enq: HW - contention'

while the Top5 results in different stuff (the main bottleneck moved to the REDO subsystem now):

ELAPSED
---------------------------------------------------------------------------
+000000000 00:06:29.795479000 

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log buffer space                      1,542         116     75   15.6 Configurat
enq: HW - contention                 58,609         110      2   14.8 Configurat
buffer busy waits                   125,436          99      1   13.3 Concurrenc
DB CPU                                               91          12.3
log file switch (checkpoint in            6          10   1620    1.3 Configurat
Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
[..]
lob reads                                    63,000          347.3           1.0
lob writes                                  315,009        1,736.4           5.0
physical read IO requests                       629            3.5           0.0
physical read bytes                       5,619,712       30,976.6          89.1
physical read total IO requests               1,629            9.0           0.0
physical read total bytes                21,804,032      120,186.7         345.8
physical read total multi block                   0            0.0           0.0
physical reads                                  686            3.8           0.0
physical reads cache                            661            3.6           0.0
physical reads cache prefetch                    57            0.3           0.0
physical reads direct                            25            0.1           0.0
physical reads direct temporary                   0            0.0           0.0
physical write IO requests                   16,462           90.7           0.3
physical write bytes                    234,405,888    1,292,076.2       3,718.0
physical write total IO requests             19,116          105.4           0.3
physical write total bytes            1,399,786,496    7,715,808.2      22,202.6
physical write total multi block              2,341           12.9           0.0
physical writes                              28,614          157.7           0.5
physical writes direct                           25            0.1           0.0
physical writes direct (lob)                      0            0.0           0.0
physical writes from cache                   28,589          157.6           0.5
physical writes non checkpoint               16,153           89.0           0.3
[..]

So in this case changing to CACHE results in reducing the work from 7 jobs from almost 25mins to 6 mins. Also you might notice a rather huge decrease of direct physical random writes which is almost always excellent results for most disk based systems (those blocks would be certianly flushed out by DBWR using much more optimized/sequential writes).

The mysterious event 44591
The mentioned event is described as “causes ASSM LOB space to batch up reclaim instead of just reclaiming the requested/required number of LOB chunks. To enable this fix, set event 44951 to the maximum number of chunks that you would like to have reclaimed per attempt. The maximum allowed is 1024. Anything larger becomes 1024. However, if the requested amount of space is larger than the event’s value, the reclaim will be for the requested amount of space.
It is interesting that several years ago people complained when migrated LOBs from MSSM to ASSM and voila and in different source we find “With Manual Segment Space Management, this value cannot be altered and is fixed at 128.“. Somewhere else it is written: “The fix for this problem introduces a *mandatory event* which must be set in order for the ASSM architectural change leading to more efficient allocation of LOB chunks during high load and transaction contention when Extent allocation is needed. The event is *not* set even after you apply the patch.

So here are the results with NOCACHE LOB:

event44951 not set:

ELAPSED                                                                                                                                              ---------------------------------------------------------------------------
+000000000 00:21:57.406259000

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 62,591         787     13   39.3 Configurat
direct path write                    90,263         252      3   12.6 User I/O
db file sequential read              64,616         172      3    8.6 User I/O
DB CPU                                              111           5.6
buffer busy waits                    24,388           1      0     .0 Concurrenc

event44951 setTo 3:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:18:33.058539000  

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 41,254         585     14   47.3 Configurat
direct path write                    86,039         285      3   23.0 User I/O
db file sequential read              43,801         144      3   11.6 User I/O
DB CPU                                               88           7.1
buffer busy waits                    39,976           1      0     .1 Concurrenc

event44951 set to 16:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:22:31.832235000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   107,809         911      8   34.9 User I/O
db file sequential read              12,804         185     14    7.1 User I/O
enq: HW - contention                  3,376         177     52    6.8 Configurat
DB CPU                                               81           3.1
log file switch (private stran            4           1    323     .0 Configurat

event44951 set to 32:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:49:27.268629000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   111,440       1,435     13   24.5 User I/O
enq: HW - contention                  2,597       1,027    395   17.5 Configurat
db file sequential read              20,339         429     21    7.3 User I/O
DB CPU                                               79           1.4
log file switch (private stran            5           2    474     .0 Configurat

event44951 set to 512:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:38:45.176198000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   116,908       1,833     16   40.2 User I/O
db file sequential read              17,262         354     20    7.8 User I/O
DB CPU                                               78           1.7
enq: HW - contention                     23          64   2802    1.4 Configurat
log file switch (private stran            6           3    473     .1 Configurat

So apparently the most interesting result (for LOB reclamation) is in my case the one with event set to 3 (chunks). I could test more, but time for this research was the most limiting constraint. When tablespace and LOB are created by default this mean 8kB, so in this case this would mean 8kB*3=24kB of I/O per reclamation. Apparently the bigger the event level/property the more I/O work for the potential reclamation is being done by the session requiring more space, all of this is hidden under enq HW contention for LOB segment, but for *different sessions*. The results in CACHE LOB scenario with event set to properly tuned level would probably result in even more optimal processing time.

SUMMARY
My conclusion is that enq HW contention being in single “Configuration event class” visible in Grid/Cloud Control, is a little bit unfortunate, because it might be also related to the I/O storage performance and this can be even affected by current db cache contents (which would or NOT contain the affected LOBSEGMENT blocks required for reclamation if possible). The LRU cache might be of course function of time/workload, also the physical I/O from the DB itself might be served by SAN cache.

Other interesting sources/posts/discussions:

Part 3: Oracle SecureFile LOB is better than BasicFile, isn’t it?

Thursday, April 24th, 2014

OK, so far in our previous parts/tests regarding Oracle LOBs we had:

  • part 1 – showing how SecureFiles are fairly superior on DB version 11.2.0.3.5 against BasicFiles in terms of space consumed and elapsed processing times when processing rather very small rows (most of them inline)
  • part 2 – showing how buggy SecureFiles were as a new feature, and actually how much space they actually wasted (on some legacy version 11.2.0.2.3)

Part 3 is going to focus on showing which is better for storing rows/LOBs that are too big to fit into row/table segment (~3900 bytes limit) and must be technically stored in separate LOB segment. The enviorniment is very similiar to the one from Part 1, basically OLTP, 4 concurrent jobs doing INSERTs to single non-partitioned table in PL/SQL loop doing INSERT LOB + DELETE LOB with ID-16 (moving window for each job), DB version is on 11.2.0.3.8, LOB chunk size is 8k, tablespace block size is also 8k, LOB setting is to perform CACHING. The storage underneath the DB is on ram disk (tmpfs to be exact), every time tablespace is dropped and recreated, everything is automated, including AWR snapshot collections. The main difference is that now we are using a much larger LOB sizes (over 4k to be exact) *AND* we changed workload to now prefill the table and LOB with garbage data, then delete it without coallescing the LOB segment itself. The last statement as you will see is ciritical , and shows typical usage pattern where data is removed (e.g. queues).

The results are below, first time elapsed BasicFile vs SecureFile:

Second, space efficency:

As you can simply see – the overall performance benefit from BasicFiles – causes huge storage waste. Also workload profile in this particular test run is much different between SecureFiles and BasicFiles , the best image is the difference in AWR snapshots for the Top 5 events for workload generating between 6k and 12k LOBs:

Click to see more details

Later analysis reveals that majority of “buffer busy waits” are spent on “free list” managment itself. Armed with this knowledge (11.2.0.3.8, securefile LOBs, buffer busy waits, free lists) we can start looking for more information, and we find a nice horror stories:

Summary:
At least on 11.2.0.3.8 SecureFile LOBs are still not working as fast as BasicFile but with added benefit of reusing space (much better than on some old PSUs e.g. 11.2.0.2.3). One can hit an extreme “buffer busy waits” on SecureFile which might require rebuilding the table with hidden parameter. It is written that this is fixed in 11.2.0.4.x releases. In Part 4 or Part 5 I’m going to see how 11.2.0.4.x behaves under the identical workload.

-J.

RATE_LIMIT and Oracle RAC 11gR2?

Monday, November 4th, 2013

Oracle SQLNet/TNS connections rate limiting is an awesome way of increasing stability of Oracle databases by allowing controlling insane behavior of starting up Applicaton Servers or Middleware components that push hundredths connections to the databases and thus killing CPU on the database/server side (and impacting other critical work). The way it works is also very important, it is not just refusing connections, but accept()-ing socket, but not starting processing (just queuing the work). For more detail you can look official Oracle whitepaper, especially www.oracle.com/technetwork/database/enterprise-edition/oraclenetservices-connectionratelim-133050.pdf.

The way its working is important , because rejecting the TCP connections (via TCP FIN or TCP RST) would cause application to get “Unable to get connection” type errors, which is in most cases not something you want to achieve for various reasons. If that feature doesn’t work you won’t be able to re-implement it in different way – let’s say using Linux’s NetFilter – because you are able to rate limit TCP connections there primarily by dropping them.

You may be surprised after reading official Oracle documentation http://docs.oracle.com/cd/E11882_01/network.112/e10835/listener.htm#NETRF008 regarding rate limiting Oracle SQLNet/TNS connections is NOT supported on RAC installations. The official documentation itself is silent on this topic, but this has been confirmed via Oracle SR to support. The probable primary reason for this is that CRS in 11gR2 manages normal and SCAN listeners so you have no choice of altering Oracle managed listner.ora and endpoints_listener.ora files because they are overwritten by… but still documentation asks you to them. Clearly a conflcit here.

On RAC/CRS the listeners are supposed to be altered only via “srvctl” command. srvctl actually is just frontend because oraagent.bin daemon takes care since 11gR2 for monitoring and housekeeping listeners. The problem is that srvctl does NOT have a way to enable RATE_LIMIT, but also there is no option to alter more advanced parameters like QUEUESIZE (fortunately on Linux 2.6.x backlog depth for listen() seems to be SOMAXCONN=/proc/sys/net/core/somaxconn=128 by default), SDU, etc (those are actually being mentioned in Bug 11782958: “SRVCTL DOES NOT SUPPORT LISTENER CONFIGURATION LIKE MODIFYING QUEUESIZE, SEND/RE”, also in DocID 1292915.1 “Setting Parameters for Scan and Node Listeners on RAC, Queuesize, SDU, Ports, etc”)

Sample srvctl options for 11gR2:

[oracle@racnode1 ~]$ aso srvctl config listener -a
Name: LISTENER
Network: 1, Owner: oracle
Home: <CRS home>
  /u01/app/11.2.0/grid11203 on node(s) racnode1, racnode2
End points: TCP:1521
[oracle@racnode1 ~]$ aso srvctl modify listener -h

Modifies the configuration for the listener.

Usage: srvctl modify listener [-l <lsnr_name>] [-o <oracle_home>] [-p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"] [-u <oracle_user>] [-k <net_num>]
    -l <lsnr_name>           Listener name (default name is LISTENER)
    -o <oracle_home>         ORACLE_HOME path
    -p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"       Comma separated tcp ports or listener endpoints
    -u <oracle_user>         Oracle user
    -k <net_num>             network number (default number is 1)
    -h                       Print usage
[oracle@racnode1 ~]$

Additionally in Doc Id 1568591.1 “11gR2 Listener With RATE_LIMIT Set: Slow Connect Time and Tnsping Response Time High with TNS-1158 Error” BUG:16409926 “LISTENER MEMORY LEAK IF RATE_LIMIT IS USED AND ENFORCED FREQUENTLY” is being mentioned (still not present in 11.2.0.2.x and 11.2.0.3.x PSUs, but fixed in 12.1.0.1 and included from start in 11.2.0.4.0). Overall this feature doesn’t seem to be widely used and/or tested and doesn’t make good feelings…

WAFL performance VS sequential reads: part III, FC LUN performance from AIX vs read_realloc

Thursday, July 7th, 2011

Continuing my series about LUN fragmentation in WALF (part1, part2) I wanted to give a try to read_realloc option. Mostly the same storage system (still DataOnTap 7.3.2) and settings were used with an exception that i’ve even used more powerful AIX with POWER7 CPU cores system (shouldn’t matter as i was completely not constrained with CPU power during the benchmarks).

The methodology was simple:
1) Do a lot of random and sequential writes to JFS2 on top of that LUN on FlexVol having read_realloc=off to fragment the LUN as much as possible
2) Start a loop of random large reads (simulating let’s say many random full table scans of up to 1MB in Oracle data file) on top of that JFS2 filesystem (BTW: this time without CIO)

The full measurement took something about 18 hours (yes, storage was hammered 18 hours all the time with just reads). Results are pretty interesting:

… it basically shows how Netapp storage can self-optimize and adapt based on the workload. Now i’m just wondering how it works under the hood (i.e. what is the trade-off or worst possible scenario for this kind of setting). What was even more interesting is that reallocate measure indicated after the whole cycle that the LUN was really nicely de-fragmented:

Thu Jul  7 03:32:20 EDT [X: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/fc_fill_aggr_sm/lun' is 3.

AIX disk I/O pacing vs PowerHA

Wednesday, July 6th, 2011

Recently i’ve played with disk I/O pacing (technically minpout and maxpout parameters) on AIX 6.1. The root cause of this investigation is based on some high I/O waits in production. The default AIX 6.1 and 7.1 systems are coming with maxpout set to 8193 and minpout respecitvletly to 4096. However old deployments and HACMP <= 5.4.1 installations procedures are still recommending setting value of 33/20, just take a look on some old High Availability Cluster Multi-Processing Troubleshooting guide:
“Although the most efficient high- and low-water marks vary from system to system, an initial high-water mark of 33 and a low-water mark of 24 provides a good starting point. These settings only slightly reduce write times and consistently generate correct fallover behavior from the HACMP software.”

Additionally IBM described the whole disk I/O pacing mechanics in this link

100% write results with maxpout=8193 minpout=4096 (defaults)
orion_20110706_0525_summary.txt:Maximum Large MBPS=213.55 @ Small=0 and Large=6
orion_20110706_0525_summary.txt:Maximum Small IOPS=32323 @ Small=2 and Large=0

100% write results with maxpout=33 minpout=24 (old HACMP)
orion_20110706_0552_summary.txt:Maximum Large MBPS=109.64 @ Small=0 and Large=10
orion_20110706_0552_summary.txt:Maximum Small IOPS=31203 @ Small=5 and Large=0

As you can see there is drastic change when it comes to large sequential writes, but don’t change that parameters yet! You really need to read the following sentence from the IBM site “The maxpout parameter specifies the number of pages that can be scheduled in the I/O state to a file before the threads are suspended. “. It is obvious that those parameters are set for the reason of avoiding I/O resource hungry processes dominating the CPU. If you take that the smallest page size is 4kB with 64kB automatically used sometimes by Oracle you might get 33*64kB = 2112kB value in best case, after which thread doing the I/O is suspended ! Ouch! So what is the process of taking the CPU from active process? It is called involuntary context switch…

Why it is all important with HACMP/PowerHA software? Because it is implemented as normal shell scripts and some RSCT processes which are mostly not running in kernel mode at all (at least everything below PowerHA 7.1)! The only way PowerHA/HACMP can give more CPU horse power to the processes needing is to give them higher priority on the run queue, just take a look on the PRI column..

root@X: /home/root :# ps -efo "sched,nice,pid,pri,comm"|sort -n +3|head
SCH NI     PID PRI COMMAND
  - --  520286  31 hatsd
  - --  729314  38 hats_diskhb_nim
  - --  938220  38 hats_nim
  - --  622658  39 hagsd
  - 20  319502  39 clstrmgr
  - 20  348330  39 rmcd
  - 20  458874  39 gsclvmd
  - 20  508082  39 gsclvmd
  - 20  528554  39 gsclvmd
root@X: /home/root :#

.. but it appears that even that could not be not enough in some situations. As you can see there is no easy answer to that question and you would have to drive your exact system configuration to 99/100% of CPU utilization on all POWER cores to see if you are not getting random failovers, because e.g. some HACMP process would not get heartbeat too late due to the CPU scheduling. So perhaps the way forward for 33/20 environments here is to leave them somewhere between 33/20 and 8193/4096 (like the mentioned 513/256 in the IBM manuals). Another thing is that you should never have a system driven to > 80% CPU usage on all physical cores, but you should now that already:)

Are you ready 2.0?

Monday, June 27th, 2011

LOL!

WAFL performance VS sequential reads: part I, FC LUN performance from AIX vs FlexVol utilization

Friday, June 24th, 2011

Some time ago i’ve started thinking about getting more serious about performance research on AIX6.1, PowerHA and Oracle stack on top of Netapp storage. One of the first things that i wanted to measure was how the Netapp’s WALF handles FlexVolume utilization in correlation to space usage. In theory long sequential reads (as like in Oracle Datawarehouses in which i’m interested) could be affected because of the fragmentation introduced by WAFL (Write *Anywhere* File System).

Some specs first:

  • DataONTap version 7.3.2, single Netapp controller 3160 was tested (but in cluster).
  • The test was done using Oracle’s Orion 11.1.0.7 storage benchmarking tool on top of AIX (orion_aix_ppc64 -run advanced -num_disks 5 -cache_size 2048 -write 30 -duration 20 -type rand -matrix basic) – as you can see the read vs write ratio was 70% to 30%, but only long reads were presented (i was not interested in the performance of 8kB reads/writes, just 1MB long reads)
  • AIX was connected via VFCs to two separate VIOS, which each were connected using 2 FC 8 GBps links each (but running in 4 GBps mode due to the Brocade SAN switches not supporting 8 GBps)
  • Netapp controller was having 4 GBps ports
  • AIX was using 6.1′s internal MPIO (round-robin) for the tested LUN
  • AIX hdisk for LUN was set to the default value of 12 (as per Netapp Host Attachement Kit)
  • AIX JFS2 filesystem was mounted with Concurrent I/O to prevent AIX from caching and read-aheads (still AIX had 3GB of RAM allocated but VMM should not use it)
  • Netapp storage controller was having 4 processors, 8GB RAM and 2GB for NVRAM (as indicated by sysconfig output, of course as this is cluster, only 1GB was available)
  • LUN size was 20GB, on top of 50GB FlexVol on RAID-DP aggregate with 5x FC 15k RPM disks
X> vol options full_aggr_test
nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off,
ignore_inconsistent=off, snapmirrored=off, create_ucode=on,
convert_ucode=off, maxdirsize=83804, schedsnapname=ordinal,
fs_size_fixed=off, compression=off, guarantee=volume, svo_enable=off,
svo_checksum=off, svo_allow_rman=off, svo_reject_errors=off,
no_i2p=off, fractional_reserve=100, extent=off, try_first=volume_grow,
read_realloc=off, snapshot_clone_dependency=off
X> df -Ag aggr_used
Aggregate                total       used      avail capacity
aggr_used               1102GB      668GB      433GB      61%
aggr_used/.snapshot         0GB        0GB        0GB     ---%
X>
X> snap list -A aggr_used
Aggregate aggr_used
working...

No snapshots exist.
X> snap sched -A aggr_used
Aggregate aggr_used: 0 0 0
X>

WALF aggregate was idle during that test and configured and running as follows:

Aggregate aggr_used (online, raid_dp) (block checksums)
  Plex /aggr_used/plex0 (online, normal, active)
    RAID group /aggr_used/plex0/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   2d.18   2d    1   2   FC:B   -  FCAL 15000 418000/856064000  420156/860480768
      parity    1d.19   1d    1   3   FC:A   -  FCAL 15000 418000/856064000  420156/860480768
      data      2d.21   2d    1   5   FC:B   -  FCAL 15000 418000/856064000  420156/860480768
      data      1d.22   1d    1   6   FC:A   -  FCAL 15000 418000/856064000  420156/860480768
      data      2d.23   2d    1   7   FC:B   -  FCAL 15000 418000/856064000  420156/860480768

[..]

X> aggr status aggr_used -v
           Aggr State           Status            Options
       aggr_used online          raid_dp, aggr     nosnap=off, raidtype=raid_dp,
                                                  raidsize=16,
                                                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  resyncsnaptime=60,
                                                  fs_size_fixed=off,
                                                  snapshot_autodelete=on,
                                                  lost_write_protect=on
[..]

X> df -g full_aggr_test
Filesystem               total       used      avail capacity  Mounted on
/vol/full_aggr_test/       50GB       20GB       29GB      40%  /vol/full_aggr_test/
/vol/full_aggr_test/.snapshot        0GB        0GB        0GB     ---%  /vol/full_aggr_test/.snapshot
X>
X> snap list full_aggr_test
Volume full_aggr_test
working...

No snapshots exist.
X>

From AIX point of view, filesystem was configured as follows (notice those big files for Orion’s use):

root@Y:# df -m .
Filesystem    MB blocks      Free %Used    Iused %Iused Mounted on
/dev/fslv00    19456.00   2099.70   90%        7     1% /fullaggr
root@Y:# du -sm *
10000.01        bigfile
7353.00 bigfile2
0.00    lost+found
0.00    orion.lun
root@Y:#

Results:

Methodology:
For each next attempt a snapshot was created to grow the space used inside the FlexVol until it was full (40%..100%). There was single Orion execution after each snapshot was created.The Y-axis represents maximum bandwidth observed for sequential 1MB reads (as reported by Orion). The Z-axis (depth) ranging from 1 to 10 reprents the number of concurrent/paralell reads being done by Orion (to let’s say simulate multiple full table scans happening on the same LUN). As it is visible from the graph when FlexVol utilization is close or equal to 100% a nearly more than double performance impact can be observed (40-45 MB/s vs 10-15MB/s). The sane FlexVol utilization minimium seems to somewhere max at 70% to avoid problems with fragmentation. AIX system was mostly coming with default settings without any more advanced optimizations (that was done on purpose except Concurrent I/O).