Archive for the ‘XEN’ Category

MAA challenge lab, part #4

Friday, April 30th, 2010

No posts on my blog for long time, need to change that. So i was trying to get MAA (Maximum Availability Architecture by Oracle) lab again in shape for writing Master of Science thesis…

Somewhere near Janurary/Februrary this year:

  • Primary VM RAC nodes prac1, prac2, prac3 are working again, but database db3 is not (unable to archive to db3dg on srac1,srac2). Main root cause was that experiments in April of 2009 with log_archive_min_succeed_dest=2 setting caused losing sync betweeen primary and standby
  • Standby VM RAC nodes srac1, srac2 VMs are up again, but without db3dg database and ASM instances

So some actions were performed (Feburary/March):

  • ASM +DATA1 recreated on both new srac1 & srac2 VMs (on dedicated VG/LV: vg01/db3dg on physical synapse server, on dedicated 15k RPM SCSI drive[4th in system] named sdb [sda is RAID5 for OS/XEN/VMs]).
  • Fixed some XEN configuration files for VMs with “w!” (disabled shared locks mode) & losetup bug in /etc/xen/scripts/block.
  • Standby rebuilded (from RMAN duplicate backup + FAL gap fetching), so finally i’ve working MAA again :)

Problematic thing is that after failover/switchover due to differences in primary (64-bit) and standby (32-bit) i have to invalidate & recompile all PL/SQL packages (very time consuming on old hardware! and seems that Broker is unable to handle that case):

So last week:

  • prac7, prac8 32-bit VMs were launched.
  • ‘ve switched to using FreeNX server from typical X11 SSH forwarding (much more interactive on slow OpenVPN!).
  • Also i’ve upgraded Grid Control (OMS) to 10.2.0.5 (next step will be to 11g), OMS database repository to 11.1.0.6 (fro 10.1.0.4).
  • Next i deployed 32-bit clusterware (11.1) on those notes, played a little bit with OCR corruptions [metalink note ID 399482.1] after hitting mysterious listener outages (OCR corruptions were not the case for it, it was permission issue on single directory – doh!)
  • Created clustered ASM, and created 32-bit RAC database named “db4.lab1″.
  • I’ve also deployed OEM/GC agents on prac7, prac8 (10.2.0.5 versions) directly from GC (awesome for mass deployments!).
  • Yesterday I’ve exported using expdp 1 schema from db3.lab1 database (64-bit RAC) and imported it into “db4.lab1″…

Planned soon:

  • Upgrading GC to 11g and upgrading agents to 11g too.
  • Building MAA for “db4.lab1″, plan is to create DataGuard for it on srac1, srac2 VMs (32-biit too, they already host DataGuard for “db3.lab1″). But this one is going to use DataGuard Broker to get FailStart Failovers working (2 node primary RAC with 2 node standby RAC)
  • Extending primary RAC to prac9 (to be created), so to have 3 node primary RAC for “db4.lab1″ protected by DataGuard broker with 2 node standby RAC
  • Fun with Snapshot standby and Real Application Testing
  • Detailed measurements of failover times with SYNC DataGuard and Broker – orajdbcstat (my little utility for testing JDBC/FCF failover times, perhaps i’ll uplift it for UCP from 11.2g)

Storage migration for ASM database deployed on Oracle Enterprise Linux in Oracle VM without downtime.

Thursday, May 21st, 2009

Suppose you wanted to migrate your database from storage array SAN1 to SAN2 without downtime. With Oracle databases using ASM this is possible. It was performed on configuration better described here. One note: the LUN can be visible through dom0 or directly by domU (by passing PCI hardware handling into our domU VM) — this posts explains only first scenario, as this is more common scenario  Brief steps include:

  1. Prepare new LUNs on storage array (not described here)
  2. Attach new LUNs to the Oracle VM (not described here, simulated here by using simple zero-padded file created using dd utility; I expect this was performed earlier[scsi bus rescanning and so on] or file created in /OVS).
  3. Modifing VM’s XEN config file.
  4. Online attaching block device to VM.
  5. Preparing new storage device from inside of the target VM.
  6. Discovering new LUN in ASM.
  7. Actual rebalance process…
  8. Verification.

Step 3: Modify vm.cfg file for additional storage.

This is straighforward, just add one line. Do NOT restart the VM. There is no need.

[root@quadovm 01_prac1]# cat vm.cfg
bootloader = '/usr/bin/pygrub'
disk = ['file:/OVS/running_pool/01_prac1/system.img,hda,w',
'file:/OVS/running_pool/01_prac1/oracle_software.img,hdd,w',
'file:/OVS/running_pool/prac.storage.raw,hde,w!',
'file:/OVS/running_pool/prac.ocr,hdf,w!',
'file:/OVS/running_pool/prac.voting,hdg,w!'
<strong>'file:/OVS/running_pool/prac.storage2.raw,hdi,w!',</strong>
]
memory = 1638
name = '01_prac1'
[..]
[root@quadovm 01_prac1]#

Step 4: Attach block device to the running VM.

[root@quadovm 01_prac1]# xm block-attach 01_prac1 file:///OVS/running_pool/prac.storage2.raw /dev/hdi 'w!'
[root@quadovm 01_prac1]#

Step 5: Prepare prac1 VM for new device.

New added storage should be automatically detected, this can be verified by checking dmesg output:

[root@prac1 ~]# dmesg|grep hdi
hdi: unknown partition table
[root@prac1 ~]# ls -al /dev/hde /dev/hdi
brw-rw---- 1 root dba  33, 0 Jan 24 13:00 /dev/hde
brw-r----- 1 root disk 56, 0 Jan 24 12:59 /dev/hdi
[root@prac1 ~]#
[root@prac1 ~]# fdisk -l /dev/hd[ei] 2> /dev/null | grep GB
Disk /dev/hde: 15.7 GB, 15728640000 bytes
Disk /dev/hdi: 16.1 GB, 16106127360 bytes
[root@prac1 ~]#

As we can see new LUN is bigger (it should be bigger or equal, but I haven’t checked what happens if you add a smaller one). Now we have to assign correct permissions so that ASM/Database can use new /dev/hdi device without problems (this doesn’t include modifing udev rules in /etc/udev/, and it is required to make new devices come with right permissions after reboot — do your own home work:) ) :

[root@prac1 ~]# chgrp dba /dev/hdi
[root@prac1 ~]# chmod g+w /dev/hdi
[root@prac1 ~]#

Step 6: Preparing ASM for new disk.

Verification of current diskgroups and changing diskgroup DATA1 ASM_POWER_BALANCE to zero.

SQL> col name format a20
SQL> SELECT name, type, total_mb, free_mb, required_mirror_free_mb req_mirr_free, usable_file_mb FROM V$ASM_DISKGROUP;

NAME                 TYPE     TOTAL_MB    FREE_MB REQ_MIRR_FREE USABLE_FILE_MB
-------------------- ------ ---------- ---------- ------------- --------------
DATA1                EXTERN      15000      14143             0          14143

SQL> ALTER DISKGROUP DATA1 REBALANCE POWER 0 WAIT;

Diskgroup altered.

SQL>

Next we have to force ASM to discover new devices by modifing asm_diskstring parameter (I’m using IFILE for ASM, so I’ve to manually edit pfile for ASM. If I don’t, ASM won’t remember new settings after restarting).

SQL> show parameter string

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/hde

SQL>
SQL> alter system set asm_diskstring='/dev/hde', '/dev/hdi' scope=memory;

System altered.

SQL> show parameter string

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/hde, /dev/hdi

SQL>
[oracle@prac1 11.1.0]$ vi /u01/app/oracle/product/11.1.0/db_1/dbs/init+ASM1.ora
#Modify asm_diskstring here too
asm_diskstring='/dev/hde','/dev/hdi'

Step 7: The main part: ASM rebalance

SQL> ALTER DISKGROUP DATA1 ADD DISK '/dev/hdi';

Diskgroup altered.

SQL> SELECT GROUP_NUMBER, OPERATION, STATE FROM V$ASM_OPERATION;

GROUP_NUMBER OPERA STATE
------------ ----- --------------------
1 REBAL RUN

SQL> select name,path,state,failgroup from v$asm_disk;

NAME                 PATH            STATE                FAILGROUP
-------------------- --------------- -------------------- ----------
DATA1_0000     /dev/hde       NORMAL               DATA1_0000
DATA1_0001     /dev/hdi        NORMAL               DATA1_0001

SQL> ALTER DISKGROUP DATA1 DROP DISK DATA1_0000;

Diskgroup altered.

SQL> SELECT GROUP_NUMBER, OPERATION, STATE, EST_MINUTES FROM V$ASM_OPERATION;

GROUP_NUMBER  OPERA   STATE    EST_MINUTES
---------------------  ---------  ----------- -----------
1             REBAL    RUN        32 

SQL>

Typical snapshot of iostat right now (10 sec averages):

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hde               0.00     0.00 340.80  1.10    14.20     0.00    85.08     1.43    4.18   0.35  12.04
hdi               0.00     0.00  0.40 357.40     0.01    14.28    81.77    11.52   32.22   2.40  86.04

From other normal SQL session:

SQL> insert into t2(id) values ('2');

1 row created.

SQL> commit;

Commit complete.

SQL>

Back to the ASM instance:

SQL> ALTER DISKGROUP DATA1 REBALANCE POWER 11;

Step 8: Verification.

We’ll just execute some big heavy-intensive SQL statement to generate some IO (thnx to Tanel for blogging this query):

SQL> create table t4 as select rownum r from
(select rownum r from dual connect by rownum <= 1000) a,
(select rownum r from dual connect by rownum <= 1000) b,
(select rownum r from dual connect by rownum <= 1000) c
where rownum <= 100000000;

From iostat we can monitor that only hdi is used, which assures us that database is really using hdi.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
33.33    0.00   26.47    5.88    0.00   34.31

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hde               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
hdi               0.00     0.00 107.84 351.96     1.60    12.35    62.14     4.96   10.30   0.91  41.96

MAA challenge lab, part #3

Wednesday, April 15th, 2009

11/04/2009:
1) Finally got some time for cleaning up Grid Control (dropping ora2 and ora3). Secured all agents (on VMs: gc, prac1, prac2). I’ve also cleaned up XEN dom0 (from quadvm). These VMs are not needed anymore. db3.lab (RAC on prac1, prac2) is in GC. Installed 10.2.0.5 32-bit agent on srac1 (single node standby).
2) Testing application of single-node RAC standby for differences in Standby Redo Logs processing (verifcation performed by using read-only mode).
3) LNS (ASYNC=buffers_number in LOG_ARCHIVE_DEST_2 parameter) performance fun.
Prepared srac2 for future RAC extension (to two nodes: srac1, srac2). Also installed GC agent on srac2 (10.2.0.5).
4) prac3: cloning and adding it into the Clusterware prac_cluster (addNode.sh from prac2). Deploying GC 10.2.0.5 agent on this node (prac1 and prac2 are both 10.2.0.4, in future I’ll try to upgrade it via GC). Later manually creating +ASM3 and db33 instances (redo, undo, srvctl, etc.). It means that I have 3 node primary RAC :)
5) srac2: Plan is to add it to the srac_cluster and make it 2 node standby RAC. +ASM2 was running, but more work is needed (mainly registrations in CRS/OCR).
6) Flash Recovery Area on standby ASM’s diskgroup +DATA1 was exhausted (thus MRP0 died) so I performed full RMAN backup with archivelogs to QUADVM dom0′s NFS and afterwards I’ve deleted archivelogs to reclaim some space. On SRAC standby I’ve changed archivelog deletion policy (in RMAN) and then restarted MRP0.
Unfortunatley I’ve lost my RAID5 array on synapse (dom0 hosting srac_cluster: srac1, srac2; it’s and old LH 6000R HP server) — 2 drives have failed, so my standby RAC is doomed until I’ll rebuild synapse on new SCSI drives (to be ordered) :(
UPDATE: I’ve verified backups of my srac1 and srac2 VMs but the backups for ASM diskgroup +DATA1 failed.  Also my OCR and voting disks are lost. It will be real fun & challenge to recover this standby RAC enviorniment (this will be also pretty like restoring non DataGuarded RAC enviorniment after site crash). I belive I won’t have to rebuild my standby from primary, because I’ve backuped this standby earlier. OCR hopefully can be restored from Clusterware auto-backup location.

15/03/2009:
Finally two node RAC {prac1,prac2} is being protected by DataGuard single-node standby RAC {srac1}.

21/02/2009:
XEN shared storage and oracle software storage on synapse for srac1 and configuring it as /u01 on srac1
Clusterware, ASM +DATA1, database (RAC) installation on srac1 (x86_32).

Brief description of work on my MAA challenge lab… #1

Saturday, January 24th, 2009

This picture below (click to enlarge) shows what I’m planning to deploy in my home lab in order to prepare better for OCP certification. It can be summarized as full Maximum Availbility Architecture implementation… Grid Control is being used to increase productivity, but I don’t want to integrate Oracle VM into the stack, just systems and databases:

MAA challenge lab

17/01/2008: Installation and fight with Grid Control on VM gc. Preparing VM linux template named 01_prac1 from which other machines are going to be cloned (simple as recursive “cp” in dom0).

18/01/2008: Installation & fight with Grid Control after I’ve dropped “emrep” database (main GC repository database). This happened while I was playing with cloned database “misc1″ from “emrep”. I didn’t read message while running “DROP DATABASE ..”  from RMAN and I’ve sent both to /dev/null, because the DBID was the same for the orginal one and the “misc1″ clone.  The primary reason was that I wanted misc1 cloned from emrep but it failed :) ). Did I say that I’ve also deleted backups? ;) After new, sucessfull fresh installation of GC, I’ve taken full backup (from XEN dom0) of 01_gc VM for future purposes. I’m starting to regret that I’ve haven’t used LVM in dom0 for snapshot/fast-backup purposes…

20/01/2008: Setting up 09_ora1 VirtualMachine from VM template 01_prac1. Installing single 11g database named “db1.lab1″ with dedicated tablespace & user for sysbench version 0.4.8 (0.4.10 doesn’t work with Oracle).

23/01/2008: Cloning ora2 from ora1. Changing hostnames, IPs (trick: the same ethernet MACs but on different XEN bridges, changes performed from console:)). Uninstalling Agent10g, vanishing db on ora2. Setting up DNS server on quadvm (in XEN dom0) for whole lab. Installing GC agent on ora2 – something is wrong… GC console doesn’t catch up new targets (primary I’m looking for “Host” component). Agent is itself discovered by GC…. starting from the beginning (rm -rf 08_ora2 :o ) and so on…
Finally got ora3 up running instead of ora2. Then it turned out that Laurent Schneider has blogged post about metalink note in which the procedure of agent removal is described. So finally I’ve got ora2 back in GC (with gc, ora1 and ora3).

Next step was setting up host prac1 as for single instance non-RAC ASM database “db3.lab1″. First  clusterware has been installed. I wanted to have it 32-bit version, because my future standby RAC hardware is only 32-bit capable but it appears that I would have to install 32-bit userland RPMs, so I decided to try in the long term 64-bit primary RAC with 32-bit standby RAC… Also Grid Control agent was installed on prac1.

24/01/2008: Raised RAM for 01_prac1 to 1.6GB from 1.2GB (due to frequent swapping occuring, I want my 11g memory_target for that fresh db3.lab1 database to be left at 450MB). Succesfull migration of ASM storage /dev/hde to /dev/hdi (simulating storage migration – thanks to ASM it is straightforward. I’m going to blog about it in my next post).

My article about Extended RAC is finally public (on OTN)

Tuesday, November 11th, 2008

In case you would like to experiment with Extended RAC cluster my article on OTN should be helpful. Enjoy!

Raising Oracle VM’s maximal number of interfaces in domU

Saturday, August 2nd, 2008

Just edit /boot/grub/menu.lst and add “netloop.nloopbacks=X”. Sample file after modification:

title Oracle VM Server vnull02
root (hd0,0)
kernel /xen.gz console=ttyS0,57600n8 console=tty dom0_mem=512M
module /vmlinuz-2.6.18-vnull02_8.1.6.0.18.el5xen ro root=/dev/md0 netloop.nloopbacks=8
module /initrd-2.6.18-vnull02_8.1.6.0.18.el5xen.img

OracleVM (XEN) network performance

Monday, March 31st, 2008

In OracleVM (virtualization product for x86 and x86_64 from Oracle, which is based on OpenSource XEN) one can pin individual VirtualMachines(later called just VMs) to dedicated CPU cores. This can give great potential win if XEN scheduler (dom0) doesn’t have to switch VMs between CPU or cores. Also you can modify default MTU (1500) settings for VMs, but more about this later.

I’ve performed some tests (on PC: QuadCore Q6600 4×2.4GHz + 8GB RAM, 1GB RAM per nfsX VM, 2GB RAM per vmracX VM, 3 SATA2 10kRPM disks in RAID0), here are the results (OracleVM 2.1 with Oracle Enterprise Linux 5):

  • using defaults (without VCPU pinning, dynamic VirtualCPU selection by XEN scheduler)
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 209 MBytes 1.75 Gbits/sec
    [ 3] 1.0- 2.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 2.0- 3.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 3.0- 4.0 sec 216 MBytes 1.82 Gbits/sec
    [ 3] 4.0- 5.0 sec 231 MBytes 1.93 Gbits/sec
    [ 3] 5.0- 6.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 6.0- 7.0 sec 228 MBytes 1.91 Gbits/sec
    [ 3] 7.0- 8.0 sec 231 MBytes 1.94 Gbits/sec
    [ 3] 8.0- 9.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 9.0-10.0 sec 222 MBytes 1.86 Gbits/sec
    [ 3] 0.0-10.0 sec 2.16 GBytes 1.85 Gbits/sec
    [ 3] Sent 1576401 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.94 GBytes 1.66 Gbits/sec 0.026 ms 160868/1576400 (10%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order
    [root@nfs2 ~]#
  • after pinning:

    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 220.5 0
    21_nfs2 7 0 1 -b- 264.1 1
    24_vmrac1 8 0 2 -b- 4.7 any cpu
    24_vmrac1 8 1 2 -b- 5.9 any cpu
    Domain-0 0 0 1 -b- 1242.9 any cpu
    Domain-0 0 1 0 -b- 224.2 any cpu
    Domain-0 0 2 2 r-- 71.8 any cpu
    Domain-0 0 3 3 -b- 60.2 any cpu

    Notice that 18_nfs1 and 21_nfs2 are pinned to diffrent cores. You would expect at first glance that this will give better performance, but…
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 105 MBytes 883 Mbits/sec
    [ 3] 1.0- 2.0 sec 107 MBytes 894 Mbits/sec
    [ 3] 2.0- 3.0 sec 108 MBytes 908 Mbits/sec
    [ 3] 3.0- 4.0 sec 118 MBytes 988 Mbits/sec
    [ 3] 4.0- 5.0 sec 130 MBytes 1.09 Gbits/sec
    [ 3] 5.0- 6.0 sec 112 MBytes 937 Mbits/sec
    [ 3] 6.0- 7.0 sec 110 MBytes 922 Mbits/sec
    [ 3] 7.0- 8.0 sec 111 MBytes 928 Mbits/sec
    [ 3] 8.0- 9.0 sec 121 MBytes 1.01 Gbits/sec
    [ 3] 9.0-10.0 sec 121 MBytes 1.02 Gbits/sec
    [ 3] 0.0-10.0 sec 1.12 GBytes 958 Mbits/sec
    [ 3] Sent 814834 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.11 GBytes 957 Mbits/sec 0.004 ms 1166/814833 (0.14%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

    As you can see there is no performance win in such scenario, XEN scheduler better knows how to utilise hardware
  • The last test is the worst scenario which can happen under XEN: overloaded hardware. So pinning both nfs systems to one core(0) gives following results:
    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 226.1 0
    21_nfs2 7 0 0 -b- 268.7 0
    [..]

    again:

    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 73.3 MBytes 615 Mbits/sec
    [ 3] 1.0- 2.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 2.0- 3.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 3.0- 4.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 4.0- 5.0 sec 68.1 MBytes 572 Mbits/sec
    [ 3] 5.0- 6.0 sec 68.6 MBytes 575 Mbits/sec
    [ 3] 6.0- 7.0 sec 69.0 MBytes 579 Mbits/sec
    [ 3] 7.0- 8.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 8.0- 9.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 9.0-10.0 sec 66.6 MBytes 559 Mbits/sec
    [ 3] 0.0-10.0 sec 688 MBytes 577 Mbits/sec
    [ 3] Sent 490928 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 680 MBytes 570 Mbits/sec 0.019 ms 6064/490927 (1.2%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

WARNING EXPERIMENAL AND NOT VERY WELL TESTED (USE AT OWN RISK!):
MTU stands for Maximal Transmission Unit in network terminology. The bigger MTU the less overhead from TCP/IP stack, thus it can give great network results decreasing CPU utilisation for network intensive operations between VMs (in XEN between VMs packets traverse like this: domU_1 –> dom0(bridge) –> domU_2). Before altering MTU for Virtual Machines you should be familiar with the way they work in XEN. Go here for very good article explaining architecture of bridged interfaces in XEN. Before you can change MTU of bridge (sanbr0 in my case) you must change MTU for each VIFX.Y in XEN dom0 by running the following ip link set dev vifX.Y mtu 9000. List of those interfaces can be found by running: brctl show. Next you have to set MTU for bridge (in dom0): ip link set dev sanbr0 mtu 9000. Now you can use larger MTU in VMs. The test was performed on the same Quad box mentioned earlier, but now from vmrac2 VM node to one nfs VM node (yes, this is vmrac2 node is running Oracle RAC on NFS, but it is idle – no transactions were performed during this test):

[root@vmrac2 ~]# cd /u03
[root@vmrac2 u03]# mkdir temp
[root@vmrac2 u03]# cd temp/
# used NFS mount options
[root@vmrac2 temp]# mount | grep /u03
10.98.1.102:/data on /u03 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=10.98.1.102)
[root@vmrac2 temp]# ip link ls dev eth2
5: eth2: mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:16:3e:6c:e7:67 brd ff:ff:ff:ff:ff:ff
[root@vmrac2 temp]# dd if=/dev/zero of=test1 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 14.0485 seconds, 14.9 MB/s
# now we change MTU
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# rm -f test1
[root@vmrac2 temp]# dd if=/dev/zero of=test2 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.28668 seconds, 91.7 MB/s
[root@vmrac2 temp]# rm test2
rm: remove regular file `test2'? y
# let's test again to be sure
[root@vmrac2 temp]# dd if=/dev/zero of=test3 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.14852 seconds, 97.6 MB/s
[root@vmrac2 temp]# rm test3
rm: remove regular file `test3'? y
# switch back to MTU=1500 to exclude other factors
[root@vmrac2 temp]# ip link set dev eth2 mtu 1500
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 10.3054 seconds, 20.4 MB/s
# and again to MTU=9000
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
[root@vmrac2 temp]# rm test4
rm: remove regular file `test4'? y
[root@vmrac2 temp]# dd if=/dev/zero of=test5 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.37787 seconds, 88.2 MB/s
[root@vmrac2 temp]#

As you can see, we’ve increased sequential NFS write performance from something about ~20MB/s to ~90MB/s for NFS server and NFS client both in Oracle VM just by switching to larger MTU (I’ll try switching MTU to 16k or even 32k to be equal with NFS rsize/wsize).

One more notice: this is experimental and don’t try this at your’s OracleVM/XEN installations as this can be unsupported. I’m still experimenting with this, but I hope this trick won’t break anything ;)

p.s.#1 Simple iperf TCP bandwidth test on LAN with MTU=9000 (with 1500 it was ~1.9Gbps, as you could read earlier)
[root@nfs2 ~]# /root/iperf -c 10.98.1.101
------------------------------------------------------------
Client connecting to 10.98.1.101, TCP port 5001
TCP window size: 73.8 KByte (default)
------------------------------------------------------------
[ 3] local 10.98.1.102 port 37660 connected with 10.98.1.101 port 5001
[ 3] 0.0-10.0 sec 7.30 GBytes 6.27 Gbits/sec

p.s.#2 Yes, Oracle RAC 11g works on Oracle VM on NFS3 :)