Archive for March, 2008

OracleVM (XEN) network performance

Monday, March 31st, 2008

In OracleVM (virtualization product for x86 and x86_64 from Oracle, which is based on OpenSource XEN) one can pin individual VirtualMachines(later called just VMs) to dedicated CPU cores. This can give great potential win if XEN scheduler (dom0) doesn’t have to switch VMs between CPU or cores. Also you can modify default MTU (1500) settings for VMs, but more about this later.

I’ve performed some tests (on PC: QuadCore Q6600 4×2.4GHz + 8GB RAM, 1GB RAM per nfsX VM, 2GB RAM per vmracX VM, 3 SATA2 10kRPM disks in RAID0), here are the results (OracleVM 2.1 with Oracle Enterprise Linux 5):

  • using defaults (without VCPU pinning, dynamic VirtualCPU selection by XEN scheduler)
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 209 MBytes 1.75 Gbits/sec
    [ 3] 1.0- 2.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 2.0- 3.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 3.0- 4.0 sec 216 MBytes 1.82 Gbits/sec
    [ 3] 4.0- 5.0 sec 231 MBytes 1.93 Gbits/sec
    [ 3] 5.0- 6.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 6.0- 7.0 sec 228 MBytes 1.91 Gbits/sec
    [ 3] 7.0- 8.0 sec 231 MBytes 1.94 Gbits/sec
    [ 3] 8.0- 9.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 9.0-10.0 sec 222 MBytes 1.86 Gbits/sec
    [ 3] 0.0-10.0 sec 2.16 GBytes 1.85 Gbits/sec
    [ 3] Sent 1576401 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.94 GBytes 1.66 Gbits/sec 0.026 ms 160868/1576400 (10%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order
    [root@nfs2 ~]#
  • after pinning:

    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 220.5 0
    21_nfs2 7 0 1 -b- 264.1 1
    24_vmrac1 8 0 2 -b- 4.7 any cpu
    24_vmrac1 8 1 2 -b- 5.9 any cpu
    Domain-0 0 0 1 -b- 1242.9 any cpu
    Domain-0 0 1 0 -b- 224.2 any cpu
    Domain-0 0 2 2 r-- 71.8 any cpu
    Domain-0 0 3 3 -b- 60.2 any cpu

    Notice that 18_nfs1 and 21_nfs2 are pinned to diffrent cores. You would expect at first glance that this will give better performance, but…
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 105 MBytes 883 Mbits/sec
    [ 3] 1.0- 2.0 sec 107 MBytes 894 Mbits/sec
    [ 3] 2.0- 3.0 sec 108 MBytes 908 Mbits/sec
    [ 3] 3.0- 4.0 sec 118 MBytes 988 Mbits/sec
    [ 3] 4.0- 5.0 sec 130 MBytes 1.09 Gbits/sec
    [ 3] 5.0- 6.0 sec 112 MBytes 937 Mbits/sec
    [ 3] 6.0- 7.0 sec 110 MBytes 922 Mbits/sec
    [ 3] 7.0- 8.0 sec 111 MBytes 928 Mbits/sec
    [ 3] 8.0- 9.0 sec 121 MBytes 1.01 Gbits/sec
    [ 3] 9.0-10.0 sec 121 MBytes 1.02 Gbits/sec
    [ 3] 0.0-10.0 sec 1.12 GBytes 958 Mbits/sec
    [ 3] Sent 814834 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.11 GBytes 957 Mbits/sec 0.004 ms 1166/814833 (0.14%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

    As you can see there is no performance win in such scenario, XEN scheduler better knows how to utilise hardware
  • The last test is the worst scenario which can happen under XEN: overloaded hardware. So pinning both nfs systems to one core(0) gives following results:
    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 226.1 0
    21_nfs2 7 0 0 -b- 268.7 0
    [..]

    again:

    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 73.3 MBytes 615 Mbits/sec
    [ 3] 1.0- 2.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 2.0- 3.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 3.0- 4.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 4.0- 5.0 sec 68.1 MBytes 572 Mbits/sec
    [ 3] 5.0- 6.0 sec 68.6 MBytes 575 Mbits/sec
    [ 3] 6.0- 7.0 sec 69.0 MBytes 579 Mbits/sec
    [ 3] 7.0- 8.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 8.0- 9.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 9.0-10.0 sec 66.6 MBytes 559 Mbits/sec
    [ 3] 0.0-10.0 sec 688 MBytes 577 Mbits/sec
    [ 3] Sent 490928 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 680 MBytes 570 Mbits/sec 0.019 ms 6064/490927 (1.2%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

WARNING EXPERIMENAL AND NOT VERY WELL TESTED (USE AT OWN RISK!):
MTU stands for Maximal Transmission Unit in network terminology. The bigger MTU the less overhead from TCP/IP stack, thus it can give great network results decreasing CPU utilisation for network intensive operations between VMs (in XEN between VMs packets traverse like this: domU_1 –> dom0(bridge) –> domU_2). Before altering MTU for Virtual Machines you should be familiar with the way they work in XEN. Go here for very good article explaining architecture of bridged interfaces in XEN. Before you can change MTU of bridge (sanbr0 in my case) you must change MTU for each VIFX.Y in XEN dom0 by running the following ip link set dev vifX.Y mtu 9000. List of those interfaces can be found by running: brctl show. Next you have to set MTU for bridge (in dom0): ip link set dev sanbr0 mtu 9000. Now you can use larger MTU in VMs. The test was performed on the same Quad box mentioned earlier, but now from vmrac2 VM node to one nfs VM node (yes, this is vmrac2 node is running Oracle RAC on NFS, but it is idle – no transactions were performed during this test):

[root@vmrac2 ~]# cd /u03
[root@vmrac2 u03]# mkdir temp
[root@vmrac2 u03]# cd temp/
# used NFS mount options
[root@vmrac2 temp]# mount | grep /u03
10.98.1.102:/data on /u03 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=10.98.1.102)
[root@vmrac2 temp]# ip link ls dev eth2
5: eth2: mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:16:3e:6c:e7:67 brd ff:ff:ff:ff:ff:ff
[root@vmrac2 temp]# dd if=/dev/zero of=test1 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 14.0485 seconds, 14.9 MB/s
# now we change MTU
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# rm -f test1
[root@vmrac2 temp]# dd if=/dev/zero of=test2 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.28668 seconds, 91.7 MB/s
[root@vmrac2 temp]# rm test2
rm: remove regular file `test2'? y
# let's test again to be sure
[root@vmrac2 temp]# dd if=/dev/zero of=test3 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.14852 seconds, 97.6 MB/s
[root@vmrac2 temp]# rm test3
rm: remove regular file `test3'? y
# switch back to MTU=1500 to exclude other factors
[root@vmrac2 temp]# ip link set dev eth2 mtu 1500
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 10.3054 seconds, 20.4 MB/s
# and again to MTU=9000
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
[root@vmrac2 temp]# rm test4
rm: remove regular file `test4'? y
[root@vmrac2 temp]# dd if=/dev/zero of=test5 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.37787 seconds, 88.2 MB/s
[root@vmrac2 temp]#

As you can see, we’ve increased sequential NFS write performance from something about ~20MB/s to ~90MB/s for NFS server and NFS client both in Oracle VM just by switching to larger MTU (I’ll try switching MTU to 16k or even 32k to be equal with NFS rsize/wsize).

One more notice: this is experimental and don’t try this at your’s OracleVM/XEN installations as this can be unsupported. I’m still experimenting with this, but I hope this trick won’t break anything ;)

p.s.#1 Simple iperf TCP bandwidth test on LAN with MTU=9000 (with 1500 it was ~1.9Gbps, as you could read earlier)
[root@nfs2 ~]# /root/iperf -c 10.98.1.101
------------------------------------------------------------
Client connecting to 10.98.1.101, TCP port 5001
TCP window size: 73.8 KByte (default)
------------------------------------------------------------
[ 3] local 10.98.1.102 port 37660 connected with 10.98.1.101 port 5001
[ 3] 0.0-10.0 sec 7.30 GBytes 6.27 Gbits/sec

p.s.#2 Yes, Oracle RAC 11g works on Oracle VM on NFS3 :)