Archive for the ‘Linux’ Category

Long remote Oracle/SQLNet connection times due to the IPv6 enabled on RHEL 5.x

Sunday, January 9th, 2011

On one of my RAC clusters in lab i’ve noticed that it took sometimes up to 5-15 seconds in order to connect to Oracle database instead of maximum 1s. It was happening on some old VMs back from 2008. The root-cause analysis showed that each RAC node was doing DNS search for it’s own name, which was something i wouldn’t expect normally. Paying attenting to the details showed that Oracle RDBMS was performing DNS on it’s own server name , but not for A DNS entry, but for AAAA (IPv6). Yes libc (DNS resolver) was asking REMOTE DNS SERVERS for it’s own server name – but in IPv6 format (AAAA) – because it couldn’t find the required info via /etc/hosts. This is going to happen even with NETWORING_IPV6=OFF in /etc/sysconfig/network.

The solution was pretty easy after estabilishing the root casue. Just ensure that:

  • all RAC nodes are in /etc/hosts
  • all RAC nodes . are in /etc/hosts
  • that resolv.conf provides at least 2-3 DNS servers that are reachable within max 50ms (check each one with dig(1)). resolv.conf can have options to perform round robin accross those. [my enviorniment was affected by this]
  • you have disabled IPv4 via NETWORKING_IPV6
  • you have aliased IPv6 to “off” in kernel modules to completley disable ipv6 kernel functionality, this in turn also disables any libc IPv6 action [my enviorniment was affected by this, just add "alias net-pf-10 off" to the modprobe.conf]
  • lsmod does NOT displays ipv6 (kernel module is not loaded)

Of course as you can see I wouldn’t be affected by this one if the DNS name server wouldn’t misconfigured in the first place (my lab in 2008 looked differently than it looks now).

Storage migration for ASM database deployed on Oracle Enterprise Linux in Oracle VM without downtime.

Thursday, May 21st, 2009

Suppose you wanted to migrate your database from storage array SAN1 to SAN2 without downtime. With Oracle databases using ASM this is possible. It was performed on configuration better described here. One note: the LUN can be visible through dom0 or directly by domU (by passing PCI hardware handling into our domU VM) — this posts explains only first scenario, as this is more common scenario  Brief steps include:

  1. Prepare new LUNs on storage array (not described here)
  2. Attach new LUNs to the Oracle VM (not described here, simulated here by using simple zero-padded file created using dd utility; I expect this was performed earlier[scsi bus rescanning and so on] or file created in /OVS).
  3. Modifing VM’s XEN config file.
  4. Online attaching block device to VM.
  5. Preparing new storage device from inside of the target VM.
  6. Discovering new LUN in ASM.
  7. Actual rebalance process…
  8. Verification.

Step 3: Modify vm.cfg file for additional storage.

This is straighforward, just add one line. Do NOT restart the VM. There is no need.

[root@quadovm 01_prac1]# cat vm.cfg
bootloader = '/usr/bin/pygrub'
disk = ['file:/OVS/running_pool/01_prac1/system.img,hda,w',
'file:/OVS/running_pool/01_prac1/oracle_software.img,hdd,w',
'file:/OVS/running_pool/prac.storage.raw,hde,w!',
'file:/OVS/running_pool/prac.ocr,hdf,w!',
'file:/OVS/running_pool/prac.voting,hdg,w!'
<strong>'file:/OVS/running_pool/prac.storage2.raw,hdi,w!',</strong>
]
memory = 1638
name = '01_prac1'
[..]
[root@quadovm 01_prac1]#

Step 4: Attach block device to the running VM.

[root@quadovm 01_prac1]# xm block-attach 01_prac1 file:///OVS/running_pool/prac.storage2.raw /dev/hdi 'w!'
[root@quadovm 01_prac1]#

Step 5: Prepare prac1 VM for new device.

New added storage should be automatically detected, this can be verified by checking dmesg output:

[root@prac1 ~]# dmesg|grep hdi
hdi: unknown partition table
[root@prac1 ~]# ls -al /dev/hde /dev/hdi
brw-rw---- 1 root dba  33, 0 Jan 24 13:00 /dev/hde
brw-r----- 1 root disk 56, 0 Jan 24 12:59 /dev/hdi
[root@prac1 ~]#
[root@prac1 ~]# fdisk -l /dev/hd[ei] 2> /dev/null | grep GB
Disk /dev/hde: 15.7 GB, 15728640000 bytes
Disk /dev/hdi: 16.1 GB, 16106127360 bytes
[root@prac1 ~]#

As we can see new LUN is bigger (it should be bigger or equal, but I haven’t checked what happens if you add a smaller one). Now we have to assign correct permissions so that ASM/Database can use new /dev/hdi device without problems (this doesn’t include modifing udev rules in /etc/udev/, and it is required to make new devices come with right permissions after reboot — do your own home work:) ) :

[root@prac1 ~]# chgrp dba /dev/hdi
[root@prac1 ~]# chmod g+w /dev/hdi
[root@prac1 ~]#

Step 6: Preparing ASM for new disk.

Verification of current diskgroups and changing diskgroup DATA1 ASM_POWER_BALANCE to zero.

SQL> col name format a20
SQL> SELECT name, type, total_mb, free_mb, required_mirror_free_mb req_mirr_free, usable_file_mb FROM V$ASM_DISKGROUP;

NAME                 TYPE     TOTAL_MB    FREE_MB REQ_MIRR_FREE USABLE_FILE_MB
-------------------- ------ ---------- ---------- ------------- --------------
DATA1                EXTERN      15000      14143             0          14143

SQL> ALTER DISKGROUP DATA1 REBALANCE POWER 0 WAIT;

Diskgroup altered.

SQL>

Next we have to force ASM to discover new devices by modifing asm_diskstring parameter (I’m using IFILE for ASM, so I’ve to manually edit pfile for ASM. If I don’t, ASM won’t remember new settings after restarting).

SQL> show parameter string

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/hde

SQL>
SQL> alter system set asm_diskstring='/dev/hde', '/dev/hdi' scope=memory;

System altered.

SQL> show parameter string

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/hde, /dev/hdi

SQL>
[oracle@prac1 11.1.0]$ vi /u01/app/oracle/product/11.1.0/db_1/dbs/init+ASM1.ora
#Modify asm_diskstring here too
asm_diskstring='/dev/hde','/dev/hdi'

Step 7: The main part: ASM rebalance

SQL> ALTER DISKGROUP DATA1 ADD DISK '/dev/hdi';

Diskgroup altered.

SQL> SELECT GROUP_NUMBER, OPERATION, STATE FROM V$ASM_OPERATION;

GROUP_NUMBER OPERA STATE
------------ ----- --------------------
1 REBAL RUN

SQL> select name,path,state,failgroup from v$asm_disk;

NAME                 PATH            STATE                FAILGROUP
-------------------- --------------- -------------------- ----------
DATA1_0000     /dev/hde       NORMAL               DATA1_0000
DATA1_0001     /dev/hdi        NORMAL               DATA1_0001

SQL> ALTER DISKGROUP DATA1 DROP DISK DATA1_0000;

Diskgroup altered.

SQL> SELECT GROUP_NUMBER, OPERATION, STATE, EST_MINUTES FROM V$ASM_OPERATION;

GROUP_NUMBER  OPERA   STATE    EST_MINUTES
---------------------  ---------  ----------- -----------
1             REBAL    RUN        32 

SQL>

Typical snapshot of iostat right now (10 sec averages):

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hde               0.00     0.00 340.80  1.10    14.20     0.00    85.08     1.43    4.18   0.35  12.04
hdi               0.00     0.00  0.40 357.40     0.01    14.28    81.77    11.52   32.22   2.40  86.04

From other normal SQL session:

SQL> insert into t2(id) values ('2');

1 row created.

SQL> commit;

Commit complete.

SQL>

Back to the ASM instance:

SQL> ALTER DISKGROUP DATA1 REBALANCE POWER 11;

Step 8: Verification.

We’ll just execute some big heavy-intensive SQL statement to generate some IO (thnx to Tanel for blogging this query):

SQL> create table t4 as select rownum r from
(select rownum r from dual connect by rownum <= 1000) a,
(select rownum r from dual connect by rownum <= 1000) b,
(select rownum r from dual connect by rownum <= 1000) c
where rownum <= 100000000;

From iostat we can monitor that only hdi is used, which assures us that database is really using hdi.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
33.33    0.00   26.47    5.88    0.00   34.31

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hde               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
hdi               0.00     0.00 107.84 351.96     1.60    12.35    62.14     4.96   10.30   0.91  41.96

Fincore – how to monitor VM cache (A.K.A. what’s inside)

Thursday, May 21st, 2009

Back in November 2008, I’ve asked question how to monitor VM cache performance (cache hit ratio) under Linux on Kevin Closson’s blog. Now I’ve just found utility that allows to show pretty nicely what *is* in VM cache. The utility is called “fincore”, it is written in Python and has some strange Python/perl dependecies, but it works:

[oracle@xeno test]$ fincore -justsummarize *.dbf
page size: 4096 bytes
0 pages, 0.0  bytes in core for 10 files; 0.00 pages, 0.0  bytes per file.
[oracle@xeno test]$ echo $ORACLE_SID
oceperf
[oracle@xeno test]$ sqlplus "/ as sysdba"

SQL*Plus: Release 11.1.0.6.0 - Production on Tue Mar 17 10:35:01 2009

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup;
ORACLE instance started.

Total System Global Area  535662592 bytes
Fixed Size                  1301112 bytes
Variable Size             348128648 bytes
Database Buffers          180355072 bytes
Redo Buffers                5877760 bytes
Database mounted.
Database opened.
SQL> show parameter filesystem

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
filesystemio_options                 string      none
SQL>


So let’s verify again (11g on Linux uses O_DIRECT IO by default).

[oracle@xeno test]$ fincore -justsummarize *.dbf
page size: 4096 bytes
0 pages, 0.0  bytes in core for 10 files; 0.00 pages, 0.0  bytes per file.
[oracle@xeno test]$

Use those DBF files with VM cache (trick):

[oracle@xeno test]$ cat *.dbf > /dev/null
[oracle@xeno test]$ fincore -justsummarize *.dbf
page size: 4096 bytes
154788 pages, 604.6 Mbytes in core for 10 files; 15478.80 pages, 60.5 Mbytes per file.
[oracle@xeno test]$

But not all is cached:

[oracle@xeno test]$ du -sh *.dbf
5.1M    ble.dbf
431M    sysaux01.dbf
551M    system01.dbf
131M    temp01.dbf
3.2M    temp02.dbf
226M    undotbs01.dbf
33M     undotbs2.dbf
33M     undotbs3.dbf
21M     uniform2_01.dbf
348M    users01.dbf
[oracle@xeno test]$
[oracle@xeno test]$ fincore -justsummarize undotbs3.dbf
page size: 4096 bytes
3029 pages, 11.8 Mbytes in core for 1 file; 3029.00 pages, 11.8 Mbytes per file.
[oracle@xeno test]$ ls -alh undotbs3.dbf
-rw-r----- 1 oracle oinstall 33M Feb 27 14:34 undotbs3.dbf
[oracle@xeno test]$

And we have total ~917 MB cached in VM:

[oracle@xeno test]$ grep ^Cached /proc/meminfo
Cached:         939208 kB
[oracle@xeno test]$

The fincore uses mincore(2) syscall, and it appeared in the old days of 2.3.99pre1, so it should work on most of your old boxes, provided that somebody would get rid of that python dependency ;)

Oracle Clusterware 11g addNode.sh ONS bug (with GDB fun & solution)

Monday, January 26th, 2009

Yesterday I’ve added new node for one of Clusterware installations. I’ve decided to use addNode.sh script. Unfortunatley after adding new node (prac2) to cluster, kernel reported every few seconds segmentation fault of ONS daemon. Here’s how one can use some in depth knowledge to solve such problems (OS: Oracle Enterprise Linux 5.1, Clusterware: 11.1.0.6, platform: x86_64).

Some messages after Clusterware shutdown — most important are first two lines here.

[root@prac2 conf]# tail /var/log/messages
Jan 25 20:42:14 prac2 kernel: ons[1923]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:42:14 prac2 kernel: ons[1927]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:42:14 prac2 logger: Oracle clsomon shutdown successful.
Jan 25 20:42:14 prac2 logger: Oracle CSS family monitor shutting down gracefully.
Jan 25 20:42:15 prac2 logger: Oracle CSSD graceful shutdown
Jan 25 20:42:15 prac2 logger: Oprocd received graceful shutdown request. Shutting down.
[root@prac2 conf]#

Next, we are going to unlimit core-dump files (by default they are disabled). Later we create special directory for central dumping cores, and also configure kernel to use it.
[root@prac2 ~]# ulimit -c unlimited
[root@prac2 ~]# ulimit -c
unlimited
[root@prac2 ~]# mkdir /tmp/cores
[root@prac2 ~]# chmod 1777 /tmp/cores
[root@prac2 ~]# sysctl kernel.core_pattern='/tmp/cores/core.%e.%p'
kernel.core_pattern = /tmp/cores/core.%e.%p
[root@prac2 ~]#

After bringing up Clusterware (crsctl start crs) we are going to get some cores:
[root@prac2 cores]# ls
core.ons.4990 core.ons.5038 core.ons.5078 core.ons.5123 core.ons.5163 core.ons.5203
core.ons.5004 core.ons.5045 core.ons.5090 core.ons.5135 core.ons.5175 core.ons.5210
core.ons.5011 core.ons.5052 core.ons.5097 core.ons.5142 core.ons.5182 core.ons.5217
core.ons.5018 core.ons.5064 core.ons.5104 core.ons.5149 core.ons.5189 core.ons.5224
core.ons.5031 core.ons.5071 core.ons.5116 core.ons.5156 core.ons.5196
[root@prac2 cores]# /u01/app/crs/bin/crsctl stop crs
Stopping resources.
This could take several minutes.
Successfully stopped Oracle Clusterware resources
Stopping Cluster Synchronization Services.
Shutting down the Cluster Synchronization Services daemon.
Shutdown request successfully issued.
[root@prac2 cores]#

We must find, if that ONS daemon dumps core each time at the same address (rip = instruction pointer), we’ll take core from PID 4990:
[root@prac2 cores]# grep 'ons\[4990\]' /var/log/messages
Jan 25 20:07:12 prac2 kernel: ons[4990]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:16:49 prac2 kernel: ons[4990]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:34:13 prac2 kernel: ons[4990]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
[root@prac2 cores]#

OK, so we’ve got bad guy at 0x000000000040cb56. Now we have to know why it happens (we want to find the problem, only if we know exact reason, we’are going to fix it — that’s my personal philosophy). Some output has been trimmed for clarity.
[root@prac2 cores]# file core.ons.4990
core.ons.4990: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'ons'
[root@prac2 cores]# gdb /u01/app/crs/opmn/bin/ons core.ons.4990
GNU gdb Red Hat Linux (6.5-25.el5rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
[..]

Core was generated by `/u01/app/oracle/product/11.1.0/crs/opmn/bin/ons -d'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000040cb56 in opmnHttpFormatConnect ()
(gdb) thread apply all bt

Thread 7 (process 4990):
#0 0x0000003ea6c0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000041aa07 in main ()

Thread 6 (process 4998):
#0 0x0000003ea6c0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000418666 in opmnWorkerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 5 (process 4999):
#0 0x0000003ea6c0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000418666 in opmnWorkerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 4 (process 5000):
#0 0x0000003ea6c0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000418666 in opmnWorkerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

---Type to continue, or q to quit---
Thread 3 (process 5001):
#0 0x0000003ea60c7922 in select () from /lib64/libc.so.6
#1 0x0000000000411399 in opmnListenerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 2 (process 5003):
#0 0x0000003ea6095451 in nanosleep () from /lib64/libc.so.6
#1 0x0000003ea6095274 in sleep () from /lib64/libc.so.6
#2 0x0000000000413fb6 in opmnPerformance ()
#3 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 1 (process 5002):
#0 0x000000000040cb56 in opmnHttpFormatConnect ()
#1 0x0000000000408d19 in connectionActive ()
#2 0x0000000000407774 in opmnConnectionEntry ()
#3 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

(gdb)
(gdb) disassemble
Dump of assembler code for function opmnHttpFormatConnect:
[..]
0x000000000040cb4b : mov 2121046(%rip),%r11 # 0x6128a8
0x000000000040cb52 : pushq 0x60(%r11)
0x000000000040cb56 : movzwl 0x60(%rbp),%r9d
[..]
0x000000000040cba6 : lea 826995(%rip),%rsi # 0x4d6a20 <__STRING.28>
0x000000000040cbad : mov 0x78(%rsp),%rdi
0x000000000040cbb2 : callq 0x4073d0
[..]
0x000000000040cbf5 : retq
[..]
(gdb)

From the output we can conclude that this function is not calling other functions/syscalls except sprintf. The instruction that SIGSEGVs program movzwl is from familiy of “extended move”. It’s just optimised copy instruction. Something is wrong – probabbly we are loading from bad frame base pointer here? Bad memory pointer? NULL? Also format string for sprintf (for more info: man sprintf) is as follows (loaded by lea instruction):

(gdb) x/s 0x4d6a20
0x4d6a20 <__STRING.28>: "POST /connect HTTP/1.1\r\nVersion: 3\r\nContent-Length: 0\r\nONStarget: %u,%hu\r\nONSsource: %u,%hu,%hu,%hu\r\nONSinfo: %s!%s!%u!%u!%hu!%hu!%hu\r\nhostName: %s\r\nclusterId: %s\r\nclusterName: %s\r\ninstanceId: %s\r\nins"...
(gdb)

As Oracle’s documentation specified, ONS – by default – takes parameters from OCR. We can dump OCR for further analysis:
[root@prac2 cores]# /u01/app/crs/bin/ocrdump
[root@prac2 cores]#

This generates human readable file named OCRDUMPFILE. After viewing it (greping for ONS string) it appears that configuration for prac2 node is not there (and it should be). This can be corrected on prac1 node (where Clusterware is up):

[root@prac1 tmp]# /u01/app/crs/bin/racgons add_config prac2:6251
[root@prac1 tmp]# /u01/app/crs/bin/ocrdump
[root@prac1 tmp]# grep ONS OCRDUMPFILE
[DATABASE.ASM.prac1.+asm1.START_OPTIONS]
[DATABASE.ONS_HOSTS]
[DATABASE.ONS_HOSTS.prac1]
[DATABASE.ONS_HOSTS.prac1.PORT]
[DATABASE.ONS_HOSTS.prac2]
[DATABASE.ONS_HOSTS.prac2.PORT]
ORATEXT : CRS application for ONS on node
ORATEXT : CRS application for ONS on node
[root@prac1 tmp]#

Now we corrected it. Let’s start Clusterware on prac2 and verify how that format function performs. This somehow tricky, because main ONS daemon is child of parent ONS watchdog (21165), that function is being called only while initialising, so we’re going to kill child and follow code-path to the child.. :

[oracle@prac2 tmp]$ ps uaxw|grep opm
oracle 21165 0.0 0.0 15928 392 ? Ss 21:55 0:00 /u01/app/oracle/product/11.1.0/crs/opmn/bin/ons -d
oracle 21166 2.1 0.6 146920 10984 ? Sl 21:55 0:00 /u01/app/oracle/product/11.1.0/crs/opmn/bin/ons -d
oracle 21281 0.0 0.0 61112 700 pts/0 R+ 21:55 0:00 grep opm
[oracle@prac2 tmp]$ gdb -p 21165 # parent
GNU gdb Red Hat Linux (6.5-25.el5rh)
[..]
(gdb) set follow-fork-mode child
(gdb) b *opmnHttpFormatConnect
Breakpoint 1 at 0x40c9f2
(gdb) b *0x000000000040cbf5 # remember that retq address...?
Breakpoint 2 at 0x40cbf5
(gdb) cont
Continuing.

Now to the the other session, where we kill child (parent will restart it):
[oracle@prac2 tmp]$ kill -TERM 21166

On the GDB session we'll get this:
[New process 22004]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[New LWP 22015]
[Switching to LWP 22015]

Breakpoint 1, 0x000000000040c9f2 in opmnHttpFormatConnect ()
(gdb) cont
Continuing.

Breakpoint 2, 0x000000000040cbf5 in opmnHttpFormatConnect ()

(gdb) info registers rax
rax 0x1ab38960 447973728
(gdb) x/s 0x1ab38960
0x1ab38960: "POST /connect HTTP/1.1\r\nVersion: 3\r\nContent-Length: 0\r\nONStarget: 174260502,6251\r\nONSsource: 174260501,6251,6150,0\r\nONSinfo: !!174260501!0!6251!0!6150\r\nhostName: prac1\r\nclusterId: databaseClusterId\r\nc"...
(gdb) quit
The program is running. Exit anyway? (y or n) y
[oracle@prac1 prac1]$

Summary: at the breakpoint #2 we are at retq (return from subroutine) and we get in rax (64-bit) register address of memory which contains formated string. This is HTTP request that is sent from child ONS program. We’ve verified that it works correctly :)

My article about Extended RAC is finally public (on OTN)

Tuesday, November 11th, 2008

In case you would like to experiment with Extended RAC cluster my article on OTN should be helpful. Enjoy!

Raising Oracle VM’s maximal number of interfaces in domU

Saturday, August 2nd, 2008

Just edit /boot/grub/menu.lst and add “netloop.nloopbacks=X”. Sample file after modification:

title Oracle VM Server vnull02
root (hd0,0)
kernel /xen.gz console=ttyS0,57600n8 console=tty dom0_mem=512M
module /vmlinuz-2.6.18-vnull02_8.1.6.0.18.el5xen ro root=/dev/md0 netloop.nloopbacks=8
module /initrd-2.6.18-vnull02_8.1.6.0.18.el5xen.img

OracleVM (XEN) network performance

Monday, March 31st, 2008

In OracleVM (virtualization product for x86 and x86_64 from Oracle, which is based on OpenSource XEN) one can pin individual VirtualMachines(later called just VMs) to dedicated CPU cores. This can give great potential win if XEN scheduler (dom0) doesn’t have to switch VMs between CPU or cores. Also you can modify default MTU (1500) settings for VMs, but more about this later.

I’ve performed some tests (on PC: QuadCore Q6600 4×2.4GHz + 8GB RAM, 1GB RAM per nfsX VM, 2GB RAM per vmracX VM, 3 SATA2 10kRPM disks in RAID0), here are the results (OracleVM 2.1 with Oracle Enterprise Linux 5):

  • using defaults (without VCPU pinning, dynamic VirtualCPU selection by XEN scheduler)
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 209 MBytes 1.75 Gbits/sec
    [ 3] 1.0- 2.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 2.0- 3.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 3.0- 4.0 sec 216 MBytes 1.82 Gbits/sec
    [ 3] 4.0- 5.0 sec 231 MBytes 1.93 Gbits/sec
    [ 3] 5.0- 6.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 6.0- 7.0 sec 228 MBytes 1.91 Gbits/sec
    [ 3] 7.0- 8.0 sec 231 MBytes 1.94 Gbits/sec
    [ 3] 8.0- 9.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 9.0-10.0 sec 222 MBytes 1.86 Gbits/sec
    [ 3] 0.0-10.0 sec 2.16 GBytes 1.85 Gbits/sec
    [ 3] Sent 1576401 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.94 GBytes 1.66 Gbits/sec 0.026 ms 160868/1576400 (10%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order
    [root@nfs2 ~]#
  • after pinning:

    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 220.5 0
    21_nfs2 7 0 1 -b- 264.1 1
    24_vmrac1 8 0 2 -b- 4.7 any cpu
    24_vmrac1 8 1 2 -b- 5.9 any cpu
    Domain-0 0 0 1 -b- 1242.9 any cpu
    Domain-0 0 1 0 -b- 224.2 any cpu
    Domain-0 0 2 2 r-- 71.8 any cpu
    Domain-0 0 3 3 -b- 60.2 any cpu

    Notice that 18_nfs1 and 21_nfs2 are pinned to diffrent cores. You would expect at first glance that this will give better performance, but…
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 105 MBytes 883 Mbits/sec
    [ 3] 1.0- 2.0 sec 107 MBytes 894 Mbits/sec
    [ 3] 2.0- 3.0 sec 108 MBytes 908 Mbits/sec
    [ 3] 3.0- 4.0 sec 118 MBytes 988 Mbits/sec
    [ 3] 4.0- 5.0 sec 130 MBytes 1.09 Gbits/sec
    [ 3] 5.0- 6.0 sec 112 MBytes 937 Mbits/sec
    [ 3] 6.0- 7.0 sec 110 MBytes 922 Mbits/sec
    [ 3] 7.0- 8.0 sec 111 MBytes 928 Mbits/sec
    [ 3] 8.0- 9.0 sec 121 MBytes 1.01 Gbits/sec
    [ 3] 9.0-10.0 sec 121 MBytes 1.02 Gbits/sec
    [ 3] 0.0-10.0 sec 1.12 GBytes 958 Mbits/sec
    [ 3] Sent 814834 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.11 GBytes 957 Mbits/sec 0.004 ms 1166/814833 (0.14%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

    As you can see there is no performance win in such scenario, XEN scheduler better knows how to utilise hardware
  • The last test is the worst scenario which can happen under XEN: overloaded hardware. So pinning both nfs systems to one core(0) gives following results:
    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 226.1 0
    21_nfs2 7 0 0 -b- 268.7 0
    [..]

    again:

    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 73.3 MBytes 615 Mbits/sec
    [ 3] 1.0- 2.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 2.0- 3.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 3.0- 4.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 4.0- 5.0 sec 68.1 MBytes 572 Mbits/sec
    [ 3] 5.0- 6.0 sec 68.6 MBytes 575 Mbits/sec
    [ 3] 6.0- 7.0 sec 69.0 MBytes 579 Mbits/sec
    [ 3] 7.0- 8.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 8.0- 9.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 9.0-10.0 sec 66.6 MBytes 559 Mbits/sec
    [ 3] 0.0-10.0 sec 688 MBytes 577 Mbits/sec
    [ 3] Sent 490928 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 680 MBytes 570 Mbits/sec 0.019 ms 6064/490927 (1.2%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

WARNING EXPERIMENAL AND NOT VERY WELL TESTED (USE AT OWN RISK!):
MTU stands for Maximal Transmission Unit in network terminology. The bigger MTU the less overhead from TCP/IP stack, thus it can give great network results decreasing CPU utilisation for network intensive operations between VMs (in XEN between VMs packets traverse like this: domU_1 –> dom0(bridge) –> domU_2). Before altering MTU for Virtual Machines you should be familiar with the way they work in XEN. Go here for very good article explaining architecture of bridged interfaces in XEN. Before you can change MTU of bridge (sanbr0 in my case) you must change MTU for each VIFX.Y in XEN dom0 by running the following ip link set dev vifX.Y mtu 9000. List of those interfaces can be found by running: brctl show. Next you have to set MTU for bridge (in dom0): ip link set dev sanbr0 mtu 9000. Now you can use larger MTU in VMs. The test was performed on the same Quad box mentioned earlier, but now from vmrac2 VM node to one nfs VM node (yes, this is vmrac2 node is running Oracle RAC on NFS, but it is idle – no transactions were performed during this test):

[root@vmrac2 ~]# cd /u03
[root@vmrac2 u03]# mkdir temp
[root@vmrac2 u03]# cd temp/
# used NFS mount options
[root@vmrac2 temp]# mount | grep /u03
10.98.1.102:/data on /u03 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=10.98.1.102)
[root@vmrac2 temp]# ip link ls dev eth2
5: eth2: mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:16:3e:6c:e7:67 brd ff:ff:ff:ff:ff:ff
[root@vmrac2 temp]# dd if=/dev/zero of=test1 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 14.0485 seconds, 14.9 MB/s
# now we change MTU
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# rm -f test1
[root@vmrac2 temp]# dd if=/dev/zero of=test2 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.28668 seconds, 91.7 MB/s
[root@vmrac2 temp]# rm test2
rm: remove regular file `test2'? y
# let's test again to be sure
[root@vmrac2 temp]# dd if=/dev/zero of=test3 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.14852 seconds, 97.6 MB/s
[root@vmrac2 temp]# rm test3
rm: remove regular file `test3'? y
# switch back to MTU=1500 to exclude other factors
[root@vmrac2 temp]# ip link set dev eth2 mtu 1500
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 10.3054 seconds, 20.4 MB/s
# and again to MTU=9000
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
[root@vmrac2 temp]# rm test4
rm: remove regular file `test4'? y
[root@vmrac2 temp]# dd if=/dev/zero of=test5 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.37787 seconds, 88.2 MB/s
[root@vmrac2 temp]#

As you can see, we’ve increased sequential NFS write performance from something about ~20MB/s to ~90MB/s for NFS server and NFS client both in Oracle VM just by switching to larger MTU (I’ll try switching MTU to 16k or even 32k to be equal with NFS rsize/wsize).

One more notice: this is experimental and don’t try this at your’s OracleVM/XEN installations as this can be unsupported. I’m still experimenting with this, but I hope this trick won’t break anything ;)

p.s.#1 Simple iperf TCP bandwidth test on LAN with MTU=9000 (with 1500 it was ~1.9Gbps, as you could read earlier)
[root@nfs2 ~]# /root/iperf -c 10.98.1.101
------------------------------------------------------------
Client connecting to 10.98.1.101, TCP port 5001
TCP window size: 73.8 KByte (default)
------------------------------------------------------------
[ 3] local 10.98.1.102 port 37660 connected with 10.98.1.101 port 5001
[ 3] 0.0-10.0 sec 7.30 GBytes 6.27 Gbits/sec

p.s.#2 Yes, Oracle RAC 11g works on Oracle VM on NFS3 :)

Oracle’s EM Grid Control on CentOS5/RHEL5: libdb.so.2 issue and resolution

Sunday, November 25th, 2007

CentOS5 and RHEL5 comes without libdb.so.2 library (old Sleepycat database software). The problem is that there is no RPM for this library in the repositories (newest is only compat-db-4.3 which provides libdb4-3.so). Oracle’s EM installer fails about libdb.so.2 with:

libdb.so.2 missing on CentOS5/RHEL5

Resolution which worked for me was to install Redhat 7.3 db1 package(!). Binary compability seems to work:

[root@oemgc ~]# wget ftp://fr.rpmfind.net/linux/redhat/7.3/en/os/i386/RedHat/RPMS/db1-1.85-8.i386.rpm
--12:32:23-- ftp://fr.rpmfind.net/linux/redhat/7.3/en/os/i386/RedHat/RPMS/db1-1.85-8.i386.rpm
=> `db1-1.85-8.i386.rpm'
Resolving fr.rpmfind.net... 194.199.20.114
Connecting to fr.rpmfind.net|194.199.20.114|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD /linux/redhat/7.3/en/os/i386/RedHat/RPMS ... done.
==> SIZE db1-1.85-8.i386.rpm ... 42581
==> PASV ... done. ==> RETR db1-1.85-8.i386.rpm ... done.
Length: 42581 (42K)
100%[=============================================================>] 42,581 230K/s in 0.2s
12:32:25 (230 KB/s) - `db1-1.85-8.i386.rpm' saved [42581]
[root@oemgc ~]# rpm -Uhv db1-1.85-8.i386.rpm
warning: db1-1.85-8.i386.rpm: Header V3 DSA signature: NOKEY, key ID db42a60e
Preparing... ########################################### [100%]
1:db1 ########################################### [100%]
[root@oemgc ~]# rpm -ql db1
/usr/bin/db1_dump185
/usr/lib/libdb.so.2
/usr/lib/libdb1.so.2
/usr/share/doc/db1-1.85
/usr/share/doc/db1-1.85/LICENSE
/usr/share/doc/db1-1.85/README
/usr/share/doc/db1-1.85/changelog
[root@oemgc ~]#

My engineering work…

Wednesday, August 15th, 2007

As of May I’m very busy architecting & implementing cluster for Java Enterprise Edition on comodity hardware (mainly x86_32 based) for my engineering work – to obtain BEng title. Our subject is:
“Web service based on scalable and highly available J2EE application cluster”. We have team consisting of 4 persons in which I’m responsible for all kind of systems/hardware scaling/clusters/load balancing/databases/networking/tunning everything :) . What kind of portal we are creating is to be decided by developers (it will likely be some kind of Web 2.0 portal).
Rest of the team is dedicated to J2EE programming. We are mainly playing with technology.
Currently rock-solid base core cluster architecture looks like this:

Cluster architecture

We are utilizing:

  • Load balancers: Linux Virtual Servers with DirectRouting on CentOS5 (configured as a part of Redhat Cluster Suite)
  • Database: Oracle10g R2
  • Middleware: JBOSS 4.2.0 (EJB3) running in a cluster based on JGroups + Hibernate(JPA) + JBOSS Cache
  • Frontend: Apache2 webservers with Solaris Network Cache Accelerator and AJP proxy to JBOSS servers
  • Solaris Jumpstart to setup new systems really fast with our selfwritten application in PHP for maintaing systems.
  • NFS for providing static content for web servers from Oracle server (yay! dedicated NetApp would be great! ;) )
  • LDAP to synchronize admins accounts inside cluster.
  • SNMPv2(LVS,OSes,JBOSS,Oracle) to monitor everything with single (selfwritten) Java application which graphs everything in realtime.

As this is basic configuration with database as an single point of failure, in Septemer I’m going to setup DataGuard for Oracle. Also I’m testing more advanced scale up. Currently I’m in process of setting up Solaris Cluster with Oracle RAC 10gR2 implemented on iSCSI storage provided by third node based on Solaris Nevada with iSCSI target to test Transparent Application Failover. I’ve been scratching my head over this one for awhile now. Yeah, it is real hardcore… more over that’s not the end of the story – Disaster Recovery with some other interesting bits of technology is going to be implemented later on… all on x86_32 comodity hardware :) Also we are going to put C-JDBC(Sequoia project) under stress…

Solaris x86 customized Jumpstart from Linux NFS server — NFSv4 problem and solution

Friday, July 6th, 2007

There is some kind of incompatibility between Linux 2.6 NFSv4 server nad Solaris 10 (U3) NFSv4 client. On installed Solaris you can put some variables into /etc/default/nfs and it should work, but when you are trying to bootstrap from Linux NFS server using Jumpstart you have to search for another solution:

1) Build a new miniroot image with /etc/default/nfs altered?
2) Simpler… alter Linux NFS server to provide only eg. only NFSv2 service
This can be achieved by recompiling kernel without NFSv4 or by much more cleaner solution – disabling NFSv4 services on runtime.

Place the following in /etc/sysconfig/nfs (RHEL5/CentOS5 specific configuration file):
RPCMOUNTDOPTS="--no-tcp --no-nfs-version 4 --no-nfs-version 3"
RPCNFSDARGS="--no-tcp --no-nfs-version 4 --no-nfs-version 3"

Now execute
/etc/init.d/nfs restart
That’s all! :) Jumpstart problem solved!

For more info consider reading man pages for rpc.nfsd and rpc.mountd. Internally those switches write “+2 -3 -4″ to /proc/fs/nfsd/versions. Versions file can be only modified after stopping [nfsd] kernel service ( you’ll get EBUSY errno while trying to change it with nfsd lanuched ).

Links for 23/04/2007

Monday, April 23rd, 2007

MySQL performance/scalability on Linux vs FreeBSD; or simply why Linux sucks when it comes to MySQL ?

ESX performance tunning guide

CREST (Council of Registered Ethical Security Testers)

Forced linux shutdown

Friday, March 23rd, 2007

Some time ago i noticed huge corruption on / on one of my servers, /sbin/reboot was gone also with /sbin/init, /sbin/poweroff and so on.

How to shutdown machine remotley ? Simple:

echo 1 > /proc/sys/kernel/sysrq
echo o > /proc/sysrq-trigger

( this requires SysRQ compiled in kernel, most distributions have this
compiled in their’s kernels )

Nice site: benchmarks of GFS,GNBD,OCFS2

Friday, March 23rd, 2007

Mostly English site dedicated to benchmarking clustering filesystems DistributedMassStorage

Exporting simple file from Linux target to Solaris initiatior using iSCSI

Friday, March 23rd, 2007

QuickHowTo about ”exporting” via iSCSI simple file from Linux target (ietd) to Solaris OS:

Linux target is running Debian/4.0, 2.6.18 kernel and iSCSI target version 0.4.14 – I wish it was Solaris box, but my very old home SCSI controllers aren’t supported by Solaris ( DELL MegaRAID 428 – PERC2 and InitIO ) – however there are some drivers but for Solaris 2.7-2.8, but after small war with them I must say that I failed…. even after playing hardcore stuff in /etc/driver_aliases

Installing iSCSI target on Debian is discussed here: Unofficial ISCSI target installation. Some checks:

rac3:/etc/init.d# cat /proc/net/iet/volume
tid:2 name:iqn.2001-04.com.example:storage.disk2.sys1.xyz
lun:0 state:0 iotype:fileio iomode:wt path:/u01/iscsi.target

rac3:/etc/init.d# cat /proc/net/iet/session
tid:2 name:iqn.2001-04.com.example:storage.disk2.sys1.xyz

As you can see /u01/iscsi.target is normal file ( created with dd(1) ) on MegaRAID RAID0 array. We will use it to do some testing from Solaris:


root@opensol:~# iscsiadm add static-config iqn.2001-04.com.example:storage.disk2.sys1.xyz,10.99.1.25
root@opensol:~# iscsiadm modify discovery --static enable
root@opensol:~# devfsadm -i iscsi
root@opensol:~# iscsiadm list target
Target: iqn.2001-04.com.example:storage.disk2.sys1.xyz
Alias: -
TPGT: 1
ISID: 4000002a0000
Connections: 1
root@opensol:~# format
Searching for disks...done

0. c1t0d0
/pci@0,0/pci1000,30@10/sd@0,0

1. c2t17d0
/iscsi/disk@0000iqn.200104.com.example%3Astorage.disk2.sys1.xyzFFFF,0
Specify disk (enter its number): CTRL+C


Okay so we are now sure that iSCSI works. In several days i'm going to test exporting SONY SDT-9000 ( an old tape drive ) via iSCSI :)

Kilka ciekawych linkow (Clustered Samba, CCIE.PL ciekawy post, drogie urzadzenia SAN ;) )

Sunday, March 18th, 2007

Samba nigdy zbytnio mnie nie interesowala ( bo nie interesowal mnie fakt udostepniania przestrzeni Windowsom ;) ) aczkolwiek ciekawie zapowiada sie klastrowana Samba.

Na forum CCIE.PL jest bardzo ciekawy wpis ( autor: pjeter ) ktory opisuje mniej wiecej sciaganie obrazu z tunera TV z kablowki i rzucanie tego na multicast w czasie rzeczywistym ( z kodekiem ) na zwyklym PC z Linuxem dzieki czemu obraz TV jest dostepny dla innych komputerow w sieci lokalnej…

Architektury sieci SAN mnie interesuja ( np link ) … ale niestety sa poza zasiegiem budzetowym:

  • MDS 9216 kosztuje ok 120000zl
  • o MDS 95xx lepiej wcale nie pisac ;)