Archive for January, 2009

Oracle Clusterware 11g addNode.sh ONS bug (with GDB fun & solution)

Monday, January 26th, 2009

Yesterday I’ve added new node for one of Clusterware installations. I’ve decided to use addNode.sh script. Unfortunatley after adding new node (prac2) to cluster, kernel reported every few seconds segmentation fault of ONS daemon. Here’s how one can use some in depth knowledge to solve such problems (OS: Oracle Enterprise Linux 5.1, Clusterware: 11.1.0.6, platform: x86_64).

Some messages after Clusterware shutdown — most important are first two lines here.

[root@prac2 conf]# tail /var/log/messages
Jan 25 20:42:14 prac2 kernel: ons[1923]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:42:14 prac2 kernel: ons[1927]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:42:14 prac2 logger: Oracle clsomon shutdown successful.
Jan 25 20:42:14 prac2 logger: Oracle CSS family monitor shutting down gracefully.
Jan 25 20:42:15 prac2 logger: Oracle CSSD graceful shutdown
Jan 25 20:42:15 prac2 logger: Oprocd received graceful shutdown request. Shutting down.
[root@prac2 conf]#

Next, we are going to unlimit core-dump files (by default they are disabled). Later we create special directory for central dumping cores, and also configure kernel to use it.
[root@prac2 ~]# ulimit -c unlimited
[root@prac2 ~]# ulimit -c
unlimited
[root@prac2 ~]# mkdir /tmp/cores
[root@prac2 ~]# chmod 1777 /tmp/cores
[root@prac2 ~]# sysctl kernel.core_pattern='/tmp/cores/core.%e.%p'
kernel.core_pattern = /tmp/cores/core.%e.%p
[root@prac2 ~]#

After bringing up Clusterware (crsctl start crs) we are going to get some cores:
[root@prac2 cores]# ls
core.ons.4990 core.ons.5038 core.ons.5078 core.ons.5123 core.ons.5163 core.ons.5203
core.ons.5004 core.ons.5045 core.ons.5090 core.ons.5135 core.ons.5175 core.ons.5210
core.ons.5011 core.ons.5052 core.ons.5097 core.ons.5142 core.ons.5182 core.ons.5217
core.ons.5018 core.ons.5064 core.ons.5104 core.ons.5149 core.ons.5189 core.ons.5224
core.ons.5031 core.ons.5071 core.ons.5116 core.ons.5156 core.ons.5196
[root@prac2 cores]# /u01/app/crs/bin/crsctl stop crs
Stopping resources.
This could take several minutes.
Successfully stopped Oracle Clusterware resources
Stopping Cluster Synchronization Services.
Shutting down the Cluster Synchronization Services daemon.
Shutdown request successfully issued.
[root@prac2 cores]#

We must find, if that ONS daemon dumps core each time at the same address (rip = instruction pointer), we’ll take core from PID 4990:
[root@prac2 cores]# grep 'ons\[4990\]' /var/log/messages
Jan 25 20:07:12 prac2 kernel: ons[4990]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:16:49 prac2 kernel: ons[4990]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
Jan 25 20:34:13 prac2 kernel: ons[4990]: segfault at 0000000000000060 rip 000000000040cb56 rsp 0000000043204028 error 4
[root@prac2 cores]#

OK, so we’ve got bad guy at 0x000000000040cb56. Now we have to know why it happens (we want to find the problem, only if we know exact reason, we’are going to fix it — that’s my personal philosophy). Some output has been trimmed for clarity.
[root@prac2 cores]# file core.ons.4990
core.ons.4990: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'ons'
[root@prac2 cores]# gdb /u01/app/crs/opmn/bin/ons core.ons.4990
GNU gdb Red Hat Linux (6.5-25.el5rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
[..]

Core was generated by `/u01/app/oracle/product/11.1.0/crs/opmn/bin/ons -d'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000040cb56 in opmnHttpFormatConnect ()
(gdb) thread apply all bt

Thread 7 (process 4990):
#0 0x0000003ea6c0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000041aa07 in main ()

Thread 6 (process 4998):
#0 0x0000003ea6c0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000418666 in opmnWorkerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 5 (process 4999):
#0 0x0000003ea6c0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000418666 in opmnWorkerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 4 (process 5000):
#0 0x0000003ea6c0a687 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000418666 in opmnWorkerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

---Type to continue, or q to quit---
Thread 3 (process 5001):
#0 0x0000003ea60c7922 in select () from /lib64/libc.so.6
#1 0x0000000000411399 in opmnListenerEntry ()
#2 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#3 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 2 (process 5003):
#0 0x0000003ea6095451 in nanosleep () from /lib64/libc.so.6
#1 0x0000003ea6095274 in sleep () from /lib64/libc.so.6
#2 0x0000000000413fb6 in opmnPerformance ()
#3 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

Thread 1 (process 5002):
#0 0x000000000040cb56 in opmnHttpFormatConnect ()
#1 0x0000000000408d19 in connectionActive ()
#2 0x0000000000407774 in opmnConnectionEntry ()
#3 0x0000003ea6c062e7 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003ea60ce3bd in clone () from /lib64/libc.so.6

(gdb)
(gdb) disassemble
Dump of assembler code for function opmnHttpFormatConnect:
[..]
0x000000000040cb4b : mov 2121046(%rip),%r11 # 0x6128a8
0x000000000040cb52 : pushq 0x60(%r11)
0x000000000040cb56 : movzwl 0x60(%rbp),%r9d
[..]
0x000000000040cba6 : lea 826995(%rip),%rsi # 0x4d6a20 <__STRING.28>
0x000000000040cbad : mov 0x78(%rsp),%rdi
0x000000000040cbb2 : callq 0x4073d0
[..]
0x000000000040cbf5 : retq
[..]
(gdb)

From the output we can conclude that this function is not calling other functions/syscalls except sprintf. The instruction that SIGSEGVs program movzwl is from familiy of “extended move”. It’s just optimised copy instruction. Something is wrong – probabbly we are loading from bad frame base pointer here? Bad memory pointer? NULL? Also format string for sprintf (for more info: man sprintf) is as follows (loaded by lea instruction):

(gdb) x/s 0x4d6a20
0x4d6a20 <__STRING.28>: "POST /connect HTTP/1.1\r\nVersion: 3\r\nContent-Length: 0\r\nONStarget: %u,%hu\r\nONSsource: %u,%hu,%hu,%hu\r\nONSinfo: %s!%s!%u!%u!%hu!%hu!%hu\r\nhostName: %s\r\nclusterId: %s\r\nclusterName: %s\r\ninstanceId: %s\r\nins"...
(gdb)

As Oracle’s documentation specified, ONS – by default – takes parameters from OCR. We can dump OCR for further analysis:
[root@prac2 cores]# /u01/app/crs/bin/ocrdump
[root@prac2 cores]#

This generates human readable file named OCRDUMPFILE. After viewing it (greping for ONS string) it appears that configuration for prac2 node is not there (and it should be). This can be corrected on prac1 node (where Clusterware is up):

[root@prac1 tmp]# /u01/app/crs/bin/racgons add_config prac2:6251
[root@prac1 tmp]# /u01/app/crs/bin/ocrdump
[root@prac1 tmp]# grep ONS OCRDUMPFILE
[DATABASE.ASM.prac1.+asm1.START_OPTIONS]
[DATABASE.ONS_HOSTS]
[DATABASE.ONS_HOSTS.prac1]
[DATABASE.ONS_HOSTS.prac1.PORT]
[DATABASE.ONS_HOSTS.prac2]
[DATABASE.ONS_HOSTS.prac2.PORT]
ORATEXT : CRS application for ONS on node
ORATEXT : CRS application for ONS on node
[root@prac1 tmp]#

Now we corrected it. Let’s start Clusterware on prac2 and verify how that format function performs. This somehow tricky, because main ONS daemon is child of parent ONS watchdog (21165), that function is being called only while initialising, so we’re going to kill child and follow code-path to the child.. :

[oracle@prac2 tmp]$ ps uaxw|grep opm
oracle 21165 0.0 0.0 15928 392 ? Ss 21:55 0:00 /u01/app/oracle/product/11.1.0/crs/opmn/bin/ons -d
oracle 21166 2.1 0.6 146920 10984 ? Sl 21:55 0:00 /u01/app/oracle/product/11.1.0/crs/opmn/bin/ons -d
oracle 21281 0.0 0.0 61112 700 pts/0 R+ 21:55 0:00 grep opm
[oracle@prac2 tmp]$ gdb -p 21165 # parent
GNU gdb Red Hat Linux (6.5-25.el5rh)
[..]
(gdb) set follow-fork-mode child
(gdb) b *opmnHttpFormatConnect
Breakpoint 1 at 0x40c9f2
(gdb) b *0x000000000040cbf5 # remember that retq address...?
Breakpoint 2 at 0x40cbf5
(gdb) cont
Continuing.

Now to the the other session, where we kill child (parent will restart it):
[oracle@prac2 tmp]$ kill -TERM 21166

On the GDB session we'll get this:
[New process 22004]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[New LWP 22015]
[Switching to LWP 22015]

Breakpoint 1, 0x000000000040c9f2 in opmnHttpFormatConnect ()
(gdb) cont
Continuing.

Breakpoint 2, 0x000000000040cbf5 in opmnHttpFormatConnect ()

(gdb) info registers rax
rax 0x1ab38960 447973728
(gdb) x/s 0x1ab38960
0x1ab38960: "POST /connect HTTP/1.1\r\nVersion: 3\r\nContent-Length: 0\r\nONStarget: 174260502,6251\r\nONSsource: 174260501,6251,6150,0\r\nONSinfo: !!174260501!0!6251!0!6150\r\nhostName: prac1\r\nclusterId: databaseClusterId\r\nc"...
(gdb) quit
The program is running. Exit anyway? (y or n) y
[oracle@prac1 prac1]$

Summary: at the breakpoint #2 we are at retq (return from subroutine) and we get in rax (64-bit) register address of memory which contains formated string. This is HTTP request that is sent from child ONS program. We’ve verified that it works correctly :)

Quick memo for myself for starting up GridControl

Saturday, January 24th, 2009

Starting GC (as oracle software owner):

$OMA_HOME/bin/emctl start agent
$OMS_HOME/bin/emctl start oms
$OMS_HOME/opmn/bin/opmnctl startall

.. and stopping:

$OMS_HOME/opmn/bin/opmnctl stopall
$OMA_HOME/bin/emctl stop agent
$OMS_HOME/bin/emctl stop oms

$ORACLE_HOME/bin/sqlplus /nolog <<EOF1
connect / as sysdba
shutdown immediate
EOF1

Brief description of work on my MAA challenge lab… #1

Saturday, January 24th, 2009

This picture below (click to enlarge) shows what I’m planning to deploy in my home lab in order to prepare better for OCP certification. It can be summarized as full Maximum Availbility Architecture implementation… Grid Control is being used to increase productivity, but I don’t want to integrate Oracle VM into the stack, just systems and databases:

MAA challenge lab

17/01/2008: Installation and fight with Grid Control on VM gc. Preparing VM linux template named 01_prac1 from which other machines are going to be cloned (simple as recursive “cp” in dom0).

18/01/2008: Installation & fight with Grid Control after I’ve dropped “emrep” database (main GC repository database). This happened while I was playing with cloned database “misc1″ from “emrep”. I didn’t read message while running “DROP DATABASE ..”  from RMAN and I’ve sent both to /dev/null, because the DBID was the same for the orginal one and the “misc1″ clone.  The primary reason was that I wanted misc1 cloned from emrep but it failed :) ). Did I say that I’ve also deleted backups? ;) After new, sucessfull fresh installation of GC, I’ve taken full backup (from XEN dom0) of 01_gc VM for future purposes. I’m starting to regret that I’ve haven’t used LVM in dom0 for snapshot/fast-backup purposes…

20/01/2008: Setting up 09_ora1 VirtualMachine from VM template 01_prac1. Installing single 11g database named “db1.lab1″ with dedicated tablespace & user for sysbench version 0.4.8 (0.4.10 doesn’t work with Oracle).

23/01/2008: Cloning ora2 from ora1. Changing hostnames, IPs (trick: the same ethernet MACs but on different XEN bridges, changes performed from console:)). Uninstalling Agent10g, vanishing db on ora2. Setting up DNS server on quadvm (in XEN dom0) for whole lab. Installing GC agent on ora2 – something is wrong… GC console doesn’t catch up new targets (primary I’m looking for “Host” component). Agent is itself discovered by GC…. starting from the beginning (rm -rf 08_ora2 :o ) and so on…
Finally got ora3 up running instead of ora2. Then it turned out that Laurent Schneider has blogged post about metalink note in which the procedure of agent removal is described. So finally I’ve got ora2 back in GC (with gc, ora1 and ora3).

Next step was setting up host prac1 as for single instance non-RAC ASM database “db3.lab1″. First  clusterware has been installed. I wanted to have it 32-bit version, because my future standby RAC hardware is only 32-bit capable but it appears that I would have to install 32-bit userland RPMs, so I decided to try in the long term 64-bit primary RAC with 32-bit standby RAC… Also Grid Control agent was installed on prac1.

24/01/2008: Raised RAM for 01_prac1 to 1.6GB from 1.2GB (due to frequent swapping occuring, I want my 11g memory_target for that fresh db3.lab1 database to be left at 450MB). Succesfull migration of ASM storage /dev/hde to /dev/hdi (simulating storage migration – thanks to ASM it is straightforward. I’m going to blog about it in my next post).

Benchmarking PostgreSQL’s INSERT/COPY data loading performance (in one TCP stream)

Tuesday, January 6th, 2009

As everyone would expect “COPY FROM” wins, but in real conditions performance degradation seems to come from network/IPv4-stack along with PostgreSQL overhead, not CPU or I/O. PostgreSQL 8.3.5 (client libraries were 8.2.x) running on Xen VirtualMachine with 2 VCPU and 1GB RAM allocated from Quad Core P4 (2.4GHz), no other Xen VM (except of course dom0) was running during those tests. VM was on software RAID0 on 3xSATA2 7.2k RPM. Pgdata and xlog both were on the same RAID array, also PostgreSQL had fsync disabled (for performance). FS: ext3 with noatime, standard journal mode. More important PostgreSQL parameteres are below:

shared_buffers = 256MB
temp_buffers = 16MB
max_prepared_transactions = 10
work_mem = 4MB
maintenance_work_mem = 32MB
max_fsm_pages = 153600
fsync = off
wal_buffers = 512kB
archive_mode = off

Real test #1: Benchmark machine (my workstation, 1.5GHz Sempron, running X11, mostly idle) separated by 100mbps network (this is intended!: to simulate the real enviorniment in MAN network), average bench.pl CPU usage is about 10-15% (reported by top). During “INSERT” tests benchmarked database was *IDLE* (reported by vmstat was mostly 99-100 all the time) but network communication generates ~2-3k packet per seconds and ~1.5-1.9 Mbps. Bench.pl parameters were $items = 20000 (random items to INSERT/COPY) and $loops = 10 (tests to conduct to avoid bad results).

postgresql_ins_bench_100mbps

Artificial test #2: Benchmark running on the same machine as database (loopback communication only). This one really stresses CPU and IO, and it’s also taking some CPU cycles for performing benchmarking (i’ve plans to include Gigabit Ethernet in my lab ;) ) Bench.pl settings were $loops = 10 and $items = 40000.

postgresql_ins_bench_localhost

To see results please click on the above images. Long story quick: if you intend to perform more than several INSERT to be done to the same table consider using COPY statement (especially over slow networks, but also on high-bandwidth links to reduce PostgreSQL, TCP and OS overheads).

Legend:

  • ins_adhoc – each time for every INSERT there is prepare+execute being done at the DBI level
  • ins_prep – for each record prepared INSERT statement is being used
  • ins_func – for every record there is prepared & executed function that performs real INSERT
  • ins_func_prep – the same as “ins_func” but statement for executing function is prepared once
  • copy – COPY statement is being executed once for all records