Deferrable primary/unique keys constraints still might result in duplicate values due to updates (even in recent 12c)

April 19th, 2016

I’m still surprised that Oracle PSU 12.1.0.2.160119 (January 2016) is apparently still bugged by old zombie bug of the past .

BTW: On most recent PostgreSQL 9.4 series (9.4.6) it works OK (fails with error).

Oracle Secure Backup/volume corruption due to tape auto-rewinding on /dev/st* access

April 14th, 2016

We’ve got interesting problem with RMAN+OSB 10.4 that resulted in observation that we were literally loosing backups. Co-worker discovered that our in-house solution for automatic Oracle database restores was sometimes failing due to missing backup pieces, later he nailed down the problem using RMAN’s crosschecks and discovered that backup pieces were in EXPIRED state without any pattern. Automatic restores most of the times worked due to RMAN failovers to previous backup, but with archivelog the situation was worse. We then tried to discover what can potentially cause it (across two DCs!) and somehow it was discovered that when we disabled Puppet and monitoring the problem magically disappeared. Generally speaking I do not like magic on our systems, so therefore I’ve used some spare time to explain the magic.

After analyzing some events and theories the problem was nailed down to corruption of running backups due to monitoring scripts (customized Zabbix agent with add-on scripts all of it managed by Puppet). One of the scripts contained lshw executions which under debugging session revealed opening /dev/st[0-9]+ tape devices via simple open(2) syscall. The man 4 st (Linux’s kernel SCSI tape device) describes the st device with the following characteristics:

Each device uses eight minor device numbers. The lowermost five bits in the minor numbers are assigned sequentially in the order of detection. In the 2.6 kernel, the bits above the eight lowermost bits are concatenated to the five lowermost bits to form the tape number. The minor numbers can be grouped into two sets of four numbers: the principal (auto-rewind) minor device numbers, n, and the “no-rewind” device numbers, (n + 128). Devices opened using the principal device number will be sent a REWIND command when they are closed. Devices opened using the “no-rewind” device number will not (..)

In order to test the theory I’ve performed set of simple experiments that was like this:

  1. Perform 1st single RMAN datafile backup, verify it’s integrity
  2. Perform 2nd single RMAN datafile backup, just to verify it hits the same OSB backup label (volume, barcode
  3. While no backups were running, execute lshw several times
  4. Perform 3rd single RMAN datafile backup, verify it’s integrity
  5. Start 4th single RMAN datafile backup, and it parallel start executing lshw
  6. Wait till 4th datafile backup finishes
  7. Perform 5th single RMAN datafile backup, verify it’s integrity

The test proved that backups #1-3 were OK, however execution of backup #5 raised my eyebrows.

oracle/12215.1: 2016/04/13.14:10:24 Info: original mount data for tape2 don’t match those read
oracle/12215.1: 2016/04/13.14:10:24 from the tape.
oracle/12215.1: 2016/04/13.14:10:24 (amh) qdv__automount_in_mh entered
oracle/12215.1: 2016/04/13.14:10:24 (amh) qdv__automount_in_mh tape2 at 2016/04/13.14:10:24, flags 0×100
oracle/12215.1: 2016/04/13.14:10:24 (amh) mount volume options list contains:
oracle/12215.1: 2016/04/13.14:10:24 (amh) vtype 3 (app), vid (null), vs_create 0, family XYZ-TE4-BACKUP, retain (null), size 0, mediainfo 2, scratch 0
oracle/12215.1: 2016/04/13.14:10:28 (amh) don’t preserve previous mh automount state
oracle/12215.1: 2016/04/13.14:10:31 (gep) getting reservation for element 0×4 (dte)
oracle/12215.1: 2016/04/13.14:10:31 (amh) pass 0 – trying to automount loaded volume
oracle/12215.1: 2016/04/13.14:10:31 (atv) qdv__automount_this_vol entered
oracle/12215.1: 2016/04/13.14:10:32 (atv) calling qdv__mount
oracle/12215.1: 2016/04/13.14:10:32 (mt) qdv__read_mount_db() succeeded, found vol_oid 0
oracle/12215.1: 2016/04/13.14:10:32 (mt) qdv__read_label() of 1048576 bytes failed – not at a tar file header (OB device mgr)
oracle/12215.1: 2016/04/13.14:10:32 (mt) qdb__label_event() returned vol_oid 839
oracle/12215.1: 2016/04/13.14:10:32 (mt) setting vol_oid in mount_info to 839
oracle/12215.1: 2016/04/13.14:10:32 (atv) qdv__mount succeeded
oracle/12215.1: 2016/04/13.14:10:32 (atv) qdv__automount_this_vol exited
oracle/12215.1: 2016/04/13.14:10:32 (amh) automount of loaded volume succeeded
oracle/12215.1: 2016/04/13.14:10:32 (amh) end of automount at 2016/04/13.14:10:32 (0×0)
oracle/12215.1: 2016/04/13.14:10:32 (amh) returning from qdv__automount_in_mh
oracle/12215.1: 2016/04/13.14:10:32 Info: volume in tape2 is usable for this operation.
oracle/12215.1: 2016/04/13.14:10:32 It must be overwritten at BOT.

oracle/12215.1: 2016/04/13.14:10:32 (pvfw) at BOT
oracle/12215.1: 2016/04/13.14:10:40 (pvfw) mounted ok
oracle/12215.1: 2016/04/13.14:10:40 (pvfw) pos_vol_cleanup not returning pstate
oracle/12215.1: Volume label:
oracle/12215.1: Volume tag: XYZ034L3E
oracle/12215.1: Volume UUID: ec287b98-e38d-1033-a482-e839352cd95c
oracle/12215.1: Volume ID: XYZ-TE4-BACKUP-000116
..

What’ was interesting was that there was some kind of desync between what OSB was expecting on the tape (obtar format) and what it got, then it proceeded directly into relabeling current tape in drive (XYZ-TE4-BACKUP-000115) to XYZ-TE4-BACKUP-000116 loosing all previous backup pieces on that volume with the same barcode. Any attempt to restore anything from XYZ-TE4-BACKUP-000115 failed from now on.

So it is clear that lshw opening /dev/st[0-9]+ devices even in read-only caused rewinding tape during backing up to that tape(!). The actual error messages may vary e.g. when tape is relabeled from different OSB media-family.

Testing revealed even more interesting testing results on different system (OL 6.7 with mhVTL to simulate SCSI tapes):

[root@db3 ~]# lshw -version
B.02.17
[root@db3 ~]# strace -ffe open /usr/sbin/lshw 2>&1 | grep /dev/st
[root@db3 ~]#

but afterwards after uninstaling lshw B02.17 (installed earlier from standard yum) and installing our older lshw B02.14 RPM (for OL5.11)

[root@db3 ~]# lshw -version
B.02.14
the latest version is B.02.17
[root@db3 ~]# strace -ffe open /usr/sbin/lshw 2>&1 | grep /dev/st
open("/dev/st0", O_RDONLY|O_NONBLOCK)   = 3
open("/dev/st0a", O_RDONLY|O_NONBLOCK)  = 3
[..]

So apparently there was bug in auto-rewind /dev/st[0-9]+ bug in lshw that was fixed. Actually lshw change-log didn’t clearly mention that (there was similar problem in years ago http://www.ezix.org/project/ticket/83), however comparing source code of scsi.cc between those lshw version reveals:

diff -r -u B.02.14/scsi.cc B.02.17/scsi.cc
--- B.02.14/scsi.cc     2008-05-07 23:23:31.000000000 +0200
+++ B.02.17/scsi.cc     2013-03-13 21:52:25.000000000 +0100
@@ -29,7 +29,7 @@

 __ID("@(#) $Id$");

-#define SG_X "/dev/sg%d"
+#define SG_X "/dev/sg*"
 #define SG_MAJOR 21

 #ifndef SCSI_IOCTL_GET_PCI
@@ -195,10 +195,10 @@
   "/dev/sr*",
   "/dev/cd*",
   "/dev/dvd*",
-  "/dev/st*", <-- our bugfix!!!
   "/dev/nst*",
   "/dev/nosst*",
-  "/dev/tape*",
+  "/dev/ntape*",
+  "/dev/nrtape*",
   NULL
 };
[..]

.. and the actually the code in older lshw with open() looks like this:

static void scan_devices()
{
  int fd = -1;
  int i = 0;
  size_t j = 0;
  My_scsi_idlun m_idlun;

  for (i = 0; devices[i] != NULL; i++)
  {
    glob_t entries;

    if(glob(devices[i], 0, NULL, &entries) == 0)
    {
      for(j=0; j < entries.gl_pathc; j++)
      {
        fd = open(entries.gl_pathv[j], O_RDONLY | O_NONBLOCK);
[..]

So the solution is fairly simple: upgrade your lshw RPM packge. Interesting question is which scsi/tape-related tools are safe and which are not. On OL 6.x the safe list contains at least those: lsscsi, obtool lsdev/dumpdev/lsvol/etc are safe, hpacucli (HP raid monitoring tool), sg_scan (sg3_utils)

But those might be not: obtool lsdev -ldg, sg_scan, mt -f /dev/st0 status, mtx, sg_inq.

Very simple demonstration how to limit latency impact (improve consumer experience) by switching to out-of-band/background validation connection checking for DB connections

March 31st, 2016

This is very simple demonstration on how to limit latency impact (improve consumer experience) by switching to out-of-band/background validation connection checking for DB connections:

Test setup: laptop <-- 0ms latency --> VM { Wildfly JBoss} <-- 6ms latency --> VM { Oracle DB }, latency was simulated using linux tc netem (6ms but only one way).
Benchmark generator settings: ApacheBenchmark 2.4 launched with concurrency = 1, timeout = 5, 1000 requests to simple JSP page, test was repeated multiple times to heat caches/etc.
Middle-tier: JBoss/WildFly Core 2.0.10.Final, 100 connection pool, validation connection checker explicitly set to “SELECT 1 FROM DUAL”
Application: simple JSP page performing 1 DB CALL (1x INSERT)

with 6ms RTT with JDBC Connection Pool with default foreground connection check (actually 4 packets; 4*6 =~ 24ms):

Requests per second:    55.97 [#/sec] (mean)
Time per request:       17.867 [ms] (mean)
Time per request:       17.867 [ms] (mean, across all concurrent requests)
Transfer rate:          16.07 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    13   18   9.8     16     218
Waiting:       13   17   9.8     16     218
Total:         13   18   9.8     17     218

Percentage of the requests served within a certain time (ms)
  50%     17
  66%     17
  75%     18
  80%     18
  90%     19
  95%     21
  98%     38
  99%     71
 100%    218 (longest request)

with 6ms RTT with JDBC Connection Pool with background connection check (just 2 packets to DB; 2*6 =~ 12ms):

Requests per second:    100.74 [#/sec] (mean)
Time per request:       9.927 [ms] (mean)
Time per request:       9.927 [ms] (mean, across all concurrent requests)
Transfer rate:          28.92 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     7   10   2.8      9      41
Waiting:        7    9   2.8      9      41
Total:          7   10   2.9      9      42

Percentage of the requests served within a certain time (ms)
  50%      9
  66%     10
  75%     10
  80%     10
  90%     11
  95%     13
  98%     18
  99%     26
 100%     42 (longest request)

This demonstrated that the average consumer experience can be improved from ~18ms to ~10ms just by setting two options that move the checking “out-of-band” for business transaction processing. Of course the more times the application calls .getConnection() even transparently like with the case of JSP, the more impact of those settings.

p.s. I’ve used

<check-valid-connection-sql>select 1 from dual</check-valid-connection-sql>

in my experiment however you should be really using

<valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleValidConnectionChecker"/> 

as it calls native OJDBC .isValid() or ping methods that perform the check even without SQL layer (smaller packets).

An excellent presentation about real-world scaling without unecessary hype

September 30th, 2015

For those who are tired of bullshit^H^H^H^H^H^H^Hnew hypes and are more keen on real-world, solid, predictable and stable IT infrastructure here you go:

https://plus.google.com/+KristianK%C3%B6hntopp/posts/4dme9PNZ1pa?pid=6197690782797826690&oid=117024231055768477646

For DBA especially comments on pages 20-25 are interesting, but nothing you wouldn’t know already. Page 19 is my personal favorite page due to how they really give a thought how to avoid XA/2PC, even if they could just use it.

Measuring I/O waste on incremental statistics gathering on partitioned tables on 11gR2 (STALE_PERCENT does not work with partitions)

July 15th, 2015

Inspiration to this article is this post: http://progeeking.com/2015/03/09/optimizer-statistics-stale-percentage-and-partitioning/ Please be sure to read it first. The main difference is looking at exactly what reads does the DBMS_STATS perform having INCREMENTAL=TRUE scenario.

Apparently it is well known fact that STALE_PERCENT does not work with (sub)partitions on 11gR2, however one might be interested in knowing exactly what additional I/O from DBMS_STATS can be generated due to this “bug”/missing feature. I’ve used the below SQL script to measure this on pretty recent 11.2.0.4.x edition:

-- prepare start
drop table stats_test purge;
exec dbms_stats.set_global_prefs('trace',to_char(0));

create table stats_test (
  id number primary key,
  merchant_id number not null,
  accounting_date date not null,
  somedata varchar2(128)
)
partition by range (merchant_id) interval (1)
subpartition by hash (accounting_date) SUBPARTITIONS 8
(partition p0 values less than (1))
;

insert into stats_test select rownum, MOD(rownum, 4), sysdate-(rownum/86400), 'blablabla'  from
  (select rownum r from dual connect by rownum <= 1000) a,
  (select rownum r from dual connect by rownum <= 1000) b,
  (select rownum r from dual connect by rownum <= 1000) c
where rownum <= 100000;

create index stats_test_idx1 on stats_test (somedata) local;

exec dbms_stats.set_table_prefs('JWARTAK', 'STATS_TEST', 'INCREMENTAL', 'TRUE');
-- just to be sure, it seems that ALL doesnt work with INCREMENTAL=TRUE
-- so with this AUTO setting you end up with no subpartition stats
-- you can tweak between AUTO and ALL here
exec dbms_stats.set_table_prefs('JWARTAK', 'STATS_TEST', 'GRANULARITY', 'ALL');
-- just to be sure
exec dbms_stats.set_table_prefs('JWARTAK', 'STATS_TEST', 'ESTIMATE_PERCENT', DBMS_STATS.AUTO_SAMPLE_SIZE); 

-- 10% by default but just to be sure
select DBMS_STATS.GET_PREFS ('STALE_PERCENT','JWARTAK','STATS_TEST') stale_percent from dual;

-- just in case
exec dbms_stats.gather_table_stats('JWARTAK', 'STATS_TEST', cascade=>true);
exec dbms_stats.gather_table_stats('JWARTAK', 'STATS_TEST', cascade=>true);
-- prepare end

---experiement start , see STALE_STATS
select PARTITION_NAME,SUBPARTITION_NAME,SUBPARTITION_POSITION, STALE_STATS, to_char(LAST_ANALYZED, 'HH24:MI DD/MM/YYYY') last_analyzed from user_tab_statistics where table_name='STATS_TEST' order by PARTITION_NAME, SUBPARTITION_POSITION;

select count(*) from stats_test;
select count(*) from stats_test where MERCHANT_ID=1;
-- just add 1 row (so 25000 -> 25001 rows in partition; 1 row just adds one row in subpartition - so now 1 out of 8 subpartitions has 3126 rows )
-- 3125 rows -> 3126 rows in subpartition is enough trigger whole partition dbms_stats scan
insert into stats_test (id, merchant_id, accounting_date, somedata) values (110012, 1, sysdate, 'blah');
commit;

-- just in case
exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO;

-- at this point all STALE_STATS = null (subpartitions) or =NO (undetected change)
select PARTITION_NAME,SUBPARTITION_NAME,SUBPARTITION_POSITION, STALE_STATS, to_char(LAST_ANALYZED, 'HH24:MI DD/MM/YYYY') last_analyzed from user_tab_statistics where table_name='STATS_TEST' order by PARTITION_NAME, SUBPARTITION_POSITION;

-- 1st run, you will get some dirty results
exec dbms_stats.gather_table_stats('JWARTAK', 'STATS_TEST');

-- another run to see the difference,
exec dbms_stats.gather_table_stats('JWARTAK', 'STATS_TEST');

-- enable debug mode
set serveroutput on
--exec dbms_stats.set_global_prefs('trace',to_char(1+4+8+16+64+2048+4096+8192));
exec dbms_stats.set_global_prefs('trace',to_char(4+8+16+64+2048+4096+8192));

-- repeat insert (now 25001 -> 25002 row in single subpartition)
insert into stats_test (id, merchant_id, accounting_date, somedata) values (110016, 1, sysdate, 'blah');
commit;
-- you can comment this out, at least on 11.2.0.4.x there is no difference
exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO;

-- 2nd run, >>> REAL RESULT <<<
exec dbms_stats.gather_table_stats('JWARTAK', 'STATS_TEST');

-- EXPERIMENT END

As you can see from the above there are two main variables here:
- dbms_stats granularity – you can set your own ALL, AUTO or any other to perform measurements
- trace level – based on http://www.pythian.com/blog/options-for-tracing-oracle-dbms_stats/ – the most interesting is where you want the output to go (diagnostic directory or via DBMS_OUTPUT buffer)
- FLUSH_DATABASE_MONITORING_INFO – you may want to flush it or not

The result is (at least on 11.2.0.4.x) that after adding just single row to single (sub)partition, even with INCREMENTAL_STATS=TRUE, you’ll get:

a) with GRANULARITY=AUTO (means table and partition but without dedicated subpartition stats, in most cases this will be “GLOBAL AND PARTITION”):

somewhere:~$ grep -i -e 'Started index ' -e stale -e 'need to be updated' -e 'gather index ' -e query /tmp/run    | grep -v Ending
DBMS_STATS: gather stats on partition SYS_P4061815: stats are stale;
DBMS_STATS: Starting query at 15-JUL-15 01.40.21.747485000 PM +02:00
DBMS_STATS: Starting query at 15-JUL-15 01.40.21.790532000 PM +02:00
DBMS_STATS: Started index JWARTAK.SYS_C005852548 at 15-JUL-15 01.40.21.899589000 PM +02:00 granularity: GLOBAL AND PARTITION gIdxGran:
DBMS_STATS: Starting query at 15-JUL-15 01.40.21.902620000 PM +02:00
DBMS_STATS: Started index JWARTAK.STATS_TEST_IDX1 at 15-JUL-15 01.40.21.974521000 PM +02:00 granularity: GLOBAL AND PARTITION gIdxGran:
DBMS_STATS: gather index stats STATS_TEST_IDX1onSYS_P4061834:
DBMS_STATS: Starting query at 15-JUL-15 01.40.21.978291000 PM +02:00
DBMS_STATS: Starting query at 15-JUL-15 01.40.22.000544000 PM +02:00
somewhere:~$

So this translates to:

  • 2x sampled FULL “PARTITION” SCANs against single partition where new row has been added
  • 1x sampled FULL INDEX SCAN on PK index
  • 2x sampled FULL INDEX SCANs against changed LOCAL index partition where new row has been added
  • certainly 1 new row out of ~25000 isn’t 10% change on partition

b) with GRANULARITY=ALL:

somewhere:~$ grep -i -e 'Started index ' -e stale -e 'need to be updated' -e 'gather index ' -e query /tmp/run_gALL    | grep -v Ending
DBMS_STATS: gather stats on subpartition SYS_SUBP4061878:stats are stale
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.166410000 PM +02:00
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.180291000 PM +02:00
DBMS_STATS: gather stats on partition SYS_P4061886: stats are stale;
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.356143000 PM +02:00
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.405171000 PM +02:00
DBMS_STATS: Started index JWARTAK.SYS_C005852574 at 15-JUL-15 01.46.57.522214000 PM +02:00 granularity: ALL gIdxGran:
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.524947000 PM +02:00
DBMS_STATS: Started index JWARTAK.STATS_TEST_IDX1 at 15-JUL-15 01.46.57.593190000 PM +02:00 granularity: ALL gIdxGran:
DBMS_STATS: gather index statsSTATS_TEST_IDX1 on SYS_SUBP4061878:stats are stale
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.596544000 PM +02:00
DBMS_STATS: gather index stats STATS_TEST_IDX1onSYS_P4061905:stats are stale
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.601476000 PM +02:00
DBMS_STATS: Starting query at 15-JUL-15 01.46.57.622425000 PM +02:00
somewhere:~$
  • 2x sampled FULL “SUBPARTITION” SCANs against single subpartition where new raw has been added
  • 2x sampled FULL “PARTITION” SCANs against single partition where new raw has been added
  • sampled FULL INDEX SCAN on PK index
  • 1x sampled FULL INDEX SCAN against changed LOCAL index subpartition where new raw has been added
  • 2x sampled FULL INDEX SCANs against changed LOCAL index partition where new raw has been added
  • certainly 1 new row out of ~25000 isn’t 10% change on partition or subpartition here

SUMMARY: So with 1 single row added/deleted or just update, in some random subpartition, we can cause on 11gR2 an avalanche of I/O reads due to DBMS_STATS scanning unnecessary segments. When you start to think about table sizes that are above 1-2TB combined with weak I/O subsystem and rather “big” indexes (e.g. 100-200GB), you can quickly run out of MAINTENANCE_WINDOW schedule due to this overhead (in extreme it could be several hundredths GB scanned via DBMS_STATS due to 1 byte changed ;) ).

Some other interesting results:

  1. as was demonstrated STALE_PERCENT does not work against (sub)partitions in 11gR2
  2. since 12gR1 you have an access to INCREMENTAL_STATS + INCREMENTAL_STALENESS (new feature) combo, where you can just set INCREMENTAL_STALENESS to USE_STALE_PERCENT and trigger stats collection for (sub)partitions only after certain threshold of changed rows is reached
  3. apparently and contrary to documentation on 11.2.0.4.x series, INCREMENTAL_STATS=TRUE works with GRANULARITY=ALL (at least in this test)
  4. STALE_STATS column from {user|all|dba}_{tab|ind}_statistics is useless for partitioned tables (flushing does not help anyway), DBMS_STATS GATHER LIST is much better for this purpose
  5. there is interesting granularity option APPROX_GLOBAL AND PARTITION but I’m not sure is it 10.2, 11.1 history artifact as i did nott have enough time to test it in details

Insert with IGNORE_ROW_ON_DUPKEY_INDEX on Index Organized Table might result in lost data

July 8th, 2015

1. We create table red_table with PK on both columns ID1, ID2 with 10000 rows
3. We create table delta_table with PK on both columns ID1, ID2 with only 500 rows like in red_table (first 500 rows)
3. We load additional 100 completely new/different rows into delta_table
4. So as of now we have 600 rows in delta_table and 10000 in red_table. The 100 rows are completely different (one can use A MINUS B to check that)
5. Try to load rows from delta_table into red_table using +IGNORE_ROW_ON_DUPKEY_INDEX hint that do NOT yet exists in red_table.
HEAP) 100 rows inserted, 2nd insert adds 0 rows (correct)
IOT) 60 rows inserted, WTF/BUG due to 40 not loaded rows!
6. “analyze table RED_TABLE validate structure cascade” returns no errors so we have no corruption (in both HEAP and IOT cases)
7. So as of now we have
HEAP) 10100 rows in red_table, and delta_table MINUS red_table returns 0 (no differences)
IOT) 10060 rows in red_table, and delta_table MINUS red_table returns 40 (missing data!)

--1
drop table red_table purge;
create table red_table (id1 number, id2 number, constraint RED_TABLE_PK primary key(id1, id2)) organization index;
insert into red_table (id1, id2) select object_id, object_id from all_objects where rownum <= 10000;
commit;

--2 & 3
drop table delta_table;
create table delta_table (id1, id2) as select * from red_table where rownum <= 500;
insert into delta_table (id1, id2) select 10000000+object_id, object_id from all_objects where rownum <= 100;
commit;

-- 4
select (select count(*) from delta_table) delta, (select count(*) from red_table) red from dual;
--
select count(*) from (
select * from delta_table
minus
select * from red_table
);

-- 5
insert /*+ IGNORE_ROW_ON_DUPKEY_INDEX(RED_TABLE, RED_TABLE_PK) */ into red_table (id1, id2) select id1, id2 from delta_table;
commit;

-- 6
analyze table RED_TABLE validate structure cascade;

-- 7
select (select count(*) from delta_table) delta, (se1lect count(*) from red_table) from dual;

select count(*) from (
select * from delta_table
minus
select * from red_table
);

Reproduced on versions:
- 11.2.0.3.13
- 11.2.0.4.2
- reproduced also with primary key being only ID1 (so this is NOT related to the composite PK on ID1+ID2 on IOT)

Bugs possibly linked to this issue:

  • Bug 17397545 ORA-600 [kdtigetrow-2] / buffer cache errors from MERGE statement with LOG ERRORS INTO
  • Bug 11865420 Insert as Select with LOG ERRORS INTO slower than expected / corruption using IGNORE_ROW_ON_DUPKEY_INDEX hint – superseded
  • Bug 14512508 : PROBLEM WITH HINT “IGNORE_ROW_ON_DUPKEY_INDEX”
  • Bug 17748658 : BUG 11865420 IS MARKED AS FIXED IN 11.2.0.4 BUT ISN’T REALLY FIXED
  • Bug 13033715 : WRONG RESULTS WITH HINT IGNORE_ROW_ON_DUPKEY_INDEX

Summary: do not use IGNORE_ROW_ON_DUPKEY_INDEX hint . Another interesting story is documented here http://guyharrison.squarespace.com/blog/2010/1/1/the-11gr2-ignore_row_on_dupkey_index-hint.html where it is clear (due to performance reasons) it is best avoid IGNORE_ROW_ON_DUPKEY_INDEX hint.

Impact of latency on non-XA and XA (2PC) performance

December 17th, 2014

This is post about quick & dirty experiment to better understand how network round-trip time (RTT) affects single thread performance of Oracle JDBC operations.

Experiment details:

  • Non-XA transaction description: single INSERT to simple table.
  • XA transaction description: two XA connections to the same DB, each doing simple INSERT to simple table, no latency on transaction/2PC object store has been simulated (in reality fsync() costs! – no real Transaction Manager has been used)
  • Transaction initiator: custom written Java7 OraXAJDBC program for purposes of this test running on small VM.
  • DB: Oracle DB 11.2.0.4.2, RAC, servie running on single node
  • Latency has been introduced via netem qdisc on Linux (both ingress and egress).

Non-XA JDBC results:

XA JDBC results:

Increasing number of connections – not a solution
Common sense workaround seems to be increase threads/number of connections in connection pool, however this approach in longer term seems to be not scalable. Each connection in Oracle 11.2 is served by separate process on the DB server, that takes some resources (mainly memory – PGA). The more connections coming to the DB to server the more often CPUs need to switch between processes – process known as context-switch. Context-switch usually destroys L1-L3 CPU cache efficiency and thus degrade performance (introduces additional latency) when many processes compete for CPU core.


image is from http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

As evidenced here OLTP Performance – Concurrent Mid-Tier Connections (must see for any serious DBA/Java/J2EE developer/architect) 2000 connections to the DB has avg trx respone time ~100ms but when Connection Pool is reduced to 96 connections (and thus queuing is introduced in the Middleware itself), system as a whole achieves much greater throughput with response time of ~3ms, because of reducing concurrency/connection-switches. Some potential option (which has NOT been tested) is new feature in Oracle 12.1 named “JDBC Support for Database Resident Connection Pool” (DRCP) which apparently seems to be implemented as Process Pool on the DB server side. As any new feature it might introduce strict requirements (incomparability) and bugs. This feature requires Oracle Universal Connection Pool. Please note that other problems/issues with no# of connections in CP were not investigated in this
document (failover/HA time, etc).

impdp: ORA-04030: out of process memory when trying to allocate 4088 bytes (PLS CGA hp,pdzgM64_New_Link)

November 14th, 2014

Even on Oracle 11.2.0.4.x there seems to be problem with “ORA-04030: out of process memory when trying to allocate 4088 bytes (PLS CGA hp,pdzgM64_New_Link)”. I’ve just hit it during DataPump import. Alert log states something like this:

ORA-04030: out of process memory when trying to allocate 1052696 bytes (pga heap,log read buffer)
ORA-04030: out of process memory when trying to allocate 7438704 bytes (pga heap,kgh stack)
ORA-04030: out of process memory when trying to allocate 4088 bytes (PLS CGA hp,pdzgM64_New_Link)
Dumping diagnostic data in directory=[cdmp_20141106193106], requested by (instance=2, osid=4893 (DW00)), summary=[incident=240603]

Detailed incident trace shows:

mmap(offset=207282176, len=8192) failed with errno=12 for the file ora_dw00_DBSIDNAME3
[..]
*** 2014-11-06 19:48:31.300
OPIRIP: Uncaught error 27102. Error stack:
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 108
Additional information: 4521984
ORA-00448: normal completion of background process
ORA-31671: Worker process DW00 had an unhandled exception.
ORA-04030: out of process memory when trying to allocate 16048 bytes (session heap,kuxLpxAlloc)
ORA-06512: at "SYS.KUPW$WORKER", line 1887

/usr/include/asm-generic/errno-base.h states that 12 is exactly ENOMEM error code. Investigation for mmap() syscall shows that ENOMEM stands for “No memory is available, or the process’s maximum number of mappings would have been exceeded.”
In Linux apparently there seems to be max_map_count limit defined as follows:

The max_map_count file allows for the restriction of the number of VMAs (Virtual Memory Areas) that a particular process can own. A Virtual Memory Area is a contiguous area of virtual address space. These areas are created during the life of the process when the program attempts to memory map a file, links to a shared memory segment, or allocates heap space. Tuning this value limits the amount of these VMAs that a process can own. Limiting the amount of VMAs a process can own can lead to problematic application behavior because the system will return out of memory errors when a process reaches its VMA limit but can free up lowmem for other kernel uses. If your system is running low on memory in the NORMAL zone, then lowering this value will help free up memory for kernel use.

This seems to be simple check in http://lxr.free-electrons.com/source/mm/mmap.c and seemed to be switched to dynamic sysctl value at times of 2.6.5 kernels (https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5-rc3/2.6.5-rc3-mm3/broken-out/max_map_count-sysctl.patch) in do_mmap_pgoff() function.

Simple workaround of increasing this value via sysctl works like a harm. The interesting fact is that oracle-validated RPM does not change this value, but it could.

11.2.0.4.x, 12.1: Bug Alert 19384287 for function based indexes together with NVL()

August 17th, 2014

19384287 : WRONG NUMBER OF ROWS RETURNED ON FIRST EXECUTION

New exploits for recent Oracle DB releases…

June 22nd, 2014

Exploit probably for CVE-2013-5858 (according to Jan 2014 CPU’s Oracle Database Server Risk Matrix ) has been revealed in blog posts Is your database secure? Are you sure? Are you *really* sure? and here Major Data Exploit Patched by January 2014′s CPU

http://www.security-explorations.com/en/SE-2014-01-press.html – internal Oracle RDMS JRE is vulnerable, CREATE SESSION privilege is enough (so with just account without even CREATE TABLE one can takeover DBA privs), not fixed yet (just sent to Oracle), no workaround given yet; I think it is just matter of time once reproduces this… :)

Part 5: Oracle SecureFile LOB is better than BasicFile, isn’t it?

May 9th, 2014

In this part I’ll try to discuss one of the biggest performance problem with BasicFile LOBs – enq HW contention (wait event that belongs to “configuration” class). Often it may be observed on systems with concurrent INSERTs into BasicFile LOBs with data bigger tan ~3960 bytes (or with LOBs having DISABLED STORAGE ON ROW), it probably is common name for things related to space allocation (allocating/formating the next extent for LOBSEGMENT itself OR looking for free space and reclaiming it). More research typically ends up in reading the following engineering notes in Metalink:

  • LOB HWM CONTENTION :Using AWR Reports to Identify the Problem; Confirm and Verify the Fix (Doc ID 837883.1)
  • Bug 9801919 “enq: HW – contention” against segments that add an extent frequently during high concurrency
  • ‘enq HW – contention’ For Busy LOB Segment (Doc ID 740075.1)
  • BUG:4867884 LOB HW LOCK CONTENTION WITH SPACE RECLAIMATION
  • Bug 6376915 HW enqueue contention for ASSM LOB segments
  • Bug 7319120 HW contention for BasicFile LOBs – special workaround

Especially in the last Bug 7319120, the workaround is stated as “The preferred option for such contention on basicfile LOBs is to migrate the LOBS to Securefiles in RDBMS 11g or later. Securefile LOBs have been designed for better concurrency“, but as I’ve presented in earlier posts, it is not beatiful as it sounds (huge problems shown in previous posts with wasting space with SecureFiles on early 11.2.0.2.x releases and supprisgnly the same on at least <= 11.2.0.4.2 as of writing). Most of the remaining bugs are already fixed starting 11.2.0.3.0, but as stated by Oracle SecureFiles are "the way forward".

To best of my knowledge typical solutions to "enq HW - contention" include usually one of the following:

  1. properly setting LOB option cache vs nocache (default) and tuning I/O – the enq HW for LOBs seems to affected by storage operations as it will be shown later
  2. proper (hash) partitioning scheme on the affected table/LOB to split contention
  3. altering LOB option chunksize (often increasing the default 8k to something like 16k/32k gives some minimal better results – but this is very workload specific)
  4. setting event 44951

Test env specification
Armed with the knowledge I’ve decided to really test that interesting event for my specific artificial workload. By this i mean version 11.2.0.3.8, BasicFile LOB, NOCACHE, pctversion 0, lobs sizes between 6000 … 10000, 7 concurrent INSERT and DELETE jobs preceeded by prefilling LOBs and then DELETing — to simulate in non-ideal clean conditions for table (ab)used as queue. DB was in NOARCHIVELOG and flashback was disabled. The only difference was storage being RAID1 of just 2xSAS 15k RPM drives (shared between data and redo) – instead of ramdisk (I’ve needed that I/O latency for enq HW to become visible in Top5),

HW contention root cause
The even more interesting fact is that INSERTs into LOB (with related LOB extent allocation) is masked by enq HW contention which (when you look via V$wait_chains) is actually waiting for “db file sequential read” :

CHAIN_SIGNATURE                                    WHO             WAIT_EVENT_TEXT              BLOCKED_BY              P1 P1_TEXT                 P2 P2_TEXT                 P3 P3_TEXT
-------------------------------------------------- --------------- ---------------------------- --------------- ---------- --------------- ---------- --------------- ---------- ---------------
'db file sequential read'<='enq: HW - contention'  129,571         enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  160,25          db file sequential read      ,                        3 file#                 4514 block#                   1 blocks
'db file sequential read'<='enq: HW - contention'  6,7             enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  33,9            enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  65,9            enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  190,13          enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block
'db file sequential read'<='enq: HW - contention'  223,9           enq: HW - contention         160,25          1213661190 name|mode                5 table space #     12583050 block

In the above case 6 sessions were waiting with enq TX contetnion (looking for free space for INSERT) for 7th session (sid=160,serial#=25) actually searching for space/reclaiming the space to finish – and actually the 7th session was doing single block (8kB) physical DB I/O read. The Top5 session look similiar to this:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:24:54.463960000 

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 62,717         918     15   32.2 Configurat
direct path write                    93,156         277      3    9.7 User I/O
db file sequential read              64,995         199      3    7.0 User I/O
DB CPU                                               91           3.2
buffer busy waits                    26,578           1      0     .0 Concurrenc

and most useful for us statistics (look at physical direct writes for lobs):

Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
[..]
lob reads                                    63,000          188.8           1.0
lob writes                                  315,000          943.7           5.0
lob writes unaligned                        314,990          943.7           5.0
physical read IO requests                    64,851          194.3           1.0
physical read bytes                     531,267,584    1,591,684.2       8,428.1
physical read total IO requests              65,656          196.7           1.0
physical read total bytes               544,384,000    1,630,981.2       8,636.2
physical read total multi block                   0            0.0           0.0
physical reads                               64,852          194.3           1.0
physical reads cache                            558            1.7           0.0
physical reads cache prefetch                     1            0.0           0.0
physical reads direct                        64,294          192.6           1.0
physical reads direct temporary                   0            0.0           0.0
physical write IO requests                  105,628          316.5           1.7
physical write bytes                  1,036,320,768    3,104,829.8      16,440.4
physical write total IO requests            170,303          510.2           2.7
physical write total bytes            1,339,469,824    4,013,068.1      21,249.6
physical write total multi block                 43            0.1           0.0
physical writes                             126,504          379.0           2.0
physical writes direct                      125,324          375.5           2.0
physical writes direct (lob)                125,316          375.5           2.0
physical writes from cache                    1,180            3.5           0.0
physical writes non checkpoint              126,504          379.0           2.0
[..]

Below are data after changing the LOB to CACHE, which had resulted in drastic reduction of enq HW contention (as the dependend I/O is faster because it is usually cached) and V$WAIT_CHAINS results in many for example wait chains like those(notice lack of “-> db file sequential read”):

  • ‘log file parallel write’<='log buffer space'<='buffer busy waits'
  • ‘log file parallel write’<='log buffer space'
  • ‘enq: HW – contention’<='enq: HW - contention'

while the Top5 results in different stuff (the main bottleneck moved to the REDO subsystem now):

ELAPSED
---------------------------------------------------------------------------
+000000000 00:06:29.795479000 

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log buffer space                      1,542         116     75   15.6 Configurat
enq: HW - contention                 58,609         110      2   14.8 Configurat
buffer busy waits                   125,436          99      1   13.3 Concurrenc
DB CPU                                               91          12.3
log file switch (checkpoint in            6          10   1620    1.3 Configurat
Statistic                                     Total     per Second     per Trans
-------------------------------- ------------------ -------------- -------------
[..]
lob reads                                    63,000          347.3           1.0
lob writes                                  315,009        1,736.4           5.0
physical read IO requests                       629            3.5           0.0
physical read bytes                       5,619,712       30,976.6          89.1
physical read total IO requests               1,629            9.0           0.0
physical read total bytes                21,804,032      120,186.7         345.8
physical read total multi block                   0            0.0           0.0
physical reads                                  686            3.8           0.0
physical reads cache                            661            3.6           0.0
physical reads cache prefetch                    57            0.3           0.0
physical reads direct                            25            0.1           0.0
physical reads direct temporary                   0            0.0           0.0
physical write IO requests                   16,462           90.7           0.3
physical write bytes                    234,405,888    1,292,076.2       3,718.0
physical write total IO requests             19,116          105.4           0.3
physical write total bytes            1,399,786,496    7,715,808.2      22,202.6
physical write total multi block              2,341           12.9           0.0
physical writes                              28,614          157.7           0.5
physical writes direct                           25            0.1           0.0
physical writes direct (lob)                      0            0.0           0.0
physical writes from cache                   28,589          157.6           0.5
physical writes non checkpoint               16,153           89.0           0.3
[..]

So in this case changing to CACHE results in reducing the work from 7 jobs from almost 25mins to 6 mins. Also you might notice a rather huge decrease of direct physical random writes which is almost always excellent results for most disk based systems (those blocks would be certianly flushed out by DBWR using much more optimized/sequential writes).

The mysterious event 44591
The mentioned event is described as “causes ASSM LOB space to batch up reclaim instead of just reclaiming the requested/required number of LOB chunks. To enable this fix, set event 44951 to the maximum number of chunks that you would like to have reclaimed per attempt. The maximum allowed is 1024. Anything larger becomes 1024. However, if the requested amount of space is larger than the event’s value, the reclaim will be for the requested amount of space.
It is interesting that several years ago people complained when migrated LOBs from MSSM to ASSM and voila and in different source we find “With Manual Segment Space Management, this value cannot be altered and is fixed at 128.“. Somewhere else it is written: “The fix for this problem introduces a *mandatory event* which must be set in order for the ASSM architectural change leading to more efficient allocation of LOB chunks during high load and transaction contention when Extent allocation is needed. The event is *not* set even after you apply the patch.

So here are the results with NOCACHE LOB:

event44951 not set:

ELAPSED                                                                                                                                              ---------------------------------------------------------------------------
+000000000 00:21:57.406259000

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 62,591         787     13   39.3 Configurat
direct path write                    90,263         252      3   12.6 User I/O
db file sequential read              64,616         172      3    8.6 User I/O
DB CPU                                              111           5.6
buffer busy waits                    24,388           1      0     .0 Concurrenc

event44951 setTo 3:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:18:33.058539000  

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
enq: HW - contention                 41,254         585     14   47.3 Configurat
direct path write                    86,039         285      3   23.0 User I/O
db file sequential read              43,801         144      3   11.6 User I/O
DB CPU                                               88           7.1
buffer busy waits                    39,976           1      0     .1 Concurrenc

event44951 set to 16:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:22:31.832235000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   107,809         911      8   34.9 User I/O
db file sequential read              12,804         185     14    7.1 User I/O
enq: HW - contention                  3,376         177     52    6.8 Configurat
DB CPU                                               81           3.1
log file switch (private stran            4           1    323     .0 Configurat

event44951 set to 32:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:49:27.268629000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   111,440       1,435     13   24.5 User I/O
enq: HW - contention                  2,597       1,027    395   17.5 Configurat
db file sequential read              20,339         429     21    7.3 User I/O
DB CPU                                               79           1.4
log file switch (private stran            5           2    474     .0 Configurat

event44951 set to 512:

ELAPSED
---------------------------------------------------------------------------
+000000000 00:38:45.176198000   

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   116,908       1,833     16   40.2 User I/O
db file sequential read              17,262         354     20    7.8 User I/O
DB CPU                                               78           1.7
enq: HW - contention                     23          64   2802    1.4 Configurat
log file switch (private stran            6           3    473     .1 Configurat

So apparently the most interesting result (for LOB reclamation) is in my case the one with event set to 3 (chunks). I could test more, but time for this research was the most limiting constraint. When tablespace and LOB are created by default this mean 8kB, so in this case this would mean 8kB*3=24kB of I/O per reclamation. Apparently the bigger the event level/property the more I/O work for the potential reclamation is being done by the session requiring more space, all of this is hidden under enq HW contention for LOB segment, but for *different sessions*. The results in CACHE LOB scenario with event set to properly tuned level would probably result in even more optimal processing time.

SUMMARY
My conclusion is that enq HW contention being in single “Configuration event class” visible in Grid/Cloud Control, is a little bit unfortunate, because it might be also related to the I/O storage performance and this can be even affected by current db cache contents (which would or NOT contain the affected LOBSEGMENT blocks required for reclamation if possible). The LRU cache might be of course function of time/workload, also the physical I/O from the DB itself might be served by SAN cache.

Other interesting sources/posts/discussions:

Part 4: Oracle SecureFile LOB is better than BasicFile, isn’t it?

May 5th, 2014

In part 3 we have shown that SecureFiles were having much lower space utilization (they reused space) on 11.2.0.3.8. I’ve wanted to try the same workload on 11.2.0.4.2 (also on ramdisks, same ammount of concurrent jobs performing INSERTs and DELETEs, same ammount of data/variables for loops), the results are more or less VERY interesting:

Click to see in better resolution

Clearly there is some regression when it comes down to MB used afterwards (at the end of test “SELECT count(1) FROM t_blob” returns just 21 rows! so you really shouldn’t have ~700MB lost). The collected AWR snapshots also reveal lack of high “buffer busy waits” on 11.2.0.4.2 for securefiles when they were present on 11.2.0.3.8. AWR 11.2.0.3.8 is below:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                              412          41.8
buffer busy waits                   415,575         299      1   30.4 Concurrenc
latch: cache buffers chains          39,045           2      0     .2 Concurrenc
latch free                            3,906           1      0     .1 Other
latch: enqueue hash chains            5,658           0      0     .0 Other

While the new/broken 11.2.0.4.2 stats are here:

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
DB CPU                                        59           85.9
enq: TX - contention                    198  2.3      11    3.3 Other
buffer busy waits                    22,290  1.4       0    2.0 Concurrenc
local write wait                        292    1       3    1.4 User I/O
latch: object queue header ope          734   .2       0     .3 Other
latch: cache buffers chains           4,161   .1       0     .2 Concurrenc

As been metnioned before I/O has been elimiated by using ram disk (with filesystem_options=NONE due to the lack of proper O_DIRECT support on tmpfs). Much lower CPU utilization (412s to 59s) is very interesting, but as you can see this comes at price. Even more interesting is the old BASIC FILE AWR report on 11.2.0.4.2 :

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                            Tota    Wait   % DB
Event                                 Waits Time Avg(ms)   time Wait Class
------------------------------ ------------ ---- ------- ------ ----------
DB CPU                                      34.8           94.4
log file switch (checkpoint in            6  1.7     275    4.5 Configurat
buffer busy waits                    18,247   .4       0    1.1 Concurrenc
latch: cache buffers chains           4,617   .1       0     .2 Concurrenc
log file switch completion               16   .1       4     .2 Configurat
enq: TX - index contention              246    0       0     .1 Concurrenc

as it shows that BASIC FILEs are faster on CPUs for this particular workload when you completley eliminate I/O ;)

Part 3: Oracle SecureFile LOB is better than BasicFile, isn’t it?

April 24th, 2014

OK, so far in our previous parts/tests regarding Oracle LOBs we had:

  • part 1 – showing how SecureFiles are fairly superior on DB version 11.2.0.3.5 against BasicFiles in terms of space consumed and elapsed processing times when processing rather very small rows (most of them inline)
  • part 2 – showing how buggy SecureFiles were as a new feature, and actually how much space they actually wasted (on some legacy version 11.2.0.2.3)

Part 3 is going to focus on showing which is better for storing rows/LOBs that are too big to fit into row/table segment (~3900 bytes limit) and must be technically stored in separate LOB segment. The enviorniment is very similiar to the one from Part 1, basically OLTP, 4 concurrent jobs doing INSERTs to single non-partitioned table in PL/SQL loop doing INSERT LOB + DELETE LOB with ID-16 (moving window for each job), DB version is on 11.2.0.3.8, LOB chunk size is 8k, tablespace block size is also 8k, LOB setting is to perform CACHING. The storage underneath the DB is on ram disk (tmpfs to be exact), every time tablespace is dropped and recreated, everything is automated, including AWR snapshot collections. The main difference is that now we are using a much larger LOB sizes (over 4k to be exact) *AND* we changed workload to now prefill the table and LOB with garbage data, then delete it without coallescing the LOB segment itself. The last statement as you will see is ciritical , and shows typical usage pattern where data is removed (e.g. queues).

The results are below, first time elapsed BasicFile vs SecureFile:

Second, space efficency:

As you can simply see – the overall performance benefit from BasicFiles – causes huge storage waste. Also workload profile in this particular test run is much different between SecureFiles and BasicFiles , the best image is the difference in AWR snapshots for the Top 5 events for workload generating between 6k and 12k LOBs:

Click to see more details

Later analysis reveals that majority of “buffer busy waits” are spent on “free list” managment itself. Armed with this knowledge (11.2.0.3.8, securefile LOBs, buffer busy waits, free lists) we can start looking for more information, and we find a nice horror stories:

Summary:
At least on 11.2.0.3.8 SecureFile LOBs are still not working as fast as BasicFile but with added benefit of reusing space (much better than on some old PSUs e.g. 11.2.0.2.3). One can hit an extreme “buffer busy waits” on SecureFile which might require rebuilding the table with hidden parameter. It is written that this is fixed in 11.2.0.4.x releases. In Part 4 or Part 5 I’m going to see how 11.2.0.4.x behaves under the identical workload.

-J.

Bug/ORA-00060 deadlock during concurrent Drop Table (purge) and INSERT

March 24th, 2014

I’ve just found a rather new and not known bug during analysis for one of hangs in production system. This research ended up in full reproducible test case against the following Oracle versions on x86_64:

  • 11.2.0.3.8
  • 11.2.0.4.0
  • 12.1.0.1.0

First, let’s prepare the experimental schema to trigger this bug, we create TRANS table and the CHILD one, the target objective is to drop CHILD table from the system to release space:

drop table trans CASCADE CONSTRAINTS purge;
drop table child CASCADE CONSTRAINTS purge;
drop sequence trans_seq;
drop sequence child_seq;

CREATE TABLE TRANS (
  ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  CDATE DATE NOT NULL,
  TXT VARCHAR2(64)
) PARTITION BY RANGE (cdate) INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
(
   PARTITION p1 values LESS THAN (TO_DATE('01-NOV-2010','DD-MON-YYYY'))
);

CREATE TABLE CHILD (
  CHILD_ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  TRANS_ID NUMBER(10,0) NOT NULL ENABLE,
  TXT VARCHAR2(64),
  CONSTRAINT "TRANS_ID_FK" FOREIGN KEY (TRANS_ID) REFERENCES TRANS (ID) ENABLE
);

create sequence trans_seq start with 1 increment by 1 cache 1000;
create sequence child_seq start with 269282 increment by 1 cache 1000;

create or replace procedure generate_load as
  i number;
begin
  for i in 1 .. 50000 loop
    insert into trans (id, cdate) values (trans_seq.nextval, to_date(trunc(sysdate-i)));
    commit write wait;
  end loop;
end;
/

exec generate_load;

-- load child with some vaules from parent (trans)
insert into child (child_id, trans_id) select child_seq.nextval, id from trans;
commit;

The trick is that CHILD has referential constraint to the TRANS table. Additional trick is that there are many (3-4) concurrent sessions doing concurrent simple INSERTs to that table followed immediately by COMMIT WRITE WAIT in PL/SQL loop. OK so let’s start our generator:

[oracle@t10050 fk_drop_table_locks]$ ./gen.sh
Mon Mar 24 14:44:50 CET 2014
Launching 3 workers with load.sql
...waiting to finish

When we attempt now to drop our CHILD table using this (what has been really performed by production DBA some time ago in our case):

alter session set ddl_lock_timeout=15;
drop table child purge;

In the experiment we have end up in INSERT sessions throwing famous ORA-00060 errors. You might wonder

BEGIN generate_load; END;
BEGIN generate_load; END;

*
*
ERROR at line 1:
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1
ORA-06512: at line 1

Elapsed: 00:00:15.06
Elapsed: 00:00:15.06
BEGIN generate_load; END;

*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1

Elapsed: 00:00:15.10

Of course alert log will be full of deadlock errors and the sample single session deadlock graph looks similiar:

*** 2014-03-24 14:25:23.056
DEADLOCK DETECTED ( ORA-00060 )

[Transaction Deadlock]

The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:

Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-0000a079-00000000        20     138    SX             24       6           S
TM-0000a07c-00000000        24       6     X             20     138          SX

session 138: DID 0001-0014-00000BF2     session 6: DID 0001-0018-00000055
session 6: DID 0001-0018-00000055       session 138: DID 0001-0014-00000BF2 

Rows waited on:
  Session 138: obj - rowid = 0000A07C - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41084, file - 0, block - 0, slot - 0)
  Session 6: obj - rowid = 0000A079 - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41081, file - 0, block - 0, slot - 0)

From this we can say that two sessions {138,6} endup in rivilaization for single two TM enqueues but in incompatibile way for two tables. Session 138 was holding SX lock on TM enqueue for table 0000A079 (TRANS) – i.e. was protecting the table in shared exclusvie mode, because wanted to insert – and in parallel to that, session 6 was holding X lock on TM enqueue for table 0000a07c (CHILD) – i.e. was holding exclusive access to whole table , so it was that tried to DROP it. In the next processing cycles:

  • session 6 (the one with DROP TABLE PURGE) tried to accquire shared lock S on table TRANS due tue to FK constraint, and started waiting for it (S and SX are not compatibile according to MetaLink note Id Detecting and Resolving Locking Conflicts and Ora-00060 errors (Doc ID 15476.1))
  • session 138 (the one doing INSERT at the time) tried to accquire SX on the enqueue TM for table CHILD, but couldnt get it as CHILD has been eXclusivley locked by session 6
  • in the several seconds automatic DeadLock detection kicks in and kills sessions performing INSERTs causing downtime/error for application

Upon further investigation together with my DBA coworker Andrzej Nowicki and short trials and errors today, we’ve came to the conclusion that the main culprint of this error is the PURGE clause of the DROP TABLE statement. This has several interesting repercusions:

  • “DROP TABLE X PURGE” vs “DROP TABLE X” are technically two different things and the former actually does NOT results in bug.
  • Oracle parameter responsible for recyclebin affects whether “DROP TABLE X” results really in “DROP TABLE X PURGE”
  • This case is very similiar to the Bug 11693365: Concurrent Drop table and Select on Reference constraint table hangs (deadlock), however that one is stated to be fixed 12.1.0.1 (Base Release) and 11.2.0.3 (Server Patch Set), however the mentioned here bug is NOT.
  • The workarounds for bug 11693365 seem to also help here (“Disable and drop all the referential integrity constraints before dropping the child table.
    eg: alter table disable constraint ; / alter table drop constraint ; /drop table cascade constraints purge;”)

ORA-14765: when CREATE INDEX ONLINE is not really online…

March 14th, 2014

Several days ago we’ve hit in our production application an the interesting new (to me at least) Oracle error “ORA-14765: Cannot create a partition while doing a create index”. It wouldn’t be strange if it wouldn’t happen during an CREATE INDEX ONLINE operation, so therefore it indicates that ONLINE index creation is not really an ONLINE operation (sic!).

Note1: you cannot test it in SYS schema, as the defered segment creation feature on partitioned table seems to be not working for SYS (or SYSTEM tablespace) as it is dedicated for Oracle dictionary purposes only. Seems to me that this is similiar story to the enhaced-checksums always enabled for SYSTEM tablespace (hidden paramter _db_always_check_system_ts which stands for “Always perform block check and checksum for System tablespace”) etc.

Note2: the key in order to get the error is to have parameter deferred_segment_creation set to TRUE, which is pretty recommended if you want to avoid space waste, starting with 11.2g, as it allocates at minimum 8MB initial extent per (sub)partition segment.

Note3: tested only on 11.2.0.3.6 and 11.2.0.3.8.

After a little playing here are the steps to reproduce it:

DBA1@test3> create table t1 (
    id number primary key,
    ts	TIMESTAMP,
    txt varchar2(64)
) partition by range (ts)  interval (NUMTOYMINTERVAL(1, 'MONTH'))
(
  partition p1 values less than (TO_DATE('01.01.2000', 'DD.MM.YYYY')),
  partition p2000_02 values less than (TO_DATE('01.02.2000', 'DD.MM.YYYY')),
  partition p2000_03 values less than (TO_DATE('01.03.2000', 'DD.MM.YYYY')),
  partition p2000_04 values less than (TO_DATE('01.04.2000', 'DD.MM.YYYY')),
  partition p2000_05 values less than (TO_DATE('01.05.2000', 'DD.MM.YYYY')),
  partition p2000_06 values less than (TO_DATE('01.06.2000', 'DD.MM.YYYY'))
); 

Table created.

DBA1@test3> select table_name, segment_created from dba_tables where table_name='T1';

TABLE_NAME                     SEG
------------------------------ ---
T1                             N/A

DBA1@test3> select partition_name, segment_created from dba_tab_partitions where table_name='T1';

PARTITION_NAME                 SEGM
------------------------------ ----
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO

6 rows selected.

DBA1@test3>

Now add a very large single current partition that will take some time :

DBA1@test3> insert  /*+ APPEND NOLOGGING */ into t1 (id, ts, txt) SELECT level, sysdate, 'blah' FROM dual CONNECT BY LEVEL <= 2500000;

2500000 rows created.

DBA1@test3> commit;

Commit complete.

DBA1@test3>

Actually so far we have partition with segment_size approximatly of ~73MB in case of 8kB db blocks.

DBA1@test3> select p.partition_name, p.segment_created, s.bytes/1024 kb from dba_tab_partitions p left join dba_segments s on (p.partition_name=s.partition_name and p.table_name=s.segment_name) where p.table_name='T1' order by 2,1;

PARTITION_NAME                 SEGM         KB
------------------------------ ---- ----------
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO
SYS_P17066                     YES       73728

7 rows selected.

Now we start building online local index in our 1st session using “create index t1_i1 on t1 (ts, txt) local online”, immedietly followed by an attempt to insert single row into one of P2000_xx partition (which are without inititalized backing them segment at this point):

DBA1@test3> insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest');
insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest')
            *
ERROR at line 1:
ORA-14765: Cannot create a partition while doing a create index

DBA1@test3>

Summary: CREATE INDEX ONLINE is not fully compatibile with sessions performing DML transaction against (sub)partitions that doesn’t come with pre-allocated segments. There seems to be very likley possibility to run into ORA-14765 for Oracle transaction systems especially creating numerous automatically managed subpartitions based on some template (like Interval Range partitions that are subpartitioned by ID). Of course you may want to avoid defered segment creation (and sacrifise a lot of storage space) or live with the risks of hitting those errors during building indexes (hopefully application might just retry at later stage).