Archive for March, 2014

Bug/ORA-00060 deadlock during concurrent Drop Table (purge) and INSERT

Monday, March 24th, 2014

I’ve just found a rather new and not known bug during analysis for one of hangs in production system. This research ended up in full reproducible test case against the following Oracle versions on x86_64:

  • 11.2.0.3.8
  • 11.2.0.4.0
  • 12.1.0.1.0

First, let’s prepare the experimental schema to trigger this bug, we create TRANS table and the CHILD one, the target objective is to drop CHILD table from the system to release space:

drop table trans CASCADE CONSTRAINTS purge;
drop table child CASCADE CONSTRAINTS purge;
drop sequence trans_seq;
drop sequence child_seq;

CREATE TABLE TRANS (
  ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  CDATE DATE NOT NULL,
  TXT VARCHAR2(64)
) PARTITION BY RANGE (cdate) INTERVAL (NUMTOYMINTERVAL(1,'MONTH'))
(
   PARTITION p1 values LESS THAN (TO_DATE('01-NOV-2010','DD-MON-YYYY'))
);

CREATE TABLE CHILD (
  CHILD_ID NUMBER(10,0) NOT NULL PRIMARY KEY,
  TRANS_ID NUMBER(10,0) NOT NULL ENABLE,
  TXT VARCHAR2(64),
  CONSTRAINT "TRANS_ID_FK" FOREIGN KEY (TRANS_ID) REFERENCES TRANS (ID) ENABLE
);

create sequence trans_seq start with 1 increment by 1 cache 1000;
create sequence child_seq start with 269282 increment by 1 cache 1000;

create or replace procedure generate_load as
  i number;
begin
  for i in 1 .. 50000 loop
    insert into trans (id, cdate) values (trans_seq.nextval, to_date(trunc(sysdate-i)));
    commit write wait;
  end loop;
end;
/

exec generate_load;

-- load child with some vaules from parent (trans)
insert into child (child_id, trans_id) select child_seq.nextval, id from trans;
commit;

The trick is that CHILD has referential constraint to the TRANS table. Additional trick is that there are many (3-4) concurrent sessions doing concurrent simple INSERTs to that table followed immediately by COMMIT WRITE WAIT in PL/SQL loop. OK so let’s start our generator:

[oracle@t10050 fk_drop_table_locks]$ ./gen.sh
Mon Mar 24 14:44:50 CET 2014
Launching 3 workers with load.sql
...waiting to finish

When we attempt now to drop our CHILD table using this (what has been really performed by production DBA some time ago in our case):

alter session set ddl_lock_timeout=15;
drop table child purge;

In the experiment we have end up in INSERT sessions throwing famous ORA-00060 errors. You might wonder

BEGIN generate_load; END;
BEGIN generate_load; END;

*
*
ERROR at line 1:
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1
ORA-06512: at line 1

Elapsed: 00:00:15.06
Elapsed: 00:00:15.06
BEGIN generate_load; END;

*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "DBA1.GENERATE_LOAD", line 5
ORA-06512: at line 1

Elapsed: 00:00:15.10

Of course alert log will be full of deadlock errors and the sample single session deadlock graph looks similiar:

*** 2014-03-24 14:25:23.056
DEADLOCK DETECTED ( ORA-00060 )

[Transaction Deadlock]

The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:

Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-0000a079-00000000        20     138    SX             24       6           S
TM-0000a07c-00000000        24       6     X             20     138          SX

session 138: DID 0001-0014-00000BF2     session 6: DID 0001-0018-00000055
session 6: DID 0001-0018-00000055       session 138: DID 0001-0014-00000BF2 

Rows waited on:
  Session 138: obj - rowid = 0000A07C - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41084, file - 0, block - 0, slot - 0)
  Session 6: obj - rowid = 0000A079 - AAAAAAAAAAAAAAAAAA
  (dictionary objn - 41081, file - 0, block - 0, slot - 0)

From this we can say that two sessions {138,6} endup in rivilaization for single two TM enqueues but in incompatibile way for two tables. Session 138 was holding SX lock on TM enqueue for table 0000A079 (TRANS) – i.e. was protecting the table in shared exclusvie mode, because wanted to insert – and in parallel to that, session 6 was holding X lock on TM enqueue for table 0000a07c (CHILD) – i.e. was holding exclusive access to whole table , so it was that tried to DROP it. In the next processing cycles:

  • session 6 (the one with DROP TABLE PURGE) tried to accquire shared lock S on table TRANS due tue to FK constraint, and started waiting for it (S and SX are not compatibile according to MetaLink note Id Detecting and Resolving Locking Conflicts and Ora-00060 errors (Doc ID 15476.1))
  • session 138 (the one doing INSERT at the time) tried to accquire SX on the enqueue TM for table CHILD, but couldnt get it as CHILD has been eXclusivley locked by session 6
  • in the several seconds automatic DeadLock detection kicks in and kills sessions performing INSERTs causing downtime/error for application

Upon further investigation together with my DBA coworker Andrzej Nowicki and short trials and errors today, we’ve came to the conclusion that the main culprint of this error is the PURGE clause of the DROP TABLE statement. This has several interesting repercusions:

  • “DROP TABLE X PURGE” vs “DROP TABLE X” are technically two different things and the former actually does NOT results in bug.
  • Oracle parameter responsible for recyclebin affects whether “DROP TABLE X” results really in “DROP TABLE X PURGE”
  • This case is very similiar to the Bug 11693365: Concurrent Drop table and Select on Reference constraint table hangs (deadlock), however that one is stated to be fixed 12.1.0.1 (Base Release) and 11.2.0.3 (Server Patch Set), however the mentioned here bug is NOT.
  • The workarounds for bug 11693365 seem to also help here (“Disable and drop all the referential integrity constraints before dropping the child table.
    eg: alter table disable constraint ; / alter table drop constraint ; /drop table cascade constraints purge;”)

ORA-14765: when CREATE INDEX ONLINE is not really online…

Friday, March 14th, 2014

Several days ago we’ve hit in our production application an the interesting new (to me at least) Oracle error “ORA-14765: Cannot create a partition while doing a create index”. It wouldn’t be strange if it wouldn’t happen during an CREATE INDEX ONLINE operation, so therefore it indicates that ONLINE index creation is not really an ONLINE operation (sic!).

Note1: you cannot test it in SYS schema, as the defered segment creation feature on partitioned table seems to be not working for SYS (or SYSTEM tablespace) as it is dedicated for Oracle dictionary purposes only. Seems to me that this is similiar story to the enhaced-checksums always enabled for SYSTEM tablespace (hidden paramter _db_always_check_system_ts which stands for “Always perform block check and checksum for System tablespace”) etc.

Note2: the key in order to get the error is to have parameter deferred_segment_creation set to TRUE, which is pretty recommended if you want to avoid space waste, starting with 11.2g, as it allocates at minimum 8MB initial extent per (sub)partition segment.

Note3: tested only on 11.2.0.3.6 and 11.2.0.3.8.

After a little playing here are the steps to reproduce it:

DBA1@test3> create table t1 (
    id number primary key,
    ts	TIMESTAMP,
    txt varchar2(64)
) partition by range (ts)  interval (NUMTOYMINTERVAL(1, 'MONTH'))
(
  partition p1 values less than (TO_DATE('01.01.2000', 'DD.MM.YYYY')),
  partition p2000_02 values less than (TO_DATE('01.02.2000', 'DD.MM.YYYY')),
  partition p2000_03 values less than (TO_DATE('01.03.2000', 'DD.MM.YYYY')),
  partition p2000_04 values less than (TO_DATE('01.04.2000', 'DD.MM.YYYY')),
  partition p2000_05 values less than (TO_DATE('01.05.2000', 'DD.MM.YYYY')),
  partition p2000_06 values less than (TO_DATE('01.06.2000', 'DD.MM.YYYY'))
); 

Table created.

DBA1@test3> select table_name, segment_created from dba_tables where table_name='T1';

TABLE_NAME                     SEG
------------------------------ ---
T1                             N/A

DBA1@test3> select partition_name, segment_created from dba_tab_partitions where table_name='T1';

PARTITION_NAME                 SEGM
------------------------------ ----
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO

6 rows selected.

DBA1@test3>

Now add a very large single current partition that will take some time :

DBA1@test3> insert  /*+ APPEND NOLOGGING */ into t1 (id, ts, txt) SELECT level, sysdate, 'blah' FROM dual CONNECT BY LEVEL <= 2500000;

2500000 rows created.

DBA1@test3> commit;

Commit complete.

DBA1@test3>

Actually so far we have partition with segment_size approximatly of ~73MB in case of 8kB db blocks.

DBA1@test3> select p.partition_name, p.segment_created, s.bytes/1024 kb from dba_tab_partitions p left join dba_segments s on (p.partition_name=s.partition_name and p.table_name=s.segment_name) where p.table_name='T1' order by 2,1;

PARTITION_NAME                 SEGM         KB
------------------------------ ---- ----------
P1                             NO
P2000_02                       NO
P2000_03                       NO
P2000_04                       NO
P2000_05                       NO
P2000_06                       NO
SYS_P17066                     YES       73728

7 rows selected.

Now we start building online local index in our 1st session using “create index t1_i1 on t1 (ts, txt) local online”, immedietly followed by an attempt to insert single row into one of P2000_xx partition (which are without inititalized backing them segment at this point):

DBA1@test3> insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest');
insert into t1 (id, ts, txt) values (3500000, TO_DATE('02.04.2000', 'DD.MM.YYYY'), 'testtesttest')
            *
ERROR at line 1:
ORA-14765: Cannot create a partition while doing a create index

DBA1@test3>

Summary: CREATE INDEX ONLINE is not fully compatibile with sessions performing DML transaction against (sub)partitions that doesn’t come with pre-allocated segments. There seems to be very likley possibility to run into ORA-14765 for Oracle transaction systems especially creating numerous automatically managed subpartitions based on some template (like Interval Range partitions that are subpartitioned by ID). Of course you may want to avoid defered segment creation (and sacrifise a lot of storage space) or live with the risks of hitting those errors during building indexes (hopefully application might just retry at later stage).

A new completley unsupported way of increasing ORAPWD entries

Friday, March 14th, 2014

FYI, i’ve just “invented” a completley new way of increasing capacity of orapwd files online that does not require knowing the SYS password (this is required once you hit ORA-1996, after trying to grant addiitonal SYSDBA). As you probably know there is hard limit of entries set when creating new orapw file and in addition it requires changing SYS password, apparently this seems to work (TM) but dont try this at home or any production system:

[oracle@t10050 dbs]$ cp orapwdtest4  orapwdtest4.old # make backup
[oracle@t10050 dbs]$ ls -l orapwdtes4 # save the filesize in bytes as SIZE1
[oracle@t10050 dbs]$ orapwd file=/tmp/remove entries=20
Enter password for SYS: anything
[oracle@t10050 dbs]$ ls -l /tmp/remove # file size will be approx 3584 bytes on Linux x86_64, if you compare it to the default orapwd , you will realize it is bigger by 2048 bytes
[oracle@t10050 dbs]$ dd if=/dev/zero of=/tmp/empty.2k bs=1k count=2
[oracle@t10050 dbs]$ cat /tmp/empty.2k >> orapwdtest4

Done! Brief tests shows that it works and you can add more working SYSDBA accounts, removal works ok, but again DO NOT TRY IT ON PRODUCTION as the internal orapwd structure is unknown to me ;)