I’ve just found a rather new and not known bug during analysis for one of hangs in production system. This research ended up in full reproducible test case against the following Oracle versions on x86_64:
- 11.2.0.3.8
- 11.2.0.4.0
- 12.1.0.1.0
First, let’s prepare the experimental schema to trigger this bug, we create TRANS table and the CHILD one, the target objective is to drop CHILD table from the system to release space:
drop table trans CASCADE CONSTRAINTS purge; drop table child CASCADE CONSTRAINTS purge; drop sequence trans_seq; drop sequence child_seq; CREATE TABLE TRANS ( ID NUMBER(10,0) NOT NULL PRIMARY KEY, CDATE DATE NOT NULL, TXT VARCHAR2(64) ) PARTITION BY RANGE (cdate) INTERVAL (NUMTOYMINTERVAL(1,'MONTH')) ( PARTITION p1 values LESS THAN (TO_DATE('01-NOV-2010','DD-MON-YYYY')) ); CREATE TABLE CHILD ( CHILD_ID NUMBER(10,0) NOT NULL PRIMARY KEY, TRANS_ID NUMBER(10,0) NOT NULL ENABLE, TXT VARCHAR2(64), CONSTRAINT "TRANS_ID_FK" FOREIGN KEY (TRANS_ID) REFERENCES TRANS (ID) ENABLE ); create sequence trans_seq start with 1 increment by 1 cache 1000; create sequence child_seq start with 269282 increment by 1 cache 1000; create or replace procedure generate_load as i number; begin for i in 1 .. 50000 loop insert into trans (id, cdate) values (trans_seq.nextval, to_date(trunc(sysdate-i))); commit write wait; end loop; end; / exec generate_load; -- load child with some vaules from parent (trans) insert into child (child_id, trans_id) select child_seq.nextval, id from trans; commit;
The trick is that CHILD has referential constraint to the TRANS table. Additional trick is that there are many (3-4) concurrent sessions doing concurrent simple INSERTs to that table followed immediately by COMMIT WRITE WAIT in PL/SQL loop. OK so let’s start our generator:
[oracle@t10050 fk_drop_table_locks]$ ./gen.sh Mon Mar 24 14:44:50 CET 2014 Launching 3 workers with load.sql ...waiting to finish
When we attempt now to drop our CHILD table using this (what has been really performed by production DBA some time ago in our case):
alter session set ddl_lock_timeout=15; drop table child purge;
In the experiment we have end up in INSERT sessions throwing famous ORA-00060 errors. You might wonder
BEGIN generate_load; END; BEGIN generate_load; END; * * ERROR at line 1: ERROR at line 1: ORA-00060: deadlock detected while waiting for resource ORA-00060: deadlock detected while waiting for resource ORA-06512: at "DBA1.GENERATE_LOAD", line 5 ORA-06512: at "DBA1.GENERATE_LOAD", line 5 ORA-06512: at line 1 ORA-06512: at line 1 Elapsed: 00:00:15.06 Elapsed: 00:00:15.06 BEGIN generate_load; END; * ERROR at line 1: ORA-00060: deadlock detected while waiting for resource ORA-06512: at "DBA1.GENERATE_LOAD", line 5 ORA-06512: at line 1 Elapsed: 00:00:15.10
Of course alert log will be full of deadlock errors and the sample single session deadlock graph looks similiar:
*** 2014-03-24 14:25:23.056 DEADLOCK DETECTED ( ORA-00060 ) [Transaction Deadlock] The following deadlock is not an ORACLE error. It is a deadlock due to user error in the design of an application or from issuing incorrect ad-hoc SQL. The following information may aid in determining the deadlock: Deadlock graph: ---------Blocker(s)-------- ---------Waiter(s)--------- Resource Name process session holds waits process session holds waits TM-0000a079-00000000 20 138 SX 24 6 S TM-0000a07c-00000000 24 6 X 20 138 SX session 138: DID 0001-0014-00000BF2 session 6: DID 0001-0018-00000055 session 6: DID 0001-0018-00000055 session 138: DID 0001-0014-00000BF2 Rows waited on: Session 138: obj - rowid = 0000A07C - AAAAAAAAAAAAAAAAAA (dictionary objn - 41084, file - 0, block - 0, slot - 0) Session 6: obj - rowid = 0000A079 - AAAAAAAAAAAAAAAAAA (dictionary objn - 41081, file - 0, block - 0, slot - 0)
From this we can say that two sessions {138,6} endup in rivilaization for single two TM enqueues but in incompatibile way for two tables. Session 138 was holding SX lock on TM enqueue for table 0000A079 (TRANS) – i.e. was protecting the table in shared exclusvie mode, because wanted to insert – and in parallel to that, session 6 was holding X lock on TM enqueue for table 0000a07c (CHILD) – i.e. was holding exclusive access to whole table , so it was that tried to DROP it. In the next processing cycles:
- session 6 (the one with DROP TABLE PURGE) tried to accquire shared lock S on table TRANS due tue to FK constraint, and started waiting for it (S and SX are not compatibile according to MetaLink note Id Detecting and Resolving Locking Conflicts and Ora-00060 errors (Doc ID 15476.1))
- session 138 (the one doing INSERT at the time) tried to accquire SX on the enqueue TM for table CHILD, but couldnt get it as CHILD has been eXclusivley locked by session 6
- in the several seconds automatic DeadLock detection kicks in and kills sessions performing INSERTs causing downtime/error for application
Upon further investigation together with my DBA coworker Andrzej Nowicki and short trials and errors today, we’ve came to the conclusion that the main culprint of this error is the PURGE clause of the DROP TABLE statement. This has several interesting repercusions:
- “DROP TABLE X PURGE” vs “DROP TABLE X” are technically two different things and the former actually does NOT results in bug.
- Oracle parameter responsible for recyclebin affects whether “DROP TABLE X” results really in “DROP TABLE X PURGE”
- This case is very similiar to the Bug 11693365: Concurrent Drop table and Select on Reference constraint table hangs (deadlock), however that one is stated to be fixed 12.1.0.1 (Base Release) and 11.2.0.3 (Server Patch Set), however the mentioned here bug is NOT.
- The workarounds for bug 11693365 seem to also help here (“Disable and drop all the referential integrity constraints before dropping the child table.
eg: alter tabledisable constraint ; / alter table drop constraint ; /drop table cascade constraints purge;”)