Dear Readers,
Trust you are doing well.
Today I am going to write about interesting issue I faced recently on one of critical 3 node RAC DB, wherein one of DB instance crashed & while it was rejoining cluster another instance crashed.
This situation created a lot of panic in customer’s environment as 2 instances were crashed one after another during peak hours.
Once all isntance were back up and running, we started investigating from available diagnostic data.
While we had complete evidence of instance 1 crashed due to interconnect communication issue, symptoms for second instance failure were completely different.
Lets see what was exactly a we see in trace files before second instance crashed:
DB alert log contents during issue time::
2023-09-30T20:59:53.265661+05:30
Errors in file /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/trace/testdbpr3_rms0_78084.trc (incident=3889014):
ORA-00240: control file enqueue held for more than 120 seconds
Incident details in: /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/incident/incdir_3889014/testdbpr3_rms0_78084_i3889014.trc
2023-09-30T21:03:59.695719+05:30
Errors in file /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/trace/testdbpr3_rms0_78084.trc (incident=3889015):
ORA-00240: control file enqueue held for more than 120 seconds
Incident details in: /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/incident/incdir_3889015/testdbpr3_rms0_78084_i3889015.trc
2023-09-30T21:04:44.947609+05:30
LGWR (ospid: 78285) waits for event 'enq: CF - contention' for 80 secs.
2023-09-30T21:04:54.863484+05:30
Errors in file /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/trace/testdbpr3_lmhb_78114.trc (incident=3889023):
ORA-29771: process O000 (OSID 70267) blocks LGWR (OSID 78285) for more than 70 seconds
Incident details in: /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/incident/incdir_3889023/testdbpr3_lmhb_78114_i3889023.trc
2023-09-30T21:05:05.480553+05:30
O000 (ospid: 70267) is blocking LGWR (ospid: 78285) in a wait
LMHB (ospid: 78114) kills O000 (ospid: 70267).
Please check LMHB trace file for more detail.
2023-09-30T21:05:05.481290+05:30
Process termination requested for pid 70267 [source = rdbms], [info = 2] [request issued by pid: 78114, uid: 1003]
2023-09-30T21:05:05.961955+05:30
System state dump requested by (instance=0, osid=78084 (RMS0)), summary=[diagnostic dump request (kfncSlaveSub)].
System State dumped to trace file /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/trace/testdbpr3_diag_77825_20230930210505.trc
2023-09-30T21:06:13.291480+05:30
LGWR (ospid: 78285) waits for event 'enq: CF - contention' for 165 secs.
2023-09-30T21:06:13.291674+05:30
LGWR (ospid: 78285) is hung in an acceptable location (cfio 0x11.00).
2023-09-30T21:06:19.099879+05:30
LGWR (ospid: 78285) waits for event 'enq: CF - contention' for 174 secs.
Errors in file /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/trace/testdbpr3_lmhb_78114.trc (incident=3889024):
ORA-29770: global enqueue process LGWR (OSID 78285) is hung for more than 70 seconds
Incident details in: /ora19chome/app/orabase/diag/rdbms/testdb/testdbpr3/incident/incdir_3889024/testdbpr3_lmhb_78114_i3889024.trc
2023-09-30T21:06:26.242109+05:30
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_testdb by memno 2
LMHB (ospid: 78114): terminating the instance due to ORA error 29770
Cause - 'ERROR: Some process(s) is not making progress.
LMHB (ospid: 78114) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s'
2023-09-30T21:06:49.140207+05:30
Dumping diagnostic data in directory=[cdmp_20230930210627], requested by (instance=3, osid=78114 (LMHB)), summary=[abnormal instance termination].
2023-09-30T21:06:49.154944+05:30
2023-09-30T21:08:11.894248+05:30
Warning: 3 processes are still attacheded to shmid 41877560:
(size: 81920 bytes, creator pid: 73878, last attach/detach pid: 49980)
2023-09-30T21:08:12.823940+05:30
USER(prelim) (ospid: 46149): terminating the instance
2023-09-30T21:08:12.835477+05:30
Instance terminated by USER(prelim), pid = 46149
So from alert log data we can see LGWR is blocked by RMS & ASM slave process for considerable time & eventually LHMB tesminated the instance.
LMHB trace file content during issue time:
===2023-09-30T21:04:39.319862+05:30
LGWR (ospid: 78285) has not moved for 79 sec (1696088078.1696087999)
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric current grant 2-way, id 27 succeed
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric current block 2-Way, id 37 succeed
===2023-09-30T21:04:42.968633+05:30
LGWR (ospid: 78285) has not moved for 83 sec (1696088082.1696087999)
: wait in cfio is not acceptable as it fails the global check.
kjgcr_ServiceGCR: KJGCR_METRICS: Local metric check lgwr heartbeat, id 19 failed
===2023-09-30T21:04:44.947354+05:30
LGWR (ospid: 78285) has not moved for 85 sec (1696088084.1696087999)
: wait in cfio is not acceptable as it fails the global check.
: heartbeat check status 6 (no-heartbeat) (threshold 70 sec)
=== LGWR (ospid: 78285) Heartbeat Report
LGWR (ospid: 78285) has no heartbeats for 85 sec. (threshold 70)
: heartbeat state 0x11.00 (cfio) pso-flag 0x0
: waiting for event 'enq: CF - contention' for 80 secs with wait_id 655800295.
===[ Wait Chain ]===
LGWR (ospid: 78285) waits for event 'enq: CF - contention'.
RMS0 (ospid: 78084) waits for event 'KSV master wait'. O000 (ospid: 70267) waits for event 'ASM file metadata operation'. ===2023-09-30T21:06:26.583700+05:30
LMHB (ospid: 78114): terminating the instance due to ORA error 29770
Cause - 'ERROR: Some process(s) is not making progress.
LMHB (ospid: 78114) is terminating the instance.
Please check LMHB trace file for more details.
RMS trace file content during issue time:
*** 2023-09-30T21:05:31.130367+05:30
Process diagnostic dump for oracle@testdbpr3 (RMS0), OS id=56984, pid: 58, proc_ser: 6, sid: 22679,sess_ser:61805 os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 21:05:30 ]
NOTE: scheduling delay has not been sampled for 0.477415 secs
0.000000 secs from [ 21:05:26 - 21:05:31 ], 5 sec avg
0.000000 secs from [ 21:04:31 - 21:05:31 ], 1 min avg
0.000000 secs from [ 21:00:31 - 21:05:31 ], 5 min avg *** 2023-09-30T21:05:34.375615+05:30
loadavg : 13.12 32.21 30.83
System user time: 0.05 sys time: 0.05 context switch: 268765
Memory (Avail / Total) = 963768.18M / 1547589.52M
Swap (Avail / Total) = 102400.00M / 102400.00M
PL/SQL call stack:
Short stack dump:
ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-read()+14<-sntpread()+28<-ntpfprd()+126<-nsbasic_brc()+399<-nioqrc()+438<-ttcdrv()+25744<-nioqwa()+61<-upirtrc()+1663<-upirtr()+165<-OCIPDispatch()+176<-kfnOpExecuteWithWaitInt()+1941<-kfnOpExecuteWithWait()+1729<-kfncSlaveFileRefresh()+799<-kfncPoolMain()+1444<-kfncExecute()+176<-kfncSlaveSubmitSlv2()+6370<-kfncSlaveSubmitSlv()+480<-kfncFileRefresh()+890<-kfioRefreshPriv()+281<-kfioIdentify()+2335<-ksfdafOpen()+425<-ksfdopn2()+17965<-ksfdopn1()+263<-ksfdopn()+260<-kcropn()+1151<-kcroio()+49<-kcrfro()+1595<-kcrabie()+7948<-ksb_act_run_int()+117<-ksb_act_run()+132<-ksbcti()+1862<-ksbabs()+2430<-ksbrdp()+1167<-opirip()+541<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+245
Kernel stack:
TOC00001-END]
[TOC00002]
========= Dump for incident 3889015 (ORA 240) ========
[TOC00003]
----- Beginning of Customized Incident Dump(s)
----- CONTROL FILE ENQUEUE HELD FOR TOO LONG
holding mode : S
enqueue holder : 'inst 3, osid 78084'
enqueue held time : 120 seconds
The current process 'inst 3, osid 78084' holds the control file enqueue
for more than 120 seconds.
Oracle Support Services triaging information: to find the root-cause, look
at the call stack of this process 'inst 3, osid 78084' below and ask
the developer that owns the first NON-service layer in the stack above
ksu_dispatch_tac to investigate. Common service layers are enqueues (ksq),
latches (ksl) library cache pins and locks (kgl), and row cache locks (kqr).
Internal debug info:
type : 0
acquired at : FILE kcrfr.c, LINE 1839
Dumping final blocker:
inst: 3, sid: 30502, ser: 8036 ====> ASM metadata process
There are 4658 sessions blocked by this session.
So from available data in trace files ans ASH data I could see following was the sequence of event before second instance crashed:
- RMS process reports ORA-00240: control file enqueue held for more than 120 seconds
- LGWR (ospid: 78285) waits for event ‘enq: CF – contention’ for 80 secs.
- ORA-29771: process O000 (OSID 70267) blocks LGWR (OSID 78285) for more than 70 seconds
- LMHB (ospid: 78114) kills O000 (ospid: 70267).
- LGWR (ospid: 78285) waits for event ‘enq: CF – contention’ for 165 secs.
- LGWR (ospid: 78285) waits for event ‘enq: CF – contention’ for 174 secs.
- ORA-29770: global enqueue process LGWR (OSID 78285) is hung for more than 70 seconds
- LMHB (ospid: 78114): terminating the instance due to ORA error 29770
Conclusion:
LGWR on both instance 2 & 3 were waiting for ‘enq: CF – contention’ by RMS process which inturn was blocked by O000 (ASM slave process) with event ‘ASM file metadata operation’. Based on ASM alert log we can see ASM slave process was continuously generating system state dump.
Wait chain ==> “enq: CF – contention => KSV MASTER WAIT ==> ASM file metadata operation”
Based on available data issue was exact match of BUG 35796102 – DB HUNG DURING ROLLING PATCH TO 19.20 WITH EVENT ‘ASM FILE METADATA OPERATION’ ‘KSV MASTER WAIT’ which closed as duplicate of BUG 31892120 – DBW0 HOLD A CF ENQUEUE FOR A LONG TIME DURING RESTARTING A FAILED INSTANCE. RMS call stack is exactly matching with bug.
This bug is applicable RU 19.18 & above and it not been included in any of DBRU yet.
In my opinion this fix is good to have on all environments with RAC architecture and DBRU 19.18 & above.
Let me know for any questions and any further information in comments or LinkedIn.
Regards
Adityanath.
Categories: 12c, 19c, Administration, ASM, backup & recovery, DB parameters, Exadata, Installation, Monitoring, Operating System, ORA errors, Oracle 18c, Peformance Tuning, RAC, Uncategorized, Upgrade