Exadata image 18.1.5 in status failure due to Validation check ERROR – NOT RUNNING for service: dbserverd

Recently one of my client faced issue after upgrading Exadata image in DB server, image was showing its status as failure. I did review all patchmgr logs but didn’t see anything weird.


root@testserver1 ]# imageinfo
Kernel version: 4.1.12-94.8.4.el6uek.x86_64 #2 SMP Sat May 5 16:14:51 PDT 2018 x86_64
Image kernel version: 4.1.12-94.8.4.el6uek
Image version: 18.1.5.0.0.180506 
Image activated: 2018-05-29 18:03:57 +0200
Image status: failure ============================> Issue
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

I asked customer to run validations manually as below:

/opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all

Customer shared o/p of the command as below:


[root@testserver1 ]# /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Logging started to /var/log/cellos/validations.log
Command line is /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Run validation ipmisettings - PASSED
Run validation misceachboot - FAILED   ============================> Issue
Check log in /var/log/cellos/validations/misceachboot.log
Run validation biosbootorder - PASSED
Run validation oswatcher - PASSED
Run validation checkdeveachboot - PASSED
Run validation checkconfigs - BACKGROUND RUN
Run validation saveconfig - BACKGROUND RUN

After checking in misceachboot.log, I found below error:


-bash-4.4$ cat misceachboot.log | grep -i error
BIOS is Already Pause On Error on Adapter 0.
[1527609678][2018-05-29 18:03:53 +0200][ERROR][0-0][/opt/oracle.cellos/image_functions][image_functions_check_configured_services][] Validation check ERROR - NOT RUNNING for service: dbserverd
BIOS is Already Pause On Error on Adapter 0.
[1527678371][2018-05-30 13:06:56 +0200][ERROR][0-0][/opt/oracle.cellos/image_functions][image_functions_check_configured_services][] Validation check ERROR - NOT RUNNING for service: dbserverd

This shows something went wrong with service: dbserverd.

I asked him to check status of dbserverd services & to manually stop & start dbserverd services on affected server.

1. service dbserverd status

2. service dbserverd stop

3. service dbserverd start


[root@testserver1 ]# service dbserverd status
rsStatus: running
msStatus: stopped       ============================> Issue

[root@testserver1 ]# service dbserverd stop
Stopping the RS and MS services...
The SHUTDOWN of services was successful.

[root@testserver1 ]# service dbserverd start
Starting the RS services...
Getting the state of RS services... running
Starting MS services...
DBM-01513: DBMCLI request to Restart Server (RS) has timed out.
The STARTUP of MS services was not successful. Error: Unknown Error

This confirmed issue was with MS services. I asked customer to restart DB server but it didn’t resolve the issue.

Now I asked customer to reconfigure MS services as given below & check if it helps:


1. ssh to the node as root

2. Shutdown running RS and MS

DBMCLI>ALTER DBSERVER SHUTDOWN SERVICES ALL

see all the pids by “ps -ef | grep “dbserver.*dbms”, just kill them all.

3. re-deploy MS:
/opt/oracle/dbserver/dbms/deploy/scripts/unix/setup_dynamicDeploy DB -D

4. Restart RS and MS
DBMCLI>ALTER DBSERVER STARTUP SERVICES ALL


& this action plan resolved the issue:


[root@testserver1 ]# dbmcli
DBMCLI: Release - Production on Wed May 30 16:05:13 CEST 2018

Copyright (c) 2007, 2016, Oracle and/or its affiliates. All rights reserved.

DBMCLI> ALTER DBSERVER STARTUP SERVICES ALL
Starting the RS and MS services...
Getting the state of RS services... running
Starting MS services...
The STARTUP of MS services was successful.

DBMCLI> exit
quitting

[root@testserver1 ]# service dbserverd status
rsStatus: running
msStatus: running    ============================> Resolved 
[root@testserver1 ]#

Then we need to rerun validations to check if it is successful now:


[root@testserver1 ]# /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Logging started to /var/log/cellos/validations.log
Command line is /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Run validation ipmisettings - PASSED
Run validation misceachboot - PASSED ============================> Resolved 
Check log in /var/log/cellos/validations/misceachboot.log
Run validation biosbootorder - PASSED
Run validation oswatcher - PASSED
Run validation checkdeveachboot - PASSED
Run validation checkconfigs - BACKGROUND RUN
Run validation saveconfig - BACKGROUND RUN

Now you need to check image status:


[root@testserver1 ]# imageinfo
Kernel version: 4.1.12-94.8.4.el6uek.x86_64 #2 SMP Sat May 5 16:14:51 PDT 2018 x86_64
Image kernel version: 4.1.12-94.8.4.el6uek
Image version: 18.1.5.0.0.180506 
Image activated: 2018-05-29 18:03:57 +0200
Image status: success ============================> Resolved
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

Sometimes this can still show status as failure where you can mark image status as success manually after checking with Oracle Support 🙂

Hope so u will find this post very useful 🙂

Cheers

Regards,
Adityanath

Advertisements