CELL-02630: There is a communication error between Management Server and Cell Server caused by a mismatch of security keys. Check that both servers have access to and use the same $OSSCONF/cellmskey.ora file.

IHAC who is on Exadata image version: 11.2.3.3.0.131014.1, faced below issue.

Any command on cellcli was failing with error: CELL-02630.

CELL-02630: There is a communication error between Management Server and Cell Server caused by a mismatch of security keys. Check that both servers have access to and use the same $OSSCONF/cellmskey.ora file.

If you read error description, it refers to communication error between CELLSRV & MS processes, due to mismatch in security keys.

I did check current status of CELL process & all were up & running.


[root@test01celadm01 ~]# service celld status
rsStatus: running
msStatus: running
cellsrvStatus: running

MS always creates key file ==> cellmskey.ora on startup if it does not exist. But in our case it was not present. (Not sure if someone deleted it manually)

I asked customer to restart MS process & check if it helps. After restarting MS process, CELLCLI commands started working as expected 🙂


CellCLI> alter cell restart services ms

Restarting MS services...
The RESTART of MS services was successful.

CellCLI> list celldisk
CD_00_test01celadm01 normal
CD_01_test01celadm01 normal
CD_02_test01celadm01 normal
CD_03_test01celadm01 normal
CD_04_test01celadm01 normal
CD_05_test01celadm01 normal
CD_06_test01celadm01 normal
CD_07_test01celadm01 normal
CD_08_test01celadm01 normal
CD_09_test01celadm01 normal
CD_10_test01celadm01 normal
CD_11_test01celadm01 normal
FD_00_test01celadm01 normal
FD_01_test01celadm01 normal
FD_02_test01celadm01 normal
FD_03_test01celadm01 normal
FD_04_test01celadm01 normal
FD_05_test01celadm01 normal
FD_06_test01celadm01 normal
FD_07_test01celadm01 normal
FD_08_test01celadm01 normal
FD_09_test01celadm01 normal
FD_10_test01celadm01 normal
FD_11_test01celadm01 normal
FD_12_test01celadm01 normal
FD_13_test01celadm01 normal
FD_14_test01celadm01 normal
FD_15_test01celadm01 normal

Hope so u will find this post very useful 🙂

Cheers

Regards,
Adityanath

Advertisements

New Exadata install getting Warning:Flash Cache size is not consistent for all storage nodes in the cluster.

Recently my customer faced the following issue, wherein after completing the X7-2 Exadata Install, Flash cache was showing different size in one of the cell node than other cells.

Everything went well with onecommand install until step 15 which had this warning:

Warning:Flash Cache size is not consistent for all storage nodes in the cluster. Flash Cache on [celadm06.test.local] does not match with the Flash Cache size on the cell celadm01.test.local in cluser /u01/app/12.2.0.1/grid

We checked flashcache size using dcli command:


[root@celadm01 linux-x64]# dcli -g cell_group -l root cellcli -e "list flashcache detail" | grep size
celadm01: size: 23.28692626953125T
celadm02: size: 23.28692626953125T
celadm03: size: 23.28692626953125T
celadm04: size: 23.28692626953125T
celadm05: size: 23.28692626953125T
celadm06: size: 23.28680419921875T ==================> Smaller flashcache than other cells
celadm07: size: 23.28692626953125T

All Flash disks were in a normal state and there was no hardware failure reported.

After investigating furter through sundiag report, I found below mismatch.


name: FD_00_celadm06
comment: 
creationTime: 2018-07-22T14:11:18+00:00
deviceName: /dev/md310
devicePartition: /dev/md310
diskType: FlashDisk
errorCount: 0
freeSpace: 0 =================================================>>>>>>>>>>>>>>>>>>>>>>>>>> freeSpace is 0
id: ***********
physicalDisk: ***********
size: 5.8218994140625T
status: normal

name: FD_01_celadm06
comment: 
creationTime: 2018-07-22T14:11:18+00:00
deviceName: /dev/md304
devicePartition: /dev/md304
diskType: FlashDisk
errorCount: 0
freeSpace: 0 =================================================>>>>>>>>>>>>>>>>>>>>>>>>>> freeSpace is 0
id: ***********
physicalDisk: ***********
size: 5.8218994140625T
status: normal

name: FD_02_celadm06
comment: 
creationTime: 2018-07-22T14:11:18+00:00
deviceName: /dev/md305
devicePartition: /dev/md305
diskType: FlashDisk
errorCount: 0
freeSpace: 0 =================================================>>>>>>>>>>>>>>>>>>>>>>>>>> freeSpace is 0
id: ***********
physicalDisk: ***********
size: 5.8218994140625T
status: normal

name: FD_03_celadm06
comment: 
creationTime: 2018-07-23T19:31:59+00:00
deviceName: /dev/md306
devicePartition: /dev/md306
diskType: FlashDisk
errorCount: 0
freeSpace: 160M =================================================>>>>>>>>>>>>>>>>>>>>>>> freeSpace 160M is not released
id: ***********
physicalDisk: ***********
size: 5.8218994140625T
status: normal

So I found the culprit 🙂 The mismatch in flash cache size was caused by freeSpace not being released on one of the flash disks (FD_03_celadm06) as we can see in the logs.

I did ask customer to recreate flashcache using following procedure.


1) Check to make sure at least one mirror copy of the extents is available.

CellCLI> list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
– If reporting ‘YES’ continue to step #2

2) Manually flush the flashcache:
# cellcli -e alter flashcache all flush

In a 2nd window… Check status of flashcach flush.
The following command should return “working” for each flash disk on each cell while the cache is being flushed and “completed” when it is finished.
# cellcli -e \”LIST CELLDISK ATTRIBUTES name, flushstatus, flusherror \”” | grep FD

3) Drop Flashlog:
# cellcli -e drop flashlog all

4) Drop flashcache:
# cellcli -e drop flashcache all

5) Recreate flashlog:
# cellcli -e create flashlog all

6) Recreate flashcache:
# cellcli -e create flashcache all

7) Finally check the flashcache size to see if it’s now at correct size:
# cellcli -e list flashcache detail | grep size


Issue was resolved after dropping and recreating the flashlog and flashcache on particular cellnode. 🙂

Hope so u will find this post very useful 🙂

Cheers

Regards,
Adityanath

Exadata image 18.1.5 in status failure due to Validation check ERROR – NOT RUNNING for service: dbserverd

Recently one of my client faced issue after upgrading Exadata image in DB server, image was showing its status as failure. I did review all patchmgr logs but didn’t see anything weird.


root@testserver1 ]# imageinfo
Kernel version: 4.1.12-94.8.4.el6uek.x86_64 #2 SMP Sat May 5 16:14:51 PDT 2018 x86_64
Image kernel version: 4.1.12-94.8.4.el6uek
Image version: 18.1.5.0.0.180506 
Image activated: 2018-05-29 18:03:57 +0200
Image status: failure ============================> Issue
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

I asked customer to run validations manually as below:

/opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all

Customer shared o/p of the command as below:


[root@testserver1 ]# /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Logging started to /var/log/cellos/validations.log
Command line is /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Run validation ipmisettings - PASSED
Run validation misceachboot - FAILED   ============================> Issue
Check log in /var/log/cellos/validations/misceachboot.log
Run validation biosbootorder - PASSED
Run validation oswatcher - PASSED
Run validation checkdeveachboot - PASSED
Run validation checkconfigs - BACKGROUND RUN
Run validation saveconfig - BACKGROUND RUN

After checking in misceachboot.log, I found below error:


-bash-4.4$ cat misceachboot.log | grep -i error
BIOS is Already Pause On Error on Adapter 0.
[1527609678][2018-05-29 18:03:53 +0200][ERROR][0-0][/opt/oracle.cellos/image_functions][image_functions_check_configured_services][] Validation check ERROR - NOT RUNNING for service: dbserverd
BIOS is Already Pause On Error on Adapter 0.
[1527678371][2018-05-30 13:06:56 +0200][ERROR][0-0][/opt/oracle.cellos/image_functions][image_functions_check_configured_services][] Validation check ERROR - NOT RUNNING for service: dbserverd

This shows something went wrong with service: dbserverd.

I asked him to check status of dbserverd services & to manually stop & start dbserverd services on affected server.

1. service dbserverd status

2. service dbserverd stop

3. service dbserverd start


[root@testserver1 ]# service dbserverd status
rsStatus: running
msStatus: stopped       ============================> Issue

[root@testserver1 ]# service dbserverd stop
Stopping the RS and MS services...
The SHUTDOWN of services was successful.

[root@testserver1 ]# service dbserverd start
Starting the RS services...
Getting the state of RS services... running
Starting MS services...
DBM-01513: DBMCLI request to Restart Server (RS) has timed out.
The STARTUP of MS services was not successful. Error: Unknown Error

This confirmed issue was with MS services. I asked customer to restart DB server but it didn’t resolve the issue.

Now I asked customer to reconfigure MS services as given below & check if it helps:


1. ssh to the node as root

2. Shutdown running RS and MS

DBMCLI>ALTER DBSERVER SHUTDOWN SERVICES ALL

see all the pids by “ps -ef | grep “dbserver.*dbms”, just kill them all.

3. re-deploy MS:
/opt/oracle/dbserver/dbms/deploy/scripts/unix/setup_dynamicDeploy DB -D

4. Restart RS and MS
DBMCLI>ALTER DBSERVER STARTUP SERVICES ALL


& this action plan resolved the issue:


[root@testserver1 ]# dbmcli
DBMCLI: Release - Production on Wed May 30 16:05:13 CEST 2018

Copyright (c) 2007, 2016, Oracle and/or its affiliates. All rights reserved.

DBMCLI> ALTER DBSERVER STARTUP SERVICES ALL
Starting the RS and MS services...
Getting the state of RS services... running
Starting MS services...
The STARTUP of MS services was successful.

DBMCLI> exit
quitting

[root@testserver1 ]# service dbserverd status
rsStatus: running
msStatus: running    ============================> Resolved 
[root@testserver1 ]#

Then we need to rerun validations to check if it is successful now:


[root@testserver1 ]# /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Logging started to /var/log/cellos/validations.log
Command line is /opt/oracle.cellos/validations/bin/vldrun.pl -quiet -all
Run validation ipmisettings - PASSED
Run validation misceachboot - PASSED ============================> Resolved 
Check log in /var/log/cellos/validations/misceachboot.log
Run validation biosbootorder - PASSED
Run validation oswatcher - PASSED
Run validation checkdeveachboot - PASSED
Run validation checkconfigs - BACKGROUND RUN
Run validation saveconfig - BACKGROUND RUN

Now you need to check image status:


[root@testserver1 ]# imageinfo
Kernel version: 4.1.12-94.8.4.el6uek.x86_64 #2 SMP Sat May 5 16:14:51 PDT 2018 x86_64
Image kernel version: 4.1.12-94.8.4.el6uek
Image version: 18.1.5.0.0.180506 
Image activated: 2018-05-29 18:03:57 +0200
Image status: success ============================> Resolved
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

Sometimes this can still show status as failure where you can mark image status as success manually after checking with Oracle Support 🙂

Hope so u will find this post very useful 🙂

Cheers

Regards,
Adityanath