EXADATA check list status

to find all information about different component in EXADATA
$ cd /opt/oracle.Support/onecommand
$ cat dbm.dat


ILOM
There are two option grafical and comand  

you can use a web browser to access the ILOM web interface
https://11.209......./ipages/ologing.asp
and root user and password

you can check 
  • Identify hardware error and faults
  • Remotely control the power of the node
  • View the graphical and non-graphical console of the host
  • View current status of sensors and indicators of the system
  • Identify the hardware configuration of the system
  • Receive alerts that are generated about system event
To check status 
./crsctl stat res -t
OS log
/var/log/messages 

To check configuration 

/etc/syslog.conf
Starting CellCLI
cellcli [port_number] [-n] [-m] [-xml] [-v | -vv | -vvv] [-x] [-e command]

Examine TOP kswapd 


/opt/oracle.Exawatcher/osw/archive/oswtop 


Memory Utilization 


cat /proc/meminfo | egrep '^MemTotal:|^MemFree:|^Cached:' 

MemTotal:        1540864 kB
MemFree:           71520 kB
Cached:           979324 kB

To check Huge 
note : Compute and Cell nodes should also be checked to ensure huge pages are configured 

# grep ^Huge /proc/meminfo
HugePages_Total: 22960 
HugePages_Free: 2056 

HugePages_Rsvd: 2016

HugePages_Surp: 0
Hugepagesize: 2048 kB


VMSTAT
On a Compute node, go to /opt/oracle.Exawatcher/osw/archive/oswvmstat.
Zero swapping is needed to achieve stable and good system performance. 
note: Example of swapping: On a healthy system the swpd column would contain only 0’s. 


vmstat 60
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0    652 116216 132176 1473568    0    0    81   177  462  928 20  6 69  5
 0  0    652 115492 132176 1473572    0    0     1    13  488 1184 18  7 57 17
 0  0    652 114624 132176 1473572    0    0     0     0  363  752 12  4 84  0

ILOM Integrated Lights Out Manager 

is a dedicated service processor that is used to manage and monitor servers. Each Cell server, Compute node, and InfiniBand switch will have a dedicated ILOM. There are several places to view errors and messages with ILOM. The first is with the web management console. From within the web console select “Open Problems.”


from the ILOM host using the ipmitool, to check last 10 event

ipmitool sel list 10



Network Status
srvctl status vip -n node1

To check if there are any interface has been down

dcli -l root -g ./all_group "ifconfig -a | grep DOWN" 


Disk Status 

# dcli -g all_group -l root /opt/MegaRAID/MegaCli/MegaCli64 AdpAllInfo -aALL | grep
"Device Present" -A 8
slcb01db07: Device Present
slcb01db07: ================
slcb01db07: Virtual Drives
slcb01db07: Degraded
slcb01db07: Offline
slcb01db07: Physical Devices  : 5
slcb01db07: Disks           : 4
slcb01db07: Critical Disks  : 0
slcb01db07: Failed Disks
--
slcb01db08: Device Present
slcb01db08: ================
slcb01db08: Virtual Drives
slcb01db08: Degraded
slcb01db08: Offline
slcb01db08: Physical Devices  : 5
slcb01db08: Disks           : 4
slcb01db08: Critical Disks  : 0
slcb01db08: Failed Disks    : 0
--
slcb01cel12: Device Present
slcb01cel12: ================
slcb01cel12: Virtual Drives
slcb01cel12: Degraded
slcb01cel12: Offline
slcb01cel12: Physical Devices  : 14
slcb01cel12: Disks           : 12
slcb01cel12: Critical Disks  : 0
slcb01cel12: Failed Disks
--
slcb01cel13: Device Present
slcb01cel13: ================
slcb01cel13: Virtual Drives
slcb01cel13: Degraded
slcb01cel13: Offline
slcb01cel13: Physical Devices  : 14
slcb01cel13: Disks           : 12
slcb01cel13: Critical Disks  : 0
slcb01cel13: Failed Disks    : 0


CheckHWnFWProfile
is a program that validates whether hardware and firmware on the Compute nodes and Storage Nodes are all supported configurations. This only takes a few seconds to run and can help identify issues such as unsupported disks as demonstrated below. Note that Exachk will also execute this command to check for issues.
dcli -l root -g ./all_group "/opt/oracle.SupportTools/CheckHWnFWProfile"


Service

lsnrctl status LISTENER_SCAN2


Database Free Buffer Waits 

A very important metric to monitor is the “free buffer wait” wait event time. Free buffer waits indicate that a database process was not able to find a free buffer into which to perform a read operation. This occurs when the DBWR process can’t write blocks to storage fast enough. “Free buffer waits” are an indication that the write rate of the I/O system is maxed out or is close to being maxed out. If this statistic appears in the top 5 wait events, then proactive action should be taken to reduce the write rate or increase the I/O capacity of storage. 


Have Changes Occurred in the Environment? 
Change Management

  • Recent Oracle patching (Operating System, Database, Cell server, Clusterware, etc.)
  • Newly deployed applications
  • Code changes to existing applications
  • Other changes in usage (i.e. new users added)
  • Oracle configuration changes
  • Operating system configuration changes
  • Migration to a new platform
  • Expansion of the environment
  • Addition of other InfiniBand devices to the fabric
  • Changes in resource management plan
Use baseline data to troubleshoot issues 
Compare configuration file

$ strings spfileemrep.ora > spfileemrep.ora.txt
$ strings spfileemrep.ora_072513_0100 > spfileemrep.ora_072513_0100.txt
$ diff spfileemrep.ora.txt spfileemrep.ora_072513_0100.txt

Checking changes to the kernel tunable parameters 
dcli -l root -g ./dbs_group "sysctl -a > /tmp/sysctl.current;diff /root/<baseline
kernel configuration file> /tmp/sysctl.current"

note: It is normal for some parameters to change dynamically. So the above output should be carefully analyzed to determine if the delta from the diff output is relevant to the issues being experienced. 


AWR Data 
you can check difference between two times with 
/u01/app/oracle/product/12.2.0.1/db_1/rdbms/admin
awrddrpt.sql

number of users
number of transactions
redo rate
physical reads per transaction
physical writes per transaction


Check if Compute node is CPU bound 

Evaluate load average per core = # of runnable processes per core
  • Question: Is load average of 80 high?
  • Answer: It depends.
o X2-2, load/core = 80/12 ~= 6.67 runnable processes per core => yikes! o X2-8, load/core = 80/64 ~= 1.25 runnable processes per core => ok!
The 3 load-average values are the 1-minute, 5-minute, and 15-minute averages.


Compute load/core = 283 / 12 ~= 23 runnable processes per core
Note that Compute nodes that are CPU bound will incorrectly show high I/O wait times because the process that issues an I/O will not be immediately rescheduled when the I/O completes. Therefore CPU scheduling time will be measured as part of I/O wait times. Thus, I/O response times measured at the database level are not accurate when the CPU is maxed out. Thus it is important to have ruled out CPU contention as documented above. 
TOP comand 
top - 20:44:36 up  2:28,  1 user,  load average: 0.02, 0.04, 0.05
Tasks: 176 total,   2 running, 174 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1540864 total,    74352 free,   407024 used,  1059488 buff/cache
KiB Swap:  5300220 total,  5300192 free,       28 used.   864920 avail Mem


I/O Performance

Check if cells are I/O bound