my notes: EXADATA check list status

to find all information about different component in EXADATA
$ cd /opt/oracle.Support/onecommand
$ cat dbm.dat

ILOM

There are two option grafical and comand

you can use a web browser to access the ILOM web interface

https://11.209......./ipages/ologing.asp

and root user and password

you can check

Identify hardware error and faults
Remotely control the power of the node
View the graphical and non-graphical console of the host
View current status of sensors and indicators of the system
Identify the hardware configuration of the system
Receive alerts that are generated about system event

To check status

./crsctl stat res -t

OS log

/var/log/messages

To check configuration

/etc/syslog.conf

Starting CellCLI

cellcli [port_number] [-n] [-m] [-xml] [-v | -vv | -vvv] [-x] [-e command]

Examine TOP kswapd

/opt/oracle.Exawatcher/osw/archive/oswtop

Memory Utilization

cat /proc/meminfo | egrep '^MemTotal:|^MemFree:|^Cached:'

MemTotal: 1540864 kB
MemFree: 71520 kB
Cached: 979324 kB

To check Huge
note : Compute and Cell nodes should also be checked to ensure huge pages are configured

# grep ^Huge /proc/meminfo
HugePages_Total: 22960
HugePages_Free: 2056

HugePages_Rsvd: 2016

HugePages_Surp: 0
Hugepagesize: 2048 kB

VMSTAT

On a Compute node, go to /opt/oracle.Exawatcher/osw/archive/oswvmstat.
Zero swapping is needed to achieve stable and good system performance.

note: Example of swapping: On a healthy system the swpd column would contain only 0’s.

vmstat 60

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

1 0 652 116216 132176 1473568 0 0 81 177 462 928 20 6 69 5

0 0 652 115492 132176 1473572 0 0 1 13 488 1184 18 7 57 17

0 0 652 114624 132176 1473572 0 0 0 0 363 752 12 4 84 0

ILOM Integrated Lights Out Manager

is a dedicated service processor that is used to manage and monitor servers. Each Cell server, Compute node, and InfiniBand switch will have a dedicated ILOM. There are several places to view errors and messages with ILOM. The first is with the web management console. From within the web console select “Open Problems.”

from the ILOM host using the ipmitool, to check last 10 event

ipmitool sel list 10

Network Status
srvctl status vip -n node1

To check if there are any interface has been down

dcli -l root -g ./all_group "ifconfig -a | grep DOWN"

Disk Status

# dcli -g all_group -l root /opt/MegaRAID/MegaCli/MegaCli64 AdpAllInfo -aALL | grep
"Device Present" -A 8

slcb01db07: Device Present
slcb01db07: ================
slcb01db07: Virtual Drives
slcb01db07: Degraded
slcb01db07: Offline
slcb01db07: Physical Devices  : 5
slcb01db07: Disks           : 4
slcb01db07: Critical Disks  : 0

slcb01db07: Failed Disks
--
slcb01db08: Device Present
slcb01db08: ================
slcb01db08: Virtual Drives
slcb01db08: Degraded
slcb01db08: Offline
slcb01db08: Physical Devices  : 5
slcb01db08: Disks           : 4
slcb01db08: Critical Disks  : 0
slcb01db08: Failed Disks    : 0
--

slcb01cel12: Device Present
slcb01cel12: ================
slcb01cel12: Virtual Drives
slcb01cel12: Degraded
slcb01cel12: Offline
slcb01cel12: Physical Devices  : 14
slcb01cel12: Disks           : 12
slcb01cel12: Critical Disks  : 0

slcb01cel12: Failed Disks
--
slcb01cel13: Device Present
slcb01cel13: ================
slcb01cel13: Virtual Drives
slcb01cel13: Degraded
slcb01cel13: Offline
slcb01cel13: Physical Devices  : 14
slcb01cel13: Disks           : 12
slcb01cel13: Critical Disks  : 0
slcb01cel13: Failed Disks    : 0




CheckHWnFWProfile




is a program that validates whether hardware and firmware on the Compute nodes and Storage
Nodes are all supported configurations. This only takes a few seconds to run and can help identify issues such as
unsupported disks as demonstrated below. Note that Exachk will also execute this command to check for issues.

dcli -l root -g ./all_group "/opt/oracle.SupportTools/CheckHWnFWProfile"

Service

lsnrctl status LISTENER_SCAN2

Database Free Buffer Waits

A very important metric to monitor is the “free buffer wait” wait event time. Free buffer waits indicate that a database process was not able to find a free buffer into which to perform a read operation. This occurs when the DBWR process can’t write blocks to storage fast enough. “Free buffer waits” are an indication that the write rate of the I/O system is maxed out or is close to being maxed out. If this statistic appears in the top 5 wait events, then proactive action should be taken to reduce the write rate or increase the I/O capacity of storage.

Have Changes Occurred in the Environment?
Change Management

Recent Oracle patching (Operating System, Database, Cell server, Clusterware, etc.)
Newly deployed applications
Code changes to existing applications
Other changes in usage (i.e. new users added)
Oracle configuration changes
Operating system configuration changes
Migration to a new platform
Expansion of the environment
Addition of other InfiniBand devices to the fabric
Changes in resource management plan

Use baseline data to troubleshoot issues

Compare configuration file

$ strings spfileemrep.ora > spfileemrep.ora.txt

$ strings spfileemrep.ora_072513_0100 > spfileemrep.ora_072513_0100.txt

$ diff spfileemrep.ora.txt spfileemrep.ora_072513_0100.txt

Checking changes to the kernel tunable parameters

dcli -l root -g ./dbs_group "sysctl -a > /tmp/sysctl.current;diff /root/<baseline
kernel configuration file> /tmp/sysctl.current"

note: It is normal for some parameters to change dynamically. So the above output should be carefully analyzed to determine if the delta from the diff output is relevant to the issues being experienced.

AWR Data
you can check difference between two times with
/u01/app/oracle/product/12.2.0.1/db_1/rdbms/admin
awrddrpt.sql

number of users
number of transactions
redo rate
physical reads per transaction
physical writes per transaction

Check if Compute node is CPU bound

Evaluate load average per core = # of runnable processes per core

Question: Is load average of 80 high?
Answer: It depends.

o X2-2, load/core = 80/12 ~= 6.67 runnable processes per core => yikes! o X2-8, load/core = 80/64 ~= 1.25 runnable processes per core => ok!
The 3 load-average values are the 1-minute, 5-minute, and 15-minute averages.

Compute load/core = 283 / 12 ~= 23 runnable processes per core
Note that Compute nodes that are CPU bound will incorrectly show high I/O wait times because the process that issues an I/O will not be immediately rescheduled when the I/O completes. Therefore CPU scheduling time will be measured as part of I/O wait times. Thus, I/O response times measured at the database level are not accurate when the CPU is maxed out. Thus it is important to have ruled out CPU contention as documented above.

TOP comand
top - 20:44:36 up 2:28, 1 user, load average: 0.02, 0.04, 0.05
Tasks: 176 total, 2 running, 174 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1540864 total, 74352 free, 407024 used, 1059488 buff/cache
KiB Swap: 5300220 total, 5300192 free, 28 used. 864920 avail Mem

I/O Performance

Check if cells are I/O bound