I encountered an error on Exadata like below.
Alert Summary
Maintenance A memory component suspected of causing a fault
Detailed Alert Information
A memory component is suspected of causing a fault with a 100% certainty.
Check Cellcli.
cellcli > list alerthistory
There is no error
Check İlom
-> show /SYS/MB/P0/D7
/SYS/MB/P0/D7
Targets:
PRSNT
SERVICE
Properties:
type = DIMM
ipmi_name = MB/P0/D7
fru_name = 32768MB DDR4 SDRAM DIMM
fru_manufacturer = Samsung
fru_part_number = 07075400,M393A4K40BB2-CTD
fru_rev_level = 01
fru_serial_number = ----------------------
fault_state = Faulted
clear_fault_action = (none)
Commands:
cd
set
show
fault_state = Faulted
There seems to be an error in the memory named MB / P0 / D7.
The problem may be caused by a temporary problem on the DIMM.
The recommended procedure is to replace the memory module immediately.
Note – If more than one Memory DIMM has experienced multiple CEs, but other possible causes of CEs must be ruled out by a qualified Oracle Support specialist before replacing any DIMMs.
Follow the procedure below to clear the error after the replacement.
set /SYS/MB/P0/D7 clear_fault_action=true
-> show /SYS/MB/P0/D7
/SYS/MB/P0/D7
Targets:
PRSNT
SERVICE
Properties:
type = DIMM
ipmi_name = MB/P0/D7
fru_name = 32768MB DDR4 SDRAM DIMM
fru_manufacturer = Samsung
fru_part_number = 07075400,M393A4K40BB2-CTD
fru_rev_level = 01
fru_serial_number = -------------------
fault_state = OK
clear_fault_action = (none)
Commands:
cd
set
show
fault_state = OK
Description
A memory component fault has been cleared. Component Name /SYS/MB/P0/D7 Trap Additional Info fault.memory.intel.dimm_ce
Resolved: Hardware Alert 1_2 | |
Event Time | 2020-04-13T21:40:53+03:00 |
Description |
A
memory component fault has been cleared.
Component
Name
/SYS/MB/P0/D7
Trap
Additional Info
fault.memory.intel.dimm_ce
|
Affected Server | Name node6 Server Model Oracle Corporation ORACLE SERVER X7-2 Chassis Serial Number —————– Release Version 19.2.9.0.0.191211.1 |
Recommended Action | Informational. |
After this process check if the threshold is reached again “240 CEs”
Great, Have a nice day…
Tags: