Oracle Exadata X7 Error A memory component is suspected of causing a fault with a 100% certainty.

Oracle Exadata Troubleshooting

I encountered an error on Exadata like below.

Alert Summary

Maintenance A memory component suspected of causing a fault

Detailed Alert Information

A memory component is suspected of causing a fault with a 100% certainty.

Check Cellcli.

cellcli > list alerthistory

There is no error

Check İlom

-> show /SYS/MB/P0/D7

/SYS/MB/P0/D7
Targets:
PRSNT
SERVICE

Properties:
    type = DIMM
    ipmi_name = MB/P0/D7
    fru_name = 32768MB DDR4 SDRAM DIMM
    fru_manufacturer = Samsung
    fru_part_number = 07075400,M393A4K40BB2-CTD
    fru_rev_level = 01
    fru_serial_number = ----------------------
    fault_state = Faulted
    clear_fault_action = (none)

Commands:
    cd
    set
    show

fault_state = Faulted

There seems to be an error in the memory named MB / P0 / D7.

The problem may be caused by a temporary problem on the DIMM.
The recommended procedure is to replace the memory module immediately.

Note – If more than one Memory DIMM has experienced multiple CEs, but other possible causes of CEs must be ruled out by a qualified Oracle Support specialist before replacing any DIMMs.

Follow the procedure below to clear the error after the replacement.

set /SYS/MB/P0/D7 clear_fault_action=true

-> show /SYS/MB/P0/D7

/SYS/MB/P0/D7
Targets:
PRSNT
SERVICE

Properties:
    type = DIMM
    ipmi_name = MB/P0/D7
    fru_name = 32768MB DDR4 SDRAM DIMM
    fru_manufacturer = Samsung
    fru_part_number = 07075400,M393A4K40BB2-CTD
    fru_rev_level = 01
    fru_serial_number = -------------------
    fault_state = OK
    clear_fault_action = (none)

Commands:
    cd
    set
    show

fault_state = OK

Description

A memory component fault has been cleared. Component Name /SYS/MB/P0/D7 Trap Additional Info fault.memory.intel.dimm_ce

Resolved: Hardware Alert 1_2
Event Time 2020-04-13T21:40:53+03:00
Description A memory component fault has been cleared. Component Name /SYS/MB/P0/D7 Trap Additional Info fault.memory.intel.dimm_ce
Affected Server Name node6 Server Model Oracle Corporation ORACLE SERVER X7-2 Chassis Serial Number —————– Release Version 19.2.9.0.0.191211.1
Recommended Action Informational.

After this process check if the threshold is reached again “240 CEs”

Great, Have a nice day…

Comments