Oracle Exadata Storage Cell Rescue Using the internal Flash Drive

Oracle Exadata Troubleshooting

While working on the Exadata Storage cell insert, the attachment failed due to the internal USB drive in the storage cell. Oracle uses the internal USB drive to automatically back up the Exadata Storage cell. We do not have to back up the storage cell manually.

In this article, I will show you how to replace an Exadata Storage cell with a faulty USB drive. You will receive an automatic SMTP alert similar to the following;

cellcli -e list alerthistory


2_1 2019-10-08T00:30:47+03:00 critical “A processor co mponent is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/P0 Fault class : fault.io.intel.iio.pcie-fatal Fault message : ht tp://www.sun.com/msg/SPX86A-8002-RK”
2_2 2019-10-08T00:31:46+03:00 critical “A processor co mponent is suspected of causing a fault with a 15% certainty. Component Name : /SYS/MB/P0 Fault class : fault.io.intel.iio.pcie-device-init-failed Fault m essage : http://www.sun.com/msg/SPX86A-800A-8S”
2_3 2019-10-08T01:00:09+03:00 clear “A processor co mponent fault has been cleared. Component Name : /SYS/MB/P0 Trap Additio nal Info : fault.io.intel.iio.pcie-fatal”
3_1 2019-10-08T00:30:46+03:00 critical “A processor co mponent is suspected of causing a fault with a 100% certainty. Component Name : /SYS/MB/PCIE11 Fault class : fault.io.intel.iio.pcie-fatal Fault message : http://www.sun.com/msg/SPX86A-8002-RK”
3_2 2019-10-08T01:00:02+03:00 clear “A generic comp onent fault has been cleared. Component Name : /SYS/MB/PCIE11 Trap Addit ional Info : fault.io.intel.iio.pcie-fatal”
4_1 2019-10-08T00:31:46+03:00 critical “A processor co mponent is suspected of causing a fault with a 80% certainty. Component Name : /SYS/MB/PCIE6 Fault class : fault.io.intel.iio.pcie-device-init-failed Faul t message : http://www.sun.com/msg/SPX86A-800A-8S”
4_2 2019-10-08T01:00:34+03:00 clear “A generic comp onent fault has been cleared. Component Name : /SYS/MB/PCIE6 Trap Additi onal Info : fault.io.intel.iio.pcie-device-init-failed”
5_1 2019-10-08T00:31:46+03:00 critical “A generic comp onent is suspected of causing a fault with a 5% certainty. Component Name : /SY S/MB Fault class : fault.io.intel.iio.pcie-device-init-failed Fault message : http://www.sun.com/msg/SPX86A-800A-8S”
5_2 2019-10-08T01:00:34+03:00 clear “A generic comp onent fault has been cleared. Component Name : /SYS/MB Trap Additional I nfo : fault.io.intel.iio.pcie-device-init-failed”
6 2019-10-08T00:30:46+03:00 critical “Critical inter rupt detected: . Power cycle forced.”
7 2019-10-08T00:30:46+03:00 critical “Critical inter rupt detected: . Power cycle forced.”
8 2019-10-08T00:30:46+03:00 critical “Critical inter rupt detected: . Power cycle forced.”
9_1 2019-10-08T00:46:05+03:00 critical “Flash disk fai led. Status : FAILED – DROPPED FOR REPLACEMENT Manufacturer : Oracle Model Number : Flash Accelerator F640 PCIe Card Size : 2981GB Serial Number : PHLE747100WY6P4BGN-1 Firmware : QDV1RD24 Slot Number : PCI Slot: 6; FDOM: 1 Cell Disk : FD_03_ru02 Grid Disk : Not configured Board Tracer Number : PHLE7471 00WY6P4BGN “
9_2 2019-10-16T10:51:08+03:00 warning “Flash disk was removed. Status : FAILED – DROPPED FOR REPLACEMENT Manufacturer : Ora cle Model Number : Flash Accelerator F640 PCIe Card Size : 2981GB S erial Number : PHLE747100WY6P4BGN-1 Firmware : QDV1RD24 Slot Number : P CI Slot: 6; FDOM: 1 Cell Disk : FD_03_ru02 Grid Disk : Not configured Flash Cache : Present Flash Log : Present”
9_3 2019-10-16T10:52:00+03:00 clear “Flash disk was replaced. Status : NORMAL Manufacturer : Oracle Model Number : Flas h Accelerator F640 PCIe Card Size : 2981GB Serial Number : PHLE848601 CG6P4BGN-1 Firmware : QDV1RD28 Slot Number : PCI Slot: 6; FDOM: 1 Cell Disk : FD_03_ru02 Grid Disk : Not configured Flash Cache : Present Flash Log : Present”
10_1 2019-10-08T00:46:06+03:00 critical “Flash disk fai led. Status : FAILED – DROPPED FOR REPLACEMENT Manufacturer : Oracle Model Number : Flash Accelerator F640 PCIe Card Size : 2981GB Serial Number : PHLE747100WY6P4BGN-2 Firmware : QDV1RD24 Slot Number : PCI Slot: 6; FDOM: 2 Cell Disk : FD_03_ru02 Grid Disk : Not configured Board Tracer Number : PHLE7471 00WY6P4BGN “
10_2 2019-10-16T10:51:08+03:00 warning “Flash disk was removed. Status : FAILED – DROPPED FOR REPLACEMENT Manufacturer : Ora cle Model Number : Flash Accelerator F640 PCIe Card Size : 2981GB S erial Number : PHLE747100WY6P4BGN-2 Firmware : QDV1RD24 Slot Number : P CI Slot: 6; FDOM: 2 Cell Disk : FD_03_ru02 Grid Disk : Not configured Flash Cache : Present Flash Log : Present”
10_3 2019-10-16T10:52:01+03:00 clear “Flash disk was replaced. Status : NORMAL Manufacturer : Oracle Model Number : Flas h Accelerator F640 PCIe Card Size : 2981GB Serial Number : PHLE848601 CG6P4BGN-2 Firmware : QDV1RD28 Slot Number : PCI Slot: 6; FDOM: 2 Cell Disk : FD_03_ru02 Grid Disk : Not configured Flash Cache : Present Flash Log : Present”

Celldisk and Griddisk are Not Created Automatically Click for Error.

The fastest way is to reimage using USB rescue.

If it’s deemed that the storage cell needs to be reimaged, the following steps can be followed via the java based ILOM Remote Console. They assume that the internal USB is healthy.

Select the last line from the grub menu. It reads: CELL_USB_BOOT_CELLBOOT_usb_in_rescue_mode

When prompted, select (r)einstall or try to recover damaged system. Confirm your decision when prompted “Are you sure?”

When prompted whether to erase data partition and disks, choose “no”

Follow the remaining prompts

Once cell is up, the celldisks will need to be imported. Run:
cellcli -e import celldisk all force

Check if flashlog and flashcache are created:
cellcli -e list flashcache detail
cellcli -e list flashlog detail

Run the following and check if flashCacheMode matches that of a healthy cell:
cellcli -e list cell detail

Manually add the griddisks to ASM:
Below is an Example. xdbld1st06 is the name of the cell as example.

alter diskgroup DATA_XDBLD1 add failgroup xdbld1st06 disk ‘o//DATA_XDBLDxdbld1st06′ rebalance power 11;

alter diskgroup RECO_XDBLD1 add failgroup xdbld1st06 disk ‘o//RECO_XDBLDxdbld1st06′ rebalance power 11;

OR you can do that using below steps also. Is an example.

a. Check the status of the griddisks

sqlplus / as sysasm
col path format a59
set pagesi 200
set linesi 200
select path, name, header_status, mode_status, mount_status, state from v$asm_disk oder by path;

  • The griddisks belonging to the reimaged cell should show up with a header_status of CANDIDATE

b. For each diskgroup (DATA, RECO, DBFS_DG, etc), run:
alter diskgroup add disk ‘/*’;

e.g. for DATA diskgroup, given a cell whose name is chsmchs00203 and whose Infiniband is configured for active/active, hence two IPs:
alter diskgroup DATA add disk ‘o/10.111.249.15;10.111.249.16/DATA*chsmsck00203’;

Have a nice day.

Comments