在上上週分享了故障處理:19C RAC改私網IP後重建集羣時報網絡找不到,這套環境重新運行root.sh後,集羣在初始化時仍然有報錯,今天來回一趟重慶,晚上不想看書,所以臨時想到把這個故障分析一下,大概發了點時間,這裏和大家分析一下大概的思路:

環境信息

這個環境是在我自己的MacOS裏面的虛擬機安裝的Oracle Arm版本,版本為19.19,並未安裝其它的補丁。

模擬故障現象

deconfig集羣

為了模擬整個故障,所以我先將環境deconfig一次,這裏注意關鍵詞-lastnode -force,也就是以為着deconfig會刪除最後的集羣配置信息。

[root@arm01 install]# ./rootcrs.sh  -deconfig -lastnode -force
.....
2025/09/02 22:06:11 CLSRSC-558: failed to deconfigure ASM
2025/09/02 22:06:11 CLSRSC-651: One or more deconfiguration steps failed, but the deconfiguration process continued because the -force option was specified.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 22:06:39 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:08:29 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:09:00 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node
2025/09/02 22:09:00 CLSRSC-559: Ensure that the GPnP profile data under the 'gpnp' directory in /oracle/app/19.3.0/grid is deleted on each node before using the software in the current Grid Infrastructure home for reconfiguration.

忽略中間的日誌,通過最後的成功關鍵字,我們可以看到整個集羣卸載成功了。

運行root.sh腳本

[root@arm01 install]# /oracle/app/19.3.0/grid/root.sh

2025/09/02 23:05:50 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:06:36 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.

ASM has been created and started successfully.

[DBT-30022] Disk group arm_ocr mounted successfully.

2025/09/02 23:06:59 CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
Died at /oracle/app/19.3.0/grid/crs/install/oraocr.pm line 1890.

這裏將前面的正常的日誌去掉了,這裏注意關鍵的行是ASM創建磁盤組成功了,也就意味着原來頭的信息是格式化過,否者無法創建磁盤組,但是磁盤組創建完成後,裏面出發了CLSRSC-428的報錯。

去查看一下詳細信息

2025-09-02 23:06:59: Executing the step [ocr_configFirstNode_step_2] to configure OCR on the first node
2025-09-02 23:06:59: Reuse Disk Group is set to 0
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/ocrcheck -debug
2025-09-02 23:06:59: Command output:
>  Status of Oracle Cluster Registry is as follows :
>        Version                  :          4
>        Total space (kbytes)     :     901284
>        Used space (kbytes)      :      84400
>        Available space (kbytes) :     816884
>        ID                       : 1509093020
>        Device/File Name         :   +ARM_OCR
>                                      PROT-713: Device/File integrity check succeeded
>
>                                      PROT-710: Device/File not configured
>
>                                      PROT-710: Device/File not configured
>
>                                      PROT-710: Device/File not configured
>
>                                      PROT-710: Device/File not configured
>
>        PROT-707: Cluster registry integrity check succeeded
>
>        PROT-720: Logical corruption check succeeded
>
>End Command output
2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Command output:
>  CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
>End Command output
2025-09-02 23:06:59: CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
2025-09-02 23:06:59: ###### Begin DIE Stack Trace ######
2025-09-02 23:06:59:     Package         File                 Line Calling
2025-09-02 23:06:59:     --------------- -------------------- ---- ----------
2025-09-02 23:06:59:  1: main            rootcrs.pl            358 crsutils::dietrap
2025-09-02 23:06:59:  2: oraClusterwareComp::oraocr oraocr.pm            1890 main::__ANON__
2025-09-02 23:06:59:  3: oraClusterwareComp::oraocr oraocr.pm            1836 oraClusterwareComp::oraocr::configureOCR
2025-09-02 23:06:59:  4: oraClusterwareComp::oraocr oraocr.pm             245 oraClusterwareComp::oraocr::configSteps
2025-09-02 23:06:59:  5: oraClusterwareComp oraClusterwareComp.pm   91 oraClusterwareComp::oraocr::configureFirstNode
2025-09-02 23:06:59:  6: crsinstall      crsinstall.pm        2586 oraClusterwareComp::configureCurrentNode
2025-09-02 23:06:59:  7: crsinstall      crsinstall.pm        2427 crsinstall::perform_initial_config
2025-09-02 23:06:59:  8: crsinstall      crsinstall.pm        1085 crsinstall::perform_init_config
2025-09-02 23:06:59:  9: crsinstall      crsinstall.pm        1243 crsinstall::init_config
2025-09-02 23:06:59: 10: crsinstall      crsinstall.pm         487 crsinstall::CRSInstall
2025-09-02 23:06:59: 11: main            rootcrs.pl            559 crsinstall::new
2025-09-02 23:06:59: ####### End DIE Stack Trace #######

注意下面這3行的信息:

2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall

checkOCR返回為0,也就是以為着檢查通過,但是立馬就報了OCR配置已經存在的錯誤,這裏感覺有點奇怪。

分析過程

查看磁盤組的狀態

Disk Group Name  Fail Group         Path                              File Name                    Status                   Status         Status         TYPE      File Size (MB) Used Size (MB) Pct. Used
---------------- ------------------ --------------------------------- ---------------------------- ------------------------ -------------- -------------- --------- -------------- -------------- ---------
ARM_OCR          ARM_OCR_0000       /dev/nvme0n2                      ARM_OCR_0000                 MEMBER                   CACHED         ONLINE         REGULAR            5,120            320      6.25
                 ******************                                                                                                                                 -------------- --------------
                 TOTAL                                                                                                                                                       5,120            320

這裏看到磁盤組狀態完全是正常的。

查看磁盤組內容

在看看磁盤組裏面的內容

[grid@arm01 ~]$ ocrconfig -showbackup


^CPROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy

arm01     2025/08/22 02:08:51     +arm_ocr:/raccluster/OCRBACKUP/backup00.ocr.261.1209780531     0
arm01     2025/08/21 14:16:32     +arm_ocr:/raccluster/OCRBACKUP/backup01.ocr.258.1209737791     0
arm01     2025/08/21 10:16:31     +arm_ocr:/raccluster/OCRBACKUP/backup02.ocr.263.1209723391     0
arm01     2025/08/21 10:16:31     +arm_ocr:/raccluster/OCRBACKUP/day.ocr.259.1209723391     0
arm01     2025/08/21 10:16:31     +arm_ocr:/raccluster/OCRBACKUP/week.ocr.260.1209723391     0

這裏連之前的備份信息都還有?搞得有得不懂了,這部分信息來至於那兒呢?

原因分析

通過上面的分析,簡單可以判斷是由於磁盤中仍存在歷史信息,所以導致集羣在初始化時報錯。但是deconfig為什麼沒有格式化磁盤組時沒有完全格式化,並且創建磁盤組時還能正常的創建成功。大概猜想時這個版本中lastnode格式化時只格式化了磁盤組頭部的信息,並未格式化集羣配置文件的位置,所以導致在集羣檢查時,讀取到歷史的集羣信息後直接退出。

解決方案

手動情況磁盤的信息,感覺一下回到10G環境中手動清理磁盤的信息。

[root@arm01 install]# dd if=/dev/zero of=/dev/nvme0n2 bs=8192 count=1000000
dd: error writing '/dev/nvme0n2': No space left on device
^C

^C^C^C

在次運行root.sh腳本後集羣初始化成功,日誌的信息如下:

[root@arm01 ~]# tail -1000f /oracle/app/19.3.0/grid/install/root_arm01_2025-09-02_23-17-34-060803682.log
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /oracle/app/19.3.0/grid
   Copying dbhome to /usr/local/bin ...
   Copying oraenv to /usr/local/bin ...
   Copying coraenv to /usr/local/bin ...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Relinking oracle with rac_on option
Using configuration parameter file: /oracle/app/19.3.0/grid/crs/install/crsconfig_params
The log of current session can be found at:
  /oracle/app/grid/crsdata/arm01/crsconfig/rootcrs_arm01_2025-09-02_11-17-34PM.log
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 1 of 19: 'ValidateEnv'.
2025/09/02 23:17:36 CLSRSC-363: User ignored prerequisites during installation
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 2 of 19: 'CheckFirstNode'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 3 of 19: 'GenSiteGUIDs'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 4 of 19: 'SetupOSD'.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 5 of 19: 'CheckCRSConfig'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 6 of 19: 'SetupLocalGPNP'.
2025/09/02 23:17:41 CLSRSC-594: Executing installation step 7 of 19: 'CreateRootCert'.
2025/09/02 23:17:42 CLSRSC-594: Executing installation step 8 of 19: 'ConfigOLR'.


2025/09/02 23:17:50 CLSRSC-594: Executing installation step 9 of 19: 'ConfigCHMOS'.
2025/09/02 23:17:50 CLSRSC-594: Executing installation step 10 of 19: 'CreateOHASD'.
2025/09/02 23:17:51 CLSRSC-594: Executing installation step 11 of 19: 'ConfigOHASD'.
2025/09/02 23:17:51 CLSRSC-330: Adding Clusterware entries to file 'oracle-ohasd.service'
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 12 of 19: 'SetupTFA'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 13 of 19: 'InstallAFD'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 14 of 19: 'InstallACFS'.
2025/09/02 23:18:25 CLSRSC-594: Executing installation step 15 of 19: 'InstallKA'.
2025/09/02 23:18:26 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:19:11 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.

ASM has been created and started successfully.

[DBT-30001] Disk groups created successfully. Check /oracle/app/grid/cfgtoollogs/asmca/asmca-250902PM111854.log for details.

2025/09/02 23:19:37 CLSRSC-482: Running command: '/oracle/app/19.3.0/grid/bin/ocrconfig -upgrade grid oinstall'
CRS-4256: Updating the profile
Successful addition of voting disk 38a1f1f25b454f55bfbeeb3f52abb8e3.
Successfully replaced voting disk group with +arm_ocr.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   38a1f1f25b454f55bfbeeb3f52abb8e3 (/dev/nvme0n2) [ARM_OCR]
Located 1 voting disk(s).
2025/09/02 23:20:09 CLSRSC-594: Executing installation step 17 of 19: 'StartCluster'.
2025/09/02 23:21:17 CLSRSC-343: Successfully started Oracle Clusterware stack
2025/09/02 23:21:17 CLSRSC-594: Executing installation step 18 of 19: 'ConfigNode'.
2025/09/02 23:21:55 CLSRSC-594: Executing installation step 19 of 19: 'PostConfig'.
2025/09/02 23:22:04 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

------------------作者介紹-----------------------

姓名:黃廷忠

現就職:Oracle中國高級服務團隊

曾就職:OceanBase、雲和恩墨、東方龍馬等