在上上週分享了故障處理:19C RAC改私網IP後重建集羣時報網絡找不到,這套環境重新運行root.sh後,集羣在初始化時仍然有報錯,今天來回一趟重慶,晚上不想看書,所以臨時想到把這個故障分析一下,大概發了點時間,這裏和大家分析一下大概的思路:
環境信息
這個環境是在我自己的MacOS裏面的虛擬機安裝的Oracle Arm版本,版本為19.19,並未安裝其它的補丁。
模擬故障現象
deconfig集羣
為了模擬整個故障,所以我先將環境deconfig一次,這裏注意關鍵詞-lastnode -force,也就是以為着deconfig會刪除最後的集羣配置信息。
[root@arm01 install]# ./rootcrs.sh -deconfig -lastnode -force
.....
2025/09/02 22:06:11 CLSRSC-558: failed to deconfigure ASM
2025/09/02 22:06:11 CLSRSC-651: One or more deconfiguration steps failed, but the deconfiguration process continued because the -force option was specified.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 22:06:39 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:08:29 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:09:00 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node
2025/09/02 22:09:00 CLSRSC-559: Ensure that the GPnP profile data under the 'gpnp' directory in /oracle/app/19.3.0/grid is deleted on each node before using the software in the current Grid Infrastructure home for reconfiguration.
忽略中間的日誌,通過最後的成功關鍵字,我們可以看到整個集羣卸載成功了。
運行root.sh腳本
[root@arm01 install]# /oracle/app/19.3.0/grid/root.sh
2025/09/02 23:05:50 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:06:36 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.
ASM has been created and started successfully.
[DBT-30022] Disk group arm_ocr mounted successfully.
2025/09/02 23:06:59 CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
Died at /oracle/app/19.3.0/grid/crs/install/oraocr.pm line 1890.
這裏將前面的正常的日誌去掉了,這裏注意關鍵的行是ASM創建磁盤組成功了,也就意味着原來頭的信息是格式化過,否者無法創建磁盤組,但是磁盤組創建完成後,裏面出發了CLSRSC-428的報錯。
去查看一下詳細信息
2025-09-02 23:06:59: Executing the step [ocr_configFirstNode_step_2] to configure OCR on the first node
2025-09-02 23:06:59: Reuse Disk Group is set to 0
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/ocrcheck -debug
2025-09-02 23:06:59: Command output:
> Status of Oracle Cluster Registry is as follows :
> Version : 4
> Total space (kbytes) : 901284
> Used space (kbytes) : 84400
> Available space (kbytes) : 816884
> ID : 1509093020
> Device/File Name : +ARM_OCR
> PROT-713: Device/File integrity check succeeded
>
> PROT-710: Device/File not configured
>
> PROT-710: Device/File not configured
>
> PROT-710: Device/File not configured
>
> PROT-710: Device/File not configured
>
> PROT-707: Cluster registry integrity check succeeded
>
> PROT-720: Logical corruption check succeeded
>
>End Command output
2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Command output:
> CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
>End Command output
2025-09-02 23:06:59: CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
2025-09-02 23:06:59: ###### Begin DIE Stack Trace ######
2025-09-02 23:06:59: Package File Line Calling
2025-09-02 23:06:59: --------------- -------------------- ---- ----------
2025-09-02 23:06:59: 1: main rootcrs.pl 358 crsutils::dietrap
2025-09-02 23:06:59: 2: oraClusterwareComp::oraocr oraocr.pm 1890 main::__ANON__
2025-09-02 23:06:59: 3: oraClusterwareComp::oraocr oraocr.pm 1836 oraClusterwareComp::oraocr::configureOCR
2025-09-02 23:06:59: 4: oraClusterwareComp::oraocr oraocr.pm 245 oraClusterwareComp::oraocr::configSteps
2025-09-02 23:06:59: 5: oraClusterwareComp oraClusterwareComp.pm 91 oraClusterwareComp::oraocr::configureFirstNode
2025-09-02 23:06:59: 6: crsinstall crsinstall.pm 2586 oraClusterwareComp::configureCurrentNode
2025-09-02 23:06:59: 7: crsinstall crsinstall.pm 2427 crsinstall::perform_initial_config
2025-09-02 23:06:59: 8: crsinstall crsinstall.pm 1085 crsinstall::perform_init_config
2025-09-02 23:06:59: 9: crsinstall crsinstall.pm 1243 crsinstall::init_config
2025-09-02 23:06:59: 10: crsinstall crsinstall.pm 487 crsinstall::CRSInstall
2025-09-02 23:06:59: 11: main rootcrs.pl 559 crsinstall::new
2025-09-02 23:06:59: ####### End DIE Stack Trace #######
注意下面這3行的信息:
2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall
checkOCR返回為0,也就是以為着檢查通過,但是立馬就報了OCR配置已經存在的錯誤,這裏感覺有點奇怪。
分析過程
查看磁盤組的狀態
Disk Group Name Fail Group Path File Name Status Status Status TYPE File Size (MB) Used Size (MB) Pct. Used
---------------- ------------------ --------------------------------- ---------------------------- ------------------------ -------------- -------------- --------- -------------- -------------- ---------
ARM_OCR ARM_OCR_0000 /dev/nvme0n2 ARM_OCR_0000 MEMBER CACHED ONLINE REGULAR 5,120 320 6.25
****************** -------------- --------------
TOTAL 5,120 320
這裏看到磁盤組狀態完全是正常的。
查看磁盤組內容
在看看磁盤組裏面的內容
[grid@arm01 ~]$ ocrconfig -showbackup
^CPROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy
arm01 2025/08/22 02:08:51 +arm_ocr:/raccluster/OCRBACKUP/backup00.ocr.261.1209780531 0
arm01 2025/08/21 14:16:32 +arm_ocr:/raccluster/OCRBACKUP/backup01.ocr.258.1209737791 0
arm01 2025/08/21 10:16:31 +arm_ocr:/raccluster/OCRBACKUP/backup02.ocr.263.1209723391 0
arm01 2025/08/21 10:16:31 +arm_ocr:/raccluster/OCRBACKUP/day.ocr.259.1209723391 0
arm01 2025/08/21 10:16:31 +arm_ocr:/raccluster/OCRBACKUP/week.ocr.260.1209723391 0
這裏連之前的備份信息都還有?搞得有得不懂了,這部分信息來至於那兒呢?
原因分析
通過上面的分析,簡單可以判斷是由於磁盤中仍存在歷史信息,所以導致集羣在初始化時報錯。但是deconfig為什麼沒有格式化磁盤組時沒有完全格式化,並且創建磁盤組時還能正常的創建成功。大概猜想時這個版本中lastnode格式化時只格式化了磁盤組頭部的信息,並未格式化集羣配置文件的位置,所以導致在集羣檢查時,讀取到歷史的集羣信息後直接退出。
解決方案
手動情況磁盤的信息,感覺一下回到10G環境中手動清理磁盤的信息。
[root@arm01 install]# dd if=/dev/zero of=/dev/nvme0n2 bs=8192 count=1000000
dd: error writing '/dev/nvme0n2': No space left on device
^C
^C^C^C
在次運行root.sh腳本後集羣初始化成功,日誌的信息如下:
[root@arm01 ~]# tail -1000f /oracle/app/19.3.0/grid/install/root_arm01_2025-09-02_23-17-34-060803682.log
Performing root user operation.
The following environment variables are set as:
ORACLE_OWNER= grid
ORACLE_HOME= /oracle/app/19.3.0/grid
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Relinking oracle with rac_on option
Using configuration parameter file: /oracle/app/19.3.0/grid/crs/install/crsconfig_params
The log of current session can be found at:
/oracle/app/grid/crsdata/arm01/crsconfig/rootcrs_arm01_2025-09-02_11-17-34PM.log
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 1 of 19: 'ValidateEnv'.
2025/09/02 23:17:36 CLSRSC-363: User ignored prerequisites during installation
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 2 of 19: 'CheckFirstNode'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 3 of 19: 'GenSiteGUIDs'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 4 of 19: 'SetupOSD'.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 5 of 19: 'CheckCRSConfig'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 6 of 19: 'SetupLocalGPNP'.
2025/09/02 23:17:41 CLSRSC-594: Executing installation step 7 of 19: 'CreateRootCert'.
2025/09/02 23:17:42 CLSRSC-594: Executing installation step 8 of 19: 'ConfigOLR'.
2025/09/02 23:17:50 CLSRSC-594: Executing installation step 9 of 19: 'ConfigCHMOS'.
2025/09/02 23:17:50 CLSRSC-594: Executing installation step 10 of 19: 'CreateOHASD'.
2025/09/02 23:17:51 CLSRSC-594: Executing installation step 11 of 19: 'ConfigOHASD'.
2025/09/02 23:17:51 CLSRSC-330: Adding Clusterware entries to file 'oracle-ohasd.service'
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 12 of 19: 'SetupTFA'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 13 of 19: 'InstallAFD'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 14 of 19: 'InstallACFS'.
2025/09/02 23:18:25 CLSRSC-594: Executing installation step 15 of 19: 'InstallKA'.
2025/09/02 23:18:26 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:19:11 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.
ASM has been created and started successfully.
[DBT-30001] Disk groups created successfully. Check /oracle/app/grid/cfgtoollogs/asmca/asmca-250902PM111854.log for details.
2025/09/02 23:19:37 CLSRSC-482: Running command: '/oracle/app/19.3.0/grid/bin/ocrconfig -upgrade grid oinstall'
CRS-4256: Updating the profile
Successful addition of voting disk 38a1f1f25b454f55bfbeeb3f52abb8e3.
Successfully replaced voting disk group with +arm_ocr.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 38a1f1f25b454f55bfbeeb3f52abb8e3 (/dev/nvme0n2) [ARM_OCR]
Located 1 voting disk(s).
2025/09/02 23:20:09 CLSRSC-594: Executing installation step 17 of 19: 'StartCluster'.
2025/09/02 23:21:17 CLSRSC-343: Successfully started Oracle Clusterware stack
2025/09/02 23:21:17 CLSRSC-594: Executing installation step 18 of 19: 'ConfigNode'.
2025/09/02 23:21:55 CLSRSC-594: Executing installation step 19 of 19: 'PostConfig'.
2025/09/02 23:22:04 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded
------------------作者介紹-----------------------
姓名:黃廷忠
現就職:Oracle中國高級服務團隊
曾就職:OceanBase、雲和恩墨、東方龍馬等