博客 / 詳情

返回

技術分享 | Oracle 19c RAC環境下備庫node1 ADG異常、asm異常分析及處理

本文為墨天輪數據庫管理服務團隊第146期技術分享,內容原創,作者為技術顧問達世德,如需轉載請聯繫小墨(VX:modb666)並註明來源。如需查看更多文章可關注【墨天輪】公眾號。

一、問題描述

巡檢發現:客户系統數據庫備庫ADG數據同步異常,具體信息如下:

查看複製錯誤信息:
image.png

查看複製相關的進程狀態:
image.png

二、分析過程

2.1、查看主節點alert日誌:

發現用户認證錯誤:

image.png

2.2、驗證SYS用户密碼,遠程訪問:

主庫-備庫:

image.png

備庫-主庫:

image.png

2.3 在備庫上查詢數據最後的同步時間:

select max(CHECKPOINT_TIME) from v$datafile_header;

image.png

在同一時間發生了網絡故障,因此初步判斷ADG的數據同步應該是網絡故障引發。

2.4、備庫rac節點1無法獲取磁盤組信息

發現當前節點無法獲取到磁盤組的信息:

image.png

2.5、查看磁盤組狀態信息:

通過命令驗證asm服務狀態。發現磁盤組無法連接,但是asm服務是正常運行的

image.png

2.6、分析asm alert日誌信息

日誌顯示 ORA-04031錯誤碼,提示shared memory太小,無法分配shared\_pool\_size 。

image.png

2.7、查詢node1實例啓動時間:

實例在2024-07-21進行了重啓操作,當前無法通過sqlplus / as sysasm連接到asm。

image.png

在node2(asm正常)上查看sga信息:

image.png

2.8、查看密碼文件

確認密碼文件位置:

image.png

可以看到密碼文件存儲在磁盤組中,但是node1節點的磁盤組當前法識別,因此導致密碼認證錯誤,ADG出現同步異常。

因此:ADG數據同步異常根本是備節點的ASM磁盤組異常,而ADG同步的密碼文件存儲在ASM中,所以數據同步發生認證錯誤,導致的異常,因此解決辦法即恢復備節點的ASM狀態即可。

三、問題處理

3.1 先解決node1節點的asm異常問題

3.1.1嘗試重啓node1的asm服務

# 
root用户執行:
srvctl status asm 
srvctl stop asm 
srvctl start asm

image.png

可以看到,asm服務正常運行,且無法停止。

3.1.2 重啓node1 的集羣服務

停止node1集羣:(請選擇在業務低峯期(23:00)執行如下操作),操作日誌如下:

# 
確認服務信息:
[root@testdb1 ~]# su - grid
testdb1:/home/grid(+ASM1)$asmcmd lsdg
Connected to an idle instance.
ASMCMD-8102: no connection to Oracle ASM; command requires Oracle ASM to run
testdb1:/home/grid(+ASM1)$srvctl status asm
ASM is running on testdb1,testdb2
# 
開始停止節點集羣(使用root用户執行):
testdb1:/home/grid(+ASM1)$exit
logout
[root@testdb1 ~]# cd /oracle/app/19.0.0/grid/bin/
[root@testdb1 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'testdb1'
CRS-2673: Attempting to stop 'ora.crsd' on 'testdb1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on server 'testdb1'
CRS-2673: Attempting to stop 'ora.dsg.dsg.acfs' on 'testdb1'
CRS-2673: Attempting to stop 'ora.chad' on 'testdb1'
CRS-2673: Attempting to stop 'ora.testdbstd.db' on 'testdb1'
Error unmounting '/dsg'. Possible busy file system. Verify the logs.
Retrying unmount
CRS-2677: Stop of 'ora.testdbstd.db' on 'testdb1' succeeded
CRS-33673: Attempting to stop resource group 'ora.asmgroup' on server 'testdb1'
CRS-2679: Attempting to clean 'ora.OCR_VOTE.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.ARCHIVEDG.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.DATADG1.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.DATADG2.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.DSG.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.MGMT.dg' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER2.lsnr' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER3.lsnr' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER_DG.lsnr' on 'testdb1'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.LISTENER2.lsnr' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.LISTENER3.lsnr' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.LISTENER_DG.lsnr' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.testdb1.vip' on 'testdb1'
CRS-5017: The resource action "ora.ARCHIVEDG.dg clean" encountered the following error: 
ORA-04031: unable to allocate 816 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KGLHD")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2677: Stop of 'ora.testdb1.vip' on 'testdb1' succeeded
CRS-5017: The resource action "ora.ARCHIVEDG.dg check" encountered the following error: 
ORA-04031: unable to allocate 816 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KGLHD")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.ARCHIVEDG.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.OCR_VOTE.dg clean" encountered the following error: 
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.DSG.dg clean" encountered the following error: 
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.DATADG2.dg clean" encountered the following error: 
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.DATADG1.dg clean" encountered the following error: 
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.MGMT.dg clean" encountered the following error: 
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.MGMT.dg check" encountered the following error: 
ORA-04031: unable to allocate 816 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KGLHD")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.MGMT.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.DATADG2.dg check" encountered the following error: 
ORA-04031: unable to allocate 568 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KKSSP")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.DATADG2.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.DSG.dg check" encountered the following error: 
ORA-04031: unable to allocate 12312 bytes of shared memory ("shared pool","unknown object","KKSSP^28","kglseshtTable")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.DSG.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.DATADG1.dg check" encountered the following error: 
ORA-04031: unable to allocate 12312 bytes of shared memory ("shared pool","unknown object","KKSSP^2346","kglseshtTable")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.DATADG1.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.OCR_VOTE.dg check" encountered the following error: 
ORA-04031: unable to allocate 12312 bytes of shared memory ("shared pool","unknown object","KKSSP^803","kglseshtTable")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.OCR_VOTE.dg' on 'testdb1' failed
CRS-2675: Stop of 'ora.dsg.dsg.acfs' on 'testdb1' failed
CRS-2679: Attempting to clean 'ora.dsg.dsg.acfs' on 'testdb1'
CRS-2675: Stop of 'ora.chad' on 'testdb1' failed
CRS-2679: Attempting to clean 'ora.chad' on 'testdb1'
CRS-2681: Clean of 'ora.chad' on 'testdb1' succeeded
Error unmounting '/dsg'. Possible busy file system. Verify the logs.
CRS-2681: Clean of 'ora.dsg.dsg.acfs' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.DSG.DSG.advm' on 'testdb1'
CRS-2677: Stop of 'ora.DSG.DSG.advm' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.proxy_advm' on 'testdb1'
CRS-2677: Stop of 'ora.proxy_advm' on 'testdb1' succeeded
CRS-2672: Attempting to start 'ora.testdb1.vip' on 'testdb2'
CRS-2676: Start of 'ora.testdb1.vip' on 'testdb2' succeeded
CRS-2799: Failed to shut down resource 'ora.ASMNET1LSNR_ASM.lsnr' on 'testdb1'
CRS-2794: Shutdown of Cluster Ready Services-managed resources on 'testdb1' has failed
CRS-2675: Stop of 'ora.crsd' on 'testdb1' failed
CRS-2799: Failed to shut down resource 'ora.crsd' on 'testdb1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'testdb1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
[root@testdb1 bin]# 
// 由於 ASM實例下 shared_pool_size 不足,導致集羣停止報錯誤:
# 
登錄grid用户驗證集羣狀態
[root@testdb1 bin]# su - grid 
Last login: Tue Sep  9 23:11:19 CST 2025
testdb1:/home/grid(+ASM1)$srvctl status asm
ASM is running on testdb2
testdb1:/home/grid(+ASM1)$   
#
 再次停止集羣服務,這次已經能夠正常停止(原因是第一次停止時部分服務停止成功,釋放了共享內存)
[root@testdb1 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'testdb1'
CRS-2673: Attempting to stop 'ora.crsd' on 'testdb1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on server 'testdb1'
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'testdb1' has completed
CRS-2677: Stop of 'ora.crsd' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'testdb1'
CRS-2673: Attempting to stop 'ora.crf' on 'testdb1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'testdb1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'testdb1'
CRS-2677: Stop of 'ora.asm' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.crf' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'testdb1'
CRS-2677: Stop of 'ora.mdnsd' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'testdb1'
CRS-2673: Attempting to stop 'ora.evmd' on 'testdb1'
CRS-2677: Stop of 'ora.ctssd' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'testdb1'
CRS-2677: Stop of 'ora.cssd' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'testdb1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'testdb1'
CRS-2677: Stop of 'ora.gipcd' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'testdb1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'testdb1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@testdb1 bin]# 

驗證asm服務狀態:

image.png

檢查集羣狀態:

image.png

啓動集羣:

image.png

再次檢查集羣狀態,確保每個服務處於online狀態:

image.png

檢查監聽,確保ASM和數據庫實例均正常註冊到監聽:

image.png

驗證asm狀態:

image.png

登錄數據庫,檢查磁盤組信息:

image.png

檢查ADG數據同步狀態:

此時可以看到,數據開始同步,且目前數據傳輸延遲1小時53分鐘,數據重放延遲15小時25分鐘。

image.png

0.png)此時數據同步完成,數據仍在重放中:

image.png

此時數據已經同步、重放均已經完成

image.png

創建臨時表,測試數據複製:

主節點創建表,並寫入數據後提交:

image.png

備庫進行查詢:

image.png

四、總結

此次故障為備機node1節點ASM實例中 shared\_pool\_size 共享池內存不足導致磁盤組異常,無法調用,ADG數據同步異常則是,ADG認證的密碼文件存儲在磁盤組中,備節點磁盤組丟失,導致認證錯誤,錯誤日誌如下:

image.png

需要解決asm實例下共享內存不足的問題,否則後期還可能出現同樣的錯誤。

查看當前的share內存配置:

image.png

建議增加 shared\_pool\_size 的值到2G,無需重啓實例。

SQL> alter system set shared_pool_size = 2048M scope=both;

墨天輪從樂知樂享的數據庫技術社區蓄勢出發,全面升級,提供多類型數據庫管理服務。墨天輪數據庫管理服務旨在為用户構建信賴可託付的數據庫環境,併為數據庫廠商提供中立的生態支持。
墨天輪數據庫服務官網:https://www.modb.pro/service

user avatar
0 位用戶收藏了這個故事!

發佈 評論

Some HTML is okay.