本文為墨天輪數據庫管理服務團隊第146期技術分享,內容原創,作者為技術顧問達世德,如需轉載請聯繫小墨(VX:modb666)並註明來源。如需查看更多文章可關注【墨天輪】公眾號。
一、問題描述
巡檢發現:客户系統數據庫備庫ADG數據同步異常,具體信息如下:
查看複製錯誤信息:
查看複製相關的進程狀態:
二、分析過程
2.1、查看主節點alert日誌:
發現用户認證錯誤:
2.2、驗證SYS用户密碼,遠程訪問:
主庫-備庫:
備庫-主庫:
2.3 在備庫上查詢數據最後的同步時間:
select max(CHECKPOINT_TIME) from v$datafile_header;
在同一時間發生了網絡故障,因此初步判斷ADG的數據同步應該是網絡故障引發。
2.4、備庫rac節點1無法獲取磁盤組信息
發現當前節點無法獲取到磁盤組的信息:
2.5、查看磁盤組狀態信息:
通過命令驗證asm服務狀態。發現磁盤組無法連接,但是asm服務是正常運行的
2.6、分析asm alert日誌信息
日誌顯示 ORA-04031錯誤碼,提示shared memory太小,無法分配shared\_pool\_size 。
2.7、查詢node1實例啓動時間:
實例在2024-07-21進行了重啓操作,當前無法通過sqlplus / as sysasm連接到asm。
在node2(asm正常)上查看sga信息:
2.8、查看密碼文件
確認密碼文件位置:
可以看到密碼文件存儲在磁盤組中,但是node1節點的磁盤組當前法識別,因此導致密碼認證錯誤,ADG出現同步異常。
因此:ADG數據同步異常根本是備節點的ASM磁盤組異常,而ADG同步的密碼文件存儲在ASM中,所以數據同步發生認證錯誤,導致的異常,因此解決辦法即恢復備節點的ASM狀態即可。
三、問題處理
3.1 先解決node1節點的asm異常問題
3.1.1嘗試重啓node1的asm服務
#
root用户執行:
srvctl status asm
srvctl stop asm
srvctl start asm
可以看到,asm服務正常運行,且無法停止。
3.1.2 重啓node1 的集羣服務
停止node1集羣:(請選擇在業務低峯期(23:00)執行如下操作),操作日誌如下:
#
確認服務信息:
[root@testdb1 ~]# su - grid
testdb1:/home/grid(+ASM1)$asmcmd lsdg
Connected to an idle instance.
ASMCMD-8102: no connection to Oracle ASM; command requires Oracle ASM to run
testdb1:/home/grid(+ASM1)$srvctl status asm
ASM is running on testdb1,testdb2
#
開始停止節點集羣(使用root用户執行):
testdb1:/home/grid(+ASM1)$exit
logout
[root@testdb1 ~]# cd /oracle/app/19.0.0/grid/bin/
[root@testdb1 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'testdb1'
CRS-2673: Attempting to stop 'ora.crsd' on 'testdb1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on server 'testdb1'
CRS-2673: Attempting to stop 'ora.dsg.dsg.acfs' on 'testdb1'
CRS-2673: Attempting to stop 'ora.chad' on 'testdb1'
CRS-2673: Attempting to stop 'ora.testdbstd.db' on 'testdb1'
Error unmounting '/dsg'. Possible busy file system. Verify the logs.
Retrying unmount
CRS-2677: Stop of 'ora.testdbstd.db' on 'testdb1' succeeded
CRS-33673: Attempting to stop resource group 'ora.asmgroup' on server 'testdb1'
CRS-2679: Attempting to clean 'ora.OCR_VOTE.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.ARCHIVEDG.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.DATADG1.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.DATADG2.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.DSG.dg' on 'testdb1'
CRS-2679: Attempting to clean 'ora.MGMT.dg' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER2.lsnr' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER3.lsnr' on 'testdb1'
CRS-2673: Attempting to stop 'ora.LISTENER_DG.lsnr' on 'testdb1'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.LISTENER2.lsnr' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.LISTENER3.lsnr' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.LISTENER_DG.lsnr' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.testdb1.vip' on 'testdb1'
CRS-5017: The resource action "ora.ARCHIVEDG.dg clean" encountered the following error:
ORA-04031: unable to allocate 816 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KGLHD")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2677: Stop of 'ora.testdb1.vip' on 'testdb1' succeeded
CRS-5017: The resource action "ora.ARCHIVEDG.dg check" encountered the following error:
ORA-04031: unable to allocate 816 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KGLHD")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.ARCHIVEDG.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.OCR_VOTE.dg clean" encountered the following error:
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.DSG.dg clean" encountered the following error:
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.DATADG2.dg clean" encountered the following error:
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.DATADG1.dg clean" encountered the following error:
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.MGMT.dg clean" encountered the following error:
ORA-04031: unable to allocate 5184 bytes of shared memory ("shared pool","unknown object","SO private sga","kss private so")
. For details refer to "(:CLSN00106:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-5017: The resource action "ora.MGMT.dg check" encountered the following error:
ORA-04031: unable to allocate 816 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KGLHD")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.MGMT.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.DATADG2.dg check" encountered the following error:
ORA-04031: unable to allocate 568 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","KKSSP")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.DATADG2.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.DSG.dg check" encountered the following error:
ORA-04031: unable to allocate 12312 bytes of shared memory ("shared pool","unknown object","KKSSP^28","kglseshtTable")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.DSG.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.DATADG1.dg check" encountered the following error:
ORA-04031: unable to allocate 12312 bytes of shared memory ("shared pool","unknown object","KKSSP^2346","kglseshtTable")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.DATADG1.dg' on 'testdb1' failed
CRS-5017: The resource action "ora.OCR_VOTE.dg check" encountered the following error:
ORA-04031: unable to allocate 12312 bytes of shared memory ("shared pool","unknown object","KKSSP^803","kglseshtTable")
. For details refer to "(:CLSN00109:)" in "/oracle/app/grid/diag/crs/testdb1/crs/trace/crsd_oraagent_grid.trc".
CRS-2680: Clean of 'ora.OCR_VOTE.dg' on 'testdb1' failed
CRS-2675: Stop of 'ora.dsg.dsg.acfs' on 'testdb1' failed
CRS-2679: Attempting to clean 'ora.dsg.dsg.acfs' on 'testdb1'
CRS-2675: Stop of 'ora.chad' on 'testdb1' failed
CRS-2679: Attempting to clean 'ora.chad' on 'testdb1'
CRS-2681: Clean of 'ora.chad' on 'testdb1' succeeded
Error unmounting '/dsg'. Possible busy file system. Verify the logs.
CRS-2681: Clean of 'ora.dsg.dsg.acfs' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.DSG.DSG.advm' on 'testdb1'
CRS-2677: Stop of 'ora.DSG.DSG.advm' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.proxy_advm' on 'testdb1'
CRS-2677: Stop of 'ora.proxy_advm' on 'testdb1' succeeded
CRS-2672: Attempting to start 'ora.testdb1.vip' on 'testdb2'
CRS-2676: Start of 'ora.testdb1.vip' on 'testdb2' succeeded
CRS-2799: Failed to shut down resource 'ora.ASMNET1LSNR_ASM.lsnr' on 'testdb1'
CRS-2794: Shutdown of Cluster Ready Services-managed resources on 'testdb1' has failed
CRS-2675: Stop of 'ora.crsd' on 'testdb1' failed
CRS-2799: Failed to shut down resource 'ora.crsd' on 'testdb1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'testdb1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
[root@testdb1 bin]#
// 由於 ASM實例下 shared_pool_size 不足,導致集羣停止報錯誤:
#
登錄grid用户驗證集羣狀態
[root@testdb1 bin]# su - grid
Last login: Tue Sep 9 23:11:19 CST 2025
testdb1:/home/grid(+ASM1)$srvctl status asm
ASM is running on testdb2
testdb1:/home/grid(+ASM1)$
#
再次停止集羣服務,這次已經能夠正常停止(原因是第一次停止時部分服務停止成功,釋放了共享內存)
[root@testdb1 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'testdb1'
CRS-2673: Attempting to stop 'ora.crsd' on 'testdb1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on server 'testdb1'
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'testdb1' has completed
CRS-2677: Stop of 'ora.crsd' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'testdb1'
CRS-2673: Attempting to stop 'ora.crf' on 'testdb1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'testdb1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'testdb1'
CRS-2677: Stop of 'ora.asm' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.crf' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'testdb1'
CRS-2677: Stop of 'ora.mdnsd' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'testdb1'
CRS-2673: Attempting to stop 'ora.evmd' on 'testdb1'
CRS-2677: Stop of 'ora.ctssd' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'testdb1'
CRS-2677: Stop of 'ora.cssd' on 'testdb1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'testdb1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'testdb1'
CRS-2677: Stop of 'ora.gipcd' on 'testdb1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'testdb1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'testdb1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@testdb1 bin]#
驗證asm服務狀態:
檢查集羣狀態:
啓動集羣:
再次檢查集羣狀態,確保每個服務處於online狀態:
檢查監聽,確保ASM和數據庫實例均正常註冊到監聽:
驗證asm狀態:
登錄數據庫,檢查磁盤組信息:
檢查ADG數據同步狀態:
此時可以看到,數據開始同步,且目前數據傳輸延遲1小時53分鐘,數據重放延遲15小時25分鐘。
0.png)此時數據同步完成,數據仍在重放中:
此時數據已經同步、重放均已經完成
創建臨時表,測試數據複製:
主節點創建表,並寫入數據後提交:
備庫進行查詢:
四、總結
此次故障為備機node1節點ASM實例中 shared\_pool\_size 共享池內存不足導致磁盤組異常,無法調用,ADG數據同步異常則是,ADG認證的密碼文件存儲在磁盤組中,備節點磁盤組丟失,導致認證錯誤,錯誤日誌如下:
需要解決asm實例下共享內存不足的問題,否則後期還可能出現同樣的錯誤。
查看當前的share內存配置:
建議增加 shared\_pool\_size 的值到2G,無需重啓實例。
SQL> alter system set shared_pool_size = 2048M scope=both;
墨天輪從樂知樂享的數據庫技術社區蓄勢出發,全面升級,提供多類型數據庫管理服務。墨天輪數據庫管理服務旨在為用户構建信賴可託付的數據庫環境,併為數據庫廠商提供中立的生態支持。
墨天輪數據庫服務官網:https://www.modb.pro/service