一客户Gbase 8c分佈式集羣DN節點無故反覆宕機,產生大量的core dump文件,該問題從多方面分析如下:

1、OS的messages日誌出現大量報錯:

Dec 16 00:04:01 GBase-111-cn13 systemd[1]: session-574301.scope: Succeeded.
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.926327] XFS (dm-4): Metadata CRC error detected at xfs_allocbt_read_verify+0x15/0xc0 [xfs], xfs_allocbt block 0x3d1cb7dc0 
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.927762] XFS (dm-4): Unmount and run xfs_repair
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.928504] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.929348] 00000000a379a98d: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.930079] 0000000092508aef: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.930960] 00000000d64520b1: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.931575] 000000001bf7b614: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.932404] 000000004a5e75b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.933297] 000000007a59f66e: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.933945] 00000000f27de469: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.934679] 000000004bdfce18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.935370] XFS (dm-4): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x3d1cb7dc0 len 8 error 74
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.935998] XFS (dm-4): page discard on page 0000000022e5a8e0, inode 0x18169b81d, offset 982663168.
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.936830] XFS (dm-4): Metadata CRC error detected at xfs_allocbt_read_verify+0x15/0xc0 [xfs], xfs_allocbt block 0x3d1cb7dc0 
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.938101] XFS (dm-4): Unmount and run xfs_repair
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.938753] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.939466] 00000000a379a98d: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.940123] 0000000092508aef: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.940754] 00000000d64520b1: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.941364] 000000001bf7b614: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.941958] 000000004a5e75b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.942548] 000000007a59f66e: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.943120] 00000000f27de469: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.943687] 000000004bdfce18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.944276] XFS (dm-4): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x3d1cb7dc0 len 8 error 74
Dec 16 00:04:12 GBase-111-cn13 kernel: [20908953.944911] XFS (dm-4): page discard on page 00000000349b82b9, inode 0x18169b81d, offset 982667264.
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.422287] XFS (dm-4): Metadata CRC error detected at xfs_allocbt_read_verify+0x15/0xc0 [xfs], xfs_allocbt block 0x3d1cb7dc0 
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.423623] XFS (dm-4): Unmount and run xfs_repair
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.424320] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.425158] 0000000063bfb8df: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.425814] 000000007ae16eb9: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.426524] 000000001a89d43c: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.427138] 000000006426849e: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.427732] 0000000034c14767: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.428324] 000000003b13ad6f: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.428900] 000000000de6e352: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.429467] 0000000086a48967: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.430054] XFS (dm-4): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x3d1cb7dc0 len 8 error 74
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.430675] XFS (dm-4): page discard on page 0000000022e5a8e0, inode 0x18169b81d, offset 982663168.
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.432476] XFS (dm-4): Metadata CRC error detected at xfs_allocbt_read_verify+0x15/0xc0 [xfs], xfs_allocbt block 0x3d1cb7dc0 
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.433742] XFS (dm-4): Unmount and run xfs_repair
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.434399] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
Dec 16 00:04:51 GBase-111-cn13 kernel: [20908993.435049] 0000000063bfb8df: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.435701] 000000007ae16eb9: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.436341] 000000001a89d43c: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.436947] 000000006426849e: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.437540] 0000000034c14767: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.438130] 000000003b13ad6f: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.438696] 000000000de6e352: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.439268] 0000000086a48967: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.439852] XFS (dm-4): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x3d1cb7dc0 len 8 error 74
Dec 16 00:04:52 GBase-111-cn13 kernel: [20908993.440466] XFS (dm-4): page discard on page 00000000349b82b9, inode 0x18169b81d, offset 982667264.

磁盤文件系統錯誤鍵詞metadata I/O error in "xfs_trans_read_buf_map" 

2、DN節點日誌中的大量報錯:

2025-12-15 22:52:03.102 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 00000 localhost 0 [DBL_WRT] LOG:  Date page recovered: buf_tag[rel 1663/8623351/53849552 blk 7139 fork 0]
2025-12-15 22:52:03.102 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 00000 localhost 0 [DBL_WRT] LOG:  Date page recovered: buf_tag[rel 1663/8623351/53849552 blk 7146 fork 0]
2025-12-15 22:52:03.102 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 00000 localhost 0 [DBL_WRT] LOG:  Date page recovered: buf_tag[rel 1663/8623351/53849552 blk 7148 fork 0]
2025-12-15 22:52:03.102 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 00000 localhost 0 [DBL_WRT] LOG:  Date page recovered: buf_tag[rel 1663/8623351/53849552 blk 7149 fork 0]
2025-12-15 22:52:03.102 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 00000 localhost 0 [DBL_WRT] LOG:  Date page recovered: buf_tag[rel 1663/8623351/53850317 blk 7149 fork 0]
2025-12-15 22:52:03.102 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 00000 localhost 0 [DBL_WRT] LOG:  Date page recovered: buf_tag[rel 1663/8623351/53850317 blk 7150 fork 0]
2025-12-15 22:52:03.124 69402084.1 [unknown] [unknown] 140364426257152 [unknown] 0 dn13 01000 localhost 0 [BACKEND] WARNING:  could not fork new process for connection due to PMstate PM_STARTUP
2025-12-15 22:52:03.155 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 00000 localhost 0 [DBL_WRT] LOG:  DW recovery state: "Batch end", file start page[dwn 31046, start 1], now access page 15783, current [page_id 15784, dwn 31046, checksum verify res is 1, page_num orig 0, page_num fixed 0]
2025-12-15 22:52:03.163 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 42809 localhost 0 [BACKEND] PANIC:  could not fsync file "base/8623351/53723907.3": Structure needs cleaning
2025-12-15 22:52:03.163 69402090.11242 [unknown] [unknown] 140309750085376 dn13_1 0 dn13 42809 localhost 0 [BACKEND] BACKTRACELOG:  tid[3077292]'s backtrace:
	/data1/soft/app/bin/gaussdb(+0xf30932) [0x5630af530932]
	/data1/soft/app/bin/gaussdb(_Z9errfinishiz+0x391) [0x5630af526ae1]
	/data1/soft/app/bin/gaussdb(_Z19ProcessSyncRequestsv+0x5c5) [0x5630b02c40a5]
	/data1/soft/app/bin/gaussdb(+0x19c7110) [0x5630affc7110]
	/data1/soft/app/bin/gaussdb(+0x19c7d5b) [0x5630affc7d5b]
	/data1/soft/app/bin/gaussdb(_Z14dw_enable_initv+0x32) [0x5630affcc122]
	/data1/soft/app/bin/gaussdb(_Z7dw_initb+0x105) [0x5630affcc335]
	/data1/soft/app/bin/gaussdb(_Z11StartupXLOGv+0x17cb) [0x5630b000dcbb]
	/data1/soft/app/bin/gaussdb(_Z18StartupProcessMainv+0x260) [0x5630afad54e0]
	/data1/soft/app/bin/gaussdb(_Z26GaussDbAuxiliaryThreadMainIL15knl_thread_role24EEiP14knl_thread_arg+0x11a) [0x5630afacac4a]
	/data1/soft/app/bin/gaussdb(_Z17GaussDbThreadMainIL15knl_thread_role24EEiP14knl_thread_arg+0x2fa) [0x5630afacaf5a]
	/data1/soft/app/bin/gaussdb(+0x14a1e55) [0x5630afaa1e55]
	/lib64/libpthread.so.0(+0x8f1b) [0x7fa925602f1b]
	/lib64/libc.so.6(clone+0x3f) [0x7fa92553a33f]
	Use addr2line to get pretty function name and line

出現PANIC級別報錯:PANIC:  could not fsync file "base/8623351/53723907.3"

3、coredump堆棧解析如下:

GBase 8c文件系統損壞故障案例_3d

解析的堆棧反而不容易看出這個問題。

從上面的的分析可以看到,定位類似的問題,需要從messages和DN節點運行日誌入手分析,找到關鍵字,需要先修復OS文件系統的損壞報錯,後重新拉起節點重建即可修復集羣節點。