一:背景
1. 講故事
這段時間都在跑外賣,感覺好久都沒寫文章了,今天繼續給大家帶來一篇崩潰類的生產事故,這是微信上有位老朋友找到我的,讓我幫忙看下為啥崩潰了,dump也在手,接下來就可以一頓分析。
二:崩潰分析
1. 為什麼會崩潰
雙擊打開dump文件,會看到崩潰信息通覽,參考如下:
Executable search path is:
Windows 10 Version 17763 MP (48 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
Edition build lab: 17763.1.amd64fre.rs5_release.180914-1434
Debug session time: Fri Oct 31 17:38:42.000 2025 (UTC + 8:00)
System Uptime: 14 days 2:42:29.643
Process Uptime: 0 days 0:00:58.000
................................................................
.......................................
Loading unloaded module list
.
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(5a74.6250): Unknown exception - code c0000374 (first/second chance not available)
For analysis of this file, run !analyze -v
ntdll!NtWaitForMultipleObjects+0x14:
00007ffe`57baf0e4 c3 ret
從卦中看崩潰碼是 c0000374,即 ntheap 損壞,哈哈,到這裏一下子就把範圍給縮小了。
2. 為什麼ntheap 損壞
那為什麼ntheap會損壞呢?可以使用 .ecxr 切到崩潰時的調用棧,觀察崩潰行為。
0:032> .ecxr
0:032> k
*** Stack trace for last set context - .thread/.cxr resets it
# Child-SP RetAddr Call Site
00 000000b4`8503ede0 00007ffe`57c0b313 ntdll!RtlReportFatalFailure+0x9
01 000000b4`8503ee30 00007ffe`57c13b9e ntdll!RtlReportCriticalFailure+0x97
02 000000b4`8503ef20 00007ffe`57c13eaa ntdll!RtlpHeapHandleError+0x12
03 000000b4`8503ef50 00007ffe`57bae109 ntdll!RtlpHpHeapHandleError+0x7a
04 000000b4`8503ef80 00007ffe`57bbbb0e ntdll!RtlpLogHeapFailure+0x45
05 000000b4`8503efb0 00007ffe`17d17b3f ntdll!RtlFreeHeap+0x9d3ce
06 000000b4`8503f050 00007ffe`541392af AcLayers!NS_FaultTolerantHeap::APIHook_RtlFreeHeap+0x41f
07 000000b4`8503f0b0 00007ffe`3773b17e KERNELBASE!LocalFree+0x2f
08 000000b4`8503f0f0 00007ffe`37661d12 mscorlib_ni+0x58b17e
09 000000b4`8503f1a0 00007ffd`e49fe127 mscorlib_ni!System.Runtime.InteropServices.Marshal.FreeHGlobal+0x22 [f:\dd\ndp\clr\src\BCL\system\runtime\interopservices\marshal.cs @ 1212]
...
0:032> !clrstack
OS Thread Id: 0x6250 (32)
Child SP IP Call Site
000000b48503f118 00007ffe57baf0e4 [InlinedCallFrame: 000000b48503f118] Microsoft.Win32.Win32Native.LocalFree(IntPtr)
000000b48503f118 00007ffe3773b17e [InlinedCallFrame: 000000b48503f118] Microsoft.Win32.Win32Native.LocalFree(IntPtr)
000000b48503f0f0 00007ffe3773b17e DomainNeutralILStubClass.IL_STUB_PInvoke(IntPtr)
000000b48503f1a0 00007ffe37661d12 System.Runtime.InteropServices.Marshal.FreeHGlobal(IntPtr) [f:\dd\ndp\clr\src\BCL\system\runtime\interopservices\marshal.cs @ 1212]
000000b48503f1e0 00007ffde49fe127 b.B+A.MoveNext()
000000b48503f240 00007ffe376b3423 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 954]
000000b48503f310 00007ffe376b32b4 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) [f:\dd\ndp\clr\src\BCL\system\threading\executioncontext.cs @ 902]
...
000000b48503f5c0 00007ffde49fb04e DomainBoundILStubClass.IL_STUB_ReversePInvoke(Int32, Int32, Int64)
從卦中可以清晰的看到是 b.B+A.MoveNext 方法中調用了 FreeHGlobal 導致的NTHeap崩潰,如果你經驗比較足的話,看到這個 FreeHGlobal 就應該想到 double free 問題,這是一個經典的問題。
3. 何為 double free
雙釋放即對一個 block 塊進行二次釋放,windows 的 RtlFreeHeap 方法會在業務邏輯中對這種情況直接判為異常,接下來你或許想知道這個 block 的地址是什麼?這個可以用 !heap -s 觀察,參考代碼如下:
0:032> !heap -s
************************************************************************************************************************
NT HEAP STATS BELOW
************************************************************************************************************************
Details:
Heap address: 0000028c75bb0000
Error address: 0000028c786018a0
Error type: HEAP_FAILURE_BLOCK_NOT_BUSY
Details: The caller performed an operation (such as a free
or a size check) that is illegal on a free block.
Follow-up: Check the error's stack trace to find the culprit.
Stack trace:
Stack trace at 0x00007ffe57c72848
00007ffe57bae109: ntdll!RtlpLogHeapFailure+0x45
00007ffe57bbbb0e: ntdll!RtlFreeHeap+0x9d3ce
00007ffe17d17b3f: AcLayers!NS_FaultTolerantHeap::APIHook_RtlFreeHeap+0x41f
00007ffe541392af: KERNELBASE!LocalFree+0x2f
00007ffe3773b17e: mscorlib_ni+0x58b17e
00007ffe37661d12: mscorlib_ni!System.Runtime.InteropServices.Marshal.FreeHGlobal+0x22
00007ffde49fe127: +0xe49fe127
LFH Key : 0x765363a7204cf973
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
0000028c75bb0000 00000002 17920 9256 16364 2120 214 5 1 a LFH
External fragmentation 23 % (214 free blocks)
0000028c75b40000 00008000 64 4 64 2 1 1 0 0
0000028c75de0000 00001002 2636 132 1080 20 5 2 0 0 LFH
0000028c76190000 00001002 4680 2268 3124 1420 40 3 0 0 LFH
External fragmentation 62 % (40 free blocks)
0000028c76130000 00001002 2636 472 1080 5 27 2 0 0 LFH
0000028c767f0000 00041002 60 8 60 5 1 1 0 0
0000028c77020000 00041002 60 16 60 2 2 1 0 0
-------------------------------------------------------------------------------------
從卦中可以看到 Heap address: 0000028c75bb0000 即為 block 地址,接下來使用 !heap -x 0000028c786018a0 觀察這個 block 塊的狀態,可以看到此時確實是 free 的。
0:032> !heap -x 0000028c786018a0
Entry User Heap Segment Size PrevSize Unused Flags
-------------------------------------------------------------------------------------------------------------
0000028c786018a0 0000028c786018b0 0000028c75bb0000 0000028c785c80d0 e0 - 0 LFH;free
到這裏問題的成因我們是完全搞清楚了,接下來就是反推問題代碼的時候了。
4. 問題代碼在哪裏
應該有朋友知道問題是在 b.B+A.MoveNext() 方法中,從名字上看這個項目應該是混淆的,有點搞哈。。。得要費點眼力,截圖如下:
從卦中的 IntPtr intPtr = Interlocked.Exchange(ref b.A, IntPtr.Zero); 來看,這個 intPtr 是一個類級別變量,看樣子是多個方法在操控類級別變量時沒有合理的控制好,為了一探究竟,再次分析源代碼,果然是的,截圖如下:
到這裏就真相大白了,讓朋友修改源碼自己控制好這個變量。
三:總結
這次生產事故是一個比較經典的 doublefree 問題,沒接觸過的話可能還是需要走一些彎路的,像我們這種老江湖,看到一二個特徵這個問題就經註定解開!