【實驗報告】sglang,vllm,transformers 在強制串行推理場景下的表現
我們現在考慮若干強制串行的需求。也就是説,必須推理完這個之後再推理下一個。
- 調包範圍是 transformers,vllm,sglang
- 投機採樣/不使用投機採樣。
投機採樣對應 eagle3。容易找到一些英文語料訓練的 eaglehead。注意:英文語聊訓練的 eaglehead 在中文 prompt 表現極差,但是仍然可以讓 accept-length > 1。 - base model 是 qwen3-8b,運行的機器是單卡 l40。
被huggingface 上動輒幾百 tps 的實驗結果嚇哭了推理參數是 temperature 不等於 0 的,雖然可能模型輸出不一樣,但顯然不影響 tps 的統計。精度全都是 16 位。
推理主要的參數應當上面提到了,但實際上有很多影響因素,很難完全控制變量,由於這是一篇隨手札記,那先這樣。 - 指標方面只看 token per second,主要是感受量級,我懶得做 mean\(\pm\)
- 輸入了 11 條 prompt
- transformers + eagle3
也就是直接把 EAGLE 的 github repo 克隆下來,用它們的 eagenerate 來生成
由於沒有官方的計時工具,所以計算 tps 的方法是,計算 eagenerate 的運行時間,計算生成了幾個 token,直接除。
Generation time: 19.689866304397583s for 1128 tokens, speed: 57.28835242258957 tokens/s
Generation time: 24.006053924560547s for 1469 tokens, speed: 61.192897617257664 tokens/s
Generation time: 33.72415637969971s for 2217 tokens, speed: 65.73922784127895 tokens/s
Generation time: 24.192238330841064s for 1477 tokens, speed: 61.05263927220292 tokens/s
Generation time: 21.344391345977783s for 1268 tokens, speed: 59.40670686957521 tokens/s
Generation time: 16.566300868988037s for 1122 tokens, speed: 67.72785360311629 tokens/s
Generation time: 25.769388437271118s for 1559 tokens, speed: 60.49813730717671 tokens/s
Generation time: 35.69959473609924s for 2051 tokens, speed: 57.4516325790679 tokens/s
Generation time: 24.897949934005737s for 1422 tokens, speed: 57.11313597180247 tokens/s
Generation time: 16.427077054977417s for 855 tokens, speed: 52.04821266367253 tokens/s
Generation time: 26.052607536315918s for 1550 tokens, speed: 59.49500439982387 tokens/s
顯存開銷是標準的 base_model 的開銷+ eagle_head 的開銷+ 預留的 max_length 個 kv-cache 的開銷
- vllm + nothing
[00:27<00:00, 27.20s/it, est. speed input: 103.04 toks/s, output: 42.86 toks/s]
[00:15<00:00, 15.15s/it, est. speed input: 166.97 toks/s, output: 42.70 toks/s]
[00:34<00:00, 34.53s/it, est. speed input: 82.83 toks/s, output: 42.89 toks/s]
[00:38<00:00, 38.48s/it, est. speed input: 80.00 toks/s, output: 42.73 toks/s]
[00:19<00:00, 19.70s/it, est. speed input: 147.42 toks/s, output: 42.59 toks/s]
[00:36<00:00, 36.50s/it, est. speed input: 102.99 toks/s, output: 42.27 toks/s]
[00:24<00:00, 24.61s/it, est. speed input: 107.77 toks/s, output: 42.91 toks/s]
[00:51<00:00, 51.44s/it, est. speed input: 57.66 toks/s, output: 42.77 toks/s]
[00:33<00:00, 33.09s/it, est. speed input: 89.75 toks/s, output: 42.76 toks/s]
[00:36<00:00, 36.10s/it, est. speed input: 89.97 toks/s, output: 42.60 toks/s]
[00:51<00:00, 51.08s/it, est. speed input: 55.79 toks/s, output: 42.83 toks/s]
- sglang + nothing
Decode batch, #running-req: 1, #token: 3778, token usage: 0.46, cuda graph: True, gen throughput (token/s): 44.34
Decode batch, #running-req: 1, #token: 4284, token usage: 0.52, cuda graph: True, gen throughput (token/s): 44.16
Decode batch, #running-req: 1, #token: 4780, token usage: 0.58, cuda graph: True, gen throughput (token/s): 43.91
Decode batch, #running-req: 1, #token: 5326, token usage: 0.65, cuda graph: True, gen throughput (token/s): 43.71
Decode batch, #running-req: 1, #token: 4643, token usage: 0.57, cuda graph: True, gen throughput (token/s): 43.93
Decode batch, #running-req: 1, #token: 4403, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.13
Decode batch, #running-req: 1, #token: 4644, token usage: 0.57, cuda graph: True, gen throughput (token/s): 43.94
Decode batch, #running-req: 1, #token: 4403, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.07
Decode batch, #running-req: 1, #token: 4418, token usage: 0.54, cuda graph: True, gen throughput (token/s): 44.11
Decode batch, #running-req: 1, #token: 5092, token usage: 0.62, cuda graph: True, gen throughput (token/s): 43.81
Decode batch, #running-req: 1, #token: 5012, token usage: 0.61, cuda graph: True, gen throughput (token/s): 43.84
感覺稍微比 vllm + nothing 好 1tps,這很不顯著,而且可能是由於採樣偏差帶來的。所以我們忽略。
- vllm + eagle3
[00:14<00:00, 14.67s/it, est. speed input: 225.82 toks/s, output: 56.59 toks/s]
[00:23<00:00, 23.12s/it, est. speed input: 107.67 toks/s, output: 59.56 toks/s]
[00:30<00:00, 30.37s/it, est. speed input: 75.89 toks/s, output: 62.78 toks/s]
[00:30<00:00, 30.48s/it, est. speed input: 85.35 toks/s, output: 60.75 toks/s]
[00:21<00:00, 21.64s/it, est. speed input: 142.03 toks/s, output: 60.78 toks/s]
[00:31<00:00, 31.46s/it, est. speed input: 108.05 toks/s, output: 69.17 toks/s]
[00:32<00:00, 32.62s/it, est. speed input: 95.64 toks/s, output: 62.65 toks/s]
[00:39<00:00, 39.54s/it, est. speed input: 83.36 toks/s, output: 61.13 toks/s]
[00:31<00:00, 31.13s/it, est. speed input: 106.44 toks/s, output: 61.07 toks/s]
[00:30<00:00, 30.32s/it, est. speed input: 101.60 toks/s, output: 62.59 toks/s]
加上 eagle3 之後 token per second 從之前的 42~43 暴力提升到了現在的 59~62
github issue 上有一些對這個優化效果的提問。因為這個近 50% 的提升其實是遠低於預期的。可以參考下面的 sglang+eagle 的運行效率
- sglang + eagle
Decode batch, #running-req: 1, #token: 4365, token usage: 0.53, accept len: 3.45, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 80.28,
Decode batch, #running-req: 1, #token: 3652, token usage: 0.45, accept len: 3.23, accept rate: 0.05, cuda graph: True, gen throughput (token/s): 75.12,
Decode batch, #running-req: 1, #token: 4962, token usage: 0.61, accept len: 4.22, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 97.56,
Decode batch, #running-req: 1, #token: 5539, token usage: 0.68, accept len: 3.08, accept rate: 0.05, cuda graph: True, gen throughput (token/s): 71.04,
Decode batch, #running-req: 1, #token: 5156, token usage: 0.63, accept len: 3.42, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 79.04,
Decode batch, #running-req: 1, #token: 4107, token usage: 0.50, accept len: 4.38, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 101.87,
Decode batch, #running-req: 1, #token: 4976, token usage: 0.61, accept len: 4.00, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 92.61,
Decode batch, #running-req: 1, #token: 4957, token usage: 0.61, accept len: 4.40, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 101.80,
Decode batch, #running-req: 1, #token: 4508, token usage: 0.55, accept len: 4.65, accept rate: 0.08, cuda graph: True, gen throughput (token/s): 108.06,
Decode batch, #running-req: 1, #token: 4950, token usage: 0.60, accept len: 4.10, accept rate: 0.07, cuda graph: True, gen throughput (token/s): 94.86,
Decode batch, #running-req: 1, #token: 5085, token usage: 0.62, accept len: 3.65, accept rate: 0.06, cuda graph: True, gen throughput (token/s): 84.59,
tps 直接翻了 1~1.5 番,效果真是卓羣。
由於一些原因我們可以進行 2 併發。
EAGLE3 的 repo 沒有提供 batchsize \(\neq\)
sglang/vllm + eagle3 自動 batchsize=2,不過也很符合直覺地:sglang 比 vllm 快不少。
vllm 甚至扛不過壓力測試………………無語了。