vllm-ascen實現雙機推理Deepseek v3
推DeepSeek滿血版,需要雙機(910B A2),比單機的費勁多了,華為的兄弟一起看了半天才搞定。
一、鏡像:
quay.io/ascend/vllm-ascend:v0.12.0rc1
quay.io/ascend/vllm-ascend:v0.12.0rc1-openeuler
quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2
二、權重:需要修改權重目錄下config.json中的"torch_dtype": "float32",實測"bfloat16"也行
https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1/
三、啓動:多機推理比單機複雜,但比mindie簡單。首先啓動容器,然後在容器內執行服務啓動腳本,主從的服務啓動腳本有少量差異。
- 容器啓動腳本start_docker.sh,主從節點一樣。啓動命令:start_docker.sh vllm-ds
#!/bin/bash
# 定義鏡像ID和名稱常量
IMAGE_ID=f49277a2e0de
# 檢查參數數量是否為1
if [ $# -ne 1 ]; then
echo "Error: need exactly one argument for container name."
exit 1
fi
# 容器名稱(從參數獲取)
CONTAINER_NAME="$1"
# 啓動Docker容器
docker run \
--name "${CONTAINER_NAME}" \
-it -d \
--net=host \
--shm-size=500g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /app1:/app1 \
-v /tmp:/tmp \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro \
-e http_proxy="$http_proxy" \
-e https_proxy="$https_proxy" \
"${IMAGE_ID}"
- 主節點服務腳本node0.sh,主節點宿主機IP:10.178.231.234
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="bond0"
local_ip="10.178.231.234"
# AIV,此處不能有
# export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
vllm serve /app1/models/DeepSeek-V3.1-w8a8 \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address 10.178.231.234 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 16384 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
- 從節點服務腳本node1.sh,從節點宿主機IP:10.178.231.233
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="bond0"
local_ip="10.178.231.233"
# AIV 此處不能有
# export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
vllm serve /app1/models/DeepSeek-V3.1-w8a8 \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address 10.178.231.234 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 4 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3\
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 16384 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.94 \
--speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
四、測試:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "deepseek_v3",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0
}'
五、巨坑:
華為提供的腳本怎麼都無法拉起服務,根據vllm官網的指導能拉起腳本,但一推理就崩。查了好久,本來都要放棄了,讓華為的研發跟進的,結果他們提了一句,換個低版本的鏡像試試。結果使用v0.11.0rc2成功了。最後的結論是主機上CANN的版本與鏡像兼容問題。。。
ps:npu-smi 25.2.0,搭配0.11.0rc2的鏡像。
六、優化:
- 配置128k上下文,需要修改node0.sh和node1.sh中的幾個參數:
--data-parallel-size 2 \
--data-parallel-size-local 1 \
## node0無需data-parallel-start-rank參數
--data-parallel-start-rank 1 \
--tensor-parallel-size 8 \
--max-model-len 131072 \
七、參考:
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.1.html