vllm-ascen實現雙機推理Deepseek v3

  推DeepSeek滿血版,需要雙機(910B A2),比單機的費勁多了,華為的兄弟一起看了半天才搞定。

一、鏡像:

quay.io/ascend/vllm-ascend:v0.12.0rc1
quay.io/ascend/vllm-ascend:v0.12.0rc1-openeuler
quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2

二、權重:需要修改權重目錄下config.json中的"torch_dtype": "float32",實測"bfloat16"也行

https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1/

三、啓動:多機推理比單機複雜,但比mindie簡單。首先啓動容器,然後在容器內執行服務啓動腳本,主從的服務啓動腳本有少量差異。

  • 容器啓動腳本start_docker.sh,主從節點一樣。啓動命令:start_docker.sh vllm-ds
#!/bin/bash

# 定義鏡像ID和名稱常量
IMAGE_ID=f49277a2e0de

# 檢查參數數量是否為1
if [ $# -ne 1 ]; then
    echo "Error: need exactly one argument for container name."
    exit 1
fi

# 容器名稱(從參數獲取)
CONTAINER_NAME="$1"

# 啓動Docker容器
docker run \
    --name "${CONTAINER_NAME}" \
    -it -d \
    --net=host \
    --shm-size=500g \
    --privileged=true \
    -w /home \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    --entrypoint=bash \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /app1:/app1 \
    -v /tmp:/tmp \
    -v /etc/hccn.conf:/etc/hccn.conf \
    -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro \
    -e http_proxy="$http_proxy" \
    -e https_proxy="$https_proxy" \
    "${IMAGE_ID}"
  • 主節點服務腳本node0.sh,主節點宿主機IP:10.178.231.234
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="bond0"
local_ip="10.178.231.234"

# AIV,此處不能有
# export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

vllm serve /app1/models/DeepSeek-V3.1-w8a8 \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-address 10.178.231.234 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 4 \
  --quantization ascend \
  --seed 1024 \
  --served-model-name deepseek_v3 \
  --enable-expert-parallel \
  --max-num-seqs 16 \
  --max-model-len 16384 \
  --max-num-batched-tokens 4096 \
  --trust-remote-code \
  --gpu-memory-utilization 0.94 \
  --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
  • 從節點服務腳本node1.sh,從節點宿主機IP:10.178.231.233
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="bond0"
local_ip="10.178.231.233"

# AIV 此處不能有
# export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

vllm serve /app1/models/DeepSeek-V3.1-w8a8 \
  --host 0.0.0.0 \
  --port 8000 \
  --headless \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-start-rank 2 \
  --data-parallel-address 10.178.231.234 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 4 \
  --quantization ascend \
  --seed 1024 \
  --served-model-name deepseek_v3\
  --enable-expert-parallel \
  --max-num-seqs 16 \
  --max-model-len 16384 \
  --max-num-batched-tokens 4096 \
  --trust-remote-code \
  --gpu-memory-utilization 0.94 \
  --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'

四、測試:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "deepseek_v3",
        "prompt": "The future of AI is",
        "max_tokens": 50,
        "temperature": 0
    }'

五、巨坑:

  華為提供的腳本怎麼都無法拉起服務,根據vllm官網的指導能拉起腳本,但一推理就崩。查了好久,本來都要放棄了,讓華為的研發跟進的,結果他們提了一句,換個低版本的鏡像試試。結果使用v0.11.0rc2成功了。最後的結論是主機上CANN的版本與鏡像兼容問題。。。

  ps:npu-smi 25.2.0,搭配0.11.0rc2的鏡像。

六、優化:

  • 配置128k上下文,需要修改node0.sh和node1.sh中的幾個參數:
--data-parallel-size 2 \
  --data-parallel-size-local 1 \
  ## node0無需data-parallel-start-rank參數
  --data-parallel-start-rank 1 \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \

七、參考:

https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.1.html