【vLLM 學習】Rlhf 詳情 - AI,vLLM,深度學習,機器學習,編程,深度學習,人工智能 HyperAI超神經博客

vLLM 是一款專為大語言模型推理加速而設計的框架，實現了 KV 緩存內存幾乎零浪費，解決了內存管理瓶頸問題。

更多 vLLM 中文文檔及教程可訪問 →vllm.hyper.ai/

*在線運行 vLLM 入門教程：零基礎分步指南

源碼 examples/offline_inference/rlhf.py


"""
一個基於 vLLM 的 RLHF 簡單實現演示，靈感來源於
OpenRLHF 框架 https://github.com/OpenRLHF/OpenRLHF 。
該設計採用訓練進程 (training processes) 與推理進程 (inference processes)
分離的方案，它們運行在不同的 GPU 上。
訓練進程向推理進程發送提示 (prompts) 以生成數據，
同時通過將模型權重從訓練進程廣播 (broadcast) 到推理進程
來實現模型權重的同步。
注意：本演示僅展示單個訓練實例 (training instance) 和單個
推理實例 (inference instance) 的簡單場景。
實際應用中可能存在多個訓練實例和多個推理實例。
完整實現請參考 OpenRLHF 框架。
"""
import os

import ray
import torch
from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from rlhf_utils import stateless_init_process_group
from transformers import AutoModelForCausalLM

from vllm import LLM, SamplingParams
from vllm.utils import get_ip, get_open_port


class MyLLM(LLM):

    def __init__(self, *args, **kwargs):
        # a hack to make the script work.
        # stop ray from manipulating CUDA_VISIBLE_DEVICES
        # at the top-level
        # 臨時解決方案：確保腳本正常運行
        # 禁止 Ray 在頂層修改 CUDA_VISIBLE_DEVICES 環境變量
        os.environ.pop("CUDA_VISIBLE_DEVICES", None)
        super().__init__(*args, **kwargs)


"""
開始訓練過程，在這裏我們使用 HuggingFace Transformer
作為在 GPU 0 上保存模型的示例。
"""

train_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
train_model.to("cuda:0")

"""
啓動推理過程，我們使用 vLLM 在 GPU 1和 GPU 2。有關如何使用 ray 的詳細信息，
請參考 ray 文檔 https://docs.ray.io/en/latest/。
"""
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
ray.init()

pg_inference = placement_group([{"GPU": 1, "CPU": 0}] * 2)
ray.get(pg_inference.ready())
scheduling_inference = PlacementGroupSchedulingStrategy(
    placement_group=pg_inference,
    placement_group_capture_child_tasks=True,
    placement_group_bundle_index=0,
)

"""
啓動 vLLM 推理引擎。
在這裏，我們使用 `enforce_eager` 減少開始時間。
"""
llm = ray.remote(
    num_cpus=0,
    num_gpus=0,
    scheduling_strategy=scheduling_inference,
)(MyLLM).remote(
    model="facebook/opt-125m",
    enforce_eager=True,
    worker_extension_cls="rlhf_utils.WorkerExtension",
    tensor_parallel_size=2,
    distributed_executor_backend="ray",
)

# 從提示中生成文本。
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0)

outputs = ray.get(llm.generate.remote(prompts, sampling_params))

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, "
          f"Generated text: {generated_text!r}")

# 設置訓練進程與推理引擎之間的通信
master_address = get_ip()
master_port = get_open_port()

handle = llm.collective_rpc.remote("init_weight_update_group",
                                   args=(master_address, master_port, 1, 3))

model_update_group = stateless_init_process_group(master_address, master_port,
                                                  0, 3, torch.device("cuda:0"))
ray.get(handle)

# 模擬訓練，修改模型的權重。
for name, p in train_model.named_parameters():
    p.data.zero_()

# 同步從訓練過程到推理引擎的權重。
for name, p in train_model.named_parameters():
    handle = llm.collective_rpc.remote("update_weight",
                                       args=(name, p.dtype, p.shape))
    model_update_group.broadcast(p, src=0, stream=torch.cuda.current_stream())
    ray.get(handle)

# 檢查權重是否更新。
assert all(ray.get(llm.collective_rpc.remote("check_weights_changed")))

# 使用更新的模型生成文本，它們會胡説八道
# 因為權重都是零。
outputs_updated = ray.get(llm.generate.remote(prompts, sampling_params))
for output in outputs_updated:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, "
          f"Generated text: {generated_text!r}")

HyperAI超神經博客

HyperAI超神經博客

博客 / 詳情

【vLLM 學習】Rlhf

發佈評論

Product

Company

Support

Company

博客 / 詳情

【vLLM 學習】Rlhf

發佈 評論

發佈評論