如何支持新模型#

本文档说明了如何在 SGLang 中添加对新语言模型和多模态大语言模型 (MLLM) 的支持。它还涵盖了如何测试新模型以及如何注册外部实现。

如何支持新的语言模型#

要在 SGLang 中支持新模型，您只需在 SGLang 模型目录下添加一个文件。您可以参考现有的模型实现并为您的模型创建一个新文件。对于大多数模型，您应该能找到一个类似的模型作为起点（例如，从 Llama 开始）。另请参考如何将模型从 vLLM 移植到 SGLang。

如何支持新的多模态大语言模型#

要在 SGLang 中支持新的多模态大语言模型 (MLLM)，除了标准 LLM 支持外，还需要几个关键组件：

将您的新模型注册为多模态模型：扩展 model_config.py 中的 is_multimodal_model，使其对您的模型返回 True。
注册新的聊天模板：仅当您的默认聊天模板无法接受图像作为输入时：在 conversation.py 中注册新的聊天模板及相应的匹配函数。
多模态数据处理器：定义一个新的 Processor 类，继承自 BaseMultimodalProcessor，并将此处理器注册为该模型的专用处理器。详见 multimodal_processor.py。
处理多模态 Token：为您的新模型实现 pad_input_ids 函数。在此函数中，Prompt 中的多模态 Token 应当被展开（如有必要）并填充多模态数据哈希，以便 SGLang 能通过 RadixAttention 识别不同的多模态数据。
处理图像特征提取：为您的新模型实现 get_image_feature 函数，该函数从原始图像数据中提取图像特征，并将其转换为语言模型使用的嵌入（embeddings）。
适配视觉注意力机制：使用 SGLang 的 VisionAttention 适配 ViT 的多头注意力（Attention）。

您可以参考 Qwen2VL 或其他 MLLM 的实现。这些模型演示了如何正确处理多模态和文本输入。

测试与调试#

请在 PR 描述中注明您所有的测试和基准测试结果。

交互式调试#

对于交互式调试，请对比 Hugging Face/Transformers 和 SGLang 的输出。以下两个命令应给出相同的文本输出以及非常相似的 prefill logits：

获取基准输出

python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm}

获取 SGLang 输出

python3 -m sglang.bench_one_batch --correct --model [new model]

将模型添加到测试套件#

为确保新模型得到良好维护，请通过将其包含在 test_generation_models.py 文件的 ALL_OTHER_MODELS 列表中，将其添加到测试套件中。在本地机器上测试新模型，并在 PR 中报告典型基准测试（如 GSM8K、MMLU、MMMU、MMMU-Pro 等）的结果。\ 对于 VLM，还需在 test_vision_openai_server_{x}.py（例如 test_vision_openai_server_a.py, test_vision_openai_server_b.py）中包含测试。

这是在本地机器上运行以测试新模型的示例命令

ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others

基准测试#

(必填) MMMU：按照 MMMU 基准测试 README.md 获取 SGLang 与 HF Transformer 的准确率对比。SGLang 运行的准确率得分不应大幅低于 HF Transformer 的运行得分。同样，参考 https://docs.sglang.com.cn/developer_guide/benchmark_and_profiling.html 获取性能对比：TTFT 和吞吐量必须达到或超过基线（例如 HF Transformer）。
(选填) 其他评估：如果您运行了其他评估，请在 PR 描述中注明结果。

将模型从 vLLM 移植到 SGLang#

vLLM 模型目录是一个宝贵的资源，因为 vLLM 覆盖了许多模型。SGLang 复用了 vLLM 的接口和一些层，这使得将模型从 vLLM 移植到 SGLang 变得更加容易。

将模型从 vLLM 移植到 SGLang：

参考这两个文件以获得指导：
- SGLang Llama 实现
- vLLM Llama 实现
主要区别包括：
- 将 vLLM 的 Attention 替换为 RadixAttention（确保将 layer_id 传递给 RadixAttention）。
- 将 vLLM 的 LogitsProcessor 替换为 SGLang 的 LogitsProcessor。
- 将 ViT 的多头 Attention 替换为 SGLang 的 VisionAttention。
- 将其他 vLLM 层（如 RMSNorm, SiluAndMul）替换为 SGLang 的层。
- 移除 Sample。
- 更改 forward() 函数并添加 forward_batch() 方法。
- 在末尾添加 EntryClass。
- 确保新实现仅使用 SGLang 组件，且不依赖于任何 vLLM 组件。

注意：确保将您的新模型添加到支持模型文档中的支持模型列表里。

注册外部模型实现#

除了上述方法外，您还可以在启动服务器之前使用 ModelRegistry 注册您的新模型。这允许您在不修改源代码的情况下集成模型。

例如：

from sglang.srt.models.registry import ModelRegistry
from sglang.srt.entrypoints.http_server import launch_server

# For a single model, add it to the registry:
ModelRegistry.models[model_name] = model_class

# For multiple models, you can imitate the import_model_classes() function:
from functools import lru_cache

@lru_cache()
def import_new_model_classes():
    model_arch_name_to_cls = {}
    # Populate model_arch_name_to_cls with your new model classes.
    ...
    return model_arch_name_to_cls

ModelRegistry.models.update(import_new_model_classes())

# Launch the server with your server arguments:
launch_server(server_args)

示例：实现并服务一个 Llama 封装模型#

下面是一个入门级的分步演练，介绍如何在 SGLang 中端到端地实现一个新模型，然后通过离线引擎运行它。

实现我们的模型#

为了保持简单，这个新模型将是 Llama 3.1-8B-Instruct 的一个简单封装，我们的目标只是通过对每个 logit 取平方根，来偏置每次 forward 调用输出的 logits。

让我们先在一个名为 llama_wrapper.py 的文件中定义我们的模型。第一步是从 SRT（SGLang 的内部后端）导入必要的库。

# In the file `llama_wrapper.py`

import torch
from transformers import LlamaConfig
from typing import Optional
from sglang.srt.layers.logits_processor import LogitsProcessorOutput
from sglang.srt.layers.quantization.base_config import QuantizationConfig
from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors

from sglang.srt.models.llama import LlamaForCausalLM

接下来，我们为模型声明一个新类 class 并使其继承自 LlamaForCausalLM，这允许我们的模型访问 LlamaForCausalLM 预定义的模块和层，例如 LlamaAttention 和 LlamaMLP。请注意，几乎所有的模型实现都接受 config 和 quant_config 作为其 __init__ 方法的参数；config 和 quant_config 通过 model_loader/loader.py 传入。因为我们继承自 LlamaForCausalLM，我们可以直接将参数传递给它的构造函数，它会为我们设置成员变量。

class LlamaWrapper(LlamaForCausalLM):
    def __init__(
        self,
        config: LlamaConfig,
        quant_config: Optional[QuantizationConfig] = None,
        prefix: str = "",
    ) -> None:
        super().__init__(config=config, quant_config=quant_config, prefix=prefix)

现在，我们要定义 forward 方法，这是在推理时会被调用的方法。注意，任何模型的 forward 签名基本上都是相同的；您可以参考 models 目录中定义的其他模型。要查看 SGLang 运行时内部到底是在哪里调用 forward 的，请查看 ModelRunner 类中的 forward_decode 和 forward_extend。

    @torch.no_grad()
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        forward_batch: ForwardBatch,
        pp_proxy_tensors: Optional[PPProxyTensors] = None,
        input_embeds: Optional[torch.Tensor] = None,
        get_embedding: bool = False,
    ) -> LogitsProcessorOutput:

我们现在调用 self.model 的 __call__ 方法（这是 LlamaForCausalLM 在其 __init__ 方法中定义的成员变量），它最终会调用 LlamaForCausalLM 的 forward 方法。之后，我们将 hidden_states 输入到模型的 LogitsProcessor 中（同样定义在 LlamaForCausalLM 中）。

        hidden_states = self.model(
            input_ids,
            positions,
            forward_batch,
            input_embeds,
            pp_proxy_tensors=pp_proxy_tensors,
        )

        res: LogitsProcessorOutput = self.logits_processor(
            input_ids,
            hidden_states,
            self.lm_head,
            forward_batch,
        )

在接收到下一个 token 的 logits 后，我们终于可以执行偏置步骤了。

        orig_logits = res.next_token_logits
        res.next_token_logits = torch.where(
            orig_logits > 0,
            orig_logits.sqrt(),
            orig_logits
        )

        return res

现在，我们的 LlamaWrapper 模型已经创建完毕，准备好提供服务了！

通过 SGLang 离线引擎服务我们的模型#

本演练的下一步涉及离线托管我们的新模型，以便它可以在本地服务，而无需 HTTP 服务器。

首先，创建一个名为 run.py 的新文件。现在，我们必须确保 SGLang 的 ModelRegistry 能够找到我们的模型。为此，我们首先从 Huggingface 下载模型的配置和权重。

# In the file `run.py`

import asyncio
from functools import lru_cache
from huggingface_hub import snapshot_download
from llama_wrapper import LlamaWrapper # Make sure to import our new model!
import sglang as sgl
from sglang.srt.models.registry import ModelRegistry

# Make sure to request access to this model on Huggingface, then export your
# `HF_TOKEN` to download the model snapshot
llama_dir = snapshot_download(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    local_dir="./llama_ckpt",
)

现在模型已在磁盘上，我们希望通过将 ./llama_ckpt/config.json 中的 architectures 字段更改为 LlamaWrapper，使其指向 LlamaWrapper。这样，当我们把模型 checkpoint 的路径传递给 SGLang 时，它就会知道我们想使用 “LlamaWrapper” 而不是 “LlamaForCausalLM” 作为我们的模型。

{
  "architectures": [
   #  "LlamaForCausalLM"
    "LlamaWrapper"
  ],
  ...
}

但是，如果我们不将 LlamaWrapper 类链接到 “LlamaWrapper” 注册关键字，SGLang 就无法找到我们的模型。因此，为了注册我们的 LlamaWrapper，我们需要遵循上面标题为“注册外部模型实现”一节中的步骤。

@lru_cache()
def import_new_model_classes():
    model_arch_name_to_cls = {"LlamaWrapper": LlamaWrapper}
    return model_arch_name_to_cls

ModelRegistry.models.update(import_new_model_classes())

最后，当我们创建 Engine 时，只需传入本地模型目录的路径即可。然后，我们的 LlamaWrapper 就可以提供服务了；在本演练中，我们将使用 SGLang Engine 的非流式异步生成接口。

def main():
    llm = sgl.Engine(model_path="./llama_ckpt")
    sampling_params = {"temperature": 0.2, "top_k": 5}
    prompts = [
        "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
        "Provide a concise factual statement about France’s capital city. The capital of France is",
        "Explain possible future trends in artificial intelligence. The future of AI is",
    ]

    asyncio.run(run_llm(llm, sampling_params, prompts))

    llm.shutdown()

async def run_llm(
    llm,
    sampling_params,
    prompts,
) -> None:
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")

if __name__ == "__main__":
    main()

现在，当我们调用 python run.py 时，我们将得到新创建模型的输出！

文档#

将新模型添加到 generative_models.md 或 multimodal_language_models.md 中的支持模型表里。

通过遵循这些指南，您可以在 SGLang 中添加对新语言模型和多模态大语言模型的支持，并确保它们经过彻底测试并易于集成到系统中。