OpenAI APIs - Vision#

SGLang 提供与 OpenAI 兼容的 API,以便从 OpenAI 服务平滑过渡到自托管的本地模型。API 的完整参考可以在 OpenAI API 参考中找到。本教程涵盖了视觉语言模型的视觉 API。

SGLang 支持多种视觉语言模型,例如 Llama 3.2、LLaVA-OneVision、Qwen2.5-VL、Gemma3 及更多

作为 OpenAI API 的替代方案,你也可以使用 SGLang 离线引擎

启动服务器#

在终端中启动服务器并等待其初始化。

[1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

vision_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:26:39] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:26:39] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:26:39] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:26:45] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:26:45] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:26:45] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:26:47] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:26:47] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:26:49] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2025-12-30 02:26:53] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:26:53] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:26:53] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:26:53] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:26:53] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:26:53] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:27:00] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.32it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.41it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.35it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.51it/s]

Capturing batches (bs=1 avail_mem=60.84 GB): 100%|██████████| 3/3 [00:00<00:00,  3.74it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。

使用 cURL#

服务器启动后,你可以使用 curl 或 requests 发送测试请求。

[2]:
import subprocess

curl_command = f"""
curl -s https://:{port}/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{{
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "text",
            "text": "What’s in this image?"
          }},
          {{
            "type": "image_url",
            "image_url": {{
              "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
            }}
          }}
        ]
      }}
    ],
    "max_tokens": 300
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)


response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
{"id":"9b4e1c7829544710b237c693ac26718c","object":"chat.completion","created":1767061634,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"这张图片显示一名男子站在一辆黄色出租车的后部,正拿着熨斗熨烫平铺在熨衣板上的一条裤子。出租车停在城市街道上,背景中还有其他出租车和建筑物。这名男子在执行任务时似乎在小心地保持平衡。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":373,"completion_tokens":66,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
{"id":"d786a04499dd42e893cb9202dd421c55","object":"chat.completion","created":1767061635,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"这张图片显示一名男子站在一辆黄色出租车的后部,正拿着熨斗熨烫平铺在熨衣板上的一条裤子。出租车停在城市街道上,背景中还有其他出租车和建筑物。这名男子在执行任务时似乎在小心地保持平衡。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":373,"completion_tokens":66,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

使用 Python Requests#

[3]:
import requests

url = f"https://:{port}/v1/chat/completions"

data = {
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print_highlight(response.text)
{"id":"4ee441429b4e47f5a4134aca4cbe5b7c","object":"chat.completion","created":1767061636,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"这张图片显示一名男子站在一辆黄色出租车的后部,正拿着熨斗熨烫平铺在熨衣板上的一条裤子。出租车停在城市街道上,背景中还有其他出租车和建筑物。这名男子在执行任务时似乎在小心地保持平衡。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":373,"completion_tokens":66,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

使用 OpenAI Python 客户端#

[4]:
from openai import OpenAI

client = OpenAI(base_url=f"https://:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)
这张图片显示一名男子站在一辆黄色出租车的后部,正拿着熨斗熨烫平铺在熨衣板上的一条裤子。出租车停在城市街道上,背景中还有其他出租车和建筑物。这名男子似乎在执行一项极不寻常的任务,因为熨烫通常是在室内进行的。

多图输入#

如果模型支持,服务器还支持多图输入以及文本与图像的交错输入。

[5]:
from openai import OpenAI

client = OpenAI(base_url=f"https://:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
                    },
                },
                {
                    "type": "text",
                    "text": "I have two very different images. They are not related at all. "
                    "Please describe the first image in one sentence, and then describe the second image in another sentence.",
                },
            ],
        }
    ],
    temperature=0,
)

print_highlight(response.choices[0].message.content)
第一张图片显示一名男子在繁忙的城市街道上的一辆出租车后部熨烫衣服。第二张图片是一个风格化的标志,包含了字母“SGL”,并将一本书和一个计算机图标融入了设计中。
[6]:
terminate_process(vision_process)