OpenAI API - 视觉#

SGLang 提供了与 OpenAI 兼容的 API,以便从 OpenAI 服务顺利过渡到自托管本地模型。完整的 API 参考可在 OpenAI API 参考文档 中找到。本教程介绍用于视觉语言模型的视觉 API。

SGLang 支持多种视觉语言模型,例如 Llama 3.2、LLaVA-OneVision、Qwen2.5-VL、Gemma3 和 更多

作为 OpenAI API 的替代方案,您还可以使用 SGLang 离线引擎

启动服务器#

在终端中启动服务器并等待其初始化。

[1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

vision_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct
"""
)

wait_for_server(f"https://:{port}")
[2025-05-15 22:33:54] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=35600, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=515979705, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:33:57] Infer the chat template name from the model path and obtain the result: qwen2-vl.
[2025-05-15 22:34:04] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:34:04] Automatically reduce --mem-fraction-static to 0.792 because this is a multimodal model.
[2025-05-15 22:34:04] Automatically turn off --chunked-prefill-size for multimodal model.
[2025-05-15 22:34:04] Init torch distributed begin.
[2025-05-15 22:34:04] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:34:04] Load weight begin. avail mem=59.73 GB
[2025-05-15 22:34:05] Multimodal attention backend not set. Use sdpa.
[2025-05-15 22:34:05] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.99it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:02<00:04,  1.45s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:04<00:03,  1.82s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:07<00:01,  1.97s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:09<00:00,  2.11s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:09<00:00,  1.89s/it]

[2025-05-15 22:34:15] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=43.74 GB, mem usage=15.99 GB.
[2025-05-15 22:34:15] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-05-15 22:34:15] Memory pool end. avail mem=42.37 GB
2025-05-15 22:34:15,870 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:34:18] max_total_num_tokens=20480, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=200, context_len=128000
[2025-05-15 22:34:18] INFO:     Started server process [59076]
[2025-05-15 22:34:18] INFO:     Waiting for application startup.
[2025-05-15 22:34:18] INFO:     Application startup complete.
[2025-05-15 22:34:18] INFO:     Uvicorn running on http://127.0.0.1:35600 (Press CTRL+C to quit)
[2025-05-15 22:34:18] INFO:     127.0.0.1:49930 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:34:19] INFO:     127.0.0.1:49938 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:34:19] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:34:20,472 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:34:20,493 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-15 22:34:21] INFO:     127.0.0.1:49948 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:34:21] The server is fired up and ready to roll!


注意:通常,服务器在单独的终端中运行。
在此笔记本中,我们将服务器和笔记本代码一起运行,因此它们的输出会合并显示。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色高亮显示。
我们在 CI 并行环境中运行这些笔记本,因此吞吐量不代表实际性能。

使用 cURL#

服务器启动后,您可以使用 curl 或 requests 发送测试请求。

[2]:
import subprocess

curl_command = f"""
curl -s https://:{port}/v1/chat/completions \\
  -d '{{
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
      {{
        "role": "user",
        "content": [
          {{
            "type": "text",
            "text": "What’s in this image?"
          }},
          {{
            "type": "image_url",
            "image_url": {{
              "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
            }}
          }}
        ]
      }}
    ],
    "max_tokens": 300
  }}'
"""

response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)


response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
[2025-05-15 22:34:24] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:25] Decode batch. #running-req: 1, #token: 340, token usage: 0.02, cuda graph: False, gen throughput (token/s): 5.38, #queue-req: 0
[2025-05-15 22:34:26] INFO:     127.0.0.1:53140 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"9a30f5176d8c4362b361e8e52749c69c","object":"chat.completion","created":1747348463,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man leaning over the back of a taxi wearing yellow. The taxi's rear shelf is elevated, and he appears to be ironing clothing on an improvised metal frame attached to the shelf. The man is wearing a yellow shirt, and the scene is set on a busy street where another taxi and urban buildings are visible in the background.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":378,"completion_tokens":71,"prompt_tokens_details":null}}
[2025-05-15 22:34:26] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:26] Decode batch. #running-req: 1, #token: 309, token usage: 0.02, cuda graph: False, gen throughput (token/s): 36.70, #queue-req: 0
[2025-05-15 22:34:27] Decode batch. #running-req: 1, #token: 349, token usage: 0.02, cuda graph: False, gen throughput (token/s): 41.21, #queue-req: 0
[2025-05-15 22:34:28] INFO:     127.0.0.1:53142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"3f2eca55de9e45a3bb41951e845871eb","object":"chat.completion","created":1747348466,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man wearing a yellow shirt leaning out of the passenger-side back door of a yellow taxi. He appears to be ironing clothes on a makeshift table propped up on撑, with the taxi parked or in motion on a city street. Another taxi can be seen passing in the background. The scene captures a humorous or atypical use of the vehicle for personal tasks.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":385,"completion_tokens":78,"prompt_tokens_details":null}}

使用 Python Requests#

[3]:
import requests

url = f"https://:{port}/v1/chat/completions"

data = {
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    "max_tokens": 300,
}

response = requests.post(url, json=data)
print_highlight(response.text)
[2025-05-15 22:34:28] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 306, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:28] Decode batch. #running-req: 1, #token: 311, token usage: 0.02, cuda graph: False, gen throughput (token/s): 37.09, #queue-req: 0
[2025-05-15 22:34:29] Decode batch. #running-req: 1, #token: 351, token usage: 0.02, cuda graph: False, gen throughput (token/s): 38.76, #queue-req: 0
[2025-05-15 22:34:30] Decode batch. #running-req: 1, #token: 391, token usage: 0.02, cuda graph: False, gen throughput (token/s): 42.69, #queue-req: 0
[2025-05-15 22:34:31] INFO:     127.0.0.1:53148 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"35582715c63441ddbc7576e8eaf37d3a","object":"chat.completion","created":1747348468,"model":"Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man wearing a yellow shirt leaning into the open trunk of a yellow taxi to press or organize clothes that appear freshly laundered. The clothes are placed on a type of folding laundry board or stand attached to the open trunk of the vehicle. The scene takes place on a busy city street, with other taxis passing by in the background. It gives a humorous and unconventional impression as it juxtaposes laundry services with an unconventional mobile setup on a taxi for pressing or organizing clothes.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":307,"total_tokens":405,"completion_tokens":98,"prompt_tokens_details":null}}

使用 OpenAI Python 客户端#

[4]:
from openai import OpenAI

client = OpenAI(base_url=f"https://:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print_highlight(response.choices[0].message.content)
[2025-05-15 22:34:31] Prefill batch. #new-seq: 1, #new-token: 292, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:31] Decode batch. #running-req: 1, #token: 333, token usage: 0.02, cuda graph: False, gen throughput (token/s): 33.64, #queue-req: 0
[2025-05-15 22:34:32] Decode batch. #running-req: 1, #token: 373, token usage: 0.02, cuda graph: False, gen throughput (token/s): 36.58, #queue-req: 0
[2025-05-15 22:34:33] INFO:     127.0.0.1:54202 - "POST /v1/chat/completions HTTP/1.1" 200 OK
图片显示一名男子在街上与一辆黄色出租车互动。他拿着一件蓝色衬衫和熨斗,似乎正在出租车后备箱熨烫衬衫,或正在将出租车后部改造成一个临时熨烫台。场景似乎在城市环境中,很可能是一条繁忙的街道,背景中有其他出租车,这表明这可能是在纽约市,因为这些出租车类似于标志性的“黄色出租车”。

多图像输入#

如果模型支持,服务器还支持多图像以及文本和图像的交错输入。

[5]:
from openai import OpenAI

client = OpenAI(base_url=f"https://:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
                    },
                },
                {
                    "type": "text",
                    "text": "I have two very different images. They are not related at all. "
                    "Please describe the first image in one sentence, and then describe the second image in another sentence.",
                },
            ],
        }
    ],
    temperature=0,
)

print_highlight(response.choices[0].message.content)
[2025-05-15 22:34:34] Prefill batch. #new-seq: 1, #new-token: 2532, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:35] Decode batch. #running-req: 1, #token: 2548, token usage: 0.12, cuda graph: False, gen throughput (token/s): 17.00, #queue-req: 0
[2025-05-15 22:34:36] Decode batch. #running-req: 1, #token: 2588, token usage: 0.13, cuda graph: False, gen throughput (token/s): 38.32, #queue-req: 0
[2025-05-15 22:34:36] INFO:     127.0.0.1:54212 - "POST /v1/chat/completions HTTP/1.1" 200 OK
第一张图片显示一名男子在繁忙的城市街道上,在出租车后部熨烫衣服。第二张图片是一个程式化的标志,其中包含字母“SGL”,设计中融合了书本和电脑图标。
[6]:
terminate_process(vision_process)