SGLang 前端语言#

SGLang 前端语言可以用来方便、结构化地定义简单易用的提示。

启动服务器#

在您的终端中启动服务器并等待其初始化。

[1]:
import requests
import os

from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint, set_default_backend
from sglang.srt.utils import load_image
from sglang.test.test_utils import is_in_ci
from sglang.utils import print_highlight, terminate_process, wait_for_server

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd


server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
[2025-05-15 22:31:26] server_args=ServerArgs(model_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='0.0.0.0', port=30831, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=55613185, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:31:35] Attention backend not set. Use fa3 backend by default.
[2025-05-15 22:31:35] Init torch distributed begin.
[2025-05-15 22:31:36] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:31:36] Load weight begin. avail mem=78.60 GB
[2025-05-15 22:31:38] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:06<00:19,  6.62s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:36<00:40, 20.39s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:55<00:19, 19.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:09<00:00, 17.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:09<00:00, 17.26s/it]

[2025-05-15 22:32:48] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=33.65 GB, mem usage=44.95 GB.
[2025-05-15 22:32:48] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-05-15 22:32:48] Memory pool end. avail mem=32.35 GB
[2025-05-15 22:32:48] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768
[2025-05-15 22:32:49] INFO:     Started server process [51693]
[2025-05-15 22:32:49] INFO:     Waiting for application startup.
[2025-05-15 22:32:49] INFO:     Application startup complete.
[2025-05-15 22:32:49] INFO:     Uvicorn running on http://0.0.0.0:30831 (Press CTRL+C to quit)
[2025-05-15 22:32:50] INFO:     127.0.0.1:36390 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:32:50] INFO:     127.0.0.1:36396 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:32:50] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:51] INFO:     127.0.0.1:36412 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:32:51] The server is fired up and ready to roll!


注意:通常,服务器在单独的终端中运行。
在本 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出会合并。
为了提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不能代表实际性能。
Server started on http://localhost:30831

设置默认后端。注意:除了本地服务器,您也可以使用 OpenAI 或其他 API 端点。

[2]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
[2025-05-15 22:32:55] INFO:     127.0.0.1:53288 - "GET /get_model_info HTTP/1.1" 200 OK

基本用法#

使用 SGLang 前端语言最简单的方法是用户和助手之间的简单问答对话。

[3]:
@function
def basic_qa(s, question):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=512))
[4]:
state = basic_qa("List 3 countries and their capitals.")
print_highlight(state["answer"])
[2025-05-15 22:32:55] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:55] INFO:     127.0.0.1:53298 - "POST /generate HTTP/1.1" 200 OK
以下是三个国家及其首都

1. 法国 - 巴黎
2. 德国 - 柏林
3. 日本 - 东京

多轮对话#

SGLang 前端语言也可用于定义多轮对话。

[5]:
@function
def multi_turn_qa(s):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user("Please give me a list of 3 countries and their capitals.")
    s += assistant(gen("first_answer", max_tokens=512))
    s += user("Please give me another list of 3 countries and their capitals.")
    s += assistant(gen("second_answer", max_tokens=512))
    return s


state = multi_turn_qa()
print_highlight(state["first_answer"])
print_highlight(state["second_answer"])
[2025-05-15 22:32:55] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 18, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:55] Decode batch. #running-req: 1, #token: 42, token usage: 0.00, cuda graph: False, gen throughput (token/s): 5.85, #queue-req: 0
[2025-05-15 22:32:56] INFO:     127.0.0.1:53300 - "POST /generate HTTP/1.1" 200 OK
当然!以下是三个国家及其首都的列表

1. 法国 - 巴黎
2. 德国 - 柏林
3. 日本 - 东京
[2025-05-15 22:32:56] Prefill batch. #new-seq: 1, #new-token: 23, #cached-token: 67, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:56] Decode batch. #running-req: 1, #token: 104, token usage: 0.01, cuda graph: False, gen throughput (token/s): 62.94, #queue-req: 0
[2025-05-15 22:32:56] INFO:     127.0.0.1:53304 - "POST /generate HTTP/1.1" 200 OK
当然!以下是另外三个国家及其首都的列表

1. 意大利 - 罗马
2. 加拿大 - 渥太华
3. 澳大利亚 - 堪培拉

控制流#

您可以在函数内使用任何 Python 代码来定义更复杂的控制流。

[6]:
@function
def tool_use(s, question):
    s += assistant(
        "To answer this question: "
        + question
        + ". I need to use a "
        + gen("tool", choices=["calculator", "search engine"])
        + ". "
    )

    if s["tool"] == "calculator":
        s += assistant("The math expression is: " + gen("expression"))
    elif s["tool"] == "search engine":
        s += assistant("The key word to search is: " + gen("word"))


state = tool_use("What is 2 * 2?")
print_highlight(state["tool"])
print_highlight(state["expression"])
[2025-05-15 22:32:56] Prefill batch. #new-seq: 1, #new-token: 25, #cached-token: 8, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:56] INFO:     127.0.0.1:53318 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:32:56] Prefill batch. #new-seq: 1, #new-token: 2, #cached-token: 31, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:56] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 31, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-05-15 22:32:56] INFO:     127.0.0.1:53320 - "POST /generate HTTP/1.1" 200 OK
计算器
[2025-05-15 22:32:56] Prefill batch. #new-seq: 1, #new-token: 13, #cached-token: 33, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:57] Decode batch. #running-req: 1, #token: 66, token usage: 0.00, cuda graph: False, gen throughput (token/s): 56.82, #queue-req: 0
[2025-05-15 22:32:57] Decode batch. #running-req: 1, #token: 1, token usage: 0.00, cuda graph: False, gen throughput (token/s): 67.83, #queue-req: 0
[2025-05-15 22:32:57] INFO:     127.0.0.1:53324 - "POST /generate HTTP/1.1" 200 OK
2 * 2.

- 您不一定需要计算器来进行这种简单的乘法运算。
- 但是,如果您更喜欢使用计算器,您可以输入这些数字和乘号 (*)。

结果:2 * 2 = 4

所以,答案是 4。

并行#

使用 fork 启动并行提示。因为 sgl.gen 是非阻塞的,下面的 for 循环会并行发起两次生成调用。

[7]:
@function
def tip_suggestion(s):
    s += assistant(
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += assistant(
            f"Now, expand tip {i+1} into a paragraph:\n"
            + gen("detailed_tip", max_tokens=256, stop="\n\n")
        )

    s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n")
    s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n")
    s += assistant(
        "To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512)
    )


state = tip_suggestion()
print_highlight(state["summary"])
[2025-05-15 22:32:57] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:32:57] Prefill batch. #new-seq: 1, #new-token: 35, #cached-token: 14, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-05-15 22:32:58] INFO:     127.0.0.1:53338 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:32:58] Decode batch. #running-req: 1, #token: 89, token usage: 0.00, cuda graph: False, gen throughput (token/s): 43.69, #queue-req: 0
[2025-05-15 22:32:59] Decode batch. #running-req: 1, #token: 129, token usage: 0.01, cuda graph: False, gen throughput (token/s): 65.95, #queue-req: 0
[2025-05-15 22:32:59] Decode batch. #running-req: 1, #token: 169, token usage: 0.01, cuda graph: False, gen throughput (token/s): 67.58, #queue-req: 0
[2025-05-15 22:33:00] Decode batch. #running-req: 1, #token: 209, token usage: 0.01, cuda graph: False, gen throughput (token/s): 68.06, #queue-req: 0
[2025-05-15 22:33:00] INFO:     127.0.0.1:53328 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:33:00] Prefill batch. #new-seq: 1, #new-token: 199, #cached-token: 39, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:33:01] Decode batch. #running-req: 1, #token: 273, token usage: 0.01, cuda graph: False, gen throughput (token/s): 65.23, #queue-req: 0
[2025-05-15 22:33:01] Decode batch. #running-req: 1, #token: 313, token usage: 0.02, cuda graph: False, gen throughput (token/s): 64.65, #queue-req: 0
[2025-05-15 22:33:02] Decode batch. #running-req: 1, #token: 353, token usage: 0.02, cuda graph: False, gen throughput (token/s): 65.12, #queue-req: 0
[2025-05-15 22:33:02] INFO:     127.0.0.1:41876 - "POST /generate HTTP/1.1" 200 OK
### 保持健康的秘诀
1. **均衡饮食**:确保您的饮食包含各种营养丰富的食物,如水果、蔬菜、瘦肉蛋白、全谷物和健康脂肪。这有助于为您的身体提供正常运作所需的必需维生素和矿物质。

2. **规律运动**:每周至少进行 150 分钟中等强度的有氧运动或 75 分钟高强度的运动,并每周进行两天或更多的肌肉强化练习。规律运动不仅能改善身体健康,还能通过减轻压力和提振情绪来增强心理健康。

约束解码#

使用 regex 指定正则表达式作为解码约束。这仅支持本地模型。

[8]:
@function
def regular_expression_gen(s):
    s += user("What is the IP address of the Google DNS servers?")
    s += assistant(
        gen(
            "answer",
            temperature=0,
            regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
        )
    )


state = regular_expression_gen()
print_highlight(state["answer"])
[2025-05-15 22:33:02] Prefill batch. #new-seq: 1, #new-token: 18, #cached-token: 12, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:33:03] INFO:     127.0.0.1:41884 - "POST /generate HTTP/1.1" 200 OK
208.67.222.222

使用 regex 定义 JSON 解码模式。

[9]:
character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)


@function
def character_gen(s, name):
    s += user(
        f"{name} is a character in Harry Potter. Please fill in the following information about this character."
    )
    s += assistant(gen("json_output", max_tokens=256, regex=character_regex))


state = character_gen("Harry Potter")
print_highlight(state["json_output"])
[2025-05-15 22:33:03] Prefill batch. #new-seq: 1, #new-token: 24, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:33:04] Decode batch. #running-req: 1, #token: 53, token usage: 0.00, cuda graph: False, gen throughput (token/s): 22.26, #queue-req: 0
[2025-05-15 22:33:04] Decode batch. #running-req: 1, #token: 93, token usage: 0.00, cuda graph: False, gen throughput (token/s): 65.59, #queue-req: 0
[2025-05-15 22:33:05] Decode batch. #running-req: 1, #token: 133, token usage: 0.01, cuda graph: False, gen throughput (token/s): 57.95, #queue-req: 0
[2025-05-15 22:33:06] INFO:     127.0.0.1:41892 - "POST /generate HTTP/1.1" 200 OK
{
"name": "哈利·波特",
"house": "格兰芬多",
"blood status": "混血",
"occupation": "学生",
"wand": {
"wood": "栗木",
"core": "凤凰羽毛",
"length": 11.0
},
"alive": "活着",
"patronus": "牡鹿",
"bogart": "摄魂怪"
}

批处理#

使用 run_batch 运行一批提示。

[10]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True,
)

for i, state in enumerate(states):
    print_highlight(f"Answer {i+1}: {states[i]['answer']}")
[2025-05-15 22:33:06] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 13, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:33:06] INFO:     127.0.0.1:41902 - "POST /generate HTTP/1.1" 200 OK
  0%|          | 0/3 [00:00<?, ?it/s]
[2025-05-15 22:33:06] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 17, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:33:06] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 17, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-05-15 22:33:06] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 19, token usage: 0.00, #running-req: 2, #queue-req: 0
[2025-05-15 22:33:06] Decode batch. #running-req: 3, #token: 51, token usage: 0.00, cuda graph: False, gen throughput (token/s): 57.74, #queue-req: 0
 67%|██████▋   | 2/3 [00:00<00:00,  8.55it/s]
[2025-05-15 22:33:06] INFO:     127.0.0.1:41934 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:33:06] INFO:     127.0.0.1:41948 - "POST /generate HTTP/1.1" 200 OK
100%|██████████| 3/3 [00:00<00:00, 13.97it/s]
[2025-05-15 22:33:06] INFO:     127.0.0.1:41918 - "POST /generate HTTP/1.1" 200 OK
回答 1:英国的首都是伦敦。
回答 2:法国的首都是巴黎。
回答 3:日本的首都是东京。

流式传输#

使用 stream 将输出流式传输给用户。

[11]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


state = text_qa.run(
    question="What is the capital of France?", temperature=0.1, stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
[2025-05-15 22:33:06] INFO:     127.0.0.1:41960 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:33:06] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 25, token usage: 0.00, #running-req: 0, #queue-req: 0
The capital of France is Paris.<|im_end|>

复杂提示#

您可以使用 {system|user|assistant}_{begin|end} 来定义复杂提示。

[12]:
@function
def chat_example(s):
    s += system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += assistant_begin()
    s += "Answer: " + gen("answer", max_tokens=100, stop="\n")
    s += assistant_end()


state = chat_example()
print_highlight(state["answer"])
[2025-05-15 22:33:06] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 14, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:33:06] INFO:     127.0.0.1:41974 - "POST /generate HTTP/1.1" 200 OK
法国的首都是巴黎。
[13]:
terminate_process(server_process)
[2025-05-15 22:33:06] Child process unexpectedly failed with an exit code 9. pid=52335
[2025-05-15 22:33:06] Child process unexpectedly failed with an exit code 9. pid=52133

多模态生成#

您可以使用 SGLang 前端语言定义多模态提示。支持的模型请参见此处

[14]:
server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
[2025-05-15 22:33:12] server_args=ServerArgs(model_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_path='Qwen/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen2.5-VL-7B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='0.0.0.0', port=32910, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=834899296, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:33:16] Infer the chat template name from the model path and obtain the result: qwen2-vl.
[2025-05-15 22:33:22] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:33:22] Automatically reduce --mem-fraction-static to 0.792 because this is a multimodal model.
[2025-05-15 22:33:22] Automatically turn off --chunked-prefill-size for multimodal model.
[2025-05-15 22:33:22] Init torch distributed begin.
[2025-05-15 22:33:22] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:33:22] Load weight begin. avail mem=61.57 GB
[2025-05-15 22:33:23] Multimodal attention backend not set. Use sdpa.
[2025-05-15 22:33:23] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:07,  1.87s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:08<00:13,  4.64s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:15<00:11,  5.52s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:21<00:05,  5.93s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:28<00:00,  6.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:28<00:00,  5.63s/it]

[2025-05-15 22:33:52] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=62.81 GB, mem usage=-1.23 GB.
[2025-05-15 22:33:52] KV Cache is allocated. #tokens: 20480, K size: 0.55 GB, V size: 0.55 GB
[2025-05-15 22:33:52] Memory pool end. avail mem=61.44 GB
2025-05-15 22:33:52,895 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:33:55] max_total_num_tokens=20480, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=200, context_len=128000
[2025-05-15 22:33:55] INFO:     Started server process [56085]
[2025-05-15 22:33:55] INFO:     Waiting for application startup.
[2025-05-15 22:33:55] INFO:     Application startup complete.
[2025-05-15 22:33:55] INFO:     Uvicorn running on http://0.0.0.0:32910 (Press CTRL+C to quit)
[2025-05-15 22:33:55] INFO:     127.0.0.1:33972 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:33:56] INFO:     127.0.0.1:33984 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:33:56] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:33:57,547 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


注意:通常,服务器在单独的终端中运行。
在本 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出会合并。
为了提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不能代表实际性能。
Server started on http://localhost:32910
[15]:
set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
[2025-05-15 22:34:00] INFO:     127.0.0.1:55184 - "GET /get_model_info HTTP/1.1" 200 OK

提问关于一张图片的问题。

[16]:
@function
def image_qa(s, image_file, question):
    s += user(image(image_file) + question)
    s += assistant(gen("answer", max_tokens=256))


image_url = "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
image_bytes, _ = load_image(image_url)
state = image_qa(image_bytes, "What is in the image?")
print_highlight(state["answer"])
2025-05-15 22:34:13,133 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-15 22:34:13] Prefill batch. #new-seq: 1, #new-token: 307, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-05-15 22:34:14] INFO:     127.0.0.1:34000 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:34:14] The server is fired up and ready to roll!
[2025-05-15 22:34:15] Decode batch. #running-req: 1, #token: 347, token usage: 0.02, cuda graph: False, gen throughput (token/s): 2.31, #queue-req: 0
[2025-05-15 22:34:16] Decode batch. #running-req: 1, #token: 387, token usage: 0.02, cuda graph: False, gen throughput (token/s): 41.87, #queue-req: 0
[2025-05-15 22:34:16] INFO:     127.0.0.1:55200 - "POST /generate HTTP/1.1" 200 OK
图片描绘了一个人正在积极地熨烫衣物,同时靠在一辆颜色鲜艳的出租车后备箱盖上。此人穿着一件印有文字("American International Taxi AD ragazzo")的黄色衬衫,手里拿着一个熨斗。正在熨烫的衣物看起来是挂在衣架上、搭在后备箱盖上的。两辆出租车都显示有美国国旗标记。场景背景是一个城市街道环境,可以看到人行道和店面。
[17]:
terminate_process(server_process)