SGLang 前端语言#

SGLang 前端语言可用于以方便、结构化的方式定义简洁易用的提示词。

启动服务器#

在终端中启动服务器并等待其初始化。

[1]:
from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint
from sglang.lang.api import set_default_backend
from sglang.srt.utils import load_image
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import print_highlight, terminate_process, wait_for_server

server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --log-level warning"
)

wait_for_server(f"https://:{port}")
print(f"Server started on https://:{port}")
[2025-12-30 02:28:09] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:28:09] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:28:09] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:28:15] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:28:15] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:28:15] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:28:17] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:28:17] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:28:23] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:28:23] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:28:23] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:28:24] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:28:24] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:28:24] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:28:29] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.24it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.25it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.30it/s]

Capturing batches (bs=1 avail_mem=62.72 GB): 100%|██████████| 3/3 [00:00<00:00, 10.09it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
Server started on https://:35562

设置默认后端。注意:除了本地服务器外,您还可以使用 OpenAI 或其他 API 端点。

[2]:
set_default_backend(RuntimeEndpoint(f"https://:{port}"))
[2025-12-30 02:28:40] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.

基础用法#

使用 SGLang 前端语言最简单的方法是用户与助手之间简单的问答对话。

[3]:
@function
def basic_qa(s, question):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=512))
[4]:
state = basic_qa("List 3 countries and their capitals.")
print_highlight(state["answer"])
以下是三个国家及其首都

1. 法国 - 巴黎
2. 德国 - 柏林
3. 日本 - 东京

多轮对话#

SGLang 前端语言也可用于定义多轮对话。

[5]:
@function
def multi_turn_qa(s):
    s += system(f"You are a helpful assistant than can answer questions.")
    s += user("Please give me a list of 3 countries and their capitals.")
    s += assistant(gen("first_answer", max_tokens=512))
    s += user("Please give me another list of 3 countries and their capitals.")
    s += assistant(gen("second_answer", max_tokens=512))
    return s


state = multi_turn_qa()
print_highlight(state["first_answer"])
print_highlight(state["second_answer"])
当然!这是一份包含三个国家及其首都的列表

1. **法国** - 巴黎
2. **德国** - 柏林
3. **意大利** - 罗马
好的!这是另一份包含三个国家及其首都的列表

1. **西班牙** - 马德里
2. **加拿大** - 渥太华
3. **澳大利亚** - 堪培拉

控制流#

您可以在函数中使用任何 Python 代码来定义更复杂的控制流。

[6]:
@function
def tool_use(s, question):
    s += assistant(
        "To answer this question: "
        + question
        + ". I need to use a "
        + gen("tool", choices=["calculator", "search engine"])
        + ". "
    )

    if s["tool"] == "calculator":
        s += assistant("The math expression is: " + gen("expression"))
    elif s["tool"] == "search engine":
        s += assistant("The key word to search is: " + gen("word"))


state = tool_use("What is 2 * 2?")
print_highlight(state["tool"])
print_highlight(state["expression"])
计算器
2 * 2.

对于这个简单的乘法,您不一定需要计算器,但我会为您计算

2 * 2 = 4

所以,2 * 2 等于 4。

并行#

使用 fork 启动并行提示词。由于 sgl.gen 是非阻塞的,下面的 for 循环会并行发出两个生成调用。

[7]:
@function
def tip_suggestion(s):
    s += assistant(
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += assistant(
            f"Now, expand tip {i+1} into a paragraph:\n"
            + gen("detailed_tip", max_tokens=256, stop="\n\n")
        )

    s += assistant("Tip 1:" + forks[0]["detailed_tip"] + "\n")
    s += assistant("Tip 2:" + forks[1]["detailed_tip"] + "\n")
    s += assistant(
        "To summarize the above two tips, I can say:\n" + gen("summary", max_tokens=512)
    )


state = tip_suggestion()
print_highlight(state["summary"])
1. **均衡饮食**
- 摄取来自所有食物类别的各种营养丰富的食物。
- 包含充足的水果、蔬菜、全谷物、瘦肉蛋白质和健康脂肪。
- 限制加工食品、含糖饮料、过量的钠和不健康脂肪。
- 均衡的饮食有助于维持健康的体重、提高能量水平、增强免疫功能并降低患慢性病的风险。

2. **定期运动**
- 定期进行步行、跑步、骑行、游泳或团队运动等体育活动。
- 目标是每周至少进行 150 分钟的中等强度有氧运动或 75 分钟的高强度运动。
- 每周至少两天包含肌肉强化练习。
- 定期运动可增强心血管健康,强健肌肉和骨骼,并提高整体身体素质。它还有助于体重管理,并具有减轻压力、改善情绪和睡眠质量等心理健康益处。

通过结合这两条建议,您可以显著增强您的整体健康水平。

受限解码#

使用 regex 指定正则表达式作为解码约束。这仅支持本地模型。

[8]:
@function
def regular_expression_gen(s):
    s += user("What is the IP address of the Google DNS servers?")
    s += assistant(
        gen(
            "answer",
            temperature=0,
            regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
        )
    )


state = regular_expression_gen()
print_highlight(state["answer"])
208.67.222.222

使用 regex 定义 JSON 解码模式。

[9]:
character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)


@function
def character_gen(s, name):
    s += user(
        f"{name} is a character in Harry Potter. Please fill in the following information about this character."
    )
    s += assistant(gen("json_output", max_tokens=256, regex=character_regex))


state = character_gen("Harry Potter")
print_highlight(state["json_output"])
{
"name": "哈利·波特",
"house": "格兰芬多",
"blood status": "混血",
"occupation": "学生",
"wand": {
"wood": "冬青木",
"core": "凤凰羽毛",
"length": 10.5
},
"alive": "活着",
"patronus": "牡鹿",
"bogart": "夜骐"
}

批处理#

使用 run_batch 运行一批提示词。

[10]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True,
)

for i, state in enumerate(states):
    print_highlight(f"Answer {i+1}: {states[i]['answer']}")
100%|██████████| 3/3 [00:00<00:00, 34.82it/s]
回答 1: 联合王国的首都是伦敦。
回答 2: 法国的首都是巴黎。
回答 3: 日本的首都是东京。

流式传输#

使用 stream 将输出流式传输给用户。

[11]:
@function
def text_qa(s, question):
    s += user(question)
    s += assistant(gen("answer", stop="\n"))


state = text_qa.run(
    question="What is the capital of France?", temperature=0.1, stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>

复杂提示词#

您可以使用 {system|user|assistant}_{begin|end} 来定义复杂提示词。

[12]:
@function
def chat_example(s):
    s += system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += assistant_begin()
    s += "Answer: " + gen("answer", max_tokens=100, stop="\n")
    s += assistant_end()


state = chat_example()
print_highlight(state["answer"])
法国的首都是巴黎。
[13]:
terminate_process(server_process)

多模态生成#

您可以使用 SGLang 前端语言来定义多模态提示词。有关支持的模型,请参阅此处

[14]:
server_process, port = launch_server_cmd(
    "python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --log-level warning"
)

wait_for_server(f"https://:{port}")
print(f"Server started on https://:{port}")
[2025-12-30 02:28:52] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:28:52] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:28:52] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:28:54] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:28:54] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:28:57] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2025-12-30 02:29:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:29:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:29:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:29:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:29:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:29:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:29:08] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.23it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.16it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.10it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.27it/s]

Capturing batches (bs=1 avail_mem=60.84 GB): 100%|██████████| 3/3 [00:00<00:00,  3.72it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
Server started on https://:37668
[15]:
set_default_backend(RuntimeEndpoint(f"https://:{port}"))
[2025-12-30 02:29:22] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.

对图像提问。

[16]:
@function
def image_qa(s, image_file, question):
    s += user(image(image_file) + question)
    s += assistant(gen("answer", max_tokens=256))


image_url = "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
image_bytes, _ = load_image(image_url)
state = image_qa(image_bytes, "What is in the image?")
print_highlight(state["answer"])
图像显示一名男子站在纽约市一辆出租车后座后面的行李箱/储物区上方。他穿着一件亮黄色的衬衫,正在进行一项不同寻常的活动——他正坐在一对牛仔裤上使用烫衣板。场景是在一条城市街道上,周围环绕着纽约建筑典型的黄色出租车和砖房。由于空间极其狭窄,且据报道该山丘的坡度很陡,这种行为显得非常出人意料且具有挑战性。警察据称允许持有执照的承运人在此巡游,向每天两次从皇后区宿舍设施运营的纽约阿尔茨海默症医疗机构运送作为正式制服的软垫货物。
[17]:
terminate_process(server_process)