SGLang 原生 API#

除了兼容 OpenAI 的 API 外,SGLang 运行时(Runtime)还提供其原生的服务器 API。我们引入了以下 API:

  • /generate (文本生成模型)

  • /get_model_info

  • /get_server_info

  • /health

  • /health_generate

  • /flush_cache

  • /update_weights

  • /encode (嵌入模型)

  • /v1/rerank (交叉编码重排序模型)

  • /classify (奖励模型)

  • /start_expert_distribution_record

  • /stop_expert_distribution_record

  • /dump_expert_distribution_record

  • /tokenize

  • /detokenize

  • 这些 API 的完整列表可以在 http_server.py 中找到。

在接下来的示例中,我们主要使用 requests 来测试这些 API。您也可以使用 curl

启动服务器#

[1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning"
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:33:36] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:36] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:36] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:42] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:42] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:42] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:44] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:33:44] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:33:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:33:56] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.02it/s]

Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  7.31it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。

Generate (文本生成模型)#

生成补全内容。这类似于 OpenAI API 中的 /v1/completions。详细参数可以在 采样参数 中找到。

[2]:
import requests

url = f"https://:{port}/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print_highlight(response.json())
{'text': " The capital of France is Paris. Here's a rough breakdown:\n\n1. **1804** - Carlist revolution on the Marguerite d'Anjou estate in Amiens, France.\n2. **1814** - Napoleon's victory at the Battle of Waterloo.\n3. **1871** - Victory of Napoleon III over Abbé Thiers and Louis Philippe.\n4. **1875** - Napoleon III committed suicide.\n5. **1905** - Bonaparte, Sédois, and others declared a second帝國(第二次帝制)。Paris became their", 'output_ids': [576, 6722, 315, 9625, 374, 12095, 13, 5692, 594, 264, 11165, 29985, 1447, 16, 13, 3070, 16, 23, 15, 19, 334, 481, 3261, 1607, 13791, 389, 279, 23201, 8801, 632, 294, 6, 2082, 73, 283, 12394, 304, 3303, 79363, 11, 9625, 624, 17, 13, 3070, 16, 23, 16, 19, 334, 481, 69427, 594, 12560, 518, 279, 16115, 315, 76375, 624, 18, 13, 3070, 16, 23, 22, 16, 334, 481, 48327, 315, 69427, 14429, 916, 25973, 963, 663, 4813, 323, 11876, 66854, 624, 19, 13, 3070, 16, 23, 22, 20, 334, 481, 69427, 14429, 11163, 18144, 624, 20, 13, 3070, 16, 24, 15, 20, 334, 481, 13481, 391, 19840, 11, 328, 15083, 29048, 11, 323, 3800, 14275, 264, 2086, 100432, 99941, 9909, 106309, 100432, 43316, 74276, 59604, 6116, 862], 'meta_info': {'id': '3a135ab9e0204149805e1784f0b24fb3', 'finish_reason': {'type': 'length', 'length': 128}, 'prompt_tokens': 7, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 128, 'cached_tokens': 0, 'e2e_latency': 0.2956242561340332, 'response_sent_to_client_ts': 1767062045.468527}}

获取模型信息#

获取模型的信息。

  • model_path: 模型的路径或名称。

  • is_generation: 模型是用作生成模型还是嵌入模型。

  • tokenizer_path: 分词器(tokenizer)的路径或名称。

  • preferred_sampling_params: 通过 --preferred-sampling-params 指定的默认采样参数。在本例中返回 None,因为我们没有在服务器参数中明确配置它。

  • weight_version: 此字段包含模型权重的版本。这通常用于跟踪模型训练参数的更改或更新。

  • has_image_understanding: 模型是否具有图像理解能力。

  • has_audio_understanding: 模型是否具有音频理解能力。

  • model_type: 来自 HuggingFace 配置的模型类型(例如,“qwen2”、“llama”)。

  • architectures: 来自 HuggingFace 配置的模型架构(例如,[“Qwen2ForCausalLM”])。

[3]:
url = f"https://:{port}/get_model_info"

response = requests.get(url)
response_json = response.json()
print_highlight(response_json)
assert response_json["model_path"] == "qwen/qwen2.5-0.5b-instruct"
assert response_json["is_generation"] is True
assert response_json["tokenizer_path"] == "qwen/qwen2.5-0.5b-instruct"
assert response_json["preferred_sampling_params"] is None
assert response_json.keys() == {
    "model_path",
    "is_generation",
    "tokenizer_path",
    "preferred_sampling_params",
    "weight_version",
    "has_image_understanding",
    "has_audio_understanding",
    "model_type",
    "architectures",
}
[2025-12-30 02:34:05] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.
{'model_path': 'qwen/qwen2.5-0.5b-instruct', 'tokenizer_path': 'qwen/qwen2.5-0.5b-instruct', 'is_generation': True, 'preferred_sampling_params': None, 'weight_version': 'default', 'has_image_understanding': False, 'has_audio_understanding': False, 'model_type': 'qwen2', 'architectures': ['Qwen2ForCausalLM']}

获取服务器信息#

获取服务器信息,包括命令行参数、token 限制和内存池大小。

  • 注意:get_server_info 合并了以下已弃用的端点:

    • get_server_args

    • get_memory_pool_size

    • get_max_total_num_tokens

[4]:
url = f"https://:{port}/get_server_info"

response = requests.get(url)
print_highlight(response.text)
[2025-12-30 02:34:05] Endpoint '/get_server_info' is deprecated and will be removed in a future version. Please use '/server_info' instead.
{"model_path":"qwen/qwen2.5-0.5b-instruct","tokenizer_path":"qwen/qwen2.5-0.5b-instruct", ... (此处省略冗长的 JSON 内容) ... "memory_usage":{"weight":1.06,"kvcache":0.23,"token_capacity":20480,"graph":0.07}}],"version":"0.1.dev1+g26e17f907"}

健康检查#

  • /health: 检查服务器的健康状态。

  • /health_generate: 通过生成一个 token 来检查服务器的健康状态。

[5]:
url = f"https://:{port}/health_generate"

response = requests.get(url)
print_highlight(response.text)
[6]:
url = f"https://:{port}/health"

response = requests.get(url)
print_highlight(response.text)

刷新缓存#

刷新 Radix 缓存。当模型权重通过 /update_weights API 更新时,会自动触发此操作。

[7]:
url = f"https://:{port}/flush_cache"

response = requests.post(url)
print_highlight(response.text)
缓存已刷新。
请检查后端日志了解更多详情。(当有正在运行或等待的请求时,将不执行此操作。)

从磁盘更新权重#

在不重启服务器的情况下从磁盘更新模型权重。仅适用于具有相同架构和参数规模的模型。

SGLang 支持 update_weights_from_disk API,用于训练期间的持续评估(将检查点保存到磁盘并从磁盘更新权重)。

[8]:
# successful update with same architecture and size

url = f"https://:{port}/update_weights_from_disk"
data = {"model_path": "qwen/qwen2.5-0.5b-instruct"}

response = requests.post(url, json=data)
print_highlight(response.text)
assert response.json()["success"] is True
assert response.json()["message"] == "Succeeded to update model weights."
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.43it/s]

{"success":true,"message":"Succeeded to update model weights.","num_paused_requests":0}
[9]:
# failed update with different parameter size or wrong name

url = f"https://:{port}/update_weights_from_disk"
data = {"model_path": "qwen/qwen2.5-0.5b-instruct-wrong"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(response_json)
assert response_json["success"] is False
assert response_json["message"] == (
    "Failed to get weights iterator: "
    "qwen/qwen2.5-0.5b-instruct-wrong"
    " (repository not found)."
)
[2025-12-30 02:34:07] Failed to get weights iterator: qwen/qwen2.5-0.5b-instruct-wrong (repository not found).
{'success': False, 'message': 'Failed to get weights iterator: qwen/qwen2.5-0.5b-instruct-wrong (repository not found).', 'num_paused_requests': 0}
[10]:
terminate_process(server_process)

Encode (嵌入模型)#

将文本编码为嵌入向量(embeddings)。请注意,此 API 仅适用于 嵌入模型,对生成模型会报错。因此,我们启动了一个新服务器来运行嵌入模型。

[11]:
embedding_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \
    --host 0.0.0.0 --is-embedding --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:34:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:14] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:16] INFO model_config.py:1015: Downcasting torch.float32 to torch.float16.
[2025-12-30 02:34:16] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:34:16] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:34:22] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:22] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:22] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:22] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:22] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:22] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:34:28] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.43s/it]



注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[12]:
# successful encode for embedding model

url = f"https://:{port}/encode"
data = {"model": "Alibaba-NLP/gte-Qwen2-1.5B-instruct", "text": "Once upon a time"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(f"Text embedding (first 10): {response_json['embedding'][:10]}")
文本嵌入 (前 10 位): [-0.00023102760314941406, -0.04986572265625, -0.0032711029052734375, 0.011077880859375, -0.0140533447265625, 0.0159912109375, -0.01441192626953125, 0.0059051513671875, -0.0228424072265625, 0.0272979736328125]
[13]:
terminate_process(embedding_process)

v1/rerank (交叉编码重排序模型)#

使用交叉编码模型对给定查询的文档列表进行重排序。请注意,此 API 仅适用于交叉编码模型,例如 BAAI/bge-reranker-v2-m3,且需配合 attention-backend tritontorch_native 使用。

[14]:
reranker_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \
    --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:34:44] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:44] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:44] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:46] INFO model_config.py:1015: Downcasting torch.float32 to torch.float16.
[2025-12-30 02:34:46] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:34:52] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:52] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:52] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:52] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:52] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:52] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:34:58] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.18it/s]



注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[15]:
# compute rerank scores for query and documents

url = f"https://:{port}/v1/rerank"
data = {
    "model": "BAAI/bge-reranker-v2-m3",
    "query": "what is panda?",
    "documents": [
        "hi",
        "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
    ],
}

response = requests.post(url, json=data)
response_json = response.json()
for item in response_json:
    print_highlight(f"Score: {item['score']:.2f} - Document: '{item['document']}'")
评分: 5.26 - 文档: 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.'
评分: -8.19 - 文档: 'hi'
[16]:
terminate_process(reranker_process)

Classify (奖励模型)#

SGLang 运行时也支持奖励模型。这里我们使用奖励模型来对成对生成的质量进行分类(打分)。

[17]:
# Note that SGLang now treats embedding models and reward models as the same type of models.
# This will be updated in the future.

reward_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:35:12] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:12] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:12] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:35:14] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:35:14] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:35:20] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:20] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:20] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:35:21] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:21] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:21] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:35:26] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:03<00:10,  3.56s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:07<00:07,  3.67s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:11<00:03,  3.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00,  2.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00,  2.84s/it]



注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[18]:
from transformers import AutoTokenizer

PROMPT = (
    "What is the range of the numeric output of a sigmoid node in a neural network?"
)

RESPONSE1 = "The output of a sigmoid node is bounded between -1 and 1."
RESPONSE2 = "The output of a sigmoid node is bounded between 0 and 1."

CONVS = [
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE1}],
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE2}],
]

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
prompts = tokenizer.apply_chat_template(CONVS, tokenize=False, return_dict=False)

url = f"https://:{port}/classify"
data = {"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "text": prompts}

responses = requests.post(url, json=data).json()
for response in responses:
    print_highlight(f"reward: {response['embedding'][0]}")
奖励值: -24.125
奖励值: 1.171875
[19]:
terminate_process(reward_process)

捕获 MoE 模型中的专家选择分布#

SGLang 运行时支持记录 MoE 模型运行中每个专家被选择的次数。这在分析模型吞吐量并规划优化时非常有用。

注意:为了更好的可读性,我们下面仅打印 csv 的前 10 行。如果您想更深入地分析结果,请相应调整。

[20]:
expert_record_server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning"
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:35:59] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:59] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:59] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:36:02] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:36:02] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:36:20] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:36:20] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:36:20] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:36:20] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:36:20] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:36:20] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:36:26] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:01<00:07,  1.07s/it]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:02<00:06,  1.16s/it]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:03<00:05,  1.14s/it]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:04<00:04,  1.16s/it]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:05<00:03,  1.16s/it]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:06<00:02,  1.11s/it]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:07<00:01,  1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:07<00:00,  1.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:07<00:00,  1.01it/s]

Capturing batches (bs=4 avail_mem=47.28 GB):   0%|          | 0/3 [00:00<?, ?it/s][2025-12-30 02:36:36] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /public_sglang_ci/runner-l3b-gpu-45/_work/sglang/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=60,N=1408,device_name=NVIDIA_H100_80GB_HBM3.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-12-30 02:36:36] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /public_sglang_ci/runner-l3b-gpu-45/_work/sglang/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=60,N=1408,device_name=NVIDIA_H100_80GB_HBM3_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=1 avail_mem=47.15 GB): 100%|██████████| 3/3 [00:02<00:00,  1.13it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[21]:
response = requests.post(f"https://:{port}/start_expert_distribution_record")
print_highlight(response)

url = f"https://:{port}/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print_highlight(response.json())

response = requests.post(f"https://:{port}/stop_expert_distribution_record")
print_highlight(response)

response = requests.post(f"https://:{port}/dump_expert_distribution_record")
print_highlight(response)
{'text': ' _______.\nA. Paris\nB. Washington\nC. Kenya\nD. Washington DC\n答案:\nA', 'output_ids': [32671, 62, 624, 32, 13, 12095, 198, 33, 13, 6515, 198, 34, 13, 36666, 198, 35, 13, 6515, 10922, 198, 102349, 510, 32, 151643], 'meta_info': {'id': 'cca45da79d104d94916759d9b1ab7e01', 'finish_reason': {'type': 'stop', 'matched': 151643}, 'prompt_tokens': 7, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 24, 'cached_tokens': 0, 'e2e_latency': 0.18297195434570312, 'response_sent_to_client_ts': 1767062205.3617764}}
[22]:
terminate_process(expert_record_server_process)

Tokenize/Detokenize 示例 (往返转换)#

此示例演示了如何同时使用 /tokenize 和 /detokenize 端点。我们首先将一个字符串 token 化,然后将生成的 ID 进行反 token 化以重构原始文本。当您需要在外部处理分词但仍想利用服务器进行反分词时,此工作流非常有用。

[23]:
tokenizer_free_server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:36:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:36:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:36:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:36:53] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:36:53] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:36:54] server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=34644, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.841, max_running_requests=128, max_queued_requests=None, max_total_tokens=20480, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=761640815, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='qwen/qwen2.5-0.5b-instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=4, cuda_graph_bs=[1, 2, 4], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2025-12-30 02:36:54] Watchdog TokenizerManager initialized.
[2025-12-30 02:36:54] Using default HuggingFace chat template with detected content format: string
[2025-12-30 02:37:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:37:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:37:01] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:37:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:37:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:37:01] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:37:03] Watchdog DetokenizerManager initialized.
[2025-12-30 02:37:04] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:37:04] Init torch distributed ends. mem usage=0.00 GB
[2025-12-30 02:37:04] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-30 02:37:07] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
[2025-12-30 02:37:07] Load weight begin. avail mem=78.58 GB
[2025-12-30 02:37:07] Found local HF snapshot for qwen/qwen2.5-0.5b-instruct at /hf_home/hub/models--qwen--qwen2.5-0.5b-instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775; skipping download.
[2025-12-30 02:37:07] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.28it/s]

[2025-12-30 02:37:08] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=77.52 GB, mem usage=1.06 GB.
[2025-12-30 02:37:08] Using KV cache dtype: torch.bfloat16
[2025-12-30 02:37:08] KV Cache is allocated. #tokens: 20480, K size: 0.12 GB, V size: 0.12 GB
[2025-12-30 02:37:08] Memory pool end. avail mem=77.12 GB
[2025-12-30 02:37:08] Capture cuda graph begin. This can take up to several minutes. avail mem=77.02 GB
[2025-12-30 02:37:08] Capture cuda graph bs [1, 2, 4]
Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  4.02it/s]
[2025-12-30 02:37:09] Capture cuda graph end. Time elapsed: 1.51 s. mem usage=0.07 GB. avail mem=76.95 GB.
[2025-12-30 02:37:10] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=128, context_len=32768, available_gpu_mem=76.95 GB
[2025-12-30 02:37:11] INFO:     Started server process [3849041]
[2025-12-30 02:37:11] INFO:     Waiting for application startup.
[2025-12-30 02:37:11] Using default chat sampling params from model generation config: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-12-30 02:37:11] Using default chat sampling params from model generation config: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2025-12-30 02:37:11] INFO:     Application startup complete.
[2025-12-30 02:37:11] INFO:     Uvicorn running on http://127.0.0.1:34644 (Press CTRL+C to quit)
[2025-12-30 02:37:11] INFO:     127.0.0.1:47404 - "GET /v1/models HTTP/1.1" 200 OK
[2025-12-30 02:37:12] INFO:     127.0.0.1:47414 - "GET /model_info HTTP/1.1" 200 OK
[2025-12-30 02:37:12] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-12-30 02:37:12] INFO:     127.0.0.1:47426 - "POST /generate HTTP/1.1" 200 OK
[2025-12-30 02:37:12] The server is fired up and ready to roll!


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[24]:
import requests
from sglang.utils import print_highlight

base_url = f"https://:{port}"
tokenize_url = f"{base_url}/tokenize"
detokenize_url = f"{base_url}/detokenize"

model_name = "qwen/qwen2.5-0.5b-instruct"
input_text = "SGLang provides efficient tokenization endpoints."
print_highlight(f"Original Input Text:\n'{input_text}'")

# --- tokenize the input text ---
tokenize_payload = {
    "model": model_name,
    "prompt": input_text,
    "add_special_tokens": False,
}
try:
    tokenize_response = requests.post(tokenize_url, json=tokenize_payload)
    tokenize_response.raise_for_status()
    tokenization_result = tokenize_response.json()
    token_ids = tokenization_result.get("tokens")

    if not token_ids:
        raise ValueError("Tokenization returned empty tokens.")

    print_highlight(f"\nTokenized Output (IDs):\n{token_ids}")
    print_highlight(f"Token Count: {tokenization_result.get('count')}")
    print_highlight(f"Max Model Length: {tokenization_result.get('max_model_len')}")

    # --- detokenize the obtained token IDs ---
    detokenize_payload = {
        "model": model_name,
        "tokens": token_ids,
        "skip_special_tokens": True,
    }

    detokenize_response = requests.post(detokenize_url, json=detokenize_payload)
    detokenize_response.raise_for_status()
    detokenization_result = detokenize_response.json()
    reconstructed_text = detokenization_result.get("text")

    print_highlight(f"\nDetokenized Output (Text):\n'{reconstructed_text}'")

    if input_text == reconstructed_text:
        print_highlight(
            "\nRound Trip Successful: Original and reconstructed text match."
        )
    else:
        print_highlight(
            "\nRound Trip Mismatch: Original and reconstructed text differ."
        )

except requests.exceptions.RequestException as e:
    print_highlight(f"\nHTTP Request Error: {e}")
except Exception as e:
    print_highlight(f"\nAn error occurred: {e}")
原始输入文本
'SGLang provides efficient tokenization endpoints.'
[2025-12-30 02:37:16] INFO:     127.0.0.1:47442 - "POST /tokenize HTTP/1.1" 200 OK

Token 化输出 (IDs)
[50, 3825, 524, 5707, 11050, 3950, 2022, 36342, 13]
Token 数量: 9
模型最大长度: 131072
[2025-12-30 02:37:16] INFO:     127.0.0.1:47454 - "POST /detokenize HTTP/1.1" 200 OK

反 Token 化输出 (文本)
'SGLang provides efficient tokenization endpoints.'

往返转换成功:原始文本与重构文本匹配。
[25]:
terminate_process(tokenizer_free_server_process)