GLM-4.7#

在 8xH100/H200 GPU 上部署 GLM-4.5 / GLM-4.6 FP8 模型

python3 -m sglang.launch_server --model zai-org/GLM-4.6-FP8 --tp 8

配置技巧#

--max-mamba-cache-size：调整 --max-mamba-cache-size 以增加 mamba 缓存空间和最大并行请求能力。作为权衡，这会减少 KV 缓存空间。您可以根据工作负载进行调整。

EAGLE 投机采样#

描述：SGLang 已支持通过 EAGLE 投机采样 (Speculative Decoding) 运行 GLM-4.5 / GLM-4.6 模型。

用法：添加参数 --speculative-algorithm、--speculative-num-steps、--speculative-eagle-topk 和 --speculative-num-draft-tokens 以启用此功能。例如：

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-4.6-FP8 \
  --tp-size 8 \
  --tool-call-parser glm45  \
  --reasoning-parser glm45  \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3  \
  --speculative-eagle-topk 1  \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.9 \
  --served-model-name glm-4.6-fp8 \
  --enable-custom-logit-processor

注意：对于 GLM-4.7，--tool-call-parser 应设置为 glm47；对于 GLM-4.5 和 GLM-4.6，应设置为 glm45。

思考预算 (Thinking Budget)#

在 SGLang 中，我们可以使用 CustomLogitProcessor 来实现思考预算。

启动服务器时开启 --enable-custom-logit-processor 标志。

示例请求

import openai
from rich.pretty import pprint
from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor


client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
response = client.chat.completions.create(
    model="zai-org/GLM-4.6",
    messages=[
        {
            "role": "user",
            "content": "Question: Is Paris the Capital of France?",
        }
    ],
    max_tokens=1024,
    extra_body={
        "custom_logit_processor": Glm4MoeThinkingBudgetLogitProcessor().to_str(),
        "custom_params": {
            "thinking_budget": 512,
        },
    },
)
pprint(response)

使用 SGLang 启动 GLM-4.5 / GLM-4.6 / GLM-4.7

目录

使用 SGLang 运行 GLM-4.5 / GLM-4.6 / GLM-4.7#

配置技巧#

EAGLE 投机采样#

思考预算 (Thinking Budget)#