投机采样#

SGLang 现在提供基于 EAGLE (EAGLE-2/EAGLE-3) 的投机采样选项。我们的实现旨在最大限度地提高速度和效率,被认为是开源 LLM 引擎中速度最快的实现之一。

性能亮点#

请参见下文,通过 EAGLE3 解码在 MT bench 测试中对 LLaMA-Instruct 3.1 8B 吞吐量带来的巨大提升。更多详情请参阅 EAGLE3 论文

方法

吞吐量 (tokens/s)

SGLang (无投机采样, 1x H100)

158.34 tokens/s

SGLang + EAGLE-2 (1x H100)

244.10 tokens/s

SGLang + EAGLE-3 (1x H100)

373.25 tokens/s

EAGLE 解码#

启用 EAGLE 投机采样涉及以下相关参数:

  • speculative_draft_model_path: 指定草稿模型。此参数为必填项。

  • speculative_num_steps: 自回归草稿生成的深度。增加此值可扩大投机范围,但存在拒绝级联的风险。默认值为 5。

  • speculative_eagle_topk: 每步的分叉因子。提高候选多样性,会带来更高的接受率,但也会导致更高的内存/计算消耗。默认值为 4。

  • speculative_num_draft_tokens: 最大并行验证能力。允许进行更深的树状评估,但会导致更高的 GPU 显存占用。默认值为 8。

这些参数对于 EAGLE-2 和 EAGLE-3 是通用的。

您可以使用 bench_speculative.py 找到这些参数的最佳组合。

在下文的文档中,我们将 --cuda-graph-max-bs 设置为较小的值以加快引擎启动。对于您自己的工作负载,请结合 --cuda-graph-max-bs--max-running-requests--mem-fraction-static 对上述参数进行调优,以获得最佳性能。

EAGLE-2 解码#

您可以通过设置 --speculative-algorithm EAGLE 并选择合适的模型来启用 EAGLE-2 解码。

[1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import openai
[2025-12-30 02:31:07] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:07] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:07] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:31:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:14] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:31:17] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:31:17] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:31:17] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:31:23] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:23] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:23] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:31:29] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
[2025-12-30 02:31:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:30] INFO utils.py:164: NumExpr defaulting to 16 threads.
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:07<00:07,  7.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00,  4.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00,  4.78s/it]

Capturing batches (bs=1 avail_mem=54.86 GB): 100%|██████████| 4/4 [00:00<00:00, 10.04it/s]
[2025-12-30 02:31:41] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:31:41] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]

Capturing batches (bs=1 avail_mem=53.67 GB): 100%|██████████| 4/4 [00:05<00:00,  1.41s/it]
Capturing batches (bs=1 avail_mem=53.57 GB): 100%|██████████| 4/4 [00:00<00:00, 108.16it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
响应: ChatCompletion(id='39a3c1fc472943e0a3bc3545f9de793b', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' Sure! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=2)], created=1767061917, model='meta-llama/Llama-2-7b-chat-hf', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=17, total_tokens=65, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})
[4]:
terminate_process(server_process)

结合 torch.compile 的 EAGLE-2 解码#

您还可以启用 torch.compile 进行进一步优化,并可选地设置 --torch-compile-max-bs

[5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --torch-compile-max-bs 2 --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:32:02] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:02] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:02] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:32:05] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:32:05] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:32:05] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:32:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:11] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:32:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:11] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:32:16] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.56s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.66s/it]

Capturing batches (bs=2 avail_mem=54.89 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.92it/s]/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
Capturing batches (bs=1 avail_mem=54.80 GB): 100%|██████████| 4/4 [00:17<00:00,  4.41s/it]
[2025-12-30 02:32:40] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:32:40] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.35s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.35s/it]

Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:06<00:00,  1.59s/it]
Capturing batches (bs=1 avail_mem=53.37 GB): 100%|██████████| 4/4 [00:00<00:00, 89.64it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
响应: ChatCompletion(id='f7f3c00416ae48589055d5c5d1d110b3', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' Sure! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=2)], created=1767061975, model='meta-llama/Llama-2-7b-chat-hf', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=17, total_tokens=65, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})
[7]:
terminate_process(server_process)

通过频次排序投机采样进行 EAGLE-2 解码#

通过在草稿模型中使用截断的高频 Token 词汇表,Eagle 投机采样减少了 lm_head 的计算开销,从而在不降低质量的情况下加速了流水线。更多详情,请查看 论文

在我们的实现中,设置 --speculative-token-map 以启用该优化。您可以从 此模型 获取 FR-Spec 中的高频 Token。或者您可以通过从 此仓库 直接下载这些 Token 来获取高频 Token 集合。

感谢 Weilin ZhaoZhousx 的贡献。

[8]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16  --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:33:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:01] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:03] WARNING model_config.py:1019: Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:03] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:33:03] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:33:03] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:33:04] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:10] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:10] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:12] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:12] Casting torch.bfloat16 to torch.float16.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:33:15] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:05<00:16,  5.38s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:10<00:10,  5.24s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:15<00:05,  5.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  3.61s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  4.19s/it]

Capturing batches (bs=1 avail_mem=59.73 GB): 100%|██████████| 4/4 [00:00<00:00,  6.01it/s]
[2025-12-30 02:33:36] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:33:36] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-30 02:33:36] Warning: Target model's context_length (8192) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors. Note that the derived context_length may differ from max_position_embeddings in the model's config.
[2025-12-30 02:33:36] Overriding the draft model's max_position_embeddings to 8192.
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.44s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.44s/it]

Capturing batches (bs=1 avail_mem=58.40 GB): 100%|██████████| 4/4 [00:06<00:00,  1.67s/it]
Capturing batches (bs=1 avail_mem=58.26 GB): 100%|██████████| 4/4 [00:00<00:00, 102.99it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
响应: ChatCompletion(id='44f1cdf182c2480084e7e36aea90a5f2', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. **France** - **Paris**\n2. **Japan** - **Tokyo**\n3. **Australia** - **Canberra**', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=128009)], created=1767062033, model='meta-llama/Meta-Llama-3-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=18, total_tokens=57, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})
[10]:
terminate_process(server_process)

EAGLE-3 解码#

您可以通过设置 --speculative-algorithm EAGLE3 并选择合适的模型来启用 EAGLE-3 解码。

[11]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:33:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:58] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:01] WARNING model_config.py:1019: Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:01] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:34:01] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:34:01] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:34:01] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:08] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:08] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:08] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:08] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:08] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:08] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:10] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:10] Casting torch.bfloat16 to torch.float16.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:34:13] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:12,  4.29s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:08<00:08,  4.47s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:04,  4.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.61s/it]

Capturing batches (bs=1 avail_mem=55.56 GB): 100%|██████████| 4/4 [00:00<00:00, 12.72it/s]
[2025-12-30 02:34:30] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:34:30] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-30 02:34:30] Warning: Target model's context_length (131072) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors. Note that the derived context_length may differ from max_position_embeddings in the model's config.
[2025-12-30 02:34:30] Overriding the draft model's max_position_embeddings to 131072.
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.46it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.46it/s]

Capturing batches (bs=1 avail_mem=53.16 GB): 100%|██████████| 4/4 [00:05<00:00,  1.37s/it]
Capturing batches (bs=1 avail_mem=58.03 GB): 100%|██████████| 4/4 [00:00<00:00, 24.76it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
响应: ChatCompletion(id='5d39fd8e35544f178de354c3569579de', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. Country: Japan\n Capital: Tokyo\n\n2. Country: Australia\n Capital: Canberra\n\n3. Country: Brazil\n Capital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=128009)], created=1767062085, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=43, prompt_tokens=43, total_tokens=86, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})
[13]:
terminate_process(server_process)

多 Token 预测 (Multi Token Prediction)#

我们通过使用投机采样在 SGLang 中支持 MTP (Multi-Token Prediction)。此处我们使用 Xiaomi/MiMo-7B-RL 模型作为示例(deepseek mtp 的使用请参考 deepseek 文档

[14]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \
    --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --mem-fraction 0.5 --log-level warning
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:34:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:54] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:34:54] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:34:54] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:35:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:35:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:35:07] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.60s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.61s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.51s/it]

Capturing batches (bs=1 avail_mem=60.22 GB): 100%|██████████| 4/4 [00:00<00:00,  9.63it/s]
[2025-12-30 02:35:20] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:35:20] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.73it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00, 10.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  5.25it/s]

Capturing batches (bs=1 avail_mem=59.37 GB): 100%|██████████| 4/4 [00:00<00:00, 57.43it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[15]:
import requests

url = f"https://:{port}/v1/chat/completions"

data = {
    "model": "XiaomiMiMo/MiMo-7B-RL",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())
{'id': '8e06d8f3bc854220813b733664ab1bac', 'object': 'chat.completion', 'created': 1767062131, 'model': 'XiaomiMiMo/MiMo-7B-RL', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "\n好的,用户在问法国的首都是哪里。让我先回想一下关于法国地理的知识。法国是西欧的一个国家。我记得首都市巴黎。不过等等,我可能应该再次核实一下。有时人们会弄混首都,特别是与其他国家如德国或意大利。 \n\n让我思考一下法国的主要城市。巴黎是人口最多的城市,对吧?还有马赛、里昂、图卢兹和波尔多。但首都肯定是巴黎。埃菲尔铁塔是巴黎的地标,也是世界上最著名的建筑之一。这是一个值得提到的好点。此外,巴黎以卢浮宫等艺术博物馆而闻名,卢浮宫是《蒙娜丽莎》的所在地。 \n\n等一下,用户有可能感到困惑,因为他们可能听说过其他城市。例如,尼斯是一个热门的旅游景点,但它不是首都。斯特拉斯堡是东北部的一个城市,但那里的首府可能是别的什么。不,斯特拉斯堡是大东部大区的首府,但不是国家的首都。 \n\n所以,答案是巴黎。为了确保,我可以想一下政府建筑。法国议会位于巴黎,具体在西岱岛附近。法国总统也居住在巴黎。所以所有的主要政府机构都在那里。 \n\n另一个角度:行政划分。法国划分为大区,而巴黎是一个自治区,但它仍然是首都。没有其他城市拥有这样的地位。此外,巴黎是法国文化、商业和政治的主要枢纽。 \n\n我觉得这很可靠。答案是巴黎。也许可以添加一点额外信息,如埃菲尔铁塔或卢浮宫,使回答更有帮助。但由于用户只是询问首都,所以要保持简洁。是的,直接回答并提及一个地标会很好。 \n\n等下,也许用户是学生在做测验,所以他们需要确切的名字。再确认一次。是的,绝对是巴黎。没有错误。好了,准备回答。\n\n\n法国的首都是**巴黎** (Paris),它是全球文化、历史和政治的枢纽。这里拥有埃菲尔铁塔、卢浮宫博物馆和香榭丽舍大街等标志性地标。巴黎也是该国政府、经济和艺术界的中心。🇫🇷✨", 'reasoning_content': None, 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 151645}], 'usage': {'prompt_tokens': 26, 'total_tokens': 546, 'completion_tokens': 520, 'prompt_tokens_details': None, 'reasoning_tokens': 0}, 'metadata': {'weight_version': 'default'}}
[16]:
terminate_process(server_process)

参考文献#

EAGLE 的流程如下:

  • 在 EAGLE 中,草稿模型使用特征序列 \((f_1, ..., f_k)\) 和 Token 序列 \((t_2, ..., t_{k+1})\) 来预测下一个特征向量,即原始 LLM 的最后一个隐藏状态。

  • 然后从 \(p_{k+2}=\text{LMHead}(f_{k+1})\) 中采样下一个 Token。随后,这两个序列以树状风格扩展——分叉出多个潜在的后续内容,每步的分叉因子由 speculative_eagle_topk 参数控制,以确保更连贯的上下文连接,并再次作为输入提供。

  • EAGLE-2 额外使用草稿模型来评估草稿树中某些分支的概率,动态停止扩展不太可能的分支。在扩展阶段之后,采用重排序 (reranking) 仅选择前 speculative_num_draft_tokens 个叶节点作为草稿 Token。

  • EAGLE-3 移除了特征预测目标,整合了低层和中层特征,并以 On-policy 方式进行训练。

这通过在特征而非 Token 上操作来实现更规范的输入,并额外传递来自下一时间步的 Token 以最小化采样带来的随机性效应,从而增强了草稿生成的准确性。此外,草稿树的动态调整和重排序后叶节点的选择进一步提高了草稿 Token 的接受率。更多详情请参见 EAGLE-2EAGLE-3 论文。

有关如何训练您自己的 EAGLE 模型的指导,请参阅 EAGLE 仓库