投机采样#

SGLang 现在提供基于 EAGLE (EAGLE-2/EAGLE-3) 的投机采样选项。我们的实现旨在最大限度地提高速度和效率，被认为是开源 LLM 引擎中速度最快的实现之一。

性能亮点#

请参见下文，通过 EAGLE3 解码在 MT bench 测试中对 LLaMA-Instruct 3.1 8B 吞吐量带来的巨大提升。更多详情请参阅 EAGLE3 论文。

方法	吞吐量 (tokens/s)
SGLang (无投机采样, 1x H100)	158.34 tokens/s
SGLang + EAGLE-2 (1x H100)	244.10 tokens/s
SGLang + EAGLE-3 (1x H100)	373.25 tokens/s

EAGLE 解码#

启用 EAGLE 投机采样涉及以下相关参数：

speculative_draft_model_path: 指定草稿模型。此参数为必填项。
speculative_num_steps: 自回归草稿生成的深度。增加此值可扩大投机范围，但存在拒绝级联的风险。默认值为 5。
speculative_eagle_topk: 每步的分叉因子。提高候选多样性，会带来更高的接受率，但也会导致更高的内存/计算消耗。默认值为 4。
speculative_num_draft_tokens: 最大并行验证能力。允许进行更深的树状评估，但会导致更高的 GPU 显存占用。默认值为 8。

这些参数对于 EAGLE-2 和 EAGLE-3 是通用的。

您可以使用 bench_speculative.py 找到这些参数的最佳组合。

在下文的文档中，我们将 --cuda-graph-max-bs 设置为较小的值以加快引擎启动。对于您自己的工作负载，请结合 --cuda-graph-max-bs、--max-running-requests 和 --mem-fraction-static 对上述参数进行调优，以获得最佳性能。

EAGLE-2 解码#

您可以通过设置 --speculative-algorithm EAGLE 并选择合适的模型来启用 EAGLE-2 解码。

[1]:

from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

import openai

[2025-12-30 02:31:07] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:07] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:07] INFO utils.py:164: NumExpr defaulting to 16 threads.

[2]:

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning
"""
)

wait_for_server(f"https://:{port}")

[2025-12-30 02:31:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:14] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:31:17] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:31:17] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:31:17] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:31:23] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:23] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:23] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:31:29] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
[2025-12-30 02:31:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:30] INFO utils.py:164: NumExpr defaulting to 16 threads.
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:07<00:07,  7.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00,  4.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00,  4.78s/it]

Capturing batches (bs=1 avail_mem=54.86 GB): 100%|██████████| 4/4 [00:00<00:00, 10.04it/s]
[2025-12-30 02:31:41] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:31:41] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.81s/it]

Capturing batches (bs=1 avail_mem=53.67 GB): 100%|██████████| 4/4 [00:05<00:00,  1.41s/it]
Capturing batches (bs=1 avail_mem=53.57 GB): 100%|██████████| 4/4 [00:00<00:00, 108.16it/s]

注意：通常情况下，服务器在独立的终端中运行。
在本笔记本中，我们同时运行服务器和笔记本代码，因此它们的输出是合并在一起的。
为了提高清晰度，服务器日志以原始黑色显示，而笔记本输出则以蓝色突出显示。
为了缩短日志长度，我们将服务器的日志级别设置为 warning，默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的，因此吞吐量并不代表实际性能。

[3]:

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

响应: ChatCompletion(id='39a3c1fc472943e0a3bc3545f9de793b', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' Sure! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=2)], created=1767061917, model='meta-llama/Llama-2-7b-chat-hf', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=17, total_tokens=65, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

[4]:

terminate_process(server_process)

结合 `torch.compile` 的 EAGLE-2 解码#

您还可以启用 torch.compile 进行进一步优化，并可选地设置 --torch-compile-max-bs

[5]:

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --torch-compile-max-bs 2 --log-level warning
"""
)

wait_for_server(f"https://:{port}")

[2025-12-30 02:32:02] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:02] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:02] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:32:05] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:32:05] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:32:05] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:32:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:11] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:32:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:11] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:32:16] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.56s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.66s/it]

Capturing batches (bs=2 avail_mem=54.89 GB):  25%|██▌       | 1/4 [00:00<00:00,  6.92it/s]/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
Capturing batches (bs=1 avail_mem=54.80 GB): 100%|██████████| 4/4 [00:17<00:00,  4.41s/it]
[2025-12-30 02:32:40] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:32:40] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.35s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.35s/it]

Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:06<00:00,  1.59s/it]
Capturing batches (bs=1 avail_mem=53.37 GB): 100%|██████████| 4/4 [00:00<00:00, 89.64it/s]

[6]:

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

响应: ChatCompletion(id='f7f3c00416ae48589055d5c5d1d110b3', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' Sure! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=2)], created=1767061975, model='meta-llama/Llama-2-7b-chat-hf', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=17, total_tokens=65, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

[7]:

terminate_process(server_process)

通过频次排序投机采样进行 EAGLE-2 解码#

通过在草稿模型中使用截断的高频 Token 词汇表，Eagle 投机采样减少了 lm_head 的计算开销，从而在不降低质量的情况下加速了流水线。更多详情，请查看论文。

在我们的实现中，设置 --speculative-token-map 以启用该优化。您可以从此模型获取 FR-Spec 中的高频 Token。或者您可以通过从此仓库直接下载这些 Token 来获取高频 Token 集合。

感谢 Weilin Zhao 和 Zhousx 的贡献。

[8]:

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16  --log-level warning
"""
)

wait_for_server(f"https://:{port}")

[2025-12-30 02:33:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:01] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:03] WARNING model_config.py:1019: Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:03] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:33:03] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:33:03] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:33:04] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:10] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:10] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:12] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:12] Casting torch.bfloat16 to torch.float16.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:33:15] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:05<00:16,  5.38s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:10<00:10,  5.24s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:15<00:05,  5.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  3.61s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00,  4.19s/it]

Capturing batches (bs=1 avail_mem=59.73 GB): 100%|██████████| 4/4 [00:00<00:00,  6.01it/s]
[2025-12-30 02:33:36] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:33:36] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-30 02:33:36] Warning: Target model's context_length (8192) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors. Note that the derived context_length may differ from max_position_embeddings in the model's config.
[2025-12-30 02:33:36] Overriding the draft model's max_position_embeddings to 8192.
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.44s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.44s/it]

Capturing batches (bs=1 avail_mem=58.40 GB): 100%|██████████| 4/4 [00:06<00:00,  1.67s/it]
Capturing batches (bs=1 avail_mem=58.26 GB): 100%|██████████| 4/4 [00:00<00:00, 102.99it/s]

[9]:

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

响应: ChatCompletion(id='44f1cdf182c2480084e7e36aea90a5f2', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. **France** - **Paris**\n2. **Japan** - **Tokyo**\n3. **Australia** - **Canberra**', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=128009)], created=1767062033, model='meta-llama/Meta-Llama-3-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=18, total_tokens=57, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

[10]:

terminate_process(server_process)

EAGLE-3 解码#

您可以通过设置 --speculative-algorithm EAGLE3 并选择合适的模型来启用 EAGLE-3 解码。

[11]:

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)

wait_for_server(f"https://:{port}")

[2025-12-30 02:33:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:58] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:01] WARNING model_config.py:1019: Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:01] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:34:01] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:34:01] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:34:01] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:08] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:08] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:08] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:08] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:08] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:08] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:10] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:10] Casting torch.bfloat16 to torch.float16.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:34:13] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:04<00:12,  4.29s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:08<00:08,  4.47s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:13<00:04,  4.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00,  3.61s/it]

Capturing batches (bs=1 avail_mem=55.56 GB): 100%|██████████| 4/4 [00:00<00:00, 12.72it/s]
[2025-12-30 02:34:30] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:34:30] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-30 02:34:30] Warning: Target model's context_length (131072) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors. Note that the derived context_length may differ from max_position_embeddings in the model's config.
[2025-12-30 02:34:30] Overriding the draft model's max_position_embeddings to 131072.
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.46it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.46it/s]

Capturing batches (bs=1 avail_mem=53.16 GB): 100%|██████████| 4/4 [00:05<00:00,  1.37s/it]
Capturing batches (bs=1 avail_mem=58.03 GB): 100%|██████████| 4/4 [00:00<00:00, 24.76it/s]

[12]:

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

响应: ChatCompletion(id='5d39fd8e35544f178de354c3569579de', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. Country: Japan\n Capital: Tokyo\n\n2. Country: Australia\n Capital: Canberra\n\n3. Country: Brazil\n Capital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=128009)], created=1767062085, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=43, prompt_tokens=43, total_tokens=86, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

[13]:

terminate_process(server_process)

多 Token 预测 (Multi Token Prediction)#

我们通过使用投机采样在 SGLang 中支持 MTP (Multi-Token Prediction)。此处我们使用 Xiaomi/MiMo-7B-RL 模型作为示例（deepseek mtp 的使用请参考 deepseek 文档）

[14]:

server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \
    --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
    --mem-fraction 0.5 --log-level warning
"""
)

wait_for_server(f"https://:{port}")

[2025-12-30 02:34:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:54] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:34:54] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:34:54] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:35:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:35:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:35:07] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.60s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:05,  2.61s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00,  2.51s/it]

Capturing batches (bs=1 avail_mem=60.22 GB): 100%|██████████| 4/4 [00:00<00:00,  9.63it/s]
[2025-12-30 02:35:20] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:35:20] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.73it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00, 10.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  5.25it/s]

Capturing batches (bs=1 avail_mem=59.37 GB): 100%|██████████| 4/4 [00:00<00:00, 57.43it/s]

[15]:

import requests

url = f"https://:{port}/v1/chat/completions"

data = {
    "model": "XiaomiMiMo/MiMo-7B-RL",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
}

response = requests.post(url, json=data)
print_highlight(response.json())

{'id': '8e06d8f3bc854220813b733664ab1bac', 'object': 'chat.completion', 'created': 1767062131, 'model': 'XiaomiMiMo/MiMo-7B-RL', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "\n好的，用户在问法国的首都是哪里。让我先回想一下关于法国地理的知识。法国是西欧的一个国家。我记得首都市巴黎。不过等等，我可能应该再次核实一下。有时人们会弄混首都，特别是与其他国家如德国或意大利。 \n\n让我思考一下法国的主要城市。巴黎是人口最多的城市，对吧？还有马赛、里昂、图卢兹和波尔多。但首都肯定是巴黎。埃菲尔铁塔是巴黎的地标，也是世界上最著名的建筑之一。这是一个值得提到的好点。此外，巴黎以卢浮宫等艺术博物馆而闻名，卢浮宫是《蒙娜丽莎》的所在地。 \n\n等一下，用户有可能感到困惑，因为他们可能听说过其他城市。例如，尼斯是一个热门的旅游景点，但它不是首都。斯特拉斯堡是东北部的一个城市，但那里的首府可能是别的什么。不，斯特拉斯堡是大东部大区的首府，但不是国家的首都。 \n\n所以，答案是巴黎。为了确保，我可以想一下政府建筑。法国议会位于巴黎，具体在西岱岛附近。法国总统也居住在巴黎。所以所有的主要政府机构都在那里。 \n\n另一个角度：行政划分。法国划分为大区，而巴黎是一个自治区，但它仍然是首都。没有其他城市拥有这样的地位。此外，巴黎是法国文化、商业和政治的主要枢纽。 \n\n我觉得这很可靠。答案是巴黎。也许可以添加一点额外信息，如埃菲尔铁塔或卢浮宫，使回答更有帮助。但由于用户只是询问首都，所以要保持简洁。是的，直接回答并提及一个地标会很好。 \n\n等下，也许用户是学生在做测验，所以他们需要确切的名字。再确认一次。是的，绝对是巴黎。没有错误。好了，准备回答。\n\n\n法国的首都是**巴黎** (Paris)，它是全球文化、历史和政治的枢纽。这里拥有埃菲尔铁塔、卢浮宫博物馆和香榭丽舍大街等标志性地标。巴黎也是该国政府、经济和艺术界的中心。🇫🇷✨", 'reasoning_content': None, 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 151645}], 'usage': {'prompt_tokens': 26, 'total_tokens': 546, 'completion_tokens': 520, 'prompt_tokens_details': None, 'reasoning_tokens': 0}, 'metadata': {'weight_version': 'default'}}

[16]:

terminate_process(server_process)

参考文献#

EAGLE 的流程如下：

在 EAGLE 中，草稿模型使用特征序列 \((f_1, ..., f_k)\) 和 Token 序列 \((t_2, ..., t_{k+1})\) 来预测下一个特征向量，即原始 LLM 的最后一个隐藏状态。
然后从 \(p_{k+2}=\text{LMHead}(f_{k+1})\) 中采样下一个 Token。随后，这两个序列以树状风格扩展——分叉出多个潜在的后续内容，每步的分叉因子由 speculative_eagle_topk 参数控制，以确保更连贯的上下文连接，并再次作为输入提供。
EAGLE-2 额外使用草稿模型来评估草稿树中某些分支的概率，动态停止扩展不太可能的分支。在扩展阶段之后，采用重排序 (reranking) 仅选择前 speculative_num_draft_tokens 个叶节点作为草稿 Token。
EAGLE-3 移除了特征预测目标，整合了低层和中层特征，并以 On-policy 方式进行训练。

这通过在特征而非 Token 上操作来实现更规范的输入，并额外传递来自下一时间步的 Token 以最小化采样带来的随机性效应，从而增强了草稿生成的准确性。此外，草稿树的动态调整和重排序后叶节点的选择进一步提高了草稿 Token 的接受率。更多详情请参见 EAGLE-2 和 EAGLE-3 论文。

有关如何训练您自己的 EAGLE 模型的指导，请参阅 EAGLE 仓库。

投机采样

目录