投机采样#
SGLang 现在提供基于 EAGLE (EAGLE-2/EAGLE-3) 的投机采样选项。我们的实现旨在最大限度地提高速度和效率,被认为是开源 LLM 引擎中速度最快的实现之一。
性能亮点#
请参见下文,通过 EAGLE3 解码在 MT bench 测试中对 LLaMA-Instruct 3.1 8B 吞吐量带来的巨大提升。更多详情请参阅 EAGLE3 论文。
方法 |
吞吐量 (tokens/s) |
|---|---|
SGLang (无投机采样, 1x H100) |
158.34 tokens/s |
SGLang + EAGLE-2 (1x H100) |
244.10 tokens/s |
SGLang + EAGLE-3 (1x H100) |
373.25 tokens/s |
EAGLE 解码#
启用 EAGLE 投机采样涉及以下相关参数:
speculative_draft_model_path: 指定草稿模型。此参数为必填项。speculative_num_steps: 自回归草稿生成的深度。增加此值可扩大投机范围,但存在拒绝级联的风险。默认值为 5。speculative_eagle_topk: 每步的分叉因子。提高候选多样性,会带来更高的接受率,但也会导致更高的内存/计算消耗。默认值为 4。speculative_num_draft_tokens: 最大并行验证能力。允许进行更深的树状评估,但会导致更高的 GPU 显存占用。默认值为 8。
这些参数对于 EAGLE-2 和 EAGLE-3 是通用的。
您可以使用 bench_speculative.py 找到这些参数的最佳组合。
在下文的文档中,我们将 --cuda-graph-max-bs 设置为较小的值以加快引擎启动。对于您自己的工作负载,请结合 --cuda-graph-max-bs、--max-running-requests 和 --mem-fraction-static 对上述参数进行调优,以获得最佳性能。
EAGLE-2 解码#
您可以通过设置 --speculative-algorithm EAGLE 并选择合适的模型来启用 EAGLE-2 解码。
[1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
import openai
[2025-12-30 02:31:07] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:07] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:07] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
--speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning
"""
)
wait_for_server(f"https://:{port}")
[2025-12-30 02:31:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:14] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:31:17] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:31:17] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:31:17] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:31:23] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:23] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:23] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:31:29] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
[2025-12-30 02:31:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:31:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:31:30] INFO utils.py:164: NumExpr defaulting to 16 threads.
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:07<00:07, 7.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00, 4.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00, 4.78s/it]
Capturing batches (bs=1 avail_mem=54.86 GB): 100%|██████████| 4/4 [00:00<00:00, 10.04it/s]
[2025-12-30 02:31:41] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:31:41] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.81s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.81s/it]
Capturing batches (bs=1 avail_mem=53.67 GB): 100%|██████████| 4/4 [00:05<00:00, 1.41s/it]
Capturing batches (bs=1 avail_mem=53.57 GB): 100%|██████████| 4/4 [00:00<00:00, 108.16it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[4]:
terminate_process(server_process)
结合 torch.compile 的 EAGLE-2 解码#
您还可以启用 torch.compile 进行进一步优化,并可选地设置 --torch-compile-max-bs
[5]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
--enable-torch-compile --torch-compile-max-bs 2 --log-level warning
"""
)
wait_for_server(f"https://:{port}")
[2025-12-30 02:32:02] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:02] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:02] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:32:05] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:32:05] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:32:05] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:32:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:11] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:32:11] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:32:11] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:32:11] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:32:16] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:02<00:02, 2.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.56s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.66s/it]
Capturing batches (bs=2 avail_mem=54.89 GB): 25%|██▌ | 1/4 [00:00<00:00, 6.92it/s]/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
Capturing batches (bs=1 avail_mem=54.80 GB): 100%|██████████| 4/4 [00:17<00:00, 4.41s/it]
[2025-12-30 02:32:40] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:32:40] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.35s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.35s/it]
Capturing batches (bs=1 avail_mem=53.51 GB): 100%|██████████| 4/4 [00:06<00:00, 1.59s/it]
Capturing batches (bs=1 avail_mem=53.37 GB): 100%|██████████| 4/4 [00:00<00:00, 89.64it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[7]:
terminate_process(server_process)
通过频次排序投机采样进行 EAGLE-2 解码#
通过在草稿模型中使用截断的高频 Token 词汇表,Eagle 投机采样减少了 lm_head 的计算开销,从而在不降低质量的情况下加速了流水线。更多详情,请查看 论文。
在我们的实现中,设置 --speculative-token-map 以启用该优化。您可以从 此模型 获取 FR-Spec 中的高频 Token。或者您可以通过从 此仓库 直接下载这些 Token 来获取高频 Token 集合。
感谢 Weilin Zhao 和 Zhousx 的贡献。
[8]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
--mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)
wait_for_server(f"https://:{port}")
[2025-12-30 02:33:01] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:01] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:01] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:03] WARNING model_config.py:1019: Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:03] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:33:03] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:33:03] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:33:04] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:10] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:10] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:33:12] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:33:12] Casting torch.bfloat16 to torch.float16.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:33:15] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:05<00:16, 5.38s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:10<00:10, 5.24s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:15<00:05, 5.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00, 3.61s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:16<00:00, 4.19s/it]
Capturing batches (bs=1 avail_mem=59.73 GB): 100%|██████████| 4/4 [00:00<00:00, 6.01it/s]
[2025-12-30 02:33:36] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:33:36] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-30 02:33:36] Warning: Target model's context_length (8192) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors. Note that the derived context_length may differ from max_position_embeddings in the model's config.
[2025-12-30 02:33:36] Overriding the draft model's max_position_embeddings to 8192.
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.44s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.44s/it]
Capturing batches (bs=1 avail_mem=58.40 GB): 100%|██████████| 4/4 [00:06<00:00, 1.67s/it]
Capturing batches (bs=1 avail_mem=58.26 GB): 100%|██████████| 4/4 [00:00<00:00, 102.99it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[10]:
terminate_process(server_process)
EAGLE-3 解码#
您可以通过设置 --speculative-algorithm EAGLE3 并选择合适的模型来启用 EAGLE-3 解码。
[11]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
--cuda-graph-max-bs 2 --dtype float16 --log-level warning
"""
)
wait_for_server(f"https://:{port}")
[2025-12-30 02:33:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:33:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:33:58] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:01] WARNING model_config.py:1019: Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:01] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:34:01] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:34:01] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:34:01] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:08] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:08] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:08] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:08] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:08] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:08] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:10] Casting torch.bfloat16 to torch.float16.
[2025-12-30 02:34:10] Casting torch.bfloat16 to torch.float16.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:34:13] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:04<00:12, 4.29s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:08<00:08, 4.47s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:13<00:04, 4.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00, 3.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:14<00:00, 3.61s/it]
Capturing batches (bs=1 avail_mem=55.56 GB): 100%|██████████| 4/4 [00:00<00:00, 12.72it/s]
[2025-12-30 02:34:30] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:34:30] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
[2025-12-30 02:34:30] Warning: Target model's context_length (131072) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors. Note that the derived context_length may differ from max_position_embeddings in the model's config.
[2025-12-30 02:34:30] Overriding the draft model's max_position_embeddings to 131072.
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.46it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.46it/s]
Capturing batches (bs=1 avail_mem=53.16 GB): 100%|██████████| 4/4 [00:05<00:00, 1.37s/it]
Capturing batches (bs=1 avail_mem=58.03 GB): 100%|██████████| 4/4 [00:00<00:00, 24.76it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[13]:
terminate_process(server_process)
多 Token 预测 (Multi Token Prediction)#
我们通过使用投机采样在 SGLang 中支持 MTP (Multi-Token Prediction)。此处我们使用 Xiaomi/MiMo-7B-RL 模型作为示例(deepseek mtp 的使用请参考 deepseek 文档)
[14]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \
--speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \
--mem-fraction 0.5 --log-level warning
"""
)
wait_for_server(f"https://:{port}")
[2025-12-30 02:34:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:34:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:34:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:34:54] INFO server_args.py:1564: Attention backend not specified. Use flashinfer backend by default.
[2025-12-30 02:34:54] WARNING server_args.py:2016: Overlap scheduler is disabled when spec v2 is off or using unsupported speculative algorithm. You can set env SGLANG_ENABLE_SPEC_V2=True to enable the experimental overlap scheduler.
[2025-12-30 02:34:54] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:35:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:35:00] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:35:00] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:35:00] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:35:07] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:02<00:07, 2.60s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:05<00:05, 2.61s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:07<00:02, 2.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00, 2.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:10<00:00, 2.51s/it]
Capturing batches (bs=1 avail_mem=60.22 GB): 100%|██████████| 4/4 [00:00<00:00, 9.63it/s]
[2025-12-30 02:35:20] SPECULATIVE_MOE_RUNNER_BACKEND is not initialized, using auto backend
[2025-12-30 02:35:20] SPECULATIVE_MOE_A2A_BACKEND is not initialized, using none backend
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 4.73it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 10.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 5.25it/s]
Capturing batches (bs=1 avail_mem=59.37 GB): 100%|██████████| 4/4 [00:00<00:00, 57.43it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[15]:
import requests
url = f"https://:{port}/v1/chat/completions"
data = {
"model": "XiaomiMiMo/MiMo-7B-RL",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print_highlight(response.json())
[16]:
terminate_process(server_process)
参考文献#
EAGLE 的流程如下:
在 EAGLE 中,草稿模型使用特征序列 \((f_1, ..., f_k)\) 和 Token 序列 \((t_2, ..., t_{k+1})\) 来预测下一个特征向量,即原始 LLM 的最后一个隐藏状态。
然后从 \(p_{k+2}=\text{LMHead}(f_{k+1})\) 中采样下一个 Token。随后,这两个序列以树状风格扩展——分叉出多个潜在的后续内容,每步的分叉因子由
speculative_eagle_topk参数控制,以确保更连贯的上下文连接,并再次作为输入提供。EAGLE-2 额外使用草稿模型来评估草稿树中某些分支的概率,动态停止扩展不太可能的分支。在扩展阶段之后,采用重排序 (reranking) 仅选择前
speculative_num_draft_tokens个叶节点作为草稿 Token。EAGLE-3 移除了特征预测目标,整合了低层和中层特征,并以 On-policy 方式进行训练。
这通过在特征而非 Token 上操作来实现更规范的输入,并额外传递来自下一时间步的 Token 以最小化采样带来的随机性效应,从而增强了草稿生成的准确性。此外,草稿树的动态调整和重排序后叶节点的选择进一步提高了草稿 Token 的接受率。更多详情请参见 EAGLE-2 和 EAGLE-3 论文。
有关如何训练您自己的 EAGLE 模型的指导,请参阅 EAGLE 仓库。