推测解码#

SGLang 现在提供基于 EAGLE(EAGLE-2/EAGLE-3)的推测解码选项。我们的实现旨在最大限度地提高速度和效率,被认为是开源 LLM 引擎中最快的之一。注意:目前,SGLang 中的推测解码与基数缓存(radix cache)和分块预填充(chunked prefill)兼容。

性能亮点#

请参阅下文,了解通过 EAGLE3 解码在 MT bench 上测试的 LLaMA-Instruct 3.1 8B 模型的吞吐量取得的巨大提升。更多详情请参阅 EAGLE3 论文

方法

吞吐量(tokens/秒)

SGLang(无推测,1x H100)

158.34 tokens/秒

SGLang + EAGLE-2(1x H100)

244.10 tokens/秒

SGLang + EAGLE-3(1x H100)

373.25 tokens/秒

EAGLE 解码#

要启用 EAGLE 推测解码,以下参数是相关的

  • speculative_draft_model_path:指定草稿模型。此参数为必需。

  • speculative_num_steps:自回归草稿的深度。增加推测范围,但有拒绝级联的风险。默认值为 5。

  • speculative_eagle_topk:每步的分支因子。提高候选多样性,将导致更高的接受率,但也会导致更高的内存/计算消耗。默认值为 4。

  • speculative_num_draft_tokens:最大并行验证容量。允许更深的树评估,但会导致更高的 GPU 内存使用。默认值为 8。

这些参数对于 EAGLE-2 和 EAGLE-3 相同。

EAGLE-2 解码#

您可以通过设置 --speculative_algorithm EAGLE 并选择合适的模型来启用 EAGLE-2 解码。

[1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process

import openai
[2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8
"""
)

wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:37:29] server_args=ServerArgs(model_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-2-7b-chat-hf', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=32059, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=827322011, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE', speculative_draft_model_path='lmsys/sglang-EAGLE-llama2-chat-7B', speculative_num_steps=3, speculative_eagle_topk=4, speculative_num_draft_tokens=16, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:37:31] Infer the chat template name from the model path and obtain the result: llama-2.
[2025-05-15 22:37:37] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:37:37] Init torch distributed begin.
[2025-05-15 22:37:37] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:37:37] Load weight begin. avail mem=34.53 GB
[2025-05-15 22:37:38] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:09<00:09,  9.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:21<00:00, 11.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:21<00:00, 10.82s/it]

[2025-05-15 22:38:00] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=21.87 GB, mem usage=12.66 GB.
[2025-05-15 22:38:00] KV Cache is allocated. #tokens: 20480, K size: 5.00 GB, V size: 5.00 GB
[2025-05-15 22:38:00] Memory pool end. avail mem=11.68 GB
2025-05-15 22:38:00,884 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:38:01] Init torch distributed begin.
[2025-05-15 22:38:01] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:38:01] Load weight begin. avail mem=11.11 GB
[2025-05-15 22:38:01] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.39s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.39s/it]

[2025-05-15 22:38:03] Load weight end. type=LlamaForCausalLMEagle, dtype=torch.float16, avail mem=10.18 GB, mem usage=0.93 GB.
[2025-05-15 22:38:03] KV Cache is allocated. #tokens: 20480, K size: 0.16 GB, V size: 0.16 GB
[2025-05-15 22:38:03] Memory pool end. avail mem=9.86 GB
[2025-05-15 22:38:03] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=4096
[2025-05-15 22:38:03] INFO:     Started server process [74315]
[2025-05-15 22:38:03] INFO:     Waiting for application startup.
[2025-05-15 22:38:03] INFO:     Application startup complete.
[2025-05-15 22:38:03] INFO:     Uvicorn running on http://127.0.0.1:32059 (Press CTRL+C to quit)
[2025-05-15 22:38:04] INFO:     127.0.0.1:56572 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:38:04] INFO:     127.0.0.1:56578 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:38:04] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:38:05,519 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90


注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
2025-05-15 22:38:56,487 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:38:56,494 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:39:12,241 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-15 22:39:12] Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 1, token usage: 0.00, #running-req: 1, #queue-req: 0
2025-05-15 22:39:12,763 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:39:12,782 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
2025-05-15 22:39:12,981 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:39:30,180 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:39:33,831 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:39:49,479 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:39:49] INFO:     127.0.0.1:56592 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:39:49] The server is fired up and ready to roll!
[2025-05-15 22:39:54] INFO:     127.0.0.1:56604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
响应:ChatCompletion(id='fb22bdb1035246938e8f28ba8c49a0d8', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' Sure! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=2)], created=1747348689, model='meta-llama/Llama-2-7b-chat-hf', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=17, total_tokens=65, completion_tokens_details=None, prompt_tokens_details=None))
[4]:
terminate_process(server_process)

使用 torch.compile 进行 EAGLE-2 解码#

您还可以启用 torch.compile 以进行进一步优化,并可选地设置 --torch-compile-max-bs

[5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
            --enable-torch-compile --torch-compile-max-bs 2
"""
)

wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:40:00] server_args=ServerArgs(model_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-2-7b-chat-hf', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=36444, mem_fraction_static=0.6, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=113513447, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE', speculative_draft_model_path='lmsys/sglang-EAGLE-llama2-chat-7B', speculative_num_steps=5, speculative_eagle_topk=8, speculative_num_draft_tokens=64, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=True, torch_compile_max_bs=2, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:40:01] Infer the chat template name from the model path and obtain the result: llama-2.
[2025-05-15 22:40:08] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:40:08] Init torch distributed begin.
[2025-05-15 22:40:08] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:40:08] Load weight begin. avail mem=61.86 GB
[2025-05-15 22:40:10] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:04<00:04,  4.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00,  2.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00,  2.97s/it]

[2025-05-15 22:40:16] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=49.20 GB, mem usage=12.66 GB.
[2025-05-15 22:40:17] KV Cache is allocated. #tokens: 20480, K size: 5.00 GB, V size: 5.00 GB
[2025-05-15 22:40:17] Memory pool end. avail mem=39.01 GB
2025-05-15 22:40:17,264 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:40:18] Init torch distributed begin.
[2025-05-15 22:40:18] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:40:18] Load weight begin. avail mem=38.44 GB
[2025-05-15 22:40:18] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.12s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.12s/it]

[2025-05-15 22:40:19] Load weight end. type=LlamaForCausalLMEagle, dtype=torch.float16, avail mem=37.51 GB, mem usage=0.93 GB.
[2025-05-15 22:40:19] KV Cache is allocated. #tokens: 20480, K size: 0.16 GB, V size: 0.16 GB
[2025-05-15 22:40:19] Memory pool end. avail mem=37.19 GB
[2025-05-15 22:40:20] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=4096
[2025-05-15 22:40:20] INFO:     Started server process [77746]
[2025-05-15 22:40:20] INFO:     Waiting for application startup.
[2025-05-15 22:40:20] INFO:     Application startup complete.
[2025-05-15 22:40:20] INFO:     Uvicorn running on http://127.0.0.1:36444 (Press CTRL+C to quit)
[2025-05-15 22:40:21] INFO:     127.0.0.1:56090 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:40:21] INFO:     127.0.0.1:56096 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:40:21] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:40:21,763 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:40:21,787 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:40:21,794 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:40:21,816 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:40:22,376 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:40:22,396 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:40:24,461 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:40:24,483 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:40:24] INFO:     127.0.0.1:56110 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:40:24] The server is fired up and ready to roll!


注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
[2025-05-15 22:40:26] Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:40:26,315 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:40:26,335 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-15 22:40:26] INFO:     127.0.0.1:56126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
响应:ChatCompletion(id='18456d5cc0024e7f8c7567a7d5def1d7', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' Sure! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=2)], created=1747348826, model='meta-llama/Llama-2-7b-chat-hf', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=17, total_tokens=65, completion_tokens_details=None, prompt_tokens_details=None))
[7]:
terminate_process(server_process)
[2025-05-15 22:40:26] Child process unexpectedly failed with an exit code 9. pid=77958

通过频率排序推测采样进行 EAGLE-2 解码#

通过在草稿模型中使用截断的高频词汇表,EAGLE 推测解码减少了 lm_head 计算开销,同时加速了流水线,且不降低质量。更多详情请查阅 这篇论文

在我们的实现中,设置 --speculative-token-map 来启用此优化。您可以从 此模型 中获取 FR-Spec 中的高频 token。或者,您也可以直接从 此仓库 下载这些高频 token。

感谢 Weilin ZhaoZhousx 的贡献。

[8]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
    --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16
"""
)

wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:40:33] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='float16', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=39148, mem_fraction_static=0.7, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=667996358, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE', speculative_draft_model_path='lmsys/sglang-EAGLE-LLaMA3-Instruct-8B', speculative_num_steps=5, speculative_eagle_topk=8, speculative_num_draft_tokens=64, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map='thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=2, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:40:33] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:40:39] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:40:40] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:40:40] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:40:40] Init torch distributed begin.
[2025-05-15 22:40:41] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:40:41] Load weight begin. avail mem=61.86 GB
[2025-05-15 22:40:42] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:08<00:24,  8.33s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:10<00:09,  4.75s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:18<00:06,  6.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:37<00:00, 11.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:37<00:00,  9.43s/it]

[2025-05-15 22:41:20] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=46.80 GB, mem usage=15.06 GB.
[2025-05-15 22:41:20] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-05-15 22:41:20] Memory pool end. avail mem=44.10 GB
2025-05-15 22:41:20,619 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:41:21] Warning: User-specified context_length (8192) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors.
[2025-05-15 22:41:21] Init torch distributed begin.
[2025-05-15 22:41:21] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:41:21] Load weight begin. avail mem=43.53 GB
[2025-05-15 22:41:22] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:04<00:00,  4.48s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:04<00:00,  4.48s/it]

[2025-05-15 22:41:26] Load weight end. type=LlamaForCausalLMEagle, dtype=torch.float16, avail mem=41.83 GB, mem usage=1.70 GB.
[2025-05-15 22:41:26] KV Cache is allocated. #tokens: 20480, K size: 0.04 GB, V size: 0.04 GB
[2025-05-15 22:41:26] Memory pool end. avail mem=41.75 GB
[2025-05-15 22:41:27] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=8192
[2025-05-15 22:41:27] INFO:     Started server process [78858]
[2025-05-15 22:41:27] INFO:     Waiting for application startup.
[2025-05-15 22:41:27] INFO:     Application startup complete.
[2025-05-15 22:41:27] INFO:     Uvicorn running on http://127.0.0.1:39148 (Press CTRL+C to quit)
[2025-05-15 22:41:27] INFO:     127.0.0.1:54854 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:41:28] INFO:     127.0.0.1:54858 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:41:28] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:41:29,501 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:41:29,525 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:41:29,533 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:41:29,554 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:41:30,598 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:41:30,619 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False


注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
2025-05-15 22:41:33,960 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:41:33,981 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:41:34] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 1, token usage: 0.00, #running-req: 1, #queue-req: 0
2025-05-15 22:41:34,313 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:41:34,334 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-15 22:41:37] INFO:     127.0.0.1:54874 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:41:37] The server is fired up and ready to roll!
[2025-05-15 22:41:37] INFO:     127.0.0.1:36496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
响应:ChatCompletion(id='a6f24bd733574077872d9423782faffc', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. **France** - **Paris**\n2. **Japan** - **Tokyo**\n3. **Australia** - **Canberra**', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=128009)], created=1747348893, model='meta-llama/Meta-Llama-3-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=18, total_tokens=57, completion_tokens_details=None, prompt_tokens_details=None))
[10]:
terminate_process(server_process)
[2025-05-15 22:41:37] Child process unexpectedly failed with an exit code 9. pid=79007

EAGLE-3 解码#

您可以通过设置 --speculative_algorithm EAGLE3 并选择合适的模型来启用 EAGLE-3 解码。

[11]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16
"""
)

wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:41:43] server_args=ServerArgs(model_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='float16', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=37616, mem_fraction_static=0.6, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=135676897, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE3', speculative_draft_model_path='jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B', speculative_num_steps=5, speculative_eagle_topk=8, speculative_num_draft_tokens=32, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=2, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:41:44] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:41:50] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:41:51] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:41:51] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:41:51] Init torch distributed begin.
[2025-05-15 22:41:51] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:41:51] Load weight begin. avail mem=61.86 GB
[2025-05-15 22:41:52] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:05<00:17,  5.98s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:10<00:10,  5.25s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:12<00:03,  3.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00,  4.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00,  4.61s/it]

[2025-05-15 22:42:11] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=46.75 GB, mem usage=15.11 GB.
[2025-05-15 22:42:11] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-05-15 22:42:11] Memory pool end. avail mem=43.96 GB
2025-05-15 22:42:11,915 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:42:12] Warning: User-specified context_length (131072) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors.
[2025-05-15 22:42:12] Init torch distributed begin.
[2025-05-15 22:42:12] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:42:12] Load weight begin. avail mem=43.39 GB
[2025-05-15 22:42:13] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.97it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.97it/s]

[2025-05-15 22:42:14] Load weight end. type=LlamaForCausalLMEagle3, dtype=torch.float16, avail mem=41.62 GB, mem usage=1.77 GB.
[2025-05-15 22:42:14] KV Cache is allocated. #tokens: 20480, K size: 0.04 GB, V size: 0.04 GB
[2025-05-15 22:42:14] Memory pool end. avail mem=41.53 GB
[2025-05-15 22:42:14] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
[2025-05-15 22:42:15] INFO:     Started server process [80041]
[2025-05-15 22:42:15] INFO:     Waiting for application startup.
[2025-05-15 22:42:15] INFO:     Application startup complete.
[2025-05-15 22:42:15] INFO:     Uvicorn running on http://127.0.0.1:37616 (Press CTRL+C to quit)
[2025-05-15 22:42:15] INFO:     127.0.0.1:38460 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:42:16] INFO:     127.0.0.1:38466 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:42:16] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:42:16,662 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:42:16,686 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:42:16,692 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:42:16,712 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:42:17,404 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:42:17,423 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:42:19,724 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:42:19,746 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:42:20] INFO:     127.0.0.1:38476 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:42:20] The server is fired up and ready to roll!


注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
[2025-05-15 22:42:20] Prefill batch. #new-seq: 1, #new-token: 42, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:42:20,856 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:42:20,877 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-15 22:42:21] INFO:     127.0.0.1:37174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
响应:ChatCompletion(id='00bd2d76f68a496e8c6310041664bba6', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. Country: Japan\n Capital: Tokyo\n\n2. Country: Australia\n Capital: Canberra\n\n3. Country: Brazil\n Capital: Brasília', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=128009)], created=1747348940, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=43, prompt_tokens=43, total_tokens=86, completion_tokens_details=None, prompt_tokens_details=None))
[13]:
terminate_process(server_process)
[2025-05-15 22:42:21] Child process unexpectedly failed with an exit code 9. pid=80190

参考资料#

EAGLE 过程如下:

  • 在 EAGLE 中,草稿模型使用特征序列 \((f_1, ..., f_k)\) 和 token 序列 \((t_2, ..., t_{k+1})\) 来预测下一个特征向量,即原始 LLM 的最后一个隐藏状态。

  • 然后从 \(p_{k+2}=\text{LMHead}(f_{k+1})\) 中采样下一个 token。之后,这两个序列以树状结构扩展——分支出多个可能的延续,每步的分支因子由 speculative_eagle_topk 参数控制——以确保更连贯的上下文连接,并再次作为输入。

  • EAGLE-2 额外使用草稿模型来评估草稿树中某些分支的可能性,动态停止对不太可能的分支的扩展。在扩展阶段后,采用重排序(reranking)来仅选择前 speculative_num_draft_tokens 个最终节点作为草稿 token。

  • EAGLE-3 移除了特征预测目标,融入了低层和中间层特征,并以 on-policy 方式进行训练。

通过对特征而非 token 进行操作,以处理更规律的输入,并额外传递下一时间步的 token 来最小化采样带来的随机性影响,这提高了草稿的准确性。此外,草稿树的动态调整和重排序最终节点的选择进一步提高了草稿 token 的接受率。更多详情请参阅 EAGLE-2EAGLE-3 论文。

关于如何训练您自己的 EAGLE 模型的指导,请参阅 EAGLE 仓库