推测解码#
SGLang 现在提供基于 EAGLE(EAGLE-2/EAGLE-3)的推测解码选项。我们的实现旨在最大限度地提高速度和效率,被认为是开源 LLM 引擎中最快的之一。注意:目前,SGLang 中的推测解码与基数缓存(radix cache)和分块预填充(chunked prefill)兼容。
性能亮点#
请参阅下文,了解通过 EAGLE3 解码在 MT bench 上测试的 LLaMA-Instruct 3.1 8B 模型的吞吐量取得的巨大提升。更多详情请参阅 EAGLE3 论文。
方法 |
吞吐量(tokens/秒) |
---|---|
SGLang(无推测,1x H100) |
158.34 tokens/秒 |
SGLang + EAGLE-2(1x H100) |
244.10 tokens/秒 |
SGLang + EAGLE-3(1x H100) |
373.25 tokens/秒 |
EAGLE 解码#
要启用 EAGLE 推测解码,以下参数是相关的
speculative_draft_model_path
:指定草稿模型。此参数为必需。speculative_num_steps
:自回归草稿的深度。增加推测范围,但有拒绝级联的风险。默认值为 5。speculative_eagle_topk
:每步的分支因子。提高候选多样性,将导致更高的接受率,但也会导致更高的内存/计算消耗。默认值为 4。speculative_num_draft_tokens
:最大并行验证容量。允许更深的树评估,但会导致更高的 GPU 内存使用。默认值为 8。
这些参数对于 EAGLE-2 和 EAGLE-3 相同。
EAGLE-2 解码#
您可以通过设置 --speculative_algorithm EAGLE
并选择合适的模型来启用 EAGLE-2 解码。
[1]:
from sglang.test.test_utils import is_in_ci
if is_in_ci():
from patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
import openai
[2]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
--speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8
"""
)
wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:37:29] server_args=ServerArgs(model_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-2-7b-chat-hf', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=32059, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=827322011, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE', speculative_draft_model_path='lmsys/sglang-EAGLE-llama2-chat-7B', speculative_num_steps=3, speculative_eagle_topk=4, speculative_num_draft_tokens=16, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=8, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:37:31] Infer the chat template name from the model path and obtain the result: llama-2.
[2025-05-15 22:37:37] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:37:37] Init torch distributed begin.
[2025-05-15 22:37:37] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:37:37] Load weight begin. avail mem=34.53 GB
[2025-05-15 22:37:38] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:09<00:09, 9.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:21<00:00, 11.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:21<00:00, 10.82s/it]
[2025-05-15 22:38:00] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=21.87 GB, mem usage=12.66 GB.
[2025-05-15 22:38:00] KV Cache is allocated. #tokens: 20480, K size: 5.00 GB, V size: 5.00 GB
[2025-05-15 22:38:00] Memory pool end. avail mem=11.68 GB
2025-05-15 22:38:00,884 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:38:01] Init torch distributed begin.
[2025-05-15 22:38:01] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:38:01] Load weight begin. avail mem=11.11 GB
[2025-05-15 22:38:01] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.39s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.39s/it]
[2025-05-15 22:38:03] Load weight end. type=LlamaForCausalLMEagle, dtype=torch.float16, avail mem=10.18 GB, mem usage=0.93 GB.
[2025-05-15 22:38:03] KV Cache is allocated. #tokens: 20480, K size: 0.16 GB, V size: 0.16 GB
[2025-05-15 22:38:03] Memory pool end. avail mem=9.86 GB
[2025-05-15 22:38:03] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=4096
[2025-05-15 22:38:03] INFO: Started server process [74315]
[2025-05-15 22:38:03] INFO: Waiting for application startup.
[2025-05-15 22:38:03] INFO: Application startup complete.
[2025-05-15 22:38:03] INFO: Uvicorn running on http://127.0.0.1:32059 (Press CTRL+C to quit)
[2025-05-15 22:38:04] INFO: 127.0.0.1:56572 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:38:04] INFO: 127.0.0.1:56578 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:38:04] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:38:05,519 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[3]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
2025-05-15 22:38:56,487 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:38:56,494 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:39:12,241 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-15 22:39:12] Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 1, token usage: 0.00, #running-req: 1, #queue-req: 0
2025-05-15 22:39:12,763 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:39:12,782 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
2025-05-15 22:39:12,981 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:39:30,180 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:39:33,831 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:39:49,479 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:39:49] INFO: 127.0.0.1:56592 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:39:49] The server is fired up and ready to roll!
[2025-05-15 22:39:54] INFO: 127.0.0.1:56604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[4]:
terminate_process(server_process)
使用 torch.compile
进行 EAGLE-2 解码#
您还可以启用 torch.compile
以进行进一步优化,并可选地设置 --torch-compile-max-bs
[5]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \
--enable-torch-compile --torch-compile-max-bs 2
"""
)
wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:40:00] server_args=ServerArgs(model_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_path='meta-llama/Llama-2-7b-chat-hf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-2-7b-chat-hf', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=36444, mem_fraction_static=0.6, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=113513447, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE', speculative_draft_model_path='lmsys/sglang-EAGLE-llama2-chat-7B', speculative_num_steps=5, speculative_eagle_topk=8, speculative_num_draft_tokens=64, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=True, torch_compile_max_bs=2, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:40:01] Infer the chat template name from the model path and obtain the result: llama-2.
[2025-05-15 22:40:08] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:40:08] Init torch distributed begin.
[2025-05-15 22:40:08] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:40:08] Load weight begin. avail mem=61.86 GB
[2025-05-15 22:40:10] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:04<00:04, 4.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00, 2.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00, 2.97s/it]
[2025-05-15 22:40:16] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=49.20 GB, mem usage=12.66 GB.
[2025-05-15 22:40:17] KV Cache is allocated. #tokens: 20480, K size: 5.00 GB, V size: 5.00 GB
[2025-05-15 22:40:17] Memory pool end. avail mem=39.01 GB
2025-05-15 22:40:17,264 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:40:18] Init torch distributed begin.
[2025-05-15 22:40:18] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:40:18] Load weight begin. avail mem=38.44 GB
[2025-05-15 22:40:18] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.12s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.12s/it]
[2025-05-15 22:40:19] Load weight end. type=LlamaForCausalLMEagle, dtype=torch.float16, avail mem=37.51 GB, mem usage=0.93 GB.
[2025-05-15 22:40:19] KV Cache is allocated. #tokens: 20480, K size: 0.16 GB, V size: 0.16 GB
[2025-05-15 22:40:19] Memory pool end. avail mem=37.19 GB
[2025-05-15 22:40:20] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=4096
[2025-05-15 22:40:20] INFO: Started server process [77746]
[2025-05-15 22:40:20] INFO: Waiting for application startup.
[2025-05-15 22:40:20] INFO: Application startup complete.
[2025-05-15 22:40:20] INFO: Uvicorn running on http://127.0.0.1:36444 (Press CTRL+C to quit)
[2025-05-15 22:40:21] INFO: 127.0.0.1:56090 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:40:21] INFO: 127.0.0.1:56096 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:40:21] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:40:21,763 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:40:21,787 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:40:21,794 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:40:21,816 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:40:22,376 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:40:22,396 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:40:24,461 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:40:24,483 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:40:24] INFO: 127.0.0.1:56110 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:40:24] The server is fired up and ready to roll!
注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[6]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[2025-05-15 22:40:26] Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:40:26,315 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:40:26,335 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-15 22:40:26] INFO: 127.0.0.1:56126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[7]:
terminate_process(server_process)
[2025-05-15 22:40:26] Child process unexpectedly failed with an exit code 9. pid=77958
通过频率排序推测采样进行 EAGLE-2 解码#
通过在草稿模型中使用截断的高频词汇表,EAGLE 推测解码减少了 lm_head
计算开销,同时加速了流水线,且不降低质量。更多详情请查阅 这篇论文。
在我们的实现中,设置 --speculative-token-map
来启用此优化。您可以从 此模型 中获取 FR-Spec 中的高频 token。或者,您也可以直接从 此仓库 下载这些高频 token。
感谢 Weilin Zhao 和 Zhousx 的贡献。
[8]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
--mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16
"""
)
wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:40:33] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='float16', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=39148, mem_fraction_static=0.7, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=667996358, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE', speculative_draft_model_path='lmsys/sglang-EAGLE-LLaMA3-Instruct-8B', speculative_num_steps=5, speculative_eagle_topk=8, speculative_num_draft_tokens=64, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map='thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=2, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:40:33] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:40:39] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:40:40] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:40:40] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:40:40] Init torch distributed begin.
[2025-05-15 22:40:41] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:40:41] Load weight begin. avail mem=61.86 GB
[2025-05-15 22:40:42] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:08<00:24, 8.33s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:10<00:09, 4.75s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:18<00:06, 6.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:37<00:00, 11.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:37<00:00, 9.43s/it]
[2025-05-15 22:41:20] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=46.80 GB, mem usage=15.06 GB.
[2025-05-15 22:41:20] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-05-15 22:41:20] Memory pool end. avail mem=44.10 GB
2025-05-15 22:41:20,619 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:41:21] Warning: User-specified context_length (8192) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors.
[2025-05-15 22:41:21] Init torch distributed begin.
[2025-05-15 22:41:21] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:41:21] Load weight begin. avail mem=43.53 GB
[2025-05-15 22:41:22] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.48s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.48s/it]
[2025-05-15 22:41:26] Load weight end. type=LlamaForCausalLMEagle, dtype=torch.float16, avail mem=41.83 GB, mem usage=1.70 GB.
[2025-05-15 22:41:26] KV Cache is allocated. #tokens: 20480, K size: 0.04 GB, V size: 0.04 GB
[2025-05-15 22:41:26] Memory pool end. avail mem=41.75 GB
[2025-05-15 22:41:27] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=8192
[2025-05-15 22:41:27] INFO: Started server process [78858]
[2025-05-15 22:41:27] INFO: Waiting for application startup.
[2025-05-15 22:41:27] INFO: Application startup complete.
[2025-05-15 22:41:27] INFO: Uvicorn running on http://127.0.0.1:39148 (Press CTRL+C to quit)
[2025-05-15 22:41:27] INFO: 127.0.0.1:54854 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:41:28] INFO: 127.0.0.1:54858 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:41:28] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:41:29,501 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:41:29,525 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:41:29,533 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:41:29,554 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:41:30,598 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:41:30,619 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[9]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
2025-05-15 22:41:33,960 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:41:33,981 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:41:34] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 1, token usage: 0.00, #running-req: 1, #queue-req: 0
2025-05-15 22:41:34,313 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:41:34,334 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-15 22:41:37] INFO: 127.0.0.1:54874 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:41:37] The server is fired up and ready to roll!
[2025-05-15 22:41:37] INFO: 127.0.0.1:36496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[10]:
terminate_process(server_process)
[2025-05-15 22:41:37] Child process unexpectedly failed with an exit code 9. pid=79007
EAGLE-3 解码#
您可以通过设置 --speculative_algorithm EAGLE3
并选择合适的模型来启用 EAGLE-3 解码。
[11]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
--cuda-graph-max-bs 2 --dtype float16
"""
)
wait_for_server(f"http://localhost:{port}")
Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-05-15 22:41:43] server_args=ServerArgs(model_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='float16', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='127.0.0.1', port=37616, mem_fraction_static=0.6, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=135676897, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE3', speculative_draft_model_path='jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B', speculative_num_steps=5, speculative_eagle_topk=8, speculative_num_draft_tokens=32, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=2, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:41:44] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:41:50] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:41:51] Casting torch.bfloat16 to torch.float16.
[2025-05-15 22:41:51] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:41:51] Init torch distributed begin.
[2025-05-15 22:41:51] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:41:51] Load weight begin. avail mem=61.86 GB
[2025-05-15 22:41:52] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:05<00:17, 5.98s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:10<00:10, 5.25s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:12<00:03, 3.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00, 4.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00, 4.61s/it]
[2025-05-15 22:42:11] Load weight end. type=LlamaForCausalLM, dtype=torch.float16, avail mem=46.75 GB, mem usage=15.11 GB.
[2025-05-15 22:42:11] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-05-15 22:42:11] Memory pool end. avail mem=43.96 GB
2025-05-15 22:42:11,915 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:42:12] Warning: User-specified context_length (131072) is greater than the derived context_length (2048). This may lead to incorrect model outputs or CUDA errors.
[2025-05-15 22:42:12] Init torch distributed begin.
[2025-05-15 22:42:12] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:42:12] Load weight begin. avail mem=43.39 GB
[2025-05-15 22:42:13] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.97it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.97it/s]
[2025-05-15 22:42:14] Load weight end. type=LlamaForCausalLMEagle3, dtype=torch.float16, avail mem=41.62 GB, mem usage=1.77 GB.
[2025-05-15 22:42:14] KV Cache is allocated. #tokens: 20480, K size: 0.04 GB, V size: 0.04 GB
[2025-05-15 22:42:14] Memory pool end. avail mem=41.53 GB
[2025-05-15 22:42:14] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
[2025-05-15 22:42:15] INFO: Started server process [80041]
[2025-05-15 22:42:15] INFO: Waiting for application startup.
[2025-05-15 22:42:15] INFO: Application startup complete.
[2025-05-15 22:42:15] INFO: Uvicorn running on http://127.0.0.1:37616 (Press CTRL+C to quit)
[2025-05-15 22:42:15] INFO: 127.0.0.1:38460 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:42:16] INFO: 127.0.0.1:38466 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:42:16] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:42:16,662 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:42:16,686 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:42:16,692 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:42:16,712 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:42:17,404 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:42:17,423 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-05-15 22:42:19,724 - INFO - flashinfer.jit: Loading JIT ops: quantization
2025-05-15 22:42:19,746 - INFO - flashinfer.jit: Finished loading JIT ops: quantization
[2025-05-15 22:42:20] INFO: 127.0.0.1:38476 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:42:20] The server is fired up and ready to roll!
注意:通常,服务器在单独的终端中运行。
在此 notebook 中,我们将服务器和 notebook 代码一起运行,因此它们的输出是合并的。
为提高清晰度,服务器日志以原始黑色显示,而 notebook 输出则以蓝色突出显示。
我们在 CI 并行环境中运行这些 notebook,因此吞吐量不代表实际性能。
[12]:
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[2025-05-15 22:42:20] Prefill batch. #new-seq: 1, #new-token: 42, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:42:20,856 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:42:20,877 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-15 22:42:21] INFO: 127.0.0.1:37174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[13]:
terminate_process(server_process)
[2025-05-15 22:42:21] Child process unexpectedly failed with an exit code 9. pid=80190
参考资料#
EAGLE 过程如下:
在 EAGLE 中,草稿模型使用特征序列 \((f_1, ..., f_k)\) 和 token 序列 \((t_2, ..., t_{k+1})\) 来预测下一个特征向量,即原始 LLM 的最后一个隐藏状态。
然后从 \(p_{k+2}=\text{LMHead}(f_{k+1})\) 中采样下一个 token。之后,这两个序列以树状结构扩展——分支出多个可能的延续,每步的分支因子由
speculative_eagle_topk
参数控制——以确保更连贯的上下文连接,并再次作为输入。EAGLE-2 额外使用草稿模型来评估草稿树中某些分支的可能性,动态停止对不太可能的分支的扩展。在扩展阶段后,采用重排序(reranking)来仅选择前
speculative_num_draft_tokens
个最终节点作为草稿 token。EAGLE-3 移除了特征预测目标,融入了低层和中间层特征,并以 on-policy 方式进行训练。
通过对特征而非 token 进行操作,以处理更规律的输入,并额外传递下一时间步的 token 来最小化采样带来的随机性影响,这提高了草稿的准确性。此外,草稿树的动态调整和重排序最终节点的选择进一步提高了草稿 token 的接受率。更多详情请参阅 EAGLE-2 和 EAGLE-3 论文。
关于如何训练您自己的 EAGLE 模型的指导,请参阅 EAGLE 仓库。