SGLang 原生 API#

除了与 OpenAI 兼容的 API 外，SGLang Runtime 还提供了其原生的服务器 API。我们将介绍以下这些 API

/generate (text generation model)
/get_model_info
/get_server_info
/health
/health_generate
/flush_cache
/update_weights
/encode(embedding model)
/classify(reward model)
/start_expert_distribution_record
/stop_expert_distribution_record
/dump_expert_distribution_record

在下面的示例中，我们主要使用 requests 来测试这些 API。你也可以使用 curl。

启动服务器#

[1]:

import requests
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0"
)

wait_for_server(f"https://:{port}")

[2025-05-15 22:34:07] server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='qwen/qwen2.5-0.5b-instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='0.0.0.0', port=39754, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=288299703, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:34:15] Attention backend not set. Use fa3 backend by default.
[2025-05-15 22:34:15] Init torch distributed begin.
[2025-05-15 22:34:16] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:34:16] Load weight begin. avail mem=41.29 GB
[2025-05-15 22:34:17] Using model weights format ['*.safetensors']
[2025-05-15 22:34:17] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.20it/s]

[2025-05-15 22:34:18] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=59.29 GB, mem usage=-18.00 GB.
[2025-05-15 22:34:18] KV Cache is allocated. #tokens: 20480, K size: 0.12 GB, V size: 0.12 GB
[2025-05-15 22:34:18] Memory pool end. avail mem=58.88 GB
[2025-05-15 22:34:19] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768
[2025-05-15 22:34:19] INFO:     Started server process [60163]
[2025-05-15 22:34:19] INFO:     Waiting for application startup.
[2025-05-15 22:34:19] INFO:     Application startup complete.
[2025-05-15 22:34:19] INFO:     Uvicorn running on http://0.0.0.0:39754 (Press CTRL+C to quit)
[2025-05-15 22:34:19] INFO:     127.0.0.1:33846 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:34:20] INFO:     127.0.0.1:47926 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:34:20] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:22] INFO:     127.0.0.1:47932 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:34:22] The server is fired up and ready to roll!

注意：通常情况下，服务器在单独的终端中运行。
在此 notebook 中，我们将服务器和 notebook 代码一起运行，因此它们的输出会合并显示。
为了提高清晰度，服务器日志以原始的黑色显示，而 notebook 输出则以蓝色高亮显示。
我们正在 CI 并行环境中运行这些 notebook，因此吞吐量并不能代表实际性能。

Generate（文本生成模型）#

生成补全。这与 OpenAI API 中的 /v1/completions 类似。详细参数可在采样参数中找到。

[2]:

url = f"https://:{port}/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print_highlight(response.json())

[2025-05-15 22:34:24] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:25] Decode batch. #running-req: 1, #token: 40, token usage: 0.00, cuda graph: False, gen throughput (token/s): 6.45, #queue-req: 0
[2025-05-15 22:34:25] Decode batch. #running-req: 1, #token: 80, token usage: 0.00, cuda graph: False, gen throughput (token/s): 84.98, #queue-req: 0
[2025-05-15 22:34:26] Decode batch. #running-req: 1, #token: 120, token usage: 0.01, cuda graph: False, gen throughput (token/s): 84.92, #queue-req: 0
[2025-05-15 22:34:26] INFO:     127.0.0.1:47942 - "POST /generate HTTP/1.1" 200 OK

{'text': " Paris is the capital of France.\n\nOkay, I'm still confused why würst du sentieren? Adding two numbers to 18 gives a sum of 86. What is the other sum? To determine the missing number, follow these steps:\n\n1. You know that \\( x + 18 = 86 \\).\n2. To find \\( x \\), subtract 18 from both sides of the equation: \\( x = 86 - 18 \\).\n3. Perform the subtraction: \\( x = 68 \\).\n\nSo, the missing number is \\( 68 \\). If you type", 'meta_info': {'id': '37131e841d2942558c48d013c080deb4', 'finish_reason': {'type': 'length', 'length': 128}, 'prompt_tokens': 7, 'completion_tokens': 128, 'cached_tokens': 0, 'e2e_latency': 1.5345242023468018}}

获取模型信息#

获取模型的信息。

model_path: 模型的路径/名称。
is_generation: 模型是用作生成模型还是嵌入模型。
tokenizer_path: 分词器的路径/名称。

[3]:

url = f"https://:{port}/get_model_info"

response = requests.get(url)
response_json = response.json()
print_highlight(response_json)
assert response_json["model_path"] == "qwen/qwen2.5-0.5b-instruct"
assert response_json["is_generation"] is True
assert response_json["tokenizer_path"] == "qwen/qwen2.5-0.5b-instruct"
assert response_json.keys() == {"model_path", "is_generation", "tokenizer_path"}

[2025-05-15 22:34:26] INFO:     127.0.0.1:47956 - "GET /get_model_info HTTP/1.1" 200 OK

{'model_path': 'qwen/qwen2.5-0.5b-instruct', 'tokenizer_path': 'qwen/qwen2.5-0.5b-instruct', 'is_generation': True}

获取服务器信息#

获取服务器信息，包括 CLI 参数、token 限制和内存池大小。

注意：get_server_info 合并了以下已废弃的端点
- get_server_args
- get_memory_pool_size
- get_max_total_num_tokens

[4]:

# get_server_info

url = f"https://:{port}/get_server_info"

response = requests.get(url)
print_highlight(response.text)

[2025-05-15 22:34:26] INFO:     127.0.0.1:47972 - "GET /get_server_info HTTP/1.1" 200 OK

健康检查#

/health: 检查服务器健康状况。
/health_generate: 通过生成一个 token 来检查服务器健康状况。

[5]:

url = f"https://:{port}/health_generate"

response = requests.get(url)
print_highlight(response.text)

[2025-05-15 22:34:26] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:27] INFO:     127.0.0.1:47980 - "GET /health_generate HTTP/1.1" 200 OK

[6]:

url = f"https://:{port}/health"

response = requests.get(url)
print_highlight(response.text)

[2025-05-15 22:34:27] INFO:     127.0.0.1:47990 - "GET /health HTTP/1.1" 200 OK

刷新缓存#

刷新 radix 缓存。当通过 /update_weights API 更新模型权重时，此操作将自动触发。

[7]:

# flush cache

url = f"https://:{port}/flush_cache"

response = requests.post(url)
print_highlight(response.text)

[2025-05-15 22:34:27] Cache flushed successfully!
[2025-05-15 22:34:27] INFO:     127.0.0.1:47992 - "POST /flush_cache HTTP/1.1" 200 OK

缓存已刷新。
请查看后端日志了解更多详情。（当存在正在运行或等待的请求时，此操作将不会执行。）

从磁盘更新权重#

无需重启服务器即可从磁盘更新模型权重。仅适用于具有相同架构和参数大小的模型。

SGLang 支持 update_weights_from_disk API，用于训练期间的持续评估（将检查点保存到磁盘并从磁盘更新权重）。

[8]:

# successful update with same architecture and size

url = f"https://:{port}/update_weights_from_disk"
data = {"model_path": "qwen/qwen2.5-0.5b-instruct"}

response = requests.post(url, json=data)
print_highlight(response.text)
assert response.json()["success"] is True
assert response.json()["message"] == "Succeeded to update model weights."

[2025-05-15 22:34:27] Start update_weights. Load format=auto
[2025-05-15 22:34:27] Update engine weights online from disk begin. avail mem=57.87 GB
[2025-05-15 22:34:27] Using model weights format ['*.safetensors']
[2025-05-15 22:34:28] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.15it/s]

[2025-05-15 22:34:28] Update weights end.
[2025-05-15 22:34:28] Cache flushed successfully!
[2025-05-15 22:34:28] INFO:     127.0.0.1:48004 - "POST /update_weights_from_disk HTTP/1.1" 200 OK

{"success":true,"message":"Succeeded to update model weights.","num_paused_requests":0}

[9]:

# failed update with different parameter size or wrong name

url = f"https://:{port}/update_weights_from_disk"
data = {"model_path": "qwen/qwen2.5-0.5b-instruct-wrong"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(response_json)
assert response_json["success"] is False
assert response_json["message"] == (
    "Failed to get weights iterator: "
    "qwen/qwen2.5-0.5b-instruct-wrong"
    " (repository not found)."
)

[2025-05-15 22:34:28] Start update_weights. Load format=auto
[2025-05-15 22:34:28] Update engine weights online from disk begin. avail mem=57.87 GB
[2025-05-15 22:34:29] Failed to get weights iterator: qwen/qwen2.5-0.5b-instruct-wrong (repository not found).
[2025-05-15 22:34:29] INFO:     127.0.0.1:48016 - "POST /update_weights_from_disk HTTP/1.1" 400 Bad Request

{'success': False, 'message': 'Failed to get weights iterator: qwen/qwen2.5-0.5b-instruct-wrong (repository not found).', 'num_paused_requests': 0}

[10]:

terminate_process(server_process)

[2025-05-15 22:34:29] Child process unexpectedly failed with an exit code 9. pid=60587

Encode（嵌入模型）#

将文本编码为嵌入向量。请注意，此 API 仅适用于嵌入模型，对生成模型会引发错误。因此，我们启动了一个新服务器来提供嵌入模型服务。

[11]:

embedding_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \
    --host 0.0.0.0 --is-embedding
"""
)

wait_for_server(f"https://:{port}")

[2025-05-15 22:34:35] server_args=ServerArgs(model_path='Alibaba-NLP/gte-Qwen2-1.5B-instruct', tokenizer_path='Alibaba-NLP/gte-Qwen2-1.5B-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Alibaba-NLP/gte-Qwen2-1.5B-instruct', chat_template=None, completion_template=None, is_embedding=True, enable_multimodal=None, revision=None, host='0.0.0.0', port=37119, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=630881105, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:34:36] Downcasting torch.float32 to torch.float16.
[2025-05-15 22:34:43] Downcasting torch.float32 to torch.float16.
[2025-05-15 22:34:43] Overlap scheduler is disabled for embedding models.
[2025-05-15 22:34:44] Downcasting torch.float32 to torch.float16.
[2025-05-15 22:34:44] Attention backend not set. Use fa3 backend by default.
[2025-05-15 22:34:44] Init torch distributed begin.
[2025-05-15 22:34:44] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:34:44] Load weight begin. avail mem=74.49 GB
[2025-05-15 22:34:46] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.91s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.81s/it]

[2025-05-15 22:34:50] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=69.97 GB, mem usage=4.52 GB.
[2025-05-15 22:34:50] KV Cache is allocated. #tokens: 20480, K size: 0.27 GB, V size: 0.27 GB
[2025-05-15 22:34:50] Memory pool end. avail mem=69.15 GB
[2025-05-15 22:34:51] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
[2025-05-15 22:34:51] INFO:     Started server process [61911]
[2025-05-15 22:34:51] INFO:     Waiting for application startup.
[2025-05-15 22:34:51] INFO:     Application startup complete.
[2025-05-15 22:34:51] INFO:     Uvicorn running on http://0.0.0.0:37119 (Press CTRL+C to quit)
[2025-05-15 22:34:52] INFO:     127.0.0.1:42722 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:34:52] INFO:     127.0.0.1:42730 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:34:52] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:53] INFO:     127.0.0.1:42738 - "POST /encode HTTP/1.1" 200 OK
[2025-05-15 22:34:53] The server is fired up and ready to roll!

[12]:

# successful encode for embedding model

url = f"https://:{port}/encode"
data = {"model": "Alibaba-NLP/gte-Qwen2-1.5B-instruct", "text": "Once upon a time"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(f"Text embedding (first 10): {response_json['embedding'][:10]}")

[2025-05-15 22:34:57] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:34:57] INFO:     127.0.0.1:42754 - "POST /encode HTTP/1.1" 200 OK

Text embedding (first 10): [-0.00019550323486328125, -0.049896240234375, -0.0032482147216796875, 0.011077880859375, -0.01406097412109375, 0.0159912109375, -0.01442718505859375, 0.005939483642578125, -0.022796630859375, 0.0273284912109375]

[13]:

terminate_process(embedding_process)

[2025-05-15 22:34:57] Child process unexpectedly failed with an exit code 9. pid=62406

Classify（奖励模型）#

SGLang Runtime 也支持奖励模型。这里我们使用奖励模型来评估成对生成的质量。

[14]:

terminate_process(embedding_process)

# Note that SGLang now treats embedding models and reward models as the same type of models.
# This will be updated in the future.

reward_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding
"""
)

wait_for_server(f"https://:{port}")

[2025-05-15 22:35:04] server_args=ServerArgs(model_path='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', tokenizer_path='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', chat_template=None, completion_template=None, is_embedding=True, enable_multimodal=None, revision=None, host='0.0.0.0', port=30319, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=178803748, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:35:13] Overlap scheduler is disabled for embedding models.
[2025-05-15 22:35:14] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:35:14] Init torch distributed begin.
[2025-05-15 22:35:14] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:35:14] Load weight begin. avail mem=76.40 GB
[2025-05-15 22:35:17] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:09<00:27,  9.27s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:10<00:09,  4.70s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:23<00:08,  8.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:36<00:00, 10.23s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:36<00:00,  9.12s/it]

[2025-05-15 22:35:54] Load weight end. type=LlamaForSequenceClassification, dtype=torch.bfloat16, avail mem=48.08 GB, mem usage=28.32 GB.
[2025-05-15 22:35:54] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-05-15 22:35:54] Memory pool end. avail mem=45.29 GB
2025-05-15 22:35:54,214 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:35:54] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072
[2025-05-15 22:35:55] INFO:     Started server process [64904]
[2025-05-15 22:35:55] INFO:     Waiting for application startup.
[2025-05-15 22:35:55] INFO:     Application startup complete.
[2025-05-15 22:35:55] INFO:     Uvicorn running on http://0.0.0.0:30319 (Press CTRL+C to quit)
[2025-05-15 22:35:55] INFO:     127.0.0.1:56158 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:35:56] INFO:     127.0.0.1:56162 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:35:56] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:35:56,700 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90

[15]:

from transformers import AutoTokenizer

PROMPT = (
    "What is the range of the numeric output of a sigmoid node in a neural network?"
)

RESPONSE1 = "The output of a sigmoid node is bounded between -1 and 1."
RESPONSE2 = "The output of a sigmoid node is bounded between 0 and 1."

CONVS = [
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE1}],
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE2}],
]

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)

url = f"https://:{port}/classify"
data = {"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "text": prompts}

responses = requests.post(url, json=data).json()
for response in responses:
    print_highlight(f"reward: {response['embedding'][0]}")

2025-05-15 22:36:47,588 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:36:47,595 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:36:47,618 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-15 22:36:47] Prefill batch. #new-seq: 2, #new-token: 136, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:36:47] INFO:     127.0.0.1:56164 - "POST /encode HTTP/1.1" 200 OK
[2025-05-15 22:36:47] The server is fired up and ready to roll!
2025-05-15 22:36:47,924 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-05-15 22:37:03,964 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-05-15 22:37:03] INFO:     127.0.0.1:51290 - "POST /classify HTTP/1.1" 200 OK

reward: -24.125

reward: 1.0625

[16]:

terminate_process(reward_process)

捕获 MoE 模型中的专家选择分布#

SGLang Runtime 支持记录 MoE 模型运行中每个专家被选中的次数。这对于分析模型的吞吐量和规划优化很有用。

注意：为了提高可读性，我们只打印下面 csv 的前 10 行。如果你想深入分析结果，请相应调整。

[17]:

expert_record_server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0"
)

wait_for_server(f"https://:{port}")

[2025-05-15 22:37:10] server_args=ServerArgs(model_path='Qwen/Qwen1.5-MoE-A2.7B', tokenizer_path='Qwen/Qwen1.5-MoE-A2.7B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen1.5-MoE-A2.7B', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='0.0.0.0', port=33938, mem_fraction_static=0.88, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=123970934, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:37:17] Attention backend not set. Use flashinfer backend by default.
[2025-05-15 22:37:17] Init torch distributed begin.
[2025-05-15 22:37:18] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:37:18] Load weight begin. avail mem=61.91 GB
[2025-05-15 22:37:21] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:00<00:03,  2.00it/s]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:06<00:23,  3.86s/it]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:12<00:23,  4.71s/it]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:19<00:22,  5.53s/it]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:26<00:18,  6.02s/it]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:33<00:12,  6.32s/it]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:38<00:06,  6.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:44<00:00,  5.96s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:44<00:00,  5.54s/it]

[2025-05-15 22:38:06] Load weight end. type=Qwen2MoeForCausalLM, dtype=torch.bfloat16, avail mem=10.19 GB, mem usage=51.72 GB.
[2025-05-15 22:38:06] max_total_tokens=20480 is larger than the profiled value 15085. Use the profiled value instead.
[2025-05-15 22:38:06] KV Cache is allocated. #tokens: 15085, K size: 1.38 GB, V size: 1.38 GB
[2025-05-15 22:38:06] Memory pool end. avail mem=7.31 GB
2025-05-15 22:38:06,703 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-05-15 22:38:07] max_total_num_tokens=15085, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=8192
[2025-05-15 22:38:07] INFO:     Started server process [73009]
[2025-05-15 22:38:07] INFO:     Waiting for application startup.
[2025-05-15 22:38:07] INFO:     Application startup complete.
[2025-05-15 22:38:07] INFO:     Uvicorn running on http://0.0.0.0:33938 (Press CTRL+C to quit)
[2025-05-15 22:38:08] INFO:     127.0.0.1:45924 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:38:08] INFO:     127.0.0.1:45932 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:38:08] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
2025-05-15 22:38:09,620 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:38:09,644 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-05-15 22:38:09,655 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-05-15 22:38:09,675 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[2025-05-15 22:38:09] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /public_sglang_ci/runner-kd-gpu-1/_work/sglang/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=60,N=1408,device_name=NVIDIA_H100_80GB_HBM3.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
2025-05-15 22:38:10,213 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False

[18]:

response = requests.post(f"https://:{port}/start_expert_distribution_record")
print_highlight(response)

url = f"https://:{port}/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print_highlight(response.json())

response = requests.post(f"https://:{port}/stop_expert_distribution_record")
print_highlight(response)

response = requests.post(f"https://:{port}/dump_expert_distribution_record")
print_highlight(response)

import glob

output_file = glob.glob("expert_distribution_*.csv")[0]
with open(output_file, "r") as f:
    print_highlight("\n| Layer ID | Expert ID | Count |")
    print_highlight("|----------|-----------|--------|")
    next(f)
    for i, line in enumerate(f):
        if i < 9:
            layer_id, expert_id, count = line.strip().split(",")
            print_highlight(f"| {layer_id:8} | {expert_id:9} | {count:6} |")

2025-05-15 22:38:25,779 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
[2025-05-15 22:38:25] Resetting expert distribution record...
[2025-05-15 22:38:25] INFO:     127.0.0.1:47982 - "POST /start_expert_distribution_record HTTP/1.1" 200 OK

[2025-05-15 22:38:25] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-05-15 22:38:26] INFO:     127.0.0.1:45938 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:38:26] The server is fired up and ready to roll!
[2025-05-15 22:38:27] Decode batch. #running-req: 1, #token: 46, token usage: 0.00, cuda graph: False, gen throughput (token/s): 2.22, #queue-req: 0
[2025-05-15 22:38:29] Decode batch. #running-req: 1, #token: 86, token usage: 0.01, cuda graph: False, gen throughput (token/s): 28.68, #queue-req: 0
[2025-05-15 22:38:30] Decode batch. #running-req: 1, #token: 126, token usage: 0.01, cuda graph: False, gen throughput (token/s): 28.81, #queue-req: 0
[2025-05-15 22:38:30] INFO:     127.0.0.1:35202 - "POST /generate HTTP/1.1" 200 OK

{'text': " Options: - wine - diplomat - paris france - posted - statue of liberty Let's do it gradually: Paris France is the capital city of France. Paris is also known as the city of love. Paris is also famous for the Eiffel tower.... So the answer is paris france. What does a person skip when they arelazy? Options: - golf - feel tired - exercise - tiredness - walk Let's do it gradually: When people are lazy, they do not do anything that would tire them out. Walking uses body energy and thinking also uses brain energy. Lazy people don't do anything that would make them tired.... So", 'meta_info': {'id': '3d0c2971cfca47f6a94edff5ec15821b', 'finish_reason': {'type': 'length', 'length': 128}, 'prompt_tokens': 7, 'completion_tokens': 128, 'cached_tokens': 0, 'e2e_latency': 5.14442777633667}}

[2025-05-15 22:38:30] INFO:     127.0.0.1:48412 - "POST /stop_expert_distribution_record HTTP/1.1" 200 OK

[2025-05-15 22:38:30] Resetting expert distribution record...
[2025-05-15 22:38:30] INFO:     127.0.0.1:48420 - "POST /dump_expert_distribution_record HTTP/1.1" 200 OK

| Layer ID | Expert ID | Count |

|----------|-----------|--------|

| 0 | 0 | 44 |

| 0 | 26 | 14 |

| 0 | 33 | 14 |

| 0 | 51 | 10 |

| 0 | 10 | 2 |

| 0 | 9 | 6 |

| 0 | 1 | 12 |

| 0 | 24 | 14 |

| 0 | 57 | 4 |

[19]:

terminate_process(expert_record_server_process)

SGLang 原生 API

目录