推理内容解析器#

SGLang 支持为推理模型（如 DeepSeek R1）从“普通”内容中解析出推理内容。

支持的模型与解析器#

模型	推理标签	解析器	注意事项
DeepSeek‑R1 系列	`<think>` … `</think>`	`deepseek-r1`	支持所有变体 (R1, R1-0528, R1-Distill)
DeepSeek‑V3 系列	`<think>` … `</think>`	`deepseek-v3`	包括 DeepSeek‑V3.2。支持 `thinking` 参数
标准 Qwen3 模型	`<think>` … `</think>`	`qwen3`	支持 `enable_thinking` 参数
Qwen3-Thinking 模型	`<think>` … `</think>`	`qwen3` 或 `qwen3-thinking`	始终生成推理内容
Kimi 模型	`◁think▷` … `◁/think▷`	`kimi`	使用特殊的推理分隔符
GPT OSS	`<\\|channel\\|>analysis<\\|message\\|>` … `<\\|end\\|>`	`gpt-oss`	无

模型特定行为#

DeepSeek-R1 家族

DeepSeek-R1：没有 <think> 起始标签，直接跳转到推理内容
DeepSeek-R1-0528：生成 <think> 起始和 </think> 结束标签
两者均由同一个 deepseek-r1 解析器处理

DeepSeek-V3 家族

DeepSeek-V3.1/V3.2：支持推理和非推理模式的混合模型，使用 deepseek-v3 解析器和 thinking 参数（注意：不是 enable_thinking）

Qwen3 家族

标准 Qwen3 (例如 Qwen3-2507)：使用 qwen3 解析器，在对话模板中支持 enable_thinking
Qwen3-Thinking (例如 Qwen3-235B-A22B-Thinking-2507)：使用 qwen3 或 qwen3-thinking 解析器，始终进行推理

Kimi

Kimi：使用特殊的 ◁think▷ 和 ◁/think▷ 标签

GPT OSS

GPT OSS：使用特殊的 <|channel|>analysis<|message|> 和 <|end|> 标签

使用方法#

启动服务器#

指定 --reasoning-parser 选项。

[1]:

import requests
from openai import OpenAI
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning"
)

wait_for_server(f"https://:{port}")

[2025-12-30 02:21:15] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:21:15] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:21:15] INFO utils.py:164: NumExpr defaulting to 16 threads.

[2025-12-30 02:21:21] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:21:21] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:21:21] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:21:24] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:21:24] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:21:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:21:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:21:30] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:21:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:21:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:21:30] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:21:36] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.38s/it]

Capturing batches (bs=1 avail_mem=62.64 GB): 100%|██████████| 3/3 [00:00<00:00,  9.87it/s]

注意：通常情况下，服务器在独立的终端中运行。
在本笔记本中，我们同时运行服务器和笔记本代码，因此它们的输出是合并在一起的。
为了提高清晰度，服务器日志以原始黑色显示，而笔记本输出则以蓝色突出显示。
为了缩短日志长度，我们将服务器的日志级别设置为 warning，默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的，因此吞吐量并不代表实际性能。

请注意，--reasoning-parser 定义了用于解释响应的解析器。

兼容 OpenAI 的 API#

使用 OpenAI 兼容 API 时，协议遵循随 DeepSeek-R1 发布而建立的 DeepSeek API 设计

reasoning_content：思维链 (CoT) 的内容。
content：最终回答的内容。

[2]:

# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1")
model_name = client.models.list().data[0].id

messages = [
    {
        "role": "user",
        "content": "What is 1+3?",
    }
]

非流式请求#

[3]:

response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=False,  # Non-streaming
    extra_body={"separate_reasoning": True},
)
print_highlight("==== Reasoning ====")
print_highlight(response_non_stream.choices[0].message.reasoning_content)

print_highlight("==== Text ====")
print_highlight(response_non_stream.choices[0].message.content)

==== 推理内容 ====

首先，我识别出问题要求计算 1 加 3 的和。

接下来，我将两个数字相加。

最后，我计算出结果，发现 1 加 3 等于 4。

==== 文本 ====

当然！让我们一步步解决这个问题。

**问题：** \(1 + 3\) 是多少？

**解答：**

1. **从第一个数字开始：**
\[
1
\]

2. **加上第二个数字：**
\[
1 + 3
\]

3. **计算总和：**
\[
1 + 3 = 4
\]

**最终答案：**
\[
\boxed{4}
\]

流式请求#

[4]:

response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=True,  # Non-streaming
    extra_body={"separate_reasoning": True},
)

reasoning_content = ""
content = ""
for chunk in response_stream:
    if chunk.choices[0].delta.content:
        content += chunk.choices[0].delta.content
    if chunk.choices[0].delta.reasoning_content:
        reasoning_content += chunk.choices[0].delta.reasoning_content

print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)

print_highlight("==== Text ====")
print_highlight(content)

==== 推理内容 ====

首先，我识别出问题要求计算 1 加 3 的和。

接下来，我将两个数字相加：1 加 3 等于 4。

因此，最终答案是 4。

==== 文本 ====

**解答：**

我们需要找出 1 和 3 的和。

1. **相加数字：**

\[
1 + 3 = 4
\]

2. **最终答案：**

\[
\boxed{4}
\]

（可选）您可以将推理内容缓冲到最后一个推理数据块（或推理内容后的第一个数据块）中。

[5]:

response_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=True,  # Non-streaming
    extra_body={"separate_reasoning": True, "stream_reasoning": False},
)

reasoning_content = ""
content = ""
for chunk in response_stream:
    if chunk.choices[0].delta.content:
        content += chunk.choices[0].delta.content
    if chunk.choices[0].delta.reasoning_content:
        reasoning_content += chunk.choices[0].delta.reasoning_content

print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)

print_highlight("==== Text ====")
print_highlight(content)

==== 推理内容 ====

首先，我识别出题目中的两个数字：1 和 3。

接下来，我通过合并这两个数字来执行加法运算。

最后，我计算总和得出结果。

==== 文本 ====

当然！让我们一步步解决这个问题。

**问题：** \(1 + 3\) 是多少？

**解答：**

1. **识别要相加的数字：**
\[
1 \quad \text{和} \quad 3
\]

2. **执行加法：**
\[
1 + 3 = 4
\]

**答案：**
\[
\boxed{4}
\]

当指定推理解析器时，默认开启推理分离。要禁用它，请在请求中将 ``separate_reasoning`` 选项设置为 ``False``。

[6]:

response_non_stream = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    top_p=0.95,
    stream=False,  # Non-streaming
    extra_body={"separate_reasoning": False},
)

print_highlight("==== Original Output ====")
print_highlight(response_non_stream.choices[0].message.content)

==== 原始输出 ====

首先，我识别出问题要求计算数字 1 和 3 的和。

接下来，我将两个数字相加得出总数。

最后，我得出结论，1 加 3 的结果是 4。

**解答：**

我们需要计算数字 1 和 3 的总和。

\[
1 + 3 = 4
\]

因此，最终答案是 \(\boxed{4}\)。

SGLang 原生 API#

[7]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, return_dict=False
)

gen_url = f"https://:{port}/generate"
gen_data = {
    "text": input,
    "sampling_params": {
        "skip_special_tokens": False,
        "max_new_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.95,
    },
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]

print_highlight("==== Original Output ====")
print_highlight(gen_response)

parse_url = f"https://:{port}/separate_reasoning"
separate_reasoning_data = {
    "text": gen_response,
    "reasoning_parser": "deepseek-r1",
}
separate_reasoning_response_json = requests.post(
    parse_url, json=separate_reasoning_data
).json()
print_highlight("==== Reasoning ====")
print_highlight(separate_reasoning_response_json["reasoning_text"])
print_highlight("==== Text ====")
print_highlight(separate_reasoning_response_json["text"])

==== 原始输出 ====

首先，我识别出问题要求计算 1 加 3 的和。

接下来，我将两个数字相加得出总数。

最后，我得出结论，1 加 3 的结果是 4。

当然！让我们一步步解决这个问题。

**问题：** \(1 + 3\) 是多少？

**解答：**

1. **从第一个数字开始：**
\(1\)

2. **加上第二个数字：**
\(1 + 3\)

3. **计算总和：**
\(1 + 3 = 4\)

**答案：**
\(\boxed{4}\)

==== 推理内容 ====

首先，我识别出问题要求计算 1 加 3 的和。

接下来，我将两个数字相加得出总数。

最后，我得出结论，1 加 3 的结果是 4。

==== 文本 ====

当然！让我们一步步解决这个问题。

**问题：** \(1 + 3\) 是多少？

**解答：**

1. **从第一个数字开始：**
\(1\)

2. **加上第二个数字：**
\(1 + 3\)

3. **计算总和：**
\(1 + 3 = 4\)

**答案：**
\(\boxed{4}\)

[8]:

terminate_process(server_process)

离线引擎 API#

[9]:

import sglang as sgl
from sglang.srt.parser.reasoning_parser import ReasoningParser
from sglang.utils import print_highlight

llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, return_dict=False
)
sampling_params = {
    "max_new_tokens": 1024,
    "skip_special_tokens": False,
    "temperature": 0.6,
    "top_p": 0.95,
}
result = llm.generate(prompt=input, sampling_params=sampling_params)

generated_text = result["text"]  # Assume there is only one prompt

print_highlight("==== Original Output ====")
print_highlight(generated_text)

parser = ReasoningParser("deepseek-r1")
reasoning_text, text = parser.parse_non_stream(generated_text)
print_highlight("==== Reasoning ====")
print_highlight(reasoning_text)
print_highlight("==== Text ====")
print_highlight(text)

[2025-12-30 02:21:55] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:21:55] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:21:55] INFO engine.py:153: server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer_path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.835, max_running_requests=128, max_queued_requests=None, max_total_tokens=20480, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=352716589, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='error', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method=None, kt_cpuinfer=None, kt_threadpool_count=None, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=4, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.66s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.54s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00,  1.55s/it]

Capturing batches (bs=1 avail_mem=43.72 GB): 100%|██████████| 20/20 [00:01<00:00, 18.22it/s]

==== 原始输出 ====

我需要计算 1 加 3 的和。

首先，我将识别涉及的两个数字：1 和 3。

接下来，我将把这些数字相加。

最后，1 加 3 的和是 4。

当然！让我们一步步解决这个问题。

**问题：** \(1 + 3\) 是多少？

**解答：**

1. **识别要相加的数字：**
\[
1 \quad \text{和} \quad 3
\]

2. **将数字相加：**
\[
1 + 3 = 4
\]

**最终答案：**
\[
\boxed{4}
\]

==== 推理内容 ====

我需要计算 1 加 3 的和。

首先，我将识别涉及的两个数字：1 和 3。

接下来，我将把这些数字相加。

最后，1 加 3 的和是 4。

==== 文本 ====

当然！让我们一步步解决这个问题。

**问题：** \(1 + 3\) 是多少？

**解答：**

1. **识别要相加的数字：**
\[
1 \quad \text{和} \quad 3
\]

2. **将数字相加：**
\[
1 + 3 = 4
\]

**最终答案：**
\[
\boxed{4}
\]

[10]:

llm.shutdown()

支持新的推理模型模式#

对于未来的推理模型，您可以在 python/sglang/srt/reasoning_parser.py 中实现推理解析器作为 BaseReasoningFormatDetector 的子类，并相应地为新的推理模型模式指定该推理解析器。

推理解析器

目录

推理内容解析器#

支持的模型与解析器#

模型特定行为#

使用方法#

启动服务器#

兼容 OpenAI 的 API#

非流式请求#

流式请求#

SGLang 原生 API#

离线引擎 API#

支持新的推理模型模式#