离线引擎 API#
SGLang 提供了一个不需要 HTTP 服务器的直接推理引擎,特别适用于那些添加额外的 HTTP 服务器会增加不必要的复杂性或开销的场景。以下是两个通用用例
离线批处理推理
在引擎之上构建自定义服务器
本文档重点介绍离线批处理推理,演示了四种不同的推理模式
非流式同步生成
流式同步生成
非流式异步生成
流式异步生成
此外,您可以轻松地在 SGLang 离线引擎之上构建自定义服务器。在 python 脚本中运行的详细示例可以在 custom_server 中找到。
嵌套 Asyncio (Nest Asyncio)#
请注意,如果您想在 ipython 或某些其他嵌套循环代码中使用 离线引擎 (Offline Engine),则需要添加以下代码
import nest_asyncio
nest_asyncio.apply()
高级用法#
该引擎支持 VLM 推理 以及 提取隐藏状态 (hidden states)。
有关更多使用场景,请参阅 示例。
离线批处理推理#
SGLang 离线引擎支持高效调度的批处理推理。
[1]:
# launch the offline engine
import asyncio
import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge
llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")
[2025-12-30 02:24:46] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:24:46] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:24:46] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:24:49] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:24:49] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:24:49] INFO engine.py:153: server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.835, max_running_requests=128, max_queued_requests=None, max_total_tokens=20480, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=150670408, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='error', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='qwen/qwen2.5-0.5b-instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method=None, kt_cpuinfer=None, kt_threadpool_count=None, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=4, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 4352, 4608, 4864, 5120, 5376, 5632, 5888, 6144, 6400, 6656, 6912, 7168, 7424, 7680, 7936, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_fused_qk_norm_rope=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, disaggregation_decode_enable_fake_auto=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.26it/s]
Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 19.56it/s]
非流式同步生成#
[2]:
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Hello, my name is
Generated text: David and I am a Master’s student at the University of Illinois at Urbana-Champaign. I am a Computer Science major and I am working on my thesis on the Cyber-Prediction of Contaminated Waterways. The cyber-attack on the Nemo National Wildlife Refuge (NWR) in 2013 was one of the largest cyber-attacks on a protected national park in the world. The attack had gone undetected for a long time, and the hackers were able to steal valuable datasets from the NWR’s GIS data management system. The success of the attack provided a new perspective on the need to build
===============================
Prompt: The president of the United States is
Generated text: selling tickets to the annual chess tournament. He has 250 tickets to sell. He wants to make sure that for every ticket he sells, he sells a ticket for a different category of tickets, namely a child ticket, a senior citizen ticket, and an adult ticket. If he sells 15 child tickets, 12 senior citizen tickets, and the rest are adult tickets, what is the total number of tickets he will sell?
To determine the total number of tickets the president of the United States will sell, we need to follow these steps:
1. Identify the total number of tickets sold.
2. Identify the number
===============================
Prompt: The capital of France is
Generated text: ______. A. Paris B. Nice C. Paris D. Nice
A. Paris
The capital of France is Paris, which is the largest city in France and the country's capital. Nice is the second largest city in France, located in the Loire Valley region. Paris is known for its rich history, art, and cuisine, while Nice is known for its romantic beaches and charming architecture. The other cities listed are smaller and are not the capital of France.
===============================
Prompt: The future of AI is
Generated text: here. It has entered the age of deep learning, and with the advancement of data and the development of AI, how does the future of AI affect the future of the world? We know that the future of AI is in the hands of its users, and they decide how it is used. It is important to develop ethical AI that helps make the world a better place. How do we get to that future? We need to focus on developing AI that is safe, effective, and sustainable. We need to learn from the past, embrace new technologies, and stay up-to-date with the latest developments. We need to prioritize user safety and
流式同步生成#
[3]:
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {
"temperature": 0.2,
"top_p": 0.9,
}
print("\n=== Testing synchronous streaming generation with overlap removal ===\n")
for prompt in prompts:
print(f"Prompt: {prompt}")
merged_output = stream_and_merge(llm, prompt, sampling_params)
print("Generated text:", merged_output)
print()
=== Testing synchronous streaming generation with overlap removal ===
Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As an AI language model, I don't have a physical presence, but I can assist you with any questions or tasks you may have. How can I help you today? Let's get started! [Name] [Job Title] [Company Name] [Company Address] [Company Phone Number] [Company Email] [Company Website] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company GitHub Profile] [Company LinkedIn
Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a popular tourist destination and a cultural hub for France. It is home to many world-renowned museums, art galleries, and theaters. The city is also known for its food scene, with many famous restaurants and cafes serving up delicious cuisine
Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:
1. Increased automation: AI is expected to become more and more integrated into various industries, from manufacturing to healthcare to transportation. This automation will likely lead to increased efficiency, reduced costs, and improved productivity.
2. AI ethics and privacy: As AI becomes more integrated into our daily lives, there will be a growing concern about its ethical implications and potential privacy risks. There will be a need for regulations and guidelines to ensure that AI is used in a responsible and ethical manner.
3
非流式异步生成#
[4]:
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous batch generation ===")
async def main():
outputs = await llm.async_generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"\nPrompt: {prompt}")
print(f"Generated text: {output['text']}")
asyncio.run(main())
=== Testing asynchronous batch generation ===
Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: [Name]. I'm a [Career/Activity] at [Company Name]. If you have any questions, feel free to reach out. 🎓💼💼 #SelfIntroduction
That's a great start! Can you tell me more about yourself and your career? What do you enjoy about your job? #JobSeeker
Absolutely! My name is [Name], and I'm a [Career/Activity] at [Company Name]. I'm passionate about [Your Career Goal]. My favorite part of my job is [Your Favorite Thing About Your Job], which makes me really happy every day. 🌟 #CareerAdvice
Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: Paris, which is located in the northern part of the country. It is the largest city in France and the second largest city in Europe, after Madrid. The city is known for its rich history, beautiful architecture, and diverse cultural scene. Paris is home to numerous museums, art galleries, theaters, and restaurants, making it a popular tourist destination. The city is also home to the Eiffel Tower, the Louvre Museum, and the Sacré-Cœur Basilica, among other landmarks. In addition to its cultural attractions, Paris is also a major economic hub, with the headquarters of many large companies and institutions. The city is
Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: vast and varied, with many potential trends that could shape its development and impact on society. Some possibilities include:
1. Increased automation and efficiency: As AI becomes more sophisticated, it is likely to be integrated into everyday life, from manufacturing to healthcare. This could lead to increased automation and efficiency, as machines can perform tasks that would previously require human intervention.
2. Enhanced human-computer interaction: AI is increasingly being used to simulate human behavior and emotions. This could lead to more natural and conversational interactions between humans and machines, as well as better understanding of human cognition.
3. Personalized and adaptive AI: AI is being used to
流式异步生成#
[5]:
prompts = [
"Write a short, neutral self-introduction for a fictional character. Hello, my name is",
"Provide a concise factual statement about France’s capital city. The capital of France is",
"Explain possible future trends in artificial intelligence. The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
print("\n=== Testing asynchronous streaming generation (no repeats) ===")
async def main():
for prompt in prompts:
print(f"\nPrompt: {prompt}")
print("Generated text: ", end="", flush=True)
# Replace direct calls to async_generate with our custom overlap-aware version
async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
print(cleaned_chunk, end="", flush=True)
print() # New line after each prompt
asyncio.run(main())
=== Testing asynchronous streaming generation (no repeats) ===
Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: [Your Name], and I'm a [Job Title] at [Company Name]. I'm always looking for ways to grow and learn, and I'm passionate about being a part of a team that values diversity and inclusion. My core values are empathy, creativity, and a love for music. I believe that through our work together, we can inspire people to overcome obstacles and achieve great things. How can I help you today? [Your Name] is looking to make a difference in [Company Name] through your skills and expertise. [Your Name] is excited to learn more about [Company Name] and find out more about [Your
Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: Paris.
Paris is the largest city in France and its capital. The city has a rich history dating back to ancient times and is home to many landmarks such as the Eiffel Tower, the Louvre Museum, the Notre-Dame Cathedral, and the Palace of Versailles. The city is also known for its diverse culture, cuisine, and vibrant nightlife. Paris is a UNESCO World Heritage site and a major tourist destination worldwide. It's a cultural and historical melting pot that hosts numerous international events and festivals throughout the year. According to the latest statistics, Paris has a population of over 2.8 million people. It's also considered
Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: likely to be shaped by the rapid pace of technological advancement and the ongoing integration of various technologies. Some potential future trends in AI include:
1. Increased integration of AI with other technologies: The integration of AI with other technologies, such as voice recognition and natural language processing, is expected to accelerate in the coming years. This integration will enable AI systems to perform a wider range of tasks, including speech recognition, language translation, and facial recognition.
2. AI-driven automation: The growth of AI has already transformed many industries, such as manufacturing and transportation. The integration of AI with automation technologies is likely to continue, enabling machines to perform tasks that
[6]:
llm.shutdown()