OpenAI API - Completions#

SGLang 提供了兼容 OpenAI 的 API,以便从 OpenAI 服务平滑迁移到自托管的本地模型。完整的 API 参考请查阅OpenAI API 参考

本教程涵盖以下常用的 API

  • chat/completions

  • completions

  • batches

查阅其他教程以了解适用于视觉语言模型的视觉 API 和适用于嵌入模型的嵌入 API

启动服务器#

在终端中启动服务器并等待其初始化。

[1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, print_highlight, terminate_process


server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
[2025-05-15 22:38:46] server_args=ServerArgs(model_path='qwen/qwen2.5-0.5b-instruct', tokenizer_path='qwen/qwen2.5-0.5b-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='qwen/qwen2.5-0.5b-instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, host='0.0.0.0', port=34325, mem_fraction_static=0.8, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=395425750, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, mm_attention_backend=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None, pdlb_url=None)
[2025-05-15 22:38:55] Attention backend not set. Use fa3 backend by default.
[2025-05-15 22:38:55] Init torch distributed begin.
[2025-05-15 22:38:55] Init torch distributed ends. mem usage=0.00 GB
[2025-05-15 22:38:55] Load weight begin. avail mem=37.01 GB
[2025-05-15 22:38:57] Using model weights format ['*.safetensors']
[2025-05-15 22:38:57] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]

[2025-05-15 22:38:58] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=35.95 GB, mem usage=1.06 GB.
[2025-05-15 22:38:58] KV Cache is allocated. #tokens: 20480, K size: 0.12 GB, V size: 0.12 GB
[2025-05-15 22:38:58] Memory pool end. avail mem=35.54 GB
[2025-05-15 22:38:59] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=32768
[2025-05-15 22:39:00] INFO:     Started server process [76120]
[2025-05-15 22:39:00] INFO:     Waiting for application startup.
[2025-05-15 22:39:00] INFO:     Application startup complete.
[2025-05-15 22:39:00] INFO:     Uvicorn running on http://0.0.0.0:34325 (Press CTRL+C to quit)
[2025-05-15 22:39:00] INFO:     127.0.0.1:36378 - "GET /v1/models HTTP/1.1" 200 OK
[2025-05-15 22:39:01] INFO:     127.0.0.1:51360 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-05-15 22:39:01] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:39:02] INFO:     127.0.0.1:51374 - "POST /generate HTTP/1.1" 200 OK
[2025-05-15 22:39:02] The server is fired up and ready to roll!


注意:通常,服务器在单独的终端中运行。
在本笔记本中,我们将服务器和笔记本代码一起运行,因此它们的输出是合并的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出以蓝色突出显示。
我们在 CI 并行环境中运行这些笔记本,因此吞吐量不代表实际性能。
Server started on http://localhost:34325

聊天补全#

用法#

服务器完全实现了 OpenAI API。如果 Hugging Face 分词器中指定了聊天模板,它将自动应用该模板。您也可以在启动服务器时使用 --chat-template 指定自定义聊天模板。

[2]:
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
[2025-05-15 22:39:05] Prefill batch. #new-seq: 1, #new-token: 37, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:39:05] Decode batch. #running-req: 1, #token: 70, token usage: 0.00, cuda graph: False, gen throughput (token/s): 6.36, #queue-req: 0
[2025-05-15 22:39:06] INFO:     127.0.0.1:51376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
响应: ChatCompletion(id='3fbe235868f14f31b010300d21ebfbea', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure, here are three countries and their respective capitals:\n\n1. **United States** - Washington, D.C.\n2. **Canada** - Ottawa\n3. **Australia** - Canberra', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=151645)], created=1747348745, model='qwen/qwen2.5-0.5b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=37, total_tokens=76, completion_tokens_details=None, prompt_tokens_details=None))

参数#

聊天补全 API 接受 OpenAI Chat Completions API 的参数。有关更多详细信息,请查阅OpenAI Chat Completions API

SGLang 通过 extra_body 参数扩展了标准 API,允许额外的自定义。 extra_body 中的一个关键选项是 chat_template_kwargs,可用于向聊天模板处理器传递参数。

启用模型的思考/推理#

您可以使用 chat_template_kwargs 启用或禁用模型的内部思考或推理过程输出。在 chat_template_kwargs 中设置 "enable_thinking": True 会在响应中包含推理步骤。这需要使用兼容的推理解析器(例如,对于 Qwen3 模型使用 --reasoning-parser qwen3)启动服务器。

以下是演示如何启用思考并单独检索推理内容(使用 separate_reasoning: True)的示例:

# Ensure the server is launched with a compatible reasoning parser, e.g.:
# python3 -m sglang.launch_server --model-path QwQ/Qwen3-32B-250415 --reasoning-parser qwen3 ...

from openai import OpenAI

# Modify OpenAI's API key and API base to use SGLang's API server.
openai_api_key = "EMPTY"
openai_api_base = f"http://127.0.0.1:{port}/v1" # Use the correct port

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "QwQ/Qwen3-32B-250415" # Use the model loaded by the server
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "separate_reasoning": True
    }
)

print("response.choices[0].message.reasoning_content: \n", response.choices[0].message.reasoning_content)
print("response.choices[0].message.content: \n", response.choices[0].message.content)

示例输出

response.choices[0].message.reasoning_content:
 Okay, so I need to figure out which number is greater between 9.11 and 9.8. Hmm, let me think. Both numbers start with 9, right? So the whole number part is the same. That means I need to look at the decimal parts to determine which one is bigger.
...
Therefore, after checking multiple methods—aligning decimals, subtracting, converting to fractions, and using a real-world analogy—it's clear that 9.8 is greater than 9.11.

response.choices[0].message.content:
 To determine which number is greater between **9.11** and **9.8**, follow these steps:
...
**Answer**:
9.8 is greater than 9.11.

设置 "enable_thinking": False(或省略)将导致 reasoning_contentNone

以下是一个使用标准 OpenAI 参数的详细聊天补全请求示例

[3]:
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a knowledgeable historian who provides concise responses.",
        },
        {"role": "user", "content": "Tell me about ancient Rome"},
        {
            "role": "assistant",
            "content": "Ancient Rome was a civilization centered in Italy.",
        },
        {"role": "user", "content": "What were their major achievements?"},
    ],
    temperature=0.3,  # Lower temperature for more focused responses
    max_tokens=128,  # Reasonable length for a concise response
    top_p=0.95,  # Slightly higher for better fluency
    presence_penalty=0.2,  # Mild penalty to avoid repetition
    frequency_penalty=0.2,  # Mild penalty for more natural language
    n=1,  # Single response is usually more stable
    seed=42,  # Keep for reproducibility
)

print_highlight(response.choices[0].message.content)
[2025-05-15 22:39:06] Prefill batch. #new-seq: 1, #new-token: 49, #cached-token: 5, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:39:06] Decode batch. #running-req: 1, #token: 88, token usage: 0.00, cuda graph: False, gen throughput (token/s): 74.67, #queue-req: 0
[2025-05-15 22:39:06] Decode batch. #running-req: 1, #token: 128, token usage: 0.01, cuda graph: False, gen throughput (token/s): 84.66, #queue-req: 0
[2025-05-15 22:39:07] Decode batch. #running-req: 1, #token: 168, token usage: 0.01, cuda graph: False, gen throughput (token/s): 83.24, #queue-req: 0
[2025-05-15 22:39:07] INFO:     127.0.0.1:51376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
古罗马是一个在西方文明发展中发挥重要作用的主要文明。他们的一些主要成就包括

1. 斗兽场的建造:这座圆形剧场建于公元70年,是当时世界上最大、最令人印象深刻的建筑之一。

2. 万神庙的建造:这座寺庙建于公元206年,被认为是罗马建筑的精美典范之一。

3. 渡槽的建造:这些对于罗马的水供应至关重要,它们被修建用于将水从山区输送到城市。

4.

也支持流式传输模式。

[4]:
stream = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
[2025-05-15 22:39:07] INFO:     127.0.0.1:51376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-05-15 22:39:07] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 24, token usage: 0.00, #running-req: 0, #queue-req: 0
Yes, that is a test. I am designed to provide information and assistance to users, not to engage in any form of[2025-05-15 22:39:07] Decode batch. #running-req: 1, #token: 60, token usage: 0.00, cuda graph: False, gen throughput (token/s): 81.18, #queue-req: 0
 testing or investigation. If you have any questions or need help, feel free to ask, and I'll do my best to assist you based on the best information available.

补全#

用法#

补全 API 与聊天补全 API 类似,但没有 messages 参数或聊天模板。

[5]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="List 3 countries and their capitals.",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)

print_highlight(f"Response: {response}")
[2025-05-15 22:39:08] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:39:08] Decode batch. #running-req: 1, #token: 14, token usage: 0.00, cuda graph: False, gen throughput (token/s): 76.04, #queue-req: 0
[2025-05-15 22:39:08] Decode batch. #running-req: 1, #token: 54, token usage: 0.00, cuda graph: False, gen throughput (token/s): 87.85, #queue-req: 0
[2025-05-15 22:39:09] INFO:     127.0.0.1:51376 - "POST /v1/completions HTTP/1.1" 200 OK
响应: Completion(id='c7047436377647779b3d4b6957568373', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. United States - Washington, D.C.\n2. Canada - Ottawa\n3. France - Paris\n4. Germany - Berlin\n5. Japan - Tokyo\n6. Italy - Rome\n7. Spain - Madrid\n8. United Kingdom - London\n9. Australia - Canberra\n10. New', matched_stop=None)], created=1747348748, model='qwen/qwen2.5-0.5b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=8, total_tokens=72, completion_tokens_details=None, prompt_tokens_details=None))

参数#

补全 API 接受 OpenAI Completions API 的参数。有关更多详细信息,请查阅OpenAI Completions API

以下是一个详细补全请求的示例

[6]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="Write a short story about a space explorer.",
    temperature=0.7,  # Moderate temperature for creative writing
    max_tokens=150,  # Longer response for a story
    top_p=0.9,  # Balanced diversity in word choice
    stop=["\n\n", "THE END"],  # Multiple stop sequences
    presence_penalty=0.3,  # Encourage novel elements
    frequency_penalty=0.3,  # Reduce repetitive phrases
    n=1,  # Generate one completion
    seed=123,  # For reproducible results
)

print_highlight(f"Response: {response}")
[2025-05-15 22:39:09] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:39:09] Decode batch. #running-req: 1, #token: 31, token usage: 0.00, cuda graph: False, gen throughput (token/s): 80.35, #queue-req: 0
[2025-05-15 22:39:09] INFO:     127.0.0.1:51376 - "POST /v1/completions HTTP/1.1" 200 OK
响应: Completion(id='562efb2814c647ceb53b5effa24c9b20', choices=[CompletionChoice(finish_reason='stop', index=0, logprobs=None, text=' Once upon a time, there was a space explorer named Alex. Alex was an adventurous soul who loved to explore new places and uncover hidden secrets. One day, he embarked on a mission to the moon, hoping to find new life forms and learn more about the universe.', matched_stop='\n\n')], created=1747348749, model='qwen/qwen2.5-0.5b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=54, prompt_tokens=9, total_tokens=63, completion_tokens_details=None, prompt_tokens_details=None))

结构化输出 (JSON, Regex, EBNF)#

关于兼容 OpenAI 的结构化输出 API,请查阅结构化输出以了解更多详细信息。

批量任务#

也支持聊天补全和补全的批量 API。您可以在 jsonl 文件中上传您的请求,创建一个批量任务,并在批量任务完成后检索结果(这需要更长时间但成本更低)。

批量 API 包括:

  • batches

  • batches/{batch_id}/cancel

  • batches/{batch_id}

以下是聊天补全批量任务的示例,补全批量任务类似。

[7]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = [
    {
        "custom_id": "request-1",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "qwen/qwen2.5-0.5b-instruct",
            "messages": [
                {"role": "user", "content": "Tell me a joke about programming"}
            ],
            "max_tokens": 50,
        },
    },
    {
        "custom_id": "request-2",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            "model": "qwen/qwen2.5-0.5b-instruct",
            "messages": [{"role": "user", "content": "What is Python?"}],
            "max_tokens": 50,
        },
    },
]

input_file_path = "batch_requests.jsonl"

with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    file_response = client.files.create(file=f, purpose="batch")

batch_response = client.batches.create(
    input_file_id=file_response.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Batch job created with ID: {batch_response.id}")
[2025-05-15 22:39:09] INFO:     127.0.0.1:51390 - "POST /v1/files HTTP/1.1" 200 OK
[2025-05-15 22:39:09] INFO:     127.0.0.1:51390 - "POST /v1/batches HTTP/1.1" 200 OK
创建批量任务,ID:batch_6b33bea1-0b87-4945-988c-88df7b880d40
[2025-05-15 22:39:09] Prefill batch. #new-seq: 2, #new-token: 20, #cached-token: 48, token usage: 0.00, #running-req: 0, #queue-req: 0
[8]:
while batch_response.status not in ["completed", "failed", "cancelled"]:
    time.sleep(3)
    print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
    batch_response = client.batches.retrieve(batch_response.id)

if batch_response.status == "completed":
    print("Batch job completed successfully!")
    print(f"Request counts: {batch_response.request_counts}")

    result_file_id = batch_response.output_file_id
    file_response = client.files.content(result_file_id)
    result_content = file_response.read().decode("utf-8")

    results = [
        json.loads(line) for line in result_content.split("\n") if line.strip() != ""
    ]

    for result in results:
        print_highlight(f"Request {result['custom_id']}:")
        print_highlight(f"Response: {result['response']}")

    print_highlight("Cleaning up files...")
    # Only delete the result file ID since file_response is just content
    client.files.delete(result_file_id)
else:
    print_highlight(f"Batch job failed with status: {batch_response.status}")
    if hasattr(batch_response, "errors"):
        print_highlight(f"Errors: {batch_response.errors}")
[2025-05-15 22:39:10] Decode batch. #running-req: 2, #token: 60, token usage: 0.00, cuda graph: False, gen throughput (token/s): 58.40, #queue-req: 0
[2025-05-15 22:39:10] Decode batch. #running-req: 2, #token: 140, token usage: 0.01, cuda graph: False, gen throughput (token/s): 158.99, #queue-req: 0
Batch job status: validating...trying again in 3 seconds...
[2025-05-15 22:39:12] INFO:     127.0.0.1:51390 - "GET /v1/batches/batch_6b33bea1-0b87-4945-988c-88df7b880d40 HTTP/1.1" 200 OK
Batch job completed successfully!
Request counts: BatchRequestCounts(completed=2, failed=0, total=2)
[2025-05-15 22:39:12] INFO:     127.0.0.1:51390 - "GET /v1/files/backend_result_file-cdad1509-5b8b-4558-9141-a3dabf2c4e60/content HTTP/1.1" 200 OK
请求 request-1
响应: {'status_code': 200, 'request_id': 'batch_6b33bea1-0b87-4945-988c-88df7b880d40-req_0', 'body': {'id': 'batch_6b33bea1-0b87-4945-988c-88df7b880d40-req_0', 'object': 'chat.completion', 'created': 1747348749, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': 'Sure, here\'s a programming joke for you:\nWhy don\'t scientists trust atoms?\nBecause they make up everything.\nThis is a play on words, as in "make up" rather than "formulate." It\'s also a classic example of a', 'tool_calls': None, 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'length', 'matched_stop': None}, 'usage': {'prompt_tokens': 35, 'completion_tokens': 50, 'total_tokens': 85}, 'system_fingerprint': None}}
请求 request-2
响应: {'status_code': 200, 'request_id': 'batch_6b33bea1-0b87-4945-988c-88df7b880d40-req_1', 'body': {'id': 'batch_6b33bea1-0b87-4945-988c-88df7b880d40-req_1', 'object': 'chat.completion', 'created': 1747348749, 'model': 'qwen/qwen2.5-0.5b-instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': 'Python is a high-level, interpreted programming language developed by Guido van Rossum. Here are some key points about Python:\n\n1. Community: Python has a large and active community, with over 230, 000 repositories on', 'tool_calls': None, 'reasoning_content': None}, 'logprobs': None, 'finish_reason': 'length', 'matched_stop': None}, 'usage': {'prompt_tokens': 33, 'completion_tokens': 50, 'total_tokens': 83}, 'system_fingerprint': None}}
清理文件...
[2025-05-15 22:39:12] INFO:     127.0.0.1:51390 - "DELETE /v1/files/backend_result_file-cdad1509-5b8b-4558-9141-a3dabf2c4e60 HTTP/1.1" 200 OK

批量任务完成需要一些时间。您可以使用这两个 API 来检索批量任务状态或取消批量任务。

  1. batches/{batch_id}: 检索批量任务状态。

  2. batches/{batch_id}/cancel: 取消批量任务。

以下是检查批量任务状态的示例。

[9]:
import json
import time
from openai import OpenAI

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(20):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "qwen/qwen2.5-0.5b-instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 64,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

max_checks = 5
for i in range(max_checks):
    batch_details = client.batches.retrieve(batch_id=batch_job.id)

    print_highlight(
        f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
    )
    print_highlight(
        f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
    )

    time.sleep(3)
[2025-05-15 22:39:12] INFO:     127.0.0.1:54056 - "POST /v1/files HTTP/1.1" 200 OK
[2025-05-15 22:39:12] INFO:     127.0.0.1:54056 - "POST /v1/batches HTTP/1.1" 200 OK
创建批量任务,ID:batch_09c588d1-fed1-4c89-9280-7846413922a2
初始状态:validating
[2025-05-15 22:39:12] Prefill batch. #new-seq: 9, #new-token: 270, #cached-token: 27, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-05-15 22:39:12] Prefill batch. #new-seq: 11, #new-token: 340, #cached-token: 33, token usage: 0.01, #running-req: 9, #queue-req: 0
[2025-05-15 22:39:13] Decode batch. #running-req: 18, #token: 1227, token usage: 0.06, cuda graph: False, gen throughput (token/s): 258.14, #queue-req: 0
[2025-05-15 22:39:22] INFO:     127.0.0.1:48984 - "GET /v1/batches/batch_09c588d1-fed1-4c89-9280-7846413922a2 HTTP/1.1" 200 OK
批量任务详情(检查 1 / 5) // ID: batch_09c588d1-fed1-4c89-9280-7846413922a2 // 状态: completed // 创建时间: 1747348752 // 输入文件 ID: backend_input_file-011c72d5-0c54-4c36-829f-09dc78f4ee21 // 输出文件 ID: backend_result_file-55bc9f26-9d48-41d0-b9dc-e633ca750816
请求计数: 总计: 20 // 已完成: 20 // 失败: 0
[2025-05-15 22:39:25] INFO:     127.0.0.1:48984 - "GET /v1/batches/batch_09c588d1-fed1-4c89-9280-7846413922a2 HTTP/1.1" 200 OK
批量任务详情(检查 2 / 5) // ID: batch_09c588d1-fed1-4c89-9280-7846413922a2 // 状态: completed // 创建时间: 1747348752 // 输入文件 ID: backend_input_file-011c72d5-0c54-4c36-829f-09dc78f4ee21 // 输出文件 ID: backend_result_file-55bc9f26-9d48-41d0-b9dc-e633ca750816
请求计数: 总计: 20 // 已完成: 20 // 失败: 0
[2025-05-15 22:39:28] INFO:     127.0.0.1:48984 - "GET /v1/batches/batch_09c588d1-fed1-4c89-9280-7846413922a2 HTTP/1.1" 200 OK
批量任务详情(检查 3 / 5) // ID: batch_09c588d1-fed1-4c89-9280-7846413922a2 // 状态: completed // 创建时间: 1747348752 // 输入文件 ID: backend_input_file-011c72d5-0c54-4c36-829f-09dc78f4ee21 // 输出文件 ID: backend_result_file-55bc9f26-9d48-41d0-b9dc-e633ca750816
请求计数: 总计: 20 // 已完成: 20 // 失败: 0
[2025-05-15 22:39:32] INFO:     127.0.0.1:48984 - "GET /v1/batches/batch_09c588d1-fed1-4c89-9280-7846413922a2 HTTP/1.1" 200 OK
批量任务详情(检查 4 / 5) // ID: batch_09c588d1-fed1-4c89-9280-7846413922a2 // 状态: completed // 创建时间: 1747348752 // 输入文件 ID: backend_input_file-011c72d5-0c54-4c36-829f-09dc78f4ee21 // 输出文件 ID: backend_result_file-55bc9f26-9d48-41d0-b9dc-e633ca750816
请求计数: 总计: 20 // 已完成: 20 // 失败: 0
[2025-05-15 22:39:35] INFO:     127.0.0.1:48984 - "GET /v1/batches/batch_09c588d1-fed1-4c89-9280-7846413922a2 HTTP/1.1" 200 OK
批量任务详情(检查 5 / 5) // ID: batch_09c588d1-fed1-4c89-9280-7846413922a2 // 状态: completed // 创建时间: 1747348752 // 输入文件 ID: backend_input_file-011c72d5-0c54-4c36-829f-09dc78f4ee21 // 输出文件 ID: backend_result_file-55bc9f26-9d48-41d0-b9dc-e633ca750816
请求计数: 总计: 20 // 已完成: 20 // 失败: 0

以下是取消批量任务的示例。

[10]:
import json
import time
from openai import OpenAI
import os

client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

requests = []
for i in range(5000):
    requests.append(
        {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/chat/completions",
            "body": {
                "model": "qwen/qwen2.5-0.5b-instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": f"{i}: You are a helpful AI assistant",
                    },
                    {
                        "role": "user",
                        "content": "Write a detailed story about topic. Make it very long.",
                    },
                ],
                "max_tokens": 128,
            },
        }
    )

input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

with open(input_file_path, "rb") as f:
    uploaded_file = client.files.create(file=f, purpose="batch")

batch_job = client.batches.create(
    input_file_id=uploaded_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")

time.sleep(10)

try:
    cancelled_job = client.batches.cancel(batch_id=batch_job.id)
    print_highlight(f"Cancellation initiated. Status: {cancelled_job.status}")
    assert cancelled_job.status == "cancelling"

    # Monitor the cancellation process
    while cancelled_job.status not in ["failed", "cancelled"]:
        time.sleep(3)
        cancelled_job = client.batches.retrieve(batch_job.id)
        print_highlight(f"Current status: {cancelled_job.status}")

    # Verify final status
    assert cancelled_job.status == "cancelled"
    print_highlight("Batch job successfully cancelled")

except Exception as e:
    print_highlight(f"Error during cancellation: {e}")
    raise e

finally:
    try:
        del_response = client.files.delete(uploaded_file.id)
        if del_response.deleted:
            print_highlight("Successfully cleaned up input file")
        if os.path.exists(input_file_path):
            os.remove(input_file_path)
            print_highlight("Successfully deleted local batch_requests.jsonl file")
    except Exception as e:
        print_highlight(f"Error cleaning up: {e}")
        raise e
[2025-05-15 22:39:38] INFO:     127.0.0.1:55100 - "POST /v1/files HTTP/1.1" 200 OK
[2025-05-15 22:39:38] INFO:     127.0.0.1:55100 - "POST /v1/batches HTTP/1.1" 200 OK
创建批量任务,ID:batch_7c6fe916-bf21-4279-bdeb-400dc229c835
初始状态:validating
[2025-05-15 22:39:39] Prefill batch. #new-seq: 8, #new-token: 8, #cached-token: 256, token usage: 0.01, #running-req: 0, #queue-req: 0
[2025-05-15 22:39:39] Prefill batch. #new-seq: 123, #new-token: 3342, #cached-token: 869, token usage: 0.03, #running-req: 8, #queue-req: 76
[2025-05-15 22:39:39] Prefill batch. #new-seq: 30, #new-token: 900, #cached-token: 150, token usage: 0.27, #running-req: 128, #queue-req: 4839
[2025-05-15 22:39:39] Decode batch. #running-req: 128, #token: 6408, token usage: 0.31, cuda graph: False, gen throughput (token/s): 82.88, #queue-req: 4839
[2025-05-15 22:39:40] Prefill batch. #new-seq: 7, #new-token: 210, #cached-token: 35, token usage: 0.39, #running-req: 157, #queue-req: 4832
[2025-05-15 22:39:40] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.40, #running-req: 163, #queue-req: 4830
[2025-05-15 22:39:40] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.41, #running-req: 164, #queue-req: 4829
[2025-05-15 22:39:40] Decode batch. #running-req: 163, #token: 13085, token usage: 0.64, cuda graph: False, gen throughput (token/s): 9803.97, #queue-req: 4829
[2025-05-15 22:39:41] Decode batch. #running-req: 161, #token: 19371, token usage: 0.95, cuda graph: False, gen throughput (token/s): 10877.52, #queue-req: 4829
[2025-05-15 22:39:41] Decode out of memory happened. #retracted_reqs: 24, #new_token_ratio: 0.5997 -> 0.9271
[2025-05-15 22:39:41] Decode out of memory happened. #retracted_reqs: 17, #new_token_ratio: 0.9081 -> 1.0000
[2025-05-15 22:39:41] Prefill batch. #new-seq: 9, #new-token: 270, #cached-token: 45, token usage: 0.88, #running-req: 120, #queue-req: 4861
[2025-05-15 22:39:41] Prefill batch. #new-seq: 120, #new-token: 3600, #cached-token: 600, token usage: 0.02, #running-req: 9, #queue-req: 4741
[2025-05-15 22:39:41] Decode batch. #running-req: 129, #token: 4605, token usage: 0.22, cuda graph: False, gen throughput (token/s): 8371.74, #queue-req: 4741
[2025-05-15 22:39:41] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.29, #running-req: 128, #queue-req: 4738
[2025-05-15 22:39:42] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.40, #running-req: 130, #queue-req: 4736
[2025-05-15 22:39:42] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.42, #running-req: 131, #queue-req: 4735
[2025-05-15 22:39:42] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.45, #running-req: 131, #queue-req: 4733
[2025-05-15 22:39:42] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.45, #running-req: 132, #queue-req: 4732
[2025-05-15 22:39:42] Decode batch. #running-req: 133, #token: 9801, token usage: 0.48, cuda graph: False, gen throughput (token/s): 8235.78, #queue-req: 4732
[2025-05-15 22:39:42] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.51, #running-req: 132, #queue-req: 4731
[2025-05-15 22:39:42] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 4, token usage: 0.66, #running-req: 132, #queue-req: 4730
[2025-05-15 22:39:42] Decode batch. #running-req: 133, #token: 14998, token usage: 0.73, cuda graph: False, gen throughput (token/s): 9075.88, #queue-req: 4730
[2025-05-15 22:39:43] Prefill batch. #new-seq: 6, #new-token: 180, #cached-token: 30, token usage: 0.90, #running-req: 124, #queue-req: 4724
[2025-05-15 22:39:43] Decode batch. #running-req: 130, #token: 19054, token usage: 0.93, cuda graph: False, gen throughput (token/s): 9073.01, #queue-req: 4724
[2025-05-15 22:39:43] Prefill batch. #new-seq: 112, #new-token: 3498, #cached-token: 422, token usage: 0.08, #running-req: 17, #queue-req: 4612
[2025-05-15 22:39:43] Prefill batch. #new-seq: 16, #new-token: 495, #cached-token: 65, token usage: 0.31, #running-req: 126, #queue-req: 4596
[2025-05-15 22:39:44] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.42, #running-req: 141, #queue-req: 4594
[2025-05-15 22:39:44] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.44, #running-req: 141, #queue-req: 4592
[2025-05-15 22:39:44] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.45, #running-req: 142, #queue-req: 4591
[2025-05-15 22:39:44] Decode batch. #running-req: 143, #token: 9447, token usage: 0.46, cuda graph: False, gen throughput (token/s): 7352.93, #queue-req: 4591
[2025-05-15 22:39:44] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 4, token usage: 0.46, #running-req: 142, #queue-req: 4590
[2025-05-15 22:39:44] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.47, #running-req: 141, #queue-req: 4587
[2025-05-15 22:39:44] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.52, #running-req: 142, #queue-req: 4585
[2025-05-15 22:39:44] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.57, #running-req: 141, #queue-req: 4584
[2025-05-15 22:39:44] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.59, #running-req: 141, #queue-req: 4583
[2025-05-15 22:39:44] Decode batch. #running-req: 140, #token: 14160, token usage: 0.69, cuda graph: False, gen throughput (token/s): 8175.19, #queue-req: 4583
[2025-05-15 22:39:45] Decode batch. #running-req: 139, #token: 19614, token usage: 0.96, cuda graph: False, gen throughput (token/s): 9867.55, #queue-req: 4583
[2025-05-15 22:39:45] Prefill batch. #new-seq: 101, #new-token: 3188, #cached-token: 347, token usage: 0.18, #running-req: 28, #queue-req: 4482
[2025-05-15 22:39:45] Prefill batch. #new-seq: 23, #new-token: 712, #cached-token: 93, token usage: 0.41, #running-req: 127, #queue-req: 4459
[2025-05-15 22:39:46] Prefill batch. #new-seq: 20, #new-token: 614, #cached-token: 86, token usage: 0.33, #running-req: 135, #queue-req: 4439
[2025-05-15 22:39:46] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.45, #running-req: 154, #queue-req: 4436
[2025-05-15 22:39:46] Decode batch. #running-req: 154, #token: 9011, token usage: 0.44, cuda graph: False, gen throughput (token/s): 7703.92, #queue-req: 4436
[2025-05-15 22:39:46] Prefill batch. #new-seq: 3, #new-token: 90, #cached-token: 15, token usage: 0.45, #running-req: 155, #queue-req: 4433
[2025-05-15 22:39:46] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.47, #running-req: 156, #queue-req: 4431
[2025-05-15 22:39:46] Prefill batch. #new-seq: 1, #new-token: 31, #cached-token: 4, token usage: 0.48, #running-req: 157, #queue-req: 4430
[2025-05-15 22:39:46] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.50, #running-req: 154, #queue-req: 4428
[2025-05-15 22:39:46] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.54, #running-req: 155, #queue-req: 4427
[2025-05-15 22:39:46] Prefill batch. #new-seq: 2, #new-token: 60, #cached-token: 10, token usage: 0.55, #running-req: 153, #queue-req: 4425
[2025-05-15 22:39:46] Decode batch. #running-req: 153, #token: 13819, token usage: 0.67, cuda graph: False, gen throughput (token/s): 8793.67, #queue-req: 4425
[2025-05-15 22:39:47] Decode batch. #running-req: 152, #token: 19815, token usage: 0.97, cuda graph: False, gen throughput (token/s): 10227.22, #queue-req: 4425
[2025-05-15 22:39:47] Decode out of memory happened. #retracted_reqs: 23, #new_token_ratio: 0.6217 -> 0.9976
[2025-05-15 22:39:47] Prefill batch. #new-seq: 96, #new-token: 3072, #cached-token: 288, token usage: 0.23, #running-req: 32, #queue-req: 4352
[2025-05-15 22:39:48] Prefill batch. #new-seq: 36, #new-token: 1108, #cached-token: 152, token usage: 0.20, #running-req: 106, #queue-req: 4316
[2025-05-15 22:39:48] INFO:     127.0.0.1:50096 - "POST /v1/batches/batch_7c6fe916-bf21-4279-bdeb-400dc229c835/cancel HTTP/1.1" 200 OK
已启动取消。状态:cancelling
[2025-05-15 22:39:48] Prefill batch. #new-seq: 1, #new-token: 30, #cached-token: 5, token usage: 0.26, #running-req: 131, #queue-req: 4315
[2025-05-15 22:39:51] INFO:     127.0.0.1:50096 - "GET /v1/batches/batch_7c6fe916-bf21-4279-bdeb-400dc229c835 HTTP/1.1" 200 OK
当前状态:cancelled
批量任务已成功取消
[2025-05-15 22:39:51] INFO:     127.0.0.1:50096 - "DELETE /v1/files/backend_input_file-9092419b-e398-448b-923c-0e486804c9c5 HTTP/1.1" 200 OK
已成功清理输入文件
已成功删除本地 batch_requests.jsonl 文件
[11]:
terminate_process(server_process)