OpenAI APIs - 文本补全 (Completions)#

SGLang 提供兼容 OpenAI 的 API,以实现从 OpenAI 服务到自托管本地模型的平滑过渡。API 的完整参考可查阅 OpenAI API 参考文档

本教程涵盖以下常用 API:

  • chat/completions

  • completions

查看其他教程以了解适用于视觉语言模型的 视觉 API (vision APIs) 和适用于嵌入模型的 嵌入 API (embedding APIs)

启动服务器#

在终端中启动服务器并等待其初始化。

[1]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning"
)

wait_for_server(f"https://:{port}")
print(f"Server started on https://:{port}")
[2025-12-30 02:25:17] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:25:17] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:25:17] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:25:23] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:25:23] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:25:23] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:25:25] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:25:25] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:25:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:25:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:25:32] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:25:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:25:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:25:32] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:25:38] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.08it/s]

Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  8.53it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
Server started on https://:33848

聊天补全 (Chat Completions)#

用法#

服务器完全实现了 OpenAI API。如果 Hugging Face 分词器中指定了聊天模板(Chat Template),它将自动应用该模板。您也可以在启动服务器时使用 --chat-template 指定自定义聊天模板。

[2]:
import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")
响应:ChatCompletion(id='e98698e0c79549e2884ab8d08b244028', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure, here are three countries and their respective capitals:\n\n1. **United States** - Washington, D.C.\n2. **Canada** - Ottawa\n3. **Australia** - Canberra', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=151645)], created=1767061548, model='qwen/qwen2.5-0.5b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=39, prompt_tokens=37, total_tokens=76, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

模型思考/推理支持#

某些模型支持内部推理或思考过程,这些过程可以在 API 响应中暴露。SGLang 通过 chat_template_kwargs 参数和兼容的推理解析器,为各种推理模型提供统一支持。

支持的模型与配置#

模型家族

聊天模板参数

推理解析器

注意事项

DeepSeek-R1 (R1, R1-0528, R1-Distill)

enable_thinking

--reasoning-parser deepseek-r1

标准推理模型

DeepSeek-V3.1

thinking

--reasoning-parser deepseek-v3

混合模型(思考/非思考模式)

Qwen3 (标准版)

enable_thinking

--reasoning-parser qwen3

混合模型(思考/非思考模式)

Qwen3-Thinking (思考版)

不适用(始终启用)

--reasoning-parser qwen3-thinking

始终生成推理过程

Kimi

不适用(始终启用)

--reasoning-parser kimi

Kimi 思考模型

Gpt-Oss

不适用(始终启用)

--reasoning-parser gpt-oss

Gpt-Oss 思考模型

基本用法#

要启用推理输出,您需要:

  1. 启动服务器时携带相应的推理解析器 (reasoning parser)

  2. chat_template_kwargs 中设置模型特定参数

  3. 可选:使用 separate_reasoning: False 以不单独获取推理内容(默认为 True

Qwen3-Thinking 模型注意事项: 这些模型始终生成思考内容,且不支持 enable_thinking 参数。请使用 --reasoning-parser qwen3-thinking--reasoning-parser qwen3 来解析思考内容。

示例:Qwen3 模型#

# Launch server:
# python3 -m sglang.launch_server --model Qwen/Qwen3-4B --reasoning-parser qwen3

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://127.0.0.1:30000/v1",
)

model = "Qwen/Qwen3-4B"
messages = [{"role": "user", "content": "How many r's are in 'strawberry'?"}]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "separate_reasoning": True
    }
)

print("Reasoning:", response.choices[0].message.reasoning_content)
print("-"*100)
print("Answer:", response.choices[0].message.content)

示例输出

Reasoning: Okay, so the user is asking how many 'r's are in the word 'strawberry'. Let me think. First, I need to make sure I have the word spelled correctly. Strawberry... S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me break it down.

Starting with 'strawberry', let's write out the letters one by one. S, T, R, A, W, B, E, R, R, Y. Hmm, wait, that's 10 letters. Let me check again. S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). So the letters are S-T-R-A-W-B-E-R-R-Y.
...
Therefore, the answer should be three R's in 'strawberry'. But I need to make sure I'm not counting any other letters as R. Let me check again. S, T, R, A, W, B, E, R, R, Y. No other R's. So three in total. Yeah, that seems right.

----------------------------------------------------------------------------------------------------
Answer: The word "strawberry" contains **three** letters 'r'. Here's the breakdown:

1. **S-T-R-A-W-B-E-R-R-Y**
   - The **third letter** is 'R'.
   - The **eighth and ninth letters** are also 'R's.

Thus, the total count is **3**.

**Answer:** 3.

注意: 设置 "enable_thinking": False(或省略该参数)将导致 reasoning_contentNone。Qwen3-Thinking 模型始终生成推理内容,且不支持 enable_thinking 参数。

Logit Bias 支持#

SGLang 在聊天补全和文本补全 API 中均支持 logit_bias 参数。此参数允许您通过为特定 token 的 logit 添加偏置值来修改其生成的可能性。偏置值的范围为 -100 到 100,其中:

  • 正值 (0 到 100) 增加该 token 被选中的可能性

  • 负值 (-100 到 0) 降低该 token 被选中的可能性

  • -100 实际上会阻止该 token 的生成

logit_bias 参数接受一个字典,其键为 token ID(字符串形式),值为偏置量(浮点数形式)。

获取 Token ID#

为了有效地使用 logit_bias,您需要知道想要设置偏置的单词的 token ID。以下是获取 token ID 的方法:

# Get tokenizer to find token IDs
import tiktoken

# For OpenAI models, use the appropriate encoding
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")  # or your model

# Get token IDs for specific words
word = "sunny"
token_ids = tokenizer.encode(word)
print(f"Token IDs for '{word}': {token_ids}")

# For SGLang models, you can access the tokenizer through the client
# and get token IDs for bias

重要: logit_bias 参数使用 token ID 作为字符串键,而不是实际的单词。

示例:DeepSeek-V3 模型#

DeepSeek-V3 模型通过 thinking 参数支持思考模式

# Launch server:
# python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.1 --tp 8  --reasoning-parser deepseek-v3

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://127.0.0.1:30000/v1",
)

model = "deepseek-ai/DeepSeek-V3.1"
messages = [{"role": "user", "content": "How many r's are in 'strawberry'?"}]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {"thinking": True},
        "separate_reasoning": True
    }
)

print("Reasoning:", response.choices[0].message.reasoning_content)
print("-"*100)
print("Answer:", response.choices[0].message.content)

示例输出

Reasoning: First, the question is: "How many r's are in 'strawberry'?"

I need to count the number of times the letter 'r' appears in the word "strawberry".

Let me write out the word: S-T-R-A-W-B-E-R-R-Y.

Now, I'll go through each letter and count the 'r's.
...
So, I have three 'r's in "strawberry".

I should double-check. The word is spelled S-T-R-A-W-B-E-R-R-Y. The letters are at positions: 3, 8, and 9 are 'r's. Yes, that's correct.

Therefore, the answer should be 3.
----------------------------------------------------------------------------------------------------
Answer: The word "strawberry" contains **3** instances of the letter "r". Here's a breakdown for clarity:

- The word is spelled: S-T-R-A-W-B-E-R-R-Y
- The "r" appears at the 3rd, 8th, and 9th positions.

注意: DeepSeek-V3 模型使用 thinking 参数(而非 enable_thinking)来控制推理输出。

[3]:
# Example with logit_bias parameter
# Note: You need to get the actual token IDs from your tokenizer
# For demonstration, we'll use some example token IDs
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "Complete this sentence: The weather today is"}
    ],
    temperature=0.7,
    max_tokens=20,
    logit_bias={
        "12345": 50,  # Increase likelihood of token ID 12345
        "67890": -50,  # Decrease likelihood of token ID 67890
        "11111": 25,  # Slightly increase likelihood of token ID 11111
    },
)

print_highlight(f"Response with logit bias: {response.choices[0].message.content}")
带有 logit bias 的响应:privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy

参数说明#

聊天补全 API 接受 OpenAI 聊天补全 API 的参数。详情请参考 OpenAI 聊天补全 API

SGLang 通过 extra_body 参数扩展了标准 API,允许更多自定义设置。extra_body 中的一个关键选项是 chat_template_kwargs,可用于向聊天模板处理器传递参数。

[4]:
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a knowledgeable historian who provides concise responses.",
        },
        {"role": "user", "content": "Tell me about ancient Rome"},
        {
            "role": "assistant",
            "content": "Ancient Rome was a civilization centered in Italy.",
        },
        {"role": "user", "content": "What were their major achievements?"},
    ],
    temperature=0.3,  # Lower temperature for more focused responses
    max_tokens=128,  # Reasonable length for a concise response
    top_p=0.95,  # Slightly higher for better fluency
    presence_penalty=0.2,  # Mild penalty to avoid repetition
    frequency_penalty=0.2,  # Mild penalty for more natural language
    n=1,  # Single response is usually more stable
    seed=42,  # Keep for reproducibility
)

print_highlight(response.choices[0].message.content)
古罗马文明以其广泛的成就而闻名,其中包括

1. 建筑:罗马人以其宏伟的建筑而闻名,包括斗兽场、万神殿以及罗马境内的万神殿。

2. 法律与治理:他们发展了影响现代法律的复杂法律体系。罗马共和国和帝国拥有高度结构化的政府体系。

3. 文学:罗马人因其文学作品而闻名,包括维吉尔、奥维德和贺拉斯的作品。

4. 艺术:他们因其艺术而享有盛誉,包括雕塑、绘画和建筑。

5. 宗教:罗马人是多

同时也支持流式模式 (Streaming mode)。

Logit Bias 支持#

文本补全 API 同样支持 logit_bias 参数,其功能与上述聊天补全部分所述相同。

[5]:
stream = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
Yes, I am designed to process text inputs and provide responses based on the information provided. If you have any specific questions or need assistance with a particular topic, feel free to ask, and I'll do my best to help you.
[6]:
# Example with logit_bias parameter for completions API
# Note: You need to get the actual token IDs from your tokenizer
# For demonstration, we'll use some example token IDs
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="The best programming language for AI is",
    temperature=0.7,
    max_tokens=20,
    logit_bias={
        "12345": 75,  # Strongly favor token ID 12345
        "67890": -100,  # Completely avoid token ID 67890
        "11111": -25,  # Slightly discourage token ID 11111
    },
)

print_highlight(f"Response with logit bias: {response.choices[0].text}")
带有 logit bias 的响应:privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy privacy

文本补全 (Completions)#

用法#

文本补全 API 与聊天补全 API 类似,但没有 messages 参数或聊天模板。

[7]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="List 3 countries and their capitals.",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)

print_highlight(f"Response: {response}")
响应:Completion(id='ee7a8ead16d743f49128d84800ecd956', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. United States - Washington D.C.\n2. Canada - Ottawa\n3. France - Paris\n4. Germany - Berlin\n5. Japan - Tokyo\n6. Italy - Rome\n7. Spain - Madrid\n8. United Kingdom - London\n9. Australia - Canberra\n10. New Zealand', matched_stop=None)], created=1767061548, model='qwen/qwen2.5-0.5b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=8, total_tokens=72, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

参数说明#

文本补全 API 接受 OpenAI 文本补全 API 的参数。详情请参考 OpenAI 文本补全 API

以下是一个详细的文本补全请求示例:

[8]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="Write a short story about a space explorer.",
    temperature=0.7,  # Moderate temperature for creative writing
    max_tokens=150,  # Longer response for a story
    top_p=0.9,  # Balanced diversity in word choice
    stop=["\n\n", "THE END"],  # Multiple stop sequences
    presence_penalty=0.3,  # Encourage novel elements
    frequency_penalty=0.3,  # Reduce repetitive phrases
    n=1,  # Generate one completion
    seed=123,  # For reproducible results
)

print_highlight(f"Response: {response}")
响应:Completion(id='28cd7bec59424af2b7437fac456359dd', choices=[CompletionChoice(finish_reason='stop', index=0, logprobs=None, text=' Once upon a time, there was a space explorer named John who was on a mission to explore the stars. He was a very talented and experienced astronaut, but he always felt like he was missing something important.', matched_stop='\n\n')], created=1767061548, model='qwen/qwen2.5-0.5b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=42, prompt_tokens=9, total_tokens=51, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

结构化输出 (JSON, Regex, EBNF)#

关于兼容 OpenAI 的结构化输出 API,请参考 结构化输出 了解更多详情。

使用 LoRA 适配器#

SGLang 支持在兼容 OpenAI 的 API 中使用 LoRA (Low-Rank Adaptation) 适配器。您可以使用 base-model:adapter-name 语法直接在 model 参数中指定要使用的适配器。

服务器设置

python -m sglang.launch_server \
    --model-path qwen/qwen2.5-0.5b-instruct \
    --enable-lora \
    --lora-paths adapter_a=/path/to/adapter_a adapter_b=/path/to/adapter_b

有关 LoRA 服务配置的更多详情,请参见 LoRA 文档

API 调用

(推荐)使用 model:adapter 语法来指定要使用的适配器

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct:adapter_a",  # ← base-model:adapter-name
    messages=[{"role": "user", "content": "Convert to SQL: show all users"}],
    max_tokens=50,
)

向后兼容:使用 ``extra_body``

为了向后兼容,旧的 extra_body 方法仍然受支持

# Backward compatible method
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[{"role": "user", "content": "Convert to SQL: show all users"}],
    extra_body={"lora_path": "adapter_a"},  # ← old method
    max_tokens=50,
)

注意: 当同时指定了 model:adapterextra_body["lora_path"] 时,model:adapter 语法具有优先级。

[9]:
terminate_process(server_process)