LoRA 服务#
SGLang 支持在基础模型上使用 LoRA 适配器。通过结合 S-LoRA 和 Punica 的技术,SGLang 可以在单个输入批次中高效地支持针对不同序列的多个 LoRA 适配器。
LoRA 服务参数#
以下服务器参数与多 LoRA 服务相关:
enable_lora: 为模型启用 LoRA 支持。为了向后兼容,如果提供了--lora-paths,该参数会自动设置为 True。lora_paths: 要加载的 LoRA 适配器列表。每个适配器必须通过以下格式之一指定:| = | 具有架构 {“lora_name”:str,”lora_path”:str,”pinned”:bool} 的 JSON。max_loras_per_batch: 每个批次使用的最大适配器数量。此参数会影响为多 LoRA 服务预留的 GPU 显存量,因此在显存紧张时应将其设置为较小的值。默认为 8。max_loaded_loras: 如果指定,它会限制一次加载到 CPU 内存中的 LoRA 适配器最大数量。该值必须大于或等于max-loras-per-batch。lora_eviction_policy: 当 GPU 显存池满时的 LoRA 适配器驱逐策略。lru: 最近最少使用(默认,缓存效率更高)。fifo: 先进先出。lora_backend: 运行 LoRA 模块 GEMM 算子的后端。目前我们支持 Triton LoRA 后端 (triton) 和 Chunked SGMV 后端 (csgmv)。未来将添加基于 Cutlass 或 CUDA 算子的更快后端。max_lora_rank: 应支持的最大 LoRA 秩 (rank)。如果未指定,将根据--lora-paths中提供的适配器自动推断。当你期望在服务器启动后动态加载具有更大 LoRA 秩的适配器时,需要此参数。lora_target_modules: 应用 LoRA 的所有目标模块的并集(例如q_proj,k_proj,gate_proj)。如果未指定,将根据--lora-paths中提供的适配器自动推断。当你期望在服务器启动后动态加载不同目标模块的适配器时,需要此参数。你也可以将其设置为all以对所有支持的模块启用 LoRA。然而,在额外的模块上启用 LoRA 会引入微小的性能开销。如果你的应用对性能敏感,我们建议仅指定你计划加载适配器的模块。--max-lora-chunk-size: ChunkedSGMV LoRA 后端的最大分块大小。仅在 –lora-backend 为 ‘csgmv’ 时使用。选择较大的值可能会提高性能。请根据你的硬件和工作负载按需调整此值。默认为 16。tp_size: SGLang 支持 LoRA 服务配合张量并行 (Tensor Parallelism)。tp_size控制张量并行的 GPU 数量。有关张量分片策略的更多细节可以在 S-Lora 论文中找到。
在客户端,用户需要提供一个字符串列表作为输入批次,以及一个每个输入序列对应的适配器名称列表。
用法#
服务单个适配器#
注意: SGLang 通过两个 API 支持 LoRA 适配器
兼容 OpenAI 的 API (
/v1/chat/completions,/v1/completions):使用model:adapter-name语法。示例请参见 结合 LoRA 使用 OpenAI API。原生 API (
/generate):在请求体中传递lora_path(如下所示)。
[1]:
import json
import requests
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, terminate_process
[2025-12-30 02:18:33] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:33] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:33] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2]:
server_process, port = launch_server_cmd(
# Here we set max-loras-per-batch to 2: one slot for adaptor and another one for base model
"""
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
--max-loras-per-batch 2 \
--log-level warning \
"""
)
wait_for_server(f"https://:{port}")
[2025-12-30 02:18:40] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:40] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:40] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:18:42] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:18:42] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:18:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:18:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:18:57] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.22it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.13it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.35it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 126.85it/s]
Capturing batches (bs=1 avail_mem=18.27 GB): 100%|██████████| 3/3 [00:00<00:00, 3.56it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[3]:
url = f"http://127.0.0.1:{port}"
json_data = {
"text": [
"List 3 countries and their capitals.",
"List 3 countries and their capitals.",
],
"sampling_params": {"max_new_tokens": 32, "temperature": 0},
# The first input uses lora0, and the second input uses the base model
"lora_path": ["lora0", None],
}
response = requests.post(
url + "/generate",
json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")
Output 0: Each country and capital should be on a new line.
Country: France
Capital: Paris
Country: Japan
Capital: Tokyo
Country: Australia
Output 1: 1. 2. 3.
1. United States - Washington D.C. 2. Japan - Tokyo 3. Australia -
[4]:
terminate_process(server_process)
服务多个适配器#
[5]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
--max-loras-per-batch 2 \
--log-level warning \
"""
)
wait_for_server(f"https://:{port}")
[2025-12-30 02:19:17] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:17] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:17] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:19:21] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:19:21] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:19:28] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:28] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:28] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:19:28] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:28] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:28] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:19:34] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.32it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.18it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.38it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 126.72it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 98.30it/s]
Capturing batches (bs=1 avail_mem=59.88 GB): 100%|██████████| 3/3 [00:00<00:00, 3.56it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[6]:
url = f"http://127.0.0.1:{port}"
json_data = {
"text": [
"List 3 countries and their capitals.",
"List 3 countries and their capitals.",
],
"sampling_params": {"max_new_tokens": 32, "temperature": 0},
# The first input uses lora0, and the second input uses lora1
"lora_path": ["lora0", "lora1"],
}
response = requests.post(
url + "/generate",
json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")
Output 0: Each country and capital should be on a new line.
Country: France
Capital: Paris
Country: Japan
Capital: Tokyo
Country: Australia
Output 1: Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -
[7]:
terminate_process(server_process)
动态 LoRA 加载#
除了在服务器启动时通过 --lora-paths 指定所有适配器外,你还可以通过 /load_lora_adapter 和 /unload_lora_adapter API 动态加载和卸载 LoRA 适配器。
使用动态 LoRA 加载时,建议在启动时显式指定 --max-lora-rank 和 --lora-target-modules。为了向后兼容,如果未显式提供,SGLang 将从 --lora-paths 推断这些值。但是,在这种情况下,你必须确保所有动态加载的适配器与初始 --lora-paths 中的适配器具有相同的形状(秩和目标模块)或严格“更小”。
[8]:
lora0 = "Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16" # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj
lora1 = "algoprog/fact-generation-llama-3.1-8b-instruct-lora" # rank - 64, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
lora0_new = "philschmid/code-llama-3-1-8b-text-to-sql-lora" # rank - 256, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.
# We are adding it here just to demonstrate usage.
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--cuda-graph-max-bs 2 \
--max-loras-per-batch 2 \
--max-lora-rank 256
--lora-target-modules all
--log-level warning
"""
)
url = f"http://127.0.0.1:{port}"
wait_for_server(url)
[2025-12-30 02:19:55] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:55] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:55] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:19:57] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:19:57] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:19:58] LoRA backend 'csgmv' does not yet support embedding or lm_head layers; dropping 'embed_tokens' and 'lm_head' from --lora-target-modules=all. To apply LoRA to these, use --lora-backend triton.
[2025-12-30 02:20:04] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:04] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:04] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:20:04] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:04] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:04] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:20:10] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.26it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.12it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.32it/s]
Capturing batches (bs=1 avail_mem=16.16 GB): 100%|██████████| 3/3 [00:00<00:00, 3.53it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
加载适配器 lora0
[9]:
response = requests.post(
url + "/load_lora_adapter",
json={
"lora_name": "lora0",
"lora_path": lora0,
},
)
if response.status_code == 200:
print("LoRA adapter loaded successfully.", response.json())
else:
print("Failed to load LoRA adapter.", response.json())
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 92.88it/s]
LoRA adapter loaded successfully. {'rid': None, 'http_worker_ipc': None, 'success': True, 'error_message': '', 'loaded_adapters': {'lora0': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16'}}
加载适配器 lora1
[10]:
response = requests.post(
url + "/load_lora_adapter",
json={
"lora_name": "lora1",
"lora_path": lora1,
},
)
if response.status_code == 200:
print("LoRA adapter loaded successfully.", response.json())
else:
print("Failed to load LoRA adapter.", response.json())
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 118.81it/s]
LoRA adapter loaded successfully. {'rid': None, 'http_worker_ipc': None, 'success': True, 'error_message': '', 'loaded_adapters': {'lora0': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16', 'lora1': 'algoprog/fact-generation-llama-3.1-8b-instruct-lora'}}
检查推理输出
[11]:
url = f"http://127.0.0.1:{port}"
json_data = {
"text": [
"List 3 countries and their capitals.",
"List 3 countries and their capitals.",
],
"sampling_params": {"max_new_tokens": 32, "temperature": 0},
# The first input uses lora0, and the second input uses lora1
"lora_path": ["lora0", "lora1"],
}
response = requests.post(
url + "/generate",
json=json_data,
)
print(f"Output from lora0: \n{response.json()[0]['text']}\n")
print(f"Output from lora1 (updated): \n{response.json()[1]['text']}\n")
Output from lora0:
Give the capital of each country.
Country 1: Japan
Capital: Tokyo
Country 2: Australia
Capital: Canberra
Country 3: Brazil
Output from lora1 (updated):
Each country and capital should be on a new line.
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals
卸载 lora0 并用不同的适配器替换它
[12]:
response = requests.post(
url + "/unload_lora_adapter",
json={
"lora_name": "lora0",
},
)
response = requests.post(
url + "/load_lora_adapter",
json={
"lora_name": "lora0",
"lora_path": lora0_new,
},
)
if response.status_code == 200:
print("LoRA adapter loaded successfully.", response.json())
else:
print("Failed to load LoRA adapter.", response.json())
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 83.77it/s]
LoRA adapter loaded successfully. {'rid': None, 'http_worker_ipc': None, 'success': True, 'error_message': '', 'loaded_adapters': {'lora1': 'algoprog/fact-generation-llama-3.1-8b-instruct-lora', 'lora0': 'philschmid/code-llama-3-1-8b-text-to-sql-lora'}}
再次检查输出
[13]:
url = f"http://127.0.0.1:{port}"
json_data = {
"text": [
"List 3 countries and their capitals.",
"List 3 countries and their capitals.",
],
"sampling_params": {"max_new_tokens": 32, "temperature": 0},
# The first input uses lora0, and the second input uses lora1
"lora_path": ["lora0", "lora1"],
}
response = requests.post(
url + "/generate",
json=json_data,
)
print(f"Output from lora0: \n{response.json()[0]['text']}\n")
print(f"Output from lora1 (updated): \n{response.json()[1]['text']}\n")
Output from lora0:
Country 1: Japan, Capital: Tokyo. Country 2: Australia, Capital: Canberra. Country 3: Brazil, Capital: Brasília.
A
Output from lora1 (updated):
Each country and capital should be on a new line.
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals
兼容 OpenAI 的 API 使用方法#
你可以通过兼容 OpenAI 的 API 使用 LoRA 适配器,方法是在 model 字段中使用 base-model:adapter-name 语法指定适配器(例如 qwen/qwen2.5-0.5b-instruct:adapter_a)。更多细节和示例,请参阅 OpenAI API 文档中的“使用 LoRA 适配器”部分:openai_api_completions.ipynb。
[14]:
terminate_process(server_process)
LoRA GPU 固定 (Pinning)#
另一个高级选项是在加载期间将适配器指定为 pinned。当适配器被固定时,它将永久分配到可用的 GPU 池槽位之一(由 --max-loras-per-batch 配置),并且在运行时不会从 GPU 显存中驱逐。相反,它会一直驻留,直到被显式卸载。
在同一个适配器被频繁跨请求使用的场景中,这可以通过避免重复的内存传输和重新初始化开销来提高性能。然而,由于 GPU 池槽位有限,固定适配器会降低系统按需动态加载其他适配器的灵活性。如果固定的适配器过多,可能会导致性能下降,或者在最极端的情况下(固定适配器数量 == max-loras-per-batch),导致所有未固定的请求停滞。因此,目前 SGLang 将最大固定适配器数量限制为 max-loras-per-batch - 1,以防止意外的资源饥饿。
在下面的示例中,我们启动一个服务器,将 lora1 以固定方式加载,将 lora2 和 lora3 加载为普通(非固定)适配器。请注意,我们特意用两种不同的格式指定了 lora2 和 lora3,以演示两者均受支持。
[15]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--cuda-graph-max-bs 8 \
--max-loras-per-batch 3 \
--max-lora-rank 256 \
--lora-target-modules all \
--lora-paths \
{"lora_name":"lora0","lora_path":"Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16","pinned":true} \
{"lora_name":"lora1","lora_path":"algoprog/fact-generation-llama-3.1-8b-instruct-lora"} \
lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora
--log-level warning
"""
)
url = f"http://127.0.0.1:{port}"
wait_for_server(url)
[2025-12-30 02:20:33] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:33] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:33] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:20:36] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:20:36] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:20:36] LoRA backend 'csgmv' does not yet support embedding or lm_head layers; dropping 'embed_tokens' and 'lm_head' from --lora-target-modules=all. To apply LoRA to these, use --lora-backend triton.
[2025-12-30 02:20:43] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:43] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:43] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:20:43] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:43] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:43] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:20:49] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.32it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.19it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.39it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 99.83it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 125.82it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 82.92it/s]
Capturing batches (bs=1 avail_mem=30.93 GB): 100%|██████████| 3/3 [00:00<00:00, 3.40it/s]
注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
你也可以在动态适配器加载期间将适配器指定为固定的。在下面的示例中,我们将 lora2 重新加载为固定适配器
[16]:
response = requests.post(
url + "/unload_lora_adapter",
json={
"lora_name": "lora1",
},
)
response = requests.post(
url + "/load_lora_adapter",
json={
"lora_name": "lora1",
"lora_path": "algoprog/fact-generation-llama-3.1-8b-instruct-lora",
"pinned": True, # Pin the adapter to GPU
},
)
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 123.44it/s]
验证结果是否符合预期
[17]:
url = f"http://127.0.0.1:{port}"
json_data = {
"text": [
"List 3 countries and their capitals.",
"List 3 countries and their capitals.",
"List 3 countries and their capitals.",
],
"sampling_params": {"max_new_tokens": 32, "temperature": 0},
# The first input uses lora0, and the second input uses lora1
"lora_path": ["lora0", "lora1", "lora2"],
}
response = requests.post(
url + "/generate",
json=json_data,
)
print(f"Output from lora0 (pinned): \n{response.json()[0]['text']}\n")
print(f"Output from lora1 (pinned): \n{response.json()[1]['text']}\n")
print(f"Output from lora2 (not pinned): \n{response.json()[2]['text']}\n")
Output from lora0 (pinned):
Give the capital of each country.
Country 1: Japan
Capital: Tokyo
Country 2: Australia
Capital: Canberra
Country 3: Brazil
Output from lora1 (pinned):
Each country and capital should be on a new line.
Country: France
Capital: Paris
Country: Japan
Capital: Tokyo
Country: Australia
Output from lora2 (not pinned):
Country 1 has a capital of Bogor? No, that's not correct. The capital of Country 1 is actually Bogor is not the capital,
[18]:
terminate_process(server_process)
选择 LoRA 后端#
SGLang 支持两种 LoRA 后端,你可以通过 --lora-backend 参数进行选择
triton: 默认的基于 Triton 的基础后端。csgmv: 针对高并发场景优化的 Chunked SGMV 后端。
csgmv 后端是最近引入的,旨在提高性能,特别是在高并发场景下。我们的基准测试显示,与基础 triton 后端相比,它实现了 20% 到 80% 的延迟改进。目前它处于预览阶段,我们预计在未来的版本中将其设为默认 LoRA 后端。在此之前,你可以通过手动设置 --lora-backend 服务器配置来采用它。
[19]:
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--lora-backend csgmv \
--max-loras-per-batch 16 \
--lora-paths lora1=path/to/lora1 lora2=path/to/lora2
"""
)
[20]:
terminate_process(server_process)
后续工作#
LoRA 相关功能的开发路线图可以在此 issue 中找到。其他功能,包括嵌入层 (Embedding Layer)、统一分页 (Unified Paging)、Cutlass 后端仍在开发中。