LoRA 服务#

SGLang 支持在基础模型上使用 LoRA 适配器。通过结合 S-LoRAPunica 的技术,SGLang 可以在单个输入批次中高效地支持针对不同序列的多个 LoRA 适配器。

LoRA 服务参数#

以下服务器参数与多 LoRA 服务相关:

  • enable_lora: 为模型启用 LoRA 支持。为了向后兼容,如果提供了 --lora-paths,该参数会自动设置为 True。

  • lora_paths: 要加载的 LoRA 适配器列表。每个适配器必须通过以下格式之一指定:| = | 具有架构 {“lora_name”:str,”lora_path”:str,”pinned”:bool} 的 JSON。

  • max_loras_per_batch: 每个批次使用的最大适配器数量。此参数会影响为多 LoRA 服务预留的 GPU 显存量,因此在显存紧张时应将其设置为较小的值。默认为 8。

  • max_loaded_loras: 如果指定,它会限制一次加载到 CPU 内存中的 LoRA 适配器最大数量。该值必须大于或等于 max-loras-per-batch

  • lora_eviction_policy: 当 GPU 显存池满时的 LoRA 适配器驱逐策略。lru: 最近最少使用(默认,缓存效率更高)。fifo: 先进先出。

  • lora_backend: 运行 LoRA 模块 GEMM 算子的后端。目前我们支持 Triton LoRA 后端 (triton) 和 Chunked SGMV 后端 (csgmv)。未来将添加基于 Cutlass 或 CUDA 算子的更快后端。

  • max_lora_rank: 应支持的最大 LoRA 秩 (rank)。如果未指定,将根据 --lora-paths 中提供的适配器自动推断。当你期望在服务器启动后动态加载具有更大 LoRA 秩的适配器时,需要此参数。

  • lora_target_modules: 应用 LoRA 的所有目标模块的并集(例如 q_proj, k_proj, gate_proj)。如果未指定,将根据 --lora-paths 中提供的适配器自动推断。当你期望在服务器启动后动态加载不同目标模块的适配器时,需要此参数。你也可以将其设置为 all 以对所有支持的模块启用 LoRA。然而,在额外的模块上启用 LoRA 会引入微小的性能开销。如果你的应用对性能敏感,我们建议仅指定你计划加载适配器的模块。

  • --max-lora-chunk-size: ChunkedSGMV LoRA 后端的最大分块大小。仅在 –lora-backend 为 ‘csgmv’ 时使用。选择较大的值可能会提高性能。请根据你的硬件和工作负载按需调整此值。默认为 16。

  • tp_size: SGLang 支持 LoRA 服务配合张量并行 (Tensor Parallelism)。tp_size 控制张量并行的 GPU 数量。有关张量分片策略的更多细节可以在 S-Lora 论文中找到。

在客户端,用户需要提供一个字符串列表作为输入批次,以及一个每个输入序列对应的适配器名称列表。

用法#

服务单个适配器#

注意: SGLang 通过两个 API 支持 LoRA 适配器

  1. 兼容 OpenAI 的 API (/v1/chat/completions, /v1/completions):使用 model:adapter-name 语法。示例请参见 结合 LoRA 使用 OpenAI API

  2. 原生 API (/generate):在请求体中传递 lora_path(如下所示)。

[1]:
import json
import requests

from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, terminate_process
[2025-12-30 02:18:33] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:33] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:33] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2]:
server_process, port = launch_server_cmd(
    # Here we set max-loras-per-batch to 2: one slot for adaptor and another one for base model
    """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enable-lora \
    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
    --max-loras-per-batch 2 \
    --log-level warning \
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:18:40] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:40] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:40] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:18:42] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:18:42] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:18:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:18:51] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:18:51] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:18:51] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:18:57] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.13it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 126.85it/s]

Capturing batches (bs=1 avail_mem=18.27 GB): 100%|██████████| 3/3 [00:00<00:00,  3.56it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[3]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "List 3 countries and their capitals.",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses the base model
    "lora_path": ["lora0", None],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")
Output 0:  Each country and capital should be on a new line.
Country: France
Capital: Paris
Country: Japan
Capital: Tokyo
Country: Australia

Output 1:  1. 2. 3.
1.  United States - Washington D.C. 2.  Japan - Tokyo 3.  Australia -
[4]:
terminate_process(server_process)

服务多个适配器#

[5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enable-lora \
    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
    lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
    --max-loras-per-batch 2 \
    --log-level warning \
"""
)

wait_for_server(f"https://:{port}")
[2025-12-30 02:19:17] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:17] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:17] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:19:21] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:19:21] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:19:28] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:28] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:28] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:19:28] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:28] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:28] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:19:34] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.18it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 126.72it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 98.30it/s]

Capturing batches (bs=1 avail_mem=59.88 GB): 100%|██████████| 3/3 [00:00<00:00,  3.56it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。
[6]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "List 3 countries and their capitals.",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")
Output 0:  Each country and capital should be on a new line.
Country: France
Capital: Paris
Country: Japan
Capital: Tokyo
Country: Australia

Output 1:  Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -
[7]:
terminate_process(server_process)

动态 LoRA 加载#

除了在服务器启动时通过 --lora-paths 指定所有适配器外,你还可以通过 /load_lora_adapter/unload_lora_adapter API 动态加载和卸载 LoRA 适配器。

使用动态 LoRA 加载时,建议在启动时显式指定 --max-lora-rank--lora-target-modules。为了向后兼容,如果未显式提供,SGLang 将从 --lora-paths 推断这些值。但是,在这种情况下,你必须确保所有动态加载的适配器与初始 --lora-paths 中的适配器具有相同的形状(秩和目标模块)或严格“更小”。

[8]:
lora0 = "Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16"  # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj
lora1 = "algoprog/fact-generation-llama-3.1-8b-instruct-lora"  # rank - 64, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
lora0_new = "philschmid/code-llama-3-1-8b-text-to-sql-lora"  # rank - 256, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj


# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.
# We are adding it here just to demonstrate usage.
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enable-lora \
    --cuda-graph-max-bs 2 \
    --max-loras-per-batch 2 \
    --max-lora-rank 256
    --lora-target-modules all
    --log-level warning
    """
)

url = f"http://127.0.0.1:{port}"
wait_for_server(url)
[2025-12-30 02:19:55] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:19:55] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:19:55] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:19:57] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:19:57] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:19:58] LoRA backend 'csgmv' does not yet support embedding or lm_head layers; dropping 'embed_tokens' and 'lm_head' from --lora-target-modules=all. To apply LoRA to these, use --lora-backend triton.
[2025-12-30 02:20:04] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:04] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:04] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:20:04] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:04] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:04] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:20:10] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.26it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.12it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.32it/s]

Capturing batches (bs=1 avail_mem=16.16 GB): 100%|██████████| 3/3 [00:00<00:00,  3.53it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。

加载适配器 lora0

[9]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora0",
        "lora_path": lora0,
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 92.88it/s]

LoRA adapter loaded successfully. {'rid': None, 'http_worker_ipc': None, 'success': True, 'error_message': '', 'loaded_adapters': {'lora0': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16'}}

加载适配器 lora1

[10]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora1",
        "lora_path": lora1,
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 118.81it/s]

LoRA adapter loaded successfully. {'rid': None, 'http_worker_ipc': None, 'success': True, 'error_message': '', 'loaded_adapters': {'lora0': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16', 'lora1': 'algoprog/fact-generation-llama-3.1-8b-instruct-lora'}}

检查推理输出

[11]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "List 3 countries and their capitals.",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output from lora0: \n{response.json()[0]['text']}\n")
print(f"Output from lora1 (updated): \n{response.json()[1]['text']}\n")
Output from lora0:
 Give the capital of each country.
Country 1: Japan
Capital: Tokyo
Country 2: Australia
Capital: Canberra
Country 3: Brazil

Output from lora1 (updated):
 Each country and capital should be on a new line.
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals

卸载 lora0 并用不同的适配器替换它

[12]:
response = requests.post(
    url + "/unload_lora_adapter",
    json={
        "lora_name": "lora0",
    },
)

response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora0",
        "lora_path": lora0_new,
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 83.77it/s]

LoRA adapter loaded successfully. {'rid': None, 'http_worker_ipc': None, 'success': True, 'error_message': '', 'loaded_adapters': {'lora1': 'algoprog/fact-generation-llama-3.1-8b-instruct-lora', 'lora0': 'philschmid/code-llama-3-1-8b-text-to-sql-lora'}}

再次检查输出

[13]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "List 3 countries and their capitals.",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output from lora0: \n{response.json()[0]['text']}\n")
print(f"Output from lora1 (updated): \n{response.json()[1]['text']}\n")
Output from lora0:
 Country 1: Japan, Capital: Tokyo. Country 2: Australia, Capital: Canberra. Country 3: Brazil, Capital: Brasília.
A

Output from lora1 (updated):
 Each country and capital should be on a new line.
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals

兼容 OpenAI 的 API 使用方法#

你可以通过兼容 OpenAI 的 API 使用 LoRA 适配器,方法是在 model 字段中使用 base-model:adapter-name 语法指定适配器(例如 qwen/qwen2.5-0.5b-instruct:adapter_a)。更多细节和示例,请参阅 OpenAI API 文档中的“使用 LoRA 适配器”部分:openai_api_completions.ipynb

[14]:
terminate_process(server_process)

LoRA GPU 固定 (Pinning)#

另一个高级选项是在加载期间将适配器指定为 pinned。当适配器被固定时,它将永久分配到可用的 GPU 池槽位之一(由 --max-loras-per-batch 配置),并且在运行时不会从 GPU 显存中驱逐。相反,它会一直驻留,直到被显式卸载。

在同一个适配器被频繁跨请求使用的场景中,这可以通过避免重复的内存传输和重新初始化开销来提高性能。然而,由于 GPU 池槽位有限,固定适配器会降低系统按需动态加载其他适配器的灵活性。如果固定的适配器过多,可能会导致性能下降,或者在最极端的情况下(固定适配器数量 == max-loras-per-batch),导致所有未固定的请求停滞。因此,目前 SGLang 将最大固定适配器数量限制为 max-loras-per-batch - 1,以防止意外的资源饥饿。

在下面的示例中,我们启动一个服务器,将 lora1 以固定方式加载,将 lora2lora3 加载为普通(非固定)适配器。请注意,我们特意用两种不同的格式指定了 lora2lora3,以演示两者均受支持。

[15]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enable-lora \
    --cuda-graph-max-bs 8 \
    --max-loras-per-batch 3 \
    --max-lora-rank 256 \
    --lora-target-modules all \
    --lora-paths \
        {"lora_name":"lora0","lora_path":"Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16","pinned":true} \
        {"lora_name":"lora1","lora_path":"algoprog/fact-generation-llama-3.1-8b-instruct-lora"} \
        lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora
    --log-level warning
    """
)


url = f"http://127.0.0.1:{port}"
wait_for_server(url)
[2025-12-30 02:20:33] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:33] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:33] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:20:36] INFO server_args.py:1564: Attention backend not specified. Use fa3 backend by default.
[2025-12-30 02:20:36] INFO server_args.py:2442: Set soft_watchdog_timeout since in CI
[2025-12-30 02:20:36] LoRA backend 'csgmv' does not yet support embedding or lm_head layers; dropping 'embed_tokens' and 'lm_head' from --lora-target-modules=all. To apply LoRA to these, use --lora-backend triton.
[2025-12-30 02:20:43] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:43] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:43] INFO utils.py:164: NumExpr defaulting to 16 threads.
[2025-12-30 02:20:43] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-12-30 02:20:43] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-12-30 02:20:43] INFO utils.py:164: NumExpr defaulting to 16 threads.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-30 02:20:49] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.32it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.19it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 99.83it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 125.82it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 82.92it/s]

Capturing batches (bs=1 avail_mem=30.93 GB): 100%|██████████| 3/3 [00:00<00:00,  3.40it/s]


注意:通常情况下,服务器在独立的终端中运行。
在本笔记本中,我们同时运行服务器和笔记本代码,因此它们的输出是合并在一起的。
为了提高清晰度,服务器日志以原始黑色显示,而笔记本输出则以蓝色突出显示。
为了缩短日志长度,我们将服务器的日志级别设置为 warning,默认日志级别为 info。
我们是在 CI 环境中运行这些笔记本的,因此吞吐量并不代表实际性能。

你也可以在动态适配器加载期间将适配器指定为固定的。在下面的示例中,我们将 lora2 重新加载为固定适配器

[16]:
response = requests.post(
    url + "/unload_lora_adapter",
    json={
        "lora_name": "lora1",
    },
)

response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora1",
        "lora_path": "algoprog/fact-generation-llama-3.1-8b-instruct-lora",
        "pinned": True,  # Pin the adapter to GPU
    },
)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 123.44it/s]

验证结果是否符合预期

[17]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "List 3 countries and their capitals.",
        "List 3 countries and their capitals.",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1", "lora2"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output from lora0 (pinned): \n{response.json()[0]['text']}\n")
print(f"Output from lora1 (pinned): \n{response.json()[1]['text']}\n")
print(f"Output from lora2 (not pinned): \n{response.json()[2]['text']}\n")
Output from lora0 (pinned):
 Give the capital of each country.
Country 1: Japan
Capital: Tokyo
Country 2: Australia
Capital: Canberra
Country 3: Brazil

Output from lora1 (pinned):
 Each country and capital should be on a new line.
Country: France
Capital: Paris
Country: Japan
Capital: Tokyo
Country: Australia


Output from lora2 (not pinned):
 Country 1 has a capital of Bogor? No, that's not correct. The capital of Country 1 is actually Bogor is not the capital,

[18]:
terminate_process(server_process)

选择 LoRA 后端#

SGLang 支持两种 LoRA 后端,你可以通过 --lora-backend 参数进行选择

  • triton: 默认的基于 Triton 的基础后端。

  • csgmv: 针对高并发场景优化的 Chunked SGMV 后端。

csgmv 后端是最近引入的,旨在提高性能,特别是在高并发场景下。我们的基准测试显示,与基础 triton 后端相比,它实现了 20% 到 80% 的延迟改进。目前它处于预览阶段,我们预计在未来的版本中将其设为默认 LoRA 后端。在此之前,你可以通过手动设置 --lora-backend 服务器配置来采用它。

[19]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enable-lora \
    --lora-backend csgmv \
    --max-loras-per-batch 16 \
    --lora-paths lora1=path/to/lora1 lora2=path/to/lora2
    """
)
[20]:
terminate_process(server_process)

后续工作#

LoRA 相关功能的开发路线图可以在此 issue 中找到。其他功能,包括嵌入层 (Embedding Layer)、统一分页 (Unified Paging)、Cutlass 后端仍在开发中。