基准测试与性能分析#

基准测试#

在不启动服务器的情况下测试运行单个静态批处理的延迟。参数与 launch_server.py 相同。请注意，这是一个没有动态批处理服务器的简化测试脚本，因此对于真实服务器可以处理的批处理大小，它可能会出现显存不足（OOM）。真实服务器会将预填充划分为多个批次，而此简化脚本则不会。
- 不启动服务器（无需启动服务器）
```
python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
```
- 启动服务器（请先使用 sglang.launch_server 启动服务器，然后运行以下命令。）
```
python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
```

离线吞吐量基准测试。此脚本将启动一个离线引擎并运行基准测试。

python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10

在线服务基准测试。请先使用 sglang.launch_server 启动服务器，然后运行以下命令。
```
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
```

使用 PyTorch Profiler 进行性能分析#

Pytorch Profiler 是一个方便的基础工具，用于检查算子（kernel）执行时间、调用栈以及算子重叠和占用情况。

使用 `sglang.bench_serving` 分析服务器性能#

# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log

# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct

# send profiling request from client
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile

请确保在服务器端和客户端都设置了 SGLANG_TORCH_PROFILER_DIR，否则无法正确生成追踪文件。一种稳妥的方法是在 shell 的 .*rc 文件中设置 SGLANG_TORCH_PROFILER_DIR（例如对于 bash shell 使用 ~/.bashrc）。

更多详情请参考服务基准测试指南。

在 PD 分离模式下进行性能分析#

在 PD 分离模式下进行性能分析时，由于 torch profiler 的限制，预填充（prefill）和解码（decode）工作节点 必须分别进行分析。bench_serving 命令为此提供了专门的选项。

分析 Prefill 工作节点#

# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log

# start prefill and decode servers (see PD disaggregation docs for setup)
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1

# start router
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000

# send profiling request targeting prefill workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000

分析 Decode 工作节点#

# send profiling request targeting decode workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001

重要注意事项#

--profile-prefill-url 和 --profile-decode-url 是 互斥的 —— 您不能同时分析两者。

这两个选项都支持多个工作节点 URL，用于多实例设置。

# Profile multiple prefill workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000 http://127.0.0.1:30002

# Profile multiple decode workers
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001 http://127.0.0.1:30003

在启动服务器之前，请确保在所有工作节点上都设置了 SGLANG_TORCH_PROFILER_DIR。
有关设置 PD 分离的更多详细信息，请参阅 PD 分离指南。

使用 `sglang.bench_offline_throughput` 分析服务器性能#

export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log

# profile one batch with bench_one_batch.py
# batch size can be controlled with --batch argument
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile

# profile multiple batches with bench_offline_throughput.py
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8

使用 `sglang.profiler` 分析服务器性能#

当服务器正在运行（例如正在处理解码请求）时，您可以通过向服务器发送 profile 请求立即开始实时分析。

您可以通过运行 python3 -m sglang.profiler 来执行此操作。例如：

# Terminal 1: Send a generation request
python3 -m sglang.test.send_one

# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.
# It will generate a profile of the above request for several decoding batches.
python3 -m sglang.profiler

您还可以将上述操作合并为单个命令：

python3 -m sglang.test.send_one --profile

使用 HTTP API 端点分析服务器性能#

SGLang 提供了 HTTP API 端点来控制正在运行的服务器上的分析。这允许您以编程方式启动和停止分析，对于捕获特定的工作负载模式非常有用。

使用 `/start_profile` 端点#

/start_profile 端点用于在服务器上开始性能分析。您可以使用以下参数控制分析何时开始以及运行多长时间：

基本用法

# Start profiling immediately for 10 steps
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "num_steps": 10
  }'

参数

output_dir (可选): 保存分析追踪文件的目录。如果未指定，则使用 SGLANG_TORCH_PROFILER_DIR 环境变量，或者默认为 /tmp。
num_steps (可选): 要分析的步数。如果未指定，分析将持续进行，直到使用 /end_profile 手动停止。
start_step (可选): 开始分析的起始步数（含）。对于跳过预热（warmup）迭代非常有用。
activities (可选): 要分析的活动列表，例如 ["CPU", "GPU"]。默认为 ["CPU", "GPU"]。
merge_profiles (可选): 是否合并分布式追踪。默认为 false。

关于步数范围的说明： 分析从 start_step（含）开始，并持续 num_steps 次迭代。例如，设置 start_step=3 且 num_steps=10，分析将捕获第 3, 4, 5, 6, 7, 8, 9, 10, 11 和 12 步（从第 3 步开始，总计 10 步）。

配合 start_step 的高级用法

# Wait 5 steps (warmup), then profile for 10 steps
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "/tmp/profiles",
    "start_step": 5,
    "num_steps": 10,
    "activities": ["CPU", "GPU"]
  }'

连续分析（手动停止）

# Start profiling without num_steps - must manually stop with /end_profile
curl -X POST http://127.0.0.1:30000/start_profile

使用 `/end_profile` 端点#

/end_profile 端点用于停止正在进行的分析会话并保存追踪文件。

# Stop profiling and save traces
curl -X POST http://127.0.0.1:30000/end_profile

这仅在您启动分析时未指定 num_steps 时才需要。如果指定了 num_steps，分析将在达到该步数后自动停止。

示例工作流#

# Terminal 1: Start the server
export SGLANG_TORCH_PROFILER_DIR=/tmp/profiles
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct

# Terminal 2: Start continuous profiling
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "start_step": 3
  }'

# Terminal 3: Send requests to generate load
python -m sglang.bench_serving --backend sglang --num-prompts 100

# Terminal 2: Stop profiling when done
curl -X POST http://127.0.0.1:30000/end_profile

分布式追踪的 Profiler Trace 合并器#

SGLang 现在支持自动合并来自具有多种并行类型（TP、DP、PP、EP）的分布式设置的性能分析追踪。此功能对于分析分布式运行中的性能特别有用。

多节点性能分析与共享存储注意事项#

完全支持单节点分析输出合并。在跨越多个节点的分布式环境中进行分析时，所有节点都应能访问共享存储（例如 NFS、Lustre）作为输出目录，以便能够合并追踪文件。

如果节点之间没有可访问的共享存储，目前尚不支持在分析过程中直接自动合并追踪文件。

HTTP API 用法#

# Start profiling with automatic trace merging enabled
curl -X POST <BASE_URL>/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "/tmp/profiles", # where to store profile traces
    "num_steps": 10,
    "activities": ["CPU", "GPU"],
    "merge_profiles": true # optional argument to merge profile traces (default=False)
  }'

命令行用法#

# Start profiling with merge enabled
python -m sglang.profiler \
  --num-steps 10 \
  --cpu \
  --gpu \
  --output-dir /tmp/profiles \
  --merge-profiles # optional argument to merge profile traces (default=False)

输出文件#

Profile 合并器会生成：

单个 Rank 的追踪文件：{profile_id}-TP-{tp}-DP-{dp}-PP-{pp}-EP-{ep}.trace.json.gz
合并后的追踪文件：merged-{profile_id}.trace.json.gz

可能的 PyTorch bug#

如果您在任何情况下遇到以下错误（例如使用 Qwen 2.5 VL 时）：

RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.

这可能是 Bug: vLLM Profiler 和 Bug: torch.profiler.profile 中报告的 PyTorch Bug。作为权宜之计，您可以使用如下环境变量禁用 with_stack：

export SGLANG_PROFILE_WITH_STACK=False
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8

查看追踪结果#

追踪文件可以从以下位置加载并可视化：

https://ui.perfetto.dev/ (任何浏览器)
chrome://tracing (仅限 Chrome 浏览器)

如果浏览器由于追踪文件过大而无法打开，客户端可以通过控制 prompt 的数量和输出长度来生成较小的追踪文件（<100MB）。例如，在分析服务器时：

python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile

此命令通过 --num-prompts 参数将 prompt 数量设置为 2，并通过 --sharegpt-output-len 参数将输出序列的长度限制为 100，从而可以生成一个较小的追踪文件，以便浏览器顺利打开。

此外，如果您想通过 Trace 中的 cuda kernel 定位 SGLang Python 源代码，则在启动服务时需要禁用 CUDA Graph。这可以通过在启动服务的命令中使用 --disable-cuda-graph 参数来实现。

使用 Nsight 进行性能分析#

Nsight systems 是一个高级工具，可以展示更多的性能分析细节，例如寄存器和共享内存使用情况、注释的代码区域以及低级 CUDA API 和事件。

前提条件

使用 apt 安装，或者在 NVIDIA Docker 容器或 SGLang Docker 容器中运行。

# install nsys
# https://docs.nvda.net.cn/nsight-systems/InstallationGuide/index.html
apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli

要分析单批次任务，请使用：

nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512

要分析服务器，例如：

# launch the server, set the delay and duration times according to needs
# after the duration time has been used up, server will be killed by nsys

nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache

# client
python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512

在实践中，我们建议用户将 --duration 参数设置为一个较大的值。每当用户想要服务器停止分析时，首先运行：

nsys sessions list

以获取格式为 profile-XXXXX 的会话 ID，然后运行：

nsys stop --session=profile-XXXXX

手动结束分析器并立即生成 nsys-rep 文件。

使用 NVTX 注释代码区域，例如查看其执行时间。

# install nvtx
pip install nvtx

# code snippets
import nvtx
with nvtx.annotate("description", color="color"):
    # some critical code

使用 Nsight Systems 进行逐层 NVTX 性能分析#

SGLang 提供了内置的逐层 NVTX 注释，可与 CUDA Profiler 结合使用，在 Nsight Systems 中进行详细的逐层性能分析。这对于识别层级性能瓶颈特别有用。

结合 Nsight Systems 和 `/start_profile` 使用 `--enable-layerwise-nvtx-marker`#

--enable-layerwise-nvtx-marker 标志会自动为模型中的每一层添加 NVTX 标记。当与 Nsight Systems 性能分析结合使用以查看详细的每层性能时，这非常强大。

方法 1：配合 CUDA_PROFILER 使用 /start_profile（用于编程控制）

此方法允许您在 Nsight Systems 运行时，通过 HTTP API 精确控制分析的开始/停止。

在启用逐层 NVTX 的情况下，在 Nsight Systems 下启动服务器：

# Terminal 1: Start server with nsys and capture-range option
nsys profile --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  --capture-range=cudaProfilerApi \
  --capture-range-end=stop \
  -o layerwise_profile \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --enable-layerwise-nvtx-marker \
    --disable-cuda-graph

注意：对于由 CUDA graph 捕获的 kernel 启动，不会发出 NVTX 标记。请使用 --disable-cuda-graph 以确保追踪中发出所有逐层 NVTX 标记。

在另一个终端中，通过带有 CUDA_PROFILER 活动的 /start_profile 控制分析：

# Terminal 2: Wait for server to be ready, then start CUDA profiling
# Wait 3 steps for warmup, then profile for 10 steps
curl -X POST http://127.0.0.1:30000/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "start_step": 3,
    "num_steps": 10,
    "activities": ["CUDA_PROFILER"]
  }'

发送请求以生成负载：

# Terminal 3: Generate workload
python -m sglang.bench_serving --backend sglang --num-prompts 100

分析将在 10 步后自动停止（由于设置了 num_steps: 10）。如果您没有指定 num_steps，则需要手动停止：
```
# Terminal 2: Only needed if num_steps was not specified
curl -X POST http://127.0.0.1:30000/end_profile
```

--capture-range=cudaProfilerApi 选项告诉 Nsight Systems 仅捕获 cudaProfilerStart() 和 cudaProfilerStop() 调用（由 /start_profile 和 /end_profile 触发）之间的数据，从而减少开销和文件大小。start_step 参数跳过前 3 步以避免捕获预热开销。

方法 2：不使用 /start_profile API 的更简单方法

对于不需要对分析开始/停止进行细粒度控制的更简单用例，您可以使用 Nsight Systems 捕获整个工作负载：

# Terminal 1: Start server with layerwise NVTX
# Note: --disable-cuda-graph ensures all NVTX markers are emitted
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --enable-layerwise-nvtx-marker \
  --disable-cuda-graph

# Terminal 2: Profile the benchmarking client
nsys profile --trace-fork-before-exec=true \
  --cuda-graph-trace=node \
  -o layerwise_profile \
  python -m sglang.bench_serving --backend sglang --num-prompts 10

此方法分析整个客户端执行过程，包括所有服务器交互。逐层 NVTX 标记将在 Nsight Systems 时间轴中可见。

查看性能分析结果

使用 Nsight Systems 打开生成的 .qdrep 文件

nsys-ui layerwise_profile.qdrep

在 Nsight Systems GUI 中，您将看到：

NVTX ranges: 每一层在时间轴上显示为一个带标签的范围，并在标记元数据中包含详细信息
CUDA kernels: 所有 GPU kernel 与层注释一起显示
层级结构: 完整的模块路径（例如 meta-llama/Meta-Llama-3.1-8B-Instruct.model.layers.0.self_attn.qkv_proj）有助于识别特定层。前缀使用来自 --model-path 的完整模型路径。
张量形状: 输入/输出维度和参数形状包含在 NVTX 标记数据中

逐层 NVTX 性能分析的优势

粒度可见性: 准确查看哪些层耗时最长
内存跟踪: 识别具有大内存分配的层
瓶颈识别: 快速定位低效操作
通信开销: 在多 GPU 设置中，查看逐层通信成本
开发调试: 验证模型架构更改是否具有预期的性能影响

其他技巧#

您可以仅提供 config.json 文件，使用随机（dummy）权重对模型进行基准测试。这允许在不进行训练的情况下快速测试模型变体。为此，在上述命令中添加 --load-format dummy，然后您只需要在 checkpoint 文件夹下有一个正确的 config.json。

您可以使用 --json-model-override-args 使用修改后的配置（例如减少层数）来测试模型。例如，您可以使用以下命令对仅具有 2 层和 2 个 KV head 的模型进行基准测试：

python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'

您可以使用 --python-backtrace=cuda 查看所有 CUDA kernel 的 python 调用栈，就像在 PyTorch Profiler 中一样。（警告：这可能会导致基于 CUDA 事件计时的 kernel 运行时出现不准确的延长）
更多参数请参阅 Nsight Systems 用户指南。

基准测试与性能分析

目录