DeepSeek V3.2 使用指南#

DeepSeek-V3.2 模型系列通过持续训练，为 DeepSeek-V3.1-Terminus 配备了 DeepSeek 稀疏注意力（DSA）机制。借助由闪电索引器（lightning indexer）驱动的细粒度稀疏注意力机制 DSA，DeepSeek-V3.2 在长上下文场景中实现了效率提升。

有关报告问题或跟踪即将推出的功能，请参阅此路线图（Roadmap）。

注：本文档最初是为 DeepSeek-V3.2-Exp 模型的使用而编写的。DeepSeek-V3.2 或 DeepSeek-V3.2-Speciale 的用法与 DeepSeek-V3.2-Exp 相同，仅工具调用解析器（tool call parser）有所不同。

安装#

Docker#

# H200/B200
docker pull lmsysorg/sglang:latest

# MI350/MI355
docker pull lmsysorg/sglang:dsv32-rocm

# NPUs
docker pull lmsysorg/sglang:dsv32-a2
docker pull lmsysorg/sglang:dsv32-a3

从源码编译#

# Install SGLang
git clone https://github.com/sgl-project/sglang
cd sglang
pip3 install pip --upgrade
pip3 install -e "python"

使用 SGLang 启动 DeepSeek V3.2#

在 8xH200/B200 GPU 上部署 DeepSeek-V3.2-Exp

# Launch with TP + DP (Recommended)
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention

# Launch with EP + DP
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 8 --enable-dp-attention

# Launch with Pure TP
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8

配置技巧#

DP 注意力（推荐）：对于 DeepSeek V3.2 模型，算子针对 dp_size=8 的用例进行了定制，因此 DP 注意力（--dp 8 --enable-dp-attention）是推荐配置，可获得更好的稳定性和性能。所有测试案例默认均使用此配置。
纯 TP 模式：也支持使用纯 TP（不带 --dp 和 --enable-dp-attention）启动。请注意，此模式在 PD 分离场景中尚未经过充分验证。
短序列 MHA Prefill（自适应）：对于短 prefill 序列（默认阈值：2048 tokens），NSA 后端会自动使用标准 MHA（无需额外标志）。在 H200 (SM90) 上，此路径使用 FlashAttention 变长算子；在 B200 (SM100) 上，它使用 TRT-LLM ragged MHA。MHA 使用 MHA_ONE_SHOT 以获得最佳性能。MHA_ONE_SHOT 在单次算子调用中计算所有 token（包括缓存的前缀和新扩展的 token）的多头注意力，避免了分块 KV 缓存处理的开销。对于总序列长度在分块容量限制内的短序列，这可以实现最佳吞吐量。
注意力算子选择：DeepSeek V3.2 模型的注意力后端会自动设置为 nsa。在该后端中，实现了用于稀疏 prefilling/decoding 的不同算子，可以通过 --nsa-prefill-backend 和 --nsa-decode-backend 服务器参数指定。nsa prefill/decode 注意力算子的选项包括：
- flashmla_sparse：来自 flash_mla 库的 flash__mla_sparse_fwd 算子。可在 Hopper 和 Blackwell GPU 上运行。需要 bf16 格式的 q, kv 输入。
- flashmla_kv：来自 flash_mla 库的 flash_mla_with_kvcache 算子。可在 Hopper 和 Blackwell GPU 上运行。需要 bf16 格式的 q 和 fp8 格式的 k_cache 输入。
- fa3：来自 flash_attn 库的 flash_attn_with_kvcache 算子。仅能在 Hopper GPU 上运行。需要 bf16 格式的 q, kv 输入。
- tilelang：可在 GPU、HPU 和 NPU 上运行的 tilelang 实现。
- aiter：AMD HPU 上的 Aiter 算子。仅能用作 decode 算子。
基于性能基准测试，H200 和 B200 的默认配置设置如下：
- H200：flashmla_sparse prefill 注意力（短序列 prefill 通过 FlashAttention varlen 使用 MHA），fa3 decode 注意力，bf16 KV 缓存数据类型。
- B200：flashmla_auto prefill 注意力（短序列 prefill 通过 TRT-LLM ragged 使用 MHA），flashmla_kv decode 注意力，fp8_e4m3 KV 缓存数据类型。flashmla_auto 允许根据 KV 缓存数据类型、硬件和启发式算法自动选择 flashmla_sparse 或 flashmla_kv 算子进行 prefill。当启用 FP8 KV 缓存且 total_kv_tokens < total_q_tokens * 512 时，它使用 flashmla_sparse 算子；否则，它将退而使用 flashmla_kv 算子。如果 flashmla_sparse 或 flashmla_kv 算子的性能发生显著变化，可能需要调整这些启发式规则。

多 Token 预测#

SGLang 为 DeepSeek V3.2 实现了基于 EAGLE 投机采样的多 Token 预测 (MTP)。通过此优化，在小 Batch Size 下解码速度可以得到显著提升。更多信息请查看此 PR。

搭配 DP 注意力的使用示例

python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

搭配纯 TP 的使用示例

python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

--speculative-num-steps、--speculative-eagle-topk 和 --speculative-num-draft-tokens 的最佳配置可以使用 bench_speculative.py 脚本针对特定 Batch Size 进行搜索。最低配置为 --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2，这可以在较大的 Batch Size 下实现加速。
对于 MTP，--max-running-requests 的默认值设为 48。对于更大的 Batch Size，应将此值增加到默认值以上。

提示

要为 EAGLE 投机采样启用实验性的重叠调度器（overlap scheduler），请设置环境变量 SGLANG_ENABLE_SPEC_V2=1。通过在草稿（draft）和验证（verification）阶段之间实现重叠调度，这可以提高性能。

函数调用与推理解析器#

函数调用和推理解析器的用法与 DeepSeek V3.1 相同。请参考推理解析器和工具解析器文档。

启动带函数调用和推理解析器的 DeepSeek-V3.2-Exp

注意：建议指定 chat-template，并确保您位于 sglang 的根目录下。

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Exp \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --tool-call-parser deepseekv31 \
  --reasoning-parser deepseek-v3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja

启动带函数调用和推理解析器的 DeepSeek-V3.2

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2 \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --tool-call-parser deepseekv32 \
  --reasoning-parser deepseek-v3

DeepSeek-V3.2-Speciale 不支持工具调用，因此只能使用推理解析器启动

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Speciale \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --reasoning-parser deepseek-v3

PD 分离#

Prefill 命令

python -m sglang.launch_server \
        --model-path deepseek-ai/DeepSeek-V3.2-Exp \
        --disaggregation-mode prefill \
        --host $LOCAL_IP \
        --port $PORT \
        --tp 8 \
        --dp 8 \
        --enable-dp-attention \
        --dist-init-addr ${HOST}:${DIST_PORT} \
        --trust-remote-code \
        --disaggregation-bootstrap-port 8998 \
        --mem-fraction-static 0.9 \

Decode 命令

python -m sglang.launch_server \
        --model-path deepseek-ai/DeepSeek-V3.2-Exp \
        --disaggregation-mode decode \
        --host $LOCAL_IP \
        --port $PORT \
        --tp 8 \
        --dp 8 \
        --enable-dp-attention \
        --dist-init-addr ${HOST}:${DIST_PORT} \
        --trust-remote-code \
        --mem-fraction-static 0.9 \

Router 命令

python -m sglang_router.launch_router --pd-disaggregation \
  --prefill $PREFILL_ADDR 8998 \
  --decode $DECODE_ADDR \
  --host 127.0.0.1 \
  --port 8000 \

如果您需要更高级的部署方法或生产级部署方法（例如基于 RBG 或 LWS 的部署），请参考 references/multi_node_deployment/rbg_pd/deepseekv32_pd.md。此外，您还可以在上述文档中找到基于 DeepEP 的 EP 并行启动命令。

基准测试结果#

使用 `gsm8k` 进行准确率测试#

可以使用 gsm8k 数据集进行简单的准确率基准测试

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

结果为 0.956，符合我们的预期

Accuracy: 0.956
Invalid: 0.000
Latency: 25.109 s
Output throughput: 5226.235 token/s

要测试长上下文准确率，请使用 --num-shots 20 运行 gsm8k。结果与 8 shots 的结果非常接近

Accuracy: 0.956
Invalid: 0.000
Latency: 29.545 s
Output throughput: 4418.617 token/s

使用 `gpqa-diamond` 进行准确率测试#

可以在 GPQA-diamond 数据集上测试长上下文的准确率基准，启用长输出 token 和思考过程

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3

8 次运行的平均准确率为 0.797，与官方技术报告中的 79.9 一致。

Repeat: 8, mean: 0.797
Scores: ['0.808', '0.798', '0.808', '0.798', '0.783', '0.788', '0.803', '0.793']

使用 `aime 2025` 进行准确率测试#

通过在 docker 或您自己的虚拟环境中安装 NeMo-Skills 来准备环境

pip install git+https://github.com/NVIDIA/NeMo-Skills.git --ignore-installed blinker

然后启动 SGLang 服务器

python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention

对于 DeepSeek-V3.2 和 DeepSeek-V3.2-Speciale:

python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-V3.2   --trust-remote-code   --tp-size 8 --dp-size 8 --enable-dp-attention   --tool-call-parser deepseekv32   --reasoning-parser deepseek-v3

运行以下脚本评估 AIME 2025

#! /bin/bash
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

ns prepare_data aime25

PORT=30000
BACKEND=sglang
MODEL="deepseek-ai/DeepSeek-V3.2-Exp" # Should be changed to the model name
MODEL_NAME="dsv32-fp8"

echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..."
ns eval \
  --benchmarks=aime25:4 \
  --server_type=$BACKEND \
  --model=$MODEL \
  --server_address=https://:${PORT}/v1 \
  --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \
  ++chat_template_kwargs.thinking=true \
  ++inference.temperature=1.0 \
  ++inference.top_p=0.95 \
  ++inference.tokens_to_generate=64000
  # ++inference.tokens_to_generate=120000 for Speciale model

测试结果 (8*B200)

DeepSeek-V3.2-Exp：

评估模式 (evaluation_mode)	条目数量 (num_entries)	平均 token 数 (avg_tokens)	生成时长（秒）(gen_seconds)	符号正确性 (symbolic_correct)	无回答 (no_answer)
pass@1 [4次平均]	30	15040	1673	87.50% ± 1.67%	0.00%
majority@4 (4选多数)	30	15040	1673	90.00%	0.00%
pass@4	30	15040	1673	90.00%	0.00%

DeepSeek-V3.2

评估模式 (evaluation_mode)	条目数量 (num_entries)	平均 token 数 (avg_tokens)	生成时长（秒）(gen_seconds)	符号正确性 (symbolic_correct)	无回答 (no_answer)
pass@1 [4次平均]	30	13550	1632	92.50% ± 1.67%	0.00%
majority@4 (4选多数)	30	13550	1632	94.71%	0.00%
pass@4	30	13550	1632	96.67%	0.00%

DeepSeek-V3.2-Speciale

评估模式 (evaluation_mode)	条目数量 (num_entries)	平均 token 数 (avg_tokens)	生成时长（秒）(gen_seconds)	符号正确性 (symbolic_correct)	无回答 (no_answer)
pass@1 [4次平均]	30	24155	3583	95.00% ± 1.92%	0.00%
majority@4 (4选多数)	30	24155	3583	95.83%	0.00%
pass@4	30	24155	3583	100.00%	0.00%

DSA 长序列上下文并行优化（实验性）#

可以在 GPQA-diamond 数据集上测试长上下文的准确率基准，启用长输出 token 和思考过程

使用示例

# Launch with EP + DP
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp  --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --max-running-requests 32

上下文并行提示#

CP_size 复用了 atten_tp_size，它等于 TP_size / DP_size。目前仍有一些功能不支持。

多 Batch Prefill：目前，prefill 过程中仅支持单请求处理。
分离：P/D 分离。
跨机支持：目前仅在单机（TP=8, EP=8）上进行了测试。
其他参数：目前仅支持 moe_dense_tp_size=1, kv_cache_dtype = “bf16”, moe_a2a_backend = “deepep”,
DP_size：CP_size 复用了 atten_tp_size，它等于 TP_size / DP_size。为了使 cp 功能正常工作，TP_size 必须能被 DP_size 整除，且 TP_size / DP_size > 1（以确保 CP_size > 1）。
详细设计参考：https://github.com/sgl-project/sglang/pull/12065