效率评估¶
本部分介绍 Qwen3 系列模型(原始模型和量化模型)的效率测试结果,包括推理速度(tokens/s)与不同上下文长度时的显存占用(GB)。
环境配置¶
Hugging Face Transformers¶
硬件:
NVIDIA H20 96GB
非AutoAWQ的软件环境:
PyTorch 2.6.0
Flash Attention 2.7.4
Transformers 4.51.3
GPTQModel 2.2.0+cu128torch2.6
AutoAWQ的软件环境:
PyTorch 2.6.0+cu124
Transformers 4.51.3
AutoAWQ 0.2.9
AutoAWQ_kernels 0.0.9
SGLang¶
硬件:
NVIDIA H20 96GB
软件环境:
PyTorch 2.6.0+cu124
Transformers 4.51.3
SGLang 0.4.6.post1
SGL-kernel 0.1.0
vLLM 0.7.2 (被SGLang AWQ量化依赖)
备注¶
推理速度(tokens/s) 的计算公式为:
\[\text{Speed} = \frac{\text{tokens}_{\text{prompt}} + \text{tokens}_{\text{generation}}}{\text{time}}\]batch size 设置为1,使用 GPU 数量尽可能少
我们测试生成2048 tokens时的速度与显存占用,输入长度分别为1、6144、14336、30720、63488、129024 tokens(如受模型支持)。
对于SGLang:
内存使用情况未报告,因为 SGLang 会预先分配所有 GPU 内存。默认情况下,我们设置
mem_fraction_static=0.85。我们配置了
context_length=140000并启用了enable_mixed_chunk=True。对于 AWQ 量化,我们使用 awq_marlin 后端。
我们设置
skip_tokenizer_init=True并使用input_ids进行生成,而不是使用原始文本提示。
Transformers 中的 FP8 性能:Transformers 在 FP8 模式下的推理速度目前不够理想,还需要进一步优化。
SGLang 中 GPTQ-INT4 的性能:SGLang 中 GPTQ-INT4 的性能也需要改进,SGLang团队正提升其表现。
结果¶
Qwen3-0.6B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-0.6B | 1 | BF16 | 1 | 414.17 | |
| FP8 | 1 | 458.03 | |||
| GPTQ-Int8 | 1 | 344.92 | |||
| 6144 | BF16 | 1 | 1426.46 | ||
| FP8 | 1 | 1572.95 | |||
| GPTQ-Int8 | 1 | 1234.29 | |||
| 14336 | BF16 | 1 | 2478.02 | ||
| FP8 | 1 | 2689.08 | |||
| GPTQ-Int8 | 1 | 2198.82 | |||
| 30720 | BF16 | 1 | 3577.42 | ||
| FP8 | 1 | 3819.86 | |||
| GPTQ-Int8 | 1 | 3342.06 |
Qwen3-0.6B (Transformers)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | GPU Memory(MB) |
|---|---|---|---|---|---|
| Qwen3-0.6B | 1 | BF16 | 1 | 58.57 | 1394 |
| FP8 | 1 | 24.60 | 1217 | ||
| GPTQ-Int8 | 1 | 26.56 | 986 | ||
| 6144 | BF16 | 1 | 154.82 | 2066 | |
| FP8 | 1 | 73.96 | 1943 | ||
| GPTQ-Int8 | 1 | 93.84 | 1658 | ||
| 14336 | BF16 | 1 | 168.48 | 2963 | |
| FP8 | 1 | 104.99 | 2839 | ||
| GPTQ-Int8 | 1 | 219.61 | 2554 | ||
| 30720 | BF16 | 1 | 175.93 | 4755 | |
| FP8 | 1 | 132.78 | 4632 | ||
| GPTQ-Int8 | 1 | 345.71 | 4347 |
Qwen3-1.7B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-1.7B | 1 | BF16 | 1 | 227.80 | |
| FP8 | 1 | 333.90 | |||
| GPTQ-Int8 | 1 | 257.40 | |||
| 6144 | BF16 | 1 | 838.28 | ||
| FP8 | 1 | 1198.20 | |||
| GPTQ-Int8 | 1 | 945.91 | |||
| 14336 | BF16 | 1 | 1525.71 | ||
| FP8 | 1 | 2095.61 | |||
| GPTQ-Int8 | 1 | 1707.63 | |||
| 30720 | BF16 | 1 | 2439.03 | ||
| FP8 | 1 | 3165.32 | |||
| GPTQ-Int8 | 1 | 2706.16 |
Qwen3-1.7B (Transformers)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | GPU Memory(MB) |
|---|---|---|---|---|---|
| Qwen3-1.7B | 1 | BF16 | 1 | 59.83 | 3412 |
| FP8 | 1 | 23.83 | 2726 | ||
| GPTQ-Int8 | 1 | 28.06 | 2229 | ||
| 6144 | BF16 | 1 | 238.53 | 4213 | |
| FP8 | 1 | 90.87 | 3462 | ||
| GPTQ-Int8 | 1 | 110.82 | 2901 | ||
| 14336 | BF16 | 1 | 352.59 | 5109 | |
| FP8 | 1 | 153.37 | 4359 | ||
| GPTQ-Int8 | 1 | 222.78 | 3798 | ||
| 30720 | BF16 | 1 | 418.13 | 6902 | |
| FP8 | 1 | 235.61 | 6151 | ||
| GPTQ-Int8 | 1 | 386.85 | 5590 |
Qwen3-4B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-4B | 1 | BF16 | 1 | 133.13 | |
| FP8 | 1 | 200.61 | |||
| AWQ-INT4 | 1 | 199.71 | |||
| 6144 | BF16 | 1 | 466.19 | ||
| FP8 | 1 | 662.26 | |||
| AWQ-INT4 | 1 | 640.07 | |||
| 14336 | BF16 | 1 | 789.25 | ||
| FP8 | 1 | 1066.23 | |||
| AWQ-INT4 | 1 | 1006.23 | |||
| 30720 | BF16 | 1 | 1165.75 | ||
| FP8 | 1 | 1467.71 | |||
| AWQ-INT4 | 1 | 1358.84 | |||
| 63488 | BF16 | 1 | 1423.98 | ||
| FP8 | 1 | 1660.67 | |||
| AWQ-INT4 | 1 | 1513.97 | |||
| 129042 | BF16 | 1 | 1371.04 | ||
| FP8 | 1 | 1497.27 | |||
| AWQ-INT4 | 1 | 1375.71 |
Qwen3-4B (Transformers)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | GPU Memory(MB) |
|---|---|---|---|---|---|
| Qwen3-4B | 1 | BF16 | 1 | 45.94 | 7973 |
| FP8 | 1 | 17.33 | 5281 | ||
| AWQ-INT4 | 1 | 51.57 | 2915 | ||
| 6144 | BF16 | 1 | 159.95 | 8860 | |
| FP8 | 1 | 60.55 | 6144 | ||
| AWQ-INT4 | 1 | 183.04 | 3881 | ||
| 14336 | BF16 | 1 | 195.31 | 10012 | |
| FP8 | 1 | 96.81 | 7297 | ||
| AWQ-INT4 | 1 | 265.22 | 5151 | ||
| 30720 | BF16 | 1 | 217.97 | 12317 | |
| FP8 | 1 | 138.84 | 9611 | ||
| AWQ-INT4 | 1 | 481.69 | 7742 |
Qwen3-8B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-8B | 1 | BF16 | 1 | 81.73 | |
| FP8 | 1 | 150.25 | |||
| AWQ-INT4 | 1 | 144.11 | |||
| 6144 | BF16 | 1 | 296.25 | ||
| FP8 | 1 | 516.64 | |||
| AWQ-INT4 | 1 | 477.89 | |||
| 14336 | BF16 | 1 | 524.70 | ||
| FP8 | 1 | 859.92 | |||
| AWQ-INT4 | 1 | 770.44 | |||
| 30720 | BF16 | 1 | 832.67 | ||
| FP8 | 1 | 1242.24 | |||
| AWQ-INT4 | 1 | 1075.91 | |||
| 63488 | BF16 | 1 | 1112.78 | ||
| FP8 | 1 | 1476.46 | |||
| AWQ-INT4 | 1 | 1254.91 | |||
| 129042 | BF16 | 1 | 1173.32 | ||
| FP8 | 1 | 1393.21 | |||
| AWQ-INT4 | 1 | 1198.06 |
Qwen3-8B (Transformers)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | GPU Memory(MB) |
|---|---|---|---|---|---|
| Qwen3-8B | 1 | BF16 | 1 | 45.32 | 15947 |
| FP8 | 1 | 15.46 | 9323 | ||
| AWQ-INT4 | 1 | 51.33 | 6177 | ||
| 6144 | BF16 | 1 | 146.12 | 16811 | |
| FP8 | 1 | 55.07 | 10187 | ||
| AWQ-INT4 | 1 | 163.23 | 7113 | ||
| 14336 | BF16 | 1 | 183.29 | 17963 | |
| FP8 | 1 | 89.64 | 11340 | ||
| AWQ-INT4 | 1 | 242.97 | 8409 | ||
| 30720 | BF16 | 1 | 208.98 | 20267 | |
| FP8 | 1 | 130.93 | 13644 | ||
| AWQ-INT4 | 1 | 438.62 | 11001 |
Qwen3-14B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-14B | 1 | BF16 | 1 | 47.10 | |
| FP8 | 1 | 97.11 | |||
| AWQ-INT4 | 1 | 96.49 | |||
| 6144 | BF16 | 1 | 174.85 | ||
| FP8 | 1 | 342.95 | |||
| AWQ-INT4 | 1 | 321.62 | |||
| 14336 | BF16 | 1 | 317.56 | ||
| FP8 | 1 | 587.33 | |||
| AWQ-INT4 | 1 | 525.74 | |||
| 30720 | BF16 | 1 | 525.80 | ||
| FP8 | 1 | 880.72 | |||
| AWQ-INT4 | 1 | 744.74 | |||
| 63488 | BF16 | 1 | 742.36 | ||
| FP8 | 1 | 1089.04 | |||
| AWQ-INT4 | 1 | 884.06 | |||
| 129042 | BF16 | 1 | 826.15 | ||
| FP8 | 1 | 1049.64 | |||
| AWQ-INT4 | 1 | 857.56 |
Qwen3-14B (Transformers)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | GPU Memory (MB) |
|---|---|---|---|---|---|
| Qwen3-14B | 1 | BF16 | 1 | 40.66 | 28402 |
| FP8 | 1 | 13.02 | 16012 | ||
| AWQ-INT4 | 1 | 44.67 | 9962 | ||
| 6144 | BF16 | 1 | 108.52 | 29495 | |
| FP8 | 1 | 44.86 | 16972 | ||
| AWQ-INT4 | 1 | 128.08 | 11020 | ||
| 14336 | BF16 | 1 | 136.36 | 30775 | |
| FP8 | 1 | 71.96 | 18253 | ||
| AWQ-INT4 | 1 | 220.62 | 12438 | ||
| 30720 | BF16 | 1 | 155.38 | 33336 | |
| FP8 | 1 | 102.63 | 20813 | ||
| AWQ-INT4 | 1 | 363.25 | 15323 |
Qwen3-32B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-32B | 1 | BF16 | 1 | 20.72 | |
| FP8 | 1 | 46.17 | |||
| AWQ-INT4 | 1 | 47.67 | |||
| 6144 | BF16 | 1 | 77.82 | ||
| FP8 | 1 | 165.71 | |||
| AWQ-INT4 | 1 | 159.99 | |||
| 14336 | BF16 | 1 | 143.08 | ||
| FP8 | 1 | 287.60 | |||
| AWQ-INT4 | 1 | 260.44 | |||
| 30720 | BF16 | 1 | 240.75 | ||
| FP8 | 1 | 436.59 | |||
| AWQ-INT4 | 1 | 366.84 | |||
| 63488 | BF16 | 1 | 342.96 | ||
| FP8 | 1 | 532.18 | |||
| AWQ-INT4 | 1 | 425.23 | |||
| 129042 | BF16 | 2 | 711.40 | TP=2 | |
| FP8 | 1 | 491.45 | |||
| AWQ-INT4 | 1 | 395.96 |
Qwen3-32B (Transformers)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | GPU Memory (MB) |
|---|---|---|---|---|---|
| Qwen3-32B | 1 | BF16 | 1 | 26.24 | 62751 |
| FP8 | 1 | 7.37 | 33379 | ||
| AWQ-INT4 | 1 | 41.8 | 19109 | ||
| 6144 | BF16 | 1 | 51.41 | 64583 | |
| FP8 | 1 | 23.57 | 34915 | ||
| AWQ-INT4 | 1 | 68.71 | 20795 | ||
| 14336 | BF16 | 1 | 62.41 | 66632 | |
| FP8 | 1 | 36.30 | 36963 | ||
| AWQ-INT4 | 1 | 107.02 | 23105 | ||
| 30720 | BF16 | 1 | 69.16 | 70728 | |
| FP8 | 1 | 49.44 | 41060 | ||
| AWQ-INT4 | 1 | 188.11 | 27718 |
Qwen3-30B-A3B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-30B-A3B | 1 | BF16 | 1 | 137.18 | |
| FP8 | 1 | 155.55 | |||
| GPTQ-INT4 | 1 | 31.29 | GPTQ-Marlin | ||
| 6144 | BF16 | 1 | 490.10 | ||
| FP8 | 1 | 551.34 | |||
| GPTQ-INT4 | 1 | 120.13 | GPTQ-Marlin | ||
| 14336 | BF16 | 1 | 849.62 | ||
| FP8 | 1 | 945.13 | |||
| GPTQ-INT4 | 1 | 227.27 | GPTQ-Marlin | ||
| 30720 | BF16 | 1 | 1283.94 | ||
| FP8 | 1 | 1405.91 | |||
| GPTQ-INT4 | 1 | 404.45 | GPTQ-Marlin | ||
| 63488 | BF16 | 1 | 1538.79 | ||
| FP8 | 1 | 1647.89 | |||
| GPTQ-INT4 | 1 | 617.09 | GPTQ-Marlin | ||
| 129042 | BF16 | 1 | 1385.65 | ||
| FP8 | 1 | 1442.14 | |||
| GPTQ-INT4 | 1 | 704.82 | GPTQ-Marlin |
Qwen3-30B-A3B (Transformers)¶
| Model | Input length | Quantization | GPU Num | Speed (tokens/s) | GPU Memory (MB) | Notes |
|---|---|---|---|---|---|---|
| Qwen3-30B-A3B | 1 | BF16 | 1 | 1.89 | 58462 | |
| FP8 | 1 | 0.44 | 30296 | |||
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported | ||
| 6144 | BF16 | 1 | 7.45 | 59037 | ||
| FP8 | 1 | 1.77 | 30872 | |||
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported | ||
| 14336 | BF16 | 1 | 14.47 | 59806 | ||
| FP8 | 1 | 3.5 | 31641 | |||
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported | ||
| 30720 | BF16 | 1 | 27.03 | 61342 | ||
| FP8 | 1 | 6.86 | 33177 | |||
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported |
Qwen3-235B-A22B (SGLang)¶
| Model | Input Length | Quantization | GPU Num | Speed (tokens/s) | Note |
|---|---|---|---|---|---|
| Qwen3-235B-A22B | 1 | BF16 | 8 | 74.50 | TP=8 |
| FP8 | 4 | 71.65 | TP=4 | ||
| GPTQ-INT4 | 4 | 14.69 | TP=4 GPTQ-Marlin |
||
| 6144 | BF16 | 8 | 289.03 | TP=8 | |
| FP8 | 4 | 275.16 | TP=4 | ||
| GPTQ-INT4 | 4 | 56.97 | TP=4 GPTQ-Marlin |
||
| 14336 | BF16 | 8 | 546.73 | TP=8 | |
| FP8 | 4 | 514.23 | TP=4 | ||
| GPTQ-INT4 | 4 | 109.13 | TP=4 GPTQ-Marlin |
||
| 30720 | BF16 | 8 | 979.41 | TP=8 | |
| FP8 | 4 | 887.90 | TP=4 | ||
| GPTQ-INT4 | 4 | 198.99 | TP=4 GPTQ-Marlin |
||
| 63488 | BF16 | 8 | 1493.91 | TP=8 | |
| FP8 | 4 | 1269.34 | TP=4 | ||
| GPTQ-INT4 | 4 | 422.77 | TP=4 GPTQ-Marlin |
||
| 129042 | BF16 | 8 | 1639.54 | TP=8 | |
| FP8 | 4 | 1319.66 | TP=4 | ||
| GPTQ-INT4 | 4 | 552.28 | TP=4 GPTQ-Marlin |