效率评估

本部分介绍 Qwen3 系列模型(原始模型和量化模型)的效率测试结果,包括推理速度(tokens/s)与不同上下文长度时的显存占用(GB)。

环境配置

Hugging Face Transformers

  • 硬件:

    • NVIDIA H20 96GB

  • 非AutoAWQ的软件环境:

    • PyTorch 2.6.0

    • Flash Attention 2.7.4

    • Transformers 4.51.3

    • GPTQModel 2.2.0+cu128torch2.6

  • AutoAWQ的软件环境:

    • PyTorch 2.6.0+cu124

    • Transformers 4.51.3

    • AutoAWQ 0.2.9

    • AutoAWQ_kernels 0.0.9

SGLang

  • 硬件:

    • NVIDIA H20 96GB

  • 软件环境:

    • PyTorch 2.6.0+cu124

    • Transformers 4.51.3

    • SGLang 0.4.6.post1

    • SGL-kernel 0.1.0

    • vLLM 0.7.2 (被SGLang AWQ量化依赖)

备注

  • 推理速度(tokens/s) 的计算公式为:

    \[\text{Speed} = \frac{\text{tokens}_{\text{prompt}} + \text{tokens}_{\text{generation}}}{\text{time}}\]
  • batch size 设置为1,使用 GPU 数量尽可能少

  • 我们测试生成2048 tokens时的速度与显存占用,输入长度分别为1、6144、14336、30720、63488、129024 tokens(如受模型支持)。

  • 对于SGLang:

    • 内存使用情况未报告,因为 SGLang 会预先分配所有 GPU 内存。默认情况下,我们设置 mem_fraction_static=0.85

    • 我们配置了 context_length=140000 并启用了 enable_mixed_chunk=True

    • 对于 AWQ 量化,我们使用 awq_marlin 后端。

    • 我们设置 skip_tokenizer_init=True 并使用 input_ids 进行生成,而不是使用原始文本提示。

  • Transformers 中的 FP8 性能:Transformers 在 FP8 模式下的推理速度目前不够理想,还需要进一步优化。

  • SGLang 中 GPTQ-INT4 的性能:SGLang 中 GPTQ-INT4 的性能也需要改进,SGLang团队正提升其表现。

结果

Qwen3-0.6B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-0.6B 1BF161414.17
FP81458.03
GPTQ-Int81344.92
6144BF1611426.46
FP811572.95
GPTQ-Int811234.29
14336BF1612478.02
FP812689.08
GPTQ-Int812198.82
30720BF1613577.42
FP813819.86
GPTQ-Int813342.06

Qwen3-0.6B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-0.6B 1BF16158.571394
FP8124.601217
GPTQ-Int8126.56986
6144BF161154.822066
FP8173.961943
GPTQ-Int8193.841658
14336BF161168.482963
FP81104.992839
GPTQ-Int81219.612554
30720BF161175.934755
FP81132.784632
GPTQ-Int81345.714347

Qwen3-1.7B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-1.7B 1BF161227.80
FP81333.90
GPTQ-Int81257.40
6144BF161838.28
FP811198.20
GPTQ-Int81945.91
14336BF1611525.71
FP812095.61
GPTQ-Int811707.63
30720BF1612439.03
FP813165.32
GPTQ-Int812706.16

Qwen3-1.7B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-1.7B 1BF16159.833412
FP8123.832726
GPTQ-Int8128.062229
6144BF161238.534213
FP8190.873462
GPTQ-Int81110.822901
14336BF161352.595109
FP81153.374359
GPTQ-Int81222.783798
30720BF161418.136902
FP81235.616151
GPTQ-Int81386.855590

Qwen3-4B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-4B 1BF161133.13
FP81200.61
AWQ-INT41199.71
6144BF161466.19
FP81662.26
AWQ-INT41640.07
14336BF161789.25
FP811066.23
AWQ-INT411006.23
30720BF1611165.75
FP811467.71
AWQ-INT411358.84
63488BF1611423.98
FP811660.67
AWQ-INT411513.97
129042BF1611371.04
FP811497.27
AWQ-INT411375.71

Qwen3-4B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-4B 1BF16145.947973
FP8117.335281
AWQ-INT4151.572915
6144BF161159.958860
FP8160.556144
AWQ-INT41183.043881
14336BF161195.3110012
FP8196.817297
AWQ-INT41265.225151
30720BF161217.9712317
FP81138.849611
AWQ-INT41481.697742

Qwen3-8B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-8B 1BF16181.73
FP81150.25
AWQ-INT41144.11
6144BF161296.25
FP81516.64
AWQ-INT41477.89
14336BF161524.70
FP81859.92
AWQ-INT41770.44
30720BF161832.67
FP811242.24
AWQ-INT411075.91
63488BF1611112.78
FP811476.46
AWQ-INT411254.91
129042BF1611173.32
FP811393.21
AWQ-INT411198.06

Qwen3-8B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-8B 1BF16145.3215947
FP8115.469323
AWQ-INT4151.336177
6144BF161146.1216811
FP8155.0710187
AWQ-INT41163.237113
14336BF161183.2917963
FP8189.6411340
AWQ-INT41242.978409
30720BF161208.9820267
FP81130.9313644
AWQ-INT41438.6211001

Qwen3-14B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-14B 1BF16147.10
FP8197.11
AWQ-INT4196.49
6144BF161174.85
FP81342.95
AWQ-INT41321.62
14336BF161317.56
FP81587.33
AWQ-INT41525.74
30720BF161525.80
FP81880.72
AWQ-INT41744.74
63488BF161742.36
FP811089.04
AWQ-INT41884.06
129042BF161826.15
FP811049.64
AWQ-INT41857.56

Qwen3-14B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-14B 1BF16140.6628402
FP8113.0216012
AWQ-INT4144.679962
6144BF161108.5229495
FP8144.8616972
AWQ-INT41128.0811020
14336BF161136.3630775
FP8171.9618253
AWQ-INT41220.6212438
30720BF161155.3833336
FP81102.6320813
AWQ-INT41363.2515323

Qwen3-32B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-32B 1BF16120.72
FP8146.17
AWQ-INT4147.67
6144BF16177.82
FP81165.71
AWQ-INT41159.99
14336BF161143.08
FP81287.60
AWQ-INT41260.44
30720BF161240.75
FP81436.59
AWQ-INT41366.84
63488BF161342.96
FP81532.18
AWQ-INT41425.23
129042BF162711.40TP=2
FP81491.45
AWQ-INT41395.96

Qwen3-32B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-32B 1BF16126.2462751
FP817.3733379
AWQ-INT4141.819109
6144BF16151.4164583
FP8123.5734915
AWQ-INT4168.7120795
14336BF16162.4166632
FP8136.3036963
AWQ-INT41107.0223105
30720BF16169.1670728
FP8149.4441060
AWQ-INT41188.1127718

Qwen3-30B-A3B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-30B-A3B 1BF161137.18
FP81155.55
GPTQ-INT4131.29GPTQ-Marlin
6144BF161490.10
FP81551.34
GPTQ-INT41120.13GPTQ-Marlin
14336BF161849.62
FP81945.13
GPTQ-INT41227.27GPTQ-Marlin
30720BF1611283.94
FP811405.91
GPTQ-INT41404.45GPTQ-Marlin
63488BF1611538.79
FP811647.89
GPTQ-INT41617.09GPTQ-Marlin
129042BF1611385.65
FP811442.14
GPTQ-INT41704.82GPTQ-Marlin

Qwen3-30B-A3B (Transformers)

Model Input length Quantization GPU Num Speed (tokens/s) GPU Memory (MB) Notes
Qwen3-30B-A3B 1BF1611.8958462
FP810.4430296
GPTQ-INT4---MoE Kernel Unsupported
6144BF1617.4559037
FP811.7730872
GPTQ-INT4---MoE Kernel Unsupported
14336BF16114.4759806
FP813.531641
GPTQ-INT4---MoE Kernel Unsupported
30720BF16127.0361342
FP816.8633177
GPTQ-INT4---MoE Kernel Unsupported

Qwen3-235B-A22B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-235B-A22B 1BF16874.50TP=8
FP8471.65TP=4
GPTQ-INT4414.69TP=4
GPTQ-Marlin
6144BF168289.03TP=8
FP84275.16TP=4
GPTQ-INT4456.97TP=4
GPTQ-Marlin
14336BF168546.73TP=8
FP84514.23TP=4
GPTQ-INT44109.13TP=4
GPTQ-Marlin
30720BF168979.41TP=8
FP84887.90TP=4
GPTQ-INT44198.99TP=4
GPTQ-Marlin
63488BF1681493.91TP=8
FP841269.34TP=4
GPTQ-INT44422.77TP=4
GPTQ-Marlin
129042BF1681639.54TP=8
FP841319.66TP=4
GPTQ-INT44552.28TP=4
GPTQ-Marlin