# Speed Benchmark We report the speed performance of bfloat16 models and quantized models (including FP8, GPTQ, AWQ) of the Qwen3 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under different context lengths. ## Environments ### Hugging Face Transformers - **Hardware**: - NVIDIA H20 96GB - **Software for Non-AutoAWQ**: - PyTorch 2.6.0 - Flash Attention 2.7.4 - Transformers 4.51.3 - GPTQModel 2.2.0+cu128torch2.6 - **Software for AutoAWQ**: - PyTorch 2.6.0+cu124 - Transformers 4.51.3 - AutoAWQ 0.2.9 - AutoAWQ_kernels 0.0.9 ### SGLang - **Hardware**: - NVIDIA H20 96GB - **Software**: - PyTorch 2.6.0+cu124 - Transformers 4.51.3 - SGLang 0.4.6.post1 - SGL-kernel 0.1.0 - vLLM 0.7.2 (Required by SGLang for AWQ quantization) ## Notes - **Inference Speed (tokens/s)** is calculated as: ```{math} \text{Speed} = \frac{\text{tokens}_{\text{prompt}} + \text{tokens}_{\text{generation}}}{\text{time}} ``` - We use a **batch size of 1** and the **minimum number of GPUs** possible for evaluation. - We test the **speed and memory usage** when generating **2048 tokens**, with input lengths of `1`, `6144`, `14336`, `30720`, `63488`, and `129024` tokens. - **For SGLang**: - **Memory usage** is not reported because SGLang pre-allocates all GPU memory. By default, we set `mem_fraction_static=0.85`. - We configure `context_length=140000` and enable `enable_mixed_chunk=True`. - For **AWQ quantization**, we use the **awq_marlin** backend. - We set `skip_tokenizer_init=True` and perform generation using `input_ids` instead of raw text prompts. - **FP8 Performance in Transformers**: The inference speed of Transformers in FP8 mode is currently not optimal and requires further optimization. - **GPTQ-INT4 Performance in SGLang**: The performance of GPTQ-INT4 in SGLang also needs improvement, and we are actively working with the team to enhance it. ## Results ### Qwen3-0.6B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-0.6B 1BF161414.17
FP81458.03
GPTQ-Int81344.92
6144BF1611426.46
FP811572.95
GPTQ-Int811234.29
14336BF1612478.02
FP812689.08
GPTQ-Int812198.82
30720BF1613577.42
FP813819.86
GPTQ-Int813342.06
### Qwen3-0.6B (Transformers)
Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-0.6B 1BF16158.571394
FP8124.601217
GPTQ-Int8126.56986
6144BF161154.822066
FP8173.961943
GPTQ-Int8193.841658
14336BF161168.482963
FP81104.992839
GPTQ-Int81219.612554
30720BF161175.934755
FP81132.784632
GPTQ-Int81345.714347
### Qwen3-1.7B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-1.7B 1BF161227.80
FP81333.90
GPTQ-Int81257.40
6144BF161838.28
FP811198.20
GPTQ-Int81945.91
14336BF1611525.71
FP812095.61
GPTQ-Int811707.63
30720BF1612439.03
FP813165.32
GPTQ-Int812706.16
### Qwen3-1.7B (Transformers)
Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-1.7B 1BF16159.833412
FP8123.832726
GPTQ-Int8128.062229
6144BF161238.534213
FP8190.873462
GPTQ-Int81110.822901
14336BF161352.595109
FP81153.374359
GPTQ-Int81222.783798
30720BF161418.136902
FP81235.616151
GPTQ-Int81386.855590
### Qwen3-4B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-4B 1BF161133.13
FP81200.61
AWQ-INT41199.71
6144BF161466.19
FP81662.26
AWQ-INT41640.07
14336BF161789.25
FP811066.23
AWQ-INT411006.23
30720BF1611165.75
FP811467.71
AWQ-INT411358.84
63488BF1611423.98
FP811660.67
AWQ-INT411513.97
129042BF1611371.04
FP811497.27
AWQ-INT411375.71
### Qwen3-4B (Transformers)
Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-4B 1BF16145.947973
FP8117.335281
AWQ-INT4151.572915
6144BF161159.958860
FP8160.556144
AWQ-INT41183.043881
14336BF161195.3110012
FP8196.817297
AWQ-INT41265.225151
30720BF161217.9712317
FP81138.849611
AWQ-INT41481.697742
### Qwen3-8B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-8B 1BF16181.73
FP81150.25
AWQ-INT41144.11
6144BF161296.25
FP81516.64
AWQ-INT41477.89
14336BF161524.70
FP81859.92
AWQ-INT41770.44
30720BF161832.67
FP811242.24
AWQ-INT411075.91
63488BF1611112.78
FP811476.46
AWQ-INT411254.91
129042BF1611173.32
FP811393.21
AWQ-INT411198.06
### Qwen3-8B (Transformers)
Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-8B 1BF16145.3215947
FP8115.469323
AWQ-INT4151.336177
6144BF161146.1216811
FP8155.0710187
AWQ-INT41163.237113
14336BF161183.2917963
FP8189.6411340
AWQ-INT41242.978409
30720BF161208.9820267
FP81130.9313644
AWQ-INT41438.6211001
### Qwen3-14B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-14B 1BF16147.10
FP8197.11
AWQ-INT4196.49
6144BF161174.85
FP81342.95
AWQ-INT41321.62
14336BF161317.56
FP81587.33
AWQ-INT41525.74
30720BF161525.80
FP81880.72
AWQ-INT41744.74
63488BF161742.36
FP811089.04
AWQ-INT41884.06
129042BF161826.15
FP811049.64
AWQ-INT41857.56
### Qwen3-14B (Transformers)
Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-14B 1BF16140.6628402
FP8113.0216012
AWQ-INT4144.679962
6144BF161108.5229495
FP8144.8616972
AWQ-INT41128.0811020
14336BF161136.3630775
FP8171.9618253
AWQ-INT41220.6212438
30720BF161155.3833336
FP81102.6320813
AWQ-INT41363.2515323
### Qwen3-32B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-32B 1BF16120.72
FP8146.17
AWQ-INT4147.67
6144BF16177.82
FP81165.71
AWQ-INT41159.99
14336BF161143.08
FP81287.60
AWQ-INT41260.44
30720BF161240.75
FP81436.59
AWQ-INT41366.84
63488BF161342.96
FP81532.18
AWQ-INT41425.23
129042BF162711.40TP=2
FP81491.45
AWQ-INT41395.96
### Qwen3-32B (Transformers)
Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-32B 1BF16126.2462751
FP817.3733379
AWQ-INT4141.819109
6144BF16151.4164583
FP8123.5734915
AWQ-INT4168.7120795
14336BF16162.4166632
FP8136.3036963
AWQ-INT41107.0223105
30720BF16169.1670728
FP8149.4441060
AWQ-INT41188.1127718
### Qwen3-30B-A3B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-30B-A3B 1BF161137.18
FP81155.55
GPTQ-INT4131.29GPTQ-Marlin
6144BF161490.10
FP81551.34
GPTQ-INT41120.13GPTQ-Marlin
14336BF161849.62
FP81945.13
GPTQ-INT41227.27GPTQ-Marlin
30720BF1611283.94
FP811405.91
GPTQ-INT41404.45GPTQ-Marlin
63488BF1611538.79
FP811647.89
GPTQ-INT41617.09GPTQ-Marlin
129042BF1611385.65
FP811442.14
GPTQ-INT41704.82GPTQ-Marlin
### Qwen3-30B-A3B (Transformers)
Model Input length Quantization GPU Num Speed (tokens/s) GPU Memory (MB) Notes
Qwen3-30B-A3B 1BF1611.8958462
FP810.4430296
GPTQ-INT4---MoE Kernel Unsupported
6144BF1617.4559037
FP811.7730872
GPTQ-INT4---MoE Kernel Unsupported
14336BF16114.4759806
FP813.531641
GPTQ-INT4---MoE Kernel Unsupported
30720BF16127.0361342
FP816.8633177
GPTQ-INT4---MoE Kernel Unsupported
### Qwen3-235B-A22B (SGLang)
Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-235B-A22B 1BF16874.50TP=8
FP8471.65TP=4
GPTQ-INT4414.69TP=4
GPTQ-Marlin
6144BF168289.03TP=8
FP84275.16TP=4
GPTQ-INT4456.97TP=4
GPTQ-Marlin
14336BF168546.73TP=8
FP84514.23TP=4
GPTQ-INT44109.13TP=4
GPTQ-Marlin
30720BF168979.41TP=8
FP84887.90TP=4
GPTQ-INT44198.99TP=4
GPTQ-Marlin
63488BF1681493.91TP=8
FP841269.34TP=4
GPTQ-INT44422.77TP=4
GPTQ-Marlin
129042BF1681639.54TP=8
FP841319.66TP=4
GPTQ-INT44552.28TP=4
GPTQ-Marlin