# Speed Benchmark
We report the speed performance of bfloat16 models and quantized models (including FP8, GPTQ, AWQ) of the Qwen3 series.
Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under different context lengths.
## Environments
### Hugging Face Transformers
- **Hardware**:
- NVIDIA H20 96GB
- **Software for Non-AutoAWQ**:
- PyTorch 2.6.0
- Flash Attention 2.7.4
- Transformers 4.51.3
- GPTQModel 2.2.0+cu128torch2.6
- **Software for AutoAWQ**:
- PyTorch 2.6.0+cu124
- Transformers 4.51.3
- AutoAWQ 0.2.9
- AutoAWQ_kernels 0.0.9
### SGLang
- **Hardware**:
- NVIDIA H20 96GB
- **Software**:
- PyTorch 2.6.0+cu124
- Transformers 4.51.3
- SGLang 0.4.6.post1
- SGL-kernel 0.1.0
- vLLM 0.7.2 (Required by SGLang for AWQ quantization)
## Notes
- **Inference Speed (tokens/s)** is calculated as:
```{math}
\text{Speed} = \frac{\text{tokens}_{\text{prompt}} + \text{tokens}_{\text{generation}}}{\text{time}}
```
- We use a **batch size of 1** and the **minimum number of GPUs** possible for evaluation.
- We test the **speed and memory usage** when generating **2048 tokens**, with input lengths of
`1`, `6144`, `14336`, `30720`, `63488`, and `129024` tokens.
- **For SGLang**:
- **Memory usage** is not reported because SGLang pre-allocates all GPU memory.
By default, we set `mem_fraction_static=0.85`.
- We configure `context_length=140000` and enable `enable_mixed_chunk=True`.
- For **AWQ quantization**, we use the **awq_marlin** backend.
- We set `skip_tokenizer_init=True` and perform generation using `input_ids` instead of raw text prompts.
- **FP8 Performance in Transformers**: The inference speed of Transformers in FP8 mode is currently not optimal and requires further optimization.
- **GPTQ-INT4 Performance in SGLang**: The performance of GPTQ-INT4 in SGLang also needs improvement, and we are actively working with the team to enhance it.
## Results
### Qwen3-0.6B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-0.6B |
1 | BF16 | 1 | 414.17 | |
| FP8 | 1 | 458.03 | |
| GPTQ-Int8 | 1 | 344.92 | |
| 6144 | BF16 | 1 | 1426.46 | |
| FP8 | 1 | 1572.95 | |
| GPTQ-Int8 | 1 | 1234.29 | |
| 14336 | BF16 | 1 | 2478.02 | |
| FP8 | 1 | 2689.08 | |
| GPTQ-Int8 | 1 | 2198.82 | |
| 30720 | BF16 | 1 | 3577.42 | |
| FP8 | 1 | 3819.86 | |
| GPTQ-Int8 | 1 | 3342.06 | |
### Qwen3-0.6B (Transformers)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
GPU Memory(MB) |
| Qwen3-0.6B |
1 | BF16 | 1 | 58.57 | 1394 |
| FP8 | 1 | 24.60 | 1217 |
| GPTQ-Int8 | 1 | 26.56 | 986 |
| 6144 | BF16 | 1 | 154.82 | 2066 |
| FP8 | 1 | 73.96 | 1943 |
| GPTQ-Int8 | 1 | 93.84 | 1658 |
| 14336 | BF16 | 1 | 168.48 | 2963 |
| FP8 | 1 | 104.99 | 2839 |
| GPTQ-Int8 | 1 | 219.61 | 2554 |
| 30720 | BF16 | 1 | 175.93 | 4755 |
| FP8 | 1 | 132.78 | 4632 |
| GPTQ-Int8 | 1 | 345.71 | 4347 |
### Qwen3-1.7B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-1.7B |
1 | BF16 | 1 | 227.80 | |
| FP8 | 1 | 333.90 | |
| GPTQ-Int8 | 1 | 257.40 | |
| 6144 | BF16 | 1 | 838.28 | |
| FP8 | 1 | 1198.20 | |
| GPTQ-Int8 | 1 | 945.91 | |
| 14336 | BF16 | 1 | 1525.71 | |
| FP8 | 1 | 2095.61 | |
| GPTQ-Int8 | 1 | 1707.63 | |
| 30720 | BF16 | 1 | 2439.03 | |
| FP8 | 1 | 3165.32 | |
| GPTQ-Int8 | 1 | 2706.16 | |
### Qwen3-1.7B (Transformers)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
GPU Memory(MB) |
| Qwen3-1.7B |
1 | BF16 | 1 | 59.83 | 3412 |
| FP8 | 1 | 23.83 | 2726 |
| GPTQ-Int8 | 1 | 28.06 | 2229 |
| 6144 | BF16 | 1 | 238.53 | 4213 |
| FP8 | 1 | 90.87 | 3462 |
| GPTQ-Int8 | 1 | 110.82 | 2901 |
| 14336 | BF16 | 1 | 352.59 | 5109 |
| FP8 | 1 | 153.37 | 4359 |
| GPTQ-Int8 | 1 | 222.78 | 3798 |
| 30720 | BF16 | 1 | 418.13 | 6902 |
| FP8 | 1 | 235.61 | 6151 |
| GPTQ-Int8 | 1 | 386.85 | 5590 |
### Qwen3-4B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-4B |
1 | BF16 | 1 | 133.13 | |
| FP8 | 1 | 200.61 | |
| AWQ-INT4 | 1 | 199.71 | |
| 6144 | BF16 | 1 | 466.19 | |
| FP8 | 1 | 662.26 | |
| AWQ-INT4 | 1 | 640.07 | |
| 14336 | BF16 | 1 | 789.25 | |
| FP8 | 1 | 1066.23 | |
| AWQ-INT4 | 1 | 1006.23 | |
| 30720 | BF16 | 1 | 1165.75 | |
| FP8 | 1 | 1467.71 | |
| AWQ-INT4 | 1 | 1358.84 | |
| 63488 | BF16 | 1 | 1423.98 | |
| FP8 | 1 | 1660.67 | |
| AWQ-INT4 | 1 | 1513.97 | |
| 129042 | BF16 | 1 | 1371.04 | |
| FP8 | 1 | 1497.27 | |
| AWQ-INT4 | 1 | 1375.71 | |
### Qwen3-4B (Transformers)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
GPU Memory(MB) |
| Qwen3-4B |
1 | BF16 | 1 | 45.94 | 7973 |
| FP8 | 1 | 17.33 | 5281 |
| AWQ-INT4 | 1 | 51.57 | 2915 |
| 6144 | BF16 | 1 | 159.95 | 8860 |
| FP8 | 1 | 60.55 | 6144 |
| AWQ-INT4 | 1 | 183.04 | 3881 |
| 14336 | BF16 | 1 | 195.31 | 10012 |
| FP8 | 1 | 96.81 | 7297 |
| AWQ-INT4 | 1 | 265.22 | 5151 |
| 30720 | BF16 | 1 | 217.97 | 12317 |
| FP8 | 1 | 138.84 | 9611 |
| AWQ-INT4 | 1 | 481.69 | 7742 |
### Qwen3-8B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-8B |
1 | BF16 | 1 | 81.73 | |
| FP8 | 1 | 150.25 | |
| AWQ-INT4 | 1 | 144.11 | |
| 6144 | BF16 | 1 | 296.25 | |
| FP8 | 1 | 516.64 | |
| AWQ-INT4 | 1 | 477.89 | |
| 14336 | BF16 | 1 | 524.70 | |
| FP8 | 1 | 859.92 | |
| AWQ-INT4 | 1 | 770.44 | |
| 30720 | BF16 | 1 | 832.67 | |
| FP8 | 1 | 1242.24 | |
| AWQ-INT4 | 1 | 1075.91 | |
| 63488 | BF16 | 1 | 1112.78 | |
| FP8 | 1 | 1476.46 | |
| AWQ-INT4 | 1 | 1254.91 | |
| 129042 | BF16 | 1 | 1173.32 | |
| FP8 | 1 | 1393.21 | |
| AWQ-INT4 | 1 | 1198.06 | |
### Qwen3-8B (Transformers)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
GPU Memory(MB) |
| Qwen3-8B |
1 | BF16 | 1 | 45.32 | 15947 |
| FP8 | 1 | 15.46 | 9323 |
| AWQ-INT4 | 1 | 51.33 | 6177 |
| 6144 | BF16 | 1 | 146.12 | 16811 |
| FP8 | 1 | 55.07 | 10187 |
| AWQ-INT4 | 1 | 163.23 | 7113 |
| 14336 | BF16 | 1 | 183.29 | 17963 |
| FP8 | 1 | 89.64 | 11340 |
| AWQ-INT4 | 1 | 242.97 | 8409 |
| 30720 | BF16 | 1 | 208.98 | 20267 |
| FP8 | 1 | 130.93 | 13644 |
| AWQ-INT4 | 1 | 438.62 | 11001 |
### Qwen3-14B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-14B |
1 | BF16 | 1 | 47.10 | |
| FP8 | 1 | 97.11 | |
| AWQ-INT4 | 1 | 96.49 | |
| 6144 | BF16 | 1 | 174.85 | |
| FP8 | 1 | 342.95 | |
| AWQ-INT4 | 1 | 321.62 | |
| 14336 | BF16 | 1 | 317.56 | |
| FP8 | 1 | 587.33 | |
| AWQ-INT4 | 1 | 525.74 | |
| 30720 | BF16 | 1 | 525.80 | |
| FP8 | 1 | 880.72 | |
| AWQ-INT4 | 1 | 744.74 | |
| 63488 | BF16 | 1 | 742.36 | |
| FP8 | 1 | 1089.04 | |
| AWQ-INT4 | 1 | 884.06 | |
| 129042 | BF16 | 1 | 826.15 | |
| FP8 | 1 | 1049.64 | |
| AWQ-INT4 | 1 | 857.56 | |
### Qwen3-14B (Transformers)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
GPU Memory (MB) |
| Qwen3-14B |
1 | BF16 | 1 | 40.66 | 28402 |
| FP8 | 1 | 13.02 | 16012 |
| AWQ-INT4 | 1 | 44.67 | 9962 |
| 6144 | BF16 | 1 | 108.52 | 29495 |
| FP8 | 1 | 44.86 | 16972 |
| AWQ-INT4 | 1 | 128.08 | 11020 |
| 14336 | BF16 | 1 | 136.36 | 30775 |
| FP8 | 1 | 71.96 | 18253 |
| AWQ-INT4 | 1 | 220.62 | 12438 |
| 30720 | BF16 | 1 | 155.38 | 33336 |
| FP8 | 1 | 102.63 | 20813 |
| AWQ-INT4 | 1 | 363.25 | 15323 |
### Qwen3-32B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-32B |
1 | BF16 | 1 | 20.72 | |
| FP8 | 1 | 46.17 | |
| AWQ-INT4 | 1 | 47.67 | |
| 6144 | BF16 | 1 | 77.82 | |
| FP8 | 1 | 165.71 | |
| AWQ-INT4 | 1 | 159.99 | |
| 14336 | BF16 | 1 | 143.08 | |
| FP8 | 1 | 287.60 | |
| AWQ-INT4 | 1 | 260.44 | |
| 30720 | BF16 | 1 | 240.75 | |
| FP8 | 1 | 436.59 | |
| AWQ-INT4 | 1 | 366.84 | |
| 63488 | BF16 | 1 | 342.96 | |
| FP8 | 1 | 532.18 | |
| AWQ-INT4 | 1 | 425.23 | |
| 129042 | BF16 | 2 | 711.40 | TP=2 |
| FP8 | 1 | 491.45 | |
| AWQ-INT4 | 1 | 395.96 | |
### Qwen3-32B (Transformers)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
GPU Memory (MB) |
| Qwen3-32B |
1 | BF16 | 1 | 26.24 | 62751 |
| FP8 | 1 | 7.37 | 33379 |
| AWQ-INT4 | 1 | 41.8 | 19109 |
| 6144 | BF16 | 1 | 51.41 | 64583 |
| FP8 | 1 | 23.57 | 34915 |
| AWQ-INT4 | 1 | 68.71 | 20795 |
| 14336 | BF16 | 1 | 62.41 | 66632 |
| FP8 | 1 | 36.30 | 36963 |
| AWQ-INT4 | 1 | 107.02 | 23105 |
| 30720 | BF16 | 1 | 69.16 | 70728 |
| FP8 | 1 | 49.44 | 41060 |
| AWQ-INT4 | 1 | 188.11 | 27718 |
### Qwen3-30B-A3B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-30B-A3B |
1 | BF16 | 1 | 137.18 | |
| FP8 | 1 | 155.55 | |
| GPTQ-INT4 | 1 | 31.29 | GPTQ-Marlin |
| 6144 | BF16 | 1 | 490.10 | |
| FP8 | 1 | 551.34 | |
| GPTQ-INT4 | 1 | 120.13 | GPTQ-Marlin |
| 14336 | BF16 | 1 | 849.62 | |
| FP8 | 1 | 945.13 | |
| GPTQ-INT4 | 1 | 227.27 | GPTQ-Marlin |
| 30720 | BF16 | 1 | 1283.94 | |
| FP8 | 1 | 1405.91 | |
| GPTQ-INT4 | 1 | 404.45 | GPTQ-Marlin |
| 63488 | BF16 | 1 | 1538.79 | |
| FP8 | 1 | 1647.89 | |
| GPTQ-INT4 | 1 | 617.09 | GPTQ-Marlin |
| 129042 | BF16 | 1 | 1385.65 | |
| FP8 | 1 | 1442.14 | |
| GPTQ-INT4 | 1 | 704.82 | GPTQ-Marlin |
### Qwen3-30B-A3B (Transformers)
| Model |
Input length |
Quantization |
GPU Num |
Speed (tokens/s) |
GPU Memory (MB) |
Notes |
| Qwen3-30B-A3B |
1 | BF16 | 1 | 1.89 | 58462 | |
| FP8 | 1 | 0.44 | 30296 | |
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported |
| 6144 | BF16 | 1 | 7.45 | 59037 | |
| FP8 | 1 | 1.77 | 30872 | |
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported |
| 14336 | BF16 | 1 | 14.47 | 59806 | |
| FP8 | 1 | 3.5 | 31641 | |
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported |
| 30720 | BF16 | 1 | 27.03 | 61342 | |
| FP8 | 1 | 6.86 | 33177 | |
| GPTQ-INT4 | - | - | - | MoE Kernel Unsupported |
### Qwen3-235B-A22B (SGLang)
| Model |
Input Length |
Quantization |
GPU Num |
Speed (tokens/s) |
Note |
| Qwen3-235B-A22B |
1 | BF16 | 8 | 74.50 | TP=8 |
| FP8 | 4 | 71.65 | TP=4 |
| GPTQ-INT4 | 4 | 14.69 | TP=4 GPTQ-Marlin |
| 6144 | BF16 | 8 | 289.03 | TP=8 |
| FP8 | 4 | 275.16 | TP=4 |
| GPTQ-INT4 | 4 | 56.97 | TP=4 GPTQ-Marlin |
| 14336 | BF16 | 8 | 546.73 | TP=8 |
| FP8 | 4 | 514.23 | TP=4 |
| GPTQ-INT4 | 4 | 109.13 | TP=4 GPTQ-Marlin |
| 30720 | BF16 | 8 | 979.41 | TP=8 |
| FP8 | 4 | 887.90 | TP=4 |
| GPTQ-INT4 | 4 | 198.99 | TP=4 GPTQ-Marlin |
| 63488 | BF16 | 8 | 1493.91 | TP=8 |
| FP8 | 4 | 1269.34 | TP=4 |
| GPTQ-INT4 | 4 | 422.77 | TP=4 GPTQ-Marlin |
| 129042 | BF16 | 8 | 1639.54 | TP=8 |
| FP8 | 4 | 1319.66 | TP=4 |
| GPTQ-INT4 | 4 | 552.28 | TP=4 GPTQ-Marlin |