Speed BenchmarkΒΆ
Attention
To be updated for Qwen2.5.
This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths.
The environment of the evaluation with huggingface transformers is:
NVIDIA A100 80GB
CUDA 11.8
Pytorch 2.1.2+cu118
Flash Attention 2.3.3
Transformers 4.38.2
AutoGPTQ 0.7.1
AutoAWQ 0.2.4
The environment of the evaluation with vLLM is:
NVIDIA A100 80GB
CUDA 11.8
Pytorch 2.3.0+cu118
Flash Attention 2.5.6
Transformers 4.40.1
vLLM 0.4.2
Note:
We use the batch size of 1 and the least number of GPUs as possible for the evalution.
We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens (>32k is only avaliable for Qwen2-72B-Instuct and Qwen2-7B-Instuct).
For vLLM, the memory usage is not reported because it pre-allocates all GPU memory. We use
gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False
by default.0.5B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
---|---|---|---|---|---|
Qwen2-0.5B-Instruct |
1 |
BF16 |
1 |
49.94 |
1.17 |
GPTQ-Int8 |
1 |
36.35 |
0.85 |
||
GPTQ-Int4 |
1 |
49.56 |
0.68 |
||
AWQ |
1 |
38.78 |
0.68 |
||
6144 |
BF16 |
1 |
50.83 |
6.42 |
|
GPTQ-Int8 |
1 |
36.56 |
6.09 |
||
GPTQ-Int4 |
1 |
49.63 |
5.93 |
||
AWQ |
1 |
38.73 |
5.92 |
||
14336 |
BF16 |
1 |
49.56 |
13.48 |
|
GPTQ-Int8 |
1 |
36.23 |
13.15 |
||
GPTQ-Int4 |
1 |
48.68 |
12.97 |
||
AWQ |
1 |
38.94 |
12.99 |
||
30720 |
BF16 |
1 |
49.25 |
27.61 |
|
GPTQ-Int8 |
1 |
34.61 |
27.28 |
||
GPTQ-Int4 |
1 |
48.18 |
27.12 |
||
AWQ |
1 |
38.19 |
27.11 |
0.5B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
---|---|---|---|---|
Qwen2-0.5B-Instruct |
1 |
BF16 |
1 |
270.49 |
GPTQ-Int8 |
1 |
235.95 |
||
GPTQ-Int4 |
1 |
240.07 |
||
AWQ |
1 |
233.31 |
||
6144 |
BF16 |
1 |
256.16 |
|
GPTQ-Int8 |
1 |
224.30 |
||
GPTQ-Int4 |
1 |
226.41 |
||
AWQ |
1 |
222.83 |
||
14336 |
BF16 |
1 |
108.89 |
|
GPTQ-Int8 |
1 |
108.10 |
||
GPTQ-Int4 |
1 |
106.51 |
||
AWQ |
1 |
104.16 |
||
30720 |
BF16 |
1 |
97.20 |
|
GPTQ-Int8 |
1 |
94.49 |
||
GPTQ-Int4 |
1 |
93.94 |
||
AWQ |
1 |
92.23 |
1.5B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
---|---|---|---|---|---|
Qwen2-1.5B-Instruct |
1 |
BF16 |
1 |
40.89 |
3.44 |
GPTQ-Int8 |
1 |
31.51 |
2.31 |
||
GPTQ-Int4 |
1 |
42.47 |
1.67 |
||
AWQ |
1 |
33.62 |
1.64 |
||
6144 |
BF16 |
1 |
40.86 |
8.74 |
|
GPTQ-Int8 |
1 |
31.31 |
7.59 |
||
GPTQ-Int4 |
1 |
42.78 |
6.95 |
||
AWQ |
1 |
32.90 |
6.92 |
||
14336 |
BF16 |
1 |
40.08 |
15.92 |
|
GPTQ-Int8 |
1 |
31.19 |
14.79 |
||
GPTQ-Int4 |
1 |
42.25 |
14.14 |
||
AWQ |
1 |
33.24 |
14.12 |
||
30720 |
BF16 |
1 |
34.09 |
30.31 |
|
GPTQ-Int8 |
1 |
28.52 |
29.18 |
||
GPTQ-Int4 |
1 |
31.30 |
28.54 |
||
AWQ |
1 |
32.16 |
28.51 |
1.5B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
---|---|---|---|---|
Qwen2-1.5B-Instruct |
1 |
BF16 |
1 |
175.55 |
GPTQ-Int8 |
1 |
172.28 |
||
GPTQ-Int4 |
1 |
184.58 |
||
AWQ |
1 |
170.87 |
||
6144 |
BF16 |
1 |
166.23 |
|
GPTQ-Int8 |
1 |
164.32 |
||
GPTQ-Int4 |
1 |
174.04 |
||
AWQ |
1 |
162.81 |
||
14336 |
BF16 |
1 |
83.67 |
|
GPTQ-Int8 |
1 |
98.63 |
||
GPTQ-Int4 |
1 |
97.65 |
||
AWQ |
1 |
92.48 |
||
30720 |
BF16 |
1 |
77.69 |
|
GPTQ-Int8 |
1 |
86.42 |
||
GPTQ-Int4 |
1 |
87.49 |
||
AWQ |
1 |
82.88 |
7B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
---|---|---|---|---|---|
Qwen2-7B-Instruct |
1 |
BF16 |
1 |
37.97 |
14.92 |
GPTQ-Int8 |
1 |
30.85 |
8.97 |
||
GPTQ-Int4 |
1 |
36.17 |
6.06 |
||
AWQ |
1 |
33.08 |
5.93 |
||
6144 |
BF16 |
1 |
34.74 |
20.26 |
|
GPTQ-Int8 |
1 |
31.13 |
14.31 |
||
GPTQ-Int4 |
1 |
33.34 |
11.40 |
||
AWQ |
1 |
30.86 |
11.27 |
||
14336 |
BF16 |
1 |
26.63 |
27.71 |
|
GPTQ-Int8 |
1 |
24.58 |
21.76 |
||
GPTQ-Int4 |
1 |
25.81 |
18.86 |
||
AWQ |
1 |
27.61 |
18.72 |
||
30720 |
BF16 |
1 |
17.49 |
42.62 |
|
GPTQ-Int8 |
1 |
16.69 |
36.67 |
||
GPTQ-Int4 |
1 |
17.17 |
33.76 |
||
AWQ |
1 |
17.87 |
33.63 |
7B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
---|---|---|---|---|
Qwen2-7B-Instruct |
1 |
BF16 |
1 |
80.45 |
GPTQ-Int8 |
1 |
114.32 |
||
GPTQ-Int4 |
1 |
143.40 |
||
AWQ |
1 |
96.65 |
||
6144 |
BF16 |
1 |
76.41 |
|
GPTQ-Int8 |
1 |
107.02 |
||
GPTQ-Int4 |
1 |
131.55 |
||
AWQ |
1 |
91.38 |
||
14336 |
BF16 |
1 |
66.54 |
|
GPTQ-Int8 |
1 |
89.72 |
||
GPTQ-Int4 |
1 |
97.93 |
||
AWQ |
1 |
76.87 |
||
30720 |
BF16 |
1 |
55.83 |
|
GPTQ-Int8 |
1 |
71.58 |
||
GPTQ-Int4 |
1 |
81.48 |
||
AWQ |
1 |
63.62 |
||
63488 |
BF16 |
1 |
41.20 |
|
GPTQ-Int8 |
1 |
49.37 |
||
GPTQ-Int4 |
1 |
54.12 |
||
AWQ |
1 |
45.89 |
||
129024 |
BF16 |
1 |
25.01 |
|
GPTQ-Int8 |
1 |
27.73 |
||
GPTQ-Int4 |
1 |
29.39 |
||
AWQ |
1 |
27.13 |
57B-A14B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
---|---|---|---|---|---|
Qwen2-57B-A14B-Instruct |
1 |
BF16 |
2 |
4.76 |
110.29 |
GPTQ-Int4 |
1 |
5.55 |
30.38 |
||
6144 |
BF16 |
2 |
4.90 |
117.80 |
|
GPTQ-Int4 |
1 |
5.44 |
35.67 |
||
14336 |
BF16 |
2 |
4.58 |
128.17 |
|
GPTQ-Int4 |
1 |
5.31 |
43.11 |
||
30720 |
BF16 |
2 |
4.12 |
163.77 |
|
GPTQ-Int4 |
1 |
4.72 |
58.01 |
57B-A14B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
---|---|---|---|---|
Qwen2-57B-A14B-Instruct |
1 |
BF16 |
2 |
31.44 |
6144 |
BF16 |
2 |
31.77 |
|
14336 |
BF16 |
2 |
21.25 |
|
30720 |
BF16 |
2 |
20.24 |
Note: Compared with dense models, MOE models have larger throughput when batch size is large, which is shown as follows:
Model |
Quantization |
# Prompts |
QPS |
Tokens/s |
---|---|---|---|---|
Qwen1.5-32B-Chat |
BF16 |
100 |
6.68 |
7343.56 |
Qwen2-57B-A14B-Instruct |
BF16 |
100 |
4.81 |
5291.15 |
Qwen1.5-32B-Chat |
BF16 |
1000 |
7.99 |
8791.35 |
Qwen2-57B-A14B-Instruct |
BF16 |
1000 |
5.18 |
5698.37 |
The results are obtained from vLLM throughput benchmarking scripts, which can be reproduced by:
python vllm/benchmarks/benchmark_throughput.py --input-len 1000 --output-len 100 --model <model_path> --num-prompts <number of prompts> --enforce-eager -tp 2
72B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
---|---|---|---|---|---|
Qwen2-72B-Instruct |
1 |
BF16 |
2 |
7.45 |
134.74 |
GPTQ-Int8 |
2 |
7.30 |
71.00 |
||
GPTQ-Int4 |
1 |
9.05 |
41.80 |
||
AWQ |
1 |
9.96 |
41.31 |
||
6144 |
BF16 |
2 |
5.99 |
144.38 |
|
GPTQ-Int8 |
2 |
5.93 |
80.60 |
||
GPTQ-Int4 |
1 |
6.79 |
47.90 |
||
AWQ |
1 |
7.49 |
47.42 |
||
14336 |
BF16 |
3 |
4.12 |
169.93 |
|
GPTQ-Int8 |
2 |
4.43 |
95.14 |
||
GPTQ-Int4 |
1 |
4.87 |
57.79 |
||
AWQ |
1 |
5.23 |
57.30 |
||
30720 |
BF16 |
3 |
2.86 |
209.03 |
|
GPTQ-Int8 |
2 |
2.83 |
124.20 |
||
GPTQ-Int4 |
2 |
3.02 |
107.94 |
||
AWQ |
2 |
1.85 |
88.60 |
72B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
Setting |
---|---|---|---|---|---|
Qwen2-72B-Instruct |
1 |
BF16 |
2 |
17.68 |
[Setting 1] |
BF16 |
4 |
30.01 |
|||
GPTQ-Int8 |
2 |
27.56 |
|||
GPTQ-Int4 |
1 |
29.60 |
[Setting 2] |
||
GPTQ-Int4 |
2 |
42.82 |
|||
AWQ |
2 |
27.73 |
|||
6144 |
BF16 |
4 |
27.98 |
||
GPTQ-Int8 |
2 |
25.46 |
|||
GPTQ-Int4 |
1 |
25.16 |
[Setting 3] |
||
GPTQ-Int4 |
2 |
38.23 |
|||
AWQ |
2 |
25.77 |
|||
14336 |
BF16 |
4 |
21.81 |
||
GPTQ-Int8 |
2 |
22.71 |
|||
GPTQ-Int4 |
2 |
26.54 |
|||
AWQ |
2 |
21.50 |
|||
30720 |
BF16 |
4 |
19.43 |
||
GPTQ-Int8 |
2 |
18.69 |
|||
GPTQ-Int4 |
2 |
23.12 |
|||
AWQ |
2 |
18.09 |
|||
30720 |
BF16 |
4 |
19.43 |
||
GPTQ-Int8 |
2 |
18.69 |
|||
GPTQ-Int4 |
2 |
23.12 |
|||
AWQ |
2 |
18.09 |
|||
63488 |
BF16 |
4 |
17.46 |
||
GPTQ-Int8 |
2 |
15.30 |
|||
GPTQ-Int4 |
2 |
13.23 |
|||
AWQ |
2 |
13.14 |
|||
129024 |
BF16 |
4 |
11.70 |
||
GPTQ-Int8 |
4 |
12.94 |
|||
GPTQ-Int4 |
2 |
8.33 |
|||
AWQ |
2 |
7.78 |
[Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[Setting 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True)
[Setting 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True)
[Setting 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True)