Qwen2.5 Speed Benchmark¶
This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2.5 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths.
The environment of the evaluation with huggingface transformers is:
NVIDIA A100 80GB
CUDA 12.1
Pytorch 2.3.1
Flash Attention 2.5.8
Transformers 4.46.0
AutoGPTQ 0.7.1+cu121 (Compiled from source code)
AutoAWQ 0.2.6
The environment of the evaluation with vLLM is:
NVIDIA A100 80GB
CUDA 12.1
vLLM 0.6.3
Pytorch 2.4.0
Flash Attention 2.6.3
Transformers 4.46.0
Notes:
We use the batch size of 1 and the least number of GPUs as possible for the evaluation.
We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens.
For vLLM, the memory usage is not reported because it pre-allocates all GPU memory. We use
gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False
by default.0.5B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
Note |
---|---|---|---|---|---|---|
Qwen2.5-0.5B-Instruct |
1 |
BF16 |
1 |
47.40 |
0.97 |
|
GPTQ-Int8 |
1 |
35.17 |
0.64 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
50.60 |
0.48 |
|||
AWQ |
1 |
37.09 |
0.68 |
|||
6144 |
BF16 |
1 |
47.45 |
1.23 |
||
GPTQ-Int8 |
1 |
36.47 |
0.90 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
48.89 |
0.73 |
|||
AWQ |
1 |
37.04 |
0.72 |
|||
14336 |
BF16 |
1 |
47.11 |
1.60 |
||
GPTQ-Int8 |
1 |
35.44 |
1.26 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
48.26 |
1.10 |
|||
AWQ |
1 |
37.14 |
1.10 |
|||
30720 |
BF16 |
1 |
47.16 |
2.34 |
||
GPTQ-Int8 |
1 |
36.25 |
2.01 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
49.22 |
1.85 |
|||
AWQ |
1 |
36.90 |
1.84 |
0.5B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
---|---|---|---|---|
Qwen2.5-0.5B-Instruct |
1 |
BF16 |
1 |
311.55 |
GPTQ-Int8 |
1 |
257.07 |
||
GPTQ-Int4 |
1 |
260.93 |
||
AWQ |
1 |
261.95 |
||
6144 |
BF16 |
1 |
304.79 |
|
GPTQ-Int8 |
1 |
254.10 |
||
GPTQ-Int4 |
1 |
257.33 |
||
AWQ |
1 |
259.80 |
||
14336 |
BF16 |
1 |
290.28 |
|
GPTQ-Int8 |
1 |
243.69 |
||
GPTQ-Int4 |
1 |
247.01 |
||
AWQ |
1 |
249.58 |
||
30720 |
BF16 |
1 |
264.51 |
|
GPTQ-Int8 |
1 |
223.86 |
||
GPTQ-Int4 |
1 |
226.50 |
||
AWQ |
1 |
229.84 |
1.5B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
Note |
---|---|---|---|---|---|---|
Qwen2.5-1.5B-Instruct |
1 |
BF16 |
1 |
39.68 |
2.95 |
|
GPTQ-Int8 |
1 |
32.62 |
1.82 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.33 |
1.18 |
|||
AWQ |
1 |
31.70 |
1.51 |
|||
6144 |
BF16 |
1 |
40.88 |
3.43 |
||
GPTQ-Int8 |
1 |
31.46 |
2.30 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.96 |
1.66 |
|||
AWQ |
1 |
32.30 |
1.63 |
|||
14336 |
BF16 |
1 |
40.43 |
4.16 |
||
GPTQ-Int8 |
1 |
31.06 |
3.03 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.66 |
2.39 |
|||
AWQ |
1 |
32.39 |
2.36 |
|||
30720 |
BF16 |
1 |
38.59 |
5.62 |
||
GPTQ-Int8 |
1 |
31.04 |
4.49 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
35.68 |
3.85 |
|||
AWQ |
1 |
31.95 |
3.82 |
1.5B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
---|---|---|---|---|
Qwen2.5-1.5B-Instruct |
1 |
BF16 |
1 |
183.33 |
GPTQ-Int8 |
1 |
201.67 |
||
GPTQ-Int4 |
1 |
217.03 |
||
AWQ |
1 |
213.74 |
||
6144 |
BF16 |
1 |
176.68 |
|
GPTQ-Int8 |
1 |
192.83 |
||
GPTQ-Int4 |
1 |
206.63 |
||
AWQ |
1 |
203.64 |
||
14336 |
BF16 |
1 |
168.69 |
|
GPTQ-Int8 |
1 |
183.69 |
||
GPTQ-Int4 |
1 |
195.88 |
||
AWQ |
1 |
192.64 |
||
30720 |
BF16 |
1 |
152.04 |
|
GPTQ-Int8 |
1 |
162.82 |
||
GPTQ-Int4 |
1 |
173.57 |
||
AWQ |
1 |
170.20 |
3B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
Note |
---|---|---|---|---|---|---|
Qwen2.5-3B-Instruct |
1 |
BF16 |
1 |
30.80 |
5.95 |
|
GPTQ-Int8 |
1 |
25.69 |
3.38 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
35.21 |
2.06 |
|||
AWQ |
1 |
25.29 |
2.50 |
|||
6144 |
BF16 |
1 |
32.20 |
6.59 |
||
GPTQ-Int8 |
1 |
24.69 |
3.98 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
34.47 |
2.67 |
|||
AWQ |
1 |
24.86 |
2.62 |
|||
14336 |
BF16 |
1 |
31.72 |
7.47 |
||
GPTQ-Int8 |
1 |
24.70 |
4.89 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
34.36 |
3.58 |
|||
AWQ |
1 |
25.19 |
3.54 |
|||
30720 |
BF16 |
1 |
25.37 |
9.30 |
||
GPTQ-Int8 |
1 |
21.67 |
6.72 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
23.60 |
5.41 |
|||
AWQ |
1 |
24.56 |
5.37 |
3B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
---|---|---|---|---|
Qwen2.5-3B-Instruct |
1 |
BF16 |
1 |
127.61 |
GPTQ-Int8 |
1 |
150.02 |
||
GPTQ-Int4 |
1 |
168.20 |
||
AWQ |
1 |
165.50 |
||
6144 |
BF16 |
1 |
123.15 |
|
GPTQ-Int8 |
1 |
143.09 |
||
GPTQ-Int4 |
1 |
159.85 |
||
AWQ |
1 |
156.38 |
||
14336 |
BF16 |
1 |
117.35 |
|
GPTQ-Int8 |
1 |
135.50 |
||
GPTQ-Int4 |
1 |
149.35 |
||
AWQ |
1 |
147.75 |
||
30720 |
BF16 |
1 |
105.88 |
|
GPTQ-Int8 |
1 |
118.38 |
||
GPTQ-Int4 |
1 |
129.28 |
||
AWQ |
1 |
127.19 |
7B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
Note |
---|---|---|---|---|---|---|
Qwen2.5-7B-Instruct |
1 |
BF16 |
1 |
40.38 |
14.38 |
|
GPTQ-Int8 |
1 |
31.55 |
8.42 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.10 |
5.52 |
|||
AWQ |
1 |
32.03 |
5.39 |
|||
6144 |
BF16 |
1 |
38.76 |
15.38 |
||
GPTQ-Int8 |
1 |
31.26 |
9.43 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
38.27 |
6.52 |
|||
AWQ |
1 |
32.37 |
6.39 |
|||
14336 |
BF16 |
1 |
29.78 |
16.91 |
||
GPTQ-Int8 |
1 |
26.86 |
10.96 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
28.70 |
8.05 |
|||
AWQ |
1 |
30.23 |
7.92 |
|||
30720 |
BF16 |
1 |
18.83 |
19.97 |
||
GPTQ-Int8 |
1 |
17.59 |
14.01 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
18.45 |
11.11 |
|||
AWQ |
1 |
19.11 |
10.98 |
7B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
Note |
---|---|---|---|---|---|
Qwen2.5-7B-Instruct |
1 |
BF16 |
1 |
84.28 |
|
GPTQ-Int8 |
1 |
122.01 |
|||
GPTQ-Int4 |
1 |
154.05 |
|||
AWQ |
1 |
148.10 |
|||
6144 |
BF16 |
1 |
80.70 |
||
GPTQ-Int8 |
1 |
112.38 |
|||
GPTQ-Int4 |
1 |
141.98 |
|||
AWQ |
1 |
137.64 |
|||
14336 |
BF16 |
1 |
77.69 |
||
GPTQ-Int8 |
1 |
105.25 |
|||
GPTQ-Int4 |
1 |
129.35 |
|||
AWQ |
1 |
124.91 |
|||
30720 |
BF16 |
1 |
70.33 |
||
GPTQ-Int8 |
1 |
90.71 |
|||
GPTQ-Int4 |
1 |
108.30 |
|||
AWQ |
1 |
104.66 |
|||
63488 |
BF16 |
1 |
50.86 |
setting-64k |
|
GPTQ-Int8 |
1 |
60.52 |
setting-64k |
||
GPTQ-Int4 |
1 |
67.97 |
setting-64k |
||
AWQ |
1 |
66.42 |
setting-64k |
||
129024 |
BF16 |
1 |
28.94 |
vllm==0.6.2, new sample config |
|
GPTQ-Int8 |
1 |
25.97 |
vllm==0.6.2, new sample config |
||
GPTQ-Int4 |
1 |
26.37 |
vllm==0.6.2, new sample config |
||
AWQ |
1 |
26.57 |
vllm==0.6.2, new sample config |
[Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
[new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
14B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
Note |
---|---|---|---|---|---|---|
Qwen2.5-14B-Instruct |
1 |
BF16 |
1 |
24.74 |
28.08 |
|
GPTQ-Int8 |
1 |
18.84 |
16.11 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
25.89 |
9.94 |
|||
AWQ |
1 |
19.23 |
9.79 |
|||
6144 |
BF16 |
1 |
20.51 |
29.50 |
||
GPTQ-Int8 |
1 |
17.80 |
17.61 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
20.06 |
11.36 |
|||
AWQ |
1 |
19.21 |
11.22 |
|||
14336 |
BF16 |
1 |
13.92 |
31.95 |
||
GPTQ-Int8 |
1 |
12.66 |
19.98 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
13.79 |
13.81 |
|||
AWQ |
1 |
14.17 |
13.67 |
|||
30720 |
BF16 |
1 |
8.20 |
36.85 |
||
GPTQ-Int8 |
1 |
7.77 |
24.88 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
8.14 |
18.71 |
|||
AWQ |
1 |
8.31 |
18.57 |
14B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
Note |
---|---|---|---|---|---|
Qwen2.5-14B-Instruct |
1 |
BF16 |
1 |
46.30 |
|
GPTQ-Int8 |
1 |
70.40 |
|||
GPTQ-Int4 |
1 |
98.02 |
|||
AWQ |
1 |
92.66 |
|||
6144 |
BF16 |
1 |
43.83 |
||
GPTQ-Int8 |
1 |
64.33 |
|||
GPTQ-Int4 |
1 |
86.10 |
|||
AWQ |
1 |
83.11 |
|||
14336 |
BF16 |
1 |
41.91 |
||
GPTQ-Int8 |
1 |
59.21 |
|||
GPTQ-Int4 |
1 |
76.85 |
|||
AWQ |
1 |
74.03 |
|||
30720 |
BF16 |
1 |
37.18 |
||
GPTQ-Int8 |
1 |
49.23 |
|||
GPTQ-Int4 |
1 |
60.91 |
|||
AWQ |
1 |
59.01 |
|||
63488 |
BF16 |
1 |
26.85 |
setting-64k |
|
GPTQ-Int8 |
1 |
32.83 |
setting-64k |
||
GPTQ-Int4 |
1 |
37.67 |
setting-64k |
||
AWQ |
1 |
36.71 |
setting-64k |
||
129024 |
BF16 |
1 |
14.53 |
vllm==0.6.2, new sample config |
|
GPTQ-Int8 |
1 |
15.10 |
vllm==0.6.2, new sample config |
||
GPTQ-Int4 |
1 |
15.13 |
vllm==0.6.2, new sample config |
||
AWQ |
1 |
15.25 |
vllm==0.6.2, new sample config |
[Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
[new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
32B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
Note |
---|---|---|---|---|---|---|
Qwen2.5-32B-Instruct |
1 |
BF16 |
1 |
17.54 |
61.58 |
|
GPTQ-Int8 |
1 |
14.52 |
33.56 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
19.20 |
18.94 |
|||
AWQ |
1 |
14.60 |
18.67 |
|||
6144 |
BF16 |
1 |
12.49 |
63.72 |
||
GPTQ-Int8 |
1 |
11.61 |
35.86 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
13.42 |
21.09 |
|||
AWQ |
1 |
13.81 |
20.81 |
|||
14336 |
BF16 |
1 |
8.95 |
67.31 |
||
GPTQ-Int8 |
1 |
8.53 |
39.28 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
9.48 |
24.67 |
|||
AWQ |
1 |
9.71 |
24.39 |
|||
30720 |
BF16 |
1 |
5.59 |
74.47 |
||
GPTQ-Int8 |
1 |
5.42 |
46.45 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
5.79 |
31.84 |
|||
AWQ |
1 |
5.85 |
31.56 |
32B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
Note |
---|---|---|---|---|---|
Qwen2.5-32B-Instruct |
1 |
BF16 |
1 |
22.13 |
setting1 |
GPTQ-Int8 |
1 |
37.57 |
|||
GPTQ-Int4 |
1 |
55.83 |
|||
AWQ |
1 |
51.92 |
|||
6144 |
BF16 |
1 |
21.05 |
setting1 |
|
GPTQ-Int8 |
1 |
34.67 |
|||
GPTQ-Int4 |
1 |
49.96 |
|||
AWQ |
1 |
46.68 |
|||
14336 |
BF16 |
1 |
19.91 |
setting1 |
|
GPTQ-Int8 |
1 |
31.89 |
|||
GPTQ-Int4 |
1 |
44.79 |
|||
AWQ |
1 |
41.83 |
|||
30720 |
BF16 |
2 |
31.82 |
||
GPTQ-Int8 |
1 |
26.88 |
|||
GPTQ-Int4 |
1 |
35.66 |
|||
AWQ |
1 |
33.75 |
|||
63488 |
BF16 |
2 |
24.45 |
setting-64k |
|
GPTQ-Int8 |
1 |
18.60 |
setting-64k |
||
GPTQ-Int4 |
1 |
22.72 |
setting-64k |
||
AWQ |
1 |
21.79 |
setting-64k |
||
129024 |
BF16 |
2 |
14.31 |
vllm==0.6.2, new sample config |
|
GPTQ-Int8 |
1 |
9.77 |
vllm==0.6.2, new sample config |
||
GPTQ-Int4 |
1 |
10.39 |
vllm==0.6.2, new sample config |
||
AWQ |
1 |
10.34 |
vllm==0.6.2, new sample config |
For context length 129024, the model needs to be predicted with the following config: “model_max_length”=131072
[Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[Setting 1]=(gpu_memory_utilization=1.0 max_model_len=32768 enforce_eager=True)
[Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
[new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
72B (Transformer)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
GPU Memory(GB) |
Note |
---|---|---|---|---|---|---|
Qwen2.5-72B-Instruct |
1 |
BF16 |
2 |
8.73 |
136.20 |
|
GPTQ-Int8 |
2 |
8.66 |
72.61 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
11.07 |
39.91 |
|||
AWQ |
1 |
11.50 |
39.44 |
|||
6144 |
BF16 |
2 |
6.39 |
140.00 |
||
GPTQ-Int8 |
2 |
6.39 |
77.81 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
7.56 |
42.50 |
|||
AWQ |
1 |
8.17 |
42.13 |
|||
14336 |
BF16 |
3 |
4.25 |
149.14 |
||
GPTQ-Int8 |
2 |
4.66 |
82.55 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
5.27 |
46.86 |
|||
AWQ |
1 |
5.57 |
46.38 |
|||
30720 |
BF16 |
3 |
2.94 |
164.79 |
||
GPTQ-Int8 |
2 |
2.94 |
94.75 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
2 |
3.14 |
62.57 |
|||
AWQ |
2 |
3.23 |
61.64 |
72B (vLLM)
Model |
Input Length |
Quantization |
GPU Num |
Speed(tokens/s) |
Note |
---|---|---|---|---|---|
Qwen2.5-72B-Instruct |
1 |
BF16 |
2 |
18.19 |
Setting 1 |
BF16 |
4 |
31.37 |
Default |
||
GPTQ-Int8 |
2 |
31.40 |
Default |
||
GPTQ-Int4 |
1 |
16.47 |
Default |
||
GPTQ-Int4 |
2 |
46.30 |
Setting 2 |
||
AWQ |
2 |
44.30 |
Default |
||
6144 |
BF16 |
4 |
29.90 |
Default |
|
GPTQ-Int8 |
2 |
29.37 |
Default |
||
GPTQ-Int4 |
1 |
13.88 |
Default |
||
GPTQ-Int4 |
2 |
42.50 |
Setting 3 |
||
AWQ |
2 |
40.67 |
Default |
||
14336 |
BF16 |
4 |
30.10 |
Default |
|
GPTQ-Int8 |
2 |
27.20 |
Default |
||
GPTQ-Int4 |
2 |
38.10 |
Default |
||
AWQ |
2 |
36.63 |
Default |
||
30720 |
BF16 |
4 |
27.53 |
Default |
|
GPTQ-Int8 |
2 |
23.32 |
Default |
||
GPTQ-Int4 |
2 |
30.98 |
Default |
||
AWQ |
2 |
30.02 |
Default |
||
63488 |
BF16 |
4 |
20.74 |
Setting 4 |
|
GPTQ-Int8 |
2 |
16.27 |
Setting 4 |
||
GPTQ-Int4 |
2 |
19.84 |
Setting 4 |
||
AWQ |
2 |
19.32 |
Setting 4 |
||
129024 |
BF16 |
4 |
12.68 |
Setting 5 |
|
GPTQ-Int8 |
4 |
14.11 |
Setting 5 |
||
GPTQ-Int4 |
2 |
10.11 |
Setting 5 |
||
AWQ |
2 |
9.88 |
Setting 5 |
[Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[Setting 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True)
[Setting 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True)
[Setting 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True)
[Setting 4]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
[Setting 5]=(gpu_memory_utilization=0.9 max_model_len=131072 enforce_eager=False)