效率评估¶
本部分介绍Qwen2.5系列模型(原始模型和量化模型)的效率测试结果,包括推理速度(tokens/s)与不同上下文长度时的显存占用(GB)。
测试HuggingFace transformers
时的环境配置:
NVIDIA A100 80GB
CUDA 12.1
Pytorch 2.3.1
Flash Attention 2.5.8
Transformers 4.46.0
AutoGPTQ 0.7.1+cu121 (Compiled from source code)
AutoAWQ 0.2.6
测试vLLM时的环境配置:
NVIDIA A100 80GB
CUDA 12.1
vLLM 0.6.3
Pytorch 2.4.0
Flash Attention 2.6.3
Transformers 4.46.0
注意:
batch size 设置为1,使用 GPU 数量尽可能少
我们测试生成2048 tokens时的速度与显存占用,输入长度分别为1、6144、14336、30720、63488、129024 tokens。(超过32K长度仅有 Qwen2-72B-Instuct 与 Qwen2-7B-Instuct 支持)
对于vLLM,由于GPU显存预分配,实际显存使用难以评估。默认情况下,统一设定为``gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False``。
0.5B (Transformer)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
显存占用 (GB) |
注意: |
---|---|---|---|---|---|---|
Qwen2.5-0.5B-Instruct |
1 |
BF16 |
1 |
47.40 |
0.97 |
|
GPTQ-Int8 |
1 |
35.17 |
0.64 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
50.60 |
0.48 |
|||
AWQ |
1 |
37.09 |
0.68 |
|||
6144 |
BF16 |
1 |
47.45 |
1.23 |
||
GPTQ-Int8 |
1 |
36.47 |
0.90 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
48.89 |
0.73 |
|||
AWQ |
1 |
37.04 |
0.72 |
|||
14336 |
BF16 |
1 |
47.11 |
1.60 |
||
GPTQ-Int8 |
1 |
35.44 |
1.26 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
48.26 |
1.10 |
|||
AWQ |
1 |
37.14 |
1.10 |
|||
30720 |
BF16 |
1 |
47.16 |
2.34 |
||
GPTQ-Int8 |
1 |
36.25 |
2.01 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
49.22 |
1.85 |
|||
AWQ |
1 |
36.90 |
1.84 |
0.5B (vLLM)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
---|---|---|---|---|
Qwen2.5-0.5B-Instruct |
1 |
BF16 |
1 |
311.55 |
GPTQ-Int8 |
1 |
257.07 |
||
GPTQ-Int4 |
1 |
260.93 |
||
AWQ |
1 |
261.95 |
||
6144 |
BF16 |
1 |
304.79 |
|
GPTQ-Int8 |
1 |
254.10 |
||
GPTQ-Int4 |
1 |
257.33 |
||
AWQ |
1 |
259.80 |
||
14336 |
BF16 |
1 |
290.28 |
|
GPTQ-Int8 |
1 |
243.69 |
||
GPTQ-Int4 |
1 |
247.01 |
||
AWQ |
1 |
249.58 |
||
30720 |
BF16 |
1 |
264.51 |
|
GPTQ-Int8 |
1 |
223.86 |
||
GPTQ-Int4 |
1 |
226.50 |
||
AWQ |
1 |
229.84 |
1.5B (Transformer)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
显存占用 (GB) |
注意: |
---|---|---|---|---|---|---|
Qwen2.5-1.5B-Instruct |
1 |
BF16 |
1 |
39.68 |
2.95 |
|
GPTQ-Int8 |
1 |
32.62 |
1.82 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.33 |
1.18 |
|||
AWQ |
1 |
31.70 |
1.51 |
|||
6144 |
BF16 |
1 |
40.88 |
3.43 |
||
GPTQ-Int8 |
1 |
31.46 |
2.30 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.96 |
1.66 |
|||
AWQ |
1 |
32.30 |
1.63 |
|||
14336 |
BF16 |
1 |
40.43 |
4.16 |
||
GPTQ-Int8 |
1 |
31.06 |
3.03 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.66 |
2.39 |
|||
AWQ |
1 |
32.39 |
2.36 |
|||
30720 |
BF16 |
1 |
38.59 |
5.62 |
||
GPTQ-Int8 |
1 |
31.04 |
4.49 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
35.68 |
3.85 |
|||
AWQ |
1 |
31.95 |
3.82 |
1.5B (vLLM)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
---|---|---|---|---|
Qwen2.5-1.5B-Instruct |
1 |
BF16 |
1 |
183.33 |
GPTQ-Int8 |
1 |
201.67 |
||
GPTQ-Int4 |
1 |
217.03 |
||
AWQ |
1 |
213.74 |
||
6144 |
BF16 |
1 |
176.68 |
|
GPTQ-Int8 |
1 |
192.83 |
||
GPTQ-Int4 |
1 |
206.63 |
||
AWQ |
1 |
203.64 |
||
14336 |
BF16 |
1 |
168.69 |
|
GPTQ-Int8 |
1 |
183.69 |
||
GPTQ-Int4 |
1 |
195.88 |
||
AWQ |
1 |
192.64 |
||
30720 |
BF16 |
1 |
152.04 |
|
GPTQ-Int8 |
1 |
162.82 |
||
GPTQ-Int4 |
1 |
173.57 |
||
AWQ |
1 |
170.20 |
3B (Transformer)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
显存占用 (GB) |
注意: |
---|---|---|---|---|---|---|
Qwen2.5-3B-Instruct |
1 |
BF16 |
1 |
30.80 |
5.95 |
|
GPTQ-Int8 |
1 |
25.69 |
3.38 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
35.21 |
2.06 |
|||
AWQ |
1 |
25.29 |
2.50 |
|||
6144 |
BF16 |
1 |
32.20 |
6.59 |
||
GPTQ-Int8 |
1 |
24.69 |
3.98 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
34.47 |
2.67 |
|||
AWQ |
1 |
24.86 |
2.62 |
|||
14336 |
BF16 |
1 |
31.72 |
7.47 |
||
GPTQ-Int8 |
1 |
24.70 |
4.89 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
34.36 |
3.58 |
|||
AWQ |
1 |
25.19 |
3.54 |
|||
30720 |
BF16 |
1 |
25.37 |
9.30 |
||
GPTQ-Int8 |
1 |
21.67 |
6.72 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
23.60 |
5.41 |
|||
AWQ |
1 |
24.56 |
5.37 |
3B (vLLM)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
---|---|---|---|---|
Qwen2.5-3B-Instruct |
1 |
BF16 |
1 |
127.61 |
GPTQ-Int8 |
1 |
150.02 |
||
GPTQ-Int4 |
1 |
168.20 |
||
AWQ |
1 |
165.50 |
||
6144 |
BF16 |
1 |
123.15 |
|
GPTQ-Int8 |
1 |
143.09 |
||
GPTQ-Int4 |
1 |
159.85 |
||
AWQ |
1 |
156.38 |
||
14336 |
BF16 |
1 |
117.35 |
|
GPTQ-Int8 |
1 |
135.50 |
||
GPTQ-Int4 |
1 |
149.35 |
||
AWQ |
1 |
147.75 |
||
30720 |
BF16 |
1 |
105.88 |
|
GPTQ-Int8 |
1 |
118.38 |
||
GPTQ-Int4 |
1 |
129.28 |
||
AWQ |
1 |
127.19 |
7B (Transformer)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
显存占用 (GB) |
注意: |
---|---|---|---|---|---|---|
Qwen2.5-7B-Instruct |
1 |
BF16 |
1 |
40.38 |
14.38 |
|
GPTQ-Int8 |
1 |
31.55 |
8.42 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
43.10 |
5.52 |
|||
AWQ |
1 |
32.03 |
5.39 |
|||
6144 |
BF16 |
1 |
38.76 |
15.38 |
||
GPTQ-Int8 |
1 |
31.26 |
9.43 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
38.27 |
6.52 |
|||
AWQ |
1 |
32.37 |
6.39 |
|||
14336 |
BF16 |
1 |
29.78 |
16.91 |
||
GPTQ-Int8 |
1 |
26.86 |
10.96 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
28.70 |
8.05 |
|||
AWQ |
1 |
30.23 |
7.92 |
|||
30720 |
BF16 |
1 |
18.83 |
19.97 |
||
GPTQ-Int8 |
1 |
17.59 |
14.01 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
18.45 |
11.11 |
|||
AWQ |
1 |
19.11 |
10.98 |
7B (vLLM)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
注意: |
---|---|---|---|---|---|
Qwen2.5-7B-Instruct |
1 |
BF16 |
1 |
84.28 |
|
GPTQ-Int8 |
1 |
122.01 |
|||
GPTQ-Int4 |
1 |
154.05 |
|||
AWQ |
1 |
148.10 |
|||
6144 |
BF16 |
1 |
80.70 |
||
GPTQ-Int8 |
1 |
112.38 |
|||
GPTQ-Int4 |
1 |
141.98 |
|||
AWQ |
1 |
137.64 |
|||
14336 |
BF16 |
1 |
77.69 |
||
GPTQ-Int8 |
1 |
105.25 |
|||
GPTQ-Int4 |
1 |
129.35 |
|||
AWQ |
1 |
124.91 |
|||
30720 |
BF16 |
1 |
70.33 |
||
GPTQ-Int8 |
1 |
90.71 |
|||
GPTQ-Int4 |
1 |
108.30 |
|||
AWQ |
1 |
104.66 |
|||
63488 |
BF16 |
1 |
50.86 |
[设定3] |
|
GPTQ-Int8 |
1 |
60.52 |
[设定3] |
||
GPTQ-Int4 |
1 |
67.97 |
[设定3] |
||
AWQ |
1 |
66.42 |
[设定3] |
||
129024 |
BF16 |
1 |
28.94 |
vllm==0.6.2, new sample config |
|
GPTQ-Int8 |
1 |
25.97 |
vllm==0.6.2, new sample config |
||
GPTQ-Int4 |
1 |
26.37 |
vllm==0.6.2, new sample config |
||
AWQ |
1 |
26.57 |
vllm==0.6.2, new sample config |
[默认设定]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
14B (Transformer)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
显存占用 (GB) |
注意: |
---|---|---|---|---|---|---|
Qwen2.5-14B-Instruct |
1 |
BF16 |
1 |
24.74 |
28.08 |
|
GPTQ-Int8 |
1 |
18.84 |
16.11 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
25.89 |
9.94 |
|||
AWQ |
1 |
19.23 |
9.79 |
|||
6144 |
BF16 |
1 |
20.51 |
29.50 |
||
GPTQ-Int8 |
1 |
17.80 |
17.61 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
20.06 |
11.36 |
|||
AWQ |
1 |
19.21 |
11.22 |
|||
14336 |
BF16 |
1 |
13.92 |
31.95 |
||
GPTQ-Int8 |
1 |
12.66 |
19.98 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
13.79 |
13.81 |
|||
AWQ |
1 |
14.17 |
13.67 |
|||
30720 |
BF16 |
1 |
8.20 |
36.85 |
||
GPTQ-Int8 |
1 |
7.77 |
24.88 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
8.14 |
18.71 |
|||
AWQ |
1 |
8.31 |
18.57 |
14B (vLLM)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
注意: |
---|---|---|---|---|---|
Qwen2.5-14B-Instruct |
1 |
BF16 |
1 |
46.30 |
|
GPTQ-Int8 |
1 |
70.40 |
|||
GPTQ-Int4 |
1 |
98.02 |
|||
AWQ |
1 |
92.66 |
|||
6144 |
BF16 |
1 |
43.83 |
||
GPTQ-Int8 |
1 |
64.33 |
|||
GPTQ-Int4 |
1 |
86.10 |
|||
AWQ |
1 |
83.11 |
|||
14336 |
BF16 |
1 |
41.91 |
||
GPTQ-Int8 |
1 |
59.21 |
|||
GPTQ-Int4 |
1 |
76.85 |
|||
AWQ |
1 |
74.03 |
|||
30720 |
BF16 |
1 |
37.18 |
||
GPTQ-Int8 |
1 |
49.23 |
|||
GPTQ-Int4 |
1 |
60.91 |
|||
AWQ |
1 |
59.01 |
|||
63488 |
BF16 |
1 |
26.85 |
[设定3] |
|
GPTQ-Int8 |
1 |
32.83 |
[设定3] |
||
GPTQ-Int4 |
1 |
37.67 |
[设定3] |
||
AWQ |
1 |
36.71 |
[设定3] |
||
129024 |
BF16 |
1 |
14.53 |
vllm==0.6.2, new sample config |
|
GPTQ-Int8 |
1 |
15.10 |
vllm==0.6.2, new sample config |
||
GPTQ-Int4 |
1 |
15.13 |
vllm==0.6.2, new sample config |
||
AWQ |
1 |
15.25 |
vllm==0.6.2, new sample config |
[默认设定]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
32B (Transformer)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
显存占用 (GB) |
注意: |
---|---|---|---|---|---|---|
Qwen2.5-32B-Instruct |
1 |
BF16 |
1 |
17.54 |
61.58 |
|
GPTQ-Int8 |
1 |
14.52 |
33.56 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
19.20 |
18.94 |
|||
AWQ |
1 |
14.60 |
18.67 |
|||
6144 |
BF16 |
1 |
12.49 |
63.72 |
||
GPTQ-Int8 |
1 |
11.61 |
35.86 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
13.42 |
21.09 |
|||
AWQ |
1 |
13.81 |
20.81 |
|||
14336 |
BF16 |
1 |
8.95 |
67.31 |
||
GPTQ-Int8 |
1 |
8.53 |
39.28 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
9.48 |
24.67 |
|||
AWQ |
1 |
9.71 |
24.39 |
|||
30720 |
BF16 |
1 |
5.59 |
74.47 |
||
GPTQ-Int8 |
1 |
5.42 |
46.45 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
5.79 |
31.84 |
|||
AWQ |
1 |
5.85 |
31.56 |
32B (vLLM)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
注意: |
---|---|---|---|---|---|
Qwen2.5-32B-Instruct |
1 |
BF16 |
1 |
22.13 |
[设定3] |
GPTQ-Int8 |
1 |
37.57 |
|||
GPTQ-Int4 |
1 |
55.83 |
|||
AWQ |
1 |
51.92 |
|||
6144 |
BF16 |
1 |
21.05 |
[设定3] |
|
GPTQ-Int8 |
1 |
34.67 |
|||
GPTQ-Int4 |
1 |
49.96 |
|||
AWQ |
1 |
46.68 |
|||
14336 |
BF16 |
1 |
19.91 |
[设定3] |
|
GPTQ-Int8 |
1 |
31.89 |
|||
GPTQ-Int4 |
1 |
44.79 |
|||
AWQ |
1 |
41.83 |
|||
30720 |
BF16 |
2 |
31.82 |
||
GPTQ-Int8 |
1 |
26.88 |
|||
GPTQ-Int4 |
1 |
35.66 |
|||
AWQ |
1 |
33.75 |
|||
63488 |
BF16 |
2 |
24.45 |
[设定3] |
|
GPTQ-Int8 |
1 |
18.60 |
[设定3] |
||
GPTQ-Int4 |
1 |
22.72 |
[设定3] |
||
AWQ |
1 |
21.79 |
[设定3] |
||
129024 |
BF16 |
2 |
14.31 |
vllm==0.6.2, new sample config |
|
GPTQ-Int8 |
1 |
9.77 |
vllm==0.6.2, new sample config |
||
GPTQ-Int4 |
1 |
10.39 |
vllm==0.6.2, new sample config |
||
AWQ |
1 |
10.34 |
vllm==0.6.2, new sample config |
For context length 129024, the model needs to be predicted with the following config: “model_max_length”=131072
[默认设定]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[设定 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True)
[默认设定]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
72B (Transformer)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
显存占用 (GB) |
注意: |
---|---|---|---|---|---|---|
Qwen2.5-72B-Instruct |
1 |
BF16 |
2 |
8.73 |
136.20 |
|
GPTQ-Int8 |
2 |
8.66 |
72.61 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
11.07 |
39.91 |
|||
AWQ |
1 |
11.50 |
39.44 |
|||
6144 |
BF16 |
2 |
6.39 |
140.00 |
||
GPTQ-Int8 |
2 |
6.39 |
77.81 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
7.56 |
42.50 |
|||
AWQ |
1 |
8.17 |
42.13 |
|||
14336 |
BF16 |
3 |
4.25 |
149.14 |
||
GPTQ-Int8 |
2 |
4.66 |
82.55 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
1 |
5.27 |
46.86 |
|||
AWQ |
1 |
5.57 |
46.38 |
|||
30720 |
BF16 |
3 |
2.94 |
164.79 |
||
GPTQ-Int8 |
2 |
2.94 |
94.75 |
auto_gptq==0.6.0+cu1210 |
||
GPTQ-Int4 |
2 |
3.14 |
62.57 |
|||
AWQ |
2 |
3.23 |
61.64 |
72B (vLLM)
模型 |
输入长度 |
量化 |
GPU数量 |
速度 (tokens/s) |
注意: |
---|---|---|---|---|---|
Qwen2.5-72B-Instruct |
1 |
BF16 |
2 |
18.19 |
[设定3] |
BF16 |
4 |
31.37 |
Default |
||
GPTQ-Int8 |
2 |
31.40 |
Default |
||
GPTQ-Int4 |
1 |
16.47 |
Default |
||
GPTQ-Int4 |
2 |
46.30 |
[设定2] |
||
AWQ |
2 |
44.30 |
Default |
||
6144 |
BF16 |
4 |
29.90 |
Default |
|
GPTQ-Int8 |
2 |
29.37 |
Default |
||
GPTQ-Int4 |
1 |
13.88 |
Default |
||
GPTQ-Int4 |
2 |
42.50 |
[设定3] |
||
AWQ |
2 |
40.67 |
Default |
||
14336 |
BF16 |
4 |
30.10 |
Default |
|
GPTQ-Int8 |
2 |
27.20 |
Default |
||
GPTQ-Int4 |
2 |
38.10 |
Default |
||
AWQ |
2 |
36.63 |
Default |
||
30720 |
BF16 |
4 |
27.53 |
Default |
|
GPTQ-Int8 |
2 |
23.32 |
Default |
||
GPTQ-Int4 |
2 |
30.98 |
Default |
||
AWQ |
2 |
30.02 |
Default |
||
63488 |
BF16 |
4 |
20.74 |
[设定3] |
|
GPTQ-Int8 |
2 |
16.27 |
[设定3] |
||
GPTQ-Int4 |
2 |
19.84 |
[设定3] |
||
AWQ |
2 |
19.32 |
[设定3] |
||
129024 |
BF16 |
4 |
12.68 |
[设定3] |
|
GPTQ-Int8 |
4 |
14.11 |
[设定3] |
||
GPTQ-Int4 |
2 |
10.11 |
[设定3] |
||
AWQ |
2 |
9.88 |
[设定3] |
[默认设定]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[设定 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True)
[设定 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True)
[设定 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True)
[默认设定]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
[默认设定]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)