HF Transformers Inference¶
This section reports the performance of bf16 models and Int4 quantized models (including GPTQ and AWQ) of the Qwen1.5 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. In terms of inference speed, we report those with or without Flash Attention 2.
The environment of the performance evaluation is:
NVIDIA A100 80GB
CUDA 12.3
Pytorch 2.1.2+cu118
Flash Attention 2.5.6
Note that we use the batch size of 1 and the least number of GPUs as possible for the evalution. We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, and 30720 tokens
0.5B:
Model |
Num. GPU |
Input Length |
Speed (w/wo FA2) |
Memory |
---|---|---|---|---|
Qwen1.5-0.5B-Chat |
1 |
1 |
58.54 / 61.34 |
1.46 |
Qwen1.5-0.5B-Chat |
1 |
6144 |
57.93 / 63.57 |
6.87 |
Qwen1.5-0.5B-Chat |
1 |
14336 |
59.48 / 60.18 |
14.59 |
Qwen1.5-0.5B-Chat |
1 |
30720 |
47.65 / 35.44 |
30.04 |
Qwen1.5-0.5B-Chat-GPTQ-Int4 |
1 |
1 |
57.18 / 63.67 |
1.03 |
Qwen1.5-0.5B-Chat-GPTQ-Int4 |
1 |
6144 |
57.47 / 63.31 |
6.44 |
Qwen1.5-0.5B-Chat-GPTQ-Int4 |
1 |
14336 |
57.57 / 52.19 |
14.16 |
Qwen1.5-0.5B-Chat-GPTQ-Int4 |
1 |
30720 |
41.84 / 32.58 |
29.60 |
Qwen1.5-0.5B-Chat-AWQ |
1 |
1 |
45.04 / 48.54 |
1.02 |
Qwen1.5-0.5B-Chat-AWQ |
1 |
6144 |
43.30 / 47.62 |
6.43 |
Qwen1.5-0.5B-Chat-AWQ |
1 |
14336 |
42.98 / 48.05 |
14.15 |
Qwen1.5-0.5B-Chat-AWQ |
1 |
30720 |
42.18 / 33.58 |
29.59 |
1.8B:
Model |
Num. GPU |
Input Length |
Speed (w/wo FA2) |
Memory |
---|---|---|---|---|
Qwen1.5-1.8B-Chat |
1 |
1 |
51.82 / 54.01 |
4.60 |
Qwen1.5-1.8B-Chat |
1 |
6144 |
51.56 / 51.45 |
10.21 |
Qwen1.5-1.8B-Chat |
1 |
14336 |
45.17 / 30.53 |
18.69 |
Qwen1.5-1.8B-Chat |
1 |
30720 |
29.21 / 16.70 |
35.67 |
Qwen1.5-1.8B-Chat-GPTQ-Int4 |
1 |
1 |
58.83 / 65.21 |
2.91 |
Qwen1.5-1.8B-Chat-GPTQ-Int4 |
1 |
6144 |
54.82 / 46.31 |
8.52 |
Qwen1.5-1.8B-Chat-GPTQ-Int4 |
1 |
14336 |
41.56 / 28.64 |
17.01 |
Qwen1.5-1.8B-Chat-GPTQ-Int4 |
1 |
30720 |
26.88 / 16.13 |
33.98 |
Qwen1.5-1.8B-Chat-AWQ |
1 |
1 |
45.78 / 48.02 |
2.89 |
Qwen1.5-1.8B-Chat-AWQ |
1 |
6144 |
44.95 / 47.64 |
8.50 |
Qwen1.5-1.8B-Chat-AWQ |
1 |
14336 |
42.44 / 29.48 |
16.98 |
Qwen1.5-1.8B-Chat-AWQ |
1 |
30720 |
28.34 / 16.38 |
33.96 |
4B:
Model |
Num. GPU |
Input Length |
Speed (w/wo FA2) |
Memory |
---|---|---|---|---|
Qwen1.5-4B-Chat |
1 |
1 |
30.32 / 32.59 |
9.59 |
Qwen1.5-4B-Chat |
1 |
6144 |
30.72 / 28.61 |
16.19 |
Qwen1.5-4B-Chat |
1 |
14336 |
23.46 / 16.96 |
27.08 |
Qwen1.5-4B-Chat |
1 |
30720 |
14.76 / 9.19 |
48.85 |
Qwen1.5-4B-Chat-GPTQ-Int4 |
1 |
1 |
33.63 / 36.67 |
5.65 |
Qwen1.5-4B-Chat-GPTQ-Int4 |
1 |
6144 |
33.93 / 30.66 |
12.25 |
Qwen1.5-4B-Chat-GPTQ-Int4 |
1 |
14336 |
25.01 / 17.48 |
23.14 |
Qwen1.5-4B-Chat-GPTQ-Int4 |
1 |
30720 |
15.28 / 9.35 |
44.91 |
Qwen1.5-4B-Chat-AWQ |
1 |
1 |
28.09 / 28.64 |
5.19 |
Qwen1.5-4B-Chat-AWQ |
1 |
6144 |
28.00 / 27.83 |
11.79 |
Qwen1.5-4B-Chat-AWQ |
1 |
14336 |
22.95 / 16.49 |
22.67 |
Qwen1.5-4B-Chat-AWQ |
1 |
30720 |
14.50 / 9.06 |
44.45 |
MoE-A2.7B:
Model |
Num. GPU |
Input Length |
Speed (w/wo FA2) |
Memory |
---|---|---|---|---|
Qwen1.5-MoE-A2.7B-Chat |
1 |
1 |
8.49 / 8.52 |
27.82 |
Qwen1.5-MoE-A2.7B-Chat |
1 |
6144 |
8.73 / 8.41 |
33.43 |
Qwen1.5-MoE-A2.7B-Chat |
1 |
14336 |
8.30 / 7.43 |
41.91 |
Qwen1.5-MoE-A2.7B-Chat |
1 |
30720 |
7.40 / 6.34 |
58.89 |
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 |
1 |
1 |
8.17 / 8.67 |
9.23 |
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 |
1 |
6144 |
8.64 / 8.30 |
14.84 |
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 |
1 |
14336 |
8.16 / 7.39 |
23.32 |
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 |
1 |
30720 |
7.11 / 6.16 |
40.30 |
7B:
Model |
Num. GPU |
Input Length |
Speed (w/wo FA2) |
Memory |
---|---|---|---|---|
Qwen1.5-7B-Chat |
1 |
1 |
37.07 / 40.05 |
16.90 |
Qwen1.5-7B-Chat |
1 |
6144 |
29.29 / 26.95 |
24.37 |
Qwen1.5-7B-Chat |
1 |
14336 |
19.93 / 16.18 |
37.01 |
Qwen1.5-7B-Chat |
1 |
30720 |
12.04 / 8.89 |
62.29 |
Qwen1.5-7B-Chat-GPTQ-Int4 |
1 |
1 |
38.73 / 46.46 |
8.78 |
Qwen1.5-7B-Chat-GPTQ-Int4 |
1 |
6144 |
34.33 / 30.76 |
16.26 |
Qwen1.5-7B-Chat-GPTQ-Int4 |
1 |
14336 |
22.04 / 17.46 |
28.90 |
Qwen1.5-7B-Chat-GPTQ-Int4 |
1 |
30720 |
12.82 / 9.26 |
54.17 |
Qwen1.5-7B-Chat-AWQ |
1 |
1 |
32.59 / 36.74 |
8.02 |
Qwen1.5-7B-Chat-AWQ |
1 |
6144 |
29.13 / 26.91 |
15.49 |
Qwen1.5-7B-Chat-AWQ |
1 |
14336 |
19.98 / 16.14 |
28.13 |
Qwen1.5-7B-Chat-AWQ |
1 |
30720 |
12.10 / 8.86 |
53.40 |
14B:
Model |
Num. GPU |
Input Length |
Speed (w/wo FA2) |
Memory |
---|---|---|---|---|
Qwen1.5-14B-Chat |
1 |
1 |
26.89 / 31.36 |
30.18 |
Qwen1.5-14B-Chat |
1 |
6144 |
19.17 / 18.03 |
39.91 |
Qwen1.5-14B-Chat |
1 |
14336 |
12.91 / 11.01 |
57.05 |
Qwen1.5-14B-Chat |
2 |
30720 |
7.68 / 6.09 |
101.65 |
Qwen1.5-14B-Chat-GPTQ-Int4 |
1 |
1 |
32.79 / 36.88 |
13.87 |
Qwen1.5-14B-Chat-GPTQ-Int4 |
1 |
6144 |
23.30 / 21.49 |
23.59 |
Qwen1.5-14B-Chat-GPTQ-Int4 |
1 |
14336 |
14.69 / 12.21 |
40.74 |
Qwen1.5-14B-Chat-GPTQ-Int4 |
2 |
30720 |
8.14 / 7.68 |
|
Qwen1.5-14B-Chat-AWQ |
1 |
1 |
27.51 / 29.50 |
12.88 |
Qwen1.5-14B-Chat-AWQ |
1 |
6144 |
20.37 / 19.03 |
22.61 |
Qwen1.5-14B-Chat-AWQ |
1 |
14336 |
13.50 / 11.35 |
39.76 |
Qwen1.5-14B-Chat-AWQ |
2 |
30720 |
7.74 / 6.03 |
72B:
Model |
Num. GPU |
Input Length |
Speed (w/wo FA2) |
Memory |
---|---|---|---|---|
Qwen1.5-72B-Chat |
2 |
1 |
7.24 / 8.13 |
142.39 |
Qwen1.5-72B-Chat |
3 |
6144 |
4.89 / 4.82 |
174.66 |
Qwen1.5-72B-Chat |
4 |
14336 |
3.37 / 3.13 |
233.00 |
Qwen1.5-72B-Chat |
5 |
30720 |
2.17 / 2.00 |
344.17 |
Qwen1.5-72B-Chat-GPTQ-Int4 |
1 |
1 |
9.32 / 10.25 |
50.09 |
Qwen1.5-72B-Chat-GPTQ-Int4 |
2 |
6144 |
5.87 / 5.84 |
97.38 |
Qwen1.5-72B-Chat-GPTQ-Int4 |
3 |
14336 |
3.86 / 3.60 |
146.17 |
Qwen1.5-72B-Chat-GPTQ-Int4 |
4 |
30720 |
2.31 / 2.06 |
238.17 |
Qwen1.5-72B-Chat-AWQ |
1 |
1 |
10.59 / 12.06 |
49.68 |
Qwen1.5-72B-Chat-AWQ |
2 |
6144 |
6.47 / 6.41 |
|
Qwen1.5-72B-Chat-AWQ |
3 |
14336 |
4.09 / 3.78 |
|
Qwen1.5-72B-Chat-AWQ |
4 |
30720 |
2.35 / 2.10 |
(Note: we had problems with the statistics of memory footprint of AWQ models on multiple devices and thus we do not report the results. Also, the memory footprint of Qwen1.5-14B in the context of 32768 tokens is also inconsistent with our expectation and we don’t report here. Additionally, due to the implementation in our HF code, the MoE model runs much slower than expectation. Intead, we advise users to deploy the MoE model with vLLM.)