HF Transformers Inference¶

This section reports the performance of bf16 models and Int4 quantized models (including GPTQ and AWQ) of the Qwen1.5 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. In terms of inference speed, we report those with or without Flash Attention 2.

The environment of the performance evaluation is:

NVIDIA A100 80GB
CUDA 12.3
Pytorch 2.1.2+cu118
Flash Attention 2.5.6

Note that we use the batch size of 1 and the least number of GPUs as possible for the evalution. We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, and 30720 tokens

0.5B:

Model	Num. GPU	Input Length	Speed (w/wo FA2)	Memory
Qwen1.5-0.5B-Chat	1	1	58.54 / 61.34	1.46
Qwen1.5-0.5B-Chat	1	6144	57.93 / 63.57	6.87
Qwen1.5-0.5B-Chat	1	14336	59.48 / 60.18	14.59
Qwen1.5-0.5B-Chat	1	30720	47.65 / 35.44	30.04
Qwen1.5-0.5B-Chat-GPTQ-Int4	1	1	57.18 / 63.67	1.03
Qwen1.5-0.5B-Chat-GPTQ-Int4	1	6144	57.47 / 63.31	6.44
Qwen1.5-0.5B-Chat-GPTQ-Int4	1	14336	57.57 / 52.19	14.16
Qwen1.5-0.5B-Chat-GPTQ-Int4	1	30720	41.84 / 32.58	29.60
Qwen1.5-0.5B-Chat-AWQ	1	1	45.04 / 48.54	1.02
Qwen1.5-0.5B-Chat-AWQ	1	6144	43.30 / 47.62	6.43
Qwen1.5-0.5B-Chat-AWQ	1	14336	42.98 / 48.05	14.15
Qwen1.5-0.5B-Chat-AWQ	1	30720	42.18 / 33.58	29.59

1.8B:

Model	Num. GPU	Input Length	Speed (w/wo FA2)	Memory
Qwen1.5-1.8B-Chat	1	1	51.82 / 54.01	4.60
Qwen1.5-1.8B-Chat	1	6144	51.56 / 51.45	10.21
Qwen1.5-1.8B-Chat	1	14336	45.17 / 30.53	18.69
Qwen1.5-1.8B-Chat	1	30720	29.21 / 16.70	35.67
Qwen1.5-1.8B-Chat-GPTQ-Int4	1	1	58.83 / 65.21	2.91
Qwen1.5-1.8B-Chat-GPTQ-Int4	1	6144	54.82 / 46.31	8.52
Qwen1.5-1.8B-Chat-GPTQ-Int4	1	14336	41.56 / 28.64	17.01
Qwen1.5-1.8B-Chat-GPTQ-Int4	1	30720	26.88 / 16.13	33.98
Qwen1.5-1.8B-Chat-AWQ	1	1	45.78 / 48.02	2.89
Qwen1.5-1.8B-Chat-AWQ	1	6144	44.95 / 47.64	8.50
Qwen1.5-1.8B-Chat-AWQ	1	14336	42.44 / 29.48	16.98
Qwen1.5-1.8B-Chat-AWQ	1	30720	28.34 / 16.38	33.96

Model	Num. GPU	Input Length	Speed (w/wo FA2)	Memory
Qwen1.5-4B-Chat	1	1	30.32 / 32.59	9.59
Qwen1.5-4B-Chat	1	6144	30.72 / 28.61	16.19
Qwen1.5-4B-Chat	1	14336	23.46 / 16.96	27.08
Qwen1.5-4B-Chat	1	30720	14.76 / 9.19	48.85
Qwen1.5-4B-Chat-GPTQ-Int4	1	1	33.63 / 36.67	5.65
Qwen1.5-4B-Chat-GPTQ-Int4	1	6144	33.93 / 30.66	12.25
Qwen1.5-4B-Chat-GPTQ-Int4	1	14336	25.01 / 17.48	23.14
Qwen1.5-4B-Chat-GPTQ-Int4	1	30720	15.28 / 9.35	44.91
Qwen1.5-4B-Chat-AWQ	1	1	28.09 / 28.64	5.19
Qwen1.5-4B-Chat-AWQ	1	6144	28.00 / 27.83	11.79
Qwen1.5-4B-Chat-AWQ	1	14336	22.95 / 16.49	22.67
Qwen1.5-4B-Chat-AWQ	1	30720	14.50 / 9.06	44.45

MoE-A2.7B:

Model	Num. GPU	Input Length	Speed (w/wo FA2)	Memory
Qwen1.5-MoE-A2.7B-Chat	1	1	8.49 / 8.52	27.82
Qwen1.5-MoE-A2.7B-Chat	1	6144	8.73 / 8.41	33.43
Qwen1.5-MoE-A2.7B-Chat	1	14336	8.30 / 7.43	41.91
Qwen1.5-MoE-A2.7B-Chat	1	30720	7.40 / 6.34	58.89
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4	1	1	8.17 / 8.67	9.23
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4	1	6144	8.64 / 8.30	14.84
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4	1	14336	8.16 / 7.39	23.32
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4	1	30720	7.11 / 6.16	40.30

Model	Num. GPU	Input Length	Speed (w/wo FA2)	Memory
Qwen1.5-7B-Chat	1	1	37.07 / 40.05	16.90
Qwen1.5-7B-Chat	1	6144	29.29 / 26.95	24.37
Qwen1.5-7B-Chat	1	14336	19.93 / 16.18	37.01
Qwen1.5-7B-Chat	1	30720	12.04 / 8.89	62.29
Qwen1.5-7B-Chat-GPTQ-Int4	1	1	38.73 / 46.46	8.78
Qwen1.5-7B-Chat-GPTQ-Int4	1	6144	34.33 / 30.76	16.26
Qwen1.5-7B-Chat-GPTQ-Int4	1	14336	22.04 / 17.46	28.90
Qwen1.5-7B-Chat-GPTQ-Int4	1	30720	12.82 / 9.26	54.17
Qwen1.5-7B-Chat-AWQ	1	1	32.59 / 36.74	8.02
Qwen1.5-7B-Chat-AWQ	1	6144	29.13 / 26.91	15.49
Qwen1.5-7B-Chat-AWQ	1	14336	19.98 / 16.14	28.13
Qwen1.5-7B-Chat-AWQ	1	30720	12.10 / 8.86	53.40

14B:

Model	Num. GPU	Input Length	Speed (w/wo FA2)	Memory
Qwen1.5-14B-Chat	1	1	26.89 / 31.36	30.18
Qwen1.5-14B-Chat	1	6144	19.17 / 18.03	39.91
Qwen1.5-14B-Chat	1	14336	12.91 / 11.01	57.05
Qwen1.5-14B-Chat	2	30720	7.68 / 6.09	101.65
Qwen1.5-14B-Chat-GPTQ-Int4	1	1	32.79 / 36.88	13.87
Qwen1.5-14B-Chat-GPTQ-Int4	1	6144	23.30 / 21.49	23.59
Qwen1.5-14B-Chat-GPTQ-Int4	1	14336	14.69 / 12.21	40.74
Qwen1.5-14B-Chat-GPTQ-Int4	2	30720	8.14 / 7.68
Qwen1.5-14B-Chat-AWQ	1	1	27.51 / 29.50	12.88
Qwen1.5-14B-Chat-AWQ	1	6144	20.37 / 19.03	22.61
Qwen1.5-14B-Chat-AWQ	1	14336	13.50 / 11.35	39.76
Qwen1.5-14B-Chat-AWQ	2	30720	7.74 / 6.03

72B:

Model	Num. GPU	Input Length	Speed (w/wo FA2)	Memory
Qwen1.5-72B-Chat	2	1	7.24 / 8.13	142.39
Qwen1.5-72B-Chat	3	6144	4.89 / 4.82	174.66
Qwen1.5-72B-Chat	4	14336	3.37 / 3.13	233.00
Qwen1.5-72B-Chat	5	30720	2.17 / 2.00	344.17
Qwen1.5-72B-Chat-GPTQ-Int4	1	1	9.32 / 10.25	50.09
Qwen1.5-72B-Chat-GPTQ-Int4	2	6144	5.87 / 5.84	97.38
Qwen1.5-72B-Chat-GPTQ-Int4	3	14336	3.86 / 3.60	146.17
Qwen1.5-72B-Chat-GPTQ-Int4	4	30720	2.31 / 2.06	238.17
Qwen1.5-72B-Chat-AWQ	1	1	10.59 / 12.06	49.68
Qwen1.5-72B-Chat-AWQ	2	6144	6.47 / 6.41
Qwen1.5-72B-Chat-AWQ	3	14336	4.09 / 3.78
Qwen1.5-72B-Chat-AWQ	4	30720	2.35 / 2.10

(Note: we had problems with the statistics of memory footprint of AWQ models on multiple devices and thus we do not report the results. Also, the memory footprint of Qwen1.5-14B in the context of 32768 tokens is also inconsistent with our expectation and we don’t report here. Additionally, due to the implementation in our HF code, the MoE model runs much slower than expectation. Intead, we advise users to deploy the MoE model with vLLM.)