HF Transformers Inference

This section reports the performance of bf16 models and Int4 quantized models (including GPTQ and AWQ) of the Qwen1.5 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. In terms of inference speed, we report those with or without Flash Attention 2.

The environment of the performance evaluation is:

  • NVIDIA A100 80GB

  • CUDA 12.3

  • Pytorch 2.1.2+cu118

  • Flash Attention 2.5.6

Note that we use the batch size of 1 and the least number of GPUs as possible for the evalution. We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, and 30720 tokens

  • 0.5B:

Model

Num. GPU

Input Length

Speed (w/wo FA2)

Memory

Qwen1.5-0.5B-Chat

1

1

58.54 / 61.34

1.46

Qwen1.5-0.5B-Chat

1

6144

57.93 / 63.57

6.87

Qwen1.5-0.5B-Chat

1

14336

59.48 / 60.18

14.59

Qwen1.5-0.5B-Chat

1

30720

47.65 / 35.44

30.04

Qwen1.5-0.5B-Chat-GPTQ-Int4

1

1

57.18 / 63.67

1.03

Qwen1.5-0.5B-Chat-GPTQ-Int4

1

6144

57.47 / 63.31

6.44

Qwen1.5-0.5B-Chat-GPTQ-Int4

1

14336

57.57 / 52.19

14.16

Qwen1.5-0.5B-Chat-GPTQ-Int4

1

30720

41.84 / 32.58

29.60

Qwen1.5-0.5B-Chat-AWQ

1

1

45.04 / 48.54

1.02

Qwen1.5-0.5B-Chat-AWQ

1

6144

43.30 / 47.62

6.43

Qwen1.5-0.5B-Chat-AWQ

1

14336

42.98 / 48.05

14.15

Qwen1.5-0.5B-Chat-AWQ

1

30720

42.18 / 33.58

29.59

  • 1.8B:

Model

Num. GPU

Input Length

Speed (w/wo FA2)

Memory

Qwen1.5-1.8B-Chat

1

1

51.82 / 54.01

4.60

Qwen1.5-1.8B-Chat

1

6144

51.56 / 51.45

10.21

Qwen1.5-1.8B-Chat

1

14336

45.17 / 30.53

18.69

Qwen1.5-1.8B-Chat

1

30720

29.21 / 16.70

35.67

Qwen1.5-1.8B-Chat-GPTQ-Int4

1

1

58.83 / 65.21

2.91

Qwen1.5-1.8B-Chat-GPTQ-Int4

1

6144

54.82 / 46.31

8.52

Qwen1.5-1.8B-Chat-GPTQ-Int4

1

14336

41.56 / 28.64

17.01

Qwen1.5-1.8B-Chat-GPTQ-Int4

1

30720

26.88 / 16.13

33.98

Qwen1.5-1.8B-Chat-AWQ

1

1

45.78 / 48.02

2.89

Qwen1.5-1.8B-Chat-AWQ

1

6144

44.95 / 47.64

8.50

Qwen1.5-1.8B-Chat-AWQ

1

14336

42.44 / 29.48

16.98

Qwen1.5-1.8B-Chat-AWQ

1

30720

28.34 / 16.38

33.96

  • 4B:

Model

Num. GPU

Input Length

Speed (w/wo FA2)

Memory

Qwen1.5-4B-Chat

1

1

30.32 / 32.59

9.59

Qwen1.5-4B-Chat

1

6144

30.72 / 28.61

16.19

Qwen1.5-4B-Chat

1

14336

23.46 / 16.96

27.08

Qwen1.5-4B-Chat

1

30720

14.76 / 9.19

48.85

Qwen1.5-4B-Chat-GPTQ-Int4

1

1

33.63 / 36.67

5.65

Qwen1.5-4B-Chat-GPTQ-Int4

1

6144

33.93 / 30.66

12.25

Qwen1.5-4B-Chat-GPTQ-Int4

1

14336

25.01 / 17.48

23.14

Qwen1.5-4B-Chat-GPTQ-Int4

1

30720

15.28 / 9.35

44.91

Qwen1.5-4B-Chat-AWQ

1

1

28.09 / 28.64

5.19

Qwen1.5-4B-Chat-AWQ

1

6144

28.00 / 27.83

11.79

Qwen1.5-4B-Chat-AWQ

1

14336

22.95 / 16.49

22.67

Qwen1.5-4B-Chat-AWQ

1

30720

14.50 / 9.06

44.45

  • MoE-A2.7B:

Model

Num. GPU

Input Length

Speed (w/wo FA2)

Memory

Qwen1.5-MoE-A2.7B-Chat

1

1

8.49 / 8.52

27.82

Qwen1.5-MoE-A2.7B-Chat

1

6144

8.73 / 8.41

33.43

Qwen1.5-MoE-A2.7B-Chat

1

14336

8.30 / 7.43

41.91

Qwen1.5-MoE-A2.7B-Chat

1

30720

7.40 / 6.34

58.89

Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4

1

1

8.17 / 8.67

9.23

Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4

1

6144

8.64 / 8.30

14.84

Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4

1

14336

8.16 / 7.39

23.32

Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4

1

30720

7.11 / 6.16

40.30

  • 7B:

Model

Num. GPU

Input Length

Speed (w/wo FA2)

Memory

Qwen1.5-7B-Chat

1

1

37.07 / 40.05

16.90

Qwen1.5-7B-Chat

1

6144

29.29 / 26.95

24.37

Qwen1.5-7B-Chat

1

14336

19.93 / 16.18

37.01

Qwen1.5-7B-Chat

1

30720

12.04 / 8.89

62.29

Qwen1.5-7B-Chat-GPTQ-Int4

1

1

38.73 / 46.46

8.78

Qwen1.5-7B-Chat-GPTQ-Int4

1

6144

34.33 / 30.76

16.26

Qwen1.5-7B-Chat-GPTQ-Int4

1

14336

22.04 / 17.46

28.90

Qwen1.5-7B-Chat-GPTQ-Int4

1

30720

12.82 / 9.26

54.17

Qwen1.5-7B-Chat-AWQ

1

1

32.59 / 36.74

8.02

Qwen1.5-7B-Chat-AWQ

1

6144

29.13 / 26.91

15.49

Qwen1.5-7B-Chat-AWQ

1

14336

19.98 / 16.14

28.13

Qwen1.5-7B-Chat-AWQ

1

30720

12.10 / 8.86

53.40

  • 14B:

Model

Num. GPU

Input Length

Speed (w/wo FA2)

Memory

Qwen1.5-14B-Chat

1

1

26.89 / 31.36

30.18

Qwen1.5-14B-Chat

1

6144

19.17 / 18.03

39.91

Qwen1.5-14B-Chat

1

14336

12.91 / 11.01

57.05

Qwen1.5-14B-Chat

2

30720

7.68 / 6.09

101.65

Qwen1.5-14B-Chat-GPTQ-Int4

1

1

32.79 / 36.88

13.87

Qwen1.5-14B-Chat-GPTQ-Int4

1

6144

23.30 / 21.49

23.59

Qwen1.5-14B-Chat-GPTQ-Int4

1

14336

14.69 / 12.21

40.74

Qwen1.5-14B-Chat-GPTQ-Int4

2

30720

8.14 / 7.68

Qwen1.5-14B-Chat-AWQ

1

1

27.51 / 29.50

12.88

Qwen1.5-14B-Chat-AWQ

1

6144

20.37 / 19.03

22.61

Qwen1.5-14B-Chat-AWQ

1

14336

13.50 / 11.35

39.76

Qwen1.5-14B-Chat-AWQ

2

30720

7.74 / 6.03

  • 72B:

Model

Num. GPU

Input Length

Speed (w/wo FA2)

Memory

Qwen1.5-72B-Chat

2

1

7.24 / 8.13

142.39

Qwen1.5-72B-Chat

3

6144

4.89 / 4.82

174.66

Qwen1.5-72B-Chat

4

14336

3.37 / 3.13

233.00

Qwen1.5-72B-Chat

5

30720

2.17 / 2.00

344.17

Qwen1.5-72B-Chat-GPTQ-Int4

1

1

9.32 / 10.25

50.09

Qwen1.5-72B-Chat-GPTQ-Int4

2

6144

5.87 / 5.84

97.38

Qwen1.5-72B-Chat-GPTQ-Int4

3

14336

3.86 / 3.60

146.17

Qwen1.5-72B-Chat-GPTQ-Int4

4

30720

2.31 / 2.06

238.17

Qwen1.5-72B-Chat-AWQ

1

1

10.59 / 12.06

49.68

Qwen1.5-72B-Chat-AWQ

2

6144

6.47 / 6.41

Qwen1.5-72B-Chat-AWQ

3

14336

4.09 / 3.78

Qwen1.5-72B-Chat-AWQ

4

30720

2.35 / 2.10

(Note: we had problems with the statistics of memory footprint of AWQ models on multiple devices and thus we do not report the results. Also, the memory footprint of Qwen1.5-14B in the context of 32768 tokens is also inconsistent with our expectation and we don’t report here. Additionally, due to the implementation in our HF code, the MoE model runs much slower than expectation. Intead, we advise users to deploy the MoE model with vLLM.)