Speed BenchmarkΒΆ

Attention

To be updated for Qwen2.5.

This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths.

The environment of the evaluation with huggingface transformers is:

  • NVIDIA A100 80GB

  • CUDA 11.8

  • Pytorch 2.1.2+cu118

  • Flash Attention 2.3.3

  • Transformers 4.38.2

  • AutoGPTQ 0.7.1

  • AutoAWQ 0.2.4

The environment of the evaluation with vLLM is:

  • NVIDIA A100 80GB

  • CUDA 11.8

  • Pytorch 2.3.0+cu118

  • Flash Attention 2.5.6

  • Transformers 4.40.1

  • vLLM 0.4.2

Note:

  • We use the batch size of 1 and the least number of GPUs as possible for the evalution.

  • We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens (>32k is only avaliable for Qwen2-72B-Instuct and Qwen2-7B-Instuct).

  • For vLLM, the memory usage is not reported because it pre-allocates all GPU memory. We use gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False by default.

  • 0.5B (Transformer)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

GPU Memory(GB)

Qwen2-0.5B-Instruct

1

BF16

1

49.94

1.17

GPTQ-Int8

1

36.35

0.85

GPTQ-Int4

1

49.56

0.68

AWQ

1

38.78

0.68

6144

BF16

1

50.83

6.42

GPTQ-Int8

1

36.56

6.09

GPTQ-Int4

1

49.63

5.93

AWQ

1

38.73

5.92

14336

BF16

1

49.56

13.48

GPTQ-Int8

1

36.23

13.15

GPTQ-Int4

1

48.68

12.97

AWQ

1

38.94

12.99

30720

BF16

1

49.25

27.61

GPTQ-Int8

1

34.61

27.28

GPTQ-Int4

1

48.18

27.12

AWQ

1

38.19

27.11

  • 0.5B (vLLM)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

Qwen2-0.5B-Instruct

1

BF16

1

270.49

GPTQ-Int8

1

235.95

GPTQ-Int4

1

240.07

AWQ

1

233.31

6144

BF16

1

256.16

GPTQ-Int8

1

224.30

GPTQ-Int4

1

226.41

AWQ

1

222.83

14336

BF16

1

108.89

GPTQ-Int8

1

108.10

GPTQ-Int4

1

106.51

AWQ

1

104.16

30720

BF16

1

97.20

GPTQ-Int8

1

94.49

GPTQ-Int4

1

93.94

AWQ

1

92.23

  • 1.5B (Transformer)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

GPU Memory(GB)

Qwen2-1.5B-Instruct

1

BF16

1

40.89

3.44

GPTQ-Int8

1

31.51

2.31

GPTQ-Int4

1

42.47

1.67

AWQ

1

33.62

1.64

6144

BF16

1

40.86

8.74

GPTQ-Int8

1

31.31

7.59

GPTQ-Int4

1

42.78

6.95

AWQ

1

32.90

6.92

14336

BF16

1

40.08

15.92

GPTQ-Int8

1

31.19

14.79

GPTQ-Int4

1

42.25

14.14

AWQ

1

33.24

14.12

30720

BF16

1

34.09

30.31

GPTQ-Int8

1

28.52

29.18

GPTQ-Int4

1

31.30

28.54

AWQ

1

32.16

28.51

  • 1.5B (vLLM)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

Qwen2-1.5B-Instruct

1

BF16

1

175.55

GPTQ-Int8

1

172.28

GPTQ-Int4

1

184.58

AWQ

1

170.87

6144

BF16

1

166.23

GPTQ-Int8

1

164.32

GPTQ-Int4

1

174.04

AWQ

1

162.81

14336

BF16

1

83.67

GPTQ-Int8

1

98.63

GPTQ-Int4

1

97.65

AWQ

1

92.48

30720

BF16

1

77.69

GPTQ-Int8

1

86.42

GPTQ-Int4

1

87.49

AWQ

1

82.88

  • 7B (Transformer)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

GPU Memory(GB)

Qwen2-7B-Instruct

1

BF16

1

37.97

14.92

GPTQ-Int8

1

30.85

8.97

GPTQ-Int4

1

36.17

6.06

AWQ

1

33.08

5.93

6144

BF16

1

34.74

20.26

GPTQ-Int8

1

31.13

14.31

GPTQ-Int4

1

33.34

11.40

AWQ

1

30.86

11.27

14336

BF16

1

26.63

27.71

GPTQ-Int8

1

24.58

21.76

GPTQ-Int4

1

25.81

18.86

AWQ

1

27.61

18.72

30720

BF16

1

17.49

42.62

GPTQ-Int8

1

16.69

36.67

GPTQ-Int4

1

17.17

33.76

AWQ

1

17.87

33.63

  • 7B (vLLM)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

Qwen2-7B-Instruct

1

BF16

1

80.45

GPTQ-Int8

1

114.32

GPTQ-Int4

1

143.40

AWQ

1

96.65

6144

BF16

1

76.41

GPTQ-Int8

1

107.02

GPTQ-Int4

1

131.55

AWQ

1

91.38

14336

BF16

1

66.54

GPTQ-Int8

1

89.72

GPTQ-Int4

1

97.93

AWQ

1

76.87

30720

BF16

1

55.83

GPTQ-Int8

1

71.58

GPTQ-Int4

1

81.48

AWQ

1

63.62

63488

BF16

1

41.20

GPTQ-Int8

1

49.37

GPTQ-Int4

1

54.12

AWQ

1

45.89

129024

BF16

1

25.01

GPTQ-Int8

1

27.73

GPTQ-Int4

1

29.39

AWQ

1

27.13

  • 57B-A14B (Transformer)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

GPU Memory(GB)

Qwen2-57B-A14B-Instruct

1

BF16

2

4.76

110.29

GPTQ-Int4

1

5.55

30.38

6144

BF16

2

4.90

117.80

GPTQ-Int4

1

5.44

35.67

14336

BF16

2

4.58

128.17

GPTQ-Int4

1

5.31

43.11

30720

BF16

2

4.12

163.77

GPTQ-Int4

1

4.72

58.01

  • 57B-A14B (vLLM)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

Qwen2-57B-A14B-Instruct

1

BF16

2

31.44

6144

BF16

2

31.77

14336

BF16

2

21.25

30720

BF16

2

20.24

Note: Compared with dense models, MOE models have larger throughput when batch size is large, which is shown as follows:

Model

Quantization

# Prompts

QPS

Tokens/s

Qwen1.5-32B-Chat

BF16

100

6.68

7343.56

Qwen2-57B-A14B-Instruct

BF16

100

4.81

5291.15

Qwen1.5-32B-Chat

BF16

1000

7.99

8791.35

Qwen2-57B-A14B-Instruct

BF16

1000

5.18

5698.37

The results are obtained from vLLM throughput benchmarking scripts, which can be reproduced by:

python vllm/benchmarks/benchmark_throughput.py --input-len 1000 --output-len 100 --model <model_path> --num-prompts <number of prompts> --enforce-eager -tp 2

  • 72B (Transformer)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

GPU Memory(GB)

Qwen2-72B-Instruct

1

BF16

2

7.45

134.74

GPTQ-Int8

2

7.30

71.00

GPTQ-Int4

1

9.05

41.80

AWQ

1

9.96

41.31

6144

BF16

2

5.99

144.38

GPTQ-Int8

2

5.93

80.60

GPTQ-Int4

1

6.79

47.90

AWQ

1

7.49

47.42

14336

BF16

3

4.12

169.93

GPTQ-Int8

2

4.43

95.14

GPTQ-Int4

1

4.87

57.79

AWQ

1

5.23

57.30

30720

BF16

3

2.86

209.03

GPTQ-Int8

2

2.83

124.20

GPTQ-Int4

2

3.02

107.94

AWQ

2

1.85

88.60

  • 72B (vLLM)

Model

Input Length

Quantization

GPU Num

Speed(tokens/s)

Setting

Qwen2-72B-Instruct

1

BF16

2

17.68

[Setting 1]

BF16

4

30.01

GPTQ-Int8

2

27.56

GPTQ-Int4

1

29.60

[Setting 2]

GPTQ-Int4

2

42.82

AWQ

2

27.73

6144

BF16

4

27.98

GPTQ-Int8

2

25.46

GPTQ-Int4

1

25.16

[Setting 3]

GPTQ-Int4

2

38.23

AWQ

2

25.77

14336

BF16

4

21.81

GPTQ-Int8

2

22.71

GPTQ-Int4

2

26.54

AWQ

2

21.50

30720

BF16

4

19.43

GPTQ-Int8

2

18.69

GPTQ-Int4

2

23.12

AWQ

2

18.09

30720

BF16

4

19.43

GPTQ-Int8

2

18.69

GPTQ-Int4

2

23.12

AWQ

2

18.09

63488

BF16

4

17.46

GPTQ-Int8

2

15.30

GPTQ-Int4

2

13.23

AWQ

2

13.14

129024

BF16

4

11.70

GPTQ-Int8

4

12.94

GPTQ-Int4

2

8.33

AWQ

2

7.78

  • [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)

  • [Setting 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True)

  • [Setting 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True)

  • [Setting 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True)