Speed Benchmark ========================= This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. The environment of the evaluation with huggingface transformers is: - NVIDIA A100 80GB - CUDA 11.8 - Pytorch 2.1.2+cu118 - Flash Attention 2.3.3 - Transformers 4.38.2 - AutoGPTQ 0.7.1 - AutoAWQ 0.2.4 The environment of the evaluation with vLLM is: - NVIDIA A100 80GB - CUDA 11.8 - Pytorch 2.3.0+cu118 - Flash Attention 2.5.6 - Transformers 4.40.1 - vLLM 0.4.2 Note: - We use the batch size of 1 and the least number of GPUs as possible for the evalution. - We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens (\>32k is only avaliable for Qwen2-72B-Instuct and Qwen2-7B-Instuct). - For vLLM, the memory usage is not reported because it pre-allocates all GPU memory. We use ``gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False`` by default. - 0.5B (Transformer) +---------------------+--------------+--------------+---------+-----------------+----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | +=====================+==============+==============+=========+=================+================+ | Qwen2-0.5B-Instruct | 1 | BF16 | 1 | 49.94 | 1.17 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 36.35 | 0.85 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 49.56 | 0.68 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 38.78 | 0.68 | + +--------------+--------------+---------+-----------------+----------------+ | | 6144 | BF16 | 1 | 50.83 | 6.42 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 36.56 | 6.09 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 49.63 | 5.93 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 38.73 | 5.92 | + +--------------+--------------+---------+-----------------+----------------+ | | 14336 | BF16 | 1 | 49.56 | 13.48 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 36.23 | 13.15 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 48.68 | 12.97 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 38.94 | 12.99 | + +--------------+--------------+---------+-----------------+----------------+ | | 30720 | BF16 | 1 | 49.25 | 27.61 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 34.61 | 27.28 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 48.18 | 27.12 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 38.19 | 27.11 | +---------------------+--------------+--------------+---------+-----------------+----------------+ - 0.5B (vLLM) +---------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +=====================+==============+==============+=========+=================+ | Qwen2-0.5B-Instruct | 1 | BF16 | 1 | 270.49 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 235.95 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 240.07 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 233.31 | + +--------------+--------------+---------+-----------------+ | | 6144 | BF16 | 1 | 256.16 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 224.30 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 226.41 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 222.83 | + +--------------+--------------+---------+-----------------+ | | 14336 | BF16 | 1 | 108.89 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 108.10 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 106.51 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 104.16 | + +--------------+--------------+---------+-----------------+ | | 30720 | BF16 | 1 | 97.20 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 94.49 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 93.94 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 92.23 | +---------------------+--------------+--------------+---------+-----------------+ - 1.5B (Transformer) +---------------------+--------------+--------------+---------+-----------------+----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | +=====================+==============+==============+=========+=================+================+ | Qwen2-1.5B-Instruct | 1 | BF16 | 1 | 40.89 | 3.44 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 31.51 | 2.31 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 42.47 | 1.67 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 33.62 | 1.64 | + +--------------+--------------+---------+-----------------+----------------+ | | 6144 | BF16 | 1 | 40.86 | 8.74 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 31.31 | 7.59 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 42.78 | 6.95 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 32.90 | 6.92 | + +--------------+--------------+---------+-----------------+----------------+ | | 14336 | BF16 | 1 | 40.08 | 15.92 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 31.19 | 14.79 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 42.25 | 14.14 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 33.24 | 14.12 | + +--------------+--------------+---------+-----------------+----------------+ | | 30720 | BF16 | 1 | 34.09 | 30.31 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 28.52 | 29.18 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 31.30 | 28.54 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 32.16 | 28.51 | +---------------------+--------------+--------------+---------+-----------------+----------------+ - 1.5B (vLLM) +---------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +=====================+==============+==============+=========+=================+ | Qwen2-1.5B-Instruct | 1 | BF16 | 1 | 175.55 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 172.28 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 184.58 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 170.87 | + +--------------+--------------+---------+-----------------+ | | 6144 | BF16 | 1 | 166.23 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 164.32 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 174.04 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 162.81 | + +--------------+--------------+---------+-----------------+ | | 14336 | BF16 | 1 | 83.67 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 98.63 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 97.65 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 92.48 | + +--------------+--------------+---------+-----------------+ | | 30720 | BF16 | 1 | 77.69 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 86.42 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 87.49 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 82.88 | +---------------------+--------------+--------------+---------+-----------------+ - 7B (Transformer) +-------------------+--------------+--------------+---------+-----------------+----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | +===================+==============+==============+=========+=================+================+ | Qwen2-7B-Instruct | 1 | BF16 | 1 | 37.97 | 14.92 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 30.85 | 8.97 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 36.17 | 6.06 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 33.08 | 5.93 | + +--------------+--------------+---------+-----------------+----------------+ | | 6144 | BF16 | 1 | 34.74 | 20.26 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 31.13 | 14.31 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 33.34 | 11.40 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 30.86 | 11.27 | + +--------------+--------------+---------+-----------------+----------------+ | | 14336 | BF16 | 1 | 26.63 | 27.71 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 24.58 | 21.76 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 25.81 | 18.86 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 27.61 | 18.72 | + +--------------+--------------+---------+-----------------+----------------+ | | 30720 | BF16 | 1 | 17.49 | 42.62 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 1 | 16.69 | 36.67 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 17.17 | 33.76 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 17.87 | 33.63 | +-------------------+--------------+--------------+---------+-----------------+----------------+ - 7B (vLLM) +-------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +===================+==============+==============+=========+=================+ | Qwen2-7B-Instruct | 1 | BF16 | 1 | 80.45 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 114.32 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 143.40 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 96.65 | + +--------------+--------------+---------+-----------------+ | | 6144 | BF16 | 1 | 76.41 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 107.02 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 131.55 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 91.38 | + +--------------+--------------+---------+-----------------+ | | 14336 | BF16 | 1 | 66.54 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 89.72 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 97.93 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 76.87 | + +--------------+--------------+---------+-----------------+ | | 30720 | BF16 | 1 | 55.83 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 71.58 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 81.48 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 63.62 | + +--------------+--------------+---------+-----------------+ | | 63488 | BF16 | 1 | 41.20 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 49.37 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 54.12 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 45.89 | + +--------------+--------------+---------+-----------------+ | | 129024 | BF16 | 1 | 25.01 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 27.73 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 29.39 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 27.13 | +-------------------+--------------+--------------+---------+-----------------+ - 57B-A14B (Transformer) +--------------------------+--------------+--------------+---------+-----------------+----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | +==========================+==============+==============+=========+=================+================+ | Qwen2-57B-A14B-Instruct | 1 | BF16 | 2 | 4.76 | 110.29 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 5.55 | 30.38 | + +--------------+--------------+---------+-----------------+----------------+ | | 6144 | BF16 | 2 | 4.90 | 117.80 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 5.44 | 35.67 | + +--------------+--------------+---------+-----------------+----------------+ | | 14336 | BF16 | 2 | 4.58 | 128.17 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 5.31 | 43.11 | + +--------------+--------------+---------+-----------------+----------------+ | | 30720 | BF16 | 2 | 4.12 | 163.77 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 4.72 | 58.01 | +--------------------------+--------------+--------------+---------+-----------------+----------------+ - 57B-A14B (vLLM) +--------------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +==========================+==============+==============+=========+=================+ | Qwen2-57B-A14B-Instruct | 1 | BF16 | 2 | 31.44 | +--------------------------+--------------+--------------+---------+-----------------+ | | 6144 | BF16 | 2 | 31.77 | +--------------------------+--------------+--------------+---------+-----------------+ | | 14336 | BF16 | 2 | 21.25 | +--------------------------+--------------+--------------+---------+-----------------+ | | 30720 | BF16 | 2 | 20.24 | +--------------------------+--------------+--------------+---------+-----------------+ Note: Compared with dense models, MOE models have larger throughput when batch size is large, which is shown as follows: +--------------------------+--------------+-------------+------+----------+ | Model | Quantization | # Prompts | QPS | Tokens/s | +==========================+==============+=============+======+==========+ | Qwen1.5-32B-Chat | BF16 | 100 | 6.68 | 7343.56 | +--------------------------+--------------+-------------+------+----------+ | Qwen2-57B-A14B-Instruct | BF16 | 100 | 4.81 | 5291.15 | +--------------------------+--------------+-------------+------+----------+ | Qwen1.5-32B-Chat | BF16 | 1000 | 7.99 | 8791.35 | +--------------------------+--------------+-------------+------+----------+ | Qwen2-57B-A14B-Instruct | BF16 | 1000 | 5.18 | 5698.37 | +--------------------------+--------------+-------------+------+----------+ The results are obtained from vLLM throughput benchmarking scripts, which can be reproduced by: ``python vllm/benchmarks/benchmark_throughput.py --input-len 1000 --output-len 100 --model --num-prompts --enforce-eager -tp 2`` - 72B (Transformer) +--------------------+--------------+--------------+---------+-----------------+----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | +====================+==============+==============+=========+=================+================+ | Qwen2-72B-Instruct | 1 | BF16 | 2 | 7.45 | 134.74 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 7.30 | 71.00 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 9.05 | 41.80 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 9.96 | 41.31 | + +--------------+--------------+---------+-----------------+----------------+ | | 6144 | BF16 | 2 | 5.99 | 144.38 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 5.93 | 80.60 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 6.79 | 47.90 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 7.49 | 47.42 | + +--------------+--------------+---------+-----------------+----------------+ | | 14336 | BF16 | 3 | 4.12 | 169.93 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 4.43 | 95.14 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 4.87 | 57.79 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 1 | 5.23 | 57.30 | + +--------------+--------------+---------+-----------------+----------------+ | | 30720 | BF16 | 3 | 2.86 | 209.03 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 2.83 | 124.20 | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 3.02 | 107.94 | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 1.85 | 88.60 | +--------------------+--------------+--------------+---------+-----------------+----------------+ - 72B (vLLM) +--------------------+--------------+--------------+---------+-----------------+----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Setting | +====================+==============+==============+=========+=================+================+ | Qwen2-72B-Instruct | 1 | BF16 | 2 | 17.68 | [Setting 1] | + + +--------------+---------+-----------------+----------------+ | | | BF16 | 4 | 30.01 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 27.56 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 29.60 | [Setting 2] | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 42.82 | - | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 27.73 | - | + +--------------+--------------+---------+-----------------+----------------+ | | 6144 | BF16 | 4 | 27.98 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 25.46 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 1 | 25.16 | [Setting 3] | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 38.23 | - | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 25.77 | - | + +--------------+--------------+---------+-----------------+----------------+ | | 14336 | BF16 | 4 | 21.81 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 22.71 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 26.54 | - | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 21.50 | - | + +--------------+--------------+---------+-----------------+----------------+ | | 30720 | BF16 | 4 | 19.43 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 18.69 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 23.12 | - | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 18.09 | - | + +--------------+--------------+---------+-----------------+----------------+ | | 30720 | BF16 | 4 | 19.43 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 18.69 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 23.12 | - | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 18.09 | - | + +--------------+--------------+---------+-----------------+----------------+ | | 63488 | BF16 | 4 | 17.46 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 2 | 15.30 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 13.23 | - | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 13.14 | - | + +--------------+--------------+---------+-----------------+----------------+ | | 129024 | BF16 | 4 | 11.70 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int8 | 4 | 12.94 | - | + + +--------------+---------+-----------------+----------------+ | | | GPTQ-Int4 | 2 | 8.33 | - | + + +--------------+---------+-----------------+----------------+ | | | AWQ | 2 | 7.78 | - | +--------------------+--------------+--------------+---------+-----------------+----------------+ * [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False) * [Setting 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True) * [Setting 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True) * [Setting 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True)