Qwen2.5 Speed Benchmark ========================= This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2.5 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. The environment of the evaluation with huggingface transformers is: - NVIDIA A100 80GB - CUDA 12.1 - Pytorch 2.3.1 - Flash Attention 2.5.8 - Transformers 4.46.0 - AutoGPTQ 0.7.1+cu121 (Compiled from source code) - AutoAWQ 0.2.6 The environment of the evaluation with vLLM is: - NVIDIA A100 80GB - CUDA 12.1 - vLLM 0.6.3 - Pytorch 2.4.0 - Flash Attention 2.6.3 - Transformers 4.46.0 Notes: - We use the batch size of 1 and the least number of GPUs as possible for the evaluation. - We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens. - For vLLM, the memory usage is not reported because it pre-allocates all GPU memory. We use ``gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False`` by default. - 0.5B (Transformer) +-------------------------+--------------+--------------+---------+-----------------+----------------+---------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | +=========================+==============+==============+=========+=================+================+===========================+ | Qwen2.5-0.5B-Instruct | 1 | BF16 | 1 | 47.40 | 0.97 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int8 | 1 | 35.17 | 0.64 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int4 | 1 | 50.60 | 0.48 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | AWQ | 1 | 37.09 | 0.68 | | + +--------------+--------------+---------+-----------------+----------------+---------------------------+ | | 6144 | BF16 | 1 | 47.45 | 1.23 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int8 | 1 | 36.47 | 0.90 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int4 | 1 | 48.89 | 0.73 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | AWQ | 1 | 37.04 | 0.72 | | + +--------------+--------------+---------+-----------------+----------------+---------------------------+ | | 14336 | BF16 | 1 | 47.11 | 1.60 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int8 | 1 | 35.44 | 1.26 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int4 | 1 | 48.26 | 1.10 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | AWQ | 1 | 37.14 | 1.10 | | + +--------------+--------------+---------+-----------------+----------------+---------------------------+ | | 30720 | BF16 | 1 | 47.16 | 2.34 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int8 | 1 | 36.25 | 2.01 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | GPTQ-Int4 | 1 | 49.22 | 1.85 | | + + +--------------+---------+-----------------+----------------+---------------------------+ | | | AWQ | 1 | 36.90 | 1.84 | | +-------------------------+--------------+--------------+---------+-----------------+----------------+---------------------------+ - 0.5B (vLLM) +-------------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +=========================+==============+==============+=========+=================+ | Qwen2.5-0.5B-Instruct | 1 | BF16 | 1 | 311.55 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 257.07 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 260.93 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 261.95 | + +--------------+--------------+---------+-----------------+ | | 6144 | BF16 | 1 | 304.79 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 254.10 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 257.33 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 259.80 | + +--------------+--------------+---------+-----------------+ | | 14336 | BF16 | 1 | 290.28 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 243.69 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 247.01 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 249.58 | + +--------------+--------------+---------+-----------------+ | | 30720 | BF16 | 1 | 264.51 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 223.86 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 226.50 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 229.84 | +-------------------------+--------------+--------------+---------+-----------------+ - 1.5B (Transformer) +--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | +==========================+==============+==============+=========+=================+================+=========================+ | Qwen2.5-1.5B-Instruct | 1 | BF16 | 1 | 39.68 | 2.95 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 32.62 | 1.82 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 43.33 | 1.18 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 31.70 | 1.51 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 6144 | BF16 | 1 | 40.88 | 3.43 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 31.46 | 2.30 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 43.96 | 1.66 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 32.30 | 1.63 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 14336 | BF16 | 1 | 40.43 | 4.16 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 31.06 | 3.03 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 43.66 | 2.39 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 32.39 | 2.36 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 30720 | BF16 | 1 | 38.59 | 5.62 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 31.04 | 4.49 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 35.68 | 3.85 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 31.95 | 3.82 | | +--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ - 1.5B (vLLM) +--------------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +==========================+==============+==============+=========+=================+ | Qwen2.5-1.5B-Instruct | 1 | BF16 | 1 | 183.33 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 201.67 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 217.03 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 213.74 | + +--------------+--------------+---------+-----------------+ | | 6144 | BF16 | 1 | 176.68 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 192.83 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 206.63 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 203.64 | + +--------------+--------------+---------+-----------------+ | | 14336 | BF16 | 1 | 168.69 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 183.69 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 195.88 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 192.64 | + +--------------+--------------+---------+-----------------+ | | 30720 | BF16 | 1 | 152.04 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 162.82 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 173.57 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 170.20 | +--------------------------+--------------+--------------+---------+-----------------+ - 3B (Transformer) +--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | +==========================+==============+==============+=========+=================+================+=========================+ | Qwen2.5-3B-Instruct | 1 | BF16 | 1 | 30.80 | 5.95 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 25.69 | 3.38 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 35.21 | 2.06 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 25.29 | 2.50 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 6144 | BF16 | 1 | 32.20 | 6.59 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 24.69 | 3.98 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 34.47 | 2.67 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 24.86 | 2.62 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 14336 | BF16 | 1 | 31.72 | 7.47 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 24.70 | 4.89 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 34.36 | 3.58 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 25.19 | 3.54 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 30720 | BF16 | 1 | 25.37 | 9.30 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 21.67 | 6.72 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 23.60 | 5.41 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 24.56 | 5.37 | | +--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ - 3B (vLLM) +--------------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +==========================+==============+==============+=========+=================+ | Qwen2.5-3B-Instruct | 1 | BF16 | 1 | 127.61 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 150.02 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 168.20 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 165.50 | + +--------------+--------------+---------+-----------------+ | | 6144 | BF16 | 1 | 123.15 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 143.09 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 159.85 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 156.38 | + +--------------+--------------+---------+-----------------+ | | 14336 | BF16 | 1 | 117.35 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 135.50 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 149.35 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 147.75 | + +--------------+--------------+---------+-----------------+ | | 30720 | BF16 | 1 | 105.88 | + + +--------------+---------+-----------------+ | | | GPTQ-Int8 | 1 | 118.38 | + + +--------------+---------+-----------------+ | | | GPTQ-Int4 | 1 | 129.28 | + + +--------------+---------+-----------------+ | | | AWQ | 1 | 127.19 | +--------------------------+--------------+--------------+---------+-----------------+ - 7B (Transformer) +-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | +=============================+==============+==============+=========+=================+================+=========================+ | Qwen2.5-7B-Instruct | 1 | BF16 | 1 | 40.38 | 14.38 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 31.55 | 8.42 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 43.10 | 5.52 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 32.03 | 5.39 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 6144 | BF16 | 1 | 38.76 | 15.38 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 31.26 | 9.43 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 38.27 | 6.52 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 32.37 | 6.39 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 14336 | BF16 | 1 | 29.78 | 16.91 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 26.86 | 10.96 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 28.70 | 8.05 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 30.23 | 7.92 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 30720 | BF16 | 1 | 18.83 | 19.97 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 17.59 | 14.01 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 18.45 | 11.11 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 19.11 | 10.98 | | +-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ - 7B (vLLM) +-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note | +=============================+==============+==============+=========+=================+===========================================+ | Qwen2.5-7B-Instruct | 1 | BF16 | 1 | 84.28 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 122.01 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 154.05 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 148.10 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 6144 | BF16 | 1 | 80.70 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 112.38 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 141.98 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 137.64 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 14336 | BF16 | 1 | 77.69 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 105.25 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 129.35 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 124.91 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 30720 | BF16 | 1 | 70.33 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 90.71 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 108.30 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 104.66 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 63488 | BF16 | 1 | 50.86 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 60.52 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 67.97 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 66.42 | setting-64k | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 129024 | BF16 | 1 | 28.94 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 25.97 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 26.37 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 26.57 | vllm==0.6.2, new sample config | +-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ * [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) * [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length) - 14B (Transformer) +--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | +==========================+==============+==============+=========+=================+================+=========================+ | Qwen2.5-14B-Instruct | 1 | BF16 | 1 | 24.74 | 28.08 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 18.84 | 16.11 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 25.89 | 9.94 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 19.23 | 9.79 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 6144 | BF16 | 1 | 20.51 | 29.50 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 17.80 | 17.61 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 20.06 | 11.36 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 19.21 | 11.22 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 14336 | BF16 | 1 | 13.92 | 31.95 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 12.66 | 19.98 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 13.79 | 13.81 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 14.17 | 13.67 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------+ | | 30720 | BF16 | 1 | 8.20 | 36.85 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int8 | 1 | 7.77 | 24.88 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | GPTQ-Int4 | 1 | 8.14 | 18.71 | | + + +--------------+---------+-----------------+----------------+-------------------------+ | | | AWQ | 1 | 8.31 | 18.57 | | +--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ - 14B (vLLM) +-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note | +=============================+==============+==============+=========+=================+===========================================+ | Qwen2.5-14B-Instruct | 1 | BF16 | 1 | 46.30 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 70.40 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 98.02 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 92.66 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 6144 | BF16 | 1 | 43.83 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 64.33 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 86.10 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 83.11 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 14336 | BF16 | 1 | 41.91 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 59.21 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 76.85 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 74.03 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 30720 | BF16 | 1 | 37.18 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 49.23 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 60.91 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 59.01 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 63488 | BF16 | 1 | 26.85 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 32.83 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 37.67 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 36.71 | setting-64k | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 129024 | BF16 | 1 | 14.53 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 15.10 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 15.13 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 15.25 | vllm==0.6.2, new sample config | +-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ * [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) * [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length) - 32B (Transformer) +-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | +=============================+==============+==============+=========+=================+================+===========================================+ | Qwen2.5-32B-Instruct | 1 | BF16 | 1 | 17.54 | 61.58 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 14.52 | 33.56 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 19.20 | 18.94 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 1 | 14.60 | 18.67 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | | 6144 | BF16 | 1 | 12.49 | 63.72 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 11.61 | 35.86 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 13.42 | 21.09 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 1 | 13.81 | 20.81 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | | 14336 | BF16 | 1 | 8.95 | 67.31 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 8.53 | 39.28 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 9.48 | 24.67 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 1 | 9.71 | 24.39 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | | 30720 | BF16 | 1 | 5.59 | 74.47 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 5.42 | 46.45 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 5.79 | 31.84 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 1 | 5.85 | 31.56 | | +-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ - 32B (vLLM) +-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note | +=============================+==============+==============+=========+=================+===========================================+ | Qwen2.5-32B-Instruct | 1 | BF16 | 1 | 22.13 | setting1 | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 37.57 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 55.83 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 51.92 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 6144 | BF16 | 1 | 21.05 | setting1 | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 34.67 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 49.96 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 46.68 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 14336 | BF16 | 1 | 19.91 | setting1 | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 31.89 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 44.79 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 41.83 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 30720 | BF16 | 2 | 31.82 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 26.88 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 35.66 | | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 33.75 | | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 63488 | BF16 | 2 | 24.45 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 18.60 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 22.72 | setting-64k | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 21.79 | setting-64k | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 129024 | BF16 | 2 | 14.31 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 1 | 9.77 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 10.39 | vllm==0.6.2, new sample config | + + +--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 1 | 10.34 | vllm==0.6.2, new sample config | +-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ * For context length 129024, the model needs to be predicted with the following config: "model_max_length"=131072 * [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False) * [Setting 1]=(gpu_memory_utilization=1.0 max_model_len=32768 enforce_eager=True) * [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) * [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length) - 72B (Transformer) +-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | +=============================+==============+==============+=========+=================+================+===========================================+ | Qwen2.5-72B-Instruct | 1 | BF16 | 2 | 8.73 | 136.20 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 8.66 | 72.61 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 11.07 | 39.91 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 1 | 11.50 | 39.44 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | | 6144 | BF16 | 2 | 6.39 | 140.00 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 6.39 | 77.81 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 7.56 | 42.50 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 1 | 8.17 | 42.13 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | | 14336 | BF16 | 3 | 4.25 | 149.14 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 4.66 | 82.55 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 5.27 | 46.86 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 1 | 5.57 | 46.38 | | + +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ | | 30720 | BF16 | 3 | 2.94 | 164.79 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 2.94 | 94.75 | auto_gptq==0.6.0+cu1210 | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | GPTQ-Int4 | 2 | 3.14 | 62.57 | | + + +--------------+---------+-----------------+----------------+-------------------------------------------+ | | | AWQ | 2 | 3.23 | 61.64 | | +-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ - 72B (vLLM) +------------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note | +==============================+==============+==============+=========+=================+===========================================+ | Qwen2.5-72B-Instruct | 1 | BF16 | 2 | 18.19 | Setting 1 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | BF16 | 4 | 31.37 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 31.40 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 16.47 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 2 | 46.30 | Setting 2 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 2 | 44.30 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 6144 | BF16 | 4 | 29.90 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 29.37 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 1 | 13.88 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 2 | 42.50 | Setting 3 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 2 | 40.67 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 14336 | BF16 | 4 | 30.10 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 27.20 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 2 | 38.10 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 2 | 36.63 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 30720 | BF16 | 4 | 27.53 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 23.32 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 2 | 30.98 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 2 | 30.02 | Default | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 63488 | BF16 | 4 | 20.74 | Setting 4 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 2 | 16.27 | Setting 4 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 2 | 19.84 | Setting 4 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 2 | 19.32 | Setting 4 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | 129024 | BF16 | 4 | 12.68 | Setting 5 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int8 | 4 | 14.11 | Setting 5 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | GPTQ-Int4 | 2 | 10.11 | Setting 5 | + +--------------+--------------+---------+-----------------+-------------------------------------------+ | | | AWQ | 2 | 9.88 | Setting 5 | +------------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+ * [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False) * [Setting 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True) * [Setting 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True) * [Setting 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True) * [Setting 4]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) * [Setting 5]=(gpu_memory_utilization=0.9 max_model_len=131072 enforce_eager=False)