GPTQ¶

注意

仍需为Qwen3更新。

GPTQ是一种针对类GPT大型语言模型的量化方法，它基于近似二阶信息进行一次性权重量化。在本文档中，我们将向您展示如何使用 transformers 库加载并应用量化后的模型，同时也会指导您如何通过AutoGPTQ来对您自己的模型进行量化处理。

在Hugging Face transformers中使用GPTQ模型¶

备注

To use the official Qwen2.5 GPTQ models with transformers, please ensure that optimum>=1.20.0 and compatible versions of transformers and auto_gptq are installed.

You can do that by

pip install -U "optimum>=1.20.0"

现在，transformers 正式支持了AutoGPTQ，这意味着您能够直接在transformers中使用量化后的模型。以下是一个非常简单的代码片段示例，展示如何运行 Qwen2.5-7B-Instruct-GPTQ-Int4 （请注意，对于每种大小的Qwen2.5模型，我们都提供了Int4和Int8两种量化版本）：

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

在vLLM中使用GPTQ模型¶

vLLM已支持GPTQ，您可以直接使用我们提供的GPTQ量化模型或使用AutoGPTQ量化的模型。我们建议使用最新版的vLLM。如有可能，其会自动使用效率更好的GPTQ Marlin实现。

实际上，使用GPTQ模型与vLLM的基本用法相同。我们提供了一个简单的示例，展示了如何通过vLLM启动与OpenAI API兼容的接口，并使用 Qwen2.5-7B-Instruct-GPTQ-Int4 模型：

在终端中运行以下命令以开启OpenAI兼容API：

vllm serve Qwen2.5-7B-Instruct-GPTQ-Int4

随后，您可以这样调用API：

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen2.5-7B-Instruct-GPTQ-Int4",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

或者你可以按照下面所示的方式，使用 openai Python包中的API客户端：

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct-GPTQ-Int4",
    messages=[
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "Tell me something about large language models."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "repetition_penalty": 1.05,
    },
)
print("Chat response:", chat_response)

使用AutoGPTQ量化你的模型¶

如果你想将自定义模型量化为GPTQ量化模型，我们建议你使用AutoGPTQ工具。推荐通过安装源代码的方式获取并安装最新版本的该软件包。

git clone https://github.com/AutoGPTQ/AutoGPTQ
cd AutoGPTQ
pip install -e .

假设你已经基于 Qwen2.5-7B 模型进行了微调，并将该微调后的模型命名为 Qwen2.5-7B-finetuned ，且使用的是自己的数据集，比如Alpaca。要构建你自己的GPTQ量化模型，你需要使用训练数据进行校准。以下是一个简单的演示示例，供你参考运行：

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quantize_config = BaseQuantizeConfig(
    bits=8, # 4 or 8
    group_size=128,
    damp_percent=0.01,
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
    static_groups=False,
    sym=True,
    true_sequential=True,
    model_name_or_path=None,
    model_file_base_name="model"
)
max_len = 8192

# Load your tokenizer and model with AutoGPTQ
# To learn about loading model to multiple GPUs,
# visit https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config)

但是，如果你想使用多GPU来读取模型，你需要使用 max_memory 而不是 device_map。下面是一段示例代码：

model = AutoGPTQForCausalLM.from_pretrained(
    model_path,
    quantize_config,
    max_memory={i: "20GB" for i in range(4)}
)

接下来，你需要准备数据进行校准。你需要做的是将样本放入一个列表中，其中每个样本都是一段文本。由于我们直接使用微调数据进行校准，所以我们首先使用ChatML模板对它进行格式化处理。例如：

import torch

data = []
for msg in dataset:
    text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
    model_inputs = tokenizer([text])
    input_ids = torch.tensor(model_inputs.input_ids[:max_len], dtype=torch.int)
    data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer.pad_token_id)))

其中每个 msg 是一个典型的聊天消息，如下所示：

[
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me who you are."},
    {"role": "assistant", "content": "I am a large language model named Qwen..."}
]

然后只需通过一行代码运行校准过程：

import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
model.quantize(data, cache_examples_on_gpu=False)

最后，保存量化模型：

model.save_quantized(quant_path, use_safetensors=True)
tokenizer.save_pretrained(quant_path)

很遗憾， save_quantized 方法不支持模型分片。若要实现模型分片，您需要先加载模型，然后使用来自 transformers 库的 save_pretrained 方法来保存并分片模型。除此之外，一切操作都非常简单。祝您使用愉快！

Known Issues¶

Qwen2.5-72B-Instruct-GPTQ-Int4 cannot stop generation properly¶

Model:

Qwen2.5-72B-Instruct-GPTQ-Int4

Framework:

vLLM, AutoGPTQ (including Hugging Face transformers)

Description:

Generation cannot stop properly. Continual generation after where it should stop, then repeated texts, either single character, a phrase, or paragraphs, are generated.

Workaround:

The following workaround could be considered

Using the original model in 16-bit floating point
Using the AWQ variants or llama.cpp-based models for reduced chances of abnormal generation

Qwen2.5-32B-Instruct-GPTQ-Int4 broken with vLLM on multiple GPUs¶

Model:

Qwen2.5-32B-Instruct-GPTQ-Int4

Framework:

vLLM

Description:

Deployment on multiple GPUs and only garbled text like !!!!!!!!!!!!!!!!!! could be generated.

Workaround:

Each of the following workaround could be considered

Using the AWQ or GPTQ-Int8 variants
Using a single GPU
Using Hugging Face transformers if latency and throughput are not major concerns

GPTQ¶

在Hugging Face transformers中使用GPTQ模型¶

在vLLM中使用GPTQ模型¶

使用AutoGPTQ量化你的模型¶

Known Issues¶

Qwen2.5-72B-Instruct-GPTQ-Int4 cannot stop generation properly¶

Qwen2.5-32B-Instruct-GPTQ-Int4 broken with vLLM on multiple GPUs¶

问题排查¶