GPTQ¶
注意
仍需为Qwen3更新。
GPTQ是一种针对类GPT大型语言模型的量化方法,它基于近似二阶信息进行一次性权重量化。在本文档中,我们将向您展示如何使用 transformers 库加载并应用量化后的模型,同时也会指导您如何通过AutoGPTQ来对您自己的模型进行量化处理。
在Hugging Face transformers中使用GPTQ模型¶
备注
To use the official Qwen2.5 GPTQ models with transformers, please ensure that optimum>=1.20.0 and compatible versions of transformers and auto_gptq are installed.
You can do that by
pip install -U "optimum>=1.20.0"
现在,transformers 正式支持了AutoGPTQ,这意味着您能够直接在transformers中使用量化后的模型。以下是一个非常简单的代码片段示例,展示如何运行 Qwen2.5-7B-Instruct-GPTQ-Int4 (请注意,对于每种大小的Qwen2.5模型,我们都提供了Int4和Int8两种量化版本):
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
在vLLM中使用GPTQ模型¶
vLLM已支持GPTQ,您可以直接使用我们提供的GPTQ量化模型或使用AutoGPTQ量化的模型。我们建议使用最新版的vLLM。如有可能,其会自动使用效率更好的GPTQ Marlin实现。
实际上,使用GPTQ模型与vLLM的基本用法相同。我们提供了一个简单的示例,展示了如何通过vLLM启动与OpenAI API兼容的接口,并使用 Qwen2.5-7B-Instruct-GPTQ-Int4 模型:
在终端中运行以下命令以开启OpenAI兼容API:
vllm serve Qwen2.5-7B-Instruct-GPTQ-Int4
随后,您可以这样调用API:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen2.5-7B-Instruct-GPTQ-Int4",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
或者你可以按照下面所示的方式,使用 openai Python包中的API客户端:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen2.5-7B-Instruct-GPTQ-Int4",
messages=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
],
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
},
)
print("Chat response:", chat_response)
使用AutoGPTQ量化你的模型¶
如果你想将自定义模型量化为GPTQ量化模型,我们建议你使用AutoGPTQ工具。推荐通过安装源代码的方式获取并安装最新版本的该软件包。
git clone https://github.com/AutoGPTQ/AutoGPTQ
cd AutoGPTQ
pip install -e .
假设你已经基于 Qwen2.5-7B 模型进行了微调,并将该微调后的模型命名为 Qwen2.5-7B-finetuned ,且使用的是自己的数据集,比如Alpaca。要构建你自己的GPTQ量化模型,你需要使用训练数据进行校准。以下是一个简单的演示示例,供你参考运行:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quantize_config = BaseQuantizeConfig(
bits=8, # 4 or 8
group_size=128,
damp_percent=0.01,
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
static_groups=False,
sym=True,
true_sequential=True,
model_name_or_path=None,
model_file_base_name="model"
)
max_len = 8192
# Load your tokenizer and model with AutoGPTQ
# To learn about loading model to multiple GPUs,
# visit https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config)
但是,如果你想使用多GPU来读取模型,你需要使用 max_memory 而不是 device_map。下面是一段示例代码:
model = AutoGPTQForCausalLM.from_pretrained(
model_path,
quantize_config,
max_memory={i: "20GB" for i in range(4)}
)
接下来,你需要准备数据进行校准。你需要做的是将样本放入一个列表中,其中每个样本都是一段文本。由于我们直接使用微调数据进行校准,所以我们首先使用ChatML模板对它进行格式化处理。例如:
import torch
data = []
for msg in dataset:
text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
model_inputs = tokenizer([text])
input_ids = torch.tensor(model_inputs.input_ids[:max_len], dtype=torch.int)
data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer.pad_token_id)))
其中每个 msg 是一个典型的聊天消息,如下所示:
[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me who you are."},
{"role": "assistant", "content": "I am a large language model named Qwen..."}
]
然后只需通过一行代码运行校准过程:
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
model.quantize(data, cache_examples_on_gpu=False)
最后,保存量化模型:
model.save_quantized(quant_path, use_safetensors=True)
tokenizer.save_pretrained(quant_path)
很遗憾, save_quantized 方法不支持模型分片。若要实现模型分片,您需要先加载模型,然后使用来自 transformers 库的 save_pretrained 方法来保存并分片模型。除此之外,一切操作都非常简单。祝您使用愉快!
Known Issues¶
Qwen2.5-72B-Instruct-GPTQ-Int4 cannot stop generation properly¶
- Model:
Qwen2.5-72B-Instruct-GPTQ-Int4
- Framework:
vLLM, AutoGPTQ (including Hugging Face transformers)
- Description:
Generation cannot stop properly. Continual generation after where it should stop, then repeated texts, either single character, a phrase, or paragraphs, are generated.
- Workaround:
The following workaround could be considered
Using the original model in 16-bit floating point
Using the AWQ variants or llama.cpp-based models for reduced chances of abnormal generation
Qwen2.5-32B-Instruct-GPTQ-Int4 broken with vLLM on multiple GPUs¶
- Model:
Qwen2.5-32B-Instruct-GPTQ-Int4
- Framework:
vLLM
- Description:
Deployment on multiple GPUs and only garbled text like
!!!!!!!!!!!!!!!!!!could be generated.- Workaround:
Each of the following workaround could be considered
Using the AWQ or GPTQ-Int8 variants
Using a single GPU
Using Hugging Face
transformersif latency and throughput are not major concerns
问题排查¶
在使用 transformers 和 auto_gptq 时,日志提示 CUDA extension not installed. 并且推理速度缓慢。
auto_gptq 未能找到与您的环境兼容的融合CUDA算子,因此退回到基础实现。请遵循其 安装指南 来安装预构建的 wheel 或尝试从源代码安装 auto_gptq 。
vllm 使用自行量化的 Qwen2.5-72B-Instruct-GPTQ 时,会引发 ValueError: ... must be divisible by ... 错误。自量化的模型的 intermediate size 与官方的 Qwen2.5-72B-Instruct-GPTQ 模型不同。
量化后,量化权重的大小将被 group size(通常为128)整除。Qwen2-72B 中FFN块的中间大小为29568。不幸的是, \(29568 \div 128 = 231\) 。由于注意力头的数量和权重的维度必须能够被张量并行大小整除,这意味着你只能使用 tensor_parallel_size=1 ,即一张 GPU 卡,来运行量化的模型。
一个解决方案是使中间大小能够被 \(128 \times 8 = 1024\) 整除。为了达到这一目的,应该使用零值对权重进行填充。虽然在数学上,在对权重进行零填充前后是等价的,但在现实中结果可能会略有不同。
尝试以下方法:
import torch
from torch.nn import functional as F
from transformers import AutoModelForCausalLM
# must use AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-72B-Instruct", torch_dtype="auto")
# this size is Qwen2.5-72B only
pad_size = 128
sd = model.state_dict()
for i, k in enumerate(sd):
v = sd[k]
print(k, i)
# interleaving the padded zeros
if ('mlp.up_proj.weight' in k) or ('mlp.gate_proj.weight' in k):
prev_v = F.pad(v.unsqueeze(1), (0, 0, 0, 1, 0, 0)).reshape(29568*2, -1)[:pad_size*2]
new_v = torch.cat([prev_v, v[pad_size:]], dim=0)
sd[k] = new_v
elif 'mlp.down_proj.weight' in k:
prev_v= F.pad(v.unsqueeze(2), (0, 1)).reshape(8192, 29568*2)[:, :pad_size*2]
new_v = torch.cat([prev_v, v[:, pad_size:]], dim=1)
sd[k] = new_v
# this is a very large file; make sure your RAM is enough to load the model
torch.save(sd, '/path/to/padded_model/pytorch_model.bin')
这将会把填充后的检查点保存到指定的目录。然后,你需要从原始检查点复制其他文件到新目录,并将 config.json 中的 intermediate_size 修改为 29696 。最后,你可以量化保存的模型检查点。