AWQ¶
注意
仍需为Qwen3更新。
AWQ即激活值感知的权重量化(Activation-aware Weight Quantization),是一种针对LLM的低比特权重量化的硬件友好方法。
AutoAWQ是一个易于使用的工具包,用于4比特量化模型。相较于FP16,AutoAWQ能够将模型的运行速度提升3倍,并将内存需求降低至原来的三分之一。AutoAWQ实现了AWQ算法,可用于LLM的量化处理。
在本文档中,我们将向您展示如何在Hugging Face transformers框架下使用量化模型,以及如何对您自己的模型进行量化
在Hugging Face transformers中使用AWQ量化模型¶
现在,transformers已经正式支持AutoAWQ,这意味着您可以直接在transformers中使用AWQ量化模型。以下是一个非常简单的代码片段,展示如何运行量化模型 Qwen2.5-7B-Instruct-AWQ :
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct-AWQ"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
在vLLM中使用AWQ量化模型¶
vLLM已支持AWQ,您可以直接使用我们提供的AWQ量化模型或使用AutoAWQ量化的模型。我们建议使用最新版的vLLM (vllm>=0.6.1),新版为AWQ量化模型提升了效率提;不然推理效率可能并为被良好优化(即效率可能较非量化模型低)。
实际上,使用AWQ模型与vLLM的基本用法相同。我们提供了一个简单的示例,展示了如何通过vLLM启动与OpenAI API兼容的接口,并使用 Qwen2.5-7B-Instruct-AWQ 模型:
在终端中运行以下命令以开启OpenAI兼容API:
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ
随后,您可以这样调用API:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
或者你可以按照下面所示的方式,使用 openai Python包中的API客户端:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct-AWQ",
messages=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
],
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
},
)
print("Chat response:", chat_response)
使用AutoAWQ量化你的模型¶
If you want to quantize your own model to AWQ quantized models, we advise you to use AutoAWQ.
pip install "autoawq<0.2.7"
假设你已经基于 Qwen2.5-7B 模型进行了微调,并将其命名为 Qwen2.5-7B-finetuned ,且使用的是你自己的数据集,比如Alpaca。若要构建你自己的AWQ量化模型,你需要使用训练数据进行校准。以下,我们将为你提供一个简单的演示示例以便运行:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load your tokenizer and model with AutoAWQ
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)
接下来,您需要准备数据以进行校准。您需要做的就是将样本放入一个列表中,其中每个样本都是一段文本。由于我们直接使用微调数据来进行校准,所以我们首先使用ChatML模板对其进行格式化。例如:
data = []
for msg in dataset:
text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
data.append(text.strip())
其中每个 msg 是一个典型的聊天消息,如下所示:
[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me who you are."},
{"role": "assistant", "content": "I am a large language model named Qwen..."}
]
然后只需通过一行代码运行校准过程:
model.quantize(tokenizer, quant_config=quant_config, calib_data=data)
最后,保存量化模型:
model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
tokenizer.save_pretrained(quant_path)
然后你就可以得到一个可以用于部署的AWQ量化模型。玩得开心!