AWQ

注意

仍需为Qwen3更新。

对于量化模型,我们推荐使用 AWQ 结合 AutoAWQ

AWQ即激活值感知的权重量化(Activation-aware Weight Quantization),是一种针对LLM的低比特权重量化的硬件友好方法。

AutoAWQ是一个易于使用的工具包,用于4比特量化模型。相较于FP16,AutoAWQ能够将模型的运行速度提升3倍,并将内存需求降低至原来的三分之一。AutoAWQ实现了AWQ算法,可用于LLM的量化处理。

在本文档中,我们将向您展示如何在Hugging Face transformers框架下使用量化模型,以及如何对您自己的模型进行量化

在Hugging Face transformers中使用AWQ量化模型

现在,transformers已经正式支持AutoAWQ,这意味着您可以直接在transformers中使用AWQ量化模型。以下是一个非常简单的代码片段,展示如何运行量化模型 Qwen2.5-7B-Instruct-AWQ

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-7B-Instruct-AWQ"

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

在vLLM中使用AWQ量化模型

vLLM已支持AWQ,您可以直接使用我们提供的AWQ量化模型或使用AutoAWQ量化的模型。我们建议使用最新版的vLLM (vllm>=0.6.1),新版为AWQ量化模型提升了效率提;不然推理效率可能并为被良好优化(即效率可能较非量化模型低)。

实际上,使用AWQ模型与vLLM的基本用法相同。我们提供了一个简单的示例,展示了如何通过vLLM启动与OpenAI API兼容的接口,并使用 Qwen2.5-7B-Instruct-AWQ 模型:

在终端中运行以下命令以开启OpenAI兼容API:

vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ

随后,您可以这样调用API:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

或者你可以按照下面所示的方式,使用 openai Python包中的API客户端:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    messages=[
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "Tell me something about large language models."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "repetition_penalty": 1.05,
    },
)
print("Chat response:", chat_response)

使用AutoAWQ量化你的模型

If you want to quantize your own model to AWQ quantized models, we advise you to use AutoAWQ.

pip install "autoawq<0.2.7"

假设你已经基于 Qwen2.5-7B 模型进行了微调,并将其命名为 Qwen2.5-7B-finetuned ,且使用的是你自己的数据集,比如Alpaca。若要构建你自己的AWQ量化模型,你需要使用训练数据进行校准。以下,我们将为你提供一个简单的演示示例以便运行:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load your tokenizer and model with AutoAWQ
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)

接下来,您需要准备数据以进行校准。您需要做的就是将样本放入一个列表中,其中每个样本都是一段文本。由于我们直接使用微调数据来进行校准,所以我们首先使用ChatML模板对其进行格式化。例如:

data = []
for msg in dataset:
    text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
    data.append(text.strip())

其中每个 msg 是一个典型的聊天消息,如下所示:

[
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me who you are."},
    {"role": "assistant", "content": "I am a large language model named Qwen..."}
]

然后只需通过一行代码运行校准过程:

model.quantize(tokenizer, quant_config=quant_config, calib_data=data)

最后,保存量化模型:

model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
tokenizer.save_pretrained(quant_path)

然后你就可以得到一个可以用于部署的AWQ量化模型。玩得开心!