GGUF¶

Recently, running LLMs locally is popular in the community, and running GGUF files with llama.cpp is a typical example. With llama.cpp, you can not only build GGUF files for your models but also perform low-bit quantization. In GGUF, you can directly quantize your models without calibration, or apply the AWQ scale for better quality, or use imatrix with calibration data. In this document, we demonstrate the simplest way to quantize your model as well as the way to apply AWQ scale to your Qwen model quantization.

Quantize Your Models and Make GGUF Files¶

Before you move to quantization, make sure you have followed the instruction and started to use llama.cpp. The following guidance will NOT provide instructions about installation and building. Now, suppose you would like to quantize Qwen1.5-7B-Chat. You need to first make a GGUF file for the fp16 model as shown below:

python convert-hf-to-gguf.py Qwen/Qwen1.5-7B-Chat --outfile models/7B/qwen1_5-7b-chat-fp16.gguf

where the first argument refers to the path to the HF model directory or the HF model name, and the second argument refers to the path of your output GGUF file (here I just put it under the directory models/7B. Remember to create the directory before you run the command). In this way, you have generated a GGUF file for your fp16 model, and you then need to quantize it to low bits based on your requirements. An example of quantizing the model to 4 bits is shown below:

./quantize models/7B/qwen1_5-7b-chat-fp16.gguf models/7B/qwen1_5-7b-chat-q4_0.gguf q4_0

where we use q4_0 for the 4-bit quantization. Until now, you have finished quantizing a model to 4 bits and putting it into a GGUF file, which can be run directly with llama.cpp.

Quantize Your Models With AWQ Scales¶

To improve the quality of your quantized models, one possible solution is to apply the AWQ scale, following this script. First of all, when you run model.quantize() with AutoAWQ, remember to add export_compatible=True as shown below:

...
model.quantize(
    tokenizer,
    quant_config=quant_config,
    export_compatible=True
)
model.save_pretrained(quant_path)
...

With model.save_quantzed() as shown above, a fp16 model with AWQ scales is saved. Then, when you run convert-hf-to-gguf.py, remember to replace the model path with the path to the fp16 model with AWQ scales, e.g.,

python convert-hf-to-gguf.py ${quant_path} --outfile models/7B/qwen1_5-7b-chat-fp16-awq.gguf

In this way, you can apply the AWQ scales to your quantized models in GGUF formats, which helps improving the model quality.

We usually quantize the fp16 model to 2, 3, 4, 5, 6, and 8-bit models. To perform different low-bit quantization, just replace the quantization method in your command. For example, if you want to quantize your model to 2-bit model, you can replace q4_0 to q2_k as demonstrated below:

./quantize models/7B/qwen1_5-7b-chat-fp16.gguf models/7B/qwen1_5-7b-chat-q2_k.gguf q2_k

We now provide GGUF models in the following quantization levels: q2_k, q3_k_m, q4_0, q4_k_m, q5_0, q5_k_m, q6_k, and q8_0. For more information, please visit llama.cpp.