llama.cpp¶
llama.cpp is a C++ library for LLM inference with mimimal setup. It enables running Qwen on your local machine. It is a plain C/C++ implementation without dependencies, and it has AVX, AVX2 and AVX512 support for x86 architectures. It provides 2, 3, 4, 5, 6, and 8-bit quantization for faster inference and reduced memory footprint. CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity is also supported. Essentially, the usage of llama.cpp is to run the GGUF (GPT-Generated Unified Format ) models. For more information, please refer to the official GitHub repo. Here we demonstrate how to run Qwen with llama.cpp.
Prerequisites¶
This example is for the usage on Linux or MacOS. For the first step, clone the repo and enter the directory:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Then use make
:
make
Then you can run GGUF files with llama.cpp
.
Running Qwen GGUF Files¶
We provide a series of GGUF models in our Hugging Face organization, and
to search for what you need you can search the repo names with
-GGUF
. Download the GGUF model that you want with
huggingface-cli
(you need to install it first with
pip install huggingface_hub
):
huggingface-cli download <model_repo> <gguf_file> --local-dir <local_dir> --local-dir-use-symlinks False
for example:
huggingface-cli download Qwen/Qwen1.5-7B-Chat-GGUF qwen1_5-7b-chat-q5_k_m.gguf --local-dir . --local-dir-use-symlinks False
Then you can run the model with the following command:
./main -m qwen1_5-7b-chat-q5_k_m.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt
where -n
refers to the maximum number of tokens to generate. There
are other hyperparameters for you to choose and you can run
./main -h
to figure them out.
Make Your GGUF Files¶
We introduce the method of creating and quantizing GGUF files in quantization/llama.cpp. You can refer to that document for more information.
Perplexity Evaluation¶
llama.cpp
provides methods for us to evaluate the perplexity
performance of the GGUF models. To do this, you need to prepare the
dataset, say “wiki test”. Here we demonstrate an example to run the
test.
For the first step, download the dataset:
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research -O wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
Then you can run the test with the following command:
./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw
where the output is like
perplexity : calculating perplexity over 655 chunks
24.43 seconds per pass - ETA 4.45 hours
[1]4.5970,[2]5.1807,[3]6.0382,...
Wait for some time and you will get the perplexity of the model.
Use GGUF with LM Studio¶
If you still find it difficult to use llama.cpp
, I advise you to
play with LM Studio, which is a platform
for your to search and run local LLMs. Qwen1.5 has already been
officially part of LM Studio. Have fun!