vLLM¶
We recommend you trying with vLLM for your deployement of Qwen. It is simple to use, and it is fast with state-of-the-art serving throughtput, efficienct management of attention key value memory with PagedAttention, continuous batching of input requests, optimized CUDA kernels, etc. To learn more about vLLM, please refer to the paper and documentation.
Installation¶
By default, you can install vLLM by pip:
pip install vLLM>=0.3.0, but if you are using CUDA 11.8, check the
note in the official document for installation
(link)
for some help. We also advise you to install ray by pip install ray
for distributed serving.
Offline Batched Inference¶
Models supported by Qwen2 codes, e.g., Qwen1.5, are supported by vLLM. The simplest usage of vLLM is offline batched inference as demonstrated below.
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat")
# Pass the default decoding hyperparameters of Qwen1.5-7B-Chat
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/Qwen1.5-7B-Chat")
# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# generate outputs
outputs = llm.generate([text], sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
OpenAI-API Compatible API Service¶
It is easy to build an OpenAI-API compatible API service with vLLM,
which can be deployed as a server that implements OpenAI API protocol.
By default, it starts the server at http://localhost:8000. You can
specify the address with --host and --port arguments. Run the
command as shown below:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen1.5-7B-Chat
You don’t need to worry about chat template as it by default uses the chat template provided by the tokenizer.
Then, you can use the create chat interface to communicate with Qwen:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen1.5-7B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
]
}'
or you can use python client with openai python package as shown
below:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen1.5-7B-Chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
]
)
print("Chat response:", chat_response)
Multi-GPU Distributred Serving¶
To scale up your serving throughputs, distributed serving helps you by
leveraging more GPU devices. Besides, for large models like
Qwen1.5-72B-Chat, it is impossible to serve it on a single GPU.
Here, we demonstrate how to run Qwen1.5-72B-Chat with tensor
parallelism just by passing in the argument tensor_parallel_size:
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen1.5-72B-Chat", tensor_parallel_size=4)
You can run multi-GPU serving by passing in the argument
--tensor-parallel-size:
python -m vllm.entrypoints.api_server \
--model Qwen/Qwen1.5-72B-Chat \
--tensor-parallel-size 4
Serving Quantized Models¶
vLLM supports different types of quantized models, including AWQ, GPTQ,
SqueezeLLM, etc. Here we show how to deploy AWQ and GPTQ models. The
usage is almost the same as above except for an additional argument for
quantization. For example, to run an AWQ model. e.g.,
Qwen1.5-7B-Chat-AWQ:
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen1.5-7B-Chat-AWQ", quantization="awq")
or GPTQ models like Qwen1.5-7B-Chat-GPTQ-Int8:
llm = LLM(model="Qwen/Qwen1.5-7B-Chat-GPTQ-Int4", quantization="gptq")
Similarly, you can run serving adding the argument --quantization as
shown below:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen1.5-7B-Chat-AWQ \
--quantization awq
or
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen1.5-7B-Chat-GPTQ-Int8 \
--quantization gptq
Additionally, vLLM supports the combination of AWQ or GPTQ models with KV cache quantization, namely FP8 E5M2 KV Cache. For example:
llm = LLM(model="Qwen/Qwen1.5-7B-Chat-GPTQ-Int8", quantization="gptq", kv_cache_dtype="fp8_e5m2")
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen1.5-7B-Chat-GPTQ-Int8 \
--quantization gptq \
--kv-cache-dtype fp8_e5m2
Troubleshooting¶
You may encounter OOM issues that are pretty annoying. We recommend two
arguments for you to make some fix. The first one is
--max-model-len. Our provided default max_postiion_embedding is
32768 and thus the maximum length for the serving is also this
value, leading to higher requirements of memory. Reducing it to a proper
length for yourself often helps with the OOM issue. Another argument you
can pay attention to is --gpu-memory-utilization. By default it is
0.9 and you can level it up to tackle the OOM problem. This is also
why you find a vLLM service always takes so much memory.