Quickstart¶
This guide helps you quickly start using Qwen1.5. We provide examples of Hugging Face Transformers as well as ModelScope, and vLLM for deployment.
Hugging Face Transformers & ModelScope¶
To get a quick start with Qwen1.5, we advise you to try with the
inference with transformers
first. Make sure that you have installed
transformers>=4.37.0
. The following is a very simple code snippet
showing how to run Qwen1.5-Chat, with an example of Qwen1.5-7B-Chat:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen1.5-7B-Chat",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat")
# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Previously, we use model.chat()
(see modeling_qwen.py
in
previous Qwen models for more information). Now, we follow the practice
of transformers
and directly use model.generate()
with
apply_chat_template()
in tokenizer.
If you would like to apply Flash Attention 2, you can load the model as shown below:
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen1.5-7B-Chat",
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2",
)
To tackle with downloading issues, we advise you to try with from ModelScope, just changing the first line of code above to the following:
from modelscope import AutoModelForCausalLM, AutoTokenizer
Streaming mode for model chat is simple with the help of
TextStreamer
. Below we show you an example of how to use it:
...
# Reuse the code before `model.generate()` in the last code snippet
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512,
streamer=streamer,
)
vLLM for Deployment¶
To deploy Qwen1.5, we advise you to use vLLM. vLLM is a fast and easy-to-use framework for LLM inference and serving. In the following, we demonstrate how to build a OpenAI-API compatible API service with vLLM.
First, make sure you have installed vLLM>=0.3.0
:
pip install vllm
Run the following code to build up a vllm service. Here we take Qwen1.5-7B-Chat as an example:
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat
Then, you can use the create chat interface to communicate with Qwen:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen1.5-7B-Chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
}'
or you can use python client with openai
python package as shown
below:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen1.5-7B-Chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
]
)
print("Chat response:", chat_response)
Next Step¶
Now, you can have fun with Qwen models. Would love to know more about its usages? Feel free to check other documents in this documentation.