Quickstart
==========

This guide helps you quickly start using Qwen1.5. We provide examples of
`Hugging Face Transformers <https://github.com/huggingface/transformers>`__ 
as well as `ModelScope <https://github.com/modelscope/modelscope>`__, and 
`vLLM <https://github.com/vllm-project/vllm>`__ for deployment.

Hugging Face Transformers & ModelScope
--------------------------------------

To get a quick start with Qwen1.5, we advise you to try with the
inference with ``transformers`` first. Make sure that you have installed
``transformers>=4.37.0``. The following is a very simple code snippet
showing how to run Qwen1.5-Chat, with an example of Qwen1.5-7B-Chat:

.. code:: python

   from transformers import AutoModelForCausalLM, AutoTokenizer
   device = "cuda" # the device to load the model onto

   # Now you do not need to add "trust_remote_code=True"
   model = AutoModelForCausalLM.from_pretrained(
       "Qwen/Qwen1.5-7B-Chat",
       torch_dtype="auto",
       device_map="auto"
   )
   tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat")

   # Instead of using model.chat(), we directly use model.generate()
   # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
   prompt = "Give me a short introduction to large language model."
   messages = [
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": prompt}
   ]
   text = tokenizer.apply_chat_template(
       messages,
       tokenize=False,
       add_generation_prompt=True
   )
   model_inputs = tokenizer([text], return_tensors="pt").to(device)

   # Directly use generate() and tokenizer.decode() to get the output.
   # Use `max_new_tokens` to control the maximum output length.
   generated_ids = model.generate(
       model_inputs.input_ids,
       max_new_tokens=512
   )
   generated_ids = [
       output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
   ]

   response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Previously, we use ``model.chat()`` (see ``modeling_qwen.py`` in
previous Qwen models for more information). Now, we follow the practice
of ``transformers`` and directly use ``model.generate()`` with
``apply_chat_template()`` in tokenizer. 

If you would like to apply Flash Attention 2, you can load the model as shown below:

.. code:: python

    model = AutoModelForCausalLM.from_pretrained(
       "Qwen/Qwen1.5-7B-Chat",
       torch_dtype="auto",
       device_map="auto",
       attn_implementation="flash_attention_2",
   )

To tackle with downloading issues, we advise you to try with from
ModelScope, just changing the first line of code above to the following:

.. code:: python

   from modelscope import AutoModelForCausalLM, AutoTokenizer

Streaming mode for model chat is simple with the help of
``TextStreamer``. Below we show you an example of how to use it:

.. code:: python

   ...
   # Reuse the code before `model.generate()` in the last code snippet
   from transformers import TextStreamer
   streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
   generated_ids = model.generate(
       model_inputs.input_ids,
       max_new_tokens=512,
       streamer=streamer,
   )

vLLM for Deployment
-------------------

To deploy Qwen1.5, we advise you to use vLLM. vLLM is a fast
and easy-to-use framework for LLM inference and serving. In the
following, we demonstrate how to build a OpenAI-API compatible API
service with vLLM.

First, make sure you have installed ``vLLM>=0.3.0``:

.. code:: bash

   pip install vllm

Run the following code to build up a vllm service. Here we take
Qwen1.5-7B-Chat as an example:

.. code:: bash

   python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat

Then, you can use the `create chat
interface <https://platform.openai.com/docs/api-reference/chat/completions/create>`__
to communicate with Qwen:

.. code:: bash

    curl http://localhost:8000/v1/chat/completions  -H "Content-Type: application/json" -d '{
       "model": "Qwen/Qwen1.5-7B-Chat",
       "messages": [
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "Tell me something about large language models."}
       ],
       }'

or you can use python client with ``openai`` python package as shown
below:

.. code:: python

   from openai import OpenAI
   # Set OpenAI's API key and API base to use vLLM's API server.
   openai_api_key = "EMPTY"
   openai_api_base = "http://localhost:8000/v1"

   client = OpenAI(
       api_key=openai_api_key,
       base_url=openai_api_base,
   )

   chat_response = client.chat.completions.create(
       model="Qwen/Qwen1.5-7B-Chat",
       messages=[
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": "Tell me something about large language models."},
       ]
   )
   print("Chat response:", chat_response)

Next Step
---------

Now, you can have fun with Qwen models. Would love to know more about
its usages? Feel free to check other documents in this documentation.