Using Transformers to Chat
==========================

The most significant but also the simplest usage of Qwen1.5 is to chat
with it using the ``transformers`` library. In this document, we show
how to chat with ``Qwen1.5-7B-Chat``, in either streaming mode or not.

Basic Usage
-----------

You can just write several lines of code with ``transformers`` to chat with
Qwen1.5-Chat. Essentially, we build the tokenizer and the model with
``from_pretrained`` method, and we use ``generate`` method to perform
chatting with the help of chat template provided by the tokenizer.
Below is an example of how to chat with Qwen1.5-7B-Chat:

.. code:: python

   from transformers import AutoModelForCausalLM, AutoTokenizer
   device = "cuda" # the device to load the model onto

   # Now you do not need to add "trust_remote_code=True"
   model = AutoModelForCausalLM.from_pretrained(
       "Qwen/Qwen1.5-7B-Chat",
       torch_dtype="auto",
       device_map="auto"
   )
   tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat")

   # Instead of using model.chat(), we directly use model.generate()
   # But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
   prompt = "Give me a short introduction to large language model."
   messages = [
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": prompt}
   ]
   text = tokenizer.apply_chat_template(
       messages,
       tokenize=False,
       add_generation_prompt=True
   )
   model_inputs = tokenizer([text], return_tensors="pt").to(device)

   # Directly use generate() and tokenizer.decode() to get the output.
   # Use `max_new_tokens` to control the maximum output length.
   generated_ids = model.generate(
       model_inputs.input_ids,
       max_new_tokens=512
   )
   generated_ids = [
       output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
   ]

   response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

If you would like to apply Flash Attention 2, you can load the model as shown below:

.. code:: python

    model = AutoModelForCausalLM.from_pretrained(
       "Qwen/Qwen1.5-7B-Chat",
       torch_dtype="auto",
       device_map="auto",
       attn_implementation="flash_attention_2",
   )

Note that the previous method in the original Qwen repo ``chat()`` is
now replaced by ``generate()``. The ``apply_chat_template()`` function
is used to convert the messages into a format that the model can
understand. The ``add_generation_prompt`` argument is used to add a
generation prompt, which refers to ``<|im_start|>assistant\n`` to the input. 
Notably, we apply ChatML template for chat models following our previous 
practice. The ``max_new_tokens`` argument is used to set the maximum length 
of the response. The ``tokenizer.batch_decode()`` function is used to 
decode the response. In terms of the input, the above ``messages`` is an 
example to show how to format your dialog history and system prompt. By 
default, if you do not specify system prompt, we directly use ``You are 
a helpful assistant.``.

Streaming Mode
--------------

With the help of ``TextStreamer``, you can modify your chatting with
Qwen to streaming mode. Below we show you an example of how to use it:

.. code:: python

   # Repeat the code above before model.generate()
   # Starting here, we add streamer for text generation.
   from transformers import TextStreamer
   streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

   # This will print the output in the streaming mode.
   generated_ids = model.generate(
       model_inputs,
       max_new_tokens=512,
       streamer=streamer,
   )

Besides using ``TextStreamer``, we can also use ``TextIteratorStreamer``
which stores print-ready text in a queue, to be used by a downstream
application as an iterator:

.. code:: python

   # Repeat the code above before model.generate()
   # Starting here, we add streamer for text generation.
   from transformers import TextIteratorStreamer
   streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

   from threading import Thread
   generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
   thread = Thread(target=model.generate, kwargs=generation_kwargs)

   thread.start()
   generated_text = ""
   for new_text in streamer:
       generated_text += new_text
   print(generated_text)

Next Step
---------

Now you can chat with Qwen1.5 in either streaming mode or not. Continue
to read the documentation and try to figure out more advanced usages of
model inference!