# Quickstart

This guide helps you quickly start using Qwen2. 
We provide examples of [Hugging Face Transformers](https://github.com/huggingface/transformers) as well as [ModelScope](https://github.com/modelscope/modelscope), and [vLLM](https://github.com/vllm-project/vllm) for deployment.

You can find Qwen2 models in the [Qwen2 collection](https://huggingface.co/collections/Qwen/qwen2-6659360b33528ced941e557f).

## Hugging Face Transformers & ModelScope

To get a quick start with Qwen2, we advise you to try with the inference with `transformers` first.
Make sure that you have installed `transformers>=4.40.0`.
We advise you to use Python 3.8 or higher, and PyTorch 2.2 or higher.

:::{dropdown} Install ``transformers``
* Install with ``pip``:

    ```bash
    pip install transformers -U
    ```

* Install with ``conda``:

    ```bash
    conda install conda-forge::transformers
    ```

* Install from source:

    ```bash
    pip install git+https://github.com/huggingface/transformers
    ```
:::

The following is a very simple code snippet showing how to run Qwen2-Instruct, with an example of Qwen2-7B-Instruct:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

Previously, we use `model.chat()` (see `modeling_qwen.py` in previous Qwen models for more information). 
Now, we follow the practice of `transformers` and directly use `model.generate()` with `apply_chat_template()` in tokenizer.


### Streaming Generation

Streaming mode for model chat is simple with the help of `TextStreamer`. 
Below we show you an example of how to use it:

```python
...
# Reuse the code before `model.generate()` in the last code snippet
from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    streamer=streamer,
)
```

It will print the text to the console or the terminal as being generated.

### ModelScope

To tackle with downloading issues, we advise you to try [ModelScope](https://github.com/modelscope/modelscope).
Before starting, you need to install `modelscope` with `pip`. 

`modelscope` adopts a programmatic interface similar (but not identical) to `transformers`.
For basic usage, you can simply change the first line of code above to the following:

```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
```

For more information, please refer to [the documentation of `modelscope`](https://www.modelscope.cn/docs).

## vLLM for Deployment

To deploy Qwen2, we advise you to use vLLM. vLLM is a fast and easy-to-use framework for LLM inference and serving. 
In the following, we demonstrate how to build a OpenAI-API compatible API service with vLLM.

First, make sure you have installed `vllm>=0.4.0`:

```bash
pip install vllm
```

Run the following code to build up a vLLM service. 
Here we take Qwen2-7B-Instruct as an example:

```bash
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-7B-Instruct
```

with `vllm>=0.5.3`, you can also use

```bash
vllm serve Qwen/Qwen2-7B-Instruct
```

Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:

```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2-7B-Instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'
```

or you can use Python client with `openai` Python package as shown below:

```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me something about large language models."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "repetition_penalty": 1.05,
    },
)
print("Chat response:", chat_response)
```

For more information, please refer to [the documentation of `vllm`](https://docs.vllm.ai/en/stable/).

## Next Step

Now, you can have fun with Qwen2 models. 
Would love to know more about its usages? 
Feel free to check other documents in this documentation.