快速开始¶
This guide helps you quickly start using Qwen3. We provide examples of Hugging Face Transformers as well as ModelScope, and vLLM and SGLang for deployment.
你可以在 Hugging Face Hub 的 Qwen3 collection 或 ModelScope 的 Qwen3 collection 中寻找 Qwen3 模型。
Transformers¶
要快速上手 Qwen3 ,我们建议您首先尝试使用 transformers 进行推理。请确保已安装了 transformers>=4.51.0 版本。我们建议您使用 Python 3.10 或以上版本, PyTorch 2.6 或以上版本。
重要
Qwen3-Instruct-2507 supports only non-thinking mode and does not generate <think></think> blocks in its output.
Different from Qwen3-2504, specifying enable_thinking=False is no longer required or supported.
The following contains a code snippet illustrating how to use Qwen3-235B-A22B-Instruct-2507 to generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
备注
We recommend temperature=0.7, top_p=0.8, top_k=20, and min_p=0 for Qwen3-Instruct-2507 models.
For supported frameworks, adjust presence_penalty between 0 and 2 to reduce repetitions.
However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
备注
Qwen3-Instruct-2507 may use CoT (chain-of-thoughts) automatically for complex tasks. We recommend using an output length of 16,384 tokens for most queries.
重要
Qwen3-Thinking-2507 supports only thinking mode.
Additionally, to enforce model thinking, the default chat template automatically includes <think>.
Therefore, it is normal for the model’s output to contain only </think> without an explicit opening <think> tag.
The following contains a code snippet illustrating how to use Qwen3-235B-A22B-Thinking-2507 to generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)
备注
We recommend temperature=0.6, top_p=0.95, top_k=20, and min_p=0 for Qwen3-Thinking-2507 models.
For supported frameworks, adjust presence_penalty between 0 and 2 to reduce repetitions.
However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
备注
Qwen3-Thinking-2507 features increased thinking depth. We strongly recommend its use in highly complex reasoning tasks with adequate maximum generation length.
以下是一个非常简单的代码片段示例,展示如何运行 Qwen3 模型:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
# load the tokenizer and the model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parse thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
Qwen3 将在实际回复前思考,与 QwQ 模型类似。这意味着模型将运用其推理能力来提升生成回复的质量。模型会首先生成包含在 <think>...</think> 块中的思考内容,随后给出最终回复。
硬开关:为了严格禁用模型的思考行为,使其功能与之前的Qwen2.5-Instruct模型保持一致,您可以在格式化文本时设置
enable_thinking=False。text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False, # Setting enable_thinking=False disables thinking mode )
在某些需要通过禁用思考来提升效率的场景中,这一功能尤其有用。
软开关:Qwen3 还能够理解用户对其思考行为的指令,特别是软开关
/think和/no_think。您可以将这些指令添加到用户 (user) 或系统 (system) 消息中,以在对话轮次之间灵活切换模型的思考模式。在多轮对话中,模型将遵循最近的指令。
备注
For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in generation_config.json).
DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
对于非思考模式,我们建议使用 Temperature=0.7,TopP=0.8,TopK=20,以及 MinP=0。
魔搭 (ModelScope)¶
为了解决下载问题,我们建议您尝试从 ModelScope 进行下载。开始之前,需要使用 pip 安装 modelscope 。
modelscope 采用了与 transformers 类似(但不完全一致)的编程接口。对于基础使用,仅需将上面代码第一行做如下修改:
from modelscope import AutoModelForCausalLM, AutoTokenizer
欲获取更多信息,请参考 modelscope 文档。
OpenAI API Compatibility¶
You can serve Qwen3 via OpenAI-compatible APIs using frameworks such as vLLM, SGLang, and interact with the API using common HTTP clients or the OpenAI SDKs.
Here we take Qwen3-235B-A22B-Instruct-2507 as an example to start the API:
SGLang (
sglang>=0.4.6.post1is required):python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 --port 8000 --tp 8 --context-length 262144
vLLM (
vllm>=0.9.0is recommended):vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 --port 8000 --tensor-parallel-size 8 --max-model-len 262144
备注
Consider adjusting the context length according to the available GPU memory.
Here we take Qwen3-235B-A22B-Thinking-2507 as an example to start the API:
SGLang (
sglang>=0.4.6.post1is required):python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --port 8000 --tp 8 --context-length 262144 --reasoning-parser deepseek-r1
vLLM (
vllm>=0.9.0is recommended):vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseek_r1
备注
Consider adjusting the context length according to the available GPU memory.
重要
We are currently working on adapting the qwen3 reasoning parsers to the new behavior.
Please follow the command above at the moment.
Here we take Qwen3-8B as an example to start the API:
SGLang (
sglang>=0.4.6.post1is required):python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 8000 --reasoning-parser qwen3
vLLM (
vllm>=0.9.0is recommended):vllm serve Qwen/Qwen3-8B --port 8000 --enable-reasoning --reasoning-parser qwen3
然后,可以使用 “create chat” interface 来与 Qwen 进行交流:
Here we show the basic command to interact with the chat completion API using Qwen3-235B-A22B-Instruct-2507.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-235B-A22B-Instruct-2507",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"max_tokens": 16384
}'
您可以按照下面所示的方式,使用 openai Python SDK中的客户端:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-235B-A22B-Instruct-2507",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
max_tokens=16384,
temperature=0.7,
top_p=0.8,
extra_body={
"top_k": 20,
}
)
print("Chat response:", chat_response)
Here we show the basic command to interact with the chat completion API using Qwen3-235B-A22B-Thinking-2507.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-235B-A22B-Thinking-2507",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
您可以按照下面所示的方式,使用 openai Python SDK中的客户端:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-235B-A22B-Thinking-2507",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
max_tokens=32768,
temperature=0.6,
top_p=0.95,
extra_body={
"top_k": 20,
}
)
print("Chat response:", chat_response)
Here we show the basic command to interact with the chat completion API using Qwen3-8B.
The default is with thinking enabled:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
您可以按照下面所示的方式,使用 openai Python SDK中的客户端:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
max_tokens=32768,
temperature=0.6,
top_p=0.95,
extra_body={
"top_k": 20,
}
)
print("Chat response:", chat_response)
To disable thinking, one could use the soft switch (e.g., appending /nothink to the user query).
The hard switch can also be used as follows:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"max_tokens": 8192,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": false}
}'
您可以按照下面所示的方式,使用 openai Python SDK中的客户端:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
max_tokens=8192,
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"enable_thinking": False},
}
)
print("Chat response:", chat_response)
For more usage, please refer to our document on SGLang and vLLM.
思考预算¶
Qwen3 支持配置思考预算。其实现方式是,一旦达到预算,便结束思考过程,并通过提前停止提示引导模型生成“总结”。
由于此功能涉及针对模型的定制,目前在开源框架中不可用,仅由阿里云百炼API实现。
然而,利用现有的开源框架,可以通过两次生成来实现此功能,具体如下:
第一次生成时,生成的token数量达到思考预算,并检查思考过程是否完成。如果思考过程未完成,则追加提前停止提示。
第二次生成时,继续生成直到内容结束或达到长度上限。
以下代码片段展示了使用Hugging Face Transformers的实现:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
thinking_budget = 16
max_new_tokens = 32768
# load the tokenizer and the model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
input_length = model_inputs.input_ids.size(-1)
# first generation until thinking budget
generated_ids = model.generate(
**model_inputs,
max_new_tokens=thinking_budget
)
output_ids = generated_ids[0][input_length:].tolist()
# check if the generation has already finished (151645 is <|im_end|>)
if 151645 not in output_ids:
# check if the thinking process has finished (151668 is </think>)
# and prepare the second model input
if 151668 not in output_ids:
print("thinking budget is reached")
early_stopping_text = "\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n"
early_stopping_ids = tokenizer([early_stopping_text], return_tensors="pt", return_attention_mask=False).input_ids.to(model.device)
input_ids = torch.cat([generated_ids, early_stopping_ids], dim=-1)
else:
input_ids = generated_ids
attention_mask = torch.ones_like(input_ids, dtype=torch.int64)
# second generation
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=input_length + max_new_tokens - input_ids.size(-1) # could be negative if max_new_tokens is not large enough (early stopping text is 24 tokens)
)
output_ids = generated_ids[0][input_length:].tolist()
# parse thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
您应该会在控制台中看到类似以下的输出:
thinking budget is reached
thinking content: <think>
Okay, the user is asking for a short introduction to large language models
Considering the limited time by the user, I have to give the solution based on the thinking directly now.
</think>
content: Large language models (LLMs) are advanced artificial intelligence systems trained on vast amounts of text data to understand and generate human-like language. They can perform tasks such as answering questions, writing stories, coding, and translating languages. LLMs are powered by deep learning techniques and have revolutionized natural language processing by enabling more context-aware and versatile interactions with text. Examples include models like GPT, BERT, and others developed by companies like OpenAI and Alibaba.
备注
出于示例目的,thinking_budget 被设置为 16。然而,在实际应用中不应将其设置得如此低。我们建议根据用户可接受的延迟调整 thinking_budget,并将其设置为高于 1024,以在各项任务中获得有意义的改进。
如果完全不需要思考,开发者应改用硬开关。
下一步¶
现在,您可以尽情探索 Qwen3 模型的各种用途。若想了解更多,请随时查阅本文档中的其他内容。