SGLang¶
SGLang is a fast serving framework for large language models and vision language models.
To learn more about SGLang, please refer to the documentation.
Environment Setup¶
By default, you can install sglang
with pip in a clean environment:
pip install "sglang[all]>=0.4.6.post1"
If you have encountered issues in installation, please feel free to check the official document for installation (link).
API Service¶
It is easy to build an OpenAI-compatible API service with SGLang, which can be deployed as a server that implements OpenAI API protocol.
By default, it starts the server at http://localhost:30000
.
You can specify the address with --host
and --port
arguments.
Run the command as shown below:
python -m sglang.launch_server --model-path Qwen/Qwen3-8B
By default, if the --model-path
does not point to a valid local directory, it will download the model files from the Hugging Face Hub.
To download model from ModelScope, set the following before running the above command:
export SGLANG_USE_MODELSCOPE=true
For distributed inference with tensor parallelism, it is as simple as
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --tensor-parallel-size 4
The above command will use tensor parallelism on 4 GPUs. You should change the number of GPUs according to your demand.
Basic Usage¶
Then, you can use the create chat interface to communicate with Qwen:
curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
You can use the API client with the openai
Python SDK as shown below:
from openai import OpenAI
# Set OpenAI's API key and API base to use SGLang's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
max_tokens=32768,
temperature=0.6,
top_p=0.95,
extra_body={
"top_k": 20,
},
)
print("Chat response:", chat_response)
Tip
While the default sampling parameters would work most of the time for thinking mode, it is recommended to adjust the sampling parameters according to your application, and always pass the sampling parameters to the API.
Thinking & Non-Thinking Modes¶
Qwen3 models will think before respond. This behavior could be controlled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether it should think.
The hard switch is available in SGLang through the following configuration to the API call. To disable thinking, use
curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"max_tokens": 8192,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": false}
}'
You can use the API client with the openai
Python SDK as shown below:
from openai import OpenAI
# Set OpenAI's API key and API base to use SGLang's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
max_tokens=8192,
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"enable_thinking": True},
},
)
print("Chat response:", chat_response)
Note
Please note that passing enable_thinking
is not OpenAI API compatible.
The exact method may differ among frameworks.
Tip
To completely disable thinking, you could use a custom chat template when starting the model:
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --chat-template ./qwen3_nonthinking.jinja
The chat template prevents the model from generating thinking content, even if the user instructs the model to do so with /think
.
Tip
It is recommended to set sampling parameters differently for thinking and non-thinking modes.
Parsing Thinking Content¶
SGLang supports parsing the thinking content from the model generation into structured messages:
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser qwen3
The response message will have a field named reasoning_content
in addition to content
, containing the thinking content generated by the model.
Note
Please note that this feature is not OpenAI API compatible.
Important
enable_thinking=False
may not be compatible with this feature.
If you need to pass enable_thinking=False
to the API, please consider disabling parsing thinking content.
Parsing Tool Calls¶
SGLang supports parsing the tool calling content from the model generation into structured messages:
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --tool-call-parser qwen25
For more information, please refer to our guide on Function Calling.
Structured/JSON Output¶
SGLang supports structured/JSON output. Please refer to SGLang’s documentation. Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt.
Serving Quantized models¶
Qwen3 comes with two types of pre-quantized models, FP8 and AWQ.
The command serving those models are the same as the original models except for the name change:
# For FP8 quantized model
python -m sglang.launch_server --model-path Qwen/Qwen3-8B-FP8
# For AWQ quantized model
python -m sglang.launch_server --model-path Qwen/Qwen3-8B-AWQ
Context Length¶
The context length for Qwen3 models in pretraining is up to 32,768 tokens. To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied. We have validated the performance of YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
SGLang supports YaRN, which can be configured as
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}' --context-length 131072
Note
SGLang implements static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts.
We advise adding the rope_scaling
configuration only when processing long contexts is required.
It is also recommended to modify the factor
as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set factor
as 2.0.
Note
The default max_position_embeddings
in config.json
is set to 40,960, which is used by SGLang.
This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking.
If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.