OpenLLM¶
OpenLLM allows developers to run Qwen2.5 models of different sizes as OpenAI-compatible APIs with a single command. It features a built-in chat UI, state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Qwen2.5. Visit the OpenLLM repository to learn more.
Installation¶
Install OpenLLM using pip
.
pip install openllm
Verify the installation and display the help information:
openllm --help
Quickstart¶
Before you run any Qwen2.5 model, ensure your model repository is up to date by syncing it with OpenLLM’s latest official repository.
openllm repo update
List the supported Qwen2.5 models:
openllm model list --tag qwen2.5
The results also display the required GPU resources and supported platforms:
model version repo required GPU RAM platforms
------- --------------------- ------- ------------------ -----------
qwen2.5 qwen2.5:0.5b default 12G linux
qwen2.5:1.5b default 12G linux
qwen2.5:3b default 12G linux
qwen2.5:7b default 24G linux
qwen2.5:14b default 80G linux
qwen2.5:14b-ggml-q4 default macos
qwen2.5:14b-ggml-q8 default macos
qwen2.5:32b default 80G linux
qwen2.5:32b-ggml-fp16 default macos
qwen2.5:72b default 80Gx2 linux
qwen2.5:72b-ggml-q4 default macos
To start a server with one of the models, use openllm serve
like this:
openllm serve qwen2.5:7b
By default, the server starts at http://localhost:3000/
.
Interact with the model server¶
With the model server up and running, you can call its APIs in the following ways:
Send an HTTP request to its /generate
endpoint via CURL:
curl -X 'POST' \
'http://localhost:3000/api/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me something about large language models.",
"model": "Qwen/Qwen2.5-7B-Instruct",
"max_tokens": 2048,
"stop": null
}'
Call the OpenAI-compatible endpoints with frameworks and tools that support the OpenAI API protocol. Here is an example:
from openai import OpenAI
client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
# Use the following func to get the available models
# model_list = client.models.list()
# print(model_list)
chat_completion = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{
"role": "user",
"content": "Tell me something about large language models."
}
],
stream=True,
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content or "", end="")
OpenLLM provides a chat UI at the /chat
endpoint for the LLM server at http://localhost:3000/chat.

Model repository¶
A model repository in OpenLLM represents a catalog of available LLMs. You can add your own repository to OpenLLM with custom Qwen2.5 variants for your specific needs. See our documentation to learn details.