dstack¶

dstack is an open-source alternative to Kubernetes and Slurm, designed to simplify GPU allocation and AI workload orchestration for ML teams across top clouds, on-prem clusters, and accelerators.

Prerequisites¶

Before you start, install dstack by following the installation instructions. Once dstack server is up, you can initialize your workspace as shown below:

mkdir dstack-qwen-deploy && cd dstack-qwen-deploy
dstack init

Deploy Qwen3-30B-A3B¶

Deploy Qwen3-30B-A3B on instances available with cloud providers configured in your ~/.dstack/server/config.yml file.

You can use SgLang, TGI or vLLM to serve the model. Here we use SgLang as an example.

Create a service configuration file named serve-30b.dstack.yml with the following content:

type: service
name: qwen3-30b-a3b

image: lmsysorg/sglang:latest
env:
  - MODEL_ID=Qwen/Qwen3-30B-A3B

commands:
  - python3 -m sglang.launch_server
        --model-path $MODEL_ID
        --port 8000
        --trust-remote-code

port: 8000
model: Qwen/Qwen3-30B-A3B

resources:
  gpu: 80GB:1

Note

For other inference backends such as vLLM or TGI, visit the dstack Inference Examples documentation.

Go ahead and apply the service configuration:

dstack apply -f serve-30b.dstack.yml

Access the Service¶

After the service is successfully deployed, you can access the service’s endpoint in the following ways:

CURL

Access through service endpoint at <dstack server URL>/proxy/services/<project name>/<run name>/

curl http://localhost:3000/proxy/services/main/qwen3-30b-a3b/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <dstack token>' \
    -d '{
        "model": "Qwen/Qwen3-30B-A3B",
        "messages": [
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming."
            }
        ]
}'

Note

When starting the dstack server, an admin token is automatically generated:

The admin token is "bbae0f28-d3dd-4820-bf61-8f4bb40815da"
The server is running at http://127.0.0.1:3000/

Chat UI

Access through dstack’s Chat UI at <dstack server URL>/projects/<project name>/models/<run name>/

https://dstack.ai/static-assets/static-assets/images//dstack-qwen-ui.png

Replicas and Auto Scaling¶

You can auto scale the service by specifying additional configurations in the serve-30b.dstack.yml.

Set replicas: min..max to define the minimum and maximum number of replicas
Configure scaling rules to determine when to scale up or down

Below is a complete configuration example with auto-scaling enabled:

type: service
name: qwen3-30b-a3b

image: lmsysorg/sglang:latest
env:
  - MODEL_ID=Qwen/Qwen3-30B-A3B

commands:
  - python3 -m sglang.launch_server
        --model-path $MODEL_ID
        --port 8000
        --trust-remote-code

port: 8000
model: Qwen/Qwen3-30B-A3B

resources:
  gpu: 80GB:1

# Minimum and maximum number of replicas
replicas: 1..4
scaling:
  # Requests per seconds
  metric: rps
  # Target metric value
  target: 10

Note

The scaling property requires a gateway to be set up.

dstack¶

Prerequisites¶

Deploy Qwen3-30B-A3B¶

Access the Service¶

Replicas and Auto Scaling¶

See also¶