dstack ======== `dstack `__ is an open-source alternative to Kubernetes and Slurm, designed to simplify GPU allocation and AI workload orchestration for ML teams across top clouds, on-prem clusters, and accelerators. Prerequisites ---------------- Before you start, install dstack by following the `installation instructions `__. Once dstack server is up, you can initialize your workspace as shown below: .. code:: bash mkdir dstack-qwen-deploy && cd dstack-qwen-deploy dstack init Deploy Qwen3-30B-A3B ----------------------------------------------- Deploy ``Qwen3-30B-A3B`` on instances available with cloud providers configured in your ``~/.dstack/server/config.yml`` file. You can use ``SgLang``, ``TGI`` or ``vLLM`` to serve the model. Here we use ``SgLang`` as an example. Create a `service `__ configuration file named ``serve-30b.dstack.yml`` with the following content: .. code:: yaml type: service name: qwen3-30b-a3b image: lmsysorg/sglang:latest env: - MODEL_ID=Qwen/Qwen3-30B-A3B commands: - python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --trust-remote-code port: 8000 model: Qwen/Qwen3-30B-A3B resources: gpu: 80GB:1 .. note:: For other inference backends such as vLLM or TGI, visit the `dstack Inference Examples `__ documentation. Go ahead and apply the service configuration: .. code:: bash dstack apply -f serve-30b.dstack.yml Access the Service -------------------- After the service is successfully deployed, you can access the service's endpoint in the following ways: .. tab-set:: .. tab-item:: CURL Access through service endpoint at ``/proxy/services///`` .. code:: bash curl http://localhost:3000/proxy/services/main/qwen3-30b-a3b/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer ' \ -d '{ "model": "Qwen/Qwen3-30B-A3B", "messages": [ { "role": "user", "content": "Compose a poem that explains the concept of recursion in programming." } ] }' .. note:: When starting the dstack server, an admin token is automatically generated: .. code:: bash The admin token is "bbae0f28-d3dd-4820-bf61-8f4bb40815da" The server is running at http://127.0.0.1:3000/ .. tab-item:: Chat UI Access through dstack's Chat UI at ``/projects//models//`` .. image:: https://dstack.ai/static-assets/static-assets/images//dstack-qwen-ui.png .. dropdown:: Gateway :icon: info :animate: fade-in Running services for development purposes doesn't require setting up a gateway. However, you'll need a gateway in the following cases: * To use auto-scaling or rate limits * To enable HTTPS for the endpoint and map it to your domain * If your service requires WebSockets * If your service cannot work with a path prefix For detailed information about gateway configuration and usage, refer to the `dstack documentation on gateways `__. Replicas and Auto Scaling ---------------------------------------- You can auto scale the service by specifying additional configurations in the ``serve-30b.dstack.yml``. - Set ``replicas: min..max`` to define the minimum and maximum number of replicas - Configure ``scaling`` rules to determine when to scale up or down Below is a complete configuration example with auto-scaling enabled: .. code:: yaml type: service name: qwen3-30b-a3b image: lmsysorg/sglang:latest env: - MODEL_ID=Qwen/Qwen3-30B-A3B commands: - python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000 --trust-remote-code port: 8000 model: Qwen/Qwen3-30B-A3B resources: gpu: 80GB:1 # Minimum and maximum number of replicas replicas: 1..4 scaling: # Requests per seconds metric: rps # Target metric value target: 10 .. note:: The scaling property requires a gateway to be set up. See also ------------ - **Fleets**: Create cloud and on-prem clusters using `Fleets `__. - **Dev Environments**: Experiment and test before deploying to production using `Dev Environments `__. - **Tasks**: Schedule single node or distributed training using `Tasks `__. - **Services**: Deploy models as secure, auto-scaling OpenAI-compatible endpoints using `Services `__. - **Metrics**: Monitor performance with automatically tracked metrics via CLI or UI using `Metrics `__.