TGI¶

注意

仍需为Qwen3更新。

Hugging Face 的 Text Generation Inference (TGI) 是一个专为部署大规模语言模型 (Large Language Models, LLMs) 而设计的生产级框架。TGI提供了流畅的部署体验，并稳定支持如下特性：

推测解码 (Speculative Decoding) ：提升生成速度。
张量并行 (Tensor Parallelism) ：高效多卡部署。
流式生成 (Token Streaming) ：支持持续性生成文本。
灵活的硬件支持：与 AMD ， Gaudi 和 AWS Inferentia 无缝衔接。

安装¶

通过 TGI docker 镜像使用 TGI 轻而易举。本文将主要介绍 TGI 的 docker 用法。

也可通过 Conda 实机安装或搭建服务。请参考 Installation Guide 与 CLI tool 以了解详细说明。

通过 TGI 部署 Qwen2.5¶

选定 Qwen2.5 模型： 从 the Qwen2.5 collection 中挑选模型。
部署TGI服务： 在终端中运行以下命令，注意替换 model 为选定的 Qwen2.5 模型 ID 、 volume 为本地的数据路径：

model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

使用 TGI API¶

一旦成功部署，API 将于选定的映射端口 (8080) 提供服务。

TGI 提供了简单直接的 API 支持流式生成：

curl http://localhost:8080/generate_stream -H 'Content-Type: application/json' \
        -d '{"inputs":"Tell me something about large language models.","parameters":{"max_new_tokens":512}}'

也可使用 OpenAI 风格的 API 使用 TGI ：

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "",
  "messages": [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

备注

JSON 中的 model 字段不会被 TGI 识别，您可传入任意值。

完整 API 文档，请查阅 TGI Swagger UI 。

你也可以使用 Python 访问 API ：

from openai import OpenAI

# initialize the client but point it to TGI
client = OpenAI(
   base_url="http://localhost:8080/v1/",  # replace with your endpoint url
   api_key="",  # this field is not used when running locally
)
chat_completion = client.chat.completions.create(
   model="",  # it is not used by TGI, you can put anything
   messages=[
      {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
      {"role": "user", "content": "Tell me something about large language models."},
   ],
   stream=True,
   temperature=0.7,
   top_p=0.8,
   max_tokens=512,
)

# iterate and print stream
for message in chat_completion:
   print(message.choices[0].delta.content, end="")

量化¶

依赖数据的量化方案（ GPTQ 与 AWQ ）

GPTQ 与 AWQ 均依赖数据进行量化。我们提供了预先量化好的模型，请于 the Qwen2.5 collection 查找。你也可以使用自己的数据集自行量化，以在你的场景中取得更好效果。

以下是通过 TGI 部署 Qwen2.5-7B-Instruct-GPTQ-Int4 的指令：

model=Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize gptq

如果模型是 AWQ 量化的，如 Qwen/Qwen2.5-7B-Instruct-AWQ ，请使用 --quantize awq 。

不依赖数据的量化方案

EETQ 是一种不依赖数据的量化方案，可直接用于任意模型。请注意，我们需要传入原始模型，并使用 --quantize eetq 标志。

model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize eetq

多卡部署¶

使用 --num-shard 指定卡书数量。请务必传入 --shm-size 1g 让 NCCL 发挥最好性能 (说明) ：

model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --num-shard 2

推测性解码 (Speculative Decoding)¶

推测性解码 (Speculative Decoding) 通过预先推测下一 token 来节约每 token 需要的时间。使用 --speculative-decoding 设定预先推测 token 的数量（默认为0，表示不预先推测）：

model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 2

推测性解码的加速效果依赖于任务类型，对于代码或重复性较高的文本生成任务，提速更明显。

更多说明可查阅此文档。

使用 HF Inference Endpoints 零代码部署¶

使用 Hugging Face Inference Endpoints 不费吹灰之力：

GUI interface: https://huggingface.co/inference-endpoints/dedicated
Coding interface: https://huggingface.co/blog/tgi-messages-api

一旦部署成功，服务使用与本地无异。

常见问题¶

Qwen2.5 支持长上下文，谨慎设定 --max-batch-prefill-tokens ， --max-total-tokens 和 --max-input-tokens 以避免 out-of-memory (OOM) 。如 OOM ，你将在启动 TGI 时收到错误提示。以下为修改这些参数的示例：

model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --max-batch-prefill-tokens 4096 --max-total-tokens 4096 --max-input-tokens 2048