ms-swift

Introduction to ms-swift

ms-swift is the official large model and multimodal model training and deployment framework provided by the ModelScope community.

GitHub repository: ms-swift

The RL script in ms-swift has the following features:

  • Support single-GPU and multi-GPU training

  • Support full-parameter tuning, LoRA, Q-LoRA, and Dora

  • Supports multiple RL algorithms including GRPO, DAPO, PPO, DPO, KTO, ORPO, CPO, and SimPO

  • Supports both large language models (LLM) and multimodal models (MLLM)

For detailed support information, please refer to: Supported Features

Environment Setup

  1. Follow the instructions of ms-swift, and build the environment.

  2. Install these packages (Optional):

    pip install deepspeed
    pip install math_verify==0.5.2
    pip install flash-attn --no-build-isolation
    pip install vllm
    

Data Preparation

ms-swift has built-in preprocessing logic for several datasets, which can be directly used for training via the --dataset parameter. For supported datasets, please refer to: Supported Datasets

You can also use local custom datasets by providing the local dataset path to the --dataset parameter.

Example Dataset Formats:

# llm
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

# mllm
{"messages": [{"role": "user", "content": "<image>What is the difference between the two images?"}], "images": ["/xxx/x.jpg"]}
{"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}], "images": ["/xxx/y.jpg", "/xxx/z.png"]}

Notes on Dataset Requirements

  1. Reward Function Calculation: Depending on the reward function being used, additional columns may be required in the dataset. For example:

    When using the built-in accuracy/cosine reward, the dataset must include a solution column to compute accuracy. The other columns in the dataset will also be passed to the kwargs of the reward function.

  2. Customizing the Reward Function: To tailor the reward function to your specific needs, you can refer to the following resource: external reward plugin

GRPO Training Examples

Single-GPU Configuration

LLM (Qwen2.5-7B):

# 42G
CUDA_VISIBLE_DEVICES=0 \
nohup swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-7B \
    --vllm_gpu_memory_utilization 0.5 \
    --use_vllm true \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --reward_funcs accuracy format \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --torch_dtype bfloat16 \
    --dataset 'AI-MO/NuminaMath-TIR' \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --learning_rate 1e-5 \
    --gradient_accumulation_steps 1 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 4 \
    --temperature 0.9 \
    --system 'examples/train/grpo/prompt.txt' \
    --log_completions true

MLLM (Qwen2.5-VL-7B-Instruct):

# 55G
CUDA_VISIBLE_DEVICES=0 \
MAX_PIXELS=602112 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --vllm_gpu_memory_utilization 0.5 \
    --use_vllm true \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_r1v_acc format \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --torch_dtype bfloat16 \
    --dataset 'lmms-lab/multimodal-open-r1-8k-verified' \
    --vllm_max_model_len 4196 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --learning_rate 1e-5 \
    --gradient_accumulation_steps 1 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 4 \
    --temperature 0.9 \
    --system 'examples/train/grpo/prompt.txt' \
    --log_completions true

Multi-GPU Training

LLM Example with DeepSpeed:

# 60G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-7B \
    --reward_funcs accuracy format \
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.7 \
    --vllm_max_model_len 8192 \
    --num_infer_workers 8 \
    --train_type lora \
    --torch_dtype bfloat16 \
    --dataset 'AI-MO/NuminaMath-TIR' \
    --max_completion_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system 'examples/train/grpo/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --log_completions true

MLLM Example with DeepSpeed:

# 60G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
nohup swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_r1v_acc format \
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.7 \
    --vllm_max_model_len 8192 \
    --num_infer_workers 8 \
    --train_type lora \
    --torch_dtype bfloat16 \
    --dataset 'lmms-lab/multimodal-open-r1-8k-verified' \
    --max_completion_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --vllm_max_model_len 4196 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system 'examples/train/grpo/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --log_completions true

Model Export

Merge LoRA Adapters:

swift export \
    --adapters output/checkpoint-xxx \
    --merge_lora true

Push to ModelScope Hub:

swift export \
    --adapters output/checkpoint-xxx \
    --push_to_hub true \
    --hub_model_id '<your-namespace>/<model-name>' \
    --hub_token '<your-access-token>'