Publisher on askill
Implement GRPO (Group Relative Policy Optimization) fine-tuning for vision-language models on small datasets. Use when SFT underperforms or training data is limited (<1000 examples).