Reinforcement Learning from Human Feedback (RLHF) has become the go-to technique for refining large language models (LLMs), but it faces significant challenges in multi-task learning (MTL), particularly around reward hacking and handling complex multi-objective optimization.
To address these challenges, in a new paper The Perfect Blend: Redefining RLHF with Mixture of Judges, a research team from Meta GenAI and FAIR developed Constrained Generative Policy Optimization (CGPO), which offers a more structured approach to RLHF, advancing the performance of general-purpose LLMs.
At the heart of CGPO is the Mixture of Judges (MoJ) mechanism, which uses cost-efficient constrained policy optimization and stratification. This innovation improves the RLHF process by balancing objectives and ensuring principled tuning, achieving strong empirical results backed by theoretical guarantees. CGPO is also highly adaptable and requires minimal hyper-parameter adjustments, making it compatible with typical post-training pipelines. Its ability to detect and address reward hacking ensures it reaches Pareto-optimal solutions even when balancing a wide range of objectives.
A key advancement in CGPO is its novel strategy to combat reward hacking in multi-task LLM post-tuning. This is accomplished through a primal-type constrained reinforcement learning (RL) method, which introduces three new optimizers: Calibrated-Regularized Policy Gradient (CRPG), Constrained Online Direct Preference Optimization (CODPO), and Calibrated-Regularized Reward Ranking Finetuning (CRRAFT). These optimizers are designed to be both scalable and easy to integrate.
To further support constrained RL within CGPO, the team developed two types of evaluators: a rule-based judge and an LLM-based judge. These judges assess whether an LLM’s output adheres to constraints across a variety of natural language processing (NLP) tasks.
In addition, the researchers introduced a multi-objective RLHF strategy that allows each task to be treated individually. Each task is optimized with its own reward models, mixture of judges, and tailored hyperparameters, resulting in the first system in multi-task RLHF to expand the Pareto frontier across numerous metrics.
The efficacy of CGPO was demonstrated in a challenging multi-task post-training setup involving five tasks: general conversation, instruction following, math and coding reasoning, engagement, and safety. Even with conflicting objectives, CGPO consistently outperformed traditional RLHF methods like PPO and DPO. Notably, the experiments were conducted using the Llama3.0 70b pre-trained model and open-source data, showcasing CGPO’s robust performance across all benchmarks and tasks.
The paper The Perfect Blend: Redefining RLHF with Mixture of Judges is on arXiv.
Author: Hecate He | Editor: Chain Zhang
The post Scaling Multi-Objective Optimization: Meta & FAIR’s CGPO Advances General-purpose LLMs first appeared on Synced.