【原】LLMs之RLHF：《LLM对齐技术的全面综述：RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF

处女座的程序猿 2024-08-12 发布于上海

展开全文

LLMs之RLHF：《LLM对齐技术的全面综述：RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻译与解读

《A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻译与解读

地址	论文地址：https://www./abs/2407.16216
时间	2024年7月23日
作者	Zhichao Wang* , Bin Bi* , Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng Salesforce
总结	背景与痛点：尽管大型语言模型(LLMs)在自我监督学习、指令微调等方面有所进步，但由于训练数据质量参差不齐，它们仍可能生成不实、有毒或无助于用户的响应，与人类意图不一致。现有评估指标如BLEU、ROUGE和BERTScore无法很好地捕捉人类对LLM输出的偏好。需要将LLM与人类价值观对齐，以避免生成不当内容。具体解决方案：强化学习从人类反馈（RLHF）：通过人类反馈来调整模型，使其输出更符合人类期望。收集人类偏好数据集(包含提示、期望响应和不期望响应三元组)，并训练奖励模型和强化学习策略。强化学习从AI反馈（RLAIF）：利用AI生成的反馈数据，以减少人类反馈的成本。核心思路与步骤 >> 奖励模型：使用显式或隐式的奖励模型，对生成的响应进行评分。奖励可以是响应级别或令牌级别的。利用Bradley-Terry模型，基于人类偏好数据训练pointwise奖励函数rφ(x，y)，给定提示x和响应y，预测人类期望响应的概率。 >> 反馈机制：收集偏好反馈或二元反馈。采用成对或列表的反馈方式。利用人类或AI提供的反馈。 >> 强化学习策略：基于参考模型的强化学习，控制输出的长度。采用不同的散度测量方法，如KL散度。选择在线或离线的策略。以LLM为代理，奖励模型为环境，最大化奖励、最小化KL散度，同时避免"对齐税"(即下游任务性能下降)。探索了不同的奖励模型(explicit/implicit，pointwise/preferencewise等)、反馈类型(偏好/二元、成对/列表式等)、RL目标(参考/无参考等)和优化方式(在线/离线等)。 >> 优化方法：迭代/在线偏好优化与非迭代/离线偏好优化。将指令微调与对齐过程分开或合并。优势：直接将人类偏好纳入模型微调，提高了LLM与人类意图的一致性。InstructGPT等RLHF模型在真实性、无害性等方面优于GPT-3等基线模型。探索了多种方法扩展RLHF框架，为进一步对齐研究奠定了基础。 >> 成本效益：RLAIF减少了对昂贵人类反馈的依赖。 >> 灵活性：多种反馈和奖励模型选择，适应不同的应用场景。 >> 提高模型安全性和可靠性：通过对齐过程减少生成不当内容的风险。总的来说，该综述系统梳理了近两年来LLM对齐技术的主要进展，概括了面临的挑战、提出的解决方案及其优缺点，为该领域的后续研究提供了全面的概览。

Abstract

With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.

随着自监督学习的进展、预训练语料库中数万亿个标记的可用性、指令微调的应用以及具有数十亿参数的大型 Transformer 的发展，大型语言模型（LLMs）现在能够生成真实且连贯的对人类查询的回应。然而，训练数据的质量参差不齐可能导致生成不符合预期的响应，这成为一个重大挑战。在过去两年中，提出了各种方法，从不同的角度来提升 LLM，特别是使其更好地与人类期望对齐。尽管有这些努力，但尚未出现一篇全面的综述论文，对这些方法进行分类和详细说明。本文旨在填补这一空白，通过将这些论文分类为不同的主题，并提供每种对齐方法的详细解释，帮助读者深入了解当前领域的现状。

1 Introduction

Over the past decades, the pretraining of LLMs through self-supervised learning [1] has seen significant advancements. These improvements have been driven by the development of larger decoder-only Transformers, the utilization of trillions of tokens, and the parallelization of computations across multiple GPUs. Following the pretraining phase, instruction tuning was employed to guide LLMs in responding to human queries. Despite these advancements, a critical issue remains unresolved: LLMs can generate undesired responses, such as providing instructions on how to commit illegal activities. To mitigate this risk, it is essential to align LLMs with human values.

Reinforcement Learning from Human Feedback (RLHF) [2, 3] has emerged as a groundbreaking technique for aligning LLMs. This approach has led to the development of powerful models such as GPT-4 [4], Claude [5], and Gemini [6]. Following the introduction of RLHF, numerous studies have explored various approaches to further align LLMs. However, there has not yet been a comprehensive review of methods for aligning LLMs with human preferences. This paper aims to fill that gap by categorically reviewing existing literature and providing detailed analyses of individual papers.

在过去几十年里，通过自监督学习进行的 LLM 预训练 [1] 取得了显著进展。这些进展得益于更大规模的仅解码 Transformer 的发展、数万亿标记的利用以及多 GPU 的并行计算。在预训练阶段之后，采用指令微调来指导 LLM 响应人类查询。尽管取得了这些进展，但一个关键问题仍未解决：LLM 可能生成不符合期望的响应，例如提供如何进行非法活动的指示。为了减轻这一风险，有必要使 LLM 与人类价值观对齐。

基于人类反馈中进行强化学习（RLHF）[2, 3] 已成为对齐 LLM 的一种突破性技术。这种方法促成了如 GPT-4 [4]、Claude [5] 和 Gemini [6] 等强大模型的发展。在 RLHF 介绍之后，众多研究探索了各种进一步对齐 LLM 的方法。然而，尚未对对齐 LLM 的方法进行全面的综述。本文旨在通过分类回顾现有文献并对个别论文进行详细分析来填补这一空白。

In this paper, we have structured our review into four main topics: 1. Reward Model; 2. Feedback; 3. Reinforcement Learning (RL); and 4. Optimization. Each topic was further divided into subtopics as shown in Figure. 1. For the Reward Model, the subtopics were: 1. Explicit Reward Model vs. Implicit Reward Model; 2. Pointwise Reward Model vs. Preference Model; 3. Response-Level Reward vs. Token-Level Reward and 4. Negative Preference Optimization. Regarding Feedback, the subtopics included: 1. Preference Feedback vs. Binary Feedback; 2. Pairwise Feedback vs. Listwise Feedback; and 3. Human Feedback vs. AI Feedback. In the RL section, the subtopics were:

1. Reference-Based RL vs. Reference-Free RL; 2. Length-Control RL; 3. Different Divergences in RL and 4. On-Policy RL vs. Off-Policy RL. For Optimization, the subtopics were: 1. Online/Iterative Preference Optimization vs. Offline/Non-iterative Preference Optimization; and 3. Separating SFT and Alignment vs. Merging SFT and Alignment. Table 1 provided an analysis of all the papers reviewed in detail using these 13 evaluation metrics.

在本文中，我们将回顾结构分为四个主要主题：1. 奖励模型；2. 反馈；3. 强化学习（RL）；和 4. 优化。

每个主题进一步分为子主题，如图 1 所示。

在奖励模型中，子主题包括：1. 显式奖励模型 vs. 隐式奖励模型；2. 点对点奖励模型 vs. 偏好模型；3. 响应级奖励 vs. 标记级奖励；4. 负偏好优化。

在反馈方面，子主题包括：1. 偏好反馈 vs. 二元反馈；2. 成对反馈 vs. 列表反馈；3. 人类反馈 vs. AI 反馈。

在 RL 部分，子主题包括：1. 基于参考的 RL vs. 无参考 RL；2. 长度控制 RL；3. RL 中的不同散度；4. 策略内 RL vs. 策略外 RL。

在优化方面，子主题包括：1. 在线/迭代偏好优化 vs. 离线/非迭代偏好优化；2. 分离 SFT 和对齐 vs. 合并 SFT 和对齐。

表 1 对所有详细审查的论文进行了这些 13 项评估指标的分析。

Figure 1: The 13 categorical directions for xPO to align an LLM with human preference

4 Future Directions未来发展方向

Based on the analysis of the reviewed papers, several research problems have been identified for further exploration.

在对文献分析的基础上，提出了若干有待进一步探讨的研究问题。

4.1、General Tasks for Alignment Evaluation对齐评估的一般任务

When reviewing various papers, different tasks were used to evaluate the performance of these methods. However, some tasks, like GSM8K [65], which focused more on reasoning, might not be suitable for assessing alignment performance. In contrast, tasks like TruthfulQA [45] or those addressing toxicity should be prioritized for evaluating the toxicity of fine-tuned LLMs. There should be an effort to combine these tasks and create a unified leaderboard for alignment evaluation.

在回顾不同论文时，使用了不同的任务来评估这些方法的性能。然而，一些任务，如 GSM8K [65]，更侧重于推理，可能不适合评估对齐性能。相比之下，应优先考虑像 TruthfulQA [45] 或处理毒性的问题来评估微调 LLM 的毒性。应努力结合这些任务，创建一个统一的对齐评估排行榜。

4.2、Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs 将隐式奖励模型、列表偏好和 Nash 学习应用于更大规模的 LMs

Currently, implicit reward model methods have been applied only to models with up to 70B parameters. Extending these methods to even larger models, such as those the size of GPT-4 and Claude-3, can provide insights into their effectiveness compared to RLHF/PPO. Similarly, the listwise preference model warrants further investigation. In RLHF, preference datasets were collected using listwise preference but were subsequently transformed into multiple pairs of pairwise preferences. The potential issues associated with applying listwise preference models at larger scales remain to be addressed. Lastly, Nash learning can address the inconsistency among human labelers. Incorporating a Nash learning model into larger-scale LLMs can demonstrate its ability to capture the complexity of human nature.

目前，隐式奖励模型方法仅应用于最多 70B 参数的模型。将这些方法扩展到更大的模型，如 GPT-4 和 Claude-3，可以提供关于其相较于 RLHF/PPO 的有效性的见解。类似地，列表偏好模型也值得进一步研究。在 RLHF 中，使用列表偏好收集了偏好数据集，但随后转化为多个成对的偏好。应用列表偏好模型于更大规模时潜在的问题仍待解决。最后，Nash 学习可以解决人类标注者之间的不一致性。将 Nash 学习模型纳入更大规模的 LLM 可以展示其捕捉人类复杂性的能力。

4.3、Experiments on Binary Feedbacks二元反馈的实验

Both KTO and DRO utilized binary feedback mechanisms, such as "thumbs up" and "thumbs down", instead of pairwise preferences. These binary feedbacks were derived from preference datasets, where desired responses were marked as positive and undesired responses as negative. Further research is needed on realistic binary datasets. Additionally, binary datasets are easier to collect compared to pairwise preference data, making it feasible to use larger-scale binary feedback datasets for alignment. However, the noise in binary feedback may be more pronounced than in preference datasets, raising the intriguing question of how to effectively filter out noisy data.

KTO 和 DRO 都使用了二元反馈机制，如“点赞”和“点踩”，而不是成对的偏好。这些二元反馈来自偏好数据集，其中期望的响应标记为正，期望之外的响应标记为负。需要进一步研究现实中的二元数据集。此外，相比于成对偏好数据，二元数据集更易收集，使得使用大规模二元反馈数据集进行对齐成为可能。然而，二元反馈中的噪音可能比偏好数据更明显，因此如何有效过滤噪声数据是一个值得关注的问题。

4.4、Experiments on Helpful AI Feedback有益AI反馈实验

Current AI feedback primarily includes harmless feedback in RLAIF and feedback ranking in iterative DPO. However, in RLAIF, helpful feedback is still provided by human labelers. This approach is reasonable, as generating helpful responses is significantly more challenging than identifying harmful ones. An intriguing future direction involves using LLMs to generate helpful feedback, thereby enabling LLMs to self-improve.

当前的 AI 反馈主要包括 RLAIF 中的无害反馈和迭代 DPO 中的反馈排序。然而，在 RLAIF 中，有益的反馈仍由人类标注者提供。这种方法是合理的，因为生成有益的响应远比识别有害的响应要困难。一个有趣的未来方向是利用 LLM 生成有益的反馈，从而使 LLM 实现自我提升。

4.5、Speeding up Nash Learning加速Nash学习

The proposed Nash learning method effectively modeled pairwise preferences and addressed inconsistencies arising from human labeling. However, it necessitated multiple iterations to converge to the optimal policy. Although the authors did not specify the time required for alignment, it was presumed to be significantly slower compared to implicit reward models such as DPO. This area warrants further research attention to speed up the Nash learning process.

提出的 Nash 学习方法有效建模了成对偏好并解决了人类标注的不一致性。然而，它需要多次迭代才能收敛到最佳策略。尽管作者未具体说明对齐所需的时间，但推测其速度明显慢于隐式奖励模型如 DPO。因此，这一领域需要进一步研究，以加速 Nash 学习过程。

4.6、Termination of Iterative/Online Learning迭代/在线学习的终止

When applying iterative or online training, determining when to terminate the iteration is crucial. Previous research has noted that iterative learning can sometimes degrade the performance of LLMs on specific tasks, which can be a sign of overfitting. However, identifying a reasonable epoch for stopping the iteration remains an unexplored area.

在应用迭代或在线训练时，确定何时终止迭代至关重要。以往研究指出，迭代学习有时会导致 LLM 在特定任务上的性能下降，这可能是过拟合的迹象。然而，确定合理的停止迭代的轮次仍是一个未探索的领域。

4.7、Simplify SFT + Alignment简化SFT +对齐

Current methodologies typically implemented SFT and alignment in a consecutive manner. However, this approach often resulted in catastrophic forgetting and rendered the training process laborious. The PAFT method mitigated catastrophic forgetting by fine-tuning SFT and alignment separately before merging them, albeit at the cost of increased complexity. Conversely, the ORPO technique integrated both processes simultaneously, but this led to a decline in performance. Thus, the challenge of effectively combining SFT and alignment to achieve high performance while maintaining efficiency remains unresolved.

当前方法通常采用连续的方式实现 SFT 和对齐。然而，这种方法往往导致灾难性遗忘，并使训练过程变得繁琐。PAFT 方法通过在合并 SFT 和对齐之前分别微调这两者来减轻灾难性遗忘，尽管增加了复杂性。相反，ORPO 技术同时集成了这两个过程，但这导致了性能下降。因此，如何有效地结合 SFT 和对齐以实现高性能且保持效率仍然是一个未解决的挑战。