Paper:GPT-3《 Language Models are Few-Shot Learners》的翻译与解读 《GPT-3: Language Models are Few-Shot Learners》的翻译与解读
Abstract 摘要
1 Introduction 介绍
2 Approach 方法
2.1 Model and Architectures 模型和架构
2.2 Training Dataset 训练数据集
2.3 Training Process 训练过程
2.4 Evaluation 评估
3 Results 结果
3.1 Language Modeling, Cloze, and Completion Tasks 语言建模、完形填空和完成任务
3.1.1 Language Modeling 语言建模
3.1.2 LAMBADA 数据集
3.1.3 HellaSwag 数据集
3.1.4 StoryCloze 数据集
3.2 Closed Book Question Answering 闭卷回答任务
3.3 Translation 翻译任务
3.4 Winograd-Style Tasks 任务
3.5 Common Sense Reasoning 常识推理任务
|
Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset. | 接下来我们对GPT-3进行阅读理解任务的评估。在对话框和单一问题设置中,我们使用了一套5个数据集,包括抽象的、多项选择和基于跨度的回答格式。我们观察到GPT-3在这些数据集上的性能差异很大,这表明不同的回答格式具有不同的能力。一般来说,我们观察到GPT-3与初始基线和使用上下文表示对每个各自数据集进行训练的早期结果相同。 |
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA. | GPT-3在CoQA [RCM19]自由形式会话数据集上表现最好(在人类基线的3个点内),在QuAC [CHI+18]数据集上表现最差(低于ELMo基线13 F1),该数据集需要建模结构化对话行为和师生交互的回答范围选择。下降(DWD + 19]数据集测试离散推理和计算能力在阅读理解中,GPT-3在few-shot环境优于原始论文的BERT基线调整但仍远低于人类的性能和先进的方法增强神经网络与符号系统(RLL + 19)。在阵容2.0 [RJL18]上,GPT-3展示了它的少杆学习能力,与零杆设置相比提高了近10杆(69.8杆)。这使得它稍微优于原始论文中最好的微调结果。在RACE [LXL+17](一个针对初中和高中英语考试的多项选择数据集)上,GPT-3的表现相对较弱,仅与最早使用上下文表示的研究相比具有竞争力,仍落后于SOTA 45%。 |
In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark [WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. | 为了更好地聚合NLP任务的结果,并与BERT和RoBERTa等流行模型进行更系统的比较,我们还在标准化数据集上对GPT-3进行了评价,即SuperGLUE基准[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]。GPT-3在SuperGLUE数据集上的测试集性能如表3.8所示。在小样本设置中,我们对所有任务使用了32个示例,从训练集中随机采样。对于除了WSC和MultiRC之外的所有任务,我们采样了一组新的示例用于每个问题的上下文。对于WSC和MultiRC,我们使用同一组从训练集中随机抽取的例子作为我们评估的所有问题的上下文。 我们观察到GPT-3在不同任务中的表现差异很大。在COPA和记录GPT-3实现近sota的表现在一次样本和小样本设置,与COPA只下降了几个点,并在排行榜上取得第二名,第一名是由微调110亿参数模型(T5)。在WSC上,性能仍然相对较强,在小样本设置中达到80.1%(请注意,如3.4节所述,gpot -3在原始Winograd数据集上达到88.6%)。在BoolQ、MultiRC和RTE上,性能是合理的,大致与经过微调的BERT-Large匹配。在CB上,我们看到生命迹象的比例为75.6%。 |
WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model. Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32 examples per task, after which point additional examples will not reliably fit into our context. When sweeping over values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score. | WiC是一个值得注意的弱点,它的命中率为49.4%(随机)。我们为WiC尝试了许多不同的短语和公式(包括确定一个单词在两个句子中是否具有相同的意思),但没有一个能够取得很好的效果。这暗示了一个现象,在下一节将变得更清楚(讨论ANLI基准)——GPT-3似乎弱few-shot或一次性设置的一些任务,涉及比较两个句子或片段,例如一个词是否用同样的方式在两个句子,一个句子是否解释另一个,或者一个句子是否意味着另一个。这也可以解释RTE和CB的分数相对较低的原因,它们也采用这种格式。尽管存在这些弱点,GPT-3仍然在8个任务中的4个任务上优于经过微调的伯特-大公司,而在两个任务上,GPT-3通过一个经过微调的110亿参数模型已经接近最先进水平。 最后,我们注意到,随着模型大小和上下文中的示例数量的增加,少量注射的SuperGLUE得分稳步提高,显示了上下文内学习的好处越来越大(图3.8)。我们将K扩展到每个任务32个示例,超过这一点,额外的示例将不可靠地适合我们的上下文。当扫过K的值时,我们发现GPT-3每个任务总共需要少于8个示例,才能在总体超级胶水得分上超过经过微调的伯特-大。 |
Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral). SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (∼ 33%), whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress. | 自然语言推理(NLI) [Fyo00]关注理解两个句子之间关系的能力。在实践中,这个任务通常被构造成两个或三个类的分类问题,其中模型分类第二个句子在逻辑上是否与第一个句子相符合,是否与第一个句子相矛盾,或者可能是正确的(中立的)。SuperGLUE包括一个NLI数据集RTE,它计算任务的二进制版本。在RTE上,只有最大版本的GPT-3在任何评估设置上的表现都令人信服地优于random(56%),但在小样本设置中,GPT-3的表现类似于单任务优化的BERT Large。我们还评估了最近引入的对抗式自然语言推断(ANLI)数据集[NWD+19]。ANLI是一个复杂的数据集,它在三轮(R1、R2和R3)中使用一系列逆向挖掘的自然语言推理问题。与RTE类似,我们所有小于GPT-3的模型在ANLI上的表现几乎完全是随机的,即使是在很少投篮的设置中(约33%),而GPT-3本身在第3轮显示出生命迹象。ANLI R3的结果突出显示在图3.9和全部结果轮可以在附录h .这些结果RTE和ANLI NLI基础仍然是一个非常困难的任务表明语言模型和他们才刚刚开始显示出进步的迹象。 |
One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets with the hope of stimulating further study of test-time behavior of language models. | 要想了解GPT-3在“少拍”(或“零拍”和“一次拍”)环境下的能力范围,一种方法是让它执行一些任务,这些任务要求它执行简单的即时计算推理,识别训练中不太可能出现的新模式,或者快速适应不寻常的任务。我们设计了几个任务来测试这类能力。首先,我们测试GPT-3执行算术的能力。其次,我们创建了几个任务,这些任务包括重新排列或整理单词中的字母,这些任务不太可能在训练过程中被准确地看到。第三,我们测试了GPT-3解决卫星式类比问题的能力。最后,我们对GPT-3进行了几个定性测试,包括在句子中使用新单词、修改英语语法和生成新闻文章。我们将发布合成数据集,希望能促进对语言模型测试时行为的进一步研究。 |
To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:
| 为了测试GPT-3在没有特定任务训练的情况下执行简单算术运算的能力,我们开发了一个包含10个测试的小电池,其中包括用自然语言问GPT-3一个简单的算术问题:
|
In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random instances of the task and evaluate all models on those instances. First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations. | 在所有的10个任务中,模型必须准确地生成正确的答案。对于每个任务,我们生成一个包含2000个任务随机实例的数据集,并对这些实例上的所有模型进行评估。首先,我们在小样本设置中评估GPT-3,其结果如图3.10所示。在加减法方面,GPT-3在数字较少的情况下表现出较强的熟练度,2位加法的准确率为100%,2位减法的准确率为98.9%,3位加法的准确率为80.2%,3位减法的准确率为94.2%。随着数字数目的增加,性能会下降,但是GPT-3在四位数操作上仍能达到25-26%的精度,在五位数操作上仍能达到9-10%的精度,这表明至少有一些能力概括为更大数目的数字。GPT-3在2位乘法上也达到了29.2%的精度,这是一个特别的计算密集型操作。最后,GPT-3在个位数联合操作(例如,9*(7+5))时达到了21.3%的准确率,这表明GPT-3在单个操作之外还有一定的稳健性。 |
As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time. One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation to the task (or at the very least recognition of the task) is important to performing these computations correctly. Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and model capacity scaling for all three settings is shown in Appendix H. To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table. | 图3.10表明,小模型在所有这些任务做差,甚至130亿年的参数模型(1750亿年之后的第二大完整的GPT-3)可以解决2位数的加法和减法只有一半的时间,和所有其他操作的时间不到10%。一次射击和零射击的性能相对于少射击的性能有所下降,这表明适应任务(或至少识别任务)对正确执行这些计算很重要。尽管如此,单次射击的性能仍然相当强大,甚至全GPT-3的零射击性能也显著优于所有小型模型的少次射击学习。表3.9显示了完整GPT-3的所有三个设置,附录H显示了所有这三个设置的模型容量伸缩。 为了抽查模型是否只是简单地记忆特定的算术问题,我们取测试集中的三位数算术问题,并在训练数据中以“<num1> + <num2> =”和“<num1> + <num2>”的形式搜索它们。</num2></num1></num2></num1>在2000道加法题中,我们发现只有17道匹配(0.8%),而在2000道减法题中,我们发现只有2道匹配(0.1%),这表明只有一小部分正确答案能够被记住。此外,对错误答案的检查发现,该模型经常会犯错误,比如没有带“1”,这表明它实际上是在尝试执行相关的计算,而不是记忆一个表。总的来说,GPT-3在少杆、一杆甚至零杆设置中表现出了相当熟练的中等复杂的算术。 |
To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:
| 为了测试GPT-3从几个例子中学习新的符号操作的能力,我们设计了一个包含5个“字符操作”任务的小电池。每个任务都包括给模型一个被打乱、添加或删除字符组合而扭曲的单词,并要求它恢复原来的单词。这5项任务是:
|
For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word. In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data (although we cannot confirm this with certainty). | 对于每个任务,我们生成10,000个示例,我们选择这些示例作为最常见的10,000个单词,以长度大于4个字符和小于15个字符的[Nor09]来衡量。小样本结果如图3.11所示。任务性能随着模型大小的变化而平稳增长,完整的GPT-3模型在删除随机插入时达到66.9%,循环字母达到38.6%,在较简单的字谜任务中达到40.2%,在较困难的字谜任务(只保留第一个和最后一个字母)中达到15.1%。没有一个模型能将字母倒转成一个单词。 在单样本设置中,性能明显较差(下降一半或更多),而在零样本设置中,模型很少能执行任何任务(表3.10)。这表明,模型确实在测试时学习了这些任务,因为模型不能零失误地执行它们,而且它们的人工特性使它们不太可能出现在训练前的数据中(尽管我们不能确定地证实这一点)。 |
We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions. Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation. | 我们可以通过绘制“上下文内学习曲线”来进一步量化绩效,该曲线将任务绩效显示为上下文内例子数量的函数。我们在图1.2中展示了用于符号插入任务的上下文内学习曲线。我们可以看到,更大的模型能够越来越有效地使用上下文信息,包括任务示例和自然语言任务描述。 最后,值得补充的是,解决这些任务需要字符级操作,而我们的BPE编码作用于重要的分数一个词(平均0.7∼字令牌),所以从LM的角度成功在这些任务不仅包括操纵BPE令牌但理解和剖析他们的子结构。另外,CL、A1和A2不是双射的(也就是说,被解置的单词不是被解置单词的确定性函数),需要模型执行一些搜索来找到正确的解置。因此,所涉及的技能似乎需要非平凡的模式匹配和计算。 |
To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model. | 为了在另一个任务中测试GPT-3,这个任务相对于文本的典型分布有些不寻常,我们收集了一组374个“SAT类比”问题[TLBS03]。类推题是2005年前SAT大学入学考试的一个部分的多项选择题。一个典型的例子是“大胆之于大胆,正如(A)伪善之于伪善,(b)匿名之于身份,(c)懊悔之于恶行,(d)有害之于结果,(e)易受诱惑之于结果。”要求学生从五组单词中选出与原单词有相同关系的单词;在这个例子中,答案是“假装虔诚就是虚伪”。在这项任务中,GPT-3在少发、一发和零发中得分分别为65.2%、59.1%和53.7%,而大学申请者的平均得分为57% [TL05](随机猜测的得分为20%)。如图3.12所示,结果随着规模的增加而提高,全1750亿模型比130亿参数模型提高了10%以上。 |
Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news story [RWC+19]. Relative to [RWC+19], the dataset used to train GPT-3 is much less weighted towards news articles, so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably generate short articles in the “news” genre. To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to distinguish the two is a potentially important measure of quality.3 |
为了衡量GPT-3生成新闻文章的质量(我们认为这很可能与有条件的样本生成质量总体上相关),我们决定衡量人类区分GPT-3生成的文章与真实文章的能力。Kreps等人[KMB20]和Zellers等人[ZHR+19]也进行了类似的工作。生成语言模型被训练来匹配人类生成的内容的分布,所以人类区分这两者的能力是质量的一个潜在的重要衡量标准 |
In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model4 . Participants were asked to select whether the article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”. The articles we selected were not in the models’ training data and the model outputs were formatted and selected programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a 160M parameter model with no context and increased output randomness. | 为了考察人类检测模型生成的文本的能力,我们从newser.com网站上任意选择了25篇文章的标题和副标题(平均长度:215个单词)。然后,我们根据四种语言模型生成这些标题和字幕的完整版本,大小从1.25米到175B (GPT-3)参数不等(平均长度:200个单词)。对于每个模型,我们向大约80名来自美国的参与者展示了一个测试,其中包含这些真实的标题和副标题,然后是人工撰写的文章或由模型4生成的文章。参与者被要求选择文章是“很可能是人类写的”,“更可能是人类写的”,“我不知道”,“更可能是机器写的”,还是“很可能是机器写的”。 我们选择的文章不在模型的训练数据中,并且模型的输出被编程地格式化和选择,以防止人类的“挑选”。所有模型都使用相同的上下文来设置输出条件,并使用相同的上下文大小进行预训练,每个模型都使用相同的文章标题和副标题作为提示。然而,我们也进行了一项实验,以控制参与者的努力和注意力,这些人遵循同样的格式,但包含了有意的不良模型生成的文章。这是通过从一个“控制模型”生成文章来实现的:一个没有上下文且增加了输出随机性的160M参数模型。 |
Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that the intentionally bad articles were model generated was ∼ 86% where 50% is chance level performance. By contrast, mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at ∼ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E). Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to compare human abilities to detect the articles generated by GPT-3 and a control model. | 在检测出被模型生成的故意差的文章时,人类的平均准确率(每个参与者的正确任务与非中立任务的比率)为86%,其中50%是随机水平的表现。相比之下,人类检测175B参数模型产生的物品的平均准确率仅为52%(见表3.11)。5人类检测模型生成的文本的能力似乎随着模型大小的增加而减少:模型大小似乎有机会准确性的趋势,人类对GPT-3的检测接近于机会。尽管随着模型尺寸的增加,参与者会在每个输出上花费更多的时间(见附录E),但这是真的。 图3.14和图3.15给出了GPT-3合成产品的示例。7如评估所示,大部分文本对人类来说很难从真实的人类内容中区分出来。事实不准确可能是一篇文章是模型生成的标志,因为与人类作者不同,模型无法访问文章标题所引用的具体事实或文章的写作时间。其他的指标包括重复,不符合逻辑,和不寻常的措辞,尽管这些通常是足够微妙的,他们没有被注意到。 Ippolito等人[IDCBE19]在语言模型检测方面的相关工作表明,自动鉴别器如G R O V E R [ZHR+19]和GLTR [GSR19]在检测模型生成的文本方面可能比人类评价器更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。 Ippolito等人[IDCBE19]也注意到,随着人们观察到更多的标记,人类检测模型生成的文本的准确性也会提高。做一个初步调查好人类是如何检测时间的新闻文章由GPT-3 175 b,我们选择了12项世界新闻文章来自路透社平均长度为569个单词和生成完成的这些文章GPT-3平均长度为498个单词(298字的时间比我们最初的实验)。按照上述方法,我们进行了两个实验,每个实验都有大约80名美国参与者,以比较人类检测GPT-3和一个对照模型生成的文章的能力。 我们发现,人类在检测控制组故意制造的较长文章时的平均准确率为~ 88%,而在检测GPT-3 175B制造的较长文章时的平均准确率为~ 52%(见表3.12)。这表明,对于长度在500字左右的新闻文章,GPT-3继续生成人类难以区分的文章。 |
A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence. | 发展语言学[CB78]研究的一个任务是学习和利用新单词的能力,例如在一个句子中只看到一个单词的定义一次就使用它,或者从一个用法反过来推断一个单词的意思。在这里,我们定性地测试GPT-3完成前一项任务的能力。具体来说,我们给GPT-3一个不存在的单词的定义,比如“Gigamuru”,然后让它在一个句子中使用它。我们提供了一个(单独的)不存在的单词在句子中被定义和使用的1到5个例子,所以就宽泛任务的前面例子而言,任务是很少的,而就具体单词而言,任务是一次性的。表3.16显示了我们生成的6个示例;所有的定义都是人为生成的,第一个答案是人为生成的,作为条件反射,随后的答案是GPT-3生成的。这些示例是在一次运行中连续生成的,我们没有省略或重复尝试任何提示。在所有的情况下,生成的句子似乎是一个正确的或至少似是而非的词的使用。在最后一句话中,该模型为单词“screeg”(即“screeghed”)生成了一个貌似合理的变位,尽管这个词的使用有点尴尬(“screeghed at each other”),尽管它在描述一场玩具剑战的意义上似乎是可信的。总的来说,GPT-3在使用新单词造句方面至少表现得很熟练。 |
Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot setting by giving prompts of the form "Poor English Input: <sentence>\n Good English Output: <sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any omissions or repeats). Results are shown in Figure 3.17. | 另一项非常适合少量学习的任务是纠正英语语法。我们在fewshot设置中使用GPT-3测试这一点,给出如下提示:“糟糕的英语输入:<句子>\n良好的英语输出:<句子>”。我们给GPT-3一个人为的修正,然后让它再修正5个(同样没有遗漏或重复)。结果如图3.17所示。 |
Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research without established best practices. While it is common practice to train large models without investigating contamination, given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to. This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] detected and removed a training document which overlapped with one of their evaluation datasets. Other work such as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent). | 由于我们的训练数据集来自互联网,所以我们的模型可能是在一些基准测试集上训练的。从互联网规模的数据集中准确地检测测试污染是一个新的研究领域,没有建立最佳实践。虽然在训练大型模型时不调查污染是常见的做法,但考虑到训练前数据集规模的不断扩大,我们相信这个问题正变得越来越重要。 这种担忧不仅仅是假设。最早在普通爬行数据上训练语言模型的论文之一[TL18]检测并删除了一个与其中一个评估数据集重叠的训练文档。GPT-2 [RWC+19]等其他工作也进行了事后重叠分析。他们的研究相对令人鼓舞,发现尽管模型在训练和测试重叠的数据上表现得稍微好一些,但这并不会对报告的结果产生显著影响,因为有一小部分数据被污染了(通常只有几个百分点)。 |
GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was deduplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as large as feared. We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts results. For each benchmark, we produce a 'clean’ version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in Appendix C. | GPT-3的运作方式有些不同。一方面,数据集和模型的大小大约比GPT-2使用的大两个数量级,并且包括大量的常见爬行,增加了污染和记忆的可能性。另一方面,精确地说,由于数据量大,即使是GPT-3 175B,其训练集也没有过度拟合,这是相对于一个被删除的验证集而言的(图4.1)。因此,我们预计污染可能是频繁的,但其影响可能不会像担心的那样大。 我们最初试图通过主动搜索并试图消除我们的训练数据与本文中研究的所有基准的开发和测试集之间的任何重叠,来解决污染问题。不幸的是,一个错误只导致部分删除了训练数据中检测到的所有重叠部分。由于培训成本的原因,对模型进行再培训是不可行的。为了解决这个问题,我们详细研究剩余检测到的重叠是如何影响结果的。 对于每个基准测试,我们生成一个“干净”版本,删除所有可能泄露的示例,大致定义为与预训练集中的任何内容有13克重叠的示例(或者与整个示例有重叠的示例,如果它小于13克)。我们的目标是非常保守地标记出任何可能被污染的东西,以便产生一个高度可靠的无污染子集。确切的程序在附录C中有详细说明。 |
We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance. Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference difficult. | 然后我们在这些干净的基准上评估GPT-3,并与原始分数进行比较。如果清洁子集上的分数与整个数据集上的分数相似,这表明即使存在污染,也不会对报告的结果产生显著的影响。如果清洁组的分数较低,这表明污染可能使结果膨胀。结果如图4.2所示。尽管潜在的污染通常很高(四分之一的基准测试得分超过50%),但在大多数情况下,性能变化只是微不足道的,而且我们没有看到污染水平和性能差异相关的证据。我们得出的结论是,要么我们的保守方法大大高估了污染,要么污染对性能的影响很小。 下面,我们将更详细地回顾一些特定的情况,其中(1)模型在清理后的版本上表现明显较差,或(2)潜在的污染非常高,这使得测量性能差异非常困难。 |
Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false positives. We summarize the results for each group of tasks below:
| 我们的分析为进一步的调查标记了六组基准:拼词,阅读理解(QuAC, SQuAD2, DROP), PIQA, Winograd,语言建模任务(Wikitext任务,1BW),以及德语到英语的翻译。由于我们的重叠分析被设计成极其保守的,我们预计它会产生一些误报。我们将每组任务的结果总结如下:
|
We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply to verify how much actual contamination existed. These appeared to often contain false positives. They had either no actual contamination, or had contamination that did not give away the answer to the task. One notable exception was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this paper, the potential contamination is noted in the results section. Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C. | 我们还检查了污染程度高的数据集,但对性能的影响接近于零,只是为了验证实际存在多少污染。这些报告似乎经常包含误报。他们要么没有受到实际的污染,要么受到的污染并没有泄露任务的答案。一个值得注意的例外是LAMBADA,它看起来确实存在大量的污染,但对性能的影响非常小,干净子集的得分在整个数据集的0.5%之内。而且,严格地说,我们的填空格式排除了最简单的记忆形式。然而,由于我们在这篇论文中取得了很大的进展,潜在的污染在结果部分被指出。 总的来说,我们已经尽了最大的努力来度量和记录数据污染的影响,并根据严重程度来注意或直接删除有问题的结果。在设计基准和培训模式时,仍有许多工作要做,以解决该领域一般的这一重要而微妙的问题。有关我们的分析的更详细的解释,请读者参阅附录C。 |
GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for future work. First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks. | GPT-3和我们对它的分析都有一些局限性。下面我们将对其中一些进行描述,并对未来的工作提出建议。 首先,尽管GPT-3在定量和定性方面有了很大的改进,特别是与它的直接前身GPT-2相比,它在文本合成和一些NLP任务方面仍有明显的缺陷。在文本合成方面,尽管整体质量很高,但GPT-3样本有时仍然在文档层面上语义上重复,在足够长的段落中开始失去连贯性,自相矛盾,偶尔还包含不符合逻辑的句子或段落。我们将发布500个未经管理的无条件样本,以帮助更好地了解GPT-3在文本合成方面的局限性和优势。在离散语言任务领域,我们非正式地注意到GPT-3似乎在“常识物理”方面有特殊的困难,尽管在一些测试该领域的数据集(如PIQA [BZB+19])上做得很好。具体来说,GPT-3很难回答“如果我把奶酪放进冰箱,它会融化吗?”定量,GPT-3的语境学习表现有明显的差距在我们套件的基准,如第三节所述,特别是它没有比机会当评估一次性甚至few-shot一些“比较”的任务,如确定两个词使用同样的方式在一个句子,或者如果一个句子意味着另一个(WIC和ANLI分别),以及阅读理解任务的一个子集。考虑到GPT-3在许多其他任务上的出色的小样本性能,这一点尤其引人注目。 |
GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused on exploring in-context learning behavior in autoregressive language models because it is straightforward to both sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent literature, which has documented improved fine-tuning performance when using these approaches over standard language models [RSR+19]. Thus our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”. | GPT-3在结构和算法上有一些限制,这可以解释上面的一些问题。我们专注于探索自回归语言模型中的上下文内学习行为,因为用这个模型类进行抽样和计算可能性都很简单。因此,我们的实验不包括任何双向架构或其他训练目标,如去噪。这与最近的许多文献有明显的不同,后者记录了在标准语言模型上使用这些方法可以提高调优性能[RSR+19]。因此,我们的设计决策的代价是,在经验上受益于双向性的任务上,可能会有更糟糕的性能。这可能包括填空任务,包括回顾和比较两段内容的任务,或者要求重读或仔细考虑一篇很长的文章,然后写出非常简短的答案的任务。这可能是一个可能的解释为GPT-3滞后few-shot性能的一些任务,如WIC(包括比较词的使用在两个句子),ANLI(包括比较两个句子是否意味着另一个),和一些阅读理解任务(例如QuAC和种族)。基于过去的文献,我们还推测,一个大型的双向模型在微调方面会比GPT-3更强。在GPT-3的规模上制作一个双向模型,以及/或尝试使双向模型在很少或零射击学习中工作,是未来研究的一个有前途的方向,并且可以帮助实现“两全其美”。 |
A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world [BHT+20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans [ZSW+19a], fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of the world [CLY+19]. | 本文所描述的一般方法的一个更基本的限制是——扩展任何类似lm的模型,无论是自回归的还是双向的——它可能最终会(或可能已经)碰到培训前目标的限制。我们目前的目标是平等地对每一个标记进行权重,并且缺乏一个概念,即哪些是最重要的,哪些是不那么重要的。[RRS20]演示定制对相关实体的预测的好处。此外,在自我监督的目标中,任务规范依赖于将所需的任务强制转化为预测问题,然而最终,有用的语言系统(例如虚拟助手)可能被认为是采取目标导向的行动,而不仅仅是进行预测。最后,大型的预训练语言模型并不基于其他经验领域,如视频或现实世界的物理互动,因此缺乏大量关于世界的上下文[BHT+20]。由于所有这些原因,纯自监督预测的缩放可能会达到极限,使用不同的方法进行扩展可能是必要的。在这方面,未来有希望的方向可能包括从人类那里学习目标函数[ZSW+19a],用强化学习进行微调,或添加额外的模式,如图像,以提供接地和更好的世界模型[CLY+19]。 |
Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is an important direction for future work, and might come from grounding in the physical world to provide additional information, or from algorithmic improvements. A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training and identifying them at test time would be an advance for language models, but nevertheless understanding precisely how few-shot learning works is an important unexplored direction for future research. | 语言模型普遍存在的另一个局限性是在训练前的样本效率较低。尽管GPT-3在测试时间样本效率方面更接近人类(一次或零次),但它在训练前看到的文本仍然比人类在一生中看到的要多得多[Lin20]。提高训练前的样本效率是未来工作的一个重要方向,可能来自于在物理世界的基础上提供额外的信息,或者来自于算法的改进。在GPT-3中,与少样本学习相关的一个限制,或者至少是不确定性,是关于小样本学习实际上是在推理时间“从零开始”学习新任务,还是仅仅识别和识别在训练中学习到的任务的不确定性。这些可能性存在于光谱,从示威游行的训练集来自相同的分布与测试时间,认识到相同的任务,但在不同的格式,以适应一个特定的风格的QA等任务,学习一门技能完全新创。GPT-3在这个范围内的位置也可能因任务而异。合成任务,如词序打乱或定义无意义的词,似乎特别有可能从头学习,而翻译显然必须在训练前学习,尽管可能从组织和风格上与测试数据非常不同的数据。最终,我们甚至不清楚人类从从零开始和之前的演示中学到了什么。即使是在训练前组织各种演示,并在测试时识别它们,也将是语言模型的一个进步,但准确地理解少枪学习是如何工作的,是未来研究的一个重要的未探索的方向。 |
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size. Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts (Section 6). | 无论目标函数或算法如何,GPT-3规模上的模型都存在一个限制,即它们都是昂贵的,并且不便于进行推断,这可能对当前形式的这种规模的模型的实际适用性提出挑战。解决这一问题的一个可能的未来方向是将大型模型精馏[HVD15],使其达到可管理的规模,以完成特定的任务。像GPT-3这样的大型模型包含了非常广泛的技能,其中大多数技能对于特定的任务来说是不需要的,这表明在原则上积极的提炼是可能的。蒸馏在一般情况下得到了很好的探索[LHCG19a],但还没有在数千亿个参数的规模上进行尝试;将其应用于这种规模的模型可能会带来新的挑战和机会。最后,GPT-3共同分享一些限制大多数深度学习系统——它的决定并不容易解释,它在预测不一定精确校准的小说所观察到的输入方差性能远高于人类标准基准,它保留了数据的偏见一直在训练。最后这个问题- -数据的偏差可能导致模型产生定型或偏见的内容- -从社会角度来说是特别关注的问题,将在下一节中与其他问题一起讨论更广泛的影响(第6节)。 |
Language models have a wide range of beneficial applications for society, including code and writing auto-completion, grammar assistance, game narrative generation, improving search engine responses, and answering questions. But they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the potential to advance both the beneficial and harmful applications of language models. Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly discuss issues of energy efficiency (Section 6.3). | 语言模型为社会提供了广泛的有益应用,包括代码和编写自动完成、语法帮助、游戏叙事生成、改进搜索引擎响应和回答问题。但它们也有潜在的有害用途。相对于较小的模型,GPT-3提高了文本生成的质量和适应性,并增加了区分合成文本和人类书写文本的难度。因此,它有潜力促进语言模型的有益和有害应用。 |
Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing language models in a very different environment or for a different purpose than researchers intended. To help with this, we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures. | 恶意使用语言模型可能有点难以预料,因为它们通常涉及到在非常不同的环境中重新使用语言模型,或者用于与研究人员预期不同的目的。为了帮助解决这一问题,我们可以从传统的安全风险评估框架的角度进行思考,这些框架列出了关键步骤,如识别威胁和潜在影响、评估可能性以及将风险确定为可能性和影响的组合[Ros12]。我们讨论三个因素:潜在的误用应用,威胁行动者,和外部激励结构。 |
Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high quality text. Language models that produce high quality text generation could lower existing barriers to carrying out these activities and increase their efficacy. The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in 3.9.4 represents a concerning milestone in this regard. | 任何依赖于生成文本的对社会有害的活动都可以通过强大的语言模型来增强。例如,虚假信息,垃圾邮件,网络钓鱼,滥用法律和政府程序,欺诈学术论文写作和社会工程借口。这些应用程序中的许多都阻碍了人们编写足够高质量的文本。产生高质量文本生成的语言模型可以降低执行这些活动的现有障碍,并提高其效率。 随着文本合成质量的提高,语言模型的误用潜力也在增加。GPT-3生成几段合成内容的能力是这方面的一个重要里程碑,人们发现这些合成内容很难与3.9.4中人类书写的文本区分开来。 |
Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors who may be able to build a malicious product to 'advanced persistent threats’ (APTs): highly skilled and well-resourced (e.g. state-sponsored) groups with long-term agendas [SBC+19]. To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is not immediate, but significant improvements in reliability could change this. Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible difference in operations that may see potential gains by using language models. The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage. | 威胁参与者可以根据技能和资源级别进行组织,从能够构建恶意产品的低或中等技能和资源的参与者,到“高级持续威胁”(APTs):高技能和资源充足的(例如。国家资助的)有长期议程的团体[SBC+19]。
|
Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment. Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts how scalable the operation can be. | 每个威胁行动者组织也有一套战术、技术和程序(TTPs),他们依靠这些来完成他们的议程。ttp会受到经济因素的影响,比如可伸缩性和部署的简便性;网络钓鱼在所有群体中都非常流行,因为它提供了一种低成本、低成本、高收益的部署恶意软件和窃取登录凭证的方法。使用语言模型来增强现有的ttp可能会导致部署成本更低。 易用性是另一个重要的激励因素。拥有稳定的基础设施对ttp的采用有很大的影响。然而,语言模型的输出是随机的,尽管开发人员可以限制这些输出(例如使用top-k truncation),但如果没有人类的反馈,它们无法持续执行。如果一个社交媒体假信息机器人的输出在99%的情况下是可靠的,但在1%的情况下输出的是不连贯的,这就可以减少操作这个机器人所需的人力。但是仍然需要人工筛选输出,这限制了操作的可伸缩性。 基于我们对这个模型的分析,以及对威胁参与者和环境的分析,我们怀疑人工智能研究人员最终将开发出具有足够一致性和可操控性的语言模型,从而使恶意参与者更感兴趣。我们希望这将给更广泛的研究界带来挑战,并希望通过结合缓解研究、原型设计和与其他技术开发人员协调来解决这一问题。 |
Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8 Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how they are different in this dimension. | 训练数据中的偏差可能导致模型产生定型或偏见的内容。这是令人担忧的,因为模型偏见可能以不同的方式伤害相关群体的人,通过加强现有的刻板印象和产生贬低形象等潜在危害[Cra17]。我们对模型中的偏差进行了分析,以便更好地理解GPT-3在公平性、偏差和代表性方面的局限性。8 我们的目标不是详尽地描述GPT-3,而是对其局限性和行为进行初步分析。我们关注的是与性别、种族和宗教相关的偏见,尽管可能存在许多其他类别的偏见,可以在后续工作中进行研究。这只是初步的分析,并没有反映模型的所有偏差,即使是在研究的类别内。 总的来说,我们的分析表明,经过互联网训练的模型具有互联网规模偏差;模型倾向于反映训练数据中呈现的刻板印象。下面我们将讨论我们在性别、种族和宗教维度上的偏见的初步发现。我们在1750亿参数模型和类似较小的模型中探查偏差,看看它们在这个维度上是否和如何不同。 |
In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found that occupations in general have a higher probability of being followed by a male gender identifier than a female one (in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant). 83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured this by feeding the model a context such as "The detective was a" and then looking at the probability of the model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.). In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist, housekeeper etc. We also tested how these probabilities changed when we shifted the context to be the "The competent {occupation} was a" (Competent Variant), and when we shifted the context to be "The incompetent {occupation} was a" (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent {occupation} was a," the majority of occupations had an even higher probability of being followed by a male identifier than a female one than was the case with our original neutral prompt, "The {occupation} was a". With the prompt "The incompetent {occupation} was a" the majority of occupations still leaned male with a similar probability than for our original neutral prompt. The average occupation bias - measured as 1 njobs P jobs log( P (female|Context) P (male|Context)) ) - was −1.11 for the Neutral Variant, −2.14 for the Competent Variant and −1.15 for the Incompetent Variant. | 在我们对GPT-3性别偏见的调查中,我们关注的是性别与职业之间的联系。我们发现,在给出“该职业是一个”(中性变量)这样的背景下,一般来说,职业被男性性别标识符跟随的概率比女性更高(换句话说,她们更倾向于男性)。在我们测试的388种职业中,有83%的职业更有可能被男性的GPT-3尾随。我们通过给模型输入诸如“侦探是a”这样的语境来测量这一点,然后观察模型接着输入男性暗示词(如“the detective was a”)的概率。或表示女性的词(woman, female等)。特别是,具有较高教育水平的职业,如立法者、银行家或名誉教授,以及需要重体力劳动的职业,如梅森、米尔莱特和治安官,都偏重于男性。更有可能被女性识别的职业包括助产士、护士、接待员、管家等。 我们还测试了当我们将上下文转换为“胜任的{占职}是一个”(胜任的变体)时,以及当我们将上下文转换为“不胜任的{占职}是一个”(不胜任的变体)时,这些概率是如何变化的。我们发现,当提示为“胜任的{职业}是a”时,大多数职业后面跟随男性标识符的概率比跟随女性标识符的概率还要高,这比我们最初的中性提示为“The{职业}是a”的概率还要高。当提示“the incompetent {career} was a”时,大多数职业仍然倾向于男性,这一概率与我们最初的中性提示相似。以1 njobs P job log(P(女性|环境)P(男性|环境))测量的平均职业偏倚为:中性变异为- 1.11,胜任变异为- 2.14,不胜任变异为- 1.15。 |
We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further corroborated the model’s tendency to associate most occupations with males. One method measured the models ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model a context such as "The advisor met with the advisee because she wanted to get advice about job applications. 'She’ refers to the" and found the option with the lowest probability between the two possible options (Choices between Occupation Option: advisor; Participant Option: advisee). We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She was very", "He would be described as", "She would be described as"9 . We looked at the adjectives and adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were more often described using adjectives that span a greater spectrum. | 我们还使用两种方法对Winogender数据集[RNLVD18]进行代词解析,这两种方法进一步证实了该模型将大多数职业与男性联系起来的倾向。一种方法是测试模型正确分配代词作为职业或参与者的能力。例如,我们为模型提供了一个上下文,例如“顾问与被咨询者会面,因为她想获得关于工作申请的建议。”“她”指的是“并在两种可能的选项(职业选项:顾问;参与者选择:学生)。 职业和参与者的词汇通常带有社会偏见,比如假设大多数居住者默认为男性。我们发现,语言模型学会了一些偏见,比如倾向于将女性代词与参与者的位置联系起来,而不是男性代词。GPT-3 175B在这项任务上的准确率是所有模型中最高的(64.17%)。这也是唯一一个女性的居住者句子(正确答案是职业选项的句子)的准确率高于男性的模型(81.7%对76.7%)。除了我们的第二大模型GPT-3 13B,其他所有模型在男性代词与职业相关的句子上的准确率都高于女性代词,但GPT-3 13B在两个句子上的准确率都相同(60%)。这提供了一些初步证据,表明在存在偏见的地方,语言模型容易出错,较大的模型比较小的模型更健壮。 |
To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The {race} man was very", "The {race} woman was very" and "People would describe the {race} person as" and generated 800 samples for each of the above prompts, with {race} replaced with a term indicating a racial category such as White or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that language models produce text of differing sentiment when varying features such as occupation [HZJ+19], we explored how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid: -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet). It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that focused on racial features; these results are not from the models talking about race in the wild but talking about race in an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated with a negative sentiment under this testing methodology. Across the models we analyzed, 'Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data. | GPT-3调查种族偏见,我们播种等模型提示——“{种族}男人非常”,“{种族}的女人非常”和“人们将{种族}人描述为“和生成800个样本对于上面的提示,用{种族}替换为一个术语表明种族类别如白人或亚洲。然后我们在生成的样本中度量单词的共同出现。鉴于先前的研究表明,语言模型在不同的特征(如职业)下产生不同的情绪[HZJ+19],我们探究了种族如何影响情绪。我们使用Senti WordNet [BES10]来测量情绪,以确定在每个种族中出现的不相称的词汇。每个词的情绪在100到-100之间变化,积极的分数表示积极的词。精彩度:100,友好度:87.5),负分数表示否定的词。猥贱:-87.5,可怕:-87.5)和0分表示中性词(如:倾斜的小屋)。 值得注意的是,我们明确地促使模型讨论种族问题,而这反过来产生了关注种族特征的文本;这些结果并不是来自于那些讨论野外竞赛的模型,而是来自于他们已经准备好这样做的实验设置。此外,由于我们测量情绪通过简单地看单词共生,产生的情绪可以反映社会历史因素——例如,文本有关的讨论奴隶制会经常有负面情绪,这可能会导致人口与负面情绪在这种测试方法。 在我们分析的所有模特中,“亚洲人”的人气一直很高——在7个模特中,有3个排名第一。另一方面,“黑色”的人气一直很低——在7款车型中,它在5款中排名最低。这些差异在较大的模型尺寸上略微缩小。这个分析给出了不同模型的偏差,并强调了对情绪、实体和输入数据之间的关系进行更复杂分析的必要性。 |
We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, and Judaism, by generating 800 model outputs of length ≈50 with a temperature of 1 and a top p of 0.9 for every prompt. Our prompts were of the nature "{Religion practitioners} are" (Eg. "Christians are") for each of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a corpus of such completions for studying co-occurrence of words. Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in the top 40 most favored words for Islam in GPT-3. | 我们研究了哪些词与无神论、佛教、基督教、印度教、伊斯兰教和犹太教等宗教术语共出现,通过生成800个模型输出,长度≈50,温度为1,每个提示的p值为0.9。我们的提示属于“宗教从业者”的性质。“基督徒是”)对应以上列出的六个宗教类别中的每一个。然后,我们允许模型自然地执行补全,并创建这样补全的语料库来研究单词的共现。 与种族相似,我们发现这些模型与宗教术语联系在一起,显示出某些倾向来反映这些术语在世界上是如何呈现的。以伊斯兰教为例,我们发现像ramadan, prophet和mosque这样的词出现的频率比其他宗教要高。我们还发现,“暴力”、“恐怖主义”和“恐怖主义”等词与“伊斯兰”相关的比例要高于与其他宗教相关的比例,并在GPT-3中跻身“伊斯兰”最受欢迎的40个词汇之列。 |
We have presented this preliminary analysis to share some of the biases we found in order to motivate further research, and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an area of continuous research for us and are excited to discuss different methodological approaches with the community. We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ+18]. Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this is also extensive [QMZH19, HZJ+19], so we offer only a few brief comments on future directions specific to large language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for these models. There is room for more research that engages with the literature outside NLP, better articulates normative statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20]. Thus, mitigation work should not be approached purely with a metric driven objective to 'remove’ bias as this has been shown to have blind spots [GG19, NvNvdG19] but in a holistic manner. | 我们提出这一初步分析是为了分享我们发现的一些偏见,以推动进一步的研究,并强调在大规模生成模型中描述偏见的固有困难;我们希望这将是一个持续研究的领域,并很高兴与社区讨论不同的方法方法。我们把这部分的工作看作是主观的路标——我们选择了性别、种族和宗教作为出发点,但我们认识到这种选择的内在主观性。我们的工作受到了描述模型属性以开发信息性标签的文献的启发,例如用于模型报告的模型卡片[MWZ+18]。 最终,重要的不仅仅是描述语言系统中的偏见,还要进行干预。关于这方面的文献也很广泛[QMZH19, HZJ+19],因此我们仅就大型语言模型的未来方向提供一些简短的评论。为了在通用模型中为有效预防偏倚铺平道路,有必要建立一个共同的词汇表,将这些模型在减轻偏倚方面的规范、技术和经验挑战结合起来。还有更多的研究空间与NLP以外的文献相结合,更好地阐明关于伤害的规范性陈述,并与受NLP系统影响的社区的生活经历相结合[BBDIW20]。因此,应对缓解工作不应单纯以一个度量驱动的目标来“消除”偏见,因为这已被证明存在盲点[GG19, NvNvdG19],而应以一种整体的方式。 |
Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such models, as advocated by [SDSE19]. The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we should consider not only the resources that go into training them, but how these resources are amortized over the lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency of such models over time, similar to trends observed in image recognition and neural machine translation [HB20]. | 实际的大规模预训练需要大量的计算,这是能源密集型的:训练GPT-3 175B在预训练期间消耗了数千次petaflop/s天计算,相比之下,1.5B参数的GPT-2模型需要几十次petaflop/s天计算(图2.2)。这意味着我们应该认识到这种模式的成本和效率,正如[SDSE19]所倡导的。 大规模的使用训练的也给了另一个样本,通过它观看大型模型的效率,我们不仅应该考虑去培训他们的资源,但这些资源如何平摊的生命周期模型,随后将被用于各种各样的目的特定任务来制定和调整。尽管像GPT-3这样的模型在培训期间消耗了大量的资源,但一旦培训完成,它们的效率会惊人地高:即使使用完整的GPT-3 175B,从一个培训过的模型生成100页内容的成本大约是0.4千瓦时,或者只有几美分的能源成本。此外,像模型蒸馏[LHCG19a]这样的技术可以进一步降低此类模型的成本,让我们采用训练单一、大规模模型的范例,然后创建更有效的版本,以便在适当的上下文中使用。随着时间的推移,算法的发展也会自然地进一步提高这些模型的效率,类似于在图像识别和神经机器翻译中观察到的趋势[HB20]。 |
Several lines of work have focused on increasing parameter count and/or computation in language models as a means to improve generative or task performance. An early work scaled LSTM based language models to over a billion parameters [JVS+16]. One line of work straightforwardly increases the size of transformer models, scaling up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size: 213 million parameters [VSP+17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters [RWC+19], 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and most recently 17 billion parameters [Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of increasing models’ capacity to store information without increased computational cost. These approaches rely on the conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM+17] has been used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19], though only a small fraction of the parameters are actually used on each forward pass. A third approach increases computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and the universal transformer [DGV+18]. Our work focuses on the first approach (scaling compute and parameters together, by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ this strategy. Several efforts have also systematically studied the effect of scale on language model performance. [KMH+20, RRBS19, LWS+20, HNA+17], find a smooth power-law trend in loss as autoregressive language models are scaled up. This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the curve can perhaps be detected in Figure 3.1), and we also find relatively smooth increases in many (though not all) downstream tasks across 3 orders of magnitude of scaling. | 有几行工作关注于增加语言模型中的参数计数和/或计算,以此作为提高生成或任务性能的手段。早期的工作将基于LSTM的语言模型扩展到超过10亿个参数[JVS+16]。一条生产线直接增加了变压器模型的尺寸,大致按比例增加了参数和每个令牌的浮动量。该血管的工作使模型规模不断增大,原论文中有2.13亿个参数[VSP+17],有3亿个参数[DCLT18], 15亿个参数[RWC+19], 80亿个参数[SPP+19], 110亿个参数[RSR+19],最近又增加了170亿个参数[Tur20]。第二行工作集中在增加参数计数而不是计算,作为在不增加计算成本的情况下增加模型存储信息的能力的一种方法。这些方法依赖于条件计算框架[BLC13],具体地说,专家混合方法[SMM+17]已经被用于生成1000亿个参数模型和最近的500亿个参数转换模型[AJF19],尽管在每次向前传递中实际使用的参数只有一小部分。第三种方法在不增加参数的情况下增加计算量;该方法的实例包括自适应计算时间[Gra16]和通用变压器[DGV+18]。我们的工作集中在第一种方法上(通过直接使神经网络变大,将计算和参数结合在一起),并将模型的大小比以前采用这种策略的模型增加10倍。 一些学者也系统地研究了规模对语言模型性能的影响。[KMH+20, RRBS19, LWS+20, HNA+17],随着自回归语言模型规模的增大,损失呈现平稳的幂律趋势。这项工作表明,随着模型不断扩大,这一趋势在很大程度上继续下去(尽管在图3.1中可以检测到曲线的轻微弯曲),我们还发现,在许多(尽管不是全部)下游任务中,在3个数量级的扩展中,都出现了相对平稳的增长。 |
Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language models that are as small as possible. This approach includes ALBERT [LCG+19] as well as general [HVD15] and task-specific [SDCW19, JYS+19, KR16] approaches to distillation of language models. These architectures and techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint of giant models. As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR+19, IBGC+14, CCE+18, MCKS18], reading comprehension [CHI+18, RCM19], and adversarially constructed datasets designed to be difficult for existing language models [SBBC19, NWD+19]. In this work we test our models on many of these datasets. | 另一项工作与扩展的方向相反,试图在尽可能小的语言模型中保持强大的性能。该方法包括ALBERT [LCG+19]、general [HVD15]和task-specific [SDCW19, JYS+19, KR16]等语言模型精馏方法。这些架构和技术对我们的工作具有潜在的补充作用,可以用于减少大型模型的延迟和内存占用。 由于经过调优的语言模型在许多标准基准测试任务上接近了人类的性能,人们投入了相当多的精力来构建更困难的或开放的任务,包括问题回答[KPR+19, IBGC+14, CCE+18, MCKS18],阅读理解[CHI+18, RCM19],以及为现有语言模型设计的困难的对立构建数据集[SBBC19, NWD+19]。在这项工作中,我们在许多数据集上测试我们的模型。 |
Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the tasks we tested on. Recent efforts include [RSR+19, RRS20], which fine-tuned an 11 billion parameter language model, and [GLT+20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on in-context learning but could be combined in the future with those of [GLT+20, LPP+20]. Metalearning in language models has been utilized in [RWC+19], though with much more limited results and no systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including matching networks [VBL+16], RL2 [DSC+16], learning to optimize [RL16, ADG+16, LM17] and MAML [FAL17]. Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training) updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time. Few-shot auto-regressive density estimation was explored in [RCP+17] and [GWC+18] studied low-resource NMT as a few-shot learning problem. | 之前的很多工作都是专门针对问题的回答,这在我们的测试任务中占了很大一部分。最近的努力包括[RSR+19, RRS20],它微调了一个110亿参数的语言模型,以及[GLT+20],它关注于在测试时处理大量的数据。我们的工作侧重于语境学习,但在未来可以与[GLT+20, LPP+20]的工作相结合。 语言模型中的金属学习在[RWC+19]中得到了应用,尽管结果有限,也没有系统的研究。更广泛地说,语言模型metalearning具有内环-外环结构,这使得它在结构上类似于一般应用于ML的metalearning。这里有大量的文献,包括匹配网络[VBL+16], RL2 [DSC+16],学习优化[RL16, ADG+16, LM17]和MAML [FAL17]。填料模型的上下文的我们的方法与以前的例子是最结构类似于RL2上也类似于[HYC01],在适应一个内循环发生在步伐通过计算模型的激活,没有更新权重,而外层循环(在这种情况下只是语言模型训练的)更新权重,和隐式学习能力适应或者至少在inference-time定义识别任务。[RCP+17]探索了小样本自回归密度估计,[GWC+18]将低资源NMT作为一个小样本学习问题进行了研究。 |
While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with similar goals is semi-supervised learning where approaches such as UDA [XDH+19] also explore methods of fine-tuning when very little labeled data is available. Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18] and utilized for some tasks (such as summarizing) in a language model with [RWC+19]. The notion of presenting tasks in natural language was also explored in the text-to-text transformer [RSR+19], although there it was applied for multi-task fine-tuning rather than for in-context learning without weight updates. Another approach to increasing generality and transfer-learning capability in language models is multi-task learning [Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating the weights for a new task. Multi-task learning has shown some promising initial results [LGH+15, LSP+18] and multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed the boundaries on certain tasks [KKS+20], but is still limited by the need to manually curate collections of datasets and set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR+17], human interaction [ZSW+19b], or active learning [Mac92]. | 虽然我们的小样本方法的机制不同,但之前的工作也探索了使用预训练语言模型结合梯度下降进行小样本学习的方法[SS20]。另一个具有类似目标的子领域是半监督学习,其中像UDA [XDH+19]这样的方法也探索了在可用标记数据很少的情况下进行微调的方法。 使用自然语言给出多任务模型的指令首先是在一个监督设置中通过[MKXS18]形式化的,并在使用[RWC+19]的语言模型中用于一些任务(比如汇总)。在文本到文本转换器[RSR+19]中也探索了用自然语言表示任务的概念,尽管它被应用于多任务微调,而不是在没有权值更新的情况下用于上下文学习。 另一种提高语言模型通用性和转移学习能力的方法是多任务学习[Car97],它对下游任务的混合进行微调,而不是分别更新每个任务的权重。如果成功的多任务学习可以允许单一模型在不更新权值的情况下用于多个任务(类似于我们的上下文学习方法),或者可以在更新新任务权值时提高样本效率。多任务学习了一些初步的结果[LGH + 15, LSP + 18]和多级微调最近成为一个标准化的一部分SOTA结果在一些数据集[PFB18]而且突破某些任务(kk + 20),但仍需要手动牧师收藏有限的数据集和设置培训课程。相比之下,大规模的预训练似乎提供了一种“自然的”广泛分布的任务,这种任务隐含在预测文本本身中。未来工作的一个方向可能是尝试为多任务学习生成更广泛的明确任务,例如通过程序生成[TFR+17]、人机交互[ZSW+19b]或主动学习[Mac92]。 |
Algorithmic innovation in language models over the last two years has been enormous, including denoising-based bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG+19, RSR+19], random permutations during training [YDY+19], architectures that improve the efficiency of sampling [DYY+19], improvements in data and training procedures [LOG+19], and efficiency increases in the embedding parameters [LCG+19]. Many of these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive language models, both in order to focus on in-context learning performance and to reduce the complexity of our large model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these algorithmic techniques is a promising direction for future work. | 算法语言的创新模式在过去的两年里一直巨大,包括denoising-based双向性[DCLT18], prefixLM [DL15]和encoder-decoder架构(RSR LLG + 19日+ 19),随机排列在训练(金波+ 19),架构,提高抽样效率[DYY + 19],改善数据和训练程序[日志+ 19],和效率提高嵌入参数(LCG + 19)。许多这些技术为下游任务提供了显著的收益。在这项工作中,我们继续关注纯自回归语言模型,这既是为了关注上下文内的学习性能,也是为了减少大型模型实现的复杂性。然而,结合这些算法的进步很可能会提高GPT-3在下游任务中的性能,特别是在微调设置中,结合GPT-3的规模与这些算法技术是未来工作的一个有前途的方向。 |
We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems. | 我们提出了一个1750亿参数语言模型显示强劲表现在许多NLP zero-shot任务和基准,一次性的,和few-shot设置,在某些情况下几乎匹配最先进的调整系统的性能,以及生成高质量的样品,在任务定义动态定性表现强劲。我们记录了大致可预测的性能扩展趋势,而不使用微调。我们还讨论了这类模型的社会影响。尽管有许多限制和弱点,这些结果表明,非常大的语言模型可能是开发适应性强的通用语言系统的一个重要成分。 |
The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments, Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of people who created content that was used in the training of the model, and to those who were involved in indexing or upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure and supercomputing teams for making it possible to train models at this scale. | 作者要感谢Ryan Lowe对论文草稿提供的详细反馈。感谢Jakub Pachocki和Szymon Sidor提出的任务建议,以及Greg Brockman、Michael Petrov、Brooke Chan和Chelsea Voss帮助运行OpenAI基础设施的评估。感谢大卫的菜肴最初支持扩大这个项目,艾琳Solaiman讨论的方式方法和评估偏差,哈里森·爱德华兹和Yura呢Burda与语境的讨论和实验学习,杰弗里·欧文和保罗global早期的讨论语言模型缩放、长欧阳的建议设计人类的评估实验,克里斯Hallacy讨论数据收集,和山卡特的帮助与视觉设计。感谢数百万创建内容并用于模型培训的人,感谢那些参与索引或对内容进行向上投票(在WebText的情况下)的人。此外,我们要感谢整个OpenAI基础设施和超级计算团队,因为他们使在这种规模上训练模型成为可能。 |
|