【原】Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读

处女座的程序猿 2021-09-28

展开全文

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读

《GPT-3: Language Models are Few-Shot Learners》的翻译与解读

作者

OpenAI

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

原文

https:///abs/2005.14165

Github

https://github.com/openai/gpt-3

Abstract 摘要

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

最近的研究表明，通过对大量文本语料库进行预训练，然后对特定任务进行微调，在许多NLP任务和基准上取得了实质性的进展。虽然在体系结构中通常与任务无关，但这种方法仍然需要成千上万个示例的特定于任务的微调数据集。相比之下，人类通常可以通过几个例子或简单的指令来执行一项新的语言任务——这是目前的NLP系统在很大程度上仍难以做到的。这里，我们展示了扩展语言模型可以极大地提高任务不可知的、小样本的性能，有时甚至可以通过预先采用的最先进的微调方法达到竞争力。具体来说，我们训练GPT-3，这是一个自回归语言模型，有1750亿个参数，比以往任何非稀疏语言模型多10倍，并测试其在小样本设置下的性能。对于所有任务，GPT-3的应用不需要任何梯度更新或微调，只需要通过与模型的文本交互指定任务和小样本演示。GPT-3在许多NLP数据集上实现了强大的性能，包括翻译、问题回答和完形填空任务，以及一些需要实时推理或领域适应的任务，如整理单词、在句子中使用新单词或执行3位数字算术。与此同时，我们也发现了一些数据集，其中GPT-3的小样本学习仍然存在困难，以及一些数据集，其中GPT-3面临着与大型网络语料库培训相关的方法论问题。最后，我们发现GPT-3可以生成人类评估者难以区分的新闻文章样本和人类撰写的文章样本。我们将讨论这一发现和GPT-3的更广泛的社会影响。

1 Introduction 介绍

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP+17] have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18]. This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms [RSR+19, LOG+19, YDY+19, LCG+19]. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons. First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.	近年来，NLP系统中出现了一种预先训练语言表示的趋势，应用于越来越灵活和任务不确定的下游迁移方式。首先,学会了使用单层表示词向量(MCCD13, PSM14)和特定于任务的架构,然后用多层RNNs表示和上下文状态被用来形成强表示[DL15、MBXS17 PNZtY18](尽管仍然适用于特定于任务的架构),以及最近pre-trained复发或变压器语言模型(垂直地震剖面+ 17)直接调整,完全消除需要特定于任务的架构(RNSS18,DCLT18, HR18]。最后一种范式在许多具有挑战性的NLP任务(如阅读理解、问题回答、文本蕴涵和许多其他任务)上取得了实质性的进展，并在新的架构和算法的基础上继续前进[RSR+19, LOG+19, YDY+19, LCG+19]。然而,这种方法的主要限制在于,架构是task-agnostic,仍然是一个需要特定于任务的数据集和特定于任务的微调:实现强劲表现所需的任务通常需要微调的数据集上成千成百上千的例子具体任务。出于几个原因，消除这一限制是可取的。首先，从实践的角度来看，每一个新任务都需要大量带标签的示例数据集，这限制了语言模型的适用性。有非常广泛的可能有用的语言任务，包括任何事情，从纠正语法，生成一个抽象概念的例子，批评一个短篇小说。对于许多这样的任务来说，很难收集到一个大型的监督训练数据集，特别是当这个过程必须为每个新任务重复时。
Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions. For instance [HLW+20] observe that larger models do not necessarily generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it [YdC+19, MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at human-level, may exaggerate actual performance on the underlying task [GSL+18, NK19]. Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.	其次，随着模型的表现力和训练分布的窄性，挖掘训练数据中假相关性的潜力从根本上增加。这可能会给预训练和微调范式带来问题，在这种范式中，模型被设计得很大，以便在预训练期间吸收信息，但随后在非常狭窄的任务分布上进行微调。例如[HLW+20]观察到，较大的模型不一定能更好地推广非分布。有证据表明，在这种范式下实现的泛化可能很差，因为模型过于具体于训练分布，不能很好地泛化在训练分布之外[YdC+19, MPL19]。因此，在特定基准测试中，即使名义上是在人的层面上，经过调优的模型的性能也可能会夸大底层任务的实际性能[GSL+18, NK19]。第三,人类学习语言最不需要大型数据集监管任务——一个简短的指令在自然语言(如:“请告诉我,如果这句话描述了一些快乐或者悲伤”)或者最多一个小数量的示威活动(例如:“这里有两个例子的人勇敢的行动;请给出勇气的第三个例子”)通常足以使一个人完成一项新任务，至少达到合理的能力水平。除了指出我们目前的NLP技术在概念上的局限性外，这种适应性还具有实际的优势——它允许人类无缝地混合在一起或在许多任务和技能之间切换，例如在冗长的对话中执行添加操作。为了广泛应用，我们希望我们的NLP系统有同样的流动性和普遍性。
One potential route towards addressing these issues is meta-learning1 – which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work [RWC+19] attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next. While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example [RWC+19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of solving language tasks. Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters [DCLT18], to 1.5 billion parameters [RWC+19], to 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale [KMH+20]. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.	对解决这些问题的一个潜在的路线是meta-learning1——在语言的上下文模型意味着模型发展广泛技能的训练时间和模式识别能力,然后使用这些能力在推理时迅速适应或识别所需的任务(见图1.1)。最近的工作[RWC + 19]试图做到这一点通过我们称之为“语境学习”,使用文本输入pretrained语言模型作为一种任务规范:模型条件在自然语言指令和/或一些示威活动的任务,然后将完成进一步的实例任务只需预测接下来会发生什么。虽然它显示出了一些最初的希望，但这种方法取得的效果仍远不及微调——例如[RWC+19]在自然问题上仅取得4%的成绩，甚至它的55 F1 CoQa结果现在也落后于最先进的水平35分以上。元学习显然需要大量的改进，才能成为解决语言任务的可行的实用方法。语言建模的另一个最新趋势可能提供了一个前进的方向。近年来，transformer语言模型的容量大幅增加，从1亿个参数[RNSS18]，到3亿个参数[DCLT18]，再到15亿个参数[RWC+19]，再到80亿个参数[SPP+19]， 110亿个参数[RSR+19]，最后是170亿个参数[Tur20]。每一次增加都带来了文本合成和/或下游NLP任务的改进，有证据表明，与许多下游任务相关的日志丢失随着规模的增大呈现平稳的改善趋势[KMH+20]。由于内环境学习涉及在模型的参数内吸收许多技能和任务，因此内环境学习能力可能会随着规模的增长而显示出类似的强劲增长，这是合理的。
In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional fine-tuning setting, but we leave this to future work. Figure 1.2 illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to remove extraneous symbols from a word. Model performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model size and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning. Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art relative to fine-tuned models operating in the same closed-book setting. GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human evaluators have difficulty distinguishing from human-generated articles. At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed. A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should not be seen as a rigorous or meaningful benchmark in itself).	在本文中，我们通过训练一个参数为1750亿的自回归语言模型(我们称之为GPT-3)，并测量其上下文内学习能力来检验这一假设。具体地说，我们在超过24个NLP数据集上评估GPT-3，以及一些旨在测试对训练集中不太可能直接包含的任务的快速适应的新任务。对于每个任务,我们评估GPT-3 3条件下:(一)“few-shot学习”,或语境学习,我们允许尽可能多的示威活动将适合模型的上下文窗口(通常10 - 100),(b)“一次性学习”,我们只允许一个示范,和(c)“zero-shot”学习,不允许有示威游行,只有一条指令在自然语言模型。原则上，GPT-3也可以在传统的微调设置中进行评估，但我们将其留给未来的工作。图1.2说明了我们所研究的条件，并展示了一个简单任务的少量学习，该任务要求模型从一个单词中去除无关的符号。模型性能随着自然语言任务描述的增加而提高，随着模型上下文中的示例数量的增加，K. Few-shot学习也随着模型大小的增加而显著提高。虽然在这种情况下的结果是特别引人注目的，但模型大小和上下文示例数量的总体趋势对我们研究的大多数任务都是成立的。我们强调，这些“学习”曲线不涉及梯度更新或微调，只是不断增加作为条件的演示数量。总的来说，在NLP任务中，GPT-3在零杆和单杆设置中取得了很好的效果，在少杆设置中，有时可以与最先进的技术竞争，甚至有时超过最先进的技术(尽管最先进的技术是由经过微调的模型持有的)。例如，GPT-3在零杆设置中CoQA达到81.5 F1，在单杆设置中CoQA达到84.0 F1，在少杆设置中达到85.0 F1。同样，在TriviaQA上，GPT-3在零杆设置上的精度为64.3%，在单杆设置上的精度为68.0%，在少杆设置上的精度为71.2%，与在相同闭锁设置下运行的精细模型相比，后者是最先进的。在测试快速适应或即时推理的任务上，GPT-3也显示出一步走和少步出的熟练程度，这些任务包括解读单词、执行算术，以及在一个句子中使用只定义过一次的新单词。我们还表明，在小样本设置中，GPT-3可以生成人工评估人员难以区分的合成新闻文章。与此同时，我们也发现一些任务在性能上有一些困难，即使在GPT-3的规模上也是如此。这包括像ANLI数据集这样的自然语言推理任务，以及像RACE或QuAC这样的阅读理解数据集。通过对GPT-3的优点和缺点(包括这些局限性)的广泛描述，我们希望能促进对语言少注射学习的研究
We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity. In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners. Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard. The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.	我们还对“数据污染”进行了系统的研究——这是一个日益严重的问题，当在数据集上训练高容量模型时，比如Common crawlow，它可能会包含来自测试数据集的内容，因为这些内容经常存在于web上。在本文中，我们开发了系统的工具来测量数据污染和量化其扭曲效应。尽管我们发现数据污染对大多数数据集上的GPT-3性能的影响很小，但我们确定了一些数据集可能会导致结果膨胀，我们要么不报告这些数据集的结果，要么根据严重程度用星号标注它们。除了以上这些，我们还训练了一系列较小的模型(从1.25亿参数到130亿参数不等)，以便在零样本、一样本和小样本设置中与GPT-3进行比较。总的来说，对于大多数任务，我们发现在所有三种设置中，模型容量的缩放相对平稳;一个值得注意的模式是，零弹、一弹和少弹之间的差距经常随着模型容量的增加而增加，这可能表明较大的模型更精通元学习。最后，鉴于GPT-3表现出的广泛的能力范围，我们讨论了对偏见、公平和更广泛的社会影响的关注，并试图在这方面对GPT-3的特征进行初步分析。本文的其余部分组织如下。在第2节中，我们将描述培训GPT-3并对其进行评估的方法和方法。第3节在零，一次和很小样本设置的任务的全范围的结果。第4节讨论了数据污染的问题(火车测试重叠)。第5节讨论GPT-3的局限性。第6节讨论更广泛的影响。第7节回顾相关工作，第8节作总结。

2 Approach 方法

Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):

我们的基本预训练方法，包括模型、数据和训练，类似于[RWC+19]中描述的过程，即相对简单地增加模型大小、数据集大小和多样性，以及训练长度。我们对上下文内学习的使用也类似于[RWC+19]，但在这项工作中，我们系统地探索了上下文内学习的不同设置。因此，在本节开始时，我们将显式定义并对比我们将在其上评估GPT-3或原则上可以在其上评估GPT-3的不同设置。这些设置可以看作取决于它们倾向于依赖多少特定于任务的数据。具体来说，我们可以在这个频谱中确定至少四个点(参见图2.1):

Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data [GSL+18, NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.
Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning [RWC+19], but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving K examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set K in the range of 10 to 100 as this is how many examples can fit in the model’s context window (nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.
One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.
Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given a natural language instruction describing the task. This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”. For example, if someone is asked to “make a table of world records for the 200m dash”, this request can be ambiguous, as it may not be clear exactly what format the table should have or what should be included (and even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example in Figure 2.1, a human would likely know what to do from just the text instruction.

微调(FT)是近年来最常见的方法，它包括通过对特定于预期任务的监督数据集进行训练来更新预训练模型的权重。通常使用成千上万的带标签的例子。调优的主要优点是在许多基准测试中具有强大的性能。主要缺点是每个任务都需要一个新的大数据集，可能会出现泛化不均匀分布[MPL19]，可能会利用训练数据的虚假特征[GSL+18, NK19]，可能会导致与人类性能的不公平比较。在这项工作中，我们没有对GPT-3进行微调，因为我们关注的是任务不确定性能，但是GPT-3原则上可以进行微调，这是未来工作的一个有希望的方向。
少量射击(FS)是我们将在这项工作中使用的术语，用来指在推理时给模型一些任务的演示作为条件[RWC+19]，但不允许权重更新。如图2.1所示,一个典型的数据集实例有一个上下文和所需的完成(例如一个英语句子翻译和法国),和few-shot作品给K上下文和完成的例子,然后最后一个例子的情况下,模型将提供完成。我们通常将K设置在10到100之间，因为这是模型上下文窗口所能容纳的示例数(nctx = 2048)。few-shot的主要优点是大大减少了对特定任务数据的需求，并减少了从一个大而窄的微调数据集学习过窄分布的可能性。这种方法的主要缺点是，到目前为止，其结果远不如最先进的微调模型。此外，仍然需要少量特定于任务的数据。显示的名字,few-shot学习作为语言模型与这里描述few-shot学习用于其他上下文毫升(HYC01轮式侦察车+ 16)-包括基于分布广泛的学习任务(在本例中隐含在训练的数据),然后迅速适应新任务。
one -shot (1S)与few-shot相同，只是除了任务的自然语言描述之外，只允许进行一次演示，如图1所示。区分“一步走”、“少一步走”和“零一步走”(见下图)的原因是，“一步走”与某些任务传达给人类的方式最接近。例如，当要求人类在人工工作服务(例如Mechanical Turk)上生成数据集时，通常会给出任务的演示。相比之下，如果没有例子，有时很难传达任务的内容或格式。
Zero-Shot(0)与one-shot相同，只是不允许演示，并且只给模型一个描述任务的自然语言指令。这种方法提供了最大的便利性、潜在的鲁棒性和避免虚假相关性(除非它们在大量的训练前数据语料库中广泛出现)，但也是最具挑战性的设置。在某些情况下，如果没有之前的例子，人类甚至很难理解任务的格式，所以在某些情况下，这种设置是“不公平的困难”。例如,如果有人要求“让200米短跑的世界纪录表”,这个请求可以模糊,因为它可能不清楚什么格式表应该有或应该包括什么(甚至仔细澄清,需要理解恰恰是很困难的)。不过，至少在某些设置中，zero-shot最接近于人类执行任务的方式——例如，在图2.1中的翻译示例中，人类可能仅通过文本指令就知道该做什么。

Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.

Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations.

图2.1展示了使用翻译英语到法语的示例的四种方法。在本文中，我们关注于零射击、一次射击和少射击，目的不是将它们作为竞争的备选方案进行比较，而是将它们作为不同的问题设置，在特定基准测试的性能和样本效率之间提供不同的权衡。我们特别强调小样本的结果，因为他们中的许多只是稍微落后于最先进的微调模型。然而，最终，“一箭双雕”(有时甚至是“零射”)似乎是对人类表现最公平的比较，也是未来工作的重要目标。下面的2.1-2.3节分别给出了我们的模型、训练数据和训练过程的细节。第2.4节讨论了我们如何进行少拍、一次拍和零拍评估的细节。

2.1 Model and Architectures 模型和架构

We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.

我们使用与GPT-2 [RWC+19]相同的模型和架构，包括修改的初始化、预归一化和其中描述的可逆标记，但我们在变压器的层中使用交替密集和局部带状稀疏注意模式，类似于稀疏变压器[CGRS19]。为了研究ML性能对模型大小的依赖关系，我们训练了8种不同大小的模型，从1.25亿个参数到1750亿个参数的3个数量级，最后一个是我们称为GPT-3的模型。先前的研究[KMH+20]表明，在有足够的训练数据的情况下，验证损失的比例应近似于一个平滑的幂律，该幂律是大小的函数;许多不同大小的训练模型允许我们测试验证丢失和下游语言任务的假设。

Table 2.1 shows the sizes and architectures of our 8 models. Here nparams is the total number of trainable parameters, nlayers is the total number of layers, dmodel is the number of units in each bottleneck layer (we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel), and dhead is the dimension of each attention head. All models use a context window of nctx = 2048 tokens. We partition the model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work [KMH+20] suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.

表2.1显示了我们的8个模型的大小和架构。这里nparams总数可训练的参数,nlayers总层数,dmodel是单位的数量在每一个瓶颈层(我们总是有前馈层瓶颈层的四倍,dff = 4∗dmodel),和dhead每个关注头部尺寸。所有模型都使用nctx = 2048令牌的上下文窗口。我们沿着深度和宽度维度在gpu上划分模型，以最小化节点之间的数据传输。每个模型的精确结构参数的选择是基于计算效率和在GPU中模型布局的负载均衡。先前的工作[KMH+20]表明，验证损失对这些参数在一个合理的大范围内不是很敏感。

2.2 Training Dataset 训练数据集

Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in [KMH+20], two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.

用于语言模型的数据集已经迅速扩展，最终达到了常见的爬行数据集dataset2 [RSR+19]，总计近一万亿字。这样大的数据集足以训练我们最大的模型，而无需对同一序列进行两次更新。然而，我们发现未过滤或轻度过滤版本的普通爬行往往比更有组织的数据集质量更低。因此，我们采取了3个步骤来提高数据集的平均质量:(1)我们下载和过滤的一个版本CommonCrawl基于相似性的一系列高品质参考全集,(2)我们在文档级别执行模糊重复数据删除,在和整个数据集,以防止冗余和保存我们伸出的完整性验证设置为一个精确的衡量过度拟合,和(3)我们还添加了高质量的参考语料训练增加CommonCrawl和增加其多样性。

前两个点的详细信息(处理常见的爬行)描述在附录a。第三,我们添加了几个策划高质量的数据集,包括WebText数据集的扩展版本(RWC + 19),收集的抓取链接在更长一段时间,和第一(公里/小时+ 20)中描述的两个网络书全集(Books1和Books2)和英文维基百科。

Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data.

表2.2显示了我们在训练中使用的最终混合数据集。common抓取数据从2016年到2019年的每月41个shards中下载，即过滤前压缩明文45TB，过滤后压缩明文570GB，大致相当于4000亿个字节对编码的令牌。需要注意的是，在训练过程中，对数据集的采样并不是按照数据集的大小进行的，而是我们认为质量较高的数据集的采样频率更高，例如common抓取和Books2数据集在训练过程中采样次数少于一次，而对其他数据集的采样次数为2-3次。这本质上接受了少量的过拟合，以换取更高质量的训练数据。

A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination.

在广泛的互联网数据上预先训练过的语言模型，特别是具有记忆大量内容能力的大型模型，主要关注的方法是，在培训前无意中看到测试或开发集，可能会污染下游任务。为了减少这种污染，我们搜索并试图消除与本文研究的所有基准的开发和测试集的重叠。不幸的是，过滤中的一个bug导致我们忽略了一些重叠部分，并且由于训练的代价，对模型进行再训练是不可行的。在第4节中，我们描述了剩余重叠的影响，在未来的工作中，我们将更积极地消除数据污染。

2.3 Training Process 训练过程

As found in [KMH+20, MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table 2.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process and hyperparameter settings are described in Appendix B.

正如在[KMH+20, MKAT18]中发现的，较大的模型通常可以使用较大的批大小，但需要较小的学习速度。我们在训练期间测量梯度噪声尺度，并使用它来指导我们批量大小的选择[MKAT18]。表2.1显示了我们使用的参数设置。为了训练更大的模型而不耗尽内存，我们在每个矩阵乘法中混合使用模型并行性和跨网络层的模型并行性。所有的模型都是在微软提供的高带宽集群的V100 GPU上进行训练的。详细的训练过程和超参数设置在附录B中描述。

2.4 Evaluation 评估

For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning examples directly from it.

对于少弹学习，我们从任务的训练集中随机抽取K个样本作为条件，根据任务的不同用1或2个新行分隔，以此来评估评估集中的每个样本。对于LAMBADA和Storycloze，没有可用的监督训练集，所以我们从开发集中提取条件设置示例，并在测试集上进行评估。对于Winograd(原始的，不是超级胶水版本)，只有一个数据集，所以我们直接从它提取条件设置示例。

K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better, so when a separate development and test set are available, we experiment with a few values of K on the development set and then run the best value on the test set. For some tasks (see Appendix G) we also use a natural language prompt in addition to (or for K = 0, instead of) demonstrations.

On tasks that involve choosing one correct completion from several options (multiple choice), we provide K examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing P (completion|context) P (completion|answer context) , where answer context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic.

K可以是0到模型上下文窗口允许的最大数量之间的任何值，即nctx = 2048，适用于所有模型，通常适合10到100个示例。更大的K值通常但不总是更好的,所以当一组独立的开发和测试是可用的,我们尝试几值K的开发设置,然后运行测试集上的最佳值。对于某些任务(参见附录G)我们也使用自然语言提示除了(或K = 0,而不是)示威活动。

对于涉及从多个选项(多项选择)中选择一个正确完成的任务，我们提供了K个上下文示例加上正确完成，然后只提供一个上下文示例，并比较每个完成的LM可能性。对于大多数任务我们比较每个令牌的可能性(规范化长度),然而在少量的数据集(弧、OpenBookQA和比赛)我们获得更多利益衡量发展设定的正常化的无条件概率每完成,通过计算P(完成|上下文)(完成|回答上下文),在回答上下文字符串“回答:”或“:”和用于提示完成应该答案但否则通用。

On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by [RSR+19] (see Appendix G) for details.

On tasks with free-form completion, we use beam search with the same parameters as [RSR+19]: a beam width of 4 and a length penalty of α = 0.6. We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else.

对于涉及二分类的任务，我们给选项以语义上更有意义的名称(例如“真”或“假”，而不是0或1)，然后把任务当作多项选择;我们有时也会类似于[RSR+19]所完成的任务(详见附录G)。

对于自由形式完成的任务，我们使用与[RSR+19]相同的参数进行波束搜索:波束宽度为4，长度罚值为radial = 0.6。我们使用F1相似度评分、BLEU或精确匹配来给模型评分，这取决于手头数据集的标准。

对于每个模型的大小和学习设置(0 -，1 -，和小样本)，最终的结果会在测试集上公布。当测试集是私有的,我们的模型往往是太大,以适应在测试服务器上,所以我们报告的结果发展。我们提交到测试服务器上少量的数据集(超强力胶水,TriviaQA PiQa)我们能够提交工作,我们只有200 b few-shot提交结果,并报告发展为一切设置结果。

3 Results 结果

In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6 additional extra-small models with as few as 100,000 parameters. As observed in [KMH+20], language modeling performance follows a power-law when making efficient use of training compute. After extending this trend by two more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a broad spectrum of natural language tasks.

Below, we evaluate the 8 models described in Section 2 (the 175 billion parameter parameter GPT-3 and 7 smaller models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks.

在图3.1中，我们展示了第2节中描述的8个模型的训练曲线。在这个图中，我们还包括了6个额外的超小型模型，这些模型只有100,000个参数。正如在[KMH+20]中观察到的，在高效使用训练计算时，语言建模性能遵循幂律。在将这一趋势扩展两个数量级之后，我们只观察到与幂律有轻微的背离。人们可能会担心这些交叉熵损失的改进仅仅来自于我们训练语料库的虚假细节建模。然而，在接下来的章节中，我们将看到交叉熵损失的改进可以在广泛的自然语言任务中带来一致的性能提升。

下面，我们在广泛的数据集上评估第2节中描述的8个模型(1750亿参数GPT-3和7个较小的模型)。我们将数据集分成9个类别，这些类别代表大致相似的任务。

In Section 3.1 we evaluate on traditional language modeling tasks and tasks that are similar to language modeling, such as Cloze tasks and sentence/paragraph completion tasks. In Section 3.2 we evaluate on “closed book” question answering tasks: tasks which require using the information stored in the model’s parameters to answer general knowledge questions. In Section 3.3 we evaluate the model’s ability to translate between languages (especially one-shot and few-shot). In Section 3.4 we evaluate the model’s performance on Winograd Schema-like tasks. In Section 3.5 we evaluate on datasets that involve commonsense reasoning or question answering. In Section 3.6 we evaluate on reading comprehension tasks, in Section 3.7 we evaluate on the SuperGLUE benchmark suite, and in 3.8 we briefly explore NLI. Finally, in Section 3.9, we invent some additional tasks designed especially to probe in-context learning abilities – these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the few-shot, one-shot, and zero-shot settings.

在3.1节中，我们评估了传统的语言建模任务和类似于语言建模的任务，如完形填空任务和句子/段落完成任务。在第3.2节中，我们对“闭卷”问题回答任务进行评估，即需要使用模型参数中存储的信息来回答一般知识问题的任务。在第3.3节中，我们评估了模型在不同语言之间的翻译能力(特别是一次翻译和少次翻译)。在第3.4节中，我们评估了该模型在Winograd类模式任务上的性能。在第3.5节中，我们对涉及常识推理或问题回答的数据集进行评估。在第3.6节中，我们评估了阅读理解任务;在第3.7节中，我们评估了SuperGLUE基准套件;在3.8节中，我们简要探讨了NLI。最后，在3.9节中，我们特别设计了一些额外的任务来探究上下文中的学习能力——这些任务侧重于即时推理、适应技巧或开放式的文本合成。我们在“少拍”、“一次拍”和“零拍”设置中评估所有的任务。

3.1 Language Modeling, Cloze, and Completion Tasks 语言建模、完形填空和完成任务

In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible completions of a piece of text.

在本节中，我们将测试GPT-3在传统的语言建模任务以及相关任务上的性能，这些任务包括预测感兴趣的单个单词、完成句子或段落，或在可能完成的一段文本之间进行选择。

3.1.1 Language Modeling 语言建模

We calculate zero-shot perplexity on the Penn Tree Bank (PTB) [MKM+94] dataset measured in [RWC+19]. We omit the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15 points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot.

我们计算了在[RWC+19]测量的佩恩树岸(PTB) [MKM+94]数据集上的零射击perplexity。我们省略了4 Wikipedia-related任务的工作,因为他们是完全包含在我们的训练数据,我们也省略十亿字的基准由于高分数被包含在我们的训练集的数据集。肺结核逃脱这些问题由于比现代互联网。我们最大的模型在PTB上设置了一个新的SOTA，显著领先15个点，达到20.50的困惑。注意，由于PTB是一个传统的语言建模数据集，它没有一个清晰的示例分离来定义一次或少次评估，因此我们只测量零次评估。

3.1.2 LAMBADA 数据集

The LAMBADA dataset [PKL+16] tests the modeling of long-range dependencies in text – the model is asked to predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the continued scaling of language models is yielding diminishing returns on this difficult benchmark. [BHT+20] reflect on the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results ([SPP+19] and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of 8% over the previous state of the art.

LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word filters [RWC+19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We use the following fill-in-the-blank format:

LAMBADA数据集[PKL+16]测试文本中远程依赖的建模——模型被要求预测需要阅读一段上下文的句子的最后一个单词。最近有研究表明，语言模型的不断扩大在这个困难的基准上产生的收益正在减少。[BHT+20]反思了在两个最新的研究结果([SPP+19]和[Tur20])之间，模型尺寸增加了一倍，仅提高了1.5%，并认为“继续以数量级扩展硬件和数据尺寸并不是前进的道路”。我们发现这条道路仍然很有希望，在零杆的情况下，LAMBADA的GPT-3实现了76%，比之前的技术水平提高了8%。
LAMBADA还演示了小样本学习的灵活性，因为它提供了一种方法来解决这个数据集通常出现的问题。尽管LAMBADA中的完成总是一个句子的最后一个单词，但是标准语言模型无法知道这个细节。因此，它不仅将概率分配给正确的结尾，也分配给其他有效的段落延续。这个问题已经部分解决了在过去的停止字过滤器[RWC+19](禁止“延续”字)。相反，few-shot设置允许我们将任务“设置”为一个cloze测试，并允许语言模型从示例中推断出需要完成的恰好是一个单词。我们使用以下填空格式:

When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy by 10%. Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot setting. Perhaps this is because all models still require several examples to recognize the pattern.

One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact on performance.

当以这种方式呈现样例时，GPT-3在小样本设置中达到了86.4%的精度，比之前的最先进水平提高了18%以上。我们观察到，随着模型尺寸的增大，小样本性能有了很大的提高。虽然这个设置将最小模型的性能降低了近20%，但对于GPT-3，它将精度提高了10%。最后，空白填充法并不是一种有效的一次性方法，它的效果总是比零填充法差。这可能是因为所有模型仍然需要几个示例来识别模式。
需要注意的一点是，对测试集污染的分析发现，LAMBADA数据集中的少数似乎出现在我们的训练数据中——然而，在第4节中执行的分析表明，对性能的影响可以忽略不计。

3.1.3 HellaSwag 数据集

The HellaSwag dataset [ZHB+19] involves picking the best ending to a story or set of instructions. The examples were adversarially mined to be difficult for language models while remaining easy for humans (who achieve 95.6% accuracy). GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the 75.4% accuracy of a fine-tuned 1.5B parameter language model [ZHR+19] but still a fair amount lower than the overall SOTA of 85.6% achieved by the fine-tuned multi-task model ALUM.

HellaSwag数据集[ZHB+19]涉及到为一个故事或一组指令选择最好的结局。这些例子对语言模型来说很难挖掘，而对人类来说却很容易(达到95.6%的准确率)。GPT-3在单小样本设置中达到78.1%的准确率，在小样本设置中达到79.3%的准确率，超过了1.5B参数语言模型[ZHR+19]的75.4%的准确率，但仍低于多任务模型模型85.6%的整体SOTA。

3.1.4 StoryCloze 数据集

We next evaluate GPT-3 on the StoryCloze 2016 dataset [MCH+16], which involves selecting the correct ending sentence for five-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot setting (with K = 70). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model [LDL19] but improves over previous zero-shot results by roughly 10%.

接下来，我们对StoryCloze 2016数据集[MCH+16]上的GPT-3进行评估，包括为五句话长的故事选择正确的结尾句。在这里，GPT-3在零样本设置中达到83.2%，在小样本设置(K = 70)中达到87.7%。这仍然比使用基于BERT模型[LDL19]进行微调的SOTA低4.1%，但比之前的零射击结果提高了约10%。

3.2 Closed Book Question Answering 闭卷回答任务

In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense amount of possible queries, this task has normally been approached by using an information retrieval system to find relevant text in combination with a model which learns to generate an answer given the question and the retrieved text. Since this setting allows a system to search for and condition on text which potentially contains the answer it is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well directly answering the questions without conditioning on auxilliary information. They denote this more restrictive evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR+19], WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself is also not permitted.

在本节中，我们将测量GPT-3回答有关广泛事实知识的问题的能力。由于可能的查询量巨大，这个任务通常是通过使用信息检索系统查找相关文本，并结合学习生成给定问题和检索文本的答案的模型来完成的。由于该设置允许系统搜索并对可能包含答案的文本进行条件设置，因此称为“open-book”。[RRS20]最近证明，一个大型语言模型可以在不依赖辅助信息的情况下直接回答问题，表现得令人惊讶地好。他们将这种更严格的评估设置称为“闭卷”。他们的工作表明，更高容量的模型可以表现得更好，我们用GPT-3测试了这一假设。我们在[RRS20]中的3个数据集上评估GPT-3: Natural Questions [KPR+19]、WebQuestions [BCFL13]和TriviaQA [JCWZ17]，使用相同的分割。注意，除了所有的结果都在闭卷设置中之外，我们使用的少样本、一次小样本和零小样本的评估代表了比以前的闭卷QA工作更严格的设置:除了不允许外部内容外，也不允许对Q&A数据集本身进行微调。

The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by 14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot result improves by 3.7% and matches the SOTA for an open-domain QA system which not only fine-tunes but also makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [LPP+20]. GPT-3’s few-shot result further improves performance another 3.2% beyond this.
On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5% in the few-shot setting. This compares to 37.4% for fine-tuned T5-11B, and 44.7% for fine-tuned T5-11B+SSM, which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this distribution, recovering strong performance in the few-shot setting.

GPT-3结果如表3.3所示。在TriviaQA上，我们在小样本设置中达到了64.3%，在一小样本设置中达到了68.0%，在小样本设置中达到了71.2%。zero-shot result的表现已经比经过微调的T5-11B高出14.2%，而且在培训前的问答时间跨度预测也比T5-11B高出3.8%。一次测试的结果提高了3.7%，与开放域QA系统的SOTA相匹配，该系统不仅进行了优化，而且利用了一种学习过的检索机制，对包含21M文档的15.3个参数密集向量索引进行检索[LPP+20]。此外，GPT-3的少拍效果进一步提高了性能3.2%。
在网络问题(WebQs)中，GPT-3在零杆设置中达到14.4%，在单杆设置中达到25.3%，在少杆设置中达到41.5%。相比之下，使用q&a特定的培训前程序的优化T5-11B和优化T5-11B+SSM的比例分别为37.4%和44.7%。GPT-3在小样本设置接近最先进的表现，微调模型。值得注意的是，与TriviaQA相比，WebQS从零杆到少杆的增益要大得多(事实上，WebQS的零杆和单杆性能都很差)，这可能表明WebQS的问题和/或它们的回答风格在GPT-3中是不分布的。然而，GPT-3似乎能够适应这种分布，在少炮点的环境中恢复了良好的性能。

On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution.

Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting the idea that model capacity translates directly to more 'knowledge’ absorbed in the parameters of the model.

在自然问题(NQs)中，GPT-3在零杆设置中达到了14.6%，在单杆设置中达到了23.0%，在少杆设置中达到了29.9%，而在经过微调的T5 11B+SSM中达到了36.6%。与WebQS类似，从零杆到少杆的巨大增益可能意味着分布的转移，这也可能解释了与TriviaQA和WebQS相比竞争力较差的原因。特别是，NQs的问题倾向于维基百科上非常精细的知识，可以测试GPT-3的能力极限和广泛的培训前分布。

总的来说，在三个数据集中的一个上，GPT-3的一次性匹配了开放域微调SOTA。在另外两个数据集上，尽管没有使用微调，它的性能接近封闭的SOTA。在所有3个数据集上，我们发现性能与模型大小的关系非常顺利(图3.3和附录H图H.7)，可能反映了模型容量直接转化为更多吸收在模型参数中的“知识”的想法。

3.3 Translation 翻译任务

For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially when translating between French and English despite only training on 10 megabytes of remaining French text. Since we increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training dataset to include more representation of other languages, though this remains an area for further improvement. As discussed in 2.2 the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages. These languages are documented in the supplemental material. In order to better understand translation capability, we also expand our analysis to include two additional commonly studied languages, German and Romanian.
Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a blend of training data that mixes many languages together in a natural way, combining them on a word, sentence, and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data. Results are shown in Table 3.4. Zero-shot GPT-3, which only receives on a natural language description of the task, still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for each translation task improves performance by over 7 BLEU and nears competitive performance with prior work. GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but underperforms when translating in the other direction. Performance on En-Ro is a noticeable outlier at over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art. For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of unsupervised pretraining, supervised finetuning on 608K labeled examples, and backtranslation [LHCG19b].

对于GPT-2，由于容量问题，在多语言文档集合上使用了一个过滤器来生成仅使用英语的数据集。即使使用了这种过滤，GPT-2也显示出了多语言能力，并且在法语和英语之间进行翻译时执行得非常出色，尽管仅对10兆字节的剩余法语文本进行了培训。由于我们将GPT-2到GPT-3的容量增加了两个数量级，因此我们还扩展了训练数据集的范围，以包括更多其他语言的表示，尽管这仍是一个有待进一步改进的领域。正如2.2中所讨论的那样，我们的大部分数据来自于原始的普通抓取，只使用基于质量的过滤。尽管GPT-3的训练数据仍然主要是英语(93%的单词计数)，但它也包括了7%的其他语言的文本。这些语言被记录在补充材料中。为了更好地理解翻译能力，我们还扩展了我们的分析，包括另外两种常用的语言，德语和罗马尼亚语。
现有的无监督机器翻译方法通常结合对单语数据集的预训练和反向翻译[SHB15]，以一种可控的方式连接两种语言。相比之下，GPT-3从混合的训练数据中学习，这些数据以自然的方式将多种语言混合在一起，在单词、句子和文档级别上将它们组合在一起。GPT-3也使用单一的训练目标，它不是为任何特定任务定制或设计的。然而，我们的单样本/小样本设置并不能严格地与之前的无监督工作相比，因为它们使用了少量成对的例子(1或64个)。这相当于一页或两页上下文内训练数据。结果如表3.4所示。Zero-shot GPT-3，它只接收任务的自然语言描述，仍然表现不佳，最近的非监督NMT结果。然而，仅为每个翻译任务提供一个示例演示，就可以提高7个蓝度以上的翻译性能，接近与之前工作的竞争性能。GPT-3在全小样本设置中进一步提高了另外4个蓝度，使得平均性能与之前的无监督NMT工作相似。根据语言方向的不同，GPT-3在性能上有明显的偏差。在研究的三种输入语言中，GPT-3在翻译成英语时显著优于之前的无监督的NMT工作，但在翻译成英语时表现不佳。在enro上的性能是一个明显的异常值，比之前的无监督的NMT工作差10蓝度以上。这可能是一个弱点，因为重用了GPT-2的字节级BPE标记器，它是为一个几乎完全是英语的训练数据集开发的。对于Fr-En和De-En，很少有shot GPT-3优于我们所能找到的最佳监督结果，但由于我们不熟悉文献和这些是非竞争性基准的外观，我们不怀疑这些结果代表了真正的艺术状态。对于roen来说，很少有shot GPT-3能在整体SOTA的0.5 BLEU范围内完成，这是通过结合无监督的预训练、对608K标记示例的监督微调和反向翻译来实现的[LHCG19b]。

Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of improvement with model capacity. This is shown in Figure 3.4 in the case of few-shot results, and scaling for all three settings is shown in Appendix H.

最后，通过所有语言对和所有三种设置(零-、一-和少-shot)，模型容量有一个平稳的提高趋势。图3.4中显示的是较少拍摄的结果，附录H中显示了所有三种设置的缩放情况。

3.4 Winograd-Style Tasks 任务

The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.
On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method described in [RWC+19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which is presented as binary classification and requires entity extraction to convert to the form described in this section. On Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance. We note that contamination analysis found some Winograd schemas in the training data but this appears to have only a small effect on results (see Section 4).

Winograd Schemas Challenge [LDM12]是NLP中的一项经典任务，当一个代词在语法上有歧义，但在语义上对人来说没有歧义时，该任务涉及确定该代词指的是哪个词。最近，经过微调的语言模型在原始Winograd数据集上取得了接近人类的性能，但是更困难的版本，比如反向挖掘的Winogrande数据集[SBBC19]，仍然显著落后于人类的性能。我们测试了GPT-3在Winograd和Winogrande上的性能，通常是在零杆、一杆和少杆设置下。
在Winograd上，我们使用[RWC+19]中描述的相同的“部分求值”方法，在原始的273个Winograd模式集上测试GPT-3。请注意，此设置与SuperGLUE基准中的WSC任务略有不同，后者以二进制分类的形式表示，需要提取实体来转换为本节中描述的形式。Winograd的GPT-3在零杆、一杆和少杆设置中取得了88.3%、89.7%和88.6%的成绩，没有显示出明确的上下文学习，但在所有情况下都取得了较好的成绩，仅比最先进的和估计的人类性能低几个点。我们注意到，污染分析在训练数据中发现了一些Winograd模式，但这似乎只对结果有很小的影响(见第4节)。

On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and human performance on the task as reported by [SBBC19] is 94.0%.

在更困难的Winogrande数据集上，我们确实发现了上下文学习的进步:GPT-3在零样本设置中实现了70.2%，在单样本设置中实现了73.2%，在少小样本设置中实现了77.7%。相比之下，经过微调的RoBERTA模型实现了79%，使用经过微调的高容量模型(T5)，最先进的实现了84.6%，而根据[SBBC19]报告的人类在该任务上的性能是94.0%。

3.5 Common Sense Reasoning 常识推理任务

Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) [BZB+19], asks common sense questions about how the physical world works and is intended as a probe of grounded understanding of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot (the last measured on PIQA’s test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over 10% worse than human performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark the result with an asterisk. See Section 4 for details.

接下来，我们考虑三个试图捕捉物理或科学推理的数据集，作为区别于句子完成，阅读理解，或广义知识问题回答。第一个是PhysicalQA (PIQA) [BZB+19]，它提出了关于物质世界如何运作的常识问题，旨在探索对世界的基础理解。GPT-3的零杆精度为81.0%，单杆精度为80.5%，少杆精度为82.8%(最后一次在PIQA的测试服务器上测量)。这比较有利的79.4%的精度之前的先进先进的一个微调罗伯塔。PIQA在模型尺寸上显示出相对较浅的缩放效果，仍然比人类的表现差10%以上，但GPT-3的少射甚至零射的结果比目前最先进的技术要好。我们的分析将PIQA标记为潜在的数据污染问题(尽管隐藏了测试标签)，因此我们用星号保守地标记了结果。详见第4节。

ARC [CCE+18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the “Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot setting, and 51.5% in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline (55.9%) from UnifiedQA [KKS+20]. On the “Easy” version of the dataset (questions which either of the mentioned baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a fine-tuned RoBERTa baseline from [KKS+20]. However, both of these results are still much worse than the overall SOTAs achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by 27% on the challenge set and 22% on the easy set.

On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the leaderboard.
Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.

ARC [CCE+18]是一个多选题数据集，收集自3至9年级的科学考试。在对简单统计或信息检索方法无法正确回答的问题进行筛选后的数据集“挑战”版本上，GPT-3在零炮设置、一次炮设置和少炮设置的准确率分别达到51.4%、53.2%和51.5%。这接近于UnifiedQA [KKS+20]的RoBERTa基线(55.9%)的性能。在数据集的“简单”版本中(上述两种基线方法都回答正确的问题)，GPT-3实现了68.8%、71.2%和70.1%，这略微超过了来自[KKS+20]的RoBERTa的优化基线。然而，这两个结果仍然比UnifiedQA取得的总体SOTAs差得多，后者在挑战集上比GPT-3的少杆结果高出27%，在简单集上高出22%。
在OpenBookQA [MCKS18]上，GPT-3从零样本到小样本设置有显著提高，但仍比整体SOTA少20分。GPT-3的少样本性能类似于一个微调的伯特大基线在排行榜上。
总的来说，使用GPT-3的上下文学习在常识推理任务中表现出混合的结果，在PIQA和ARC的单样本和小样本学习设置中，只观察到小的和不一致的收获，但在OpenBookQA中观察到显著的改善。GPT-3在所有评估设置中对新的PIQA数据集设置SOTA。

3.6 Reading Comprehension 阅读理解任务

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.

接下来我们对GPT-3进行阅读理解任务的评估。在对话框和单一问题设置中，我们使用了一套5个数据集，包括抽象的、多项选择和基于跨度的回答格式。我们观察到GPT-3在这些数据集上的性能差异很大，这表明不同的回答格式具有不同的能力。一般来说，我们观察到GPT-3与初始基线和使用上下文表示对每个各自数据集进行训练的早期结果相同。

GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI+18] a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP [DWD+19], a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems [RLL+19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE [LXL+17], a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still 45% behind SOTA.

GPT-3在CoQA [RCM19]自由形式会话数据集上表现最好(在人类基线的3个点内)，在QuAC [CHI+18]数据集上表现最差(低于ELMo基线13 F1)，该数据集需要建模结构化对话行为和师生交互的回答范围选择。下降(DWD + 19]数据集测试离散推理和计算能力在阅读理解中,GPT-3在few-shot环境优于原始论文的BERT基线调整但仍远低于人类的性能和先进的方法增强神经网络与符号系统(RLL + 19)。在阵容2.0 [RJL18]上，GPT-3展示了它的少杆学习能力，与零杆设置相比提高了近10杆(69.8杆)。这使得它稍微优于原始论文中最好的微调结果。在RACE [LXL+17](一个针对初中和高中英语考试的多项选择数据集)上，GPT-3的表现相对较弱，仅与最早使用上下文表示的研究相比具有竞争力，仍落后于SOTA 45%。

3.7 SuperGLUE 对比

In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark [WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated.
We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting.

为了更好地聚合NLP任务的结果，并与BERT和RoBERTa等流行模型进行更系统的比较，我们还在标准化数据集上对GPT-3进行了评价，即SuperGLUE基准[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]。GPT-3在SuperGLUE数据集上的测试集性能如表3.8所示。在小样本设置中，我们对所有任务使用了32个示例，从训练集中随机采样。对于除了WSC和MultiRC之外的所有任务，我们采样了一组新的示例用于每个问题的上下文。对于WSC和MultiRC，我们使用同一组从训练集中随机抽取的例子作为我们评估的所有问题的上下文。

我们观察到GPT-3在不同任务中的表现差异很大。在COPA和记录GPT-3实现近sota的表现在一次样本和小样本设置，与COPA只下降了几个点，并在排行榜上取得第二名，第一名是由微调110亿参数模型(T5)。在WSC上，性能仍然相对较强，在小样本设置中达到80.1%(请注意，如3.4节所述，gpot -3在原始Winograd数据集上达到88.6%)。在BoolQ、MultiRC和RTE上，性能是合理的，大致与经过微调的BERT-Large匹配。在CB上，我们看到生命迹象的比例为75.6%。

WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.

Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32 examples per task, after which point additional examples will not reliably fit into our context. When sweeping over values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score.

WiC是一个值得注意的弱点，它的命中率为49.4%(随机)。我们为WiC尝试了许多不同的短语和公式(包括确定一个单词在两个句子中是否具有相同的意思)，但没有一个能够取得很好的效果。这暗示了一个现象,在下一节将变得更清楚(讨论ANLI基准)——GPT-3似乎弱few-shot或一次性设置的一些任务,涉及比较两个句子或片段,例如一个词是否用同样的方式在两个句子,一个句子是否解释另一个,或者一个句子是否意味着另一个。这也可以解释RTE和CB的分数相对较低的原因，它们也采用这种格式。尽管存在这些弱点，GPT-3仍然在8个任务中的4个任务上优于经过微调的伯特-大公司，而在两个任务上，GPT-3通过一个经过微调的110亿参数模型已经接近最先进水平。

最后，我们注意到，随着模型大小和上下文中的示例数量的增加，少量注射的SuperGLUE得分稳步提高，显示了上下文内学习的好处越来越大(图3.8)。我们将K扩展到每个任务32个示例，超过这一点，额外的示例将不可靠地适合我们的上下文。当扫过K的值时，我们发现GPT-3每个任务总共需要少于8个示例，才能在总体超级胶水得分上超过经过微调的伯特-大。

3.8 NLI 自然语言推理任务

Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral). SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced Adversarial Natural Language Inference (ANLI) dataset [NWD+19]. ANLI is a difficult dataset employing a series of adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (∼ 33%), whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.

自然语言推理(NLI) [Fyo00]关注理解两个句子之间关系的能力。在实践中，这个任务通常被构造成两个或三个类的分类问题，其中模型分类第二个句子在逻辑上是否与第一个句子相符合，是否与第一个句子相矛盾，或者可能是正确的(中立的)。SuperGLUE包括一个NLI数据集RTE，它计算任务的二进制版本。在RTE上，只有最大版本的GPT-3在任何评估设置上的表现都令人信服地优于random(56%)，但在小样本设置中，GPT-3的表现类似于单任务优化的BERT Large。我们还评估了最近引入的对抗式自然语言推断(ANLI)数据集[NWD+19]。ANLI是一个复杂的数据集，它在三轮(R1、R2和R3)中使用一系列逆向挖掘的自然语言推理问题。与RTE类似，我们所有小于GPT-3的模型在ANLI上的表现几乎完全是随机的，即使是在很少投篮的设置中(约33%)，而GPT-3本身在第3轮显示出生命迹象。ANLI R3的结果突出显示在图3.9和全部结果轮可以在附录h .这些结果RTE和ANLI NLI基础仍然是一个非常困难的任务表明语言模型和他们才刚刚开始显示出进步的迹象。

3.9 Synthetic and Qualitative Tasks 综合和定性任务

One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets with the hope of stimulating further study of test-time behavior of language models.

要想了解GPT-3在“少拍”(或“零拍”和“一次拍”)环境下的能力范围，一种方法是让它执行一些任务，这些任务要求它执行简单的即时计算推理，识别训练中不太可能出现的新模式，或者快速适应不寻常的任务。我们设计了几个任务来测试这类能力。首先，我们测试GPT-3执行算术的能力。其次，我们创建了几个任务，这些任务包括重新排列或整理单词中的字母，这些任务不太可能在训练过程中被准确地看到。第三，我们测试了GPT-3解决卫星式类比问题的能力。最后，我们对GPT-3进行了几个定性测试，包括在句子中使用新单词、修改英语语法和生成新闻文章。我们将发布合成数据集，希望能促进对语言模型测试时行为的进一步研究。

3.9.1 Arithmetic 算术

To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:

2 digit addition (2D+) – The model is asked to add two integers sampled uniformly from [0, 100), phrased in the form of a question, e.g. “Q: What is 48 plus 76? A: 124.”
2 digit subtraction (2D-) – The model is asked to subtract two integers sampled uniformly from [0, 100); the answer may be negative. Example: “Q: What is 34 minus 53? A: -19”.
3 digit addition (3D+) – Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000).
3 digit subtraction (3D-) – Same as 2 digit subtraction, except numbers are uniformly sampled from [0, 1000).
4 digit addition (4D+) – Same as 3 digit addition, except uniformly sampled from [0, 10000).
4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 10000).
5 digit addition (5D+) – Same as 3 digit addition, except uniformly sampled from [0, 100000).
5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 100000).
2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from [0, 100), e.g. “Q: What is 24 times 42? A: 1008”.
One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers, with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers are selected uniformly on [0, 10) and the operations are selected uniformly from {+,-,*}.

为了测试GPT-3在没有特定任务训练的情况下执行简单算术运算的能力，我们开发了一个包含10个测试的小电池，其中包括用自然语言问GPT-3一个简单的算术问题:

2位加法(2D+)——模型被要求将从[0,100均匀采样的两个整数相加，以问题的形式表达，例如:“Q: 48加76等于多少?”答:124。”
2位减法(2D-)——要求模型从[0,100]均匀采样的两个整数进行减法;答案可能是否定的。例子:“问:34减53等于多少?”答:-19”。
3位加法(3D+) -与2位加法相同，只是数字均匀地从[0,1000]取样。
3位减法(3D-) -与2位减法相同，只是数字均匀地从[0,1000]采样。
4位加法(4D+) -与3位加法相同，只是均匀采样于[0,10000]。
4位减法(4D-) -与3位减法相同，只是均匀采样于[0,10000]。
5位加法(5D+) -与3位加法相同，除了均匀采样于[0,100000]。
5位减法(5D-) -与3位减法相同，除了均匀采样[0,100000]。
2位乘法(2Dx)——模型要求将从[0,100均匀采样的两个整数相乘)，例如:“Q: 24乘以42等于多少?”答:1008”。
一位数合成(1DC)——要求模型对三个1位数执行合成操作，最后两个用括号括起来。例如，“Q: 6+(4*8)是多少?”答:38”。在[0,10)上一致选择三个1位数字，在{+，-，*}中一致选择操作。

In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random instances of the task and evaluate all models on those instances. First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations.

在所有的10个任务中，模型必须准确地生成正确的答案。对于每个任务，我们生成一个包含2000个任务随机实例的数据集，并对这些实例上的所有模型进行评估。首先，我们在小样本设置中评估GPT-3，其结果如图3.10所示。在加减法方面，GPT-3在数字较少的情况下表现出较强的熟练度，2位加法的准确率为100%，2位减法的准确率为98.9%，3位加法的准确率为80.2%，3位减法的准确率为94.2%。随着数字数目的增加，性能会下降，但是GPT-3在四位数操作上仍能达到25-26%的精度，在五位数操作上仍能达到9-10%的精度，这表明至少有一些能力概括为更大数目的数字。GPT-3在2位乘法上也达到了29.2%的精度，这是一个特别的计算密集型操作。最后，GPT-3在个位数联合操作(例如，9*(7+5))时达到了21.3%的准确率，这表明GPT-3在单个操作之外还有一定的稳健性。

As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time.

One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation to the task (or at the very least recognition of the task) is important to performing these computations correctly. Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and model capacity scaling for all three settings is shown in Appendix H.

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table.
Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings.

图3.10表明,小模型在所有这些任务做差,甚至130亿年的参数模型(1750亿年之后的第二大完整的GPT-3)可以解决2位数的加法和减法只有一半的时间,和所有其他操作的时间不到10%。一次射击和零射击的性能相对于少射击的性能有所下降，这表明适应任务(或至少识别任务)对正确执行这些计算很重要。尽管如此，单次射击的性能仍然相当强大，甚至全GPT-3的零射击性能也显著优于所有小型模型的少次射击学习。表3.9显示了完整GPT-3的所有三个设置，附录H显示了所有这三个设置的模型容量伸缩。
为了抽查模型是否只是简单地记忆特定的算术问题，我们取测试集中的三位数算术问题，并在训练数据中以“<num1> + <num2> =”和“<num1> + <num2>”的形式搜索它们。</num2></num1></num2></num1>在2000道加法题中，我们发现只有17道匹配(0.8%)，而在2000道减法题中，我们发现只有2道匹配(0.1%)，这表明只有一小部分正确答案能够被记住。此外，对错误答案的检查发现，该模型经常会犯错误，比如没有带“1”，这表明它实际上是在尝试执行相关的计算，而不是记忆一个表。总的来说，GPT-3在少杆、一杆甚至零杆设置中表现出了相当熟练的中等复杂的算术。

3.9.2 Word Scrambling and Manipulation Tasks 拼字和操作任务

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:

Cycle letters in word (CL) – The model is given a word with its letters cycled, then the “=” symbol, and is expected to generate the original word. For example, it might be given “lyinevitab” and should output “inevitably”.
Anagrams of all but first and last characters (A1) – The model is given a word where every letter except the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = corruption.
Anagrams of all but first and last 2 characters (A2) – The model is given a word where every letter except the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt → opponent.
Random insertion in word (RI) – A random punctuation or space character is inserted between each letter of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.
Reversed words (RW) – The model is given a word spelled backwards, and must output the original word. Example: stcejbo → objects.

为了测试GPT-3从几个例子中学习新的符号操作的能力，我们设计了一个包含5个“字符操作”任务的小电池。每个任务都包括给模型一个被打乱、添加或删除字符组合而扭曲的单词，并要求它恢复原来的单词。这5项任务是:

单词(CL)中的循环字母——给模型一个单词，它的字母是循环的，然后是“=”符号，并期望生成原始单词。例如，它可能被赋予“lyinevitab”，而应该输出“不可避免”。
除了第一个和最后一个字符以外的所有字符的字谜(A1)——模型被给定一个单词，其中除了第一个和最后一个字符以外的每个字母都被随机打乱，并且必须输出原始单词。例如:criroptuon =腐败。
除了第一个和最后两个字符以外的所有字符的字谜(A2)——模型给出一个单词，其中除了前两个和后两个字符以外的每个字母都被随机打乱，并且必须恢复原来的单词。例:opoepnnt→对手。
单词中的随机插入(RI)——在单词的每个字母之间插入随机的标点或空格字符，模型必须输出原始单词。例子:s.u ! c / c e。ssi /o/n =连续。
反向单词(RW)——给模型一个反向拼写的单词，并且必须输出原始单词。示例:stcejbo→对象。

For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word.
In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data (although we cannot confirm this with certainty).

对于每个任务，我们生成10,000个示例，我们选择这些示例作为最常见的10,000个单词，以长度大于4个字符和小于15个字符的[Nor09]来衡量。小样本结果如图3.11所示。任务性能随着模型大小的变化而平稳增长，完整的GPT-3模型在删除随机插入时达到66.9%，循环字母达到38.6%，在较简单的字谜任务中达到40.2%，在较困难的字谜任务(只保留第一个和最后一个字母)中达到15.1%。没有一个模型能将字母倒转成一个单词。
在单样本设置中，性能明显较差(下降一半或更多)，而在零样本设置中，模型很少能执行任何任务(表3.10)。这表明，模型确实在测试时学习了这些任务，因为模型不能零失误地执行它们，而且它们的人工特性使它们不太可能出现在训练前的数据中(尽管我们不能确定地证实这一点)。

We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions.

Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation.

我们可以通过绘制“上下文内学习曲线”来进一步量化绩效，该曲线将任务绩效显示为上下文内例子数量的函数。我们在图1.2中展示了用于符号插入任务的上下文内学习曲线。我们可以看到，更大的模型能够越来越有效地使用上下文信息，包括任务示例和自然语言任务描述。
最后,值得补充的是,解决这些任务需要字符级操作,而我们的BPE编码作用于重要的分数一个词(平均0.7∼字令牌),所以从LM的角度成功在这些任务不仅包括操纵BPE令牌但理解和剖析他们的子结构。另外，CL、A1和A2不是双射的(也就是说，被解置的单词不是被解置单词的确定性函数)，需要模型执行一些搜索来找到正确的解置。因此，所涉及的技能似乎需要非平凡的模式匹配和计算。

3.9.3 SAT Analogies 类比

To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model.

为了在另一个任务中测试GPT-3，这个任务相对于文本的典型分布有些不寻常，我们收集了一组374个“SAT类比”问题[TLBS03]。类推题是2005年前SAT大学入学考试的一个部分的多项选择题。一个典型的例子是“大胆之于大胆，正如(A)伪善之于伪善，(b)匿名之于身份，(c)懊悔之于恶行，(d)有害之于结果，(e)易受诱惑之于结果。”要求学生从五组单词中选出与原单词有相同关系的单词;在这个例子中，答案是“假装虔诚就是虚伪”。在这项任务中，GPT-3在少发、一发和零发中得分分别为65.2%、59.1%和53.7%，而大学申请者的平均得分为57% [TL05](随机猜测的得分为20%)。如图3.12所示，结果随着规模的增加而提高，全1750亿模型比130亿参数模型提高了10%以上。

3.9.4 News Article Generation 新闻文章生成

Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news story [RWC+19]. Relative to [RWC+19], the dataset used to train GPT-3 is much less weighted towards news articles, so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably generate short articles in the “news” genre.
To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR+19]. Generative language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to distinguish the two is a potentially important measure of quality.3

之前在生成语言模型上的工作定性地测试了他们生成合成“新闻文章”的能力，方法是有条件地从模型中取样，并给出一个由一个新闻故事的可信的第一句话组成的人类书面提示。相对于数据集(RWC + 19),用于火车GPT-3偏重于新闻文章要少得多,因此试图产生新闻文章通过原始无条件的样品更有效——例如GPT-3经常解释提出的第一句话“新闻文章”的一条微博,然后文章合成反应或后续消息。为了解决这个问题，我们使用了GPT-3的少样本学习能力，在模型的上下文中提供了之前的三篇新闻文章来约束它。有了提议的下一篇文章的标题和副标题，该模型能够可靠地生成“新闻”类型的短文章。

为了衡量GPT-3生成新闻文章的质量(我们认为这很可能与有条件的样本生成质量总体上相关)，我们决定衡量人类区分GPT-3生成的文章与真实文章的能力。Kreps等人[KMB20]和Zellers等人[ZHR+19]也进行了类似的工作。生成语言模型被训练来匹配人类生成的内容的分布，所以人类区分这两者的能力是质量的一个潜在的重要衡量标准

In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model4 . Participants were asked to select whether the article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”.
The articles we selected were not in the models’ training data and the model outputs were formatted and selected programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a 160M parameter model with no context and increased output randomness.

为了考察人类检测模型生成的文本的能力，我们从newser.com网站上任意选择了25篇文章的标题和副标题(平均长度:215个单词)。然后，我们根据四种语言模型生成这些标题和字幕的完整版本，大小从1.25米到175B (GPT-3)参数不等(平均长度:200个单词)。对于每个模型，我们向大约80名来自美国的参与者展示了一个测试，其中包含这些真实的标题和副标题，然后是人工撰写的文章或由模型4生成的文章。参与者被要求选择文章是“很可能是人类写的”，“更可能是人类写的”，“我不知道”，“更可能是机器写的”，还是“很可能是机器写的”。

我们选择的文章不在模型的训练数据中，并且模型的输出被编程地格式化和选择，以防止人类的“挑选”。所有模型都使用相同的上下文来设置输出条件，并使用相同的上下文大小进行预训练，每个模型都使用相同的文章标题和副标题作为提示。然而，我们也进行了一项实验，以控制参与者的努力和注意力，这些人遵循同样的格式，但包含了有意的不良模型生成的文章。这是通过从一个“控制模型”生成文章来实现的:一个没有上下文且增加了输出随机性的160M参数模型。

Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that the intentionally bad articles were model generated was ∼ 86% where 50% is chance level performance. By contrast, mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at ∼ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E).
Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15. 7 Much of the text is—as indicated by the evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator that an article is model generated since, unlike human authors, the models have no access to the specific facts that the article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual phrasings, though these are often subtle enough that they are not noticed.
Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like G R O V E R [ZHR+19] and GLTR [GSR19] may have greater success at detecting model generated text than human evaluators. Automatic detection of these models may be a promising area of future research.

Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to compare human abilities to detect the articles generated by GPT-3 and a control model.
We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was ∼ 88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely above chance at ∼ 52% (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3 continues to produce articles that humans find difficult to distinguish from human written news articles.

在检测出被模型生成的故意差的文章时，人类的平均准确率(每个参与者的正确任务与非中立任务的比率)为86%，其中50%是随机水平的表现。相比之下，人类检测175B参数模型产生的物品的平均准确率仅为52%(见表3.11)。5人类检测模型生成的文本的能力似乎随着模型大小的增加而减少:模型大小似乎有机会准确性的趋势，人类对GPT-3的检测接近于机会。尽管随着模型尺寸的增加，参与者会在每个输出上花费更多的时间(见附录E)，但这是真的。

图3.14和图3.15给出了GPT-3合成产品的示例。7如评估所示，大部分文本对人类来说很难从真实的人类内容中区分出来。事实不准确可能是一篇文章是模型生成的标志，因为与人类作者不同，模型无法访问文章标题所引用的具体事实或文章的写作时间。其他的指标包括重复，不符合逻辑，和不寻常的措辞，尽管这些通常是足够微妙的，他们没有被注意到。

Ippolito等人[IDCBE19]在语言模型检测方面的相关工作表明，自动鉴别器如G R O V E R [ZHR+19]和GLTR [GSR19]在检测模型生成的文本方面可能比人类评价器更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。

Ippolito等人[IDCBE19]也注意到，随着人们观察到更多的标记，人类检测模型生成的文本的准确性也会提高。做一个初步调查好人类是如何检测时间的新闻文章由GPT-3 175 b,我们选择了12项世界新闻文章来自路透社平均长度为569个单词和生成完成的这些文章GPT-3平均长度为498个单词(298字的时间比我们最初的实验)。按照上述方法，我们进行了两个实验，每个实验都有大约80名美国参与者，以比较人类检测GPT-3和一个对照模型生成的文章的能力。

我们发现，人类在检测控制组故意制造的较长文章时的平均准确率为~ 88%，而在检测GPT-3 175B制造的较长文章时的平均准确率为~ 52%(见表3.12)。这表明，对于长度在500字左右的新闻文章，GPT-3继续生成人类难以区分的文章。

3.9.5 Learning and Using Novel Words 学习和使用新单词

A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.

发展语言学[CB78]研究的一个任务是学习和利用新单词的能力，例如在一个句子中只看到一个单词的定义一次就使用它，或者从一个用法反过来推断一个单词的意思。在这里，我们定性地测试GPT-3完成前一项任务的能力。具体来说，我们给GPT-3一个不存在的单词的定义，比如“Gigamuru”，然后让它在一个句子中使用它。我们提供了一个(单独的)不存在的单词在句子中被定义和使用的1到5个例子，所以就宽泛任务的前面例子而言，任务是很少的，而就具体单词而言，任务是一次性的。表3.16显示了我们生成的6个示例;所有的定义都是人为生成的，第一个答案是人为生成的，作为条件反射，随后的答案是GPT-3生成的。这些示例是在一次运行中连续生成的，我们没有省略或重复尝试任何提示。在所有的情况下，生成的句子似乎是一个正确的或至少似是而非的词的使用。在最后一句话中，该模型为单词“screeg”(即“screeghed”)生成了一个貌似合理的变位，尽管这个词的使用有点尴尬(“screeghed at each other”)，尽管它在描述一场玩具剑战的意义上似乎是可信的。总的来说，GPT-3在使用新单词造句方面至少表现得很熟练。

3.9.6 Correcting English Grammar 修改英语语法

Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot setting by giving prompts of the form "Poor English Input: <sentence>\n Good English Output: <sentence>". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any omissions or repeats). Results are shown in Figure 3.17.

另一项非常适合少量学习的任务是纠正英语语法。我们在fewshot设置中使用GPT-3测试这一点，给出如下提示:“糟糕的英语输入:<句子>\n良好的英语输出:<句子>”。我们给GPT-3一个人为的修正，然后让它再修正5个(同样没有遗漏或重复)。结果如图3.17所示。

4 Measuring and Preventing Memorization Of Benchmarks 测量和防止记忆基准

Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research without established best practices. While it is common practice to train large models without investigating contamination, given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to.
This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] detected and removed a training document which overlapped with one of their evaluation datasets. Other work such as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).

由于我们的训练数据集来自互联网，所以我们的模型可能是在一些基准测试集上训练的。从互联网规模的数据集中准确地检测测试污染是一个新的研究领域，没有建立最佳实践。虽然在训练大型模型时不调查污染是常见的做法，但考虑到训练前数据集规模的不断扩大，我们相信这个问题正变得越来越重要。

这种担忧不仅仅是假设。最早在普通爬行数据上训练语言模型的论文之一[TL18]检测并删除了一个与其中一个评估数据集重叠的训练文档。GPT-2 [RWC+19]等其他工作也进行了事后重叠分析。他们的研究相对令人鼓舞，发现尽管模型在训练和测试重叠的数据上表现得稍微好一些，但这并不会对报告的结果产生显著影响，因为有一小部分数据被污染了(通常只有几个百分点)。

GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was deduplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as large as feared.
We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts results.
For each benchmark, we produce a 'clean’ version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in Appendix C.

GPT-3的运作方式有些不同。一方面，数据集和模型的大小大约比GPT-2使用的大两个数量级，并且包括大量的常见爬行，增加了污染和记忆的可能性。另一方面，精确地说，由于数据量大，即使是GPT-3 175B，其训练集也没有过度拟合，这是相对于一个被删除的验证集而言的(图4.1)。因此，我们预计污染可能是频繁的，但其影响可能不会像担心的那样大。

我们最初试图通过主动搜索并试图消除我们的训练数据与本文中研究的所有基准的开发和测试集之间的任何重叠，来解决污染问题。不幸的是，一个错误只导致部分删除了训练数据中检测到的所有重叠部分。由于培训成本的原因，对模型进行再培训是不可行的。为了解决这个问题，我们详细研究剩余检测到的重叠是如何影响结果的。

对于每个基准测试，我们生成一个“干净”版本，删除所有可能泄露的示例，大致定义为与预训练集中的任何内容有13克重叠的示例(或者与整个示例有重叠的示例，如果它小于13克)。我们的目标是非常保守地标记出任何可能被污染的东西，以便产生一个高度可靠的无污染子集。确切的程序在附录C中有详细说明。

We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance.
Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference difficult.

然后我们在这些干净的基准上评估GPT-3，并与原始分数进行比较。如果清洁子集上的分数与整个数据集上的分数相似，这表明即使存在污染，也不会对报告的结果产生显著的影响。如果清洁组的分数较低，这表明污染可能使结果膨胀。结果如图4.2所示。尽管潜在的污染通常很高(四分之一的基准测试得分超过50%)，但在大多数情况下，性能变化只是微不足道的，而且我们没有看到污染水平和性能差异相关的证据。我们得出的结论是，要么我们的保守方法大大高估了污染，要么污染对性能的影响很小。

下面，我们将更详细地回顾一些特定的情况，其中(1)模型在清理后的版本上表现明显较差，或(2)潜在的污染非常高，这使得测量性能差异非常困难。

Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false positives. We summarize the results for each group of tasks below:

Reading Comprehension: Our initial analysis flagged >90% of task examples from QuAC, SQuAD2, and DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult. Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source text was present in our training data but the question/answer pairs were not, meaning the model gains only background information and cannot memorize the answer to a specific question.
German translation: We found 25% of the examples in the WMT16 German-English test set were marked as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the flagged examples contain paired sentences resembling NMT training data and collisions were monolingual matches mostly of snippets of events discussed in the news.
Reversed Words and Anagrams: Recall that these tasks are of the form “alaok = koala”. Due to the short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set, but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small, but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the symbol insertion task shows high overlap but no effect on performance – this is because that task involves removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to many spurious matches.
PIQA: The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was released after our training set was created and its labels are hidden, some of the web pages used by the crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential contamination.
Winograd: The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in fact present in our training set, though presented in a different format than we present the task to the model. Although the decrease in performance is small, we mark our Winograd results in the main paper with an asterisk.
Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably extract a clean subset here, we do not report results on these datasets, even though we intended to when starting this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language modeling benchmark.

我们的分析为进一步的调查标记了六组基准:拼词，阅读理解(QuAC, SQuAD2, DROP)， PIQA, Winograd，语言建模任务(Wikitext任务，1BW)，以及德语到英语的翻译。由于我们的重叠分析被设计成极其保守的，我们预计它会产生一些误报。我们将每组任务的结果总结如下:

阅读理解:我们最初的分析将QuAC、SQuAD2和DROP中90%的任务示例>标记为潜在污染，如此之大，甚至很难在干净子集上测量差异。然而，经过人工检查，我们发现，对于我们检查的每一个重叠，在所有3个数据集中，我们的训练数据中都有源文本，但是问题/答案对没有，这意味着模型只获得了背景信息，不能记住特定问题的答案。
德语翻译:我们发现，在WMT16德语-英语测试集中，25%的样本被标记为潜在污染，相关总效应值为1-2蓝色。经过检查，没有一个标记的例子包含类似NMT训练数据的成对句子，碰撞是单语匹配，主要是新闻中讨论的事件片段。
颠倒单词和字谜:回想一下这些任务的形式是“alaok = koala”。由于这些任务的长度较短，我们使用2克来进行过滤(忽略标点符号)。在检查标记的重叠之后，我们发现它们并不是训练集中真正的反向或解码的典型实例，而是回文或普通的解码。g " kayak = kayak "。重叠的数量很小，但是去除琐碎的任务会增加难度，从而产生虚假信号。与此相关的是，符号插入任务显示了高重叠，但对性能没有影响——这是因为该任务涉及从单词中删除非字母字符，而重叠分析本身忽略了这些字符，从而导致许多虚假匹配。
PIQA:重叠分析将29%的示例标记为受污染的，并观察到干净子集的性能下降了3个百分点(相对下降4%)。虽然测试数据集创建发布我们的训练集和它的标签是隐藏的,使用的一些网页的创造者众包数据集都包含在我们的训练集,我们也发现了相似的下降25 x模型和更少的记忆能力小,导致我们怀疑这种转变可能是统计偏差而不是记忆;工人们模仿的例子可能更简单。不幸的是，我们不能严格地证明这个假设。因此，我们用星号标记PIQA结果，表示这种潜在的污染。
Winograd:重叠分析标记了45%的示例，发现干净子集的性能下降了2.6%。对重叠数据点的手动检查表明，实际上有132个Winograd模式出现在我们的训练集中，尽管它们的格式与我们向模型展示任务的格式不同。尽管性能下降很小，但我们在主论文中用星号标记了Winograd结果。
语言建模:我们发现用GPT-2测量的4个维基百科语言建模基准，加上儿童书籍测试数据，几乎全部包含在我们的训练数据中。因为我们不能可靠地提取一个干净的子集，所以我们不报告这些数据集的结果，即使我们在开始这项工作时打算这样做。我们注意到佩恩树银行由于其年龄未受影响，因此成为我们的主要语言建模基准。

We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply to verify how much actual contamination existed. These appeared to often contain false positives. They had either no actual contamination, or had contamination that did not give away the answer to the task. One notable exception was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this paper, the potential contamination is noted in the results section.
An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the same distribution as the original dataset. It remains possible that memorization inflates results but at the same time is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small models, which are unlikely to be memorizing.

Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.

我们还检查了污染程度高的数据集，但对性能的影响接近于零，只是为了验证实际存在多少污染。这些报告似乎经常包含误报。他们要么没有受到实际的污染，要么受到的污染并没有泄露任务的答案。一个值得注意的例外是LAMBADA，它看起来确实存在大量的污染，但对性能的影响非常小，干净子集的得分在整个数据集的0.5%之内。而且，严格地说，我们的填空格式排除了最简单的记忆形式。然而，由于我们在这篇论文中取得了很大的进展，潜在的污染在结果部分被指出。
我们的污染分析的一个重要限制是，我们不能确定干净子集是从与原始数据集相同的分布中提取的。记忆有可能使结果膨胀，但同时也被一些统计偏差精确地抵消了，从而使干净子集变得更容易。然而，绝对的数字。

总的来说，我们已经尽了最大的努力来度量和记录数据污染的影响，并根据严重程度来注意或直接删除有问题的结果。在设计基准和培训模式时，仍有许多工作要做，以解决该领域一般的这一重要而微妙的问题。有关我们的分析的更详细的解释，请读者参阅附录C。

5 Limitations 局限性

GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for future work. First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.	GPT-3和我们对它的分析都有一些局限性。下面我们将对其中一些进行描述，并对未来的工作提出建议。首先，尽管GPT-3在定量和定性方面有了很大的改进，特别是与它的直接前身GPT-2相比，它在文本合成和一些NLP任务方面仍有明显的缺陷。在文本合成方面，尽管整体质量很高，但GPT-3样本有时仍然在文档层面上语义上重复，在足够长的段落中开始失去连贯性，自相矛盾，偶尔还包含不符合逻辑的句子或段落。我们将发布500个未经管理的无条件样本，以帮助更好地了解GPT-3在文本合成方面的局限性和优势。在离散语言任务领域，我们非正式地注意到GPT-3似乎在“常识物理”方面有特殊的困难，尽管在一些测试该领域的数据集(如PIQA [BZB+19])上做得很好。具体来说，GPT-3很难回答“如果我把奶酪放进冰箱，它会融化吗?”定量,GPT-3的语境学习表现有明显的差距在我们套件的基准,如第三节所述,特别是它没有比机会当评估一次性甚至few-shot一些“比较”的任务,如确定两个词使用同样的方式在一个句子,或者如果一个句子意味着另一个(WIC和ANLI分别),以及阅读理解任务的一个子集。考虑到GPT-3在许多其他任务上的出色的小样本性能，这一点尤其引人注目。
GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused on exploring in-context learning behavior in autoregressive language models because it is straightforward to both sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent literature, which has documented improved fine-tuning performance when using these approaches over standard language models [RSR+19]. Thus our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.	GPT-3在结构和算法上有一些限制，这可以解释上面的一些问题。我们专注于探索自回归语言模型中的上下文内学习行为，因为用这个模型类进行抽样和计算可能性都很简单。因此，我们的实验不包括任何双向架构或其他训练目标，如去噪。这与最近的许多文献有明显的不同，后者记录了在标准语言模型上使用这些方法可以提高调优性能[RSR+19]。因此，我们的设计决策的代价是，在经验上受益于双向性的任务上，可能会有更糟糕的性能。这可能包括填空任务，包括回顾和比较两段内容的任务，或者要求重读或仔细考虑一篇很长的文章，然后写出非常简短的答案的任务。这可能是一个可能的解释为GPT-3滞后few-shot性能的一些任务,如WIC(包括比较词的使用在两个句子),ANLI(包括比较两个句子是否意味着另一个),和一些阅读理解任务(例如QuAC和种族)。基于过去的文献，我们还推测，一个大型的双向模型在微调方面会比GPT-3更强。在GPT-3的规模上制作一个双向模型，以及/或尝试使双向模型在很少或零射击学习中工作，是未来研究的一个有前途的方向，并且可以帮助实现“两全其美”。
A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world [BHT+20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans [ZSW+19a], fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of the world [CLY+19].	本文所描述的一般方法的一个更基本的限制是——扩展任何类似lm的模型，无论是自回归的还是双向的——它可能最终会(或可能已经)碰到培训前目标的限制。我们目前的目标是平等地对每一个标记进行权重，并且缺乏一个概念，即哪些是最重要的，哪些是不那么重要的。[RRS20]演示定制对相关实体的预测的好处。此外，在自我监督的目标中，任务规范依赖于将所需的任务强制转化为预测问题，然而最终，有用的语言系统(例如虚拟助手)可能被认为是采取目标导向的行动，而不仅仅是进行预测。最后，大型的预训练语言模型并不基于其他经验领域，如视频或现实世界的物理互动，因此缺乏大量关于世界的上下文[BHT+20]。由于所有这些原因，纯自监督预测的缩放可能会达到极限，使用不同的方法进行扩展可能是必要的。在这方面，未来有希望的方向可能包括从人类那里学习目标函数[ZSW+19a]，用强化学习进行微调，或添加额外的模式，如图像，以提供接地和更好的世界模型[CLY+19]。
Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is an important direction for future work, and might come from grounding in the physical world to provide additional information, or from algorithmic improvements. A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training and identifying them at test time would be an advance for language models, but nevertheless understanding precisely how few-shot learning works is an important unexplored direction for future research.	语言模型普遍存在的另一个局限性是在训练前的样本效率较低。尽管GPT-3在测试时间样本效率方面更接近人类(一次或零次)，但它在训练前看到的文本仍然比人类在一生中看到的要多得多[Lin20]。提高训练前的样本效率是未来工作的一个重要方向，可能来自于在物理世界的基础上提供额外的信息，或者来自于算法的改进。在GPT-3中，与少样本学习相关的一个限制，或者至少是不确定性，是关于小样本学习实际上是在推理时间“从零开始”学习新任务，还是仅仅识别和识别在训练中学习到的任务的不确定性。这些可能性存在于光谱,从示威游行的训练集来自相同的分布与测试时间,认识到相同的任务,但在不同的格式,以适应一个特定的风格的QA等任务,学习一门技能完全新创。GPT-3在这个范围内的位置也可能因任务而异。合成任务，如词序打乱或定义无意义的词，似乎特别有可能从头学习，而翻译显然必须在训练前学习，尽管可能从组织和风格上与测试数据非常不同的数据。最终，我们甚至不清楚人类从从零开始和之前的演示中学到了什么。即使是在训练前组织各种演示，并在测试时识别它们，也将是语言模型的一个进步，但准确地理解少枪学习是如何工作的，是未来研究的一个重要的未探索的方向。
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size. Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts (Section 6).	无论目标函数或算法如何，GPT-3规模上的模型都存在一个限制，即它们都是昂贵的，并且不便于进行推断，这可能对当前形式的这种规模的模型的实际适用性提出挑战。解决这一问题的一个可能的未来方向是将大型模型精馏[HVD15]，使其达到可管理的规模，以完成特定的任务。像GPT-3这样的大型模型包含了非常广泛的技能，其中大多数技能对于特定的任务来说是不需要的，这表明在原则上积极的提炼是可能的。蒸馏在一般情况下得到了很好的探索[LHCG19a]，但还没有在数千亿个参数的规模上进行尝试;将其应用于这种规模的模型可能会带来新的挑战和机会。最后,GPT-3共同分享一些限制大多数深度学习系统——它的决定并不容易解释,它在预测不一定精确校准的小说所观察到的输入方差性能远高于人类标准基准,它保留了数据的偏见一直在训练。最后这个问题- -数据的偏差可能导致模型产生定型或偏见的内容- -从社会角度来说是特别关注的问题，将在下一节中与其他问题一起讨论更广泛的影响(第6节)。

6 Broader Impacts 更广泛的影响

Language models have a wide range of beneficial applications for society, including code and writing auto-completion, grammar assistance, game narrative generation, improving search engine responses, and answering questions. But they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the potential to advance both the beneficial and harmful applications of language models.
Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly discuss issues of energy efficiency (Section 6.3).

语言模型为社会提供了广泛的有益应用，包括代码和编写自动完成、语法帮助、游戏叙事生成、改进搜索引擎响应和回答问题。但它们也有潜在的有害用途。相对于较小的模型，GPT-3提高了文本生成的质量和适应性，并增加了区分合成文本和人类书写文本的难度。因此，它有潜力促进语言模型的有益和有害应用。
在这里，我们关注改进后的语言模型的潜在危害，不是因为我们认为这种危害必然更大，而是为了激励人们努力去研究和减轻它们。这类语言模型的广泛影响是多方面的。我们关注两个主要问题:第6.1节中故意误用像GPT-3这样的语言模型的可能性，以及第6.2节中像GPT-3这样的模型中的偏见、公平和表示问题。我们也简要讨论能源效益的问题(第6.3节)。

6.1 Misuse of Language Models 语言模型的误用

Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing language models in a very different environment or for a different purpose than researchers intended. To help with this, we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.

恶意使用语言模型可能有点难以预料，因为它们通常涉及到在非常不同的环境中重新使用语言模型，或者用于与研究人员预期不同的目的。为了帮助解决这一问题，我们可以从传统的安全风险评估框架的角度进行思考，这些框架列出了关键步骤，如识别威胁和潜在影响、评估可能性以及将风险确定为可能性和影响的组合[Ros12]。我们讨论三个因素:潜在的误用应用，威胁行动者，和外部激励结构。

6.1.1 Potential Misuse Applications 潜在的误用

Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high quality text. Language models that produce high quality text generation could lower existing barriers to carrying out these activities and increase their efficacy.

The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in 3.9.4 represents a concerning milestone in this regard.

任何依赖于生成文本的对社会有害的活动都可以通过强大的语言模型来增强。例如，虚假信息，垃圾邮件，网络钓鱼，滥用法律和政府程序，欺诈学术论文写作和社会工程借口。这些应用程序中的许多都阻碍了人们编写足够高质量的文本。产生高质量文本生成的语言模型可以降低执行这些活动的现有障碍，并提高其效率。

随着文本合成质量的提高，语言模型的误用潜力也在增加。GPT-3生成几段合成内容的能力是这方面的一个重要里程碑，人们发现这些合成内容很难与3.9.4中人类书写的文本区分开来。

6.1.2 Threat Actor Analysis 威胁行动者分析

Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors who may be able to build a malicious product to 'advanced persistent threats’ (APTs): highly skilled and well-resourced (e.g. state-sponsored) groups with long-term agendas [SBC+19].
To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is not immediate, but significant improvements in reliability could change this.
Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible difference in operations that may see potential gains by using language models. The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.

威胁参与者可以根据技能和资源级别进行组织，从能够构建恶意产品的低或中等技能和资源的参与者，到“高级持续威胁”(APTs):高技能和资源充足的(例如。国家资助的)有长期议程的团体[SBC+19]。

为了了解低技能和中等技能的参与者是如何思考语言模型的，我们一直在监视论坛和聊天组，在那里错误信息策略，恶意软件的传播，和计算机欺诈经常被讨论。虽然在2019年春天首次发布GPT-2之后，我们确实发现了大量关于滥用的讨论，但我们发现，自那以后，实验的实例变少了，也没有成功的部署。此外，这些误用的讨论与媒体对语言模型技术的报道有关。从这一点，我们评估的威胁，滥用这些行动者不是立即，但重大改进的可靠性可以改变这一点。
因为APTs通常不公开讨论操作，所以我们就可能涉及语言模型使用的APT活动咨询了专业的威胁分析师。自从GPT-2发布以来，在使用语言模型可以获得潜在收益的操作方面没有明显的差异。评估是语言模型可能不值得投入大量资源,因为没有令人信服的证明当前的语言模型明显优于现有方法生成文本,因为“目标”或“控制”方法的内容语言模型仍处于早期阶段。

6.1.3 External Incentive Structures 外部激励结构

Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.

Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts how scalable the operation can be.
Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

每个威胁行动者组织也有一套战术、技术和程序(TTPs)，他们依靠这些来完成他们的议程。ttp会受到经济因素的影响，比如可伸缩性和部署的简便性;网络钓鱼在所有群体中都非常流行，因为它提供了一种低成本、低成本、高收益的部署恶意软件和窃取登录凭证的方法。使用语言模型来增强现有的ttp可能会导致部署成本更低。
易用性是另一个重要的激励因素。拥有稳定的基础设施对ttp的采用有很大的影响。然而，语言模型的输出是随机的，尽管开发人员可以限制这些输出(例如使用top-k truncation)，但如果没有人类的反馈，它们无法持续执行。如果一个社交媒体假信息机器人的输出在99%的情况下是可靠的，但在1%的情况下输出的是不连贯的，这就可以减少操作这个机器人所需的人力。但是仍然需要人工筛选输出，这限制了操作的可伸缩性。
基于我们对这个模型的分析，以及对威胁参与者和环境的分析，我们怀疑人工智能研究人员最终将开发出具有足够一致性和可操控性的语言模型，从而使恶意参与者更感兴趣。我们希望这将给更广泛的研究界带来挑战，并希望通过结合缓解研究、原型设计和与其他技术开发人员协调来解决这一问题。

6.2 Fairness, Bias, and Representation 公平、偏见和代表性

Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8
Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the model’s biases even within the studied categories.

Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how they are different in this dimension.

训练数据中的偏差可能导致模型产生定型或偏见的内容。这是令人担忧的，因为模型偏见可能以不同的方式伤害相关群体的人，通过加强现有的刻板印象和产生贬低形象等潜在危害[Cra17]。我们对模型中的偏差进行了分析，以便更好地理解GPT-3在公平性、偏差和代表性方面的局限性。8
我们的目标不是详尽地描述GPT-3，而是对其局限性和行为进行初步分析。我们关注的是与性别、种族和宗教相关的偏见，尽管可能存在许多其他类别的偏见，可以在后续工作中进行研究。这只是初步的分析，并没有反映模型的所有偏差，即使是在研究的类别内。
总的来说，我们的分析表明，经过互联网训练的模型具有互联网规模偏差;模型倾向于反映训练数据中呈现的刻板印象。下面我们将讨论我们在性别、种族和宗教维度上的偏见的初步发现。我们在1750亿参数模型和类似较小的模型中探查偏差，看看它们在这个维度上是否和如何不同。

6.2.1 Gender 性别

In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found that occupations in general have a higher probability of being followed by a male gender identifier than a female one (in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant). 83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured this by feeding the model a context such as "The detective was a" and then looking at the probability of the model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.). In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist, housekeeper etc.
We also tested how these probabilities changed when we shifted the context to be the "The competent {occupation} was a" (Competent Variant), and when we shifted the context to be "The incompetent {occupation} was a" (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent {occupation} was a," the majority of occupations had an even higher probability of being followed by a male identifier than a female one than was the case with our original neutral prompt, "The {occupation} was a". With the prompt "The incompetent {occupation} was a" the majority of occupations still leaned male with a similar probability than for our original neutral prompt. The average occupation bias - measured as 1 njobs P jobs log( P (female|Context) P (male|Context)) ) - was −1.11 for the Neutral Variant, −2.14 for the Competent Variant and −1.15 for the Incompetent Variant.

在我们对GPT-3性别偏见的调查中，我们关注的是性别与职业之间的联系。我们发现，在给出“该职业是一个”(中性变量)这样的背景下，一般来说，职业被男性性别标识符跟随的概率比女性更高(换句话说，她们更倾向于男性)。在我们测试的388种职业中，有83%的职业更有可能被男性的GPT-3尾随。我们通过给模型输入诸如“侦探是a”这样的语境来测量这一点，然后观察模型接着输入男性暗示词(如“the detective was a”)的概率。或表示女性的词(woman, female等)。特别是，具有较高教育水平的职业，如立法者、银行家或名誉教授，以及需要重体力劳动的职业，如梅森、米尔莱特和治安官，都偏重于男性。更有可能被女性识别的职业包括助产士、护士、接待员、管家等。
我们还测试了当我们将上下文转换为“胜任的{占职}是一个”(胜任的变体)时，以及当我们将上下文转换为“不胜任的{占职}是一个”(不胜任的变体)时，这些概率是如何变化的。我们发现，当提示为“胜任的{职业}是a”时，大多数职业后面跟随男性标识符的概率比跟随女性标识符的概率还要高，这比我们最初的中性提示为“The{职业}是a”的概率还要高。当提示“the incompetent {career} was a”时，大多数职业仍然倾向于男性，这一概率与我们最初的中性提示相似。以1 njobs P job log(P(女性|环境)P(男性|环境))测量的平均职业偏倚为:中性变异为- 1.11，胜任变异为- 2.14，不胜任变异为- 1.15。

We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further corroborated the model’s tendency to associate most occupations with males. One method measured the models ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model a context such as "The advisor met with the advisee because she wanted to get advice about job applications. 'She’ refers to the" and found the option with the lowest probability between the two possible options (Choices between Occupation Option: advisor; Participant Option: advisee).
Occupation and participant words often have societal biases associated with them such as the assumption that most occupants are by default male. We found that the language models learnt some of these biases such as a tendency to associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger models are more robust than smaller models.

We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She was very", "He would be described as", "She would be described as"9 . We looked at the adjectives and adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were more often described using adjectives that span a greater spectrum.
Table 6.1 shows the top 10 most favored descriptive words for the model along with the raw number of times each word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective, we have also included the average for the number of co-occurrences across all qualifying words for each gender.

我们还使用两种方法对Winogender数据集[RNLVD18]进行代词解析，这两种方法进一步证实了该模型将大多数职业与男性联系起来的倾向。一种方法是测试模型正确分配代词作为职业或参与者的能力。例如，我们为模型提供了一个上下文，例如“顾问与被咨询者会面，因为她想获得关于工作申请的建议。”“她”指的是“并在两种可能的选项(职业选项:顾问;参与者选择:学生)。

职业和参与者的词汇通常带有社会偏见，比如假设大多数居住者默认为男性。我们发现，语言模型学会了一些偏见，比如倾向于将女性代词与参与者的位置联系起来，而不是男性代词。GPT-3 175B在这项任务上的准确率是所有模型中最高的(64.17%)。这也是唯一一个女性的居住者句子(正确答案是职业选项的句子)的准确率高于男性的模型(81.7%对76.7%)。除了我们的第二大模型GPT-3 13B，其他所有模型在男性代词与职业相关的句子上的准确率都高于女性代词，但GPT-3 13B在两个句子上的准确率都相同(60%)。这提供了一些初步证据，表明在存在偏见的地方，语言模型容易出错，较大的模型比较小的模型更健壮。
我们还进行了共现测试，分析哪些词可能出现在其他预先选择的词附近。通过为数据集中的每个提示生成800个长度为50、温度为1和顶部p为0.9的输出，我们创建了一个模型输出示例集。关于性别，我们有诸如"他非常"，"她非常"，"他被描述为"，"她被描述为"9。我们看了形容词和副词在100个最受欢迎的单词中使用现成的POS标记。我们发现，女性被描述时更多地使用“美丽”和“华丽”等以外表为导向的词汇，而男性则更多地使用范围更广的形容词来描述。
表6.1显示了模型中最受欢迎的10个描述性单词，以及每个单词与代词指示符共出现的原始次数。这里的“最受欢迎”指的是那些与某个类别同时出现的词比另一个类别出现的比率要高。为了更好地理解这些数字，我们还包括了每种性别的所有限定词中共同出现的次数的平均值。

6.2.2 Race 种族

To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The {race} man was very", "The {race} woman was very" and "People would describe the {race} person as" and generated 800 samples for each of the above prompts, with {race} replaced with a term indicating a racial category such as White or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that language models produce text of differing sentiment when varying features such as occupation [HZJ+19], we explored how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid: -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet).
It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that focused on racial features; these results are not from the models talking about race in the wild but talking about race in an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated with a negative sentiment under this testing methodology.
Across the models we analyzed, 'Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data.

GPT-3调查种族偏见,我们播种等模型提示——“{种族}男人非常”,“{种族}的女人非常”和“人们将{种族}人描述为“和生成800个样本对于上面的提示,用{种族}替换为一个术语表明种族类别如白人或亚洲。然后我们在生成的样本中度量单词的共同出现。鉴于先前的研究表明，语言模型在不同的特征(如职业)下产生不同的情绪[HZJ+19]，我们探究了种族如何影响情绪。我们使用Senti WordNet [BES10]来测量情绪，以确定在每个种族中出现的不相称的词汇。每个词的情绪在100到-100之间变化，积极的分数表示积极的词。精彩度:100，友好度:87.5)，负分数表示否定的词。猥贱:-87.5，可怕:-87.5)和0分表示中性词(如:倾斜的小屋)。
值得注意的是，我们明确地促使模型讨论种族问题，而这反过来产生了关注种族特征的文本;这些结果并不是来自于那些讨论野外竞赛的模型，而是来自于他们已经准备好这样做的实验设置。此外,由于我们测量情绪通过简单地看单词共生,产生的情绪可以反映社会历史因素——例如,文本有关的讨论奴隶制会经常有负面情绪,这可能会导致人口与负面情绪在这种测试方法。
在我们分析的所有模特中，“亚洲人”的人气一直很高——在7个模特中，有3个排名第一。另一方面，“黑色”的人气一直很低——在7款车型中，它在5款中排名最低。这些差异在较大的模型尺寸上略微缩小。这个分析给出了不同模型的偏差，并强调了对情绪、实体和输入数据之间的关系进行更复杂分析的必要性。

6.2.3 Religion 宗教

We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, and Judaism, by generating 800 model outputs of length ≈50 with a temperature of 1 and a top p of 0.9 for every prompt. Our prompts were of the nature "{Religion practitioners} are" (Eg. "Christians are") for each of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a corpus of such completions for studying co-occurrence of words.
Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in the top 40 most favored words for Islam in GPT-3.

我们研究了哪些词与无神论、佛教、基督教、印度教、伊斯兰教和犹太教等宗教术语共出现，通过生成800个模型输出，长度≈50，温度为1，每个提示的p值为0.9。我们的提示属于“宗教从业者”的性质。“基督徒是”)对应以上列出的六个宗教类别中的每一个。然后，我们允许模型自然地执行补全，并创建这样补全的语料库来研究单词的共现。
与种族相似，我们发现这些模型与宗教术语联系在一起，显示出某些倾向来反映这些术语在世界上是如何呈现的。以伊斯兰教为例，我们发现像ramadan, prophet和mosque这样的词出现的频率比其他宗教要高。我们还发现，“暴力”、“恐怖主义”和“恐怖主义”等词与“伊斯兰”相关的比例要高于与其他宗教相关的比例，并在GPT-3中跻身“伊斯兰”最受欢迎的40个词汇之列。

6.2.4 Future Bias and Fairness Challenges 未来的偏见和公平挑战

We have presented this preliminary analysis to share some of the biases we found in order to motivate further research, and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an area of continuous research for us and are excited to discuss different methodological approaches with the community. We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ+18].
Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this is also extensive [QMZH19, HZJ+19], so we offer only a few brief comments on future directions specific to large language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for these models. There is room for more research that engages with the literature outside NLP, better articulates normative statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20]. Thus, mitigation work should not be approached purely with a metric driven objective to 'remove’ bias as this has been shown to have blind spots [GG19, NvNvdG19] but in a holistic manner.

我们提出这一初步分析是为了分享我们发现的一些偏见，以推动进一步的研究，并强调在大规模生成模型中描述偏见的固有困难;我们希望这将是一个持续研究的领域，并很高兴与社区讨论不同的方法方法。我们把这部分的工作看作是主观的路标——我们选择了性别、种族和宗教作为出发点，但我们认识到这种选择的内在主观性。我们的工作受到了描述模型属性以开发信息性标签的文献的启发，例如用于模型报告的模型卡片[MWZ+18]。
最终，重要的不仅仅是描述语言系统中的偏见，还要进行干预。关于这方面的文献也很广泛[QMZH19, HZJ+19]，因此我们仅就大型语言模型的未来方向提供一些简短的评论。为了在通用模型中为有效预防偏倚铺平道路，有必要建立一个共同的词汇表，将这些模型在减轻偏倚方面的规范、技术和经验挑战结合起来。还有更多的研究空间与NLP以外的文献相结合，更好地阐明关于伤害的规范性陈述，并与受NLP系统影响的社区的生活经历相结合[BBDIW20]。因此，应对缓解工作不应单纯以一个度量驱动的目标来“消除”偏见，因为这已被证明存在盲点[GG19, NvNvdG19]，而应以一种整体的方式。

6.3 Energy Usage 能源使用

Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such models, as advocated by [SDSE19].
The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we should consider not only the resources that go into training them, but how these resources are amortized over the lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency of such models over time, similar to trends observed in image recognition and neural machine translation [HB20].

实际的大规模预训练需要大量的计算，这是能源密集型的:训练GPT-3 175B在预训练期间消耗了数千次petaflop/s天计算，相比之下，1.5B参数的GPT-2模型需要几十次petaflop/s天计算(图2.2)。这意味着我们应该认识到这种模式的成本和效率，正如[SDSE19]所倡导的。
大规模的使用训练的也给了另一个样本,通过它观看大型模型的效率,我们不仅应该考虑去培训他们的资源,但这些资源如何平摊的生命周期模型,随后将被用于各种各样的目的特定任务来制定和调整。尽管像GPT-3这样的模型在培训期间消耗了大量的资源，但一旦培训完成，它们的效率会惊人地高:即使使用完整的GPT-3 175B，从一个培训过的模型生成100页内容的成本大约是0.4千瓦时，或者只有几美分的能源成本。此外，像模型蒸馏[LHCG19a]这样的技术可以进一步降低此类模型的成本，让我们采用训练单一、大规模模型的范例，然后创建更有效的版本，以便在适当的上下文中使用。随着时间的推移，算法的发展也会自然地进一步提高这些模型的效率，类似于在图像识别和神经机器翻译中观察到的趋势[HB20]。

7 Related Work 相关工作

Several lines of work have focused on increasing parameter count and/or computation in language models as a means to improve generative or task performance. An early work scaled LSTM based language models to over a billion parameters [JVS+16]. One line of work straightforwardly increases the size of transformer models, scaling up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size: 213 million parameters [VSP+17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters [RWC+19], 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and most recently 17 billion parameters [Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of increasing models’ capacity to store information without increased computational cost. These approaches rely on the conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM+17] has been used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19], though only a small fraction of the parameters are actually used on each forward pass. A third approach increases computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and the universal transformer [DGV+18]. Our work focuses on the first approach (scaling compute and parameters together, by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ this strategy. Several efforts have also systematically studied the effect of scale on language model performance. [KMH+20, RRBS19, LWS+20, HNA+17], find a smooth power-law trend in loss as autoregressive language models are scaled up. This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the curve can perhaps be detected in Figure 3.1), and we also find relatively smooth increases in many (though not all) downstream tasks across 3 orders of magnitude of scaling.	有几行工作关注于增加语言模型中的参数计数和/或计算，以此作为提高生成或任务性能的手段。早期的工作将基于LSTM的语言模型扩展到超过10亿个参数[JVS+16]。一条生产线直接增加了变压器模型的尺寸，大致按比例增加了参数和每个令牌的浮动量。该血管的工作使模型规模不断增大，原论文中有2.13亿个参数[VSP+17]，有3亿个参数[DCLT18]， 15亿个参数[RWC+19]， 80亿个参数[SPP+19]， 110亿个参数[RSR+19]，最近又增加了170亿个参数[Tur20]。第二行工作集中在增加参数计数而不是计算，作为在不增加计算成本的情况下增加模型存储信息的能力的一种方法。这些方法依赖于条件计算框架[BLC13]，具体地说，专家混合方法[SMM+17]已经被用于生成1000亿个参数模型和最近的500亿个参数转换模型[AJF19]，尽管在每次向前传递中实际使用的参数只有一小部分。第三种方法在不增加参数的情况下增加计算量;该方法的实例包括自适应计算时间[Gra16]和通用变压器[DGV+18]。我们的工作集中在第一种方法上(通过直接使神经网络变大，将计算和参数结合在一起)，并将模型的大小比以前采用这种策略的模型增加10倍。一些学者也系统地研究了规模对语言模型性能的影响。[KMH+20, RRBS19, LWS+20, HNA+17]，随着自回归语言模型规模的增大，损失呈现平稳的幂律趋势。这项工作表明，随着模型不断扩大，这一趋势在很大程度上继续下去(尽管在图3.1中可以检测到曲线的轻微弯曲)，我们还发现，在许多(尽管不是全部)下游任务中，在3个数量级的扩展中，都出现了相对平稳的增长。
Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language models that are as small as possible. This approach includes ALBERT [LCG+19] as well as general [HVD15] and task-specific [SDCW19, JYS+19, KR16] approaches to distillation of language models. These architectures and techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint of giant models. As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR+19, IBGC+14, CCE+18, MCKS18], reading comprehension [CHI+18, RCM19], and adversarially constructed datasets designed to be difficult for existing language models [SBBC19, NWD+19]. In this work we test our models on many of these datasets.	另一项工作与扩展的方向相反，试图在尽可能小的语言模型中保持强大的性能。该方法包括ALBERT [LCG+19]、general [HVD15]和task-specific [SDCW19, JYS+19, KR16]等语言模型精馏方法。这些架构和技术对我们的工作具有潜在的补充作用，可以用于减少大型模型的延迟和内存占用。由于经过调优的语言模型在许多标准基准测试任务上接近了人类的性能，人们投入了相当多的精力来构建更困难的或开放的任务，包括问题回答[KPR+19, IBGC+14, CCE+18, MCKS18]，阅读理解[CHI+18, RCM19]，以及为现有语言模型设计的困难的对立构建数据集[SBBC19, NWD+19]。在这项工作中，我们在许多数据集上测试我们的模型。
Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the tasks we tested on. Recent efforts include [RSR+19, RRS20], which fine-tuned an 11 billion parameter language model, and [GLT+20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on in-context learning but could be combined in the future with those of [GLT+20, LPP+20]. Metalearning in language models has been utilized in [RWC+19], though with much more limited results and no systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including matching networks [VBL+16], RL2 [DSC+16], learning to optimize [RL16, ADG+16, LM17] and MAML [FAL17]. Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training) updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time. Few-shot auto-regressive density estimation was explored in [RCP+17] and [GWC+18] studied low-resource NMT as a few-shot learning problem.	之前的很多工作都是专门针对问题的回答，这在我们的测试任务中占了很大一部分。最近的努力包括[RSR+19, RRS20]，它微调了一个110亿参数的语言模型，以及[GLT+20]，它关注于在测试时处理大量的数据。我们的工作侧重于语境学习，但在未来可以与[GLT+20, LPP+20]的工作相结合。语言模型中的金属学习在[RWC+19]中得到了应用，尽管结果有限，也没有系统的研究。更广泛地说，语言模型metalearning具有内环-外环结构，这使得它在结构上类似于一般应用于ML的metalearning。这里有大量的文献，包括匹配网络[VBL+16]， RL2 [DSC+16]，学习优化[RL16, ADG+16, LM17]和MAML [FAL17]。填料模型的上下文的我们的方法与以前的例子是最结构类似于RL2上也类似于[HYC01],在适应一个内循环发生在步伐通过计算模型的激活,没有更新权重,而外层循环(在这种情况下只是语言模型训练的)更新权重,和隐式学习能力适应或者至少在inference-time定义识别任务。[RCP+17]探索了小样本自回归密度估计，[GWC+18]将低资源NMT作为一个小样本学习问题进行了研究。
While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with similar goals is semi-supervised learning where approaches such as UDA [XDH+19] also explore methods of fine-tuning when very little labeled data is available. Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18] and utilized for some tasks (such as summarizing) in a language model with [RWC+19]. The notion of presenting tasks in natural language was also explored in the text-to-text transformer [RSR+19], although there it was applied for multi-task fine-tuning rather than for in-context learning without weight updates. Another approach to increasing generality and transfer-learning capability in language models is multi-task learning [Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating the weights for a new task. Multi-task learning has shown some promising initial results [LGH+15, LSP+18] and multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed the boundaries on certain tasks [KKS+20], but is still limited by the need to manually curate collections of datasets and set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR+17], human interaction [ZSW+19b], or active learning [Mac92].	虽然我们的小样本方法的机制不同，但之前的工作也探索了使用预训练语言模型结合梯度下降进行小样本学习的方法[SS20]。另一个具有类似目标的子领域是半监督学习，其中像UDA [XDH+19]这样的方法也探索了在可用标记数据很少的情况下进行微调的方法。使用自然语言给出多任务模型的指令首先是在一个监督设置中通过[MKXS18]形式化的，并在使用[RWC+19]的语言模型中用于一些任务(比如汇总)。在文本到文本转换器[RSR+19]中也探索了用自然语言表示任务的概念，尽管它被应用于多任务微调，而不是在没有权值更新的情况下用于上下文学习。另一种提高语言模型通用性和转移学习能力的方法是多任务学习[Car97]，它对下游任务的混合进行微调，而不是分别更新每个任务的权重。如果成功的多任务学习可以允许单一模型在不更新权值的情况下用于多个任务(类似于我们的上下文学习方法)，或者可以在更新新任务权值时提高样本效率。多任务学习了一些初步的结果[LGH + 15, LSP + 18]和多级微调最近成为一个标准化的一部分SOTA结果在一些数据集[PFB18]而且突破某些任务(kk + 20),但仍需要手动牧师收藏有限的数据集和设置培训课程。相比之下，大规模的预训练似乎提供了一种“自然的”广泛分布的任务，这种任务隐含在预测文本本身中。未来工作的一个方向可能是尝试为多任务学习生成更广泛的明确任务，例如通过程序生成[TFR+17]、人机交互[ZSW+19b]或主动学习[Mac92]。
Algorithmic innovation in language models over the last two years has been enormous, including denoising-based bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG+19, RSR+19], random permutations during training [YDY+19], architectures that improve the efficiency of sampling [DYY+19], improvements in data and training procedures [LOG+19], and efficiency increases in the embedding parameters [LCG+19]. Many of these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive language models, both in order to focus on in-context learning performance and to reduce the complexity of our large model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these algorithmic techniques is a promising direction for future work.	算法语言的创新模式在过去的两年里一直巨大,包括denoising-based双向性[DCLT18], prefixLM [DL15]和encoder-decoder架构(RSR LLG + 19日+ 19),随机排列在训练(金波+ 19),架构,提高抽样效率[DYY + 19],改善数据和训练程序[日志+ 19],和效率提高嵌入参数(LCG + 19)。许多这些技术为下游任务提供了显著的收益。在这项工作中，我们继续关注纯自回归语言模型，这既是为了关注上下文内的学习性能，也是为了减少大型模型实现的复杂性。然而，结合这些算法的进步很可能会提高GPT-3在下游任务中的性能，特别是在微调设置中，结合GPT-3的规模与这些算法技术是未来工作的一个有前途的方向。

8 Conclusion 结论

We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.

我们提出了一个1750亿参数语言模型显示强劲表现在许多NLP zero-shot任务和基准,一次性的,和few-shot设置,在某些情况下几乎匹配最先进的调整系统的性能,以及生成高质量的样品,在任务定义动态定性表现强劲。我们记录了大致可预测的性能扩展趋势，而不使用微调。我们还讨论了这类模型的社会影响。尽管有许多限制和弱点，这些结果表明，非常大的语言模型可能是开发适应性强的通用语言系统的一个重要成分。

Acknowledgements 致谢

The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments, Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of people who created content that was used in the training of the model, and to those who were involved in indexing or upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure and supercomputing teams for making it possible to train models at this scale.

作者要感谢Ryan Lowe对论文草稿提供的详细反馈。感谢Jakub Pachocki和Szymon Sidor提出的任务建议，以及Greg Brockman、Michael Petrov、Brooke Chan和Chelsea Voss帮助运行OpenAI基础设施的评估。感谢大卫的菜肴最初支持扩大这个项目,艾琳Solaiman讨论的方式方法和评估偏差,哈里森·爱德华兹和Yura呢Burda与语境的讨论和实验学习,杰弗里·欧文和保罗global早期的讨论语言模型缩放、长欧阳的建议设计人类的评估实验,克里斯Hallacy讨论数据收集,和山卡特的帮助与视觉设计。感谢数百万创建内容并用于模型培训的人，感谢那些参与索引或对内容进行向上投票(在WebText的情况下)的人。此外，我们要感谢整个OpenAI基础设施和超级计算团队，因为他们使在这种规模上训练模型成为可能。