谷歌DeepMind 宣布能预测蛋白质结构 , 生物学大突破（全文）

AndLib 2020-12-02

展开全文

谷歌的Deepmind声称能从蛋白质的氨基酸序列准确预测蛋白质结构，生物学上重大的突破

伦敦--Alphabet拥有的DeepMind公司开发了一款人工智能软件，能够准确预测蛋白质在几天内将进入的结构，从而解决了一个50年前的“重大挑战”，这一挑战可能为更好地理解疾病和药物发现铺平道路。

每个活的细胞里面都有成千上万种不同的蛋白质，这些蛋白质能让它活得很好。预测蛋白质折叠的形状很重要，因为它决定了蛋白质的功能，几乎所有疾病，包括癌症和痴呆症，都与蛋白质的功能有关。

“蛋白质是最美丽、最华丽的结构，准确预测它们如何折叠的能力真的非常、非常具有挑战性，多年来一直困扰着许多人，”欧洲生物信息学研究所的珍妮特·桑顿教授在电话中告诉记者。

英国研究实验室DeepMind的“阿尔法折叠”人工智能系统参加了一个名为CASP（结构预测关键评估）的小组组织的比赛。它是一个社区实验组织，其任务是加速解决一个问题：如何计算蛋白质分子的三维结构。在过去25年里一直在监测该领域的进展情况。“实验黄金标准。”周一，该公司表示，DeepMind公司的阿尔法折叠系统在蛋白质结构预测方面取得了无与伦比的准确性。

“DeepMind已经跃居领先地位，”CASP主席约翰·莫尔特Mount 教授在宣布这一消息前的一次新闻发布会上说。“计算机科学中一项历时50年的重大挑战已在很大程度上得到解决。”

Moult补充说，“药物设计和新兴的蛋白质设计领域都有一些重要的影响。”

拥有大约1000名员工，几乎没有收入，DeepMind已经成为Alphabet（谷歌的母公司）支持的昂贵公司。然而，它已经成为全球人工智能竞赛的领导者之一，与Facebook人工智能研究公司、微软和OpenAI一样。

谷歌首席执行官桑达尔·皮查伊在推特上对这一突破表示欢迎。

DeepMind的联合创始人兼首席执行官德米斯·哈萨比斯在电话会议上说：“DeepMind背后的最终愿景一直是构建通用人工智能，然后通过大大加快科学发现的速度，帮助我们更好地了解周围的世界。”

谷歌在2014年斥资6亿美元收购了这家公司，最出名的是它开发了能够玩太空入侵者和中国古代棋盘游戏GO等游戏的人工智能系统。然而，它总是说，它希望有更多的科学影响。

“游戏是有效开发和测试通用算法的绝佳试验场，我们希望有一天我们能把这些算法转移到现实世界中去，比如科学问题，”哈萨比斯说。“我们认为阿尔法折叠是本文的第一个证明点。这些算法现在已经足够成熟和强大，足以适用于真正具有挑战性的科学问题。”DeepMind还参加了2018年的CASP蛋白质折叠大赛。虽然当时的结果令人印象深刻，但DeepMindAlphaFold的负责人JohnJumper表示，该团队知道，要生产出“真正强大的生物相关性或在实验中具有竞争力”的东西，还有一定的路要走。

然而，今年的比赛并不是一帆风顺的，Jumper表示，DeepMind进行了三个月，没有取得任何进展。他说：“我们坐在那里担心我们耗尽了数据？”

甚至在比赛截止日期临近的时候，Jumper和他的团队仍然担心自己可能出现失误。他说：“机器学习系统中总会出现错误。”

但他们的努力似乎得到了回报。“我们真的认为我们已经建立了一个系统，为实验生物学家提供正确的和可操作的信息，”他说。“你有一个结构的原因是为了了解一些自然世界，然后问更多的问题。我们认为我们已经建立了一个系统，将真正帮助人们做到这一点。”

以下是DeepMind的公告全文（中文翻译仅供参考，英文原文为准）

蛋白质对生命至关重要，几乎支撑着生命的所有功能，它们是复杂的大分子，由氨基酸链组成，蛋白质的作用很大程度上取决于其独特的三维结构。蛋白质是由氨基酸链组成的复杂大分子，蛋白质的作用主要取决于其独特的三维结构。弄清蛋白质折叠成什么形状被称为 '蛋白质折叠问题'，在过去的50年里，它一直是生物学领域的一个重大挑战。作为一项重大的科学进展，我们最新版本的人工智能系统AlphaFold被两年一度的蛋白质结构预测关键评估（CASP）的组织者认可为解决这一重大挑战的方法。这一突破表明了人工智能对科学发现的影响，以及它在一些解释和塑造我们世界的最基本领域大幅加速进展的潜力。

蛋白质的形状与它的功能密切相关，预测这种结构的能力可以让我们更好地了解它的作用和工作原理。世界上许多最大的挑战，如开发疾病的治疗方法或寻找分解工业废物的酶，从根本上说都与蛋白质及其所起的作用有关。近50年来，我们一直停留在这一个问题上--蛋白质如何折叠起来。看到DeepMind拿出了一个解决方案，我个人在这个问题上研究了这么久，经历了这么多的停顿和开始，不知道是否能达到目的，这是一个非常特别的时刻。JOHN MOULT教授马里兰大学CASP创始人兼主席。多年来，这一直是深入科学研究的重点，使用各种实验技术来检查和确定蛋白质结构，如核磁共振和X射线晶体学。这些技术以及冷冻电子显微镜等新方法都依赖于大量的试错，每个结构可能需要数年的艰苦卓绝的工作，并需要使用数百万美元的专用设备。

蛋白质折叠问题在1972年诺贝尔化学奖的获奖感言中，Christian Anfinsen提出了一个著名的假设：理论上，蛋白质的氨基酸序列应该完全决定其结构。这一假设引发了长达5年的探索，希望能够仅根据蛋白质的1D氨基酸序列来计算预测蛋白质的3D结构，作为这些昂贵且耗时的实验方法的补充。然而，一个主要的挑战是，理论上，蛋白质在形成最终的三维结构之前可能会有许多种折叠方式，这是一个天文数字。1969年，Cyrus Levinthal指出，通过蛮力计算来列举一个典型蛋白质的所有可能构型所需要的时间比已知宇宙的年龄还要长--Levinthal估计一个典型蛋白质的可能构型有10^300种。然而在自然界中，蛋白质会自发地折叠，有的在几毫秒内就折叠完毕--这种二元对立有时被称为Levinthal悖论。

CASP14评估结果 1994年，John Moult教授和Krzysztof Fidelis教授创立了CASP，作为一个两年一次的盲评，以促进研究，监测进展，并建立蛋白质结构预测的技术状态。它既是评估预测技术的黄金标准，也是建立在共同努力基础上的独特的全球社区。最重要的是，CASP选择最近才通过实验确定的蛋白质结构（有些结构在评估时仍在等待确定）作为团队测试其结构预测方法的目标；这些结构不会提前公布。参赛者必须盲目地预测蛋白质的结构，而这些预测随后会在获得地面真实的实验数据时与之进行比较。我们非常感谢CASP的组织者和整个社区，尤其是那些实验者，他们的结构使得这种严格的评估成为可能。

CASP用来衡量预测准确性的主要指标是全局距离测试（Global Distance Test，GDT），范围为0-100。简单来说，GDT大约可以认为是指氨基酸残基（蛋白质链中的珠子）与正确位置的阈值距离内的百分比。据Moult教授介绍，GDT在90分左右，非正式地认为与实验方法得到的结果具有竞争力。在今天公布的第14届CASP评估结果中，我们最新的AlphaFold系统在所有目标中总体达到了92.4 GDT的中位数。这意味着我们的预测平均误差（RMSD）约为1.6埃，与一个原子的宽度（或0.1个纳米）相当。即使对于最难的蛋白质目标，即那些最具挑战性的自由建模类别，AlphaFold也实现了87.0 GDT的中位数得分（数据见此处）。

这些令人振奋的结果为生物学家开辟了将计算结构预测作为科学研究的核心工具的潜力。我们的方法可能被证明对重要的蛋白质类别特别有帮助，例如膜蛋白，这些蛋白质很难结晶，因此对实验测定具有挑战性。这项计算工作代表了蛋白质折叠问题上的一个惊人进展，这是生物学中一个长达50年的大挑战。它的发生比该领域的许多人预测的要早几十年。看到它将以多种方式从根本上改变生物学研究，这将是令人兴奋的。VENKI RAMAKRISHNAN教授诺贝尔-劳雷特和王室协会主席的报告

我们解决蛋白质折叠问题的方法我们在2018年首次进入CASP13，我们的初始版本AlphaFold在参与者中取得了最高的准确性。之后，我们在Nature上发表了一篇关于我们CASP13方法的论文，并附上了相关的代码，这启发了其他工作和社区开发的开源实现。现在，我们开发的新的深度学习架构推动了我们在CASP14中的方法的变化，使我们能够达到无与伦比的准确度。这些方法从生物学、物理学和机器学习领域获得灵感，当然也包括过去半个世纪以来许多科学家在蛋白质折叠领域的工作。一个折叠的蛋白质可以被看作是一个 '空间图'，其中残基是节点，而边缘则连接着相邻的残基。这张图对于理解蛋白质内部的物理相互作用，以及它们的进化史非常重要。对于在CASP14上使用的AlphaFold的最新版本，我们创建了一个基于注意力的神经网络系统，经过端到端的训练，它试图解释这个图的结构，同时对它正在构建的隐含图进行推理。它使用进化相关序列、多序列对齐（MSA）和氨基酸残基对的表示来完善这个图。

通过迭代这个过程，系统会对蛋白质的底层物理结构进行强有力的预测，并能在几天内确定高度精确的结构。此外，AlphaFold还可以通过内部置信度来预测每个预测的蛋白质结构中哪些部分是可靠的。我们在公开的数据上训练了这个系统，这些数据包括来自蛋白质数据库的约17万个蛋白质结构，以及包含未知结构的蛋白质序列的大型数据库。它使用了大约128个TPUv3核心（大致相当于约100-200个GPU）运行了几周，这在当今机器学习中使用的大多数大型最先进模型的背景下是一个相对适度的计算量。与我们的CASP13 AlphaFold系统一样，我们正在准备一篇关于我们系统的论文，以便在适当的时候提交给同行评审的期刊。

对现实世界的潜在影响当DeepMind在十年前成立时，我们希望有一天人工智能的突破能帮助我们作为一个平台，推动我们对基本科学问题的理解。现在，经过4年的努力打造AlphaFold，我们开始看到这一愿景的实现，并对药物设计和环境可持续性等领域产生影响。马克斯-普朗克发育生物学研究所所长、CASP评估员Andrei Lupas教授让我们知道，'AlphaFold惊人精确的模型让我们解决了一个卡了近十年的蛋白质结构，重新启动了我们理解信号如何跨细胞膜传递的努力。' 我们对AlphaFold对生物研究和更广泛的世界的影响感到乐观，并很高兴与其他人合作，在未来几年了解更多关于它的潜力。在撰写同行评审论文的同时，我们还在探索如何以可扩展的方式为该系统提供更广泛的使用权。同时，我们还在研究蛋白质结构预测如何有助于我们对特定疾病的理解与少数专家小组，例如，通过帮助识别功能失常的蛋白质并推理它们如何相互作用。这些见解可以使药物开发的工作更加精确，补充现有的实验方法，更快地找到有希望的治疗方法。

AlphaFold是一代人一次的进步，以令人难以置信的速度和精度预测蛋白质结构。这一飞跃性进展表明计算方法是如何改变生物学研究的，并在加速药物发现过程中大有可为。阿瑟-D-莱文森博士，CALICO创始人兼CEO，GENENTECH前董事长兼CEO。我们还看到有迹象表明，作为科学界开发的众多工具之一，蛋白质结构预测可能在未来的大流行病应对工作中发挥作用。今年早些时候，我们预测了SARS-CoV-2病毒的几种蛋白质结构，包括ORF3a，其结构以前是未知的。在CASP14上，我们预测了另一个冠状病毒蛋白ORF8的结构。令人印象深刻的是，实验人员的快速工作现在已经证实了ORF3a和ORF8的结构。尽管它们具有挑战性，而且相关序列很少，但与实验确定的结构相比，我们的预测都达到了很高的准确度。

除了加快对已知疾病的理解，我们还对这些技术探索我们目前没有模型的数亿蛋白质的潜力感到兴奋--这是一片未知生物学的广阔天地。由于DNA指定了构成蛋白质结构的氨基酸序列，基因组学革命使得大规模读取自然界的蛋白质序列成为可能--在通用蛋白质数据库（UniProt）中，有1.8亿个蛋白质序列并在不断增加。相比之下，考虑到从序列到结构所需的实验工作，蛋白质数据库（PDB）中只有约17万个蛋白质结构。在未确定的蛋白质中，可能有一些具有新的、令人兴奋的功能，就像望远镜帮助我们更深入地观察未知的宇宙一样，像AlphaFold这样的技术可能会帮助我们找到它们。

开启新的可能性 AlphaFold是我们迄今为止最重要的进展之一，但与所有科学研究一样，仍有许多问题需要回答。并非我们预测的每一个结构都是完美的。还有很多东西需要学习，包括多种蛋白质如何形成复合物，它们如何与DNA、RNA或小分子相互作用，以及我们如何确定所有氨基酸侧链的精确位置。在与其他人的合作中，如何最好地将这些科学发现用于开发新药、管理环境的方法等方面也有很多需要学习的地方。对于我们所有从事科学领域计算和机器学习方法的人来说，像AlphaFold这样的系统展示了人工智能作为辅助基础发现工具的惊人潜力。就像50年前安金森提出了一个当时科学远远无法企及的挑战一样，我们的宇宙还有很多方面是未知的。今天公布的进展让我们进一步相信，人工智能将成为人类拓展科学知识前沿最有用的工具之一，我们期待着未来多年的努力和发现!

Proteins are essential to life, supporting practically all its functions. They are large complex molecules, made up of chains of amino acids, and what a protein does largely depends on its unique 3D structure. Figuring out what shapes proteins fold into is known as the “protein folding problem”, and has stood as a grand challenge in biology for the past 50 years. In a major scientific advance, the latest version of our AI system AlphaFoldhas been recognised as a solution to this grand challenge by the organisers of the biennial Critical Assessment of protein Structure Prediction (CASP). This breakthrough demonstrates the impact AI can have on scientific discovery and its potential to dramatically accelerate progress in some of the most fundamental fields that explain and shape our world.

A protein’s shape is closely linked with its function, and the ability to predict this structure unlocks a greater understanding of what it does and how it works. Many of the world’s greatest challenges, like developing treatments for diseases or finding enzymes that break down industrial waste, are fundamentally tied to proteins and the role they play.

We have been stuck on this one problem – how do proteins fold up – for nearly 50 years. To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment.

PROFESSOR JOHN MOULT

CO-FOUNDER AND CHAIR OF CASP, UNIVERSITY OF MARYLAND

This has been a focus of intensive scientific research for many years, using a variety of experimental techniques to examine and determine protein structures, such as nuclear magnetic resonance and X-ray crystallography. These techniques, as well as newer methods like cryo-electron microscopy, depend on extensive trial and error, which can take years of painstaking and laborious work per structure, and require the use of multi-million dollar specialised equipment.

The ‘protein folding problem’

In his acceptance speech for the 1972 Nobel Prize in Chemistry, Christian Anfinsen famously postulated that, in theory, a protein’s amino acid sequence should fully determine its structure. This hypothesis sparked a five decade quest to be able to computationally predict a protein’s 3D structure based solely on its 1D amino acid sequence as a complementary alternative to these expensive and time consuming experimental methods. A major challenge, however, is that the number of ways a protein could theoretically fold before settling into its final 3D structure is astronomical. In 1969 Cyrus Levinthal noted that it would take longer than the age of the known universe to enumerate all possible configurations of a typical protein by brute force calculation – Levinthal estimated 10^300 possible conformations for a typical protein. Yet in nature, proteins fold spontaneously, some within milliseconds – a dichotomy sometimes referred to as Levinthal’s paradox.

Results from the CASP14 assessment

In 1994, Professor John Moult and Professor Krzysztof Fidelis founded CASP as a biennial blind assessment to catalyse research, monitor progress, and establish the state of the art in protein structure prediction. It is both the gold standard for assessing predictive techniques and a unique global community built on shared endeavour. Crucially, CASP chooses protein structures that have only very recently been experimentally determined (some were still awaiting determination at the time of the assessment) to be targets for teams to test their structure prediction methods against; they are not published in advance. Participants must blindly predict the structure of the proteins, and these predictions are subsequently compared to the ground truth experimental data when they become available. We’re indebted to CASP’s organisers and the whole community, not least the experimentalists whose structures enable this kind of rigorous assessment.

The main metric used by CASP to measure the

accuracy of predictions is the Global Distance Test (GDT) which ranges from 0-100. In simple terms, GDT can be approximately thought of as the percentage of amino acid residues (beads in the protein chain) within a threshold distance from the correct position. According to Professor Moult, a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods.

In the results from the 14th CASP assessment, released today, our latest AlphaFold system achieves a median score of 92.4 GDT overall across all targets. This means that our predictions have an average error (RMSD) of approximately 1.6 Angstroms, which is comparable to the width of an atom (or 0.1 of a nanometer). Even for the very hardest protein targets, those in the most challenging free-modelling category, AlphaFold achieves a median score of 87.0 GDT (data available here).

These exciting results open up the potential for biologists to use computational structure prediction as a core tool in scientific research. Our methods may prove especially helpful for important classes of proteins, such as membrane proteins, that are very difficult to crystallise and therefore challenging to experimentally determine.

This computational work represents a stunning advance on the protein-folding problem, a 50-year-old grand challenge in biology. It has occurred decades before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research.

PROFESSOR VENKI RAMAKRISHNAN

NOBEL LAUREATE AND PRESIDENT OF THE ROYAL SOCIETY

Our approach to the protein folding problem

We first entered CASP13 in 2018 with our initial version of AlphaFold, which achieved the highest accuracy among participants. Afterwards, we published a paper on our CASP13 methods in Nature with associated code, which has gone on to inspire other work and community-developed open source implementations. Now, new deep learning architectures we’ve developed have driven changes in our methods for CASP14, enabling us to achieve unparalleled levels of accuracy. These methods draw inspiration from the fields of biology, physics, and machine learning, as well as of course the work of many scientists in the protein folding field over the past half-century.

A folded protein can be thought of as a “spatial graph”, where residues are the nodes and edges connect the residues in close proximity. This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history. For the latest version of AlphaFold, used at CASP14, we created an attention-based neural network system, trained end-to-end, that attempts to interpret the structure of this graph, while reasoning over the implicit graph that it’s building. It uses evolutionarily related sequences, multiple sequence alignment (MSA), and a representation of amino acid residue pairs to refine this graph.

By iterating this process, the system develops strong predictions of the underlying physical structure of the protein and is able to determine highly-accurate structures in a matter of days. Additionally, AlphaFold can predict which parts of each predicted protein structure are reliable using an internal confidence measure.

We trained this system on publicly available data consisting of ~170,000 protein structures from the protein data bank together with large databases containing protein sequences of unknown structure. It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks, which is a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today. As with our CASP13 AlphaFold system, we are preparing a paper on our system to submit to a peer-reviewed journal in due course.

The potential for real-world impact

When DeepMind started a decade ago, we hoped that one day AI breakthroughs would help serve as a platform to advance our understanding of fundamental scientific problems. Now, after 4 years of effort building AlphaFold, we’re starting to see that vision realised, with implications for areas like drug design and environmental sustainability.

Professor Andrei Lupas, Director of the Max Planck Institute for Developmental Biology and a CASP assessor, let us know that, “AlphaFold’s astonishingly accurate models have allowed us to solve a protein structure we were stuck on for close to a decade, relaunching our effort to understand how signals are transmitted across cell membranes.”

We’re optimistic about the impact AlphaFold can have on biological research and the wider world, and excited to collaborate with others to learn more about its potential in the years ahead. Alongside working on a peer-reviewed paper, we’re exploring how best to provide broader access to the system in a scalable way.

In the meantime, we’re also looking into how protein structure predictions could contribute to our understanding of specific diseases with a small number of specialist groups, for example by helping to identify proteins that have malfunctioned and to reason about how they interact. These insights could enable more precise work on drug development, complementing existing experimental methods to find promising treatments faster.

AlphaFold is a once in a generation advance, predicting protein structures with incredible speed and precision. This leap forward demonstrates how computational methods are poised to transform research in biology and hold much promise for accelerating the drug discovery process.

ARTHUR D. LEVINSON

PHD, FOUNDER & CEO CALICO, FORMER CHAIRMAN & CEO, GENENTECH

We’ve also seen signs that protein structure prediction could be useful in future pandemic response efforts, as one of many tools developed by the scientific community. Earlier this year, we predicted several protein structures of the SARS-CoV-2 virus, including ORF3a, whose structures were previously unknown. At CASP14, we predicted the structure of another coronavirus protein, ORF8. Impressively quick work by experimentalists has now confirmed the structures of both ORF3a and ORF8. Despite their challenging nature and having very few related sequences, we achieved a high degree of accuracy on both of our predictions when compared to their experimentally determined structures.

As well as accelerating understanding of known diseases, we’re excited about the potential for these techniques to explore the hundreds of millions of proteins we don’t currently have models for – a vast terrain of unknown biology. Since DNA specifies the amino acid sequencesthat comprise protein structures, the genomics revolution has made it possible to read protein sequences from the natural world at massive scale – with 180 million protein sequences and counting in the Universal Protein database (UniProt). In contrast, given the experimental work needed to go from sequence to structure, only around 170,000 protein structures are in the Protein Data Bank (PDB). Among the undetermined proteins may be some with new and exciting functions and – just as a telescope helps us see deeper into the unknown universe – techniques like AlphaFold may help us find them.

Unlocking new possibilities

AlphaFold is one of our most significant advances to date but, as with all scientific research, there are still many questions to answer. Not every structure we predict will be perfect. There’s still much to learn, including how multiple proteins form complexes, how they interact with DNA, RNA, or small molecules, and how we can determine the precise location of all amino acid side chains. In collaboration with others, there’s also much to learn about how best to use these scientific discoveries in the development of new medicines, ways to manage the environment, and more.

For all of us working on computational and machine learning methods in science, systems like AlphaFold demonstrate the stunning potential for AI as a tool to aid fundamental discovery. Just as 50 years ago Anfinsen laid out a challenge far beyond science’s reach at the time, there are many aspects of our universe that remain unknown. The progress announced today gives us further confidence that AI will become one of humanity’s most useful tools in expanding the frontiers of scientific knowledge, and we’re looking forward to the many years of hard work and discovery ahead!