LLMs之Prover:《DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition》翻译与解读
导读:DeepSeek-Prover-V2通过结合大型语言模型的推理能力和形式化验证系统的严格性,提出了一种新的形式化定理证明方法。该方法通过子目标分解、递归求解、课程学习和强化学习等技术,有效地提升了模型在各种数学基准测试中的性能,并缩小了非形式化推理和形式化证明之间的差距,为未来的自动定理证明研究奠定了基础。
>> 背景痛点
● 推理有效性:大型语言模型(LLMs)在数学问题求解方面取得了显著进展,这主要得益于推理时规模扩展,特别是自然语言链式思考(CoT)推理。
● 形式化定理的挑战:尽管自然语言推理在解决竞赛级数学问题上很成功,但将其应用于形式化定理证明仍然具有根本性的挑战。
● 非形式化的劣势和形式化的特点:LLMs的自然语言推理本质上是非形式化的,依赖于启发式方法、近似和数据驱动的猜测模式,这些模式通常缺乏形式验证系统所需的严格结构。形式化验证系统(如Lean, Isabelle, Coq)基于严格的逻辑基础,每个证明步骤都必须显式构造和形式验证,不允许任何歧义、隐含假设或细节遗漏。
● 两者的挑战:如何弥合非形式化的高级推理与形式化验证系统的句法严格性之间的差距,是神经定理证明中长期存在的挑战。
>> 解决方案
● 提出了DeepSeek-Prover-V2,一个用于Lean 4形式化定理证明的开源大型语言模型。
● 利用DeepSeek-V3驱动的递归定理证明流程收集初始化数据。通过提示DeepSeek-V3将复杂问题分解为一系列子目标来启动冷启动训练过程。将已解决子目标的证明合成为链式思考过程,并结合DeepSeek-V3的逐步推理,为强化学习创建一个初始冷启动。
● 将非形式化和形式化数学推理集成到一个统一的模型中。
● 构建了一个简单的递归定理证明pipeline,利用DeepSeek-V3作为子目标分解和形式化的统一工具。
● 提示DeepSeek-V3将定理分解为高级证明草图,同时将这些证明步骤形式化为Lean 4中的子目标序列。使用较小的7B模型处理每个子目标的证明搜索,从而减少计算负担。
● 引入课程学习框架,利用分解的子目标生成推测性定理,逐步增加训练任务的难度,以更好地指导模型的学习过程。将完整的逐步形式证明与DeepSeek-V3的相应链式思考配对,以创建冷启动推理数据。
● 应用强化学习阶段,以进一步加强非形式化数学推理和形式证明构造之间的联系。
>> 核心思路步骤
● 子目标分解: 使用DeepSeek-V3将复杂定理分解为更小的、可管理的子目标(lemma)。
● 形式化草图: 将分解的子目标转换为Lean 4的形式化语句,但省略证明细节,用"sorry"占位符表示。
● 递归求解: 使用较小的7B prover模型递归地解决每个子目标,利用先前的子目标作为前提。
● 合成完整证明: 将子目标的证明组合成原始问题的完整形式化证明。
● 冷启动数据生成: 将完整的形式化证明与DeepSeek-V3的链式思考过程相结合,创建高质量的冷启动训练数据。
● 课程学习: 利用子目标生成难度递增的训练任务,逐步引导prover模型解决更具挑战性的问题。
● 强化学习: 使用二元正确/错误反馈作为奖励信号,并加入一致性奖励,确保生成的证明结构与链式思考的分解一致。
>> 优势
● DeepSeek-Prover-V2-671B在神经定理证明方面达到了最先进的性能。在MiniF2F-test上达到了88.9%的pass ratio。解决了PutnamBench中的658个问题中的49个。在ProverBench的15个AIME问题中成功解决了6个。
● 缩小了大型语言模型中形式和非形式数学推理之间的差距。
● 通过将一般用途LLM与轻量级专用7B prover集成,实现了90.2%的miniF2F-valid成功率。
● CoT推理模式在形式数学推理中比非CoT模式具有显著的性能优势。
● 7B模型在PutnamBench数据集上使用非CoT生成模式表现出色,成功解决了671B版本未解决的13个问题。
>> 结论和观点
● 通过合成冷启动推理数据,可以有效提升形式化定理证明的能力。
● 递归定理证明框架,结合子目标分解和形式化,是一种有前景的方法。
● 课程学习和强化学习可以进一步增强模型在形式化定理证明方面的能力。
● 高容量模型即使在没有明确CoT提示的情况下,也可能内化和外化中间推理。
● 建议未来的工作重点是将该范例扩展到类似AlphaProof的系统,目标是解决代表自动定理证明挑战前沿的IMO级数学问题。
● 建议进一步探索如何利用大型语言模型的非形式化推理能力来指导形式化证明的构建。
● 建议研究如何设计更有效的奖励函数,以鼓励模型生成结构良好、易于理解的形式化证明。
目录
LLMs之Prover:《DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition》翻译与解读
地址 | 论文地址:[2504.21801] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition | 时间 | 2025年4月30日 | 作者 | DeepSeek-AI |
Abstract
We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing. | 我们推出了 DeepSeek-Prover-V2,这是一款开源的大规模语言模型,专为 Lean 4 中的形式化定理证明而设计,其初始化数据通过由 DeepSeek-V3 驱动的递归定理证明管道收集。冷启动训练过程首先提示 DeepSeek-V3 将复杂问题分解为一系列子目标。已解决子目标的证明被合成为一条思维链过程,并与 DeepSeek-V3 的逐步推理相结合,以创建强化学习的初始冷启动。这一过程使我们能够将非正式和正式的数学推理整合到一个统一的模型中。最终得到的模型 DeepSeek-Prover-V2-671B 在神经定理证明方面达到了最先进的性能,在 MiniF2F 测试中达到了 88.9% 的通过率,并解决了 PutnamBench 中 658 个问题中的 49 个。除了标准基准测试外,我们还引入了 ProverBench,这是一个包含 325 个形式化问题的集合,以丰富我们的评估,其中包括从最近的 AIME 竞赛(第 24 - 25 年)中精选的 15 个问题。对这 15 个 AIME 问题的进一步评估表明,该模型成功解决了其中的 6 个。相比之下,DeepSeek-V3 通过多数投票解决了其中的 8 个问题,这表明大型语言模型中形式化数学推理与非形式化数学推理之间的差距正在大幅缩小。 |
Figure 1:Benchmark performance of DeepSeek-Prover-V2. On the AIME benchmark, DeepSeek-V3 is evaluated using the standard find-answer task for natural-language reasoning, while prover models generate Lean code to construct formal proofs for a given correct answer.图 1:DeepSeek-Prover-V2 的基准性能。在 AIME 基准测试中,DeepSeek-V3 通过标准的自然语言推理找答案任务进行评估,而证明模型则生成 Lean 代码来为给定的正确答案构建形式化证明。
1、Introduction
The emergence of reasoning capabilities in large language models (LLMs) has revolutionized numerous areas of artificial intelligence, particularly in the domain of mathematical problem solving (DeepSeek-AI, 2025). These advancements are largely enabled by the paradigm of inference-time scaling, most notably through natural language chain-of-thought reasoning (Jaech et al., 2024). Rather than relying solely on a single forward pass to arrive at an answer, LLMs can reflect on intermediate reasoning steps, improving both accuracy and interpretability. Despite the success of natural language reasoning in solving competition-level mathematical problems, its application to formal theorem proving remains fundamentally challenging. LLMs perform natural language reasoning in an inherently informal manner, relying on heuristics, approximations, and data-driven guessing patterns that often lack the rigorous structure required by formal verification systems. In contrast, proof assistants such as Lean (Moura and Ullrich, 2021), Isabelle (Paulson, 1994), and Coq (Barras et al., 1999) operate on strict logical foundations, where every proof step must be explicitly constructed and formally verified. These systems permit no ambiguity, implicit assumptions, or omission of details. Bridging the gap between informal, high-level reasoning and the syntactic rigor of formal verification systems remains a longstanding research challenge in neural theorem proving (Yang et al., 2024). To harness the strengths of informal mathematical reasoning in support of formal theorem proving, a classical approach is to hierarchically decompose formal proofs based on the guidance of natural-language proof sketches. Jiang et al. (2023) proposed a framework, called Draft, Sketch, and Prove (DSP), that leverages a large language model to generate proof sketches in natural language, which are subsequently translated into formal proof steps. This informal-to-formal theorem proving paradigm closely mirrors the concept of subgoals in hierarchical reinforcement learning (Barto and Mahadevan, 2003; Nachum et al., 2018; Eppe et al., 2022), where complex tasks are broken down into a hierarchy of simpler subtasks that can be solved independently to progressively achieve the overarching objective. In formal theorem proving, a subgoal is typically an intermediate proposition or lemma that contributes to the proof of a larger theorem (Zhao et al., 2023, 2024). This hierarchical decomposition aligns with human problem-solving strategies and supports modularity, reusability, and more efficient proof search (Wang et al., 2024b; Zheng et al., 2024). Recent studies have extended this paradigm by employing multi-tiered hierarchies for structured proof generation (Wang et al., 2024a), and by leveraging reinforcement learning techniques to optimize the decomposition of complex theorems into manageable subgoals (Dong et al., 2024). | 大型语言模型(LLMs)推理能力的出现彻底改变了人工智能的众多领域,尤其是在数学问题求解方面(DeepSeek-AI,2025)。这些进步很大程度上得益于推理时间缩放的范式,尤其是通过自然语言的推理链(Jaech 等人,2024)。LLMs 不再仅仅依靠单次前向传递得出答案,而是能够反思中间的推理步骤,从而提高准确性和可解释性。尽管自然语言推理在解决竞赛级别的数学问题方面取得了成功,但将其应用于形式化定理证明仍然存在根本性的挑战。LLMs 进行自然语言推理的方式本质上是非正式的,依赖于启发式方法、近似值和数据驱动的猜测模式,这些往往缺乏形式验证系统所要求的严格结构。相比之下,诸如 Lean(Moura 和 Ullrich,2021 年)、Isabelle(Paulson,1994 年)和 Coq(Barras 等人,1999 年)这样的证明助手则基于严格的逻辑基础运行,其中每一步证明都必须明确构建并经过形式验证。这些系统不允许存在任何模糊性、隐含假设或细节遗漏。在神经定理证明领域(Yang 等人,2024 年),如何弥合非正式的高层次推理与形式验证系统语法严谨性之间的差距,一直是一个长期存在的研究挑战。 为了利用非正式数学推理的优势来支持形式定理证明,一种经典的方法是根据自然语言证明草图的指导,对形式证明进行分层分解。Jiang 等人(2023 年)提出了一种名为“草稿、草图和证明”(DSP)的框架,该框架利用大型语言模型生成自然语言形式的证明草图,随后将其转换为形式证明步骤。这种非正式到正式的定理证明范式与分层强化学习中的子目标概念紧密相关(Barto 和 Mahadevan,2003 年;Nachum 等人,2018 年;在 Eppe 等人(2022 年)的研究中,复杂任务被分解为一系列更简单的子任务,这些子任务可以独立解决,从而逐步实现总体目标。在形式化定理证明中,子目标通常是有助于证明更大定理的中间命题或引理(Zhao 等人,2023 年,2024 年)。这种分层分解与人类解决问题的策略相一致,并支持模块化、可重用性和更高效的证明搜索(Wang 等人,2024 年 b;Zheng 等人,2024 年)。最近的研究通过采用多层级结构进行结构化证明生成(Wang 等人,2024 年 a),以及利用强化学习技术优化复杂定理向可管理子目标的分解(Dong 等人,2024 年),对这一范式进行了扩展。 | In this paper, we develop a reasoning model for subgoal decomposition, leveraging a suite of synthetic cold-start data and large-scale reinforcement learning to enhance its performance. To construct the cold-start dataset, we develop a simple yet effective pipeline for recursive theorem proving, utilizing DeepSeek-V3 (DeepSeek-AI, 2024) as a unified tool for both subgoal decomposition and formalization. We prompt DeepSeek-V3 to decompose theorems into high-level proof sketches while simultaneously formalizing these proof steps in Lean 4, resulting in a sequence of subgoals. Since the subgoal decomposition is powered by a large general-purpose model, we use a smaller 7B model to handle the proof search for each subgoal, thereby reducing the associated computational burden. Additionally, we introduce a curriculum learning framework that leverages the decomposed subgoals to generate conjectural theorems, progressively increasing the difficulty of training tasks to better guide the model’s learning process. Once the decomposed steps of a challenging problem are resolved, we pair the complete step-by-step formal proof with the corresponding chain-of-thought from DeepSeek-V3 to create cold-start reasoning data. Based on the cold start, a subsequent reinforcement learning stage is applied to further strengthen the connection between informal mathematical reasoning and formal proof construction. Our experiments show that reinforcement learning starting from the cold start of informal reasoning in task decomposition significantly enhances the model’s capabilities in formal theorem proving. The resulting DeepSeek-Prover-V2-671B model establishes a new state-of-the-art in neural theorem proving across multiple benchmarks. On MiniF2F-test, it achieves 82.4% accuracy with Pass@32, improving to 88.9% with Pass@8192. The model shows strong generalization capabilities to college-level theorem proving, solving 37.1% of ProofNet-test problems with Pass@1024 and tackling 49 out of 658 challenging PutnamBench problems. Additionally, we contribute ProverBench, a benchmark dataset containing 325 formalized problems to advance neural theorem proving research, including 15 from the prestigious AIME competitions (years 24-25). DeepSeek-Prover-V2-671B successfully solves 6 of these 15 challenging AIME problems, further demonstrating its sophisticated mathematical reasoning capabilities. | 在本文中,我们开发了一种用于子目标分解的推理模型,利用一套合成的冷启动数据和大规模强化学习来提升其性能。为了构建冷启动数据集,我们开发了一个简单而有效的递归定理证明流水线,利用 DeepSeek-V3(DeepSeek-AI,2024)作为子目标分解和形式化的统一工具。我们提示 DeepSeek-V3 将定理分解为高级证明草图,同时在 Lean 4 中对这些证明步骤进行形式化,从而形成一系列子目标。由于子目标分解是由一个大型通用模型驱动的,我们使用一个较小的 7B 模型来处理每个子目标的证明搜索,从而减轻相关的计算负担。此外,我们引入了一个课程学习框架,利用分解的子目标生成推测性定理,逐步增加训练任务的难度,以更好地引导模型的学习过程。一旦具有挑战性问题的分解步骤得到解决,我们就将完整的分步形式证明与 DeepSeek-V3 对应的思维链配对,以创建冷启动推理数据。基于冷启动,随后应用强化学习阶段,进一步加强非形式化数学推理与形式化证明构建之间的联系。我们的实验表明,从任务分解中的非形式化推理冷启动开始的强化学习显著增强了模型在形式化定理证明方面的能力。由此产生的 DeepSeek-Prover-V2-671B 模型在多个基准测试中确立了神经定理证明的新标杆。在 MiniF2F-test 上,它实现了 82.4% 的准确率,Pass@32 为 88.9%,Pass@8192 为 88.9%。该模型在大学水平的定理证明方面表现出强大的泛化能力,在 ProofNet-test 上解决了 37.1% 的问题,Pass@1024 为 49 个,解决了 658 个具有挑战性的 PutnamBench 问题中的 49 个。此外,我们贡献了 ProverBench,这是一个包含 325 个形式化问题的基准数据集,旨在推动神经定理证明研究的发展,其中包括 15 道来自著名的 AIME 竞赛(第 24 至 25 届)的题目。DeepSeek-Prover-V2-671B 成功解决了这 15 道极具挑战性的 AIME 题目中的 6 道,进一步证明了其复杂的数学推理能力。 |
Conclusion
In this work, we propose a comprehensive pipeline for synthesizing cold-start reasoning data to advance formal theorem proving. Our data construction process is grounded in a recursive theorem-proving framework, wherein DeepSeek-V3 serves as a unified model for both subgoal decomposition and lemma formalization within the Lean 4 proof assistant. Our approach combines high-level proof sketches with formal steps, creating a sequence of manageable subgoals that can be efficiently solved using a smaller 7B model, significantly reducing computational requirements. The curriculum learning framework we developed uses these decomposed subgoals to generate increasingly difficult training tasks, creating a more effective learning progression. By pairing complete formal proofs with DeepSeek-V3’s chain-of-thought reasoning, we established valuable cold-start reasoning data that bridges informal mathematical thinking with formal proof structures. The subsequent reinforcement learning stage substantially enhanced this connection, leading to significant improvements in formal theorem proving capabilities. The resulting model, DeepSeek-Prover-V2-671B, consistently outperforms all baselines across a range of benchmarks, spanning both high-school competition problems and undergraduate-level mathematics. Our future work will focus on scaling this paradigm to an AlphaProof-like system with the ultimate aim of tackling IMO-level mathematical problems that represent the frontier of automated theorem proving challenges. | 在本研究中,我们提出了一套全面的流程,用于合成冷启动推理数据以推进形式化定理证明。我们的数据构建过程基于一个递归定理证明框架,在此框架中,DeepSeek-V3 在 Lean 4 证明助手内统一充当子目标分解和引理形式化的模型。我们的方法将高层次的证明草图与形式化步骤相结合,生成一系列易于处理的子目标,这些子目标可以使用较小的 7B 模型高效解决,从而大幅降低计算需求。我们开发的课程学习框架利用这些分解的子目标生成难度逐渐增加的训练任务,形成更有效的学习进程。通过将完整的形式化证明与 DeepSeek-V3 的链式思维推理相结合,我们建立了宝贵的冷启动推理数据,将非形式化的数学思维与形式化证明结构连接起来。随后的强化学习阶段极大地加强了这种联系,显著提升了形式化定理证明的能力。由此产生的模型 DeepSeek-Prover-V2-671B 在一系列基准测试中始终优于所有基线模型,涵盖了高中竞赛题和大学水平的数学题。我们未来的工作将致力于将这一范例扩展为类似 AlphaProof 的系统,最终目标是解决代表自动化定理证明挑战前沿的国际数学奥林匹克竞赛级别的数学问题。 |
|