【原】LLMs之Benchmark：《FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models》翻译与解

处女座的程序猿 2025-05-08 发布于上海

展开全文

LLMs之Benchmark：《FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models》翻译与解读

导读：该论文提出了FormalMATH，一个大规模的Lean4形式化数学推理基准测试，旨在解决现有基准测试的局限性。通过人机协作的自动形式化流程，FormalMATH构建了一个包含5,560个问题的丰富数据集，并对现有LLM定理证明器进行了评估，揭示了它们在形式化数学推理方面的局限性，为未来的研究方向提供了重要启示。

>> 背景与痛点

● 形式化数学推理(FMR)的挑战性：形式化数学推理依赖于严格的公理化框架，对人工智能来说仍然是一个难题。

● 现有基准测试的局限性：现有的形式化数学基准测试（如MiniF2F、ProofNet）在范围、规模和实用性上存在局限性。具体表现为：

● 范围狭窄：例如，MiniF2F主要关注高中水平的代数和数论，ProofNet则侧重于本科水平的分析和代数。

● 数据集规模小：MiniF2F测试集仅包含244个问题，ProofNet包含186个问题，限制了基准测试的鲁棒性和FMR系统的泛化能力。

● 性能饱和：现有最优定理证明器在现有基准测试上已达到较高的成功率（超过80.7%），表明这些基准测试可能接近其效用极限。

● 自动形式化的困难：将自然语言描述的数学问题转化为精确的形式化语言（如Lean4）存在挑战，尤其是在高级数学领域，需要严格的语义保留。

● 验证形式化陈述的挑战：验证形式化陈述需要确保语法正确性（通过Lean4编译器检查）以及与原始问题的语义一致性，这既耗时又需要专业知识。

>> 解决方案：FormalMATH基准测试

● FormalMATH的提出：为了解决现有基准测试的局限性，论文提出了FormalMATH，一个大规模的Lean4基准测试，包含5,560个经过形式验证的数学陈述。

● FormalMATH的特点：

●● 范围广泛：涵盖代数、几何、微积分、数论、离散数学等多个数学领域。

●● 难度多样：包含从高中奥林匹克竞赛题到本科水平定理的各种难度。

●● 规模庞大：比MiniF2F基准测试大22.8倍。

>> 核心思路与步骤：人机协作的自动形式化流程

● 人机协作框架：论文提出了一个人机协作框架（如图1所示），以构建FormalMATH基准测试，该框架旨在减少生成形式化数学陈述所需的人工标注工作。

● 关键步骤：

●● 集成多LLM的自动形式化：利用多个大型语言模型（LLM）进行自动形式化，通过best-of-N抽样。

●● 自动验证：采用三层验证流程确保正确性：编译器语法验证。多LLM语义验证。基于否定的反证，过滤不可证明的定理。

●● 人工验证：招募国际数学奥林匹克奖牌获得者级别的人工专家，手动检查自然语言陈述和Lean4形式化之间的语义对齐保真度。

● 数据过滤与保留率：通过上述流程，在人工验证之前保留了72.09%的翻译陈述，从而最大限度地减少了人工验证。

>> 优势

● 大规模和全面的Lean4基准测试：FormalMATH提供了一个包含5,560个形式验证的数学陈述的基准测试，覆盖了不同的子领域。

● 有效的人机协作自动形式化流程：该框架集成了多LLM自动形式化、多LLM语义验证和基于否定的反证策略，以自动化形式化，同时在人工审查前保持72.09%的准确率。

● LLM在语义理解上的优势：将Lean4语句翻译成自然语言比反向过程更容易，通用LLM擅长理解自然语言。

● 高效的Lean4验证器实现：利用树状结构的并行方法，集中处理Mathlib4库的导入操作，避免重复计算开销，确保可扩展和高效地利用计算资源。

>> 实验结果与分析

● LLM定理证明器的局限性：在FormalMATH基准测试中，现有最优的LLM定理证明器表现出显著的局限性。

●● 成功率低：即使是性能最佳的模型，在实际抽样预算下，在FormalMATH-Full数据集上的成功率也仅为16.46%。

●● 领域偏差：现有证明器表现出严重的领域偏差，在代数等领域表现良好，但在微积分等领域表现不佳。

●● 过度依赖简化策略：证明器经常将多步推理简化为单策略调用，例如“aesop”和“linearith”，绕过了必要的演绎严谨性。

●● 自然语言指导的负面影响：在CoT推理场景中，添加自然语言解决方案会降低成功率，表明这种指导引入了歧义而不是清晰性。

● 测试时扩展的性能提升有限：对FormalMATH-Lite进行测试时扩展表明，在形式定理证明中应用测试时扩展的回报有限。

● CoT可以增强模型在形式数学推理方面的能力：通过在FormalMATH-Lite上进行测试时扩展，发现朴素CoT优于自然语言指导。

● 常见错误模式：对现有定理证明器进行系统分析，发现的常见错误模式包括不完整的证明、无法处理复杂的非线性不等式、不恰当使用自动化策略以及引入冗余假设。

>> 结论与观点

● FormalMATH的价值：FormalMATH提供了一个可靠的基准，用于评估形式化数学推理。

● LLM定理证明器的改进方向：LLM定理证明器在鲁棒性、泛化性和推理复杂性方面有很大的改进空间。

● 未来研究方向：未来的研究可以集中在改进LLM在复杂定理空间的探索和技能获取、解决领域偏差、提高推理的演绎严谨性等方面。

《FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models》翻译与解读

地址

论文地址：[2505.02735] FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

时间

2025年5月5日

作者

香港中文大学

中南大学

西湖大学

加州大学洛杉矶分校等

Abstract

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

形式化数学推理仍然是人工智能面临的一项关键挑战，现有基准在范围和规模上的局限性阻碍了其发展。为解决这一问题，我们推出了 FormalMATH，这是一个大规模的 Lean4 基准测试集，包含 5560 个经过形式验证的问题，涵盖了从高中奥林匹克竞赛难题到大学水平定理的多个领域（例如代数、应用数学、微积分、数论和离散数学）。为缓解手动形式化效率低下的问题，我们引入了一种新颖的人机协作自动形式化流程，该流程整合了：（1）用于陈述自动形式化的专业大型语言模型（LLM），（2）多 LLM 语义验证，以及（3）基于否定的反证过滤策略，使用现成的基于 LLM 的证明器。这种方法在人工验证前保留了 72.09% 的陈述，同时确保了对原始自然语言问题的忠实度。我们对最先进的基于大语言模型的定理证明器的评估揭示了显著的局限性：即使是最强大的模型，在实际抽样预算下也仅能实现 16.46%的成功率，表现出明显的领域偏差（例如，在代数方面表现出色但在微积分方面却失败）以及对简化自动化策略的过度依赖。值得注意的是，我们发现了一个反直觉的逆向关系，即在链式思维推理场景中，自然语言解决方案指导与证明成功之间呈负相关，这表明人类编写的非正式推理在形式推理环境中引入了噪声而非清晰性。我们认为，FormalMATH 为形式数学推理的基准测试提供了一个可靠的基准。

Figure 1: A human-in-the-loop pipeline for formal mathematical statement creation and filtering.图 1：用于正式数学陈述创建和筛选的人机交互流程。

Figure 2: (a) Performance comparison of existing theorem provers on the full FormalMATH benchmark. Results show Pass@1×32×100 accuracy for best-first-search-based (BFS) methods, including BFS-Prover and InternLM-Prover, and Pass@32 accuracy via single-pass generations (SPG) for the other provers, including Kinima-Prover, STP, Goedel-Prover, DeepSeek-V1.5-RL and DeepSeek-V1.5-SFT. (b) Funnel chart illustrating the percentage of data that is preserved after each filtering stage in our human-in-the-loop autoformalization pipeline.图 2：（a）现有定理证明器在完整 FormalMATH 基准测试中的性能比较。结果表明，基于最佳优先搜索（BFS）的方法（包括 BFS-Prover 和 InternLM-Prover）的 Pass@1×32×100 准确率，而其他证明器（包括 Kinima-Prover、STP、Goedel-Prover、DeepSeek-V1.5-RL 和 DeepSeek-V1.5-SFT）通过单次生成（SPG）的 Pass@32 准确率。（b）漏斗图展示了在我们的“人在回路”自动形式化流程中，经过每个过滤阶段后保留的数据百分比。

1、Introduction

Formal mathematical reasoning (FMR) [YPH+24] represents a specialized form of mathematical practice grounded in formal systems [Lei10, Mat20, BBC+97], which provides a rigorous axiomatic framework essential for automated proof validation. However, FMR is inherently challenging for humans. For instance, the Liquid Tensor Experiment [Sch22] and the Polynomial Freiman-Ruzsa Conjecture [Tao23] have taken years of effort by human experts to formalize and yet remain incomplete. Recent works have leveraged self-supervised learning [PS20], chain-of-thought (CoT) finetuning [XRS+24], and scalable tree-search [XXY+25] to explore complex proof strategies, demonstrating the significant potential of large language models (LLMs) for FMR. While there are several formal mathematics benchmarks, such as MiniF2F [ZHP21] and ProofNet [APS+23] that are widely used to evaluate the FMR capabilities of LLMs, they still present a few critical limitations: (1) Scope limitation: Existing benchmarks are narrowly scoped. For instance, MiniF2F is restricted to high school-level algebra and number theory, while ProofNet focuses narrowly on undergraduate-level analysis and algebra. Their narrow scopes limit the capacity to evaluate holistic FMR capabilities across diverse mathematical domains. (2) Dataset size: Formal mathematics benchmarks remain relatively small in scale. MiniF2F contains merely 244 problems in its test set, and ProofNet includes only 186. This constrains benchmarking robustness and hinders the development of generalizable FMR systems. (3) Performance Saturation: State-of-the-art theorem provers, such as Kimina-Prover [WUL+25], now achieve success rates exceeding 80.7%, signaling that existing benchmarks may be nearing their practical utility limits.

形式数学推理（FMR）[YPH+24] 是一种基于形式系统[Lei10， Mat20， BBC+97] 的专业化数学实践，它提供了一个严格的公理框架，对于自动化证明验证至关重要。然而，FMR 对人类来说具有固有的挑战性。例如，液体张量实验 [Sch22] 和多项式弗雷曼 - 鲁萨猜想 [Tao23] 即使经过人类专家多年的努力进行形式化，也仍未完成。近期的研究利用了自监督学习 [PS20]、链式思维（CoT）微调 [XRS+24] 以及可扩展树搜索 [XXY+25] 来探索复杂的证明策略，展示了大型语言模型（LLMs）在 FMR 方面的巨大潜力。尽管存在诸如 MiniF2F [ZHP21] 和 ProofNet [APS+23] 等广泛用于评估 LLMs 的 FMR 能力的形式数学基准，但它们仍存在一些关键局限性：（1）范围限制：现有的基准范围狭窄。例如，MiniF2F 仅限于高中水平的代数和数论，而 ProofNet 则仅专注于大学水平的分析和代数。（1）范围狭窄：它们的范围有限，这限制了对跨不同数学领域的整体形式化数学推理（FMR）能力进行评估的能力。（2）数据集规模：形式化数学基准在规模上仍然相对较小。MiniF2F 的测试集仅包含 244 个问题，而 ProofNet 仅有 186 个。这限制了基准测试的稳健性，并阻碍了通用形式化数学推理系统的开发。（3）性能饱和：诸如 Kimina-Prover [WUL+25] 这样的最先进的定理证明器如今的成功率已超过 80.7%，这表明现有的基准可能已接近其实际效用的极限。

To address these limitations, we introduce FormalMATH — a large-scale Lean4 [MU21]-based bench-mark containing 5,560 formally verified mathematical statements. FormalMATH includes a broad spectrum of mathematical domains, such as algebra, geometry, calculus, number theory, discrete mathematics, and more, while simultaneously spanning multiple difficulty levels, ranging from high school olympiad problems to undergraduate-level theorems (see Figure 3 for an overview). The development of FormalMATH presents two primary challenges: (1) Autoformalization difficulty: limited concurrent tools open-sourced for robustly translating natural-language problems into precise Lean4 statements, especially for advanced mathematical domains requiring strict semantic preservation, (2) Validating formal statements requires en-suring syntactic correctness (via Lean4 compiler checks) and semantic alignment with original problems—a dual requirement that remains technically demanding and time-intensive even for human experts.

Motivated by these challenges, we propose a human-in-the-loop framework (Figure 1) for constructing the FormalMATH benchmark. Our framework substantially reduces the manual annotation effort required to generate formal mathematical statements by integrating: (1) Ensemble-based autoformalization: multi-LLMs for autoformalization via best-of-N sampling [WWS+22a] and (2) Automated validation: A three-tiered pipeline ensures correctness — compiler syntax validation [Lea23], Multi-LLMs semantic verification, and negation-based disproof to filter unprovable theorems. This strategy minimizes human verification while achieving high accuracy, preserving 72.09% of translated statements in FormalMATH.

为了解决这些局限性，我们引入了 FormalMATH——一个基于 Lean4 [MU21] 的大规模基准测试集，包含 5560 个经过形式化验证的数学陈述。FormalMATH 涵盖了广泛的数学领域，如代数、几何、微积分、数论、离散数学等，同时涵盖了从高中奥林匹克竞赛问题到大学水平定理的多个难度级别（见图 3 概览）。FormalMATH 的开发面临两个主要挑战：（1）自动形式化难度：（1）为将自然语言问题稳健地转换为精确的 Lean4 语句而开源的有限并发工具，尤其适用于需要严格语义保留的高级数学领域；（2）验证形式化陈述需要确保语法正确性（通过 Lean4 编译器检查）以及与原始问题的语义一致性——这一双重要求即使对于人类专家来说，在技术上也具有挑战性且耗时。

受这些挑战的激励，我们提出了一个“人在回路”框架（图 1），用于构建 FormalMATH 基准。我们的框架通过整合以下内容大幅减少了生成形式化数学陈述所需的人工标注工作：（1）基于集成的自动形式化：通过最佳 N 采样使用多语言模型进行自动形式化 [WWS+22a]；（2）自动化验证：一个三层管道确保正确性——编译器语法验证 [Lea23]、多语言模型语义验证以及基于否定的反证来过滤无法证明的定理。这种策略在实现高精度的同时将人工验证降至最低，在 FormalMATH 中保留了 72.09% 的翻译陈述。

We evaluate state-of-the-art LLM-based theorem provers on the FormalMATH benchmark, re-vealing significant challenges for these systems. For instance, the best-performing model — Kimina-Prover [WUL+25] achieves only 16.46% on the FormalMATH-Full dataset under the pass@32 metric, while BFS-Prover [XXY+25] attains just 11.13% using a best-first search with a sampling budget of 1 × 32 × 100. Our analysis of these results yields several intriguing insights. First, existing provers exhibit a pronounced domain bias, excelling primarily in high-school-level algebra and applied mathematics while struggling with other mathematical domains. This highlights critical gaps in their cross-domain generalizability. Second, the provers frequently reduce multi-step reasoning to single-tactic invocations (e.g., “aesop” [LF23] and “linearith”), bypassing necessary deductive rigor. Third, while CoT reasoning [WWS+22b] enhances perfor-mance on FormalMATH statements, adding natural language solutions reduces success rates, suggesting such guidance introduces ambiguity rather than clarity. Our contributions include:

● A Large and Comprehensive Lean4 Benchmark: We present FormalMATH, a benchmark of 5,560 formally verified mathematical statements spanning diverse subdomains. This dataset is 22.8× larger than the widely used MiniF2F benchmark.

● An Effective Human-in-the-Loop Autoformalization Pipeline: We introduce a framework (Figure 1) integrating multi-LLM autoformalization, multi-LLM semantic verification, and negation-based disproof strategies to automate formalization while retaining 72.09% accuracy before manual review. This reduces reliance on costly expert annotation and enables scalable Lean4 dataset construction.

● Comprehensive Evaluation of LLM-based Theorem Provers: Our systematic evaluation reveals fundamental limitations in state-of-the-art theorem provers: (1) even the best-performing model achieve only a 16.46% success rate on FormalMATH, (2) existing provers exhibit severe domain bias, performing well in areas like algebra but poorly in others such as calculus, (3) counterintuitive inverse relationship where providing natural language solution guidance decreased proof success rates in CoT scenarios. These limitations suggest important potential directions for improving LLM-based provers.

我们在 FormalMATH 基准测试中评估了最先进的基于 LLM 的定理证明器，揭示了这些系统面临的重大挑战。例如，表现最佳的模型——Kimina-Prover [WUL+25] 在 FormalMATH-Full 数据集上仅在 pass@32 指标下达到 16.46%，而 BFS-Prover [XXY+25] 使用最佳优先搜索和 1×32×100 的采样预算时仅达到 11.13%。对这些结果的分析为我们带来了几个有趣的见解。首先，现有的证明器表现出明显的领域偏差，主要在高中水平的代数和应用数学方面表现出色，而在其他数学领域则表现不佳。这凸显了它们在跨领域泛化能力方面的关键差距。其次，这些证明器经常将多步推理简化为单个策略调用（例如，“aesop” [LF23] 和“linearith”），从而绕过了必要的演绎严谨性。第三，虽然 CoT 推理 [WWS+22b] 提高了在 FormalMATH 陈述上的性能，但添加自然语言解决方案却降低了成功率，这表明这种指导引入了歧义而非清晰性。我们的贡献包括：● 一个庞大且全面的 Lean4 基准：我们推出了 FormalMATH，这是一个包含 5560 个经过形式验证的数学陈述的基准，涵盖了不同的子领域。该数据集比广泛使用的 MiniF2F 基准大 22.8 倍。

● 一种有效的人机协作自动形式化流程：我们介绍了一个框架（图 1），该框架整合了多语言模型自动形式化、多语言模型语义验证以及基于否定的反证策略，以实现自动化形式化，同时在人工审核前保持 72.09% 的准确率。这减少了对昂贵专家标注的依赖，并使 Lean4 数据集的构建具有可扩展性。

● 基于语言模型的定理证明器的全面评估：我们的系统评估揭示了最先进的定理证明器存在的根本局限性：（1）即使表现最佳的模型在 FormalMATH 上的成功率也只有 16.46%，（2）现有的证明器存在严重的领域偏差，在代数等领域表现良好，但在微积分等其他领域表现不佳，（3）在 CoT 场景中，提供自然语言解决方案指导反而降低了证明成功率这一反直觉的逆向关系。这些局限性表明了改进基于大型语言模型的证明器的重要潜在方向。

6 Concluding Remarks

We introduce ForamlMATH, a novel and extensive benchmark for evaluating the formal mathematical reasoning capabilities of LLMs. Comprising 5,560 formally verified statements in Lean4, FormalMATH spans a wide range of mathematical domains, including algebra, number theory, calculus, and discrete mathematics, encompassing problems from high-school Olympiad level to undergraduate curricula. We propose a simple yet effective human-in-the-loop autoformalization pipeline to construct FormalMATH. This pipeline integrates specialized LLMs for initial Lean4 statement formalization, multi-LLM semantic verification to ensure fidelity to the original natural-language problems, and a negation-based disproof strategy for filtering invalid statements, which extensively reduces the effort for subsequent manual review by human experts, while achieving a high pre-verification preservation rate of 72.09%.

Our comprehensive evaluation of state-of-the-art LLM-based theorem provers on FormalMATH reveals significant limitations in current systems. Even the most capable models demonstrate modest success rates under practical sampling budgets, with the top performer achieving only 16.46% accuracy. Our analysis further identifies pronounced domain biases, wherein models excel in certain domains like algebra but struggle considerably in other domains such as calculus. Additionally, our findings indicate an over-reliance on simplified automation tactics and, counterintuitively, a negative impact of natural-language solution guidance on proof success in CoT scenarios. These results highlight the challenging nature of the FormalMATH benchmark and pose critical open problems for future research in enhancing robustness, generalizability, and reasoning complexity of automatic theorem provers.

我们推出了 ForamlMATH，这是一个用于评估大型语言模型（LLM）形式化数学推理能力的全新且广泛的基准测试。ForamlMATH 包含 5560 个在 Lean4 中经过形式验证的陈述，涵盖了广泛的数学领域，包括代数、数论、微积分和离散数学，涵盖了从高中奥林匹克竞赛水平到大学课程的问题。我们提出了一种简单而有效的“人在回路”自动形式化流程来构建 ForamlMATH。该流程整合了专门用于初步 Lean4 陈述形式化的 LLM、多 LLM 语义验证以确保对原始自然语言问题的忠实度，以及基于否定的反证策略来过滤无效陈述，这极大地减少了后续人工专家审查的工作量，同时实现了 72.09% 的高预验证保留率。

我们对最先进的基于 LLM 的定理证明器在 ForamlMATH 上进行了全面评估，结果揭示了当前系统存在的显著局限性。即使是最强大的模型在实际抽样预算下也仅取得了适度的成功率，表现最佳的模型准确率也只有 16.46%。我们的分析进一步揭示了明显的领域偏差，即模型在代数等某些领域表现出色，但在微积分等其他领域则表现不佳。此外，我们的研究结果表明，模型过度依赖简化的自动化策略，而且出人意料的是，在 CoT 场景中，自然语言解决方案指导对证明成功的负面影响。这些结果凸显了 FormalMATH 基准测试的挑战性，并为未来研究在增强自动定理证明器的稳健性、泛化能力和推理复杂性方面提出了关键的开放性问题。