Construction-Verification: A Benchmark for Applied Mathematics in Lean 4
Bowen Yang, Yi Yuan, Chenyi Li, Ziyu Wang, Liangqi Li, Bo Zhang, Zhe Li, Zaiwen Wen
cs.LO
Feb 1, 2026 · v1
TL;DR
Introduces AMBER, a Lean 4 construction-verification benchmark for applied mathematics tasks.
Abstract
Recent advances in large language models have demonstrated impressive capabilities in mathematical formalization. However, existing benchmarks focus on logical verification of declarative propositions, often neglecting the task of explicitly synthesizing solutions. This limitation is particularly acute in applied mathematics domains, where the goal is frequently to derive concrete values or executable algorithms rather than solely proving theorems. To address this, we introduce a Lean 4 framework that enforces a construction-verification workflow, compelling the agent to define explicit solutions before proving their correctness. We curate a comprehensive benchmark AMBER (Applied Mathematics BEnchmark for Reasoning) spanning core domains of applied mathematics, including convex analysis, optimization, numerical algebra, and high-dimensional probability. Aside from theorem proving, our benchmark features complex tasks such as evaluation, algorithm design, and representation transformation. Experiments reveal that current models face significant difficulties with these constructive tasks. Notably, we observe that general-purpose reasoning models consistently outperform specialized theorem provers. We attribute this to a degradation of instruction following capabilities in specialized models. Fine-tuning on proof corpora appears to induce ``tactical overfitting", compromising the ability to adhere to complex constructive requirements, whereas general models retain the versatility needed for multi-task formal reasoning.
Problem
Existing formal mathematics benchmarks focus on logical verification of declarative propositions, neglecting the task of explicitly synthesizing solutions. In applied mathematics, the goal is frequently to derive concrete values or executable algorithms rather than solely proving theorems.
Approach
The authors introduce AMBER (Applied Mathematics BEnchmaRk), a Lean 4 benchmark that shifts evaluation from existential verification to explicit solution construction. Tasks span numerical evaluation, algorithm design, and representation transformation across convex analysis, optimization, numerical linear algebra, and probability. Problems are sourced from classical textbooks and require agents to synthesize computable definitions and perform structural modeling before verification.
Results
At pass@16, the best models solve only a small fraction of problems: DeepSeek-V3.2-Thinking achieves 3/24 Easy problems in optimization and 5/26 in convex analysis, with 0 Hard problems solved across all domains. Specialized theorem provers (Goedel Prover-32B, Kimina Prover-72B) perform no better than general reasoning models, exhibiting a tactical overfitting phenomenon where proof-closing skills do not transfer to constructive tasks.
| Model | Opt Easy | Opt Hard | Convex Easy | Convex Hard | NumAlg Easy | NumAlg Hard | Prob Easy | Prob Hard |
|---|
| DeepSeek-V3.2-Thinking | 3/24 | 0/26 | 5/26 | 0/24 | 3/25 | 0/25 | 1/25 | 0/25 |
| GPT-5.1 | 2/24 | 0/26 | 3/26 | 0/24 | 3/25 | 0/25 | 1/25 | 0/25 |
| Gemini-3.0 Pro | 3/24 | 0/26 | 4/26 | 0/24 | 4/25 | 0/25 | 1/25 | 0/25 |
Pass@16 results across domains