Introduces Lean4PHYS with PhysLib and the LeanPhysBench benchmark of 200 Lean4 college-physics statements.
Abstract
We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.
Problem
Formal-reasoning benchmarks target mathematics; no benchmark existed for college-level physics reasoning in Lean4.
Approach
Lean4PHYS provides PhysLib, a community-driven Lean4 repository of unit systems and fundamental theorems for formal physics, and LeanPhysBench, 200 hand-crafted, peer-reviewed Lean4 statements drawn from textbooks and competitions across difficulty levels.
Figure 1: Overview of the Lean4PHYS framework: (a) Benchmark and Library Construction: The benchmark and library are developed using a bottom-up principle. We first establish the foundational units and SI system of PhysLib , then compose the calculation support, and finally construct the field-related theorems. For the benchmark, NL questions are transformed from a question-answering format to pro
Results
The best prover, DeepSeek-Prover-V2-7B, reaches only 16% and Claude-Sonnet-4 reaches 35%, showing the benchmark is hard; PhysLib improves model performance by an average of 11.75%.