← All papers
First page of FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4

FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4

Jiarui Yao, Ruida Wang, Tong Zhang

cs.CL Mar 5, 2025 · v2 cs.AI
FANS uses Lean4 to formally verify candidate answers for natural-language math reasoning and improve answer selection.
Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates. The inherent ambiguity of Natural Language (NL) limits LLMs' ability to perform verifiable reasoning, making its answers lack coherence and trustworthy support. To tackle the above problems, we propose a novel framework named FANS: Formal ANswer Selection for Natural Language Math Reasoning Using Lean4. To the best of our knowledge, it is the first framework that utilizes Lean4 to enhance LLMs' NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it tries to prove it using a Lean4 prover and verify it by Lean4. Finally, it uses the FL result to assist in answer selection. It enhances LLMs' NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model. Extensive experiments indicate the effectiveness of our framework. It can improve the accuracy rate of reward model enhanced LLMs in the MATH-500 dataset by at most 1.91% and AMC-23 by at most 8.33% on strong reward-model baselines. In some particular fields like number theory that Lean4 experts in, we can even select all correct solutions. The qualitative analysis also shows our framework can make NL results formally backed by Lean4 proofs. As a pioneering work in the corresponding field, we will open-source all our models and datasets to further boost the development of the field.

LLM natural-language math reasoning lacks verifiable support, so answer selection relies on heuristics like majority voting or reward models.

FANS translates an NL question and LLM-generated answers into Lean4 theorem statements, proves them with a Lean4 prover, and uses the verification result to assist answer selection, providing a formally backed alternative to reward-model scoring.

Figure 2: FANS Framework: The framework shown in the upper part first passes the Natural Language math questions and the LLM-generated answers to our Long CoT NL-to-FL translator. Subsequently, it invokes a prover to prove the translated Lean4 statements and uses the verifier to check whether the proofs are correct. The correct outputs are used for further answer selection as a verifiable foundati

FANS improves reward-model-enhanced selection accuracy by up to 1.91% on MATH-500 and 8.33% on AMC-23, and in number theory can select all correct solutions; relative gains reach 7.9% on MATH500 number theory and 20% on AMC23.

Figure 1: Comparision between FANS and majority vote, together with ORM and ORM + FANS method. From the results, we could see FANS based on ORM achieve the highest accuracies consistently across different base models and different test sets. In particular, FANS performs well on the sub-fields of number theory and algebra, which are better supported by Lean4 with its existing libraries.