LongCat-Flash-Prover, a 560B-parameter MoE, performs native formal reasoning in Lean4 via agentic tool-integrated RL.
Abstract
We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.
Problem
Open-weights LLMs still struggle with native formal reasoning in Lean4, spanning autoformalization, proof sketching, and proving.
Approach
LongCat-Flash-Prover decomposes native formal reasoning into auto-formalization, sketching, and proving, expanded with a Hybrid-Experts Iteration Framework that synthesizes verified trajectories. Agentic RL uses Hierarchical Importance Sampling Policy Optimization (HisPO) with gradient masking, plus theorem-consistency and legality checks to curb reward hacking.
Figure 2: The overview of our hybrid-experts tool-integration synthesis pipeline. In this pipeline, we iteratively optimize three different experts (i.e., auto-formalizer, lemma-style sketcher, and prover), and use these experts to synthesize trajectories based on pre-defined formal reasoning capabilities. Given one problem in natural language, we first transform it into a formal statement in Lean
Results
The model sets a new open-weights state of the art, reaching 97.1% on MiniF2F-Test with a 72-attempt budget, and solving 70.8% of ProverBench and 41.5% of PutnamBench under no more than 220 attempts per problem.