← All papers

On Compositional Learning Behaviours in Formal Mathematics

Kevin Yandoka Denamganaï

cs.CL May 27, 2026 · v1

theorem-proving-ml benchmark

Read PDF arXiv abstract

TL;DR

Cross-evaluates ten Lean 4 theorem provers on compositional learning behaviours and miniF2F whole-proof performance.

Abstract

Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $>75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.

Problem

Self-evolving agents for formal mathematics need Compositional Learning Behaviours (CLBs), the capacity to ground and recombine novel symbolic structures in context. It is unclear whether CLB competency is necessary for strong performance on formal theorem proving benchmarks.

Approach

The authors propose S2B-LM, an adaptation of the Symbolic Behaviour Benchmark that replaces continuous numerical stimuli with categorical domains and adds chain-of-thought scaffolding. They cross-evaluate ten Lean 4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance using exact permutation tests.

Results

Search-heavy models cover the tractable bulk of miniF2F without detectable CLBs, but every model breaking into the Olympiad-level tier (miniF2F >75%) is among the five highest CLB scorers (p=0.004). Model scale does not explain this pattern (p_scale=0.220). CLB competency is necessary but not sufficient for elite formal verification performance.

Model	Size	Adj. ZSCT	miniF2F Pass@32
DeepSeek-Prover-V1	7B	0.0	46.1%
Goedel-Prover-SFT	7B	16.3	57.6%
Goedel-Prover-V2-8B	8B	50.0	84.6%

Model CLB and miniF2F results

Papers With

On Compositional Learning Behaviours in Formal Mathematics