← All papers
First page of On Compositional Learning Behaviours in Formal Mathematics

On Compositional Learning Behaviours in Formal Mathematics

Kevin Yandoka Denamganaï

cs.CL May 27, 2026 · v1
Cross-evaluates ten Lean 4 theorem provers on compositional learning behaviours and miniF2F whole-proof performance.
Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $>75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.

Self-evolving agents for formal mathematics need Compositional Learning Behaviours (CLBs), the capacity to ground and recombine novel symbolic structures in context. It is unclear whether CLB competency is necessary for strong performance on formal theorem proving benchmarks.

The authors propose S2B-LM, an adaptation of the Symbolic Behaviour Benchmark that replaces continuous numerical stimuli with categorical domains and adds chain-of-thought scaffolding. They cross-evaluate ten Lean 4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance using exact permutation tests.

Search-heavy models cover the tractable bulk of miniF2F without detectable CLBs, but every model breaking into the Olympiad-level tier (miniF2F >75%) is among the five highest CLB scorers (p=0.004). Model scale does not explain this pattern (p_scale=0.220). CLB competency is necessary but not sufficient for elite formal verification performance.

ModelSizeAdj. ZSCTminiF2F Pass@32
DeepSeek-Prover-V17B0.046.1%
Goedel-Prover-SFT7B16.357.6%
Goedel-Prover-V2-8B8B50.084.6%
Model CLB and miniF2F results