Fine-tunes a 2B model for NL-to-Lean4 autoformalization with a GRPO cycle-consistency reward.
Abstract
Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL' loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.
Problem
Autoformalization models often produce Lean4 code that compiles but is semantically incorrect, and it is unclear whether curriculum ordering or RL helps small models.
Approach
Qwen3.5-2B is fine-tuned with LoRA on FineLeanCorpus under three regimes: SFT with curriculum (difficulty 1-10), SFT without curriculum, and GRPO reinforcement learning with a cycle-consistency reward measuring NL->Lean4->NL' fidelity via sentence-embedding cosine similarity.
Results
RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs 0.513 on the FLC held-out set; 0.561 vs 0.422 on PutnamBench) while raising cross-entropy by only 0.011 nats. Curriculum ordering gives no measurable benefit over shuffled training.
Figure 2 : Mean cycle consistency by model and evaluation set. Error bars show \pm 1 std.