Builds the FormInv invariance benchmark on 103 Lean4-verified Mathlib4 theorems to measure LLM consistency across paraphrases.
Abstract
A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.
Problem
Mathematical-reasoning benchmarks evaluate accuracy on canonical statements but do not test whether models answer consistently across semantically equivalent restatements, and programmatically generated paraphrases may themselves be wrong.
Approach
FormInv uses a curated substrate of 103 Lean4-verified Mathlib4 theorems across seven domains, expanded into paraphrase families, and measures a per-theorem Invariance Gap, Semantic Consistency Rate (SCR), and Cochran's Q. A cross-model unanimity protocol flags semantically incorrect paraphrases automatically.
Results
Aggregate accuracy across nine models spans 86-96% while SCR spans 50-82%, a 32-point gap invisible to standard accuracy. Auditing MathCheck flagged 4 of 129 groups (3.1%) as incorrect; removing them dropped GPT-4o from rank 2 to 4.
Figure 3: Per-theorem IG distribution across all 9 models and 103 theorems. The distribution is strongly bimodal: 79% of theorems have IG =0 (consistent responses across all paraphrase families) and 21% have IG \approx 0.42 (fail on 1–3 families).
Model
Accuracy
Mean IG
SCR
DeepSeek V3
96.4%
6.9%
82%
GPT-4o
94.2%
7.0%
82%
Claude Sonnet 4.6
93.2%
10.7%
74%
Claude Haiku 4.5
85.7%
—
50%
Accuracy vs Semantic Consistency Rate, selected models