← All papers
First page of FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Nishal Thomas, Noel Thomas

cs.LG May 27, 2026 · v1 cs.AI
Builds the FormInv invariance benchmark on 103 Lean4-verified Mathlib4 theorems to measure LLM consistency across paraphrases.
A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

Mathematical-reasoning benchmarks evaluate accuracy on canonical statements but do not test whether models answer consistently across semantically equivalent restatements, and programmatically generated paraphrases may themselves be wrong.

FormInv uses a curated substrate of 103 Lean4-verified Mathlib4 theorems across seven domains, expanded into paraphrase families, and measures a per-theorem Invariance Gap, Semantic Consistency Rate (SCR), and Cochran's Q. A cross-model unanimity protocol flags semantically incorrect paraphrases automatically.

Aggregate accuracy across nine models spans 86-96% while SCR spans 50-82%, a 32-point gap invisible to standard accuracy. Auditing MathCheck flagged 4 of 129 groups (3.1%) as incorrect; removing them dropped GPT-4o from rank 2 to 4.

Figure 3: Per-theorem IG distribution across all 9 models and 103 theorems. The distribution is strongly bimodal: 79% of theorems have IG =0 (consistent responses across all paraphrase families) and 21% have IG \approx 0.42 (fail on 1–3 families).
ModelAccuracyMean IGSCR
DeepSeek V396.4%6.9%82%
GPT-4o94.2%7.0%82%
Claude Sonnet 4.693.2%10.7%74%
Claude Haiku 4.585.7%50%
Accuracy vs Semantic Consistency Rate, selected models