Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization
Yicheng Tao, Haotian Liu, Shanwen Wang, Hongteng Xu
cs.CL
Jan 21, 2025 · v1
TL;DR
Trains a premise retrieval model on Mathlib data that embeds Lean proof states and premises for mathematical formalization.
Abstract
Formalized mathematics has recently garnered significant attention for its ability to assist mathematicians across various fields. Premise retrieval, as a common step in mathematical formalization, has been a challenge, particularly for inexperienced users. Existing retrieval methods that facilitate natural language queries require a certain level of mathematical expertise from users, while approaches based on formal languages (e.g., Lean) typically struggle with the scarcity of training data, hindering the training of effective and generalizable retrieval models. In this work, we introduce a novel method that leverages data extracted from Mathlib to train a lightweight and effective premise retrieval model. In particular, the proposed model embeds queries (i.e., proof state provided by Lean) and premises in a latent space, featuring a tokenizer specifically trained on formal corpora. The model is learned in a contrastive learning framework, in which a fine-grained similarity calculation method and a re-ranking module are applied to enhance the retrieval performance. Experimental results demonstrate that our model outperforms existing baselines, achieving higher accuracy while maintaining a lower computational load. We have released an open-source search engine based on our retrieval model at
https://premise-search.com/. The source code and the trained model can be found at
https://github.com/ruc-ai4math/Premise-Retrieval.
Problem
Premise retrieval is a common step in mathematical formalization that is challenging for inexperienced users. Existing retrieval methods either require mathematical expertise for natural language queries or struggle with scarcity of formal-language training data.
Approach
The authors introduce a premise retrieval model for Lean 4 trained on data extracted from Mathlib4. The system uses a two-stage architecture: a BERT-based retrieval model computes similarity between proof states and premise embeddings (using arguments and goals), followed by a re-ranking model that reorders the top candidates. The retrieval model uses contrastive learning with hard negatives, and the re-ranking model uses a cross-encoder architecture. Pre-training on concatenated states and premises provides domain adaptation.
Results
The retrieval model selects top-100 candidates which are reordered by the re-ranker to produce final top-20 results. The method outperforms previous models on the Lean premise retrieval task. The authors developed a search engine based on the model and note that results with stronger provers (e.g., DeepSeek-Prover-V2) would further demonstrate the effect of premise selection on automated theorem proving.