Combining Textual and Structural Information for Premise Selection in Lean
Premise selection in large formal libraries like Lean's Mathlib is a key bottleneck, and existing language-based methods treat premises in isolation, ignoring the dependency structure connecting them.
The authors present a graph-augmented approach that combines dense text embeddings of Lean formalizations (using ByT5) with graph neural networks over a heterogeneous dependency graph capturing state-premise and premise-premise relations. The GNN propagates structural information to refine text embeddings before scoring premise relevance.
On the LeanDojo Benchmark, the method outperforms the ReProver language-based baseline by over 25% across standard retrieval metrics (Recall@1, Recall@10, MRR). Ablation suggests the improvement may partly stem from different training paradigm choices rather than purely from graph structure utilization.
