Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning

Shaun Baek, Shaun Esua-Mensah, Cyrus Tsui, Sejan Vigneswaralingam, Abdullah Alali, Michael Lu, Vasu Sharma, Sean O'Brien, Kevin Zhu

cs.CL Mar 25, 2025 · v1

benchmark dataset logic theorem-proving-ml

Read PDF arXiv abstract

TL;DR

Builds the Rosetta-PL benchmark by translating a dataset of logical propositions sourced from Lean into a custom propositional language.

Abstract

Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs' logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.

Problem

LLMs are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. It is unclear whether performance on logical tasks is driven by genuine reasoning or surface-level pattern matching.

Approach

Rosetta-PL is a benchmark that translates logical propositions from Lean into a custom propositional language, then fine-tunes GPT-4o on this constructed language. Two translation keys are tested: one preserving relational integrity of logical operators and one using arbitrary mappings. The approach isolates logical reasoning from natural language complexity by evaluating in a controlled symbolic environment.

Results

Fine-tuned GPT-4o achieves 95.97% accuracy on seen data in the custom propositional language versus 76.0% on untranslated Lean. On unseen data, Translation Key 1 (preserving logical relationships) yields 80.5% accuracy compared to 63.1% for Translation Key 2 (arbitrary mapping). Performance saturates around 20,000 training examples.

Papers With

Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning