Towards Automated Formal Verification of Backend Systems with LLMs
Kangping Xu, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao
cs.SE
Apr 13, 2025 · v1
TL;DR
Translates Scala backend code into formal Lean representations, auto-generates behavioral theorems, and verifies them with LLM-based provers.
Abstract
Software testing plays a critical role in ensuring that systems behave as intended. However, existing automated testing approaches struggle to match the capabilities of human engineers due to key limitations such as test locality, lack of general reliability, and business logic blindness. In this work, we propose a novel framework that leverages functional programming and type systems to translate Scala backend code into formal Lean representations. Our pipeline automatically generates theorems that specify the intended behavior of APIs and database operations, and uses LLM-based provers to verify them. When a theorem is proved, the corresponding logic is guaranteed to be correct and no further testing is needed. If the negation of a theorem is proved instead, it confirms a bug. In cases where neither can be proved, human intervention is required. We evaluate our method on realistic backend systems and find that it can formally verify over 50% of the test requirements, which suggests that half of a testing engineer's workload can be automated. Additionally, with an average cost of only $2.19 per API, LLM-based verification is significantly more cost-effective than manual testing and can be scaled easily through parallel execution. Our results indicate a promising direction for scalable, AI-powered software testing, with the potential to greatly improve engineering productivity as models continue to advance.
Problem
Software testing of backend systems requires extensive manual labor, suffers from test locality, and is blind to business logic. Existing automated testing approaches cannot match human engineers in coverage or reliability for verifying API and database correctness properties.
Approach
The framework translates Scala backend code into formal Lean representations using functional programming and type systems. It automatically generates theorems specifying intended behavior of APIs and database operations via an LLM-based formalizer, then uses LLM-based provers to verify these properties. Both positive theorems (expected behavior holds) and negative theorems (incorrect behavior fails) are generated.
Results
Across four backend projects (UserAuth, BankAccount, Email, TaxiApp), the pipeline achieves 100% accuracy in table and dependency formalization, and 94.6-100% accuracy in API formalization. Theorem generation achieves 93.5-100% accuracy for API requirement theorems and 81.7-100% for table property theorems.
| Project | Table Form. Acc% | API Form. Acc% | API Thm Acc% | Table Thm Acc% |
|---|
| UserAuth | 100.0 | 100.0 | 100.0 | 95.8 |
| BankAccount | 100.0 | 100.0 | 94.4 | 81.7 |
| Email | 100.0 | 94.6 | 98.4 | 84.2 |
| TaxiApp | 100.0 | 96.0 | 93.5 | 84.0 |
Formalization and theorem generation accuracy across projects