Coding Agents Fail at Real-World Optimization—and Current Benchmarks Can't Even See It

02 [RAG] Coding Agents Fail at Real-World Optimization—and Current Benchmarks Can’t Even See It

Existing code benchmarks grade agents on whether code runs correctly, not whether it runs well. That distinction matters at the repository level, where the bottleneck is almost never correctness—it’s throughput, memory, and runtime under realistic workloads. Binary pass/fail signals are blind to this entirely.

FormulaCode exposes the gap with a benchmark built from 957 real performance bottlenecks mined from scientific Python repositories on GitHub. Each task is paired with an expert-authored patch and an average of 264.6 community-maintained performance workloads per task: the actual execution profiles the original developers used to validate their own optimizations, not a synthetic test suite. Multi-objective metrics track runtime, memory consumption, and throughput simultaneously, so an agent that speeds up a function at the cost of memory explosion scores accordingly. This is the first benchmark where “did it get faster?” has a precise, multi-dimensional answer tied to real-world code.

The results are sobering. Current LLM (Large Language Model) coding agents struggle on FormulaCode in ways that synthetic benchmarks never surface: agents frequently propose correct patches that fail to move the performance needle, or optimize one metric while degrading another. The benchmark’s fine-grained scoring makes those tradeoffs visible. For teams building or evaluating coding agents for production use, including code review automation, performance regression detection, and repository-level refactoring, FormulaCode provides a credibility test that SWE-bench style correctness evaluation cannot.

One limitation to flag: the benchmark draws from scientific Python repositories specifically, which skew toward numerical computing and array operations. Performance optimization patterns in web services, database access layers, or systems code may not be well represented. Agents that score well here aren’t guaranteed to transfer.

Key takeaways:

957 real performance bottlenecks from GitHub, each evaluated against 264.6 workloads on average, with multi-objective metrics across runtime, memory, and throughput, making agent tradeoffs measurable for the first time
Binary correctness evaluation systematically hides the most common failure mode in production coding agents: code that passes tests but doesn’t actually improve performance
Teams evaluating LLM coding agents for performance-sensitive applications should run FormulaCode before trusting benchmark numbers from correctness-only evaluations

Source: Evaluating Agentic Optimization on Large Codebases

Source: Evaluating Agentic Optimization on Large Codebases