ScatterAI
Issue #8 · March 18, 2026

Coding Agents Fail at Real-World Optimization—and Current Benchmarks Can't Even See It

Research

02 [RAG] Coding Agents Fail at Real-World Optimization—and Current Benchmarks Can’t Even See It

Existing code benchmarks grade agents on whether code runs correctly, not whether it runs well. That distinction matters at the repository level, where the bottleneck is almost never correctness—it’s throughput, memory, and runtime under realistic workloads. Binary pass/fail signals are blind to this entirely.

FormulaCode exposes the gap with a benchmark built from 957 real performance bottlenecks mined from scientific Python repositories on GitHub. Each task is paired with an expert-authored patch and an average of 264.6 community-maintained performance workloads per task: the actual execution profiles the original developers used to validate their own optimizations, not a synthetic test suite. Multi-objective metrics track runtime, memory consumption, and throughput simultaneously, so an agent that speeds up a function at the cost of memory explosion scores accordingly. This is the first benchmark where “did it get faster?” has a precise, multi-dimensional answer tied to real-world code.

The results are sobering. Current LLM (Large Language Model) coding agents struggle on FormulaCode in ways that synthetic benchmarks never surface: agents frequently propose correct patches that fail to move the performance needle, or optimize one metric while degrading another. The benchmark’s fine-grained scoring makes those tradeoffs visible. For teams building or evaluating coding agents for production use, including code review automation, performance regression detection, and repository-level refactoring, FormulaCode provides a credibility test that SWE-bench style correctness evaluation cannot.

One limitation to flag: the benchmark draws from scientific Python repositories specifically, which skew toward numerical computing and array operations. Performance optimization patterns in web services, database access layers, or systems code may not be well represented. Agents that score well here aren’t guaranteed to transfer.

Key takeaways:

Source: Evaluating Agentic Optimization on Large Codebases

Source: Evaluating Agentic Optimization on Large Codebases