LLMs That Ace Math Olympiads Collapse on Real Cryptographic Code Proofs

02 [RAG] LLMs That Ace Math Olympiads Collapse on Real Cryptographic Code Proofs

Competition math benchmarks have become the default signal for reasoning capability. A new benchmark built from AWS’s production cryptographic library exposes a structural gap: models that perform well on abstract theorem proving consistently fail when asked to reason about real assembly code.

The benchmark derives from s2n-bignum, an industrial cryptographic library whose assembly routines are formally verified in HOL Light (a proof assistant for machine-verified mathematics). The verification task splits into two subtasks: writing precise behavioral specifications for low-level assembly routines, then constructing correctness proofs against those specs. Both require grounding symbolic reasoning in the messy specifics of real implementation — register states, memory layouts, arithmetic overflow behavior — rather than idealized mathematical objects.

This is one library, one verification framework, one domain. Generalization claims beyond cryptographic assembly code need separate evidence. For teams evaluating LLMs on code reasoning tasks, this benchmark closes a gap that competition math datasets leave wide open: strong Pass@1 on AIME or MATH does not predict whether a model can reason about what a specific assembly routine actually does.

Key takeaways:

Neurosymbolic reasoning strong enough for competition math breaks down on industrial code verification, because real implementation proofs require grounding in concrete computational behavior rather than abstract structure
Success on mathematics benchmarks measures a separable capability from low-level code reasoning, and the two should not be treated as proxies for each other
Teams using theorem-proving performance as a signal for code reasoning ability should validate directly on implementation-level tasks; s2n-bignum-bench offers a production-derived test case for that evaluation

Source: s2n-bignum-bench: A Practical Benchmark for Evaluating Low-Level Code Reasoning of LLMs