Industrial Crypto Benchmark Exposes the Gap Between Theorem Proving and Real Code Reasoning

LLMs that solve Olympiad-level theorems still can’t reliably verify assembly routines. Neurosymbolic systems have logged strong results on competition-style mathematics benchmarks, but those benchmarks test abstract proof construction, not reasoning about what a specific piece of real-world code does at the machine level.

s2n-bignum-bench closes that gap by pulling directly from an industrial cryptographic library already deployed at AWS. The library, s2n-bignum, provides assembly routines for cryptographic operations whose correctness is formally verified in HOL Light, a proof assistant used for machine-checked mathematics. The benchmark tasks models with two distinct sub-problems: writing precise behavioral specifications for assembly routines, and constructing the formal proofs that those specifications hold. Both are required in real industrial verification workflows, and neither appears in standard theorem-proving benchmarks. The library’s assembly routines are low-level, performance-optimized, and exhibit behavior that diverges significantly from the structured algebraic reasoning that dominates math competition datasets.

The limitation is both practical and conceptual. Models that excel at AIME or Lean-formalized mathematics haven’t learned to read assembly, reason about register states, or translate C-style memory semantics into formal logic — skills that are prerequisites here, not side effects of general reasoning ability. For teams building or evaluating code reasoning systems, this benchmark is worth running before claiming generalization to production software. Systems whose proof generation was trained or evaluated primarily on mathematical corpora should expect significant performance gaps on this task class.

Key takeaways:

Competition math benchmarks test abstract proof construction; s2n-bignum-bench tests specification writing plus proof construction for real assembly routines, exposing a capability gap current benchmarks hide
Strong theorem-proving performance on math benchmarks does not predict performance on industrial code verification, as the two tasks require different low-level reasoning primitives
Teams evaluating LLMs for formal verification or code correctness tooling should test against this benchmark before assuming math-domain results generalize to deployed software

Source: s2n-bignum-bench: A Practical Benchmark for Evaluating Low-Level Code Reasoning of LLMs