ScatterAI
Issue #5 · March 15, 2026

LLMs That Ace Math Olympiads Collapse on Real Cryptographic Code Proofs

Research

02 [RAG] LLMs That Ace Math Olympiads Collapse on Real Cryptographic Code Proofs

Competition math benchmarks have become the default signal for reasoning capability. A new benchmark built from AWS’s production cryptographic library exposes a structural gap: models that perform well on abstract theorem proving consistently fail when asked to reason about real assembly code.

The benchmark derives from s2n-bignum, an industrial cryptographic library whose assembly routines are formally verified in HOL Light (a proof assistant for machine-verified mathematics). The verification task splits into two subtasks: writing precise behavioral specifications for low-level assembly routines, then constructing correctness proofs against those specs. Both require grounding symbolic reasoning in the messy specifics of real implementation — register states, memory layouts, arithmetic overflow behavior — rather than idealized mathematical objects.

This is one library, one verification framework, one domain. Generalization claims beyond cryptographic assembly code need separate evidence. For teams evaluating LLMs on code reasoning tasks, this benchmark closes a gap that competition math datasets leave wide open: strong Pass@1 on AIME or MATH does not predict whether a model can reason about what a specific assembly routine actually does.

Key takeaways:

Source: s2n-bignum-bench: A Practical Benchmark for Evaluating Low-Level Code Reasoning of LLMs