Using Code as Intermediate Representation Improves VLM Spatial Reasoning by 68.8%

Setup

Vision-Language Models (VLMs) like GPT-4o or Claude 3.5 Sonnet often struggle with precise spatial reasoning — tasks like “Where is the glass relative to the plate?” or “How many objects are to the left of the chair?”. While they can describe images fluently, their internal coordinate systems are often fuzzy, leading to layout errors in complex scenes.

What They Found

The “CoCo” (Code-Conditioned) research demonstrates that requiring a VLM to generate code (like SVG or Python coordinate maps) as an intermediate “scratchpad” step before answering a spatial question improves performance by a staggering 68.8%. By forcing the model to express relationships in a formal language, the inherent spatial fuzziness of natural language is mitigated.

How It Works

Instead of asking “Is the ball inside the box?”, the CoCo prompt instructs the model to first “Write a Python dictionary mapping all objects to their estimated bounding boxes” and then “Check if the ball’s coordinates are within the box’s coordinates.” This two-step process leverages the model’s strong code-generation capabilities to create an explicit mental model of the scene, which then constrains and informs the final natural language answer.

Why It Matters

This reveals that “reasoning” in VLMs is highly sensitive to the representational format used. Code acts as a scaffolding that enforces logical and geometric consistency. For developers building robotics or UI-automation agents, this provides a practical “prompting primitive”: never ask a VLM to reason about space directly; always ask it to code the scene first. Companies deploying vision systems for warehouse automation or industrial inspection can immediately adopt this technique to reduce spatial reasoning errors in production pipelines. This approach generalizes beyond vision tasks—any domain requiring explicit constraint satisfaction may benefit from intermediate code generation as a reasoning layer, potentially unlocking similar gains across multimodal reasoning benchmarks.