ScatterAI
Issue #12 · March 26, 2026

Long video QA breaks when models ignore what the video is already telling them

Research

01 [Video Gen] Long video QA breaks when models ignore what the video is already telling them

Query-only retrieval has a structural flaw. When a long video understanding system looks for relevant segments, it asks, “does this segment match the query?” — and stops there. This ignores the video’s own internal logic: scenes that share visual context, temporal transitions that signal narrative shifts, and segments can be relevant because they bridge other relevant segments, even if they do not directly mention the query.

VideoDetective reframes segment retrieval as a graph problem. Each video segment becomes a node; edges encode two signals simultaneously: visual similarity between segments and temporal proximity. The system then runs a Hypothesis-Verification-Refinement (HVR) loop. It forms an initial relevance hypothesis from the query, verifies it against the graph’s inter-segment affinity structure, then refines which segments to surface. The loop lets the model propagate relevance across the graph, so a segment adjacent (visually or temporally) to a high-relevance segment gets its score lifted even if the query alone wouldn’t flag it. This is the intrinsic structure the query-only baseline misses entirely.

The practical implication is direct: for teams building long video QA pipelines, retrieval quality is the ceiling, and query-only retrieval leaves graph-structured evidence on the table. The HVR loop adds inference-time cost, but it replaces a structurally incomplete retrieval pass with one that uses the video’s own geometry.

Source: VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance fo