01 [Video Gen] Long video QA breaks when models ignore what the video is already telling them
Query-only retrieval has a structural flaw. When a long video understanding system looks for relevant segments, it asks, “does this segment match the query?” — and stops there. This ignores the video’s own internal logic: scenes that share visual context, temporal transitions that signal narrative shifts, and segments can be relevant because they bridge other relevant segments, even if they do not directly mention the query.
VideoDetective reframes segment retrieval as a graph problem. Each video segment becomes a node; edges encode two signals simultaneously: visual similarity between segments and temporal proximity. The system then runs a Hypothesis-Verification-Refinement (HVR) loop. It forms an initial relevance hypothesis from the query, verifies it against the graph’s inter-segment affinity structure, then refines which segments to surface. The loop lets the model propagate relevance across the graph, so a segment adjacent (visually or temporally) to a high-relevance segment gets its score lifted even if the query alone wouldn’t flag it. This is the intrinsic structure the query-only baseline misses entirely.
The practical implication is direct: for teams building long video QA pipelines, retrieval quality is the ceiling, and query-only retrieval leaves graph-structured evidence on the table. The HVR loop adds inference-time cost, but it replaces a structurally incomplete retrieval pass with one that uses the video’s own geometry.
- Query-to-segment relevance and inter-segment affinity are solved jointly through a visual-temporal graph and an HVR loop; neither signal alone is sufficient for sparse clue localization in long video.
- Query-only retrieval assumes relevant segments are independently identifiable, which fails when relevance is distributed across a narrative arc or shared visual context.
- Teams building long video retrieval systems should audit whether their segment scorer has access to inter-segment structure, not just query similarity; the graph signal is cheap to construct relative to the retrieval errors it prevents.
Source: VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance fo