Brief · AI Research Papers Archive →

Brief

AI research papers, explained for builders.

Mar 26, 2026 Thursday 4 entries

Paper 1

Long video QA breaks when models ignore what the video is already telling them

Most video QA systems fail on long videos because they match query words to segments in isolation, ignoring how scenes connect visually and temporally. VideoDetective treats the video as a graph where segments influence each other's relevance scores, letting it find clues that only make sense in context—fixing a fundamental flaw in how we retrieve answers from hours of footage.

01 [Video Gen] Long video QA breaks when models ignore what the video is already telling them

Query-only retrieval has a structural flaw. When a long video understanding system looks for relevant segments, it asks, “does this segment match the query?” — and stops there. This ignores the video’s own internal logic: scenes that share visual context, temporal transitions that signal narrative shifts, and segments can be relevant because they bridge other relevant segments, even if they do not directly mention the query.

VideoDetective reframes segment retrieval as a graph problem. Each video segment becomes a node; edges encode two signals simultaneously: visual similarity between segments and temporal proximity. The system then runs a Hypothesis-Verification-Refinement (HVR) loop. It forms an initial relevance hypothesis from the query, verifies it against the graph’s inter-segment affinity structure, then refines which segments to surface. The loop lets the model propagate relevance across the graph, so a segment adjacent (visually or temporally) to a high-relevance segment gets its score lifted even if the query alone wouldn’t flag it. This is the intrinsic structure the query-only baseline misses entirely.

The practical implication is direct: for teams building long video QA pipelines, retrieval quality is the ceiling, and query-only retrieval leaves graph-structured evidence on the table. The HVR loop adds inference-time cost, but it replaces a structurally incomplete retrieval pass with one that uses the video’s own geometry.

Query-to-segment relevance and inter-segment affinity are solved jointly through a visual-temporal graph and an HVR loop; neither signal alone is sufficient for sparse clue localization in long video.
Query-only retrieval assumes relevant segments are independently identifiable, which fails when relevance is distributed across a narrative arc or shared visual context.
Teams building long video retrieval systems should audit whether their segment scorer has access to inter-segment structure, not just query similarity; the graph signal is cheap to construct relative to the retrieval errors it prevents.

Source: VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance fo

Brief

Long video QA breaks when models ignore what the video is already telling them

01 [Video Gen] Long video QA breaks when models ignore what the video is already telling them

Deep research agents do not need the internet; they need the right offline corpus

02 [RAG] Deep research agents do not need the internet; they need the right offline corpus

DoRA's memory wall breaks at high rank: a systems fix, not a math fix

03 [Multimodal] DoRA’s memory wall breaks at high rank: a systems fix, not a math fix

Also Worth Noting — 2026-03-26

Also Worth Noting

OpenAI's Safety Stack for Sora 2 Reveals How Hard Real-Time Video Moderation Actually Is

01 [Industry] OpenAI’s Safety Stack for Sora 2 Reveals How Hard Real-Time Video Moderation Actually Is

3D reasoning in VLMs stems from perception issues, not language processing.

01 [Video Gen] 3D reasoning in VLMs stems from perception issues, not language processing.

Also Worth Noting — 2026-03-22

Also Worth Noting

Real websites will get your agent banned — synthetic clones will get it trained

01 [Evaluation] Real websites will get your agent banned — synthetic clones will get it trained

The Search Agent Data Gap Has a Structural Fix — and the Numbers Behind It Are Now Public

02 [RAG] The Search Agent Data Gap Has a Structural Fix — and the Numbers Behind It Are Now Public

Residual connections assume every layer matters equally — these results say they're wrong by design

03 [Evaluation] Residual connections assume every layer matters equally — these results say they’re wrong by design

Also Worth Noting — 2026-03-19

Also Worth Noting

Most researchers are using AI wrong — here's the five-level map that shows why

01 [Agent] Most researchers are using AI wrong — here’s the five-level map that shows why

Coding Agents Fail at Real-World Optimization—and Current Benchmarks Can't Even See It

02 [RAG] Coding Agents Fail at Real-World Optimization—and Current Benchmarks Can’t Even See It

Also Worth Noting — 2026-03-18

Also Worth Noting

Ensemble weighting that punishes disagreement outperforms static mixing in non-stationary sequential tasks

01 [Evaluation] Ensemble weighting that punishes disagreement outperforms static mixing in non-stationary sequential tasks

Industrial Crypto Benchmark Exposes the Gap Between Theorem Proving and Real Code Reasoning

Low-Resource Languages Expose a Structural Gap in Code LLMs

03 [Code] Low-Resource Languages Expose a Structural Gap in Code LLMs

Also Worth Noting — 2026-03-17

Also Worth Noting

Static ensemble weights fail in non-stationary environments, and coherence between models carries the signal you're missing

01 [Evaluation] Static ensemble weights fail in non-stationary environments, and coherence between models carries the signal you’re missing

LLMs That Ace Math Olympiads Collapse on Real Cryptographic Code Proofs

02 [RAG] LLMs That Ace Math Olympiads Collapse on Real Cryptographic Code Proofs

LLMs That Ace Python Collapse on a General-Purpose Language With Thin Training Data

03 [Code] LLMs That Ace Python Collapse on a General-Purpose Language With Thin Training Data

Also Worth Noting — 2026-03-15

Also Worth Noting

Text-to-image models fail at complex text because glyph templates were never in the loop

01 [Image Gen] Text-to-image models fail at complex text because glyph templates were never in the loop

02 [Evaluation] LLMs Can Pass Wine Theory But Fail at the Glass

03 [RAG] Brain MRI Diagnosis Models Hallucinate Because They Skip the Measurements

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Setup

What They Found

How It Works

Why It Matters

Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

Setup

What They Found

How It Works

Why It Matters

Also Worth Noting — 2026-03-14

Also Worth Noting

Knowledge Graph RAG Breaks on Multi-Hop Questions — Entity Summaries Fix the Retrieval Phase

01 [RAG] Knowledge Graph RAG Breaks on Multi-Hop Questions — Entity Summaries Fix the Retrieval Phase

02 [RAG] VAE Posterior Collapse Is a Prior Selection Problem, Not an Architecture Problem

03 [RAG] KV Cache Eviction Gets a Cheap Oracle — At a Fraction of the Lookahead Cost

Using Code as Intermediate Representation Improves VLM Spatial Reasoning by 68.8%

Setup

What They Found

How It Works

Why It Matters

Imitation Learning Can't Teach Judgment — Agents Trained on Perfect Demos Fail Out-of-Distribution

Setup

What They Found

How It Works

Why It Matters

Also Worth Noting — 2026-03-13

Also Worth Noting

Diffusion Models Don't Fail at Text Because They Can't Reason — They Fail Because They've Never Seen the Input

01 [Image Gen] Diffusion Models Don’t Fail at Text Because They Can’t Reason — They Fail Because They’ve Never Seen the Input

02 [Evaluation] LLMs Can Pass Wine Theory Exams But Fail at the Sensory Judgment That Actually Defines Expertise

03 [RAG] Brain MRI Diagnosis Models Hallucinate Because They Skip the Measurements