ScatterAI
Issue #4 · March 14, 2026

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Research

Setup

Current Multimodal Large Language Models (MLLMs) for sports video understanding are narrow by design—limited to single sports, single tasks, or zero-shot approaches that never actually train on the domain. No existing end-to-end trained model handles the combination of high-speed motion, complex rule sets, and long temporal reasoning across multiple sports simultaneously. DeepSport fills this gap as the first end-to-end MLLM trained for multi-task, multi-sport video reasoning.

What They Found

How It Works

DeepSport is built on a multimodal foundation model extended with an agentic reinforcement learning framework, where the model learns to decompose complex sports queries into reasoning steps and receive reward signals based on answer correctness across tasks. Rather than fine-tuning on labeled examples for each task separately, the RL loop trains the model to plan, retrieve relevant temporal context from video, and synthesize rule knowledge into coherent answers. This agentic approach lets the model handle variable-length video inputs and open-ended question types without task-specific heads or pipelines.

Why It Matters