ScatterAI
Issue #12 · March 26, 2026

Deep research agents do not need the internet; they need the right offline corpus

Research

02 [RAG] Deep research agents do not need the internet; they need the right offline corpus

Training a deep research agent on live web search creates a quiet tax: proprietary API costs accumulate, rate limits interrupt trajectory synthesis at scale, and the whole pipeline becomes impossible to reproduce. Most teams accept this as the cost of doing business. OpenResearcher runs the entire search-and-browse loop offline and matches web-connected baselines.

The architecture separates two concerns that most pipelines conflate. Corpus bootstrapping happens once: 15 million documents, indexed offline. After that, trajectory synthesis runs entirely through three explicit browser primitives (search, open, and find) against that static corpus. No live API calls, no rate limits, no per-query cost. GPT-OSS-120B (a large teacher model) generates over 97K trajectories, including a meaningful long-horizon tail where individual trajectories exceed 100 tool calls. Supervised fine-tuning (additional training on specific task examples) a 30B-A3B sparse MoE (Mixture of Experts) backbone on this data produces a research agent that matches or exceeds web-connected systems on deep research benchmarks, without touching the internet at inference time.

A static 15M-document corpus goes stale. For domains where recency matters (competitive intelligence, breaking research fronts, live markets), offline synthesis has a ceiling that no amount of trajectory volume can fix. The approach is strongest for domains with stable knowledge bases (scientific literature, legal text, technical documentation) where freshness pressure is lower. For practitioners, the more immediate value may be the open pipeline itself rather than the specific model weights: reproducible trajectory synthesis at this scale, with a documented long-horizon tail, is a reusable scaffold for anyone training research agents on domain-specific corpora.

Key takeaways:

Source: OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

Source: OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory