ScatterAI
Issue #8 · March 18, 2026

Most researchers are using AI wrong — here's the five-level map that shows why

Research

01 [Agent] Most researchers are using AI wrong — here’s the five-level map that shows why

The gap between “I use ChatGPT to fix code” and “I run autonomous research agents overnight” is enormous, but there’s no shared map for where any given workflow sits on that spectrum. Most practitioners operate somewhere in the middle without a clear vocabulary for what they’re doing, what risks they’re taking, or what the next level of integration actually looks like.

This guide structures AI-assisted research as a five-level taxonomy, moving from Level 1 (single-turn query answering) through Level 5 (fully autonomous multi-day research loops). The levels aren’t arbitrary — each step transfers more epistemic responsibility to the agent and introduces qualitatively different failure modes. The framework targets CLI (Command-Line Interface) coding agents such as Claude Code, Codex CLI, and OpenCode, converting them into autonomous research assistants via methodological rules formulated as agent prompts. The rules encode researcher intent as structured constraints: what the agent may modify, how it should report uncertainty, when it must halt and verify. Case studies span deep learning experiments and formal mathematics, two domains with very different ground-truth verification structures.

The practical wedge here is the methodology rules layer. Raw CLI agents will happily run experiments, overwrite files, and generate plausible-looking LaTeX proofs with no epistemic safeguards. The prompt-level guardrails act as a lightweight institutional review process embedded directly in the agent loop, catching cases where the agent is about to commit a result it hasn’t actually verified. For mathematics, verification is formal and binary. For ML, it’s murkier: an agent that reruns an experiment until it gets a favorable number is doing something the field hasn’t settled how to call.

The limitation is real and the authors acknowledge it: this is a practitioner guide, not an empirical study. There are no controlled comparisons between taxonomy levels, no quantified productivity gains, no ablations on which guardrail rules matter most. The value is conceptual scaffolding and reproducible tooling, not benchmark numbers.

For teams already running agentic coding workflows, the taxonomy provides a diagnostic. If your current setup has no explicit rules governing when the agent halts for human verification, you’re probably at Level 3 or below regardless of how autonomous it feels.

Key takeaways:

Source: The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

Source: The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics