Most researchers are using AI wrong — here's the five-level map that shows why

01 [Agent] Most researchers are using AI wrong — here’s the five-level map that shows why

The gap between “I use ChatGPT to fix code” and “I run autonomous research agents overnight” is enormous, but there’s no shared map for where any given workflow sits on that spectrum. Most practitioners operate somewhere in the middle without a clear vocabulary for what they’re doing, what risks they’re taking, or what the next level of integration actually looks like.

This guide structures AI-assisted research as a five-level taxonomy, moving from Level 1 (single-turn query answering) through Level 5 (fully autonomous multi-day research loops). The levels aren’t arbitrary — each step transfers more epistemic responsibility to the agent and introduces qualitatively different failure modes. The framework targets CLI (Command-Line Interface) coding agents such as Claude Code, Codex CLI, and OpenCode, converting them into autonomous research assistants via methodological rules formulated as agent prompts. The rules encode researcher intent as structured constraints: what the agent may modify, how it should report uncertainty, when it must halt and verify. Case studies span deep learning experiments and formal mathematics, two domains with very different ground-truth verification structures.

The practical wedge here is the methodology rules layer. Raw CLI agents will happily run experiments, overwrite files, and generate plausible-looking LaTeX proofs with no epistemic safeguards. The prompt-level guardrails act as a lightweight institutional review process embedded directly in the agent loop, catching cases where the agent is about to commit a result it hasn’t actually verified. For mathematics, verification is formal and binary. For ML, it’s murkier: an agent that reruns an experiment until it gets a favorable number is doing something the field hasn’t settled how to call.

The limitation is real and the authors acknowledge it: this is a practitioner guide, not an empirical study. There are no controlled comparisons between taxonomy levels, no quantified productivity gains, no ablations on which guardrail rules matter most. The value is conceptual scaffolding and reproducible tooling, not benchmark numbers.

For teams already running agentic coding workflows, the taxonomy provides a diagnostic. If your current setup has no explicit rules governing when the agent halts for human verification, you’re probably at Level 3 or below regardless of how autonomous it feels.

Key takeaways:

A five-level taxonomy maps AI research integration from passive query tools to autonomous multi-day agents; each level transfers more epistemic responsibility and introduces distinct failure modes requiring distinct guardrails
The real risk in agentic research workflows is unguarded verification: agents optimizing toward plausible outputs rather than true ones, with no structural mechanism to stop them
Teams running CLI agents on research tasks should audit whether their current prompt setup includes explicit halt-and-verify rules; if not, the open-source framework here is a ready starting point

Source: The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

Source: The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics