Imitation Learning Can't Teach Judgment — Agents Trained on Perfect Demos Fail Out-of-Distribution

Setup

Imitation Learning (IL) — where an agent is trained to mimic human experts — is the standard way to build “foundation agents” for web browsing, software use, and robotics. The assumption is that if an agent sees enough “perfect” demonstrations, it will learn the expert’s underlying logic. However, these agents often collapse when they encounter a situation slightly different from the training data.

What They Found

This paper identifies a fundamental “Judgment Gap” in IL-trained agents. Because they are only trained on correct actions, they never learn to distinguish between a “good” action and a “catastrophic” one. When they drift off the expert trajectory (Out-of-Distribution), they have no internal mechanism to evaluate which of their potential next steps is safest. In contrast, agents trained with Reinforcement Learning (RL) develop “judgment” because they have experienced and been penalized for failures.

How It Works

The researchers compared agents trained on 10,000 perfect trajectories versus agents trained on 5,000 trajectories plus a “Critic” network that evaluated 5,000 failed attempts. Despite having less “expert” data, the RL-hybrid agents were 4x more resilient to unexpected UI changes. They found that learning “what not to do” is mathematically more important for generalization than learning “what to do” in a static environment.

Why It Matters

This is a critical insight for companies building autonomous agents. Relying solely on “Golden Path” datasets will create agents that are brittle and dangerous in the real world. To build robust judgment, agents must be allowed to fail in sandboxed environments (RL) so they can build a cost-model of their actions. Perfect demonstrations are a starting point, not the finish line. This has significant implications for safety-critical deployments where agents operate in dynamic, unpredictable environments where offline learning approaches will fundamentally underperform.