Skip to main content

Why Agents Declare Victory Too Early

How to Prevent Premature Hand-ins

1. Externalize Termination Judgment

The completion judgment shouldn't be made by the agent itself. The harness must independently execute termination validation, using runtime signals as input, not the agent's confidence. Write this clearly in CLAUDE.md:

## Definition of Done
- Feature complete = end-to-end verification passed, not "code is written"
- Required verification levels:
  1. Unit tests pass
  2. Integration tests pass
  3. End-to-end flow verification passes
- Do not proceed to level 2 if level 1 fails
- Do not proceed to level 3 if level 2 fails

2. Build a Three-Layer Termination Validation

  • Layer 1: Syntax and Static Analysis. Lowest cost, least information, but must pass. This is the bare minimum check—you must spell the words right before we look at anything else.
  • Layer 2: Runtime Behavior Verification. Test execution, app startup checks, critical path validation. This is the core evidence of completion. It's not enough to just write it; it must run.
  • Layer 3: System-Level Confirmation. End-to-end testing, integration validation, user scenario simulation. The final line of defense against premature declarations. It's not enough to run; it must run correctly.

3. Design Good "Red Pen Markups" for Agents

OpenAI introduced a particularly effective pattern during their Codex practice: error messages for agents should include fix instructions. Don't just draw a big red cross like a lazy grader; be like a good teacher and write "here's how you should change this" in the margins. Don't use "Test failed", but use "Test failed: POST /api/reset-password returned 500. Check that the email service config exists in environment variables. The template file should be at templates/reset-email.html." This specific, actionable feedback allows the agent to self-correct without human intervention.

4. Capture Runtime Signals

Effective runtime signals include:

  • Did the application successfully start and reach a ready state?
  • Did the critical feature paths execute successfully at runtime?
  • Were database writes, file operations, and other side effects correct?
  • Were temporary resources cleaned up?

Real-World Case

Task: Implement user password reset functionality. Involves database operations, email sending, and API endpoint modifications.

Premature hand-in path: Agent modifies database schema, writes API endpoint, adds email template, runs unit tests (passes), and declares completion. The exam paper is completely filled out.

Actual point deductions: (1) End-to-end flow untested—the actual sending and verification of the reset link was never confirmed. (2) Database migration failed after partial execution, causing schema inconsistency. (3) Email service config was missing in the target environment.

Harness intervention: Termination validation enforced—(1) Start the full app to verify reset endpoint accessibility; (2) Execute the full reset flow; (3) Verify database state consistency. All defects were found within the session, saving 5-10x the cost of subsequent fixes. The independent grader found the real issues.

Takeaways

  • Agents are systematically overconfident—confidence calibration bias is an objective reality. Filling out the exam paper doesn't mean you got it right.
  • Completion judgment must be externalized—the harness verifies independently; don't trust the agent's "feelings". Students cannot grade their own exams.
  • All three layers of validation are essential—syntax passing, behavior passing, system passing, progressing layer by layer.
  • Error messages should be like a good teacher's red pen markup—include specific fix steps so the agent can self-correct.
  • No refactoring until core functionality is verified—the completion priority constraint is the key to preventing premature optimization.

Further Reading

Exercises

  1. Termination Validation Function Design: Design a complete termination validation for a task involving a database migration and API modification. List the required runtime signals and the pass/fail criteria for each signal. Run it on a real task and record what hidden issues it finds.

  2. Calibration Bias Measurement: Choose 10 different types of coding tasks, and record the agent's self-reported completion confidence vs. the actual completion quality. Calculate the bias value and analyze its relationship with task complexity.

  3. Multi-Layer Defense Experiment: Run three configurations on the same set of tasks—(a) static analysis only, (b) add unit testing, (c) full three-layer validation. Compare the proportion of premature completion declarations and the number of uncaught defects.

Feedback / ReportSpotted an issue or have an improvement idea?