How to Prevent Premature Hand-ins
1. Externalize Termination Judgment
The completion judgment shouldn't be made by the agent itself. The harness must independently execute termination validation, using runtime signals as input, not the agent's confidence. Write this clearly in CLAUDE.md:
## Definition of Done
- Feature complete = end-to-end verification passed, not "code is written"
- Required verification levels:
1. Unit tests pass
2. Integration tests pass
3. End-to-end flow verification passes
- Do not proceed to level 2 if level 1 fails
- Do not proceed to level 3 if level 2 fails
2. Build a Three-Layer Termination Validation
- Layer 1: Syntax and Static Analysis. Lowest cost, least information, but must pass. This is the bare minimum check—you must spell the words right before we look at anything else.
- Layer 2: Runtime Behavior Verification. Test execution, app startup checks, critical path validation. This is the core evidence of completion. It's not enough to just write it; it must run.
- Layer 3: System-Level Confirmation. End-to-end testing, integration validation, user scenario simulation. The final line of defense against premature declarations. It's not enough to run; it must run correctly.
3. Design Good "Red Pen Markups" for Agents
OpenAI introduced a particularly effective pattern during their Codex practice: error messages for agents should include fix instructions. Don't just draw a big red cross like a lazy grader; be like a good teacher and write "here's how you should change this" in the margins. Don't use "Test failed", but use "Test failed: POST /api/reset-password returned 500. Check that the email service config exists in environment variables. The template file should be at templates/reset-email.html." This specific, actionable feedback allows the agent to self-correct without human intervention.
4. Capture Runtime Signals
Effective runtime signals include:
- Did the application successfully start and reach a ready state?
- Did the critical feature paths execute successfully at runtime?
- Were database writes, file operations, and other side effects correct?
- Were temporary resources cleaned up?
Real-World Case
Task: Implement user password reset functionality. Involves database operations, email sending, and API endpoint modifications.
Premature hand-in path: Agent modifies database schema, writes API endpoint, adds email template, runs unit tests (passes), and declares completion. The exam paper is completely filled out.
Actual point deductions: (1) End-to-end flow untested—the actual sending and verification of the reset link was never confirmed. (2) Database migration failed after partial execution, causing schema inconsistency. (3) Email service config was missing in the target environment.
Harness intervention: Termination validation enforced—(1) Start the full app to verify reset endpoint accessibility; (2) Execute the full reset flow; (3) Verify database state consistency. All defects were found within the session, saving 5-10x the cost of subsequent fixes. The independent grader found the real issues.