Why End-to-End Testing Changes Results

Key principle: Enforce invariants, don't micromanage implementation. For example, require "data is parsed at the boundary," but don't dictate which library to use. Error messages must include fix instructions—not just saying "violation," but telling the agent exactly how to change it.

Source: OpenAI: Harness engineering: leveraging Codex in an agent-first world

1. Harness Must Include an End-to-End Layer

Make it explicit in your validation flow: for tasks involving cross-component changes, passing end-to-end tests is a prerequisite for completion:

## Validation Hierarchy
- Level 1: Unit tests (Must pass)
- Level 2: Integration tests (Must pass)
- Level 3: End-to-end tests (Must pass when cross-component changes are involved)
- Skipping any required level = Not Complete

2. Turn Architectural Rules into Executable Checks

Every architectural constraint should have a corresponding test or lint rule:

# Check if the render process directly calls Node.js APIs
grep -r "require('fs')" src/renderer/ && exit 1 || echo "OK: no direct fs access in renderer"

3. Design Agent-Oriented Error Messages

Failure messages should contain three elements: what went wrong, why, and how to fix it:

ERROR: Found direct import of 'fs' in src/renderer/App.tsx:12
WHY: Renderer process has no access to Node.js APIs for security
FIX: Move file operations to src/preload/file-ops.ts and call via window.api.readFile()

4. Establish a Review Feedback Promotion Process

Every time a new type of agent error is found during code review, turn it into an automated check. A month later, your harness will be significantly stronger than at the start of the month. It's like rehearsal notes for a choir—recording issues found in every rehearsal so they can be checked before the next one. Over time, common errors decrease, and the music becomes more harmonious.

Real-World Case

Task: Implement a file export feature in an Electron app. Involves render process UI, preload script filesystem proxy, and service layer data transformation.

Singing parts individually (Unit tests passed): Render component tests (passed, file operations mocked), preload script tests (passed, filesystem mocked), service layer tests (passed, data source mocked). Agent declares completion.

Singing together (Defects revealed by End-to-End tests):

Defect	Description	Unit Test	E2E
Interface Mismatch	Inconsistent file path format	Missed	Caught
State Propagation	Export progress not sent back to UI via IPC	Missed	Caught
Resource Leak	Large file export handles not released	Missed	Caught
Permission Issue	Different permissions in packaged environment	Missed	Caught
Error Propagation	Service layer exceptions didn't reach UI layer	Missed	Caught

All 5 defects were caught by end-to-end tests, while unit tests caught none. The cost was an increase in test time from 2 seconds to 15 seconds—completely acceptable in an agent workflow. No matter how well each part sings individually, it can't beat a full ensemble rehearsal.

Source: OpenAI: Harness engineering: leveraging Codex in an agent-first world

1. Harness Must Include an End-to-End Layer

Make it explicit in your validation flow: for tasks involving cross-component changes, passing end-to-end tests is a prerequisite for completion:

## Validation Hierarchy
- Level 1: Unit tests (Must pass)
- Level 2: Integration tests (Must pass)
- Level 3: End-to-end tests (Must pass when cross-component changes are involved)
- Skipping any required level = Not Complete

2. Turn Architectural Rules into Executable Checks

Every architectural constraint should have a corresponding test or lint rule:

# Check if the render process directly calls Node.js APIs
grep -r "require('fs')" src/renderer/ && exit 1 || echo "OK: no direct fs access in renderer"

3. Design Agent-Oriented Error Messages

Failure messages should contain three elements: what went wrong, why, and how to fix it:

ERROR: Found direct import of 'fs' in src/renderer/App.tsx:12
WHY: Renderer process has no access to Node.js APIs for security
FIX: Move file operations to src/preload/file-ops.ts and call via window.api.readFile()

4. Establish a Review Feedback Promotion Process

Real-World Case

Task: Implement a file export feature in an Electron app. Involves render process UI, preload script filesystem proxy, and service layer data transformation.

Singing together (Defects revealed by End-to-End tests):

Defect	Description	Unit Test	E2E
Interface Mismatch	Inconsistent file path format	Missed	Caught
State Propagation	Export progress not sent back to UI via IPC	Missed	Caught
Resource Leak	Large file export handles not released	Missed	Caught
Permission Issue	Different permissions in packaged environment	Missed	Caught
Error Propagation	Service layer exceptions didn't reach UI layer	Missed	Caught

Takeaways

Unit tests are systematically blind to component boundary defects—their isolation design is exactly what prevents them from detecting interaction issues. Everyone singing correctly doesn't mean the choir isn't out of tune.
End-to-end testing not only detects defects, it changes agent coding behavior—making it focus more on integration and boundaries.
Architectural rules must be executable—not written in a document waiting to be read, but automatically checked on every commit.
Error messages must be designed for agents—including specific steps on "how to fix it" to form a self-correcting loop.
Review feedback promotion makes the harness automatically stronger—every category of captured defect becomes a permanent line of defense.

Exercises

Cross-Component Defect Detection: Pick a modification task involving at least three components. First, run only unit tests and record the results, then run end-to-end tests. Analyze which type of cross-layer interaction issue each additionally discovered defect belongs to.
Architectural Rule Automation: Pick an architectural constraint from your project and turn it into an executable check (with an agent-oriented error message). Integrate it into the harness and verify its effectiveness with a baseline task.
Review Feedback Promotion: Find a recurring comment type from your code review history and convert it into an automated check using the five-step process. Compare the frequency of the issue before and after the promotion.

Feedback / ReportSpotted an issue or have an improvement idea?

Why End-to-End Testing Changes Results

1. Harness Must Include an End-to-End Layer

2. Turn Architectural Rules into Executable Checks

3. Design Agent-Oriented Error Messages

4. Establish a Review Feedback Promotion Process

Real-World Case

Further Reading

Exercises

1. Harness Must Include an End-to-End Layer

2. Turn Architectural Rules into Executable Checks

3. Design Agent-Oriented Error Messages

4. Establish a Review Feedback Promotion Process

Real-World Case

Takeaways

Further Reading

Exercises