📖 Lesson content
Summary
When building prompt evaluation workflows, grading systems provide objective signals about output quality. A grader takes model output and returns some kind of measurable feedback - typically a number between 1 and 10, where 10 represents high quality and 1 represents poor quality.
Types of Graders

There are three main approaches to grading model outputs:
- Code graders - Programmatically evaluate outputs using custom code
- Model graders - Use another AI model to assess the quality
- Human graders - Have people manually review and score outputs
Code Graders
Code graders let you implement any programmatic check you can imagine. Common uses include:
- Checking output length
- Verifying output does or doesn't contain certain words
- Syntax validation for JSON, Python, or regex
- Readability scores to ensure appropriate reading levels
Model Graders
Model graders offer tremendous flexibility by using an additional API call to evaluate outputs. They're useful for assessing:
- Response quality
- Quality of instruction following
- Completeness
- Helpfulness
- Safety
Human Graders
Human graders provide the most flexibility but come with significant downsides. While humans can evaluate responses for any criteria imaginable, the process is time-consuming and tedious.
Defining Evaluation Criteria

Before implementing any grader, you need clear evaluation criteria. For a code generation prompt, you might focus on:
- Format - Should return only Python, JSON, or Regex without explanation
- Valid Syntax - Produced code should have valid syntax
- Task Following - Response should directly address the user's task with accurate code

The first two criteria work well with code graders, while task following is better suited for model graders due to their flexibility.
Implementing a Model Grader
Model graders are often the easiest to implement. Here's a basic structure:
def grade_by_model(test_case, output):
messages = []
add_user_message(messages, eval_prompt)
add_assistant_message(messages, "```json")
eval_text = chat(messages, stop_sequences=["```"])
return json.loads(eval_text)

The grading prompt should be comprehensive and include:
- Clear role definition for the grader
- The original task
- The AI-generated solution to evaluate
- Specific output format requirements
Ask for more than just a score. Request strengths, weaknesses, and reasoning alongside the numerical score. This prevents the model from defaulting to middling scores like 6 and forces more thoughtful evaluation.
Integrating Graders into Your Workflow
Once you have a grader function, integrate it into your test case runner:
def run_test_case(test_case):
output = run_prompt(test_case)
# Call the model grader
model_grade = grade_by_model(test_case, output)
score = model_grade["score"]
reasoning = model_grade["reasoning"]
return {
"output": output,
"test_case": test_case,
"score": score,
"reasoning": reasoning
}
After running all test cases, calculate an average score to get an objective metric for your prompt's performance:
from statistics import mean
def run_eval(dataset):
results = []
for test_case in dataset:
result = run_test_case(test_case)
results.append(result)
average_score = mean([result["score"] for result in results])
print(f"Average score: {average_score}")
return results
This gives you a concrete number to focus on improving. While model graders can be somewhat capricious and might benefit from better guidance, they provide a starting point for objective evaluation that you can iterate on and improve.
Downloads
🔁 Related lessons
- Next: Code based grading
- Previous: Running the eval
- Same section: Making a request · Multi-turn conversations · Chat exercise
- Part of paths: Path C
- Reference docs: Glossary · Skills atlas · By use-case
📚 Source & attribution
- Original Anthropic Academy lesson: https://anthropic.skilljar.com/claude-with-google-vertex/289168
- © 2025 Anthropic. Educational fair-use only.