📖 Lesson content
Summary
When building prompt evaluation workflows, grading systems provide objective signals about output quality. A grader takes model output and returns some kind of measurable feedback - typically a number between 1 and 10, where 10 represents high quality and 1 represents poor quality.
Types of Graders

There are three main approaches to grading model outputs:
- Code graders - Programmatically evaluate outputs using custom logic
- Model graders - Use another AI model to assess the quality
- Human graders - Have people manually review and score outputs
Code Graders
Code graders let you implement any programmatic check you can imagine. Common uses include:
- Checking output length
- Verifying output does/doesn't have certain words
- Syntax validation for JSON, Python, or regex
- Readability scores
The only requirement is that your code returns some usable signal - usually a number between 1 and 10.
Model Graders
Model graders feed your original output into another API call for evaluation. This approach offers tremendous flexibility for assessing:
- Response quality
- Quality of instruction following
- Completeness
- Helpfulness
- Safety
Human Graders
Human graders provide the most flexibility but are time-consuming and tedious. They're useful for evaluating:
- General response quality
- Comprehensiveness
- Depth
- Conciseness
- Relevance
Defining Evaluation Criteria

Before implementing any grader, you need clear evaluation criteria. For a code generation prompt, you might focus on:
- Format - Should return only Python, JSON, or Regex without explanation
- Valid Syntax - Produced code should have valid syntax
- Task Following - Response should directly address the user's task with accurate code

The first two criteria work well with code graders, while task following is better suited for model graders due to their flexibility.
Implementing a Model Grader
Here's how to build a model grader function:
def grade_by_model(test_case, output):
# Create evaluation prompt
eval_prompt = """
You are an expert code reviewer. Evaluate this AI-generated solution.
Task: {task}
Solution: {solution}
Provide your evaluation as a structured JSON object with:
- "strengths": An array of 1-3 key strengths
- "weaknesses": An array of 1-3 key areas for improvement
- "reasoning": A concise explanation of your assessment
- "score": A number between 1-10
"""
messages = []
add_user_message(messages, eval_prompt)
add_assistant_message(messages, "```json")
eval_text = chat(messages, stop_sequences=["```"])
return json.loads(eval_text)
The key insight is asking for strengths, weaknesses, and reasoning alongside the score. Without this context, models tend to default to middling scores around 6. ## Integrating Grading into Your Workflow Update your test case runner to call the grader: def run_test_case(test_case): output = run_prompt(test_case) # Grade the output model_grade = grade_by_model(test_case, output) score = model_grade["score"] reasoning = model_grade["reasoning"] return { "output": output, "test_case": test_case, "score": score, "reasoning": reasoning } Finally, calculate an average score across all test cases: ``` from statistics import mean def run_eval(dataset): results = [] for test_case in dataset: result = run_test_case(test_case) results.append(result) average_score = mean([result["score"] for result in results]) print(f"Average score: {average_score}") return results ``` ` This gives you an objective metric to track as you iterate on your prompt. While model graders can be somewhat capricious, they provide a consistent baseline for measuring improvements. ` ````
`` ` #### Downloads - [001_prompt_evals_grader.ipynb](https://cc.sj-cdn.net/instructor/4hdejjwplbrm-anthropic/assets/1762977624/001_prompt_evals_grader.ipynb?response-content-disposition=attachment&Expires=1776933925&Signature=A8r5JH-ZF2J4LXvlPC2hk2WsuRCQFiskGAuStui8eWCfus2WPBhw8VWvvSg0WjjJrsoNS1nQkAsIq5LZg0ADgm0alnj9CerIWagiL43tRW-5iENv-QnYGyRvjzBbvBSesPIngk0MIPvWuEycDyg4vgIh8mfZ9nzJaIPyYAKepXLdevmJQh1UYuNXI8bk2-Mwx9NfH6hfCZL7KHC6HmVahtK746RFkyYTAVis29ncJQj16sfrV2viNwWYzpOS-Qs52jCypaCxQWdj2N0~z8SR-ufbwYn8xfKD4Xa0R1Iq~AGRQz4htwnmoMRC3MGUkht9FCkyCCOxknSHruG44wlmZA__&Key-Pair-Id=APKAI3B7HFD2VYJQK4MQ) ` ``
## 🔁 Related lessons
- **Next:** [Code based grading](../21-code-based-grading/en.md)
- **Previous:** [Running the eval](../19-running-the-eval/en.md)
- **Same section:** [Making a request](../05-making-a-request/en.md) · [Multi-Turn conversations](../06-multi-turn-conversations/en.md) · [Chat exercise](../07-chat-exercise/en.md)
- **Part of paths:** [Path C](../../../../../docs/learning-paths.md)
- **Reference docs:** [Glossary](../../../../../docs/glossary.md) · [Skills atlas](../../../../../docs/skills-atlas.md) · [By use-case](../../../../../docs/by-use-case.md)
## 📚 Source & attribution
- Original Anthropic Academy lesson: [https://anthropic.skilljar.com/claude-with-the-anthropic-api/287742](https://anthropic.skilljar.com/claude-with-the-anthropic-api/287742)
- © 2025 Anthropic. Educational fair-use only.
- Crawled: 2026-04-23 · Standardized: 2026-05-01