Model based grading

📖 Lesson content

Summary

When building prompt evaluation workflows, grading systems provide objective signals about output quality. A grader takes model output and returns some kind of measurable feedback - typically a number between 1 and 10, where 10 represents high quality and 1 represents poor quality.

Types of Graders

There are three main approaches to grading model outputs:

Code graders - Programmatically evaluate outputs using custom code
Model graders - Use another AI model to assess the quality
Human graders - Have people manually review and score outputs

Code Graders

Code graders let you implement any programmatic check you can imagine. Common uses include:

Checking output length
Verifying output does or doesn't contain certain words
Syntax validation for JSON, Python, or regex
Readability scores to ensure appropriate reading levels

Model Graders

Model graders offer tremendous flexibility by using an additional API call to evaluate outputs. They're useful for assessing:

Response quality
Quality of instruction following
Completeness
Helpfulness
Safety

Human Graders

Human graders provide the most flexibility but come with significant downsides. While humans can evaluate responses for any criteria imaginable, the process is time-consuming and tedious.

Defining Evaluation Criteria

Before implementing any grader, you need clear evaluation criteria. For a code generation prompt, you might focus on:

Format - Should return only Python, JSON, or Regex without explanation
Valid Syntax - Produced code should have valid syntax
Task Following - Response should directly address the user's task with accurate code

The first two criteria work well with code graders, while task following is better suited for model graders due to their flexibility.

Implementing a Model Grader

Model graders are often the easiest to implement. Here's a basic structure:

def grade_by_model(test_case, output):
    messages = []
    add_user_message(messages, eval_prompt)
    add_assistant_message(messages, "```json")
    eval_text = chat(messages, stop_sequences=["```"])
    return json.loads(eval_text)

The grading prompt should be comprehensive and include:

Clear role definition for the grader
The original task
The AI-generated solution to evaluate
Specific output format requirements

Ask for more than just a score. Request strengths, weaknesses, and reasoning alongside the numerical score. This prevents the model from defaulting to middling scores like 6 and forces more thoughtful evaluation.

Integrating Graders into Your Workflow

Once you have a grader function, integrate it into your test case runner:

def run_test_case(test_case):
    output = run_prompt(test_case)
    
    # Call the model grader
    model_grade = grade_by_model(test_case, output)
    score = model_grade["score"]
    reasoning = model_grade["reasoning"]
    
    return {
        "output": output, 
        "test_case": test_case, 
        "score": score,
        "reasoning": reasoning
    }

After running all test cases, calculate an average score to get an objective metric for your prompt's performance:

from statistics import mean

def run_eval(dataset):
    results = []
    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)
    
    average_score = mean([result["score"] for result in results])
    print(f"Average score: {average_score}")
    
    return results

This gives you a concrete number to focus on improving. While model graders can be somewhat capricious and might benefit from better guidance, they provide a starting point for objective evaluation that you can iterate on and improve.

Downloads

001_prompt_evals_grader.ipynb

🔁 Related lessons

Next: Code based grading
Previous: Running the eval
Same section: Making a request · Multi-turn conversations · Chat exercise
Part of paths: Path C
Reference docs: Glossary · Skills atlas · By use-case

📚 Source & attribution

Original Anthropic Academy lesson: https://anthropic.skilljar.com/claude-with-google-vertex/289168