Skip to main content

Evaluate prompts in the Anthropic Console

TL;DR

  • The Anthropic Workbench has new features designed to streamline the development and deployment of high-quality prompts for Claude.
  • Key improvements include a prompt generator that uses Claude 3.5 Sonnet to convert high-level tasks into detailed prompts.
  • The workbench also provides automated test data generation and a comprehensive evaluation system to test, refine, and compare prompt performance efficiently.

Takeaways

  • The Anthropic Workbench now includes a prompt generator that uses Claude 3.5 Sonnet to automatically create detailed prompt templates from a high-level description of a task.
  • Users can leverage Claude to automatically generate realistic input data or test cases for their prompts, significantly reducing the time spent on creating test sets.
  • The evaluate feature allows users to set up a test suite with numerous test cases, which can be generated by Claude, uploaded via CSV, or customized directly.
  • After running the prompts, outputs can be graded, and prompts can be iteratively refined (e.g., adjusting justification length) based on feedback from the test results.
  • The workbench supports rerunning refined prompts against existing test suites to quickly verify the impact of changes.
  • Users can compare new results against old results side-by-side to visually assess improvements in prompt quality and performance.

Vocabulary

  • Anthropic Workbench — A development environment provided by Anthropic for creating, testing, and deploying prompts for their AI models.
  • prompt generator — A tool within the workbench that uses an AI model to automatically create detailed prompt templates from a high-level task description.
  • Claude 3.5 Sonnet — A specific, high-performance AI model developed by Anthropic, often used for tasks requiring advanced reasoning.
  • prompt template — A structured and detailed set of instructions provided to an AI model to guide its output for a specific task.
  • triage — In the context of customer support, the process of assessing and prioritizing incoming requests to determine the appropriate action.
  • input data — The specific examples or information fed into a prompt to test how the AI model responds.
  • test case — A specific scenario or set of inputs designed to evaluate a prompt's performance under particular conditions.
  • evaluate feature — A component of the workbench designed to systematically test prompts against multiple test cases and analyze their outputs.
  • test suite — A collection of multiple test cases used together to thoroughly assess the quality and robustness of a prompt.
  • deploy to production — The act of releasing a developed and tested prompt (or software feature) for live use by end-users.

Transcript

We've recently made a number of improvements to the anthropic workbench that make it easier to develop and deploy high-quality prompts for Claude. Let's see how it works by taking a look at our recently updated prompt generator. You can use the prompt generator to take a high-level description of a task and convert it into a detailed prompt template using Claude 3.5 Sonnet. In this case, let's imagine we need to triage customer support requests. As you can see, Claude immediately starts writing a prompt based off of our task. It's detailed and specific and looks like it should work. But, before we deploy to production, we should really test to see how it performs with realistic customer data. Coming up with realistic test data can be time-consuming and can take longer than writing the prompt itself. You can now use Claude to automatically generate realistic input data based off of your prompt. In this case, we can generate a customer support request. This one looks good, so let's see how the prompt works with this particular support request. This seems pretty good. It's providing a justification and a Tri-OS decision. But how do we know that we didn't get lucky? How do we know that this prompt is actually going to work in a broad range of scenarios? That's where then you evaluate feature comes in. You can use the evaluate page to set up as many test cases as you want. Let's keep generating a broad range of representative test cases. You can also upload test cases for a CSV if you happen to have the test data in-hat. Test case generation logic is highly customizable and adapts to your existing test set. If you have highly specific requirements, you can directly edit the generation logic yourself. Once you have enough test cases ready, you can generate results for your new test suite. All right, these results look pretty good. So let's go and grade their quality. Maybe we decide when we're evaluating them that we actually felt that the justifications were a brief would like them to be a bit longer. Well, we can go back to the prompt and find the section where it specified a one sentence justification and update it to a two sentence justification. We can rerun the prompt and just as we'd hope we're saying a two sentence justification. So let's go back to the evaluate tab. Thankfully, our test suite is still there. So it can rerun the new prompt against the old test set data. And just as we hoped, they're all just a little bit longer. We can go and grade these new outputs. We're happier with these ones. But just to be sure, we can actually compare these mere results against the old results. And here we can see side by side the results are longer. We're still getting similar tree of decisions, but our grading on average is better.

Feedback / ReportSpotted an issue or have an improvement idea?