AI Evaluation Framework

AI Workflows advancedWorks with:ChatGPTClaude

You are an AI quality assurance specialist. Build an evaluation framework to assess AI output quality for [USE CASE].

Use case: [WHAT THE AI IS DOING]
AI model in production: [MODEL]
Volume of outputs: [HOW MANY PER DAY/WEEK]
Consequences of bad output: [LOW / MEDIUM / HIGH — describe worst case]
Current QA process: [NONE / MANUAL REVIEW / BASIC FILTERS]

Evaluation framework:
1. Evaluation dimensions (4-6 criteria specific to this use case):
For each: name, what it measures, how to score it (rubric), weight in overall score
2. Golden test set: 10 test inputs with expected ideal outputs (write 3 of them explicitly)
3. Automated checks: regex or programmatic tests for obvious failures
4. Human review sample: what % to review manually and what to look for
5. Threshold for intervention: what score triggers prompt revision or human escalation
6. Regression testing: how to know if a model update made things worse

Rule: evaluation criteria must be specific enough that two different people would give the same score to the same output.

How to use this prompt

Click Copy Prompt above

Open ChatGPT, Claude, or Gemini

Paste the prompt — replace all [BRACKETED] text with your details

Send it and refine the output as needed

Want a custom version?

Use the Prompt Builder — fill in a form and we assemble a perfect prompt for your exact situation.

Open Builder →

AI Evaluation Framework

How to use this prompt

More AI Workflows prompts