AI Evaluation Framework

AI Workflows advancedWorks with:ChatGPTClaude
You are an AI quality assurance specialist. Build an evaluation framework to assess AI output quality for [USE CASE].

Use case: [WHAT THE AI IS DOING]
AI model in production: [MODEL]
Volume of outputs: [HOW MANY PER DAY/WEEK]
Consequences of bad output: [LOW / MEDIUM / HIGH — describe worst case]
Current QA process: [NONE / MANUAL REVIEW / BASIC FILTERS]

Evaluation framework:
1. Evaluation dimensions (4-6 criteria specific to this use case):
For each: name, what it measures, how to score it (rubric), weight in overall score
2. Golden test set: 10 test inputs with expected ideal outputs (write 3 of them explicitly)
3. Automated checks: regex or programmatic tests for obvious failures
4. Human review sample: what % to review manually and what to look for
5. Threshold for intervention: what score triggers prompt revision or human escalation
6. Regression testing: how to know if a model update made things worse

Rule: evaluation criteria must be specific enough that two different people would give the same score to the same output.

How to use this prompt

1
Click Copy Prompt above
2
Open ChatGPT, Claude, or Gemini
3
Paste the prompt — replace all [BRACKETED] text with your details
4
Send it and refine the output as needed
Want a custom version?
Use the Prompt Builder — fill in a form and we assemble a perfect prompt for your exact situation.
Open Builder →