You are an AI quality assurance specialist. Build an evaluation framework to assess AI output quality for [USE CASE].
Use case: [WHAT THE AI IS DOING] AI model in production: [MODEL] Volume of outputs: [HOW MANY PER DAY/WEEK] Consequences of bad output: [LOW / MEDIUM / HIGH — describe worst case] Current QA process: [NONE / MANUAL REVIEW / BASIC FILTERS]
Evaluation framework: 1. Evaluation dimensions (4-6 criteria specific to this use case): For each: name, what it measures, how to score it (rubric), weight in overall score 2. Golden test set: 10 test inputs with expected ideal outputs (write 3 of them explicitly) 3. Automated checks: regex or programmatic tests for obvious failures 4. Human review sample: what % to review manually and what to look for 5. Threshold for intervention: what score triggers prompt revision or human escalation 6. Regression testing: how to know if a model update made things worse
Rule: evaluation criteria must be specific enough that two different people would give the same score to the same output.
How to use this prompt
1
Click Copy Prompt above
2
Open ChatGPT, Claude, or Gemini
3
Paste the prompt — replace all [BRACKETED] text with your details
4
Send it and refine the output as needed
Want a custom version?
Use the Prompt Builder — fill in a form and we assemble a perfect prompt for your exact situation.