AI Testing

Multi-Pass Judge Prompts: AI QA with AI

Use AI to evaluate AI outputs with multi-pass judge prompts. Includes prompt templates, scoring systems, and workflow integration.

AgentMastery TeamFebruary 15, 20258 min read

Updated Dec 2025

Quick Answer

Key Takeaway: Multi-Pass Judge Prompts: AI QA with AI

Use AI to evaluate AI outputs with multi-pass judge prompts. Includes prompt templates, scoring systems, and workflow integration.

Article

Updated: 2/15/2025

AI TestingJudge PromptsQAAutomationWorkflow

Article Outline

Introduction

The paradox: Using AI to check AI
Why it works: Different models, different temperatures, different prompts
When judge prompts save time vs manual review
Real use cases at scale (100+ outputs/week)

What Are Judge Prompts?

Definition: A secondary AI prompt that evaluates the first output for quality, accuracy, or compliance with requirements.

The workflow:

Step 1: Generate content (GPT-3.5)
Step 2: Judge output (GPT-4 or Claude)
Step 3: Accept, refine, or regenerate

Why this works:

Different model = fresh evaluation
Explicit criteria in judge prompt = systematic checking
Lower temperature for judge = more reliable scoring
Cheaper than human review for high volume

Types of Judge Prompts

Type 1: Accuracy Judge

Review this AI-generated article for factual accuracy.

Check for:
- Unverified statistics or data points
- Made-up citations or sources
- Internal contradictions
- Claims that sound suspicious

Score accuracy 1-10 and list any flagged claims.

Type 2: Brand Voice Judge

Evaluate if this content matches our brand voice.

Our brand: [Professional, data-driven, no fluff, founder wisdom]

Rate 1-10:
- Tone match
- Vocabulary appropriateness
- Sentence structure fit
- Overall brand alignment

Flag specific phrases that miss the mark.

Type 3: Completeness Judge

Check if this content addresses all requirements from the brief.

Brief requirements:
- [List requirements]

For each requirement:
- ✅ Fully addressed
- ⚠️ Partially addressed
- ❌ Missing

Overall completeness score: X/10

Type 4: SEO Quality Judge

Evaluate this content for SEO quality.

Check:
- Primary keyword in first 100 words?
- H2s include semantic keywords?
- Internal linking opportunities (list them)?
- Meta description compelling?
- Readability for target audience?

Score each 1-10 + overall SEO score.

Type 5: Compliance Judge

Review for legal/compliance issues.

Flag:
- Unsubstantiated claims
- Medical/legal advice (prohibited)
- Competitor disparagement
- Trademark usage
- Privacy concerns
- Accessibility issues

Risk level: Low/Medium/High

Multi-Pass Judge Workflow

Pass One: Generate

Prompt: "Write a 500-word blog post about AI testing tools"
Model: GPT-3.5 (fast, cheap)
Temperature: 0.7
Output: Draft content

Pass Two: Judge (Accuracy)

Prompt: "Review this for factual accuracy. Score 1-10 and flag suspicious claims."
Model: GPT-4 (more reliable)
Temperature: 0.2 (deterministic)
Output: Accuracy score + flagged items

Pass Three: Judge (Brand Voice)

Prompt: "Does this match [brand voice]? Score 1-10 and flag off-brand phrases."
Model: Claude Opus (strong at nuance)
Temperature: 0.2
Output: Voice score + specific feedback

Pass Four: Conditional Refinement

IF accuracy &lt; 7 OR voice &lt; 7:
    Regenerate with feedback incorporated
ELSE:
    Manual polish + publish

Cost example (500-word post):

Generate: $0.003
Judge (accuracy): $0.02
Judge (voice): $0.02
Total: $0.043 vs $20-50 for human QA

Judge Prompt Engineering

Bad judge prompt:

Is this good?

Problem: Vague, subjective, inconsistent scoring.

Good judge prompt:

Evaluate this content on these specific criteria:

• Factual Accuracy (1-10):
   - All claims verifiable?
   - No hallucinated citations?
   - Internal consistency?

• Readability (1-10):
   - Clear sentence structure?
   - Appropriate vocabulary for audience?
   - Logical flow?

• Completeness (1-10):
   - Addresses all brief requirements?
   - No missing sections?
   - Sufficient depth?

For each dimension, provide:
- Score (1-10)
- Reasoning (1 sentence)
- Specific issues (if score &lt; 7)

Overall recommendation: Accept / Refine / Regenerate

Key elements:

Specific dimensions to evaluate
Numeric scores for consistency
Reasoning required to validate judgment
Action recommendation for next step

Scoring Systems

Simple Pass/Fail:

Does this meet requirements? Yes/No
If No, list what's missing.

Best for: Binary quality gates

Numeric Scale (1-10):

Score this output 1-10 on [criterion].
• 1-3 = Unacceptable
• 4-6 = Needs improvement
• 7-8 = Good
• 9-10 = Excellent

Best for: Nuanced evaluation

Weighted Dimensions:

Accuracy (40%): 8/10
Clarity (30%): 7/10
Brand Voice (20%): 9/10
SEO (10%): 6/10

Overall: (8*0.4 + 7*0.3 + 9*0.2 + 6*0.1) = 7.7/10

Best for: Balancing multiple priorities

Rubric-Based:

For each criterion, select:
- Exceeds expectations (3 pts)
- Meets expectations (2 pts)
- Below expectations (1 pt)
- Fails (0 pts)

Total score: X/15

Best for: Consistent team evaluation

When Judge Prompts Work Best

✅ High-volume content

Publishing 50+ pieces/month
Need systematic QA at scale
Human review is bottleneck

✅ Objective criteria

Fact-checking
SEO compliance
Structural requirements
Word count, formatting

✅ First-pass filtering

Eliminate obvious failures before human review
Route: AI judge → humans review only flagged items
Saves 70-80% of QA time

❌ When to skip judge prompts

Highly subjective creative work
Legal/medical content (human expert required)
Low volume (<10 pieces/month)
Brand-critical content (CEO blog, etc.)

Advanced Techniques

Multi-Judge Consensus:

Judge 1 (GPT-4): Score 8/10
Judge 2 (Claude): Score 7/10
Judge 3 (Gemini): Score 9/10

Average: 8/10
Variance: Low → High confidence in quality

Chain-of-Thought Judging:

Evaluate this step-by-step:

- First, list all factual claims made
- Then, assess if each claim is verifiable
- Next, check for internal contradictions
- Finally, provide an overall accuracy score

Think through each step explicitly before scoring.

Self-Critique (Same Model):

Pass One: Generate content (GPT-4, temp 0.7)
Pass Two: Same model critiques (GPT-4, temp 0.2)

Benefit: Model knows its own weaknesses
Risk: May miss systematic model biases

Human-in-the-Loop:

- AI generates content
- AI judges content → flags issues
- Human reviews only flagged items
- Human decides: Accept / Edit / Regenerate

Result: 80% auto-accepted, 20% human review

Real Workflow Example

Use case: SEO blog content factory (100 posts/month)

Step One: Generate (GPT-3.5)

100 posts × $0.02 each = $2.00

Step Two: Judge (GPT-4)

100 posts × $0.03 judging = $3.00
Flags 30 posts as below threshold

Step Three: Human Review

30 flagged posts × 10 min review = 5 hours
70 posts auto-accepted

Result:

$5 AI cost
5 hours human time (vs 16.7 hours for all posts)
70% time savings
Consistent quality gates

Building a Judge Library

Create reusable judge prompts for common needs:

/judges
  /accuracy.txt - Fact-checking template
  /brand-voice.txt - Voice consistency template
  /seo.txt - SEO quality template
  /completeness.txt - Brief requirements template
  /compliance.txt - Legal risk template
  /readability.txt - Audience fit template

Usage:

content = generate(prompt)
accuracy_score = judge(content, judges/accuracy.txt)
brand_score = judge(content, judges/brand-voice.txt)

if accuracy_score &lt; 7 or brand_score &lt; 7:
    refine(content, feedback)

Integration with Tools

Manual (spreadsheet):

Column A: Original content
Column B: Judge prompt
Column C: Judge output
Column D: Scores extracted
Column E: Accept/Reject decision

Automated (scripts):

Python/Node.js scripts
Loop through content files
Run judge prompts
Log scores
Route based on thresholds

Platform integration:

Outranking: Built-in SEO scoring
Custom CMS: Add judge step to publish workflow
QA tools: Integrate with existing review process

Limitations & Pitfalls

❌ Judge models hallucinate too

They can flag correct content as wrong
They can miss actual errors
Always spot-check judge decisions

❌ Garbage in, garbage out

Bad judge prompts = useless feedback
Vague criteria = inconsistent scores
Test and refine judge prompts

❌ Cost can add up

Judging with GPT-4 = 10-20x generation cost
For low-value content, may not be worth it
Use cheaper models for simple checks

❌ Not a replacement for humans

Final quality decisions need human judgment
Complex/subjective topics need expert review
Legal/medical/financial = always human

Next Steps

Pick one quality dimension - Start with accuracy or brand voice
Write a judge prompt - Use templates above
Test on 5 outputs - See if scores make sense
Refine the prompt - Improve scoring consistency
Automate - Add to your content workflow

Conclusion

Judge prompts scale QA for high-volume AI content
Most effective for objective criteria (facts, SEO, structure)
Multi-pass workflow catches more issues than single review
Cost-effective: $0.05/post vs $20-50 human QA
Best as first filter, not replacement for human judgment

CTAs

Note to writer: When expanding:

Include full judge prompt templates (copy-paste ready)
Provide scoring spreadsheet template
Add code examples for automation (Python/Node.js)
Include case study with before/after metrics
Show real judge outputs with scores

Share This Post

Help others discover valuable AI insights

AI Testing

AI Accuracy vs Speed Tradeoffs 2025: When Fast Models Beat GPT-4 (Decision Framework)

Strategic guide to choosing between fast AI models (GPT-3.5, Claude Haiku) and accurate models (GPT-4, Claude Opus). Cost analysis, use case matrix, hybrid workflows for 60% cost savings.

AI Testing

AI Content QA for Marketers: From Draft to Publish in 10 Minutes

Practical SOP for marketing teams to quality-check AI-generated content in 10 minutes or less. Includes checklists, tools, and real workflow examples.

Data

AI Copywriting Tools Comparison: Finding the Best Fit for Your Content Needs

Explore the top AI copywriting tools, compare their features, and discover the best option for your writing needs!

Free Tools & Resources

AI Prompt Engineering Field Guide (2025)

Master prompt engineering with proven patterns, real-world examples, and role-based frameworks.

Download Free Guide

Cold Email ROI Calculator

Estimate revenue uplift from email improvements and optimize your outbound strategy

Try Interactive Tool

Ready to Master AI Agents?

Find the perfect AI tools for your business needs

List Your AI Tool

Get discovered by thousands of decision-makers searching for AI solutions.

From $250 • Featured listings available

Get Listed

Multi-Pass Judge Prompts: AI QA with AI

Key Takeaway: Multi-Pass Judge Prompts: AI QA with AI

Share This Post

Related Articles

AI Accuracy vs Speed Tradeoffs 2025: When Fast Models Beat GPT-4 (Decision Framework)

AI Content QA for Marketers: From Draft to Publish in 10 Minutes

AI Copywriting Tools Comparison: Finding the Best Fit for Your Content Needs

Free Tools & Resources

AI Prompt Engineering Field Guide (2025)

Cold Email ROI Calculator

Ready to Master AI Agents?

List Your AI Tool