Weekly rankings refreshed
New comparison pages added
Methodology details published
Arcade calculators clarified
Weekly rankings refreshed
New comparison pages added
Methodology details published
Arcade calculators clarified
AI Testing

Multi-Pass Judge Prompts: AI QA with AI

Use AI to evaluate AI outputs with multi-pass judge prompts. Includes prompt templates, scoring systems, and workflow integration.

AgentMastery TeamFebruary 15, 20258 min read

Updated Oct 2025

Quick Answer

Key Takeaway: Multi-Pass Judge Prompts: AI QA with AI

Use AI to evaluate AI outputs with multi-pass judge prompts. Includes prompt templates, scoring systems, and workflow integration.

Article
Updated: 2/15/2025
AI TestingJudge PromptsQAAutomationWorkflow

Article Outline

Introduction

  • The paradox: Using AI to check AI
  • Why it works: Different models, different temperatures, different prompts
  • When judge prompts save time vs manual review
  • Real use cases at scale (100+ outputs/week)

What Are Judge Prompts?

Definition: A secondary AI prompt that evaluates the first output for quality, accuracy, or compliance with requirements.

The workflow:

Step 1: Generate content (GPT-3.5)
Step 2: Judge output (GPT-4 or Claude)
Step 3: Accept, refine, or regenerate

Why this works:

  • Different model = fresh evaluation
  • Explicit criteria in judge prompt = systematic checking
  • Lower temperature for judge = more reliable scoring
  • Cheaper than human review for high volume

Types of Judge Prompts

Type 1: Accuracy Judge

Review this AI-generated article for factual accuracy.

Check for:
- Unverified statistics or data points
- Made-up citations or sources
- Internal contradictions
- Claims that sound suspicious

Score accuracy 1-10 and list any flagged claims.

Type 2: Brand Voice Judge

Evaluate if this content matches our brand voice.

Our brand: [Professional, data-driven, no fluff, founder wisdom]

Rate 1-10:
- Tone match
- Vocabulary appropriateness
- Sentence structure fit
- Overall brand alignment

Flag specific phrases that miss the mark.

Type 3: Completeness Judge

Check if this content addresses all requirements from the brief.

Brief requirements:
- [List requirements]

For each requirement:
- ✅ Fully addressed
- ⚠️ Partially addressed
- ❌ Missing

Overall completeness score: X/10

Type 4: SEO Quality Judge

Evaluate this content for SEO quality.

Check:
- Primary keyword in first 100 words?
- H2s include semantic keywords?
- Internal linking opportunities (list them)?
- Meta description compelling?
- Readability for target audience?

Score each 1-10 + overall SEO score.

Type 5: Compliance Judge

Review for legal/compliance issues.

Flag:
- Unsubstantiated claims
- Medical/legal advice (prohibited)
- Competitor disparagement
- Trademark usage
- Privacy concerns
- Accessibility issues

Risk level: Low/Medium/High

Multi-Pass Judge Workflow

Pass One: Generate

Prompt: "Write a 500-word blog post about AI testing tools"
Model: GPT-3.5 (fast, cheap)
Temperature: 0.7
Output: Draft content

Pass Two: Judge (Accuracy)

Prompt: "Review this for factual accuracy. Score 1-10 and flag suspicious claims."
Model: GPT-4 (more reliable)
Temperature: 0.2 (deterministic)
Output: Accuracy score + flagged items

Pass Three: Judge (Brand Voice)

Prompt: "Does this match [brand voice]? Score 1-10 and flag off-brand phrases."
Model: Claude Opus (strong at nuance)
Temperature: 0.2
Output: Voice score + specific feedback

Pass Four: Conditional Refinement

IF accuracy < 7 OR voice < 7:
    Regenerate with feedback incorporated
ELSE:
    Manual polish + publish

Cost example (500-word post):

  • Generate: $0.003
  • Judge (accuracy): $0.02
  • Judge (voice): $0.02
  • Total: $0.043 vs $20-50 for human QA

Judge Prompt Engineering

Bad judge prompt:

Is this good?

Problem: Vague, subjective, inconsistent scoring.

Good judge prompt:

Evaluate this content on these specific criteria:

• Factual Accuracy (1-10):
   - All claims verifiable?
   - No hallucinated citations?
   - Internal consistency?

• Readability (1-10):
   - Clear sentence structure?
   - Appropriate vocabulary for audience?
   - Logical flow?

• Completeness (1-10):
   - Addresses all brief requirements?
   - No missing sections?
   - Sufficient depth?

For each dimension, provide:
- Score (1-10)
- Reasoning (1 sentence)
- Specific issues (if score < 7)

Overall recommendation: Accept / Refine / Regenerate

Key elements:

  • Specific dimensions to evaluate
  • Numeric scores for consistency
  • Reasoning required to validate judgment
  • Action recommendation for next step

Scoring Systems

Simple Pass/Fail:

Does this meet requirements? Yes/No
If No, list what's missing.

Best for: Binary quality gates

Numeric Scale (1-10):

Score this output 1-10 on [criterion].
• 1-3 = Unacceptable
• 4-6 = Needs improvement
• 7-8 = Good
• 9-10 = Excellent

Best for: Nuanced evaluation

Weighted Dimensions:

Accuracy (40%): 8/10
Clarity (30%): 7/10
Brand Voice (20%): 9/10
SEO (10%): 6/10

Overall: (8*0.4 + 7*0.3 + 9*0.2 + 6*0.1) = 7.7/10

Best for: Balancing multiple priorities

Rubric-Based:

For each criterion, select:
- Exceeds expectations (3 pts)
- Meets expectations (2 pts)
- Below expectations (1 pt)
- Fails (0 pts)

Total score: X/15

Best for: Consistent team evaluation

When Judge Prompts Work Best

✅ High-volume content

  • Publishing 50+ pieces/month
  • Need systematic QA at scale
  • Human review is bottleneck

✅ Objective criteria

  • Fact-checking
  • SEO compliance
  • Structural requirements
  • Word count, formatting

✅ First-pass filtering

  • Eliminate obvious failures before human review
  • Route: AI judge → humans review only flagged items
  • Saves 70-80% of QA time

❌ When to skip judge prompts

  • Highly subjective creative work
  • Legal/medical content (human expert required)
  • Low volume (<10 pieces/month)
  • Brand-critical content (CEO blog, etc.)

Advanced Techniques

Multi-Judge Consensus:

Judge 1 (GPT-4): Score 8/10
Judge 2 (Claude): Score 7/10
Judge 3 (Gemini): Score 9/10

Average: 8/10
Variance: Low → High confidence in quality

Chain-of-Thought Judging:

Evaluate this step-by-step:

- First, list all factual claims made
- Then, assess if each claim is verifiable
- Next, check for internal contradictions
- Finally, provide an overall accuracy score

Think through each step explicitly before scoring.

Self-Critique (Same Model):

Pass One: Generate content (GPT-4, temp 0.7)
Pass Two: Same model critiques (GPT-4, temp 0.2)

Benefit: Model knows its own weaknesses
Risk: May miss systematic model biases

Human-in-the-Loop:

- AI generates content
- AI judges content → flags issues
- Human reviews only flagged items
- Human decides: Accept / Edit / Regenerate

Result: 80% auto-accepted, 20% human review

Real Workflow Example

Use case: SEO blog content factory (100 posts/month)

Step One: Generate (GPT-3.5)

  • 100 posts × $0.02 each = $2.00

Step Two: Judge (GPT-4)

  • 100 posts × $0.03 judging = $3.00
  • Flags 30 posts as below threshold

Step Three: Human Review

  • 30 flagged posts × 10 min review = 5 hours
  • 70 posts auto-accepted

Result:

  • $5 AI cost
  • 5 hours human time (vs 16.7 hours for all posts)
  • 70% time savings
  • Consistent quality gates

Building a Judge Library

Create reusable judge prompts for common needs:

/judges
  /accuracy.txt - Fact-checking template
  /brand-voice.txt - Voice consistency template
  /seo.txt - SEO quality template
  /completeness.txt - Brief requirements template
  /compliance.txt - Legal risk template
  /readability.txt - Audience fit template

Usage:

content = generate(prompt)
accuracy_score = judge(content, judges/accuracy.txt)
brand_score = judge(content, judges/brand-voice.txt)

if accuracy_score &lt; 7 or brand_score &lt; 7:
    refine(content, feedback)

Integration with Tools

Manual (spreadsheet):

  • Column A: Original content
  • Column B: Judge prompt
  • Column C: Judge output
  • Column D: Scores extracted
  • Column E: Accept/Reject decision

Automated (scripts):

  • Python/Node.js scripts
  • Loop through content files
  • Run judge prompts
  • Log scores
  • Route based on thresholds

Platform integration:

  • Outranking: Built-in SEO scoring
  • Custom CMS: Add judge step to publish workflow
  • QA tools: Integrate with existing review process

Limitations & Pitfalls

❌ Judge models hallucinate too

  • They can flag correct content as wrong
  • They can miss actual errors
  • Always spot-check judge decisions

❌ Garbage in, garbage out

  • Bad judge prompts = useless feedback
  • Vague criteria = inconsistent scores
  • Test and refine judge prompts

❌ Cost can add up

  • Judging with GPT-4 = 10-20x generation cost
  • For low-value content, may not be worth it
  • Use cheaper models for simple checks

❌ Not a replacement for humans

  • Final quality decisions need human judgment
  • Complex/subjective topics need expert review
  • Legal/medical/financial = always human

Next Steps

  • Pick one quality dimension - Start with accuracy or brand voice
  • Write a judge prompt - Use templates above
  • Test on 5 outputs - See if scores make sense
  • Refine the prompt - Improve scoring consistency
  • Automate - Add to your content workflow

Conclusion

  • Judge prompts scale QA for high-volume AI content
  • Most effective for objective criteria (facts, SEO, structure)
  • Multi-pass workflow catches more issues than single review
  • Cost-effective: $0.05/post vs $20-50 human QA
  • Best as first filter, not replacement for human judgment

CTAs


Note to writer: When expanding:

  • Include full judge prompt templates (copy-paste ready)
  • Provide scoring spreadsheet template
  • Add code examples for automation (Python/Node.js)
  • Include case study with before/after metrics
  • Show real judge outputs with scores

Share This Post

Help others discover valuable AI insights

Free Tools & Resources

AI Prompt Engineering Field Guide (2025)

Master prompt engineering with proven patterns, real-world examples, and role-based frameworks.

Cold Email ROI Calculator

Estimate revenue uplift from email improvements and optimize your outbound strategy

Ready to Master AI Agents?

Find the perfect AI tools for your business needs

List Your AI Tool

Get discovered by thousands of decision-makers searching for AI solutions.

From $250 • Featured listings available

Get Listed