Multi-Pass Judge Prompts: AI QA with AI
Use AI to evaluate AI outputs with multi-pass judge prompts. Includes prompt templates, scoring systems, and workflow integration.
Updated Oct 2025
Key Takeaway: Multi-Pass Judge Prompts: AI QA with AI
Use AI to evaluate AI outputs with multi-pass judge prompts. Includes prompt templates, scoring systems, and workflow integration.
Article Outline
Introduction
- The paradox: Using AI to check AI
- Why it works: Different models, different temperatures, different prompts
- When judge prompts save time vs manual review
- Real use cases at scale (100+ outputs/week)
What Are Judge Prompts?
Definition: A secondary AI prompt that evaluates the first output for quality, accuracy, or compliance with requirements.
The workflow:
Step 1: Generate content (GPT-3.5)
Step 2: Judge output (GPT-4 or Claude)
Step 3: Accept, refine, or regenerate
Why this works:
- Different model = fresh evaluation
- Explicit criteria in judge prompt = systematic checking
- Lower temperature for judge = more reliable scoring
- Cheaper than human review for high volume
Types of Judge Prompts
Type 1: Accuracy Judge
Review this AI-generated article for factual accuracy.
Check for:
- Unverified statistics or data points
- Made-up citations or sources
- Internal contradictions
- Claims that sound suspicious
Score accuracy 1-10 and list any flagged claims.
Type 2: Brand Voice Judge
Evaluate if this content matches our brand voice.
Our brand: [Professional, data-driven, no fluff, founder wisdom]
Rate 1-10:
- Tone match
- Vocabulary appropriateness
- Sentence structure fit
- Overall brand alignment
Flag specific phrases that miss the mark.
Type 3: Completeness Judge
Check if this content addresses all requirements from the brief.
Brief requirements:
- [List requirements]
For each requirement:
- ✅ Fully addressed
- ⚠️ Partially addressed
- ❌ Missing
Overall completeness score: X/10
Type 4: SEO Quality Judge
Evaluate this content for SEO quality.
Check:
- Primary keyword in first 100 words?
- H2s include semantic keywords?
- Internal linking opportunities (list them)?
- Meta description compelling?
- Readability for target audience?
Score each 1-10 + overall SEO score.
Type 5: Compliance Judge
Review for legal/compliance issues.
Flag:
- Unsubstantiated claims
- Medical/legal advice (prohibited)
- Competitor disparagement
- Trademark usage
- Privacy concerns
- Accessibility issues
Risk level: Low/Medium/High
Multi-Pass Judge Workflow
Pass One: Generate
Prompt: "Write a 500-word blog post about AI testing tools"
Model: GPT-3.5 (fast, cheap)
Temperature: 0.7
Output: Draft content
Pass Two: Judge (Accuracy)
Prompt: "Review this for factual accuracy. Score 1-10 and flag suspicious claims."
Model: GPT-4 (more reliable)
Temperature: 0.2 (deterministic)
Output: Accuracy score + flagged items
Pass Three: Judge (Brand Voice)
Prompt: "Does this match [brand voice]? Score 1-10 and flag off-brand phrases."
Model: Claude Opus (strong at nuance)
Temperature: 0.2
Output: Voice score + specific feedback
Pass Four: Conditional Refinement
IF accuracy < 7 OR voice < 7:
Regenerate with feedback incorporated
ELSE:
Manual polish + publish
Cost example (500-word post):
- Generate: $0.003
- Judge (accuracy): $0.02
- Judge (voice): $0.02
- Total: $0.043 vs $20-50 for human QA
Judge Prompt Engineering
Bad judge prompt:
Is this good?
Problem: Vague, subjective, inconsistent scoring.
Good judge prompt:
Evaluate this content on these specific criteria:
• Factual Accuracy (1-10):
- All claims verifiable?
- No hallucinated citations?
- Internal consistency?
• Readability (1-10):
- Clear sentence structure?
- Appropriate vocabulary for audience?
- Logical flow?
• Completeness (1-10):
- Addresses all brief requirements?
- No missing sections?
- Sufficient depth?
For each dimension, provide:
- Score (1-10)
- Reasoning (1 sentence)
- Specific issues (if score < 7)
Overall recommendation: Accept / Refine / Regenerate
Key elements:
- Specific dimensions to evaluate
- Numeric scores for consistency
- Reasoning required to validate judgment
- Action recommendation for next step
Scoring Systems
Simple Pass/Fail:
Does this meet requirements? Yes/No
If No, list what's missing.
Best for: Binary quality gates
Numeric Scale (1-10):
Score this output 1-10 on [criterion].
• 1-3 = Unacceptable
• 4-6 = Needs improvement
• 7-8 = Good
• 9-10 = Excellent
Best for: Nuanced evaluation
Weighted Dimensions:
Accuracy (40%): 8/10
Clarity (30%): 7/10
Brand Voice (20%): 9/10
SEO (10%): 6/10
Overall: (8*0.4 + 7*0.3 + 9*0.2 + 6*0.1) = 7.7/10
Best for: Balancing multiple priorities
Rubric-Based:
For each criterion, select:
- Exceeds expectations (3 pts)
- Meets expectations (2 pts)
- Below expectations (1 pt)
- Fails (0 pts)
Total score: X/15
Best for: Consistent team evaluation
When Judge Prompts Work Best
✅ High-volume content
- Publishing 50+ pieces/month
- Need systematic QA at scale
- Human review is bottleneck
✅ Objective criteria
- Fact-checking
- SEO compliance
- Structural requirements
- Word count, formatting
✅ First-pass filtering
- Eliminate obvious failures before human review
- Route: AI judge → humans review only flagged items
- Saves 70-80% of QA time
❌ When to skip judge prompts
- Highly subjective creative work
- Legal/medical content (human expert required)
- Low volume (<10 pieces/month)
- Brand-critical content (CEO blog, etc.)
Advanced Techniques
Multi-Judge Consensus:
Judge 1 (GPT-4): Score 8/10
Judge 2 (Claude): Score 7/10
Judge 3 (Gemini): Score 9/10
Average: 8/10
Variance: Low → High confidence in quality
Chain-of-Thought Judging:
Evaluate this step-by-step:
- First, list all factual claims made
- Then, assess if each claim is verifiable
- Next, check for internal contradictions
- Finally, provide an overall accuracy score
Think through each step explicitly before scoring.
Self-Critique (Same Model):
Pass One: Generate content (GPT-4, temp 0.7)
Pass Two: Same model critiques (GPT-4, temp 0.2)
Benefit: Model knows its own weaknesses
Risk: May miss systematic model biases
Human-in-the-Loop:
- AI generates content
- AI judges content → flags issues
- Human reviews only flagged items
- Human decides: Accept / Edit / Regenerate
Result: 80% auto-accepted, 20% human review
Real Workflow Example
Use case: SEO blog content factory (100 posts/month)
Step One: Generate (GPT-3.5)
- 100 posts × $0.02 each = $2.00
Step Two: Judge (GPT-4)
- 100 posts × $0.03 judging = $3.00
- Flags 30 posts as below threshold
Step Three: Human Review
- 30 flagged posts × 10 min review = 5 hours
- 70 posts auto-accepted
Result:
- $5 AI cost
- 5 hours human time (vs 16.7 hours for all posts)
- 70% time savings
- Consistent quality gates
Building a Judge Library
Create reusable judge prompts for common needs:
/judges
/accuracy.txt - Fact-checking template
/brand-voice.txt - Voice consistency template
/seo.txt - SEO quality template
/completeness.txt - Brief requirements template
/compliance.txt - Legal risk template
/readability.txt - Audience fit template
Usage:
content = generate(prompt)
accuracy_score = judge(content, judges/accuracy.txt)
brand_score = judge(content, judges/brand-voice.txt)
if accuracy_score < 7 or brand_score < 7:
refine(content, feedback)
Integration with Tools
Manual (spreadsheet):
- Column A: Original content
- Column B: Judge prompt
- Column C: Judge output
- Column D: Scores extracted
- Column E: Accept/Reject decision
Automated (scripts):
- Python/Node.js scripts
- Loop through content files
- Run judge prompts
- Log scores
- Route based on thresholds
Platform integration:
- Outranking: Built-in SEO scoring
- Custom CMS: Add judge step to publish workflow
- QA tools: Integrate with existing review process
Limitations & Pitfalls
❌ Judge models hallucinate too
- They can flag correct content as wrong
- They can miss actual errors
- Always spot-check judge decisions
❌ Garbage in, garbage out
- Bad judge prompts = useless feedback
- Vague criteria = inconsistent scores
- Test and refine judge prompts
❌ Cost can add up
- Judging with GPT-4 = 10-20x generation cost
- For low-value content, may not be worth it
- Use cheaper models for simple checks
❌ Not a replacement for humans
- Final quality decisions need human judgment
- Complex/subjective topics need expert review
- Legal/medical/financial = always human
Next Steps
- Pick one quality dimension - Start with accuracy or brand voice
- Write a judge prompt - Use templates above
- Test on 5 outputs - See if scores make sense
- Refine the prompt - Improve scoring consistency
- Automate - Add to your content workflow
Conclusion
- Judge prompts scale QA for high-volume AI content
- Most effective for objective criteria (facts, SEO, structure)
- Multi-pass workflow catches more issues than single review
- Cost-effective: $0.05/post vs $20-50 human QA
- Best as first filter, not replacement for human judgment
CTAs
- Test AI outputs with our accuracy calculator
- Learn AI QA workflow for marketers
- Compare AI models for best judge performance
Note to writer: When expanding:
- Include full judge prompt templates (copy-paste ready)
- Provide scoring spreadsheet template
- Add code examples for automation (Python/Node.js)
- Include case study with before/after metrics
- Show real judge outputs with scores
Related Articles
Accuracy vs Speed: When to Trade Creativity for Reliability
Decision framework for choosing between fast AI models (GPT-3.5, Claude Haiku) and accurate models (GPT-4, Claude Opus). Includes cost analysis and use case matrix.
AI Content QA for Marketers: From Draft to Publish in 10 Minutes
Practical SOP for marketing teams to quality-check AI-generated content in 10 minutes or less. Includes checklists, tools, and real workflow examples.
AI Copywriting Tools Comparison: Finding the Best Fit for Your Content Needs
Explore the top AI copywriting tools, compare their features, and discover the best option for your writing needs!
Free Tools & Resources
AI Prompt Engineering Field Guide (2025)
Master prompt engineering with proven patterns, real-world examples, and role-based frameworks.
Cold Email ROI Calculator
Estimate revenue uplift from email improvements and optimize your outbound strategy
List Your AI Tool
Get discovered by thousands of decision-makers searching for AI solutions.
From $250 • Featured listings available