Building a Lightweight AI Eval Harness in Google Sheets
Create a no-code AI evaluation system using Google Sheets. Track prompts, outputs, scores, and costs across models. Includes template and formulas.
Updated Oct 2025
Key Takeaway: Building a Lightweight AI Eval Harness in Google Sheets
Create a no-code AI evaluation system using Google Sheets. Track prompts, outputs, scores, and costs across models. Includes template and formulas.
Article Outline
Introduction
- Why you don't need expensive platforms to test AI
- Google Sheets as a lightweight eval harness
- What you can track: prompts, outputs, scores, costs, models
- Who this is for: Solo creators, small teams, budget-conscious testers
What is an Eval Harness?
Definition: A system to systematically test AI outputs against criteria, track results, and measure improvement over time.
Enterprise versions (expensive):
- Langsmith ($99-$999/month)
- PromptLayer ($49-$299/month)
- Custom-built evaluation pipelines
DIY version (free):
- Google Sheets + manual/semi-automated testing
- Good enough for 90% of use cases
- Scales to 100s of tests
The Spreadsheet Structure
Sheet 1: Test Cases
| Test ID | Prompt | Expected Output Type | Use Case | Priority |
|---|---|---|---|---|
| TC-001 | "Summarize this article..." | Concise summary | Blog summarization | High |
| TC-002 | "Write a cold email..." | Personalized email | Outreach | Medium |
Sheet 2: Outputs
| Test ID | Model | Temperature | Timestamp | Output Text | Token Count | Cost |
|---|---|---|---|---|---|---|
| TC-001 | GPT-4 | 0.7 | 2025-01-15 | [output] | 250 | $0.0075 |
| TC-001 | GPT-3.5 | 0.7 | 2025-01-15 | [output] | 230 | $0.00034 |
Sheet 3: Scores
| Test ID | Model | Accuracy (1-10) | Clarity (1-10) | Usefulness (1-10) | Overall | Reviewer | Notes |
|---|---|---|---|---|---|---|---|
| TC-001 | GPT-4 | 9 | 8 | 9 | 8.67 | Sarah | Strong summary |
| TC-001 | GPT-3.5 | 7 | 7 | 7 | 7.00 | Sarah | Missing context |
Sheet 4: Analysis Dashboard
| Model | Avg Overall Score | Total Cost | Tests Run | Cost Per Test | Winner Count |
|---|---|---|---|---|---|
| GPT-4 | 8.5 | $2.45 | 50 | $0.049 | 35 |
| GPT-3.5 | 7.2 | $0.18 | 50 | $0.0036 | 15 |
Step-by-Step Setup
Step 1: Create test cases
- List all scenarios you want to test
- Write clear prompts
- Define expected output characteristics
- Prioritize (High/Medium/Low)
Step 2: Generate outputs
For each test case:
For each model (GPT-4, GPT-3.5, Claude, etc.):
- Run the prompt
- Paste output into Sheet 2
- Record: model, temperature, timestamp, tokens, cost
Step 3: Score outputs
- Define scoring dimensions (accuracy, clarity, brand voice, etc.)
- Score each output 1-10 per dimension
- Calculate overall score (average or weighted)
- Add notes for context
Step 4: Analyze results
- Use formulas to aggregate scores by model
- Calculate cost efficiency (score per dollar)
- Identify best model for each use case
- Track improvement over time (as you refine prompts)
Key Formulas
Average score by model:
=AVERAGEIF(Scores!B:B, "GPT-4", Scores!F:F)
Total cost by model:
=SUMIF(Outputs!B:B, "GPT-4", Outputs!G:G)
Cost per point (efficiency metric):
= Total Cost / (Average Score * Number of Tests)
Winner count (how many times model scored highest):
=COUNTIF(Scores!F:F, MAX(Scores!F2:F51))
Score improvement over time (for a specific test):
=TREND(Scores!F:F, Scores!D:D)
Scoring Rubrics
Simple scoring (1-10):
- 1-3: Unacceptable (would not use)
- 4-6: Needs significant editing
- 7-8: Good with minor edits
- 9-10: Publish-ready
Dimensional scoring:
| Dimension | Weight | GPT-4 Score | GPT-3.5 Score |
|---|---|---|---|
| Accuracy | 40% | 9 | 7 |
| Clarity | 30% | 8 | 7 |
| Brand Voice | 20% | 8 | 6 |
| SEO | 10% | 7 | 6 |
| Weighted Overall | 8.2 | 6.7 |
Formula: =SUMPRODUCT(weights, scores)
Pass/Fail with notes:
- ✅ Pass: Meets all requirements
- ⚠️ Pass with edits: Meets most, needs minor fixes
- ❌ Fail: Below threshold, regenerate
Tracking Multiple Variables
What to track:
-
Model variables:
- Model name (GPT-4, Claude, etc.)
- Temperature setting
- Max tokens
- System prompt
-
Prompt variables:
- Prompt version (v1, v2, v3...)
- Prompt length
- Context provided
- Instructions clarity
-
Output metrics:
- Token count
- Cost
- Generation time
- Format compliance
-
Quality scores:
- Accuracy
- Clarity
- Brand voice
- Usefulness
- Overall
-
Business metrics:
- Use case category
- Priority level
- Actual usage (did you use this output?)
- Revision time needed
Advanced: Conditional Formatting
Highlight best performers:
Green: Scores ≥ 8
Yellow: Scores 6-7.9
Red: Scores < 6
Cost efficiency indicators:
Green: Cost per point < $0.01
Yellow: Cost per point $0.01-$0.05
Red: Cost per point > $0.05
Model comparison heatmap:
For each test case, color:
Green = best model
Yellow = middle
Red = worst
Example Use Cases
Use Case 1: Blog summarization
Test Cases: 10 blog posts (varying length/complexity)
Models: GPT-4, GPT-3.5, Claude Haiku
Scoring: Accuracy, completeness, conciseness
Result: Claude Haiku wins (best speed-to-quality)
Use Case 2: Cold email generation
Test Cases: 5 persona variations
Models: GPT-4, GPT-3.5
Scoring: Personalization, clarity, CTA strength
Result: GPT-4 wins (better personalization)
Use Case 3: SEO content briefs
Test Cases: 8 keyword-based briefs
Models: GPT-4, GPT-3.5 + SERP context
Scoring: Heading relevance, keyword coverage, intent match
Result: GPT-3.5 with SERP data wins (good enough + cheap)
Automation (Optional)
Google Apps Script integration:
function runTest(prompt, model) {
// Call OpenAI API
const response = callAPI(prompt, model);
// Log to Outputs sheet
const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Outputs');
sheet.appendRow([
generateTestID(),
model,
0.7, // temperature
new Date(),
response.text,
response.tokens,
calculateCost(response.tokens, model)
]);
}
Benefits:
- One-click test execution
- Automatic cost calculation
- Timestamp tracking
- Reduces manual copy-paste
Limitations:
- Requires coding knowledge
- API key management
- Rate limits
Workflow Integration
Weekly evaluation cycle:
Monday: Define 5-10 new test cases
Tuesday: Generate outputs across models
Wednesday: Score outputs (20-30 min)
Thursday: Analyze results, identify improvements
Friday: Refine prompts based on learnings
Continuous improvement:
1. Baseline test (v1 prompts)
2. Identify weak scores
3. Refine prompts (v2)
4. Re-test same cases
5. Compare v1 vs v2
6. Repeat
Tracking Prompt Evolution
Sheet 5: Prompt Versions
| Prompt ID | Version | Prompt Text | Change Log | Avg Score |
|---|---|---|---|---|
| P-001 | v1 | "Write a blog post about..." | Initial | 6.5 |
| P-001 | v2 | "Write a 500-word blog post... Include..." | Added constraints | 7.8 |
| P-001 | v3 | "Write a 500-word blog post... Tone: [X]..." | Added tone guidance | 8.2 |
Insight: Track which prompt changes improve scores most.
Cost Analysis
Monthly budget tracking:
Total API spend: $45.23
Tests run: 150
Average cost per test: $0.30
Cost per quality point: $0.037
Budget allocation:
- GPT-4 (high-stakes): $25 (55%)
- GPT-3.5 (volume): $15 (33%)
- Claude (testing): $5.23 (12%)
ROI calculation:
Before AI:
- Content creation: 2 hours/piece × $50/hr = $100
- Volume: 10 pieces/month = $1000
With AI (tested workflow):
- AI generation: $0.50/piece
- Human editing: 30 min × $50/hr = $25
- Total per piece: $25.50
- Volume: 40 pieces/month = $1020
Result: 4x volume for same budget
Sharing & Collaboration
Team evaluation:
- Share Google Sheet with team
- Each person scores different dimensions
- Aggregate scores for consensus
- Use comments for qualitative feedback
Client reporting:
- Create read-only dashboard tab
- Show: model comparison, cost efficiency, quality trends
- Demonstrate systematic testing approach
Limitations of Sheets
What works:
- Manual testing (up to ~200 tests)
- Simple scoring rubrics
- Cost tracking
- Model comparison
- Prompt version history
What doesn't scale:
- High-volume testing (1000s of tests)
- Real-time API integration (slow)
- Complex statistical analysis
- Automated regression testing
When to graduate to platforms:
- Running >500 tests/month
- Need automated testing pipelines
- Require advanced analytics (A/B testing, statistical significance)
- Multiple team members testing simultaneously
Template Download
What's included:
- Pre-built sheets (Test Cases, Outputs, Scores, Analysis)
- Formula examples
- Conditional formatting rules
- Sample test cases
- Scoring rubrics
- Cost calculation formulas
How to use:
- Copy template to your Google Drive
- Customize scoring dimensions for your use case
- Add your test cases
- Start logging outputs and scores
- Review Analysis dashboard weekly
Next Steps
- Download our template - Start with pre-built structure
- Define 5 test cases - Real scenarios from your workflow
- Run first comparison - GPT-4 vs GPT-3.5 or Claude
- Score outputs - Use consistent rubric
- Review analysis - Which model wins? At what cost?
Conclusion
- You don't need expensive tools to test AI systematically
- Google Sheets provides enough structure for most use cases
- Key: Consistent test cases, scoring rubrics, cost tracking
- Track improvements over time as you refine prompts
- Graduate to platforms only when volume demands it
CTAs
Note to writer: When expanding:
- Provide actual Google Sheets template (shareable link)
- Include screenshot walkthrough of each sheet
- Show real examples with actual scores and costs
- Provide Apps Script code snippets for automation
- Add video tutorial (optional, high engagement)
Related Articles
Accuracy vs Speed: When to Trade Creativity for Reliability
Decision framework for choosing between fast AI models (GPT-3.5, Claude Haiku) and accurate models (GPT-4, Claude Opus). Includes cost analysis and use case matrix.
AI Content QA for Marketers: From Draft to Publish in 10 Minutes
Practical SOP for marketing teams to quality-check AI-generated content in 10 minutes or less. Includes checklists, tools, and real workflow examples.
A Comprehensive Comparison of AI Copywriting Tools for Video Content Creation
Discover the best AI copywriting tools for creating engaging video content with our in-depth comparison and actionable tips.
Free Tools & Resources
AI Prompt Engineering Field Guide (2025)
Master prompt engineering with proven patterns, real-world examples, and role-based frameworks.
Cold Email ROI Calculator
Estimate revenue uplift from email improvements and optimize your outbound strategy
List Your AI Tool
Get discovered by thousands of decision-makers searching for AI solutions.
From $250 • Featured listings available