Evaluation Criteria Design Framework - Precise Quality Measurement | AI Skill Library
Learn how to create objective, measurable criteria for assessing AI system outputs and ensuring consistent quality evaluation.
What is Evaluation Criteria Design
Evaluation criteria design is the practice of creating precise, operational definitions for measuring output quality. It transforms abstract judgments like "good" or "accurate" into specific, observable characteristics that can be consistently assessed.
An evaluation criterion specifies what property to measure, how to measure it, and what constitutes success. Criteria serve as the bridge between desired outcomes and actual outputs, providing a systematic way to assess performance across multiple instances.
Why This Skill Matters
Without explicit evaluation criteria, quality assessment becomes subjective and inconsistent. Different evaluators apply different standards. The same evaluator applies different standards at different times. This variation makes it impossible to track improvements, compare outputs, or ensure reliability.
Implicit criteria cause misalignment. You might judge outputs based on unwritten expectations, leading to repeated failures and frustration. The system cannot meet standards that aren't clearly defined. Implicit criteria also make it difficult to diagnose problems—is the issue the approach or the evaluation?
Poor evaluation criteria mask real issues. Vague measures like "high quality" catch nothing because everything passes, or they catch everything because nothing passes. Without objective criteria, you cannot distinguish between acceptable and unacceptable outputs. Random acceptance and rejection creates noise rather than signal.
Strong evaluation criteria enable systematic improvement. They reveal specific failure modes, measure progress quantitatively, and guide refinement efforts. Each criterion targets a specific dimension of quality. Breaking down quality into measurable components makes improvement tractable.
Core Principles
Measurability A criterion must produce consistent, repeatable results across evaluators and time. Measurable criteria use observable properties, not subjective judgments. "Engaging" is subjective; "contains at least three questions" is measurable. If multiple evaluators would produce different scores, the criterion needs refinement.
Binary vs. Graded Criteria Binary criteria are pass/fail—output either satisfies the requirement or doesn't. Graded criteria assign scores along a continuum. Binary criteria work for clear requirements; graded criteria work for smooth quality dimensions. Choose based on whether intermediate quality levels are meaningful.
Criterion Independence Each criterion should measure a distinct quality dimension. Overlapping criteria double-count the same attribute and create weighting issues. Independent criteria provide clearer diagnostic information when outputs fail.
Reference Standards Objective benchmarks that define quality levels. Reference standards might be expert-created examples, validated datasets, or established metrics. They ground criteria in reality rather than opinion.
Operationalization Translating abstract concepts into concrete measurements. "Accurate" becomes "contains no factual errors when checked against reference source X." "Concise" becomes "under 200 words." Each operational definition creates an objective test.
Step-by-Step Guide
1. Define Success Dimensions
Start by identifying what quality means for your use case. What makes an output acceptable or unacceptable? List all dimensions that matter:
- Accuracy (correctness, factual validity)
- Completeness (covers required points)
- Clarity (understandability, readability)
- Format compliance (structure, syntax, schema)
- Style consistency (tone, voice, conventions)
- Appropriateness (fit for audience, purpose)
Relate this to Task Scoping—your scope defines what success looks like. Each success criterion from scoping becomes an evaluation dimension.
Example: For a summarization task, dimensions might be: factual accuracy, coverage of key points, conciseness, readability, and logical flow.
2. Operationalize Each Dimension
Transform abstract dimensions into measurable criteria. For each dimension, ask: "How can I observe this objectively?"
- Accuracy → "Contains no factual errors when checked against source text"
- Completeness → "Includes at least 80% of key points from reference list"
- Clarity → "Flesch-Kincaid grade level under 12"
- Format → "Valid JSON complying with schema"
- Consistency → "Uses consistent terminology (no synonyms for same concept)"
This connects to Abstraction—break complex concepts into observable components.
Example: "Clarity" decomposes into reading grade level, sentence complexity, jargon density, and definition presence.
3. Choose Measurement Type
For each criterion, decide whether to use binary or graded measurement:
Binary criteria (pass/fail) work for:
- Clear requirements (schema compliance, presence checks)
- Threshold conditions (length limits, minimum counts)
- Format validation (syntax, structure)
Graded criteria (scoring) work for:
- Smooth quality dimensions (accuracy percentages, coverage fractions)
- Comparative assessment (better/worse rankings)
- Diagnostic feedback (partial credit)
Example: Use binary for "valid JSON format" but graded for "coverage percentage of key points."
4. Define Measurement Methods
Specify how each criterion will be assessed:
- Automated checks: Schema validation, regex patterns, length counts, reference comparisons
- Manual review: Rubrics, rating scales, comparison to examples
- Hybrid approaches: Automated flagging + human verification
Document procedures to ensure repeatability. Include examples of what constitutes pass vs. fail for each criterion.
Example: "Spelling errors will be detected using tool X, with results verified manually for false positives."
5. Establish Thresholds and Scoring Scales
For binary criteria, define clear pass/fail boundaries:
- What exactly constitutes compliance?
- Are there exceptions or edge cases?
For graded criteria, define what each score level means:
- Five-point scale: Excellent (5), Good (4), Satisfactory (3), Needs Work (2), Unacceptable (1)
- Ten-point scale with defined thresholds
- Continuous scores with target ranges
Base thresholds on actual requirements, not arbitrary divisions.
Example: "Coverage scoring: 90%+ = Excellent, 75-89% = Good, 60-74% = Satisfactory, less than 60% = Unacceptable."
6. Validate on Sample Outputs
Test criteria on diverse examples before finalizing:
- Apply criteria to known good outputs—do they pass?
- Apply criteria to known bad outputs—do they fail?
- Test edge cases and borderline situations
- Have multiple evaluators assess same outputs—do scores agree?
Refine criteria based on validation results. Eliminate or revise criteria that produce inconsistent results.
Example: Test coverage criterion on 20 summaries. If expert-assessed "good" summaries consistently score less than 60%, the threshold is too strict.
7. Document with Examples
Create documentation that includes:
- Criterion name and definition
- Measurement method (automated/manual)
- Scoring scale or threshold
- Examples of passing outputs
- Examples of failing outputs
- Borderline cases illustrating limits
Examples create shared understanding and reduce interpretation drift.
Example: Show a summary with 95% coverage (Excellent) and one with 55% coverage (Unacceptable) to anchor the scale.
8. Implement and Monitor
Deploy criteria in your evaluation workflow:
- Automate what can be automated
- Train human evaluators on rubrics and procedures
- Establish review processes for disagreements
- Track results over time
Monitor for:
- Consistency across evaluators
- Failure rates by criterion
- Criteria that never fail (too lenient?) or always fail (too strict?)
- Evaluation throughput and bottlenecks
Refine criteria between evaluation cycles based on data.
Common Mistakes
Subjective criteria: Using terms that require judgment creates inconsistency. "Professional," "engaging," and "intuitive" mean different things to different people. Replace with specific, observable characteristics. Define exactly what "professional" looks like in measurable terms.
Overlapping criteria: Measuring the same attribute multiple times creates weighting problems. If both "clarity" and "readability" score sentence complexity, that dimension gets double-weighted unintentionally. Ensure each criterion measures distinct territory.
Unmeasurable criteria: Creating criteria that cannot be consistently assessed wastes effort. "Creative" cannot be measured reliably. "Novel" lacks a stable reference point. If you cannot create an objective test, eliminate the criterion or reformulate it.
Too many criteria: Excessive criteria create evaluation burden and dilute focus. Each additional criterion adds assessment cost without proportional value. Focus on dimensions that truly differentiate quality. Combine related criteria into composite measures when possible.
Changing criteria mid-evaluation: Modifying evaluation standards during assessment invalidates comparisons. Establish criteria upfront, apply them consistently, and refine between evaluation cycles rather than during them.
Measuring Success
Quality Checklist
Your evaluation criteria are working well when:
- Inter-Rater Reliability: Multiple evaluators produce consistent scores (correlation >0.8 is excellent, >0.7 is acceptable)
- Test-Retest Reliability: Same evaluator scores same output consistently across time
- Discriminative Power: Criteria distinguish clearly between good and bad outputs
- Coverage: All important quality dimensions are measured by at least one criterion
- Independence: Criterion scores show low correlation with each other (r < 0.5)
- Actionability: Failure patterns reveal specific improvement directions
- Efficiency: Evaluation cost is proportional to decision value
Red Flags
Warning signs that your criteria need improvement:
- High Disagreement: Evaluators consistently disagree on scores (>30% variance)
- Uniform Scoring: All outputs receive similar scores (criteria don't differentiate)
- Never-Fail Criteria: Some criteria always pass (too lenient or irrelevant)
- Always-Fail Criteria: Some criteria always fail (too strict or impossible)
- High Correlation: Different criteria produce nearly identical scores (redundancy)
- Ambiguous Results: Scores don't align with expert judgments
- Interpretation Drift: Scores change over time without output quality changes
Success Metrics
Track these metrics to validate your criteria design:
- Consistency: Cronbach's alpha > 0.7 for multi-criterion evaluation
- Validity: Correlation with expert assessments > 0.7
- Efficiency: Evaluation time < 25% of generation time
- Coverage: At least 80% of specified requirements checked by criteria
- Actionability: Failure diagnosis leads to successful fix in >70% of cases
When to Use This Skill
Ideal Scenarios
Benchmarking and Comparison When comparing multiple approaches, systems, or configurations, objective criteria ensure fair assessment. Use evaluation criteria design when you need to distinguish real performance differences from evaluation variation. This is essential for A/B testing, model selection, and optimization.
Automated Quality Control When implementing continuous integration, batch processing, or production monitoring, you need criteria that can be implemented as automated tests. Evaluation criteria design translates quality requirements into executable checks that scale.
Iterative Improvement Projects When tracking progress over time or measuring the impact of changes, you need objective, repeatable measures. Criteria create baseline measurements and enable valid before/after comparisons that guide refinement efforts.
Multi-Evaluator Teams When multiple people will assess quality, shared criteria prevent fragmentation and ensure consistency. Teams need common ground to reduce training overhead and align evaluation standards.
Contractual or Compliance Requirements When outputs must meet documented standards for service-level agreements, regulatory compliance, or quality assurance certifications, evaluation criteria become the contract language that specifies requirements.
Not Ideal For
Exploratory Work When exploring possibilities or brainstorming, rigid criteria can constrain creativity. Early exploration benefits from flexibility rather than formal measurement.
Simple Binary Checks When a task has a single, obvious pass/fail condition, formal criteria design may be overkill. Simple validation rules suffice.
Rapid Prototyping When quickly testing ideas, extensive criteria design slows iteration. Use lightweight checks until you confirm the approach is worth formalizing.
Purely Subjective Domains When quality genuinely resists operationalization (e.g., artistic merit, philosophical depth), forcing criteria creates false precision. Accept subjectivity and use qualitative approaches.
Common Use Cases
Model Selection and Comparison
A research team comparing three language models for document summarization needs fair, objective assessment. They design evaluation criteria measuring:
- Factual Accuracy: Percentage of claims verifiable in source text
- Coverage: Fraction of key points from source included in summary
- Conciseness: Summary length relative to source (target: 10-20%)
- Readability: Flesch-Kincaid grade level score
- Structure: Presence of logical organization (binary)
Each criterion is measurable, independent, and maps to a quality dimension that matters for their use case. The criteria reveal that Model A excels at accuracy but produces verbose summaries, Model B is concise but misses key points, and Model C balances both well—leading to an informed decision.
Content Quality Assurance
A content platform needs automated quality control for user-generated articles. They design criteria including:
- Minimum Length: 500+ words (binary, automated)
- Spelling/Grammar: Zero errors via automated checker (binary, automated)
- Structure Presence: Contains introduction, body, conclusion (binary, automated)
- Citation Quality: At least 3 reputable sources (count, automated)
- Originality Score: >80% via plagiarism detection (threshold, automated)
Content failing any criterion requires manual review. Passing content proceeds to publication. The criteria scale to thousands of submissions while maintaining quality floors.
Iterative Prompt Optimization
A developer improving a code-generation prompt needs measurable progress tracking. They design criteria:
- Syntax Validity: Code parses without errors (binary, automated)
- Test Success Rate: Percentage of generated functions passing unit tests (graded, automated)
- Complexity Score: Cyclomatic complexity under 10 (threshold, automated)
- Documentation Coverage: Percentage of functions with docstrings (graded, manual)
- Style Compliance: Adherence to linting rules (percentage, automated)
Each prompt iteration is evaluated against these criteria. Progress is tracked quantitatively: "Iteration 3 improved test success rate from 60% to 85% while maintaining complexity scores." The criteria guide refinement efforts toward measurable improvement.
How This Skill Connects to Other Skills
Specification writing defines requirements that evaluation criteria assess. Specifications state what outputs should be; criteria determine whether they achieved it. They're two sides of the same coin—requirements and verification.
Constraint encoding provides rules that become evaluation criteria. Each constraint translates to a pass/fail test. Constraints encode requirements; criteria check compliance.
Validation strategies implement evaluation criteria through automated tests, manual review processes, or hybrid approaches. Criteria define what to validate; strategies define how.
Error analysis uses evaluation criteria to categorize and diagnose failures. Each criterion maps to specific failure types. Strong criteria make it possible to track failure patterns and identify systematic issues.
Decomposition helps break complex quality concepts into multiple criteria. "High quality" decomposes into accuracy, clarity, completeness, and formatting criteria. Each component gets its own evaluation metric.
Skill Boundaries
Evaluation criteria cannot compensate for fundamentally flawed approaches. Perfect measurement of terrible outputs still produces terrible scores. Criteria assess quality, they don't create it.
Criteria don't guarantee validity. A criterion can be objective, measurable, and consistent while still measuring the wrong thing. Content validity—whether the criterion matters—requires domain judgment separate from measurement design.
Over-specification creates evaluation burden. More criteria aren't automatically better. Each criterion adds assessment cost. Focus on dimensions that meaningfully differentiate quality and drive decisions.
Static criteria fail in evolving contexts. If requirements change, evaluation criteria must update. Stale criteria measure irrelevant qualities. Criteria maintenance is ongoing work.
Criteria don't substitute for expert judgment in complex cases. Some quality dimensions resist operationalization. Nuanced judgments like "subtle" or "insightful" may resist decomposition. Recognize when criteria reach their limits and human evaluation becomes necessary.
Related Skills
Note: This skill is not yet in the main relationship map. Relationships will be defined as the skill library evolves.
Prerequisite Skills
Task Scoping: Evaluation criteria design requires well-scoped tasks to define what success looks like and which quality dimensions matter.
Specification Writing: Clear specifications provide the requirements that evaluation criteria transform into measurable checks.
Complementary Skills
Output Validation: Evaluation criteria design creates the measures that output validation implements through automated tests and manual review.
Constraint Encoding: Constraints become binary evaluation criteria—each constraint translates to a pass/fail test.
Failure Case Analysis: Failure patterns revealed through analysis inform what evaluation criteria are necessary and reveal gaps in current measurement.
Abstraction: Breaking complex quality concepts into measurable components relies on abstraction skills to decompose abstract dimensions.
Quick Reference
Criterion Type Selection
| Criterion Type | Best For | Example | Measurement |
|---|---|---|---|
| Binary | Clear requirements, format compliance | "Valid JSON" | Pass/fail test |
| Threshold | Minimum standards, length limits | "> 500 words" | Count vs. threshold |
| Graded | Quality dimensions, comparisons | "Coverage: 0-100%" | Percentage score |
| Ordinal | Comparative ranking | "Better/worse than baseline" | Ranked comparison |
Operationalization Patterns
| Abstract Concept | → | Observable Measure |
|---|---|---|
| Accuracy | → | % claims verifiable in reference |
| Completeness | → | % required points present |
| Clarity | → | Flesch-Kincaid grade level |
| Conciseness | → | Word count / compression ratio |
| Consistency | → | % consistent term usage |
| Structure | → | Presence of required sections |
| Style | → | % sentences following style guide |
Criteria Design Templates
Binary Criterion Template
Criterion: [Name] Definition: Output [specific observable condition] Test: [Automated check / Manual rubric] Pass: [Exact pass condition] Fail: [Exact fail condition] Edge Cases: [Known exceptions]
Graded Criterion Template
Criterion: [Name] Definition: [Quality dimension] measured by [metric] Scale: [Scoring range, e.g., 0-100 or 1-5] Levels:
- Excellent (5): [Specific description with examples]
- Good (4): [Specific description with examples]
- Satisfactory (3): [Specific description with examples]
- Needs Work (2): [Specific description with examples]
- Unacceptable (1): [Specific description with examples]
Pro Tips
- Start with Task Scoping—success criteria become evaluation dimensions
- Use Abstraction to decompose complex qualities into measurable components
- Validate criteria on 10-20 diverse outputs before finalizing
- Aim for 5-10 criteria covering distinct quality dimensions
- Automate everything objective; reserve manual review for subjective judgments
- Create example sets showing excellent, acceptable, and unacceptable outputs
- Re-validate criteria quarterly or when requirements change
- Track inter-rater reliability to catch ambiguous criteria
FAQ
Q: How many criteria should I have?
A: Aim for 5-10 criteria covering distinct quality dimensions. Fewer than 5 risks missing important aspects; more than 10 creates evaluation burden. Quality matters more than quantity—ensure each criterion measures something unique and meaningful.
Q: Should I use binary or graded criteria?
A: Use binary criteria for clear requirements (format compliance, presence checks) and graded criteria for smooth quality dimensions (accuracy percentages, coverage fractions). Binary is simpler and faster; graded provides more nuanced feedback for improvement.
Q: How do I handle subjective qualities like "engaging" or "creative"?
A: Either eliminate the criterion (if it's not essential) or decompose it into observable components. "Engaging" might become "contains at least 3 questions," "uses active voice," and "includes examples." Accept that some genuinely subjective qualities resist operationalization.
Q: What if evaluators disagree on scores?
A: Disagreement indicates the criterion needs refinement. First, check if the criterion definition is clear enough. Add examples and edge cases. If disagreement persists, consider splitting the criterion into more specific measures or replacing it with something more observable.
Q: How often should I update my criteria?
A: Re-validate criteria whenever requirements change, evaluation patterns shift, or you notice issues (high disagreement, uniform scoring). For stable use cases, quarterly review ensures criteria stay aligned with needs. Between cycles, track failure patterns to identify gaps.
Q: Can I combine multiple criteria into a single score?
A: Yes, but be careful. Weighted aggregation can hide individual criterion failures. Report individual criterion scores first, then combine if needed for high-level summaries. Ensure weights reflect actual importance, not arbitrary values.
Q: How do I know if my criteria are measuring the right things?
A: Validate against expert assessments—correlate criterion scores with expert judgments (>0.7 correlation indicates good validity). Also check whether criteria catch failures that matter in practice and whether passing outputs actually work in your use case.
Explore More
What Are Claude Skills?
Understanding the fundamentals of Claude Skills and how they differ from traditional prompts
Reasoning Framework
Master advanced reasoning techniques to unlock Claude's full analytical capabilities
Coding Framework
Structure your coding tasks for better, more maintainable code
Agent Framework
Build autonomous agents that can complete complex multi-step tasks