Failure Case Analysis Framework - Systematic Error Diagnosis | AI Skill Library
Learn how to systematically analyze AI system failures to identify root causes and prevent recurrence.
What is Failure Case Analysis
Failure case analysis is the systematic examination of incorrect or unsatisfactory outputs to understand why they occurred and how to prevent them. It transforms failures from random setbacks into diagnostic data that reveals weaknesses in specifications, constraints, or approaches.
The practice identifies what went wrong, determines why it went wrong, and extracts actionable insights. Each failure provides information about edge cases, ambiguous requirements, or missing constraints that success cases cannot reveal.
Why This Skill Matters
Without structured failure analysis, you treat symptoms rather than causes. You might fix a specific error instance while the underlying pattern remains. The same failure recurs in slightly different forms, leading to an endless cycle of corrections that never address root problems.
Superficial analysis causes wasted effort. If you misdiagnose a failure, you apply the wrong fix. You might tighten constraints when the real issue is ambiguous requirements, or restructure prompts when the problem is inadequate evaluation criteria. Misdirected effort consumes time without improving results.
Poor failure analysis creates brittleness. Overfitting to specific failure examples makes systems fragile. You add endless special-case handling without understanding the systematic issue. The solution becomes a collection of patches rather than a robust approach.
Systematic analysis reveals patterns. Failures cluster in specific categories: edge cases, ambiguous instructions, constraint conflicts, or capability limits. Identifying patterns enables structural fixes rather than surface corrections. One insight from failure analysis might prevent dozens of related failures.
Core Concepts
Classification: Categorizing failures by type to identify patterns. Common categories include format violations, content omissions, incorrect interpretations, logical errors, and style inconsistencies. Classification reveals whether failures are isolated incidents or symptoms of systemic issues.
Root Cause Analysis: Tracing failures backward to origin points. The visible error might be a symptom of deeper issues. Format violations could stem from missing constraints, unclear specifications, or conflicting requirements. This analysis requires Reasoning.
Reproducibility: Determining whether failures are consistent or intermittent. Reproducible failures indicate deterministic issues in instructions or constraints. Intermittent failures suggest stochastic variation or context-dependent behaviors. Different failure types require different remediation strategies.
Severity Assessment: Evaluating the impact of different failure types. Some failures are inconsequential deviations; others are critical defects. Severity assessment prioritizes which failures demand immediate attention and which can be tolerated or addressed later.
Corrective vs. Preventive Action: Corrective actions fix specific failures; preventive actions eliminate entire failure classes. Failure analysis should identify both—immediate fixes for current issues and structural changes to prevent recurrence.
Step-by-Step Guide
1. Establish Failure Collection System
Create a structured system for capturing failures as they occur. Each failure record should include:
- Input: What prompt or context caused the failure
- Output: The incorrect or unsatisfactory result
- Defect Description: What makes this output incorrect
- Context: Model, parameters, timestamp, any relevant conditions
- Impact: Severity (critical/major/minor) and consequences
Without organized records, you cannot identify patterns or track whether fixes work. Use spreadsheets, databases, or dedicated tools—whatever ensures consistent documentation.
Example: A failure log for code generation might include: task description, generated code, error message from compiler, model version, temperature setting.
2. Classify Failures by Type
Develop a consistent taxonomy for categorizing failures. Common categories include:
- Format violations: Structure, syntax, schema non-compliance
- Content omissions: Missing required information or sections
- Incorrect interpretations: Misunderstanding requirements or intent
- Logical errors: Flawed reasoning or invalid conclusions
- Style inconsistencies: Tone, voice, or convention violations
- Edge cases: Unusual inputs that break normal patterns
Classification creates structure from raw failure data. Look for clustering—are most failures format issues, or do they concentrate in specific content areas?
Example: If 70% of failures are format violations, focus on Constraint Encoding. If 60% are content omissions, examine Specification Writing.
3. Analyze Root Causes
For each failure pattern, trace back to origin points. Use the "Five Whys" technique:
- Why did this failure occur? → Direct cause (e.g., "Wrong data format")
- Why did that cause exist? → Deeper cause (e.g., "Format not specified in constraints")
- Why was it not specified? → Systemic cause (e.g., "No template for this output type")
- Why was there no template? → Process cause (e.g., "New use case added without updating specs")
- Why was process not followed? → Root cause (e.g., "No specification review workflow")
Root causes often point to: missing constraints, unclear specifications, conflicting requirements, insufficient context, or process gaps. Connect this to Reasoning Skills.
Example: Content omissions might trace back to incomplete specifications, which trace back to missing requirements gathering, which traces back to no standardized scoping process.
4. Determine Reproducibility
Test each failure pattern across multiple attempts:
- Consistent failures: Occur every time with same input → Deterministic issue in specifications or constraints
- Intermittent failures: Occur sometimes with same input → Stochastic variation or context issue
- Context-dependent: Occur only with certain parameter combinations or context configurations
Reproducibility indicates the fix type:
- Consistent → Fix specifications/constraints
- Intermittent → Adjust parameters, add examples, or set expectations
- Context-dependent → Modify context management or add constraints for specific scenarios
Example: If JSON format errors occur 100% of the time, add schema constraints. If they occur 30% of the time, consider few-shot examples or lower temperature.
5. Assess Severity and Prioritize
Not all failures deserve equal attention. Prioritize by:
- Frequency: How often does this failure type occur?
- Impact: How severe are the consequences?
- Fix cost: How expensive is the remedy?
- Fix effectiveness: How likely is the fix to work?
Create a priority matrix:
- High frequency + High impact → Immediate attention
- High frequency + Low impact → Batch fixes when convenient
- Low frequency + High impact → Address soon, but after #1
- Low frequency + Low impact → Document and monitor
Example: Critical data corruption (high impact) even if rare (low frequency) takes priority over frequent but inconsequential formatting quirks.
6. Design Corrective and Preventive Actions
For each prioritized failure pattern, design two types of actions:
Corrective Actions (immediate fixes):
- Add specific constraints
- Clarify ambiguous instructions
- Add validation checks
- Provide examples
- Adjust parameters
Preventive Actions (systemic fixes):
- Improve Specification Writing process
- Add Constraint Encoding templates
- Update Evaluation Criteria
- Establish review workflows
- Create standard operating procedures
Example: For JSON format failures, corrective action might be "Add schema constraint." Preventive action might be "Create JSON template library for common use cases."
7. Test Fixes on Held-Out Cases
Apply corrective actions to failure examples that weren't used in diagnosis:
- Hold out test set: Reserve 20-30% of failure cases for validation
- Apply fixes: Test whether corrective actions resolve failures
- Check for regressions: Ensure fixes don't break previously working cases
- Verify generalization: Confirm fixes work on related failure types
This prevents overfitting to specific instances. If a fix works on training failures but fails on test failures, it's overfit—look for a more general solution.
Example: If adding a constraint fixes 95% of test failures and introduces no new issues, the fix is robust and ready for deployment.
8. Update Specifications and Documentation
Incorporate analysis findings into your knowledge base:
- Add missing requirements to specifications
- Clarify ambiguous language in instructions
- Add new constraints to prevent recurrence
- Update evaluation criteria to catch new failure types
- Document edge cases with examples
- Create playbooks for common failure patterns
Documentation prevents forgetting hard-won lessons. Connect this to Iteration—each analysis cycle should permanently improve the system.
Example: Create a "Known Issues and Solutions" document that catalogs failure patterns, root causes, and proven fixes for future reference.
When to Use This Skill
Ideal Scenarios
High Error Rates When error rates exceed acceptable thresholds (>5-10%), systematic issues exist that random fixes cannot address. Failure case analysis identifies patterns and root causes, enabling targeted improvements rather than scattered corrections.
Iterative Improvement Cycles When systematically improving outputs over time, failure analysis reveals which changes will have the most impact. Without understanding what's failing and why, improvements are random rather than targeted.
Complex Multi-Component Systems When failures occur in systems with multiple interacting parts (chained operations, intricate requirements, multi-stage workflows), failure modes are difficult to diagnose intuitively. Systematic analysis cuts through complexity.
Production Debugging When failures occur in deployed systems, rapid and accurate diagnosis is essential. Analysis skills prevent firefighting and enable lasting fixes rather than temporary patches.
Quality Assurance Processes When building QA workflows, failure analysis improves specifications and constraints. Each defect reveals weaknesses in requirements or validation criteria, creating a quality improvement loop.
Not Ideal For
One-Time Failures When a failure occurs once and never recurs, deep analysis may not be warranted. Document it, but focus analysis effort on recurrent patterns.
Obvious Issues When the failure cause is immediately apparent (e.g., typo in prompt), formal analysis is overkill. Fix it and move on.
Exploratory Work When experimenting and exploring, failures are expected and informative. Analysis becomes valuable once you've settled on an approach to optimize.
Capability-Limited Tasks When a task genuinely exceeds system capabilities, analysis will reveal the pattern but cannot eliminate failures. In these cases, focus on redefining requirements rather than perfecting the approach.
Common Use Cases
Debugging Code Generation Failures
A development team experiences 30% failure rate in their code generation system. They apply failure case analysis:
Collection: Log 200 failed generations with inputs, outputs, and compiler errors
Classification: Categorize failures:
- 45%: Syntax errors (missing brackets, semicolons)
- 30%: Type mismatches
- 15%: Missing imports
- 10%: Logic errors
Root Cause Analysis:
- Syntax errors → Missing format constraints in prompt
- Type mismatches → Incomplete type specifications
- Missing imports → No context about available libraries
- Logic errors → Insufficient test case examples
Corrective Actions: Add schema constraints, improve type specifications, include library documentation, add few-shot examples
Result: Failure rate drops from 30% to 8% in two weeks
Content Moderation System
A content platform struggles with inconsistent moderation decisions. They analyze failure cases where inappropriate content was approved or appropriate content was rejected.
Classification reveals patterns:
- 60% of failures: Context-dependent sarcasm misunderstood
- 25% of failures: Cultural references not recognized
- 15% of failures: Ambiguous policy language
Root causes: Insufficient context capture, no cultural sensitivity training, vague moderation guidelines
Preventive actions: Redesign context collection, add cultural context annotations, rewrite guidelines with specific examples
Result: Moderation consistency improves by 40%, appeal rate drops by half
API Response Format Issues
An API consistently returns malformed JSON responses, breaking client integrations.
Analysis of 500 failure instances:
- 70%: Unescaped special characters in text fields
- 20%: Missing required fields
- 10%: Truncated responses from length limits
Root causes: No sanitization constraints, incomplete schema validation, no explicit length handling
Corrective actions: Add character escaping rules, implement schema validation, add length truncation rules with error handling
Result: API response validity improves from 85% to 99.9%
Common Mistakes
Blaming the system: Attributing failures to system limitations prevents diagnosing the real issue. Many failures stem from specifications, constraints, or instructions rather than inherent capability limits. Focus on what you control before assuming capability constraints.
Overfitting to specific examples: Adding fixes for individual failure instances without identifying underlying patterns creates fragile systems. You accumulate special-case handling rather than addressing systematic issues. Always look for the pattern behind specific failures.
Treating symptoms: Fixing visible errors without identifying root causes ensures recurrence. A format violation might indicate missing constraints, unclear specifications, or conflicting requirements. Treat the disease, not the symptom.
Ignoring successful outputs: Focusing exclusively on failures misses what's working. Understanding why certain outputs succeed provides complementary information. Compare successful and failed cases to identify discriminating factors.
Insufficient documentation: Not recording failure details, context, and attempted fixes prevents learning. You repeat the same analysis cycles and forget what worked. Maintain structured records to build institutional knowledge.
Confirmation bias: Only seeing failures that confirm existing hypotheses while missing contradictory evidence. Actively look for failure types that don't fit your current understanding. Anomalies often reveal the most valuable insights.
Measuring Success
Quality Checklist
Your failure case analysis is effective when:
- Pattern Recognition: You can identify recurring failure types and their characteristics
- Root Cause Identification: You trace failures to underlying causes, not just symptoms
- Fix Effectiveness: Corrective actions reduce failure rates by >50% for targeted patterns
- Prevention: New failure types decrease as specifications and constraints improve
- Knowledge Building: Documentation captures lessons learned and prevents repeat mistakes
- Efficiency: Analysis time decreases as you develop playbooks for common failures
Red Flags
Warning signs that your analysis needs improvement:
- Symptom Treatment: Fixing visible errors without addressing root causes
- High Recurrence: Same failure types keep recurring after "fixes"
- Overfitting: Accumulating special-case handling without understanding patterns
- Blaming the System: Attributing failures to capability limits without evidence
- Missing Patterns: Unable to categorize failures or identify clusters
- Incomplete Documentation: Losing track of what you've tried and what worked
Success Metrics
Track these to validate analysis effectiveness:
- Failure Rate Reduction: Target >50% reduction in top 3 failure categories
- Fix Durability: >80% of fixes remain effective after 30 days
- Pattern Coverage: >70% of failures fall into identified categories
- Analysis ROI: Time saved by prevention > time spent on analysis
Related Skills
Note: This skill is not yet in the main relationship map. Relationships will be defined as the skill library evolves.
Prerequisite Skills
Output Validation: Failure case analysis requires validation processes to identify and categorize failures systematically.
Evaluation Criteria Design: Well-defined criteria provide the standard against which failures are identified and classified.
Complementary Skills
Specification Writing: Failure analysis reveals missing requirements and ambiguous instructions that need to be added to specifications.
Constraint Encoding: Failure patterns guide the creation of new constraints and the refinement of existing ones.
Error Recovery: Understanding failure patterns enables better error recovery strategies and fallback mechanisms.
Iteration: Each analysis cycle should permanently improve the system through iterative refinement based on findings.
Reasoning Skills: Root cause analysis relies on strong reasoning to trace failures backward to their origins.
Quick Reference
Failure Classification Taxonomy
| Category | Sub-Types | Typical Root Causes | Priority Fixes |
|---|---|---|---|
| Format | Schema violations, syntax errors, structure issues | Missing/inadequate constraints | Add Constraint Encoding |
| Content | Omissions, inaccuracies, hallucinations | Incomplete specifications | Improve Specification Writing |
| Logic | Flawed reasoning, invalid conclusions | Insufficient context/examples | Add few-shot examples, clarify instructions |
| Style | Tone inconsistencies, voice variations | No style guidelines | Create style guides, add examples |
| Edge Cases | Unusual inputs, boundary conditions | Missing test coverage | Expand test cases, add edge case constraints |
Five Whys Template
FAILURE: [Describe what went wrong]
1. Why did this failure occur?
→ [Direct cause]
2. Why did [direct cause] happen?
→ [Deeper cause]
3. Why did [deeper cause] exist?
→ [Systemic cause]
4. Why was [systemic cause] present?
→ [Process/structural cause]
5. Why did [process/structural cause] exist?
→ [ROOT CAUSE]
FIX: Address root cause, not symptoms
Corrective vs. Preventive Actions
| Action Type | Purpose | Examples | Timeline |
|---|---|---|---|
| Corrective | Fix immediate failures | Add constraints, clarify instructions, adjust parameters | Immediate |
| Preventive | Prevent recurrence | Improve specifications, create templates, establish workflows | Short-term |
Pro Tips
- Start with Output Validation to systematically identify failures
- Use Evaluation Criteria Design to create objective failure classification
- Apply the Five Whys technique to reach root causes, not just symptoms
- Prioritize failures by frequency × impact matrix
- Test fixes on held-out cases to prevent overfitting
- Document every failure pattern and solution in a knowledge base
- Schedule quarterly reviews of failure patterns to identify emerging issues
- Share analysis findings across teams to prevent repeat mistakes
FAQ
Q: How many failures do I need to collect before analyzing?
A: Start analysis once you have 20-30 failures of a given type. Smaller samples can reveal obvious patterns, but statistical significance requires more data. For rare failure types, analyze each instance individually rather than waiting for large samples.
Q: Should I analyze every failure or just patterns?
A: Focus on patterns, not individual instances. One-time failures happen; recurrent patterns indicate systemic issues. Log every failure, but prioritize analysis of categories that represent >80% of failures by frequency or impact.
Q: How do I know if a failure is due to capability limits vs. specifications?
A: Test systematically. If improving specifications, constraints, and context doesn't reduce failure rates after multiple iterations, you may have hit capability limits. But exhaust all specification and constraint improvements first—most attributed "capability limits" are actually specification gaps.
Q: What's the difference between corrective and preventive actions?
A: Corrective actions fix specific failures you've already seen (e.g., "Add constraint to prevent this specific error"). Preventive actions eliminate entire failure classes to prevent future occurrences (e.g., "Create specification template to prevent all similar omissions"). You need both—corrective for immediate relief, preventive for long-term improvement.
Q: How often should I update my failure analysis?
A: Continuously. Log failures as they occur, review patterns weekly, and conduct comprehensive analysis monthly. After implementing fixes, track for 2-4 weeks to measure effectiveness. If new failure types emerge, analyze immediately rather than waiting for scheduled reviews.
Q: Can failure analysis help with tasks that are mostly successful?
A: Yes. Even low failure rates (1-5%) benefit from analysis. The few failures that occur often reveal edge cases or ambiguous requirements that, once fixed, improve overall robustness. Perfect quality requires understanding why the small percentage fails, not just accepting "mostly works."
Q: How do I avoid overfitting to specific failure examples?
A: Always test fixes on held-out failure cases not used in diagnosis. Reserve 20-30% of failures as a test set. If a fix works on training failures but fails on test failures, it's overfit—look for a more general solution that addresses the underlying pattern, not the specific instances.
Explore More
What Are Claude Skills?
Understanding the fundamentals of Claude Skills and how they differ from traditional prompts
Reasoning Framework
Master advanced reasoning techniques to unlock Claude's full analytical capabilities
Coding Framework
Structure your coding tasks for better, more maintainable code
Agent Framework
Build autonomous agents that can complete complex multi-step tasks