Failure Case Analysis Framework - Systematic Error Diagnosis | AI Skill Library - Claude Skills & MCP Guide | AI Skill Library

What is Failure Case Analysis

Failure case analysis is the systematic examination of incorrect or unsatisfactory outputs to understand why they occurred and how to prevent them. It transforms failures from random setbacks into diagnostic data that reveals weaknesses in specifications, constraints, or approaches.

The practice identifies what went wrong, determines why it went wrong, and extracts actionable insights. Each failure provides information about edge cases, ambiguous requirements, or missing constraints that success cases cannot reveal.

Why This Skill Matters

Without structured failure analysis, you treat symptoms rather than causes. You might fix a specific error instance while the underlying pattern remains. The same failure recurs in slightly different forms, leading to an endless cycle of corrections that never address root problems.

Superficial analysis causes wasted effort. If you misdiagnose a failure, you apply the wrong fix. You might tighten constraints when the real issue is ambiguous requirements, or restructure prompts when the problem is inadequate evaluation criteria. Misdirected effort consumes time without improving results.

Poor failure analysis creates brittleness. Overfitting to specific failure examples makes systems fragile. You add endless special-case handling without understanding the systematic issue. The solution becomes a collection of patches rather than a robust approach.

Systematic analysis reveals patterns. Failures cluster in specific categories: edge cases, ambiguous instructions, constraint conflicts, or capability limits. Identifying patterns enables structural fixes rather than surface corrections. One insight from failure analysis might prevent dozens of related failures.

Core Concepts

Classification: Categorizing failures by type to identify patterns. Common categories include format violations, content omissions, incorrect interpretations, logical errors, and style inconsistencies. Classification reveals whether failures are isolated incidents or symptoms of systemic issues.

Root Cause Analysis: Tracing failures backward to origin points. The visible error might be a symptom of deeper issues. Format violations could stem from missing constraints, unclear specifications, or conflicting requirements. This analysis requires Reasoning.

Reproducibility: Determining whether failures are consistent or intermittent. Reproducible failures indicate deterministic issues in instructions or constraints. Intermittent failures suggest stochastic variation or context-dependent behaviors. Different failure types require different remediation strategies.

Severity Assessment: Evaluating the impact of different failure types. Some failures are inconsequential deviations; others are critical defects. Severity assessment prioritizes which failures demand immediate attention and which can be tolerated or addressed later.

Corrective vs. Preventive Action: Corrective actions fix specific failures; preventive actions eliminate entire failure classes. Failure analysis should identify both—immediate fixes for current issues and structural changes to prevent recurrence.

Step-by-Step Guide

1. Establish Failure Collection System

Create a structured system for capturing failures as they occur. Each failure record should include:

Input: What prompt or context caused the failure
Output: The incorrect or unsatisfactory result
Defect Description: What makes this output incorrect
Context: Model, parameters, timestamp, any relevant conditions
Impact: Severity (critical/major/minor) and consequences

Without organized records, you cannot identify patterns or track whether fixes work. Use spreadsheets, databases, or dedicated tools—whatever ensures consistent documentation.

Example: A failure log for code generation might include: task description, generated code, error message from compiler, model version, temperature setting.

2. Classify Failures by Type

Develop a consistent taxonomy for categorizing failures. Common categories include:

Format violations: Structure, syntax, schema non-compliance
Content omissions: Missing required information or sections
Incorrect interpretations: Misunderstanding requirements or intent
Logical errors: Flawed reasoning or invalid conclusions
Style inconsistencies: Tone, voice, or convention violations
Edge cases: Unusual inputs that break normal patterns

Classification creates structure from raw failure data. Look for clustering—are most failures format issues, or do they concentrate in specific content areas?

Example: If 70% of failures are format violations, focus on Constraint Encoding. If 60% are content omissions, examine Specification Writing.

3. Analyze Root Causes

For each failure pattern, trace back to origin points. Use the "Five Whys" technique:

Why did this failure occur? → Direct cause (e.g., "Wrong data format")
Why did that cause exist? → Deeper cause (e.g., "Format not specified in constraints")
Why was it not specified? → Systemic cause (e.g., "No template for this output type")
Why was there no template? → Process cause (e.g., "New use case added without updating specs")
Why was process not followed? → Root cause (e.g., "No specification review workflow")

Root causes often point to: missing constraints, unclear specifications, conflicting requirements, insufficient context, or process gaps. Connect this to Reasoning Skills.

Example: Content omissions might trace back to incomplete specifications, which trace back to missing requirements gathering, which traces back to no standardized scoping process.

4. Determine Reproducibility

Test each failure pattern across multiple attempts:

Consistent failures: Occur every time with same input → Deterministic issue in specifications or constraints
Intermittent failures: Occur sometimes with same input → Stochastic variation or context issue
Context-dependent: Occur only with certain parameter combinations or context configurations

Reproducibility indicates the fix type:

Consistent → Fix specifications/constraints
Intermittent → Adjust parameters, add examples, or set expectations
Context-dependent → Modify context management or add constraints for specific scenarios

Example: If JSON format errors occur 100% of the time, add schema constraints. If they occur 30% of the time, consider few-shot examples or lower temperature.

5. Assess Severity and Prioritize

Not all failures deserve equal attention. Prioritize by:

Frequency: How often does this failure type occur?
Impact: How severe are the consequences?
Fix cost: How expensive is the remedy?
Fix effectiveness: How likely is the fix to work?

Create a priority matrix:

High frequency + High impact → Immediate attention
High frequency + Low impact → Batch fixes when convenient
Low frequency + High impact → Address soon, but after #1
Low frequency + Low impact → Document and monitor

Example: Critical data corruption (high impact) even if rare (low frequency) takes priority over frequent but inconsequential formatting quirks.

6. Design Corrective and Preventive Actions

For each prioritized failure pattern, design two types of actions:

Corrective Actions (immediate fixes):

Add specific constraints
Clarify ambiguous instructions
Add validation checks
Provide examples
Adjust parameters

Preventive Actions (systemic fixes):

Improve Specification Writing process
Add Constraint Encoding templates
Update Evaluation Criteria
Establish review workflows
Create standard operating procedures

Example: For JSON format failures, corrective action might be "Add schema constraint." Preventive action might be "Create JSON template library for common use cases."

7. Test Fixes on Held-Out Cases

Apply corrective actions to failure examples that weren't used in diagnosis:

Hold out test set: Reserve 20-30% of failure cases for validation
Apply fixes: Test whether corrective actions resolve failures
Check for regressions: Ensure fixes don't break previously working cases
Verify generalization: Confirm fixes work on related failure types

This prevents overfitting to specific instances. If a fix works on training failures but fails on test failures, it's overfit—look for a more general solution.

Example: If adding a constraint fixes 95% of test failures and introduces no new issues, the fix is robust and ready for deployment.

8. Update Specifications and Documentation

Incorporate analysis findings into your knowledge base:

Add missing requirements to specifications
Clarify ambiguous language in instructions
Add new constraints to prevent recurrence
Update evaluation criteria to catch new failure types
Document edge cases with examples
Create playbooks for common failure patterns

Documentation prevents forgetting hard-won lessons. Connect this to Iteration—each analysis cycle should permanently improve the system.

Example: Create a "Known Issues and Solutions" document that catalogs failure patterns, root causes, and proven fixes for future reference.

When to Use This Skill

Ideal Scenarios

High Error Rates When error rates exceed acceptable thresholds (>5-10%), systematic issues exist that random fixes cannot address. Failure case analysis identifies patterns and root causes, enabling targeted improvements rather than scattered corrections.

Iterative Improvement Cycles When systematically improving outputs over time, failure analysis reveals which changes will have the most impact. Without understanding what's failing and why, improvements are random rather than targeted.

Complex Multi-Component Systems When failures occur in systems with multiple interacting parts (chained operations, intricate requirements, multi-stage workflows), failure modes are difficult to diagnose intuitively. Systematic analysis cuts through complexity.

Production Debugging When failures occur in deployed systems, rapid and accurate diagnosis is essential. Analysis skills prevent firefighting and enable lasting fixes rather than temporary patches.

Quality Assurance Processes When building QA workflows, failure analysis improves specifications and constraints. Each defect reveals weaknesses in requirements or validation criteria, creating a quality improvement loop.

Not Ideal For

One-Time Failures When a failure occurs once and never recurs, deep analysis may not be warranted. Document it, but focus analysis effort on recurrent patterns.

Obvious Issues When the failure cause is immediately apparent (e.g., typo in prompt), formal analysis is overkill. Fix it and move on.

Exploratory Work When experimenting and exploring, failures are expected and informative. Analysis becomes valuable once you've settled on an approach to optimize.

Capability-Limited Tasks When a task genuinely exceeds system capabilities, analysis will reveal the pattern but cannot eliminate failures. In these cases, focus on redefining requirements rather than perfecting the approach.

Common Use Cases

Debugging Code Generation Failures

A development team experiences 30% failure rate in their code generation system. They apply failure case analysis:

Collection: Log 200 failed generations with inputs, outputs, and compiler errors

Classification: Categorize failures:

45%: Syntax errors (missing brackets, semicolons)
30%: Type mismatches
15%: Missing imports
10%: Logic errors

Root Cause Analysis:

Syntax errors → Missing format constraints in prompt
Type mismatches → Incomplete type specifications
Missing imports → No context about available libraries
Logic errors → Insufficient test case examples

Corrective Actions: Add schema constraints, improve type specifications, include library documentation, add few-shot examples

Result: Failure rate drops from 30% to 8% in two weeks

Content Moderation System

A content platform struggles with inconsistent moderation decisions. They analyze failure cases where inappropriate content was approved or appropriate content was rejected.

Classification reveals patterns:

60% of failures: Context-dependent sarcasm misunderstood
25% of failures: Cultural references not recognized
15% of failures: Ambiguous policy language

Root causes: Insufficient context capture, no cultural sensitivity training, vague moderation guidelines

Preventive actions: Redesign context collection, add cultural context annotations, rewrite guidelines with specific examples

Result: Moderation consistency improves by 40%, appeal rate drops by half

API Response Format Issues

An API consistently returns malformed JSON responses, breaking client integrations.

Analysis of 500 failure instances:

70%: Unescaped special characters in text fields
20%: Missing required fields
10%: Truncated responses from length limits

Root causes: No sanitization constraints, incomplete schema validation, no explicit length handling

Corrective actions: Add character escaping rules, implement schema validation, add length truncation rules with error handling

Result: API response validity improves from 85% to 99.9%

Common Mistakes

Blaming the system: Attributing failures to system limitations prevents diagnosing the real issue. Many failures stem from specifications, constraints, or instructions rather than inherent capability limits. Focus on what you control before assuming capability constraints.

Overfitting to specific examples: Adding fixes for individual failure instances without identifying underlying patterns creates fragile systems. You accumulate special-case handling rather than addressing systematic issues. Always look for the pattern behind specific failures.

Treating symptoms: Fixing visible errors without identifying root causes ensures recurrence. A format violation might indicate missing constraints, unclear specifications, or conflicting requirements. Treat the disease, not the symptom.

Ignoring successful outputs: Focusing exclusively on failures misses what's working. Understanding why certain outputs succeed provides complementary information. Compare successful and failed cases to identify discriminating factors.

Insufficient documentation: Not recording failure details, context, and attempted fixes prevents learning. You repeat the same analysis cycles and forget what worked. Maintain structured records to build institutional knowledge.

Confirmation bias: Only seeing failures that confirm existing hypotheses while missing contradictory evidence. Actively look for failure types that don't fit your current understanding. Anomalies often reveal the most valuable insights.

Measuring Success

Quality Checklist

Your failure case analysis is effective when:

Pattern Recognition: You can identify recurring failure types and their characteristics
Root Cause Identification: You trace failures to underlying causes, not just symptoms
Fix Effectiveness: Corrective actions reduce failure rates by >50% for targeted patterns
Prevention: New failure types decrease as specifications and constraints improve
Knowledge Building: Documentation captures lessons learned and prevents repeat mistakes
Efficiency: Analysis time decreases as you develop playbooks for common failures

Red Flags

Warning signs that your analysis needs improvement:

Symptom Treatment: Fixing visible errors without addressing root causes
High Recurrence: Same failure types keep recurring after "fixes"
Overfitting: Accumulating special-case handling without understanding patterns
Blaming the System: Attributing failures to capability limits without evidence
Missing Patterns: Unable to categorize failures or identify clusters
Incomplete Documentation: Losing track of what you've tried and what worked

Success Metrics

Track these to validate analysis effectiveness:

Failure Rate Reduction: Target >50% reduction in top 3 failure categories
Fix Durability: >80% of fixes remain effective after 30 days
Pattern Coverage: >70% of failures fall into identified categories
Analysis ROI: Time saved by prevention > time spent on analysis

Note: This skill is not yet in the main relationship map. Relationships will be defined as the skill library evolves.

Prerequisite Skills

Output Validation: Failure case analysis requires validation processes to identify and categorize failures systematically.

Evaluation Criteria Design: Well-defined criteria provide the standard against which failures are identified and classified.

Complementary Skills

Specification Writing: Failure analysis reveals missing requirements and ambiguous instructions that need to be added to specifications.

Constraint Encoding: Failure patterns guide the creation of new constraints and the refinement of existing ones.

Error Recovery: Understanding failure patterns enables better error recovery strategies and fallback mechanisms.

Iteration: Each analysis cycle should permanently improve the system through iterative refinement based on findings.

Reasoning Skills: Root cause analysis relies on strong reasoning to trace failures backward to their origins.

Quick Reference

Failure Classification Taxonomy

Category	Sub-Types	Typical Root Causes	Priority Fixes
Format	Schema violations, syntax errors, structure issues	Missing/inadequate constraints	Add Constraint Encoding
Content	Omissions, inaccuracies, hallucinations	Incomplete specifications	Improve Specification Writing
Logic	Flawed reasoning, invalid conclusions	Insufficient context/examples	Add few-shot examples, clarify instructions
Style	Tone inconsistencies, voice variations	No style guidelines	Create style guides, add examples
Edge Cases	Unusual inputs, boundary conditions	Missing test coverage	Expand test cases, add edge case constraints

Five Whys Template

FAILURE: [Describe what went wrong]

1. Why did this failure occur?
   → [Direct cause]

2. Why did [direct cause] happen?
   → [Deeper cause]

3. Why did [deeper cause] exist?
   → [Systemic cause]

4. Why was [systemic cause] present?
   → [Process/structural cause]

5. Why did [process/structural cause] exist?
   → [ROOT CAUSE]

FIX: Address root cause, not symptoms

Corrective vs. Preventive Actions

Action Type	Purpose	Examples	Timeline
Corrective	Fix immediate failures	Add constraints, clarify instructions, adjust parameters	Immediate
Preventive	Prevent recurrence	Improve specifications, create templates, establish workflows	Short-term

Pro Tips

Start with Output Validation to systematically identify failures
Use Evaluation Criteria Design to create objective failure classification
Apply the Five Whys technique to reach root causes, not just symptoms
Prioritize failures by frequency × impact matrix
Test fixes on held-out cases to prevent overfitting
Document every failure pattern and solution in a knowledge base
Schedule quarterly reviews of failure patterns to identify emerging issues
Share analysis findings across teams to prevent repeat mistakes

FAQ

Q: How many failures do I need to collect before analyzing?

A: Start analysis once you have 20-30 failures of a given type. Smaller samples can reveal obvious patterns, but statistical significance requires more data. For rare failure types, analyze each instance individually rather than waiting for large samples.

Q: Should I analyze every failure or just patterns?

A: Focus on patterns, not individual instances. One-time failures happen; recurrent patterns indicate systemic issues. Log every failure, but prioritize analysis of categories that represent >80% of failures by frequency or impact.

Q: How do I know if a failure is due to capability limits vs. specifications?

A: Test systematically. If improving specifications, constraints, and context doesn't reduce failure rates after multiple iterations, you may have hit capability limits. But exhaust all specification and constraint improvements first—most attributed "capability limits" are actually specification gaps.

Q: What's the difference between corrective and preventive actions?

A: Corrective actions fix specific failures you've already seen (e.g., "Add constraint to prevent this specific error"). Preventive actions eliminate entire failure classes to prevent future occurrences (e.g., "Create specification template to prevent all similar omissions"). You need both—corrective for immediate relief, preventive for long-term improvement.

Q: How often should I update my failure analysis?

A: Continuously. Log failures as they occur, review patterns weekly, and conduct comprehensive analysis monthly. After implementing fixes, track for 2-4 weeks to measure effectiveness. If new failure types emerge, analyze immediately rather than waiting for scheduled reviews.

Q: Can failure analysis help with tasks that are mostly successful?

A: Yes. Even low failure rates (1-5%) benefit from analysis. The few failures that occur often reveal edge cases or ambiguous requirements that, once fixed, improve overall robustness. Perfect quality requires understanding why the small percentage fails, not just accepting "mostly works."

Q: How do I avoid overfitting to specific failure examples?

A: Always test fixes on held-out failure cases not used in diagnosis. Reserve 20-30% of failures as a test set. If a fix works on training failures but fails on test failures, it's overfit—look for a more general solution that addresses the underlying pattern, not the specific instances.