Error Recovery - Building Resilient AI Workflows | AI Skill Library

Learn error recovery to design AI workflows that detect failures, adapt strategies, and complete objectives despite unexpected obstacles.

intermediate
20 min read

What This Skill Is

Error Recovery is the practice of designing AI workflows that can detect when something goes wrong, diagnose the problem, and take corrective action without human intervention. Instead of failing completely or producing incorrect results, systems employing error recovery recognize failure states and attempt alternative approaches.

This skill operates on the principle that failures are inevitable in any complex system. Networks disconnect, APIs return errors, data proves malformed, tools timeout, and assumptions prove false. Error Recovery acknowledges these realities and builds workflows that anticipate and respond to failures rather than pretending they won't occur. This relates to Failure Case Analysis for understanding what might fail.

Error Recovery transforms brittle workflows into resilient ones. A brittle workflow attempts each step once and aborts on any error. A resilient workflow validates each step, recognizes when expectations aren't met, and either retries with modifications, switches to fallback strategies, or fails gracefully with meaningful context about what went wrong.

Why This Skill Matters

Without Error Recovery, AI workflows become fragile assemblies where any single failure breaks the entire process. A request fails due to rate limiting, and the workflow aborts leaving partial state. An API changes its response format, and downstream steps receive malformed data.

These failures waste resources and time. Long-running workflows execute for minutes before encountering an error that should have been detected immediately. Re-running workflows from scratch after late-stage failures is inefficient.

Error Recovery enables robust automation. When workflows can handle common failures automatically, human oversight shifts from constant monitoring to exception handling. You intervene only for genuinely unusual problems.

The quality of AI outputs improves with proper error recovery. When workflows detect and correct errors rather than propagating corrupted data, final outputs reflect actual successful operations.

Trust in AI-assisted workflows depends on error recovery. If workflows fail unpredictably, you learn not to trust them. If workflows recognize errors and handle them appropriately, you can rely on them even in complex scenarios.

Core Concepts

Failure Detection

Failure detection is recognizing when something hasn't worked as intended. This goes beyond explicit error codes. Detection includes recognizing when outputs don't match expected schemas, when results are incomplete, when operations take too long, or when assumptions are violated.

Effective detection requires explicit validation criteria. What does success look like? What are the acceptable ranges? Without defined success criteria, failures go undetected and propagate downstream, which is why Evaluation is critical.

Error Classification

Error classification categorizes failures to determine appropriate responses. Not all errors should be handled the same way. Transient failures like network timeouts might warrant retries. Permanent errors like invalid credentials require different handling.

Classification distinguishes recoverable from unrecoverable errors. Recoverable errors allow the workflow to continue with correction. Unrecoverable errors require aborting and reporting.

Retry Strategies

Retry strategies define how and when to reattempt failed operations. Immediate retries might overwhelm a struggling system. Exponential backoff gives systems time to recover. Retries with different parameters might succeed where identical retries fail.

Effective retry requires understanding failure causes. If a parameter was invalid, retrying with the same parameter will fail again. If a system is overloaded, waiting before retrying might help.

Fallback Mechanisms

Fallback mechanisms provide alternative approaches when primary methods fail. When an API is unavailable, fall back to cached data. When a specific tool fails, switch to a different tool. When a detailed analysis fails, fall back to a simpler approximation.

Fallbacks require pre-planning. You cannot fall back to an alternative you haven't identified. Effective fallback design involves considering what could fail and what alternatives exist before the workflow executes.

Graceful Degradation

Graceful degradation accepts reduced functionality rather than total failure. When a perfect result isn't achievable, workflows should produce the best possible result given constraints. This might involve Output Validation to determine what's acceptable.

Degradation requires defining what "good enough" means. Specifying minimum acceptable outcomes allows workflows to deliver value even when obstacles prevent ideal execution.

How This Skill Is Used

Error Recovery transforms linear execution paths into adaptive workflows that respond to actual conditions.

Design workflows with explicit validation points. After each major operation, verify that results meet expectations. Check that data structures are valid, that required fields are present, and that operations completed successfully.

Define error types and their meanings. Categorize possible failures into transient issues (retry), permanent problems (abort), data errors (correct), and logic errors (fix).

Implement retry logic appropriate to each error type. Transient failures benefit from retries with exponential backoff. Parameter errors might succeed if modified. Each error category has an optimal recovery approach.

Identify fallback options before execution. For each critical operation, determine what alternatives exist if it fails. Document these options during Planning, not discovery during failure.

Establish abort conditions and cleanup procedures. When workflows must abort, define what cleanup is required. What partial state needs reversal? What resources need release? Proper abort handling prevents messy failures.

Monitor error patterns and refine responses. Track which errors occur frequently and whether recovery strategies work. Adjust retry parameters and improve error messages.

Common Mistakes

Mistake: Assuming Success

Proceeding through workflows without validating that each step succeeded. Relying on the absence of explicit errors rather than actively checking for expected conditions.

Add explicit validation after each operation. Check that results match expected schemas and that required data is present. Don't assume success—verify it.

Mistake: Misclassifying Errors

Treating all errors the same way regardless of their nature. Retrying permanent errors like authentication failures, or aborting on transient issues like temporary timeouts.

Classify errors to determine appropriate responses. Transient errors warrant retries. Permanent errors require different approaches.

Mistake: Infinite Retry Loops

Retrying failed operations indefinitely without termination conditions. Workflows enter loops where they repeatedly attempt the same failing operation, consuming resources without progress. This is where Iteration must recognize when to stop.

Define retry limits and backoff strategies. After a certain number of attempts or elapsed time, escalate to alternative approaches or abort. Don't retry forever hoping for different results.

Mistake: Inadequate Fallbacks

Having fallback plans that are themselves unreliable or unavailable. Attempting to fall back to a cached value that doesn't exist, or switching to an alternative tool with similar failure modes.

Test fallback paths alongside primary paths. Ensure alternatives are actually available and functional.

Mistake: Failing Without Context

Aborting workflows when unrecoverable errors occur without preserving diagnostic information. You know something failed, but not what operation, what error, or what state existed at failure.

Capture and report error context. Document what operation failed, what error occurred, and what state existed.

When This Skill Is Needed

Error Recovery becomes essential when workflows involve multiple steps, external dependencies, or unattended operation. The more complex the workflow, the more important error recovery becomes.

Automated workflows require error recovery. When workflows execute without human oversight, they must handle common failures automatically.

Long-running processes require error recovery. Multi-hour workflows that fail after 90 minutes waste significant time if they must restart from the beginning. Error recovery allows workflows to save partial progress.

Production systems require error recovery. In production environments, failures are inevitable. Error Recovery creates systems that remain operational despite routine failures.

External integrations require error recovery. APIs fail, databases disconnect, networks timeout. Any workflow depending on external systems must handle the failures those systems produce.

Data processing pipelines require error recovery. When processing datasets, some records may be malformed or files corrupted. Error recovery allows pipelines to log problematic items and continue processing valid data.

User-facing applications require error recovery. When users depend on AI workflows, failures should be handled gracefully. Show meaningful error messages and preserve work where possible.

How This Skill Connects to Other Skills

Error Recovery integrates with multiple capabilities to create robust AI workflows.

Error Recovery builds on Task Decomposition. Recoverable workflows are decomposed into discrete steps that can be validated individually. Decomposition creates the boundaries where errors are detected and where recovery actions can be applied.

Error Recovery requires Reasoning. Determining why something failed and what to do about it requires reasoning capabilities. Without reasoning, error recovery devolves into blind retry loops.

Error Recovery enhances Tool Use. Tools fail, and Tool Use without Error Recovery propagates those failures. Error Recovery wraps tool invocation with detection, classification, and response.

Error Recovery relies on Abstraction. Effective error recovery operates at the right abstraction level—handling errors in terms of workflow impacts rather than implementation details.

Error Recovery supports Planning. Plans assume certain operations will succeed. Error Recovery provides the contingency handling when reality doesn't match the plan.

Skill Boundaries

Error Recovery has limitations.

Error Recovery cannot fix fundamentally broken implementations. If code is buggy or logic is flawed, retrying won't help. Error Recovery handles operational failures, not logic errors.

Error Recovery cannot compensate for missing capabilities. If a workflow requires functionality that doesn't exist, error recovery cannot conjure it. Lacking fallbacks is a failure of planning.

Error Recovery has performance costs. Validation checks, retry logic, and fallback paths add overhead. For simple operations, complex error recovery may cost more than the failures it prevents.

Error Recovery cannot eliminate all failures. Some problems are unrecoverable by design. If authentication credentials are invalid, retrying won't help. Error Recovery handles recoverable errors and gracefully fails unrecoverable ones.

Error Recovery cannot guess correct parameters. If an operation fails due to invalid parameters, retrying with the same parameters fails again. Don't expect error recovery to magically discover correct inputs.

Prerequisite Skills

Iteration: Error recovery is a specialized form of iteration based on failure evaluation.

Complementary Skills

Evaluation: Recovery starts with evaluation of what went wrong.

Tool Use: Error recovery makes tool use robust by handling tool failures systematically.

Core
Essential
Meta
Intermediate