Overview
Theresilience module provides retry and fallback mechanisms for building fault-tolerant workflows. It includes exponential backoff with jitter for retries and sequential fallback patterns for graceful degradation.
Classes
RetryPolicy
Configuration for retry behavior.Parameters
| Name | Type | Default | Description |
|---|---|---|---|
max_attempts | int | 3 | Maximum number of retry attempts |
initial_delay | float | 1.0 | Initial delay in seconds before first retry |
max_delay | float | 60.0 | Maximum delay in seconds (caps exponential backoff) |
exponential_base | float | 2.0 | Base for exponential backoff (2.0 = double each time) |
jitter | float | 1.0 | Random jitter range (0 to jitter seconds) added to delay |
retry_on_exceptions | tuple[Type[Exception], ...] | (Exception,) | Exception types to retry on |
retry_on_status | tuple[str, ...] | () | Node status codes to retry on |
Delay Calculation
The delay between retries is calculated as:- Attempt 1: 1.0s + jitter
- Attempt 2: 2.0s + jitter
- Attempt 3: 4.0s + jitter
- Attempt 4: 8.0s + jitter
- Attempt 5: 16.0s + jitter
RetryableNode
Base class for nodes with automatic retry logic.Constructor Parameters
| Name | Type | Required | Description |
|---|---|---|---|
node_id | str | Yes | Unique node identifier |
node_type | str | Yes | Node type identifier |
name | str | Yes | Human-readable node name |
config | dict[str, Any] | Yes | Node configuration |
retry_policy | RetryPolicy | None | No | Retry policy (defaults to RetryPolicy()) |
Methods
execute
Main execution method with retry logic (do not override)._execute_with_retry() internally.
_execute_with_retry
Override this method with your node logic (without retry handling).NodeResult- Your node’s execution result
Retry Metadata
When retries occur, the returnedNodeResult includes metadata:
FallbackNode
Node that tries multiple fallback options sequentially.Constructor Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
node_id | str | Yes | - | Unique node identifier |
nodes | list[BaseNode] | Yes | - | List of nodes to try (in order) |
handle_exceptions | tuple[Type[Exception], ...] | No | (Exception,) | Exceptions to handle with fallback |
pass_error_context | bool | No | True | Pass previous errors to next node |
name | str | No | "Fallback" | Node name |
config | dict[str, Any] | None | No | None | Node configuration |
Methods
execute
Execute nodes sequentially until one succeeds.- Try first node
- If successful, return result
- If failed, try next node
- Repeat until success or all nodes exhausted
- Return last error if all fail
NodeResult- Result from first successful node, or error if all fail
Fallback Metadata
The returnedNodeResult includes fallback information:
Usage Patterns
Basic Retry
Custom Retry Exceptions
Retry on Status Codes
Multi-Model Fallback
Combining Retry and Fallback
Error Context Propagation
Error Handling
Retry Exhaustion
When all retry attempts fail:All Fallbacks Failed
When all fallback nodes fail:Non-Retryable Exceptions
Exceptions not inretry_on_exceptions cause immediate failure:
Best Practices
Set Appropriate Max Attempts
Set Appropriate Max Attempts
Don’t retry indefinitely. Set reasonable limits:
Use Jitter to Prevent Thundering Herd
Use Jitter to Prevent Thundering Herd
Jitter prevents all instances from retrying simultaneously:
Specify Retry Exceptions
Specify Retry Exceptions
Only retry exceptions that are transient:
Order Fallbacks by Preference
Order Fallbacks by Preference
Put best options first, cheapest/most reliable last:
Combine Strategies
Combine Strategies
Use both retry (for transient issues) and fallback (for persistent failures):
Monitor Retry Metrics
Monitor Retry Metrics
Track retry patterns to optimize policies:
See Also
- Callbacks - Monitor retry events
- Caching - Reduce need for retries
- Rate Limiting - Prevent rate limit errors