Retry

Overview

The Retry pattern is a fundamental resilience mechanism that allows an application to automatically re-execute an operation that has failed due to a transient error, such as a temporary network glitch or a brief service unavailability. This prevents temporary issues from escalating into hard failures, significantly improving the stability and reliability of your service.

The Athomic retry module is built on the robust and battle-tested tenacity library. It provides a highly configurable, policy-based approach to retries through a simple @retry decorator.

Key Features

Declarative Use: Apply complex retry logic with a single decorator.
Policy-Based Configuration: Centrally define multiple named retry policies (e.g., "fast_retry", "slow_retry") in your configuration files.
Exponential Backoff & Jitter: Built-in support for exponential backoff with random jitter to prevent "thundering herd" problems.
Live Configuration: Retry policies can be tuned and updated at runtime without restarting the application.
Full Observability: Every retry attempt is logged and instrumented with traces and metrics.

How It Works

The retry mechanism is centered around Retry Policies. A policy is a set of rules that defines the retry behavior:

attempts: The maximum number of times to retry the operation.
wait_min_seconds / wait_max_seconds: The minimum and maximum delay between retries.
backoff: A multiplier for the delay, enabling exponential backoff.
jitter: A random factor added to the delay to de-synchronize retries from multiple clients.
exceptions: A list of specific exception types that should trigger a retry.

The @retry decorator applies a policy to a function. When the decorated function is called, a RetryHandler manages the execution. If the function raises one of the configured exceptions, the handler waits for the calculated delay and then re-executes the function, up to the maximum number of attempts.

Usage Example

You can apply a named retry policy directly to any asynchronous function.

from nala.athomic.resilience import retry, RetryFactory

# In a real app, the factory would be a singleton.
retry_factory = RetryFactory()
# Get a pre-configured policy from your settings.toml
fast_retry_policy = retry_factory.create_policy(name="fast_retry")

@retry(policy=fast_retry_policy)
async def call_flaky_service(request_data: dict) -> dict:
    """
    This function will be retried up to 3 times with a short,
    exponential backoff if it fails with an HTTPException.
    """
    # This call might fail temporarily
    response = await http_client.post("/external-api", json=request_data)
    response.raise_for_status()
    return response.json()

Configuration

You define all your retry policies in settings.toml under the [resilience.retry] section. You can specify a default_policy and a dictionary of named policies for different use cases.

[default.resilience.retry]
enabled = true

  # This policy will be used if no specific policy is requested.
  [default.resilience.retry.default_policy]
  attempts = 3
  wait_min_seconds = 1.0
  wait_max_seconds = 10.0
  backoff = 2.0 # Wait time will be ~1s, 2s, 4s...
  exceptions = ["HTTPRequestError", "HTTPTimeoutError"]

  # A dictionary of named, reusable policies.
  [default.resilience.retry.policies]

    # A policy for quick, internal retries.
    [default.resilience.retry.policies.fast_retry]
    attempts = 3
    wait_min_seconds = 0.1
    wait_max_seconds = 1.0
    backoff = 1.5
    exceptions = ["HTTPException"]

    # A policy for long-running background tasks.
    [default.resilience.retry.policies.long_retry]
    attempts = 10
    wait_min_seconds = 5.0
    wait_max_seconds = 300.0 # 5 minutes
    backoff = 2.0
    exceptions = ["ConnectionError"]

Live Configuration

Because the RetrySettings model is a LiveConfigModel, you can change any of these values (e.g., attempts, wait_max_seconds) in your live configuration source (like Consul), and the RetryFactory will use the new values for all subsequent operations without requiring an application restart.

API Reference

`nala.athomic.resilience.retry.decorator.retry(*, policy=None, operation_name=None, on_retry=None, on_fail=None, circuit_breaker_hook=None, logger=None, tracer=None)`

A decorator to add retry logic to synchronous or asynchronous functions, with optional hooks for retry, failure, circuit breaker, logging, and tracing. Args: policy (Optional[RetryPolicy]): The retry policy to use. If None, a default policy may be applied. operation_name (Optional[str]): An optional name for the operation, used for logging or tracing. on_retry (Optional[Callable]): Optional callback invoked on each retry attempt. on_fail (Optional[Callable]): Optional callback invoked when all retry attempts fail. circuit_breaker_hook (Optional[Callable[[BaseException], None]]): Optional callback invoked when a circuit breaker event occurs. logger (Optional[Callable]): Optional logger function for logging retry events. tracer (Optional[Callable]): Optional tracer function for tracing retry events. Returns: Callable: A decorator that wraps the target function with retry logic, supporting both sync and async functions. Usage: @retry(policy=my_policy, on_retry=my_on_retry) def my_function(...): ...

`nala.athomic.resilience.retry.policy.RetryPolicy`

Define the retry behavior: exceptions, attempts, delays, backoff, jitter, etc.

`nala.athomic.resilience.retry.factory.RetryFactory`

Factory class for creating retry policies and decorators for resilient operations.

`create_policy(name=None)`

Creates a RetryPolicy instance based on a named policy from settings.

In MESH mode, this returns a NoOpRetryPolicy (single attempt).

`create_retry_decorator(*, policy=None, operation_name=None, logger=None, tracer=None, on_retry=None, on_fail=None, circuit_breaker_hook=None, **policy_overrides)`

Creates the retry decorator.

`create_retry_handler(policy_name=None, operation_name=None, **kwargs)`

Creates a RetryHandler instance configured with a specific named policy.

`nala.athomic.resilience.retry.handler.RetryHandler`

Manages the retry logic for synchronous and asynchronous functions based on a defined policy.

This class encapsulates the retry behavior, using the tenacity library, and integrates framework features such as exponential backoff, observability (logging, tracing, metrics), and circuit breaker integration.

`init(policy, operation_name=None, on_retry=None, on_fail=None, circuit_breaker_hook=None, tracer=None, logger=None)`

Initializes the RetryHandler.

Parameters:

Name	Type	Description	Default
`policy`	`RetryPolicy`	The retry policy defining behavior (attempts, delays, exceptions).	required
`operation_name`	`Optional[str]`	A descriptive name for the operation, used in logs and metrics.	`None`
`on_retry`	`Optional[RetryCallback]`	Hook executed before sleep on each failed retry attempt.	`None`
`on_fail`	`Optional[FailCallback]`	Hook executed when all retry attempts fail permanently.	`None`
`circuit_breaker_hook`	`Optional[Callable[[], bool]]`	Hook function that returns True if the circuit is open, aborting retries.	`None`
`tracer`	`Optional[Tracer]`	OpenTelemetry tracer instance.	`None`
`logger`	`Optional[Logger]`	Logger instance for instrumentation.	`None`

`arun(fn, *args, **kwargs)` `async`

Executes an asynchronous function with the configured retry policy.

`run(fn, *args, **kwargs)`

Executes a synchronous function with the configured retry policy.

Retry

Overview

Key Features

How It Works

Usage Example

Configuration

Live Configuration

API Reference

nala.athomic.resilience.retry.decorator.retry(*, policy=None, operation_name=None, on_retry=None, on_fail=None, circuit_breaker_hook=None, logger=None, tracer=None)

nala.athomic.resilience.retry.policy.RetryPolicy

nala.athomic.resilience.retry.factory.RetryFactory

create_policy(name=None)

create_retry_decorator(*, policy=None, operation_name=None, logger=None, tracer=None, on_retry=None, on_fail=None, circuit_breaker_hook=None, **policy_overrides)

create_retry_handler(policy_name=None, operation_name=None, **kwargs)

nala.athomic.resilience.retry.handler.RetryHandler

__init__(policy, operation_name=None, on_retry=None, on_fail=None, circuit_breaker_hook=None, tracer=None, logger=None)

arun(fn, *args, **kwargs) async

run(fn, *args, **kwargs)