Skip to content

Exponential Backoff

Overview

Exponential Backoff is a standard resilience strategy used to gradually increase the delay between consecutive actions. This is useful in two primary scenarios:

  1. Polling Idle Resources: When a background service is polling for work (like the OutboxPublisher checking for new events) and finds none, it's inefficient to poll again immediately. Exponential backoff increases the wait time between polls, reducing CPU and network usage during idle periods.
  2. Retrying Failed Operations: When retrying a failed call to a downstream service, applying an increasing delay between attempts gives the struggling service time to recover.

The Athomic implementation provides a stateful BackoffHandler that manages this logic based on configurable, named policies.

Key Features

  • Policy-Based: Define multiple named backoff policies with different timings (min_delay, max_delay, factor) for various use cases.
  • Stateful Handler: The BackoffHandler automatically manages the current delay state, increasing it after each wait and resetting it when work is found.
  • Live Configuration: Backoff policies can be tuned in real-time without an application restart.

How It Works

The system is composed of three main components:

  1. BackoffPolicy: A simple data object that holds the rules for a backoff strategy:

    • min_delay: The initial and minimum wait time.
    • max_delay: The maximum time to wait, which caps the exponential growth.
    • factor: The multiplier used to increase the delay after each wait (e.g., a factor of 1.5 will increase the wait time by 50% on each step).
  2. BackoffHandler: The stateful object that orchestrates the backoff logic. Its key methods are:

    • wait(): Asynchronously sleeps for the current_delay period and then calculates the next, longer delay.
    • reset(): Resets the current_delay back to the policy's min_delay.
  3. BackoffFactory: A factory used to create configured BackoffHandler instances based on named policies from your settings.toml.


Use Case: The OutboxPublisher Polling Loop

The OutboxPublisher is a perfect example of this pattern in action:

  • In its main loop, it polls the database for new events.
  • If no events are found, it calls backoff_handler.wait(). The first time, it might wait 1 second. The next, 1.5 seconds, then 2.25, and so on, up to a configured maximum. This makes the service highly efficient when idle.
  • As soon as it finds and processes an event, it immediately calls backoff_handler.reset(). This resets the delay to the minimum, ensuring the service becomes highly responsive as soon as there is work to do.

Usage Example

You can use the BackoffHandler in any custom polling loop.

import asyncio
from nala.athomic.resilience.backoff import BackoffFactory

# Get a handler configured with the "my_worker_policy" from settings.toml
backoff_factory = BackoffFactory()
backoff_handler = backoff_factory.create_handler(policy_name="my_worker_policy")

async def my_polling_worker():
    while True:
        work_done = await poll_for_work()

        if work_done:
            # We found work, so reset the delay to be responsive
            backoff_handler.reset()
        else:
            # No work found, wait with an increasing delay
            print("No work found, backing off...")
            await backoff_handler.wait()

Configuration

You define backoff policies in your settings.toml under the [resilience.backoff] section.

[default.resilience.backoff]
enabled = true

  # A default policy if no specific one is requested.
  [default.resilience.backoff.default_policy]
  min_delay_seconds = 1.0
  max_delay_seconds = 30.0
  factor = 1.5

  # A dictionary of named, reusable policies.
  [default.resilience.backoff.policies]

    # A policy for an aggressive, fast-polling worker.
    [default.resilience.backoff.policies.outbox_publisher_polling]
    min_delay_seconds = 0.1
    max_delay_seconds = 5.0
    factor = 1.2

    # A policy for a slow, infrequent background job.
    [default.resilience.backoff.policies.daily_cleanup_job]
    min_delay_seconds = 60.0
    max_delay_seconds = 3600.0 # 1 hour
    factor = 2.0

Live Configuration

Because BackoffSettings is a LiveConfigModel, you can change any of these policy values in your live configuration source (e.g., Consul), and the changes will be reflected in the BackoffHandler instances without requiring a restart.


API Reference

nala.athomic.resilience.backoff.handler.BackoffHandler

Manages the state and logic for an exponential backoff strategy.

This handler keeps track of the current delay, dynamically increasing it based on the configured policy parameters (min_delay, max_delay, factor) and providing hooks for error handling.

__init__(policy, operation_name='unknown', on_error=None)

Initializes the BackoffHandler.

Parameters:

Name Type Description Default
policy BackoffPolicy

The immutable policy defining the backoff rules.

required
operation_name str

A descriptive name for the operation being managed, used for dedicated logging. Defaults to "unknown".

'unknown'
on_error Optional[ErrorCallback]

An asynchronous callback executed immediately after an error occurs but before sleeping.

None

reset()

Resets the internal delay counter back to the minimum policy value.

This should be called after a successful operation to immediately resume high-frequency polling or operation attempts.

wait() async

Waits for the current delay period (typical usage for an idle/polling state) and then increases the delay for the next cycle.

wait_after_error(exc) async

Executes the on_error callback (if configured) and then waits for the current delay period, increasing it for the next cycle.

Parameters:

Name Type Description Default
exc Exception

The exception that triggered the delay.

required

nala.athomic.resilience.backoff.policy.BackoffPolicy dataclass

A Value Object holding the configuration for an exponential backoff strategy.

This policy defines the parameters used by the BackoffHandler to control the delay between consecutive polling cycles or retry attempts, ensuring the system waits increasingly longer after failures or during idle periods.

Attributes:

Name Type Description
min_delay float

The initial and minimum delay in seconds.

max_delay float

The maximum delay in seconds, capping the exponential increase.

factor float

The multiplier used to increase the delay after each step.

nala.athomic.resilience.backoff.factory.BackoffFactory

Factory class responsible for creating configured instances of BackoffHandler.

It resolves configuration policies (default or named override) and constructs a runnable handler ready for use in polling loops or retry mechanisms.

__init__(settings=None)

Initializes the factory with resolved application settings for backoff.

Parameters:

Name Type Description Default
settings Optional[BackoffSettings]

Explicit settings instance. If None, loads from global application settings.

None

create_handler(policy_name=None, operation_name='default', on_error=None)

Creates a new BackoffHandler instance configured with the specified policy.

The method resolves the correct configuration: prioritizing a named policy if provided and found, or falling back to the default policy otherwise.

Parameters:

Name Type Description Default
policy_name Optional[str]

The name of a policy defined in settings (e.g., 'high_contention').

None
operation_name str

A descriptive name for the operation using this handler, used for logging and metrics. Defaults to "default".

'default'
on_error Optional[ErrorCallback]

An asynchronous callable hook executed before sleeping after a failure.

None

Returns:

Name Type Description
BackoffHandler BackoffHandler

A fully configured handler instance.