Skip to content

Bulkhead

Overview

The Bulkhead pattern is a resilience mechanism designed to isolate failures in one part of an application from affecting others. It works by limiting the number of concurrent executions for a specific operation.

Imagine a ship's hull, which is divided into isolated compartments (bulkheads). If one compartment is breached and floods, the bulkheads prevent the water from sinking the entire ship. Similarly, in a software system, if a downstream service becomes slow, the bulkhead pattern prevents that slowness from consuming all available application resources (like worker threads or connections), which would otherwise cause a cascading failure across your entire service.

The Athomic implementation uses asyncio.Semaphore to enforce these concurrency limits on any asynchronous function via the @bulkhead decorator.

Key Features

  • Concurrency Limiting: Restrict how many instances of a function can run simultaneously.
  • Policy-Based: Define named policies with different concurrency limits for different operations in your configuration.
  • Fail-Fast Mechanism: When a bulkhead's limit is reached, new calls don't wait in a queue. They are rejected immediately, raising a BulkheadRejectedError. This "fail-fast" behavior is crucial for shedding load and protecting system stability.
  • Full Observability: All bulkhead activity is instrumented with Prometheus metrics, tracking concurrent requests, accepted calls, and rejections for each policy.

How It Works

  1. Policies & Semaphores: You define named policies in your configuration, each with a specific concurrency limit (e.g., payment_api: 5). The singleton BulkheadService manages an asyncio.Semaphore for each policy.
  2. @bulkhead Decorator: You apply the @bulkhead(policy="...") decorator to an async function.
  3. Acquisition Attempt: When the decorated function is called, it attempts to acquire a "slot" from the semaphore associated with its policy. This is a non-blocking check.
  4. Execution or Rejection:
    • If a slot is available, the function executes normally. When it completes (or fails), the slot is released.
    • If no slots are available (the limit is reached), the acquisition fails immediately, and a BulkheadRejectedError is raised without executing the function.

Usage Example

Let's say you have two functions: one is a highly resource-intensive video processing task, and the other is a fast metadata lookup. You can use separate bulkheads to ensure the slow video task can't block the fast metadata lookups.

from nala.athomic.resilience.bulkhead import bulkhead, BulkheadRejectedError

# This policy is defined in settings.toml with a low limit (e.g., 2)
@bulkhead(policy="video_processing")
async def generate_video_thumbnail(video_id: str):
    # Very slow and resource-intensive I/O operation
    await process_video(video_id)

# This policy has a higher limit (e.g., 50)
@bulkhead(policy="metadata_lookup")
async def fetch_user_metadata(user_id: str):
    # Fast database call
    await db.fetch_user(user_id)

async def handle_request(video_id: str):
    try:
        # If 2 video tasks are already running, this call will fail immediately
        await generate_video_thumbnail(video_id)
    except BulkheadRejectedError:
        # You can now handle the rejection gracefully, e.g., by returning a 429 Too Many Requests
        print("The system is currently busy processing other videos. Please try again later.")

Configuration

You configure your bulkhead policies in settings.toml under the [resilience.bulkhead] section.

[default.resilience.bulkhead]
enabled = true

# The default concurrency limit to apply if a policy is not found.
default_limit = 20

  # A dictionary of named policies and their specific concurrency limits.
  [default.resilience.bulkhead.policies]
  video_processing = 2
  metadata_lookup = 50
  external_api_calls = 10

API Reference

nala.athomic.resilience.bulkhead.decorator.bulkhead(policy)

Decorator to protect an asynchronous function with a concurrency limit (Bulkhead).

This pattern isolates resource usage by limiting the number of concurrent executions for the decorated function, preventing cascading failures. It uses the BulkheadService to acquire a slot before execution.

Parameters:

Name Type Description Default
policy str

The unique name of the bulkhead policy to apply, as defined in the application configuration.

required

Returns:

Name Type Description
Callable Callable[..., Any]

The decorator function.

Raises:

Type Description
TypeError

If the decorated function is not asynchronous.

nala.athomic.resilience.bulkhead.service.BulkheadService

Manages all bulkhead policies and their corresponding semaphores.

This service enforces concurrency limits on asynchronous operations based on predefined policies, preventing cascading failures by isolating resource usage. It lazily creates an asyncio.Semaphore for each unique policy name.

Attributes:

Name Type Description
settings BulkheadSettings

The bulkhead configuration settings.

_semaphores Dict[str, Semaphore]

Cache of active semaphores, keyed by policy name.

__init__(settings)

Initializes the BulkheadService.

Parameters:

Name Type Description Default
settings BulkheadSettings

The application's bulkhead configuration.

required

acquire(policy) async

Acquires a slot from the bulkhead for a given policy, using an async context manager.

If the bulkhead is full, it raises BulkheadRejectedError immediately (non-blocking acquisition attempt with a minimal timeout). Observability metrics are updated for both accepted and rejected calls.

Parameters:

Name Type Description Default
policy str

The name of the bulkhead policy to enforce.

required

Raises:

Type Description
BulkheadRejectedError

If the acquisition times out (meaning the bulkhead is full).

nala.athomic.resilience.bulkhead.exceptions.BulkheadRejectedError

Bases: Exception

Raised when a call is rejected because the bulkhead is full.