Retry & Dead Letter Queue (DLQ)
Overview
The Retry and Dead Letter Queue (DLQ) mechanism is a critical resilience pattern that prevents message loss when a consumer fails to process a message. It provides an automated, configurable way to handle transient (temporary) and permanent processing failures gracefully.
- Retry: When a consumer handler fails with a transient error, the message is automatically republished to the original topic to be processed again. This is handled by the
DLQHandler. - Dead Letter Queue (DLQ): If a message fails to be processed after a configured number of retry attempts, it is considered a "poison pill" and is moved to a separate, dedicated topic called the Dead Letter Queue. This removes the failing message from the main processing queue, preventing it from blocking other messages.
The Failure Handling Flow
When a decorated consumer handler (@subscribe_to) raises an exception, the following automated flow is triggered:
- Failure Detection: The
BaseMessageProcessorcatches the exception. - DLQ Handler Invocation: It packages all information about the failure (the original message, headers, exception details) into a
FailureContextobject and passes it to theDLQHandler. - Retry Check: The
DLQHandlerinspects the message's headers to check the current retry attempt count (using thex-retry-attemptsheader). - Decision:
- If the attempt count is less than
max_attempts(from your configuration), the handler republishes the message back to its original topic with an incremented retry header. - If the attempt count has reached
max_attempts, the handler constructs a detailed "DLQ envelope" (a JSON object with the original message and error details) and publishes it to the configured DLQ topic.
- If the attempt count is less than
The DLQ Processor
Once a message is in the DLQ topic, it needs to be handled. Athomic provides a dedicated background service, the DeadLetterProcessorService, which automatically subscribes to your DLQ topic.
This service consumes the DLQ envelopes and passes them to a configured Dead Letter Strategy for final disposition.
Available Strategies
logging_only(Default): This strategy simply logs the full details of the failed message envelope as a structuredERRORlog. This is the safest default, ensuring that failing messages are recorded for manual inspection without taking further automated action.republish_to_parking_lot: This strategy republishes the failed message to yet another topic, a "parking lot", which can be used for long-term storage or offline analysis.
Configuration
The retry and DLQ mechanism is configured under the [integration.messaging.dlq] section in your settings.toml.
[default.integration.messaging.dlq]
# A master switch to enable or disable the entire retry/DLQ flow.
enabled = true
# The maximum number of times to retry processing a message before sending it to the DLQ.
max_attempts = 3
# The name of the Dead Letter Queue topic where failed messages will be sent.
topic = "platform.dead-letter-queue.v1"
# The strategy for handling messages that arrive in the DLQ.
# Can be "logging_only" or "republish_to_parking_lot".
strategy = "logging_only"
# (Optional) The topic to use if the 'republish_to_parking_lot' strategy is chosen.
# parking_lot_topic = "platform.parking-lot.v1"
API Reference
nala.athomic.integration.messaging.retry.handler.DLQHandler
Orchestrates the handling of consumption failures, implementing the retry mechanism (republishing) and the final dead-letter queue (DLQ) publishing.
This class serves as the decision point: - If attempts < max_attempts and it's a single message: Republish for retry. - If attempts >= max_attempts or it's a batch failure: Send to DLQ.
__init__(producer, retry_policy, serializer, topic, max_attempts)
Initializes the DLQ handler with all necessary dependencies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
producer
|
ProducerProtocol
|
The producer used to send messages (for retry or DLQ). |
required |
retry_policy
|
MaxAttemptsPolicy
|
The policy defining how retry headers and counts are managed. |
required |
serializer
|
SerializerProtocol
|
The serializer used for general data conversion. |
required |
topic
|
str
|
The destination topic name for the Dead Letter Queue. |
required |
max_attempts
|
int
|
The maximum number of times to retry a message before sending to DLQ. |
required |
handle_failure(context)
async
Orchestrates failure handling using an explicit context object.
- Single messages will be retried up to max_attempts.
- Batch messages will be sent directly to the DLQ without retry attempts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
FailureContext
|
The |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the message was handled (retried or sent to DLQ), False otherwise. |
nala.athomic.integration.messaging.retry.processor_service.DeadLetterProcessorService
Bases: BaseService
A dedicated background service that manages the lifecycle of a consumer subscribed to the Dead Letter Queue (DLQ).
This service is backend-agnostic as it receives a pre-configured consumer instance via dependency injection (DIP). Its purpose is purely to ensure the DLQ consumer is started and stopped gracefully with the application.
__init__(consumer)
Initializes the service with a pre-configured DLQ consumer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
consumer
|
BaseConsumer
|
A consumer instance that is already configured to subscribe to the DLQ topic and process its messages. This consumer is typically created by a Factory and injected here. |
required |
nala.athomic.integration.messaging.retry.strategies.protocol.DeadLetterStrategyProtocol
Bases: Protocol
Defines the contract for strategies that manage and handle messages that have permanently failed processing (Dead Letter Queue - DLQ).
Any class implementing this protocol must provide a concrete implementation for the asynchronous 'handle' method.
handle(failure_details)
async
Processes a message that has permanently failed and requires a final disposition (e.g., logging, manual queue, or permanent storage).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
failure_details
|
Dict[str, Any]
|
A dictionary containing all relevant information about the failed message, including payload, exception details, and retry history. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ProcessingOutcome |
ProcessingOutcome
|
An enum indicating the result of the DLQ handling process. - ACK: Indicates the handling strategy was successful and the message can be removed from the DLQ. - REQUEUE: Indicates a temporary failure in the handling process, and the message should be requeued to the DLQ for another attempt. |