Dead letter queues in practice
A dead letter queue (DLQ) is one of those patterns that looks trivial on a whiteboard and turns into a quiet source of pain in production. The idea is simple: when a message can't be processed after N attempts, stop retrying it and move it somewhere out of the way so it doesn't block the rest of the stream. The trouble is everything around that "somewhere".
A DLQ is not a garbage can
The most common failure I've seen is treating the DLQ as write-only. Messages go in, nobody looks at them, and six weeks later someone notices a downstream table is missing 0.3% of its rows. If nothing consumes or alerts on the DLQ, you haven't built resilience — you've built a slow data-loss machine with good branding.
Minimum viable DLQ hygiene, in order of importance:
- An alert on DLQ depth > 0 (or above a small threshold for noisy streams).
- Enough context on each dead message to reproduce the failure: original topic/partition/offset, the exception, and a timestamp.
- A documented, boring way to replay.
Keep the failure reason with the message
Re-driving a message is useless if you don't know why it died. I wrap the original payload rather than mutating it:
{
"reason": "SchemaValidationError: field 'amount' is null",
"attempts": 5,
"source": {"topic": "orders", "partition": 3, "offset": 918273},
"first_seen": "2026-01-18T09:12:44Z",
"payload": { "...": "original message, untouched" }
}
Poison messages vs. transient failures
These need different handling and people constantly conflate them. A transient failure (downstream 503, a lock timeout) should be retried with backoff and will usually resolve itself. A poison message (malformed payload, a null where the schema promised a value) will fail identically forever — retrying it just burns CPU and delays everything behind it. The DLQ is for poison. If your DLQ is full of transient failures, your retry policy is wrong, not your consumers.
My rule of thumb: retry transient errors in-process with capped exponential backoff, and only send to the DLQ when an error is deterministic or the attempt budget is exhausted. Everything else is just moving the problem.