Building a Dead Letter Queue with PostgreSQL
Learn how to build a production-ready dead letter queue with PostgreSQL: schema design, retry logic with exponential backoff, and monitoring failed jobs.
Postgres
This post was written by an engineer at QueryPlane. QueryPlane is an app builder for your database: bring your own postgres db and you can create interactive applications to share with other developers, coworkers or even your customers. If you’re interested in trying it out, get started here.
When processing events or background jobs, failures are inevitable. Network timeouts, third-party API errors, invalid data—these issues can cause jobs to fail silently or repeatedly. A Dead Letter Queue (DLQ) captures these failed events so you can inspect, debug, and retry them later.
While tools like Kafka and RabbitMQ have built-in DLQ support, many teams find that using PostgreSQL as a DLQ offers better visibility and simpler operations. This post covers how to build a production-ready DLQ using PostgreSQL.
In this post, we’ll cover:
- Why PostgreSQL for DLQs - Benefits over message broker DLQs
- Table schema design - Storing failed events with metadata
- Indexing strategy - Optimizing for common query patterns
- Retry mechanism - Safely reprocessing failed events
- Operational considerations - Monitoring and maintenance
Why PostgreSQL for a Dead Letter Queue
Traditional message brokers like Kafka store failed events in separate topics. While functional, this approach has drawbacks. Filtering by failure reason or event type requires additional tooling. You can’t easily run SQL queries to understand failure patterns. And you have another system to monitor and maintain.
PostgreSQL solves these problems. You can filter, aggregate, and analyze failures with familiar SQL. Failed events are captured atomically with your application state through transactional consistency. There’s no additional infrastructure to deploy since you’re already running PostgreSQL. And you can use any PostgreSQL client to inspect and manage failures.
As one engineering team noted, moving from Kafka DLQs to PostgreSQL transformed their failure handling: failures became “an expected part of the system with a clear, auditable recovery path” rather than opaque messages in a separate topic.
Who uses this approach
Several companies have successfully used PostgreSQL as a job queue and failure handling system in production.
RudderStack scaled their PostgreSQL-based event queue to handle 100,000 events per second over six years of production use. They chose PostgreSQL over specialized tools like Kafka because it’s flexible, reliable for transactions, and easier to debug.
Holistics built a multi-tenant job queue on PostgreSQL specifically because they needed persistent job storage with full visibility into job status—something fire-and-forget queue systems couldn’t provide.
Instagram famously uses PostgreSQL at scale, demonstrating that startups don’t need specialized infrastructure for most use cases. The “use Postgres for everything” philosophy has gained traction because it reduces operational complexity. Having Redis, Kafka, and MongoDB means three potential points of failure. With just PostgreSQL, you consolidate risk and expertise.
Table schema design
A well-designed DLQ table captures everything needed to understand and retry failures:
CREATE TABLE dlq_events (
id BIGSERIAL PRIMARY KEY,
event_type VARCHAR(255) NOT NULL,
payload JSONB NOT NULL,
error_reason TEXT NOT NULL,
error_stacktrace TEXT,
status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
retry_count INT NOT NULL DEFAULT 0,
retry_after TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);
The JSONB payload column preserves the original event without rigid schema requirements—you can store any event type in the same table. The explicit status field tracks the lifecycle of each failed event through states like PENDING, PROCESSING, SUCCEEDED, and FAILED_PERMANENTLY. The retry_after timestamp prevents retry storms by scheduling when an event becomes eligible for retry. Storing both the error reason (for filtering) and full stacktrace (for debugging) gives you the context needed to diagnose issues. Timestamps enable time-based analysis and auditing.
Indexing strategy
Create indexes based on your query patterns:
-- Fast lookup of events ready for retry
CREATE INDEX idx_dlq_status ON dlq_events(status);
-- Efficient retry scheduling queries
CREATE INDEX idx_dlq_status_retry_after ON dlq_events(status, retry_after);
-- Filter failures by event type
CREATE INDEX idx_dlq_event_type ON dlq_events(event_type);
-- Time-based analysis and cleanup
CREATE INDEX idx_dlq_created_at ON dlq_events(created_at);
The composite index on (status, retry_after) is particularly important—it allows the retry scheduler to efficiently find events that are both pending and ready for retry without scanning the entire table.
Retry mechanism
The retry mechanism uses PostgreSQL’s FOR UPDATE SKIP LOCKED to safely process events across multiple workers:
-- Find and lock events ready for retry
WITH eligible_events AS (
SELECT id
FROM dlq_events
WHERE status = 'PENDING'
AND retry_after <= NOW()
ORDER BY retry_after
LIMIT 50
FOR UPDATE SKIP LOCKED
)
UPDATE dlq_events
SET status = 'PROCESSING',
updated_at = NOW()
WHERE id IN (SELECT id FROM eligible_events)
RETURNING *;
The SKIP LOCKED clause ensures that multiple scheduler instances can run simultaneously without blocking each other—each worker picks up different rows. This is the same technique that powers libraries like pg-boss, created by Tim Jones, and Graphile Worker, maintained by Benjie Gillam.
After processing, update the event based on the outcome:
-- On success
UPDATE dlq_events
SET status = 'SUCCEEDED',
updated_at = NOW()
WHERE id = $1;
-- On failure (schedule for retry with exponential backoff)
UPDATE dlq_events
SET status = 'PENDING',
retry_count = retry_count + 1,
retry_after = NOW() + (INTERVAL '1 minute' * POWER(2, retry_count)),
error_reason = $2,
updated_at = NOW()
WHERE id = $1;
Retry configuration
For retry scheduling, process 50-100 events per batch to balance throughput and transaction size. Run the scheduler every few minutes to hours depending on urgency. Set a maximum retry limit (10-20 attempts is common) before marking events as permanently failed. Use exponential backoff to prevent overwhelming failing downstream services—the query above doubles the wait time after each failure.
Querying failures
One of the biggest advantages of a PostgreSQL DLQ is ad-hoc analysis:
-- Failures by error reason in the last 24 hours
SELECT error_reason, COUNT(*) as count
FROM dlq_events
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY error_reason
ORDER BY count DESC;
-- Failure rate by event type
SELECT event_type,
COUNT(*) FILTER (WHERE status = 'SUCCEEDED') as succeeded,
COUNT(*) FILTER (WHERE status = 'PENDING') as pending,
COUNT(*) FILTER (WHERE status = 'FAILED_PERMANENTLY') as failed
FROM dlq_events
GROUP BY event_type;
-- Events stuck in retry loop
SELECT *
FROM dlq_events
WHERE retry_count > 5
AND status = 'PENDING'
ORDER BY created_at;
Operational considerations
Table bloat
High-throughput DLQs can cause table bloat due to frequent updates. PostgreSQL’s MVCC model creates new row versions on each update, and these dead tuples accumulate until autovacuum cleans them up. Tune autovacuum by increasing autovacuum_vacuum_scale_factor for the DLQ table. Move succeeded and permanently-failed events to a history table periodically. Monitor bloat using tools like pgstattuple or pg_bloat_check to track table size over time.
Monitoring
Set up alerts for queue depth (pending events exceeding threshold), retry failures (events hitting max retry count), and processing time (events sitting in PROCESSING status too long, which may indicate crashed workers).
Cleanup
Implement a retention policy to prevent unbounded table growth:
-- Archive old succeeded events
INSERT INTO dlq_events_archive
SELECT * FROM dlq_events
WHERE status = 'SUCCEEDED'
AND updated_at < NOW() - INTERVAL '30 days';
DELETE FROM dlq_events
WHERE status = 'SUCCEEDED'
AND updated_at < NOW() - INTERVAL '30 days';
When to use a hosted solution
Building a PostgreSQL DLQ works well for many use cases, but consider a hosted job queue platform if you need sub-second latency and high throughput (thousands of jobs per second), complex workflow orchestration with dependencies, built-in observability dashboards and alerting, or managed infrastructure with guaranteed SLAs.
See our companion post on hosted job queue platforms for options like Hatchet, Inngest, Trigger.dev, and Temporal.
Wrapping up
A PostgreSQL-based DLQ provides visibility and control that’s difficult to achieve with message broker DLQs. You get SQL queryability, transactional consistency, and no additional infrastructure. The FOR UPDATE SKIP LOCKED pattern enables safe concurrent processing, and standard PostgreSQL tooling handles monitoring and maintenance.
For teams already running PostgreSQL, this approach reduces operational complexity while improving failure handling. Start simple, add indexes based on actual query patterns, and implement monitoring as your queue grows.