Webhook Failure Recovery 2026: Patterns for B2B Digital-Goods APIs
Webhook delivery in B2B digital-goods integrations is a critical path: a missed order.delivered event means the gift-card code never reaches the customer and support is flooded with tickets. This article covers the production patterns we see at top FoxReload partners running 99.95%+ uptime.
1. Exponential backoff with jitter β the math
A naive 30-second retry will kill your receiver during an incident on our side (thundering herd). The correct formula:
function nextRetryDelay(attempt: number): number {
const base = 30_000; // 30s
const cap = 6 * 60 * 60 * 1000; // 6h
const exp = Math.min(base * Math.pow(2, attempt), cap);
const jitter = Math.random() * exp * 0.3; // Β±30%
return exp + jitter;
}
FoxReload uses exactly this algorithm: attempts 1β8 spread across a 30s β 24h window. The 6-hour cap prevents a single webhook from monopolising a worker, and Β±30% jitter smooths retry spikes across the fleet.
2. Dead-letter queue (DLQ)
After 8 failed attempts, the event must go to a DLQ, not be lost. The production pattern is a dedicated queue with manual replay:
// Express + BullMQ
app.post('/webhook/foxreload', async (req, res) => {
const sig = req.header('X-Foxreload-Signature');
if (!verifyHmac(req.rawBody, sig, process.env.WEBHOOK_SECRET)) {
return res.sendStatus(401);
}
const eventId = req.header('X-Foxreload-Event-Id');
const fresh = await redis.set(`evt:${eventId}`, '1', 'EX', 86400, 'NX');
if (!fresh) return res.sendStatus(200); // already processed
await queue.add('process-event', req.body, {
attempts: 8,
backoff: { type: 'exponential', delay: 30000 },
removeOnFail: false, // keep for DLQ inspection
});
res.sendStatus(200);
});
DLQ events are reviewed by an on-call engineer: either replayed via POST /v1/webhooks/{id}/replay or closed out manually in admin.
3. Idempotency keys β non-negotiable
Webhook delivery is at-least-once, never exactly-once. Without idempotency, a single order.delivered event could decrement inventory twice. Use X-Foxreload-Event-Id as a natural idempotency key:
| Storage | Latency | TTL | Cost / 1M events |
|---|---|---|---|
| Redis SETNX | <2ms | 24h | $0.40 |
| Postgres UNIQUE index | 5β8ms | forever | $0.10 |
| DynamoDB ConditionExpression | 8β12ms | 24h | $1.25 |
For most partners Redis SETNX is optimal: cheap, fast, and the TTL covers the FoxReload retry window (24h).
4. Alerting on >1% loss
The metric you must monitor is a rolling 5-minute webhook delivery success rate. If it drops below 99%, that's an incident. Prometheus rule:
- alert: WebhookDeliveryDegraded
expr: |
(sum(rate(webhook_received_total[5m]))
- sum(rate(webhook_failed_total[5m])))
/ sum(rate(webhook_received_total[5m])) < 0.99
for: 2m
labels: { severity: page }
The alert routes to PagerDuty/Opsgenie, on-call inspects the DLQ and receiver logs. In 80% of cases the root cause is a recent receiver deploy with a regression β rollback resolves it in 5 minutes.
CTA
Full FoxReload webhook documentation, replay endpoints, and Prometheus metrics are available after onboarding β request API access.
