orders.create webhook hits a 60-line Express service. The service does two things — verifies the HMAC signature using the Shopify shared secret, then pushes the payload onto a BullMQ queue on Redis. Total receiver-side latency: 18 ms median. Acknowledges Shopify within 200 ms regardless of whether the OMS is healthy.
// Shopify webhook receiver — the entire production code
import express from 'express';
import { Queue } from 'bullmq';
import crypto from 'crypto';
const app = express();
app.use(express.raw({ type: 'application/json' }));
const queue = new Queue('shopify-orders', {
connection: { host: 'redis', port: 6379 }
});
const SECRET = process.env.SHOPIFY_WEBHOOK_SECRET;
app.post('/webhook/shopify-order', async (req, res) => {
const sig = req.get('X-Shopify-Hmac-Sha256');
const expected = crypto.createHmac('sha256', SECRET)
.update(req.body)
.digest('base64');
if (sig !== expected) return res.status(401).end();
await queue.add('process-order', JSON.parse(req.body), {
attempts: 5,
backoff: { type: 'exponential', delay: 2000 }
});
res.status(200).end();
});
app.listen(8080);
The OMS worker pool drains the queue at whatever rate it can sustain — typically 4-6 orders/sec, with bursts to 12. During BBD peak the queue depth hit 38k briefly; it drained inside 2 hours of the spike subsiding. No order was lost.
## Piece 2: Read-replica POS Postgres
The OMS sits on a Hetzner CCX33 (8 vCPU, 32 GB RAM) running Postgres 16. The 2024 outage had a secondary cause — the founder's "live dashboard" was running a 6-second aggregation query every 30 seconds, which alone consumed 40% of CPU during peak. Order writes started timing out.
The fix: Postgres logical replication to a CCX23 (4 vCPU, 16 GB RAM) read replica. Every reporting query — Grafana, accounting exports, ops "where is order X" lookups — went to the replica. The primary did writes only. CPU on the primary peaked at 67% during BBD; in 2024 it had pinned at 100% for 4 hours.
Replication lag during peak: 800 ms-2 sec, well within the founder's tolerance for dashboard freshness. The replica also doubles as a warm standby for failover — we tested cutover to replica in the rehearsal week (took 41 seconds).
## Piece 3: The three dashboards (the ones the founder actually watches)
- 1. Razorpay payment-success rate drops below 90% for 5+ min
- 2. BullMQ queue depth exceeds 5,000
- 3. Redis memory utilisation exceeds 80%
- 4. Postgres primary CPU exceeds 85% for 3+ min
- 5. Replica lag exceeds 30 seconds
- 6. Order-create latency p95 exceeds 4 seconds
- 7. Inventory decrement deadlock detected in logs
- 8. Shopify webhook signature failures spike (potential replay attack)
- 9. Single-SKU inventory burns to zero (out-of-stock cascade)
- 10. Worker pool processing rate drops below 2/sec
- 11. Grafana dashboard fails to load (founder cannot watch)
- 12. Postgres primary failover required (rehearsed runbook)
- Day 1 (3 Oct): 12k orders, smooth, queue depth never exceeded 2k
- Day 2: 9k, smooth
- Day 3: 14k, queue depth peaked at 8,400 around 21:00 IST, drained in 90 min
- Day 4: 13k, smooth, Razorpay had a 6-min wobble at 14:47 (recovered)
- Day 5: 11k, smooth, GST 2.0 invoice for one borderline-12% item flagged manually
- Day 6: 8k, post-peak day, smooth
- Day 7 (9 Oct): 4k, BBD ends, smooth
- Total: 71k orders over 9 days. Zero outages. 1 manual intervention (GST flag).
wal_keep_size on the primary; consider a synchronous_commit=off setting if your tolerance permits (most retail clients do).
Symptom: "founder is watching the wrong number." Cause: dashboard has 7 panels and the founder watches the bottom one (which lags). Fix: cut to 3 panels max. Put the most actionable on top. We rebuilt our boards twice in week 2.
Symptom: "GST 2.0 invoice rejection on borderline-12% items." Cause: pre-22-Sep mapping table is hardcoded. Fix: move the mapping to a Postgres table or Google Sheet so the finance team can flag exceptions in real time. We did this on 18 September.
Symptom: "engineers panic during peak even though metrics are green." Cause: no formal end-of-shift handover. Fix: add a 5-line shift summary at end of every on-call shift, posted to a #peak-incidents Slack channel. Pattern recognition over a week becomes obvious.
## The mini case study — what changed for the team after Q4
After Diwali week ended on 5 November, the engineering team kept the queue tier and the read replica permanently. Steady-state infra cost rose by ₹4,200/month over the 2024 baseline. The founder said in our 17 November debrief that the cost was the cheapest insurance he had ever bought. The 24-page runbook is still in use; engineers add new incidents whenever a novel failure mode shows up.
For the smaller-scale variant of this pattern, see our Shopify + Tally daily-close flow — same Shopify backbone, different end-of-day reporting goal. For the build playbook on a fresh stack, see our Radiant Finance case — same architectural patterns applied to a financial services site.
## When NOT to invest in this
Skip this if (a) your peak is under 3x your baseline — vertical scaling and a few targeted indexes will get you through, the queue tier is overkill, (b) your team is under 4 engineers — you cannot run on-call rotation through 9 days of BBD week without burnout, hire a partner instead, or (c) your founder does not watch dashboards during peak — the third piece (the live dashboards) is by far the most expensive in human terms and you may not get the value. We turned down two Q4 projects in 2025 for reason (a) — gentle traffic increases do not justify the queue tier.
For the founder perspective on Q4 spend decisions, see Vivek's writeup on insurance-spend timing for Indian D2C founders. As Hrishikesh, our CTO, said in the postmortem: 4 weeks of paranoia in September is cheaper than 4 hours of outage in October.
## FAQ
### How long does it take to build this Q4 survival kit?
For us, 4 weeks with a 4-engineer team. The first week is audit + the queue tier. Weeks 2-3 are the replica + dashboards + load test. Week 4 is cutover + runbook + on-call setup. Faster than 4 weeks is not realistic if the audit reveals real architectural debt (which it usually does).
### What is the typical project cost for an Indian D2C aggregator at this scale?
₹5-8 lakh for a 14-brand setup like this client. Smaller setups (3-5 brands) come in at ₹2.5-4 lakh. The big swing factor is whether the existing OMS is on a shared host (cheap fix) or a custom stack (expensive audit + rewrite of write paths).
### Can the queue tier be skipped if the OMS is already on Kubernetes with HPA?
Sometimes. If the OMS is truly horizontally scalable and the bottleneck is upstream of the database, autoscaling solves the same problem. If the bottleneck is database write throughput (which it is in 80% of OMS we have audited), the queue tier still helps because it lets the database commit at its sustainable rate while the queue absorbs the spike.
### Why BullMQ specifically and not SQS or RabbitMQ?
BullMQ runs on Redis, which the client already had. No new dependency, no new ops. SQS adds AWS dependency for an India-hosted client. RabbitMQ is great but the ops team did not know it. We pick the queue technology the client can already operate.
### How did GST 2.0 affect the build?
We had to re-test invoice generation for every brand on the new 5%/18%/40% slabs. Two brands had borderline-12% items that moved to 18% — the founder personally approved the price-change communication to customers. The mapping table moved from hardcoded constants to a Postgres lookup so future slab changes are configuration, not code.
### What is the right team size for on-call during BBD week?
Two engineers per 8-hour shift, three shifts per day, rotating. Six engineers total for the week, plus the CTO on standby. We tried 4 engineers in 2024 and it was visibly thin — the second BBD outage happened during a shift handover gap.
### Can we monitor the dashboards from a phone?
Yes. Grafana's mobile view is acceptable. The founder used an iPad propped up at his desk. We considered building a custom mobile app but the build effort was not worth the gain — the founder already trusted Grafana from the desktop view.
Want this festive-season scaling kit on your stack?
We ship the order-spike survival kit — queue-buffered webhooks, read replica, runbook, on-call rotation through peak — in 4 weeks for ₹5-8 lakh depending on stack complexity. Suitable for any 5+ brand D2C aggregator on Shopify Plus or custom OMS doing 8k+ orders/week baseline.
Book a 20-min Call
