We Built an Order-Spike Survival Kit for a 14-Brand D2C Aggregator — A ₹6.8L Q4 Investment That Saved ₹40L in Lost Sales | Softechinfra Blog

Q: How long does it take to build this Q4 survival kit?

4 weeks with a 4-engineer team. Week 1 audit and queue tier; weeks 2-3 replica, dashboards, load test; week 4 cutover and runbook.

Q: What is the typical project cost for an Indian D2C aggregator at this scale?

Rs 5-8 lakh for a 14-brand setup. Smaller 3-5 brand setups come in at Rs 2.5-4 lakh. Stack complexity drives the swing.

Q: Why BullMQ specifically and not SQS or RabbitMQ?

BullMQ runs on Redis the client already had. No new dependency, no new ops to learn. We pick the queue tech the client can already operate.

Q: How did GST 2.0 affect the build?

We re-tested invoice generation for every brand on the new slabs. Two brands had borderline-12% items that moved to 18%. The mapping table moved from hardcoded to a Postgres lookup.

A Mumbai-based D2C aggregator runs 14 brands across personal care, home fragrance, and pet food on a shared Shopify Plus + custom OMS stack. Q3 traffic ran ~12k orders/week. They projected Flipkart Big Billion Days (3-9 October 2025) plus Diwali week (29 October - 5 November) to push them to 60k orders/week peak — a 5x spike. Their stack had crashed during Onam week 2024. The founder green-lit a ₹6.8 lakh "survival kit" on 1 September with a hard deadline of 28 September. We shipped on 26 September. They processed 71k orders in BBD week alone. Estimated saved revenue: ₹40 lakh.

₹6.8L

Total project cost (4 weeks)

Order volume vs Q3 baseline

71k

Orders shipped in BBD week (3-9 Oct)

₹40L

Estimated saved revenue (vs 2024 outage)

## TL;DR — what the survival kit actually is Three pieces. (1) A queue-buffered webhook tier — Shopify webhooks land in BullMQ on Redis instead of hitting the OMS directly, so a 5x spike does not knock the OMS over. (2) A read-replica Postgres for the POS reporting dashboard — heavy queries no longer compete with order writes. (3) Three Grafana dashboards the founder watches in real time during peak — order velocity, payment-success rate, and inventory burn per SKU. Total infra cost during peak: ₹38,000 for 9 days; reverted to ₹14,000/month afterward. ## Why this matters now — October 2025 Two retail signals pushed this off the backlog. Flipkart Big Billion Days 2025 ran 3-9 October — D2C brands on the platform reported 4-8x normal volume. Amazon Great Indian Festival overlapped. GST 2.0 went live on 22 September right before the festive window — every brand had to re-test invoice generation under the new 5%/18%/40% slabs while the spike was happening. The combination of "5x volume" and "hot tax-rate change two weeks before" is the worst possible Q4 setup, and it was exactly this client's situation. A recent r/IndiaBusiness thread from a 6-brand D2C aggregator shows the same pattern — Onam/Ganesh/Navratri/Diwali stacked, GST 2.0 cutover in the middle, OMS bottleneck on writes. The architecture they shipped (queue + replica + dashboards) is the same as ours. ## The architecture (3 pieces, in order of impact)

QUEUE

1. Queue-buffered webhooks

Shopify orders.create webhook → tiny Express receiver → BullMQ on Redis → OMS worker pool. Decouples spike from OMS write throughput.

REPLICA

2. Read-replica POS

Postgres logical replica handles all reporting reads. Founder's dashboard, ops queries, accounting exports — none touch the primary.

DASH

3. Three live dashboards

Grafana: order velocity (per minute), payment-success rate (rolling 5-min window), inventory burn per top SKU. Founder watches on a phone during peak.

RUN

4. Runbook + on-call rotation

A 24-page runbook with 12 named incidents from 2024. Two engineers on-call per shift, rotating across BBD week.

## Piece 1: The queue-buffered webhook tier The 2024 Onam outage was diagnosed as OMS write contention. Shopify fired 28 webhooks/second at peak; the OMS could ingest 4/sec on its happy path. The queue hit 400k entries, then Shopify started rejecting our endpoint, then orders silently dropped. The fix is well-known but rarely shipped before it is needed: insert a queue between the source and the consumer. The architecture we shipped: Shopify orders.create webhook hits a 60-line Express service. The service does two things — verifies the HMAC signature using the Shopify shared secret, then pushes the payload onto a BullMQ queue on Redis. Total receiver-side latency: 18 ms median. Acknowledges Shopify within 200 ms regardless of whether the OMS is healthy.

// Shopify webhook receiver — the entire production code
  import express from 'express';
  import { Queue } from 'bullmq';
  import crypto from 'crypto';
  
  const app = express();
  app.use(express.raw({ type: 'application/json' }));
  
  const queue = new Queue('shopify-orders', {
    connection: { host: 'redis', port: 6379 }
  });
  
  const SECRET = process.env.SHOPIFY_WEBHOOK_SECRET;
  
  app.post('/webhook/shopify-order', async (req, res) => {
    const sig = req.get('X-Shopify-Hmac-Sha256');
    const expected = crypto.createHmac('sha256', SECRET)
      .update(req.body)
      .digest('base64');
    if (sig !== expected) return res.status(401).end();
  
    await queue.add('process-order', JSON.parse(req.body), {
      attempts: 5,
      backoff: { type: 'exponential', delay: 2000 }
    });
    res.status(200).end();
  });
  
  app.listen(8080);

The OMS worker pool drains the queue at whatever rate it can sustain — typically 4-6 orders/sec, with bursts to 12. During BBD peak the queue depth hit 38k briefly; it drained inside 2 hours of the spike subsiding. No order was lost. ## Piece 2: Read-replica POS Postgres The OMS sits on a Hetzner CCX33 (8 vCPU, 32 GB RAM) running Postgres 16. The 2024 outage had a secondary cause — the founder's "live dashboard" was running a 6-second aggregation query every 30 seconds, which alone consumed 40% of CPU during peak. Order writes started timing out. The fix: Postgres logical replication to a CCX23 (4 vCPU, 16 GB RAM) read replica. Every reporting query — Grafana, accounting exports, ops "where is order X" lookups — went to the replica. The primary did writes only. CPU on the primary peaked at 67% during BBD; in 2024 it had pinned at 100% for 4 hours. Replication lag during peak: 800 ms-2 sec, well within the founder's tolerance for dashboard freshness. The replica also doubles as a warm standby for failover — we tested cutover to replica in the rehearsal week (took 41 seconds). ## Piece 3: The three dashboards (the ones the founder actually watches)

VEL

Order velocity per minute

Single line chart, last 60 minutes. Threshold line at 50/min triggers a colour change. Founder learned to read the slope, not the absolute number.

PAY

Payment-success rate (5-min rolling)

Rolling window of last 300 sec. If it drops under 92% the dashboard turns red. We saw two dips during BBD; both were Razorpay-side and recovered in 8 minutes.

INV

Inventory burn — top 12 SKUs

Bar chart of remaining qty for the top 12 revenue SKUs. Bars below 100 turn red. Founder pulled inventory from a sister brand twice during BBD based on this view.

QUE

Queue depth (engineering view)

Engineers watch this; founder does not. Shows BullMQ depth + processing rate. If depth rises and rate falls, scale workers.

## The cost — actual numbers from BBD week Total during peak: roughly ₹21,200 over 9 days. Reverted to ₹14,000/month steady-state by 15 October (replica was kept; worker pool downsized; Redis returned to single-node). The ₹6.8 lakh project cost was 4 engineers × 4 weeks of build + on-call retainer through Diwali. ## The 4-week build plan (what we actually shipped)

Week 1 (1-7 Sep): audit + queue-buffered webhook tier

Re-read 2024 Onam postmortem. Built the BullMQ receiver. Switched 1 of 14 brands to it as a canary. Watched for 48h.

Week 2 (8-14 Sep): read-replica + dashboard prototypes

Spun up Postgres logical replica. Migrated 3 dashboard queries off primary. Built Grafana boards. Founder reviewed and demanded simpler views — we cut 4 of 7 panels.

Week 3 (15-21 Sep): GST 2.0 cutover + load test

Re-mapped invoice GST slabs for the 22 Sep changeover. Ran a synthetic load test simulating 5x peak — found a deadlock in inventory decrement, fixed.

Week 4 (22-26 Sep): cutover all 14 brands + runbook

Migrated remaining 13 brands to the queue tier. Wrote 24-page runbook with 12 named incidents. Set on-call rotation. Shipped 2 days early.

## The runbook — 12 named incidents (the ones we wrote responses for ahead of time)

1. Razorpay payment-success rate drops below 90% for 5+ min
2. BullMQ queue depth exceeds 5,000
3. Redis memory utilisation exceeds 80%
4. Postgres primary CPU exceeds 85% for 3+ min
5. Replica lag exceeds 30 seconds
6. Order-create latency p95 exceeds 4 seconds
7. Inventory decrement deadlock detected in logs
8. Shopify webhook signature failures spike (potential replay attack)
9. Single-SKU inventory burns to zero (out-of-stock cascade)
10. Worker pool processing rate drops below 2/sec
11. Grafana dashboard fails to load (founder cannot watch)
12. Postgres primary failover required (rehearsed runbook)

Each incident has a named first-responder, an escalation path, and 2-4 commands to run. The runbook lives in a Notion page; on-call engineers read it before their shift.

The single most useful runbook entry was #11 — "Founder cannot see the dashboard." The dashboard is the founder's sense-of-control tool during peak. If Grafana dies, the founder calls the CTO and panics. We wrote a 4-step rebuild and put it on page 1 of the runbook. We never used it. But the founder told us it was the entry he checked first — knowing the answer existed was the point.

## What actually happened during BBD week 3-9 October

Day 1 (3 Oct): 12k orders, smooth, queue depth never exceeded 2k
Day 2: 9k, smooth
Day 3: 14k, queue depth peaked at 8,400 around 21:00 IST, drained in 90 min
Day 4: 13k, smooth, Razorpay had a 6-min wobble at 14:47 (recovered)
Day 5: 11k, smooth, GST 2.0 invoice for one borderline-12% item flagged manually
Day 6: 8k, post-peak day, smooth
Day 7 (9 Oct): 4k, BBD ends, smooth
Total: 71k orders over 9 days. Zero outages. 1 manual intervention (GST flag).

Diwali week (29 Oct - 5 Nov) ran similarly — peak day 16k orders, no outage, two manual GST flags on borderline items. ## Common mistakes — symptoms first Symptom: "queue is buffered but the OMS still falls over." Cause: workers ramp up and the OMS hits the same bottleneck inside the worker. Fix: rate-limit workers, not just receivers. Set a max-concurrent of 6 inside BullMQ — beyond that, payments race the inventory decrement and you deadlock. Symptom: "replica lag spikes during peak and dashboards show stale numbers." Cause: long-running write transactions on the primary block the WAL stream. Fix: increase wal_keep_size on the primary; consider a synchronous_commit=off setting if your tolerance permits (most retail clients do). Symptom: "founder is watching the wrong number." Cause: dashboard has 7 panels and the founder watches the bottom one (which lags). Fix: cut to 3 panels max. Put the most actionable on top. We rebuilt our boards twice in week 2. Symptom: "GST 2.0 invoice rejection on borderline-12% items." Cause: pre-22-Sep mapping table is hardcoded. Fix: move the mapping to a Postgres table or Google Sheet so the finance team can flag exceptions in real time. We did this on 18 September. Symptom: "engineers panic during peak even though metrics are green." Cause: no formal end-of-shift handover. Fix: add a 5-line shift summary at end of every on-call shift, posted to a #peak-incidents Slack channel. Pattern recognition over a week becomes obvious. ## The mini case study — what changed for the team after Q4 After Diwali week ended on 5 November, the engineering team kept the queue tier and the read replica permanently. Steady-state infra cost rose by ₹4,200/month over the 2024 baseline. The founder said in our 17 November debrief that the cost was the cheapest insurance he had ever bought. The 24-page runbook is still in use; engineers add new incidents whenever a novel failure mode shows up. For the smaller-scale variant of this pattern, see our Shopify + Tally daily-close flow — same Shopify backbone, different end-of-day reporting goal. For the build playbook on a fresh stack, see our Radiant Finance case — same architectural patterns applied to a financial services site. ## When NOT to invest in this Skip this if (a) your peak is under 3x your baseline — vertical scaling and a few targeted indexes will get you through, the queue tier is overkill, (b) your team is under 4 engineers — you cannot run on-call rotation through 9 days of BBD week without burnout, hire a partner instead, or (c) your founder does not watch dashboards during peak — the third piece (the live dashboards) is by far the most expensive in human terms and you may not get the value. We turned down two Q4 projects in 2025 for reason (a) — gentle traffic increases do not justify the queue tier. For the founder perspective on Q4 spend decisions, see Vivek's writeup on insurance-spend timing for Indian D2C founders. As Hrishikesh, our CTO, said in the postmortem: 4 weeks of paranoia in September is cheaper than 4 hours of outage in October. ## FAQ ### How long does it take to build this Q4 survival kit? For us, 4 weeks with a 4-engineer team. The first week is audit + the queue tier. Weeks 2-3 are the replica + dashboards + load test. Week 4 is cutover + runbook + on-call setup. Faster than 4 weeks is not realistic if the audit reveals real architectural debt (which it usually does). ### What is the typical project cost for an Indian D2C aggregator at this scale? ₹5-8 lakh for a 14-brand setup like this client. Smaller setups (3-5 brands) come in at ₹2.5-4 lakh. The big swing factor is whether the existing OMS is on a shared host (cheap fix) or a custom stack (expensive audit + rewrite of write paths). ### Can the queue tier be skipped if the OMS is already on Kubernetes with HPA? Sometimes. If the OMS is truly horizontally scalable and the bottleneck is upstream of the database, autoscaling solves the same problem. If the bottleneck is database write throughput (which it is in 80% of OMS we have audited), the queue tier still helps because it lets the database commit at its sustainable rate while the queue absorbs the spike. ### Why BullMQ specifically and not SQS or RabbitMQ? BullMQ runs on Redis, which the client already had. No new dependency, no new ops. SQS adds AWS dependency for an India-hosted client. RabbitMQ is great but the ops team did not know it. We pick the queue technology the client can already operate. ### How did GST 2.0 affect the build? We had to re-test invoice generation for every brand on the new 5%/18%/40% slabs. Two brands had borderline-12% items that moved to 18% — the founder personally approved the price-change communication to customers. The mapping table moved from hardcoded constants to a Postgres lookup so future slab changes are configuration, not code. ### What is the right team size for on-call during BBD week? Two engineers per 8-hour shift, three shifts per day, rotating. Six engineers total for the week, plus the CTO on standby. We tried 4 engineers in 2024 and it was visibly thin — the second BBD outage happened during a shift handover gap. ### Can we monitor the dashboards from a phone? Yes. Grafana's mobile view is acceptable. The founder used an iPad propped up at his desk. We considered building a custom mobile app but the build effort was not worth the gain — the founder already trusted Grafana from the desktop view.

Want this festive-season scaling kit on your stack?

We ship the order-spike survival kit — queue-buffered webhooks, read replica, runbook, on-call rotation through peak — in 4 weeks for ₹5-8 lakh depending on stack complexity. Suitable for any 5+ brand D2C aggregator on Shopify Plus or custom OMS doing 8k+ orders/week baseline.

Book a 20-min Call

For the cost-of-doing-nothing perspective and a longer view on Q4 ops decisions, see our Razorpay Black Friday stack-check post. For the broader engineering services context behind these projects, see our AI automation team page.

Tags:

D2CShopifyFestiveDiwaliArchitectureScalingGSTPostgres

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

// Shopify webhook receiver — the entire production code import express from 'express'; import { Queue } from 'bullmq'; import crypto from 'crypto'; const app = express(); app.use(express.raw({ type: 'application/json' })); const queue = new Queue('shopify-orders', { connection: { host: 'redis', port: 6379 } }); const SECRET = process.env.SHOPIFY_WEBHOOK_SECRET; app.post('/webhook/shopify-order', async (req, res) => { const sig = req.get('X-Shopify-Hmac-Sha256'); const expected = crypto.createHmac('sha256', SECRET) .update(req.body) .digest('base64'); if (sig !== expected) return res.status(401).end(); await queue.add('process-order', JSON.parse(req.body), { attempts: 5, backoff: { type: 'exponential', delay: 2000 } }); res.status(200).end(); }); app.listen(8080);

We Built an Order-Spike Survival Kit for a 14-Brand D2C Aggregator — A ₹6.8L Q4 Investment That Saved ₹40L in Lost Sales

Want this festive-season scaling kit on your stack?

Hrishikesh Baidya

Related Posts

WhatsApp Business API in 2026: 4 Pricing Surprises Killing Indian SMB Margins

CBSE Class 12 Results Day: The Honest Conversation About AI in College-Application Essays

A Weekly Workflow for Tracking Your AI-Search Citations (n8n + Perplexity + Sheets)

Want More Insights?

We Built an Order-Spike Survival Kit for a 14-Brand D2C Aggregator — A ₹6.8L Q4 Investment That Saved ₹40L in Lost Sales

Want this festive-season scaling kit on your stack?

Hrishikesh Baidya

Related Posts

WhatsApp Business API in 2026: 4 Pricing Surprises Killing Indian SMB Margins

CBSE Class 12 Results Day: The Honest Conversation About AI in College-Application Essays

A Weekly Workflow for Tracking Your AI-Search Citations (n8n + Perplexity + Sheets)

Want More Insights?