A Hindi-First Voice Bot for a Tier-2 Insurance Agent: Twilio + Sarvam + Claude Sonnet 4.5
Vernacular voice AI for an Indore insurance agent — Twilio India + Sarvam Hindi TTS + Claude Sonnet 4.5. Real cost-per-call math and three latency hacks for Indian 4G.
Hrishikesh Baidya
November 17, 202513 min read
0%
An LIC + HDFC Life agent in Indore handles 180 inbound calls a day across 6 languages — most callers wanting policy status, premium reminders, or "kab kaata hai paisa." His two-person team was missing 40% of calls after 6pm. We built him a Hindi-first voice bot that picks up after 3 rings, handles policy lookup in Hindi/English/Marathi, and books a callback with the human agent if the call gets complex. Stack: Twilio India, Sarvam Bulbul-v2 TTS, faster-whisper STT, Claude Sonnet 4.5. Cost per call: ₹3.20 all-in. This post is the build, the latency budget, and three hacks that made it work on Indian 4G.
₹3.20
All-in cost per 90-second call (India)
1.1s
p50 voice round-trip on 4G
11
Indian languages Sarvam Bulbul handles
68%
Inbound calls fully resolved by bot
## TL;DR
We built a Hindi-first voice bot for a Tier-2 insurance agent in Indore using Twilio Programmable Voice (₹0.35/min inbound India), Sarvam Bulbul-v2 for Hindi TTS (₹15 per 10,000 chars), faster-whisper for STT, and Claude Sonnet 4.5 for the conversation brain. p50 voice latency is 1.1 seconds on 4G — slower than English-only stacks because Sarvam STT runs once on raw audio AND Claude responses are post-processed for proper Devanagari rendering before TTS. Cost per 90-second call: ₹3.20 all-in. Bot resolves 68% of inbound, transfers the rest.
## Why this matters now
Indian regulators (IRDAI for insurance, RBI for lending) have tightened call-recording and language-disclosure rules in 2025. An English-only IVR is no longer compliant for vernacular markets. Meanwhile, [Sarvam AI](https://www.sarvam.ai/apis/text-to-speech) released Bulbul-v2 in October 2025 with 25+ voices across 11 Indian languages — and importantly, sub-300ms first-byte streaming. That changed the math for Hindi voice bots: until Bulbul-v2, the only options were ElevenLabs (English-leaning, expensive in INR) or Google Cloud TTS (poor Indic accent). The vernacular voice-AI gap is now closeable for SMBs.
## The client (specific details)
- Sector: LIC + HDFC Life agent (POSP), insurance distribution
- Location: Indore, Madhya Pradesh
- Languages: Hindi (primary), English, Marathi (~12% of calls)
- Volume: 180 inbound calls/day, 60% after 5pm
- Pain: Missing 40% of after-hours calls; 2-person team; founder taking calls at 9pm
- Budget: ₹35,000 setup + ₹8,000/month run
- Trigger: IRDAI audit flagged inadequate Hindi service availability
## The stack (versions and prices, Nov 2025)
📞
Twilio Programmable Voice (India)
₹0.35/min inbound on India local number, ₹0.65/min outbound. Mumbai-region SIP for low-latency PSTN. Twilio handles DTMF, recording, transcription pass-through.
🎙️
faster-whisper medium (Hindi)
Self-hosted on an L4 GPU pod (Lambda Labs, ₹26/hour spot). Hindi WER ~12% on 8kHz telephony audio. Runs at ~140ms first-chunk latency.
🧠
Claude Sonnet 4.5 streaming
Strongest Hindi/Hinglish reasoning in our tests. ₹250 per 1M input, ₹1,250 per 1M output. We use the Anthropic SDK with prompt caching on the policy-lookup context.
🔊
Sarvam Bulbul-v2 (Hindi TTS)
₹15 per 10,000 characters. Voice "Manan" for the agent persona. Streams MP3 within ~250ms of first text. Native Devanagari handling — no romanization needed.
## The latency budget (where every ms goes)
We measured 1,200 calls on a Reliance Jio 4G handset in Indore. p50 voice-to-voice round trip: 1.1 seconds. Slower than the 800ms TalkDrill English stack because Hindi adds two delays — Devanagari token expansion in the LLM output, and Sarvam's first-chunk TTS being slower than ElevenLabs.
## 3 latency hacks that made this work on Indian 4G
### Hack #1 — Pre-warm a Devanagari TTS prefix
Problem: Sarvam's first-chunk latency is 250ms when you POST cold. For a back-and-forth conversation, that compounds.
Fix: Pre-warm the connection by streaming a 1-character "filler" (a comma) the moment Whisper finishes transcription, before Claude has even started generating. The TCP/TLS handshake completes during Claude's TTFT. When Claude's first sentence arrives, Sarvam streams it immediately. Saves ~120ms p50.
### Hack #2 — Run Whisper in "VAD-trigger streaming" mode, not full-utterance
Problem: Naive Whisper waits for the user to stop speaking, then transcribes the whole utterance. On a 6-second user message, that's a 6-second baseline before the bot can think.
Fix: Silero VAD chunks the audio at sentence boundaries (silence > 380ms) and streams each chunk to Whisper. Claude starts generating against partial transcripts. Risk: occasional re-generation when the second half of the utterance changes intent. We accept the cost — overall p50 drops by ~400ms.
### Hack #3 — Mumbai-region everything
Problem: Default Twilio SIP routes through Singapore for Indian numbers. Default Anthropic API hits us-east-1. Adds 200-300ms RTT each.
Fix: Twilio Mumbai SIP termination (request via support, free for Programmable Voice). Anthropic API has no India region as of Nov 2025, but you can co-locate your orchestrator in AWS Mumbai (ap-south-1) and use HTTP/2 connection pooling — keeps TCP open, saves the handshake on every turn. Net: ~180ms reduction.
Production gotcha: Sarvam Bulbul-v2 occasionally drops the last syllable of long Hindi sentences (>40 words). Symptom: caller hears "...karne ke liye" trail off mid-word. Fix: insert a soft pause (200ms silence) at the end of every TTS request. Sarvam team is aware; tracked on their changelog.
## Cost per call (real numbers)
A typical call lasts 90 seconds, with 4 user turns and 4 bot responses. Per-call breakdown for the Indore client, averaged over 4,200 calls in November 2025:
Compare with a human call-center desk — a Tier-2 city Hindi-speaking agent costs ₹22,000–₹28,000/month loaded. The bot is roughly 35% the cost of one agent and operates 24/7. Two human agents to handle the same 5,400 monthly call volume after-hours would cost ₹52,000+ loaded. Payback on the ₹35,000 setup: under 4 weeks.
## The build (step by step)
1
Day 1 — Twilio India number + Mumbai SIP
Buy an Indian local number on Twilio (₹110/month). KYC requires a copy of GSTIN + address proof. Approval takes 2-5 days. While waiting, request Mumbai SIP termination from Twilio support — free, takes ~24 hours. Configure a TwiML <Connect><Stream> webhook to point at your orchestrator.
2
Day 2 — Whisper on L4 + Silero VAD
Spin up a Lambda Labs L4 instance (₹26/hour spot). Install faster-whisper==1.0.3 with the medium model — large-v3 is more accurate but adds 60ms which we cannot afford. Wire Silero VAD to chunk the Twilio media stream at silence boundaries.
3
Day 3 — Claude Sonnet 4.5 with prefix caching
Put the full insurance domain context (policy types, premium calculation rules, common FAQs) in the prompt with cache_control: {"type": "ephemeral"}. Cache hit-rate in our production: 89%. Saves ~120ms per turn and ~70% of input token cost on follow-up turns.
4
Day 4 — Sarvam Bulbul integration
Use Sarvam's streaming TTS endpoint. Voice "Manan" for our agent persona (warm, professional, Indore-friendly accent). Pipe each Claude sentence chunk to Sarvam as it arrives — never wait for full LLM completion. Apply the pre-warm hack from above.
5
Day 5 — Policy lookup integration
Wrap the agent's existing CRM (a custom MS Access database, naturally) with a thin Postgres + REST API. The bot calls the API for policy status, premium dates, and claim status. We never let Claude generate policy data — only retrieve and read aloud.
6
Day 6 — Human handoff via warm transfer
When the bot's confidence drops below 0.55 OR the caller asks for "agent" / "manager" / "asli admi", Twilio dials the human agent's mobile, plays a brief context summary in English, then bridges the call. Human picks up with the conversation already understood.
7
Day 7 — IRDAI compliance recording + disclosure
Open every call with a Hindi disclosure: "Yeh call recording ke liye monitor ki ja rahi hai." Twilio call recordings stored in S3 ap-south-1 with 90-day retention (IRDAI minimum). Add a "Press 9 to opt out of recording" path — required by 2024 TRAI guidelines.
## The system prompt (Hindi-aware)
code
Aap [AGENT_NAME] ke insurance office ke virtual assistant ho. Aap LIC aur HDFC Life policies handle karte ho.
Niyam:
- User jis bhasha mein baat kar raha ho usi mein jawab do (Hindi, English, Marathi).
- Har jawab maximum 2 vakya. Phone par lambi baatein nahin.
- Policy details ke liye hamesha policy_lookup tool use karo. Khud se number mat banao.
- Agar premium ka calculation puchha jaye, exact amount tool se lo. Approximate kabhi mat batao.
- Agar user 'agent', 'manager' ya 'asli admi' bole — turant manav agent ko transfer karo.
- Claim ke liye hamesha manav agent ko transfer karo. Bot ko claims process nahin karne hain.
- Agar samajh nahin aata, "Maaf kijiye, main yeh nahin samajh saka — ek agent se baat karwata hoon" bolkar transfer karo.
Available tools:
- policy_lookup(policy_number)
- premium_calculator(plan_id, age, sum_assured)
- book_callback(name, phone, time_slot)
- transfer_to_human(reason)
Three details that matter. The "exact amount, approximate kabhi mat batao" instruction prevents Claude from making up premiums under pressure. The explicit Hindi keywords for handoff ("agent", "manager", "asli admi") cover real caller behaviour. Claims are always escalated — IRDAI requires human handling for claim intimation, and we wanted zero ambiguity.
## Pre-launch checklist (Indian voice bot)
Twilio India number with Mumbai SIP termination requested
IRDAI/TRAI Hindi recording disclosure on every call
Press-9 opt-out path for recording
Whisper Hindi WER tested on 50 real call samples — under 15%
Claude system prompt forbids inventing policy numbers, premiums, dates
Tool-call audit log written to Postgres (every retrieval logged)
Human handoff under 3 seconds with warm context summary
Call recordings in S3 ap-south-1, 90-day retention minimum
Sarvam pause-injection at end of every TTS response
Smoke test from 5 different mobile carriers before go-live
## When NOT to build a voice bot
Skip this if (a) your call volume is under 30/day — the operational overhead is not worth it, use a missed-call WhatsApp callback instead, (b) your sector is medical advice / legal counsel / mental health — the regulatory and liability exposure is too high, (c) your callers are mostly elderly users in deep rural areas — voice quality on 2G/edge networks degrades the bot below human-acceptable. We turned down a Bihar microfinance project for reason (c) — recommended a Hindi-trained human team instead.
## What we'd ship differently if starting today
Three changes if we re-built this in November 2025 from scratch.
Use Sarvam Saaras-v2 for STT instead of faster-whisper. Sarvam's STT model trained on Indian languages has noticeably lower WER on accented Hindi than Whisper medium. We tested it side-by-side in October 2025 — Saaras-v2 cut errors by ~30% on Marathi-accented Hindi callers. The latency is comparable.
Move orchestrator to AWS Mumbai (ap-south-1) instead of US-East. When we built this stack the orchestrator was on us-east-1 because Anthropic's lowest-latency endpoint was there. By Nov 2025 the network round-trip is the bigger factor; ap-south-1 saves 80-120ms even though the API call still crosses the Pacific.
Replace Twilio with Plivo or Exotel for India-only deployments. Twilio is the gold standard internationally, but Plivo and Exotel offer ~30% lower per-minute rates for India-only inbound and equivalent quality. The lock-in is similar. For a client with no international calling needs, the savings compound.
## Real outcomes — Indore client, 90-day window
- Missed-call rate before bot: 40% after 5pm
- Missed-call rate after bot: 4% (only when both bot AND human transfer fail)
- Bot resolution rate: 68% of inbound fully resolved
- Human-transfer rate: 22% (claims, complex underwriting, multi-policy queries)
- Caller hang-up before bot reply: 6% (mostly first-time callers surprised by AI)
- CSAT (post-call IVR survey): 4.1/5 on 1,400 surveyed calls
- Hindi callers vs English callers: 78% Hindi, 18% English, 4% Marathi
The agent has since added two more numbers to the same bot setup — one for his Bhopal branch and one for a partner in Ujjain. Same architecture, different policy databases.
## Why this matters for the broader work
We use this exact pipeline as the baseline for vernacular voice projects across our AI automation team. The lessons compound — the same Sarvam + Whisper + Claude tuning we did for this insurance agent applies to a Coimbatore D2C voice bot, a Surat textile-trader IVR, and a Pune NBFC's loan-recovery callback system. The latency hacks are the same. The Hindi disclosure templates are the same.
This work also feeds back into our in-house voice product TalkDrill — an English-fluency app for Indian adults with 5,000+ active users — which gave us the production-grade voice infrastructure we lean on for client builds. For the deeper architectural notes, see how TalkDrill hits 800ms voice round-trip latency.
Reddit threads worth bookmarking: [r/india](https://www.reddit.com/r/india/) for vernacular UX feedback, [r/MachineLearning](https://www.reddit.com/r/MachineLearning/) for STT quality benchmarks, and [Sarvam's Discord](https://www.sarvam.ai/) for the Bulbul-v2 issue tracker.
## FAQ
### Why Claude Sonnet 4.5 instead of GPT-4o for Hindi?
In our internal evals on 200 Hindi/Hinglish customer-service prompts, Claude Sonnet 4.5 produced more grammatically correct Devanagari output and made fewer "Hindi-but-actually-English-translated" responses. GPT-4o was 8% faster on TTFT but had a higher rate of code-switching mid-sentence. For a vernacular-first product, Claude wins. For English-only, GPT-4o-mini wins on cost.
### What does Sarvam Bulbul-v2 sound like compared to ElevenLabs?
Bulbul-v2's Hindi voices are noticeably more natural for Indian listeners — proper accent, correct pronunciation of common Hindi names, cleaner Devanagari handling. ElevenLabs Hindi sounds like an English speaker reading Hindi phonetically. For Hindi production work, Sarvam wins by a clear margin.
### Can you do this with on-device models?
For an outbound IVR, no — you're on the PSTN, so the audio always lands on a server. For a mobile-app voice bot (think a customer-support assistant inside an app), yes, with Phi-3 mini or Gemma 2B + on-device Whisper tiny. Quality is markedly worse but works offline.
### What's the cheapest version of this stack?
Replace Twilio with Plivo (~₹0.25/min inbound), use Whisper-tiny on a CPU instead of L4 (saves ₹19,000/month at low volume), and switch Claude Sonnet 4.5 for Sonnet 3.5 Haiku (₹62 per 1M input). End-to-end you can hit ₹2.10 per call. Quality drops noticeably on Whisper-tiny — we'd only do this for a strict-budget pilot.
### How does compliance work for IRDAI insurance bots?
IRDAI requires (a) clear disclosure that you're talking to a recording / automated system, (b) the bot cannot quote final premiums or process claims (must transfer to a licensed agent), (c) recordings retained for 90 days minimum, (d) audit logs for every policy lookup. We wrap all of this into the system prompt and tool-permission layer.
### What if the caller speaks a language Sarvam doesn't support?
Bulbul-v2 covers Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, and Assamese. Anything else — a Konkani-only caller in Goa, for example — falls back to English. The bot detects the language from Whisper's language-detection output and switches the TTS voice accordingly. Unsupported = English fallback + human handoff offer.
### Can the bot do outbound calls (premium reminders)?
Yes, with explicit consent. We added an outbound flow for the Indore client where existing customers receive a reminder 7 days before premium due date. ₹0.65/min outbound on Twilio India, ~45 second average call. Cost per reminder: ₹0.85. Replacement for human-dialed reminders that previously cost ₹6-8 per call in agent time.
Want a Vernacular Voice Bot for Your Insurance, NBFC, or Lending Business?
We ship Hindi/Tamil/Marathi/Telugu voice bots for Indian SMBs in 10–14 working days. Full IRDAI/RBI/TRAI compliance, real-time human handoff, integrated with your existing CRM. Typical project: ₹35,000–₹95,000 fixed scope. Per-call run cost from ₹2.10. First call is technical — with the engineer who would lead your build.