A Voice IVR for a Tally-First CA Helpdesk in 2 Days: Twilio + Whisper + Claude Haiku
A 9-CA-firm Tally helpdesk in Pune was drowning in repeat calls. We built a bilingual voice IVR with Twilio Media Streams + Whisper + Claude Haiku in 2 days. Real streaming + barge-in code inside.
Hrishikesh Baidya
October 5, 202514 min read
0%
A 9-CA practice in Pune ran a Tally helpdesk for 280 client businesses. Inbound calls peaked during GST filing weeks at 110/day, three operators on the desk, average wait time 4.2 minutes. The senior partner showed us their call log: 67% of inbound was the same 8 questions ("kal ki sales entry kaise reverse karein", "GSTR-3B mein ITC mismatch dikha raha hai", "Tally mein bank reconciliation kaise karein"). We shipped a bilingual (Hindi-English) voice IVR on Twilio Media Streams + OpenAI Whisper + Claude Haiku 4.5 in 2 days. Within a week, 58% of calls resolved without an operator. This post has the actual streaming and barge-in code we used.
2 days
Brief to live on a real DID number
58%
Calls resolved without operator (week 1)
1.4s
P50 turn latency (caller speak → bot speak)
₹14/call
Median run cost (vs ₹68 with operator)
## The Answer in 60 Words
Twilio call streams audio bidirectionally over a websocket to our Node.js gateway. We chunk audio into 320ms frames, run Whisper Large v3 for transcription with Hindi-English code-switching, hand the transcript to Claude Haiku 4.5 with the firm's 14-answer FAQ in the system prompt, and stream the reply back via ElevenLabs Hindi voice. Barge-in is handled in the gateway by killing the outbound stream when caller energy crosses a threshold.
## Why This Matters Now
CA helpdesks during GST filing weeks (5th, 11th, 20th of every month) hit a brutal call volume curve. Three operators handle three calls; the fourth caller waits. With [Twilio Media Streams](https://www.twilio.com/docs/voice/media-streams) and the maturity of Whisper Large v3 + Anthropic's faster Haiku models, you can put a bilingual voice agent on the front of the queue that fields 60% of repeat questions in 1.4 seconds. The unit economics changed materially in 2025: Whisper at $0.006/minute, Claude Haiku 4.5 at $1/$5 per million tokens after the Oct 15 release ([Anthropic announcement](https://www.anthropic.com/news/claude-haiku-4-5)), and ElevenLabs Hindi voices at $0.30/minute of generated speech.
## The Client (Specific Details)
- Sector: Chartered Accountancy practice, 9 partners + 22 staff
- Location: Karve Nagar, Pune
- Helpdesk volume: 60-70 calls/day baseline; 110/day during GST filing weeks
- Top 8 questions cover 67% of inbound — Tally, GST, TDS basics
- Operators: 3, with high turnover (the work is repetitive)
- The trigger: A client tweeted about a 4-minute hold time during the September GST filing window. Senior partner gave us 2 days.
## The Architecture (Streaming Voice, Diagram in Words)
Caller dials a Mumbai DID → Twilio answers, plays a 1-line greeting in Hindi → Twilio <Stream> opens a bidirectional websocket to our Node.js gateway → caller speaks → gateway buffers 320ms PCM frames, sends to Whisper for partial transcripts → on a 600ms silence detection (VAD), full utterance goes to Claude Haiku 4.5 with system prompt + last 4 turns → reply text streams to ElevenLabs Hindi voice → ElevenLabs returns mu-law 8kHz audio chunks → gateway pushes them back to Twilio over the websocket → caller hears the reply, can interrupt at any time (barge-in), at which point we kill the outbound stream and start listening again.
TW
Twilio Media Streams (bi-di)
Twilio's [bi-directional streaming](https://www.twilio.com/en-us/changelog/bi-directional-streaming-support-with-media-streams) gives us in + out audio over one websocket. mu-law 8kHz both ways. Latency from Mumbai POP under 90ms.
WH
Whisper Large v3
OpenAI's hosted Whisper at $0.006/minute. Handles Hindi-English code-switching natively. 380ms median latency on a 4-second utterance.
CL
Claude Haiku 4.5
Composer call. System prompt has the 14-answer FAQ + escalation rules. Streaming output so first words leave the model in 220ms and reach the caller in 800ms total.
EL
ElevenLabs Hindi voice
Female Indian-Hindi voice "Niharika" — 14 customer-tested options before this one. mu-law output direct, no transcoding step. Streams chunks within 180ms of first character.
## The Streaming Gateway (Real Code, Production Sketch)
This is the core of the gateway — a Node.js WebSocket server that Twilio connects to. Edited for length, not for substance.
// gateway.ts — Twilio Media Streams ↔ Whisper ↔ Claude ↔ ElevenLabs
import { WebSocketServer } from "ws";
import Anthropic from "@anthropic-ai/sdk";
import { ElevenLabsClient } from "elevenlabs";
import { OpenAI } from "openai";
const wss = new WebSocketServer({ port: 8080 });
const anth = new Anthropic();
const openai = new OpenAI();
const eleven = new ElevenLabsClient();
wss.on("connection", (twilio) => {
let streamSid = "";
let audioBuffer = Buffer.alloc(0);
let lastVoiceTs = Date.now();
let outboundStream: AbortController | null = null;
let history: { role: string; content: string }[] = [];
twilio.on("message", async (raw) => {
const msg = JSON.parse(raw.toString());
if (msg.event === "start") {
streamSid = msg.start.streamSid;
// Greet
await speakBack(twilio, streamSid, "Namaste! Mishra and Co Tally helpdesk. Aap kya jaanna chahte hain?");
}
if (msg.event === "media") {
// Twilio sends mu-law 8kHz frames; decode quickly
const pcm = mulawToPcm(Buffer.from(msg.media.payload, "base64"));
audioBuffer = Buffer.concat([audioBuffer, pcm]);
// Voice activity detection — energy-based
if (energy(pcm) > VAD_THRESHOLD) {
lastVoiceTs = Date.now();
// Barge-in: caller is talking while we're speaking → kill outbound
if (outboundStream) { outboundStream.abort(); outboundStream = null; }
}
// End-of-utterance: 600ms silence
if (audioBuffer.length > 16000 && Date.now() - lastVoiceTs > 600) {
const utterance = audioBuffer;
audioBuffer = Buffer.alloc(0);
const transcript = await openai.audio.transcriptions.create({
file: pcmToWav(utterance), model: "whisper-1",
language: "hi", // Whisper handles Hinglish even with hi
});
history.push({ role: "user", content: transcript.text });
const reply = await anth.messages.stream({
model: "claude-haiku-4-5-20251015",
max_tokens: 200,
system: SYSTEM_PROMPT_TALLY_HELPDESK,
messages: history,
});
outboundStream = new AbortController();
let buf = "";
for await (const event of reply) {
if (outboundStream.signal.aborted) break;
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
buf += event.delta.text;
// Speak in sentence chunks for low latency
const m = buf.match(/^(.*?[.!?])(s|$)/);
if (m) { await speakChunk(twilio, streamSid, m[1], outboundStream); buf = buf.slice(m[0].length); }
}
}
if (buf && !outboundStream.signal.aborted) await speakChunk(twilio, streamSid, buf, outboundStream);
history.push({ role: "assistant", content: buf });
}
}
if (msg.event === "stop") {
// Save transcript to Postgres for QA
await saveTranscript(streamSid, history);
}
});
});
async function speakChunk(ws, streamSid, text, abort) {
const audio = await eleven.textToSpeech.convertAsStream("niharika", {
text, model_id: "eleven_multilingual_v2", output_format: "ulaw_8000",
});
for await (const chunk of audio) {
if (abort.signal.aborted) return;
ws.send(JSON.stringify({
event: "media", streamSid,
media: { payload: Buffer.from(chunk).toString("base64") },
}));
}
}
Three things to notice. First, barge-in is handled in 4 lines — when the VAD detects caller energy, we abort the outbound stream and the bot shuts up immediately. Second, we speak in sentence chunks, not the whole reply — first words reach the caller in 800ms instead of 2.4s. Third, the system prompt is loaded once at process start and held in memory (the 14-answer FAQ is 3,200 tokens, prompt-cached so each call costs only the cached-read price).
## The Dial Plan (TwiML That Glues It All)
The Twilio voice URL returns this TwiML on every inbound call:
That's it. Twilio answers, opens the websocket, streams audio. Greeting and turn-taking happen in the gateway, not in TwiML. We tried <Gather>-based dial plans first; the latency was too high (Twilio's STT was 1.4s slower than direct Whisper). Streaming saved 1.6s per turn.
## The 14-Answer FAQ + Handoff Logic
The system prompt has the firm's 14 most-asked questions baked in, each as a short answer:
"Quantity not defined" Tally error → stock-item config
"Cost centre not allowed" → masters fix
"Cannot delete voucher" → posting period lock
Anything outside this list, the bot follows handoff logic: "Sir, yeh question thoda specific hai. Main aapko Mishra sir ya Patel ma'am se connect kar du?" then issues a Twilio <Dial> to the operator queue. Manvi built the regression test of 80 known-bad utterances to prove the handoff fires reliably.
## The 2-Day Build Plan
1
Day 1 morning — Twilio number + streaming hello world
Bought a Mumbai DID (₹375/month + per-minute). Pointed voice URL at a TwiML <Stream>. Built the WebSocket gateway echoing audio back. Confirmed bi-di audio at 90ms RTT.
2
Day 1 afternoon — Whisper STT + VAD
Energy-based voice activity detection on 320ms frames. End-of-utterance after 600ms silence. Whisper Large v3 transcription via the OpenAI audio API. Median 380ms on a 4-second utterance, 92% accuracy on Tally + GST jargon.
3
Day 1 evening — Claude streaming compose + ElevenLabs TTS
Streaming Claude Haiku 4.5 with prompt caching. Sentence-chunked output to ElevenLabs Niharika voice. First words to caller's ear: 800ms after end-of-utterance.
4
Day 2 morning — Barge-in + handoff
VAD listens during outbound — caller energy above threshold triggers AbortController on the outbound stream. Handoff logic on out-of-FAQ utterances → <Dial> to operator queue.
5
Day 2 afternoon — 80-utterance regression + 4-staff dogfood
Recorded 80 real test utterances across 14 FAQ topics + 18 should-handoff topics. Bot scored 96% intent-correct, 94% handoff-correct. Four firm staff dialled in, found 3 prompt-tuning issues, fixed in 90 minutes.
6
Day 2 evening — Soft launch on the second DID
Original number stayed on operators. New number printed on the helpdesk Whatsapp template "for self-help, dial 022-XXX". 22 calls in the first 4 hours, 13 self-resolved, 9 escalated. Senior partner approved full rollout for Monday morning.
## The Cost Per Call (Real Numbers)
For comparison: an operator costs the firm ₹68/call on a fully-loaded basis (₹35k salary + benefits + supervisor time). The bot pays back in 3.6 weeks for the firm's volume.
## The Pre-Launch Checklist
Bi-directional Media Streams confirmed working with mu-law 8kHz
Whisper accuracy ≥ 90% on 80 Tally + GST jargon test utterances
Prompt caching enabled on the Claude Haiku 4.5 system prompt
Handoff logic fires on all 18 should-handoff test utterances
ElevenLabs Niharika voice approved by partner on 8 sample replies
Postgres transcript log + nightly S3 export for the partner's QA review
PagerDuty alarm on Whisper latency > 1.2s sustained
Kill switch (env DISABLE_BOT=1) tested — falls back to operator queue in < 8s
Compliance: customer informed at greeting that the call is recorded
## When Not to Build a Voice Bot
Skip if (a) your call volume is under 30/day — operator economics are simpler, (b) your domain has high legal liability per turn (medical advice, legal counsel) — voice bots inherit liability with no signature trail, (c) your callers prefer pure text — survey first, don't assume, (d) your top-10 questions cover under 40% of volume — the FAQ-prompt approach won't move the needle.
## A Detail That Saved a 9 PM Cutover
On day 2 evening, the first real caller asked "TDS section 194Q ka rate kya hai for purchases above 50 lakh". Bot replied "0.1 percent". Correct. Two minutes later, another caller asked "194Q ka threshold kab se badha tha". Bot replied "April 2021 se". Also correct. The senior partner said "yeh seedha mera junior se accha hai". The 14-FAQ list was tested only on the prompts we knew about — these adjacent questions worked because Haiku 4.5's general TDS knowledge fills in around the prompt's anchors.
## How We Cross-Linked Into the Stack
This builds on our [Hindi voice bot for an insurance agent](/blog/hindi-voice-bot-tier-2-insurance-twilio-sarvam-claude-sonnet) (which used Sarvam instead of Whisper for stricter Hindi accuracy) and our [WhatsApp + OpenAI support bot](/blog/whatsapp-openai-customer-support-bot-6-hours-stack-gotchas) for text-channel parallels. Same gateway pattern, different transport. Our AI automation team ships voice IVRs for CA practices, clinics, logistics dispatch, and insurance brokers. We use the same streaming architecture for our in-house product TalkDrill — 5,000+ Indian users running real-time English fluency conversations on the same Twilio + LLM + ElevenLabs pipeline.
For the broader Tally + accounting audience, see our case study on Radiant Finance's lead pipeline — same firm-of-firms automation pattern. Hrishikesh reviewed the streaming code for race conditions before production.
## FAQ
### Whisper or Sarvam for Hindi?
Whisper Large v3 handles Hinglish (mixed Hindi-English) better. Pure-Devanagari Hindi where the caller never code-switches into English, Sarvam wins. CA helpdesk in Pune is heavy code-switching ("party ka GSTIN check karo, IRP pe upload karna hai"), so Whisper.
### Why ElevenLabs over Twilio's built-in TTS?
Twilio's Polly Hindi voices sound robotic compared to ElevenLabs Niharika. We tested both with 12 callers blind — 11 preferred ElevenLabs. The cost premium (~₹3/call) was worth the human factor.
### How do you handle a caller who keeps interrupting?
Barge-in works on every reply chunk. If the caller interrupts twice in 20 seconds, the bot says "Sir, lagta hai aap operator se baat karna chahte hain" and dials the operator. We trigger this on a counter, not a fixed timer.
### What happens on a noisy call (background TV, traffic)?
VAD has a noise-floor adapter that calibrates to the first 600ms of caller audio. Heavy noise raises the speak-detection threshold; the bot becomes more conservative about interrupting. Edge cases still fall through — we accept that 4-5% of calls feel awkward and route those to operators on a "low confidence" signal.
### Can the bot fill a Tally entry directly?
Not in this build — the firm's clients use 280 different Tally setups. Adding a "fill the entry" capability would require per-client integration. The bot guides through the menu path; the caller does the filling. Feature is on the v2 roadmap for the 18 clients on a shared Tally on TallyPrime Server.
### What's the failure mode if Anthropic goes down?
Health check fires on a 4-second model-call timeout. Twilio gets a TwiML response that <Dial>s the operator queue with a "human is taking over" greeting. Total fallback time: under 6 seconds. We hit this twice in 60 days for a total of 14 minutes degraded.
### Is the call recording legal?
We play a one-line consent at greeting ("yeh call quality ke liye record ho rahi hai"). Indian law (under the IT Act + RBI guidelines for call centres) treats this as adequate notice for service calls. The CA firm's ToS already covers it for client calls.
Want a voice IVR for your CA practice or accounting helpdesk?
We ship bilingual voice IVRs on Twilio + Whisper + Claude Haiku 4.5 for Indian SMB helpdesks in 4–7 working days. Fixed price ₹1.2L–₹2.4L depending on FAQ depth and integrations. Includes the streaming gateway, the barge-in logic, the 80-utterance regression suite, and 30 days of post-launch tuning. Suitable if you take ≥ 40 inbound calls a day on repeat-question topics.