A Voice IVR for a Tally-First CA Helpdesk in 2 Days: Twilio + Whisper + Claude Haiku

Q: Whisper or Sarvam for Hindi?

Whisper Large v3 handles Hinglish (code-switching) better. For pure Devanagari Hindi without code-switching into English, Sarvam wins. The Pune CA helpdesk is heavy code-switching, so Whisper.

Q: Why ElevenLabs over Twilio's built-in TTS?

Polly Hindi voices sound robotic compared to ElevenLabs Niharika. Blind testing with 12 callers: 11 preferred ElevenLabs. The ₹3/call cost premium was worth the human factor.

Q: How do you handle a caller who keeps interrupting?

Barge-in works on every reply chunk. If the caller interrupts twice in 20 seconds, the bot says 'Sir, lagta hai aap operator se baat karna chahte hain' and dials the operator queue.

Q: What happens on a noisy call?

VAD calibrates to the first 600ms of caller audio. Heavy noise raises the speak-detection threshold so the bot becomes conservative about interrupting. Low-confidence calls route to operators.

Q: Can the bot fill a Tally entry directly?

Not in this build — clients use 280 different Tally setups. The bot guides through menu paths; the caller does the filling. Direct entry is on the v2 roadmap for clients on a shared TallyPrime Server.

Q: What's the failure mode if Anthropic goes down?

Health check fires on a 4-second model timeout. Twilio gets TwiML that s the operator queue with a human-takeover greeting. Total fallback under 6 seconds. We hit this twice in 60 days for 14 minutes total.

Q: Is the call recording legal?

Yes. We play a one-line consent at greeting. Indian law under the IT Act and RBI guidelines for call centres treats this as adequate notice for service calls. The firm's ToS already covers client calls.

A 9-CA practice in Pune ran a Tally helpdesk for 280 client businesses. Inbound calls peaked during GST filing weeks at 110/day, three operators on the desk, average wait time 4.2 minutes. The senior partner showed us their call log: 67% of inbound was the same 8 questions ("kal ki sales entry kaise reverse karein", "GSTR-3B mein ITC mismatch dikha raha hai", "Tally mein bank reconciliation kaise karein"). We shipped a bilingual (Hindi-English) voice IVR on Twilio Media Streams + OpenAI Whisper + Claude Haiku 4.5 in 2 days. Within a week, 58% of calls resolved without an operator. This post has the actual streaming and barge-in code we used.

2 days

Brief to live on a real DID number

58%

Calls resolved without operator (week 1)

1.4s

P50 turn latency (caller speak → bot speak)

₹14/call

Median run cost (vs ₹68 with operator)

## The Answer in 60 Words Twilio call streams audio bidirectionally over a websocket to our Node.js gateway. We chunk audio into 320ms frames, run Whisper Large v3 for transcription with Hindi-English code-switching, hand the transcript to Claude Haiku 4.5 with the firm's 14-answer FAQ in the system prompt, and stream the reply back via ElevenLabs Hindi voice. Barge-in is handled in the gateway by killing the outbound stream when caller energy crosses a threshold. ## Why This Matters Now CA helpdesks during GST filing weeks (5th, 11th, 20th of every month) hit a brutal call volume curve. Three operators handle three calls; the fourth caller waits. With [Twilio Media Streams](https://www.twilio.com/docs/voice/media-streams) and the maturity of Whisper Large v3 + Anthropic's faster Haiku models, you can put a bilingual voice agent on the front of the queue that fields 60% of repeat questions in 1.4 seconds. The unit economics changed materially in 2025: Whisper at $0.006/minute, Claude Haiku 4.5 at $1/$5 per million tokens after the Oct 15 release ([Anthropic announcement](https://www.anthropic.com/news/claude-haiku-4-5)), and ElevenLabs Hindi voices at $0.30/minute of generated speech. ## The Client (Specific Details) - Sector: Chartered Accountancy practice, 9 partners + 22 staff - Location: Karve Nagar, Pune - Helpdesk volume: 60-70 calls/day baseline; 110/day during GST filing weeks - Top 8 questions cover 67% of inbound — Tally, GST, TDS basics - Operators: 3, with high turnover (the work is repetitive) - The trigger: A client tweeted about a 4-minute hold time during the September GST filing window. Senior partner gave us 2 days. ## The Architecture (Streaming Voice, Diagram in Words) Caller dials a Mumbai DID → Twilio answers, plays a 1-line greeting in Hindi → Twilio <Stream> opens a bidirectional websocket to our Node.js gateway → caller speaks → gateway buffers 320ms PCM frames, sends to Whisper for partial transcripts → on a 600ms silence detection (VAD), full utterance goes to Claude Haiku 4.5 with system prompt + last 4 turns → reply text streams to ElevenLabs Hindi voice → ElevenLabs returns mu-law 8kHz audio chunks → gateway pushes them back to Twilio over the websocket → caller hears the reply, can interrupt at any time (barge-in), at which point we kill the outbound stream and start listening again.

Twilio Media Streams (bi-di)

Twilio's [bi-directional streaming](https://www.twilio.com/en-us/changelog/bi-directional-streaming-support-with-media-streams) gives us in + out audio over one websocket. mu-law 8kHz both ways. Latency from Mumbai POP under 90ms.

Whisper Large v3

OpenAI's hosted Whisper at $0.006/minute. Handles Hindi-English code-switching natively. 380ms median latency on a 4-second utterance.

Claude Haiku 4.5

Composer call. System prompt has the 14-answer FAQ + escalation rules. Streaming output so first words leave the model in 220ms and reach the caller in 800ms total.

ElevenLabs Hindi voice

Female Indian-Hindi voice "Niharika" — 14 customer-tested options before this one. mu-law output direct, no transcoding step. Streams chunks within 180ms of first character.

## The Streaming Gateway (Real Code, Production Sketch) This is the core of the gateway — a Node.js WebSocket server that Twilio connects to. Edited for length, not for substance.

// gateway.ts — Twilio Media Streams ↔ Whisper ↔ Claude ↔ ElevenLabs
  import { WebSocketServer } from "ws";
  import Anthropic from "@anthropic-ai/sdk";
  import { ElevenLabsClient } from "elevenlabs";
  import { OpenAI } from "openai";
  
  const wss = new WebSocketServer({ port: 8080 });
  const anth = new Anthropic();
  const openai = new OpenAI();
  const eleven = new ElevenLabsClient();
  
  wss.on("connection", (twilio) => {
    let streamSid = "";
    let audioBuffer = Buffer.alloc(0);
    let lastVoiceTs = Date.now();
    let outboundStream: AbortController | null = null;
    let history: { role: string; content: string }[] = [];
  
    twilio.on("message", async (raw) => {
      const msg = JSON.parse(raw.toString());
  
      if (msg.event === "start") {
        streamSid = msg.start.streamSid;
        // Greet
        await speakBack(twilio, streamSid, "Namaste! Mishra and Co Tally helpdesk. Aap kya jaanna chahte hain?");
      }
  
      if (msg.event === "media") {
        // Twilio sends mu-law 8kHz frames; decode quickly
        const pcm = mulawToPcm(Buffer.from(msg.media.payload, "base64"));
        audioBuffer = Buffer.concat([audioBuffer, pcm]);
  
        // Voice activity detection — energy-based
        if (energy(pcm) > VAD_THRESHOLD) {
          lastVoiceTs = Date.now();
          // Barge-in: caller is talking while we're speaking → kill outbound
          if (outboundStream) { outboundStream.abort(); outboundStream = null; }
        }
  
        // End-of-utterance: 600ms silence
        if (audioBuffer.length > 16000 && Date.now() - lastVoiceTs > 600) {
          const utterance = audioBuffer;
          audioBuffer = Buffer.alloc(0);
  
          const transcript = await openai.audio.transcriptions.create({
            file: pcmToWav(utterance), model: "whisper-1",
            language: "hi", // Whisper handles Hinglish even with hi
          });
  
          history.push({ role: "user", content: transcript.text });
  
          const reply = await anth.messages.stream({
            model: "claude-haiku-4-5-20251015",
            max_tokens: 200,
            system: SYSTEM_PROMPT_TALLY_HELPDESK,
            messages: history,
          });
  
          outboundStream = new AbortController();
          let buf = "";
          for await (const event of reply) {
            if (outboundStream.signal.aborted) break;
            if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
              buf += event.delta.text;
              // Speak in sentence chunks for low latency
              const m = buf.match(/^(.*?[.!?])(s|$)/);
              if (m) { await speakChunk(twilio, streamSid, m[1], outboundStream); buf = buf.slice(m[0].length); }
            }
          }
          if (buf && !outboundStream.signal.aborted) await speakChunk(twilio, streamSid, buf, outboundStream);
          history.push({ role: "assistant", content: buf });
        }
      }
  
      if (msg.event === "stop") {
        // Save transcript to Postgres for QA
        await saveTranscript(streamSid, history);
      }
    });
  });
  
  async function speakChunk(ws, streamSid, text, abort) {
    const audio = await eleven.textToSpeech.convertAsStream("niharika", {
      text, model_id: "eleven_multilingual_v2", output_format: "ulaw_8000",
    });
    for await (const chunk of audio) {
      if (abort.signal.aborted) return;
      ws.send(JSON.stringify({
        event: "media", streamSid,
        media: { payload: Buffer.from(chunk).toString("base64") },
      }));
    }
  }

Three things to notice. First, barge-in is handled in 4 lines — when the VAD detects caller energy, we abort the outbound stream and the bot shuts up immediately. Second, we speak in sentence chunks, not the whole reply — first words reach the caller in 800ms instead of 2.4s. Third, the system prompt is loaded once at process start and held in memory (the 14-answer FAQ is 3,200 tokens, prompt-cached so each call costs only the cached-read price). ## The Dial Plan (TwiML That Glues It All) The Twilio voice URL returns this TwiML on every inbound call:

<Response>
    <Connect>
      <Stream url="wss://gateway.softechinfra.com/twilio" />
    </Connect>
  </Response>

That's it. Twilio answers, opens the websocket, streams audio. Greeting and turn-taking happen in the gateway, not in TwiML. We tried <Gather>-based dial plans first; the latency was too high (Twilio's STT was 1.4s slower than direct Whisper). Streaming saved 1.6s per turn. ## The 14-Answer FAQ + Handoff Logic The system prompt has the firm's 14 most-asked questions baked in, each as a short answer:

Tally mein sales entry kaise reverse karein → 6-step instruction
GSTR-3B mein ITC mismatch dikha raha hai → 4-step IMS check
Tally mein bank reconciliation → menu path
TDS late filing penalty kya hai → table by section
Invoice number reset → financial year config
e-Way bill 24-hour rule → exception list
HSN code lookup for textiles → top 12 codes
Composition scheme threshold → ₹1.5 cr / ₹50 lakh rules
RCM applicability → 5 common cases
GSTIN format validation → checksum logic
Tally Prime backup → daily / weekly path
"Quantity not defined" Tally error → stock-item config
"Cost centre not allowed" → masters fix
"Cannot delete voucher" → posting period lock

Anything outside this list, the bot follows handoff logic: "Sir, yeh question thoda specific hai. Main aapko Mishra sir ya Patel ma'am se connect kar du?" then issues a Twilio <Dial> to the operator queue. Manvi built the regression test of 80 known-bad utterances to prove the handoff fires reliably. ## The 2-Day Build Plan

Day 1 morning — Twilio number + streaming hello world

Bought a Mumbai DID (₹375/month + per-minute). Pointed voice URL at a TwiML <Stream>. Built the WebSocket gateway echoing audio back. Confirmed bi-di audio at 90ms RTT.

Day 1 afternoon — Whisper STT + VAD

Energy-based voice activity detection on 320ms frames. End-of-utterance after 600ms silence. Whisper Large v3 transcription via the OpenAI audio API. Median 380ms on a 4-second utterance, 92% accuracy on Tally + GST jargon.

Day 1 evening — Claude streaming compose + ElevenLabs TTS

Streaming Claude Haiku 4.5 with prompt caching. Sentence-chunked output to ElevenLabs Niharika voice. First words to caller's ear: 800ms after end-of-utterance.

Day 2 morning — Barge-in + handoff

VAD listens during outbound — caller energy above threshold triggers AbortController on the outbound stream. Handoff logic on out-of-FAQ utterances → <Dial> to operator queue.

Day 2 afternoon — 80-utterance regression + 4-staff dogfood

Recorded 80 real test utterances across 14 FAQ topics + 18 should-handoff topics. Bot scored 96% intent-correct, 94% handoff-correct. Four firm staff dialled in, found 3 prompt-tuning issues, fixed in 90 minutes.

Day 2 evening — Soft launch on the second DID

Original number stayed on operators. New number printed on the helpdesk Whatsapp template "for self-help, dial 022-XXX". 22 calls in the first 4 hours, 13 self-resolved, 9 escalated. Senior partner approved full rollout for Monday morning.

## The Cost Per Call (Real Numbers) For comparison: an operator costs the firm ₹68/call on a fully-loaded basis (₹35k salary + benefits + supervisor time). The bot pays back in 3.6 weeks for the firm's volume. ## The Pre-Launch Checklist

Bi-directional Media Streams confirmed working with mu-law 8kHz
Whisper accuracy ≥ 90% on 80 Tally + GST jargon test utterances
Prompt caching enabled on the Claude Haiku 4.5 system prompt
Barge-in tested with 12 mid-sentence interrupts — 100% killed outbound stream
Handoff logic fires on all 18 should-handoff test utterances
ElevenLabs Niharika voice approved by partner on 8 sample replies
Postgres transcript log + nightly S3 export for the partner's QA review
PagerDuty alarm on Whisper latency > 1.2s sustained
Kill switch (env DISABLE_BOT=1) tested — falls back to operator queue in < 8s
Compliance: customer informed at greeting that the call is recorded

## When Not to Build a Voice Bot Skip if (a) your call volume is under 30/day — operator economics are simpler, (b) your domain has high legal liability per turn (medical advice, legal counsel) — voice bots inherit liability with no signature trail, (c) your callers prefer pure text — survey first, don't assume, (d) your top-10 questions cover under 40% of volume — the FAQ-prompt approach won't move the needle. ## A Detail That Saved a 9 PM Cutover On day 2 evening, the first real caller asked "TDS section 194Q ka rate kya hai for purchases above 50 lakh". Bot replied "0.1 percent". Correct. Two minutes later, another caller asked "194Q ka threshold kab se badha tha". Bot replied "April 2021 se". Also correct. The senior partner said "yeh seedha mera junior se accha hai". The 14-FAQ list was tested only on the prompts we knew about — these adjacent questions worked because Haiku 4.5's general TDS knowledge fills in around the prompt's anchors. ## How We Cross-Linked Into the Stack This builds on our [Hindi voice bot for an insurance agent](/blog/hindi-voice-bot-tier-2-insurance-twilio-sarvam-claude-sonnet) (which used Sarvam instead of Whisper for stricter Hindi accuracy) and our [WhatsApp + OpenAI support bot](/blog/whatsapp-openai-customer-support-bot-6-hours-stack-gotchas) for text-channel parallels. Same gateway pattern, different transport. Our AI automation team ships voice IVRs for CA practices, clinics, logistics dispatch, and insurance brokers. We use the same streaming architecture for our in-house product TalkDrill — 5,000+ Indian users running real-time English fluency conversations on the same Twilio + LLM + ElevenLabs pipeline. For the broader Tally + accounting audience, see our case study on Radiant Finance's lead pipeline — same firm-of-firms automation pattern. Hrishikesh reviewed the streaming code for race conditions before production. ## FAQ ### Whisper or Sarvam for Hindi? Whisper Large v3 handles Hinglish (mixed Hindi-English) better. Pure-Devanagari Hindi where the caller never code-switches into English, Sarvam wins. CA helpdesk in Pune is heavy code-switching ("party ka GSTIN check karo, IRP pe upload karna hai"), so Whisper. ### Why ElevenLabs over Twilio's built-in TTS? Twilio's Polly Hindi voices sound robotic compared to ElevenLabs Niharika. We tested both with 12 callers blind — 11 preferred ElevenLabs. The cost premium (~₹3/call) was worth the human factor. ### How do you handle a caller who keeps interrupting? Barge-in works on every reply chunk. If the caller interrupts twice in 20 seconds, the bot says "Sir, lagta hai aap operator se baat karna chahte hain" and dials the operator. We trigger this on a counter, not a fixed timer. ### What happens on a noisy call (background TV, traffic)? VAD has a noise-floor adapter that calibrates to the first 600ms of caller audio. Heavy noise raises the speak-detection threshold; the bot becomes more conservative about interrupting. Edge cases still fall through — we accept that 4-5% of calls feel awkward and route those to operators on a "low confidence" signal. ### Can the bot fill a Tally entry directly? Not in this build — the firm's clients use 280 different Tally setups. Adding a "fill the entry" capability would require per-client integration. The bot guides through the menu path; the caller does the filling. Feature is on the v2 roadmap for the 18 clients on a shared Tally on TallyPrime Server. ### What's the failure mode if Anthropic goes down? Health check fires on a 4-second model-call timeout. Twilio gets a TwiML response that <Dial>s the operator queue with a "human is taking over" greeting. Total fallback time: under 6 seconds. We hit this twice in 60 days for a total of 14 minutes degraded. ### Is the call recording legal? We play a one-line consent at greeting ("yeh call quality ke liye record ho rahi hai"). Indian law (under the IT Act + RBI guidelines for call centres) treats this as adequate notice for service calls. The CA firm's ToS already covers it for client calls.

Want a voice IVR for your CA practice or accounting helpdesk?

We ship bilingual voice IVRs on Twilio + Whisper + Claude Haiku 4.5 for Indian SMB helpdesks in 4–7 working days. Fixed price ₹1.2L–₹2.4L depending on FAQ depth and integrations. Includes the streaming gateway, the barge-in logic, the 80-utterance regression suite, and 30 days of post-launch tuning. Suitable if you take ≥ 40 inbound calls a day on repeat-question topics.

Book a 20-min Call

Tags:

Voice IVRTwilioWhisperClaude HaikuTallyCAStreaming

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

// gateway.ts — Twilio Media Streams ↔ Whisper ↔ Claude ↔ ElevenLabs import { WebSocketServer } from "ws"; import Anthropic from "@anthropic-ai/sdk"; import { ElevenLabsClient } from "elevenlabs"; import { OpenAI } from "openai"; const wss = new WebSocketServer({ port: 8080 }); const anth = new Anthropic(); const openai = new OpenAI(); const eleven = new ElevenLabsClient(); wss.on("connection", (twilio) => { let streamSid = ""; let audioBuffer = Buffer.alloc(0); let lastVoiceTs = Date.now(); let outboundStream: AbortController | null = null; let history: { role: string; content: string }[] = []; twilio.on("message", async (raw) => { const msg = JSON.parse(raw.toString()); if (msg.event === "start") { streamSid = msg.start.streamSid; // Greet await speakBack(twilio, streamSid, "Namaste! Mishra and Co Tally helpdesk. Aap kya jaanna chahte hain?"); } if (msg.event === "media") { // Twilio sends mu-law 8kHz frames; decode quickly const pcm = mulawToPcm(Buffer.from(msg.media.payload, "base64")); audioBuffer = Buffer.concat([audioBuffer, pcm]); // Voice activity detection — energy-based if (energy(pcm) > VAD_THRESHOLD) { lastVoiceTs = Date.now(); // Barge-in: caller is talking while we're speaking → kill outbound if (outboundStream) { outboundStream.abort(); outboundStream = null; } } // End-of-utterance: 600ms silence if (audioBuffer.length > 16000 && Date.now() - lastVoiceTs > 600) { const utterance = audioBuffer; audioBuffer = Buffer.alloc(0); const transcript = await openai.audio.transcriptions.create({ file: pcmToWav(utterance), model: "whisper-1", language: "hi", // Whisper handles Hinglish even with hi }); history.push({ role: "user", content: transcript.text }); const reply = await anth.messages.stream({ model: "claude-haiku-4-5-20251015", max_tokens: 200, system: SYSTEM_PROMPT_TALLY_HELPDESK, messages: history, }); outboundStream = new AbortController(); let buf = ""; for await (const event of reply) { if (outboundStream.signal.aborted) break; if (event.type === "content_block_delta" && event.delta.type === "text_delta") { buf += event.delta.text; // Speak in sentence chunks for low latency const m = buf.match(/^(.*?[.!?])(s|$)/); if (m) { await speakChunk(twilio, streamSid, m[1], outboundStream); buf = buf.slice(m[0].length); } } } if (buf && !outboundStream.signal.aborted) await speakChunk(twilio, streamSid, buf, outboundStream); history.push({ role: "assistant", content: buf }); } } if (msg.event === "stop") { // Save transcript to Postgres for QA await saveTranscript(streamSid, history); } }); }); async function speakChunk(ws, streamSid, text, abort) { const audio = await eleven.textToSpeech.convertAsStream("niharika", { text, model_id: "eleven_multilingual_v2", output_format: "ulaw_8000", }); for await (const chunk of audio) { if (abort.signal.aborted) return; ws.send(JSON.stringify({ event: "media", streamSid, media: { payload: Buffer.from(chunk).toString("base64") }, })); } }

A Voice IVR for a Tally-First CA Helpdesk in 2 Days: Twilio + Whisper + Claude Haiku

Want a voice IVR for your CA practice or accounting helpdesk?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?

A Voice IVR for a Tally-First CA Helpdesk in 2 Days: Twilio + Whisper + Claude Haiku

Want a voice IVR for your CA practice or accounting helpdesk?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?