Build a Voice IVR for Your Tally Helpdesk: Twilio + Sarvam + Whisper in a Weekend

Q: Can this work without barge-in?

Technically yes, but UX is markedly worse. Without barge-in, callers must wait for full bot responses before speaking - even after realizing mid-reply the bot misunderstood. Barge-in is 60% of the 'feels natural' magic.

Q: How do you handle DTMF (touch-tone) input?

We don't by default. Modern voice IVRs work better with open-ended speech + intent classification. If a caller must use touch-tone, Twilio's Gather primitive handles DTMF as fallback.

Q: What's the resolution rate floor for this approach?

Across 6 client builds, intent-based IVRs land at 60-78% resolution depending on FAQ coverage. The CA practice hit 73% with 4 FAQ-resolved intents. 8 FAQ intents would likely hit 80%+.

Q: How do you handle accents the bot consistently mishears?

Add misheard transcripts as training data for prompt-level disambiguation. Track Whisper outputs that staff later corrected, then add 'common variations' section to the intent prompt. Improves classification 3-5pp without retraining STT.

Build a Voice IVR for Your Tally Helpdesk: Twilio + Sarvam + Whisper in a Weekend

A Bangalore CA practice with 280 small-business clients was drowning in inbound calls about Tally support — license renewal, GST return queries, "data corrupt ho gaya," password reset. Two staff members handled 90+ calls a day; both were at burnout. We built them a voice IVR over a weekend that routes by language (Hindi, Tamil, English), classifies intent, creates a Tally Helpdesk ticket, and only escalates the genuinely complex cases. Stack: Twilio India, Sarvam Bulbul-v2, faster-whisper, Claude Sonnet 4.5. This post is the end-to-end build with real streaming + barge-in code.

280

Clients on the CA practice's books

73%

Calls fully resolved by IVR (no human)

₹2.80

Cost per 75-second IVR call

2 days

Weekend build time

## TL;DR — what this post delivers A voice IVR for an SMB helpdesk that handles trilingual routing (Hindi/Tamil/English), STT-based intent classification (no DTMF press-1-for-X menus), ticket creation in Tally Helpdesk via REST, and human warm-transfer for complex cases. Streaming TTS + barge-in (caller can interrupt the bot). Built on Twilio Programmable Voice (₹0.35/min India inbound), Sarvam Bulbul-v2 for Hindi/Tamil TTS, faster-whisper medium for STT, Claude Sonnet 4.5 for the brain. Weekend build, 73% resolution rate, ₹2.80 per call. ## Why a voice IVR (and not just WhatsApp) Two reasons. First, the CA practice's clients are mostly small-business owners aged 35-65 — many prefer phone calls over typing on WhatsApp, especially when stressed about a Tally crash. Second, voice IVR lets you start with intent classification before forcing the caller down a menu tree. The traditional "press 1 for renewal, press 2 for support" gets abandoned at 32% in our testing on Indian SMB clients; an open-ended "what brings you to us today?" with intent classification gets answered at 91%. ## The full call flow

📞

Step 1 — Greeting + language detect

"Hello, [practice name] mein swagat hai. Aap kaisi madad chahte hain?" — bilingual opener. Caller's first 3 seconds tell us their preferred language.

🧠

Step 2 — Intent classification

Whisper transcribes the caller's free-form statement. Claude classifies into one of 8 intents: license_renewal, gst_query, data_corrupt, password_reset, install_help, e_invoice, training, other.

🛠️

Step 3 — Resolution OR ticket

Bot answers if it's a known FAQ (license renewal steps, password reset link). Otherwise creates a Tally Helpdesk ticket with full context.

👤

Step 4 — Human transfer (when needed)

For "data corrupt" or any explicit "agent" request, warm-transfer to the human staff with a brief context summary. Average transfer time: 8 seconds.

## The streaming + barge-in code (Python) This is the part most IVR tutorials skip. Without barge-in, the caller can't interrupt the bot — and on a slow Indian-English-speaking bot, that's frustrating fast. With streaming, the bot speaks as Claude generates, not after. ### Twilio TwiML for media streaming

xml

### The Python WebSocket handler (FastAPI)

python

from fastapi import FastAPI, WebSocket
    import asyncio
    import base64
    from voice_pipeline import VoicePipeline
  
    app = FastAPI()
  
    @app.websocket('/twilio-stream')
    async def twilio_stream(ws: WebSocket):
        await ws.accept()
        pipeline = VoicePipeline()
  
        try:
            async for msg in ws.iter_json():
                if msg['event'] == 'start':
                    pipeline.call_sid = msg['start']['callSid']
                    pipeline.from_number = msg['start']['customParameters']['from']
                    asyncio.create_task(pipeline.run(ws))
  
                elif msg['event'] == 'media':
                    # Inbound audio chunk from caller
                    audio_chunk = base64.b64decode(msg['media']['payload'])
                    await pipeline.on_audio_in(audio_chunk)
  
                elif msg['event'] == 'stop':
                    await pipeline.shutdown()
                    break
        except Exception as e:
            print(f'Stream error: {e}')
            await pipeline.shutdown()

### The voice pipeline (with barge-in)

python

import asyncio
    from typing import Optional
    from anthropic import AsyncAnthropic
    from sarvam_client import SarvamTTS
    from whisper_client import StreamingWhisper
  
    class VoicePipeline:
        def __init__(self):
            self.whisper = StreamingWhisper(model='medium', languages=['hi', 'ta', 'en'])
            self.tts = SarvamTTS(voice='manan')
            self.claude = AsyncAnthropic()
            self.tts_task: Optional[asyncio.Task] = None
            self.user_speaking = False
  
        async def on_audio_in(self, audio_chunk: bytes):
            # VAD: is the user speaking?
            if self.whisper.is_speech(audio_chunk):
                if not self.user_speaking:
                    self.user_speaking = True
                    # BARGE-IN: cancel any in-flight TTS
                    if self.tts_task and not self.tts_task.done():
                        self.tts_task.cancel()
                        print('Bot interrupted by caller')
  
            await self.whisper.feed(audio_chunk)
  
        async def run(self, ws):
            # Greeting
            await self.speak(ws, 'Namaste, Tally Helpdesk mein swagat hai. Aapki kya samasya hai?')
  
            while True:
                # Wait for caller to finish speaking
                transcript = await self.whisper.get_completed_utterance()
                if not transcript:
                    continue
  
                self.user_speaking = False
                detected_lang = self.whisper.last_detected_language
  
                # Classify intent + generate reply
                intent = await self.classify_intent(transcript, detected_lang)
  
                if intent == 'transfer_to_human':
                    await self.transfer(ws, summary=transcript)
                    break
  
                reply_stream = await self.generate_reply(transcript, intent, detected_lang)
  
                # Stream TTS while Claude generates
                self.tts_task = asyncio.create_task(self.stream_tts(ws, reply_stream, detected_lang))
                try:
                    await self.tts_task
                except asyncio.CancelledError:
                    print('TTS cancelled by barge-in')
  
        async def stream_tts(self, ws, text_stream, language: str):
            buffer = ''
            async for chunk in text_stream:
                buffer += chunk
                # Flush at sentence boundary for natural cadence
                if any(p in buffer for p in '.!?।'):
                    audio = await self.tts.synthesize(buffer, language=language)
                    await self.send_audio(ws, audio)
                    buffer = ''
            if buffer:
                audio = await self.tts.synthesize(buffer, language=language)
                await self.send_audio(ws, audio)
  
        async def speak(self, ws, text: str, language: str = 'hi'):
            audio = await self.tts.synthesize(text, language=language)
            await self.send_audio(ws, audio)
  
        async def send_audio(self, ws, audio_bytes: bytes):
            import base64
            payload = base64.b64encode(audio_bytes).decode()
            await ws.send_json({
                'event': 'media',
                'streamSid': self.stream_sid,
                'media': {'payload': payload},
            })

Three details that matter. The barge-in cancellation happens AT THE FIRST detected speech-frame, not after the user finishes — that's what makes the bot feel polite ("oh, you wanted to say something"). The TTS streaming flushes at sentence boundaries (using । for Hindi punctuation) — buffering until the full Claude response would add 1-2 seconds of perceived delay. The language detection runs continuously through Whisper, so a caller switching from Hindi to English mid-call gets the right TTS voice. ## The intent classifier (Claude prompt)

python

INTENT_PROMPT = """You are an intent classifier for a Tally Helpdesk voice bot.
  
    The caller said: "{transcript}"
    Caller's language: {language}
  
    Classify into ONE of these intents:
    - license_renewal: questions about Tally license renewal, expiry, activation
    - gst_query: GST return preparation, GSTR filing, GST configuration in Tally
    - data_corrupt: Tally data file corruption, "data not opening", recovery requests
    - password_reset: forgotten password, lock/unlock issues
    - install_help: installation problems, version upgrade, setup
    - e_invoice: e-invoice generation, IRN issues, GSTN integration
    - training: requests for Tally training, "kaise sikhayenge"
    - transfer_to_human: explicit request for human, "agent", "manager", complex multi-issue
    - other: doesn't fit above categories
  
    Respond with ONLY the intent name. No explanation.
    """
  
    async def classify_intent(self, transcript: str, language: str) -> str:
        response = await self.claude.messages.create(
            model='claude-sonnet-4-5',
            max_tokens=20,
            messages=[{'role': 'user', 'content': INTENT_PROMPT.format(
                transcript=transcript, language=language,
            )}],
        )
        intent = response.content[0].text.strip()
        return intent if intent in VALID_INTENTS else 'other'

## The Tally Helpdesk integration Tally Solutions exposes a REST API for ticket creation through their TallyPrime Server. We POST a JSON payload with caller name (looked up from phone number against the practice's CRM), intent, transcript, and detected language.

python

import httpx
  
    async def create_tally_ticket(self, intent: str, transcript: str, language: str):
        async with httpx.AsyncClient() as client:
            # Lookup caller from CRM
            customer = await self.lookup_customer_by_phone(self.from_number)
  
            response = await client.post(
                'https://tally-helpdesk.softechinfra.com/api/tickets',
                headers={'Authorization': f'Bearer {os.environ["TALLY_TOKEN"]}'},
                json={
                    'customer_id': customer['id'] if customer else None,
                    'phone': self.from_number,
                    'intent_category': intent,
                    'description': transcript,
                    'language': language,
                    'source': 'voice_ivr',
                    'priority': 'high' if intent == 'data_corrupt' else 'normal',
                    'call_sid': self.call_sid,
                },
            )
            ticket = response.json()
            return ticket['ticket_id']

## The build (hour by hour)

Saturday morning (4 hrs) — Twilio + WebSocket scaffolding

Buy Twilio India local number (₹110/month, KYC takes 2-5 days — start mid-week). Configure TwiML for media streaming. Stand up FastAPI WebSocket handler on Hetzner CCX13. Test with a hardcoded "hello" reply.

Saturday afternoon (4 hrs) — Whisper + Sarvam pipeline

Spin up faster-whisper medium on a Lambda Labs L4 (₹26/hour spot). Wire Sarvam Bulbul-v2 for TTS. Test full STT → TTS round trip with a static sentence. Initial p50: 1,400ms — too slow.

Saturday evening (3 hrs) — Streaming + barge-in

Add streaming TTS (flush at sentence boundary). Add VAD-triggered barge-in (cancel TTS task on first detected user speech). p50 drops to 1,100ms; conversation feels natural.

Sunday morning (3 hrs) — Intent classifier + Tally integration

Wire Claude Sonnet 4.5 intent classification with the 9-intent taxonomy. Build the customer-lookup-by-phone against the practice's existing CRM. POST tickets to Tally Helpdesk API. Smoke-test with real-sounding queries.

Sunday afternoon (3 hrs) — FAQ resolution paths

For 4 of the 9 intents (license_renewal, password_reset, install_help, training), the bot can resolve directly with a scripted answer + SMS link. Wire each FAQ flow with a Claude-generated dynamic response.

Sunday evening (2 hrs) — Human warm-transfer + go-live

Twilio Dial verb to warm-transfer the call to staff mobile, with a brief Claude-generated context summary played to the staff before bridging. Run 8 end-to-end test calls from different phones. Ship at 8pm.

## Pre-launch checklist

Twilio India number with Mumbai SIP termination requested (free, 24h)
HTTPS + valid SSL on the WebSocket endpoint (mandatory for Twilio Streams)
Whisper VAD threshold tuned to local accent (380-450ms silence works for India)
TTS streaming flushes at sentence boundary, not full-response
Barge-in cancels TTS task at first detected user speech
Intent classifier prompt includes "transfer_to_human" as explicit option
Customer lookup by phone before ticket creation
Recording disclosure played at call start (TRAI compliance)
Warm-transfer plays context summary to staff before bridging
Smoke test from at least 5 carrier+device combinations before go-live

## Common mistakes (and the fix) Symptom: "the bot keeps cutting people off." Cause: VAD threshold too aggressive. Fix: increase silence threshold from 250ms to 380ms+ for Indian English/Hindi cadence. Symptom: "barge-in feels delayed." Cause: VAD running on whole 200ms chunks. Fix: switch to frame-level VAD that fires at the first 30ms of speech. Symptom: "Tamil callers get Hindi TTS." Cause: language detection running only at start. Fix: re-detect language on every utterance, switch TTS voice mid-call. Symptom: "tickets created without context." Cause: claude classification fired before full transcript. Fix: only classify after Whisper signals utterance complete (silence > 380ms). ## When NOT to build a voice IVR Skip this if (a) call volume is under 30/day — your existing staff can absorb it, (b) callers are mostly elderly users in deep rural areas — voice quality on 2G/edge degrades the experience past usable, (c) your business is in a regulated sector with strict call-handling protocols you can't translate to AI (insurance claims processing, mental health support, legal advice). For (c), an AI receptionist that just routes + creates a ticket is fine; AI doing the work is not. ## Real outcomes — Bangalore CA practice (60 days in) - Inbound calls handled per day: 90 → 90 (volume same, but distribution changed) - Calls fully resolved by IVR (no human): 73% - Calls warm-transferred to staff: 22% - Calls dropped (caller hung up): 5% - Average call duration: 4.2 min (with humans) → 1.3 min (with IVR) - Staff time on phone: 6.5 hrs/day → 1.8 hrs/day - Tickets created automatically: 28 per day - Cost per call: ₹2.80 all-in - Monthly run cost (2,700 calls): ₹7,560 The two staff members redirected the 4.7 hours/day of saved phone time to higher-value work (client review meetings, advisory calls). Practice owner reported revenue per staff hour up roughly 38% — not because they're working harder, but because they're not stuck on the phone explaining password reset for the fifth time that day. ## What we'd ship differently today Three changes if rebuilding now in December 2025. Use Sarvam Saaras-v2 STT instead of faster-whisper. Sarvam's STT model trained on Indian languages now beats Whisper medium on accented Hindi/Tamil by ~25% lower WER. We tested it last week — switching this weekend. Use Claude Opus 4.5 medium-effort for intent classification on hard cases. Sonnet 4.5 misclassifies about 4% of multi-intent calls ("renewal AND gst question" gets routed to one). Opus 4.5 medium reduces this to under 1%. Worth the cost difference for the practice. Add a callback-scheduling intent. Sometimes the caller's question is best handled by a 30-min scheduled call with the senior partner. Currently we route those to a human — but a "book a callback for tomorrow at 3pm" intent would resolve them without staff time. ## How this connects to other work This pipeline is the IVR cousin of the Hindi voice bot for the Indore insurance agent we shipped in November. Same Twilio + Sarvam + Whisper + Claude stack, different domain (helpdesk vs. customer service) and different intent taxonomy. The architecture is the lesson; the prompts are the per-client work. We use the same voice infrastructure that powers our in-house product TalkDrill (English fluency app, 5,000+ active users) — covered in detail in our deep-dive on how TalkDrill hits 800ms voice round-trip latency. The TalkDrill latency lessons port directly into client IVR builds via our AI automation team. Reddit threads worth bookmarking: [r/twilio](https://www.reddit.com/r/twilio/) for streaming media gotchas, [r/MachineLearning](https://www.reddit.com/r/MachineLearning/) for STT model comparisons, and [Sarvam's Discord](https://www.sarvam.ai/) for the Bulbul-v2 issue tracker. ## FAQ ### Can this work without barge-in? Technically yes, but the user experience is markedly worse. Without barge-in, the caller must wait for the full bot response before speaking — even if they realised mid-bot-reply that the bot misunderstood. Barge-in is 60% of the "feels natural" magic. ### Why Twilio over Plivo or Exotel for India? Twilio's Programmable Voice + Streams API is the most mature for media-streaming use cases. Plivo and Exotel are ~30% cheaper but have smaller ecosystems for the streaming primitive. For a build-and-iterate weekend, Twilio's docs save you 4-6 hours. ### How do you handle DTMF (touch-tone) input? We don't, by default. Modern voice IVRs work better with open-ended speech + intent classification. If a caller MUST use touch-tone (a wired landline with no keypad letters, for instance), Twilio's <Gather> primitive handles DTMF and we fall back to a 4-option menu. ### What's the resolution rate floor for this approach? In our experience across 6 client builds, intent-based IVRs land at 60-78% resolution depending on the FAQ coverage and the callers' familiarity with the bot. The CA practice hit 73% with 4 FAQ-resolved intents. A practice with 8 FAQ-resolved intents would likely hit 80%+. ### Can I deploy this on-prem instead of Hetzner cloud? Yes. Replace Hetzner with any Linux box that has 4GB RAM and a public-routable HTTPS endpoint. The L4 GPU for Whisper is the only hard requirement — and you can swap Whisper for OpenAI's hosted STT (slightly slower, 200ms more latency) if you can't run a GPU. ### How do you handle accents the bot consistently mishears? Add the misheard transcripts as training data for prompt-level disambiguation. We track Whisper outputs that human staff later corrected, then add a "common variations" section to the intent prompt: "Note: 'jeem-essay-tee' usually means GST." Improves classification by 3-5pp without retraining the STT model. ### What does the recording disclosure say? "Yeh call recording aur quality monitoring ke liye record ho rahi hai. Agar aap ise opt out karna chahte hain, kripya 9 dabayein." TRAI requires the disclosure; the opt-out path is best practice. Recordings stored in S3 ap-south-1 with 90-day retention.

Want a Voice IVR for Your CA Practice or Accounting Software Helpdesk?

We ship voice IVRs for Indian SMB helpdesks (Tally, Zoho Books, BUSY, custom billing apps) in 7-14 working days. Trilingual routing, intent classification, automatic ticket creation, warm transfer to your team. Typical project: ₹65,000-₹1,40,000 fixed scope. Per-call run cost from ₹2.10. First call is technical — with the engineer who would lead your build.

Book a 20-min Call

Tags:

Voice AIIVRTwilioSarvam AIWhisperTallyHelpdeskIndian SMB

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

from fastapi import FastAPI, WebSocket import asyncio import base64 from voice_pipeline import VoicePipeline app = FastAPI() @app.websocket('/twilio-stream') async def twilio_stream(ws: WebSocket): await ws.accept() pipeline = VoicePipeline() try: async for msg in ws.iter_json(): if msg['event'] == 'start': pipeline.call_sid = msg['start']['callSid'] pipeline.from_number = msg['start']['customParameters']['from'] asyncio.create_task(pipeline.run(ws)) elif msg['event'] == 'media': # Inbound audio chunk from caller audio_chunk = base64.b64decode(msg['media']['payload']) await pipeline.on_audio_in(audio_chunk) elif msg['event'] == 'stop': await pipeline.shutdown() break except Exception as e: print(f'Stream error: {e}') await pipeline.shutdown()

import asyncio from typing import Optional from anthropic import AsyncAnthropic from sarvam_client import SarvamTTS from whisper_client import StreamingWhisper class VoicePipeline: def __init__(self): self.whisper = StreamingWhisper(model='medium', languages=['hi', 'ta', 'en']) self.tts = SarvamTTS(voice='manan') self.claude = AsyncAnthropic() self.tts_task: Optional[asyncio.Task] = None self.user_speaking = False async def on_audio_in(self, audio_chunk: bytes): # VAD: is the user speaking? if self.whisper.is_speech(audio_chunk): if not self.user_speaking: self.user_speaking = True # BARGE-IN: cancel any in-flight TTS if self.tts_task and not self.tts_task.done(): self.tts_task.cancel() print('Bot interrupted by caller') await self.whisper.feed(audio_chunk) async def run(self, ws): # Greeting await self.speak(ws, 'Namaste, Tally Helpdesk mein swagat hai. Aapki kya samasya hai?') while True: # Wait for caller to finish speaking transcript = await self.whisper.get_completed_utterance() if not transcript: continue self.user_speaking = False detected_lang = self.whisper.last_detected_language # Classify intent + generate reply intent = await self.classify_intent(transcript, detected_lang) if intent == 'transfer_to_human': await self.transfer(ws, summary=transcript) break reply_stream = await self.generate_reply(transcript, intent, detected_lang) # Stream TTS while Claude generates self.tts_task = asyncio.create_task(self.stream_tts(ws, reply_stream, detected_lang)) try: await self.tts_task except asyncio.CancelledError: print('TTS cancelled by barge-in') async def stream_tts(self, ws, text_stream, language: str): buffer = '' async for chunk in text_stream: buffer += chunk # Flush at sentence boundary for natural cadence if any(p in buffer for p in '.!?।'): audio = await self.tts.synthesize(buffer, language=language) await self.send_audio(ws, audio) buffer = '' if buffer: audio = await self.tts.synthesize(buffer, language=language) await self.send_audio(ws, audio) async def speak(self, ws, text: str, language: str = 'hi'): audio = await self.tts.synthesize(text, language=language) await self.send_audio(ws, audio) async def send_audio(self, ws, audio_bytes: bytes): import base64 payload = base64.b64encode(audio_bytes).decode() await ws.send_json({ 'event': 'media', 'streamSid': self.stream_sid, 'media': {'payload': payload}, })

INTENT_PROMPT = """You are an intent classifier for a Tally Helpdesk voice bot. The caller said: "{transcript}" Caller's language: {language} Classify into ONE of these intents: - license_renewal: questions about Tally license renewal, expiry, activation - gst_query: GST return preparation, GSTR filing, GST configuration in Tally - data_corrupt: Tally data file corruption, "data not opening", recovery requests - password_reset: forgotten password, lock/unlock issues - install_help: installation problems, version upgrade, setup - e_invoice: e-invoice generation, IRN issues, GSTN integration - training: requests for Tally training, "kaise sikhayenge" - transfer_to_human: explicit request for human, "agent", "manager", complex multi-issue - other: doesn't fit above categories Respond with ONLY the intent name. No explanation. """ async def classify_intent(self, transcript: str, language: str) -> str: response = await self.claude.messages.create( model='claude-sonnet-4-5', max_tokens=20, messages=[{'role': 'user', 'content': INTENT_PROMPT.format( transcript=transcript, language=language, )}], ) intent = response.content[0].text.strip() return intent if intent in VALID_INTENTS else 'other'

import httpx async def create_tally_ticket(self, intent: str, transcript: str, language: str): async with httpx.AsyncClient() as client: # Lookup caller from CRM customer = await self.lookup_customer_by_phone(self.from_number) response = await client.post( 'https://tally-helpdesk.softechinfra.com/api/tickets', headers={'Authorization': f'Bearer {os.environ["TALLY_TOKEN"]}'}, json={ 'customer_id': customer['id'] if customer else None, 'phone': self.from_number, 'intent_category': intent, 'description': transcript, 'language': language, 'source': 'voice_ivr', 'priority': 'high' if intent == 'data_corrupt' else 'normal', 'call_sid': self.call_sid, }, ) ticket = response.json() return ticket['ticket_id']

Build a Voice IVR for Your Tally Helpdesk: Twilio + Sarvam + Whisper in a Weekend

Want a Voice IVR for Your CA Practice or Accounting Software Helpdesk?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?

Build a Voice IVR for Your Tally Helpdesk: Twilio + Sarvam + Whisper in a Weekend

Want a Voice IVR for Your CA Practice or Accounting Software Helpdesk?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?