TalkDrill Voice AI: How We Cut Round-Trip Latency to 740ms on Indian 4G
Streaming-STT, KV-cache reuse, Mumbai-region edge-compute, and WebRTC tweaks that cut TalkDrill's voice round-trip from 1,820ms to 740ms on Airtel/Jio 4G. The latency budget, the packet-capture proof, the rebuild that actually shipped.
Hrishikesh Baidya
December 17, 202516 min read
0%
In June 2025, our average voice round-trip latency on TalkDrill — measured from "user finished speaking" to "first phoneme of AI response audible" — was 1,820 ms on Airtel 4G in Mumbai. Users called it laggy. They were right. Six months later, the same measurement on the same network is 740 ms. The fix was not one big rewrite. It was four streaming-pipeline changes, one infrastructure relocation, and a hundred small client-side tweaks. This post is the budget, the breakdown, and the production runbook.
1,820 ms
Average round-trip latency, June 2025
740 ms
Average round-trip latency, Dec 2025
59%
Reduction across all network conditions
5,000+
Active users across Indian carriers
## The Answer in 60 Words
We cut TalkDrill's round-trip latency from 1,820 ms to 740 ms by streaming partial transcripts (no waiting for end-of-utterance), reusing the KV-cache across user turns, moving inference to AWS ap-south-1 (Mumbai), pre-warming TTS chunks before the LLM finishes, and running VAD client-side in WebAssembly. Each change was verified with packet captures from actual Airtel/Jio 4G phones in Mumbai and Bangalore.
## Why This Matters Now
[OpenAI's published latency-budget guidance](https://openai.com/index/delivering-low-latency-voice-ai-at-scale/) puts conversational voice AI in the 300-800 ms range to feel natural. Above 1,000 ms, users start interrupting the model. Above 1,500 ms, they assume something is broken. Indian 4G networks add 180-220 ms of round-trip just to reach AWS Mumbai before any inference happens, and 900-1,100 ms p95 if you route to a US region. This is non-negotiable infrastructure context for any voice-AI product serving Indian users.
[TalkDrill](https://talkdrill.com) — our in-house English-speaking app for Indian adults, with 5,000+ active users — runs voice conversations as the core product loop. Every 100 ms of latency loss is felt by every user on every turn. The 1,080 ms we cut between June and December is the difference between "the app feels broken" and "the app feels like a real conversation."
## The Latency Budget (Where Every Millisecond Goes)
This is the production budget for a typical 740 ms round-trip on Airtel 4G in Mumbai. It is from a December 2025 packet capture across 200 sample turns.
The biggest line item is LLM time-to-first-token. The biggest cut from June to December was in ASR processing, where streaming partial transcripts removed a 600+ ms wait for end-of-utterance.
## The 4 Streaming-Pipeline Changes That Mattered
1
Stream the ASR transcript
Switched from "wait for end-of-utterance" to streaming partial transcripts via Faster-Whisper with VAD chunking. Each ~300 ms audio chunk produces a partial that the LLM starts processing speculatively.
2
KV-cache reuse across turns
The conversation history doesn't change between turns. We keep the KV-cache warm on the inference server, indexed by conversation ID. New turns only pay for the new tokens, not the full history. Saves ~120 ms TTFT.
3
Speculative TTS pre-warm
As the LLM streams its first 8-12 tokens, we start TTS on the prefix in parallel. By the time the LLM finishes, the first audio chunk is already on the wire. Removes the 240-280 ms TTS-after-LLM wait.
4
Client-side VAD in WebAssembly
Voice Activity Detection runs in the browser via a WASM-compiled Silero model. End-of-speech is detected client-side in 50-80 ms instead of the 200-350 ms server-side latency we had before.
## The Infrastructure Move (Mumbai Region)
This was the single largest gain in the rebuild.
In June, our inference (Whisper + LLM + TTS) ran on AWS us-east-1 (Virginia). Round-trip from a Mumbai 4G phone to us-east-1: 218 ms p95 (just for the network). Plus TLS handshake on cold connections: another 90-150 ms.
We migrated to AWS ap-south-1 (Mumbai) in August 2025. Round-trip from the same Mumbai 4G phone: 18 ms p95. TLS handshake: 28 ms. The network alone shaved ~200 ms off the budget. The cost: we had to provision GPU spot instances in ap-south-1, which were ~22% more expensive than the equivalent in us-east-1. The latency tradeoff was worth it; the cost increase was a fraction of the latency win in dollar terms.
For a deeper explanation of why region choice matters this much for Indian users, see the [comparative analysis on voice-AI for India vs global platforms](https://www.caller.digital/blog/voice-ai-india-vs-global-platforms) — the published latency benchmarks confirm the same patterns we measured.
## The Streaming ASR Change (How It Actually Works)
Before the rebuild, our ASR pipeline did this:
1. Capture full audio chunk after VAD detected end-of-speech
2. Upload entire chunk
3. Run Whisper on full chunk
4. Return final transcript
5. Send transcript to LLM
Total ASR-side latency: ~620 ms median, ~1,100 ms p95.
After the rebuild:
1. Stream audio chunks (~300 ms each) as they are captured
2. Run Whisper on each chunk in parallel as it arrives
3. Emit partial transcripts with confidence scores
4. LLM starts processing on partials (with a buffer for revisions)
5. Final transcript commits when VAD detects end-of-speech
Total ASR-side latency: ~180 ms median (the time from end-of-speech to final transcript commit). The LLM has been working on partials for the previous 1.2-2 seconds and only needs to confirm or revise its plan.
The key library is [Faster-Whisper](https://github.com/SYSTRAN/faster-whisper) — the CTranslate2-optimised Whisper implementation. We run large-v3 on a g5.xlarge in ap-south-1 with batch size 1 and beam size 1. Latency per 300 ms chunk: ~80 ms.
## The KV-Cache Reuse Trick
Most LLM inference servers re-process the entire conversation history on every turn. For a 12-turn conversation with 4,000 tokens of history, that is 4,000 tokens of pre-fill on each turn — ~120 ms of TTFT just for processing the history.
We cache the KV state of the conversation history on the GPU between turns, indexed by conversation ID. When the next turn arrives, only the new user message gets pre-filled. The 120 ms wait disappears.
Implementation gotchas:
- Cache eviction. GPU memory is limited. We evict caches that have been idle >5 minutes. Each cache is ~80-200 MB depending on conversation length.
- Cache invalidation. If the conversation history is edited (a "regenerate" button), invalidate the cache and rebuild.
- Multi-tenant isolation. Caches are per-conversation, not per-user, but a security audit confirmed there is no cross-tenant leakage path.
This is not a custom inference-server feature — it is a vLLM configuration. We use vLLM for our local LLM (a fine-tuned Llama 3 8B for the conversational layer) with the prefix-caching feature enabled.
## The Speculative TTS Pre-Warm
After the LLM emits its first 8-12 tokens, we start TTS on those tokens in parallel with the rest of LLM generation. By the time the LLM is done, the first audio chunk is already on the wire to the client.
This is a small implementation but a big perceptual gain. Before: user finishes speaking, ~600 ms of silence, then LLM response audible. After: user finishes speaking, ~340 ms of silence, then LLM response audible. The 260 ms cut is exactly the TTS-after-LLM wait we removed.
Caveats: speculative TTS sometimes has to revise itself if the LLM's first 8-12 tokens didn't match the final response shape. This happens ~3% of the time. We accept the rare audio jitter because the latency win across all turns is worth it.
## The WebAssembly VAD
Voice Activity Detection — telling when the user has finished speaking — used to run server-side after we received the audio. Latency: 200-350 ms.
We compiled the [Silero VAD model](https://github.com/snakers4/silero-vad) to WebAssembly and ship it in the browser bundle (~280 KB compressed). It runs on a worker thread with the same accuracy as the server-side version.
End-of-speech detection now happens client-side in 50-80 ms. The audio chunk is sent to the server with an "EOF" flag attached, so the server knows immediately to commit the final transcript. We saved another 150-280 ms on the budget.
## The 12-Step Rebuild Plan (We Ran This Sequentially)
1
Baseline measurement
Packet capture across Airtel + Jio + Vi 4G in Mumbai, Bangalore, Pune. 200 sample turns per network. Latency budget broken down by stage.
2
Migrate inference to ap-south-1
Spin up g5.xlarge GPU instances in Mumbai region. Re-test inference latency. Single biggest win: 200 ms off the budget.
3
Switch ASR to Faster-Whisper streaming
Replace the wait-for-end pattern with streaming partials. Tune chunk size (300 ms), beam size (1), VAD aggressiveness.
4
Enable vLLM prefix-caching
Configure vLLM with prefix-caching. Reuse KV-cache across turns indexed by conversation ID. Set per-cache eviction at 5 minutes idle.
5
Speculative TTS pre-warm
Wire the LLM-output stream to TTS so first 8-12 tokens kick off speech synthesis in parallel with continued generation.
6
Compile Silero VAD to WASM
Compile model, ship in client bundle, run on worker thread, signal EOF to server.
7
WebRTC tuning for Indian 4G
Bump audio bitrate to Opus 24 kbps stereo, enable jitter buffer adaptation, configure ICE for cellular preference. Reduces packet loss on Vi 4G specifically.
8
CDN for static client assets
Move all client assets (including the WASM VAD) to CloudFront Edge in India. Initial app load drops from 4.2 s to 1.8 s.
9
Connection pre-warm
Open the WebRTC + WebSocket connection 3 seconds before the user is expected to start speaking. Removes TLS handshake from the per-turn latency.
10
Per-network latency dashboard
Build a Grafana dashboard tracking p50/p90/p99 latency per carrier per city per turn. Catch regressions within 24 hours.
11
A/B test per change
Roll out each change to 10% of users for 5 days. Compare turn-completion rate, session length, satisfaction rating. Roll back if any metric degrades.
12
Final packet-capture validation
Re-run the baseline measurement with the same Mumbai 4G phones, same script, same time of day. Compare each line item to the original budget.
## The Production Pre-Deploy Checklist (Run This Before Promoting Any Latency Change)
Packet capture from at least 3 Indian carriers (Airtel, Jio, Vi)
Latency p50/p90/p99 per stage compared against the baseline
Turn-completion rate stable or improving
Session-length metric stable or improving
No increase in transcription error rate (WER) on the validation set
No increase in TTS audio jitter complaints in the support queue
KV-cache eviction tested under sustained load (no GPU OOM)
WebAssembly VAD tested on Android Chrome, iOS Safari, desktop Chrome/Safari/Firefox
Rollback procedure tested in staging in the last 7 days
## The Per-Carrier Differences We Found (And Had to Handle)
Carrier
Median p50 latency (Mumbai 4G)
Notable behaviour
Airtel 4G
740 ms
Most stable, lowest packet loss. Our reference network.
Jio 4G
790 ms
Slightly higher RTT but lower jitter. Best for sustained calls.
Vi 4G
910 ms
Higher packet loss in Tier-2 cities; we tuned the jitter buffer aggressively for Vi.
Airtel 5G NSA
520 ms
Where 5G coverage works, latency drops 30%. Coverage still patchy.
Wi-Fi (home)
410 ms
Best case. Approaches the OpenAI guidance ceiling for natural conversation.
The Vi tuning was specifically a jitter-buffer adjustment plus a more aggressive Opus FEC (forward error correction) setting. Without it, Vi 4G users experienced 8-12% audio glitches per turn. With it: 1-2%.
## Common Mistakes Other Teams Make On Voice-AI Latency
Symptom: TTFT looks fine in lab tests but bad in production. Cause: lab tests are on Wi-Fi or low-RTT networks. Production users are on 4G with 90-200 ms RTT to your servers. Fix: always benchmark on representative cellular networks.
Symptom: latency improves by 100 ms but users complain more. Cause: speculative TTS pre-warm sometimes produces audio that has to revise. Even rare audio jitter is more noticeable than the latency win. Fix: monitor jitter and audio-quality metrics, not just latency.
Symptom: KV-cache reuse causes OOM under load. Cause: caches not evicted aggressively enough. Fix: monitor GPU memory; evict caches on memory pressure, not just on idle time.
Symptom: WebRTC works in dev but fails in production over corporate networks. Cause: corporate firewalls block UDP, forcing TURN. Fix: deploy a TURN server in the same region; expect ~80 ms additional latency for TURN-relayed traffic.
Symptom: latency is great in city centres but terrible in suburbs. Cause: 4G backhaul congestion. Fix: nothing application-side — this is a carrier-network problem. Surface it in the UI with a "weak network" indicator.
## A Question We Get From Founders Building Voice AI
Why not just use a managed voice-AI platform like [Sarvam's voice agents](https://www.sarvam.ai/) or [Bolna](https://www.bolna.ai/)?
For most product teams, you should. The work above takes ~3 months of senior engineering and assumes you have the in-house expertise to debug WebRTC and tune vLLM. Managed platforms ship with India-region inference and reasonable defaults. We built our own because TalkDrill's product loop requires fine-grained control over the streaming protocol and the prompt structure for the language-learning rubric — neither is well-served by a generic voice-agent platform. For a sales-coaching tool or a customer-support bot, the calculus tips the other way.
For deeper coverage of the architecture our team uses across voice-AI client work, see our [companion piece on TalkDrill's pronunciation scoring engine](/blog/talkdrill-indian-english-pronunciation-scoring-without-punishing-accents), our 2025 deep-dives on [MLOps for production AI systems](/blog/ai-operations-mlops) and [Kubernetes best practices we apply to inference clusters](/blog/kubernetes-2025-best-practices), and our [AI & automation service line](/services/ai-automation) where this stack ships to clients. The TalkDrill case study lives at [/projects/talkdrill](/projects/talkdrill).
If you'd rather we just build the voice AI stack for you, [we ship it as a fixed-scope 12-week engagement →](/contact?service=ai).
## FAQ
### What's the realistic latency floor for voice AI on Indian 4G?
Below 600 ms is hard without sacrificing audio quality or transcription accuracy. Below 400 ms requires 5G or Wi-Fi. The published [comparative voice-AI guidance](https://comparevoiceai.com/blog/latency-optimisation-voice-agent) confirms 600-800 ms is the realistic window for production conversational AI on Indian networks.
### Why not run inference on-device instead of in Mumbai?
For pronunciation feedback, on-device works (we ship a small WASM model for low-stakes tasks). For full conversational AI with a 4-8B parameter LLM, on-device is not feasible on the Android handsets most of our users have. Server-side in ap-south-1 is the right tradeoff.
### Does prompt caching help with voice AI latency?
Yes, especially for system prompts that don't change. Anthropic's prompt caching cuts cached-token cost by 90% and saves 50-80 ms on TTFT for cached prefixes. We use it for all client-side LLM calls.
### How do you measure latency in production accurately?
Server-side timestamps at 4 points (audio received, ASR final, LLM TTFT, TTS first chunk sent) plus a client-side timestamp at "audio first audible". The client-side timestamp is the user-perceptible one and the only one that matters for product decisions.
### What about WebSocket overhead vs WebRTC?
WebRTC has lower overhead for sustained audio streams because it uses UDP. WebSocket adds TCP retransmit overhead which spikes latency on lossy networks (Vi 4G in particular). We use WebRTC for the audio path and WebSocket for the control path.
### How do you handle network handovers (4G to Wi-Fi mid-session)?
WebRTC supports ICE restart for connectivity changes. We trigger ICE restart on detected network change and reconnect within 1-2 seconds. The user experiences a brief ~1 second pause but the session continues; transcript history is preserved server-side.
### What does the GPU bill look like for this setup?
For 5,000 active users with ~3 sessions/week of 8 minutes each: roughly $1,800/month in GPU spot instances plus $400/month in network egress. Per-active-user cost: ~₹37/month for inference. We pass a portion of that into the Pro tier pricing of ₹299/month.
## A Detail That Saved Us In November
In November, our latency dashboard flagged a regression: p90 jumped from 920 ms to 1,400 ms over a single afternoon. We had not deployed anything. The root cause turned out to be Anthropic's API region routing — a small percentage of our traffic was being routed to us-east-1 instead of staying in ap-south-1, adding 200+ ms to the budget. The fix was a configuration change in our Anthropic SDK call to pin the region. We added a regression test that asserts p90 latency stays within 100 ms of the published budget for any deployed change, run as a synthetic from a Mumbai test phone every 5 minutes.
The lesson: latency drifts from infrastructure changes you do not control. Synthetic monitoring from real networks in real cities is the only reliable signal. Without the dashboard, we would have shipped a regression to users for 3-4 days before noticing.
We crosschecked our findings against [Reddit discussions in r/MachineLearning on real-time voice AI](https://www.reddit.com/r/MachineLearning/) and the [community pulse on r/voiceAI](https://www.reddit.com/r/voiceai/) — Indian engineers building voice products report similar bottlenecks and similar fixes. The architecture choices in this post are not novel; the discipline of measuring them on Indian networks specifically is what made the latency win real.
Want a Voice-AI Mobile App Built on This Stack?
We build voice-AI products for Indian users — pronunciation, conversation, support agents — on the streaming + Mumbai-region + WebRTC stack described here. Fixed-scope engagements from ₹14 lakh, shipped in 10-14 working weeks. The first call is a technical scoping with the engineer who would lead your build.