{
"model": "saaras-v3",
"language": "en-IN",
"biasing": {
"phrases": [
{"phrase": "office", "boost": 1.4},
{"phrase": "school", "boost": 1.4},
{"phrase": "very", "boost": 1.3},
// ... 6,400 entries total, weighted by frequency
],
"pronunciation_lexicon": "s3://talkdrill-bucket/lexicons/hindi-en-tier2-v3.csv",
"lexicon_weight": 0.6
},
"diarization": false,
"punctuation": true
}
Two settings. biasing.phrases is a per-word boost — telling the decoder that "office" is more probable than "off-his" in our context. pronunciation_lexicon is the per-word phoneme variant list. The lexicon_weight is critical — too high and you over-bias toward variants even when speakers use standard pronunciation. We tested 0.3, 0.5, 0.6, 0.8 and 0.6 had the best WER on a held-out validation set.
## The Results, Broken Down By Phoneme Class
Two observations. The /v-w/ and /j-z/ classes saw the biggest relative improvements (50% and 42%) because they had the cleanest variant patterns. The /sh-s/ class improved least (27%) because the speaker-to-speaker variation was higher — some speakers consistently used /s/, others /sh/, others alternated. A second iteration of the lexicon would split this class by speaker region.
## What Failed (Worth Documenting)
Failed #1 — Fine-tuning Whisper. We tried a 12-hour fine-tune of Whisper-medium on our 600-utterance corpus + augmented data. WER moved from 24.1% to 22.8% — marginal. The problem: 600 utterances is too small for fine-tuning to outperform a hand-crafted lexicon. We would need ~10,000 utterances to make fine-tuning worthwhile.
Failed #2 — Noise reduction preprocessing. Hypothesis: tier-2 city recordings have more background noise; reduce noise, improve WER. Reality: aggressive noise reduction also stripped pronunciation information that the decoder needed. WER got worse by 1.4 points. We use mild noise reduction now (RNNoise at 0.4 strength) only when SNR is below 12 dB.
Failed #3 — N-gram language model rescoring. We tried building a domain-specific 4-gram LM on our app's transcripts and rescoring the ASR n-best output. WER barely moved (-0.3 points). Saaras V3's own LM is already strong for English; our LM was redundant.
## The Pre-Ship Checklist (Domain-Specific ASR)
- Test corpus collected from production-realistic conditions (not studio)
- Per-speaker diversity — minimum 20 speakers across 3+ regional contexts
- Hand-transcribed by 2 reviewers + adjudicator, ≥ 0.92 inter-annotator agreement
- Baseline WER established on stock ASR, broken down by phoneme class
- Top 4-5 error categories identified and prioritised by frequency × severity
- Lexicon entries written by someone with IPA training (not vibes)
- Lexicon weight tuned on held-out validation, not the test set
- Per-class WER tracked over time, not just average WER
- Production fallback if lexicon load fails — never serve a 500
- Quarterly re-collection of test corpus as user base shifts
v3.1.4) and route 5% of traffic to new versions before promoting.
## When NOT To Build A Custom Lexicon
Skip this work if (a) your stock-ASR WER is already below 12% — diminishing returns, (b) your audience is metro-tier-1 only — Saaras / Whisper handle this well, or (c) you cannot collect a test corpus — without ground truth you are guessing. The collection cost (₹12,000 in our case for the corpus, plus 30 person-hours) is the smallest part of the budget but the largest part of the value.
## Real Example — The First Patna User Who Came Back
In late August, before the lexicon was deployed, a 26-year-old user from Patna left a one-star review: "Bot does not understand my English. I left." We saw the review, traced her sessions, found 40+ examples of /v-w/ misrecognition. After the lexicon went live, we sent her a polite email asking if she would try again. She did. Her WER on the same prompts dropped from 31% to 14%. She is a paying user as of October. Speech recognition that fails for tier-2 speakers is not "high-quality" in any meaningful sense — it is unfit for the Indian market.
## A Detail That Saved Us On Day 23
On day 23, a developer noticed our WER measurements were drifting upward day-over-day in production. Investigation found the lexicon was being re-loaded from S3 on every cold start of our ASR worker, and the cache TTL was too short — workers were spending 12% of their CPU on lexicon parsing. Fix: load once on worker startup, hold in memory, refresh only on explicit lexicon-version-bump signal. CPU dropped, WER stabilised. The lesson: production ML systems leak in operational ways, not just in model accuracy.
## How This Connects To The Wider TalkDrill Stack
The lexicon is one piece. Voice tutoring on TalkDrill also depends on streaming WebRTC (covered in our 800 ms latency post), pronunciation scoring (covered in our pronunciation scoring post), and the Claude Sonnet 4.5 conversational layer with prefix caching. The lexicon is the boring infrastructure piece — but it is the piece that determines whether tier-2 speakers feel respected by the product or not. We have a similar discipline for the Hindi essay grader on PenLeap, our in-house edtech product.
## FAQ
### Did you consider Whisper-Hindi v2 instead?
Yes. Whisper-Hindi v2 hits ~5% WER on FLEURS, but FLEURS is clean Hindi. Our problem is Hindi-influenced English in noisy conditions. Different problem; different solution.
### Why not use IndicWhisper or AI4Bharat models?
We tested AI4Bharat IndicConformer at the start of the project. WER was 28%, slightly worse than Saaras V3 baseline. The model is excellent for pure Hindi; for code-mixed and Hindi-influenced English, Saaras V3 + lexicon was better.
### How much did Sarvam cost vs alternatives?
Saaras V3 was ₹0.41 per minute at our volume. Whisper-medium self-hosted on a g4dn.xlarge was ₹0.18 per minute compute + the operational overhead. We chose Saaras for the API ergonomics; the cost difference was acceptable at our scale.
### How do you handle code-switching (Hindi-English mixing)?
Saaras V3 handles code-mixed audio without explicit configuration. Our lexicon biases the English half. The Hindi half goes through Saaras's native Hindi recognition. We have not had to do anything special.
### What about other Indian languages?
We started with Hindi because it covers the largest speaker base. The same lexicon-build methodology works for Tamil, Bengali, Telugu, Marathi — each language has its own characteristic English-influence patterns. Building a Tamil lexicon would be the next 6-week project.
### Did the lexicon reduce response latency?
Slightly increased — about 40 ms per call. Acceptable trade-off for the quality lift. If you are streaming and every ms matters, weigh the cost.
### What happens when ASR vendors update their models?
We re-baseline whenever Saaras releases a new version. Sometimes the lexicon helps less (the new model handles more on its own); sometimes more. We have a quarterly re-baselining cadence in our ops calendar.
### Where can I learn more about Indic ASR research?
Start with the [Sarvam evaluation post](https://www.sarvam.ai/blogs/evaluating-indian-language-asr) (their honest discussion of WER limitations) and the [Voice of India benchmark paper](https://arxiv.org/html/2604.19151v1) for the academic baseline.
Need a domain-specific speech evaluation engine?
We build production speech pipelines for Indian-language voice products: ASR + lexicon + pronunciation scoring + LLM downstream. Fixed-scope projects, ₹3-7 lakh depending on language and corpus size. First call is with the engineer who would lead your build. Email contact@softechinfra.com or use the contact form.
Book a Voice-AI architecture call
