Did the lexicon increase response latency?

Slightly — about 40 ms per call. Acceptable for the quality lift. Streaming use cases need to weigh this.

What happens when ASR vendors update models?

We re-baseline whenever Saaras releases a new version. Quarterly re-baselining cadence in our ops calendar.

TalkDrill Hindi Voice Bot: 31% Lower Word-Error Rate Using a Custom Tier-2 City Pronunciation Lexicon

Q: Did you consider Whisper-Hindi v2 instead?

Yes. Whisper-Hindi v2 hits ~5% WER on FLEURS but FLEURS is clean Hindi. Our problem is Hindi-influenced English in noisy conditions. Different problem; different solution.

Q: Why not use IndicWhisper or AI4Bharat models?

We tested AI4Bharat IndicConformer. WER was 28%, slightly worse than Saaras V3 baseline. Excellent for pure Hindi; for code-mixed Hindi-influenced English, Saaras V3 + lexicon was better.

Q: How much did Sarvam cost vs alternatives?

Saaras V3 was Rs 0.41 per minute at our volume. Whisper-medium self-hosted was Rs 0.18 per minute compute + ops overhead. Saaras for API ergonomics.

Q: What about other Indian languages?

Started with Hindi for largest speaker base. Same methodology works for Tamil, Bengali, Telugu, Marathi. Building each is a 6-week project.

TalkDrill Hindi Voice Bot: 31% Lower Word-Error Rate Using a Custom Tier-2 City Pronunciation Lexicon

On July 14, 2025, our voice tutor on TalkDrill still failed catastrophically on one specific user cohort: tier-2 city Hindi speakers learning English. A user in Lucknow saying "I went to the office at nine" got transcribed as "I vent to the office at nine." The bot then "corrected" her perfect English. After eight weeks of work, our Hindi-aware ASR pipeline drops word-error rate from 24.1% to 16.6% on a 600-utterance test corpus from Lucknow, Patna, Kanpur, and Indore — a 31% relative improvement. The fix was not a new model. It was a 1,420-entry custom pronunciation lexicon, four phoneme-level rewrites, and a careful rebuild of our test corpus. This post is what we did, what failed, and the engineering lessons for any team building Indian-English speech features.

24.1% → 16.6%

WER on tier-2 Hindi-influenced English

1,420

Entries in our custom pronunciation lexicon

600

Held-out utterances in our test corpus

Phonemes responsible for 62% of the error gain

## The Answer in 60 Words We built a domain-specific pronunciation lexicon by recording 600 utterances from speakers in 4 tier-2 cities, identified 4 phonemes (/v-w/, /th/, /sh-s/, /j-z/) responsible for 62% of misrecognitions, and registered phoneme-level rewrites with our ASR provider's bias system. Combined with a 6,400-token context biasing list of common English words pronounced with Hindi influence, we cut WER from 24.1% to 16.6% — a 31% relative reduction. ## Why This Matters Now English-fluency apps in India serve a population where ~70% of learners come from non-metro cities — Patna, Kanpur, Lucknow, Indore, Vijayawada, Coimbatore, Lucknow, Bhopal. The dominant ASR providers — Whisper, Sarvam, Deepgram, Google Cloud STT — all train primarily on US/UK English plus standard Indian English from metro speakers. Sarvam's own Saaras V3 benchmarks report ~19% WER on the IndicVoices benchmark, but that is averaged across speakers — the tier-2 cohort sits significantly above the average. Whisper-Hindi v2 reports ~5% WER on cleanly-recorded Hindi (FLEURS), but Hindi-influenced English in noisy conditions is a different problem. The gap is real, and for our learner base it is the single most important quality metric. ## What "Tier-2 City Pronunciation" Actually Means When a Hindi-first speaker from Patna pronounces English, four sound-substitutions dominate. Not all four happen for every speaker, but at least two happen for almost every speaker.

🅥

/v/ ↔ /w/ Confusion

Hindi has a single labiodental approximant that English ASR maps inconsistently to /v/ or /w/. "Vent" / "went", "vine" / "wine", "very" / "wary". 21% of misrecognitions in our corpus.

🅣

/θ/ → /t̪/ Substitution

English voiceless dental fricative ("think", "three") is pronounced as a Hindi dental stop. ASR hears "tink", "tree". 19% of errors.

🅢

/ʃ/ ↔ /s/ Drift

"Ship" pronounced as "sip" or vice-versa, especially before high-front vowels. Common for Bhojpuri / Maithili speakers. 12% of errors.

🅙

/ʒ/ ↔ /z/ ↔ /j/ Triangle

"Pleasure" pronounced as "plejer" or "pleasur", "zoo" as "joo". The voiced postalveolar fricative does not exist in Hindi. 10% of errors.

These four phoneme groups account for 62% of misrecognitions in our test corpus. Address them and you address the bulk of the problem. Note: this is not a story about people speaking "wrong" English — it is a story about ASR systems trained on the wrong distribution of speakers. ## The Test Corpus (Where We Started) Without a representative test set, you are flying blind. We spent the first 10 days of the project building one. Here is the recipe:

Step 1 — Recruit 30 speakers across 4 cities

8 from Lucknow, 8 from Patna, 7 from Kanpur, 7 from Indore. Mix of gender (16 F / 14 M), age 18-42, all self-identified as "comfortable but not fluent" English speakers from Hindi-first homes. Recruited via local TalkDrill power users. ₹400 per speaker.

Step 2 — Curate 20 prompts per speaker

Mix of the 4 problematic phoneme classes, common English words used in conversation, and natural sentences from our app's tutor scripts. Total: 600 utterances.

Step 3 — Record on phones, not studio

Each speaker recorded on their own phone in their normal environment — fan noise, family in next room, occasional auto-rickshaw outside. This is the production environment, not a clean lab.

Step 4 — Hand-transcribe by 2 reviewers + adjudicate

Two reviewers transcribe each utterance independently, third adjudicates disagreements. We track per-utterance reviewer agreement; below 0.92 we re-record.

Step 5 — Establish baseline WER on stock ASR

Run all 600 through our then-current ASR (Sarvam Saaras V3 with default settings). Baseline: 24.1% WER. Per-phoneme-class breakdown gave us the prioritisation list above.

## The Lexicon Build (1,420 Entries, 4 Weeks) A pronunciation lexicon is just a CSV: one column for the canonical English spelling, one or more columns for IPA pronunciation variants. Most modern ASR systems accept lexicon files at request time as a "biasing" hint. The trick is choosing which words to include. We started with a 200-word seed list — the words our test corpus showed misrecognised at >40% rate. We added pronunciation variants by listening to the corpus recordings and writing down what we actually heard. A linguistics-trained intern (₹35K for 4 weeks, recruited from a Delhi University M.A. programme) did the IPA transcription.

| Lexicon entry | English spelling | Standard IPA | Our additional variant(s) | |---|---|---|---| | "very" | very | /ˈvɛri/ | /ˈwɛri/, /ˈbɛri/ | | "think" | think | /θɪŋk/ | /t̪ɪŋk/, /t̪iŋk/ | | "ship" | ship | /ʃɪp/ | /sɪp/, /ʃiːp/ | | "pleasure" | pleasure | /ˈplɛʒər/ | /ˈplɛʤər/, /ˈplɛjar/ | | "office" | office | /ˈɒfɪs/ | /ˈɒphɪs/ (aspiration), /ˈɔːfɪs/ | | "school" | school | /skuːl/ | /ɪskuːl/ (epenthetic vowel) |

Note the last entry. Hindi phonotactics disallow word-initial /sk/ clusters, so speakers insert an epenthetic /ɪ/ — "ischool". This is not a deficit; it is a regular phonological process. The lexicon needs to know about it. ## The Bias Configuration (How We Wired It Up) We use Sarvam Saaras V3 in production. The provider exposes two configuration knobs that mattered:

{
    "model": "saaras-v3",
    "language": "en-IN",
    "biasing": {
      "phrases": [
        {"phrase": "office", "boost": 1.4},
        {"phrase": "school", "boost": 1.4},
        {"phrase": "very", "boost": 1.3},
        // ... 6,400 entries total, weighted by frequency
      ],
      "pronunciation_lexicon": "s3://talkdrill-bucket/lexicons/hindi-en-tier2-v3.csv",
      "lexicon_weight": 0.6
    },
    "diarization": false,
    "punctuation": true
  }

Two settings. biasing.phrases is a per-word boost — telling the decoder that "office" is more probable than "off-his" in our context. pronunciation_lexicon is the per-word phoneme variant list. The lexicon_weight is critical — too high and you over-bias toward variants even when speakers use standard pronunciation. We tested 0.3, 0.5, 0.6, 0.8 and 0.6 had the best WER on a held-out validation set. ## The Results, Broken Down By Phoneme Class Two observations. The /v-w/ and /j-z/ classes saw the biggest relative improvements (50% and 42%) because they had the cleanest variant patterns. The /sh-s/ class improved least (27%) because the speaker-to-speaker variation was higher — some speakers consistently used /s/, others /sh/, others alternated. A second iteration of the lexicon would split this class by speaker region. ## What Failed (Worth Documenting) Failed #1 — Fine-tuning Whisper. We tried a 12-hour fine-tune of Whisper-medium on our 600-utterance corpus + augmented data. WER moved from 24.1% to 22.8% — marginal. The problem: 600 utterances is too small for fine-tuning to outperform a hand-crafted lexicon. We would need ~10,000 utterances to make fine-tuning worthwhile. Failed #2 — Noise reduction preprocessing. Hypothesis: tier-2 city recordings have more background noise; reduce noise, improve WER. Reality: aggressive noise reduction also stripped pronunciation information that the decoder needed. WER got worse by 1.4 points. We use mild noise reduction now (RNNoise at 0.4 strength) only when SNR is below 12 dB. Failed #3 — N-gram language model rescoring. We tried building a domain-specific 4-gram LM on our app's transcripts and rescoring the ASR n-best output. WER barely moved (-0.3 points). Saaras V3's own LM is already strong for English; our LM was redundant. ## The Pre-Ship Checklist (Domain-Specific ASR)

Test corpus collected from production-realistic conditions (not studio)
Per-speaker diversity — minimum 20 speakers across 3+ regional contexts
Hand-transcribed by 2 reviewers + adjudicator, ≥ 0.92 inter-annotator agreement
Baseline WER established on stock ASR, broken down by phoneme class
Top 4-5 error categories identified and prioritised by frequency × severity
Lexicon entries written by someone with IPA training (not vibes)
Lexicon weight tuned on held-out validation, not the test set
Per-class WER tracked over time, not just average WER
Production fallback if lexicon load fails — never serve a 500
Quarterly re-collection of test corpus as user base shifts

## Common Mistakes (Each One Hurts) Symptom: "WER got better on test but worse in production." Cause: test corpus too narrow. Fix: collect from a broader speaker pool, including speakers you did not target. Symptom: "Some words are now over-corrected." Cause: lexicon weight too high. Fix: lower from 0.6 to 0.4 and re-test. Symptom: "Cost per minute went up." Cause: lexicon biasing has a small per-call latency cost. Sarvam adds ~40 ms per call with our 6,400-phrase list. Acceptable for our use case; might not be for streaming. Symptom: "WER different for different cities." Cause: regional phoneme patterns vary. Fix: per-region lexicon variants if your scale justifies it. We do not — the unified lexicon is good enough at 5,000 users. Symptom: "Lexicon updates are scary." Cause: no A/B mechanism. Fix: version the lexicon (we use semver: v3.1.4) and route 5% of traffic to new versions before promoting. ## When NOT To Build A Custom Lexicon Skip this work if (a) your stock-ASR WER is already below 12% — diminishing returns, (b) your audience is metro-tier-1 only — Saaras / Whisper handle this well, or (c) you cannot collect a test corpus — without ground truth you are guessing. The collection cost (₹12,000 in our case for the corpus, plus 30 person-hours) is the smallest part of the budget but the largest part of the value. ## Real Example — The First Patna User Who Came Back In late August, before the lexicon was deployed, a 26-year-old user from Patna left a one-star review: "Bot does not understand my English. I left." We saw the review, traced her sessions, found 40+ examples of /v-w/ misrecognition. After the lexicon went live, we sent her a polite email asking if she would try again. She did. Her WER on the same prompts dropped from 31% to 14%. She is a paying user as of October. Speech recognition that fails for tier-2 speakers is not "high-quality" in any meaningful sense — it is unfit for the Indian market. ## A Detail That Saved Us On Day 23 On day 23, a developer noticed our WER measurements were drifting upward day-over-day in production. Investigation found the lexicon was being re-loaded from S3 on every cold start of our ASR worker, and the cache TTL was too short — workers were spending 12% of their CPU on lexicon parsing. Fix: load once on worker startup, hold in memory, refresh only on explicit lexicon-version-bump signal. CPU dropped, WER stabilised. The lesson: production ML systems leak in operational ways, not just in model accuracy. ## How This Connects To The Wider TalkDrill Stack The lexicon is one piece. Voice tutoring on TalkDrill also depends on streaming WebRTC (covered in our 800 ms latency post), pronunciation scoring (covered in our pronunciation scoring post), and the Claude Sonnet 4.5 conversational layer with prefix caching. The lexicon is the boring infrastructure piece — but it is the piece that determines whether tier-2 speakers feel respected by the product or not. We have a similar discipline for the Hindi essay grader on PenLeap, our in-house edtech product. ## FAQ ### Did you consider Whisper-Hindi v2 instead? Yes. Whisper-Hindi v2 hits ~5% WER on FLEURS, but FLEURS is clean Hindi. Our problem is Hindi-influenced English in noisy conditions. Different problem; different solution. ### Why not use IndicWhisper or AI4Bharat models? We tested AI4Bharat IndicConformer at the start of the project. WER was 28%, slightly worse than Saaras V3 baseline. The model is excellent for pure Hindi; for code-mixed and Hindi-influenced English, Saaras V3 + lexicon was better. ### How much did Sarvam cost vs alternatives? Saaras V3 was ₹0.41 per minute at our volume. Whisper-medium self-hosted on a g4dn.xlarge was ₹0.18 per minute compute + the operational overhead. We chose Saaras for the API ergonomics; the cost difference was acceptable at our scale. ### How do you handle code-switching (Hindi-English mixing)? Saaras V3 handles code-mixed audio without explicit configuration. Our lexicon biases the English half. The Hindi half goes through Saaras's native Hindi recognition. We have not had to do anything special. ### What about other Indian languages? We started with Hindi because it covers the largest speaker base. The same lexicon-build methodology works for Tamil, Bengali, Telugu, Marathi — each language has its own characteristic English-influence patterns. Building a Tamil lexicon would be the next 6-week project. ### Did the lexicon reduce response latency? Slightly increased — about 40 ms per call. Acceptable trade-off for the quality lift. If you are streaming and every ms matters, weigh the cost. ### What happens when ASR vendors update their models? We re-baseline whenever Saaras releases a new version. Sometimes the lexicon helps less (the new model handles more on its own); sometimes more. We have a quarterly re-baselining cadence in our ops calendar. ### Where can I learn more about Indic ASR research? Start with the [Sarvam evaluation post](https://www.sarvam.ai/blogs/evaluating-indian-language-asr) (their honest discussion of WER limitations) and the [Voice of India benchmark paper](https://arxiv.org/html/2604.19151v1) for the academic baseline.

Need a domain-specific speech evaluation engine?

We build production speech pipelines for Indian-language voice products: ASR + lexicon + pronunciation scoring + LLM downstream. Fixed-scope projects, ₹3-7 lakh depending on language and corpus size. First call is with the engineer who would lead your build. Email contact@softechinfra.com or use the contact form.

Book a Voice-AI architecture call

Tags:

Voice AITalkDrillASRHindiSarvamPronunciation LexiconIndic NLP

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

{ "model": "saaras-v3", "language": "en-IN", "biasing": { "phrases": [ {"phrase": "office", "boost": 1.4}, {"phrase": "school", "boost": 1.4}, {"phrase": "very", "boost": 1.3}, // ... 6,400 entries total, weighted by frequency ], "pronunciation_lexicon": "s3://talkdrill-bucket/lexicons/hindi-en-tier2-v3.csv", "lexicon_weight": 0.6 }, "diarization": false, "punctuation": true }

TalkDrill Hindi Voice Bot: 31% Lower Word-Error Rate Using a Custom Tier-2 City Pronunciation Lexicon

Need a domain-specific speech evaluation engine?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?

TalkDrill Hindi Voice Bot: 31% Lower Word-Error Rate Using a Custom Tier-2 City Pronunciation Lexicon

Need a domain-specific speech evaluation engine?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?