When a TalkDrill user from Hyderabad says "very good" and the model marks /v/ as an error because it heard /w/, we are punishing a Telugu speaker for being a Telugu speaker. Last quarter we logged 4,700 such false positives across 5,000+ active users — every one of them an unhappy session. Fixing this is not a model swap. It is a phoneme-level audit, a regionalised acoustic profile per user, and a human-rater calibration loop that never lets the model drift back. This post is exactly how we did it.
4,700
False-positive accent flags in Q3 2025
11
L1-language acoustic profiles maintained
0.91
Pearson correlation with human MOS raters
68
Phoneme substitution rules in the production table
## The Answer in 60 Words
TalkDrill scores pronunciation using a phoneme-level forced alignment from the user's audio against the target phoneme sequence. Each phoneme gets a confidence score; substitutions known to be regional accent variants are not penalised. The rules come from a substitution table we built from Sanskrit-rooted L1 interference patterns. Human MOS raters re-calibrate the table monthly. Pearson correlation against rater scores: 0.91.
## Why This Matters Now
[Sarvam AI's Saaras V3 model](https://www.sarvam.ai/blogs/asr) — released in early 2026 — was benchmarked on Svarah, a 9.6-hour dataset capturing 117 speakers across 65 districts in 19 Indian states. Their conclusion: Western-trained ASR models systematically penalise Indian-English speakers, and the gap widens for Tier-2/3 city accents. The same dynamic applies one step downstream — pronunciation scoring on top of any ASR model that wasn't trained on Indian phonology.
[TalkDrill](https://talkdrill.com) — our in-house English-speaking app for Indian adults, with 5,000+ active users — saw this exact problem in Q3 2025. The model said "wrong" and the user said "but every Bangalorean says it that way." The fix was not a louder model; it was a lookup table the model consulted before scoring.
## The Shape of the Problem (One Diagram, Not Three)
A pronunciation scorer takes (a) the user's audio, (b) the target text, and (c) the target phoneme sequence (a CMU-style transcription of the text). It produces a per-phoneme score. The score asks: how confident is the acoustic model that the audio at position N matched the expected phoneme at position N?
A naive scorer fails for Indian-English speakers because the acoustic model was trained predominantly on US/UK speakers, and the expected phoneme sequence was generated from a US/UK pronunciation dictionary. Both ends of the comparison are wrong for our users. Two fixes in series:
-
Acoustic-model fine-tuning on Indian-accented audio so the model is not surprised by /v/-as-/w/.
-
Substitution rules so when a phoneme that
is a regional substitution is detected, the score is preserved.
## The Phoneme Substitution Table (The Most Useful Artefact in This Post)
This is a slimmed extract of the production table. We maintain 68 rows; below are the most-fired ones. The "fire rate" column is how often each rule triggered across our last 100k scoring calls in October 2025.
| Source phoneme (target) |
Substituted phoneme (heard) |
L1 language(s) |
Example word |
Fire rate (per 1k calls) |
| /v/ | /w/ | Telugu, Kannada | "very" → "wery" | 38 |
| /w/ | /v/ | Hindi, Bengali | "with" → "vith" | 27 |
| /θ/ (think) | /t̪/ (dental t) | Hindi, Marathi, Bengali | "think" → "tink" | 52 |
| /ð/ (this) | /d̪/ (dental d) | Hindi, Marathi, Gujarati | "this" → "dis" | 49 |
| /z/ | /dʒ/ | Hindi, Punjabi | "zoo" → "joo" | 21 |
| /ʃ/ | /s/ | Bengali (regional), Tamil | "ship" → "sip" | 18 |
| /r/ (alveolar) | /ɾ/ (tap) | Hindi, Punjabi, Marathi | "red" → tapped /r/ | 44 |
| /p/ | /pʰ/ (aspirated) | Hindi, Marathi (initial position) | "pen" → "phen" | 14 |
| schwa /ə/ | /a/ (open) | Tamil, Malayalam | "about" → "ahbout" | 33 |
| final /d/ | /t̪/ | Hindi, Punjabi | "good" → "goot" | 17 |
| /æ/ (cat) | /ɛ/ (bet) | Bengali, Odia | "cat" → "ket" | 22 |
| /ɒ/ (pot) | /o/ | Most Indian L1s | "pot" → "poht" | 29 |
The table is sourced from three places: (1) published phonological-interference research on Indian English, (2) our own log analysis across Q2 and Q3 2025, and (3) the human-rater calibration loop that runs monthly. It is not derived from the model. It is what the model consults when deciding "is this a real error or an acceptable regional variant?"
## The Scoring Pipeline (Walk-Through)
1
Capture + VAD
Browser/RN client captures 16 kHz mono audio. WebAssembly VAD trims silence client-side so we never pay to transcribe quiet. Saves ~22% of inference cost.
2
ASR + alignment
Faster-Whisper large-v3 fine-tuned on ~140 hours of Indian-accented audio we labelled in 2024-25. Output includes per-token timestamps used downstream.
3
Forced phoneme alignment
Montreal Forced Aligner (MFA) maps the audio to the target phoneme sequence using a CMU-style dictionary we extended for Indian English variants.
4
Per-phoneme confidence
For each aligned phoneme, the acoustic model emits a posterior probability. Below a per-phoneme threshold, it is flagged for review against the substitution table.
5
Substitution check
If the substitution matches a row in the table for the user's declared L1, the score is preserved. Otherwise, the phoneme is marked as an error and surfaced in the UI.
6
MOS aggregation
Per-phoneme scores aggregate to per-word, then per-sentence, then a final 1-5 MOS-style score using a weighted formula tuned against human raters.
## The MOS Calibration Loop (The Bit Without Which the Whole Thing Drifts)
Mean Opinion Score (MOS) is a 1-5 rating commonly used in speech evaluation. The benchmark is human raters listening to the same audio and scoring it. Our model's output has to track theirs.
Every month, we sample 200 audio clips at random from production usage. Three trained human raters — two are linguists, one is an English teacher who works with adult learners — score each clip on the same 1-5 scale. We compute Pearson correlation between the model's score and the rater median. The current run: 0.91.
When correlation drops below 0.85, we trigger a model retrain plus a substitution-table review. Drift typically comes from three sources:
-
A new regional L1 entering production (we added Odia and Assamese in August 2025; correlation initially dropped to 0.81 because the substitution table had no Odia rows).
-
A model upgrade (Whisper large-v3 upgrade in July 2025 changed the posterior distribution; thresholds had to be re-tuned).
-
Audio-condition drift (a meaningful shift in the noise profile from new users on cheaper Android handsets).
The full calibration runbook is 6 pages and ships as a versioned PDF in our internal wiki. We rerun it the first Monday of every month.
## How Indian-English Phonology Differs (and Why ASR Models Punish It)
The published research on Indian-English phonology converges on a small set of well-documented variants. The shorter the audio sample, the more these matter — single-word drills are where false-positives spike.
-
Retroflex vs alveolar consonants — Indian English speakers replace alveolar /t/ and /d/ with retroflex /ʈ/ and /ɖ/, especially mid-word. Most ASR models trained on US/UK audio miss this and confidence scores collapse.
-
Aspirated initial stops — Hindi/Marathi-L1 speakers add aspiration to /p/, /t/, /k/ in initial position. The model hears /pʰ/ instead of /p/ and either mis-transcribes or scores the phoneme as low confidence.
-
Final consonant cluster simplification — "asked" → "asked" with the /k/ dropped, common across many South Indian L1s. The model's expected phoneme sequence has 5 phonemes; the user produces 4. A naive scorer marks one phoneme as missing.
-
Schwa colouring — The English schwa /ə/ collapses to a more open vowel for Tamil/Malayalam speakers. The model marks the phoneme as wrong; the user sounds completely intelligible to a Bangalore listener.
-
Dental vs interdental — /θ/ (think) and /ð/ (this) are not present in most Indian L1s. Speakers substitute dental /t̪/ and /d̪/. This is the #1 false-positive in our logs.
The root pattern: ASR models trained on US/UK audio assume a phoneme set and a spectral distribution that does not match Indian English. Without explicit handling, the model marks 30-50% of fluent Indian-English speech as "errors". Our internal target: false-positive rate below 4% on the regional-substitution categories.
## The Implementation in Code (Real Sketch)
This is the per-phoneme scoring function — annotated, slightly trimmed.
def score_phoneme(
aligned_phoneme: AlignedPhoneme,
target_phoneme: str,
user_l1: str,
threshold: float = 0.62,
) -> PhonemeScore:
"""Score a single aligned phoneme against its target."""
# 1. Posterior probability from the acoustic model
confidence = aligned_phoneme.posterior_prob
if confidence >= threshold:
# Phoneme matched cleanly. Done.
return PhonemeScore(value=confidence, status="correct")
# 2. Below threshold — could be error, could be regional substitution
detected_phoneme = aligned_phoneme.argmax_phoneme
# Check the substitution table for this user's L1
if is_known_substitution(
target=target_phoneme,
substituted=detected_phoneme,
l1=user_l1,
):
# Acceptable regional variant. Award the target score with a small
# confidence haircut so we still distinguish from a perfect production.
return PhonemeScore(
value=max(confidence, 0.78),
status="acceptable_variant",
note=f"{target_phoneme} → {detected_phoneme} (regional)",
)
# 3. Genuine error
return PhonemeScore(value=confidence, status="error")
The function is 18 lines including comments. The complexity is in the data — the substitution table, the L1 declaration, the posterior threshold tuning per phoneme class. We refresh the substitution table monthly from the calibration loop output and ship it as a versioned JSON.
## The Calibration Runbook (One Side of A4)
1
Sample 200 audio clips uniformly across L1s
Stratified sample so every L1 with at least 50 users that month gets at least 12 clips. Pulled from a Postgres view that filters out clips users marked as "I want to delete this".
2
Distribute to 3 raters via a private dashboard
Each rater scores blind — they do not see the model's score. Each clip is scored by all three raters. Inter-rater agreement (Krippendorff's alpha) tracked across calibrations; below 0.7 we re-train raters before trusting the run.
3
Compute Pearson correlation against model output
Per L1 and overall. Plot it. Stash in S3 with the run timestamp. Below 0.85 overall = trigger investigation.
4
Drill into outliers
Cases where rater median was ≥4 and model scored ≤2 are the false-positive class. Cases where rater median was ≤2 and model scored ≥4 are the false-negative class. Both sets get reviewed by our linguist — usually 8-12 clips per category.
5
Update the substitution table or thresholds
Add new substitution rules if the linguist finds a missing rule. Tune posterior thresholds per phoneme if the model is systematically off in one direction. Both changes go through a code review with regression tests against the held-out clip set.
## When Not to Build a Custom Pronunciation Scorer
Skip this whole stack if (a) your users speak a single, well-represented accent — say, all-US adult learners on a corporate training programme, where Whisper out-of-the-box scores well above 0.9 correlation, (b) you can route audio to a managed API like [AssemblyAI's pronunciation feature](https://www.assemblyai.com/blog/best-api-models-for-real-time-speech-recognition-and-transcription) with acceptable per-minute cost, or (c) your product cares about fluency at the conversational level and not phoneme-level pronunciation. For TalkDrill, phoneme-level was non-negotiable because that is what learners ask for: "tell me which sound I got wrong." For a sales-coaching product, sentence-level is enough.
## A Real Calibration: October 2025 Run
In October's calibration, the overall Pearson correlation was 0.91. Two L1s under-performed: Bengali at 0.86 and Assamese at 0.79. The Assamese drop was expected — we had added the L1 only seven weeks earlier and the substitution table had three rows for Assamese where for Hindi we have eleven. The linguist reviewed 14 outlier clips, identified four new substitution patterns specific to Assamese, and we shipped the table update in mid-November. November's correlation for Assamese: 0.88.
The Bengali drop was a surprise — Bengali had been live for a year. The linguist spotted that the issue was not phoneme-level. Three clips had heavy regional dialect variants from rural West Bengal where the spectral envelope was different enough to confuse the acoustic model entirely. We added 22 hours of rural Bengali audio to the next fine-tune; correlation moved back to 0.90 in November.
The relevant Reddit discussion thread on this kind of pattern is here on [r/MachineLearning's ASR threads](https://www.reddit.com/r/MachineLearning/) — Indian engineers building speech systems are the most candid source on what fails and why.
## Common Mistakes (Each One Costs Real Users)
Symptom: 30%+ of phonemes flagged as errors for a fluent user. Cause: substitution table missing the user's L1 or posterior threshold set too high. Fix: drop threshold to 0.55, check the L1 has at least 8 substitution rules in the table.
Symptom: model upgrade tanks correlation. Cause: posterior distribution shifted. Fix: re-tune thresholds per phoneme class against the held-out clip set before promoting the new model to production. We learned this the hard way in July 2025.
Symptom: rater agreement (Krippendorff's alpha) below 0.7. Cause: rater confusion about scoring criteria, or audio quality so poor that no rater can score consistently. Fix: re-train raters with a fresh 30-clip annotated set; filter audio quality at ingestion (drop clips below SNR 18 dB).
Symptom: false-positive rate climbs over weeks. Cause: a new user demographic — different handsets, different network conditions — is changing the audio distribution. Fix: stratify the next calibration by handset model and SIM operator. Often the fix is in audio preprocessing, not in the scoring layer.
Symptom: schwa-marked words always score badly. Cause: schwa is acoustically weak and confidence scores are inherently low. Fix: lower the threshold for schwa specifically (we use 0.42 for /ə/ vs 0.62 for stops). This is a standard phonetic-modelling concession — schwa is famously difficult for acoustic models.
## A Question We Get from Edtech Founders
Why not just call a managed API like Sarvam or [Deepgram's Indian-English model](https://deepgram.com/) and skip the work?
We have evaluated both. [Sarvam's Saaras V3](https://www.sarvam.ai/blogs/asr) produces excellent ASR for Indian English and is now part of our consideration set for new pipelines. The reason we still own the scoring layer: pronunciation scoring is not just transcription. It is the comparison between what was said and what was expected, with a per-phoneme verdict. Managed APIs give you a transcript and a confidence number; they do not give you "the /θ/ in 'think' was rendered as /t̪/ which is a Hindi-L1 variant we accept." That logic lives in the substitution table and the calibration loop, and that is the bit we will not outsource.
For deeper coverage of how we run voice AI infrastructure end-to-end, see our [companion piece on cutting TalkDrill's round-trip latency to 740 ms on Indian 4G](/blog/talkdrill-voice-ai-740ms-latency-indian-4g), our 2025 deep-dive on [MLOps for production AI systems](/blog/ai-operations-mlops), and our [mobile development service line](/services/mobile-development) where this scoring engine ships as part of the React Native app.
If you'd rather we just build the scoring layer for your domain, [we ship it as a fixed-scope 6-week engagement →](/contact?service=ai).
## FAQ
### How accurate is TalkDrill's pronunciation scoring compared to a human teacher?
Pearson correlation against a panel of three trained human raters: 0.91 in October 2025. Per L1, it ranges from 0.86 (Bengali) to 0.94 (Hindi-L1 with high audio quality). We rerun the calibration monthly and surface drift the same week.
### Why not just use OpenAI Whisper or Google's Speech-to-Text directly?
Both give a transcript and overall confidence. Neither gives per-phoneme scores aligned to the target sequence. To get pronunciation scoring you need forced alignment plus an acoustic posterior, which means you keep the model in your stack — at minimum a fine-tuned [Faster-Whisper](https://github.com/SYSTRAN/faster-whisper) plus [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/).
### What's the false-positive rate on the substitution table?
In production, false-positives on the regional-substitution categories sit below 4%. False-positives on novel substitution patterns (a user from a rare L1 we have not seen before) can spike to 20%+ during their first session — we surface a "tell us your native language" prompt early to help the table apply.
### How do you handle code-switching between English and Hindi?
The forced aligner expects a target phoneme sequence per language. Code-switched audio is split at language boundaries by a short phone-level language ID model, then aligned per-segment. Scores are computed per language and aggregated. Accuracy on code-switched audio is ~6 percentage points below pure English and is an ongoing area of work.
### Does the system work for accents we haven't trained on?
It degrades gracefully. The substitution table only fires for L1s with rules; for an unseen L1 the user gets a stricter (closer to US/UK reference) score. We surface this honestly in the UI: "we are still learning your accent — pronunciation feedback may be over-strict."
### What does monthly calibration cost in human time?
Three raters at 90 minutes each = 4.5 hours of rater time. Linguist review of outliers = 2-3 hours. Engineering time to update tables and ship the new release = 4-6 hours. Total: ~12-15 hours per month, distributed across one calibration window.
### Can a single dev team replicate this for a different language pair?
Yes — we have built an analogous pipeline for Hindi pronunciation in PenLeap's read-aloud feature for younger learners. The architecture transfers; only the substitution table and the rater pool change. Allow ~6 months for the substitution table to mature.
## A Detail That Saved Us In August
In August 2025, we added Odia as a supported L1. The substitution table had three Odia rows we had inferred from published phonology. After 10 days, our calibration sample showed Odia speakers were getting consistently lower scores than Hindi speakers for objectively similar pronunciation. The linguist reviewed 22 outlier clips and identified that Odia speakers retain a distinct retroflex /ʐ/ in some words (e.g., place names) where Hindi speakers do not. We added the rule, retrained the acoustic model with 18 hours of donated Odia-accented English, and the gap closed within four weeks. The lesson: published phonology gets you 70% of the way; production data gets you the rest.
The analogous pattern shows up in voice AI for any culturally diverse user base — you cannot ship a substitution table from a textbook and call it done. You build it, you measure it, and you keep updating it.
Need a Domain-Specific Speech Evaluation Engine?
We build pronunciation scoring, fluency analysis, and voice-AI feedback engines for Indian English, Hindi, and regional-language audio. Fixed-scope engagements from ₹6.8 lakh, shipped in 6-10 working weeks. The first call is a technical scoping with the engineer who would lead your build.
Book a Technical Scoping Call