A RAG Chatbot for an Indian Insurance Broker: How We Cut "I Don't Know" From 38% to 11%
A Mumbai insurance broker's RAG bot was answering "I don't know" 38% of the time. We rebuilt the chunking, retrieval, reranker, and a 5-line eval harness to catch drift. New rate: 11%.
Hrishikesh Baidya
October 16, 202514 min read
0%
A Mumbai insurance broker — call them Sahasra Insurance — sells motor, health, and term life across 14 insurer partners. Their support chatbot, built by a previous vendor in early 2025, was answering "Sorry, I don't have information on that" or some variant on 38% of inbound queries. The founder showed us the logs: customers asking real questions about claim filing, premium calculation, riders. The bot was failing them. We rebuilt the retrieval pipeline in 11 days and dropped the "I don't know" rate to 11%. This post is the chunking strategy, the retrieval-rerank pattern, and the 5-line eval harness that catches drift before customers do. With actual Python code.
38% → 11%
"I don't know" rate before / after rebuild
11 days
Rebuild engagement length
2,400
Source policy pages indexed (14 insurers)
5 lines
Eval harness that catches retrieval drift
## The Answer in 60 Words
Three changes drove the 27-point drop: (1) re-chunking by question-type instead of fixed-size, with a HyDE-style query rewrite stored alongside each chunk, (2) two-stage retrieval — top-20 from pgvector, reranked to top-4 by Cohere multilingual v3, (3) a 5-line nightly eval harness that runs 80 gold-standard questions and alarms on retrieval-precision drop. Same model (Claude Haiku 4.5), same bot UI, dramatically different answers.
## Why "I Don't Know" Is the Worst RAG Failure
A wrong answer is a bug you can fix. "I don't know" is a bug that compounds — the customer goes to a competitor's bot or, worse, calls the broker. For Sahasra, the founder estimated each "I don't know" cost ₹4,200 in lost LTV (typical first-year health insurance commission, partial). At 38% on 1,400 inbound queries/month, that's a ₹22 lakh/month leak. The bot was actively destroying revenue.
The previous vendor's diagnosis: "the model is too cautious". Their proposed fix: lower the temperature, change the system prompt to "always try to answer". This made the bot hallucinate instead of refuse — 11% hallucinations, which cost roughly 3-4x more in compliance risk than refusals. We rolled it back on day 2.
The real diagnosis: the retrieval was bringing back the wrong chunks. The model was correctly refusing because the context didn't actually contain the answer. Fix the retrieval, the model behaves.
## The Stack (Before + After)
Layer
Before
After
Chunking
Fixed 500-token windows, 50-token overlap
Question-typed chunks (claim/premium/rider/general), variable size, HyDE query stored with chunk
Embedding
text-embedding-ada-002
text-embedding-3-large + Cohere multilingual for fallback
Vector DB
Pinecone starter
Postgres + pgvector (cheaper, audit-friendly for compliance)
5-line nightly harness, 80 gold-standard Qs, alarms on drift
## The Chunking Strategy (Where Naive RAG Fails)
A motor-insurance policy PDF has 3 distinct sections: cover (what's insured), claim procedure (how to file), exclusions (what's not covered). Each generates different question shapes from customers.
- "What's the IDV calculation for my Swift Dzire?" → cover/premium section
- "How do I file a claim for windshield damage?" → claim procedure section
- "Does the policy cover engine flooding in monsoon?" → exclusions section
Fixed 500-token chunking smears these together. A chunk might be the last paragraph of cover + the first paragraph of claim — useful for nothing.
Our fix: a pre-chunking pass that classifies each paragraph by question-type (using a Haiku 4.5 call at ₹0.04/page), then chunks within each section by semantic boundary. Each chunk gets a question-type tag in metadata. Retrieval can then filter by predicted question-type before ranking by similarity.
QT
Question-typed chunks
Each chunk tagged claim, premium, rider, exclusion, general. Retrieval first filters by predicted question type, then ranks within. Cuts irrelevant matches by ~40%.
HD
HyDE-style query rewrite per chunk
For each chunk, generate "what question does this answer" with Haiku 4.5. Embed both the chunk and the question, store both. User questions match the question embedding much better than the raw chunk.
RR
Two-stage retrieval + reranker
Top-20 candidates from pgvector, reranked to top-4 by Cohere rerank-multilingual-v3.0. Reranker adds 280ms but lifts precision by 18 points.
EV
5-line nightly eval harness
80 gold-standard Q&A pairs run nightly. Score = % where retrieved chunks contain the answer. Alarm if score drops by > 4 points over 7 days. Catches drift from new policies.
## The Python Retrieval + Eval Code (Production Sketch)
This is the actual retrieval function that runs in the FastAPI handler.
# retrieval.py — production sketch (edited for length)
import asyncio
from anthropic import Anthropic
from cohere import Client as Cohere
from openai import OpenAI
import psycopg2
from psycopg2.extras import RealDictCursor
anth = Anthropic()
cohere = Cohere()
openai = OpenAI()
def embed(text: str) -> list[float]:
"""text-embedding-3-large at 3072 dims."""
r = openai.embeddings.create(model="text-embedding-3-large", input=text)
return r.data[0].embedding
async def classify_question_type(q: str) -> str:
r = await anth.messages.create(
model="claude-haiku-4-5-20251015",
max_tokens=20,
system="Classify the user's question. Reply with one word: claim, premium, rider, exclusion, general.",
messages=[{"role": "user", "content": q}],
)
return r.content[0].text.strip().lower()
def pgvector_search(q_emb: list[float], q_type: str, k: int = 20) -> list[dict]:
sql = """
SELECT id, chunk_text, question_text, source_doc, page_no,
1 - (chunk_emb <=> %s::vector) AS chunk_score,
1 - (question_emb <=> %s::vector) AS question_score
FROM policy_chunks
WHERE question_type = %s OR question_type = 'general'
ORDER BY GREATEST(
1 - (chunk_emb <=> %s::vector),
1 - (question_emb <=> %s::vector)
) DESC
LIMIT %s
"""
with psycopg2.connect(DSN) as conn, conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute(sql, (q_emb, q_emb, q_type, q_emb, q_emb, k))
return cur.fetchall()
async def rerank_top_k(query: str, candidates: list[dict], top_n: int = 4) -> list[dict]:
"""Cohere rerank brings 20 candidates down to the top 4 most relevant."""
docs = [c["chunk_text"] for c in candidates]
r = cohere.rerank(
model="rerank-multilingual-v3.0",
query=query, documents=docs, top_n=top_n,
)
return [candidates[res.index] for res in r.results]
async def retrieve(user_question: str) -> list[dict]:
q_type = await classify_question_type(user_question)
q_emb = embed(user_question)
candidates = pgvector_search(q_emb, q_type, k=20)
return await rerank_top_k(user_question, candidates, top_n=4)
The composer call (Claude Haiku 4.5) takes the top-4 chunks as context, plus the conversation history, and generates a reply. We deliberately tell the model "if the context does not contain the answer, say 'I'll need to check with our team' and emit ESCALATE_TO_HUMAN" — refusing well is better than answering badly. The escalation rate post-rebuild is 11%; before it was a less-honest 38% mix of refusals and waffle.
## The 5-Line Eval Harness
This is the eval cron that runs at 02:00 IST every night. The full file is 60 lines including imports and CLI; the functional core is 5 lines.
# eval_nightly.py — runs at 02:00 IST
import json, asyncio
from retrieval import retrieve
GOLD = json.load(open("eval/gold_80.json")) # [{"q": "...", "expected_chunk_ids": [...]}, ...]
async def score():
hits = sum(
1 for case in GOLD
if any(c["id"] in case["expected_chunk_ids"] for c in await retrieve(case["q"]))
)
return hits / len(GOLD)
if __name__ == "__main__":
score_now = asyncio.run(score())
log_to_postgres_and_alarm_if_drift(score_now)
The 80 gold-standard questions were curated over 3 days by reading 6 weeks of real customer transcripts and asking the broker's senior agent "what's the right answer to this, and which paragraph in which policy doc has it?". The expected_chunk_ids list per question is exhaustive — any of those chunks counts as a hit. We alarm if the score drops more than 4 points compared to a 7-day rolling baseline. Manvi maintains the gold set and adds 4-6 new questions every quarter as the policy book changes.
## The 11-Day Rebuild Plan
1
Days 1-2: Diagnose the failure modes
Pulled 6 weeks of transcripts. Manually labelled 200 "I don't know" cases. 67% were retrieval failures (answer existed in source but wasn't retrieved); 22% were genuinely missing data; 11% were misclassified intent. Chose to fix retrieval first.
2
Days 3-4: Re-ingest 2,400 policy pages with question-typed chunking
Pre-classification pass with Haiku 4.5 — ₹96 total for the full corpus. Semantic chunking within each section. HyDE-style question generation per chunk — another ₹140. Total ingest cost: ₹240 + 2 hours of wall time.
3
Days 5-6: Postgres + pgvector setup, embedding migration
Spun up PostgreSQL 16 + pgvector on a Hetzner CCX23. Migrated from Pinecone (cheaper, on-prem audit story for IRDAI compliance). text-embedding-3-large at 3072 dims, IVFFlat index with lists=100.
4
Days 7-8: Cohere reranker + composer migration to Haiku 4.5
Wired Cohere rerank-multilingual-v3.0. Migrated composer from GPT-4o-mini to Claude Haiku 4.5. Initial human eval on 60 questions: GPT-4o-mini hallucinated 4 times, Haiku 4.5 refused 6 times. The refusals were the right behaviour given the context.
5
Day 9: Build the 80-question gold set
Senior agent + Manvi spent a full day curating 80 gold-standard Q&A pairs across the 5 question types and 14 insurer partners. Each Q has 1-3 expected chunk IDs. Stored in eval/gold_80.json.
6
Day 10: Wire the nightly eval cron + Grafana dashboard
02:00 IST cron runs the 80-question eval. Score logged to Postgres. Grafana shows the 30-day trend. PagerDuty alarm on a 4-point drop vs the 7-day rolling baseline. First baseline score: 0.89.
7
Day 11: 50% A/B in production, then full cutover
Coin flip in the webhook. New stack: 11% "I don't know" rate, 0.4% hallucination rate. Old stack still at 38%. Founder approved 100% cutover at 18:00. Old stack stayed warm for 7 days as rollback insurance.
## Why Each Change Mattered (Numbers)
The chart's individual numbers don't sum because the changes overlap — once you have HyDE rewrites, the reranker has better candidates to choose from, etc. The decomposition matters because if the founder ever asks "what if we drop the reranker to save the ₹720/month Cohere bill", we have the data to say "you'll lose 18 points of retrieval precision and 8 points of bot resolution rate".
## The Pre-Launch Checklist
2,400 pages re-chunked with question-type tags
HyDE-style question rewrite stored alongside every chunk
text-embedding-3-large embeddings live in Postgres + pgvector
Cohere reranker key tested on 50 sample queries
Composer migrated to Claude Haiku 4.5 with explicit refuse-when-uncertain prompt
80-question gold set curated and reviewed by senior agent
Nightly eval cron live with PagerDuty drift alarm
Grafana dashboard showing 30-day trend of "I don't know" rate
50% A/B fired and watched for 4 hours before full cutover
Old stack kept warm for 7 days post-cutover as rollback insurance
## When Not to Build a RAG Bot Like This
Skip this architecture if (a) your knowledge base is under 50 pages — naive top-k retrieval is fine, the engineering overhead doesn't pay back. (b) Your domain is high-liability per turn (medical advice, legal counsel) — RAG bots inherit liability and refusals are not a sufficient defence. (c) Your customers expect humans on the first turn — building a bot that hands off after one message is more annoying than no bot. (d) Your knowledge base churns daily — a more agentic retrieve-on-the-fly pattern (LlamaIndex query engines, [agentic retrieval](https://www.llamaindex.ai/blog/rag-is-dead-long-live-agentic-retrieval)) might serve you better.
## A Detail That Saved a Compliance Audit
IRDAI auditors visited Sahasra in November. They asked to see the bot's responses to 30 specific questions about claim procedure for one insurer. The auditor wanted to know not just the answer, but which clause of which document the answer came from. Because we store source_doc and page_no with every chunk, every bot reply included a "[Source: HDFC ERGO Motor Wording, p.18]" footer. Auditor signed off in 40 minutes. Without that traceability, the audit would have stretched a week and required 14 hours of senior agent time.
## The Reddit + Community Pulse
The [r/LangChain subreddit](https://www.reddit.com/r/LangChain/) has had a long-running thread on "why does my RAG say I don't know all the time" — the dominant fix patterns map almost 1:1 to ours: question-typed chunking, HyDE rewrites, two-stage retrieval. The [r/MachineLearning crowd](https://www.reddit.com/r/MachineLearning/) is more skeptical, arguing that "RAG is dead, agents are next" — see [LlamaIndex's agentic retrieval post](https://www.llamaindex.ai/blog/rag-is-dead-long-live-agentic-retrieval) for the strongest version of that argument. For a 2,400-page insurance policy corpus, agentic retrieve-on-the-fly is overkill; the static index is the right tool. We'd revisit if the corpus grew past 20,000 pages.
## How We Cross-Linked Into the Stack
This rebuild used the same retrieval and prompt-engineering muscle we put into our [Diwali D2C support bot](/blog/diwali-d2c-customer-support-chatbot-claude-haiku-freshdesk-whatsapp-3-day-build), where the 11-rule escalation guard is the analog of this bot's "refuse when context doesn't contain the answer" pattern. The cost-router YAML pattern from our [Sonnet 4.5 benchmark post](/blog/claude-sonnet-4-5-launch-six-production-workflows-rerun-india) and our [Haiku 4.5 + F5 audit post](/blog/claude-haiku-4-5-launch-f5-bigip-breach-two-builds-rearchitect) plugs into this stack — the bot is one workflow in a YAML that lists 14. Our AI automation team ships RAG rebuilds for insurance brokers, healthcare clinics, legal firms, and CA practices. Hrishikesh led the architecture; Manvi owned the gold-set curation.
We saw the same retrieval-quality issue early in our work on TalkDrill — where it took a Cohere reranker upgrade to lift answer-quality scores by double digits. The pattern repeats across domains.
## FAQ
### Why Cohere reranker over a model-based rerank?
Latency. Cohere rerank-multilingual-v3.0 returns top-N from 20 candidates in ~280ms. A Claude-as-reranker call would take 1.4s and cost 12x more. For a chatbot turn budget of under 2s, the dedicated reranker is the right tool.
### Why Postgres + pgvector instead of Pinecone or Weaviate?
Cost (₹740/month for 2.4M chunks vs Pinecone's $70/month at the same scale), audit story (data stays in our Postgres for IRDAI compliance), and operational simplicity (we already run Postgres for everything else). Performance is fine up to ~10M chunks.
### How big is the gold-standard eval set really?
80 questions in production. We tested with 240 in evaluation — performance estimates were within 2 points of the 80-question subset. We chose 80 for the nightly cron because it runs in under 90 seconds and costs ₹6 per nightly run.
### What if the policy book changes?
Nightly diff job pulls each insurer's policy PDFs (some via API, some via scraper). Changed pages get re-chunked, re-embedded, re-tagged. Total nightly cost: ₹14. The 80-question gold set is reviewed quarterly to add new questions for new policy products.
### Why not use a hybrid BM25 + vector search?
We tested it. BM25 + vector hybrid added 3 points of recall on the gold set. The reranker captured most of that lift on its own. Adding BM25 doubled the query latency without a proportional precision gain — we cut it.
### How do you handle out-of-policy questions (general health advice, life decisions)?
The Haiku 4.5 system prompt has an explicit "if the question is medical, legal, or financial advice that requires a licensed professional, refuse and offer to connect to a human" clause. Catches roughly 40 such questions per month, all routed to the senior agent.
### What's the realistic latency the customer sees?
P50 1.4s (classify intent + embed + pgvector + rerank + Haiku stream first token). P95 2.6s. The reranker is the single largest contributor — we accept the latency for the precision lift.
Want a RAG chatbot trained on YOUR knowledge base?
We rebuild RAG chatbots for Indian SMBs in 7-14 working days. Fixed price ₹1.4L–₹3.5L depending on corpus size and integrations. Includes the question-typed chunking, the two-stage retrieval pipeline, the 80-question gold-standard eval set, and 30 days of post-launch tuning. Suitable if you have ≥ 200 pages of source content and your current bot's "I don't know" rate is bleeding LTV.