An Internal Knowledge Bot for a 220-Person CA Firm: Claude Sonnet 4.5 + LlamaIndex + 14 Years of GST Circulars
A Hyderabad CA firm (220 staff, 14 years of GST circulars) shipped an internal RAG bot in 6 weeks. Claude Sonnet 4.5 + LlamaIndex + Qdrant. The ingest pipeline, the citation-required prompt, and 71% adoption by week 8.
Hrishikesh Baidya
October 26, 202514 min read
0%
A 220-person CA firm in Hyderabad — 4 partners, 18 managers, 198 associates and articled clerks across 7 service lines — was losing ~14 person-hours a week to "what does the May 2018 GST circular say about input credit on capital goods" type questions. Senior managers were the bottleneck. We built an internal knowledge bot in 6 weeks on Claude Sonnet 4.5 + LlamaIndex + Qdrant, ingesting 14 years of GST circulars (2,840 documents), 1,420 internal SOPs, and 18,400 client memos. Eight weeks post-launch, 156 of 220 staff (71%) used it weekly; manager-question volume dropped 38%. Here is the ingest pipeline, the citation-required prompt pattern, and what we did about hallucinations.
220
Staff (4 partners + 18 managers + 198 associates)
2,840
GST circulars ingested (2011–2025)
71%
Weekly active staff at week 8
38%
Drop in manager-question volume
## TL;DR (60 Words)
We ingested 14 years of CBIC GST circulars + internal SOPs + client memos into a Qdrant vector store via LlamaIndex. Claude Sonnet 4.5 handles the answer generation with a strict citation-required prompt: every claim must cite a circular ID + section. Without a citation, the bot says "I do not know — ask a senior manager." Weekly active rate at 8 weeks: 71%. Cost: ₹4.2 lakh build + ₹38,000/month inference.
## Why This Matters Now
Claude Sonnet 4.5 was released on September 29, 2025 — a model Anthropic positioned as state-of-the-art on agentic tasks and software engineering. For a knowledge bot ingesting Indian regulatory text (which is dense, cross-referenced, and frequently amended), the gains over earlier models on long-context retrieval and citation-following accuracy are substantial. The LlamaIndex framework added explicit support for Sonnet 4.5 within days of release. The combination is cheap to operate (~₹38k/month at this firm's volume) and ergonomic for a small in-house team to maintain. We have shipped 3 similar bots in the last 6 months for Indian professional-services firms — each variant of the same pattern.
## The Client (Specific Details)
- Sector: Mid-tier CA firm (audit + tax + GST + transfer pricing + statutory compliance)
- Location: Hyderabad HQ + 1 satellite office in Vizag
- Size: 220 staff (4 partners, 18 managers, 198 associates and articled clerks)
- Knowledge corpus: 2,840 CBIC GST circulars + 1,420 internal SOPs + 18,400 client engagement memos (anonymised)
- Stack on day 0: A shared Google Drive with a folder hierarchy. Search worked when the file name was right. Otherwise, you Slack a senior manager.
- Trigger: The senior manager who answers most GST queries took 4 weeks of leave. Productivity collapsed across 3 service lines. The managing partner drew a line.
## The Architecture
IN
Ingest: PDF parser + LlamaIndex SentenceSplitter (450-token chunks, 50-overlap)
EM
Embeddings: Voyage AI voyage-3-large (Indian English + legal jargon strong)
VD
Vector store: Qdrant (self-hosted on Hetzner CCX23, 32 GB RAM)
RT
Retrieval: top-12 hybrid (vector + BM25), then Cohere reranker top-4
LM
LLM: Claude Sonnet 4.5 with citation-required prompt
UI
UI: Next.js chat interface with citation pills + thumbs feedback
## The Ingest Pipeline (Where Most Bots Fail)
Ingesting CBIC GST circulars is harder than ingesting clean Markdown. Circulars are PDFs with mixed Hindi+English text, tables, footnote references, and explicit cross-references like "as per circular 87/06/2019 dated 02.01.2019". A naive chunk-and-embed loses the cross-reference context.
1
PDF parsing with metadata preservation
unstructured.io for the parse, then a custom regex pass to extract circular ID, date, subject from the header. Each chunk inherits these as metadata. Cross-references are flagged for the post-process step.
2
Semantic chunking on legal-text boundaries
LlamaIndex SentenceSplitter at 450 tokens with 50-token overlap. We tried 1,000-token chunks; recall was worse because the embedding averaged across multiple distinct rules. Smaller chunks won decisively for legal text.
3
Cross-reference graph as a sidecar
Every "circular X dated Y" reference becomes an edge in a Neo4j sidecar. At retrieval time, we expand the result set with documents 1-hop from the matched chunks. Improved recall on amendment-heavy queries by 23%.
4
Hybrid retrieval + Cohere reranker
Vector (Voyage AI) + BM25 in parallel. Top-12 from each merged, then Cohere rerank-3.5 returns top-4. The reranker is what made the bot trustworthy enough to roll out — pure vector retrieval missed exact-phrase queries about specific circular numbers.
## The Citation-Required Prompt (The 80% of Hallucination Defence)
The single most important prompt change: the bot is allowed to refuse to answer. We made "I do not know — ask a senior manager" a celebrated outcome.
python
SYSTEM_PROMPT = """
You are an internal knowledge assistant for a Chartered Accountancy firm.
You help associates and articled clerks find answers in Indian GST circulars,
internal SOPs, and historical client memos.
CRITICAL RULES:
1. Every factual claim MUST be backed by a citation in the form
[circular_id, section] or [SOP_ID] or [CLIENT_MEMO_ID].
2. If the retrieved context does not contain a clear answer, respond with EXACTLY:
"I do not have a confident answer. Please consult a senior manager.
Closest related circulars: [list any retrieved IDs]."
3. Do NOT speculate. Do NOT extrapolate from general tax knowledge.
4. If the question concerns a circular issued AFTER the most recent ingest date
({last_ingest_date}), respond:
"This circular is more recent than my knowledge. Please check the CBIC portal
directly: cbic-gst.gov.in"
5. Quote the source verbatim where possible. Use Indian English. Use INR for amounts.
6. If the user asks for an opinion or recommendation, decline. You answer
factual questions about what the circulars and SOPs say.
CONTEXT (from retrieval, ranked top-4):
{retrieved_chunks_with_citations}
USER QUESTION: {user_question}
"""
In the first 4 weeks of operation we logged every "I do not have a confident answer" response. There were 412. Of these, 380 were genuine knowledge gaps the senior manager confirmed. 32 were false negatives where the bot had the answer but did not surface confidence. The 8% false-negative rate was acceptable to the partners — far preferable to confident hallucinations.
## The Cost of Operation
Compared to the alternative — one extra senior manager on retainer to handle these queries (₹6+ lakh/month fully loaded) — the bot pays back in under 2 weeks. Even at 3x usage, monthly cost stays under ₹1 lakh.
## The 6-Week Build Plan
1
Week 1: Discovery + corpus inventory
Catalogued the Google Drive corpus. Identified 2,840 GST circulars (2011-2025), 1,420 SOPs (mostly current), 18,400 client memos. Removed ~3,400 memos that contained PII without anonymisation. Got the partner team to sign off on the corpus list.
2
Week 2: Ingest pipeline + Qdrant setup
unstructured.io for PDF parsing. Custom metadata extractor for circular ID/date. LlamaIndex SentenceSplitter at 450 tokens. Qdrant on Hetzner CCX23 with daily snapshots. First full ingest took 38 hours; nightly delta runs in 12 minutes.
3
Week 3: Retrieval + reranker tuning
Built the hybrid retrieval (Voyage AI embeddings + BM25). Added Cohere reranker on top. Built an evaluation set of 100 questions with known-good answers. Tuned chunk size, top-k, rerank-k. Final: chunk 450, top-12 retrieved, top-4 reranked.
Claude Sonnet 4.5 via Anthropic API. The citation-required prompt above. Iterated 11 times with the senior manager rating responses. Final eval: 87% answers fully cited, 8% "I do not know" (acceptable), 5% needed prompt fix.
5
Week 5: Next.js UI + thumbs feedback loop
Chat interface with citation pills (click to view the cited chunk in context). Thumbs up/down on every answer. Down-thumbed answers feed a weekly review queue for the senior manager.
6
Week 6: Pilot + full rollout
Day 1-3: pilot with the 18 managers. Day 4: rolled out to all 220 staff. WhatsApp announcement from the managing partner ("use this before Slack-ing me"). 92 staff used it on day 1. By week 4, 156 weekly active.
## What Drove The 71% Adoption Rate
We have shipped 3 internal RAG bots; this one had the highest adoption of the three. Three things made the difference:
A
The managing partner used it publicly
In the first all-hands after launch, the managing partner asked the bot a question on screen and showed the citation pills. Adoption spiked 28% in the next week. Top-down endorsement matters more than any feature.
B
"I do not know" was never punished
The first time the bot refused to answer a question (legitimately), the senior manager could have said "useless tool". Instead they Slack-replied "Good — that one needs my judgement." That message was screenshotted and shared. Trust built fast.
C
Citations were one-click verifiable
Every cited circular pill opens the source PDF at the right page in a side panel. Associates verified the first 5-6 answers themselves. Once trust was established, they verified less. The verification path was always one click.
## Pre-Launch Checklist (RAG Bot Edition)
Corpus inventory signed off by the partner team (no PII, no client-confidential leaks)
Ingest pipeline tested on a 100-document sample with citation extraction verified
Evaluation set of 100 known-good Q&A pairs scored by the senior subject matter expert
Chunk size + top-k + rerank-k tuned with grid search on the eval set
Citation-required prompt iterated to under 10% false-negative rate
"I do not know" response framing approved by partner team (this is non-trivial)
UI shows citation pills with one-click access to the source document
Thumbs feedback loop wired to a weekly senior-manager review queue
Cost monitoring with daily alert if spend trends >₹2,000/day
Audit log of every query + answer + citations for the partner team
## Common Mistakes (Each One Hurts)
Symptom: "Bot confidently cites a circular that does not exist." Cause: chunk metadata not propagated through retrieval. Fix: every chunk in the vector store carries the source circular ID; the prompt template demands the bot use only those IDs.
Symptom: "Bot misses obvious matches on circular numbers." Cause: pure vector retrieval misses exact phrase matches. Fix: add BM25 in parallel; merge top-12 from each before reranking. Hybrid retrieval is non-negotiable.
Symptom: "Bot is slow — 14 seconds per answer." Cause: serialised retrieval + rerank + LLM. Fix: stream the LLM response as soon as the first reranked context is in. Time-to-first-token drops to 1.4 seconds; full answer in 6-8 seconds is acceptable.
Symptom: "Bot adoption stalls at 30%." Cause: refusing legitimate questions too aggressively. Fix: false-negative analysis on the "I do not know" responses; tune the confidence threshold downward until refusal rate is 8-12%.
Symptom: "Bot costs spiral after week 4." Cause: associates discover it answers personal-finance questions too. Fix: domain detection in the prompt; off-domain questions get redirected to the firm's CSR FAQ link.
## When NOT To Build a Knowledge Bot
Skip the build if (a) your corpus is under ~200 documents — search + senior expert is faster, (b) your domain changes faster than your re-ingest cycle (daily-changing rates, intraday securities) — the bot will lag and erode trust, or (c) you do not have a senior subject matter expert who can curate the prompt and review thumbs-down responses weekly. We have walked away from 2 RAG projects where the SME was unavailable. The bot needs an owner.
## A Detail That Saved Us On Day 9
On day 9 a senior associate asked: "What is the GST rate on a particular agricultural input post the September 2025 GST Council meeting?" The September 2025 meeting was 18 days before our last ingest, but the relevant circular was published 4 days after the ingest cycle. The bot correctly said "I do not have a confident answer — this may be a recent circular not yet in my corpus." The associate Slack-ed the senior manager, who was impressed. Without the recency-aware prompt rule, the bot would have answered with the older (wrong) rate. The lesson: every RAG prompt for fast-changing regulatory text needs an explicit "I might be stale" exit. Build it from day one.
## The Reddit Pulse on RAG-for-Law / Compliance Bots
The thread on r/CharteredAccountants from October 2025 has 38 comments from CA firms experimenting with internal LLMs. The top concern raised: hallucination on circular numbers. The second: cost spiraling. The pattern we shipped (Sonnet 4.5 + citation-required + Qdrant) addresses both. We have seen the same patterns hold up on a transfer-pricing bot for a different firm in March 2026. Worth comparing approaches in the thread before you build.
## FAQ
### Why Claude Sonnet 4.5 instead of GPT-4 or Gemini?
We benchmarked all three on a 100-question eval set drawn from the firm's internal knowledge. Sonnet 4.5 had the lowest hallucination rate (4% vs 9% for GPT-4 Turbo and 11% for Gemini 1.5 Pro), the best citation-following accuracy (94% vs 82% / 79%), and a sensible price point (~₹2.40 per query at this corpus size). Sonnet 4.5 also handled the long-context retrieval gracefully when we needed to pass 4 reranked chunks of legal text.
### Why Qdrant instead of Pinecone or Weaviate?
Self-hosting on a single Hetzner CCX23 (₹6,800/mo) is cheaper than any managed equivalent for this corpus size. Qdrant has a stable Python client, hybrid search (BM25 + vector) built-in, and a sensible operations footprint. We have run it in production on 3 client projects with no operational drama.
### How often do you re-ingest the corpus?
GST circulars: nightly cron pulls from the CBIC portal. Internal SOPs: weekly. Client memos: monthly batch with anonymisation pass. Re-ingest of changed-only documents takes 12 minutes per cycle.
### What about data privacy — does Anthropic see the firm's data?
The query and the retrieved context are sent to Anthropic for inference. The corpus itself never leaves our infrastructure. Anthropic's Standard API does not train on enterprise inputs. For higher assurance, the firm can move to AWS Bedrock with VPC routing — we have prepared but not enabled that path.
### How do you handle cross-references between circulars?
A Neo4j sidecar stores cross-reference edges. At retrieval time, we expand the result set with documents 1-hop from the matched chunks. Improved recall on amendment-heavy queries by 23%.
### What is the cost per query?
About ₹2.70 per query at this firm's volume (~14k queries/month). Breakdown: Claude Sonnet 4.5 inference ₹1.70, embeddings (already-cached) ₹0.10, retrieval + rerank ₹0.40, Qdrant + Hetzner amortised ₹0.50.
### What was the team for this build?
One senior backend engineer (LLM + RAG pipeline), one full-stack (Next.js UI + auth), one ML-savvy engineer for the eval-set tuning. Our QA lead Manvi at 0.4 FTE on the response-quality eval set, and our CTO Hrishikesh at 0.3 FTE for the prompt-engineering iterations and the partner reviews.
### How do you compare this to building on PenLeap-style real-time AI feedback?
Our edtech product PenLeap uses a similar pattern (rubric-based scoring is also a citation-required generation problem). The infrastructure layer is shared between our consulting work and our in-house products. That is part of why we can ship these RAG projects in 6 weeks rather than 14.
## Want an Internal Knowledge Bot for Your Firm?
Need an Internal Knowledge Bot for Your Firm?
We build internal RAG bots for Indian professional-services firms (CA, law, consulting) in the 50-500 staff range. Typical engagement: 5-7 weeks, fixed-price ₹3.8L-₹6.4L plus ~₹40k/month operating cost. Our standard pattern: Claude Sonnet 4.5 + LlamaIndex + Qdrant. First call is with the engineer who would lead your build.