Our 30-person Softechinfra ops team had 1,400 internal wiki pages spread across Notion, Google Docs, and a legacy Confluence. Onboarding a new engineer took two full weeks of "where is the X document?" pinging in Slack. We built a RAG chatbot over the wiki in a weekend using Claude Sonnet 4.5, LlamaIndex, and Postgres pgvector. After three iterations, accuracy went from 61% to 87%. This post is the complete code walkthrough — chunking strategy, embedding setup, retrieval, the eval harness, and the three prompts that made the biggest accuracy jump.
61% → 87%
Eval accuracy across iterations
2 days
Weekend build (Saturday + Sunday)
₹2,400
Monthly run cost (40 engineers)
## TL;DR — what this post delivers
A copy-runnable RAG chatbot over a Confluence/Notion/Google Docs internal wiki, using LlamaIndex for ingestion + retrieval, Postgres pgvector for storage, OpenAI text-embedding-3-large for embeddings, and Claude Sonnet 4.5 for answer generation. Eval harness with 80 ground-truth Q&A pairs. Three specific prompt changes drove accuracy from 61% to 87% — adding source citations, forcing "I don't know" on low-confidence retrievals, and chunking by semantic boundary instead of fixed tokens. Total weekend build: 2 days. Monthly cost at 40 active engineers: ₹2,400.
## Why build this yourself
Off-the-shelf RAG products (Glean, Notion AI, Slack AI) are fine for the 80% case but cost ₹400-800 per-user-per-month and lock you into their indexing strategy. For an SMB engineering team where the wiki spans multiple tools, a 2-day in-house build pays back in 3-4 months at 40 users. The bigger win is control: when accuracy hits 75% on your specific terminology, you can fix it. With Glean you file a ticket and wait 6 months.
## The stack (versions, December 2025)
📚
LlamaIndex 0.12.x
Open-source RAG framework. Best ingestion connectors (Confluence, Notion, Google Docs) and the cleanest retrieval primitives. Per the [official docs](https://developers.llamaindex.ai/python/framework/integrations/vector_stores/postgres/), pgvector integration is first-class.
🐘
Postgres 16 + pgvector 0.7
HNSW index with m=16, ef_construction=64. Self-hosted on a Hetzner CCX13 (₹1,200/month). Holds 1,400 docs × 8 chunks = 11,200 vectors comfortably.
🧠
OpenAI text-embedding-3-large
3072-dim embeddings. ₹0.13 per 1M tokens. Higher quality than text-embedding-3-small at this scale. One-time embedding cost for our 1,400-doc wiki: ₹220.
💬
Claude Sonnet 4.5
Best long-context reasoning at this price tier. ₹250/M input, ₹1,250/M output. Average answer cost: ₹0.42 per query at our wiki size.
## The 3 prompts that drove the biggest accuracy jump
Before code, the headline insight: most of the accuracy gain came from prompt engineering, not from changing the model or the retrieval algorithm. Three changes mattered.
### Prompt change #1 — Force inline citations
Before: "Answer the question using the context below."
After: "Answer the question using ONLY the context below. After every factual claim, cite the source document in square brackets like [doc: deployment-runbook.md]. If the context doesn't contain the answer, say 'I don't have that information in the wiki.'"
Effect: accuracy 61% → 74%. Citations forced the model to ground claims in specific documents. Hallucinations dropped from 17% to 4%.
### Prompt change #2 — Confidence-based abstention
Before: Model always tried to answer.
After: Added retrieval-confidence threshold. If the top-3 retrieved chunks all score below 0.65 cosine similarity, return "I don't have a confident answer for this — try rephrasing or check directly with the relevant team."
Effect: accuracy 74% → 81%. Lower coverage (the bot answers 12% fewer questions) but the answers it does give are dramatically more reliable.
### Prompt change #3 — Semantic chunking, not fixed-token chunking
Before: 512-token fixed chunks with 50-token overlap.
After: LlamaIndex's
SemanticSplitterNodeParser with breakpoint_percentile=85. Chunks split at semantic boundaries (paragraphs of related content), variable size 200-800 tokens.
Effect: accuracy 81% → 87%. Retrieval surfaces more topically-coherent chunks; Claude has fewer fragmented contexts to reconcile.
## The complete code (Python)
This is the trimmed-down reference. Production version has tracing, deduplication, and access control on top.
### Step 1 — Ingest from Confluence + Notion + Google Docs
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.confluence import ConfluenceReader
from llama_index.readers.notion import NotionPageReader
from llama_index.readers.google import GoogleDocsReader
import os
def load_wiki_documents():
docs = []
# Confluence (legacy)
confluence_reader = ConfluenceReader(
base_url=os.environ['CONFLUENCE_URL'],
user_name=os.environ['CONFLUENCE_USER'],
api_token=os.environ['CONFLUENCE_TOKEN'],
)
docs.extend(confluence_reader.load_data(space_key='ENG'))
# Notion (current primary)
notion_reader = NotionPageReader(integration_token=os.environ['NOTION_TOKEN'])
docs.extend(notion_reader.load_data(database_id=os.environ['NOTION_WIKI_DB']))
# Google Docs (drift docs)
gdocs_reader = GoogleDocsReader()
docs.extend(gdocs_reader.load_data(document_ids=load_gdoc_ids()))
# Tag every doc with its source for citation
for d in docs:
d.metadata['source_system'] = d.metadata.get('source_system', 'unknown')
d.metadata['ingested_at'] = datetime.utcnow().isoformat()
return docs
### Step 2 — Chunk with semantic splitter
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model='text-embedding-3-large', dimensions=3072)
splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=85,
embed_model=embed_model,
)
def chunk_documents(docs):
nodes = splitter.get_nodes_from_documents(docs)
# Carry metadata into chunks for citation
for node in nodes:
node.metadata['source_doc'] = node.metadata.get('title', 'unknown')
return nodes
### Step 3 — Store in pgvector
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
vector_store = PGVectorStore.from_params(
database='wiki_rag',
host='postgres.internal',
password=os.environ['PG_PASSWORD'],
port=5432,
user='wiki_rag',
table_name='wiki_embeddings',
embed_dim=3072,
hnsw_kwargs={
'hnsw_m': 16,
'hnsw_ef_construction': 64,
'hnsw_ef_search': 40,
'hnsw_dist_method': 'vector_cosine_ops',
},
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
def build_index(nodes):
return VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
embed_model=embed_model,
)
### Step 4 — Retrieval + Claude answer generation
from anthropic import Anthropic
from llama_index.core.retrievers import VectorIndexRetriever
anthropic = Anthropic()
RAG_PROMPT = """You are an internal-wiki assistant for Softechinfra engineering team.
Answer the user's question using ONLY the context below. Rules:
- After every factual claim, cite the source document in square brackets like [doc: deployment-runbook.md]
- If the context doesn't contain the answer, say "I don't have that information in the wiki — try rephrasing or check directly with the relevant team"
- Keep answers under 150 words unless the question explicitly asks for detail
- Never invent code, configuration values, URLs, or people's names
Context:
{context}
Question: {question}
"""
def answer_question(question: str, index):
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
retrieved = retriever.retrieve(question)
# Confidence gate
top_score = retrieved[0].score if retrieved else 0
if top_score < 0.65:
return {
'answer': "I don't have a confident answer. Try rephrasing or ask the relevant team directly.",
'sources': [],
'confidence': top_score,
}
context = '\n\n'.join(
f"[{n.metadata['source_doc']}]\n{n.text}" for n in retrieved
)
response = anthropic.messages.create(
model='claude-sonnet-4-5',
max_tokens=600,
messages=[{'role': 'user', 'content': RAG_PROMPT.format(context=context, question=question)}],
)
return {
'answer': response.content[0].text,
'sources': [n.metadata['source_doc'] for n in retrieved],
'confidence': top_score,
}
### Step 5 — The eval harness
This is the part most weekend RAG builds skip — and then they don't know if their changes are helping or hurting.
import json
from sklearn.metrics import accuracy_score
# eval_set.json: 80 ground-truth Q&A pairs hand-curated from real Slack questions
EVAL_SET = json.load(open('eval_set.json'))
def llm_judge(question, predicted, expected):
"""Use Claude to judge if predicted answer captures the expected meaning."""
prompt = f"""You are evaluating a Q&A bot. Question: {question}
Expected answer: {expected}
Bot's answer: {predicted}
Does the bot's answer correctly address the question's intent? Answer ONLY 'yes' or 'no'.
"""
r = anthropic.messages.create(
model='claude-sonnet-4-5',
max_tokens=10,
messages=[{'role': 'user', 'content': prompt}],
)
return r.content[0].text.strip().lower() == 'yes'
def run_eval(index):
results = []
for item in EVAL_SET:
predicted = answer_question(item['question'], index)
correct = llm_judge(item['question'], predicted['answer'], item['expected'])
results.append({
'question': item['question'],
'correct': correct,
'confidence': predicted['confidence'],
})
accuracy = sum(r['correct'] for r in results) / len(results)
avg_confidence = sum(r['confidence'] for r in results) / len(results)
print(f"Accuracy: {accuracy:.1%} Avg confidence: {avg_confidence:.2f}")
return results
## The weekend build (hour by hour)
1
Saturday morning (4 hrs) — Stack setup + ingestion
Provision Hetzner CCX13. Install Postgres 16 + pgvector. Set up Python venv with LlamaIndex 0.12. Configure Confluence/Notion/Google Docs API tokens. Run the ingestion script. 1,400 docs took 22 minutes to pull and embed.
2
Saturday afternoon (5 hrs) — First answer + eval set
Wire up the retrieval + Claude answer generator. Get the first end-to-end answer in 1.5 hours. Spend the rest curating 80 ground-truth Q&As from real Slack questions over the past 30 days. The eval set is the part that pays back the most.
3
Saturday evening (2 hrs) — First eval run
Run eval. Baseline: 61% accuracy. Identify failure modes: hallucinated config values, confidently wrong on policy questions, retrieved chunks too fragmented for nuanced questions.
4
Sunday morning (3 hrs) — Prompt iteration #1: citations
Add inline citation requirement and "say I don't know" instruction. Re-run eval: 74% (+13pp). Hallucination rate drops from 17% to 4%.
5
Sunday afternoon (2 hrs) — Prompt iteration #2: confidence gate
Add the 0.65 cosine-similarity threshold. Re-run eval: 81% (+7pp). 12% of questions now get the abstention message — that's the price.
6
Sunday evening (3 hrs) — Semantic chunking + Slack bot
Switch from fixed-token chunking to SemanticSplitterNodeParser. Re-embed (45 min). Re-run eval: 87% (+6pp). Wire a thin Slack bot for the team. Ship at 9pm Sunday.
## Pre-launch checklist
- Eval set with ≥50 ground-truth Q&A pairs (curated from real questions)
- Citations required in every answer for traceability
- Confidence gate to abstain when retrieval is weak
- Semantic chunking, not fixed-token (or at least tested both)
- Source-system metadata preserved on every chunk for filtering
- Access control — engineers shouldn't query HR/finance docs
- Daily re-ingest cron for changed wiki pages
- Slack bot wrapper for low-friction team usage
- Cost monitoring (per-day query count + token spend)
- Feedback button (thumbs-up/down) wired to a Postgres table for ongoing eval
## Common mistakes (and the fix)
Symptom: bot makes up plausible-sounding configuration values. Cause: no "never invent" instruction in prompt. Fix: explicit prompt rule + low-confidence abstention.
Symptom: eval accuracy is 92% but real users are unhappy. Cause: eval set drawn from same wiki the bot retrieves from — overfit. Fix: include questions whose answer is NOT in the wiki; the bot should abstain on those.
Symptom: retrieval surfaces irrelevant chunks for short queries. Cause: short queries don't have enough semantic signal. Fix: query rewriting — use a tiny LLM to expand the query before embedding ("how do we deploy" → "what is the production deployment process for our backend services").
Symptom: re-embedding the wiki takes hours. Cause: re-embedding everything on every change. Fix: hash-based diff — only re-embed chunks whose source content changed. We track chunk content hash in a separate table.
## When NOT to build this yourself
Skip the DIY if (a) your wiki is under 100 docs — Glean/Notion AI's free tier or even just better-organized docs is sufficient, (b) your team is under 10 engineers — the build payback period is too long, (c) you have strict compliance requirements that need vendor SOC2 (we self-host but on a single VM; an enterprise audit might want managed). For (c), check if your existing AWS or GCP contract includes Bedrock Agents or Vertex AI Search — both are credible alternatives.
## Real outcomes — Softechinfra ops team (90 days in)
- Onboarding time for new engineers: 14 days → 6 days (measured by "time to first PR merged")
- Slack #where-is-this channel volume: 38 questions/week → 4 questions/week
- Bot accuracy at 90 days: 87% (no degradation, eval re-runs weekly)
- Bot abstention rate: 12% (the questions where it says "I don't know")
- Query volume: 280 queries/week across 30 engineers
- Average answer time: 2.4 seconds
- Cost per query: ₹0.42 (LLM) + amortized infra negligible at our volume
We've since added a feedback widget — every answer has a thumbs-up/down. Down-votes flow into our eval set as future test cases. The eval set has grown from 80 to 240 Q&A pairs in 90 days.
## What we'd add today if rebuilding
Three improvements we're shipping in Q1 2026.
Hybrid retrieval (BM25 + vector). Pure vector retrieval misses queries with very specific terminology ("error code 4451"). BM25 sparse retrieval catches those. Combining the two via reciprocal rank fusion is worth ~3-4pp accuracy.
Conversational memory. Right now each query is independent. Adding a 5-turn conversation buffer with question reformulation handles "and what about for staging?" follow-ups.
Multimodal ingest. Our wiki has 240 architecture diagrams as PNGs. We currently skip them. Adding GPT-4o vision to extract text + structure from diagrams would cover ~12% of currently-unanswerable questions.
## How this connects to the broader work
We use this same RAG pattern in client builds via our
AI automation team. The Pune logistics agent we built (covered in our
Bedrock AgentCore comparison post) uses the same chunking + retrieval + citation pattern, just with AgentCore Memory underneath. The patterns port across stacks.
For the WhatsApp version of this same pattern (RAG over a product catalog instead of an internal wiki), see our
WhatsApp + OpenAI bot walkthrough from November. Same architecture, different surface area.
Reddit threads worth following: [r/LangChain](https://www.reddit.com/r/LangChain/) for retrieval pattern debates, [r/MachineLearning](https://www.reddit.com/r/MachineLearning/) for embedding-quality benchmarks, and [Christopher Gs' production-RAG essay](https://christophergs.com/blog/production-rag-with-postgres-vector-store-open-source-models) for a deeper take on Postgres + open-source models.
## FAQ
### Why LlamaIndex instead of LangChain?
For RAG specifically, LlamaIndex's ingestion connectors (Confluence, Notion, Google Docs, Slack) are more mature and the retrieval primitives are cleaner. LangChain is more general-purpose and better for full agent orchestration. For a pure RAG-over-wiki use case, LlamaIndex is the right tool.
### Why pgvector instead of Pinecone or Qdrant?
At 11k vectors, pgvector on a ₹1,200/month Postgres handles it comfortably with no managed-service cost. Pinecone or Qdrant make sense above ~500k vectors or when you need cross-region replication. For team-sized wikis, pgvector wins on simplicity.
### How do you handle wiki updates?
Daily cron runs the ingestion pipeline, computes a hash of each chunk's content, and only re-embeds chunks whose hash changed. Re-embedding cost on a typical day with 5-15 doc changes: under ₹2.
### Why text-embedding-3-large over text-embedding-3-small?
In our eval, large gave +4pp accuracy at our wiki size. The cost difference is trivial (₹220 one-time vs ₹70 one-time). At larger scales (>100k vectors), the math shifts toward 3-small for storage cost reasons.
### How do you handle access control on internal docs?
Each chunk carries the source-system access level in metadata. Before retrieval, we filter the vector search to only chunks the requesting user is authorized to see. This requires syncing user/group memberships from your IdP — adds an evening of work.
### Why Claude Sonnet 4.5 instead of GPT-4o for the answer generation?
Sonnet 4.5 follows the citation instruction more reliably and is less prone to confabulating beyond the retrieved context. GPT-4o is comparable on quality but a bit more verbose. For citation-heavy production RAG, Sonnet is our default.
### Can I run this on Bedrock instead of direct Anthropic API?
Yes — change the Anthropic SDK call to a Bedrock invocation. If you're already on AWS, that's a 30-line change. The rest of the stack works unchanged. The recent
Opus 4.5 launch (covered in our migration guide) is also relevant — for some hard wiki questions, Opus 4.5 medium-effort gives noticeably better answers at acceptable cost.
Want a Custom RAG Chatbot Trained on YOUR Internal Docs?
We build production RAG chatbots over Confluence, Notion, Google Docs, SharePoint, or your custom CMS in 7-14 working days. Includes ingestion pipelines, eval harness with your real questions, citation tracing, and Slack/Teams integration. Typical project: ₹65,000-₹1,40,000 fixed scope. First call is technical — with the engineer who'd lead your build.
Book a 20-min Call