Why Claude Sonnet 4.5 over GPT-4o?

Sonnet 4.5 follows citation instructions more reliably and is less prone to confabulating beyond retrieved context. GPT-4o is comparable on quality but more verbose. For citation-heavy RAG, Sonnet is our default.

Build a RAG Chatbot for Your Internal Wiki in a Weekend (Claude Sonnet 4.5 + LlamaIndex + Postgres pgvector)

Q: How do you handle wiki updates?

Daily cron runs ingestion, computes hash of each chunk, only re-embeds chunks whose hash changed. Cost on a typical day with 5-15 doc changes: under ₹2.

Q: How do you handle access control on internal docs?

Each chunk carries source-system access level in metadata. Before retrieval, filter vector search to only chunks the user is authorized to see. Requires syncing user/group memberships from your IdP.

Build a RAG Chatbot for Your Internal Wiki in a Weekend (Claude Sonnet 4.5 + LlamaIndex + Postgres pgvector)

Our 30-person Softechinfra ops team had 1,400 internal wiki pages spread across Notion, Google Docs, and a legacy Confluence. Onboarding a new engineer took two full weeks of "where is the X document?" pinging in Slack. We built a RAG chatbot over the wiki in a weekend using Claude Sonnet 4.5, LlamaIndex, and Postgres pgvector. After three iterations, accuracy went from 61% to 87%. This post is the complete code walkthrough — chunking strategy, embedding setup, retrieval, the eval harness, and the three prompts that made the biggest accuracy jump.

61% → 87%

Eval accuracy across iterations

1,400

Wiki docs ingested

2 days

Weekend build (Saturday + Sunday)

₹2,400

Monthly run cost (40 engineers)

## TL;DR — what this post delivers A copy-runnable RAG chatbot over a Confluence/Notion/Google Docs internal wiki, using LlamaIndex for ingestion + retrieval, Postgres pgvector for storage, OpenAI text-embedding-3-large for embeddings, and Claude Sonnet 4.5 for answer generation. Eval harness with 80 ground-truth Q&A pairs. Three specific prompt changes drove accuracy from 61% to 87% — adding source citations, forcing "I don't know" on low-confidence retrievals, and chunking by semantic boundary instead of fixed tokens. Total weekend build: 2 days. Monthly cost at 40 active engineers: ₹2,400. ## Why build this yourself Off-the-shelf RAG products (Glean, Notion AI, Slack AI) are fine for the 80% case but cost ₹400-800 per-user-per-month and lock you into their indexing strategy. For an SMB engineering team where the wiki spans multiple tools, a 2-day in-house build pays back in 3-4 months at 40 users. The bigger win is control: when accuracy hits 75% on your specific terminology, you can fix it. With Glean you file a ticket and wait 6 months. ## The stack (versions, December 2025)

📚

LlamaIndex 0.12.x

Open-source RAG framework. Best ingestion connectors (Confluence, Notion, Google Docs) and the cleanest retrieval primitives. Per the [official docs](https://developers.llamaindex.ai/python/framework/integrations/vector_stores/postgres/), pgvector integration is first-class.

🐘

Postgres 16 + pgvector 0.7

HNSW index with m=16, ef_construction=64. Self-hosted on a Hetzner CCX13 (₹1,200/month). Holds 1,400 docs × 8 chunks = 11,200 vectors comfortably.

🧠

OpenAI text-embedding-3-large

3072-dim embeddings. ₹0.13 per 1M tokens. Higher quality than text-embedding-3-small at this scale. One-time embedding cost for our 1,400-doc wiki: ₹220.

💬

Claude Sonnet 4.5

Best long-context reasoning at this price tier. ₹250/M input, ₹1,250/M output. Average answer cost: ₹0.42 per query at our wiki size.

## The 3 prompts that drove the biggest accuracy jump Before code, the headline insight: most of the accuracy gain came from prompt engineering, not from changing the model or the retrieval algorithm. Three changes mattered. ### Prompt change #1 — Force inline citations Before: "Answer the question using the context below." After: "Answer the question using ONLY the context below. After every factual claim, cite the source document in square brackets like [doc: deployment-runbook.md]. If the context doesn't contain the answer, say 'I don't have that information in the wiki.'" Effect: accuracy 61% → 74%. Citations forced the model to ground claims in specific documents. Hallucinations dropped from 17% to 4%. ### Prompt change #2 — Confidence-based abstention Before: Model always tried to answer. After: Added retrieval-confidence threshold. If the top-3 retrieved chunks all score below 0.65 cosine similarity, return "I don't have a confident answer for this — try rephrasing or check directly with the relevant team." Effect: accuracy 74% → 81%. Lower coverage (the bot answers 12% fewer questions) but the answers it does give are dramatically more reliable. ### Prompt change #3 — Semantic chunking, not fixed-token chunking Before: 512-token fixed chunks with 50-token overlap. After: LlamaIndex's SemanticSplitterNodeParser with breakpoint_percentile=85. Chunks split at semantic boundaries (paragraphs of related content), variable size 200-800 tokens. Effect: accuracy 81% → 87%. Retrieval surfaces more topically-coherent chunks; Claude has fewer fragmented contexts to reconcile. ## The complete code (Python) This is the trimmed-down reference. Production version has tracing, deduplication, and access control on top. ### Step 1 — Ingest from Confluence + Notion + Google Docs

python

from llama_index.core import SimpleDirectoryReader
    from llama_index.readers.confluence import ConfluenceReader
    from llama_index.readers.notion import NotionPageReader
    from llama_index.readers.google import GoogleDocsReader
    import os
  
    def load_wiki_documents():
        docs = []
  
        # Confluence (legacy)
        confluence_reader = ConfluenceReader(
            base_url=os.environ['CONFLUENCE_URL'],
            user_name=os.environ['CONFLUENCE_USER'],
            api_token=os.environ['CONFLUENCE_TOKEN'],
        )
        docs.extend(confluence_reader.load_data(space_key='ENG'))
  
        # Notion (current primary)
        notion_reader = NotionPageReader(integration_token=os.environ['NOTION_TOKEN'])
        docs.extend(notion_reader.load_data(database_id=os.environ['NOTION_WIKI_DB']))
  
        # Google Docs (drift docs)
        gdocs_reader = GoogleDocsReader()
        docs.extend(gdocs_reader.load_data(document_ids=load_gdoc_ids()))
  
        # Tag every doc with its source for citation
        for d in docs:
            d.metadata['source_system'] = d.metadata.get('source_system', 'unknown')
            d.metadata['ingested_at'] = datetime.utcnow().isoformat()
  
        return docs

### Step 2 — Chunk with semantic splitter

python

from llama_index.core.node_parser import SemanticSplitterNodeParser
    from llama_index.embeddings.openai import OpenAIEmbedding
  
    embed_model = OpenAIEmbedding(model='text-embedding-3-large', dimensions=3072)
  
    splitter = SemanticSplitterNodeParser(
        buffer_size=1,
        breakpoint_percentile_threshold=85,
        embed_model=embed_model,
    )
  
    def chunk_documents(docs):
        nodes = splitter.get_nodes_from_documents(docs)
        # Carry metadata into chunks for citation
        for node in nodes:
            node.metadata['source_doc'] = node.metadata.get('title', 'unknown')
        return nodes

### Step 3 — Store in pgvector

python

from llama_index.vector_stores.postgres import PGVectorStore
    from llama_index.core import StorageContext, VectorStoreIndex
  
    vector_store = PGVectorStore.from_params(
        database='wiki_rag',
        host='postgres.internal',
        password=os.environ['PG_PASSWORD'],
        port=5432,
        user='wiki_rag',
        table_name='wiki_embeddings',
        embed_dim=3072,
        hnsw_kwargs={
            'hnsw_m': 16,
            'hnsw_ef_construction': 64,
            'hnsw_ef_search': 40,
            'hnsw_dist_method': 'vector_cosine_ops',
        },
    )
  
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
  
    def build_index(nodes):
        return VectorStoreIndex(
            nodes=nodes,
            storage_context=storage_context,
            embed_model=embed_model,
        )

### Step 4 — Retrieval + Claude answer generation

python

from anthropic import Anthropic
    from llama_index.core.retrievers import VectorIndexRetriever
  
    anthropic = Anthropic()
  
    RAG_PROMPT = """You are an internal-wiki assistant for Softechinfra engineering team.
  
    Answer the user's question using ONLY the context below. Rules:
    - After every factual claim, cite the source document in square brackets like [doc: deployment-runbook.md]
    - If the context doesn't contain the answer, say "I don't have that information in the wiki — try rephrasing or check directly with the relevant team"
    - Keep answers under 150 words unless the question explicitly asks for detail
    - Never invent code, configuration values, URLs, or people's names
  
    Context:
    {context}
  
    Question: {question}
    """
  
    def answer_question(question: str, index):
        retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
        retrieved = retriever.retrieve(question)
  
        # Confidence gate
        top_score = retrieved[0].score if retrieved else 0
        if top_score < 0.65:
            return {
                'answer': "I don't have a confident answer. Try rephrasing or ask the relevant team directly.",
                'sources': [],
                'confidence': top_score,
            }
  
        context = '\n\n'.join(
            f"[{n.metadata['source_doc']}]\n{n.text}" for n in retrieved
        )
  
        response = anthropic.messages.create(
            model='claude-sonnet-4-5',
            max_tokens=600,
            messages=[{'role': 'user', 'content': RAG_PROMPT.format(context=context, question=question)}],
        )
  
        return {
            'answer': response.content[0].text,
            'sources': [n.metadata['source_doc'] for n in retrieved],
            'confidence': top_score,
        }

### Step 5 — The eval harness This is the part most weekend RAG builds skip — and then they don't know if their changes are helping or hurting.

python

import json
    from sklearn.metrics import accuracy_score
  
    # eval_set.json: 80 ground-truth Q&A pairs hand-curated from real Slack questions
    EVAL_SET = json.load(open('eval_set.json'))
  
    def llm_judge(question, predicted, expected):
        """Use Claude to judge if predicted answer captures the expected meaning."""
        prompt = f"""You are evaluating a Q&A bot. Question: {question}
        Expected answer: {expected}
        Bot's answer: {predicted}
  
        Does the bot's answer correctly address the question's intent? Answer ONLY 'yes' or 'no'.
        """
        r = anthropic.messages.create(
            model='claude-sonnet-4-5',
            max_tokens=10,
            messages=[{'role': 'user', 'content': prompt}],
        )
        return r.content[0].text.strip().lower() == 'yes'
  
    def run_eval(index):
        results = []
        for item in EVAL_SET:
            predicted = answer_question(item['question'], index)
            correct = llm_judge(item['question'], predicted['answer'], item['expected'])
            results.append({
                'question': item['question'],
                'correct': correct,
                'confidence': predicted['confidence'],
            })
  
        accuracy = sum(r['correct'] for r in results) / len(results)
        avg_confidence = sum(r['confidence'] for r in results) / len(results)
        print(f"Accuracy: {accuracy:.1%}  Avg confidence: {avg_confidence:.2f}")
        return results

## The weekend build (hour by hour)

Saturday morning (4 hrs) — Stack setup + ingestion

Provision Hetzner CCX13. Install Postgres 16 + pgvector. Set up Python venv with LlamaIndex 0.12. Configure Confluence/Notion/Google Docs API tokens. Run the ingestion script. 1,400 docs took 22 minutes to pull and embed.

Saturday afternoon (5 hrs) — First answer + eval set

Wire up the retrieval + Claude answer generator. Get the first end-to-end answer in 1.5 hours. Spend the rest curating 80 ground-truth Q&As from real Slack questions over the past 30 days. The eval set is the part that pays back the most.

Saturday evening (2 hrs) — First eval run

Run eval. Baseline: 61% accuracy. Identify failure modes: hallucinated config values, confidently wrong on policy questions, retrieved chunks too fragmented for nuanced questions.

Sunday morning (3 hrs) — Prompt iteration #1: citations

Add inline citation requirement and "say I don't know" instruction. Re-run eval: 74% (+13pp). Hallucination rate drops from 17% to 4%.

Sunday afternoon (2 hrs) — Prompt iteration #2: confidence gate

Add the 0.65 cosine-similarity threshold. Re-run eval: 81% (+7pp). 12% of questions now get the abstention message — that's the price.

Sunday evening (3 hrs) — Semantic chunking + Slack bot

Switch from fixed-token chunking to SemanticSplitterNodeParser. Re-embed (45 min). Re-run eval: 87% (+6pp). Wire a thin Slack bot for the team. Ship at 9pm Sunday.

## Pre-launch checklist

Eval set with ≥50 ground-truth Q&A pairs (curated from real questions)
Citations required in every answer for traceability
Confidence gate to abstain when retrieval is weak
Semantic chunking, not fixed-token (or at least tested both)
Source-system metadata preserved on every chunk for filtering
Access control — engineers shouldn't query HR/finance docs
Daily re-ingest cron for changed wiki pages
Slack bot wrapper for low-friction team usage
Cost monitoring (per-day query count + token spend)
Feedback button (thumbs-up/down) wired to a Postgres table for ongoing eval

## Common mistakes (and the fix) Symptom: bot makes up plausible-sounding configuration values. Cause: no "never invent" instruction in prompt. Fix: explicit prompt rule + low-confidence abstention. Symptom: eval accuracy is 92% but real users are unhappy. Cause: eval set drawn from same wiki the bot retrieves from — overfit. Fix: include questions whose answer is NOT in the wiki; the bot should abstain on those. Symptom: retrieval surfaces irrelevant chunks for short queries. Cause: short queries don't have enough semantic signal. Fix: query rewriting — use a tiny LLM to expand the query before embedding ("how do we deploy" → "what is the production deployment process for our backend services"). Symptom: re-embedding the wiki takes hours. Cause: re-embedding everything on every change. Fix: hash-based diff — only re-embed chunks whose source content changed. We track chunk content hash in a separate table. ## When NOT to build this yourself Skip the DIY if (a) your wiki is under 100 docs — Glean/Notion AI's free tier or even just better-organized docs is sufficient, (b) your team is under 10 engineers — the build payback period is too long, (c) you have strict compliance requirements that need vendor SOC2 (we self-host but on a single VM; an enterprise audit might want managed). For (c), check if your existing AWS or GCP contract includes Bedrock Agents or Vertex AI Search — both are credible alternatives. ## Real outcomes — Softechinfra ops team (90 days in) - Onboarding time for new engineers: 14 days → 6 days (measured by "time to first PR merged") - Slack #where-is-this channel volume: 38 questions/week → 4 questions/week - Bot accuracy at 90 days: 87% (no degradation, eval re-runs weekly) - Bot abstention rate: 12% (the questions where it says "I don't know") - Query volume: 280 queries/week across 30 engineers - Average answer time: 2.4 seconds - Cost per query: ₹0.42 (LLM) + amortized infra negligible at our volume We've since added a feedback widget — every answer has a thumbs-up/down. Down-votes flow into our eval set as future test cases. The eval set has grown from 80 to 240 Q&A pairs in 90 days. ## What we'd add today if rebuilding Three improvements we're shipping in Q1 2026. Hybrid retrieval (BM25 + vector). Pure vector retrieval misses queries with very specific terminology ("error code 4451"). BM25 sparse retrieval catches those. Combining the two via reciprocal rank fusion is worth ~3-4pp accuracy. Conversational memory. Right now each query is independent. Adding a 5-turn conversation buffer with question reformulation handles "and what about for staging?" follow-ups. Multimodal ingest. Our wiki has 240 architecture diagrams as PNGs. We currently skip them. Adding GPT-4o vision to extract text + structure from diagrams would cover ~12% of currently-unanswerable questions. ## How this connects to the broader work We use this same RAG pattern in client builds via our AI automation team. The Pune logistics agent we built (covered in our Bedrock AgentCore comparison post) uses the same chunking + retrieval + citation pattern, just with AgentCore Memory underneath. The patterns port across stacks. For the WhatsApp version of this same pattern (RAG over a product catalog instead of an internal wiki), see our WhatsApp + OpenAI bot walkthrough from November. Same architecture, different surface area. Reddit threads worth following: [r/LangChain](https://www.reddit.com/r/LangChain/) for retrieval pattern debates, [r/MachineLearning](https://www.reddit.com/r/MachineLearning/) for embedding-quality benchmarks, and [Christopher Gs' production-RAG essay](https://christophergs.com/blog/production-rag-with-postgres-vector-store-open-source-models) for a deeper take on Postgres + open-source models. ## FAQ ### Why LlamaIndex instead of LangChain? For RAG specifically, LlamaIndex's ingestion connectors (Confluence, Notion, Google Docs, Slack) are more mature and the retrieval primitives are cleaner. LangChain is more general-purpose and better for full agent orchestration. For a pure RAG-over-wiki use case, LlamaIndex is the right tool. ### Why pgvector instead of Pinecone or Qdrant? At 11k vectors, pgvector on a ₹1,200/month Postgres handles it comfortably with no managed-service cost. Pinecone or Qdrant make sense above ~500k vectors or when you need cross-region replication. For team-sized wikis, pgvector wins on simplicity. ### How do you handle wiki updates? Daily cron runs the ingestion pipeline, computes a hash of each chunk's content, and only re-embeds chunks whose hash changed. Re-embedding cost on a typical day with 5-15 doc changes: under ₹2. ### Why text-embedding-3-large over text-embedding-3-small? In our eval, large gave +4pp accuracy at our wiki size. The cost difference is trivial (₹220 one-time vs ₹70 one-time). At larger scales (>100k vectors), the math shifts toward 3-small for storage cost reasons. ### How do you handle access control on internal docs? Each chunk carries the source-system access level in metadata. Before retrieval, we filter the vector search to only chunks the requesting user is authorized to see. This requires syncing user/group memberships from your IdP — adds an evening of work. ### Why Claude Sonnet 4.5 instead of GPT-4o for the answer generation? Sonnet 4.5 follows the citation instruction more reliably and is less prone to confabulating beyond the retrieved context. GPT-4o is comparable on quality but a bit more verbose. For citation-heavy production RAG, Sonnet is our default. ### Can I run this on Bedrock instead of direct Anthropic API? Yes — change the Anthropic SDK call to a Bedrock invocation. If you're already on AWS, that's a 30-line change. The rest of the stack works unchanged. The recent Opus 4.5 launch (covered in our migration guide) is also relevant — for some hard wiki questions, Opus 4.5 medium-effort gives noticeably better answers at acceptable cost.

Want a Custom RAG Chatbot Trained on YOUR Internal Docs?

We build production RAG chatbots over Confluence, Notion, Google Docs, SharePoint, or your custom CMS in 7-14 working days. Includes ingestion pipelines, eval harness with your real questions, citation tracing, and Slack/Teams integration. Typical project: ₹65,000-₹1,40,000 fixed scope. First call is technical — with the engineer who'd lead your build.

Book a 20-min Call

Tags:

RAGLlamaIndexpgvectorPostgresClaude Sonnet 4.5Internal WikiPython

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

from llama_index.core import SimpleDirectoryReader from llama_index.readers.confluence import ConfluenceReader from llama_index.readers.notion import NotionPageReader from llama_index.readers.google import GoogleDocsReader import os def load_wiki_documents(): docs = [] # Confluence (legacy) confluence_reader = ConfluenceReader( base_url=os.environ['CONFLUENCE_URL'], user_name=os.environ['CONFLUENCE_USER'], api_token=os.environ['CONFLUENCE_TOKEN'], ) docs.extend(confluence_reader.load_data(space_key='ENG')) # Notion (current primary) notion_reader = NotionPageReader(integration_token=os.environ['NOTION_TOKEN']) docs.extend(notion_reader.load_data(database_id=os.environ['NOTION_WIKI_DB'])) # Google Docs (drift docs) gdocs_reader = GoogleDocsReader() docs.extend(gdocs_reader.load_data(document_ids=load_gdoc_ids())) # Tag every doc with its source for citation for d in docs: d.metadata['source_system'] = d.metadata.get('source_system', 'unknown') d.metadata['ingested_at'] = datetime.utcnow().isoformat() return docs

from llama_index.core.node_parser import SemanticSplitterNodeParser from llama_index.embeddings.openai import OpenAIEmbedding embed_model = OpenAIEmbedding(model='text-embedding-3-large', dimensions=3072) splitter = SemanticSplitterNodeParser( buffer_size=1, breakpoint_percentile_threshold=85, embed_model=embed_model, ) def chunk_documents(docs): nodes = splitter.get_nodes_from_documents(docs) # Carry metadata into chunks for citation for node in nodes: node.metadata['source_doc'] = node.metadata.get('title', 'unknown') return nodes

from llama_index.vector_stores.postgres import PGVectorStore from llama_index.core import StorageContext, VectorStoreIndex vector_store = PGVectorStore.from_params( database='wiki_rag', host='postgres.internal', password=os.environ['PG_PASSWORD'], port=5432, user='wiki_rag', table_name='wiki_embeddings', embed_dim=3072, hnsw_kwargs={ 'hnsw_m': 16, 'hnsw_ef_construction': 64, 'hnsw_ef_search': 40, 'hnsw_dist_method': 'vector_cosine_ops', }, ) storage_context = StorageContext.from_defaults(vector_store=vector_store) def build_index(nodes): return VectorStoreIndex( nodes=nodes, storage_context=storage_context, embed_model=embed_model, )

from anthropic import Anthropic from llama_index.core.retrievers import VectorIndexRetriever anthropic = Anthropic() RAG_PROMPT = """You are an internal-wiki assistant for Softechinfra engineering team. Answer the user's question using ONLY the context below. Rules: - After every factual claim, cite the source document in square brackets like [doc: deployment-runbook.md] - If the context doesn't contain the answer, say "I don't have that information in the wiki — try rephrasing or check directly with the relevant team" - Keep answers under 150 words unless the question explicitly asks for detail - Never invent code, configuration values, URLs, or people's names Context: {context} Question: {question} """ def answer_question(question: str, index): retriever = VectorIndexRetriever(index=index, similarity_top_k=5) retrieved = retriever.retrieve(question) # Confidence gate top_score = retrieved[0].score if retrieved else 0 if top_score < 0.65: return { 'answer': "I don't have a confident answer. Try rephrasing or ask the relevant team directly.", 'sources': [], 'confidence': top_score, } context = '\n\n'.join( f"[{n.metadata['source_doc']}]\n{n.text}" for n in retrieved ) response = anthropic.messages.create( model='claude-sonnet-4-5', max_tokens=600, messages=[{'role': 'user', 'content': RAG_PROMPT.format(context=context, question=question)}], ) return { 'answer': response.content[0].text, 'sources': [n.metadata['source_doc'] for n in retrieved], 'confidence': top_score, }

import json from sklearn.metrics import accuracy_score # eval_set.json: 80 ground-truth Q&A pairs hand-curated from real Slack questions EVAL_SET = json.load(open('eval_set.json')) def llm_judge(question, predicted, expected): """Use Claude to judge if predicted answer captures the expected meaning.""" prompt = f"""You are evaluating a Q&A bot. Question: {question} Expected answer: {expected} Bot's answer: {predicted} Does the bot's answer correctly address the question's intent? Answer ONLY 'yes' or 'no'. """ r = anthropic.messages.create( model='claude-sonnet-4-5', max_tokens=10, messages=[{'role': 'user', 'content': prompt}], ) return r.content[0].text.strip().lower() == 'yes' def run_eval(index): results = [] for item in EVAL_SET: predicted = answer_question(item['question'], index) correct = llm_judge(item['question'], predicted['answer'], item['expected']) results.append({ 'question': item['question'], 'correct': correct, 'confidence': predicted['confidence'], }) accuracy = sum(r['correct'] for r in results) / len(results) avg_confidence = sum(r['confidence'] for r in results) / len(results) print(f"Accuracy: {accuracy:.1%} Avg confidence: {avg_confidence:.2f}") return results

Build a RAG Chatbot for Your Internal Wiki in a Weekend (Claude Sonnet 4.5 + LlamaIndex + Postgres pgvector)

Want a Custom RAG Chatbot Trained on YOUR Internal Docs?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?

Build a RAG Chatbot for Your Internal Wiki in a Weekend (Claude Sonnet 4.5 + LlamaIndex + Postgres pgvector)

Want a Custom RAG Chatbot Trained on YOUR Internal Docs?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?