Build a Voice IVR with Twilio + Claude + ElevenLabs: A 1-Day Tutorial

Twilio ConversationRelay launched general availability in March 2026 — meaning you no longer need to wire your own WebSocket bridge between Twilio Media Streams and your LLM. Combined with Claude Haiku 4.5 token-streaming and ElevenLabs Flash v2.5 (75ms model latency), you can ship a working voice IVR that books appointments, handles barge-in, and sounds human in roughly one working day. This post is the actual code from a build we shipped for an 8-clinic dental chain in Hyderabad — every gotcha, every latency budget, every line. ## TL;DR — what a working voice IVR costs and how long it takes One developer-day to ship a working appointment-booking IVR. Per-call cost at typical 3-minute duration: ₹14.40 (Twilio inbound ₹0.35/min × 3 + Claude Haiku ~₹2.20 + ElevenLabs ~₹11.50 + STT ~₹2.00 ≈ ₹14.40 per 3-min call). End-to-end response latency: 1.4–2.1 seconds with proper streaming, which is the threshold below which callers don't perceive lag.

75ms

ElevenLabs Flash v2.5 Model Latency

478ms

Streaming TTFB (Real-World Test)

₹14.40

Per 3-Min Call (All-In)

8 hours

Realistic Build Time, Solo Dev

## Why this matters now (April 2026) Three platform shifts removed the hard parts of voice AI in the last 90 days. Twilio ConversationRelay went GA — handles STT, TTS, barge-in, and turn-taking inside Twilio's edge, you only run a WebSocket. Claude Haiku 4.5 token streaming added consistent first-token latency under 400ms, which makes voice viable. And ElevenLabs Flash v2.5 ships at 75ms model inference (478ms real-world TTFB) with Hindi/Tamil/Telugu support out of the box. The combination compresses a 6-week voice build to a single day. ## The actual answer — five-component voice stack

📞

Twilio Voice + ConversationRelay

Inbound calls hit a TwiML endpoint, Twilio opens a WebSocket to your server. Twilio handles audio framing, STT, and TTS playback.

🎙️

Deepgram Nova 3 (STT)

Streaming speech-to-text with 200-300ms transcription latency. Beats Twilio's built-in for noisy phone audio.

🧠

Claude Haiku 4.5 (Brain)

Token streaming starts at ~380ms first-token. Function-calling for "book_appointment", "check_availability".

🗣️

ElevenLabs Flash v2.5 (TTS)

75ms model latency, real-world ~478ms TTFB. Streamed back to caller as Claude's tokens arrive.

## Latency budget — the math that keeps callers from hanging up Voice perception research is clear: above 2 seconds of silence after the caller stops talking, the experience feels broken. Above 3 seconds, callers start saying "hello?" and losing patience. The budget breakdown for a typical exchange: Add ~200ms for India ↔ US round-trip if your servers are stateside. Below 1,500ms feels human. Between 1,500–2,000ms feels "AI but okay". Above 2,500ms users start hating you. ## The DIY walkthrough — code that actually answers calls Stack: Twilio Voice number, ConversationRelay, Node.js 20 server on Hetzner, Claude Haiku 4.5, ElevenLabs Flash v2.5. ### Step 1 — install dependencies

bash

mkdir voice-ivr && cd voice-ivr
  npm init -y
  npm install fastify @fastify/websocket @fastify/formbody \
              @anthropic-ai/sdk dotenv twilio

### Step 2 — Twilio TwiML endpoint (handles incoming call)

javascript

// server.js
  import Fastify from 'fastify';
  import websocket from '@fastify/websocket';
  import formbody from '@fastify/formbody';
  import Anthropic from '@anthropic-ai/sdk';
  import 'dotenv/config';
  
  const app = Fastify({ logger: true });
  await app.register(websocket);
  await app.register(formbody);
  
  const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
  
  // Step 1: Twilio hits this when a call comes in
  app.post('/twiml', async (req, reply) => {
    const wsUrl = wss://${req.headers.host}/ws;
    const twiml = 
  
    
      
    
  ;
    reply.type('text/xml').send(twiml);
  });

ConversationRelay does five things you used to wire by hand: opens the WebSocket to your server, runs STT (Deepgram in our case), streams text events to you, plays back text you send via ElevenLabs, and handles barge-in (caller interrupting AI mid-speech). ### Step 3 — the WebSocket handler (where Claude lives)

javascript

// continued in server.js
  const SYSTEM_PROMPT = You are an appointment booking assistant for Sunrise Dental.
  Hours: Mon-Sat, 9am-7pm. 8 clinics in Hyderabad.
  Always speak in 1-2 short sentences. This is a phone call, not chat.
  Confirm clinic name, date, time, and patient phone before booking.
  If asked something outside booking, politely say you'll transfer to staff.;
  
  const tools = [{
    name: 'book_appointment',
    description: 'Books a confirmed appointment after collecting all details',
    input_schema: {
      type: 'object',
      properties: {
        clinic: { type: 'string' },
        date: { type: 'string', description: 'ISO 8601 date' },
        time: { type: 'string', description: 'HH:MM 24-hour' },
        patient_name: { type: 'string' },
        patient_phone: { type: 'string' }
      },
      required: ['clinic', 'date', 'time', 'patient_name', 'patient_phone']
    }
  }];
  
  app.register(async function (app) {
    app.get('/ws', { websocket: true }, (socket, req) => {
      const messages = [];
  
      socket.on('message', async (raw) => {
        const event = JSON.parse(raw.toString());
  
        if (event.type === 'prompt') {
          // Caller said something — we get the transcribed text
          messages.push({ role: 'user', content: event.voicePrompt });
  
          const stream = anthropic.messages.stream({
            model: 'claude-haiku-4-5',
            max_tokens: 200,
            system: SYSTEM_PROMPT,
            tools,
            messages
          });
  
          let fullResponse = '';
          stream.on('text', (chunk) => {
            fullResponse += chunk;
            // Stream tokens to Twilio for ElevenLabs to speak immediately
            socket.send(JSON.stringify({
              type: 'text',
              token: chunk,
              last: false
            }));
          });
  
          stream.on('end', () => {
            socket.send(JSON.stringify({ type: 'text', token: '', last: true }));
            messages.push({ role: 'assistant', content: fullResponse });
          });
  
          // Handle tool calls
          stream.on('contentBlock', async (block) => {
            if (block.type === 'tool_use' && block.name === 'book_appointment') {
              await saveAppointment(block.input);
              messages.push({
                role: 'user',
                content: [{
                  type: 'tool_result',
                  tool_use_id: block.id,
                  content: 'Booked successfully.'
                }]
              });
            }
          });
        }
  
        if (event.type === 'interrupt') {
          // Caller barged in — stop generating
          // (Anthropic stream auto-cancels on socket close)
          app.log.info('barge-in detected at ' + event.utteranceUntilInterrupt);
        }
      });
    });
  });
  
  async function saveAppointment(details) {
    // Hit your booking API / Google Calendar / DB
    console.log('Booking:', details);
  }
  
  await app.listen({ port: 3000, host: '0.0.0.0' });

You should now see — when you call your Twilio number — Sunrise Dental's greeting, then a 1.5-second pause after you speak, then a streamed reply in Rachel's voice that you can interrupt mid-sentence. ### Step 4 — the four config tweaks that actually matter for latency

json

{
    "stt": {
      "provider": "Deepgram",
      "model": "nova-3-phonecall",
      "endpointing": 300,
      "punctuate": true
    },
    "tts": {
      "provider": "ElevenLabs",
      "model": "eleven_flash_v2_5",
      "voice": "Rachel",
      "optimize_streaming_latency": 4
    },
    "llm": {
      "model": "claude-haiku-4-5",
      "max_tokens": 200,
      "stream": true
    }
  }

The four tweaks: (1) Deepgram endpointing: 300 — finalizes transcript when caller pauses 300ms (default is 700ms, too slow for natural conversation). (2) ElevenLabs optimize_streaming_latency: 4 — drops to lowest quality preset, saves ~150ms. (3) Claude max_tokens: 200 — short replies start playing faster. (4) Anthropic stream: true — token-by-token, not wait-for-complete. ## Common mistakes — five we keep seeing on r/Twilio and r/voiceai Mistake 1 — Not handling barge-in. If your bot keeps talking when the caller speaks over it, the call dies in 30 seconds. ConversationRelay sends an interrupt event — you must stop streaming Claude tokens and clear any pending TTS audio. The Twilio Anthropic ConversationRelay tutorial covers this in detail. Mistake 2 — Sending full sentences to TTS instead of tokens. Waiting for a complete Claude sentence adds 600–900ms to perceived latency. Stream tokens to ElevenLabs as they arrive — Flash v2.5's streaming endpoint accepts incremental input. Mistake 3 — Using Claude Sonnet for voice. Sonnet's first-token latency runs 600–900ms versus Haiku's 380ms. For voice, that 300ms gap is the difference between "feels human" and "feels broken". Reserve Sonnet for the function-call result step, not the conversational turn. Mistake 4 — Default Deepgram endpointing. Default 700ms means a half-second of dead air after every caller utterance. Drop to 300ms for natural turn-taking; bump back to 500ms only if your callers tend to pause mid-sentence (older demographics). Mistake 5 — No fallback to a human number. If Claude returns nonsense, callers need an out. Add a fallback that triggers on three failed turn-rounds or on the keyword "human" / "agent" / "person". Every voice IVR we ship has this.

PII gotcha: Voice calls often include names, phone numbers, addresses spoken aloud — all PII under DPDP Act. Ensure Twilio recording is OFF unless you have explicit consent. Do not send raw audio to OpenAI/Anthropic for STT — keep STT on Deepgram (or your own Whisper) where you control retention.

## Real example — 8-clinic dental chain, Hyderabad Sunrise Dental (name changed) runs 8 clinics across Hyderabad. Front-desk staff at each clinic spent 35–40% of their day on the phone booking appointments. We deployed a single Twilio Voice number running this exact stack in April 2026. Build time: 1.5 days including QA. After 30 days: 4,300 inbound calls handled, 3,100 successful bookings (72% completion), 9-second average response time, 4.4/5 caller satisfaction (post-call IVR rating). The remaining 1,200 calls were handed to staff cleanly with full transcripts. Total run cost: ₹62,000/month including Twilio, ElevenLabs, Claude, and infra. Saved roughly 1.6 FTEs of front-desk time. Same voice-AI latency-tuning patterns we use on our in-house product [TalkDrill](https://talkdrill.com), where 5,000+ Indian users practice English speaking in real-time conversation with AI tutors.

You handle ≥1,000 inbound calls/month (below this, hire a person)
Your call flow is task-shaped: booking, status check, payment confirmation, FAQ
You have a Twilio number purchased and a credit card on file
You set up barge-in handling and tested with rapid interruptions
You added a "transfer to human" fallback path on three failed turns
You logged the first 100 calls and reviewed transcripts before scaling
You have explicit recording consent if you record calls
You set Deepgram endpointing to 300ms for natural turn-taking
You used Claude Haiku, not Sonnet, for the conversational turn

## FAQ ### What's the actual minimum latency I can hit on a voice IVR in May 2026? Floor is about 1,200ms perceived latency end-to-end with Claude Haiku streaming, Deepgram Nova 3 with 300ms endpointing, ElevenLabs Flash v2.5 streaming, and servers within ±100ms RTT of the user. Sub-1,000ms requires a regional inference endpoint and is rare for SMB workloads. ### Is ElevenLabs the only option? It feels expensive. ElevenLabs Flash v2.5 runs ~$0.18 per 1,000 chars at the Creator tier. For Hindi/Tamil with similar quality and lower price, try Sarvam-1 (Indian, ₹0.40/1,000 chars) or Cartesia Sonic (US, ~$0.06/1,000 chars). Quality gap is real — test on your specific scripts before switching. ### Can ConversationRelay handle Hindi or Tamil voice calls? Yes. ConversationRelay supports Deepgram for STT (Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Punjabi, English) and ElevenLabs for TTS (multilingual v2 covers all major Indic languages). Configure transcriptionLanguage="hi-IN" or set per-call. Quality on Tamil and Telugu agglutinative speech is the weakest link — test with native speakers. ### How do I add function calling for "check availability" or "send SMS confirmation"? Claude's tools API works inside the ConversationRelay flow. Define your tool with input schema, list it in the messages.stream call, and execute the function when Claude emits a tool_use block. Append the tool result back into the message history and let Claude continue. The Twilio function-calling blog post has working examples. ### What happens if my server crashes mid-call? Twilio retries the WebSocket connection once, then plays a fallback TwiML if you defined one. Always set a fallback URL on your TwiML app pointing to a static "We're having a glitch — please call back" message. Add health checks that bounce the server before traffic dies. ### How do I bill calls to clients accurately? Twilio bills you per minute on inbound. ElevenLabs bills per character generated. Claude bills per token. Build an in-call counter that tracks all three and exports per-call cost to your DB. We bill clients monthly with itemized usage; some clients prefer flat per-minute pricing where we eat variance. ### Where can I read more on real-world voice AI latency benchmarks? The [VEXYL AI 2026 latency tests](https://vexyl.ai/elevenlabs-tts-latency-test-2026-real-world-results/) and [Picovoice TTS Latency Benchmark](https://github.com/Picovoice/tts-latency-benchmark) on GitHub are the most credible third-party data. ElevenLabs' own latency docs cover the difference between model latency (75ms) and real-world TTFB (478ms) — both matter, depending on where you measure.

Want a Voice IVR for Your Business?

We ship production voice IVRs (booking, support, payment, FAQ) on Twilio + Claude + ElevenLabs in 5 working days. Typical project ₹95,000–₹2,40,000 depending on integrations. Per-call run cost from ₹12. Multilingual ready out of the box. You own the code and the Twilio account.

Book a 20-min Call

Tags:

Voice AITwilioElevenLabsClaude APIIVRConversationRelayVoice Agent

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

// server.js import Fastify from 'fastify'; import websocket from '@fastify/websocket'; import formbody from '@fastify/formbody'; import Anthropic from '@anthropic-ai/sdk'; import 'dotenv/config'; const app = Fastify({ logger: true }); await app.register(websocket); await app.register(formbody); const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); // Step 1: Twilio hits this when a call comes in app.post('/twiml', async (req, reply) => { const wsUrl = wss://${req.headers.host}/ws; const twiml = ; reply.type('text/xml').send(twiml); });

// continued in server.js const SYSTEM_PROMPT = You are an appointment booking assistant for Sunrise Dental. Hours: Mon-Sat, 9am-7pm. 8 clinics in Hyderabad. Always speak in 1-2 short sentences. This is a phone call, not chat. Confirm clinic name, date, time, and patient phone before booking. If asked something outside booking, politely say you'll transfer to staff.; const tools = [{ name: 'book_appointment', description: 'Books a confirmed appointment after collecting all details', input_schema: { type: 'object', properties: { clinic: { type: 'string' }, date: { type: 'string', description: 'ISO 8601 date' }, time: { type: 'string', description: 'HH:MM 24-hour' }, patient_name: { type: 'string' }, patient_phone: { type: 'string' } }, required: ['clinic', 'date', 'time', 'patient_name', 'patient_phone'] } }]; app.register(async function (app) { app.get('/ws', { websocket: true }, (socket, req) => { const messages = []; socket.on('message', async (raw) => { const event = JSON.parse(raw.toString()); if (event.type === 'prompt') { // Caller said something — we get the transcribed text messages.push({ role: 'user', content: event.voicePrompt }); const stream = anthropic.messages.stream({ model: 'claude-haiku-4-5', max_tokens: 200, system: SYSTEM_PROMPT, tools, messages }); let fullResponse = ''; stream.on('text', (chunk) => { fullResponse += chunk; // Stream tokens to Twilio for ElevenLabs to speak immediately socket.send(JSON.stringify({ type: 'text', token: chunk, last: false })); }); stream.on('end', () => { socket.send(JSON.stringify({ type: 'text', token: '', last: true })); messages.push({ role: 'assistant', content: fullResponse }); }); // Handle tool calls stream.on('contentBlock', async (block) => { if (block.type === 'tool_use' && block.name === 'book_appointment') { await saveAppointment(block.input); messages.push({ role: 'user', content: [{ type: 'tool_result', tool_use_id: block.id, content: 'Booked successfully.' }] }); } }); } if (event.type === 'interrupt') { // Caller barged in — stop generating // (Anthropic stream auto-cancels on socket close) app.log.info('barge-in detected at ' + event.utteranceUntilInterrupt); } }); }); }); async function saveAppointment(details) { // Hit your booking API / Google Calendar / DB console.log('Booking:', details); } await app.listen({ port: 3000, host: '0.0.0.0' });

{ "stt": { "provider": "Deepgram", "model": "nova-3-phonecall", "endpointing": 300, "punctuate": true }, "tts": { "provider": "ElevenLabs", "model": "eleven_flash_v2_5", "voice": "Rachel", "optimize_streaming_latency": 4 }, "llm": { "model": "claude-haiku-4-5", "max_tokens": 200, "stream": true } }

Build a Voice IVR with Twilio + Claude + ElevenLabs: A 1-Day Tutorial

Want a Voice IVR for Your Business?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?

Build a Voice IVR with Twilio + Claude + ElevenLabs: A 1-Day Tutorial

Want a Voice IVR for Your Business?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Want More Insights?