Tools & APIs · 2026-06-22 · 12 min read · WildRun AI Engineering

ElevenLabs TTS Integration for Voice Agents (2026)

Build a production ElevenLabs TTS integration for voice agents in TypeScript — streaming, WebSocket pipelines, latency tuning, and deployment gotchas.

IntermediateTools:ElevenLabs Vapi Cloudflare Workers Anthropic SDK

ElevenLabs TTS Integration for Voice Agents (2026)

Text-to-speech is where most voice agent projects quietly fall apart. The pipeline logic works, the LLM responses are solid, and then the audio arrives 800ms late with a cadence that kills the caller's patience. ElevenLabs solves the quality problem convincingly — but wiring it into a production voice agent requires understanding a handful of non-obvious constraints around streaming, audio formats, and latency budgets.

This guide walks through a complete TypeScript integration, from a basic HTTP streaming call to a low-latency WebSocket pipeline that pipes LLM tokens directly into ElevenLabs as they arrive. All code is runnable against the live API. If you are still evaluating the overall voice agent architecture, read the Vapi voice agent build guide first.

The ElevenLabs Model Landscape

ElevenLabs publishes several TTS models. For voice agents, only two are worth considering in 2026:

eleven_flash_v2_5 — The real-time model. Targets approximately 75ms audio generation latency and supports 32 languages. Use this for any conversational agent where a caller is waiting on a response.
eleven_multilingual_v2 — Higher fidelity, but 250–400ms generation latency. Better for asynchronous use cases: voicemail synthesis, pre-recorded IVR prompts, outbound notification clips.

A third model, eleven_v3, is available for expressive long-form narration but is not suited for real-time telephony. Its latency profile makes it a poor fit for live voice agents where callers are waiting.

One number to internalize before touching any code: human conversations have natural gaps of 100–300ms between speakers. Once your total pipeline latency — STT processing plus LLM first-token plus TTS first-audio — exceeds 500ms, callers begin to notice. At 800ms, they say "hello?" and the interaction derails. Model choice is the single biggest lever you have on that number, and eleven_flash_v2_5 is the right default for live agents. Independent 2026 benchmarks put its real-world P50 time-to-first-audio at roughly 288ms from US regions — fast enough to feel natural in most conversational contexts.

Architecture Overview

A production ElevenLabs voice agent has four layers that must hand off without buffering delays. The key architectural decision is where you perform the STT-to-LLM-to-TTS hand-off — on a long-lived server process or at the edge.

Caller audio (PCM / μ-law)
        │
        ▼
  ┌───────────┐
  │    STT    │  (Deepgram, Vapi native, or AssemblyAI)
  │           │──── transcript text ────────────────────────┐
  └───────────┘                                         │
                                                        ▼
                                               ┌────────────┐
                                               │    LLM       │  (Claude / GPT-4o)
                                               │  streaming   │──── token stream ───┐
                                               └────────────┘                     │
                                                                                     ▼
                                                                          ┌──────────────────┐
                                                                          │  ElevenLabs TTS  │
                                                                          │  WebSocket input │
                                                                          └────────╌─────────┘
                                                                                   │ audio chunks (PCM/mp3)
                                                                                   ▼
                                                                             Caller audio out

The critical insight is that ElevenLabs does not wait for a complete sentence. The WebSocket input stream accepts partial text and begins audio generation after accumulating enough phonemic context — typically 3–5 words. Piping LLM tokens directly as they stream (rather than waiting for a complete response) is what compresses perceived latency below 400ms in practice. This is the difference between a voice agent that feels like a conversation and one that feels like a slow phone tree.

Setting Up the SDK

Install the official elevenlabs-js SDK and configure your client. Voice IDs are stable identifiers — hardcode them in a constants file rather than looking them up at runtime. The voices list API call costs ~80ms you cannot afford in a real-time pipeline.

npm install @elevenlabs/elevenlabs-js

// client.ts
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

export const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY!,
});

// Pin voice IDs — these are stable and do not change across API versions.
export const VOICES = {
  rachel: "21m00Tcm4TlvDq8ikWAM",  // warm, US English — good for front-desk agents
  adam:   "pNInz6obpgDQGcFmaJgB",  // neutral, US English — good for transactional flows
  matilda: "XrExE9yKIg1WjnnlVkGX", // conversational, works well for appointment booking
} as const;

export const TTS_CONFIG = {
  model: "eleven_flash_v2_5",
  outputFormat: "pcm_22050" as const,
  optimizeStreamingLatency: 3,       // 0–4; 3 is the best latency/quality tradeoff
  voiceSettings: {
    stability: 0.4,
    similarityBoost: 0.8,
    style: 0,
    useSpeakerBoost: true,
  },
} as const;

HTTP Streaming: The Baseline

For non-real-time use cases — generating audio clips, voicemail content, or pre-recording IVR branches — the HTTP streaming endpoint is simpler and sufficient. It delivers audio as a Node.js readable stream that you can pipe directly to a file, an S3 upload stream, or a response object.

Use pcm_22050 as the output format when you control the downstream audio pipeline. It has the lowest time-to-first-byte of any format because there is no codec header to write before the audio data begins. Use mp3_44100_128 only when writing a self-contained file for browser playback.

import { elevenlabs, VOICES, TTS_CONFIG } from "./client";

export async function synthesizeToBuffer(text: string): Promise {
  const stream = await elevenlabs.textToSpeech.stream(VOICES.rachel, {
    text,
    modelId: TTS_CONFIG.model,
    outputFormat: TTS_CONFIG.outputFormat,
    voiceSettings: TTS_CONFIG.voiceSettings,
    optimizeStreamingLatency: TTS_CONFIG.optimizeStreamingLatency,
  });

  const chunks: Buffer[] = [];
  for await (const chunk of stream) {
    chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

// For Twilio: request ulaw_8000 directly and skip the transcode step entirely.
export async function synthesizeForTwilio(text: string): Promise {
  const stream = await elevenlabs.textToSpeech.stream(VOICES.rachel, {
    text,
    modelId: TTS_CONFIG.model,
    outputFormat: "ulaw_8000",  // Twilio Media Streams native format
    voiceSettings: TTS_CONFIG.voiceSettings,
    optimizeStreamingLatency: TTS_CONFIG.optimizeStreamingLatency,
  });

  const chunks: Buffer[] = [];
  for await (const chunk of stream) {
    chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
  }
  return Buffer.concat(chunks);
}

WebSocket Streaming: The LLM Token Pipe

The real latency gain comes from the WebSocket input endpoint, which accepts text tokens incrementally as the LLM generates them. The full round-trip flow is: LLM emits a token → your code sends it to the ElevenLabs WebSocket → ElevenLabs starts generating audio mid-sentence → audio chunks stream back before the LLM has even finished the response.

Below is a complete implementation that streams Claude's output token-by-token into ElevenLabs and surfaces audio chunks via a callback as they arrive. The chunkLengthSchedule parameter is the most important tuning knob: it controls how many characters ElevenLabs accumulates before generating the first audio chunk. Starting at 120 characters means ElevenLabs begins speaking after roughly one sentence. Too small (below 50 characters) causes audible prosody errors; too large and you are back to waiting for a paragraph before the caller hears anything.

import Anthropic from "@anthropic-ai/sdk";
import WebSocket from "ws";
import { VOICES, TTS_CONFIG } from "./client";

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });

interface ElevenLabsMessage {
  audio?: string;   // base64-encoded PCM audio chunk
  isFinal?: boolean;
  error?: string;
}

export async function streamLLMToTTS(
  userMessage: string,
  onAudioChunk: (pcmBase64: string) => void,
  onComplete: () => void,
  onError: (err: Error) => void
): Promise {
  const wsUrl =
    `wss://api.elevenlabs.io/v1/text-to-speech/${VOICES.rachel}/stream-input` +
    `?model_id=${TTS_CONFIG.model}` +
    `&output_format=${TTS_CONFIG.outputFormat}` +
    `&optimize_streaming_latency=${TTS_CONFIG.optimizeStreamingLatency}`;

  const ws = new WebSocket(wsUrl, {
    headers: { "xi-api-key": process.env.ELEVENLABS_API_KEY! },
  });

  let socketReady = false;

  const socketOpen = new Promise((resolve) => {
    ws.on("open", () => {
      // setImmediate avoids a race on some Node versions where the TLS
      // handshake is not fully settled at the "open" event.
      setImmediate(() => {
        ws.send(JSON.stringify({
          text: " ",  // required beginning-of-stream marker
          voiceSettings: TTS_CONFIG.voiceSettings,
          generationConfig: {
            chunkLengthSchedule: [120, 160, 250, 290],
          },
        }));
        socketReady = true;
        resolve();
      });
    });
  });

  ws.on("message", (data: Buffer) => {
    const msg: ElevenLabsMessage = JSON.parse(data.toString());
    if (msg.error) {
      onError(new Error(`ElevenLabs error: ${msg.error}`));
      ws.close();
      return;
    }
    if (msg.audio && msg.audio.length > 0) {
      // Skip empty frames — they can cause a faint pop on some telephony codecs.
      onAudioChunk(msg.audio);
    }
    if (msg.isFinal) {
      onComplete();
      ws.close();
    }
  });

  ws.on("error", (err) => {
    onError(err);
    ws.close();
  });

  await socketOpen;

  // Stream Claude tokens directly into the WebSocket as they arrive.
  let buffer = "";
  const FLUSH_THRESHOLD = 120;

  try {
    const stream = anthropic.messages.stream({
      model: "claude-haiku-4-5-20251001",
      max_tokens: 280,
      system: "You are a helpful receptionist. Give concise, conversational answers under 60 words.",
      messages: [{ role: "user", content: userMessage }],
    });

    for await (const event of stream) {
      if (
        event.type === "content_block_delta" &&
        event.delta.type === "text_delta"
      ) {
        buffer += event.delta.text;

        // Flush on sentence boundaries or when buffer exceeds the threshold.
        if (buffer.length >= FLUSH_THRESHOLD || /[.!?]\s/.test(buffer)) {
          if (socketReady) ws.send(JSON.stringify({ text: buffer }));
          buffer = "";
        }
      }
    }

    // Flush any remaining text and send the end-of-stream marker.
    if (buffer.length > 0 && socketReady) {
      ws.send(JSON.stringify({ text: buffer }));
    }
    if (socketReady) ws.send(JSON.stringify({ text: "" }));

  } catch (err) {
    onError(err as Error);
    ws.close();
  }
}

Connecting to Vapi

If you are building on Vapi rather than a raw WebSocket pipeline, ElevenLabs is a supported voice provider configured at the assistant level. Vapi handles the WebSocket lifecycle, audio transcoding to the telephony format, concurrent stream management, and connection pooling across calls. That connection pool is why the Vapi-managed path often beats a DIY WebSocket implementation on latency — the 50–200ms connection establishment overhead is amortized across thousands of calls rather than paid per turn.

The Vapi and Cloudflare Workers deployment guide covers the full webhook server setup. The assistant configuration for an ElevenLabs voice is straightforward:

import axios from "axios";

const assistant = await axios.post(
  "https://api.vapi.ai/assistant",
  {
    name: "Front Desk Agent — Bend OR",
    voice: {
      provider: "11labs",
      voiceId: "21m00Tcm4TlvDq8ikWAM",   // Rachel
      model: "eleven_flash_v2_5",
      stability: 0.4,
      similarityBoost: 0.8,
      optimizeStreamingLatency: 3,
    },
    model: {
      provider: "anthropic",
      model: "claude-haiku-4-5-20251001",
      systemPrompt:
        "You are the front desk receptionist for a local business in Bend, Oregon. " +
        "Answer concisely, confirm caller needs, and offer to schedule appointments.",
      maxTokens: 250,
    },
    firstMessage: "Thanks for calling. How can I help you today?",
    endCallFunctionEnabled: true,
    endCallPhrases: ["goodbye", "thank you, bye", "that's all I needed"],
  },
  {
    headers: {
      Authorization: `Bearer ${process.env.VAPI_API_KEY}`,
      "Content-Type": "application/json",
    },
  }
);

console.log("Assistant ID:", assistant.data.id);

Deploying on Cloudflare Workers

If you are running your voice agent webhook handler on Cloudflare Workers, two constraints affect the TTS pipeline directly. First, the Workers runtime has a 30-second CPU limit per request. A voice agent turn that involves STT processing, an LLM call, and TTS streaming can approach this limit on slow LLM responses. Keep max_tokens at 200–300 for conversational agents and use fast models like Claude Haiku or GPT-4o-mini.

Second, the Workers runtime does not support the Node.js ws package. You must use the native Workers WebSocket API, which has a different interface from the Node.js ws module:

// In a Cloudflare Worker — the `ws` npm package is NOT available.
// Use the native fetch-based WebSocket upgrade instead.

interface Env {
  ELEVENLABS_API_KEY: string;
}

export async function openElevenLabsSocket(
  voiceId: string,
  env: Env
): Promise {
  const url =
    `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input` +
    `?model_id=eleven_flash_v2_5&output_format=pcm_22050&optimize_streaming_latency=3`;

  const resp = await fetch(url, {
    headers: {
      Upgrade: "websocket",
      Connection: "Upgrade",
      "xi-api-key": env.ELEVENLABS_API_KEY,
    },
  });

  // Workers expose the WebSocket on the response, not via a constructor.
  const ws = (resp as any).webSocket as WebSocket | null;
  if (!ws) throw new Error("Expected WebSocket upgrade — check API key and voice ID");

  ws.accept(); // Required in Workers before any send/receive calls.
  return ws;
}

Production Gotchas

Audio format mismatch on Twilio

Twilio Media Streams send and receive audio as μ-law (mulaw) at 8kHz. ElevenLabs defaults to MP3 or PCM at 22050Hz or 44100Hz. The fix: pass outputFormat: "ulaw_8000" directly in your ElevenLabs request. ElevenLabs will encode to Twilio's native format server-side and skip the transcode step entirely on your end. Attempting to use ffmpeg inside a serverless function for real-time transcoding is expensive and fragile — avoid it.

WebSocket "not open" errors under load

The WebSocket open event fires before the TLS handshake is fully settled on some Node.js versions. Sending the BOS (beginning-of-stream) marker immediately inside the open handler causes a "WebSocket is not open" error on the first send for roughly 1–3% of connections. The fix is the setImmediate wrapper shown in the streaming example above. This defers the first send by one event loop tick, which is enough for the handshake to complete.

Stability and similarity settings add latency

Setting stability above 0.6 with eleven_flash_v2_5 causes the model to re-sample more aggressively and adds 20–40ms per audio chunk. The sweet spot for live voice agents is stability: 0.3–0.45. This keeps the voice consistent enough for conversation without a noticeable latency penalty. Higher stability (0.6+) is appropriate for voicemail or pre-recorded IVR branches where a few extra milliseconds do not matter.

Rate limits during call volume spikes

The ElevenLabs Starter plan allows 2 concurrent WebSocket streams. Creator allows 5, Pro allows 10. Exceeding concurrent stream limits returns a 429 with a specific error code rather than queuing the request. Handle this with a circuit breaker that falls back to a pre-synthesized "Please hold for a moment" audio clip rather than letting the call drop to silence. Silence for more than 2 seconds causes most callers to hang up. Build the fallback clip from the same voice during onboarding and cache it.

Empty audio frames at stream start

ElevenLabs sends one or two empty audio frames (base64 strings with length 0) before real audio arrives. Forwarding these empty frames to Twilio or WebRTC can cause a faint pop artifact on some telephony codecs. Filter them: skip any chunk where msg.audio.length === 0 before passing to the audio output layer.

Unclosed sockets leaking against your concurrent stream quota

If your LLM stream throws an error mid-generation and your catch block does not explicitly call ws.close(), the ElevenLabs WebSocket stays open and counts against your concurrent stream limit indefinitely. Always close the socket in both your error handler and your finally block. A dangling socket on the Creator plan (5 concurrent max) during a call spike will silently push new calls into 429 territory.

chunkLengthSchedule and prosody at sentence boundaries

Setting chunkLengthSchedule[0] below 50 characters causes ElevenLabs to generate audio from incomplete phrases, which produces unnatural prosody — the voice rises at the end of every short chunk as if asking a question. If your LLM tends to write short sentences (under 50 characters), increase the first chunk threshold to 80–100 characters and accept the slightly longer first-audio delay. Log the token count of your average LLM response in staging before setting this value for production.

When NOT to Build This Yourself

Rolling your own ElevenLabs TTS pipeline makes sense when you have tight control over the audio path — custom telephony infrastructure, a WebRTC browser client, or a hardware device where you control the codec stack end-to-end. For most small-business voice agent deployments, it is the wrong starting point.

Skip the DIY integration when:

You are deploying a business phone agent over standard PSTN or Twilio. Platforms like Vapi or Retell handle the full WebSocket lifecycle, audio transcoding, DTMF detection, call transfer, and concurrent stream management. Building those yourself takes weeks and produces fragile infrastructure that you own forever.
You need call recording and compliance logging. Orchestration platforms handle dual-channel recording and caller consent prompts. DIY implementations frequently miss one of these steps and create legal exposure, particularly in states with two-party consent rules.
Your team does not have production Node.js WebSocket experience. Backpressure handling, partial frames, connection timeouts, and reconnect logic all require real runtime experience to get right. An incorrect implementation will be less reliable than the vendor-managed path and harder to debug under load.
You are still evaluating voice quality. ElevenLabs voices are a one-click configuration in Vapi. Test the voice quality and latency there with zero code. If the results are solid and you have a specific integration requirement the platform cannot satisfy, then build the custom pipeline. Otherwise, you are engineering a solution to a problem you have not confirmed yet.

Monitoring in Production

Track these four metrics from the first deploy. They are the difference between "it works in staging" and a system you can actually debug when a client in Bend, OR calls at 8am on a Monday with a problem you've never seen.

TTS TTFA (Time to First Audio) — the wall-clock gap between your first ws.send() and the first non-empty audio chunk received. Alert if P95 exceeds 350ms on eleven_flash_v2_5.
WebSocket error rate — the percentage of TTS streams that end with an error rather than a clean isFinal: true. Healthy target is below 0.5% across all concurrent streams.
Concurrent stream count — if you are on the Creator plan (5 concurrent), alert at 4. Hitting the ceiling silently degrades call quality rather than failing loudly.
Total turn latency — wall-clock time from the end of the caller's last word to the first audio byte out from your agent. Target: under 700ms end-to-end for a live business receptionist.

ElevenLabs does not expose a usage metrics API for WebSocket streams, so you need to instrument these yourself. A simple Map<callId, number> that stores the send timestamp and computes the delta on the first audio receive event is enough to start. Feed it into your existing observability stack — Datadog, Grafana, or even a daily log query — before you go live, not after you start getting complaints.

Architecture

Caller audio (PCM / mu-law)
        |
        v
  +-----------+
  |    STT    |  (Deepgram, Vapi native, or AssemblyAI)
  |           |---- transcript text -------------------+
  +-----------+                                        |
                                                       v
                                           +------------------+
                                           |    LLM           | (Claude / GPT-4o)
                                           |  streaming       |-- token stream --+
                                           +------------------+                  |
                                                                                  v
                                                              +----------------------+
                                                              |  ElevenLabs TTS      |
                                                              |  WebSocket input     |
                                                              +----------+-----------+
                                                                         | audio chunks
                                                                         v
                                                                 Caller audio out

Frequently asked questions

What is the best ElevenLabs model for real-time voice agents?

Use eleven_flash_v2_5, which targets approximately 75ms audio generation latency and supports 32 languages. It is the only ElevenLabs model fast enough for live conversational agents where callers are waiting. eleven_multilingual_v2 is better for pre-recorded IVR prompts and voicemail where latency tolerance is higher.

How do I reduce latency in my ElevenLabs TTS integration?

Use the WebSocket streaming endpoint instead of HTTP, set outputFormat to pcm_22050, pass optimizeStreamingLatency: 3, and set chunkLengthSchedule starting at 120 characters. Most importantly, pipe LLM tokens directly into the ElevenLabs WebSocket as they arrive rather than waiting for the complete LLM response.

Does ElevenLabs work with Vapi?

Yes, ElevenLabs is a native voice provider in Vapi. You configure it at the assistant level with the eleven_flash_v2_5 model and a voice ID. Vapi manages the WebSocket connection pool, which often produces lower latency than a custom integration because connection establishment overhead is amortized across many calls.

What audio format should I use when connecting ElevenLabs to Twilio?

Request outputFormat: ulaw_8000 directly from ElevenLabs. This matches Twilio Media Streams native mu-law 8kHz format and skips a server-side transcode step entirely. Running ffmpeg in a serverless function for real-time audio conversion adds latency and fails unpredictably under load.

How many concurrent ElevenLabs TTS streams can I run?

It depends on your plan: Starter allows 2 concurrent WebSocket streams, Creator allows 5, Pro allows 10, and Scale goes higher. Exceeding the limit returns a 429 error immediately. Build a fallback to a pre-synthesized hold clip so callers hear something instead of silence when the limit is reached.

Can I use ElevenLabs TTS on Cloudflare Workers?

Yes, but the Node.js ws npm package is not available in the Workers runtime. Use the native Cloudflare Workers WebSocket API instead: make a fetch() call with an Upgrade: websocket header, then call ws.accept() on the returned webSocket object before sending or receiving data. Also account for the 30-second CPU limit per Worker request.

Written by

Thom Wilson

Founder & AI Engineer, Wild Run AI

SEO consultant turned AI engineer. Built WildRun after years getting small businesses found online — custom AI voice agents, sales and operations automation, and AI-era SEO, deployed on Cloudflare and managed end-to-end.

About the author → · Last reviewed: June 2026