Architecture · 2026-06-25 · 9 min read · WildRun AI Engineering

Voice AI Latency Optimization: Achieve Sub-500ms Responses

Learn how to cut voice AI latency below 500ms: streaming pipelines, turn detection tuning, and the right STT/LLM/TTS model choices for production agents.

AdvancedTools:Vapi ElevenLabs Deepgram AssemblyAI Cloudflare Workers Anthropic SDK

Voice AI Latency Optimization: Achieve Sub-500ms Responses

A voice agent that takes 2.5 seconds to respond does not feel like a conversation — it feels like hold music with better vocabulary. The threshold where callers stop noticing the pause is around 800ms end-to-end. The threshold where the interaction feels natural, closer to how humans actually talk, is below 500ms. Getting there requires optimizing every stage of the pipeline simultaneously: speech-to-text, the LLM, text-to-speech, and the orchestration glue between them.

This guide covers the specific configuration changes, model selections, and architecture decisions that move a stock Vapi agent from 2–3 seconds down to the 465–700ms range achievable in 2026 production systems. Every technique here has been validated against real benchmark data — not marketing copy from vendor websites.

Why the 800ms ceiling matters

Human turn-taking in conversation operates on 200–500ms gaps between speakers. Research on telephone conversations consistently shows that callers begin to perceive a pause as unnatural at around 700–800ms, and begin doubting whether the system understood them by 1.5 seconds. By 3 seconds, abandonment rates climb sharply regardless of how accurate or helpful the eventual response is.

Most default configurations of voice AI platforms — including Vapi with out-of-the-box settings — land between 1.5 and 3 seconds end-to-end. That gap is not primarily a hardware problem. It is a configuration problem. The same underlying infrastructure can hit sub-700ms with deliberate choices at every stage. The largest single culprit is not the LLM or the TTS — it is the endpointing settings that determine when the system decides the user has finished speaking.

Sub-500ms matters most for high-volume outbound campaigns and front-desk replacement agents where call quality directly affects conversion. For internal tools or low-stakes use cases, the investment in latency optimization may not be worth the maintenance overhead — more on that in the final section.

The three-stage pipeline and where time goes

Every traditional voice AI pipeline has three serial stages. A millisecond saved in any one stage reduces total round-trip time for the caller. Understanding each stage independently is the prerequisite for knowing which to optimize first.

Stage 1: Speech-to-text

STT latency has two distinct components. The first is endpointing — detecting that the user has stopped speaking. The second is recognition latency — the time to produce an accurate transcript from that audio. With naive settings, endpointing alone can add 1.5 seconds by waiting for a silence timeout. Streaming STT emits partial transcripts every 50ms, letting downstream stages start before the full sentence is complete.

Deepgram Nova-3 with Flux delivers 200–400ms streaming latency with built-in end-of-turn detection for English, making it the standard recommendation for most agents in 2026. AssemblyAI Universal-3 Pro Streaming fires a native end_of_turn flag at 300–600ms median, removing the need for a separate endpointing model. Standard Whisper is the wrong choice for real-time use — it was designed for batch transcription and lacks native streaming support.

Stage 2: LLM inference

LLM latency is measured as time-to-first-token (TTFT) — the gap between receiving the complete transcript and the first output token arriving. For GPT-4o, TTFT is typically 250–600ms. For GPT-4o mini, it drops to 50–120ms. For Gemini Flash 2.0, 40–100ms. This difference matters because TTS cannot start synthesizing audio until the first LLM token arrives — so TTFT adds directly to end-to-end latency with no way to overlap it.

Two additional factors compound LLM latency beyond model choice. System prompt length increases prefill time: a 2,000-token system prompt takes measurably longer to process than a 400-token one. Unconstrained max_tokens means longer TTS runs that the caller must sit through. Keeping max tokens at 100–200 for a conversational turn shortens both the LLM run and the downstream audio generation.

Stage 3: Text-to-speech

TTS latency is measured as time-to-first-byte (TTFB) of audio — the gap between the first input token arriving and the first audio chunk being emitted. ElevenLabs Flash v2.5 benchmarks at approximately 75ms TTFB under normal load, making it the fastest high-quality TTS option available in 2026. The critical requirement is that TTS must operate in streaming mode: receiving LLM tokens as they arrive and generating audio chunks in parallel, rather than waiting for the complete response text before starting synthesis.

The streaming architecture: overlapping every stage

The single highest-leverage architectural change is switching from a sequential pipeline — wait for STT to complete, then query LLM, then wait for full LLM response, then synthesize audio — to a fully streamed pipeline where each stage starts as soon as the previous one emits its first output. The wall-clock time collapses from the sum of all stages to approximately the slowest individual stage plus handoff overhead.

User speaks
    |
    v
[VAD + Endpointing] <-- Deepgram Flux / AssemblyAI EOT
    |  partial transcripts every ~50ms
    v
[STT streaming] ------ Deepgram Nova-3 (~150ms to first partial)
    |  tokens arrive before sentence completes
    v
[LLM streaming] ------ GPT-4o mini / Gemini Flash (~50-120ms TTFT)
    |  tokens emitted; flushed at sentence boundaries to TTS
    v
[TTS streaming] ------ ElevenLabs Flash v2.5 (~75ms TTFB)
    |  first audio chunk triggers playback immediately
    v
[Caller hears response]

Optimized total:     465-700ms end-to-end
Default sequential:  1,500-3,000ms

When running custom orchestration middleware on Cloudflare Workers, keep the 30-second CPU limit in mind. For individual voice turns this is never an issue, but long calls need session state managed outside the Worker itself. Vapi handles this transparently when used as the orchestration layer, which is why it remains the recommended starting point even for custom latency-optimized setups.

Configuring a latency-optimized Vapi agent

If you are starting from scratch, see how to build a Vapi voice agent from scratch for the foundation. The configuration below focuses exclusively on the settings that move the latency needle. The most impactful single change is the endpointing value on the transcriber — the default is 500ms, but 100ms is sufficient for natural English and eliminates 400ms of dead time on every single turn.

import Vapi from "@vapi-ai/server-sdk";

const vapi = new Vapi({ apiKey: process.env.VAPI_API_KEY! });

const assistant = await vapi.assistants.create({
  name: "OptimizedAgent",

  // STT: Deepgram Nova-3 with aggressive endpointing
  transcriber: {
    provider: "deepgram",
    model: "nova-3",
    language: "en-US",
    smartFormat: false,    // skip post-processing pass; saves ~30ms
    endpointing: 100,      // ms of silence before end-of-turn; default is 500
  },

  // LLM: fast model, short output budget
  model: {
    provider: "openai",
    model: "gpt-4o-mini",  // ~50-120ms TTFT vs ~300ms for gpt-4o
    messages: [
      {
        role: "system",
        content: process.env.SYSTEM_PROMPT!,
      },
    ],
    temperature: 0,
    maxTokens: 150,         // shorter response = faster TTS; enough for one turn
    stream: true,
  },

  // TTS: ElevenLabs Flash v2.5 -- 75ms TTFB
  voice: {
    provider: "11labs",
    voiceId: process.env.ELEVENLABS_VOICE_ID!,
    model: "eleven_flash_v2_5",
    optimizeStreamingLatency: 4,  // maximum latency optimization setting
    stability: 0.5,
    similarityBoost: 0.75,
  },

  // Eliminate artificial delays -- defaults silently add 300-500ms
  responseDelaySeconds: 0,
  llmRequestDelaySeconds: 0,
  silenceTimeoutSeconds: 20,
  maxDurationSeconds: 600,
});

console.log("Assistant ID:", assistant.id);

Tuning turn detection: the hidden 1.5-second penalty

The second-largest source of latency waste is turn detection configuration. Vapi's default smart endpointing uses a text-based fallback that waits 1.5 seconds when no punctuation is detected at the end of a transcript. This is a common case in natural speech — callers often trail off or pause mid-thought without producing punctuation. That single default setting can negate every other optimization in your stack.

The right fix depends on which transcriber you are using. Deepgram with Flux handles end-of-turn natively and is the simplest path for English. AssemblyAI also fires a native end_of_turn flag. If you need to use a different transcriber, LiveKit smart endpointing with a tuned waitFunction is the next best option. If you must use Vapi's text-based endpointing, the minimum change is reducing onNoPunctuationSeconds from the 1.5s default down to 0.3s.

// Option A: Deepgram Flux -- best for English, native EOT, no extra config
// Vapi uses Deepgram's EOT signal directly; omit smartEndpointingPlan entirely
const deepgramConfig = {
  transcriber: {
    provider: "deepgram",
    model: "nova-3",
    language: "en-US",
    endpointing: 100,   // 100ms silence window; practical floor for English
    // No smartEndpointingPlan -- Flux handles EOT natively
  },
};

// Option B: AssemblyAI -- built-in end_of_turn flag, ~300-600ms median
const assemblyConfig = {
  transcriber: {
    provider: "assembly-ai",
    model: "best",
    // Fires end_of_turn natively; no separate endpointing plan needed
  },
};

// Option C: LiveKit with tuned waitFunction (for non-Deepgram transcribers)
const livekitConfig = {
  smartEndpointingPlan: {
    provider: "livekit",
    // p = probability (0-1) that the user has finished speaking
    waitFunction: `
      if (p >= 0.7) return 0;
      if (p >= 0.4) return 100;
      return 200;
    `,
  },
};

// Option D: Override Vapi text-based defaults directly
// Reduces the 1.5s no-punctuation penalty to 300ms
const vapiEndpointingConfig = {
  smartEndpointingPlan: {
    provider: "vapi",
    onPunctuationSeconds: 0.05,     // was 0.1 default
    onNumberSeconds: 0.1,           // was 0.5 default
    onNoPunctuationSeconds: 0.3,    // was 1.5 default -- the main culprit
  },
};

Sentence-boundary flushing for faster TTS start

Even with streaming TTS, a practical constraint remains: most TTS providers need a syntactically coherent chunk of text before they can synthesize audio that sounds natural. Passing tokens one-by-one produces choppy output. Waiting for the full LLM response adds hundreds of milliseconds. The solution is flushing to TTS at sentence boundaries — the first complete sentence starts audio playback while the LLM is still generating the rest of the response.

When building custom orchestration on Cloudflare Workers — for example, a relay layer between Vapi webhooks and a specialized LLM endpoint — this pattern applies directly. The key is detecting punctuation in the accumulated token buffer and flushing as soon as a sentence ends rather than waiting for the full completion.

// workers/voice-relay.ts
export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const { transcript } = await req.json<{ transcript: string }>();

    // Fire LLM request immediately -- no buffering
    const llmRes = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "gpt-4o-mini",
        messages: [
          { role: "system", content: env.SYSTEM_PROMPT },
          { role: "user", content: transcript },
        ],
        stream: true,
        max_tokens: 150,
      }),
    });

    const { readable, writable } = new TransformStream<Uint8Array, Uint8Array>();
    const writer = writable.getWriter();
    const enc = new TextEncoder();
    const dec = new TextDecoder();

    let buffer = "";
    // Flush to TTS at sentence boundaries -- starts audio as fast as possible
    const SENTENCE_BOUNDARY = /[.!?][^\w]|[.!?]$/;

    (async () => {
      const reader = llmRes.body!.getReader();
      try {
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;

          for (const line of dec.decode(value).split("\n")) {
            if (!line.startsWith("data: ")) continue;
            const data = line.slice(6).trim();
            if (data === "[DONE]") continue;

            try {
              const token =
                JSON.parse(data).choices[0]?.delta?.content ?? "";
              buffer += token;

              if (SENTENCE_BOUNDARY.test(buffer)) {
                // First flush triggers TTS immediately
                await writer.write(enc.encode(buffer));
                buffer = "";
              }
            } catch (_) {
              // Malformed SSE chunk -- skip
            }
          }
        }
        if (buffer.trim()) await writer.write(enc.encode(buffer));
      } finally {
        await writer.close();
      }
    })();

    return new Response(readable, {
      headers: { "Content-Type": "text/plain; charset=utf-8" },
    });
  },
} satisfies ExportedHandler<Env>;

Measuring pipeline latency in production

You cannot optimize what you do not measure. Vapi's webhook events contain enough timing information to reconstruct per-stage latency for every call. The key events to track are speech-update when the user stops speaking, assistant-request when the LLM query fires, and the bot's first speech-update started when audio reaches the caller. The gap between the first two events is your endpointing overhead. The gap between the second and third is your combined LLM plus TTS latency.

// app/api/vapi-webhook/route.ts  (Next.js App Router)
export async function POST(req: Request) {
  const event = await req.json();
  const msg = event.message;
  if (!msg) return new Response("ok");

  const callId = event.call?.id;

  // User finished speaking -- start of endpointing window
  if (msg.type === "speech-update" &&
      msg.status === "stopped" &&
      msg.role === "user") {
    console.log(JSON.stringify({ event: "user_speech_end", callId, ts: Date.now() }));
  }

  // LLM query fired -- end of endpointing window
  if (msg.type === "assistant-request") {
    console.log(JSON.stringify({
      event: "llm_query_start",
      callId,
      ts: Date.now(),
      transcriptTokens: msg.transcript?.split(" ").length ?? 0,
    }));
  }

  // Bot audio starts playing -- LLM + TTS fully cleared
  if (msg.type === "speech-update" &&
      msg.status === "started" &&
      msg.role === "bot") {
    console.log(JSON.stringify({ event: "bot_speech_start", callId, ts: Date.now() }));
  }

  return new Response("ok");
}

Feed these logs into any aggregation tool grouped by callId. Compute llm_query_start minus user_speech_end for endpointing overhead and bot_speech_start minus llm_query_start for combined LLM and TTS time. P50 numbers tell you what typical callers experience; P95 tells you what the worst 5% experience. High P50 usually points to endpointing configuration. High P95 with acceptable P50 usually indicates TTS latency spikes under load or cold LLM starts on infrequently-used endpoints.

Production gotchas

Vapi's 10-second first-response timeout. Vapi drops a call if no audio response is generated within 10 seconds of the user speaking. Under optimized latency this never triggers, but cold-start LLM spikes during low-traffic hours or after a fresh deploy can exceed this ceiling. Add a short fallback utterance in your system prompt instructions so the agent has something to say while the LLM warms up, or intercept the assistant-request webhook to inject a filler response when LLM latency exceeds a threshold you set.

Endpointing below 80ms causes false end-of-turn triggers. Reducing the Deepgram endpointing parameter below 80ms starts cutting off users mid-sentence, particularly for speakers who pause between clauses or take a breath mid-thought. 100–120ms is the practical floor for conversational English. Go lower and you will see the agent interrupting callers, which consistently scores worse in caller satisfaction than a 200ms slower response.

ElevenLabs Flash v2.5 TTFB degrades under concurrent load. The 75ms benchmark holds at low concurrency. At high call volume, TTFB can spike 3–5x as the ElevenLabs API queues requests. Build a fallback TTS configuration — ElevenLabs Turbo v2 or Cartesia Sonic are both reasonable alternatives — that activates when Flash latency exceeds a threshold you detect via webhook timing. Do not discover this failure mode during a high-traffic campaign.

Smart endpointing and Deepgram Flux can conflict. If you set a smartEndpointingPlan while using Deepgram as the transcriber, Vapi may apply the text-based smart endpointing logic instead of Deepgram Flux's native EOT signal. When using Deepgram, explicitly omit the smartEndpointingPlan field entirely to ensure Flux's built-in EOT handling takes precedence.

Cloudflare Workers have a 30-second CPU limit per request. For individual voice turns this is never a problem. But if you are relaying a streaming LLM response through a Worker and the user asks an open-ended question that produces a long reply, the Worker can time out mid-stream. Cap max_tokens at the LLM level and add an explicit signal: AbortSignal.timeout(8000) on your LLM fetch calls inside Workers to fail gracefully rather than silently hanging.

Prompt caching only helps after the cache is warm. Both the Anthropic SDK and OpenAI support prefix caching for repeated system prompts, which can cut TTFT by 30–50% for cached prefixes. But the cache miss on the first request — or after a deploy that changes the system prompt — produces normal uncached latency. Never include cached-hit TTFT in your P50 latency targets. Measure cached and uncached distributions separately so spikes are visible.

Regional routing matters more than most teams realize. Deploying your LLM proxy Worker in a US-West Cloudflare datacenter while routing to an OpenAI endpoint served from US-East adds 60–80ms of round-trip that no code optimization can remove. Pin your Worker location to match your primary LLM API endpoint region. For agents serving Bend and the broader Pacific Northwest, US-West-2 colocation consistently outperforms cross-region routing by 50–100ms on P50 measurements.

When NOT to build this yourself

If you are building a proof-of-concept or an internal tool with under 10 concurrent callers, the default Vapi configuration is sufficient. The 1.5–2.5 second latency range is noticeable but tolerable for most non-public-facing use cases. The optimizations above require active monitoring, per-model cost accounting, fallback logic, and ongoing configuration maintenance as providers update their APIs. That overhead is hard to justify at small scale.

If your deployment will handle more than 50 concurrent calls, the streaming architecture puts real pressure on LLM API rate limits. GPT-4o mini allows 500 requests per minute on Tier 1 — at 50 concurrent calls with 5 turns per minute each, you hit the ceiling without any headroom for spikes. Managed platforms that own their own LLM inference infrastructure handle rate limits transparently and may be a better fit than a self-managed stack at that volume.

If voice quality matters more than response speed for your specific use case — a high-empathy intake flow, a legal consultation screener, or any agent where an unhurried tone builds trust — optimizing to sub-500ms may actively hurt the experience. ElevenLabs Flash v2.5 at 75ms produces slightly less expressive audio than the full Multilingual v2 model. Measure caller satisfaction outcomes alongside latency metrics, not latency alone.

If the team deploying this agent does not have dedicated capacity to monitor P95 latency and respond to TTS provider outages, a fully optimized custom stack is more fragile than a well-configured managed service. A predictable 700ms latency from a managed vendor is more reliable in practice than a hand-tuned 465ms setup that occasionally spikes to 3 seconds when a dependency degrades. Know your operational capacity before committing to this level of infrastructure ownership.

Architecture

User speaks
    |
    v
[VAD + Endpointing] <-- Deepgram Flux / AssemblyAI EOT
    |  partial transcripts every ~50ms
    v
[STT streaming] ------ Deepgram Nova-3 (~150ms to first partial)
    |  tokens arrive before sentence completes
    v
[LLM streaming] ------ GPT-4o mini / Gemini Flash (~50-120ms TTFT)
    |  tokens emitted; flushed at sentence boundaries
    v
[TTS streaming] ------ ElevenLabs Flash v2.5 (~75ms TTFB)
    |  first audio chunk triggers playback immediately
    v
[Caller hears response]

Optimized: 465-700ms end-to-end
Default sequential: 1,500-3,000ms

Frequently asked questions

What is end-to-end latency in a voice AI agent?

End-to-end latency is the total elapsed time from when a user finishes speaking to when the AI agent begins playing audio back. It spans four stages: speech detection (endpointing), speech-to-text transcription, LLM inference, and TTS synthesis. All four must complete or begin streaming before the caller hears a response.

What end-to-end latency is achievable with Vapi in 2026?

With Deepgram Nova-3 (endpointing: 100ms), GPT-4o mini (stream: true, maxTokens: 150), and ElevenLabs Flash v2.5, production teams consistently achieve 465-700ms end-to-end. Default Vapi configuration with no tuning typically lands between 1.5 and 2.5 seconds.

Why does Vapi's default configuration add so much latency?

The primary cause is Vapi's default smart endpointing, which waits 1.5 seconds when no punctuation is detected at the end of a transcript. Setting onNoPunctuationSeconds to 0.3 and Deepgram endpointing to 100ms eliminates most of this overhead and is the single highest-impact configuration change.

Is ElevenLabs Flash v2.5 good enough quality for customer-facing agents?

For most business phone use cases such as appointment booking, lead qualification, and basic triage, yes. The quality difference between Flash v2.5 (75ms TTFB) and higher-quality models is noticeable in direct comparison but rarely flagged by callers in real conversations. For high-empathy use cases like medical intake or legal consultation, the slower but more expressive models may be worth the added latency.

What is the practical minimum end-to-end latency for a voice AI agent?

Speech-to-speech models, which bypass the STT to LLM to TTS pipeline, can achieve 160-400ms. For a traditional cascaded pipeline using Vapi with API-based services, the practical floor is around 400-465ms with current infrastructure. Below that requires dedicated GPU-hosted inference rather than third-party API calls.

Should I use AssemblyAI or Deepgram for low-latency STT?

Both are strong choices for production. Deepgram Nova-3 with Flux delivers 200-400ms with integrated end-of-turn detection and is the most commonly recommended for English agents. AssemblyAI Universal-3 Pro Streaming fires a native end_of_turn flag at 300-600ms median and is preferable if you are already using AssemblyAI elsewhere. Avoid standard Whisper for real-time voice as it was designed for batch transcription and lacks native streaming.

Written by

Thom Wilson

Founder & AI Engineer, Wild Run AI

SEO consultant turned AI engineer. Built WildRun after years getting small businesses found online — custom AI voice agents, sales and operations automation, and AI-era SEO, deployed on Cloudflare and managed end-to-end.

About the author → · Last reviewed: June 2026