Best LLM for Voice Agents in 2026: A Developer's Comparison
Compare GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash, and Groq for voice AI. Real TTFT data, function-calling grades, and TypeScript you can deploy today.
The STT and TTS layers in a voice pipeline are largely commoditized — Deepgram Nova-3 and ElevenLabs Flash v2.5 are the clear defaults for most teams in 2026. The variable that actually separates a voice agent that feels natural from one that feels laggy is the LLM in the middle. Pick wrong and you lose 400ms on every turn, tool calls start misfiring on a meaningful share of conversations, or your bill triples at scale.
This post compares the five models worth considering for production voice agents today: GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Flash, Groq + Llama 3.3 70B, and Claude Haiku 4.5. Real TTFT numbers, function-calling grades, cost at scale, and the TypeScript to wire each one into a Vapi pipeline.
The constraint set that makes voice different
Voice agents operate inside a latency budget that chat interfaces never face. From the moment STT emits its final transcript, your LLM has roughly 200–700ms before the TTS pipeline can produce audio. Anything past 700ms TTFT registers as an unnatural pause to most callers — delays over 1 second consistently cause frustration and higher hang-up rates.
That budget breaks down roughly like this:
- STT finalization: ~100ms (Deepgram Nova-3)
- LLM time to first token: 150–500ms — the variable you control
- TTS first audio chunk: ~80ms (ElevenLabs Flash v2.5)
- Telephony and network round-trip: ~60ms
Two traps reliably kill TTFT in production. First: reasoning mode. Every frontier model in 2026 ships a reasoning or thinking toggle. That toggle is off for voice — without exception. Reasoning TTFT runs from 8 seconds to over 200 seconds on complex prompts. That is not a slow turn; it is a dropped call. Second: unbounded context growth. Voice transcripts accumulate fast. Feeding 8,000 tokens of conversation history into every LLM call adds 50–150ms compared to a trimmed 1,500-token window. Keep turn-by-turn context lean and summarize aggressively at turn 10.
Vapi enforces a 10-second first-response budget before it considers the turn failed. You want median TTFT well under 1 second and P95 under 2 seconds — that leaves headroom for slow network conditions without Vapi timing out the turn.
The five models worth evaluating
GPT-4.1 — the production default
TTFT: ~450ms median, ~800ms P95. Throughput: ~125 tok/s. Cost: $2.00 input / $8.00 output per million tokens.
GPT-4.1 is the model most production Vapi deployments run on in 2026, and the reasons are mundane: it works reliably. Instruction following on noisy ASR transcripts — filler words, disfluencies, mid-sentence restarts — is more consistent here than on other frontier models. Function calling fires correctly on the first attempt at a rate that matters at scale. Tool call failures in a voice agent mean dead air while the agent retries, and GPT-4.1 keeps that failure rate low. The 1M-token context window is overkill for single calls but useful if you inject lengthy service catalogs or knowledge bases into your system prompt.
The weakness is cost. At $8.00/M output tokens, a 10-minute call generating around 4,000 output tokens per minute runs to roughly $0.32/call on LLM alone — before STT, TTS, Vapi platform fees ($0.05/min), and telephony. That math shifts your unit economics quickly at volume.
Claude Sonnet 4.6 — best tail-latency predictability
TTFT: ~500ms median, ~900ms P95. Throughput: ~90 tok/s. Cost: $3.00 input / $15.00 output per million tokens.
Sonnet 4.6's median TTFT is slightly slower than GPT-4.1, but its P95/P50 ratio — 1.8× — is the best among frontier providers tested in mid-2026. In practice that means fewer calls where the agent suddenly pauses for 2+ seconds. For voice, tail latency matters more than median latency: one bad turn can end the call. Sonnet 4.6 also handles complex multi-step reasoning inside a single turn well — dynamic intake questions that require branching on caller responses, or instructions that require reading between the lines of ambiguous speech.
Output cost is the limiting factor: $15.00/M output tokens is nearly double GPT-4.1. For a typical intake agent generating 300 output tokens per turn × 20 turns per call, that is ~$0.09/call on LLM output alone. Fine at low volume; significant at 10,000 calls/month.
Gemini 2.5 Flash — the cost-performance option
TTFT: ~600ms median (endpoint-dependent). Throughput: ~204 tok/s — fastest of any tested provider. Cost: $0.075 input / $0.30 output per million tokens.
Gemini 2.5 Flash changes the unit economics significantly. At $0.30/M output tokens, it is approximately 26× cheaper than Claude Sonnet 4.6 and GPT-4.1 on output. For high-volume commodity tasks — appointment confirmation, FAQ deflection, basic lead triage — the quality difference between Gemini Flash and GPT-4.1 rarely justifies the cost gap. The 204 tok/s throughput also means shorter waits even at higher output lengths.
The caveats matter. TTFT at P95 can reach 1,800ms on the Google AI Studio public endpoint during US business hours. Using Vertex AI with a regional endpoint — us-west1 is the right choice for deployments serving Central Oregon — tightens P95 to approximately 900ms. Function calling accuracy on complex nested tool schemas also lags behind GPT-4.1. Test your specific tool definitions before committing at production scale.
Groq + Llama 3.3 70B — the speed outlier
TTFT: ~180ms median (8B model), ~250ms median (70B). Throughput: 500+ tok/s. Cost: $0.59 input / $0.79 output per million tokens (Llama 3.3 70B).
Groq's LPU (Language Processing Unit) hardware produces inference speeds that GPU-based providers cannot match. At 180ms median TTFT on the 8B model, Groq creates genuine headroom in your latency budget — you can add a CRM lookup before the LLM call and still hit under 700ms total end-to-end. The 70B Llama 3.3 model follows instructions well enough for structured workflows like appointment booking and FAQ routing.
The production constraints are real. Groq rate limits default to around 30 RPM for the 70B model on standard tiers, which limits concurrent call capacity. Complex parallel tool calls are also less consistent on open-weight models than on GPT-4.1. Use Groq for simple, single-tool workflows or as a speed-optimized fallback path — not as the primary model for complex intake flows.
Claude Haiku 4.5 — the economical Anthropic option
TTFT: Sub-600ms median. Cost: $0.80 input / $4.00 output per million tokens.
Haiku 4.5 sits between Sonnet 4.6 and Gemini Flash in both capability and cost. It inherits Anthropic's strong instruction-following on ASR transcripts and is the right tier when you need Anthropic API coverage — HIPAA BAA, consistent behavior guarantees — but cannot absorb Sonnet 4.6 pricing at scale. Function calling is reliable on simple single-tool schemas; less consistent on deeply nested multi-tool flows where Sonnet 4.6 is the safer choice.
Pipeline architecture
The most cost-efficient production deployments do not pick a single model for every call — they route by turn complexity. A caller asking to reschedule an appointment is structurally different from one who needs you to look up their insurance details, validate the policy number, and branch on three possible eligibility outcomes.
Caller speech
│
▼
[Deepgram Nova-3 STT] ~100ms
│ transcript + metadata
▼
[Complexity Classifier] <20ms (regex/keyword — no LLM)
├─ simple FAQ / booking ──► Gemini 2.5 Flash ~600ms TTFT
├─ multi-step / tool-heavy ─► GPT-4.1 ~450ms TTFT
└─ HIPAA intake / complex ──► Claude Sonnet 4.6 ~500ms TTFT
│
▼
[LLM response + tool calls]
│
▼
[ElevenLabs Flash v2.5 TTS] ~80ms
│
▼
Caller hears audio
Total target: <750ms end-to-endThe classifier itself must be cheap and fast — a regex or keyword pass over the transcript keeps classification under 20ms. Do not route the routing decision through a second LLM call.
Connecting models in Vapi
Vapi's assistant configuration accepts a model object that lets you swap providers with a single field change. Here is the base pattern covering all three primary models, with a routing function that selects based on call complexity:
import Vapi from "@vapi-ai/server-sdk";
const vapi = new Vapi({ token: process.env.VAPI_API_KEY! });
const SYSTEM_PROMPT = `You are a helpful scheduling assistant.
Be concise — this is a phone call, not a chat.
Speak in complete sentences. Never read back JSON or code.`;
// GPT-4.1 — production default for complex flows
const gpt41Config = {
provider: "openai" as const,
model: "gpt-4.1",
temperature: 0.3,
maxTokens: 400,
messages: [{ role: "system" as const, content: SYSTEM_PROMPT }],
};
// Gemini 2.5 Flash — high-volume, cost-sensitive paths
const geminiFlashConfig = {
provider: "google" as const,
model: "gemini-2.5-flash",
temperature: 0.2,
maxTokens: 300,
messages: [{ role: "system" as const, content: SYSTEM_PROMPT }],
};
// Claude Sonnet 4.6 — HIPAA intake, complex branching logic
const sonnet46Config = {
provider: "anthropic" as const,
model: "claude-sonnet-4-6",
temperature: 0.2,
maxTokens: 500,
messages: [{ role: "system" as const, content: SYSTEM_PROMPT }],
};
type Complexity = "simple" | "complex" | "sensitive";
function selectModel(complexity: Complexity) {
if (complexity === "sensitive") return sonnet46Config;
if (complexity === "complex") return gpt41Config;
return geminiFlashConfig;
}
export async function createVoiceAssistant(complexity: Complexity) {
return vapi.assistants.create({
name: `intake-agent-${complexity}`,
model: selectModel(complexity),
voice: {
provider: "elevenlabs" as const,
voiceId: "21m00Tcm4TlvDq8ikWAM",
model: "eleven_flash_v2_5",
stability: 0.5,
similarityBoost: 0.75,
},
transcriber: {
provider: "deepgram" as const,
model: "nova-3",
language: "en-US",
},
firstMessageMode: "assistant-speaks-first",
firstMessage: "Thanks for calling. How can I help you today?",
});
}Function calling: where models diverge most
Latency benchmarks get most of the attention, but function calling reliability is the real failure mode in production. A voice agent that says "One moment while I check that" and then silently fails to call the tool — because the LLM emitted malformed JSON or picked the wrong function — produces a dead-air moment that callers interpret as the line dropping. That failure is invisible in your latency metrics.
The tool definition schema you write matters as much as the model. Keep schemas flat. Name parameters with full English words, not abbreviations. Include an example value in every description field. All three top models perform meaningfully better with those practices in place.
// This schema shape produces consistent tool calls on GPT-4.1, Sonnet 4.6, and Gemini 2.5 Flash
const checkAvailabilityTool = {
type: "function" as const,
function: {
name: "check_appointment_availability",
description:
"Check whether a specific appointment slot is open. " +
"Call this whenever the caller asks about scheduling, rescheduling, or availability. " +
"If the caller mentions a day without a specific time, ask for the time before calling.",
parameters: {
type: "object" as const,
properties: {
requested_date: {
type: "string",
description:
"The date the caller wants, in YYYY-MM-DD format. Example: '2026-07-15'",
},
requested_time: {
type: "string",
description:
"The time slot in HH:MM 24-hour format. Example: '14:00'",
},
service_type: {
type: "string",
enum: ["cleaning", "exam", "emergency", "consultation"],
description:
"Type of appointment. Infer from context if not explicitly stated.",
},
},
required: ["requested_date", "requested_time", "service_type"],
},
},
};
export async function createAgentWithTools(complexity: Complexity) {
return vapi.assistants.create({
name: `intake-agent-tools-${complexity}`,
model: {
...selectModel(complexity),
tools: [checkAvailabilityTool],
toolChoice: "auto",
},
voice: {
provider: "elevenlabs" as const,
voiceId: "21m00Tcm4TlvDq8ikWAM",
model: "eleven_flash_v2_5",
},
transcriber: {
provider: "deepgram" as const,
model: "nova-3",
language: "en-US",
},
firstMessageMode: "assistant-speaks-first",
firstMessage: "Thanks for calling. How can I help you today?",
});
}Dynamic routing via Vapi webhook
Rather than hardcoding one model per assistant, you can route at the start of each call using Vapi's assistant-request webhook event. When this event fires, you receive the caller's number and any metadata attached to the call — enough to pick a model tier before the first LLM call executes.
import { Hono } from "hono";
interface VapiWebhookPayload {
message: {
type: string;
call: {
customer?: { number?: string };
metadata?: Record<string, string>;
};
};
}
const app = new Hono();
// Classify complexity in <5ms — no LLM needed here
function classifyComplexity(
callerNumber: string,
metadata: Record<string, string>
): Complexity {
// Known patients flagged in your PMS metadata → HIPAA-eligible model
if (metadata.patientStatus === "existing") return "sensitive";
// After-hours calls in Central Oregon tend toward urgent or non-routine
// PDT is UTC-7; business hours roughly map to 15:00–01:00 UTC
const utcHour = new Date().getUTCHours();
const isAfterHours = utcHour < 15 || utcHour >= 1;
if (isAfterHours) return "complex";
return "simple";
}
app.post("/vapi/webhook", async (c) => {
const payload = await c.req.json<VapiWebhookPayload>();
if (payload.message.type === "assistant-request") {
const { call } = payload.message;
const complexity = classifyComplexity(
call.customer?.number ?? "",
call.metadata ?? {}
);
const assistant = await createVoiceAssistant(complexity);
return c.json({ assistantId: assistant.id });
}
return c.json({ received: true });
});
export default app;Production gotchas
Reasoning mode silently wrecks latency. OpenAI exposes a reasoning_effort parameter; Anthropic uses a thinking block in the request body. If you copy config from a coding or analysis project, verify neither is set. Even reasoning_effort: "low" adds 2–4 seconds of TTFT. Log your actual TTFT per turn in production — do not rely on local testing where reasoning latency is masked by IDE response caching.
Gemini regional variance is significant. Gemini 2.5 Flash on the public AI Studio endpoint can hit P95 TTFT of 1,800ms during US business hours. Vertex AI us-west1 runs closer to 900ms P95. If you benchmark on AI Studio and deploy to Vertex — or vice versa — your production latency profile will not match your test results. Pick one surface and stay on it.
Streaming mode changes your effective latency. When streaming is enabled, Vapi can begin the TTS pipeline as soon as the first non-tool-call tokens arrive from the LLM. If streaming is off, Vapi waits for the complete LLM response. Check your Vapi plan and explicitly set stream: true in any direct API calls outside Vapi's SDK. This single setting can recover 100–200ms of perceived latency per turn.
Context window creep is a hidden latency tax. A 20-turn call accumulates 3,000–6,000 tokens of transcript. At turn 20, you are paying to re-process the entire call history on every LLM invocation. Implement rolling summarization at turn 10: compress turns 1–8 into a single summary message and drop the raw turn objects. This alone typically recovers 80–150ms of TTFT on long calls with any of these models.
Anthropic's BAA is not automatic compliance. Signing the Business Associate Agreement with Anthropic makes your Claude-based deployment HIPAA-eligible — it does not make it HIPAA compliant. Your call transcript logging, storage, retention, and access controls must also be configured correctly. Vapi stores call transcripts on its own infrastructure by default; review Vapi's data handling agreements separately from your LLM provider's BAA.
Groq rate limits are per-model, not per-account. Default limits for Llama 3.3 70B sit around 30 RPM on standard tiers. At 50 concurrent voice calls, you will hit this ceiling and start seeing throttling errors — which in a voice pipeline manifest as TTFT spikes past Vapi's 10-second timeout. Build a fallback to GPT-4.1 mini on Groq throttle errors, or negotiate higher limits before launch.
When NOT to build this yourself
Multi-model LLM routing adds real operational complexity. You now maintain three provider SDKs, three sets of API keys, three billing dashboards, and a routing layer that can fail silently. Before committing to this architecture, answer these questions honestly:
Is your call volume under 3,000 calls/month? At that volume, the cost difference between GPT-4.1 and Gemini Flash is under $200/month. That margin does not justify the engineering time to build and maintain a routing layer. Pick one model, tune your prompt, and ship.
Are your tool schemas simple? If your agent books appointments and does nothing else, GPT-4.1's function calling accuracy is high enough that the reliability advantage over cheaper models does not translate to meaningful caller experience improvement.
Do you have monitoring in place? Multi-model routing fails silently in ways that single-model deployments do not. If your classifier is sending 15% of calls to the wrong tier and you cannot see that in your dashboards, the routing layer is actively degrading quality. Build routing only after you have per-turn TTFT logging and tool-call success rate tracking already running.
For most small businesses — medical practices, law firms, home services contractors across Central Oregon — a single well-configured GPT-4.1 agent on Vapi with a tested system prompt outperforms a poorly instrumented multi-model setup. See the complete voice agent latency optimization guide for the full pipeline tuning approach before adding LLM routing complexity.
Recommended starting configuration
For new production deployments launching in mid-2026:
- STT: Deepgram Nova-3 via Vapi built-in
- LLM default: GPT-4.1 — reliable function calling, predictable latency, 1M context window
- LLM at scale (>3,000 calls/month): Gemini 2.5 Flash on Vertex AI us-west1, after validating your tool schemas
- LLM for HIPAA intake: Claude Sonnet 4.6 — best tail-latency predictability plus Anthropic BAA coverage
- LLM for speed-critical simple paths: Groq + Llama 3.3 70B with a GPT-4.1 fallback on rate-limit errors
- TTS: ElevenLabs Flash v2.5 — sub-100ms TTS latency in isolation
- Context: Rolling summarization at turn 10, maximum 1,500 tokens of active turn history
If you are building a voice pipeline for a client rather than your own business, the multi-model routing complexity is often better managed by a team that has already worked through the production edge cases. Book a call — we build and deploy these pipelines for practices and agencies across Central Oregon and can short-circuit your LLM selection process.
Architecture
Voice Call Pipeline — LLM Routing
Caller speech
|
v
[Deepgram Nova-3 STT] ~100ms
| transcript + metadata
v
[Complexity Classifier] <20ms (regex/keyword, no LLM)
+- simple FAQ / booking --> Gemini 2.5 Flash ~600ms TTFT
+- multi-step / tool-heavy --> GPT-4.1 ~450ms TTFT
+- HIPAA intake / complex --> Claude Sonnet 4.6 ~500ms TTFT
|
v
[LLM response + tool calls]
|
v
[ElevenLabs Flash v2.5 TTS] ~80ms
|
v
Caller hears audio
Total target: <750ms end-to-end
Frequently asked questions
What is the best LLM for voice agents in 2026?
GPT-4.1 is the safest default for most production voice agents — reliable function calling, approximately 450ms median TTFT, and a 1M-token context window. For cost-sensitive high-volume deployments, Gemini 2.5 Flash on Vertex AI is roughly 26 times cheaper on output tokens with comparable quality on simple conversational workflows.
How much latency budget does the LLM have in a voice pipeline?
Roughly 200 to 700ms from when STT finalizes the transcript to when TTS can start producing audio. That window includes your LLM time to first token. Anything above 700ms TTFT consistently registers as an unnatural pause, and above 1 second reliably increases hang-up rates.
Can I use reasoning models for voice agents?
No. Reasoning and thinking modes add 8 to 200 seconds of TTFT depending on prompt complexity, making them unusable for real-time voice. Disable reasoning on any model you deploy for voice and verify it is off explicitly — some API client library defaults vary by version.
Is Groq production-ready for voice agents?
For simple workflows with a single tool call, yes. Groq delivers approximately 180ms median TTFT on Llama 3 models — the fastest inference available as of mid-2026. The limitations are tighter rate limits compared to OpenAI and Anthropic, and less consistent multi-tool function calling on open-weight models.
Which LLM should I use for HIPAA-eligible healthcare voice agents?
Claude Sonnet 4.6 is the most common choice for HIPAA-adjacent deployments. Anthropic offers a Business Associate Agreement covering Sonnet 4.6, and the model has the best tail-latency predictability among frontier providers. Note that signing the BAA makes your deployment HIPAA-eligible — your logging, storage, and access controls must also be configured correctly.
When does multi-model LLM routing make sense for voice agents?
Once your call volume exceeds roughly 3,000 calls per month and you have per-turn latency and tool-call-success monitoring in place. Below that volume, the cost savings do not justify the added complexity of maintaining multiple provider integrations and a routing layer that can fail silently.