Architecture · 2026-06-28 · 12 min read · WildRun AI Engineering

How to Test AI Voice Agents: An Evaluation Framework

A practitioner's guide to testing AI voice agents — latency measurement, LLM judging, simulated callers, and CI regression, with runnable TypeScript.

IntermediateTools:Vapi ElevenLabs Anthropic SDK Cloudflare Workers Hamming Coval

How to Test AI Voice Agents: An Evaluation Framework

You shipped a voice agent last week. The demo went well — the assistant booked appointments, handled transfers, didn't hallucinate the business hours. Then someone called and said "um, actually, can you just—" and the agent kept talking right over them, booked the wrong time, and hung up. Production failures in voice AI don't look like stack traces. They sound like silence, crosstalk, or a confused caller who just dialed a competitor.

Testing voice agents requires a framework that covers four distinct failure surfaces: audio quality and latency, conversational accuracy, behavioral edge cases, and regression over time. This guide walks through each layer with runnable TypeScript you can drop into your CI pipeline today.

Why voice agents need a dedicated test harness

Traditional API testing catches the wrong things. A unit test can verify your bookAppointment() tool returns the right JSON — but it cannot tell you whether the agent called that tool at the right moment in a real conversation, or whether the caller had already hung up 200ms before the response arrived.

Voice adds three variables that text-based agents don't have: latency compounds (ASR + LLM + TTS stack in series), audio quality degrades under real call conditions (background noise, codec artifacts, spotty connections), and turn-taking is ambiguous — end-of-utterance detection is genuinely hard. Your test harness must cover all three.

The industry has converged on a four-layer evaluation model. Each layer tests a different failure mode with a different testing method. Automate roughly 80% of evaluation — latency metrics, Word Error Rate, tool-call verification, and task completion. Reserve human review for calibrating your LLM judge scores and auditing the 10–20% of calls that land in ambiguous territory.

Layer 1: Infrastructure — latency and audio quality

Time to First Audio (TTFA) is the single metric that determines whether your agent feels natural to callers. The tolerance ceiling is 800ms end-to-end — beyond that, silence registers as a dropped call. The budget typically breaks down as: ~100ms ASR, ~300ms LLM first token, ~300ms TTS first audio chunk from ElevenLabs, and ~100ms network transit. Any production hiccup in that chain and you're over budget.

Track P50, P90, P95, and P99 latency percentiles — never just averages. A P99 of 3,000ms means one in a hundred callers waits three full seconds and assumes the call dropped. Your SLA target should be P90 under 800ms. Vapi exposes a call.analyzed webhook event with per-call timing metadata you can capture directly:

import type { VapiCallAnalyzedPayload } from "@vapi-ai/web";

interface LatencyRecord {
  callId: string;
  ttfa_ms: number;
  asr_ms: number;
  llm_ms: number;
  tts_ms: number;
  e2e_ms: number;
  timestamp: number;
}

export async function handleCallAnalyzed(
  payload: VapiCallAnalyzedPayload
): Promise {
  const { call } = payload;
  const timings = call.artifact?.timings;
  if (!timings) return;

  const record: LatencyRecord = {
    callId: call.id,
    ttfa_ms: timings.firstSpeakMs ?? 0,
    asr_ms: timings.asrDurationMs ?? 0,
    llm_ms: timings.llmFirstTokenMs ?? 0,
    tts_ms: timings.ttsFirstAudioMs ?? 0,
    e2e_ms: timings.firstSpeakMs ?? 0,
    timestamp: Date.now(),
  };

  if (record.e2e_ms > 800) {
    console.warn(`TTFA breach: ${record.callId} — ${record.e2e_ms}ms`);
  }

  await fetch("/api/metrics/latency", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(record),
  });
}

For audio quality, the ITU-T P.800 standard uses Mean Opinion Score (MOS) — blinded listeners rate naturalness on a 1–5 scale across at least 30 calls. Near-human TTS averages MOS 4.3–4.5 under clean conditions. Word Error Rate for ASR should stay below 5% on clean audio; budget up to 15% for call-center audio with background noise. Leading ASR models report error rates of up to 17.7% on noisy call-center recordings, so test with audio from your actual deployment environment, not a quiet studio.

Layer 2: Accuracy — ASR, NLU, and tool-call verification

Accuracy testing answers: did the agent understand the caller, and did it take the right action? The two failure modes are orthogonal — good transcription with bad intent classification, or bad transcription that the LLM self-corrected. Test them separately, or you won't know which layer is breaking.

For tool-call verification, define expected outcomes as a test spec, then run an LLM-as-judge over the actual transcript. The Anthropic SDK works well here — claude-sonnet-4-6 is fast enough for synchronous eval pipelines without blowing your cost budget:

import Anthropic from "@anthropic-ai/sdk";

interface TranscriptTurn {
  role: "user" | "assistant";
  content: string;
  toolCalls?: { name: string; arguments: Record }[];
}

interface TestCase {
  callId: string;
  transcript: TranscriptTurn[];
  expectedTools: string[];
}

interface JudgeResult {
  callId: string;
  score: number;
  toolCallsMatched: boolean;
  reasoning: string;
}

const client = new Anthropic();

export async function judgeConversation(test: TestCase): Promise {
  const actualTools = test.transcript
    .flatMap((t) => t.toolCalls ?? [])
    .map((tc) => tc.name);

  const toolCallsMatched = test.expectedTools.every((name) =>
    actualTools.includes(name)
  );

  const transcriptText = test.transcript
    .map((t) => `${t.role.toUpperCase()}: ${t.content}`)
    .join("\n");

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `You are an expert voice AI evaluator. Score this conversation 1-5 on task completion accuracy.

TRANSCRIPT:
${transcriptText}

EXPECTED TOOL CALLS: ${test.expectedTools.join(", ")}
ACTUAL TOOL CALLS: ${actualTools.join(", ")}

Respond with JSON only: {"score": <1-5>, "reasoning": ""}`,
      },
    ],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "{}";
  const parsed = JSON.parse(text);

  return {
    callId: test.callId,
    score: parsed.score,
    toolCallsMatched,
    reasoning: parsed.reasoning,
  };
}

Set a calibration baseline before you rely on judge scores. Run 50 golden calls — calls your team has manually scored — through the judge and record agreement rate. Aim for greater than 85% agreement between judge and human on pass/fail decisions. A drift of ±2% in task completion rate across 50+ calls is your signal that a prompt change or model update has degraded accuracy. Consult your LLM provider benchmark data when switching judge models — scoring rubric shifts are invisible unless you calibrate against ground truth first.

Layer 3: Behavioral testing — simulated callers and edge cases

This is the layer most teams skip and then regret. Behavioral testing means simulating real caller patterns: hesitations, interruptions, topic changes mid-sentence, very short utterances like "yeah" or "uh-huh," and frustrated callers who repeat themselves. Vapi's Evals framework (launched 2025) lets you define a conversation as structured data and run it against your live agent without consuming real call minutes:

import Vapi from "@vapi-ai/server-sdk";

interface ConversationTurn {
  callerSays: string;
  waitMs?: number;
}

interface Assertion {
  type: "tool_called" | "contains_text" | "not_contains";
  value: string;
}

interface ConversationScript {
  agentId: string;
  turns: ConversationTurn[];
  assertions: Assertion[];
}

interface TestResult {
  passed: boolean;
  failures: string[];
  transcript: string[];
  durationMs: number;
}

export async function runSimulatedCall(
  script: ConversationScript,
  apiKey: string
): Promise {
  const vapi = new Vapi({ token: apiKey });
  const transcript: string[] = [];
  const toolsCalled: string[] = [];
  const failures: string[] = [];
  const start = Date.now();

  // Chat mode: text-only, no audio codec overhead — ideal for CI
  const chat = await vapi.calls.createChat({
    assistantId: script.agentId,
  });

  for (const turn of script.turns) {
    if (turn.waitMs) await new Promise((r) => setTimeout(r, turn.waitMs));

    const response = await vapi.calls.sendMessage(chat.id, {
      message: { role: "user", content: turn.callerSays },
    });

    transcript.push(`USER: ${turn.callerSays}`);
    transcript.push(`AGENT: ${response.message?.content ?? ""}`);

    if (response.toolCalls) {
      toolsCalled.push(...response.toolCalls.map((tc) => tc.function.name));
    }
  }

  for (const assertion of script.assertions) {
    if (
      assertion.type === "tool_called" &&
      !toolsCalled.includes(assertion.value)
    ) {
      failures.push(`Expected tool "${assertion.value}" was not called`);
    }
    if (assertion.type === "contains_text") {
      const found = transcript.some((line) =>
        line.toLowerCase().includes(assertion.value.toLowerCase())
      );
      if (!found) {
        failures.push(`Expected phrase "${assertion.value}" not found`);
      }
    }
  }

  return {
    passed: failures.length === 0,
    failures,
    transcript,
    durationMs: Date.now() - start,
  };
}

Build a minimum caller persona matrix before you ship any agent to production:

Fluent speaker — clear, complete sentences. Your happy path. If this breaks, nothing else matters.
Hesitant speaker — "So I was thinking... um... maybe... Tuesday?" Tests VAD threshold tuning and barge-in recovery.
Frustrated caller — repeats themselves, explicitly asks to speak to a human. Tests the escalation path. The most common complaint in production voice AI is agents that don't escalate when they should.
Out-of-scope caller — asks something the agent is not designed to handle. Tests graceful deflection vs. silent failure.

Run the hesitant speaker persona specifically against your VAD configuration. The default silence threshold on most platforms is 500ms, which cuts off callers who pause mid-thought. Tune to 700–900ms for conversational agents — then test that the adjusted threshold doesn't make the agent feel unresponsive on short, definitive utterances like "yes" or "cancel it."

Layer 4: Regression suite and CI/CD integration

A voice agent that passes all tests today can fail silently after a prompt edit, a model version bump, or a third-party dependency update. Regression testing means running your full eval suite on every deploy and failing the build when metrics drift outside tolerance bands: latency ±10%, task completion rate ±3%, WER ±2%.

interface BaselineMetrics {
  p90LatencyMs: number;
  taskCompletionRate: number;
}

interface RegressionReport {
  passed: boolean;
  details: Record<
    string,
    { baseline: number; current: number; deltaPercent: number; ok: boolean }
  >;
}

const TOLERANCE = { latency: 0.1, taskCompletion: 0.03 };

export async function runRegressionSuite(
  agentId: string,
  baseline: BaselineMetrics,
  testCases: ConversationScript[]
): Promise {
  console.log(`Running ${testCases.length} regression tests against ${agentId}...`);

  // Serial execution with delay to avoid WebSocket connection churn
  const results: TestResult[] = [];
  for (const tc of testCases) {
    results.push(await runSimulatedCall(tc, process.env.VAPI_API_KEY!));
    await new Promise((r) => setTimeout(r, 500));
  }

  const passedCount = results.filter((r) => r.passed).length;
  const currentCompletion = passedCount / testCases.length;
  const currentP90 = await fetchP90Latency(agentId);

  const report: RegressionReport = {
    passed: true,
    details: {
      taskCompletion: {
        baseline: baseline.taskCompletionRate,
        current: currentCompletion,
        deltaPercent:
          ((currentCompletion - baseline.taskCompletionRate) /
            baseline.taskCompletionRate) *
          100,
        ok:
          Math.abs(currentCompletion - baseline.taskCompletionRate) <=
          TOLERANCE.taskCompletion,
      },
      p90Latency: {
        baseline: baseline.p90LatencyMs,
        current: currentP90,
        deltaPercent:
          ((currentP90 - baseline.p90LatencyMs) / baseline.p90LatencyMs) * 100,
        ok:
          Math.abs(currentP90 - baseline.p90LatencyMs) /
            baseline.p90LatencyMs <=
          TOLERANCE.latency,
      },
    },
  };

  const failedDimensions = Object.entries(report.details).filter(
    ([, v]) => !v.ok
  );
  if (failedDimensions.length > 0) {
    report.passed = false;
    console.error(
      "Regression failures:",
      failedDimensions
        .map(([k, v]) => `${k}: ${v.deltaPercent.toFixed(1)}% drift`)
        .join(", ")
    );
    process.exit(1);
  }

  return report;
}

async function fetchP90Latency(agentId: string): Promise {
  const res = await fetch(`/api/metrics/latency/p90?agentId=${agentId}`);
  const data = (await res.json()) as { p90Ms: number };
  return data.p90Ms;
}

Wire this runner as a CI step after deploy and before traffic is shifted. If it exits non-zero, roll back. One important constraint: Cloudflare Workers has a 30-second CPU limit per request. Your regression orchestration — which fans out many sequential simulated calls — should live in a long-running Node.js or Deno process, not a Worker. The production agent itself runs fine on Workers; the test harness cannot. For more on how to structure the underlying agent stack, see building a voice agent with Vapi from scratch.

Load testing your voice agent

Load testing is distinct from regression testing. It answers: what happens when 50 callers reach this agent simultaneously? Most voice agents fail in two ways under concurrent load: the LLM provider starts rate-limiting (watch your 429 error rate in logs), and WebSocket connections to the telephony provider start dropping during rapid connection churn.

Establish your baseline metrics under single-caller conditions first. Then ramp to 10× concurrent calls and track where TTFA degrades. Third-party API rate limits become your bottleneck under load, not your own infrastructure. If your agent makes a CRM lookup on every call and your CRM allows 10 requests per second, you hit a hard ceiling at 10 concurrent callers before any voice infrastructure constraint kicks in. Test with realistic tool-call volumes and patterns — not just raw audio throughput.

Use chat mode for initial load testing to isolate LLM and tool-call performance from audio pipeline variables. Switch to actual voice calls only once the text-layer load tests pass cleanly — that way you know exactly which layer any degradation comes from.

Production gotchas

End-of-utterance detection tuning. The default silence threshold on most platforms is 500ms — too aggressive for natural conversational speech with mid-thought pauses. Callers who think aloud ("so I was thinking... maybe Tuesday?") get cut off mid-sentence. Tune your VAD threshold to 700–900ms for conversational agents. Too high and the agent feels slow; too low and it interrupts constantly. Test both extremes explicitly before settling on a value, and document which threshold you're running so future deploys don't silently revert it.

Transcript timing normalization. ASR providers return transcripts with word-level timestamps that are relative to the audio chunk start time, not the wall clock. When you correlate transcripts with Vapi webhook events in your eval framework, normalize everything to Unix milliseconds at ingest time. Miss this and your latency calculations will be off by the chunk buffer duration — typically 200–400ms — making P90 numbers look better than reality.

LLM judge calibration drift. Your LLM judge is itself an LLM. When the provider rolls a silent model update, the judge's implicit scoring rubric shifts. A judge that rated 4/5 for a given call quality in January may rate the same quality 3/5 in June — not because performance changed but because the model's internal calibration shifted. Run a calibration pass on a fixed set of 50 golden calls any time you change the judge model, and log which model version produced which scores so you can detect retrospective drift.

Vapi's 10-second first-response budget. If your agent makes tool calls before speaking its first word — CRM lookup, calendar availability check, knowledge base retrieval — you have a hard 10-second window before Vapi treats the call as stalled and disconnects. In practice, all pre-speech tool calls need to complete within 6–7 seconds to leave buffer for LLM inference and TTS startup. Test this explicitly in your Layer 1 eval. Discover it there, not when a slow CRM query kills your first impression under real traffic.

Audio codec mismatches. Vapi uses Opus codec by default. If you're recording calls for your eval corpus and your transcription service expects PCM or μ-law audio, the silent conversion step adds latency and potential quality loss you won't see in local testing. Nail down your full audio pipeline — codec, sample rate, channel count — before building eval infrastructure around recorded call data.

WebSocket connection churn in CI. Simulated caller tests create real WebSocket connections to Vapi. Many tests running in rapid succession create a connection open/close cycle rate that can hit provider-side limits faster than production traffic does (production calls are spread over time). Add at least 500ms between test call teardown and the next call initiation. For large suites of 50+ test cases, use Vapi's chat testing mode rather than voice mode to avoid audio WebSocket limits entirely.

When NOT to build this yourself

A full eval harness is a 2–4 week engineering investment. Before you start, check whether the ROI justifies it for your situation.

If you're running fewer than 500 calls per month, manual spot-checking of 10% of calls gives you most of the quality signal at a fraction of the cost. Dedicated platforms like Hamming and Coval offer turnkey eval infrastructure — voice simulation, scoring dashboards, CI integration — that would take weeks to rebuild. At low call volumes, buying beats building.

If your agent handles compliance-sensitive conversations — HIPAA, legal intake, financial advice — don't route call recordings through a DIY eval pipeline before reviewing your data handling obligations. Eval infrastructure that stores full call audio and transcripts is subject to the same compliance requirements as your production system. The compliance review alone can take longer than building the harness, and getting it wrong is a more expensive problem than slow testing.

If you're deploying for a single client rather than a multi-tenant platform, a lightweight monitoring setup — call recordings, manual QA on flagged calls, latency alerts — gets you to the same quality bar with far less engineering overhead. Build the full harness when you're maintaining 5+ distinct agent deployments. That's when the fixed investment amortizes quickly across deployments.

If your primary concern is voice naturalness rather than task accuracy, no automated framework fully replaces human listening panels. MOS scores are useful averages but they hide specific scenarios where synthetic voice sounds off — laughter, surprise, domain-specific technical terms, or emotional subtext. Budget for human eval even if you automate everything else.

If you're building voice agents for client deployments and want a second opinion on what eval infrastructure makes sense for your scale, book a call with our engineering team.

Architecture

Caller Simulator --> Vapi WebRTC --> LLM Backend --> Tool APIs
     |                  |               |              |
     |              ASR/Audio       Response        CRM/DB
     |                  |               |              |
     +---------> Eval Harness <-----------------------+
                     |
           +---------+-----------+
      Latency Tracker       LLM Judge
      (P50/P90/P99)         (Score 1-5)
           |                     |
           +---------+-----------+
                     |
              Regression CI
            (pass/fail gate)

Frequently asked questions

What is TTFA and why does it matter more than average latency?

TTFA (Time to First Audio) is the elapsed time from when the caller stops speaking to when your agent plays its first audio chunk. It matters more than average latency because callers experience silence as failure — a gap over 800ms registers as a dropped call. Averages mask long tails; always track P90 and P99.

How many test cases do I need for a reliable regression suite?

A minimum viable regression suite for a single-purpose agent covers the happy path, two edge cases per major flow branch, and all four caller personas — roughly 20-30 test cases. Scale to 50+ once you have production call data to mine for real failure patterns.

Can I use Vapi chat mode for all testing instead of voice mode?

For regression testing, accuracy scoring, and load testing, yes — chat mode is faster, cheaper, and isolates logic from audio variables. Use voice mode only when testing audio-specific behaviors: barge-in handling, VAD threshold tuning, MOS scoring, or TTS voice quality evaluation.

How do I prevent LLM judge scores from drifting over time?

Set temperature=0 on judge model calls to minimize variance. Maintain a fixed set of 50 golden calls with human-verified ground truth, and re-run them after any judge model change. For borderline cases scoring 2 or 3 out of 5, run the judge three times and take the median score.

What should I do when the 10-second first-response budget is too tight for my tool calls?

Pre-fetch data before the call connects using the phone.ringing webhook event — you get a 1-2 second head start on CRM lookups triggered by the caller phone number. Cache frequently-needed data at agent startup rather than fetching it per call.

How do I test barge-in handling without real callers?

Use Vapi voice testing mode with a simulated caller agent configured to interrupt at a fixed timing offset — 1 second after the agent starts speaking. Assert that the agent stops its current utterance, processes the interruption transcript, and responds to the new input rather than continuing its original response.

Written by

Thom Wilson

Founder & AI Engineer, Wild Run AI

SEO consultant turned AI engineer. Built WildRun after years getting small businesses found online — custom AI voice agents, sales and operations automation, and AI-era SEO, deployed on Cloudflare and managed end-to-end.

About the author → · Last reviewed: June 2026

How to Test AI Voice Agents: An Evaluation Framework

Why voice agents need a dedicated test harness

Layer 1: Infrastructure — latency and audio quality

Layer 2: Accuracy — ASR, NLU, and tool-call verification

Layer 3: Behavioral testing — simulated callers and edge cases

Layer 4: Regression suite and CI/CD integration

Load testing your voice agent

Production gotchas

When NOT to build this yourself

Frequently asked questions

Ready to stop losing calls?

Across the Wild Run AI network

How to Test AI Voice Agents: An Evaluation Framework

Why voice agents need a dedicated test harness

Layer 1: Infrastructure — latency and audio quality

Layer 2: Accuracy — ASR, NLU, and tool-call verification

Layer 3: Behavioral testing — simulated callers and edge cases

Layer 4: Regression suite and CI/CD integration

Load testing your voice agent

Production gotchas

When NOT to build this yourself

Frequently asked questions

Ready to stop losing calls?

Related articles

Voice AI Latency Optimization: Achieve Sub-500ms Responses

Across the Wild Run AI network