Prompting · 2026-06-19 · 12 min read · WildRun AI Engineering

Prompt Engineering for Voice AI Agents: Complete Guide

Practical prompt engineering for voice AI agents: control response latency, prevent hallucinations, handle interruptions, and wire tool calls in Vapi.

IntermediateTools:Vapi ElevenLabs Cloudflare Workers Anthropic SDK

Prompt Engineering for Voice AI Agents: Complete Guide

The single biggest reason voice AI agents fail in production has nothing to do with TTS quality or which LLM you pick. It is the system prompt. Most teams write a detailed chatbot prompt, drop it into their Vapi assistant, and wonder why callers hang up after 30 seconds. Voice conversations operate under completely different constraints than text chat, and your prompt engineering has to reflect that from the ground up.

This guide covers the specific techniques that make voice system prompts work: controlling response length for streaming TTS, handling barge-in and interruptions, grounding the model to prevent hallucinations, and wiring tool calls so the agent knows when to transfer or end the call. All examples use TypeScript and Vapi with ElevenLabs for voice synthesis.

Why text prompts break in voice

A well-crafted chatbot system prompt often produces terrible voice agent behavior because the failure modes are invisible during text testing. Three things break immediately when you move to phone calls.

Markdown renders as noise. When your prompt does not explicitly forbid it, the LLM generates responses with bold text, bullet dashes, and numbered lists. The TTS engine reads every character literally -- "asterisk asterisk key differences asterisk asterisk" -- and callers hear that in the first 10 seconds.

LLMs default to long-form answers. Without explicit length constraints, a capable LLM produces a 150-word answer to a simple question about business hours. At an average speaking rate of 130 words per minute, that is over a 60-second monologue. Callers disengage or hang up well before the agent finishes its first turn.

Turn-taking is not explicit. In text chat, the UI enforces conversational turns. In voice, the caller needs an explicit cue that the agent has finished speaking and is waiting for them. Prompts that do not instruct the agent to always end with a question leave callers in silence, uncertain whether the agent is still processing.

Anatomy of a voice system prompt

Every effective voice system prompt has five sections, in this order. Each serves a specific purpose -- do not merge them or skip them. Vapi injects additional context around your prompt, and the positioning of each section affects how reliably the model follows the instructions.

1. Identity -- Two to three sentences. Who the agent is, which business it represents, and the caller context. Keep it tight: Vapi injects call metadata alongside your prompt, so avoid repetition.

2. Scope -- Explicit list of what the agent can and cannot help with. This is your primary anti-hallucination boundary. Anything not listed here should trigger an escalation, not a guess.

3. Speaking style rules -- Hard constraints on formatting and response length. No markdown, no lists, two sentences maximum per turn, always end with a question. Negative constraints are easier for the model to follow consistently than positive ones.

4. Knowledge block -- Static facts: business hours, pricing, FAQs. Keep this under 2,000 tokens. Beyond that threshold, first-response latency increases as the model processes a larger context before generating its first token.

5. Tool instructions -- When to invoke transferCall, endCall, or custom functions. Describe trigger conditions in plain language, not just function signatures. A model that knows it can transfer calls will use it inconsistently; explicit trigger conditions make behavior reliable.

const SYSTEM_PROMPT = `
## Identity
You are Jordan, a scheduling assistant for Cascade Physical Therapy in Bend, Oregon.
You handle inbound calls during business hours (Mon-Fri, 8am-6pm Pacific).

## Scope
You CAN: schedule appointments, confirm existing appointments, answer questions
about services and the insurance plans we accept, and transfer urgent calls to staff.
You CANNOT: provide medical advice, discuss billing disputes, or access patient records.
If asked about anything outside this scope, say a staff member will follow up
and offer to transfer the call.

## Speaking style
- Never use bullet points, numbered lists, bold, italics, or any markdown formatting.
- Keep every response to two sentences or fewer.
- Always end your response with a direct question to keep the conversation moving.
- Do not say "Great!", "Absolutely!", or "Of course!" - they sound scripted on a phone call.
- Speak phone numbers as individual digits: "five-four-one, five-five-five, one-two-three-four".

## Knowledge
Business hours: Monday through Friday, 8am to 6pm Pacific.
Address: 123 Main Street, Bend, Oregon.
Insurance accepted: Blue Cross Blue Shield, Providence Health, PacificSource, Medicare.
New patient appointments: typically 60 minutes. Most plans require a referral.

## Tool use
- Use transferCall when: the caller asks to speak with a person, mentions a medical
  emergency, or the conversation requires records you cannot access.
- Use endCall after: the caller confirms their appointment is booked, or explicitly
  says goodbye. Do not end the call while any question is unanswered.
`.trim();

Controlling response length for streaming TTS

Vapi operates with roughly a 10-second first-response budget -- the time between when the caller finishes speaking and when the agent starts speaking -- before callers perceive a significant hang. That window spans STT processing (approximately 150-200ms for Deepgram), LLM first-token generation, and TTS synthesis startup. ElevenLabs Turbo v2.5 begins streaming audio at approximately 200ms after receiving text; the standard multilingual v2 model averages closer to 300ms.

Two model settings directly control whether you stay inside this budget. First, set temperature to 0.1-0.2 for voice agents. At temperature 0.7, the same "two sentences max" instruction randomly produces four or five sentence responses -- the instruction competes with stochastic sampling and loses unpredictably. Second, set maxTokens to 150. That is approximately 115 spoken words, enough for a complete helpful response while imposing a hard ceiling that prevents runaway answers.

import Vapi from "@vapi-ai/server-sdk";

const vapi = new Vapi({ token: process.env.VAPI_API_KEY! });

const assistant = await vapi.assistants.create({
  name: "Cascade PT Scheduler",
  model: {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    temperature: 0.15,        // Low variance keeps response length predictable
    maxTokens: 150,           // Hard cap: ~115 spoken words at 130wpm
    messages: [
      { role: "system", content: SYSTEM_PROMPT },
    ],
  },
  voice: {
    provider: "11labs",
    voiceId: "21m00Tcm4TlvDq8ikWAM",
    stability: 0.5,
    similarityBoost: 0.75,
    model: "eleven_turbo_v2_5",   // ~200ms streaming vs ~300ms for standard v2
  },
  firstMessage: "Thanks for calling Cascade Physical Therapy, this is Jordan. Are you calling to schedule a new appointment, or do you have a question about an existing one?",
  endCallFunctionEnabled: true,
  recordingEnabled: true,
  silenceTimeoutSeconds: 10,
  silenceTimeoutMessage: "I am still here -- can you hear me okay?",
});

Turn-taking and interruption handling

Vapi's Voice Activity Detection triggers a stop_speaking event the moment it detects the caller talking while the agent is mid-sentence. The TTS audio is cut immediately. The STT then processes what the caller said -- including context fragments from what the agent was saying -- and feeds the transcript to the LLM as the next user turn.

The problem is context contamination. The LLM sees a fragmented transcript and without explicit instructions tends to finish its interrupted thought before addressing what the caller actually said. The caller redirects the conversation, and the agent continues with its original sentence anyway. A barge-in recovery instruction in your prompt eliminates this by explicitly overriding the model's default completion behavior.

For detailed server-side handling of stop_speaking and other Vapi call events, the Vapi webhook architecture guide covers the full event payload structure and routing patterns.

const INTERRUPTION_INSTRUCTIONS = `
## Handling interruptions
If the caller interrupts you mid-sentence, do not finish your previous thought.
Acknowledge what they just said and respond to that directly.
Never say "As I was saying" or reference the sentence you were in the middle of.
If you cannot clearly understand what they said, ask one short clarifying question.
`.trim();

// Append to your SYSTEM_PROMPT after the Speaking Style section.
// Instructions near the start and end of a system prompt carry more weight
// than instructions buried in the middle -- position this deliberately.
const FULL_SYSTEM_PROMPT = SYSTEM_PROMPT + "\n\n" + INTERRUPTION_INSTRUCTIONS;

Anti-hallucination guardrails

Voice agents hallucinate for a predictable reason: the system prompt instructs the model to be helpful and conversational, which the model interprets as permission to fill gaps with plausible-sounding information. The fix is making the scope boundary more authoritative than the helpfulness instruction -- phrasing it as a hard rule, not a preference.

Research from voice AI deployments shows that properly grounded agents reduce hallucination rates from approximately 27% down to under 5%. The two highest-leverage changes are an explicit knowledge boundary instruction with a clear path for out-of-scope queries, and a RAG retrieval similarity threshold of 0.65 or higher to avoid injecting irrelevant context chunks that confuse the model.

function buildSystemPrompt(retrievedContext: string): string {
  return `
## Identity
You are Jordan, scheduling assistant for Cascade Physical Therapy in Bend, Oregon.

## Scope and knowledge boundary
You MUST only answer questions using the information in ## Retrieved Knowledge below.
If the answer is not there, say: "I do not have that information available.
Let me connect you with a staff member who can help." Then call transferCall.
Do not infer, estimate, or draw on general medical or insurance knowledge.

## Speaking style
Two sentences maximum. No markdown. Always end with a question.

## Retrieved Knowledge
${retrievedContext || "No relevant information was found for this query."}

## Tool use
- transferCall: anything outside scope, caller requests human, or medical question.
- endCall: only after confirmed booking or explicit caller goodbye.
`.trim();
}

async function getContextForQuery(query: string): Promise {
  // Only inject context that clears the similarity threshold
  const results = await vectorSearch(query, { topK: 3, minScore: 0.65 });
  return results.map((r) => r.text).join("\n\n");
}

async function handleTurn(callerMessage: string) {
  const context = await getContextForQuery(callerMessage);
  const systemPrompt = buildSystemPrompt(context);
  // Pass systemPrompt to your Vapi model config for this turn
}

Tool calling: transferCall and endCall

Vapi exposes transferCall and endCall as built-in functions the model can invoke. The most common mistake is defining only the function signature without specifying trigger conditions in the description. A model that knows it can transfer calls will do so inconsistently -- sometimes for simple questions, sometimes not at all for genuine escalations. Explicit conditions in the function description make behavior reliable and auditable.

const assistantTools = [
  {
    type: "transferCall" as const,
    destinations: [
      {
        type: "number" as const,
        number: "+15415550100",
        message: "One moment while I connect you with our front desk team.",
      },
    ],
    function: {
      name: "transferCall",
      // The LLM reads this description to decide when to invoke the function
      description: `Transfer the call to a human staff member. Use this when:
1. The caller explicitly asks to speak with a person or the front desk.
2. The caller mentions a medical emergency or urgent clinical concern.
3. The caller wants to dispute a bill or access their patient records.
4. You have asked a clarifying question twice and still cannot understand the caller.
5. The caller expresses frustration or repeats the same request three times in a row.
Do NOT use this just because a question is complex - try to answer it first.`,
      parameters: { type: "object" as const, properties: {}, required: [] },
    },
  },
];

ElevenLabs TTS optimization

ElevenLabs' pronunciation dictionary is one of the most underused features in voice deployments. By default, the TTS engine makes phonetic guesses at unfamiliar terms: regional place names, medical terminology, and insurance acronyms all get creative treatments. Upload a pronunciation dictionary through the ElevenLabs dashboard before launch. For Central Oregon deployments, entries for "Deschutes," "Ochoco," and local clinic names prevent the most common mispronunciations before your first live caller hears them.

For model selection, the Turbo v2 and Turbo v2.5 tiers begin streaming audio at approximately 200ms -- roughly 100ms faster than the standard multilingual v2 model. The trade-off is slightly reduced expressivity on emotional range, which is acceptable for business phone contexts where clarity and consistency matter more than nuance. Use Turbo unless emotional expressivity is a specific product requirement for your deployment.

Avoid SSML tags in LLM outputs unless you have tested them against your specific voice configuration. Malformed SSML causes the engine to fall back to plain-text parsing, introducing a 400-600ms penalty on re-processing. For pause control, instruct the model to use ellipses and natural sentence breaks instead of SSML markup.

Production gotchas

Vapi wraps your prompt -- you do not control the full context window. Vapi injects call metadata, tool definitions, and internal instructions alongside your system prompt. Your text is not the only thing the model sees. Near a context limit, Vapi's injected content takes priority and your knowledge block can be truncated silently. Keep your total system prompt under 3,000 tokens. Verify actual token counts with the Anthropic SDK token counter before deploying.

Cloudflare Workers has a 30-second CPU time limit. Webhook handlers that process Vapi events and make downstream API calls -- to a booking system, CRM, or calendar -- must use streaming responses. Awaiting a full upstream response before returning from a Worker will hit the 30-second ceiling under real call conditions. Return 200 OK immediately and process asynchronously where possible.

STT transcripts contain errors your prompt must handle defensively. The STT layer sees real phone audio -- background noise, regional accents, cellular compression artifacts. Common errors: "two" vs "to" vs "too," proper nouns mangled, phone digits transposed. Instruct the model to repeat critical information -- names, dates, callback numbers -- for verbal confirmation before taking any action. This prevents a large class of downstream booking and routing errors.

Temperature above 0.3 breaks response length predictability. The "two sentences max" instruction competes with stochastic sampling at higher temperatures. At 0.7, the model occasionally produces four or five sentence responses despite the constraint. Set temperature to 0.1-0.2 and treat it as a hard operational requirement, not a soft suggestion. This applies when using Claude via Vapi's model config -- the temperature field is respected and makes a measurable difference.

The firstMessage field is separate from your system prompt. Do not include greeting language inside the system prompt body -- Vapi renders firstMessage independently. If greeting text appears in both places, the agent greets the caller twice at the start of every call, which signals immediately that something is misconfigured.

Silence handling requires explicit configuration. If a caller goes quiet for 5-10 seconds -- distracted, poor connection, accidentally muted -- Vapi will not prompt them without configuration. Set silenceTimeoutSeconds and silenceTimeoutMessage in the assistant config, and add a prompt instruction for how the agent handles a second silence event before gracefully ending the call.

When NOT to build this yourself

Custom prompt engineering for a voice AI agent makes sense when you need tight integration with proprietary business logic, tool calls into APIs that purpose-built platforms do not support, or conversation flows that off-the-shelf solutions cannot handle. In several common situations, the custom build is the wrong call.

If your deployment handles fewer than 200 calls per month, the engineering time to tune prompts, handle edge cases, maintain the STT/LLM/TTS pipeline, and monitor for regressions typically costs more than a managed voice AI service running for 12 months. The economics of custom builds rarely work out below that call volume threshold.

If your use case involves clinical intake, legal consultations, or financial advice with regulatory retention requirements, prompt engineering alone is insufficient. You need a platform that provides call recording with compliant storage, transcript retention policies, and audit trails. Building that infrastructure custom -- and keeping it compliant as regulations evolve -- is a significant ongoing commitment beyond the system prompt.

If your conversation flow has more than 15 distinct branches -- different appointment types, insurance verification paths, multi-step intake, emergency triage -- a single-prompt approach becomes fragile. Instructions from earlier in the prompt get ignored as the context fills during long calls. At that complexity level, a state-machine architecture with separate prompts per conversation state is more reliable than what is covered here.

For a complete walkthrough of the Vapi deployment stack -- from provisioning a phone number through production monitoring -- see how to build a voice AI agent with Vapi. That guide covers the infrastructure setup this post assumes is already in place.

Architecture

Caller (Phone)
      |
      v
[Phone Number / Twilio]
      |
      v
 [Vapi Orchestration]
  /       |        \
 v        v         v
STT      LLM       TTS
Deep-   Claude    Eleven-
gram    temp=0.15  Labs
~150ms  max=150   Turbo
        tokens    ~200ms
         |
   [System Prompt]
   1. Identity
   2. Scope
   3. Style Rules
   4. Knowledge
   5. Tool Triggers
         |
    [Tool Call?]
     /         \
   Yes           No
    |             |
    v             v
[Webhook]    [Next Turn]
[CF Worker]
 30s limit
    |
    v
[Business APIs]
Booking / CRM

Frequently asked questions

What is the most important difference between a voice AI system prompt and a chatbot prompt?

Voice prompts must explicitly forbid markdown formatting (which TTS reads aloud literally), enforce strict response length limits of two sentences or fewer, and always end with a question to signal the caller's turn. Text chat prompts need none of these constraints because the UI enforces turns and renders formatting visually.

How do I prevent my Vapi agent from hallucinating facts about my business?

Add an explicit knowledge boundary instruction to your system prompt: 'You MUST only answer using information in the Retrieved Knowledge section below. If the answer is not there, transfer the call.' Pair this with a RAG retrieval similarity threshold of 0.65 or higher to avoid injecting irrelevant context. This combination reduces hallucination rates from roughly 27% to under 5%.

What temperature should I set for a voice AI agent using Claude or GPT?

Set temperature between 0.1 and 0.2. At higher temperatures, length constraints like 'two sentences max' compete with stochastic sampling and lose unpredictably -- the same instruction can produce a 1-sentence or a 5-sentence response at temperature 0.7, making call quality inconsistent across sessions.

How should I handle caller interruptions in my voice AI system prompt?

Add a barge-in recovery section to your prompt: when interrupted mid-sentence, the agent should drop its previous thought entirely and respond directly to what the caller just said. Without this instruction, the LLM defaults to completing its interrupted sentence before addressing the new input, which sounds unnatural and wastes turn time.

What is Vapi's 10-second first-response budget and why does it matter?

Vapi allows roughly 10 seconds from when a caller finishes speaking to when the agent starts responding before callers perceive a significant hang. This budget covers STT processing (~150-200ms), LLM first-token generation, and TTS streaming startup (~200ms for ElevenLabs Turbo). Keeping maxTokens at 150 and temperature at 0.15 helps reliably stay inside this window.

When should I use transferCall versus endCall in my Vapi tool definitions?

Use transferCall when the conversation requires human judgment: medical questions, billing disputes, repeated caller frustration, or anything outside your agent's defined scope. Use endCall only after the caller's goal is confirmed complete -- appointment booked, question answered -- or the caller explicitly says goodbye. Never end a call while a question is still open.

Written by

Thom Wilson

Founder & AI Engineer, Wild Run AI

SEO consultant turned AI engineer. Built WildRun after years getting small businesses found online — custom AI voice agents, sales and operations automation, and AI-era SEO, deployed on Cloudflare and managed end-to-end.

About the author → · Last reviewed: June 2026