Barge-In Voice AI: Handling Interruptions That Kill Calls
Learn how to implement barge-in voice AI so your agent stops mid-sentence when a caller interrupts—with TypeScript, Vapi config, and production gotchas.
Barge-in occurs in roughly one in five voice agent calls. When a caller says "wait" or "actually" or simply starts talking over the agent, they expect it to stop immediately — the same way a human would. When it does not, the caller repeats themselves, gets frustrated, and trust in the interaction collapses. What follows is how to implement barge-in that works reliably in production, using Vapi as the orchestration layer and real TypeScript you can adapt today.
Why barge-in is harder than it looks
The naive approach — stop TTS the moment any audio is detected — creates an immediate problem: the agent interrupts itself every time the caller says "mm-hmm" or "yeah okay." This is called a false barge-in, and it is as disruptive as having no barge-in at all. The agent chops its own response into fragments while the caller watches what should be a fluid conversation turn into a broken loop.
Three problems must be solved simultaneously:
- Voice activity detection (VAD): Is there audio energy in the incoming stream, and is it speech?
- Backchannel vs. true interruption: Did the caller say something requiring the agent to yield the floor, or are they just signaling they are listening?
- TTS flush speed: How quickly can you stop the current utterance and route new audio? The 2026 production standard is under 60ms TTS flush time, with end-to-end turn-taking gaps of 200–400ms.
Get any of these wrong and you end up with an agent that either talks over callers or stutters to a halt on every filler word. Neither is acceptable in a production system meant to represent a business on the phone.
The duplex audio pipeline
Barge-in requires a fully duplex audio architecture: STT and TTS must run on independent streams simultaneously, never alternating turns. This is the architectural dividing line between a voice agent that supports real interruption and one that only pretends to. If your pipeline serializes — STT then LLM then TTS, repeat — there is no mechanism to detect caller speech while the agent is talking.
The pipeline at a high level:
Caller audio (PCMU/8kHz)
|
v
+------------------+ VAD score +------------------+
| Echo Cancel | -------------> | VAD Engine |
+------------------+ | (Silero/WebRTC) |
| +--------+---------+
| | speech_start / speech_end
v v
+------------------+ transcript +------------------------+
| STT Stream | -----------> | Barge-in Decision |
| (Deepgram) | | backchannel vs true |
+------------------+ +------------+-----------+
| interrupt event
v
+------------------------+
| TTS Flush + LLM |
| Context Reset |
+------------------------+
| new TTS audio
v
+------------------------+
| ElevenLabs TTS |
| ~300ms first chunk |
+------------------------+
Echo cancellation is not optional. Without it, the agent's own TTS audio bleeds into the microphone input and triggers its VAD — creating phantom barge-in events on every agent utterance. PSTN telephony providers typically handle acoustic echo cancellation (AEC) at the hardware level. For WebRTC deployments, you need server-side AEC or the browser's built-in AEC. End-to-end delay across this pipeline compounds quickly; for a full treatment of where latency accumulates, see the voice agent latency optimization guide.
Voice activity detection options
VAD is the first gating layer. It decides whether incoming audio contains speech before anything reaches the STT engine or barge-in logic. Two options dominate production deployments in 2026:
WebRTC VAD
WebRTC VAD is a Gaussian Mixture Model classifier running at 10ms frames. It is fast — under 1ms per frame — and ships inside every modern browser and many telephony SDKs. The tradeoff is accuracy: a false-positive rate of 8–15% on noisy phone audio, particularly with background music or call-center ambient noise. Acceptable for quiet environments; unreliable in noisier conditions.
Silero VAD
Silero VAD is a small LSTM-based neural network trained on multilingual speech data. Its false-positive rate drops to 5–8% on noisy phone audio. It runs in ONNX format inside a Node.js worker thread at roughly 15ms per 30ms audio window — fast enough for real-time use if kept off the main event loop. As of mid-2026, trained audio-based VAD models are achieving 86% precision and 100% recall at 500ms overlap windows in controlled conditions, though production numbers on real phone traffic run somewhat lower.
For most teams using Vapi: skip the custom VAD. Vapi runs its own energy and voice classifier internally. Custom VAD is only worth implementing if you are on a raw WebRTC stack or have specific noise-environment requirements that the platform defaults cannot handle.
The backchannel problem
A backchannel is a short vocal acknowledgment that signals listening without intending to take the floor: "yeah," "uh huh," "okay," "I see," "mm-hmm," "right." Your callers will produce these constantly. If your barge-in logic treats every one as a true interruption, the agent will cut itself off dozens of times per call.
The production approach uses a two-stage filter:
- Duration guard: Any speech segment under 400ms that does not contain a content word is classified as a backchannel. The agent continues speaking.
- Content-word detection: If the interim STT transcript contains a content word — noun, verb, or question word — treat it as a true interruption regardless of duration. Words like "wait," "stop," "actually," "that is wrong," and question words are reliable interrupt signals.
Vapi handles this classification with a proprietary model trained on real call data. When building custom pipelines, a keyword list plus duration gating replicates most of the value without requiring an ML model.
Implementing barge-in with Vapi
For most production deployments, Vapi handles the hard parts of barge-in internally. Your job is to configure it correctly and handle speech-update events on the client side. Barge-in behavior is controlled through three config namespaces on your assistant object.
Assistant configuration
import Vapi from "@vapi-ai/web";
const vapi = new Vapi(process.env.VAPI_PUBLIC_KEY!);
const assistant = {
transcriber: {
provider: "deepgram",
model: "nova-3",
endpointing: {
// Silence required before treating speech as complete (default: 500ms)
// Lower = faster turn detection, higher = fewer false triggers on natural pauses
silenceThresholdMs: 500,
},
},
voice: {
provider: "elevenlabs",
voiceId: "your-voice-id",
// ElevenLabs streams first chunk at ~300ms; keep chunk size small to reduce bleed after flush
speed: 1.0,
},
model: {
provider: "anthropic",
// Haiku for fast re-entry after interruption; Sonnet if response accuracy matters more
model: "claude-haiku-4-5-20251001",
messages: [{ role: "system", content: "..." }],
},
// Removes background noise before VAD and STT process the audio
backgroundDenoisingEnabled: true,
// Events needed for client-side barge-in logic
clientMessages: [
"speech-update", // VAD state transitions: started / stopped
"transcript", // interim transcripts for content-word detection
"hang",
],
};
await vapi.start(assistant);Speech-update event handler
The speech-update event fires whenever the caller's VAD state changes. The handler below implements the 300ms duration guard and content-word early detection that separates backchannels from true interruptions.
type SpeechUpdate = {
type: "speech-update";
status: "started" | "stopped";
role: "user" | "assistant";
};
let interruptTimer: ReturnType<typeof setTimeout> | null = null;
let agentSpeaking = false;
vapi.on("speech-update", (update: SpeechUpdate) => {
if (update.role === "assistant") {
agentSpeaking = update.status === "started";
return;
}
if (update.status === "started" && agentSpeaking) {
// Start 300ms duration guard before treating as real interruption
interruptTimer = setTimeout(() => {
vapi.stop();
agentSpeaking = false;
console.log("[barge-in] sustained speech — flushing TTS");
}, 300);
}
if (update.status === "stopped" && interruptTimer) {
// Ended under 300ms — was a backchannel, cancel the interrupt
clearTimeout(interruptTimer);
interruptTimer = null;
}
});
vapi.on("transcript", (t: { transcriptType: string; transcript: string }) => {
if (
t.transcriptType === "partial" &&
hasContentWord(t.transcript) &&
agentSpeaking
) {
// Content word detected before 300ms elapsed — interrupt immediately
if (interruptTimer) { clearTimeout(interruptTimer); interruptTimer = null; }
vapi.stop();
agentSpeaking = false;
console.log("[barge-in] content word detected:", t.transcript);
}
});
const CONTENT_TRIGGERS = [
"wait", "stop", "actually", "no", "hold on",
"what", "when", "where", "why", "how",
"can you", "i need", "i want", "wrong", "incorrect",
"that's not", "you said",
];
function hasContentWord(text: string): boolean {
const lower = text.toLowerCase();
return CONTENT_TRIGGERS.some((w) => lower.includes(w));
}Custom barge-in on a raw WebRTC stack
If you are running your own media server — using LiveKit or a custom SIP bridge — you need to manage VAD and the barge-in state machine yourself. The controller below serializes state across async event sources to prevent concurrent interrupt sequences from corrupting conversation context.
type ConvState = "listening" | "agent_speaking" | "interrupted" | "processing";
class BargeInController {
private state: ConvState = "listening";
private ttsStream: NodeJS.ReadableStream | null = null;
private onInterrupt: (context: string) => void;
private agentUtteranceSoFar = "";
constructor(onInterrupt: (context: string) => void) {
this.onInterrupt = onInterrupt;
}
agentStarted(stream: NodeJS.ReadableStream, utterance: string) {
this.state = "agent_speaking";
this.ttsStream = stream;
this.agentUtteranceSoFar = utterance;
}
onVadEvent(event: "speech_start" | "speech_end") {
if (event === "speech_start" && this.state === "agent_speaking") {
setTimeout(() => {
if (this.state === "agent_speaking") this.triggerInterrupt();
}, 300);
}
}
onTranscript(text: string, isFinal: boolean) {
if (this.state !== "agent_speaking") return;
if (!isFinal && hasContentWord(text)) { this.triggerInterrupt(); return; }
if (isFinal && text.trim().length > 3) this.triggerInterrupt();
}
private triggerInterrupt() {
if (this.state === "interrupted" || this.state === "processing") return;
this.state = "interrupted";
if (this.ttsStream) { this.ttsStream.destroy(); this.ttsStream = null; }
// Pass interrupted context so LLM can acknowledge it gracefully
this.onInterrupt(this.agentUtteranceSoFar);
this.state = "processing";
}
agentDone() { if (this.state === "agent_speaking") this.state = "listening"; }
responseReady() { this.state = "listening"; }
}Passing agentUtteranceSoFar to the interrupt handler matters. When the LLM re-enters after an interruption, it should know what the agent said before being cut off — so it can respond naturally instead of restarting from scratch. This context threading is what makes interrupted conversations feel continuous rather than reset.
Production gotchas
In-flight TTS chunks still play after flush. Calling vapi.stop() or destroying your TTS stream does not immediately silence the caller's speaker. Audio chunks already queued in the WebRTC jitter buffer will play out over the next 80–120ms. Minimize bleed by keeping your TTS buffer shallow — request one chunk at a time rather than pre-buffering multiple seconds of audio ahead of playback.
DTMF tones trigger VAD. Telephone keypad presses generate tones at 697–1633Hz. Both Silero and WebRTC VAD misclassify DTMF as speech at moderate confidence levels. Add a DTMF event listener at the telephony layer and suppress VAD responses for 200ms after any tone event. Without this, callers pressing menu keys trigger false barge-ins mid-sentence.
AEC gaps on SIP trunks. Some SIP carriers disable acoustic echo cancellation or apply it inconsistently. The symptom is the agent repeatedly interrupting itself with no real caller input — its own TTS audio echoing back through the trunk. Diagnose by comparing VAD speech_start timestamps to TTS start timestamps in your logs: if they fire within 50ms of each other, you have an AEC gap. Fix at the carrier level or apply a 200ms VAD suppression window after each TTS start event as a fallback.
LLM re-entry latency dominates perceived silence. Barge-in detection adds 200–400ms. But the caller hears nothing for another 800–2,000ms while the LLM generates tokens and ElevenLabs begins streaming its first chunk. Pass the interrupted utterance context to the LLM rather than issuing a clean prompt — this cuts time-to-first-token substantially. If you are running on Cloudflare Workers, watch the 30-second CPU limit: a stalled streaming LLM call inside a barge-in retry loop will hit it.
Concurrent async events corrupt state. VAD events, interim STT transcripts, and TTS stream callbacks arrive on different async channels. Without explicit state gating, two concurrent interrupt sequences can fire for a single barge-in event. Guard every state transition with an early return if already in the target or post-target state, as shown in BargeInController.triggerInterrupt() above.
Cold start on serverless VAD. Loading the Silero ONNX model from disk takes 200–400ms on a cold Lambda or Workers instance. Pre-warm the model during your initialization path, not on first call. Route cold-start traffic to a pre-warmed instance pool or accept the first-call penalty explicitly.
Testing your barge-in implementation
Unit tests are not enough here because correctness depends on timing across multiple async streams. You need scenario-level call tests that inject pre-recorded audio and verify event timing against expected outcomes. The five baseline scenarios every implementation must cover:
- True interruption: Caller says "wait, stop" during agent speech. Agent must halt within 500ms of speech onset.
- Short backchannel: Caller says "yeah" mid-sentence. Agent must continue without interruption.
- Noise burst: Inject 200ms of white noise. No barge-in should fire.
- DTMF input: Send a DTMF tone during agent speech. No false barge-in.
- Silence timeout recovery: Caller goes silent for 5+ seconds after interrupting. Agent must re-prompt and recover cleanly.
Target metrics for a production-ready implementation: barge-in detection latency under 400ms, false barge-in rate under 2%, missed true interruptions under 1%. Log every VAD event with millisecond timestamps — you cannot tune what you do not measure. A scenario replay framework using recorded call audio will surface systematic failures far faster than manual testing.
When NOT to build this yourself
If your use case is a standard inbound phone agent — appointment booking, patient intake, basic lead qualification — there is no practical reason to implement a custom barge-in layer. Vapi's built-in interruption handling covers the common cases, and the configuration above is all you need to tune it for your call traffic patterns.
Custom VAD and barge-in handling is worth the investment when:
- You are running a custom WebRTC or SIP media server that cannot use Vapi's telephony layer
- You need sub-200ms interrupt detection and Vapi's defaults cannot reach that threshold for your call type
- Your audio environment has unusual noise conditions — high-volume restaurants in Bend, outdoor trades job sites, call centers with heavy background chatter — where a custom model trained on your specific audio substantially outperforms general-purpose VAD
- You are building a multi-party conference agent where the speaker turn model is more complex than a standard two-party call
For everyone else: configure, do not build. The right VAD and backchannel model is the one already trained on millions of real calls — not the one being written from scratch this sprint.
If you want to see how barge-in behaves on real calls before committing to an architecture, book a demo and we can walk through call recordings with interruption events highlighted.
Architecture
Caller audio (PCMU/8kHz)
|
v
+------------------+ VAD score +------------------+
| Echo Cancel | -------------> | VAD Engine |
+------------------+ | (Silero/WebRTC) |
| +--------+---------+
| | speech_start/speech_end
v v
+------------------+ transcript +------------------------+
| STT Stream | -----------> | Barge-in Decision |
| (Deepgram) | | backchannel vs true |
+------------------+ +------------+-----------+
| interrupt event
v
+------------------------+
| TTS Flush + LLM |
| Context Reset |
+------------------------+
| new TTS audio
v
+------------------------+
| ElevenLabs TTS |
| ~300ms first chunk |
+------------------------+
Frequently asked questions
What is barge-in in voice AI?
Barge-in is the ability of a voice agent to detect when a caller starts speaking while the agent is talking and immediately stop its own speech to yield the floor. Without it, the agent talks over callers and the interaction feels robotic.
What causes false barge-in in voice agents?
False barge-in happens when the agent incorrectly stops speaking due to short caller vocalizations like 'yeah' or 'uh huh' (backchannels), background noise bursts, DTMF keypad tones, or the agent's own TTS audio echoing back through the microphone. Good implementations use duration guards and content-word detection to filter these out.
How does Vapi handle interruptions?
Vapi uses voice activity detection and a proprietary classifier to distinguish true interruptions from backchannels. You tune sensitivity via transcriber.endpointing.silenceThresholdMs and subscribe to speech-update events on the client side to add custom interrupt logic on top of Vapi's built-in handling.
What is the difference between a backchannel and a true interruption?
A backchannel is a short acknowledgment — 'yeah,' 'mm-hmm,' 'okay' — that signals listening without taking the floor. A true interruption contains new content: a question, correction, or directive to stop. Production systems distinguish them using a duration guard (under 400ms = likely backchannel) combined with content-word detection in the interim transcript.
How fast should barge-in detection be?
The 2026 production standard is barge-in detection latency under 400ms from speech onset, false barge-in rate below 2%, and missed true interruptions below 1%. TTS flush time should be under 60ms. Total perceived silence after an interruption — including LLM generation and TTS re-entry — typically runs 1–2 seconds.
Do I need custom VAD for barge-in to work?
Not if you are using Vapi. Vapi includes built-in VAD and backchannel detection that cover the common cases. Custom VAD — typically Silero running in an ONNX worker thread — is only worth building if you have a custom WebRTC or SIP stack, unusual noise environments, or need sub-200ms interrupt detection that Vapi's defaults cannot reach.