vadConfidence live
What it is. The probability threshold above which a single audio frame is classified as speech by the Silero VAD model. Lower is more sensitive (picks up quiet speech but also noise); higher is stricter (ignores noise but may miss soft-spoken callers).
Runtime. Carried into the VADParams object at base_service.py:1546-1551 → VADParams.confidence.
- When to change: raise toward 0.8 if the agent barges in on background speech or TV noise; lower toward 0.5 if it misses soft-spoken callers.
Symptom it fixes: "it jumps in when there's people talking behind me" → raise vadConfidence. "it doesn't hear me when I talk quietly" → lower it.
vadMinVolume live
What it is. A secondary gate on raw audio energy. A frame must clear both vadConfidence and vadMinVolume to count as speech, filtering low-energy hiss the model might otherwise score as speech.
Runtime. Same VADParams block, base_service.py:1546-1551 → VADParams.min_volume.
- When to change: raise if line hiss or static triggers false speech; lower if quiet callers get dropped.
Symptom it fixes: "static on the line opens a turn" → raise vadMinVolume.
vadStartSecs live
What it is. How much continuous speech must accrue before VAD declares "speech started" — a debounce against blips (clicks, coughs, a single transient).
Runtime. base_service.py:1546-1551 → VADParams.start_secs (specifically base_service.py:1549).
- When to change: raise to ignore short noise bursts; lower for snappier detection of the caller's first word.
Symptom it fixes: "a cough makes it think I started talking" → raise vadStartSecs.
vadStopSecs live the core endpointing knob
What it is. How much silence must pass before VAD declares "speech ended." This is the single most important endpointing knob: too low and a natural mid-sentence pause is mistaken for turn-end (the agent jumps in); too high and the agent feels laggy.
Runtime. base_service.py:1546-1551 → VADParams.stop_secs (specifically base_service.py:1550).
- When to change: raise (0.4–0.6) to stop cutting off callers who pause mid-sentence; lower for snappier turn-handoff in clean, fast conversations.
Symptom it fixes: "it cuts me off when I pause for a breath" → raise vadStopSecs. This is the #1 fix for mid-sentence cutoffs.
The VAD playground below lets you feel exactly this tradeoff — a caller utterance with a 0.30 s mid-sentence pause, and a slider that decides whether that pause reads as "still thinking" or "done talking."
Try it — VAD playground
A fixed caller utterance: speech, a 0.30 s mid-sentence pause, more speech, then trailing silence. The waveform is fixed; your sliders move the decision markers. Teaching moment: raise vadStopSecs past 0.30 s and the agent stops cutting the caller off at the pause.
userTurnStopTimeout live the safety net
What it is. A backstop timer: the maximum time the agent waits after the last speech-stopped event before it force-commits the turn, even if the turn-end strategies have not fired. It is the hard ceiling, not the usual trigger.
Runtime. Field at base_service.py:1508, applied in turn logic at base_service.py:1573 → LLMUserAggregatorParams.user_turn_stop_timeout. This is turn-taking logic, not part of VADParams.
waitSeconds + 0.1 (base_service.py:1515-1523) so the safety timer can't beat the stop strategy and close a turn while STT is still in flight. If a tiny timeout doesn't seem to take effect, this clamp is why.- When to change: raise alongside
vadStopSecsfor very pause-heavy callers; lower for a snappier commit when the VAD stop is reliable.
Symptom it fixes: "after I stop it takes forever, or it ends my turn even when VAD should still be listening" → the backstop interacting with vadStopSecs.
base_service.py:1552, right after the VADParams object is built at :1546-1551. Grep the boot logs for the VAD params line — those printed values are the ones actually applied, so it's the fastest way to confirm a dashboard change propagated.How the VAD knobs interact
Think of one caller utterance as a four-stage timeline. Each knob owns one stage:
- Speech start — VAD waits
vadStartSecsof speech abovevadConfidence/vadMinVolumebefore saying "started." - Speech stop — VAD waits
vadStopSecsof silence before saying "ended." - Turn commit — if something stalls,
userTurnStopTimeoutforce-ends the turn. - Reply gate — the bot then waits
waitSeconds(Part 5) before speaking.
So a mid-sentence cutoff is usually vadStopSecs too low (it called "ended" on a breath); userTurnStopTimeout is the ceiling, not the usual trigger. Tune vadStopSecs first; touch the timeout only for pathological cases. Step through the four stages below.
botSpeechGraceSecs live echo suppression
What it is. A short window at the start of each bot utterance during which user-turn-start is suppressed, so the bot's own audio leaking back through the line (AEC residue) doesn't register as the caller barging in.
Runtime. boot_steps.py:2557-2573 → CallActionProcessor(echo_grace_secs=...).
- When to change: raise to ~0.2–0.4 on speakerphone / hands-free / poor-AEC lines where the bot interrupts itself. Keep small — too large and you ignore genuine fast interruptions.
Symptom it fixes: "it stops itself right after it starts talking" / "half a second of its own echo triggers a barge-in" → set botSpeechGraceSecs to ~0.3.
The echo-grace timeline below shows a bot utterance with an echo blip looping back at 0.15 s. Widen the grace window to swallow the blip, but watch the verdict — too wide and you start blocking real callers.
Try it — echo-grace timeline
The bot starts speaking at t=0. An echo blip loops back at 0.15 s. The shaded window is botSpeechGraceSecs: any audio inside it is ignored as echo. Cover the blip and the false barge-in is suppressed; leave it short and the bot interrupts itself.
Echo grace vs. barge-in (the tension)
What it is. botSpeechGraceSecs and numberOfWords pull in opposite directions: grace suppresses early interrupts to defeat echo; a low numberOfWords invites interrupts. A larger grace suppresses more echo but also delays legitimate barge-ins — the caller literally cannot interrupt during the grace window.
The rule. On a speakerphone deployment, raise grace (~0.3) AND keep numberOfWords ≥ 2, so real interruptions still land (after the grace window, with enough words) while echo in the first ~300 ms is ignored. Measure your worst-case echo delay (often 0.1–0.2 s), then set grace just past it; going wider buys nothing but lost responsiveness. On a clean headset, leave it at 0.
boot_steps.py:2557-2573, reading botSpeechGraceSecs into CallActionProcessor(echo_grace_secs=...). Default is 0.0 (no suppression), correct for headset/handset calls; raise to 0.2–0.4 s for speakerphone. Confirm the applied value via the boot logs.Checkpoint — speakerphone caller: the bot starts a sentence, then immediately stops as if interrupted, but nobody spoke. Which knob, and roughly what value?
botSpeechGraceSecs, set to about 0.2–0.4. The bot's own first syllables are echoing back through the speakerphone and the AEC isn't fully removing them, so the pipeline reads it as a barge-in. The grace window ignores user-turn-start for that opening fraction of a second. Verify with the echo-grace demo above: a grace window past 0.15 s swallows the blip. Pair with WebRTC APM (Part 6) for the underlying echo.