Part 7 — STT: VAD / endpointing + echo suppression

Step 34

vadConfidence live

What it is. The probability threshold above which a single audio frame is classified as speech by the Silero VAD model. Lower is more sensitive (picks up quiet speech but also noise); higher is stricter (ignores noise but may miss soft-spoken callers).

sttConfig.additionalSettings.vadConfidence range 0.05–1.0 default 0.7

Runtime. Carried into the VADParams object at base_service.py:1546-1551 → VADParams.confidence.

When to change: raise toward 0.8 if the agent barges in on background speech or TV noise; lower toward 0.5 if it misses soft-spoken callers.

Symptom it fixes: "it jumps in when there's people talking behind me" → raise vadConfidence. "it doesn't hear me when I talk quietly" → lower it.

Step 35

vadMinVolume live

What it is. A secondary gate on raw audio energy. A frame must clear both vadConfidence and vadMinVolume to count as speech, filtering low-energy hiss the model might otherwise score as speech.

sttConfig.additionalSettings.vadMinVolume range 0.0–1.0 default 0.4

Runtime. Same VADParams block, base_service.py:1546-1551 → VADParams.min_volume.

When to change: raise if line hiss or static triggers false speech; lower if quiet callers get dropped.

Symptom it fixes: "static on the line opens a turn" → raise vadMinVolume.

Step 36

vadStartSecs live

What it is. How much continuous speech must accrue before VAD declares "speech started" — a debounce against blips (clicks, coughs, a single transient).

sttConfig.additionalSettings.vadStartSecs range 0.05–1.0 default 0.2

Runtime. base_service.py:1546-1551 → VADParams.start_secs (specifically base_service.py:1549).

When to change: raise to ignore short noise bursts; lower for snappier detection of the caller's first word.

Symptom it fixes: "a cough makes it think I started talking" → raise vadStartSecs.

Step 37

vadStopSecs live the core endpointing knob

What it is. How much silence must pass before VAD declares "speech ended." This is the single most important endpointing knob: too low and a natural mid-sentence pause is mistaken for turn-end (the agent jumps in); too high and the agent feels laggy.

sttConfig.additionalSettings.vadStopSecs range 0.05–1.0 default 0.2

Runtime. base_service.py:1546-1551 → VADParams.stop_secs (specifically base_service.py:1550).

When to change: raise (0.4–0.6) to stop cutting off callers who pause mid-sentence; lower for snappier turn-handoff in clean, fast conversations.

Symptom it fixes: "it cuts me off when I pause for a breath" → raise vadStopSecs. This is the #1 fix for mid-sentence cutoffs.

The VAD playground below lets you feel exactly this tradeoff — a caller utterance with a 0.30 s mid-sentence pause, and a slider that decides whether that pause reads as "still thinking" or "done talking."

Try it — VAD playground

Watch the decision markers move

A fixed caller utterance: speech, a 0.30 s mid-sentence pause, more speech, then trailing silence. The waveform is fixed; your sliders move the decision markers. Teaching moment: raise vadStopSecs past 0.30 s and the agent stops cutting the caller off at the pause.

vadConfidence 0.70

vadMinVolume 0.40

vadStartSecs 0.20s

vadStopSecs 0.20s

userTurnStopTimeout 1.0s

Move a slider.

Step 38

userTurnStopTimeout live the safety net

What it is. A backstop timer: the maximum time the agent waits after the last speech-stopped event before it force-commits the turn, even if the turn-end strategies have not fired. It is the hard ceiling, not the usual trigger.

sttConfig.additionalSettings.userTurnStopTimeout range 0.5–5.0 default 1.0

Runtime. Field at base_service.py:1508, applied in turn logic at base_service.py:1573 → LLMUserAggregatorParams.user_turn_stop_timeout. This is turn-taking logic, not part of VADParams.

Chat-mode clampIn chat/text mode this is clamped to at least waitSeconds + 0.1 (base_service.py:1515-1523) so the safety timer can't beat the stop strategy and close a turn while STT is still in flight. If a tiny timeout doesn't seem to take effect, this clamp is why.

When to change: raise alongside vadStopSecs for very pause-heavy callers; lower for a snappier commit when the VAD stop is reliable.

Symptom it fixes: "after I stop it takes forever, or it ends my turn even when VAD should still be listening" → the backstop interacting with vadStopSecs.

Ask Claude Code: "Show me the VAD params log line the agent prints at boot so I can confirm my change took effect." → The resolved params are logged at base_service.py:1552, right after the VADParams object is built at :1546-1551. Grep the boot logs for the VAD params line — those printed values are the ones actually applied, so it's the fastest way to confirm a dashboard change propagated.

Step 39

How the VAD knobs interact

Think of one caller utterance as a four-stage timeline. Each knob owns one stage:

Speech start — VAD waits vadStartSecs of speech above vadConfidence/vadMinVolume before saying "started."
Speech stop — VAD waits vadStopSecs of silence before saying "ended."
Turn commit — if something stalls, userTurnStopTimeout force-ends the turn.
Reply gate — the bot then waits waitSeconds (Part 5) before speaking.

So a mid-sentence cutoff is usually vadStopSecs too low (it called "ended" on a breath); userTurnStopTimeout is the ceiling, not the usual trigger. Tune vadStopSecs first; touch the timeout only for pathological cases. Step through the four stages below.

Four-stage turn timeline

Step 40

botSpeechGraceSecs live echo suppression

What it is. A short window at the start of each bot utterance during which user-turn-start is suppressed, so the bot's own audio leaking back through the line (AEC residue) doesn't register as the caller barging in.

sttConfig.additionalSettings.botSpeechGraceSecs range 0.0–1.0 default 0.0 (off)

Runtime. boot_steps.py:2557-2573 → CallActionProcessor(echo_grace_secs=...).

When to change: raise to ~0.2–0.4 on speakerphone / hands-free / poor-AEC lines where the bot interrupts itself. Keep small — too large and you ignore genuine fast interruptions.

Symptom it fixes: "it stops itself right after it starts talking" / "half a second of its own echo triggers a barge-in" → set botSpeechGraceSecs to ~0.3.

The echo-grace timeline below shows a bot utterance with an echo blip looping back at 0.15 s. Widen the grace window to swallow the blip, but watch the verdict — too wide and you start blocking real callers.

Try it — echo-grace timeline

Suppress the blip without blocking the caller

The bot starts speaking at t=0. An echo blip loops back at 0.15 s. The shaded window is botSpeechGraceSecs: any audio inside it is ignored as echo. Cover the blip and the false barge-in is suppressed; leave it short and the bot interrupts itself.

botSpeechGraceSecs 0.00s

Move the slider.

Step 41

Echo grace vs. barge-in (the tension)

What it is. botSpeechGraceSecs and numberOfWords pull in opposite directions: grace suppresses early interrupts to defeat echo; a low numberOfWords invites interrupts. A larger grace suppresses more echo but also delays legitimate barge-ins — the caller literally cannot interrupt during the grace window.

conceptual — no single field

The rule. On a speakerphone deployment, raise grace (~0.3) AND keep numberOfWords ≥ 2, so real interruptions still land (after the grace window, with enough words) while echo in the first ~300 ms is ignored. Measure your worst-case echo delay (often 0.1–0.2 s), then set grace just past it; going wider buys nothing but lost responsiveness. On a clean headset, leave it at 0.

Ask Claude Code: "Where in the pipeline is the bot-speech grace window applied, and what's its default?" → It's wired in at boot_steps.py:2557-2573, reading botSpeechGraceSecs into CallActionProcessor(echo_grace_secs=...). Default is 0.0 (no suppression), correct for headset/handset calls; raise to 0.2–0.4 s for speakerphone. Confirm the applied value via the boot logs.

Checkpoint — speakerphone caller: the bot starts a sentence, then immediately stops as if interrupted, but nobody spoke. Which knob, and roughly what value?

botSpeechGraceSecs, set to about 0.2–0.4. The bot's own first syllables are echoing back through the speakerphone and the AEC isn't fully removing them, so the pipeline reads it as a barge-in. The grace window ignores user-turn-start for that opening fraction of a second. Verify with the echo-grace demo above: a grace window past 0.15 s swallows the blip. Pair with WebRTC APM (Part 6) for the underlying echo.