Part 1

The voice pipeline mental model

Before any single knob makes sense, you need the path of one conversational turn — every later setting lives at exactly one stage of this path.

Step 1

The path of one turn

A single back-and-forth runs through one ordered chain of processors, assembled in pipecat-agent src/core/boot_steps.py:2648-2670 as a single Pipeline([...]). The caller's audio is handed from one stage to the next like a relay baton: caller audio in, then DTMF aggregation (keypad digits), then the audio-in filter (WebRTC APM: noise suppression, AGC, echo cancel), then VAD (is this speech, and when did it start and stop), then STT (audio to words plus per-word confidence), then the voicemail detect/gate, then user turn aggregation (the start/stop speaking strategies and smart-turn), then the LLM / workflow (routing plus response generation), then the echo grace / CallAction step, then TTS (text to speech, with normalization and substitutions), then the background sound mix, and finally caller audio out. Nothing skips ahead and nothing runs out of order.

Two cross-cutting controllers sit beside the chain, not in it. The silence/idle handler (src/core/processors/idle_handler.py) watches for the caller going quiet, and the call-duration monitor (call_timeout_monitor.py, an observer) caps total call length. They are not stages the audio flows through; they observe the call from the side.

Keep this diagram in your head. When a customer reports a problem, your first move is to locate which stage is misbehaving, then which knob tunes that stage. Because a setting can only affect the stages downstream of where it lives, knowing where a knob sits tells you the range of symptoms it can possibly fix.

Pipeline stage map — click a stage

Each stage maps to a settings group and to the pipecat-agent file:line where that processor is built. Click any stage to see both.

Ask Claude Code: "Open src/core/boot_steps.py around line 2648 and show me the exact ordered list of processors the Pipeline([...]) is constructed from. Confirm where VAD, STT, turn aggregation, the LLM, the echo-grace CallAction, and TTS each get inserted, and quote the line numbers so I can cite them."
Step 2

Which settings group lives at which stage

Every voice setting belongs to one stage (or a small set of stages) in that chain. The dashboard scatters these across its panels, but the mapping below is the real structure. Learn it once and the dashboard stops feeling random.

Pipeline stageSettings groupGuide part
Audio-in filterSTT audio filters (webrtcApmEnabled, aicEnabled)Part 6
VADSTT VAD / endpointing (vadConfidence, vadStartSecs, ...)Part 7
STTSTT core + keyterm biasingParts 5, 8
Turn aggregationTTS speaking plans + smart-turnParts 3, 6
Echo gracebotSpeechGraceSecsPart 7
TTSTTS voice / model + normalizationParts 2, 4
Background mixbackgroundSoundUrlPart 4
Idle handlercall-level silence timeoutPart 9
Duration monitorcall-level max durationPart 9
DTMF / Voicemailcall-level controlsPart 9
The labeling quirk to internalize early Turn-taking and barge-in live under the TTS panel (the "Start/Stop Speaking Plan"), even though they are fundamentally about listening to the caller. They sit there because they govern when the bot's speech stops. Do not go hunting in STT for barge-in tuning.
Step 3

Vocabulary

Six terms come up in every part of this guide. You do not need to memorize them now; this is the reference to return to when you hit "turn detection" in Part 7 or "logit bias" in Part 8.

Core terms
  • VAD (Voice Activity Detection): a model (here, Silero) that answers "is this frame speech or not?" frame by frame. It does not transcribe; it gates.
  • Endpointing: deciding when a turn ends — how much silence means "the caller is done talking." Driven by vadStopSecs plus the turn-stop timeout.
  • Barge-in: the caller interrupting the bot while it is still speaking. Governed by the Stop Speaking Plan.
  • Turn detection / turn-taking: the broader decision of whose turn it is. "Smart turn" is an AI model that predicts end-of-turn from prosody instead of pure silence timing.
  • Logit bias / keyterm boost: nudging the STT decoder toward specific words (e.g. a brand name, "Fibabanka") by adding bias to those tokens' logits.
  • Normalization: rewriting text before TTS so it is spoken naturally ("user@x.com" becomes "user at x dot com"; a TC number grouped into speakable chunks).
Step 4

A worked example: "it cuts me off mid-sentence"

Trace the symptom through the diagram. The caller is mid-sentence; the bot starts talking over them. That is a turn-end mistake — the pipeline decided the caller's turn ended too early. The candidate stages are all upstream of the LLM:

  1. VAD stop too eagervadStopSecs too low, so a natural pause reads as "done." (Part 7)
  2. Turn-stop timeout too shortuserTurnStopTimeout fires before the caller resumes. (Part 7)
  3. Smart-turn off when the caller pauses a lot — pure-silence endpointing cannot tell a "thinking pause" from "done." (Part 6)

Notice the fix is never "rewrite the prompt." The complaint mentions the bot talking, so it is tempting to reach for the LLM prompt or the TTS voice, but neither of those decided when to start — the decision was made upstream, the moment VAD called silence and turn aggregation handed a completed turn to the LLM. This symptom-to-stage discipline is the core lesson the whole guide reinforces. (For the opposite symptom — the bot will not let the caller interrupt — you would look at the Stop Speaking Plan in Part 3.) The timeline below makes the handoff concrete.

Turn timeline

One caller utterance on a 0–5s axis: a speech band, a silence gap, then the moment VAD declares the turn ended and the moment the bot starts speaking.

Static, illustrative values (vadStopSecs 0.2, waitSeconds 0.4). Later parts make these knobs movable so you can watch the markers slide.

Ask Claude Code: "A caller says the agent cuts them off mid-sentence. Using the pipeline order in boot_steps.py:2648-2670, map that symptom to a stage and tell me which knobs to look at. I expect the answer to land on the VAD plus turn-detection stage (base_service.py:1546-1551 for VAD params, the user-turn strategies in create_user_turn_params), not on the LLM or TTS. Confirm and point me at the specific vadStopSecs stop-delay setting."