Part 4

TTS: speech normalization, substitutions, background sound

How raw text becomes speakable text before it ever reaches the voice model: rewriting emails and URLs, optionally speaking digit strings the Turkish way, your own regex find/replace rules, and a background audio bed. This part owns "it spelled out the URL", "it read my TC number as one huge number", and "it mispronounces our brand name".

Where these run. TTS text passes through an ordered filter chain built in base_service.py:842-860: MarkdownTextFilter() -> SpeechTextFilter(language, substitutions, 6 toggles) -> LanguageTextFilter (unless language is multi). The six normalization toggles live under ttsConfig.speechNormalization (zod speechNormalizationSchema, validators.ts:87-94). The layers run in order: structural email/URL/emoji handling, then your custom substitutions, then number normalization.

Step 17

The always-on filter chain live

What it is. Every TTS utterance is rewritten before synthesis. This is not one toggle but the pipeline that hosts all the toggles below. Markdown is stripped first, then SpeechTextFilter applies normalization and substitutions, then per-language fixups.

ttsConfig.speechNormalization.* type object of bools chain markdown → speech → language

Runtime. Chain assembled in base_service.py:842-860; the speech filter implementation lives in speech_text_filter.py. Because filtering happens before synthesis, no TTS provider sees the raw "https://" or the bare emoji.

Symptom it owns: any "the bot spoke the literal characters" complaint traces back to a normalizer being off (or the text being shaped so the regex did not match).

Step 18

Emails, URLs, emojis live

What it is. Rewrite emails (user@example.com → "user at example dot com"), URLs (strip the scheme, turn "." into "dot"), and strip emojis so the model does not try to vocalize them.

ttsConfig.speechNormalization.emails ttsConfig.speechNormalization.urls ttsConfig.speechNormalization.emojis type bool default all true

Runtime. base_service.py:849-851 wires the flags; implementation at speech_text_filter.py:139-141. These three are language-agnostic and on by default.

Symptom it fixes: "it literally said h-t-t-p-s colon slash slash" → confirm URL normalization is on and the text is shaped so the matcher catches it.

Step 19

Phone / TC-identity / general numbers live Turkish-only

What it is. Speak digit strings naturally instead of as one run: phone numbers grouped 3-3-2-2, 11-digit TC identity numbers grouped 3-3-3-2, and general numbers via Turkish num2words.

ttsConfig.speechNormalization.phoneNumbers ttsConfig.speechNormalization.identityNumbers ttsConfig.speechNormalization.generalNumbers type bool default all false

Runtime. base_service.py:852-854; implementation speech_text_filter.py:142-144 via _tr_digits_to_words. The runtime only implements Turkish, and the dashboard disables these three toggles unless language === "tr" (tts-config-panel.tsx:250-251).

Off by default for a reasonThe previous always-on behavior ate leading zeros inside TC and phone groups (see the comment in validators.ts:85). Enable these deliberately, only for flows that read numbers back to the caller — and only with language = tr.

Symptom it fixes: "it read my TC number as one enormous number / dropped the leading zero" → set language = tr and enable identityNumbers.

Try it — normalization previewer

Spoken-output preview (illustrative)
Spoken output appears here.

Illustrative only — the real transforms live in speech_text_filter.py. The [emoji] token stands in for a real emoji glyph; with emojis on it is stripped, off it stays. When language is not tr, the three numeric toggles disable and grey out, mirroring the dashboard gate at tts-config-panel.tsx:250-251. Number outputs here show grouping; the runtime additionally renders general numbers as Turkish words via num2words.

Step 20

Custom text substitutions live

What it is. Your own ordered regex find/replace rules, applied before TTS — e.g. force "Fibabanka" → "Fiba banka", or expand an abbreviation the model mangles.

ttsConfig.textSubstitutions[] shape {pattern, replacement, caseInsensitive?} type array default language-specific

Runtime. Wired at base_service.py:847; compiled at speech_text_filter.py:179-221 and applied sequentially (top to bottom) at :308-323. Uses the ReDoS-safe regex engine with a 0.1s-per-rule timeout, so a runaway pattern cannot hang the call.

Symptom it fixes: "it mispronounces our company name / a product term" → add a substitution. Fastest pronunciation fix, no model change.

Try it — substitution tester

One regex rule, applied live
Result appears here.

Invalid regex is caught and reported, never thrown. Server-side, each rule also runs under a 0.1s ReDoS guard (not simulated here).

Ask Claude Code: "In pipecat-agent, show me speech_text_filter.py:179-221 and :308-323 — how are textSubstitutions compiled, and what enforces the 0.1s-per-rule timeout?"
Step 21

Background sound live

What it is. A looping ambient audio bed (e.g. a subtle call-center murmur) mixed under the bot's voice so the silence between sentences does not feel sterile or obviously synthetic.

backgroundSoundUrl scope top-level agentConfig type string | null max 200 chars default null

Runtime. base_service.py:1386-1392 constructs a ResamplingSoundfileMixer(volume=1.2) that loops the file under the TTS output. Note this is top-level, not under ttsConfig. The audio is uploaded via dedicated audio-asset endpoints, not pasted raw.

Symptom it fixes: "the total silence between sentences feels unnatural / makes it obvious it's a bot."

Ask Claude Code: "Find every read of backgroundSoundUrl in pipecat-agent and confirm the ResamplingSoundfileMixer volume, then show me the dashboard validator that caps it at 200 chars."

Checkpoint

Customer says the agent reads back IBANs and TC numbers as a single huge number. Two things to set?

1. Set ttsConfig.language = "tr" — the number normalizers are gated on Turkish, both in the UI and at runtime. 2. Enable speechNormalization.identityNumbers (and generalNumbers for the IBAN's long numeric run). Both are off/unset by default. If language is not tr, these toggles are disabled in the UI (tts-config-panel.tsx:250-251) and ignored at runtime — so the language change is the prerequisite, not optional.