Speech Synthesis Markup Language (SSML)

Speech Synthesis Markup Language (SSML) is an XML-based markup language that gives you granular control over speech output. With SSML, you can leverage XML tags to craft audio content that delivers a more natural and engaging listening experience.

Begin every SSML document with the foundational <speak> tag to enclose your synthesized speech content:

1 <speak>Your content to be synthesized here</speak>

Escaping Characters

Transforming text into SSML requires escaping certain characters to ensure correct interpretation:

Character	Escaped Form
`&`	`&`
`>`	`>`
`<`	`<`
`"`	`"`
`'`	`'`

1 <!-- Original: Some "text" with 5 < 6 & 4 > 8 in it -->
2 <speak>Some &quot;text&quot; with 5 &lt; 6 &amp; 4 &gt; 8 in it</speak>

Supported SSML Tags

prosody

The prosody tag controls the expressiveness of synthesized speech by manipulating pitch, rate, and volume.

1 <speak>
2     This is a normal speech pattern.
3     <prosody pitch="high" rate="fast" volume="+20%">
4         I'm speaking with a higher pitch, faster than usual, and louder!
5     </prosody>
6     Back to normal speech pattern.
7 </speak>

Parameters

pitch

string

Adjusts the pitch of speech delivery.

Values:

x-low, low, medium (default), high, x-high
Percentage adjustments: -83% to +100% (e.g., +20%, -30%)

rate

string

Alters speech speed.

Values:

x-slow, slow, medium (default), fast, x-fast
Percentage adjustments: -50% to +9900% (e.g., +20%, -30%)

volume

string

Controls speech loudness.

Values:

silent, x-soft, medium (default), loud, x-loud
Decibel adjustments: Number with dB suffix (e.g., -6dB)
Percentage adjustments (e.g., +20%, -30%)

break

The break tag controls pausing between words, following W3 specifications.

1 <speak>
2     Sometimes it can be useful to add a longer pause at the end of the sentence.
3     <break strength="medium" />
4     Or <break time="100ms" /> sometimes in the <break time="1s" /> middle.
5 </speak>

Parameters

strength

string

Specifies pause strength.

Values:

none: 0ms
x-weak: 250ms
weak: 500ms
medium: 750ms
strong: 1000ms
x-strong: 1250ms

time

string

Specifies pause duration (0-10 seconds).

Values:

Milliseconds: ms suffix (e.g., 100ms)
Seconds: s suffix (e.g., 1s)

emphasis

The emphasis tag adds or removes emphasis from text, modifying speech similarly to prosody but without setting individual attributes.

1 <speak>
2     I already told you I <emphasis level="strong">really like</emphasis> that person.
3 </speak>

Parameters

level

string

Specifies emphasis level.

Values:

reduced
moderate
strong

sub

The sub tag replaces pronunciation for contained text, following W3 specifications.

1 <speak>
2     For detailed information, please read the <sub alias="Frequently Asked Questions">FAQ</sub> section.
3 </speak>

Parameters

alias

stringRequired

Specifies text to be spoken instead of enclosed text.

speechify:style

The speechify:style tag controls emotion and cadence (speed) of the voice.

1 <speak>
2     <speechify:style emotion="angry" cadence="fast">How many times you can ask me this?</speechify:style>
3 </speak>

Parameters

emotion

string

Changes voice emotion.

Values:

angry, cheerful, sad, terrified, relaxed
fearful, surprised, calm, assertive, energetic
warm, direct, bright

cadence

string

Experimental: Only supported by simba-turbo model

Adjusts speech speed using AI for natural-sounding results.

Values:

slow, medium (default), fast
Percentage adjustments: -50% to +40% (e.g., +20%, -30%)

For consistent audio rate changes, <prosody rate=".."> is more suitable, though it may sound less natural.

Examples

Basic SSML

Prosody Control

Pauses & Emphasis

Emotional Styling

1 <speak>Welcome to Speechify's Text-to-Speech service.</speak>