Speech Synthesis Markup Language (SSML)
Control speech synthesis with markup language
Speech Synthesis Markup Language (SSML) is an XML-based markup language that gives you granular control over speech output. With SSML, you can leverage XML tags to craft audio content that delivers a more natural and engaging listening experience.
Begin every SSML document with the foundational <speak>
tag to enclose your synthesized speech content:
Escaping Characters
Transforming text into SSML requires escaping certain characters to ensure correct interpretation:
Supported SSML Tags
prosody
The prosody
tag controls the expressiveness of synthesized speech by manipulating pitch, rate, and volume.
Parameters
Adjusts the pitch of speech delivery.
Values:
x-low
,low
,medium
(default),high
,x-high
- Percentage adjustments:
-83%
to+100%
(e.g.,+20%
,-30%
)
Alters speech speed.
Values:
x-slow
,slow
,medium
(default),fast
,x-fast
- Percentage adjustments:
-50%
to+9900%
(e.g.,+20%
,-30%
)
Controls speech loudness.
Values:
silent
,x-soft
,medium
(default),loud
,x-loud
- Decibel adjustments: Number with
dB
suffix (e.g.,-6dB
) - Percentage adjustments (e.g.,
+20%
,-30%
)
break
The break
tag controls pausing between words, following W3 specifications.
Parameters
Specifies pause strength.
Values:
none
: 0msx-weak
: 250msweak
: 500msmedium
: 750msstrong
: 1000msx-strong
: 1250ms
Specifies pause duration (0-10 seconds).
Values:
- Milliseconds:
ms
suffix (e.g.,100ms
) - Seconds:
s
suffix (e.g.,1s
)
emphasis
The emphasis
tag adds or removes emphasis from text, modifying speech similarly to prosody
but without setting individual attributes.
Parameters
Specifies emphasis level.
Values:
reduced
moderate
strong
sub
The sub
tag replaces pronunciation for contained text, following W3 specifications.
Parameters
Specifies text to be spoken instead of enclosed text.
speechify:style
The speechify:style
tag controls emotion and cadence (speed) of the voice.
Parameters
Changes voice emotion.
Values:
angry
,cheerful
,sad
,terrified
,relaxed
fearful
,surprised
,calm
,assertive
,energetic
warm
,direct
,bright
simba-turbo
modelAdjusts speech speed using AI for natural-sounding results.
Values:
slow
,medium
(default),fast
- Percentage adjustments:
-50%
to+40%
(e.g.,+20%
,-30%
)
<prosody rate="..">
is more suitable, though it may sound less natural.