Speech Synthesis Markup Language (SSML)
Speech Synthesis Markup Language (SSML) is an XML-based markup language that gives you granular control over speech output. With SSML, you can leverage XML tags to craft audio content that delivers a more natural and engaging listening experience.
Begin every SSML document with the foundational <speak> tag to enclose your synthesized speech content:
Escaping Characters
Transforming text into SSML requires escaping certain characters to ensure correct interpretation:
Supported SSML Tags
prosody
The prosody tag controls the expressiveness of synthesized speech by manipulating pitch, rate, and volume.
Parameters
Adjusts the pitch of speech delivery.
Values:
x-low,low,medium(default),high,x-high- Percentage adjustments:
-83%to+100%(e.g.,+20%,-30%)
Alters speech speed.
Values:
x-slow,slow,medium(default),fast,x-fast- Percentage adjustments:
-50%to+9900%(e.g.,+20%,-30%)
Controls speech loudness.
Values:
silent,x-soft,medium(default),loud,x-loud- Decibel adjustments: Number with
dBsuffix (e.g.,-6dB) - Percentage adjustments (e.g.,
+20%,-30%)
break
The break tag controls pausing between words, following W3 specifications.
Parameters
Specifies pause strength.
Values:
none: 0msx-weak: 250msweak: 500msmedium: 750msstrong: 1000msx-strong: 1250ms
Specifies pause duration (0-10 seconds).
Values:
- Milliseconds:
mssuffix (e.g.,100ms) - Seconds:
ssuffix (e.g.,1s)
emphasis
The emphasis tag adds or removes emphasis from text, modifying speech similarly to prosody but without setting individual attributes.
Parameters
Specifies emphasis level.
Values:
reducedmoderatestrong
sub
The sub tag replaces pronunciation for contained text, following W3 specifications.
Parameters
Specifies text to be spoken instead of enclosed text.
speechify:style
The speechify:style tag controls emotion of the voice.
Parameters
Changes voice emotion.
Values:
angry,cheerful,sad,terrified,relaxedfearful,surprised,calm,assertive,energeticwarm,direct,bright