SSML
Mastering SSML with the Speechify API
Dive into the core of Speech Synthesis Markup Language (SSML), an XML-based markup language designed to give you granular control over speech output. With SSML, you can leverage additional XML tags to craft audio content that resonates with your audience, ensuring a more natural and engaging listening experience. As we refine the Speechify API during our beta phase, we're excited to introduce a curated selection of SSML tags, with plans to expand our offerings to bring your content to life in even more dynamic ways. Begin every SSML document with the foundational <speak>
tag to enclose your synthesized speech content.
<speak>Your content to be synthesized here</speak>
Escaping Characters: A Primer
Transforming your text into SSML requires attention to detail — specifically, the escaping of certain characters to ensure your markup is correctly interpreted:
&
-> &
>
-> >
<
-> <
"
-> "
'
-> '
Consider this transformation for accurate rendering:
const escapeSSMLChars = (text: string) =>
text
.replaceAll('&', '&')
.replaceAll('<', '<')
.replaceAll('>', '>')
.replaceAll('"', '"')
.replaceAll('\'', ''')
For example Some "text" with 5 < 6 & 4 > 8 in it
-> <speak>Some "text" with 5 < 6 & 4 > 8 in it</speak>
<prosody>
<prosody>
The prosody
tag is used to control and enhance the expressiveness of synthesized speech. It allows you to manipulate three primary attributes of spoken text, such as pitch of voice, rate of speech and speaker's volume. For example:
<speak>
This is a normal speech pattern.
<prosody pitch="high" rate="fast" volume="+20%">
I'm speaking with a higher pitch, faster than usual, and louder!
</prosody>
Back to normal speech pattern.
</speak>
Attributes
pitch
pitch
Adjusts the pitch at which the speech is delivered. Valid values include:
- x-low
- low
- medium (default)
- high
- x-high
- Percentage expressed as a number preceded by
+
(optionally) or-
and followed by%
(e.g., +20%, -30%). Valid range is between -83% and +100% but could be lower/higher when used in combination withrate
.
<speak>
<prosody pitch="high">Hello! I am a cheerful character.</prosody>
<prosody pitch="-50%">And I am a more serious character.</prosody>
</speak>
rate
rate
Alters the speed at which the speech is spoken. It allows the following values:
- x-slow
- slow
- medium (default)
- fast
- x-fast
- Percentage expressed between as a number preceded by
+
(optionally) or-
and followed by%
(e.g., +20%, -30%). The valid range is between -50% and +9900%.
<speak>
This is spoken at a <prosody rate="slow">slower rate</prosody>, while this is <prosody rate="fast">much faster</prosody> or <prosody rate="500%">insanely fast.</prosody>
</speak>
volume
volume
Controls the loudness of the speech. In addition to the standard enumerated levels, it supports percentage adjustments.
- silent
- x-soft
- medium (default)
- loud
- x-loud
- A number preceded by "+" or "-" and immediately followed by "dB"
- Percentage expressed as a number preceded by
+
(optionally) or-
and followed by%
(e.g., +20%, -30%)
<speak>
<prosody volume="-6dB">Sometimes</prosody> it can be useful to
<prosody volume="loud">increase the volume for a specific speech.</prosody>
</speak>
<break>
<break>
Tag break
controls pausing or other prosodic boundaries between words. It follows the W3 specifications. Length of the break is specified either in strength enum or time of the break. Example usage:
<speak>
Sometimes it can be useful to add a longer pause at the end of the sentence.
<break strength="medium" />
Or <break time="100ms" /> sometimes in the <break time="1s" /> middle.
</speak>
Attributes
strength
strength
Specifies the strength of the pause, influencing its duration. Supported values include:
- none: 0ms
- x-weak: 250ms
- weak: 500ms
- medium: 750ms
- strong: 1000ms
- x-strong: 1250ms
time
time
Allows for the specification of pause duration in absolute time. Must be between 0 and 10 seconds. Supported values include:
- Milliseconds, specified with
ms
suffix - Seconds, specified with
s
suffix
<emphasis>
<emphasis>
Tag<emphasis>
is to add or remove emphasis from the text contained by the element. The <emphasis>
element modifies speech similarly to <prosody>
, but without the need to set individual speech attributes. It only accepts one attribute - level
.
<speak>
I already told you I <emphasis level="strong">really like</emphasis> that person.
</speak>
Attributes
level
level
Specifies emphasis level. Accepted values:
- reduced
- moderate
- strong
<sub>
<sub>
The <sub>
tag is utilized to replace pronunciation for the contained text. It follows the W3 specifications.
The required alias
attribute has a value of any text.
<speak>
For detailed information, please read the <sub alias="Frequently Asked Questions">FAQ</sub> section.
</speak>
Attributes
alias
alias
Specifies a string to be spoken instead of the enclosed text.
<speechify:style>
<speechify:style>
The <speechify:style>
tag allows to control the emotion and cadence (speed) of the voice. It must include emotion
and/or cadence
attribute.
<speak>
<speechify:style emotion="angry" cadence="fast">How many times you can ask me this?</speechify:style>
</speak>
Attributes
emotion
emotion
Allows to change the emotion of the voice. Supported values are:
- angry
- cheerful
- sad
- terrified
- relaxed
- fearful
- surprised
- calm
- assertive
- energetic
- warm
- direct
- bright
cadence
cadence
Experimental
Cadence is only supported by experimental
simba-turbo
model.
Uses an AI model to adjust speech speed, providing a more natural-sounding result without guaranteeing a specific percentage change in speed. For consistent audio rate changes, <prosody rate="..">
is more suitable, though it may sound less natural.
It allows the following values:
- slow
- medium (default)
- fast
- Percentage expressed between as a number preceded by
+
(optionally) or-
and followed by%
(e.g., +20%, -30%). The valid range is between -50% and +40%.
Updated 28 days ago