Mastering SSML with the Speechify API

Dive into the core of Speech Synthesis Markup Language (SSML), an XML-based markup language designed to give you granular control over speech output. With SSML, you can leverage additional XML tags to craft audio content that resonates with your audience, ensuring a more natural and engaging listening experience. As we refine the Speechify API during our beta phase, we're excited to introduce a curated selection of SSML tags, with plans to expand our offerings to bring your content to life in even more dynamic ways. Begin every SSML document with the foundational <speak> tag to enclose your synthesized speech content.

<speak>Your content to be synthesized here</speak>

Escaping Characters: A Primer

Transforming your text into SSML requires attention to detail — specifically, the escaping of certain characters to ensure your markup is correctly interpreted:

& -> &amp;

> -> &gt;

< -> &lt;

" -> &quot;

' -> &apos;

Consider this transformation for accurate rendering:

const escapeSSMLChars = (text: string) =>
  text
    .replaceAll('&', '&amp;')
    .replaceAll('<', '&lt;')
    .replaceAll('>', '&gt;')
    .replaceAll('"', '&quot;')
    .replaceAll('\'', '&apos;')

For example Some "text" with 5 < 6 & 4 > 8 in it -> <speak>Some &quot;text&quot; with 5 &lt; 6 &amp; 4 &gt; 8 in it</speak>

<prosody>

The prosody tag is used to control and enhance the expressiveness of synthesized speech. It allows you to manipulate three primary attributes of spoken text, such as pitch of voice, rate of speech and speaker's volume. For example:

<speak>
    This is a normal speech pattern.
    <prosody pitch="high" rate="fast" volume="+20%">
        I'm speaking with a higher pitch, faster than usual, and louder!
    </prosody>
    Back to normal speech pattern.
</speak>

Attributes

pitch

Adjusts the pitch at which the speech is delivered. Valid values include:

  • x-low
  • low
  • medium (default)
  • high
  • x-high
  • Percentage expressed as a number preceded by + (optionally) or - and followed by % (e.g., +20%, -30%). Valid range is between -83% and +100% but could be lower/higher when used in combination with rate.
<speak>
    <prosody pitch="high">Hello! I am a cheerful character.</prosody>
    <prosody pitch="-50%">And I am a more serious character.</prosody>
</speak>

rate

Alters the speed at which the speech is spoken. It allows the following values:

  • x-slow
  • slow
  • medium (default)
  • fast
  • x-fast
  • Percentage expressed between as a number preceded by + (optionally) or - and followed by % (e.g., +20%, -30%). The valid range is between -50% and +9900%.
<speak>
    This is spoken at a <prosody rate="slow">slower rate</prosody>, while this is <prosody rate="fast">much faster</prosody> or <prosody rate="500%">insanely fast.</prosody>
</speak>

volume

Controls the loudness of the speech. In addition to the standard enumerated levels, it supports percentage adjustments.

  • silent
  • x-soft
  • medium (default)
  • loud
  • x-loud
  • A number preceded by "+" or "-" and immediately followed by "dB"
  • Percentage expressed as a number preceded by + (optionally) or - and followed by % (e.g., +20%, -30%)
<speak>
    <prosody volume="-6dB">Sometimes</prosody> it can be useful to
    <prosody volume="loud">increase the volume for a specific speech.</prosody>
</speak>

<break>

Tag break controls pausing or other prosodic boundaries between words. It follows the W3 specifications. Length of the break is specified either in strength enum or time of the break. Example usage:

<speak>
    Sometimes it can be useful to add a longer pause at the end of the sentence.
    <break strength="medium" /> 
    Or <break time="100ms" /> sometimes in the <break time="1s" /> middle.
</speak>

Attributes

strength

Specifies the strength of the pause, influencing its duration. Supported values include:

  • none: 0ms
  • x-weak: 250ms
  • weak: 500ms
  • medium: 750ms
  • strong: 1000ms
  • x-strong: 1250ms

time

Allows for the specification of pause duration in absolute time. Must be between 0 and 10 seconds. Supported values include:

  • Milliseconds, specified with ms suffix
  • Seconds, specified with s suffix

<emphasis>

Tag<emphasis> is to add or remove emphasis from the text contained by the element. The <emphasis> element modifies speech similarly to <prosody>, but without the need to set individual speech attributes. It only accepts one attribute - level.

<speak>
    I already told you I <emphasis level="strong">really like</emphasis> that person.
</speak>

Attributes

level

Specifies emphasis level. Accepted values:

  • reduced
  • moderate
  • strong

<sub>

The <sub> tag is utilized to replace pronunciation for the contained text. It follows the W3 specifications.
The required alias attribute has a value of any text.

<speak>
    For detailed information, please read the <sub alias="Frequently Asked Questions">FAQ</sub> section.
</speak>

Attributes

alias

Specifies a string to be spoken instead of the enclosed text.

<speechify:style>

The <speechify:style> tag allows to control the emotion and cadence (speed) of the voice. It must include emotion and/or cadence attribute.

<speak>
    <speechify:style emotion="angry" cadence="fast">How many times you can ask me this?</speechify:style>
</speak>

Attributes

emotion

Allows to change the emotion of the voice. Supported values are:

  • angry
  • cheerful
  • sad
  • terrified
  • relaxed
  • fearful
  • surprised
  • calm
  • assertive
  • energetic
  • warm
  • direct
  • bright

cadence

🚧

Experimental

Cadence is only supported by experimental simba-turbo model.

Uses an AI model to adjust speech speed, providing a more natural-sounding result without guaranteeing a specific percentage change in speed. For consistent audio rate changes, <prosody rate=".."> is more suitable, though it may sound less natural.

It allows the following values:

  • slow
  • medium (default)
  • fast
  • Percentage expressed between as a number preceded by + (optionally) or - and followed by % (e.g., +20%, -30%). The valid range is between -50% and +40%.