Speech Synthesis Markup Language (SSML)

Control speech synthesis with markup language

Speech Synthesis Markup Language (SSML) is an XML-based markup language that gives you granular control over speech output. With SSML, you can leverage XML tags to craft audio content that delivers a more natural and engaging listening experience.

Begin every SSML document with the foundational <speak> tag to enclose your synthesized speech content:

1<speak>Your content to be synthesized here</speak>

Escaping Characters

Transforming text into SSML requires escaping certain characters to ensure correct interpretation:

CharacterEscaped Form
&&amp;
>&gt;
<&lt;
"&quot;
'&apos;
1<!-- Original: Some "text" with 5 < 6 & 4 > 8 in it -->
2<speak>Some &quot;text&quot; with 5 &lt; 6 &amp; 4 &gt; 8 in it</speak>

Supported SSML Tags

The prosody tag controls the expressiveness of synthesized speech by manipulating pitch, rate, and volume.

1<speak>
2 This is a normal speech pattern.
3 <prosody pitch="high" rate="fast" volume="+20%">
4 I'm speaking with a higher pitch, faster than usual, and louder!
5 </prosody>
6 Back to normal speech pattern.
7</speak>

Parameters

pitch
string

Adjusts the pitch of speech delivery.

Values:

  • x-low, low, medium (default), high, x-high
  • Percentage adjustments: -83% to +100% (e.g., +20%, -30%)
rate
string

Alters speech speed.

Values:

  • x-slow, slow, medium (default), fast, x-fast
  • Percentage adjustments: -50% to +9900% (e.g., +20%, -30%)
volume
string

Controls speech loudness.

Values:

  • silent, x-soft, medium (default), loud, x-loud
  • Decibel adjustments: Number with dB suffix (e.g., -6dB)
  • Percentage adjustments (e.g., +20%, -30%)

The break tag controls pausing between words, following W3 specifications.

1<speak>
2 Sometimes it can be useful to add a longer pause at the end of the sentence.
3 <break strength="medium" />
4 Or <break time="100ms" /> sometimes in the <break time="1s" /> middle.
5</speak>

Parameters

strength
string

Specifies pause strength.

Values:

  • none: 0ms
  • x-weak: 250ms
  • weak: 500ms
  • medium: 750ms
  • strong: 1000ms
  • x-strong: 1250ms
time
string

Specifies pause duration (0-10 seconds).

Values:

  • Milliseconds: ms suffix (e.g., 100ms)
  • Seconds: s suffix (e.g., 1s)

The emphasis tag adds or removes emphasis from text, modifying speech similarly to prosody but without setting individual attributes.

1<speak>
2 I already told you I <emphasis level="strong">really like</emphasis> that person.
3</speak>

Parameters

level
string

Specifies emphasis level.

Values:

  • reduced
  • moderate
  • strong

The sub tag replaces pronunciation for contained text, following W3 specifications.

1<speak>
2 For detailed information, please read the <sub alias="Frequently Asked Questions">FAQ</sub> section.
3</speak>

Parameters

alias
stringRequired

Specifies text to be spoken instead of enclosed text.

The speechify:style tag controls emotion and cadence (speed) of the voice.

1<speak>
2 <speechify:style emotion="angry" cadence="fast">How many times you can ask me this?</speechify:style>
3</speak>

Parameters

emotion
string

Changes voice emotion.

Values:

  • angry, cheerful, sad, terrified, relaxed
  • fearful, surprised, calm, assertive, energetic
  • warm, direct, bright
cadence
string
Experimental: Only supported by simba-turbo model

Adjusts speech speed using AI for natural-sounding results.

Values:

  • slow, medium (default), fast
  • Percentage adjustments: -50% to +40% (e.g., +20%, -30%)
For consistent audio rate changes, <prosody rate=".."> is more suitable, though it may sound less natural.

Examples

1<speak>Welcome to Speechify's Text-to-Speech service.</speak>