For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
ExamplesConsole
OverviewText to SpeechAPI ReferenceChangelog
OverviewText to SpeechAPI ReferenceChangelog
  • Get Started
    • Overview
    • Quickstart
    • Authentication
    • Models
    • API Limits
    • Official SDKs
  • Features
    • Voice Cloning
    • Language Support
    • Streaming
    • Emotion Control
    • SSML
    • Speech Marks
LogoLogo
ExamplesConsole
On this page
  • Escaping Characters
  • Supported SSML Tags
  • Examples
Features

Speech Synthesis Markup Language (SSML)

Control speech synthesis with markup language
Was this page helpful?
Previous

Speech marks

Learn how speech marks map text to audio timing for synchronization features.
Next
Built with

SSML is an XML-based markup language for controlling pitch, rate, pauses, emphasis, and emotion in synthesized speech. Wrap your content in a <speak> tag:

1<speak>Your content to be synthesized here</speak>

Escaping Characters

Transforming text into SSML requires escaping certain characters to ensure correct interpretation:

CharacterEscaped Form
&&amp;
>&gt;
<&lt;
"&quot;
'&apos;
1<!-- Original: Some "text" with 5 < 6 & 4 > 8 in it -->
2<speak>Some &quot;text&quot; with 5 &lt; 6 &amp; 4 &gt; 8 in it</speak>

Supported SSML Tags

prosody

The prosody tag controls the expressiveness of synthesized speech by manipulating pitch, rate, and volume.

1<speak>
2 This is a normal speech pattern.
3 <prosody pitch="high" rate="fast" volume="+20%">
4 I'm speaking with a higher pitch, faster than usual, and louder!
5 </prosody>
6 Back to normal speech pattern.
7</speak>

Parameters

pitch
string

Adjusts the pitch of speech delivery.

Values:

  • x-low, low, medium (default), high, x-high
  • Percentage adjustments: -83% to +100% (e.g., +20%, -30%)
rate
string

Alters speech speed.

Values:

  • x-slow, slow, medium (default), fast, x-fast
  • Percentage adjustments: -50% to +9900% (e.g., +20%, -30%)
volume
string

Controls speech loudness.

Values:

  • silent, x-soft, medium (default), loud, x-loud
  • Decibel adjustments: Number with dB suffix (e.g., -6dB)
  • Percentage adjustments (e.g., +20%, -30%)
break

The break tag controls pausing between words, following W3 specifications.

1<speak>
2 Sometimes it can be useful to add a longer pause at the end of the sentence.
3 <break strength="medium" />
4 Or <break time="100ms" /> sometimes in the <break time="1s" /> middle.
5</speak>

Parameters

strength
string

Specifies pause strength.

Values:

  • none: 0ms
  • x-weak: 250ms
  • weak: 500ms
  • medium: 750ms
  • strong: 1000ms
  • x-strong: 1250ms
time
string

Specifies pause duration (0-10 seconds).

Values:

  • Milliseconds: ms suffix (e.g., 100ms)
  • Seconds: s suffix (e.g., 1s)
emphasis

The emphasis tag adds or removes emphasis from text, modifying speech similarly to prosody but without setting individual attributes.

1<speak>
2 I already told you I <emphasis level="strong">really like</emphasis> that person.
3</speak>

Parameters

level
string

Specifies emphasis level.

Values:

  • reduced
  • moderate
  • strong
sub

The sub tag replaces pronunciation for contained text, following W3 specifications.

1<speak>
2 For detailed information, please read the <sub alias="Frequently Asked Questions">FAQ</sub> section.
3</speak>

Parameters

alias
stringRequired

Specifies text to be spoken instead of enclosed text.

speechify:style

The speechify:style tag controls emotion of the voice. See Emotion Control for the full list of 13 supported emotions and best practices.

1<speak>
2 <speechify:style emotion="cheerful">Great news! Your order shipped!</speechify:style>
3</speak>

Parameters

emotion
string

Sets the voice emotion. Values: angry, cheerful, sad, terrified, relaxed, fearful, surprised, calm, assertive, energetic, warm, direct, bright.

Examples

Basic SSML
Prosody Control
Pauses & Emphasis
Emotional Styling
1<speak>Welcome to Speechify's Text-to-Speech service.</speak>