Speech marks
Learn how speech marks map text to audio timing for synchronization features.
Overview
Speech marks are returned with every synthesis request and provide a mapping between time and text. They inform the client when each word is spoken in the audio, enabling features like:
- Text highlighting during playback
- Precise audio seeking by text position
- Usage tracking and analytics
- Synchronization between text and audio
Data structure
Speech marks use the following TypeScript interfaces:
Important considerations
The following details are crucial for correctly implementing speech mark functionality in your application.
-
SSML escaping: Values are returned based on the SSML, so any escaping of
&
,<
and>
will be present in thevalue
,start
andend
fields. Consider using the string tracker library to assist with mapping. -
Index gaps: The
start
andend
values of each word may have gaps. When looking for a word at a specific index, check forstart
being>= yourIndex
rather than checking if the index is within bothstart
andend
bounds. -
Timing gaps: Similarly,
start_time
andend_time
of each word may have gaps. Follow the same approach as with index gaps. -
Initial silence: The
start_time
of the first word is not necessarily0
like theNestedChunk
. There can be silence at the beginning of the sentence that leads to the word starting partway through. -
Trailing silence: The
end_time
of the last word does not necessarily correspond with the end of theNestedChunk
. There can be silence at the end that will make theNestedChunk
longer.
Example output
The following example demonstrates how speech marks represent a sentence with timing information for each word: