Speech marks
Overview
Speech marks are returned with every synthesis request and provide a mapping between time and text. They inform the client when each word is spoken in the audio, enabling features like:
- Text highlighting during playback
- Precise audio seeking by text position
- Usage tracking and analytics
- Synchronization between text and audio
Data structure
Speech marks use the following TypeScript interfaces:
Important considerations
The following details are crucial for correctly implementing speech mark functionality in your application.
-
SSML escaping: Values are returned based on the SSML, so any escaping of
&,<and>will be present in thevalue,startandendfields. Consider using the string tracker library to assist with mapping. -
Index gaps: The
startandendvalues of each word may have gaps. When looking for a word at a specific index, check forstartbeing>= yourIndexrather than checking if the index is within bothstartandendbounds. -
Timing gaps: Similarly,
start_timeandend_timeof each word may have gaps. Follow the same approach as with index gaps. -
Initial silence: The
start_timeof the first word is not necessarily0like theNestedChunk. There can be silence at the beginning of the sentence that leads to the word starting partway through. -
Trailing silence: The
end_timeof the last word does not necessarily correspond with the end of theNestedChunk. There can be silence at the end that will make theNestedChunklonger.
Example output
The following example demonstrates how speech marks represent a sentence with timing information for each word: