Speech marks are returned with every synthesis request and provide a mapping between time and text. They inform the client when each word is spoken in the audio, enabling features like:
Speech marks use the following TypeScript interfaces:
SSML escaping: Values are returned based on the SSML, so any escaping of &, < and > will be present in the value, start and end fields. Consider using the string tracker library to assist with mapping.
Index gaps: The start and end values of each word may have gaps. When looking for a word at a specific index, check for start being >= yourIndex rather than checking if the index is within both start and end bounds.
Timing gaps: Similarly, start_time and end_time of each word may have gaps. Follow the same approach as with index gaps.
Initial silence: The start_time of the first word is not necessarily 0 like the NestedChunk. There can be silence at the beginning of the sentence that leads to the word starting partway through.
Trailing silence: The end_time of the last word does not necessarily correspond with the end of the NestedChunk. There can be silence at the end that will make the NestedChunk longer.
For the input "Hello, welcome to Speechify", the response includes:
Note how start_time of the first word (125ms) doesn’t match the NestedChunk start (0ms) — there’s initial silence before speech begins.