Speech marks | Speechify API

Overview

Speech marks are returned with every synthesis request and provide a mapping between time and text. They inform the client when each word is spoken in the audio, enabling features like:

Text highlighting during playback
Precise audio seeking by text position
Usage tracking and analytics
Synchronization between text and audio

Data structure

Speech marks use the following TypeScript interfaces:

1 type Chunk = {
2   start_time: number  // Time in milliseconds when this chunk starts in the audio
3   end_time: number    // Time in milliseconds when this chunk ends in the audio
4   start: number       // Character index where this chunk starts in the original text
5   end: number         // Character index where this chunk ends in the original text
6   value: string       // The text content of this chunk
7 }
8 
9 type NestedChunk = Chunk & {
10   chunks: Chunk[]     // Array of word-level chunks within this sentence/paragraph
11 }

Important considerations

The following details are crucial for correctly implementing speech mark functionality in your application.

SSML escaping: Values are returned based on the SSML, so any escaping of &, < and > will be present in the value, start and end fields. Consider using the string tracker library to assist with mapping.
Index gaps: The start and end values of each word may have gaps. When looking for a word at a specific index, check for start being >= yourIndex rather than checking if the index is within both start and end bounds.
Timing gaps: Similarly, start_time and end_time of each word may have gaps. Follow the same approach as with index gaps.
Initial silence: The start_time of the first word is not necessarily 0 like the NestedChunk. There can be silence at the beginning of the sentence that leads to the word starting partway through.
Trailing silence: The end_time of the last word does not necessarily correspond with the end of the NestedChunk. There can be silence at the end that will make the NestedChunk longer.

Example output

The following example demonstrates how speech marks represent a sentence with timing information for each word:

1 const chunk: NestedChunk = {
2   start: 0,
3   end: 79,
4   start_time: 0,
5   end_time: 4292.58,
6   value: 'This is a sentence used for testing with some text on the end to make it longer',
7   chunks: [
8     {
9       start: 0,
10       end: 4,
11       start_time: 125,
12       end_time: 250,
13       value: 'This',
14     },
15     {
16       start: 5,
17       end: 7,
18       start_time: 250,
19       end_time: 375,
20       value: 'is',
21     },
22     {
23       start: 8,
24       end: 9,
25       start_time: 375,
26       end_time: 500,
27       value: 'a',
28     },
29     {
30       start: 10,
31       end: 18,
32       start_time: 500,
33       end_time: 937,
34       value: 'sentence',
35     },
36     {
37       start: 19,
38       end: 23,
39       start_time: 937,
40       end_time: 1200,
41       value: 'used',
42     },
43     {
44       start: 24,
45       end: 27,
46       start_time: 1200,
47       end_time: 1375,
48       value: 'for',
49     },
50     {
51       start: 28,
52       end: 35,
53       start_time: 1375,
54       end_time: 1775,
55       value: 'testing',
56     },
57     {
58       start: 36,
59       end: 40,
60       start_time: 1775,
61       end_time: 1937,
62       value: 'with',
63     },
64     {
65       start: 41,
66       end: 45,
67       start_time: 1937,
68       end_time: 2125,
69       value: 'some',
70     },
71     {
72       start: 46,
73       end: 50,
74       start_time: 2125,
75       end_time: 2500,
76       value: 'text',
77     },
78     {
79       start: 51,
80       end: 53,
81       start_time: 2500,
82       end_time: 2625,
83       value: 'on',
84     },
85     {
86       start: 54,
87       end: 57,
88       start_time: 2625,
89       end_time: 2850,
90       value: 'the',
91     },
92     {
93       start: 58,
94       end: 61,
95       start_time: 2850,
96       end_time: 3000,
97       value: 'end',
98     },
99     {
100       start: 62,
101       end: 64,
102       start_time: 3000,
103       end_time: 3125,
104       value: 'to',
105     },
106     {
107       start: 65,
108       end: 69,
109       start_time: 3125,
110       end_time: 3312,
111       value: 'make',
112     },
113     {
114       start: 70,
115       end: 72,
116       start_time: 3312,
117       end_time: 3437,
118       value: 'it',
119     },
120     {
121       start: 73,
122       end: 79,
123       start_time: 3437,
124       end_time: 4292.58,
125       value: 'longer',
126     },
127   ],
128 }