Speech marks

Learn how speech marks map text to audio timing for synchronization features.

Overview

Speech marks are returned with every synthesis request and provide a mapping between time and text. They inform the client when each word is spoken in the audio, enabling features like:

  • Text highlighting during playback
  • Precise audio seeking by text position
  • Usage tracking and analytics
  • Synchronization between text and audio

Data structure

Speech marks use the following TypeScript interfaces:

1type Chunk = {
2 start_time: number // Time in milliseconds when this chunk starts in the audio
3 end_time: number // Time in milliseconds when this chunk ends in the audio
4 start: number // Character index where this chunk starts in the original text
5 end: number // Character index where this chunk ends in the original text
6 value: string // The text content of this chunk
7}
8
9type NestedChunk = Chunk & {
10 chunks: Chunk[] // Array of word-level chunks within this sentence/paragraph
11}

Important considerations

The following details are crucial for correctly implementing speech mark functionality in your application.

  • SSML escaping: Values are returned based on the SSML, so any escaping of &, < and > will be present in the value, start and end fields. Consider using the string tracker library to assist with mapping.

  • Index gaps: The start and end values of each word may have gaps. When looking for a word at a specific index, check for start being >= yourIndex rather than checking if the index is within both start and end bounds.

  • Timing gaps: Similarly, start_time and end_time of each word may have gaps. Follow the same approach as with index gaps.

  • Initial silence: The start_time of the first word is not necessarily 0 like the NestedChunk. There can be silence at the beginning of the sentence that leads to the word starting partway through.

  • Trailing silence: The end_time of the last word does not necessarily correspond with the end of the NestedChunk. There can be silence at the end that will make the NestedChunk longer.

Example output

The following example demonstrates how speech marks represent a sentence with timing information for each word:

1const chunk: NestedChunk = {
2 start: 0,
3 end: 79,
4 start_time: 0,
5 end_time: 4292.58,
6 value: 'This is a sentence used for testing with some text on the end to make it longer',
7 chunks: [
8 {
9 start: 0,
10 end: 4,
11 start_time: 125,
12 end_time: 250,
13 value: 'This',
14 },
15 {
16 start: 5,
17 end: 7,
18 start_time: 250,
19 end_time: 375,
20 value: 'is',
21 },
22 {
23 start: 8,
24 end: 9,
25 start_time: 375,
26 end_time: 500,
27 value: 'a',
28 },
29 {
30 start: 10,
31 end: 18,
32 start_time: 500,
33 end_time: 937,
34 value: 'sentence',
35 },
36 {
37 start: 19,
38 end: 23,
39 start_time: 937,
40 end_time: 1200,
41 value: 'used',
42 },
43 {
44 start: 24,
45 end: 27,
46 start_time: 1200,
47 end_time: 1375,
48 value: 'for',
49 },
50 {
51 start: 28,
52 end: 35,
53 start_time: 1375,
54 end_time: 1775,
55 value: 'testing',
56 },
57 {
58 start: 36,
59 end: 40,
60 start_time: 1775,
61 end_time: 1937,
62 value: 'with',
63 },
64 {
65 start: 41,
66 end: 45,
67 start_time: 1937,
68 end_time: 2125,
69 value: 'some',
70 },
71 {
72 start: 46,
73 end: 50,
74 start_time: 2125,
75 end_time: 2500,
76 value: 'text',
77 },
78 {
79 start: 51,
80 end: 53,
81 start_time: 2500,
82 end_time: 2625,
83 value: 'on',
84 },
85 {
86 start: 54,
87 end: 57,
88 start_time: 2625,
89 end_time: 2850,
90 value: 'the',
91 },
92 {
93 start: 58,
94 end: 61,
95 start_time: 2850,
96 end_time: 3000,
97 value: 'end',
98 },
99 {
100 start: 62,
101 end: 64,
102 start_time: 3000,
103 end_time: 3125,
104 value: 'to',
105 },
106 {
107 start: 65,
108 end: 69,
109 start_time: 3125,
110 end_time: 3312,
111 value: 'make',
112 },
113 {
114 start: 70,
115 end: 72,
116 start_time: 3312,
117 end_time: 3437,
118 value: 'it',
119 },
120 {
121 start: 73,
122 end: 79,
123 start_time: 3437,
124 end_time: 4292.58,
125 value: 'longer',
126 },
127 ],
128}