Quickstart

if you're new to Speechify AI API, this guide should provide the short getting started intro and links to the more in-detail documents.

Getting started with the Speechify AI API involves:

  • Registering your account
  • Determining your use-case and figuring out the proper authentication mechanism
  • Calling the API directly (using any HTTP request library or utility) or through our official SDKs

Register Your Account

  1. Go to https://console.sws.speechify.com/ and sign up with your email/password or a social auth service.
  2. Navigate to the API Keys section and make sure you have the default API key created for you. If not, create one yourself.
  3. Go to the interactive playground page and generate some audio. You can experiment with different voices and audio generation settings to get the sense of what the API is capable of.
  4. You can also make a digital clone of your own voice.

Now that you have your account and the API key, let's see how to talk to the API from your code.

Make Your First API Request

To start talking to our API from code you will need:

  1. the API base URL: https://api.sws.speechify.com/
  2. YOUR_API_KEY
  3. HTTP client of your choice (i.e. curl for the shell scripts, fetch for Node.js, etc.)

Check the Recipes section for the list of examples that demonstrate various usages of the API.

If you're very new to HTTP APIs in general, this one would be the simplest to follow:

Make sure that you're able to run the simple examples before proceeding with this guide.

Determine Your Auth Use-Case and Mechanism

Speechify AI API is a paid product, and a product with regulated capabilities, as it makes it possible to impersonate people's voices and generate naturally-sounding speech. For both of these reasons, the API access is protected.

And every app willing to use any protected API is typically falling into one of the two categories:

  • Public clients. Examples are an in-browser web app, a native mobile or desktop app.
  • Confidential clients. These apps are more diverse and may include: a server-only web app without the public API, a voice agent app (that you can only access via the phone line or the messenger call), an internal use CLI app only available to the members of your team, etc.

Depending on your app use-case, you need to use either the Access Tokens-based authentication, or the API keys. This may be confusing at first, but the two auth mechanisms must coexist to cover the two different client types.

The rules to select the auth mechanism are simple, though:

  • If your application falls into the Confidential client category, you can use the API Keys.
  • Otherwise (your app is a Public client), you must use the Access Tokens.

Please refer to our Authentication guide for the detailed explanation of whats, whys, and hows.

Direct HTTP calls or SDK?

If you're a seasoned web developer and have worked with various HTTP APIs in the past, please refer to our full OpenAPI documentation.

Note: no matter which authentication mechanism you're using (see the previous section), you'll be passing the API Key or the Access Token in the same standard way, in the Authorization header of each request:

Authorization: Bearer YOUR_API_KEY_OR_ACCESS_TOKEN

Without a valid header, requests will be met with a 401 Unauthorized status, ensuring that your data and interactions remain protected.

Are you writing code for the browser or server-side Node.js? Then you can use our official JS/TS SDK. It is a lightweight and thin wrapper on top of the standard fetch that adds some quality of life improvements, such as:

  • automatically decoding the audio data from Base-64,
  • automatically refreshing and updating the Access Tokens.

Crafting Your input with SSML

The input parameter of the audio generation endpoints (speech, stream) is a special one.

For the most trivial use-cases, you can send it as plain text. This works, but it doesn't give you the fine-grained control over how the speech is synthesized.

For anything beyond trivial, we recommend your input to be wrapped into the Speech Synthesis Markup Language (SSML). While it may look like an unnecessary complication, SSML offers you meticulous control over how your text is spoken.

SSML, an XML-based markup language, empowers you to enrich your audio content with nuances such as tone, emphasis, and emotional delivery, using tags like <prosody>, <break>, and <emphasis>.

The most basic SSML is just your text wrapped in the<speak> tag:

<speak>Your content to be synthesized here</speak>

For an in-depth exploration of how SSML can transform your content and to stay updated on future enhancements, visit our SSML documentation.

Example: changing the speed of voice

Depending on your specific use case, you may want the speech to go slower or faster than what you get by default. This is a common request, and a great example of where SSML is worth every trouble. Please check the <prosody> tag documentation for how to adjust not only the speed (rate), but also the voice pitch and volume.

Personal (Cloned) Voices

Not only does Speechify provide an extensive list of standard voices, both male and female, it also lets you create a digitized version of any human voice, for example, your own.

Please note that this is the advanced feature only available to the paying customers.

You can start experimenting with cloned voices right from your browser. Upload or record a sample of your voice, and the new entry will appear in the voice select.

You can of course also create the custom voice via an API call, and use such voice IDs for the speech synthesis.


Happy building with Speechify's Text-to-Speech API!