Language Support | Speechify API

Speechify Text-to-Speech Models support synthesizing speech in multiple languages. Our API can handle both single-language texts and mixed-language inputs.

Fully Supported Languages

The following languages are fully supported for speech synthesis:

Language	Code
English	en
French	fr-FR
German	de-DE
Spanish	es-ES
Portuguese (Brazil)	pt-BR
Portuguese (Portugal)	pt-PT

Beta Languages

The following languages are currently in beta (we’re actively improving them and welcome feedback):

Language	Code
Arabic	ar-AE
Danish	da-DK
Dutch	nl-NL
Estonian	et-EE
Finnish	fi-FI
Greek	el-GR
Hebrew	he-IL
Hindi	hi-IN
Italian	it-IT
Japanese	ja-JP
Norwegian	nb-NO
Polish	pl-PL
Russian	ru-RU
Swedish	sv-SE
Turkish	tr-TR
Ukrainian	uk-UA
Vietnamese	vi-VN

Coming Soon

We will soon support these additional languages:

Language	Code
Belarusian	be-BY
Bengali	bn-IN
Bulgarian	bg-BG
Cantonese	zh-HK
Catalan	ca-ES
Croatian	hr-HR
Czech	cs-CZ
Filipino	fil-PH
Georgian	ka-GE
Gujarati	gu-IN
Hungarian	hu-HU
Indonesian	id-ID
Japanese	ja-JP
Korean	ko-KR
Malay	ms-MY
Mandarin	zh-CH
Marathi	mr-IN
Nepali	ne-NP
Persian	fa-IR
Romanian	ro-RO
Serbian	sr-RS
Slovak	sk-SK
Tamil	ta-IN
Telugu	te-IN
Thai	th-TH
Urdu	ur-PK

We’re actively working on expanding this list and will update this document as new languages are added to the platform.

Using the `language` Parameter

Our speech synthesis endpoints (/v1/audio/speech and /v1/audio/stream) support the optional language parameter, which should follow the locale naming standard (e.g., en-US, fr-FR).

When to specify the language:

Known single language: If you know the input text is entirely in one language, providing the language parameter will result in better audio quality.
Unknown or mixed language: If you’re unsure of the input language or the text contains multiple languages, omit the language parameter. Speechify models will automatically detect and handle the language(s) in the input.

Voice Cloning Languages

There are no language limitations for voice cloning. Speechify can produce high-quality cloned voices from short samples (approximately 1 minute of speech is recommended) and use the same voice to synthesize speech in any supported language.