Text-to-Speech Models
Learn about Speechify’s advanced text-to-speech models and their capabilities
Overview
Speechify’s advanced text-to-speech models are designed to meet specific user needs, from simple text reading to complex multilingual and emotional tone integration. Each model offers unique capabilities optimized for different use cases.
Simba English
Speechify’s Simba English text-to-speech model offers standard capabilities designed to deliver clear and natural voice output for reading texts. The model focuses on delivering a consistent user experience, supporting fine-tuning, and zero-shot voice cloning. The audio output of this model is distinctively different from other models.
Key Features
Produces clear and natural speech
Maintains uniform quality across all outputs
Creates a voice clone from a short audio sample
Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning
Supported Languages
- English
Simba Multilingual
This model is currently experimental and may be subject to changes.
Simba Multilingual allows the usage of all supported languages and supports using multiple languages within a single sentence. The audio output of this model is distinctively different from other models.
Key Features
Supports multiple languages within a single sentence
Creates a voice clone from a short audio sample
Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning
Supported Languages
- English
- Spanish
- French
- And many more, check the list of supported languages for more details.
Simba Turbo
Simba Turbo is a text-to-speech model that emphasizes faster processing speeds and the ability to control emotional tones in the voice output. Tailored for users who require quick response times and prefer to adjust the emotional undertones to better match the context of the text being read. The audio output of this model is distinctively different from other models.
Key Features
Delivers faster processing to reduce wait times
Enables control over emotional expressions to match the context of the text
Allows adjustment of speech flow for dynamic presentations
Creates a voice clone from a short audio sample
Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning
Supported Languages
- English
FAQ
What is the difference between the models?
Each model is optimized for different use cases. Simba English focuses on clear English speech, Simba Multilingual supports multiple languages in a single sentence, and Simba Turbo emphasizes speed and emotional control.
Can I use multiple languages in a single request?
Yes, with the Simba Multilingual model, you can use multiple languages within a single sentence.
What is zero-shot voice cloning?
Zero-shot voice cloning creates a voice clone from a short audio sample of the speaker, without requiring extensive training data.
What is fine-tuning?
Fine-tuning creates a high-quality voice clone using hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning.