Text-to-Speech Models
Learn about Speechify’s advanced text-to-speech models and their capabilities
Overview
Speechify’s advanced text-to-speech models are designed to meet specific user needs, from simple text reading to complex multilingual and emotional tone integration. Each model offers unique capabilities optimized for different use cases.
Simba English
Speechify’s Simba English text-to-speech model offers standard capabilities designed to deliver clear and natural voice output for reading texts. The model focuses on delivering a consistent user experience, supporting fine-tuning, and zero-shot voice cloning. The audio output of this model is distinctively different from other models.
Key Features
Produces clear and natural speech
Maintains uniform quality across all outputs
Creates a voice clone from a short audio sample
Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning
Supported Languages
- English
Simba Multilingual
This model is currently experimental and may be subject to changes.
Simba Multilingual allows the usage of all supported languages and supports using multiple languages within a single sentence. The audio output of this model is distinctively different from other models.
Key Features
Supports multiple languages within a single sentence
Creates a voice clone from a short audio sample
Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning
Supported Languages
- English
- Spanish
- French
- And many more, check the list of supported languages for more details.
FAQ
What is the difference between the models?
Each model is optimized for different use cases. Simba English focuses on clear English speech and Simba Multilingual supports multiple languages in a single sentence.
Can I use multiple languages in a single request?
Yes, with the Simba Multilingual model, you can use multiple languages within a single sentence.
What is zero-shot voice cloning?
Zero-shot voice cloning creates a voice clone from a short audio sample of the speaker, without requiring extensive training data.
What is fine-tuning?
Fine-tuning creates a high-quality voice clone using hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning.