Text-to-Speech Models

Learn about Speechify’s advanced text-to-speech models and their capabilities

Overview

Speechify’s advanced text-to-speech models are designed to meet specific user needs, from simple text reading to complex multilingual and emotional tone integration. Each model offers unique capabilities optimized for different use cases.

Simba English

Speechify’s Simba English text-to-speech model offers standard capabilities designed to deliver clear and natural voice output for reading texts. The model focuses on delivering a consistent user experience, supporting fine-tuning, and zero-shot voice cloning. The audio output of this model is distinctively different from other models.

Voice Clarity

Produces clear and natural speech

Consistency

Maintains uniform quality across all outputs

Zero-shot Voice Cloning

Creates a voice clone from a short audio sample

Fine-tuning

Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning

  • English

Simba Multilingual

This model is currently experimental and may be subject to changes.

Simba Multilingual allows the usage of all supported languages and supports using multiple languages within a single sentence. The audio output of this model is distinctively different from other models.

Language Flexibility

Supports multiple languages within a single sentence

Zero-shot Voice Cloning

Creates a voice clone from a short audio sample

Fine-tuning

Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning

Simba Turbo

This model is deprecated

Simba Turbo is a text-to-speech model that emphasizes faster processing speeds and the ability to control emotional tones in the voice output. Tailored for users who require quick response times and prefer to adjust the emotional undertones to better match the context of the text being read. The audio output of this model is distinctively different from other models.

Speed

Delivers faster processing to reduce wait times

Emotional Control

Enables control over emotional expressions to match the context of the text

Speech Cadence Control

Allows adjustment of speech flow for dynamic presentations

Zero-shot Voice Cloning

Creates a voice clone from a short audio sample

Fine-tuning

Creates a voice clone from hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning

  • English

FAQ

Each model is optimized for different use cases. Simba English focuses on clear English speech, Simba Multilingual supports multiple languages in a single sentence, and Simba Turbo emphasizes speed and emotional control.

Yes, with the Simba Multilingual model, you can use multiple languages within a single sentence.

Zero-shot voice cloning creates a voice clone from a short audio sample of the speaker, without requiring extensive training data.

Fine-tuning creates a high-quality voice clone using hours of the speaker’s audio, providing significantly better results than zero-shot voice cloning.