Model Description: OuteTTS-0.2-500M is our improved successor to the v0.1 release. The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself. Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
Read more in our blog →Abstract: We present OuteTTS, a novel approach to text-to-speech synthesis that leverages pure language modeling without the need for external adapters or complex architectures. Our 350M parameter model demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.
Introduction: Text-to-speech synthesis has traditionally relied on complex architectures and specialized models. With OuteTTS, we demonstrate that a relatively small language model can learn to generate high-quality speech through a simple yet effective approach. Our model, with just 350M parameters, showcases the potential of using language models directly for speech synthesis.
Read more in our blog →A successor to the v0.1 release with significant improvements across all aspects of performance.
Teaching Language Models to Speak via Audio Tokens and Forced Alignment
The Lite Oute 2 Mamba2Attn 250M is our latest third-generation model, showcasing the Mamba2 architecture with attention layers.
Lite Oute 1 300M and Ultra-Compact 65M parameters models, offering versatility and efficiency for various deployment scenarios