KaniTTS2 — open-source 400M TTS model with voice cloning, runs in 3GB VRAM. Pretrain code included.

Tools 426 points 81 comments 1 month ago

Hey everyone, we just open-sourced KaniTTS2 - a text-to-speech model designed for real-time conversational use cases. \## Models: Multilingual (English, Spanish), and English-specific with local accents. Language support is actively expanding - more languages coming in future updates \## Specs \* 400M parameters (BF16) \* 22kHz sample rate \* Voice Cloning \* \~0.2 RTF on RTX 5090 \* 3GB GPU VRAM \* Pretrained on \~10k hours of speech \* Training took 6 hours on 8x H100s \## Full pretrain code - train your own TTS from scratch This is the part we’re most excited to share. We’re releasing the complete pretraining framework so anyone can train a TTS model for their own language, accent, or domain. \## Links \* Pretrained model: https://huggingface.co/nineninesix/kani-tts-2-pt \* English model: https://huggingface.co/nineninesix/kani-tts-2-en \* Pretrain code: https://github.com/nineninesix-ai/kani-tts-2-pretrain \* HF Spaces: https://huggingface.co/spaces/nineninesix/kani-tts-2-pt, https://huggingface.co/spaces/nineninesix/kanitts-2-en \* License: Apache 2.0 Happy to answer any questions. Would love to see what people build with this, especially for underrepresented languages.

More from r/LocalLLaMA