Qwen3's most underrated feature: Voice embeddings

Did you know that Qwen3 TTS utilizes voice embedding for voice cloning? Your voice is turned into a vector of 1024 dimensions (or 2048 for 1.7b), and based on this vector alone you can get your custom voice. But the coolest part is that this means that you can use math to modify voices, average voices. You can swap gender, pitch, mix and match voices, and even create an emotion space! This also enables semantic voice search! The voice embedding model is actually just a tiny encoder with just a few million parameters. I've ripped it out of the voice embedding model so you can use the embedding model standalone. Check out my collection! :D I also have onnx models for optimized web / front-end inference. https://huggingface.co/collections/marksverdhei/qwen3-voice-embedding Voice embedings can be used for inference in my vllm-omni fork until it is supported in upstream: https://github.com/heiervang-technologies/ht-vllm-omni

Qwen3's most underrated feature: Voice embeddings

More from r/LocalLLaMA

This is where we are right now, LocalLLaMA

Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨

I feel personally attacked