mistralai/Voxtral-Mini-4B-Realtime-2602 · Hugging Face

Voxtral Mini 4B Realtime 2602 is a **multilingual, realtime speech-transcription model** and among the first open-source solutions to achieve accuracy comparable to offline systems with a delay of **<500ms**. It supports **13 languages** and outperforms existing open-source baselines across a range of tasks, making it ideal for applications like voice assistants and live subtitling. Built with a **natively streaming architecture** and a custom causal audio encoder - it allows configurable transcription delays (240ms to 2.4s), enabling users to balance **latency and accuracy** based on their needs. At a **480ms delay**, it matches the performance of leading offline open-source transcription models, as well as realtime APIs. As a **4B-parameter model**, is optimized for **on-device deployment**, requiring minimal hardware resources. It runs in realtime with on devices minimal hardware with throughput exceeding 12.5 tokens/second.

mistralai/Voxtral-Mini-4B-Realtime-2602 · Hugging Face

More from r/LocalLLaMA

This is where we are right now, LocalLLaMA

Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨

I feel personally attacked