Free ASIC Llama 3.1 8B inference at 16,000 tok/s - no, not a joke

Hello everyone, A fast inference hardware startup, Taalas, has released a free chatbot interface and API endpoint running on their chip. They chose a small model intentionally as proof of concept. Well, it worked out really well, it runs at 16k tps! I know this model is quite limited but there likely exists a group of users who find it sufficient and would benefit from hyper-speed on offer. Anyways, they are of course moving on to bigger and better models, but are giving free access to their proof-of-concept to people who want it. More info: https://taalas.com/the-path-to-ubiquitous-ai/ Chatbot demo: https://chatjimmy.ai/ Inference API service: https://taalas.com/api-request-form It's worth trying out the chatbot even just for a bit, the speed is really something to experience. Cheers! EDIT: It's worth noting that the chatbot demo actually undersells the speed on display. Anything over a few hundred tps is perceived as instantaneous, so the experience of 1k tps vs 16k tps should be pretty similar. So you are only seeing the bottom few percent of the speed on offer. A proper demo would be using a token-intensive workload with their API. Now THAT would be something to see.

Free ASIC Llama 3.1 8B inference at 16,000 tok/s - no, not a joke

More from r/LocalLLaMA

This is where we are right now, LocalLLaMA

Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨

I feel personally attacked