Nemo 30B is insane. 1M+ token CTX on one 3090

Been playing around with llama.cpp and some 30-80B parameter models with CPU offloading. Currently have one 3090 and 32 GB of RAM. Im very impressed by Nemo 30B. 1M+ Token Context cache, runs on one 3090, CPU offloading for experts. Does 35 t/s which is faster than I can read at least. Usually slow as fuck at this large a context window. Feed it a whole book or research paper and its done summarizing in like a few mins. This really makes long context windows on local hardware possible. The only other contender I have tried is Seed OSS 36b and it was much slower by about 20 tokens.

Nemo 30B is insane. 1M+ token CTX on one 3090

More from r/LocalLLaMA

This is where we are right now, LocalLLaMA

Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨

I feel personally attacked