Nemo 30B is insane. 1M+ token CTX on one 3090

Tools 386 points 105 comments 1 month ago

Been playing around with llama.cpp and some 30-80B parameter models with CPU offloading. Currently have one 3090 and 32 GB of RAM. Im very impressed by Nemo 30B. 1M+ Token Context cache, runs on one 3090, CPU offloading for experts. Does 35 t/s which is faster than I can read at least. Usually slow as fuck at this large a context window. Feed it a whole book or research paper and its done summarizing in like a few mins. This really makes long context windows on local hardware possible. The only other contender I have tried is Seed OSS 36b and it was much slower by about 20 tokens.

More from r/LocalLLaMA