Posted by u/bobaburger
Qwen3-Coder-Next on RTX 5060 Ti 16 GB - Some numbers
About 2 weeks ago, I posted about running GLM-4.7-Flash on 16 GB of VRAM here www.reddit.com/r/LocalLLaMA/comments/1qlanzn/glm47flashreap\_on\_rtx\_5060\_ti\_16\_gb\_200k\_context/. And here we go, today, let's squeeze an even bigger model into the poor rig. Hardware: * AMD Ryzen 7 7700X * RAM 32 GB DDR5-6000 * RTX 5060 Ti 16 GB Model: unsloth/Qwen3-Coder-Next-GGUF Q3\_K\_M Llama.cpp version: llama.cpp@b7940 The llamap.cpp command: llama-server -m ./Qwen3-Coder-Next-Q3_K_M.gguf -c 32768 -np 1 -t 8 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --jinja --fit on -fa 1 When I started, I didn't expect much, given that my best result for GLM-4.7-Flash was something \~300 t/s pp and 14 t/s gen. Maybe I'll end up with a lot of OOM and crash. But, to my surprise, the card was able to pull it well! When llama.cpp is fully loaded, it takes **15.1 GB** GPU memory, and **30.2 GB** RAM. The rig is almost at its memory limit. During prompt processing, GPU usage was about **35%**, and CPU usage was about **15%**. During token generation, that's **45%** for the GPU, and **25%-45%** CPU. So perhaps there are some room to squeeze in some tuning here. Does it run? Yes, and it's quite fast for a 5060! |Metric|Task 2 (Large Context)|Task 190 (Med Context)|Task 327 (Small Context)| |:-|:-|:-|:-| |Prompt Eval (Prefill)|154.08 t/s|225.14 t/s|118.98 t/s| |Generation (Decode)|16.90 t/s|16.82 t/s|18.46 t/s| The above run was with a 32k context size. Later on, I tried again with a 64k context size, the speed did not change much. Is it usable? I'd say yes, not Opus 4.5 or Gemini Flash usable, but I think it's pretty close to my experience when Claude Sonnet 3.7 or 4 was still a thing. One thing that sticks out is, this model uses way less tool calls than Opus, so it feels fast. It seems to read the whole file all at once when needed, rather than grepping every 200 lines like the Claude brothers. One-shot something seems to work pretty well, until it runs into bugs. In my example, I asked the model to create a web-based chess game with a Python backend, connected via WebSocket. The model showed that it can debug the problem by jumping back and forth between frontend and backend code very well. When facing a problem, it will first hypothesize a cause, then work its way through the code to verify that. Then there will be a lot of "But wait", "Hold on", followed by a tool call to read some files, and then changing directions. Sometimes it works. Sometimes, it was just burning through the tokens and ended up reaching the context limit. Maybe because I was using Q3\_K\_M, and higher quants will have better quality here. Some screenshots: https://gist.github.com/user-attachments/assets/8d074a76-c441-42df-b146-0ae291af17df https://gist.github.com/user-attachments/assets/3aa3a845-96cd-4b23-b6d9-1255036106db You can see the Claude session logs and llama.cpp logs of the run here https://gist.github.com/huytd/6b1e9f2271dd677346430c1b92893b57 **Update:** So, I managed to get some time sit down and run some tests again. This time, I'm trying to see what's the sweet spot for `--n-cpu-moe`. This big \*ss model has 512 expert layers, I'll start with `ncmoe = 16`. % llama-bench -m ./Qwen3-Coder-Next-Q3_K_M.gguf -ngl 99 -ncmoe 16 -fa 1 -t 8 --mmap 0 --no-warmup | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | pp512 | 269.74 ± 57.76 | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | tg128 | 5.51 ± 0.03 | Definitely a no-go, the weights filled up the whole GPU and fully spilled over to the shared GPU mem, extremely slow. Let's do 64 then. % llama-bench -m ./Qwen3-Coder-Next-Q3_K_M.gguf -ngl 99 -ncmoe 64 -fa 1 -t 8 --no-warmup ggml_cuda_init: found 1 CUDA devices: | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | pp512 | 21.23 ± 12.52 | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | tg128 | 12.45 ± 0.79 | What's happening here is, we get better tg speed, but pp dropped. The GPU was under-utilized, only half of the VRAM was filled. Back to `ncmoe = 32` seems to work, no more spill over to the slow shared GPU mem, everything fits nicely in the GPU mem and the system mem. % llama-bench -m ./Qwen3-Coder-Next-Q3_K_M.gguf -ngl 99 -ncmoe 32 -fa 1 -t 8 --mmap 0 --no-warmup | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | pp512 | 275.89 ± 65.48 | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | tg128 | 20.21 ± 0.57 | So 32 was a safe number, let's try something lower, like 28: % llama-bench -m ./Qwen3-Coder-Next-Q3_K_M.gguf -ngl 99 -ncmoe 28 -fa 1 -t 8 --mmap 0 --no-warmup | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | pp512 | 253.92 ± 59.39 | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | tg128 | 7.92 ± 0.13 | Nope! spilled over to the slow shared GPU mem again. Let's bump it up to, like, 30: % llama-bench -m ./Qwen3-Coder-Next-Q3_K_M.gguf -ngl 99 -ncmoe 30 -fa 1 -t 8 --mmap 0 --no-warmup | model | size | params | backend | ngl | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | pp512 | 296.60 ± 73.63 | | qwen3next 80B.A3B Q3_K - Medium | 35.65 GiB | 79.67 B | CUDA | 99 | 1 | tg128 | 20.15 ± 1.06 | So I think this is the sweet spot for RTX 5060 Ti on this Q3\_K\_M quant. pp at 296.60 t/s and tg at 20.15 t/s. Q3\_K\_M performance
More from r/LocalLLaMA
Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨
I feel personally attacked
Distillation when you do it. Training when we do it.