RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

Tools 569 points 143 comments 1 month ago

Spent an evening dialing in Qwen3.6-35B-A3B on consumer hardware. Fun side note: I had **Claude Opus 4.7 (just the $20 sub)** build the config, launch the servers in the background, run the benchmarks, read the VRAM splits from the llama.cpp logs, and iterate on the tuning — basically did the whole thing autonomously. I just told it what hardware I have and what I wanted to run. Sharing because the common `--cpu-moe` advice is leaving **54% of your speed on the table** on 16GB GPUs. # Hardware * **GPU:** RTX 5070 Ti (16GB GDDR7, Blackwell) * **CPU:** Ryzen 9800X3D (96MB L3 V-Cache) * **RAM:** 32GB DDR5 * **Stack:** llama.cpp b8829 (CUDA 13.1, Windows x64) * **Model:** `unsloth/Qwen3.6-35B-A3B-GGUF` — `UD-Q4_K_M` (22.1 GB) # The finding — --cpu-moe vs --n-cpu-moe N Everyone’s using `--cpu-moe` which pushes ALL MoE experts to CPU. On a 16GB GPU with a 22GB MoE model that means **only \~1.9 GB of your VRAM gets used** — the other \~12 GB sits idle. `--n-cpu-moe N` keeps experts of the first N layers on CPU and puts the rest on GPU. With `N=20` on a 40-layer model, the split uses VRAM properly. # Benchmarks (300-token generation, Q4_K_M) |Config|Gen t/s|Prompt t/s|VRAM used| |:-|:-|:-|:-| |`--cpu-moe` (baseline)|51.2|87.9|3.5 GB| |`--n-cpu-moe 20`|**78.7**|**100.6**|12.7 GB| |`--n-cpu-moe 20` \+ `-np 1` \+ 128K ctx|**79.3**|**135.8**|13.2 GB| **+54% generation speed, +54% prompt speed** vs. naive `--cpu-moe`. Jumping to 128K context is essentially free thanks to `-np 1` dropping recurrent-state memory. # Startup command that works llama-server.exe ^ -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^ --n-cpu-moe 20 ^ -ngl 99 ^ -np 1 ^ -fa on ^ -ctk q8_0 -ctv q8_0 ^ -c 131072 ^ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ^ --presence-penalty 0.0 --repeat-penalty 1.0 ^ --reasoning-budget -1 ^ --host 0.0.0.0 --port 8080 That’s Unsloth’s “Precise Coding” sampling preset. For general use: `--temp 1.0 --presence-penalty 1.5`. # Gotchas I hit (well, that Opus hit and fixed) * `-np` **defaults to auto=4 slots.** Wastes memory on recurrent state (\~190 MB). Set `-np 1` for single-user setups (OpenCode etc.). * `--fit-target` **doesn’t help here** — `-ngl 99` \+ `--n-cpu-moe N` already gives you deterministic control. * `-ctk q8_0 -ctv q8_0` is nearly lossless and halves your KV cache vs fp16. 128K ctx only costs 1.36 GB VRAM. * **Qwen3.6 is a hybrid architecture** — only 10 layers are standard attention, the other 40 are Gated Delta Net (recurrent). That’s why KV memory is so small. # How to tune N for your GPU Each MoE layer on GPU costs \~530 MB VRAM. Non-MoE weights are \~1.9 GB fixed. For a 40-layer model: |GPU VRAM|Recommended `N`| |:-|:-| |8 GB|stay with `--cpu-moe`| |12 GB|`N=26`| |16 GB|`N=20` (sweet spot)| |24 GB|`N=8` (fits almost everything)| Start conservative, watch VRAM during a long-context generation, then step `N` down by 2-3 until you have \~2 GB headroom. # TL;DR Replace `--cpu-moe` with `--n-cpu-moe 20`, add `-np 1`, and you get **79 t/s + 128K context** on a 5070 Ti. The 9800X3D’s V-Cache carries the CPU side effortlessly. And Claude Opus 4.7 on the $20 Pro sub is genuinely good enough now to run this kind of hardware-tuning loop end-to-end — launch servers in background, parse logs, iterate — without hand-holding. Kind of wild. Happy to test other configs if anyone wants comparisons. **\*\*\*\*\*\*\*\*\*\*\*\*\*EDIT — Thanks to some great comments, the setup got better. Updated findings:** **1.** `--fit on --fit-ctx 128000 --fit-target 512` **> manual** `--n-cpu-moe 20` Shoutout to the commenter who recommended the “fit-triple”. It auto-probes VRAM, picks N for you (landed on N=19 here), and adapts if drivers steal VRAM. Slightly faster than my hand-tuned N=20 and zero brain power to maintain. **Caveat:** bare `--fit on` silently drops ctx to 4K — always pair it with `--fit-ctx`. **2. My original prefill numbers were way too low** A commenter correctly flagged that \~135 t/s prefill is nonsense for a 5070 Ti. They were right — that was server-side timing including first-token latency. Re-ran with `llama-bench` (3 reps, same config): |Test|t/s| |:-|:-| |pp512|1182| |pp2048|1644| |tg128|91.5| So real prefill is **\~1.2–1.6k t/s**, not 135. **Final “best command” for 16 GB VRAM + 32 GB RAM :** llama-server.exe ^ -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^ --fit on ^ --fit-ctx 128000 ^ --fit-target 512 ^ -np 1 ^ -fa on ^ -ctk q8_0 ^ -ctv q8_0 ^ --temp 0.6 ^ --top-p 0.95 ^ --top-k 20 ^ --min-p 0.0 ^ --presence-penalty 0.0 ^ --repeat-penalty 1.0 ^ --reasoning-budget -1 ^ --host 0.0.0.0 ^ --port 8033 Keep the comments coming, every round makes this faster. :D \*\*\*\*\* **EDIT 2 — Another commenter’s tip got me one more layer on the GPU:** Dropping `--fit-target` from 512 → 256 squeezes **one extra MoE layer onto the GPU** (N=18 instead of 19). The commenter also suggested adding `--mlock` alongside `--no-mmap` to lock RAM pages against swap. Benched both changes vs. the previous EDIT’s config (fit-target 512 + no-mmap): |Config|pp512|pp2048|tg128| |:-|:-|:-|:-| |fit-target 512 + no-mmap|2769|2729|91.5| |**fit-target 256 + no-mmap + mlock**|**2743**|**2724**|**96.3**| **+7% generation**, prefill unchanged. Costs nothing — just a smaller VRAM headroom and explicit RAM locking. **Updated final command:** llama-server.exe ^ -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^ --fit on ^ --fit-ctx 128000 ^ --fit-target 256 ^ -np 1 ^ -fa on ^ --no-mmap ^ --mlock ^ -ctk q8_0 ^ -ctv q8_0 ^ --temp 0.6 ^ --top-p 0.95 ^ --top-k 20 ^ --min-p 0.0 ^ --presence-penalty 0.0 ^ --repeat-penalty 1.0 ^ --reasoning-budget -1 ^ --host 0.0.0.0 ^ --port 8033 **\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*** **EDIT 3 — Two more community tips landed big wins:** **1.** `-ub 2048` **(ubatch size) = +59% prompt-processing at 2K tokens** Default `-ub` is 512. Bumping it to 2048 (and matching `-b 2048`) lets the GPU process more tokens in parallel per prefill step. Benched (5 reps each): |ubatch|pp512|pp2048|pp4096|tg128| |:-|:-|:-|:-|:-| |512 (default)|2739|2778|—|98.7| |1024|2689|3689|—|100.5| |**2048**|2771|**4453**|4417|98.4| |4096|2736|4427|4866|100.4| **2048 is the sweet spot** — 59% faster at 2K-prompts, gen untouched. 4096 only helps beyond 2K-prompts (compute buffer saturates otherwise) and eats more VRAM. **2.** `--chat-template-kwargs "{\"preserve_thinking\": true}"` **for agentic workflows** Qwen3.6-specific chat template parameter. Default only keeps the latest user turn’s thinking; `preserve_thinking: true` carries thinking traces from all historical messages forward. Turns out Qwen3.6 was specifically trained for this behavior. Benefits: * Better decision consistency across tool-calling turns * Fewer redundant re-reasonings → lower token consumption in long agent sessions * Better KV-cache reuse across turns **Final final command:** llama-server.exe ^ -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^ --fit on ^ --fit-ctx 128000 ^ --fit-target 256 ^ -np 1 ^ -fa on ^ --no-mmap ^ --mlock ^ -b 2048 ^ -ub 2048 ^ -ctk q8_0 ^ -ctv q8_0 ^ --temp 0.6 ^ --top-p 0.95 ^ --top-k 20 ^ --min-p 0.0 ^ --presence-penalty 0.0 ^ --repeat-penalty 1.0 ^ --reasoning-budget -1 ^ --chat-template-kwargs "{\"preserve_thinking\": true}" ^ --host 0.0.0.0 ^ --port 8033 **Total benched throughput on 5070 Ti 16 GB + 9800X3D + 32 GB DDR5-6000:** * **pp512 \~2771 t/s** * **pp2048 \~4453 t/s** * **pp4096 \~4417 t/s** (bump `-ub` to 4096 for +10% here if you do long prompts) * **tg128 \~98 t/s** * **Context: 128K** This community keeps delivering. Thank you.

More from r/LocalLLaMA