Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

A short follow-up to my previous post, where I showed that changing the scaffold around the same 9B Qwen model moved benchmark performance from 19.11% to 45.56%: https://www.reddit.com/r/LocalLLaMA/s/JMHuAGj1LV After feedback from people here, I tried little-coder with Qwen3.6 35B. It now lands in the public Polyglot top 10 with a success rate of 78.7%, making it actually competitive with the best models out there for this benchmark! At this point I’m increasingly convinced that part of the performance gap to cloud models is harness mismatch: we may have been testing local coding models inside scaffolds built for a different class of model. Next up is Terminal Bench, then likely GAIA for research capabilities. Would love to hear your feedback here! EDIT: after many requests, pi.dev adaptation is up! EDIT 2: Terminal Bench 1 (0.1.1) finished with 40% success rate! Now running TB 2. Just sent the results via email. There is no model remotely as small as the 35B in that area. Exciting times EDIT 3: Terminal Bench 2.0 requires 5 runs per trial (which will take 40 more hours), but the first run finished with 30%!!! That’s with the 35B model. Full write up: https://open.substack.com/pub/itayinbarr/p/honey-i-shrunk-the-coding-agent GitHub: https://github.com/itayinbarr/little-coder Full benchmark results: https://github.com/itayinbarr/little-coder/blob/main/docs/benchmark-qwen3.6-35b-a3b.md

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

More from r/LocalLLaMA

This is where we are right now, LocalLLaMA

Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨

I feel personally attacked