Posted by u/CuriousPlatypus1881
GPT-5.2 xhigh, GLM-4.7, Kimi K2 Thinking, DeepSeek v3.2 on Fresh SWE-rebench (December 2025)
Hi all, I’m Anton from Nebius. We’ve updated the **SWE-bench leaderboard** with our **December runs** on **48 fresh GitHub PR tasks** (PRs created in the previous month only). The setup is standard SWE-bench: models read real PR issues, edit code, run tests, and must make the full suite pass. A few observations from this release: * **Claude Opus 4.5** leads this snapshot at **63.3% resolved rate**. * **GPT-5.2 (extra high effort)** follows closely at **61.5%**. * **Gemini 3 Flash Preview** slightly outperforms **Gemini 3 Pro Preview** (60.0% vs 58.9%), despite being smaller and cheaper. * **GLM-4.7** is currently the strongest open-source model on the leaderboard, ranking alongside closed models like GPT-5.1-codex. * **GPT-OSS-120B** shows a large jump in performance when run in high-effort reasoning mode, highlighting the impact of inference-time scaling. Looking forward to your thoughts and feedback.
External link:
https://swe-rebench.com/?insight=dec_2025More from r/LocalLLaMA
My story of underestimating /r/LocalLLaMA's thirst for VRAM
zai-org/GLM-4.7-Flash · Hugging Face
NVIDIA's new 8B model is Orchestrator-8B, a specialized 8-billion-parameter AI designed not to answer everything itself, but to intelligently manage and route complex tasks to different tools (like web search, code execution, other LLMs) for greater efficiency
I’ve seen some arguments we’ve reached AGI, it’s just about putting the separate pieces together in the right context....