GPT-5.2 xhigh, GLM-4.7, Kimi K2 Thinking, DeepSeek v3.2 on Fresh SWE-rebench (December 2025)

Tools 376 points 89 comments 4 days ago

Hi all, I’m Anton from Nebius. We’ve updated the **SWE-bench leaderboard** with our **December runs** on **48 fresh GitHub PR tasks** (PRs created in the previous month only). The setup is standard SWE-bench: models read real PR issues, edit code, run tests, and must make the full suite pass. A few observations from this release: * **Claude Opus 4.5** leads this snapshot at **63.3% resolved rate**. * **GPT-5.2 (extra high effort)** follows closely at **61.5%**. * **Gemini 3 Flash Preview** slightly outperforms **Gemini 3 Pro Preview** (60.0% vs 58.9%), despite being smaller and cheaper. * **GLM-4.7** is currently the strongest open-source model on the leaderboard, ranking alongside closed models like GPT-5.1-codex. * **GPT-OSS-120B** shows a large jump in performance when run in high-effort reasoning mode, highlighting the impact of inference-time scaling. Looking forward to your thoughts and feedback.

More from r/LocalLLaMA