SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, Qwen3-Coder-Next, Opus 4.6, Codex Performance

Hi all, I’m Anton from Nebius. We’ve updated the **SWE-rebench leaderboard** with our **January runs** on **48 fresh GitHub PR tasks** (PRs created in the previous month only). The setup is standard SWE-bench: models read real PR issues, edit code, run tests, and must make the full suite pass. Key observations: * **Claude Code (Opus 4.6)** leads this snapshot at **52.9% resolved rate** and also achieves the highest **pass@5 (70.8%)**. * **Claude Opus 4.6** and **gpt-5.2-xhigh** follow very closely (51.7%), making the top tier extremely tight. * **gpt-5.2-medium (51.0%)** performs surprisingly close to the frontier configuration. * Among open models, **Kimi K2 Thinking (43.8%)**, **GLM-5 (42.1%)**, and **Qwen3-Coder-Next (40.0%)** lead the pack. * **MiniMax M2.5 (39.6%)** continues to show strong performance while remaining one of the cheapest options. * Clear gap between Kimi variants: **K2 Thinking (43.8%)** vs **K2.5 (37.9%)**. * Newer smaller/flash variants (e.g., GLM-4.7 Flash, gpt-5-mini-medium) trade performance for efficiency, landing in the 25–31% range. Looking forward to your thoughts and feedback.

SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, Qwen3-Coder-Next, Opus 4.6, Codex Performance

More from r/LocalLLaMA

This is where we are right now, LocalLLaMA

Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨

I feel personally attacked