Code Agent

🏆 BeyondSWE Leaderboard ↗ ↖

22 March 2026

Track the performance of frontier code agents on BeyondSWE — a comprehensive benchmark evaluating coding agents beyond single bug fixing.

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? ↗ ↖

24 February 2026·6 mins

We introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes—resolution scope and knowledge scope—using 500 real-world instances across four distinct settings. Together we develop SearchSWE, a framework that integrates deep search with coding abilities to analyse deep research for coding.

Immersion in the GitHub Universe: Scaling Coding Agents to Mastery

10 February 2026·6 mins

We propose Scale-SWE, a sandboxed multi-agent system that constructs 100k real SWE data points — the largest open-source high-quality SWE dataset to date. By training Qwen3-30A3B-Instruct on distilled data, we achieved 64% on SWE-bench-Verified, surpassing industrial models of the same size.

↑