BeyondSWE

Overview of BeyondSWE. Our benchmark extends evaluation along two dimensions — knowledge scope and resolution scope: CrossRepo and DomainFix expand knowledge scope by requiring external software resources and domain expertise respectively; DepMigrate and Doc2Repo expand resolution scope from localized patches to codebase-wide transformations.

Abstract

Current benchmarks for code agents primarily assess narrow, repository‑specific fixes, overlooking critical real‑world challenges such as cross‑repository reasoning, domain‑specialized problem solving, dependency‑driven migration, and full‑repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes—resolution scope and knowledge scope—using 500 real‑world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer‑like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Benchmark Tasks

Recently, OpenAI announced that it will abandon SWE-bench-Verified as its evaluation standard. Analyzing from the perspectives of Resolution Scope and Knowledge Scope, we found that even the latest SWE-bench Pro only averages 107.4 lines of modification, and it's limited to internal fixes within a single repository.

In real-world software development scenarios, engineers need to consult external links to locate cross-repository bugs, consult documentation to understand how to fix them using domain knowledge bases, handle project updates caused by disruptive upgrades to classic repositories like NumPy, and build a complete repository from scratch based on development documentation. These are all areas where current mainstream benchmarks fall short. Our benchmark, BeyondSWE, covers four major task scenarios, each corresponding to two distinctly different challenges across two dimensions:

🔗

CROSS-REPOSITORY

Knowledge Scope External Resources

Task Description

Requires consulting related projects, upstream libraries, and external dependencies—beyond a single codebase.
We filtered through 3,000 GitHub PRs containing external links. After rigorous quality control, we obtained 200 samples from 67 repositories, with an average of 1.3 external links.

Example: kitware_trame-server_pr8 Problem Statement

Server ignores `host` argument in `start()` and `TRAME_DEFAULT_HOST` environment variable

Description

We are encountering an issue where the `host` parameter passed to `trame_server.Server.start()` is ignored, causing the server to always bind to `localhost` (127.0.0.1). This prevents external connections and is blocking downstream integrations that require binding to specific interfaces (e.g., `0.0.0.0` for containerized environments), such as the ongoing work in PyVista (see pyvista/pyvista#3385).

The server persists with the CLI default (`localhost`) even when an explicit override is provided in Python. Furthermore, the `TRAME_DEFAULT_HOST` environment variable, which is expected to define the host when no argument is provided, is currently not respected.

Steps to Reproduce

The following script attempts to bind the server to `0.0.0.0`, but the server continues to listen on `127.0.0.1`:

import os
from trame_server import Server

# Scenario 1: Attempt to start server on all interfaces via argument
# Expected: Binds to 0.0.0.0
# Actual: Binds to 127.0.0.1
print("--- Testing host argument ---")
server = Server()
try:
    # This host argument is currently ignored
    server.start(port=8080, host='0.0.0.0', open_browser=False)
except Exception as e:
    print(f"Error: {e}")
finally:
    server.stop()

# Scenario 2: Attempt using environment variable
# Expected: Binds to 0.0.0.0
# Actual: Binds to 127.0.0.1 (default)
print("\n--- Testing environment variable ---")
os.environ['TRAME_DEFAULT_HOST'] = '0.0.0.0'
server2 = Server()
try:
    server2.start(port=8081, open_browser=False)
except Exception as e:
    print(f"Error: {e}")
finally:
    server2.stop()

Expected Behavior

Argument Priority: When server.start(host='0.0.0.0') is called, the server should respect the host argument and bind to that address.
Environment Variable: If no host argument is provided to start(), the server should use the value of TRAME_DEFAULT_HOST if it is set.

🧬

DOMAIN-SPECIFIC

Knowledge Scope Scientific Domain

Task Description

Demands specialized scientific expertise in addition to programming skills.
We collaborated with 11 field experts to carefully select repositories in high-barrier fields such as astronomy, bioinformatics, quantum physics, plasma physics, computational chemistry, and seismology, including cvxpy (convex optimization), QuTiP (quantum computing), and Biotite (bioinformatics).

Example: cvxpy_cvxpy_pr2125 Problem Statement

Implement sparse Cholesky decomposition

Description

Currently, `cvxpy` decomposes positive definite matrices into a Gram matrix format using eigendecomposition followed by scaling the eigenvectors by the square roots of the eigenvalues. This process is inefficient for sparse positive definite matrices, as it doesn't take advantage of sparsity.

It would be beneficial to expose a sparse Cholesky decomposition functionality to handle these decompositions more efficiently. This would improve performance for operations involving large sparse positive definite matrices.

Example Usage

import scipy.sparse as sp
import numpy as np
# This function doesn't exist yet
from cvxpy.utilities.linalg import sparse_cholesky 

n = 100
A = sp.rand(n, n, density=0.05)
A = A.T @ A + sp.eye(n) # Make it positive definite

# Should return lower triangular matrix L and permutation p
L, p = sparse_cholesky(A)

Expected Behavior

The sparse_cholesky function should return a sparse lower triangular matrix $L$ and a permutation vector $p$ such that $P A P^T = L L^T$, where $P$ is the permutation matrix associated with $p$.

🕊️

DEPENDENCY-DRIVEN MIGRATION

Resolution Scope Codebase-wide

Task Description

Moves from localized patches to codebase-wide transformations triggered by dependency upgrades (e.g., Pydantic v1→v2).
We identified 23 core dependency packages that included major version updates. After screening from 7,000 candidates and manual review by four software engineering experts, we finally obtained 178 samples from 120 repositories.

Example: stanfordnlp_dsp_pr403 Problem Statement

Support OpenAI v1 SDK and fix Azure configuration issues

Description

We are upgrading the external `openai` dependency to version 1.0+. This upgrade has caused regressions in our Language Model (LM) client, specifically for **Azure OpenAI** configurations.

Users attempting to run the application with the new SDK version are encountering errors related to argument handling. The current implementation's approach to mapping `model`, `engine`, and `deployment_id` appears to be incompatible with the updated library. Additionally, the logic used to infer model types (chat vs. text) is failing for Azure deployments with custom names.

Current Behavior

Upgrade Failure: The codebase is currently incompatible with `openai>=1.0`.
Azure Parameter Errors: When `api_provider="azure"` is used, initializing the client results in `AssertionError`s or parameter validation failures. The application struggles to reconcile the `model` argument with the parameters expected by the Azure endpoint (e.g., `deployment_id`, `engine`).
Inference Issues: The system attempts to deduce the model type based on the model name. This fails for Azure users with custom deployment names (e.g., names that do not contain "gpt" or "instruct"), causing the client to misbehave.

Expected Behavior

The library must support `openai>=1.0`.
Azure-backed clients must initialize successfully without raising errors regarding `model`, `engine`, or `deployment_id` ambiguity.
Azure model type inference should be robust enough to handle custom deployment names.
Constraint: The refactor must ensure that existing caches for standard OpenAI users remain valid.

📝

DOCUMENT-TO-REPOSITORY

Resolution Scope Full Repository

Task Description

Builds an entire functional repository from specifications such as design docs or API descriptions.
We collected high-quality repositories from January to November 2025, using the following selection criteria: consistently active, at least 3 contributors, and more than 20 stars. This resulted in 50 sample repositories, with nearly 40% of them having more than 4000 lines of code.

Example: doc2mark Problem Statement

Specification: target_repo

1. Overview

The `target_repo` library is a unified document processing tool designed to convert various document formats (PDF, Office files, Images, HTML, etc.) into clean Markdown. It emphasizes high-fidelity conversion by preserving layout, tables, and extracting images. A key feature is its integration with AI-powered OCR (specifically OpenAI's Vision API and Tesseract) to handle scanned documents and complex layouts that traditional extractors miss.

The library provides a simple, unified API for processing single files or batch-processing entire directories. It abstracts away the complexity of handling different file formats, providing a consistent `ProcessedDocument` output containing the content, metadata, and extracted assets.

2. API Reference

Component: `UnifiedDocumentLoader`

Import Path: from target_repo import UnifiedDocumentLoader

class UnifiedDocumentLoader:
    def __init__(
        self,
        ocr_provider: Union[str, OCRProvider, BaseOCR] = 'openai',
        api_key: Optional[str] = None,
        ocr_config: Optional[OCRConfig] = None,
        cache_dir: Optional[str] = None,
        # Enhanced OCR configuration for OpenAI
        model: str = "gpt-4.1",
        temperature: float = 0,
        max_tokens: int = 4096,
        max_workers: int = 5,
        prompt_template: Union[str, PromptTemplate] = PromptTemplate.DEFAULT,
        timeout: int = 30,
        max_retries: int = 3,
        # Additional OpenAI parameters
        top_p: float = 1.0,
        frequency_penalty: float = 0.0,
        presence_penalty: float = 0.0,
        base_url: Optional[str] = None,
        # General OCR parameters
        default_prompt: Optional[str] = None,
        # Table output configuration
        table_style: Optional[str] = None
    ): ...

    @property
    def supported_formats(self) -> List[str]: ...

    @property
    def ocr(self) -> BaseOCR: ...

    @property
    def cache_dir(self) -> Optional[Path]: ...

    def load(
        self,
        file_path: Union[str, Path],
        output_format: Union[str, OutputFormat] = OutputFormat.MARKDOWN,
        extract_images: bool = False,
        ocr_images: bool = False,
        show_progress: bool = False,
        # Format-specific parameters
        encoding: str = 'utf-8',
        delimiter: Optional[str] = None
    ) -> ProcessedDocument: ...

    def batch_process(
        self,
        input_dir: Union[str, Path],
        output_dir: Optional[Union[str, Path]] = None,
        output_format: Union[str, OutputFormat] = OutputFormat.MARKDOWN,
        extract_images: bool = False,
        ocr_images: bool = False,
        recursive: bool = True,
        show_progress: bool = True,
        save_files: bool = True,
        **kwargs
    ) -> Dict[str, Any]: ...

Description: The main entry point for loading and processing documents. It manages format detection, processor selection, and OCR configuration. It supports caching processed results to avoid redundant computation.

Component: Convenience Functions

Import Path: from target_repo import load, batch_process_documents

def load(
    file_path: Union[str, Path],
    output_format: Union[str, OutputFormat] = OutputFormat.MARKDOWN,
    extract_images: bool = False,
    ocr_images: bool = False,
    ocr_provider: str = 'openai',
    api_key: Optional[str] = None,
    **kwargs
) -> ProcessedDocument: ...

def batch_process_documents(
    input_dir: Union[str, Path],
    output_dir: Optional[Union[str, Path]] = None,
    output_format: Union[str, OutputFormat] = OutputFormat.MARKDOWN,
    extract_images: bool = False,
    ocr_images: bool = False,
    recursive: bool = True,
    ocr_provider: str = 'openai',
    api_key: Optional[str] = None,
    show_progress: bool = True,
    save_files: bool = True,
    **kwargs
) -> Dict[str, Any]: ...

Overview of SearchSWE Framework

To address the issue of insufficient internal knowledge within the Agent, we propose the SearchSWE framework, which offers the following advantages:

Dual-context design
- 🐳 Local context: Docker environment for repo exploration, command execution, and testing.
- 🌎 Global context: access to external web information.
Two external tools
- 🔎 Search tool: query web search engines for relevant resources.
- 🌐 Browser tool: retrieve and summarize webpages given a URL and goal.
Autonomous reasoning: the agent decides when and how to use external information during problem solving.

Overview of The SearchSWE Framework. Left: the agent solves coding tasks by iteratively accessing external resources (search, browser) and local context (Docker container), with a blocklist preventing cheating. Right: evaluation applies patches to a fresh container and runs P2P/F2P tests for verification.

To prevent the agent from directly searching the target repository and cheating, SearchSWE filters all URLs pointing to the target repository and also blocks any bash commands attempting to directly access the target repository. Our Docker environment also completely erases all commit records and metadata after the target commit.

Evaluation Results

We evaluate frontier models with SearchSWE and compare against the OpenHands baseline. Results highlight a key gap: search and coding abilities have improved independently, but their integration remains inconsistent in practice.

✅ When search helps: it can provide actionable external context (e.g., documentation, migration guides), yielding gains on some model–task pairs (notably DomainFix and DepMigrate for strong general-purpose models).
❌ When search hurts: benefits are not universal—irrelevant or mismatched retrieval can inject noise and reduce performance (some code-specialized models degrade under search).
📊 Task-level patterns: CrossRepo tends to benefit broadly, while Doc2Repo often suffers because fragmented search evidence can disrupt coherent end-to-end design.
😭 Why it’s hard: 1. the information landscape ( the knowledge needed for code tasks is often embedded in raw artifacts); 2. version consistency (local environments often pin older versions); 3. the noise introduced by search can actively harm code generation.
🛎️ Implication: combining external information seeking with code reasoning is essential but does not emerge automatically—BeyondSWE is designed to measure and drive progress on this integration.

Main results on BeyondSWE. Multiple frontier models evaluated with OpenHands and SearchSWE. Green / Red values indicate SearchSWE’s gains/drops relative to OpenHands. Bold and underlined denote the best and second-best results within each framework.

We conducted a detailed analysis of our search tool usage: the comparison revealed that searching more does not necessarily equate to searching better.

Qualitative Analysis of Search-Code Disconnect

We investigated why search augmentation does not consistently translate to coding performance (Although they are all quite mature), and identify specific failure modes where the integration of external knowledge breaks down. Three failure modes illustrate that the integration of search and code requires precise information retrieval and appropriate timing and methods of application.

🛣️

Failure Mode 1

The Information Landscape Gap

🔗 CrossRepo

🔐

Failure Mode 2

Version consistency

🕊️ DepMigrate

💥

Failure Mode 3

Search Noise & Semantic Drift

🔗 CrossRepo

Citation

If you find our work helpful, please cite us:

@article{beyondswe2026,
  title={BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?},
  author={Guoxin Chen and Fanzhe Meng and Jiale Zhao and Minghao Li and Daixuan Cheng and Huatong Song and Jie Chen and Yuzhi Lin and Hui Chen and Xin Zhao and Ruihua Song and Chang Liu and Cheng Chen and Kai Jia and Ji-Rong Wen},
  year={2026}
}

Abstract

Benchmark Tasks

CROSS-REPOSITORY

Task Description

Server ignores `host` argument in `start()` and `TRAME_DEFAULT_HOST` environment variable

Description

Steps to Reproduce

Expected Behavior

DOMAIN-SPECIFIC

Task Description

Implement sparse Cholesky decomposition

Description

Example Usage

Expected Behavior

DEPENDENCY-DRIVEN MIGRATION

Task Description

Support OpenAI v1 SDK and fix Azure configuration issues

Description

Current Behavior

Expected Behavior

DOCUMENT-TO-REPOSITORY

Task Description

Specification: target_repo

1. Overview

2. API Reference

Component: `UnifiedDocumentLoader`

Component: Convenience Functions

Overview of SearchSWE Framework

Evaluation Results

Qualitative Analysis of Search-Code Disconnect

Failure Mode 1

Failure Mode 2

Failure Mode 3

Case Study

Citation