Microsoft Research has introduced debug-gym, a novel environment designed to train AI coding tools in the complex art of debugging code.
As AI’s role in software development expands, debug-gym aims to address a critical bottleneck: while AI can generate code efficiently, debugging remains a major time sink for developers.
The proliferation of AI coding assistants is enhancing developer productivity. GitHub CEO Thomas Dohmke predicted in 2023 that “sooner than later, 80% of the code is going to be written by Copilot”.
This trend is evident across the industry, with both large corporations and startups increasingly relying on AI for code generation. Y Combinator’s Garry Tan highlighted this, noting that for a quarter of their latest startup batch, 95% of the code was penned by large language models (LLMs).
However, the reality of software development involves far more debugging than initial code writing.
“As maintainers of popular open-source repositories, this resonates with us,” stated the Microsoft Research team. They posed a compelling question: “But what if an AI tool could propose fixes for hundreds of open issues, and all we had to do was approve them before merging?”
Bridging the gap: Interactive debugging for AI
Debugging, as defined by the researchers, is an interactive and iterative process to fix code. Developers typically form hypotheses about crashes, gather evidence by stepping through code execution, examine variable values (often using tools like the Python debugger, pdb), and repeat this cycle until the issue is resolved.
Debugging, as defined by the researchers, is an interactive and iterative process to fix code. Developers typically form hypotheses about crashes, gather evidence by stepping through code execution, examine variable values (often using tools like the Python debugger, pdb,) and repeat this cycle until the issue is resolved.
Debug-gym aims to equip AI agents with similar code debug capabilities. It asks: “to what degree can LLMs use interactive debugging tools such as pdb?”
The environment provides code-repairing AI agents with access to tools for active information-seeking, expanding their action and observation capabilities. Agents within debug-gym can set breakpoints, navigate code, inspect variable values, create test functions, and choose whether to investigate further or rewrite code based on their confidence level.
“We believe interactive debugging with proper tools can empower coding agents to tackle real-world software engineering tasks and is central to LLM-based agent research,” the Microsoft team explained.
Fixes proposed by these enhanced agents – following human approval – would be grounded in the specific codebase context, program execution details, and documentation, moving beyond mere guesswork based on training data.
Debug-gym is built with several key considerations:
- Repository-level handling: Agents can access and modify files within the entire code repository.
- Robustness and safety: Code execution occurs within sandboxed Docker containers, isolating the environment to prevent harmful actions while allowing thorough testing.
- Extensibility: The platform is designed for easy integration of new debugging tools.
- Text-based interaction: Observations are presented in structured text (like JSON), and actions use a simple text syntax, ensuring compatibility with modern LLMs.
Researchers can use debug-gym with custom repositories and evaluate agent performance using benchmarks like Aider (simple function generation), Mini-nightmare (short, buggy examples), and SWE-bench (real-world problems requiring deep codebase understanding.)
Promising early results
Initial experiments involved a simple prompt-based agent using various LLMs (including Claude 3.7, OpenAI o1, and OpenAI o3-mini) equipped with debug tools like eval, view, pdb, rewrite, and listdir.
While even with these tools, solving complex issues like those in SWE-bench Lite remained challenging (rarely exceeding 50% success rate), the performance uplift compared to agents without debugging tools was significant.
The success rate on SWE-bench Lite saw relative increases of 30% for Claude 3.7, 182% for OpenAI o1, and 160% for OpenAI o3-mini when debugging tools were available.
The researchers attribute the overall difficulty to the lack of sequential decision-making data (like debugging traces) in current LLM training datasets. However, the marked improvement validates the potential of this research direction.
Training AI code debug specialists
The Microsoft Research team believes fine-tuning LLMs specifically for interactive debugging is the next step. This necessitates creating specialised datasets, potentially recording agent interactions within the debugger as they gather information to solve problems.
Unlike standard reasoning tasks, interactive debugging involves a cycle of action, environmental feedback, and subsequent decision-making, requiring rich data capturing the entire problem-solving sequence.
The plan includes fine-tuning an “info-seeking model” dedicated to gathering necessary bug-fixing information, which would then provide relevant context to a primary code generation model. This could potentially involve smaller, efficient info-seeking models feeding larger generation models, akin to an advanced Retrieval Augmented Generation (RAG) system, potentially saving on AI inference costs.
By open-sourcing debug-gym, Microsoft Research invites the wider community to contribute to advancing interactive debugging agents and, more broadly, AI agents capable of actively seeking information from their environment.
See also: Open-source AI matches coding abilities of proprietary models

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.