Skip to content

Add blog post: The Coding Harness Behind GitHub Copilot in VS Code#9740

Draft
jukasper wants to merge 2 commits intomainfrom
blog/agent-harness-github-copilot-vscode
Draft

Add blog post: The Coding Harness Behind GitHub Copilot in VS Code#9740
jukasper wants to merge 2 commits intomainfrom
blog/agent-harness-github-copilot-vscode

Conversation

@jukasper
Copy link
Copy Markdown
Member

@jukasper jukasper commented May 6, 2026

This PR adds a new blog post explaining the coding harness architecture behind GitHub Copilot's agent mode in VS Code.

The post covers:

  • What a coding harness is and why it matters
  • The model-harness interaction loop
  • How the harness provides tools and context to the model
  • Evaluation and benchmarking approach

Copy link
Copy Markdown
Collaborator

@ntrogh ntrogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jukasper Great post - very well explained and good level of depth! Also like the additional of diagrams.

I left some feedback, but nothing major.

May 7, 2026 by [Julia Kasper](https://github.com/jukasper)

Every few months, a new model drops and the conversation resets. Which one is smartest? Which one is fastest? Which one should we ship? Those are useful questions, but for a product like Visual Studio Code they are incomplete. A model is only one part of the experience. What developers actually feel is the coding harness: the layer that assembles context, exposes tools, runs the agent loop, interprets tool calls, and turns a model's output into something useful inside the editor.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an intent statement to indicate to readers what this blog post is about.


![Diagram showing that an agent is made up of a model plus a harness. The harness includes the agent loop, tools, context management, and system prompt.](agent_model_harness.png)

## What We Mean by the Coding Harness
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## What We Mean by the Coding Harness
## What is the coding harness?


## What We Mean by the Coding Harness

That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often readers ignore the above text and/or heading. Make sure to have each section stand by itself. "That distinction.." is not clear when you start to read from that point on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next.
Language models do not edit files, execute commands, or run tests by themselves. They can only produce text. The coding harness is the system that acts as a bridge between the code editor and the language model. It turns that text into action and feeds the results back so the model can decide what to do next.


![Diagram of the agent loop showing the cycle: build prompt, send to model, check response type, execute tools, record results, and loop back.](agentloop.png)

Each pass through this loop is called a round. A single user message might trigger dozens of rounds as the model reads and searches files, edits code, runs tests, reads the output, and iterates on failures.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to clarify the difference between a turn and a round.


The loop is not unbounded. The harness enforces a tool-call limit, checks for cancellation between rounds, and runs stop hooks, extension points that can inspect the model's state and either allow it to finish or push it to keep working ("you were about to stop, but the tests still fail").

Within the loop, the prompt is rebuilt on every iteration. That means the model always sees the latest state of the workspace: if it edited a file three rounds ago, the current prompt reflects that edit. The harness also manages conversation summarization. When the accumulated history grows too large, it compresses earlier rounds into a summary so the model can keep working without hitting the context window ceiling.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first read, I was wondering why all of a sudden the harness came up here, and thought you meant the agent loop. Consider if we should include this in the harness section.


When a new model ships, it needs to fit into an existing harness. The system prompt, the tool definitions, the loop logic, the context assembly, all of it was built and tuned over many months of real-world use. The model gets better at filling in the blanks, but the harness defines what the blanks are.

This matters even more because GitHub Copilot spans model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install provider extensions. The editor deals with a moving ecosystem, not a single stable API.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This matters even more because GitHub Copilot spans model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install provider extensions. The editor deals with a moving ecosystem, not a single stable API.
This matters even more because GitHub Copilot spans multiple model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install extra providers via extensions. This means that VS Code has to deal with broad and continuously evolving ecosystem, not a single stable API.


Different models need different harness behavior. Claude models use `replace_string_in_file` for edits; GPT models use `apply_patch`. Gemini needs reminders to use tool-calling instead of narrating it, and breaks on orphaned tool calls in history. Some models support extended thinking and need reasoning-effort controls. Some work best with a concise system prompt; others need verbose, structured instructions to stay on track. The harness selects different system prompts per model - Claude Sonnet 4 gets a different prompt than Claude 4.5, which gets a different one than Opus.

These aren't abstract differences. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. When a new model ships, we don't just flip a switch. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These aren't abstract differences. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. When a new model ships, we don't just flip a switch. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better.
All these per-model differences aren't trivial. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. This means that when a new model ships, we don't just flip a switch but we need to validate its behavior. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better.


## Evaluation keeps the harness honest

That's where evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After launch, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
That's where evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After launch, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows.
Just like you need to test a new feature before you ship it, models also need to be tested. That's where model evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After the model is live, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows.


![Diagram showing an overview of the VS Code evaluation pipeline.](evaluations.png)

Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.
There are multiple public model benchmarks, which are useful as a shared reference point. We use these benchmarks to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.


Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.

Part of the issue is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Part of the issue is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers.
One of the issues with the public benchmarks is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real-world coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants