Opus 4.6 under Cursor vs. Claude Code — same weights, wildly different results. The secret is the agent harness: the software layer between you and the LLM that decides what the model sees, what it can do, and how it recovers from mistakes.
From your keyboard to the model's response, your message passes through six layers. Click any layer to see what it does.
Before the model sees your message, it reads a long system prompt that tells it who it is, how to behave, and what rules to follow. Claude Code's system prompt is 3000+ tokens, heavily structured with markdown headers and explicit do/don't rules. A generic harness might use 10 words: "You are a helpful coding assistant."
This is the single highest-leverage text in the entire system. A vague prompt produces vague behavior. A precise prompt produces precise behavior.
Claude Code loads your CLAUDE.md files (project rules like "never push to main"), auto-memory (cross-session learning), git status (current branch, recent commits), and project structure. All of this is injected into the system prompt before the model sees your message.
This is why your instruction "never push to main" works in Claude Code but not in Cursor — it's injected directly where the model reads it on every turn.
Tools are functions the model can call. The quality of tool descriptions matters enormously — they tell the model when to use each tool, not just what it does. Claude Code has ~12 specialized tools, each with a careful description. A generic harness might give the model just run_command and hope for the best.
Key design choice: Claude Code's Edit tool does surgical string replacement (specify only what changed). Some harnesses use whole-file rewrites, which waste tokens and introduce errors.
This is the heart of the harness. When you send a message: (1) the model decides what tool to call, (2) the harness executes it, (3) the result is fed back, (4) the model decides the next action. This loop continues until the model has enough information to give you a final answer.
A good harness lets the model loop 10-20 times autonomously — reading files, running tests, fixing errors, re-running. A limited harness might cut off after 2-3 turns.
Every model has a finite context window. After dozens of file reads and tool calls, the conversation history fills up. A naive harness crashes or silently drops old messages. A good harness compresses old turns — asking a fast model to summarize them — while keeping recent context intact.
Claude Code also orders context by priority: system prompt and user instructions first, then recent tool results, then older history. This ensures the model always sees the most important information.
Claude Code categorizes tools by risk: read-only tools (Read, Grep, Glob) run automatically, while modification tools (Edit, Bash, Write) prompt the user. When a user denies a tool call, the denial is fed back to the model as a result, so it can adjust its approach instead of getting stuck.
Similarly, when a tool throws an error (file not found, test failed), the error message is fed back. The model reads it and self-corrects on the next turn. A naive harness would just crash.
Watch what happens when you say "Fix the bug in utils.py" to Claude Code's harness. Every SSE event is shown.
Read("utils.py") 1 import re
2 from datetime import datetime
3
4 def parse_date(s: str):
5 for fmt in ["%Y-%m-%d", "%m/%d/%Y"]:
6 try:
7 return datetime.strptime(s, fmt)
8 except ValueError:
9 pass
10 return None
Edit("utils.py",
old: 'for fmt in ["%Y-%m-%d", "%m/%d/%Y"]:'
new: 'for fmt in ["%Y-%m-%dT%H:%M:%S", "%Y-%m-%d", "%m/%d/%Y"]:'
)Bash("python -m pytest tests/test_utils.py -v")
tests/test_utils.py::test_parse_iso PASSED
tests/test_utils.py::test_parse_us PASSED
tests/test_utils.py::test_parse_dash PASSED
3 passed in 0.12s
parse_date function was missing the ISO 8601 format (%Y-%m-%dT%H:%M:%S). Added it as the first format to try. All 3 tests pass.
What just happened: 4 turns, 3 tool calls. The model read the file before editing (system prompt rule), used a surgical 1-line edit (Edit tool design), ran tests because CLAUDE.md said to (context injection), and the harness checked permissions on the Edit call (permission layer). A naive harness would have skipped all of this — no file read, whole-file rewrite, no tests, no permission check.
The same Opus 4.6, asked "Add a logout button to the nav." Watch how the harness quality changes everything.
System prompt: "You are a helpful coding assistant."
Model's response:
"Here's a generic React logout button component you could add to your nav..."
Problems: Doesn't know your framework, your file structure, your design system, or where the nav component is. Just guesses.
Has read_file, edit_file (whole-file), run_command.
Model's behavior:
1. Glob **/*nav* → finds TopNav.tsx
2. Read TopNav.tsx (890 lines)
3. Rewrite entire file with button added
4. (no tests — doesn't know your conventions)
Problems: Whole-file rewrite risks deleting code. No tests run. No permission check. Styling might not match.
System prompt + CLAUDE.md (design system, conventions) + surgical Edit + permission layer.
Model's behavior:
1. Glob **/*Nav* → TopNav.tsx
2. Read TopNav.tsx
3. Edit: add 4-line button (matches design system)
4. Bash: npm run test
5. "Done. Added logout button with stone-900 styling."
Result: 4-line surgical edit. Matches your design system. Tests pass. Permission check on Edit.
Each of these layers independently improves the model's behavior. Together, they compound.
Some harnesses give the model an edit_file(path, new_content) tool — the model must output the entire file even if only one line changed. Claude Code's Edit tool takes old_string and new_string — the model only specifies the diff.
CLAUDE.mdClaude Code loads three layers of context and injects them into the system prompt:
When a user denies a tool call or a tool fails, a naive harness crashes or gets stuck. Claude Code feeds the denial/error back to the model as a tool result. The model reads it and adjusts.
Read · Glob · Grep · WebSearchEdit · Write · Bash
After 20+ file reads and tool calls, the conversation history exceeds the context window.
A naive harness either crashes (context_length_exceeded) or silently drops old messages — losing important context.
The same architecture that makes Claude Code smart is what makes SofaGenius smart. You already built one.
The punchline: SofaGenius wraps Claude with tools (W&B monitor, SQL analyst, HF scout), system prompts (per-agent instructions), context injection (pre-resolved W&B identity, session state, run aliases), and an agentic loop (base.py) — the exact same pattern that makes Claude Code outperform Cursor.
training.SYSTEM_PROMPT) with tool selection rules, behavior guidelines, and domain-specific instructions.
_build_wandb_context() pre-resolves your W&B entity + projects.
_build_launch_context() injects the last-launched run ID.
_build_alias_context() maps friendly names to run IDs.
CARD_TOOL_MAPPING maps tools to visual card types.
base.run_subagent() loops up to 10 turns. Emits SSE events (text, tool_call, tool_result, card, done) for real-time streaming.
_session_contexts) persists across messages.
_extract_event_context() captures launch/health card data for future reference.
_execute_tool() wraps every tool call in try/except. Errors are JSON-encoded and fed back to Claude as tool results.
Plus two layers Claude Code doesn't need:
_classify_intent() uses Haiku to route to 6 specialized subagents.
Claude Code doesn't need this — it's one agent for all coding tasks.
_inject_credentials() temporarily injects user's W&B/HF tokens into env vars during tool execution, then restores. Multi-user safety.
Improving the software around the model vs. improving the model itself — they're complementary, not alternatives.
The key confusion: When people say "training an agent," they might mean two very different things. Harness engineering improves the code that wraps the model (system prompts, tool descriptions, loop logic) — no gradients involved. MCP trace fine-tuning trains the model's actual weights to be better at tool calling — real training with GPUs and loss functions. Both make agents better, but through completely different mechanisms.
MCP (Model Context Protocol) is a standard format for tools to advertise their capabilities to LLMs. It lives inside the harness's tool layer — it's not the whole harness, and it's not model training. But it enables both paths:
MCP standardizes how tools describe themselves. This is useful for harness engineering (one format to support instead of per-tool custom code), model training (consistent format means more uniform training data), and interoperability (tools built for Claude Code work in any MCP-compatible harness).
A trace is a recording of a complete agent session: what the user asked, which tools the model called, what results came back, and whether the final outcome was good. Here's what a trace looks like:
weather_tool("Tokyo") ← correct tool selection{"temp": 15, "condition": "cloudy"}flight_search("SFO", "TYO") ← correct next step[{price: 850, airline: "JAL"}, ...]The training process uses these traces as (input, output) pairs:
Backend.AI built a small "sidecar" model that helps open-source LLMs (Llama, Gemma) comply with MCP format. The problem: these models were never trained on MCP tool schemas, so they format tool calls incorrectly. The sidecar sits beside the main LLM and "fixes" its output into proper MCP format.
They evaluated on MCP-Bench — a benchmark with 28 MCP servers (Google Maps, Wikipedia, Steam). Open-source models scored poorly because they were never trained on MCP tool-calling formats. The sidecar patches this gap at minimal cost, without modifying the main LLM.
Recent research analyzed 856 MCP tools and found that 97.1% have at least one quality defect ("smell") in their descriptions. Just like code smells, these aren't technically broken but degrade the model's ability to use the tool correctly.
The six quality components: Purpose (what does it do?), Guidelines (when should you use it?), Limitations (what can't it do?), Parameter Explanation (what do the inputs mean?), Length/Completeness (enough detail?), and Examples (show me a call). Enriching descriptions with these components improved task success by ~6% — a free improvement that requires no model training.
| Path A: Harness Engineering | Path B: MCP Trace Fine-Tuning | |
|---|---|---|
| What changes | Code (system prompt, tool descriptions, loop logic) | Model weights (neural network parameters) |
| Training method | Manual iteration + optional prompt optimization (DSPy) | SFT or GRPO with gradient descent |
| Uses GPUs | No | Yes |
| Benefits all users of the model | No — only users of your harness | Yes — the model itself improves |
| Benefits all models in your harness | Yes — swap in any model and it benefits | No — only the fine-tuned model |
| Current maturity | Production-ready | Early research / emerging |
| Example | Claude Code's system prompt, SofaGenius's orchestrator | Backend.AI's MCP Sidecar (GRPO on MCP-Bench) |
You've trained a better model on MCP traces. Now where does it plug into the system? Three options — each touching a different part of the architecture.
User Message │ ▼ ┌─────────────────────────────────────────────────────┐ │ ORCHESTRATOR orchestrator.py │ │ │ │ 1. Intent Classification ──▶ claude-haiku-4-5 │ ← line 414 │ "training" | "data" | "scout" | ... │ │ │ │ 2. General Responses ────▶ claude-sonnet-4-5 │ ← line 469 │ (no tools, direct answer) │ │ │ │ Context Injection: │ │ _build_wandb_context() │ │ _build_launch_context() │ │ _build_alias_context() │ │ │ └──────────────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────┐ │ SUBAGENT LOOP base.py │ │ │ │ 3. Tool-calling Agent ──▶ claude-sonnet-4-5 │ ← line 168 │ + tools (analyze_run_health, list_runs, ...) │ │ + agentic loop (up to 10 turns) │ │ + card emission via CARD_TOOL_MAPPING │ │ │ └─────────────────────────────────────────────────────┘
The harness code (context injection, alias resolution, card emission, permissions) stays completely unchanged in all three deployment options below. You're only swapping which model the harness calls — the architecture is the same.
| Option | File Changed | Impact | Training Data Needed |
|---|---|---|---|
| A: Replace subagent | base.py:168 |
High | Full agent traces (message → tools → result → answer) |
| B: Sidecar | base.py:224 (new fn) |
Medium | (wrong tool call, corrected tool call) pairs |
| C: Replace classifier | orchestrator.py:414 |
Low | (message, category) pairs |
This is why the harness vs. model distinction matters. The harness is the frame. The model is the engine. You can upgrade the engine without redesigning the frame — and you can improve the frame to make any engine perform better. The best systems do both.
The vocabulary of agent harnesses, explained in plain English.