Agent Harness — Why the Same Model Feels Different

ANATOMY

Six Layers of an Agent Harness

From your keyboard to the model's response, your message passes through six layers. Click any layer to see what it does.

📜

System Prompt — The DNA

Thousands of tokens that define personality, rules, and capabilities

Before the model sees your message, it reads a long system prompt that tells it who it is, how to behave, and what rules to follow. Claude Code's system prompt is 3000+ tokens, heavily structured with markdown headers and explicit do/don't rules. A generic harness might use 10 words: "You are a helpful coding assistant."

This is the single highest-leverage text in the entire system. A vague prompt produces vague behavior. A precise prompt produces precise behavior.

The system prompt is like a job description — a detailed one produces a focused employee, a vague one produces someone who guesses

📚

Context Injection — What the Model Knows About You

CLAUDE.md, auto-memory, git status, credentials, session state

Claude Code loads your CLAUDE.md files (project rules like "never push to main"), auto-memory (cross-session learning), git status (current branch, recent commits), and project structure. All of this is injected into the system prompt before the model sees your message.

This is why your instruction "never push to main" works in Claude Code but not in Cursor — it's injected directly where the model reads it on every turn.

Like a briefing packet given to a consultant before they start work — the more context, the less time wasted asking basic questions

🔧

Tool Registry — What the Model Can Do

Read, Edit, Glob, Grep, Bash, WebSearch, sub-agents

Tools are functions the model can call. The quality of tool descriptions matters enormously — they tell the model when to use each tool, not just what it does. Claude Code has ~12 specialized tools, each with a careful description. A generic harness might give the model just run_command and hope for the best.

Key design choice: Claude Code's Edit tool does surgical string replacement (specify only what changed). Some harnesses use whole-file rewrites, which waste tokens and introduce errors.

A surgeon with a scalpel vs. a surgeon with a chainsaw — both can "edit" tissue, but precision tools produce better outcomes

🔄

Agentic Loop — The Engine

Multi-turn cycle: model thinks → calls tools → sees results → repeats

This is the heart of the harness. When you send a message: (1) the model decides what tool to call, (2) the harness executes it, (3) the result is fed back, (4) the model decides the next action. This loop continues until the model has enough information to give you a final answer.

A good harness lets the model loop 10-20 times autonomously — reading files, running tests, fixing errors, re-running. A limited harness might cut off after 2-3 turns.

Like a chef tasting as they cook — taste, adjust, taste again. Cutting them off after one taste means undercooked food

🗃

Context Management — Long-Term Memory

Compression, summarization, priority ordering

Every model has a finite context window. After dozens of file reads and tool calls, the conversation history fills up. A naive harness crashes or silently drops old messages. A good harness compresses old turns — asking a fast model to summarize them — while keeping recent context intact.

Claude Code also orders context by priority: system prompt and user instructions first, then recent tool results, then older history. This ensures the model always sees the most important information.

Like meeting notes — you don't re-read every email from 3 months ago, you keep a summary and focus on this week's updates

🔒

Permissions & Error Recovery

Safe tools auto-run, dangerous tools need approval, errors feed back to the model

Claude Code categorizes tools by risk: read-only tools (Read, Grep, Glob) run automatically, while modification tools (Edit, Bash, Write) prompt the user. When a user denies a tool call, the denial is fed back to the model as a result, so it can adjust its approach instead of getting stuck.

Similarly, when a tool throws an error (file not found, test failed), the error message is fed back. The model reads it and self-corrects on the next turn. A naive harness would just crash.

Like a new hire who checks with their manager before making big decisions, and learns from mistakes instead of freezing

LIVE DEMO

Same Request, Full Harness

Watch what happens when you say "Fix the bug in utils.py" to Claude Code's harness. Every SSE event is shown.

👤

You

Fix the bug in utils.py — the parse_date function returns None for ISO formats.

↓ Harness prepares the API call

⚙

Harness (before API call)

Assembling context: system prompt (3200 tokens) + CLAUDE.md ("use pytest", "never push to main") + git status (branch: feature/date-fix) + 12 tool definitions. Sending to model...

↓ Turn 1: Model decides to read the file first

🧠

Claude (tool_use)

Read("utils.py")
I need to see the current implementation before editing.

📄

Tool Result (auto-allowed: Read is safe)

  1   import re

            2   from datetime import datetime

            3

            4   def parse_date(s: str):

            5      for fmt in ["%Y-%m-%d", "%m/%d/%Y"]:

            6        try:

            7          return datetime.strptime(s, fmt)

            8        except ValueError:

            9          pass

           10      return None

↓ Turn 2: Model sees the bug — ISO format missing — uses surgical Edit

🧠

Claude (tool_use)

Edit("utils.py",

            old: 'for fmt in ["%Y-%m-%d", "%m/%d/%Y"]:'

            new: 'for fmt in ["%Y-%m-%dT%H:%M:%S", "%Y-%m-%d", "%m/%d/%Y"]:'

          )

Only 1 line changed. File unchanged otherwise.

⚙

Harness (permission check)

Edit requires approval. → User approved.
Uniqueness check: old_string found exactly once. ✓ Executing.

↓ Turn 3: Model runs the tests (from CLAUDE.md: "run pytest before done")

🧠

Claude (tool_use)

Bash("python -m pytest tests/test_utils.py -v")

⚡

Tool Result

tests/test_utils.py::test_parse_iso PASSED

          tests/test_utils.py::test_parse_us PASSED

          tests/test_utils.py::test_parse_dash PASSED

          3 passed in 0.12s

↓ Turn 4: No more tool calls → loop ends → final response

🧠

Claude (final response)

Fixed. The parse_date function was missing the ISO 8601 format (%Y-%m-%dT%H:%M:%S). Added it as the first format to try. All 3 tests pass.

What just happened: 4 turns, 3 tool calls. The model read the file before editing (system prompt rule), used a surgical 1-line edit (Edit tool design), ran tests because CLAUDE.md said to (context injection), and the harness checked permissions on the Edit call (permission layer). A naive harness would have skipped all of this — no file read, whole-file rewrite, no tests, no permission check.

THREE LEVELS

Three Harness Levels, Same Model

The same Opus 4.6, asked "Add a logout button to the nav." Watch how the harness quality changes everything.

LEVEL 1

Naive Harness

No tools, no context, one shot

System prompt: "You are a helpful coding assistant."

Model's response:

"Here's a generic React logout button component you could add to your nav..."

Problems: Doesn't know your framework, your file structure, your design system, or where the nav component is. Just guesses.

Tool calls: 0 · Files read: 0 · Tests run: 0

LEVEL 2

Basic Agent

Tools + loop, but no context

Has read_file, edit_file (whole-file), run_command.

Model's behavior:

1. Glob **/*nav* → finds TopNav.tsx
2. Read TopNav.tsx (890 lines)
3. Rewrite entire file with button added
4. (no tests — doesn't know your conventions)

Problems: Whole-file rewrite risks deleting code. No tests run. No permission check. Styling might not match.

Tool calls: 3 · Tokens wasted on full rewrite: ~2000

LEVEL 3

Full Harness

Context + smart tools + permissions

System prompt + CLAUDE.md (design system, conventions) + surgical Edit + permission layer.

Model's behavior:

1. Glob **/*Nav* → TopNav.tsx
2. Read TopNav.tsx
3. Edit: add 4-line button (matches design system)
4. Bash: npm run test
5. "Done. Added logout button with stone-900 styling."

Result: 4-line surgical edit. Matches your design system. Tests pass. Permission check on Edit.

Tool calls: 4 · Lines changed: 4 · Tests: passing

DEEP DIVE

The Five Key Differences

Each of these layers independently improves the model's behavior. Together, they compound.

System Prompt: Generic vs. Model-Specific

The highest-leverage text in the entire system

Generic (Cursor-like)

"You are a helpful coding assistant"
Must work for GPT-4, Claude, Gemini
No model-specific phrasing
~10-50 tokens
No behavioral rules or examples

Model-Specific (Claude Code)

3000+ tokens of structured instructions
Co-evolved with Opus — tested for which phrasings it responds to best
Explicit do/don't rules with examples
Tool-specific guidance ("use Read, not cat")
Behavior calibration ("don't over-engineer")

        # Generic harness:

        system = "You are a helpful coding assistant."

        # Claude Code (simplified excerpt):

        system = """You are Claude Code, Anthropic's official CLI.

        # Tools

        - Use Read instead of cat/head/tail

        - Use Edit for surgical string replacement

        - Use Grep instead of grep or rg

        # Behavior

        - Read files before editing them

        - Don't over-engineer or add features beyond what was asked

        - If a tool call fails, try a different approach

        - If the user denies a tool, don't re-attempt it

        ..."""

Surgical Edit vs. Whole-File Rewrite

The tool design that saves thousands of tokens per edit

Some harnesses give the model an edit_file(path, new_content) tool — the model must output the entire file even if only one line changed. Claude Code's Edit tool takes old_string and new_string — the model only specifies the diff.

Whole-File Rewrite

Model outputs: 890 lines (entire file)
Lines actually changed: 1
Risk: could accidentally delete other code
No uniqueness check
Tokens wasted: ~2000

Surgical Replacement

Model outputs: 2 strings (old → new)
Lines actually changed: 1
Uniqueness check: old_string must appear exactly once
Precise diff visible to user
Tokens used: ~50

        # Whole-file rewrite (what some harnesses do):

        def edit_file(path: str, new_content: str):

            with open(path, "w") as f:

                f.write(new_content)  # entire file rewritten

        # Surgical replacement (what Claude Code does):

        def edit_file(path: str, old_string: str, new_string: str):

            content = open(path).read()

            assert content.count(old_string) == 1  # uniqueness check!

            content = content.replace(old_string, new_string, 1)

            with open(path, "w") as f:

                f.write(content)

Context Injection — `CLAUDE.md`

How your rules become part of the model's "memory" on every turn

Claude Code loads three layers of context and injects them into the system prompt:

🌐

Global Rules

~/.claude/CLAUDE.md
"Never push to main"

📂

Project Rules

./CLAUDE.md
"Use Tailwind, no emojis"

🧠

Auto-Memory

.claude/memory/
Cross-session learning

🛠

Git State

Branch, status, diffs
Recent commits

        # What Claude Code does before every API call:

        claude_md = load_file("~/.claude/CLAUDE.md")

        project_md = load_file("./CLAUDE.md")

        memory = load_file("~/.claude/memory/MEMORY.md")

        system_prompt += f"""

        <user-instructions>{claude_md}</user-instructions>

        <project-instructions>{project_md}</project-instructions>

        """

        # This is WHY "never push to main" works in Claude Code

        # but not in Cursor — the rule is injected into the

        # system prompt where the model reads it on EVERY turn.

Permission System & Error Recovery

The safety net that also makes the model smarter

When a user denies a tool call or a tool fails, a naive harness crashes or gets stuck. Claude Code feeds the denial/error back to the model as a tool result. The model reads it and adjusts.

Auto-Allowed (safe)

Read · Glob · Grep · WebSearch
Read-only operations. No confirmation needed.

Needs Approval (risky)

Edit · Write · Bash
Modifies files or runs commands. User must approve.

        # When user denies a tool call:

        tool_result = {

          "type": "tool_result",

          "content": "Error: User denied this action. Try a different approach.",

          "is_error": True

        }

        # Model sees this and adapts — maybe asks a question

        # or tries a less invasive alternative.

        # When a tool throws an exception:

        try:

            result = execute_tool(name, args)

        except Exception as e:

            result = f"Error: {e}"

        # Model reads the error and self-corrects on the next turn.

Context Compression — Why Long Sessions Don't Break

Old turns get summarized so the model never loses track

After 20+ file reads and tool calls, the conversation history exceeds the context window. A naive harness either crashes (context_length_exceeded) or silently drops old messages — losing important context.

📝

Messages Grow

Each tool call adds to history

⚠️

Near Limit

Approaching context window

📜

Compress

Summarize old turns via Haiku

✅

Continue

Recent context intact + summary

        # Naive harness — messages grow until crash:

        messages.append(response)

        messages.append(tool_result)

        # Eventually: ERROR context_length_exceeded

        # Full harness — old messages get compressed:

        if total_tokens(messages) > THRESHOLD:

            old = messages[:-KEEP_RECENT]

            summary = summarize_via_haiku(old)

            messages = [summary] + messages[-KEEP_RECENT:]

REAL-WORLD EXAMPLE

SofaGenius IS an Agent Harness

The same architecture that makes Claude Code smart is what makes SofaGenius smart. You already built one.

The punchline: SofaGenius wraps Claude with tools (W&B monitor, SQL analyst, HF scout), system prompts (per-agent instructions), context injection (pre-resolved W&B identity, session state, run aliases), and an agentic loop (base.py) — the exact same pattern that makes Claude Code outperform Cursor.

Same Question, Three Harness Levels in SofaGenius

User: "How's my training going?"

Naive (01)

"What platform are you using? What metrics are you tracking?"
Tool calls: 0. Useless.

Basic (02)

get_wandb_info → list_runs → analyze_health
Tool calls: 3. Works, but slow.

Full Harness (03)

analyze_run_health (with pre-resolved context)
Tool calls: 1. Instant + card rendered.

How SofaGenius Maps to the Six Layers

1. System Prompt Each subagent has its own prompt (training.SYSTEM_PROMPT) with tool selection rules, behavior guidelines, and domain-specific instructions.

2. Context Injection _build_wandb_context() pre-resolves your W&B entity + projects. _build_launch_context() injects the last-launched run ID. _build_alias_context() maps friendly names to run IDs.

3. Tool Registry 5 subagents × 3-8 tools each. Rich descriptions tell the model WHEN to use each tool. CARD_TOOL_MAPPING maps tools to visual card types.

4. Agentic Loop base.run_subagent() loops up to 10 turns. Emits SSE events (text, tool_call, tool_result, card, done) for real-time streaming.

5. Context Management Session state (_session_contexts) persists across messages. _extract_event_context() captures launch/health card data for future reference.

6. Error Recovery _execute_tool() wraps every tool call in try/except. Errors are JSON-encoded and fed back to Claude as tool results.

Plus two layers Claude Code doesn't need:

Intent Routing _classify_intent() uses Haiku to route to 6 specialized subagents. Claude Code doesn't need this — it's one agent for all coding tasks.

Credential Isolation _inject_credentials() temporarily injects user's W&B/HF tokens into env vars during tool execution, then restores. Multi-user safety.

MCP & TRAINING

Two Paths to a Better Agent

Improving the software around the model vs. improving the model itself — they're complementary, not alternatives.

The key confusion: When people say "training an agent," they might mean two very different things. Harness engineering improves the code that wraps the model (system prompts, tool descriptions, loop logic) — no gradients involved. MCP trace fine-tuning trains the model's actual weights to be better at tool calling — real training with GPUs and loss functions. Both make agents better, but through completely different mechanisms.

PATH A

Harness Engineering

Improve the software around the model (no gradients)

What changes: Code — system prompts, tool descriptions, loop logic, context injection
Method: Manual iteration + optional automated prompt optimization (DSPy)
Uses GPUs: No
Benefits all models: Yes — swap in any model and it benefits from the improved harness
Maturity: Production-ready. This is how Claude Code, SofaGenius, and most agent systems are improved today

This is everything we covered in the sections above — system prompt engineering, surgical Edit tool design, CLAUDE.md injection, permissions, context compression.

PATH B

MCP Trace Fine-Tuning

Train the model's weights to be better at tool calling

What changes: Model weights — the neural network parameters themselves
Method: SFT or GRPO with gradient descent on tool-calling traces
Uses GPUs: Yes
Benefits all harnesses: Yes — the model itself becomes better at tools, regardless of which harness runs it
Maturity: Early research / emerging. Backend.AI's MCP Sidecar is one of the first concrete examples

This is real training — collecting agent traces, creating datasets, running SFT or GRPO on GPUs to improve the model's tool-calling ability.

Where MCP Fits In the Picture

MCP is a protocol inside the harness's tool layer — it enables both paths

MCP (Model Context Protocol) is a standard format for tools to advertise their capabilities to LLMs. It lives inside the harness's tool layer — it's not the whole harness, and it's not model training. But it enables both paths:

        # MCP's place in the architecture:

        HARNESS

          ├── System Prompt

          ├── Agentic Loop

          ├── Context Injection

          ├── Permissions

          └── Tool Layer

              ├── Native tools (defined in your code)

              │   e.g., SofaGenius's analyze_run_health

              │

              └── MCP servers (standard protocol) ◄── THIS

                   e.g., GitHub MCP server, Slack MCP server

                   Tools advertised via JSON-RPC schemas

MCP standardizes how tools describe themselves. This is useful for harness engineering (one format to support instead of per-tool custom code), model training (consistent format means more uniform training data), and interoperability (tools built for Claude Code work in any MCP-compatible harness).

How MCP Trace Fine-Tuning Works

Collect real agent sessions, use them to train the model's tool-calling ability

A trace is a recording of a complete agent session: what the user asked, which tools the model called, what results came back, and whether the final outcome was good. Here's what a trace looks like:

User

"What's the weather in Tokyo and book me a flight there"

Model → Tool Call

weather_tool("Tokyo") ← correct tool selection

Tool Result

{"temp": 15, "condition": "cloudy"}

Model → Tool Call

flight_search("SFO", "TYO") ← correct next step

Tool Result

[{price: 850, airline: "JAL"}, ...]

Model → Final Response

"It's 15°C and cloudy in Tokyo. I found flights from $850 on JAL." ← knows when to stop

The training process uses these traces as (input, output) pairs:

📝

Collect Traces

Record real agent sessions with tool calls and results

🔢

Filter & Label

Keep successful traces; discard failures

🔧

Fine-Tune

SFT or GRPO on the traces — real GPU training

✅

Better Model

Model learns: which tool, what args, when to stop

Case Study: Backend.AI's MCP Sidecar Model

SFT failed, GRPO worked — a real example of MCP trace fine-tuning

Backend.AI built a small "sidecar" model that helps open-source LLMs (Llama, Gemma) comply with MCP format. The problem: these models were never trained on MCP tool schemas, so they format tool calls incorrectly. The sidecar sits beside the main LLM and "fixes" its output into proper MCP format.

Attempt 1: SFT — Failed

Used GPT-5 to correct imperfect JSON, then trained the sidecar via SFT to mimic the corrections.

Problem: Since modern LLMs already produce 90%+ valid JSON, the model learned to copy the input, not correct it. The token overlap was too high for SFT to find the signal.

Attempt 2: GRPO — Worked

Switched to reinforcement learning using GRPO (same algorithm SofaGenius uses for training). Used MCP-Bench metrics as reward signals.

Result: Significant improvements in JSON formatting, tool selection, schema adherence, and parameter validity. The model learned to fix, not just copy.

They evaluated on MCP-Bench — a benchmark with 28 MCP servers (Google Maps, Wikipedia, Steam). Open-source models scored poorly because they were never trained on MCP tool-calling formats. The sidecar patches this gap at minimal cost, without modifying the main LLM.

97% of MCP Tool Descriptions Have "Smells"

Fixing tool descriptions is a low-hanging fruit that improves both paths

Recent research analyzed 856 MCP tools and found that 97.1% have at least one quality defect ("smell") in their descriptions. Just like code smells, these aren't technically broken but degrade the model's ability to use the tool correctly.

56%

Unclear Purpose
Tool doesn't explain what it does

97%

At Least One Smell
Missing guidelines, limits, or examples

+6%

Success Improvement
After augmenting descriptions

The six quality components: Purpose (what does it do?), Guidelines (when should you use it?), Limitations (what can't it do?), Parameter Explanation (what do the inputs mean?), Length/Completeness (enough detail?), and Examples (show me a call). Enriching descriptions with these components improved task success by ~6% — a free improvement that requires no model training.

Side-by-Side: How the Two Paths Compare

	Path A: Harness Engineering	Path B: MCP Trace Fine-Tuning
What changes	Code (system prompt, tool descriptions, loop logic)	Model weights (neural network parameters)
Training method	Manual iteration + optional prompt optimization (DSPy)	SFT or GRPO with gradient descent
Uses GPUs	No	Yes
Benefits all users of the model	No — only users of your harness	Yes — the model itself improves
Benefits all models in your harness	Yes — swap in any model and it benefits	No — only the fine-tuned model
Current maturity	Production-ready	Early research / emerging
Example	Claude Code's system prompt, SofaGenius's orchestrator	Backend.AI's MCP Sidecar (GRPO on MCP-Bench)

The bottom line: These two paths are complementary. A better harness helps any model perform better (Path A). A fine-tuned model is better at tool calling regardless of which harness runs it (Path B). The best agent systems will use both — a well-engineered harness running a model that's been fine-tuned for tool use.

References

→ Backend.AI: MCP Sidecar Model — Training an MCP-compliant sidecar via GRPO
→ MCP Tool Descriptions Are Smelly — 97% of MCP tools have quality issues; augmentation improves success
→ MCP Specification (2025-11-25) — The official protocol spec
→ Langfuse: MCP Tracing — Observability for MCP tool calls in agent traces
→ Don't Just Fine-tune the Agent, Tune the Environment — Research on improving harness alongside model
→ DSPy (Stanford) — Automated prompt optimization framework

DEPLOYMENT

Where Does the Fine-Tuned Model Go?

You've trained a better model on MCP traces. Now where does it plug into the system? Three options — each touching a different part of the architecture.

Current Architecture: 3 Model Call Sites

Every box with a model name is a potential deployment target

User Message
     │
     ▼
┌─────────────────────────────────────────────────────┐
│  ORCHESTRATOR  orchestrator.py                       │
│                                                     │
│  1. Intent Classification ──▶ claude-haiku-4-5      │  ← line 414
│     "training" | "data" | "scout" | ...              │
│                                                     │
│  2. General Responses ────▶ claude-sonnet-4-5       │  ← line 469
│     (no tools, direct answer)                       │
│                                                     │
│  Context Injection:                                  │
│     _build_wandb_context()                         │
│     _build_launch_context()                        │
│     _build_alias_context()                         │
│                                                     │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────┐
│  SUBAGENT LOOP  base.py                              │
│                                                     │
│  3. Tool-calling Agent ──▶ claude-sonnet-4-5        │  ← line 168
│     + tools (analyze_run_health, list_runs, ...)    │
│     + agentic loop (up to 10 turns)                 │
│     + card emission via CARD_TOOL_MAPPING           │
│                                                     │
└─────────────────────────────────────────────────────┘

The harness code (context injection, alias resolution, card emission, permissions) stays completely unchanged in all three deployment options below. You're only swapping which model the harness calls — the architecture is the same.

Option A

Replace the Subagent Model

Highest impact — this is where all tool calling happens

The fine-tuned model replaces Sonnet in the subagent loop. This is the highest-value target because the subagent makes every tool-calling decision: which tool to pick, what arguments to pass, when to stop looping.

backend/agents/base.py:168-169

# Before

- model="claude-sonnet-4-5-20250929",

# After

+ model="ft:your-org/sofa-genius-tool-agent",

What the model learned from traces:

Call analyze_run_health directly when context has the run ID (skip list_runs)
Format entity_project as "entity/project" consistently
Stop after 1 tool call for simple health checks (not 2–3)
Never leak JSON into text responses

Impact

Very High

Option A: Replace Subagent Model

               BEFORE                                 AFTER
               ──────                                 ─────
  base.py:168                              base.py:168
  claude-sonnet-4-5                        fine-tuned-model
       │                                         │
       ▼                                         ▼
  ┌──────────┐                            ┌──────────────────┐
  │ Sonnet    │ → tool calls → tools      │ Fine-tuned        │ → tool calls → tools
  │ (generic) │ ← results   ←            │ (learned W&B      │ ← results   ←
  └──────────┘                            │  tool patterns)   │
                                          └──────────────────┘

Option B

Sidecar Model (Intercept + Correct)

When you can't fine-tune the primary model (e.g., it's a closed API)

The primary model stays Sonnet. A small fine-tuned model (e.g., Llama 8B) sits between Sonnet's tool-call output and the actual tool execution, correcting malformed calls. This is exactly what Backend.AI did with their MCP Sidecar.

backend/agents/base.py:224-225

# Current

- for tool_use in tool_uses:

- result = _execute_tool(tool_use["name"], tool_use["input"], tool_dispatch)

# With sidecar

+ for tool_use in tool_uses:

+ name, args = _sidecar_correct(tool_use["name"], tool_use["input"])

+ result = _execute_tool(name, args, tool_dispatch)

What the sidecar corrects:

Malformed JSON arguments (missing quotes, wrong types)
Wrong parameter names (project → entity_project)
Schema mismatches between what the model outputs and what MCP expects

Impact

Medium

Option B: Sidecar Model (Intercept & Correct)

┌───────────────────────────────────────────────────────────────┐
│  SUBAGENT LOOP  base.py                                     │
│                                                              │
│  Sonnet ──▶ tool_use block ──▶ SIDECAR ──▶ _execute_tool   │
│              (may be wrong)      (fixes it)    (correct)      │
│                                                              │
│  New code needed:                                            │
│    _sidecar_correct(name, args) → (name, args)               │
│    Uses fine-tuned Llama 8B to validate/fix                  │
│                                                              │
└───────────────────────────────────────────────────────────────┘

Option C

Replace the Intent Classifier

Lower impact, but useful if classification is the bottleneck

Fine-tune a small model to replace Haiku for intent routing. The current classifier uses a general-purpose model to categorize messages into 6 categories. A specialized model could be faster, cheaper, and more accurate for your specific domain.

backend/orchestrator.py:414-418

# Before

- model="claude-haiku-4-5-20251001",

# After

+ model="ft:your-org/sofa-intent-classifier",

Training data:

Simple (message, category) pairs from real user sessions
"check my training" → training
"find a dataset for sentiment analysis" → scout
"launch a GRPO run" → launch

Impact

Low

Comparison: Which Option Matters Most?

Option	File Changed	Impact	Training Data Needed
A: Replace subagent	`base.py:168`	High	Full agent traces (message → tools → result → answer)
B: Sidecar	`base.py:224` (new fn)	Medium	(wrong tool call, corrected tool call) pairs
C: Replace classifier	`orchestrator.py:414`	Low	(message, category) pairs

Key insight: Option A is where most value is. The subagent model makes all tool-calling decisions — which tool to pick, what arguments to pass, when to stop looping. That's exactly what MCP trace fine-tuning teaches. Options B and C address narrower problems (format correction and routing), while Option A improves the core reasoning.

What Stays the Same?

The harness architecture is unchanged — only the model inside changes

Unchanged (Harness)

• Context injection (W&B identity, session, aliases)
• Intent routing architecture
• Tool definitions and dispatch
• Card emission via CARD_TOOL_MAPPING
• SSE streaming to frontend
• Permission model and error recovery

Changed (Model)

• Which tool to call and when
• Argument formatting accuracy
• Number of loop iterations needed
• When to stop and give a final answer
• JSON formatting compliance
• Domain-specific reasoning

This is why the harness vs. model distinction matters. The harness is the frame. The model is the engine. You can upgrade the engine without redesigning the frame — and you can improve the frame to make any engine perform better. The best systems do both.

GLOSSARY

Key Terms

The vocabulary of agent harnesses, explained in plain English.

Core Concepts

Agent Harness

The complete software layer between you and the LLM. Includes system prompt, tool definitions, agentic loop, context management, and permission system. The same model inside different harnesses produces different behavior — this is why Opus feels better in Claude Code than in Cursor.

The operating system that a CPU runs inside — same chip, different OS, different experience

System Prompt

A block of text sent to the model before your message, on every turn. Defines personality, rules, capabilities, and behavioral constraints. Claude Code's is 3000+ tokens. A generic harness might use 10 words.

A job description that the employee re-reads every morning before starting work

Tool Definition

A JSON schema describing a function the model can call. Includes name, description (tells model WHEN to use it), and input parameters. The quality of descriptions directly affects model behavior — a vague description produces random tool usage.

A labeled toolbox — clear labels ("Phillips head screwdriver, use for small screws") vs. unlabeled drawers

Agentic Loop

The multi-turn cycle where the model calls a tool, sees the result, decides the next action, calls another tool, and repeats until it has enough information to give a final answer. The number of allowed turns determines how much autonomous work the model can do.

A chef tasting soup repeatedly — taste, adjust salt, taste again, add pepper, taste again, serve

Context & Memory

CLAUDE.md

A markdown file in your project root that contains persistent instructions for Claude Code. Rules written here are injected into the system prompt on every turn. Examples: "never push to main", "use pytest for tests", "follow our design system." Also supports a global version at ~/.claude/CLAUDE.md.

A team wiki that every new hire reads on day one — and re-reads every morning

Context Window

The maximum amount of text a model can "see" at once (input + output). When the conversation history exceeds this limit, something must be dropped or compressed. Claude's context window is ~200K tokens.

The size of your desk — you can only spread out so many papers before you have to file some away

Context Compression

When the conversation history nears the context window limit, old turns are summarized by a fast model (Haiku) and replaced with a compact summary. Recent turns are kept verbatim. This prevents crashes and information loss.

Meeting minutes — you don't re-read every word from last month's meetings, you keep a summary

Tool Design

Surgical Edit

An edit tool that takes (old_string, new_string) and replaces only the exact match. Enforces uniqueness (old_string must appear exactly once). This is how Claude Code's Edit tool works — dramatically fewer tokens and errors than whole-file rewrites.

Find-and-replace in a text editor, but with a safety check that prevents ambiguous replacements

SSE (Server-Sent Events)

A protocol for streaming data from server to browser. Agent harnesses use SSE to stream the model's response in real-time: text chunks, tool call notifications, card data, and completion signals. This creates the "typing" effect you see in Claude Code and SofaGenius.

A live sports ticker — updates appear as they happen, you don't have to refresh

Safety & Control

Permission Model

A system that categorizes tools by risk level. Read-only tools (Read, Grep) run automatically. Modification tools (Edit, Bash) require user approval. When the user denies a call, the denial is fed back to the model so it can adjust its approach.

A new employee who checks with their manager before making purchases over $100

Error Recovery

When a tool call fails (file not found, test failed, permission denied), the error is fed back to the model as a tool result. The model reads the error and self-corrects on the next turn. A naive harness would crash or get stuck.

A GPS that reroutes when you miss a turn, instead of saying "route calculation failed"

Intent Routing

Using a fast, cheap model (Haiku) to classify the user's intent and route to a specialized subagent. SofaGenius uses this: "check my training" → Training Agent, "find a dataset" → Scout Agent. This keeps each agent focused with a small, relevant tool set.

A hospital triage desk — quick assessment of your symptoms, then you're sent to the right specialist

MCP & Model Training

MCP (Model Context Protocol)

A standard protocol for tools to advertise their capabilities to LLMs via JSON-RPC schemas. MCP lives inside the harness's tool layer. It standardizes how tools describe themselves, enabling interop — a tool built for Claude Code works in any MCP-compatible harness.

USB-C for AI tools — one standard plug so any device works with any port

MCP Trace Fine-Tuning

Training a model's weights on recorded agent sessions (user request → tool calls → results → final answer). The model learns which tool to pick, how to format arguments, and when to stop. This is Path B — improving the model itself, not the harness code.

A new doctor studying case files from experienced doctors — learning diagnosis patterns from real patient interactions

SFT (Supervised Fine-Tuning)

Training a model to imitate correct examples by minimizing the difference between its output and the reference output. Simple but can fail for tool calling — the model may learn to copy surface patterns (90% token overlap) rather than learning the underlying correction logic.

Copying an expert's homework — you reproduce the answers but don't learn the reasoning

GRPO (Group Relative Policy Optimization)

A reinforcement learning method that scores multiple model outputs and pushes the model toward better ones. Unlike SFT, the model explores different strategies and is rewarded for correctness, not imitation. Backend.AI found GRPO succeeded where SFT failed for MCP compliance.

Learning to cook by trying variations and keeping the recipes that taste best, rather than memorizing one cookbook

Sidecar Model

A small, specialized model deployed alongside the primary LLM. It intercepts and corrects the primary model's tool calls to comply with MCP format. Useful when you can't fine-tune the primary model (e.g., it's a closed API like Claude or GPT).

A translator sitting next to a foreign diplomat — the diplomat speaks, the translator fixes the grammar for the audience

MCP-Bench

A benchmark suite with 28 real MCP servers (Google Maps, Wikipedia, Steam, etc.) for evaluating how well a model handles tool calling. Open-source models score poorly because they were never trained on MCP formats — this is the gap sidecar models aim to fill.

A driving test with 28 different road scenarios — measures whether you can actually drive, not just recite the manual

Tool Description Smells

Quality defects in MCP tool descriptions: missing parameter explanations, vague descriptions, no usage examples, inconsistent formatting. Research shows 97% of MCP tools have at least one smell. Augmenting descriptions with examples and constraints improves task success by ~6%.

Poorly labeled medicine bottles — "take as needed" vs. "take 2 tablets with food every 8 hours for pain"

Why the Same Model Feels Different

Six Layers of an Agent Harness

Same Request, Full Harness

Three Harness Levels, Same Model

Naive Harness

Basic Agent

Full Harness

The Five Key Differences

System Prompt: Generic vs. Model-Specific

Generic (Cursor-like)

Model-Specific (Claude Code)

Surgical Edit vs. Whole-File Rewrite

Context Injection — CLAUDE.md

Permission System & Error Recovery

Context Compression — Why Long Sessions Don't Break

SofaGenius IS an Agent Harness

Same Question, Three Harness Levels in SofaGenius

How SofaGenius Maps to the Six Layers

Two Paths to a Better Agent

Harness Engineering

MCP Trace Fine-Tuning

Where MCP Fits In the Picture

How MCP Trace Fine-Tuning Works

Case Study: Backend.AI's MCP Sidecar Model

97% of MCP Tool Descriptions Have "Smells"

Side-by-Side: How the Two Paths Compare

References

Where Does the Fine-Tuned Model Go?

Current Architecture: 3 Model Call Sites

Replace the Subagent Model

Sidecar Model (Intercept + Correct)

Replace the Intent Classifier

Comparison: Which Option Matters Most?

What Stays the Same?

Key Terms

Context Injection — `CLAUDE.md`