Agent Harness Engineering: The Missing Half of AI Agents

Why better agents come from better systems around models, not just better models

Jun 11, 2026

For the last few years, most conversations about AI agents have focused on the model.

Which model is smarter? Which one writes better code? Which one follows instructions better? Which one hallucinates less?

Those questions matter. But they miss the more important half of the system.

An agent is not just a model. An agent is a model inside a working environment. That environment decides what the model can see, what tools it can use, what rules it must follow, how it gets feedback, how it recovers from mistakes, and how it continues work across time.

That environment is the harness.

A simple way to say it is:

Agent = Model + Harness

The model generates the next step. The harness determines whether that step happens in the right context, under the right constraints, with the right verification.

This idea shows up across recent writing from Addy Osmani, Martin Fowler, LangChain, and Anthropic. Together, they point to a shift in how we should think about agents.

The next frontier is not only better models. It is better systems around models.

A Smart Model Is Not a Reliable Worker

A raw model has no durable memory. It does not naturally know your codebase conventions. It does not know which tests matter. It does not feel the pain of breaking production. It does not know what your team considers “good.”

It only has the current context and the next action.

A harness gives the model the missing parts of work:

Filesystem and Git for durable state
Tools for acting on the world
Sandboxes for safe execution
Rules and skills for project knowledge
Hooks for enforcement
Tests and linters for feedback
Browser automation for real user verification
Progress files for handoff across sessions
Subagents for specialized work

This is why the same model can feel weak in one product and strong in another. The difference is often not intelligence. It is the quality of the harness.

A decent model in a great harness can outperform a great model in a bad one.

The Harness Is a Control Loop

The most useful way to understand harness engineering is not as “adding tools.” It is as building a control system.

A good harness has two forces.

First, feedforward: guidance before the agent acts. This includes AGENTS.md, architecture docs, coding conventions, task plans, skills, and examples.

Second, feedback: sensors after the agent acts. This includes type checks, tests, linters, static analysis, browser checks, review agents, logs, and runtime metrics.

Feedforward reduces the chance of mistakes. Feedback catches mistakes early and gives the agent something actionable to repair.

If you only have feedforward, the agent gets rules but no reality check. If you only have feedback, the agent keeps crashing into walls. The power comes from the loop:

Guide the agent, let it act, inspect the result, feed the signal back, improve the system.

That is the core of harness engineering.

Every Failure Should Improve the Harness

The most practical insight is simple: when an agent fails, do not only blame the model.

Ask what the harness failed to provide.

If the agent edits files it should never touch, add a boundary rule and an enforcement hook.

If it claims a task is done without running tests, make tests part of the completion gate.

If it keeps misunderstanding the project structure, add a short project map.

If it repeatedly produces code that passes tests but violates architecture, add structural checks.

This creates a ratchet. Every real failure becomes a new rule, hook, test, tool, or sensor. The system gets better because mistakes are turned into infrastructure.

But the word “real” matters. Do not add rules just because they sound wise. Every line in a good agent rulebook should pay rent. If it cannot be traced to a real failure or hard constraint, it may just be context noise.

A harness should feel like a pilot checklist, not a style guide.

More Tools Can Make the Agent Worse

It is tempting to solve every agent problem by adding more tools, more MCP servers, more rules, and more documentation.

But tools are not free.

Every tool description enters the model’s decision space. Every rule competes for attention. Every extra file increases the chance that the agent reads the wrong thing, misses the important thing, or gets steered by low-quality context.

A strong harness is selective.

Use deterministic checks when possible: types, tests, lint, dependency rules, structural analysis. They are fast, cheap, and reliable.

Use inferential checks when needed: AI code review, semantic analysis, LLM-as-judge. They are useful, but slower, more expensive, and less deterministic.

The practical rule is:

Use computers for what computers can check. Use models for what needs judgment.

That keeps the harness simple and trustworthy.

Long-Running Agents Need Handoffs, Not Just Longer Context

Long tasks reveal the limits of naive agent design.

Anthropic describes two common failure modes. First, an agent tries to complete too much at once, runs out of context, and leaves half-finished work behind. Second, a later session sees some progress and declares victory too early.

The fix is not only a bigger context window. It is better handoff.

A long-running harness needs artifacts that survive the session:

A feature list that defines what “done” means
A progress file that records what changed
An init script that restores the environment
Git history that shows the sequence of work
Tests that verify each completed feature

The agent should not wake up and guess what happened. It should onboard like a new engineer joining a project midstream.

This changes the goal. Long-running autonomy is not one giant session. It is a sequence of clean, incremental sessions that can safely pass work forward.

The key capability is not memory. It is handoff quality.

The Hardest Thing to Harness Is Behavior

Some things are relatively easy to regulate.

Maintainability can be checked with linting, type systems, complexity rules, duplication detection, and test coverage.

Architecture can be checked with dependency rules, module boundaries, and fitness functions.

Behavior is harder.

Does the app actually do what the user needs? Does the generated test reflect the real requirement? Does the UI flow work for a human? Did the agent solve the right problem, or only satisfy the visible prompt?

This is where human judgment still matters most.

A harness can reduce review toil. It can catch many errors before humans see them. It can make agents safer and more consistent. But it cannot fully define product intent, business tradeoffs, taste, or organizational context.

The point of harness engineering is not to remove humans. It is to move human attention to the places where it matters most.

Better Models Will Not Eliminate Harnesses

A common assumption is that harnesses are temporary. Once models get better, we will need fewer scaffolds.

That is partly true. Some harness components encode assumptions about what the model cannot do. When models improve, those components should be removed.

But the need for harnesses does not disappear. It moves.

When models could barely write code, the harness helped them edit files and run tests. When models became better coders, the harness shifted toward verification, planning, tool selection, and long-running state. As models become capable of larger tasks, harnesses will need to coordinate multiple agents, isolate worktrees, evaluate design quality, monitor drift, and manage complex feedback loops.

The ceiling moves. The failure modes move with it.

Harness engineering is not a patch for weak models. It is the discipline of designing the environment in which model intelligence becomes useful work.

The Practical Mental Model

When designing an agent harness, do not start with tools.

Start with failure.

Ask:

What does this agent often get wrong?
Was the failure caused by missing context before action?
Was it caused by missing feedback after action?
Can the failure be detected automatically?
Can the fix become a rule, test, hook, tool, or handoff artifact?

That gives you the basic loop:

Desired behavior → feedforward guidance → agent action → feedback sensor → self-correction → failure captured → harness updated

This is the heart of agent harness engineering.

A prompt can remind an agent once. A harness can prevent a whole class of failures.

The Real Shift

The important shift is not from “bad models” to “good models.”

It is from treating agents as chatbots to treating them as workers inside engineered systems.

A model gives you intelligence. A harness gives that intelligence memory, tools, constraints, feedback, recovery, and accountability.

That is why harness engineering matters.

The future of agents will not be won only by whoever has the strongest model. It will be won by the teams that build the clearest environments for models to do reliable work.

Pinyu Labs

Discussion about this post

Ready for more?