Agent Harnesses — The Model Was Never the Bottleneck

I watched a talk by Tejas Kumar where he pulled off something that looks like a magic trick and is actually the whole point of agent engineering. He took one of the worst models you can still rent — gpt-3.5-turbo — gave it one fixed prompt and one job, and made it reliably do that job. The trick: he never improved the prompt, and never swapped the model. He improved everything around it.

That "everything around it" finally has a name. An AI harness is, in Tejas's framing, everything except the model weights — the tools the model can call, the context you feed it, the guardrails that bound it, the verify step that checks its work, the loop that ties it together, and the environment it runs in. The model is a rented brain. The harness is the body, the senses, and the supervisor. Let's build one, piece by piece.

Source: Tejas Kumar (IBM) · AI Engineer World's FairRepo: TejasQ/basically-ai-harnessModel: gpt-3.5-turbo-0613Stack: TypeScript · OpenRouter · Playwright

What you'll learn

What an "AI harness" is — and why it, not the model, is usually the thing to fix
How to build one from a bare loop, in five small steps you can click through
Why guardrails and a verify step catch two completely different failures
A quick self-check to make sure it stuck

What a harness actually is

When an agent fails, the reflex is to reach for a bigger model or a cleverer prompt. The harness mindset says the opposite. Mitchell Hashimoto — who coined "harness engineering" in early 2026 — put it as a rule: whenever an agent makes a mistake, you engineer the environment so it can't make that mistake again. The fix is rarely "try harder"; it's "what capability is missing." Those capabilities are the parts below — tap any one to see what it does in this repo.

Anatomy of a harnesstap a part ↓

verbsToolswhat it seesContextthe limitsGuardrails

MODELrented brain

the engineLoopthe proofVerifythe worldEnvironment

everything except the model weights

ToolsContextGuardrailsLoopVerifyEnvironment

Tools — the verbs

The actions the model is allowed to take. Here it gets seven browser tools — browser_navigate, browser_get_stories, browser_click, browser_fill and friends — each bound to a live browser session the harness owns, not a global they reach into. Without tools, the model can only describe; with them, it can act.

Context — what it sees each turn

The messages handed to the model. createContext(task) is a minimal two-message array: a short system prompt plus the task. As the run grows, trimContext() keeps the system prompt and original task pinned and drops the oldest tool messages, so the window never overflows.

Guardrails — the limits

Deterministic checks that fire before each model call. combineGuardrails(maxIterations(15), maxMessages(50)) runs each in order and stops on the first failure. No model judgment — just plain code saying "this has gone too far."

Loop — the engine

Call the model, run the tools it asked for, feed the results back, repeat — until it says it's done or a guardrail trips. Every iteration is recorded into a trace, which is exactly what makes verification possible.

Verify — the proof

A function that reads the recorded trace — not the model's answer — to confirm the task actually happened. verifySuccessfulUpvote() looks for a real click on the upvote element that landed back on Hacker News. The agent can't lie about which tools it called.

Environment — the world

Where the agent acts. A BrowserSession owns a real browser; the harness opens it, binds the tools to it, and always closes it — even on error. The environment knows nothing about the harness or the model. It's a clean seam.

Two different things are both called a "harness"

One quick piece of vocabulary, because it trips everyone up. The word points at two unrelated things — keep them apart.

2021 · EleutherAI

Eval harness

Measures how good a model is. Same questions, many models, one scorer. A dataset of cases, each with an expected answer and a trap — the wrong answer most models reflexively give. Ask "capital of Australia?", expect Canberra, watch who falls for Sydney.

2026 · agentic engineering

Agent harness

Doesn't measure the model — it puts the model to work. Tools, a loop, an environment, guardrails, a verify step. You take a fixed model and make it reliably accomplish a real task in the world.

Same word, opposite jobs. One grades the model; the other employs it. This whole post is about the second kind.

The experiment: one model, one prompt

Here's the setup that makes the idea undeniable. The repo is a sequence of git branches. The model is pinned to gpt-3.5-turbo-0613 the entire way — chosen because it's weak. The task never changes: go to Hacker News and upvote the highest-ranked story you haven't voted on. And the system prompt and task text are identical from the first branch to the last.

So every improvement below comes from the harness — never the prompt, never the model. The agent travels from "loops forever and lies about success" to "verified, recovers from a login wall, stops the moment it's done."

Branch 0 · it liesBranch 4 · verified

same model · same prompt · five branches

Build the harness, step by step

This is the heart of it. Click through the five branches — each one adds a small, boring, deterministic capability, and the diff is tiny every time. Watch the agent get reliable without the prompt ever changing.

0 Before1 Guardrails2 Harness3 Verify4 Recover

Step 1 of 5 · Branch 0

Before the harness

ProblemA raw while loop with nothing watching it. It can spin forever, and it can announce success it never achieved.

raw loopno guardrailsno verify

while (true) {
  const res = await model.chat(messages, tools)
  // the only exit: the model itself says it's done
  if (res.finish_reason === "stop") return res
  messages.push(...await runToolCalls(res.tool_calls))
}

What you learnedWith nothing around it, a weak model has two failure modes you can't even see: it never stops, or it stops and lies. Everything from here is about closing those two gaps.

Step 2 of 5 · Branch 0 → 1

Add context + guardrails

FixBound the loop and keep the context from overflowing — without touching the prompt.

+69 lines5 filesprompt unchanged

// agent/4-guardrails.ts
export const defaultGuardrails = combineGuardrails(
  maxIterations(15),   // stop runaway loops
  maxMessages(50),     // stop runaway context
)
// agent/7-index.ts — the ONLY change vs branch 0:
await runLoop(MODEL, messages, tools, defaultGuardrails)

What you learnedOne new argument turned an unbounded loop into a bounded one. The prompt and task are byte-for-byte identical to branch 0 — the gain came entirely from the environment.

Step 3 of 5 · Branch 1 → 2

Extract the harness

FixWrap the whole lifecycle in one function so capabilities have a home to grow in.

+56 / −202 filesharness.ts: 53 lines

// agent/7-index.ts — 20 lines collapse to 2
const result = await runHarness(TASK, MODEL)
printHarnessResult(result)
// runHarness opens the browser, builds tools + context,
// runs the loop, and always closes the browser — even on error.

What you learnedOnce the harness is a named thing, it's a thing you can extend. Every step after this is "add a capability to the harness," not "rewrite the script."

Step 4 of 5 · Branch 2 → 3

Add a verify step

FixStop trusting the model's word. Read the trace and prove the upvote actually happened.

+120 / −3verify stepmaxAttempts: 3

const result = await runHarness(TASK, MODEL, {
  verify: verifySuccessfulUpvote,   // reads the trace, not the prose
  maxAttempts: 3,                   // retry until it really happened
})
// VerifyResult = { passed: boolean, reason: string, fatal?: boolean }

What you learnedGuardrails could never catch a hallucinated "Done!" — the run looked perfectly well-behaved. Verify inspects what tools actually fired. This is the line between a demo and something you'd trust.

Step 5 of 5 · Branch 3 → 4

Recover from failure

FixThe agent kept getting bounced to a login wall. Teach the harness to clear it and to stop the instant it succeeds.

+88 / −54 fileslogin-handler.ts

// agent/6-harness.ts — success short-circuits the loop
const guardrails = combineGuardrails(
  stopAfterUpvote(() => upvotedStory),  // → stoppedBy: "success"
  defaultGuardrails,
)
// login-handler.ts fills the form, then injects:
// "You are now logged in. Finish the task."

What you learnedThis is Hashimoto's rule as code: the agent hit a wall once; now the harness clears that wall every time. The fix wasn't a smarter model — it was a better environment.

▸ click the steps — the prompt never changes, only the harness does

Guardrails catch crashes. Verify catches lies.

The most useful thing I took away: guardrails and verify are not the same tool, and you need both. They fail-stop on completely different things.

runs before each model call

Guardrails → structural failures

Deterministic limits: too many iterations, too much context, an unrecoverable login wall. They stop the agent from spinning forever or drowning in its own history. The model gets no vote. Result: stoppedBy: "guardrail".

catches: the agent that never stops

runs after the loop

Verify → wrong answers

Inspects the trace for proof the task happened. It stops the agent from claiming a success it never achieved. The check reads ground truth, so the model can't talk past it. Result: passed, or a retry.

catches: the agent that lies

Without a verify step, a model that hallucinates "Done!" looks identical to one that did the work.

Because verify reads the trace instead of the answer, the whole run is auditable. Here's roughly what a successful attempt prints — the verdict is derived from the tool calls, not from anything the model said:

● run · openai/gpt-3.5-turbo-0613 · attempt 2/3  → browser_navigate    news.ycombinator.com  → browser_get_stories  [#1 "Show HN: …" id=43210123 voted=false]  → browser_click        up_43210123 → now at /news  ● stoppedBy: successVerify: PASS — clicked up_43210123, landed back on /news

Your turn — spot the missing capability

One quick check to make sure it stuck. An agent finishes a run and reports: "Done! I upvoted the top story." But when you look, nothing was clicked — it hallucinated the whole thing. Which harness capability catches this?

Self-check

The agent claims success it never achieved. What stops it shipping that lie?

More guardrails A verify step A bigger model A better prompt

Not quite. Guardrails only see structural trouble — too many steps, too much context. A clean run that ends in a lie sails right past them. Try again.

Exactly. Verify reads the trace and asks "did a real upvote click land?" The model can describe success all it likes — verify checks what tools actually fired, so the lie fails the check and the harness retries.

Tempting, but no. This whole post is the counter-example: a weak model was made reliable with no upgrade at all. A bigger model can hallucinate "Done!" just as confidently. The missing capability is in the harness.

Nope. The prompt never changed across all five branches — and a perfect prompt still can't tell you, after the fact, whether the agent actually did the work. You need something that inspects the result.

What I took away

I came into this thinking model choice was most of the battle. The repo is a tidy proof that it usually isn't. A genuinely weak model, held fixed, became reliable through five small deterministic additions — none of them prompt-whispering. When my own agents misbehave now, my first question has changed from "is there a better model?" to "what capability is my harness missing?" — a context limit, a guardrail, a verify step, a recovery path. That reframe is the whole lesson, and it's a far more tractable problem than hoping the next model drop fixes everything.

The model was never the bottleneck. The harness was.

Resources & References

Start with the 20-minute talk this whole post is built on, then go deeper with the canonical guides from the teams actually shipping agents at scale.

Watch · Harnesses in AI: A Deep Dive — Tejas Kumar (IBM)Open on YouTube ↗

Primary sources

Talk · AI Engineer World's FairHarnesses in AI: A Deep Dive — Tejas Kumar (IBM). The talk and framing this post is built on. youtube.com/watch?v=C_GG5g38vLU
Code · the repo this post walks throughbasically-ai-harness — the branch-by-branch build, by Tejas Kumar. Clone it and step through the diffs yourself. github.com/TejasQ/basically-ai-harness

Learn the craft — from the teams building agents

Anthropic · EngineeringBuilding Effective Agents — the canonical guide to agent design: when a workflow beats an agent, the core patterns, and why simple, composable harnesses win over heavy frameworks. anthropic.com/engineering/building-effective-agents
Anthropic · EngineeringEffective Context Engineering for AI Agents — the "context" pillar in depth: isolation, reduction, and retrieval to keep the window lean as a run grows. anthropic.com/engineering/effective-context-engineering-for-ai-agents
OpenAI · Guides & ResourcesA Practical Guide to Building Agents — a 34-page field guide to agent design, orchestration, and guardrails, distilled from real customer deployments. openai.com — A Practical Guide to Building Agents (PDF)

Background & lineage

Mitchell Hashimoto · HashiCorp co-founderMy AI adoption journey — where "harness engineering" was coined: when an agent makes a mistake, engineer the environment so it can't make that mistake again. mitchellh.com/writing/my-ai-adoption-journey
EleutherAI · the other "harness"lm-evaluation-harness — the 2021 eval-harness lineage the agent harness is so often confused with; the de-facto standard for measuring model quality. github.com/EleutherAI/lm-evaluation-harness

Agent Harnesses — The Model Was Never the Bottleneck

Agent Harnesses — The Model Was Never the Bottleneck

What a harness actually is

Tools — the verbs

Context — what it sees each turn

Guardrails — the limits

Loop — the engine

Verify — the proof

Environment — the world

Two different things are both called a "harness"

The experiment: one model, one prompt

Build the harness, step by step

Guardrails catch crashes. Verify catches lies.

Your turn — spot the missing capability

What I took away

Resources & References

Tags

MCP — The How

Agent BattleGround — Claude Code vs Codex

Agent Harnesses — The Model Was Never the Bottleneck

What a harness actually is

Tools — the verbs

Context — what it sees each turn

Guardrails — the limits

Loop — the engine

Verify — the proof

Environment — the world

Two different things are both called a "harness"

The experiment: one model, one prompt

Build the harness, step by step

Guardrails catch crashes. Verify catches lies.

Your turn — spot the missing capability

What I took away

Resources & References

Tags

Related articles

MCP — The How

Agent BattleGround — Claude Code vs Codex