Agent Harnesses — The Model Was Never the Bottleneck
I watched a talk by Tejas Kumar where he pulled off something that looks like a magic trick and is actually the whole point of agent engineering. He took one of the worst models you can still rent — gpt-3.5-turbo — gave it one fixed prompt and one job, and made it reliably do that job. The trick: he never improved the prompt, and never swapped the model. He improved everything around it.
That "everything around it" finally has a name. An AI harness is, in Tejas's framing, everything except the model weights — the tools the model can call, the context you feed it, the guardrails that bound it, the verify step that checks its work, the loop that ties it together, and the environment it runs in. The model is a rented brain. The harness is the body, the senses, and the supervisor. Let's build one, piece by piece.
- What an "AI harness" is — and why it, not the model, is usually the thing to fix
- How to build one from a bare loop, in five small steps you can click through
- Why guardrails and a verify step catch two completely different failures
- A quick self-check to make sure it stuck
What a harness actually is
When an agent fails, the reflex is to reach for a bigger model or a cleverer prompt. The harness mindset says the opposite. Mitchell Hashimoto — who coined "harness engineering" in early 2026 — put it as a rule: whenever an agent makes a mistake, you engineer the environment so it can't make that mistake again. The fix is rarely "try harder"; it's "what capability is missing." Those capabilities are the parts below — tap any one to see what it does in this repo.
Tools — the verbs
The actions the model is allowed to take. Here it gets seven browser tools — browser_navigate, browser_get_stories, browser_click, browser_fill and friends — each bound to a live browser session the harness owns, not a global they reach into. Without tools, the model can only describe; with them, it can act.
Context — what it sees each turn
The messages handed to the model. createContext(task) is a minimal two-message array: a short system prompt plus the task. As the run grows, trimContext() keeps the system prompt and original task pinned and drops the oldest tool messages, so the window never overflows.
Guardrails — the limits
Deterministic checks that fire before each model call. combineGuardrails(maxIterations(15), maxMessages(50)) runs each in order and stops on the first failure. No model judgment — just plain code saying "this has gone too far."
Loop — the engine
Call the model, run the tools it asked for, feed the results back, repeat — until it says it's done or a guardrail trips. Every iteration is recorded into a trace, which is exactly what makes verification possible.
Verify — the proof
A function that reads the recorded trace — not the model's answer — to confirm the task actually happened. verifySuccessfulUpvote() looks for a real click on the upvote element that landed back on Hacker News. The agent can't lie about which tools it called.
Environment — the world
Where the agent acts. A BrowserSession owns a real browser; the harness opens it, binds the tools to it, and always closes it — even on error. The environment knows nothing about the harness or the model. It's a clean seam.
Two different things are both called a "harness"
One quick piece of vocabulary, because it trips everyone up. The word points at two unrelated things — keep them apart.
expected answer and a trap — the wrong answer most models reflexively give. Ask "capital of Australia?", expect Canberra, watch who falls for Sydney.Same word, opposite jobs. One grades the model; the other employs it. This whole post is about the second kind.
The experiment: one model, one prompt
Here's the setup that makes the idea undeniable. The repo is a sequence of git branches. The model is pinned to gpt-3.5-turbo-0613 the entire way — chosen because it's weak. The task never changes: go to Hacker News and upvote the highest-ranked story you haven't voted on. And the system prompt and task text are identical from the first branch to the last.
So every improvement below comes from the harness — never the prompt, never the model. The agent travels from "loops forever and lies about success" to "verified, recovers from a login wall, stops the moment it's done."
Build the harness, step by step
This is the heart of it. Click through the five branches — each one adds a small, boring, deterministic capability, and the diff is tiny every time. Watch the agent get reliable without the prompt ever changing.
while loop with nothing watching it. It can spin forever, and it can announce success it never achieved.while (true) {
const res = await model.chat(messages, tools)
// the only exit: the model itself says it's done
if (res.finish_reason === "stop") return res
messages.push(...await runToolCalls(res.tool_calls))
}// agent/4-guardrails.ts
export const defaultGuardrails = combineGuardrails(
maxIterations(15), // stop runaway loops
maxMessages(50), // stop runaway context
)
// agent/7-index.ts — the ONLY change vs branch 0:
await runLoop(MODEL, messages, tools, defaultGuardrails)// agent/7-index.ts — 20 lines collapse to 2
const result = await runHarness(TASK, MODEL)
printHarnessResult(result)
// runHarness opens the browser, builds tools + context,
// runs the loop, and always closes the browser — even on error.const result = await runHarness(TASK, MODEL, {
verify: verifySuccessfulUpvote, // reads the trace, not the prose
maxAttempts: 3, // retry until it really happened
})
// VerifyResult = { passed: boolean, reason: string, fatal?: boolean }// agent/6-harness.ts — success short-circuits the loop
const guardrails = combineGuardrails(
stopAfterUpvote(() => upvotedStory), // → stoppedBy: "success"
defaultGuardrails,
)
// login-handler.ts fills the form, then injects:
// "You are now logged in. Finish the task."Guardrails catch crashes. Verify catches lies.
The most useful thing I took away: guardrails and verify are not the same tool, and you need both. They fail-stop on completely different things.
stoppedBy: "guardrail".passed, or a retry.Without a verify step, a model that hallucinates "Done!" looks identical to one that did the work.
Because verify reads the trace instead of the answer, the whole run is auditable. Here's roughly what a successful attempt prints — the verdict is derived from the tool calls, not from anything the model said:
up_43210123 → now at /news ● stoppedBy: successVerify: PASS — clicked up_43210123, landed back on /newsYour turn — spot the missing capability
One quick check to make sure it stuck. An agent finishes a run and reports: "Done! I upvoted the top story." But when you look, nothing was clicked — it hallucinated the whole thing. Which harness capability catches this?
What I took away
I came into this thinking model choice was most of the battle. The repo is a tidy proof that it usually isn't. A genuinely weak model, held fixed, became reliable through five small deterministic additions — none of them prompt-whispering. When my own agents misbehave now, my first question has changed from "is there a better model?" to "what capability is my harness missing?" — a context limit, a guardrail, a verify step, a recovery path. That reframe is the whole lesson, and it's a far more tractable problem than hoping the next model drop fixes everything.
The model was never the bottleneck. The harness was.
Resources & References
Start with the 20-minute talk this whole post is built on, then go deeper with the canonical guides from the teams actually shipping agents at scale.
- Talk · AI Engineer World's FairHarnesses in AI: A Deep Dive — Tejas Kumar (IBM). The talk and framing this post is built on. youtube.com/watch?v=C_GG5g38vLU
- Code · the repo this post walks throughbasically-ai-harness — the branch-by-branch build, by Tejas Kumar. Clone it and step through the diffs yourself. github.com/TejasQ/basically-ai-harness
- Anthropic · EngineeringBuilding Effective Agents — the canonical guide to agent design: when a workflow beats an agent, the core patterns, and why simple, composable harnesses win over heavy frameworks. anthropic.com/engineering/building-effective-agents
- Anthropic · EngineeringEffective Context Engineering for AI Agents — the "context" pillar in depth: isolation, reduction, and retrieval to keep the window lean as a run grows. anthropic.com/engineering/effective-context-engineering-for-ai-agents
- OpenAI · Guides & ResourcesA Practical Guide to Building Agents — a 34-page field guide to agent design, orchestration, and guardrails, distilled from real customer deployments. openai.com — A Practical Guide to Building Agents (PDF)
- Mitchell Hashimoto · HashiCorp co-founderMy AI adoption journey — where "harness engineering" was coined: when an agent makes a mistake, engineer the environment so it can't make that mistake again. mitchellh.com/writing/my-ai-adoption-journey
- EleutherAI · the other "harness"lm-evaluation-harness — the 2021 eval-harness lineage the agent harness is so often confused with; the de-facto standard for measuring model quality. github.com/EleutherAI/lm-evaluation-harness

