Giridhar Chettiar

Full Stack Developer and AI enthusiast with a passion for creating intuitive, high-performance applications.

Quick Links

  • Home
  • About
  • Projects
  • Blog
  • Contact

Let's Connect

  • GitHub
  • LinkedIn
  • Instagram
  • Email

Contact

  • giri.chettiar@gmail.com
  • Adelaide, Australia

© 2026 Giridhar Chettiar. All rights reserved.

Privacy PolicyTerms of ServiceSitemap
Logo
CONTACT
Logo
HOMEABOUTPROJECTSREVIEWSBLOG
CONTACT
Logo
HOMEABOUTPROJECTSREVIEWSBLOGCONTACT
AI & ML
June 13, 2026
11 min read

Agent Harnesses — The Model Was Never the Bottleneck

Tejas Kumar took one of the worst models you can still rent — gpt-3.5-turbo — gave it one fixed prompt and one task (upvote a story on Hacker News), and made it reliable without ever touching the prompt or swapping the model. He did it across five git branches by engineering everything around the model: tools, context, guardrails, a verify step, a login handler. This is a hands-on, build-it-with-me walk through that repo and the idea underneath it — an AI harness is everything except the model weights.

Giridhar Chettiar
Giridhar Chettiar
Author
Agent Harnesses — The Model Was Never the Bottleneck
Contents8 sections
  1. 1What a harness actually is
  2. 2Two different things are both called a "harness"
  3. 3The experiment: one model, one prompt
  4. 4Build the harness, step by step
  5. 5Guardrails catch crashes. Verify catches lies.
  6. 6Your turn — spot the missing capability
  7. What I took away

Agent Harnesses — The Model Was Never the Bottleneck

I watched a talk by Tejas Kumar where he pulled off something that looks like a magic trick and is actually the whole point of agent engineering. He took one of the worst models you can still rent — gpt-3.5-turbo — gave it one fixed prompt and one job, and made it reliably do that job. The trick: he never improved the prompt, and never swapped the model. He improved everything around it.

That "everything around it" finally has a name. An AI harness is, in Tejas's framing, everything except the model weights — the tools the model can call, the context you feed it, the guardrails that bound it, the verify step that checks its work, the loop that ties it together, and the environment it runs in. The model is a rented brain. The harness is the body, the senses, and the supervisor. Let's build one, piece by piece.

Source: Tejas Kumar (IBM) · AI Engineer World's FairRepo: TejasQ/basically-ai-harnessModel: gpt-3.5-turbo-0613Stack: TypeScript · OpenRouter · Playwright
What you'll learn
  • What an "AI harness" is — and why it, not the model, is usually the thing to fix
  • How to build one from a bare loop, in five small steps you can click through
  • Why guardrails and a verify step catch two completely different failures
  • A quick self-check to make sure it stuck

What a harness actually is

When an agent fails, the reflex is to reach for a bigger model or a cleverer prompt. The harness mindset says the opposite. Mitchell Hashimoto — who coined "harness engineering" in early 2026 — put it as a rule: whenever an agent makes a mistake, you engineer the environment so it can't make that mistake again. The fix is rarely "try harder"; it's "what capability is missing." Those capabilities are the parts below — tap any one to see what it does in this repo.

Anatomy of a harnesstap a part ↓
MODELrented brain
everything except the model weights

Tools — the verbs

The actions the model is allowed to take. Here it gets seven browser tools — browser_navigate, browser_get_stories, browser_click, browser_fill and friends — each bound to a live browser session the harness owns, not a global they reach into. Without tools, the model can only describe; with them, it can act.

Context — what it sees each turn

The messages handed to the model. createContext(task) is a minimal two-message array: a short system prompt plus the task. As the run grows, trimContext() keeps the system prompt and original task pinned and drops the oldest tool messages, so the window never overflows.

Guardrails — the limits

Deterministic checks that fire before each model call. combineGuardrails(maxIterations(15), maxMessages(50)) runs each in order and stops on the first failure. No model judgment — just plain code saying "this has gone too far."

Loop — the engine

Call the model, run the tools it asked for, feed the results back, repeat — until it says it's done or a guardrail trips. Every iteration is recorded into a trace, which is exactly what makes verification possible.

Verify — the proof

A function that reads the recorded trace — not the model's answer — to confirm the task actually happened. verifySuccessfulUpvote() looks for a real click on the upvote element that landed back on Hacker News. The agent can't lie about which tools it called.

Environment — the world

Where the agent acts. A BrowserSession owns a real browser; the harness opens it, binds the tools to it, and always closes it — even on error. The environment knows nothing about the harness or the model. It's a clean seam.

Two different things are both called a "harness"

One quick piece of vocabulary, because it trips everyone up. The word points at two unrelated things — keep them apart.

2021 · EleutherAI
Eval harness
Measures how good a model is. Same questions, many models, one scorer. A dataset of cases, each with an expected answer and a trap — the wrong answer most models reflexively give. Ask "capital of Australia?", expect Canberra, watch who falls for Sydney.
2026 · agentic engineering
Agent harness
Doesn't measure the model — it puts the model to work. Tools, a loop, an environment, guardrails, a verify step. You take a fixed model and make it reliably accomplish a real task in the world.

Same word, opposite jobs. One grades the model; the other employs it. This whole post is about the second kind.

The experiment: one model, one prompt

Here's the setup that makes the idea undeniable. The repo is a sequence of git branches. The model is pinned to gpt-3.5-turbo-0613 the entire way — chosen because it's weak. The task never changes: go to Hacker News and upvote the highest-ranked story you haven't voted on. And the system prompt and task text are identical from the first branch to the last.

So every improvement below comes from the harness — never the prompt, never the model. The agent travels from "loops forever and lies about success" to "verified, recovers from a login wall, stops the moment it's done."

Branch 0 · it liesBranch 4 · verified
same model · same prompt · five branches

Build the harness, step by step

This is the heart of it. Click through the five branches — each one adds a small, boring, deterministic capability, and the diff is tiny every time. Watch the agent get reliable without the prompt ever changing.

Step 1 of 5 · Branch 0
Before the harness
ProblemA raw while loop with nothing watching it. It can spin forever, and it can announce success it never achieved.
raw loopno guardrailsno verify
while (true) {
  const res = await model.chat(messages, tools)
  // the only exit: the model itself says it's done
  if (res.finish_reason === "stop") return res
  messages.push(...await runToolCalls(res.tool_calls))
}
What you learnedWith nothing around it, a weak model has two failure modes you can't even see: it never stops, or it stops and lies. Everything from here is about closing those two gaps.
Step 2 of 5 · Branch 0 → 1
Add context + guardrails
FixBound the loop and keep the context from overflowing — without touching the prompt.
+69 lines5 filesprompt unchanged
// agent/4-guardrails.ts
export const defaultGuardrails = combineGuardrails(
  maxIterations(15),   // stop runaway loops
  maxMessages(50),     // stop runaway context
)
// agent/7-index.ts — the ONLY change vs branch 0:
await runLoop(MODEL, messages, tools, defaultGuardrails)
What you learnedOne new argument turned an unbounded loop into a bounded one. The prompt and task are byte-for-byte identical to branch 0 — the gain came entirely from the environment.
Step 3 of 5 · Branch 1 → 2
Extract the harness
FixWrap the whole lifecycle in one function so capabilities have a home to grow in.
+56 / −202 filesharness.ts: 53 lines
// agent/7-index.ts — 20 lines collapse to 2
const result = await runHarness(TASK, MODEL)
printHarnessResult(result)
// runHarness opens the browser, builds tools + context,
// runs the loop, and always closes the browser — even on error.
What you learnedOnce the harness is a named thing, it's a thing you can extend. Every step after this is "add a capability to the harness," not "rewrite the script."
Step 4 of 5 · Branch 2 → 3
Add a verify step
FixStop trusting the model's word. Read the trace and prove the upvote actually happened.
+120 / −3verify stepmaxAttempts: 3
const result = await runHarness(TASK, MODEL, {
  verify: verifySuccessfulUpvote,   // reads the trace, not the prose
  maxAttempts: 3,                   // retry until it really happened
})
// VerifyResult = { passed: boolean, reason: string, fatal?: boolean }
What you learnedGuardrails could never catch a hallucinated "Done!" — the run looked perfectly well-behaved. Verify inspects what tools actually fired. This is the line between a demo and something you'd trust.
Step 5 of 5 · Branch 3 → 4
Recover from failure
FixThe agent kept getting bounced to a login wall. Teach the harness to clear it and to stop the instant it succeeds.
+88 / −54 fileslogin-handler.ts
// agent/6-harness.ts — success short-circuits the loop
const guardrails = combineGuardrails(
  stopAfterUpvote(() => upvotedStory),  // → stoppedBy: "success"
  defaultGuardrails,
)
// login-handler.ts fills the form, then injects:
// "You are now logged in. Finish the task."
What you learnedThis is Hashimoto's rule as code: the agent hit a wall once; now the harness clears that wall every time. The fix wasn't a smarter model — it was a better environment.
▸ click the steps — the prompt never changes, only the harness does

Guardrails catch crashes. Verify catches lies.

The most useful thing I took away: guardrails and verify are not the same tool, and you need both. They fail-stop on completely different things.

runs before each model call
Guardrails → structural failures
Deterministic limits: too many iterations, too much context, an unrecoverable login wall. They stop the agent from spinning forever or drowning in its own history. The model gets no vote. Result: stoppedBy: "guardrail".
catches: the agent that never stops
runs after the loop
Verify → wrong answers
Inspects the trace for proof the task happened. It stops the agent from claiming a success it never achieved. The check reads ground truth, so the model can't talk past it. Result: passed, or a retry.
catches: the agent that lies

Without a verify step, a model that hallucinates "Done!" looks identical to one that did the work.

Because verify reads the trace instead of the answer, the whole run is auditable. Here's roughly what a successful attempt prints — the verdict is derived from the tool calls, not from anything the model said:

● run · openai/gpt-3.5-turbo-0613 · attempt 2/3 → browser_navigate news.ycombinator.com → browser_get_stories [#1 "Show HN: …" id=43210123 voted=false] → browser_click up_43210123 → now at /news ● stoppedBy: successVerify: PASS — clicked up_43210123, landed back on /news

Your turn — spot the missing capability

One quick check to make sure it stuck. An agent finishes a run and reports: "Done! I upvoted the top story." But when you look, nothing was clicked — it hallucinated the whole thing. Which harness capability catches this?

Self-check
The agent claims success it never achieved. What stops it shipping that lie?
Not quite. Guardrails only see structural trouble — too many steps, too much context. A clean run that ends in a lie sails right past them. Try again.
Exactly. Verify reads the trace and asks "did a real upvote click land?" The model can describe success all it likes — verify checks what tools actually fired, so the lie fails the check and the harness retries.
Tempting, but no. This whole post is the counter-example: a weak model was made reliable with no upgrade at all. A bigger model can hallucinate "Done!" just as confidently. The missing capability is in the harness.
Nope. The prompt never changed across all five branches — and a perfect prompt still can't tell you, after the fact, whether the agent actually did the work. You need something that inspects the result.

What I took away

I came into this thinking model choice was most of the battle. The repo is a tidy proof that it usually isn't. A genuinely weak model, held fixed, became reliable through five small deterministic additions — none of them prompt-whispering. When my own agents misbehave now, my first question has changed from "is there a better model?" to "what capability is my harness missing?" — a context limit, a guardrail, a verify step, a recovery path. That reframe is the whole lesson, and it's a far more tractable problem than hoping the next model drop fixes everything.

The model was never the bottleneck. The harness was.

Resources & References

Start with the 20-minute talk this whole post is built on, then go deeper with the canonical guides from the teams actually shipping agents at scale.

Watch · Harnesses in AI: A Deep Dive — Tejas Kumar (IBM)Open on YouTube ↗
Primary sources
  • Talk · AI Engineer World's FairHarnesses in AI: A Deep Dive — Tejas Kumar (IBM). The talk and framing this post is built on. youtube.com/watch?v=C_GG5g38vLU
  • Code · the repo this post walks throughbasically-ai-harness — the branch-by-branch build, by Tejas Kumar. Clone it and step through the diffs yourself. github.com/TejasQ/basically-ai-harness
Learn the craft — from the teams building agents
  • Anthropic · EngineeringBuilding Effective Agents — the canonical guide to agent design: when a workflow beats an agent, the core patterns, and why simple, composable harnesses win over heavy frameworks. anthropic.com/engineering/building-effective-agents
  • Anthropic · EngineeringEffective Context Engineering for AI Agents — the "context" pillar in depth: isolation, reduction, and retrieval to keep the window lean as a run grows. anthropic.com/engineering/effective-context-engineering-for-ai-agents
  • OpenAI · Guides & ResourcesA Practical Guide to Building Agents — a 34-page field guide to agent design, orchestration, and guardrails, distilled from real customer deployments. openai.com — A Practical Guide to Building Agents (PDF)
Background & lineage
  • Mitchell Hashimoto · HashiCorp co-founderMy AI adoption journey — where "harness engineering" was coined: when an agent makes a mistake, engineer the environment so it can't make that mistake again. mitchellh.com/writing/my-ai-adoption-journey
  • EleutherAI · the other "harness"lm-evaluation-harness — the 2021 eval-harness lineage the agent harness is so often confused with; the de-facto standard for measuring model quality. github.com/EleutherAI/lm-evaluation-harness

Tags

AI Agents
Agent Harness
Harness Engineering
Context Engineering
Guardrails

Related articles

AI & ML

Agent BattleGround — Claude Code vs Codex

I gave Claude Code (Fable 5) and Codex (gpt-5.5) the exact same six creative-coding tasks — an arcade game, a generative-art CLI, a bug hunt, an ASCII aquarium, a messy-data dashboard, and a 120-line demoscene — one shot each, zero interventions, blind judges, instrumented re-tests. This is the full fight card: every prompt, the actual screen recordings, the parameter-by-parameter scorecards, and what the judges said. Final tally: 6–0, but the story is closer than the headline.

AI & ML

MCP — The What

Part 2 of a 3-part series on MCP. Part 1 was the why; this is the anatomy. We open up a single connection between an AI and a tool and name every part — Host, Client, Server, the three primitives a server offers, the exact JSON-RPC that crosses the wire, and the transport (STDIO vs HTTP+SSE) it crosses through. Four interactive diagrams you can actually drive.

Read More Articles
  • What a harness actually is
  • Two different things are both called a "harness"
  • The experiment: one model, one prompt
  • Build the harness, step by step
  • Guardrails catch crashes. Verify catches lies.
  • Your turn — spot the missing capability
  • What I took away
  • Resources & References
7
  • 8Resources & References
  • LLM
    Playwright
    TypeScript
    Tejas Kumar