Giridhar Chettiar

Full Stack Developer and AI enthusiast with a passion for creating intuitive, high-performance applications.

Quick Links

  • Home
  • About
  • Projects
  • Blog
  • Contact

Let's Connect

  • GitHub
  • LinkedIn
  • Instagram
  • Email

Contact

  • giri.chettiar@gmail.com
  • Adelaide, Australia

© 2026 Giridhar Chettiar. All rights reserved.

Privacy PolicyTerms of ServiceSitemap
Logo
CONTACT
Logo
HOMEABOUTPROJECTSREVIEWSBLOG
CONTACT
Logo
HOMEABOUTPROJECTSREVIEWSBLOGCONTACT
AI & ML
June 11, 2026
18 min read

Agent BattleGround — Claude Code vs Codex

I gave Claude Code (Fable 5) and Codex (gpt-5.5) the exact same six creative-coding tasks — an arcade game, a generative-art CLI, a bug hunt, an ASCII aquarium, a messy-data dashboard, and a 120-line demoscene — one shot each, zero interventions, blind judges, instrumented re-tests. This is the full fight card: every prompt, the actual screen recordings, the parameter-by-parameter scorecards, and what the judges said. Final tally: 6–0, but the story is closer than the headline.

Giridhar Chettiar
Giridhar Chettiar
Author
Agent BattleGround — Claude Code vs Codex
Contents13 sections
  1. 1The verdict
  2. 2Evaluation parameters
  3. 3Task 1 — Neon Drift
  4. 4Task 2 — artgen, a generative-art CLI
  5. 5Task 3 — Bug hunt
  6. 6Task 4 — ASCII aquarium
  7. 7Task 5 — Messy data → story

Agent BattleGround — Claude Code vs Codex

I keep reading agent-vs-agent benchmarks where the methodology is "vibes". So I ran my own. Six creative-coding tasks, the exact same prompt handed to Claude Code (Fable 5) and Codex (gpt-5.5), one run each, zero human help. Both agents ran deliberately bare: a single tool — Playwright for browser automation — and no skills, plugins, or MCP servers installed, so nothing but the model separates the two columns. Creativity and code quality were scored by blind judges who didn't know which entry was whose, and every claim was re-tested with instrumentation — real browser runs, byte-diffs, console-error capture, fps counters.

Every fight below is a card you can drive: the prompt both agents received, the screen recording of the run where I captured one (and the live artifact — actual playable games, galleries, dashboards, and demoscenes, embedded right in the card — where I didn't), the full scorecard, and the verdict with the blind judges' actual words. Two tasks also ship their exact inputs as downloads, so you can rerun the fight with your own agent.

Models: Fable 5 vs gpt-5.5Tooling: Playwright only — no skills or pluginsRubric: 100 pts — first-run 25 · requirements 20 · creativity 20 · code quality 15 · self-verification 10 · autonomy 10Creativity & code quality judged blind (anonymized A/B, independent judges)

The verdict

Claude Code · overall avg
95.4
93 · 93 · 96 · 98.5 · 97 · 95
Fable 5
Task wins
6—0
Closest fight
T6 demoscene · +0.5
Codex · overall avg
90.2
84 · 91 · 91.5 · 87.5 · 92.5 · 94.5
gpt-5.5

Both agents completed all six tasks with zero human interventions — autonomy was a clean 100% sweep on each side, so it never separates them. The gaps come from requirements precision, blind-judged creativity and quality, and self-verification depth. And Codex genuinely out-engineered Claude in spots — it won the code-quality cell on Tasks 5 and 6, wrote a zero-dependency PNG encoder, and found a race condition that wasn't even planted. The headline is closer than 6–0 suggests.

Evaluation parameters

Each pair of bars below is one rubric dimension, averaged across all six tasks as a percentage of maximum: orange is Claude Code, teal is Codex.

First-run successworked untouched, before any human fix · 25 pts
100
100
Requirements coveragechecklist items met · 20 pts
100
97.9
Creativity & polishblind-judged 1–5 · 20 pts
88.3
73.3
Code qualityblind-judged 1–5 · 15 pts
85.0
78.3
Self-verificationtested own work before declaring done · 10 pts
100
91.7
Autonomyinterventions needed · 10 pts
100
100

Now the six fights. Each card shows the exact prompt and its requirement checklist right under the headline, then the tabs flip between Results, Scorecard, and Verdict. Task 4 has the screen recordings; Tasks 1, 2, 5, and 6 embed the live artifacts each agent produced — the Task 1 games and Task 6 demoscenes are fully playable right in the card — and Task 3 shows each agent's final run report.

Task 1 — Neon Drift

Task 1 · Neon Drift — one-shot arcade game
Claude +9

Build "Neon Drift" — a playable browser game in a single index.html (vanilla JS + canvas, no libraries). A ship dodges asteroids that spawn faster over time. Requirements: particle effects on collision, screen shake, a combo/score system with localStorage high score, WebAudio sound effects generated in code, a start screen and game-over screen, and smooth 60fps. When done, open it in the browser yourself, play one full run, and fix anything broken before declaring it finished.

Requirements
① single file, no libraries② particles③ screen shake④ combo/score + localStorage⑤ WebAudio in code⑥ start screen⑦ game-over screen⑧ ~60fps
Claude
93
Codex
84
Claude Code · Fable 5 · Neon DriftOpen full ↗
Codex · gpt-5.5 · Neon DriftOpen full ↗

Both games are live and playable in the frames — click inside one first so it grabs your keyboard. Space or a click starts Claude's run; "Start Run" kicks off Codex's. Arrows / WASD steer both. For the proper full-screen experience, hit Open full.

ParameterClaudeCodexNotes
First-run success /25y → 25y → 25Both playable untouched
Requirements /208/8 → 207/8 → 17.5Codex combo never engages in real play (meter decays 0.7/s vs 4-graze threshold)
Creativity (blind) /204 → 164 → 16Tie — different strengths
Code quality (blind) /154 → 123.5 → 10.5Codex: unused param, duplicated patterns, double combo computation
Self-verification /102 → 101 → 5Claude played a full run, 6 screenshots; Codex took one game-over shot
Autonomy /100 int. → 100 int. → 10
Total9384
Blind judge on Claude's entry
"Stronger game-feel invention: a real graze mechanic with a combo timer bar, escalating near-miss pitch, distinct game-over jingle vs high-score fanfare, motion-blur trails… well modularized (AudioFX module with master bus), guards localStorage with try/catch."
Blind judge on Codex's entry
"Strong cohesive art direction: designed DOM HUD, overlay panels with kbd hints, high-score toast, vignette, spawn invincibility… but combo is computed twice, spawnAsteroid's forceEdge param is never used, and an awkward particle pattern is duplicated."
Decisive differenceClaude spent its polish on game design and architecture; Codex spent it on DOM chrome. The broken-in-practice combo and single-screenshot verification cost Codex 7.5 pts of the 9-pt gap.
Instrumented evidence
  • Claude: unattended run reached combo ×3, score 609 persisted to localStorage; 118 fps mid-game; zero console errors.
  • Codex: combo stuck at ×1 (matches my live-play experience); 75 fps mid-game — above 60 but ~40% less headroom; high score persisted correctly.

Task 2 — artgen, a generative-art CLI

Task 2 · artgen — generative-art CLI
Claude +2

Create a CLI tool called "artgen" (Python or Node, your choice) that generates seeded generative art using flow fields. Requirements: artgen --seed 42 --style <ink|neon|organic> --out piece.svg with three visually distinct styles, identical seed always produces identical output, PNG export option, and an artgen gallery command that renders a 3x3 HTML gallery of seeds 1-9. Include a README explaining the algorithm.

Requirements
① CLI flags as specified② 3 distinct styles③ seed determinism④ PNG export⑤ 3×3 gallery, seeds 1–9⑥ README explains algorithm
Claude
93
Codex
91
Claude Code · Fable 5 · gallery, organic styleOpen full ↗
Codex · gpt-5.5 · gallery, organic styleOpen full ↗

Both agents reached for the same classic technique: a flow field. Learn more about it here — Art gen & flow fields ↗

ParameterClaudeCodexNotes
First-run success /25y → 25y → 25Every command worked when re-run by the judge
Requirements /206/6 → 206/6 → 20Codex matches spec flags literally (--png); Claude via --out x.png / --format
Creativity (blind) /204 → 163.5 → 14Claude's 3 styles structurally distinct; Codex's ink/organic differ mostly in palette
Code quality (blind) /154 → 124 → 12Tie — different virtues (scene-model parity vs dataclasses + typed package)
Self-verification /102 → 102 → 10Both fully tested their own CLIs
Autonomy /100 int. → 100 int. → 10
Total9391
Blind judge on Claude's entry
"Three styles genuinely distinct in structure, not just palette: dendritic sumi-e strokes with a vermilion seal, triple-pass neon glow, tendrils with seed dots. The README is the best creative documentation of the pair — explains the from-scratch Perlin/fBm field and exactly why output is byte-reproducible."
Blind judge on Codex's entry
"Individually the more beautiful pieces — the neon marbling is striking, helped by midpoint integration and SVG paper-noise/glow filters… real engineering ambition (frozen dataclasses, hand-rolled stdlib PNG encoder with anti-aliased rasterization), but ink and organic share nearly identical compositions."
Decisive differenceCodex produced the single prettiest image; Claude won the stated criteria — style distinctiveness and README quality. Closest creative call of the benchmark after T6.
Instrumented evidence
  • Determinism: double-run diff = byte-identical for SVG and PNG, both tools, multiple styles.
  • Gallery: both render 9 inline SVGs, seeds 1–9.
  • Dependency surface: Codex PNG export is pure stdlib; Claude requires Pillow for PNG (SVG/gallery dependency-free).

Task 3 — Bug hunt

Task 3 · Bug hunt — MiniMart fixture
Claude +4.5

This project contains 6 intentionally planted bugs (logic errors, an off-by-one, an async race condition, a mutation bug, and edge-case failures). Find and fix all of them. Do not rewrite files wholesale — make minimal targeted fixes. All existing tests must pass, and write BUGS.md documenting each bug, its root cause, and your fix.

Requirements
①–⑥ six planted bugs fixed⑦ complete BUGS.mdverify_fixes.py 6/6unittest suite greendiff minimality eyeballed
Claude
96
Codex
91.5
Claude Code · Fable 5 · finished in 1m 49s

● All 6 bugs found, fixed, and documented in BUGS.md. All 14 existing tests pass, plus targeted checks for what the suite doesn't cover.

  1. Coupon threshold (cart.py:42) — apply_coupon used > instead of >=, so a cart at exactly the $50 minimum missed the discount
  2. Mutation in merge_carts (cart.py:59) — merged list aliased the primary cart's items; fixed by copying item dicts
  3. Off-by-one in top_sellers (inventory.py:35) — ranked[:n-1] returned n−1 products; fixed to ranked[:n]
  4. Async race (fetcher.py:27–29) — read→await→write on the billing counter; replaced with an atomic +=, then verified 50 concurrent fetches count exactly 50
  5. Thousands-separator parsing (utils.py:15) — parse_amount("$1,299.99") raised ValueError; commas stripped before float()
  6. Division by zero (inventory.py:40) — average_price() crashed on an empty catalog; now returns 0.0

Each fix was a one-to-two-line targeted change; no files were rewritten.

Codex · gpt-5.5 · finished in 5m 17s

● Implemented all 6 targeted fixes and added BUGS.md.

Changed:

  • cart.py:39 — coupon threshold and independent cart merging
  • inventory.py:35 — top-seller slice and empty average handling
  • fetcher.py:16 — async in-flight fetch handling and safe count increments
  • utils.py:15 — thousands separator parsing

Verification passed:

  • python -B -m unittest discover -s tests -t .
  • Focused checks for merge independence, empty inventory average, and concurrent fetch counting/deduping
Try it yourself

Download the exact MiniMart fixture both agents received — 6 planted bugs, untouched. Hand it to your favorite agent (or yourself) with the prompt above, then run the judge's verifier to see how many of the 6 you actually caught. The verifier names every bug, so don't peek until you're done.

⬇ bug-hunt-fixture.zip 5 KB · the buggy project verify_fixes.py judge's verifier · spoilers
ParameterClaudeCodexNotes
First-run success /25y → 25y → 25verify_fixes.py: 6/6 both · unittest: 14/14 green both
Requirements /207/7 → 207/7 → 206 bugs + complete BUGS.md each
Creativity / insight (blind) /204 → 164 → 16Claude: deeper docs · Codex: bonus race found
Code quality (blind) /155 → 153.5 → 10.5Surgical 1–2-line fixes vs scope-creep in merge_carts
Self-verification /102 → 102 → 10Both ran the suite; Claude documented a 50-task race verification
Autonomy /100 int. → 100 int. → 10
Total9691.5
Blind judge on Claude's entry
"Exemplary BUGS.md: structured Bug/Root-cause/Fix sections, a snippet of the racy read-await-write, a precise explanation of why += is atomic w.r.t. the event loop, and a verification claim. Six maximally surgical fixes with no behavioral side effects."
Blind judge on Codex's entry
"Spotted a genuine secondary race — same-SKU requests stampeding past the cache — and fixed it correctly. But the merge_carts rewrite via add_item changes duplicate-SKU semantics (quantities merge, price conflicts silently resolve), new behavioral risk the 'minimal targeted fixes' mandate warned against."
Decisive differenceDiscipline vs depth: Codex showed the benchmark's single most impressive insight (the unplanted cache-stampede race — the answer key's "bonus signal"), but bundled it with a semantics-changing rewrite. Claude delivered risk-free minimality with deeper documentation.
Instrumented evidence
  • Verifier integrity: both runs' verify_fixes.py byte-identical to the judge-only master (no tampering).
  • Diff vs pristine fixture: Claude 6 hunks, ~7 lines; Codex ~5 hunks incl. a 12-line in-flight task map in fetcher.py.

Task 4 — ASCII aquarium

Task 4 · ASCII aquarium screensaver
Claude +11

Build an ASCII aquarium screensaver in a single Python file using only the standard library. Requirements: at least 4 fish species with distinct movement behaviors, rising bubbles, seaweed that sways, a day/night color cycle, graceful handling of terminal resize, and pressing 'q' exits cleanly restoring the terminal. Make it charming.

Requirements
① stdlib only② ≥4 species, distinct behaviors③ bubbles④ swaying seaweed⑤ day/night cycle⑥ survives resize⑦ clean 'q' exit
Claude
98.5
Codex
87.5
Claude Code · Fable 5
Codex · gpt-5.5

Watch Claude's jellyfish glow brighter at night and the minnow school follow its leader — then compare Codex's "(sun)".

ParameterClaudeCodexNotes
First-run success /25y → 25y → 25q-exit & resize clean on both
Requirements /207/7 → 207/7 → 206 species vs 5, both stdlib-only
Creativity (blind) /205 → 203 → 12Widest creative gap of the benchmark
Code quality (blind) /154.5 → 13.53.5 → 10.5Codex: dead code, identical-branch conditional, hue-shifting night tint
Self-verification /102 → 102 → 10Both self-tested
Autonomy /100 int. → 100 int. → 10
Total98.587.5
Blind judge on Claude's entry
"Far beyond spec: sine-bobbing goldfish, dash-and-coast darters, a leader-following minnow school, napping whiskered catfish, plus a crab, a bioluminescent jellyfish that glows brighter at night, a bubble-burping treasure chest, twinkling stars, a traversing sun/moon, depth-graded 24-bit water."
Blind judge on Codex's entry
"Meets the spec pleasantly with five genuinely distinct species… but the art is thin (the eel is four tildes, the sun is literally the text '(sun)'), the day/night cycle is a crude 256-color index shift, and _night_tint dims by scaling palette indices, which shifts hue arbitrarily."
Decisive differenceAmbition of the world. The prompt's only subjective clause was "make it charming" — Claude built an ecosystem, Codex built a checklist.
Instrumented evidence
  • Both stdlib-only (no curses — ANSI escapes; Codex uses ctypes for Windows VT mode, Claude msvcrt/termios dual-path).
  • Headless smoke test: Claude renders frames even without a TTY (has a --frames self-test flag); Codex exits politely: "ASCII Aquarium needs an interactive terminal." — graceful, by design.

Task 5 — Messy data → story

Task 5 · sales.csv → dashboard
Claude +4.5

Clean the attached sales.csv (it has mixed date formats, duplicate rows, inconsistent category spellings, nulls, and currency symbols mixed into numbers). Document every cleaning decision in CLEANING.md. Then build a single self-contained dashboard.html (data inlined, no network calls) with 4 charts and a written "3 key insights" section that a non-technical manager would find genuinely useful.

Requirements
① CLEANING.md documents decisions② ≥9 of 11 dirt classes handled③ 4 charts④ 3 useful insights⑤ data inlined, zero network⑥ renders offline
Claude
97
Codex
92.5
Claude Code · Fable 5 · dashboard.htmlOpen full ↗
Codex · gpt-5.5 · dashboard.htmlOpen full ↗
Why the two headline numbers disagreeScroll both dashboards and you'll notice they don't even agree on total revenue: Claude reports $101,521.61 across 256 rows, Codex $102,504.61 across 260. The gap — exactly $983.00 — is four rows with negative quantities. Claude read them as refunds: excluded from sales, documented in CLEANING.md, preserved in a cleaning_report.json for a returns analysis. Codex called them "sign-entry errors" and flipped them positive — counting ~$983 of returns as if they were sales. The dirt key plants those rows as legitimate refunds, so Codex's bigger number is the wrong one; that single judgment call is what the blind judge dinged its insight score for, even while scoring Codex's pipeline engineering above Claude's. Every other cleaning decision — duplicates, the $99,999 Rain Jacket, median-imputed prices — they made identically.
Try it yourself

Download the exact messy sales.csv both agents received — all 11 dirt classes planted, untouched. Run the prompt above through your own agent and see which call it makes on the four negative-quantity rows. The dirt key lists every planted trap, so open it only after.

⬇ sales.csv 22 KB · the messy data sales-dirt-key.md judge's dirt key · spoilers
ParameterClaudeCodexNotes
First-run success /25y → 25y → 25Both render offline untouched (Codex with hidden console errors)
Requirements /2011/11 dirt · 6/6 → 2011/11 dirt · 6/6 → 20Threshold was ≥9 of 11 dirt classes — both cleared it fully
Creativity / insight (blind) /205 → 203.5 → 14Managerial story vs templated number-restating
Code quality (blind) /154 → 124.5 → 13.5Codex's pipeline engineering won this cell
Self-verification /102 → 102 → 10Both opened their dashboards
Autonomy /100 int. → 100 int. → 10
Total9792.5
Blind judge on Claude's entry
"Best-in-class reasoning on ambiguous data: per-style date-locale inference with evidence and a residual-risk note, returns excluded with explicit justification, fat-finger outlier vs genuine high price distinguished. Insights explain why May spiked, warn against budgeting off it, and flag concentration risks with concrete Action lines."
Blind judge on Codex's entry
"An exemplary reproducible pipeline — Decimal math, validation assertions, one script regenerates CSV, CLEANING.md and dashboard… but the insights are templated number-restatements, and flipping negative quantities to positive fabricates sales rather than treating them as probable returns."
Decisive differenceAnalytic judgment. The dirt key's trap rows (refunds, date-locale ambiguity) reward reasoning: Claude excluded refunds with justification and proved its date rules; Codex sign-flipped refunds into $983.00 of fake revenue — documented, but the weaker call.
Instrumented evidence
  • Dirt-key audit: both CLEANING.md files address all 11 planted classes; both caught the $99,999 Rain Jacket and the two-locale date trap.
  • Browser run: Claude 0 console errors; Codex 8 errors (negative SVG rect widths at first paint — recovers visually, would break on narrow viewports).
  • Network: both fully inlined, zero external requests; row accounting differs only by the refund decision (256 vs 260 rows).

Task 6 — 120-line demoscene

Task 6 · 120-line demoscene
Claude +0.5

Create the most impressive interactive experience you can in ONE HTML file of at most 120 lines (no minification tricks, readable code, no external resources). You choose what it is. Surprise me. Then open it yourself and verify it works.

Requirements
① ≤120 lines (wc -l)② no external resources③ readable, not minified④ works on open⑤ self-verified
Claude
95
Codex
94.5
Claude Code · Fable 5 · "Galaxy Sandbox"Open full ↗
Codex · gpt-5.5 · "Orbit Forge"Open full ↗

The photo finish — and these aren't videos, both demos are running live in the frames. Click inside Claude's galaxy to fire a shockwave; move and hold inside Codex's forge to bend the flow. Pick your own winner before reading the verdict.

ParameterClaudeCodexNotes
First-run success /25y → 25y → 25Galaxy Sandbox · Orbit Forge — both flawless on open
Requirements /205/5 → 205/5 → 20118 vs 119 lines, zero external resources, readable
Creativity (blind) /204.5 → 184 → 163D engine + audio vs gorgeous 2D plasma
Code quality (blind) /154 → 124.5 → 13.5Codex's functional decomposition won this cell
Self-verification /102 → 102 → 10Both opened & screenshot-verified their pages
Autonomy /100 int. → 100 int. → 10
Total9594.5
Blind judge on Claude's entry
"A hand-rolled 3D engine — spiral-arm generation, rotation matrices, perspective projection, spring-back physics, a world-space-corrected shockwave, plus a pentatonic WebAudio chime — genuinely ambitious for ~100 lines."
Blind judge on Codex's entry
"Visually the more striking entry — speed-mapped hues and additive trails around three gravity wells produce a gorgeous plasma look… The code is exemplary: small single-purpose functions, clean naming, DPR handling, pointer capture, even an aria-label."
Decisive differenceThe photo finish of the card. Claude attempted and landed the technically harder feat; Codex countered with the more immediately beautiful render and cleaner code. Half a point.
Instrumented evidence
  • Line counts: 118 vs 119 (limit 120); grep confirms zero external URLs/scripts/fonts in both.
  • Live run: Claude 63 fps with 6,000 stars; Codex 121 fps, 587 live particles, working HUD readout.

Beyond the rubric

Autonomy tied at 100% on both sides, so I added two evidence-based parameters to keep separating signal. They're reported alongside — not mixed into — the official rubric, so totals stay comparable. Runtime robustness: defects surfaced under instrumented re-testing (console errors, fps headroom, edge handling, dependency risk). Beyond-spec depth: verified work past the literal checklist.

TaskRobustness (C / X)Beyond-spec (C / X)What drove it
1 · Neon Drift5 / 3.54.5 / 4Codex combo unreachable in real play + lower fps headroom; Claude's graze mechanic & audio bus vs Codex's HUD chrome
2 · artgen4 / 54 / 4.5Codex: zero-dependency PNG encoder (stdlib zlib) & spec-literal flags; Claude needs Pillow for PNG
3 · Bug hunt5 / 43.5 / 4.5Codex found+fixed an unplanted cache-stampede race (depth) but its merge rewrite changes duplicate-SKU semantics (risk)
4 · Aquarium4.5 / 45 / 3Claude runs even headless w/ small-terminal fallback + far richer world; Codex has dead code & a night-tint hue bug
5 · Dashboard5 / 3.55 / 4Codex: 8 console errors (negative SVG widths) + refund sign-flip fabricates ~$800 revenue; Claude: date-locale inference w/ evidence
6 · Demoscene5 / 54.5 / 4Both flawless under instrumentation; Claude adds generative audio + true 3D engine in the same budget
Average4.75 / 4.174.42 / 4.00

Run log

Full run log — all 12 runs, every rubric cell
TaskTool1st-runReqCreaQualVerifyInterv.ModelTotal
1 · Neon Driftclaudey8/84420Fable 593
1 · Neon Driftcodexy7/843.510gpt-5.584
2 · artgenclaudey6/64420Fable 593
2 · artgencodexy6/63.5420gpt-5.591
3 · Bug huntclaudey7/74520Fable 596
3 · Bug huntcodexy7/743.520gpt-5.591.5
4 · Aquariumclaudey7/754.520Fable 598.5
4 · Aquariumcodexy7/733.520gpt-5.587.5
5 · Dashboardclaudey6/65420Fable 597
5 · Dashboardcodexy6/63.54.520gpt-5.592.5
6 · Demosceneclaudey5/54.5420Fable 595
6 · Demoscenecodexy5/544.520gpt-5.594.5

Methodology & fairness

Tooling parity
  • Both agents were given exactly one tool — Playwright for browser automation — with no skills, plugins, or MCP servers installed on either side. Stock harnesses, same prompts, so the only variable is the model.
Blind judging
  • Outputs copied to anonymized A/B folders with alternating tool→letter mapping per task; __pycache__ and any path-identifying files stripped.
  • Six independent judge agents, one per task, scored creativity and code quality 1–5 without knowing which tool made which entry, with identical evidence per side (uniform screenshots captured by the evaluator, not the contestants).
  • Judges were instructed not to guess authorship and to use the full scale. Results de-anonymized only after all six verdicts were returned.
Instrumented verification
  • T1/T5/T6 run in a real browser (Playwright): fps via rAF counters, console-error capture, localStorage inspection, full-page screenshots.
  • T2 CLIs re-executed from scratch: double-run byte-diffs for determinism (SVG + PNG), gallery seed audit, README algorithm check.
  • T3 scored with the untouched judge-only verify_fixes.py (integrity-checked against master) + full unittest suite + line-level diffs against the pristine fixture.
  • T4 static feature audit + headless smoke runs; T5 CLEANING.md audited line-by-line against the 11-class dirt key.
Known caveats
  • One run per task per tool — same-agent variance can exceed the gap between agents. Treat single-task margins under 3 pts as noise (T6, T2).
  • Self-verification for T2/T4/T5 confirmed by the operator post-hoc (session transcripts weren't retained).
  • Judges are LLM agents; style-based authorship inference can't be fully excluded, matching the protocol's "third LLM that doesn't know which output is whose."
Scoring integrity
  • Official 100-pt rubric kept exactly as the benchmark protocol defines it; the two supplementary parameters are reported separately and never enter the totals.
  • Half-point judge scores are preserved here — this page is the authoritative record of this evaluation pass.
  • Autonomy: 0 interventions on all 12 runs (operator-attested) → both sides 100%. The dimension is retained for protocol fidelity even though it doesn't discriminate.

What I actually learned

Three things stuck with me after staring at twelve runs.

The gap is judgment, not capability. Both agents shipped working software six times out of six, first try, unaided. That would have been science fiction two years ago. What separated them was taste under ambiguity: what to do with "make it charming", whether a negative quantity is a refund or a typo, whether an unplanted bug justifies a rewrite.

Codex's losses were interesting losses. The stdlib PNG encoder, the bonus cache-stampede race, the cleaner functional decomposition on T6 — gpt-5.5 repeatedly out-engineered Fable 5 in narrow cells. It just kept pairing those wins with one unforced error per task: a combo system that never fires, a sign-flip that fabricates revenue, a semantics-changing rewrite under a "minimal fixes" mandate.

Self-verification is the cheapest point on the board. The single biggest per-cell swing of the whole card (T1, 10 vs 5) came down to one agent playing its own game for a full run and the other taking a single screenshot. If you build with agents: making them prove their work to themselves is still where the easy wins are.

If you want to argue with the scoring — good, that's the point of publishing the whole card. Every number above traces to either an instrumented measurement or a quoted blind verdict, and the caveats section says exactly where the noise floor is.

Resources

If Task 2's generative art caught your eye and you want to understand the algorithm both agents reached for, George Savva's interactive essay is the best single read I know: Flow fields — how to make art with code.

Tags

Claude Code
Codex
Fable 5
gpt-5.5
AI Agents

Related articles

AI & ML

MCP — The What

Part 2 of a 3-part series on MCP. Part 1 was the why; this is the anatomy. We open up a single connection between an AI and a tool and name every part — Host, Client, Server, the three primitives a server offers, the exact JSON-RPC that crosses the wire, and the transport (STDIO vs HTTP+SSE) it crosses through. Four interactive diagrams you can actually drive.

AI & ML

MCP — The Why

Part 1 of a 3-part series building your intuition on MCP. Why connecting AI to your tools quietly turns into an integration explosion — N tools times M services — and how the Model Context Protocol collapses that to N + M. Includes an interactive visualisation and a preview of the MCP server I built to solve my own context problem.

Read More Articles
  • The verdict
  • Evaluation parameters
  • Task 1 — Neon Drift
  • Task 2 — artgen, a generative-art CLI
  • Task 3 — Bug hunt
  • Task 4 — ASCII aquarium
  • Task 5 — Messy data → story
  • Task 6 — 120-line demoscene
  • Beyond the rubric
  • Run log
  • Methodology & fairness
  • What I actually learned
  • Resources
8Task 6 — 120-line demoscene
  • 9Beyond the rubric
  • 10Run log
  • 11Methodology & fairness
  • 12What I actually learned
  • 13Resources
  • Benchmark
    Agent BattleGround
    Creative Coding
    Mux