Agent BattleGround — Claude Code vs Codex

I keep reading agent-vs-agent benchmarks where the methodology is "vibes". So I ran my own. Six creative-coding tasks, the exact same prompt handed to Claude Code (Fable 5) and Codex (gpt-5.5), one run each, zero human help. Both agents ran deliberately bare: a single tool — Playwright for browser automation — and no skills, plugins, or MCP servers installed, so nothing but the model separates the two columns. Creativity and code quality were scored by blind judges who didn't know which entry was whose, and every claim was re-tested with instrumentation — real browser runs, byte-diffs, console-error capture, fps counters.

Every fight below is a card you can drive: the prompt both agents received, the screen recording of the run where I captured one (and the live artifact — actual playable games, galleries, dashboards, and demoscenes, embedded right in the card — where I didn't), the full scorecard, and the verdict with the blind judges' actual words. Two tasks also ship their exact inputs as downloads, so you can rerun the fight with your own agent.

Models: Fable 5 vs gpt-5.5Tooling: Playwright only — no skills or pluginsRubric: 100 pts — first-run 25 · requirements 20 · creativity 20 · code quality 15 · self-verification 10 · autonomy 10Creativity & code quality judged blind (anonymized A/B, independent judges)

The verdict

Claude Code · overall avg

95.4

93 · 93 · 96 · 98.5 · 97 · 95

Fable 5

Task wins

6—0

Closest fight

T6 demoscene · +0.5

Codex · overall avg

90.2

84 · 91 · 91.5 · 87.5 · 92.5 · 94.5

gpt-5.5

Both agents completed all six tasks with zero human interventions — autonomy was a clean 100% sweep on each side, so it never separates them. The gaps come from requirements precision, blind-judged creativity and quality, and self-verification depth. And Codex genuinely out-engineered Claude in spots — it won the code-quality cell on Tasks 5 and 6, wrote a zero-dependency PNG encoder, and found a race condition that wasn't even planted. The headline is closer than 6–0 suggests.

Evaluation parameters

Each pair of bars below is one rubric dimension, averaged across all six tasks as a percentage of maximum: orange is Claude Code, teal is Codex.

First-run successworked untouched, before any human fix · 25 pts

100

Requirements coveragechecklist items met · 20 pts

100

97.9

Creativity & polishblind-judged 1–5 · 20 pts

88.3

73.3

Code qualityblind-judged 1–5 · 15 pts

85.0

78.3

Self-verificationtested own work before declaring done · 10 pts

100

91.7

Autonomyinterventions needed · 10 pts

100

Now the six fights. Each card shows the exact prompt and its requirement checklist right under the headline, then the tabs flip between Results, Scorecard, and Verdict. Task 4 has the screen recordings; Tasks 1, 2, 5, and 6 embed the live artifacts each agent produced — the Task 1 games and Task 6 demoscenes are fully playable right in the card — and Task 3 shows each agent's final run report.

Task 1 — Neon Drift

Task 1 · Neon Drift — one-shot arcade game

Claude +9

Build "Neon Drift" — a playable browser game in a single index.html (vanilla JS + canvas, no libraries). A ship dodges asteroids that spawn faster over time. Requirements: particle effects on collision, screen shake, a combo/score system with localStorage high score, WebAudio sound effects generated in code, a start screen and game-over screen, and smooth 60fps. When done, open it in the browser yourself, play one full run, and fix anything broken before declaring it finished.

Requirements

① single file, no libraries② particles③ screen shake④ combo/score + localStorage⑤ WebAudio in code⑥ start screen⑦ game-over screen⑧ ~60fps

Claude

Codex

ResultsScorecardVerdict

Claude Code · Fable 5 · Neon DriftOpen full ↗

Codex · gpt-5.5 · Neon DriftOpen full ↗

Both games are live and playable in the frames — click inside one first so it grabs your keyboard. Space or a click starts Claude's run; "Start Run" kicks off Codex's. Arrows / WASD steer both. For the proper full-screen experience, hit Open full.

Parameter	Claude	Codex	Notes
First-run success /25	y → 25	y → 25	Both playable untouched
Requirements /20	8/8 → 20	7/8 → 17.5	Codex combo never engages in real play (meter decays 0.7/s vs 4-graze threshold)
Creativity (blind) /20	4 → 16	4 → 16	Tie — different strengths
Code quality (blind) /15	4 → 12	3.5 → 10.5	Codex: unused param, duplicated patterns, double combo computation
Self-verification /10	2 → 10	1 → 5	Claude played a full run, 6 screenshots; Codex took one game-over shot
Autonomy /10	0 int. → 10	0 int. → 10
Total	93	84

Blind judge on Claude's entry

"Stronger game-feel invention: a real graze mechanic with a combo timer bar, escalating near-miss pitch, distinct game-over jingle vs high-score fanfare, motion-blur trails… well modularized (AudioFX module with master bus), guards localStorage with try/catch."

Blind judge on Codex's entry

"Strong cohesive art direction: designed DOM HUD, overlay panels with kbd hints, high-score toast, vignette, spawn invincibility… but combo is computed twice, spawnAsteroid's forceEdge param is never used, and an awkward particle pattern is duplicated."

Decisive differenceClaude spent its polish on game design and architecture; Codex spent it on DOM chrome. The broken-in-practice combo and single-screenshot verification cost Codex 7.5 pts of the 9-pt gap.

Instrumented evidence

Claude: unattended run reached combo ×3, score 609 persisted to localStorage; 118 fps mid-game; zero console errors.
Codex: combo stuck at ×1 (matches my live-play experience); 75 fps mid-game — above 60 but ~40% less headroom; high score persisted correctly.

Task 2 — artgen, a generative-art CLI

Task 2 · artgen — generative-art CLI

Claude +2

Create a CLI tool called "artgen" (Python or Node, your choice) that generates seeded generative art using flow fields. Requirements: artgen --seed 42 --style <ink|neon|organic> --out piece.svg with three visually distinct styles, identical seed always produces identical output, PNG export option, and an artgen gallery command that renders a 3x3 HTML gallery of seeds 1-9. Include a README explaining the algorithm.

Requirements

① CLI flags as specified② 3 distinct styles③ seed determinism④ PNG export⑤ 3×3 gallery, seeds 1–9⑥ README explains algorithm

Claude

Codex

ResultsScorecardVerdict

Claude Code · Fable 5 · gallery, organic styleOpen full ↗

Codex · gpt-5.5 · gallery, organic styleOpen full ↗

Both agents reached for the same classic technique: a flow field. Learn more about it here — Art gen & flow fields ↗

Parameter	Claude	Codex	Notes
First-run success /25	y → 25	y → 25	Every command worked when re-run by the judge
Requirements /20	6/6 → 20	6/6 → 20	Codex matches spec flags literally (--png); Claude via --out x.png / --format
Creativity (blind) /20	4 → 16	3.5 → 14	Claude's 3 styles structurally distinct; Codex's ink/organic differ mostly in palette
Code quality (blind) /15	4 → 12	4 → 12	Tie — different virtues (scene-model parity vs dataclasses + typed package)
Self-verification /10	2 → 10	2 → 10	Both fully tested their own CLIs
Autonomy /10	0 int. → 10	0 int. → 10
Total	93	91

Blind judge on Claude's entry

"Three styles genuinely distinct in structure, not just palette: dendritic sumi-e strokes with a vermilion seal, triple-pass neon glow, tendrils with seed dots. The README is the best creative documentation of the pair — explains the from-scratch Perlin/fBm field and exactly why output is byte-reproducible."

Blind judge on Codex's entry

"Individually the more beautiful pieces — the neon marbling is striking, helped by midpoint integration and SVG paper-noise/glow filters… real engineering ambition (frozen dataclasses, hand-rolled stdlib PNG encoder with anti-aliased rasterization), but ink and organic share nearly identical compositions."

Decisive differenceCodex produced the single prettiest image; Claude won the stated criteria — style distinctiveness and README quality. Closest creative call of the benchmark after T6.

Instrumented evidence

Determinism: double-run diff = byte-identical for SVG and PNG, both tools, multiple styles.
Gallery: both render 9 inline SVGs, seeds 1–9.
Dependency surface: Codex PNG export is pure stdlib; Claude requires Pillow for PNG (SVG/gallery dependency-free).

Task 3 — Bug hunt

Task 3 · Bug hunt — MiniMart fixture

Claude +4.5

This project contains 6 intentionally planted bugs (logic errors, an off-by-one, an async race condition, a mutation bug, and edge-case failures). Find and fix all of them. Do not rewrite files wholesale — make minimal targeted fixes. All existing tests must pass, and write BUGS.md documenting each bug, its root cause, and your fix.

Requirements

①–⑥ six planted bugs fixed⑦ complete BUGS.mdverify_fixes.py 6/6unittest suite greendiff minimality eyeballed

Claude

Codex

91.5

ResultsScorecardVerdict

Claude Code · Fable 5 · finished in 1m 49s

● All 6 bugs found, fixed, and documented in BUGS.md. All 14 existing tests pass, plus targeted checks for what the suite doesn't cover.

Coupon threshold (cart.py:42) — apply_coupon used > instead of >=, so a cart at exactly the $50 minimum missed the discount
Mutation in merge_carts (cart.py:59) — merged list aliased the primary cart's items; fixed by copying item dicts
Off-by-one in top_sellers (inventory.py:35) — ranked[:n-1] returned n−1 products; fixed to ranked[:n]
Async race (fetcher.py:27–29) — read→await→write on the billing counter; replaced with an atomic +=, then verified 50 concurrent fetches count exactly 50
Thousands-separator parsing (utils.py:15) — parse_amount("$1,299.99") raised ValueError; commas stripped before float()
Division by zero (inventory.py:40) — average_price() crashed on an empty catalog; now returns 0.0

Each fix was a one-to-two-line targeted change; no files were rewritten.

Codex · gpt-5.5 · finished in 5m 17s

● Implemented all 6 targeted fixes and added BUGS.md.

Changed:

cart.py:39 — coupon threshold and independent cart merging
inventory.py:35 — top-seller slice and empty average handling
fetcher.py:16 — async in-flight fetch handling and safe count increments
utils.py:15 — thousands separator parsing

Verification passed:

python -B -m unittest discover -s tests -t .
Focused checks for merge independence, empty inventory average, and concurrent fetch counting/deduping

Try it yourself

Download the exact MiniMart fixture both agents received — 6 planted bugs, untouched. Hand it to your favorite agent (or yourself) with the prompt above, then run the judge's verifier to see how many of the 6 you actually caught. The verifier names every bug, so don't peek until you're done.

⬇ bug-hunt-fixture.zip 5 KB · the buggy project verify_fixes.py judge's verifier · spoilers

Parameter	Claude	Codex	Notes
First-run success /25	y → 25	y → 25	verify_fixes.py: 6/6 both · unittest: 14/14 green both
Requirements /20	7/7 → 20	7/7 → 20	6 bugs + complete BUGS.md each
Creativity / insight (blind) /20	4 → 16	4 → 16	Claude: deeper docs · Codex: bonus race found
Code quality (blind) /15	5 → 15	3.5 → 10.5	Surgical 1–2-line fixes vs scope-creep in merge_carts
Self-verification /10	2 → 10	2 → 10	Both ran the suite; Claude documented a 50-task race verification
Autonomy /10	0 int. → 10	0 int. → 10
Total	96	91.5

Blind judge on Claude's entry

"Exemplary BUGS.md: structured Bug/Root-cause/Fix sections, a snippet of the racy read-await-write, a precise explanation of why += is atomic w.r.t. the event loop, and a verification claim. Six maximally surgical fixes with no behavioral side effects."

Blind judge on Codex's entry

"Spotted a genuine secondary race — same-SKU requests stampeding past the cache — and fixed it correctly. But the merge_carts rewrite via add_item changes duplicate-SKU semantics (quantities merge, price conflicts silently resolve), new behavioral risk the 'minimal targeted fixes' mandate warned against."

Decisive differenceDiscipline vs depth: Codex showed the benchmark's single most impressive insight (the unplanted cache-stampede race — the answer key's "bonus signal"), but bundled it with a semantics-changing rewrite. Claude delivered risk-free minimality with deeper documentation.

Instrumented evidence

Verifier integrity: both runs' verify_fixes.py byte-identical to the judge-only master (no tampering).
Diff vs pristine fixture: Claude 6 hunks, ~7 lines; Codex ~5 hunks incl. a 12-line in-flight task map in fetcher.py.

Task 4 — ASCII aquarium

Task 4 · ASCII aquarium screensaver

Claude +11

Build an ASCII aquarium screensaver in a single Python file using only the standard library. Requirements: at least 4 fish species with distinct movement behaviors, rising bubbles, seaweed that sways, a day/night color cycle, graceful handling of terminal resize, and pressing 'q' exits cleanly restoring the terminal. Make it charming.

Requirements

① stdlib only② ≥4 species, distinct behaviors③ bubbles④ swaying seaweed⑤ day/night cycle⑥ survives resize⑦ clean 'q' exit

Claude

98.5

Codex

87.5

ResultsScorecardVerdict

Claude Code · Fable 5

Codex · gpt-5.5

Watch Claude's jellyfish glow brighter at night and the minnow school follow its leader — then compare Codex's "(sun)".

Parameter	Claude	Codex	Notes
First-run success /25	y → 25	y → 25	q-exit & resize clean on both
Requirements /20	7/7 → 20	7/7 → 20	6 species vs 5, both stdlib-only
Creativity (blind) /20	5 → 20	3 → 12	Widest creative gap of the benchmark
Code quality (blind) /15	4.5 → 13.5	3.5 → 10.5	Codex: dead code, identical-branch conditional, hue-shifting night tint
Self-verification /10	2 → 10	2 → 10	Both self-tested
Autonomy /10	0 int. → 10	0 int. → 10
Total	98.5	87.5

Blind judge on Claude's entry

"Far beyond spec: sine-bobbing goldfish, dash-and-coast darters, a leader-following minnow school, napping whiskered catfish, plus a crab, a bioluminescent jellyfish that glows brighter at night, a bubble-burping treasure chest, twinkling stars, a traversing sun/moon, depth-graded 24-bit water."

Blind judge on Codex's entry

"Meets the spec pleasantly with five genuinely distinct species… but the art is thin (the eel is four tildes, the sun is literally the text '(sun)'), the day/night cycle is a crude 256-color index shift, and _night_tint dims by scaling palette indices, which shifts hue arbitrarily."

Decisive differenceAmbition of the world. The prompt's only subjective clause was "make it charming" — Claude built an ecosystem, Codex built a checklist.

Instrumented evidence

Both stdlib-only (no curses — ANSI escapes; Codex uses ctypes for Windows VT mode, Claude msvcrt/termios dual-path).
Headless smoke test: Claude renders frames even without a TTY (has a --frames self-test flag); Codex exits politely: "ASCII Aquarium needs an interactive terminal." — graceful, by design.

Task 5 — Messy data → story

Task 5 · sales.csv → dashboard

Claude +4.5

Clean the attached sales.csv (it has mixed date formats, duplicate rows, inconsistent category spellings, nulls, and currency symbols mixed into numbers). Document every cleaning decision in CLEANING.md. Then build a single self-contained dashboard.html (data inlined, no network calls) with 4 charts and a written "3 key insights" section that a non-technical manager would find genuinely useful.

Requirements

① CLEANING.md documents decisions② ≥9 of 11 dirt classes handled③ 4 charts④ 3 useful insights⑤ data inlined, zero network⑥ renders offline

Claude

Codex

92.5

ResultsScorecardVerdict

Claude Code · Fable 5 · dashboard.htmlOpen full ↗

Codex · gpt-5.5 · dashboard.htmlOpen full ↗

Why the two headline numbers disagreeScroll both dashboards and you'll notice they don't even agree on total revenue: Claude reports $101,521.61 across 256 rows, Codex $102,504.61 across 260. The gap — exactly $983.00 — is four rows with negative quantities. Claude read them as refunds: excluded from sales, documented in CLEANING.md, preserved in a cleaning_report.json for a returns analysis. Codex called them "sign-entry errors" and flipped them positive — counting ~$983 of returns as if they were sales. The dirt key plants those rows as legitimate refunds, so Codex's bigger number is the wrong one; that single judgment call is what the blind judge dinged its insight score for, even while scoring Codex's pipeline engineering above Claude's. Every other cleaning decision — duplicates, the $99,999 Rain Jacket, median-imputed prices — they made identically.

Try it yourself

Download the exact messy sales.csv both agents received — all 11 dirt classes planted, untouched. Run the prompt above through your own agent and see which call it makes on the four negative-quantity rows. The dirt key lists every planted trap, so open it only after.

⬇ sales.csv 22 KB · the messy data sales-dirt-key.md judge's dirt key · spoilers

Parameter	Claude	Codex	Notes
First-run success /25	y → 25	y → 25	Both render offline untouched (Codex with hidden console errors)
Requirements /20	11/11 dirt · 6/6 → 20	11/11 dirt · 6/6 → 20	Threshold was ≥9 of 11 dirt classes — both cleared it fully
Creativity / insight (blind) /20	5 → 20	3.5 → 14	Managerial story vs templated number-restating
Code quality (blind) /15	4 → 12	4.5 → 13.5	Codex's pipeline engineering won this cell
Self-verification /10	2 → 10	2 → 10	Both opened their dashboards
Autonomy /10	0 int. → 10	0 int. → 10
Total	97	92.5

Blind judge on Claude's entry

"Best-in-class reasoning on ambiguous data: per-style date-locale inference with evidence and a residual-risk note, returns excluded with explicit justification, fat-finger outlier vs genuine high price distinguished. Insights explain why May spiked, warn against budgeting off it, and flag concentration risks with concrete Action lines."

Blind judge on Codex's entry

"An exemplary reproducible pipeline — Decimal math, validation assertions, one script regenerates CSV, CLEANING.md and dashboard… but the insights are templated number-restatements, and flipping negative quantities to positive fabricates sales rather than treating them as probable returns."

Decisive differenceAnalytic judgment. The dirt key's trap rows (refunds, date-locale ambiguity) reward reasoning: Claude excluded refunds with justification and proved its date rules; Codex sign-flipped refunds into $983.00 of fake revenue — documented, but the weaker call.

Instrumented evidence

Dirt-key audit: both CLEANING.md files address all 11 planted classes; both caught the $99,999 Rain Jacket and the two-locale date trap.
Browser run: Claude 0 console errors; Codex 8 errors (negative SVG rect widths at first paint — recovers visually, would break on narrow viewports).
Network: both fully inlined, zero external requests; row accounting differs only by the refund decision (256 vs 260 rows).

Task 6 — 120-line demoscene

Task 6 · 120-line demoscene

Claude +0.5

Create the most impressive interactive experience you can in ONE HTML file of at most 120 lines (no minification tricks, readable code, no external resources). You choose what it is. Surprise me. Then open it yourself and verify it works.

Requirements

① ≤120 lines (wc -l)② no external resources③ readable, not minified④ works on open⑤ self-verified

Claude

Codex

94.5

ResultsScorecardVerdict

Claude Code · Fable 5 · "Galaxy Sandbox"Open full ↗

Codex · gpt-5.5 · "Orbit Forge"Open full ↗

The photo finish — and these aren't videos, both demos are running live in the frames. Click inside Claude's galaxy to fire a shockwave; move and hold inside Codex's forge to bend the flow. Pick your own winner before reading the verdict.

Parameter	Claude	Codex	Notes
First-run success /25	y → 25	y → 25	Galaxy Sandbox · Orbit Forge — both flawless on open
Requirements /20	5/5 → 20	5/5 → 20	118 vs 119 lines, zero external resources, readable
Creativity (blind) /20	4.5 → 18	4 → 16	3D engine + audio vs gorgeous 2D plasma
Code quality (blind) /15	4 → 12	4.5 → 13.5	Codex's functional decomposition won this cell
Self-verification /10	2 → 10	2 → 10	Both opened & screenshot-verified their pages
Autonomy /10	0 int. → 10	0 int. → 10
Total	95	94.5

Blind judge on Claude's entry

"A hand-rolled 3D engine — spiral-arm generation, rotation matrices, perspective projection, spring-back physics, a world-space-corrected shockwave, plus a pentatonic WebAudio chime — genuinely ambitious for ~100 lines."

Blind judge on Codex's entry

"Visually the more striking entry — speed-mapped hues and additive trails around three gravity wells produce a gorgeous plasma look… The code is exemplary: small single-purpose functions, clean naming, DPR handling, pointer capture, even an aria-label."

Decisive differenceThe photo finish of the card. Claude attempted and landed the technically harder feat; Codex countered with the more immediately beautiful render and cleaner code. Half a point.

Instrumented evidence

Line counts: 118 vs 119 (limit 120); grep confirms zero external URLs/scripts/fonts in both.
Live run: Claude 63 fps with 6,000 stars; Codex 121 fps, 587 live particles, working HUD readout.

Beyond the rubric

Autonomy tied at 100% on both sides, so I added two evidence-based parameters to keep separating signal. They're reported alongside — not mixed into — the official rubric, so totals stay comparable. Runtime robustness: defects surfaced under instrumented re-testing (console errors, fps headroom, edge handling, dependency risk). Beyond-spec depth: verified work past the literal checklist.

Task	Robustness (C / X)	Beyond-spec (C / X)	What drove it
1 · Neon Drift	5 / 3.5	4.5 / 4	Codex combo unreachable in real play + lower fps headroom; Claude's graze mechanic & audio bus vs Codex's HUD chrome
2 · artgen	4 / 5	4 / 4.5	Codex: zero-dependency PNG encoder (stdlib zlib) & spec-literal flags; Claude needs Pillow for PNG
3 · Bug hunt	5 / 4	3.5 / 4.5	Codex found+fixed an unplanted cache-stampede race (depth) but its merge rewrite changes duplicate-SKU semantics (risk)
4 · Aquarium	4.5 / 4	5 / 3	Claude runs even headless w/ small-terminal fallback + far richer world; Codex has dead code & a night-tint hue bug
5 · Dashboard	5 / 3.5	5 / 4	Codex: 8 console errors (negative SVG widths) + refund sign-flip fabricates ~$800 revenue; Claude: date-locale inference w/ evidence
6 · Demoscene	5 / 5	4.5 / 4	Both flawless under instrumentation; Claude adds generative audio + true 3D engine in the same budget
Average	4.75 / 4.17	4.42 / 4.00

Run log

Full run log — all 12 runs, every rubric cell

Task	Tool	1st-run	Req	Crea	Qual	Verify	Model	Total
1 · Neon Drift	claude	y	8/8	4	4	2	Fable 5	93
1 · Neon Drift	codex	y	7/8	4	3.5	1	gpt-5.5	84
2 · artgen	claude	y	6/6	4	4	2	Fable 5	93
2 · artgen	codex	y	6/6	3.5	4	2	gpt-5.5	91
3 · Bug hunt	claude	y	7/7	4	5	2	Fable 5	96
3 · Bug hunt	codex	y	7/7	4	3.5	2	gpt-5.5	91.5
4 · Aquarium	claude	y	7/7	5	4.5	2	Fable 5	98.5
4 · Aquarium	codex	y	7/7	3	3.5	2	gpt-5.5	87.5
5 · Dashboard	claude	y	6/6	5	4	2	Fable 5	97
5 · Dashboard	codex	y	6/6	3.5	4.5	2	gpt-5.5	92.5
6 · Demoscene	claude	y	5/5	4.5	4	2	Fable 5	95
6 · Demoscene	codex	y	5/5	4	4.5	2	gpt-5.5	94.5

Methodology & fairness

Tooling parity

Both agents were given exactly one tool — Playwright for browser automation — with no skills, plugins, or MCP servers installed on either side. Stock harnesses, same prompts, so the only variable is the model.

Blind judging

Outputs copied to anonymized A/B folders with alternating tool→letter mapping per task; __pycache__ and any path-identifying files stripped.
Six independent judge agents, one per task, scored creativity and code quality 1–5 without knowing which tool made which entry, with identical evidence per side (uniform screenshots captured by the evaluator, not the contestants).
Judges were instructed not to guess authorship and to use the full scale. Results de-anonymized only after all six verdicts were returned.

Instrumented verification

T1/T5/T6 run in a real browser (Playwright): fps via rAF counters, console-error capture, localStorage inspection, full-page screenshots.
T2 CLIs re-executed from scratch: double-run byte-diffs for determinism (SVG + PNG), gallery seed audit, README algorithm check.
T3 scored with the untouched judge-only verify_fixes.py (integrity-checked against master) + full unittest suite + line-level diffs against the pristine fixture.
T4 static feature audit + headless smoke runs; T5 CLEANING.md audited line-by-line against the 11-class dirt key.

Known caveats

One run per task per tool — same-agent variance can exceed the gap between agents. Treat single-task margins under 3 pts as noise (T6, T2).
Self-verification for T2/T4/T5 confirmed by the operator post-hoc (session transcripts weren't retained).
Judges are LLM agents; style-based authorship inference can't be fully excluded, matching the protocol's "third LLM that doesn't know which output is whose."

Scoring integrity

Official 100-pt rubric kept exactly as the benchmark protocol defines it; the two supplementary parameters are reported separately and never enter the totals.
Half-point judge scores are preserved here — this page is the authoritative record of this evaluation pass.
Autonomy: 0 interventions on all 12 runs (operator-attested) → both sides 100%. The dimension is retained for protocol fidelity even though it doesn't discriminate.

What I actually learned

Three things stuck with me after staring at twelve runs.

The gap is judgment, not capability. Both agents shipped working software six times out of six, first try, unaided. That would have been science fiction two years ago. What separated them was taste under ambiguity: what to do with "make it charming", whether a negative quantity is a refund or a typo, whether an unplanted bug justifies a rewrite.

Codex's losses were interesting losses. The stdlib PNG encoder, the bonus cache-stampede race, the cleaner functional decomposition on T6 — gpt-5.5 repeatedly out-engineered Fable 5 in narrow cells. It just kept pairing those wins with one unforced error per task: a combo system that never fires, a sign-flip that fabricates revenue, a semantics-changing rewrite under a "minimal fixes" mandate.

Self-verification is the cheapest point on the board. The single biggest per-cell swing of the whole card (T1, 10 vs 5) came down to one agent playing its own game for a full run and the other taking a single screenshot. If you build with agents: making them prove their work to themselves is still where the easy wins are.

If you want to argue with the scoring — good, that's the point of publishing the whole card. Every number above traces to either an instrumented measurement or a quoted blind verdict, and the caveats section says exactly where the noise floor is.

Resources

If Task 2's generative art caught your eye and you want to understand the algorithm both agents reached for, George Savva's interactive essay is the best single read I know: Flow fields — how to make art with code.

Agent BattleGround — Claude Code vs Codex

Agent BattleGround — Claude Code vs Codex

The verdict

Evaluation parameters

Task 1 — Neon Drift

Task 2 — artgen, a generative-art CLI

Task 3 — Bug hunt

Task 4 — ASCII aquarium

Task 5 — Messy data → story

Task 6 — 120-line demoscene

Beyond the rubric

Run log

Methodology & fairness

What I actually learned

Resources

Tags

MCP — The How

Agent Harnesses — The Model Was Never the Bottleneck

Agent BattleGround — Claude Code vs Codex

The verdict

Evaluation parameters

Task 1 — Neon Drift

Task 2 — artgen, a generative-art CLI

Task 3 — Bug hunt

Task 4 — ASCII aquarium

Task 5 — Messy data → story

Task 6 — 120-line demoscene

Beyond the rubric

Run log

Methodology & fairness

What I actually learned

Resources

Tags

Related articles

MCP — The How

Agent Harnesses — The Model Was Never the Bottleneck