Agent BattleGround — Claude Code vs Codex
I keep reading agent-vs-agent benchmarks where the methodology is "vibes". So I ran my own. Six creative-coding tasks, the exact same prompt handed to Claude Code (Fable 5) and Codex (gpt-5.5), one run each, zero human help. Both agents ran deliberately bare: a single tool — Playwright for browser automation — and no skills, plugins, or MCP servers installed, so nothing but the model separates the two columns. Creativity and code quality were scored by blind judges who didn't know which entry was whose, and every claim was re-tested with instrumentation — real browser runs, byte-diffs, console-error capture, fps counters.
Every fight below is a card you can drive: the prompt both agents received, the screen recording of the run where I captured one (and the live artifact — actual playable games, galleries, dashboards, and demoscenes, embedded right in the card — where I didn't), the full scorecard, and the verdict with the blind judges' actual words. Two tasks also ship their exact inputs as downloads, so you can rerun the fight with your own agent.
The verdict
Both agents completed all six tasks with zero human interventions — autonomy was a clean 100% sweep on each side, so it never separates them. The gaps come from requirements precision, blind-judged creativity and quality, and self-verification depth. And Codex genuinely out-engineered Claude in spots — it won the code-quality cell on Tasks 5 and 6, wrote a zero-dependency PNG encoder, and found a race condition that wasn't even planted. The headline is closer than 6–0 suggests.
Evaluation parameters
Each pair of bars below is one rubric dimension, averaged across all six tasks as a percentage of maximum: orange is Claude Code, teal is Codex.
Now the six fights. Each card shows the exact prompt and its requirement checklist right under the headline, then the tabs flip between Results, Scorecard, and Verdict. Task 4 has the screen recordings; Tasks 1, 2, 5, and 6 embed the live artifacts each agent produced — the Task 1 games and Task 6 demoscenes are fully playable right in the card — and Task 3 shows each agent's final run report.
Task 1 — Neon Drift
Build "Neon Drift" — a playable browser game in a single index.html (vanilla JS + canvas, no libraries). A ship dodges asteroids that spawn faster over time. Requirements: particle effects on collision, screen shake, a combo/score system with localStorage high score, WebAudio sound effects generated in code, a start screen and game-over screen, and smooth 60fps. When done, open it in the browser yourself, play one full run, and fix anything broken before declaring it finished.
RequirementsBoth games are live and playable in the frames — click inside one first so it grabs your keyboard. Space or a click starts Claude's run; "Start Run" kicks off Codex's. Arrows / WASD steer both. For the proper full-screen experience, hit Open full.
| Parameter | Claude | Codex | Notes |
|---|---|---|---|
| First-run success /25 | y → 25 | y → 25 | Both playable untouched |
| Requirements /20 | 8/8 → 20 | 7/8 → 17.5 | Codex combo never engages in real play (meter decays 0.7/s vs 4-graze threshold) |
| Creativity (blind) /20 | 4 → 16 | 4 → 16 | Tie — different strengths |
| Code quality (blind) /15 | 4 → 12 | 3.5 → 10.5 | Codex: unused param, duplicated patterns, double combo computation |
| Self-verification /10 | 2 → 10 | 1 → 5 | Claude played a full run, 6 screenshots; Codex took one game-over shot |
| Autonomy /10 | 0 int. → 10 | 0 int. → 10 | |
| Total | 93 | 84 |
- Claude: unattended run reached combo ×3, score 609 persisted to localStorage; 118 fps mid-game; zero console errors.
- Codex: combo stuck at ×1 (matches my live-play experience); 75 fps mid-game — above 60 but ~40% less headroom; high score persisted correctly.
Task 2 — artgen, a generative-art CLI
Create a CLI tool called "artgen" (Python or Node, your choice) that generates seeded generative art using flow fields. Requirements: artgen --seed 42 --style <ink|neon|organic> --out piece.svg with three visually distinct styles, identical seed always produces identical output, PNG export option, and an artgen gallery command that renders a 3x3 HTML gallery of seeds 1-9. Include a README explaining the algorithm.
Both agents reached for the same classic technique: a flow field. Learn more about it here — Art gen & flow fields ↗
| Parameter | Claude | Codex | Notes |
|---|---|---|---|
| First-run success /25 | y → 25 | y → 25 | Every command worked when re-run by the judge |
| Requirements /20 | 6/6 → 20 | 6/6 → 20 | Codex matches spec flags literally (--png); Claude via --out x.png / --format |
| Creativity (blind) /20 | 4 → 16 | 3.5 → 14 | Claude's 3 styles structurally distinct; Codex's ink/organic differ mostly in palette |
| Code quality (blind) /15 | 4 → 12 | 4 → 12 | Tie — different virtues (scene-model parity vs dataclasses + typed package) |
| Self-verification /10 | 2 → 10 | 2 → 10 | Both fully tested their own CLIs |
| Autonomy /10 | 0 int. → 10 | 0 int. → 10 | |
| Total | 93 | 91 |
- Determinism: double-run diff = byte-identical for SVG and PNG, both tools, multiple styles.
- Gallery: both render 9 inline SVGs, seeds 1–9.
- Dependency surface: Codex PNG export is pure stdlib; Claude requires Pillow for PNG (SVG/gallery dependency-free).
Task 3 — Bug hunt
This project contains 6 intentionally planted bugs (logic errors, an off-by-one, an async race condition, a mutation bug, and edge-case failures). Find and fix all of them. Do not rewrite files wholesale — make minimal targeted fixes. All existing tests must pass, and write BUGS.md documenting each bug, its root cause, and your fix.
Requirements● All 6 bugs found, fixed, and documented in BUGS.md. All 14 existing tests pass, plus targeted checks for what the suite doesn't cover.
- Coupon threshold (
cart.py:42) —apply_couponused>instead of>=, so a cart at exactly the $50 minimum missed the discount - Mutation in
merge_carts(cart.py:59) — merged list aliased the primary cart's items; fixed by copying item dicts - Off-by-one in
top_sellers(inventory.py:35) —ranked[:n-1]returned n−1 products; fixed toranked[:n] - Async race (
fetcher.py:27–29) — read→await→write on the billing counter; replaced with an atomic+=, then verified 50 concurrent fetches count exactly 50 - Thousands-separator parsing (
utils.py:15) —parse_amount("$1,299.99")raised ValueError; commas stripped before float() - Division by zero (
inventory.py:40) —average_price()crashed on an empty catalog; now returns 0.0
Each fix was a one-to-two-line targeted change; no files were rewritten.
● Implemented all 6 targeted fixes and added BUGS.md.
Changed:
cart.py:39— coupon threshold and independent cart merginginventory.py:35— top-seller slice and empty average handlingfetcher.py:16— async in-flight fetch handling and safe count incrementsutils.py:15— thousands separator parsing
Verification passed:
python -B -m unittest discover -s tests -t .- Focused checks for merge independence, empty inventory average, and concurrent fetch counting/deduping
Download the exact MiniMart fixture both agents received — 6 planted bugs, untouched. Hand it to your favorite agent (or yourself) with the prompt above, then run the judge's verifier to see how many of the 6 you actually caught. The verifier names every bug, so don't peek until you're done.
⬇ bug-hunt-fixture.zip 5 KB · the buggy project verify_fixes.py judge's verifier · spoilers| Parameter | Claude | Codex | Notes |
|---|---|---|---|
| First-run success /25 | y → 25 | y → 25 | verify_fixes.py: 6/6 both · unittest: 14/14 green both |
| Requirements /20 | 7/7 → 20 | 7/7 → 20 | 6 bugs + complete BUGS.md each |
| Creativity / insight (blind) /20 | 4 → 16 | 4 → 16 | Claude: deeper docs · Codex: bonus race found |
| Code quality (blind) /15 | 5 → 15 | 3.5 → 10.5 | Surgical 1–2-line fixes vs scope-creep in merge_carts |
| Self-verification /10 | 2 → 10 | 2 → 10 | Both ran the suite; Claude documented a 50-task race verification |
| Autonomy /10 | 0 int. → 10 | 0 int. → 10 | |
| Total | 96 | 91.5 |
- Verifier integrity: both runs' verify_fixes.py byte-identical to the judge-only master (no tampering).
- Diff vs pristine fixture: Claude 6 hunks, ~7 lines; Codex ~5 hunks incl. a 12-line in-flight task map in fetcher.py.
Task 4 — ASCII aquarium
Build an ASCII aquarium screensaver in a single Python file using only the standard library. Requirements: at least 4 fish species with distinct movement behaviors, rising bubbles, seaweed that sways, a day/night color cycle, graceful handling of terminal resize, and pressing 'q' exits cleanly restoring the terminal. Make it charming.
RequirementsWatch Claude's jellyfish glow brighter at night and the minnow school follow its leader — then compare Codex's "(sun)".
| Parameter | Claude | Codex | Notes |
|---|---|---|---|
| First-run success /25 | y → 25 | y → 25 | q-exit & resize clean on both |
| Requirements /20 | 7/7 → 20 | 7/7 → 20 | 6 species vs 5, both stdlib-only |
| Creativity (blind) /20 | 5 → 20 | 3 → 12 | Widest creative gap of the benchmark |
| Code quality (blind) /15 | 4.5 → 13.5 | 3.5 → 10.5 | Codex: dead code, identical-branch conditional, hue-shifting night tint |
| Self-verification /10 | 2 → 10 | 2 → 10 | Both self-tested |
| Autonomy /10 | 0 int. → 10 | 0 int. → 10 | |
| Total | 98.5 | 87.5 |
- Both stdlib-only (no curses — ANSI escapes; Codex uses ctypes for Windows VT mode, Claude msvcrt/termios dual-path).
- Headless smoke test: Claude renders frames even without a TTY (has a --frames self-test flag); Codex exits politely: "ASCII Aquarium needs an interactive terminal." — graceful, by design.
Task 5 — Messy data → story
Clean the attached sales.csv (it has mixed date formats, duplicate rows, inconsistent category spellings, nulls, and currency symbols mixed into numbers). Document every cleaning decision in CLEANING.md. Then build a single self-contained dashboard.html (data inlined, no network calls) with 4 charts and a written "3 key insights" section that a non-technical manager would find genuinely useful.
RequirementsDownload the exact messy sales.csv both agents received — all 11 dirt classes planted, untouched. Run the prompt above through your own agent and see which call it makes on the four negative-quantity rows. The dirt key lists every planted trap, so open it only after.
⬇ sales.csv 22 KB · the messy data sales-dirt-key.md judge's dirt key · spoilers| Parameter | Claude | Codex | Notes |
|---|---|---|---|
| First-run success /25 | y → 25 | y → 25 | Both render offline untouched (Codex with hidden console errors) |
| Requirements /20 | 11/11 dirt · 6/6 → 20 | 11/11 dirt · 6/6 → 20 | Threshold was ≥9 of 11 dirt classes — both cleared it fully |
| Creativity / insight (blind) /20 | 5 → 20 | 3.5 → 14 | Managerial story vs templated number-restating |
| Code quality (blind) /15 | 4 → 12 | 4.5 → 13.5 | Codex's pipeline engineering won this cell |
| Self-verification /10 | 2 → 10 | 2 → 10 | Both opened their dashboards |
| Autonomy /10 | 0 int. → 10 | 0 int. → 10 | |
| Total | 97 | 92.5 |
- Dirt-key audit: both CLEANING.md files address all 11 planted classes; both caught the $99,999 Rain Jacket and the two-locale date trap.
- Browser run: Claude 0 console errors; Codex 8 errors (negative SVG rect widths at first paint — recovers visually, would break on narrow viewports).
- Network: both fully inlined, zero external requests; row accounting differs only by the refund decision (256 vs 260 rows).
Task 6 — 120-line demoscene
Create the most impressive interactive experience you can in ONE HTML file of at most 120 lines (no minification tricks, readable code, no external resources). You choose what it is. Surprise me. Then open it yourself and verify it works.
RequirementsThe photo finish — and these aren't videos, both demos are running live in the frames. Click inside Claude's galaxy to fire a shockwave; move and hold inside Codex's forge to bend the flow. Pick your own winner before reading the verdict.
| Parameter | Claude | Codex | Notes |
|---|---|---|---|
| First-run success /25 | y → 25 | y → 25 | Galaxy Sandbox · Orbit Forge — both flawless on open |
| Requirements /20 | 5/5 → 20 | 5/5 → 20 | 118 vs 119 lines, zero external resources, readable |
| Creativity (blind) /20 | 4.5 → 18 | 4 → 16 | 3D engine + audio vs gorgeous 2D plasma |
| Code quality (blind) /15 | 4 → 12 | 4.5 → 13.5 | Codex's functional decomposition won this cell |
| Self-verification /10 | 2 → 10 | 2 → 10 | Both opened & screenshot-verified their pages |
| Autonomy /10 | 0 int. → 10 | 0 int. → 10 | |
| Total | 95 | 94.5 |
- Line counts: 118 vs 119 (limit 120); grep confirms zero external URLs/scripts/fonts in both.
- Live run: Claude 63 fps with 6,000 stars; Codex 121 fps, 587 live particles, working HUD readout.
Beyond the rubric
Autonomy tied at 100% on both sides, so I added two evidence-based parameters to keep separating signal. They're reported alongside — not mixed into — the official rubric, so totals stay comparable. Runtime robustness: defects surfaced under instrumented re-testing (console errors, fps headroom, edge handling, dependency risk). Beyond-spec depth: verified work past the literal checklist.
| Task | Robustness (C / X) | Beyond-spec (C / X) | What drove it |
|---|---|---|---|
| 1 · Neon Drift | 5 / 3.5 | 4.5 / 4 | Codex combo unreachable in real play + lower fps headroom; Claude's graze mechanic & audio bus vs Codex's HUD chrome |
| 2 · artgen | 4 / 5 | 4 / 4.5 | Codex: zero-dependency PNG encoder (stdlib zlib) & spec-literal flags; Claude needs Pillow for PNG |
| 3 · Bug hunt | 5 / 4 | 3.5 / 4.5 | Codex found+fixed an unplanted cache-stampede race (depth) but its merge rewrite changes duplicate-SKU semantics (risk) |
| 4 · Aquarium | 4.5 / 4 | 5 / 3 | Claude runs even headless w/ small-terminal fallback + far richer world; Codex has dead code & a night-tint hue bug |
| 5 · Dashboard | 5 / 3.5 | 5 / 4 | Codex: 8 console errors (negative SVG widths) + refund sign-flip fabricates ~$800 revenue; Claude: date-locale inference w/ evidence |
| 6 · Demoscene | 5 / 5 | 4.5 / 4 | Both flawless under instrumentation; Claude adds generative audio + true 3D engine in the same budget |
| Average | 4.75 / 4.17 | 4.42 / 4.00 |
Run log
Full run log — all 12 runs, every rubric cell
| Task | Tool | 1st-run | Req | Crea | Qual | Verify | Interv. | Model | Total |
|---|---|---|---|---|---|---|---|---|---|
| 1 · Neon Drift | claude | y | 8/8 | 4 | 4 | 2 | 0 | Fable 5 | 93 |
| 1 · Neon Drift | codex | y | 7/8 | 4 | 3.5 | 1 | 0 | gpt-5.5 | 84 |
| 2 · artgen | claude | y | 6/6 | 4 | 4 | 2 | 0 | Fable 5 | 93 |
| 2 · artgen | codex | y | 6/6 | 3.5 | 4 | 2 | 0 | gpt-5.5 | 91 |
| 3 · Bug hunt | claude | y | 7/7 | 4 | 5 | 2 | 0 | Fable 5 | 96 |
| 3 · Bug hunt | codex | y | 7/7 | 4 | 3.5 | 2 | 0 | gpt-5.5 | 91.5 |
| 4 · Aquarium | claude | y | 7/7 | 5 | 4.5 | 2 | 0 | Fable 5 | 98.5 |
| 4 · Aquarium | codex | y | 7/7 | 3 | 3.5 | 2 | 0 | gpt-5.5 | 87.5 |
| 5 · Dashboard | claude | y | 6/6 | 5 | 4 | 2 | 0 | Fable 5 | 97 |
| 5 · Dashboard | codex | y | 6/6 | 3.5 | 4.5 | 2 | 0 | gpt-5.5 | 92.5 |
| 6 · Demoscene | claude | y | 5/5 | 4.5 | 4 | 2 | 0 | Fable 5 | 95 |
| 6 · Demoscene | codex | y | 5/5 | 4 | 4.5 | 2 | 0 | gpt-5.5 | 94.5 |
Methodology & fairness
Tooling parity
- Both agents were given exactly one tool —
Playwrightfor browser automation — with no skills, plugins, or MCP servers installed on either side. Stock harnesses, same prompts, so the only variable is the model.
Blind judging
- Outputs copied to anonymized
A/Bfolders with alternating tool→letter mapping per task;__pycache__and any path-identifying files stripped. - Six independent judge agents, one per task, scored creativity and code quality 1–5 without knowing which tool made which entry, with identical evidence per side (uniform screenshots captured by the evaluator, not the contestants).
- Judges were instructed not to guess authorship and to use the full scale. Results de-anonymized only after all six verdicts were returned.
Instrumented verification
- T1/T5/T6 run in a real browser (Playwright): fps via rAF counters, console-error capture, localStorage inspection, full-page screenshots.
- T2 CLIs re-executed from scratch: double-run byte-diffs for determinism (SVG + PNG), gallery seed audit, README algorithm check.
- T3 scored with the untouched judge-only
verify_fixes.py(integrity-checked against master) + full unittest suite + line-level diffs against the pristine fixture. - T4 static feature audit + headless smoke runs; T5 CLEANING.md audited line-by-line against the 11-class dirt key.
Known caveats
- One run per task per tool — same-agent variance can exceed the gap between agents. Treat single-task margins under 3 pts as noise (T6, T2).
- Self-verification for T2/T4/T5 confirmed by the operator post-hoc (session transcripts weren't retained).
- Judges are LLM agents; style-based authorship inference can't be fully excluded, matching the protocol's "third LLM that doesn't know which output is whose."
Scoring integrity
- Official 100-pt rubric kept exactly as the benchmark protocol defines it; the two supplementary parameters are reported separately and never enter the totals.
- Half-point judge scores are preserved here — this page is the authoritative record of this evaluation pass.
- Autonomy: 0 interventions on all 12 runs (operator-attested) → both sides 100%. The dimension is retained for protocol fidelity even though it doesn't discriminate.
What I actually learned
Three things stuck with me after staring at twelve runs.
The gap is judgment, not capability. Both agents shipped working software six times out of six, first try, unaided. That would have been science fiction two years ago. What separated them was taste under ambiguity: what to do with "make it charming", whether a negative quantity is a refund or a typo, whether an unplanted bug justifies a rewrite.
Codex's losses were interesting losses. The stdlib PNG encoder, the bonus cache-stampede race, the cleaner functional decomposition on T6 — gpt-5.5 repeatedly out-engineered Fable 5 in narrow cells. It just kept pairing those wins with one unforced error per task: a combo system that never fires, a sign-flip that fabricates revenue, a semantics-changing rewrite under a "minimal fixes" mandate.
Self-verification is the cheapest point on the board. The single biggest per-cell swing of the whole card (T1, 10 vs 5) came down to one agent playing its own game for a full run and the other taking a single screenshot. If you build with agents: making them prove their work to themselves is still where the easy wins are.
If you want to argue with the scoring — good, that's the point of publishing the whole card. Every number above traces to either an instrumented measurement or a quoted blind verdict, and the caveats section says exactly where the noise floor is.
Resources
If Task 2's generative art caught your eye and you want to understand the algorithm both agents reached for, George Savva's interactive essay is the best single read I know: Flow fields — how to make art with code.

