# sales.csv — Planted Dirt Key (Task 5)

**User-side only.** Use this to check whether each agent's `CLEANING.md` actually found and handled every issue class. Generated with seed 42 — both agents must receive identical copies of `sales.csv`.

272 rows total (260 base + 12 exact duplicates), columns: `order_id, order_date, customer, category, product, quantity, unit_price, region, channel`.

| # | Issue class | Details | Count |
|---|------------|---------|-------|
| 1 | Mixed date formats | Four formats interleaved: `2025-01-05`, `05/01/2025` (day-first), `Jan 05, 2025`, `1/5/25` (US short) | throughout |
| 2 | Exact duplicate rows | Whole-row duplicates (same order_id appears twice) | 12 |
| 3 | Inconsistent categories | Case variants + typos: `Electroncs`, `Apparell`, `home and garden`, `Office supplies`, ` toys ` etc. — true set is Electronics, Home & Garden, Apparel, Toys, Office Supplies | throughout |
| 4 | Currency chaos in `unit_price` | `$1,249.00`, `45.5`, `USD 62.00`, ` 12.99 ` (padded), plus empty | throughout |
| 5 | Missing values | Empty unit_price ×6, empty customer ×10, empty region ×9 | 25 |
| 6 | Whitespace padding | Customers like `  Mei Chen  ` ×10; regions like ` West `, `South ` | 10+ |
| 7 | Region case/typos | `SOUTH`, `east`, `Nort` (typo for North), `WEST` | throughout |
| 8 | Negative quantities | Refund rows (legitimate data — should be handled/flagged, not silently deleted) | 4 |
| 9 | Text quantities | `quantity` of `"two"` / `"three"` | 4 |
| 10 | Absurd outlier | `ORD-0058`: Rain Jacket priced `$99,999.00` (real price ≈ $110) | 1 |
| 11 | Inconsistent channel | `online` vs `Online` | throughout |

## Judging cleaning quality

A strong `CLEANING.md` names all 11 classes and states a defensible decision per class (e.g., refunds kept and flagged, not dropped; outlier excluded or capped with justification; day-first vs month-first ambiguity acknowledged). For requirements coverage in `score.py`, Task 5's checklist item "all dirt classes addressed" means at least 9 of these 11 were found and handled or explicitly flagged.

The date ambiguity is the subtle trap: `05/01/2025` (day-first) and `1/5/25` (US short) coexist, so a lazy single-format parser silently corrupts dates. Agents that notice and explain how they disambiguated deserve creativity/quality credit.
