# Phase 3 โ€” three-mode Cosmos benchmark ยท merged dataset > ๐Ÿ” **MERGED FROM TWO SEGMENTS** โ€” see provenance below. Per protocol ยง12 the original run aborted due to a runner-side accounting bug, no rows were discarded, and the resumed run picked up at the next task in the shuffle order. This file is the union. > โš ๏ธ **DIRECTIONAL โ€” only 13 kept rows per cell.** Tier 2 evidence at most per ยง9.1; not a Tier 1 web headline. ## Provenance - Merged at: 2026-05-07T02:55:17Z - Partial: `benchmarks/results/phase3-2026-05-07-d2-tasks1to5.csv` โ†’ 15 rows - Resumed: `benchmarks/results/phase3-2026-05-07-draft-resume-from5.csv` โ†’ 27 rows - Total: 42 rows after dedup - Tasks locked at: 2026-05-07T00:47:53Z (d3181df245) ## Exclusions (primary metric per ยง7.1) - Total rows: 42 - Kept for primary: **37** (88.1%) - Excluded โ€” error / both-judges-failed: 5 - Excluded โ€” judge_disagreement (ฮ” > 1): 0 - Exclusion rate: **11.9%** โš  exceeds 10% โ€” downgrade to directional per ยง9.1 ## Summary by (brain ร— mode) | Brain | Mode | n | Mean cost | Mean tokens | Mean score | |---|---|---:|---:|---:|---:| | `fresh` | `baseline` | 0 | โ€” | โ€” | โ€” | | `fresh` | `mcp_only` | 0 | โ€” | โ€” | โ€” | | `fresh` | `mcp_plus_rules` | 0 | โ€” | โ€” | โ€” | | `curated` | `baseline` | 12 | $0.2337 | 126156 | 1.3333 / 4 | | `curated` | `mcp_only` | 12 | $0.1885 | 131222 | 1.3333 / 4 | | `curated` | `mcp_plus_rules` | 13 | $0.3554 | 211048 | 2.2308 / 4 | ## Paired analysis: mcp_plus_rules โˆ’ mcp_only (curated) - Paired n: **12** - Mean ฮ” (C โˆ’ B): **+0.750** - SD: 1.712, SE: 0.494 - 95% CI (normal approx): [-0.219, +1.719] - โš  CI crosses zero โ€” Tier 2 directional only per ยง9.1 Per-task ฮ”Cโˆ’B: | Task | C | B | ฮ” | |---|---:|---:|---:| | D1 | 0.0 | 0.0 | +0.0 | | D2 | 0.0 | 0.0 | +0.0 | | D3 | 3.0 | 4.0 | -1.0 | | E2 | 4.0 | 4.0 | +0.0 | | T-AL3 | 4.0 | 0.0 | +4.0 | | T-AL4 | 0.0 | 0.0 | +0.0 | | T-AL5 | 4.0 | 3.0 | +1.0 | | T-AL6 | 3.0 | 1.0 | +2.0 | | T-PL4 | 0.0 | 0.0 | +0.0 | | T-PL5 | 4.0 | 0.0 | +4.0 | | T-PL6 | 0.0 | 0.0 | +0.0 | | T-PL8 | 3.0 | 4.0 | -1.0 | --- Generated by `benchmarks/_merge_phase3.py`. The Welch-style normal-approx CI above is a quick sanity reading; the protocol-mandated paired bootstrap (1000 resamples) is the canonical primary metric โ€” compute over the merged CSV before any web-headline claim.