# Phase 3 — three-mode Cosmos benchmark · merged dataset

> 🔁 **MERGED FROM TWO SEGMENTS** — see provenance below. Per protocol §12 the original run aborted due to a runner-side accounting bug, no rows were discarded, and the resumed run picked up at the next task in the shuffle order. This file is the union.

> ⚠️ **DIRECTIONAL — only 13 kept rows per cell.** Tier 2 evidence at most per §9.1; not a Tier 1 web headline.

## Provenance

- Merged at: 2026-05-07T02:55:17Z
- Partial: `benchmarks/results/phase3-2026-05-07-d2-tasks1to5.csv` → 15 rows
- Resumed: `benchmarks/results/phase3-2026-05-07-draft-resume-from5.csv` → 27 rows
- Total: 42 rows after dedup
- Tasks locked at: 2026-05-07T00:47:53Z (d3181df245)

## Exclusions (primary metric per §7.1)

- Total rows: 42
- Kept for primary: **37** (88.1%)
- Excluded — error / both-judges-failed: 5
- Excluded — judge_disagreement (Δ > 1): 0
- Exclusion rate: **11.9%** ⚠ exceeds 10% — downgrade to directional per §9.1

## Summary by (brain × mode)

| Brain | Mode | n | Mean cost | Mean tokens | Mean score |
|---|---|---:|---:|---:|---:|
| `fresh` | `baseline` | 0 | — | — | — |
| `fresh` | `mcp_only` | 0 | — | — | — |
| `fresh` | `mcp_plus_rules` | 0 | — | — | — |
| `curated` | `baseline` | 12 | $0.2337 | 126156 | 1.3333 / 4 |
| `curated` | `mcp_only` | 12 | $0.1885 | 131222 | 1.3333 / 4 |
| `curated` | `mcp_plus_rules` | 13 | $0.3554 | 211048 | 2.2308 / 4 |

## Paired analysis: mcp_plus_rules − mcp_only (curated)

- Paired n: **12**
- Mean Δ (C − B): **+0.750**
- SD: 1.712, SE: 0.494
- 95% CI (normal approx): [-0.219, +1.719]
- ⚠ CI crosses zero — Tier 2 directional only per §9.1

Per-task ΔC−B:

| Task | C | B | Δ |
|---|---:|---:|---:|
| D1 | 0.0 | 0.0 | +0.0 |
| D2 | 0.0 | 0.0 | +0.0 |
| D3 | 3.0 | 4.0 | -1.0 |
| E2 | 4.0 | 4.0 | +0.0 |
| T-AL3 | 4.0 | 0.0 | +4.0 |
| T-AL4 | 0.0 | 0.0 | +0.0 |
| T-AL5 | 4.0 | 3.0 | +1.0 |
| T-AL6 | 3.0 | 1.0 | +2.0 |
| T-PL4 | 0.0 | 0.0 | +0.0 |
| T-PL5 | 4.0 | 0.0 | +4.0 |
| T-PL6 | 0.0 | 0.0 | +0.0 |
| T-PL8 | 3.0 | 4.0 | -1.0 |

---

Generated by `benchmarks/_merge_phase3.py`. The Welch-style normal-approx CI above is a quick sanity reading; the protocol-mandated paired bootstrap (1000 resamples) is the canonical primary metric — compute over the merged CSV before any web-headline claim.