# Regression-50 Benchmark — Pre-registered Protocol **Status:** harness scaffold landed 2026-05-08. Bug curation in progress. The Phase 3 result on the 14-task moat-category subset did not clear the publication bar (95% CI on the rules-block effect crossed zero). The next milestone is a larger, more diverse, more replicable suite that an outside reviewer can verify end-to-end. This document pre-registers what that suite is and what claims it can support. ## Goal Measure whether Cosmos's Project Lessons mechanism actually prevents re-derivation of fixes the project has already shipped — across 50 real bugs from 5 open-source repositories that the operator did **not** author. Prevents the dogfood-bias problem of testing Cosmos against Cosmos's own bugs. ## Bug selection criteria A row makes it into `bugs.jsonl` only if all five hold: 1. **Public source.** The repo is on a public Git host with permissive license (MIT / Apache / BSD). No private code. 2. **Atomic fix.** A single commit (or ≤3 commits in a tight series) landed the fix. Multi-month refactors don't qualify. 3. **Pre/post commits both compile.** `git checkout sha_before` builds. `git checkout sha_after` builds. We don't include bugs whose surrounding tree is broken for unrelated reasons. 4. **Has a failing test that passes after the fix.** Either an existing test in the repo, or one we add as part of curation. The test becomes the oracle. 5. **Lesson would plausibly help.** The fix is non-obvious from reading the failing test alone — i.e. the bug exists *because* the obvious approach is wrong. If a junior dev would solve it in five minutes by reading the test, Cosmos has nothing to add. Repos initially targeted (subject to the criteria above): - **click** (Python CLI lib) — 10 bugs - **fastapi** (Python web fw) — 10 bugs - **astro** (TS web fw) — 10 bugs - **tauri** (Rust + TS desktop fw) — 10 bugs - **pydantic** (Python validation) — 10 bugs Selection is done in advance, recorded with the SHA pair, and **frozen** before any run. No post-hoc swaps. ## Modes Three per bug, run independently with the brain sandboxed each time (no leak between modes): | Mode | Cosmos engine | Lesson seeded | Rules block | |---|---|---|---| | **A — baseline** | absent | n/a | no | | **B — mcp_only** | available | no | no | | **C — mcp_lesson** | available | yes (pinned to scope_glob) | yes | The lesson seeded in C is hand-written from the actual fix commit message + diff context. It's stored as a `code_errors` row pointing at `scope_glob` (the file path or directory the bug lives in). ## Tasks per run For each (bug, mode) tuple: 1. Check out `sha_before` in a temp working tree (no shared state with other runs — `git worktree add` per run). 2. Spawn a Claude CLI process with the appropriate tool allowlist and Cosmos MCP config (or no MCP for mode A). 3. Send the bug summary as the prompt: `"Fix the bug described below. Run the test at to verify."` 4. The agent has at most N turns (default 12) and a hard cost cap of $0.50 per task. 5. After the agent stops, run the test command. Record: - Did the test pass? (binary) - Did the patch match the actual fix at `sha_after`? (judge call) - In mode C only: did Cosmos's lesson surface in the activity log before the first edit? (binary, from `mcp_activity.jsonl`) - Tokens in / tokens out (per Anthropic billing) - Wall-clock seconds ## Scoring Each (bug, mode) gets a 0–4 score: - **+1** the test passes after the agent's edit. - **+1** the patch is on the same file as the actual fix. - **+1** the patch addresses the same root cause (judge decides). - **+1** in mode C, the lesson surfaced before the first edit (mode A and B always score 0 on this point). The judge is a separate Claude call given the agent's diff and the canonical fix diff, both stripped of mode signals (filenames + paths preserved, but Cosmos-specific markers removed). Two judges run per case; we report agreement rate. ## Statistical bar A claim graduates from "directional evidence" to "publishable" when: - N ≥ 50 distinct bugs, ≥ 5 distinct repos - Bonferroni-corrected p < 0.0083 across the three pairwise mode comparisons (A vs B, A vs C, B vs C) - Paired bootstrap 95% CI on the C–A effect strictly above zero - Cohen's |d_z| ≥ 0.3 - Per-bug exclusion rate < 10% - CI half-width less than the absolute effect This is the same five-condition gate used by the 14-task protocol (see `benchmarks/PHASE3_PROTOCOL.md` §9.1). We are not lowering it. ## What this benchmark cannot say - Nothing about cold-repo first-encounter performance — by design, mode C pre-seeds the lesson. The benchmark measures the value of *having recorded* lessons, not Cosmos's ability to discover bugs it has never seen. - Nothing about other AI clients — the harness wraps the Claude CLI. Cline / Cursor / Windsurf parity is a separate run, after. - Nothing about user behavior — the prompt is fixed, terse, and identical across modes. Real users phrase things differently. These are documented up front so a reader who needs them looks elsewhere instead of reading too much into our numbers. ## Cost 50 bugs × 3 modes × $0.50/run cap = **$75 max** per full sweep. Two-judge scoring adds ~$10. Realistic spend ≈ $40–60 per sweep since most runs come in well under cap. ## Reproducibility Anyone can re-run with: ``` git clone https://github.com/atitechs-cosmos/cosmos cd cosmos python3.12 -m venv .venv && source .venv/bin/activate pip install -r requirements-tier0.txt # Smoke test the harness — fakes the Claude calls, exercises every # code path, no API spend python -m benchmarks.regression_50.runner --dry-run # Single bug, all 3 modes — costs ~$1.50, sanity check before full python -m benchmarks.regression_50.runner --limit 1 # Full sweep (after operator quota check) python -m benchmarks.regression_50.runner ``` Bug list, results CSV, judge transcripts, raw `mcp_activity.jsonl` per run — all committed under `benchmarks/regression_50/results/`. ## Versioning `bugs.jsonl` is frozen per sweep. If we add a bug, we ship a new sweep with a date-stamped `bugs-YYYY-MM-DD.jsonl` and run all three modes from scratch on the new set. We never edit a published bug list in place.