# Regression-50 Benchmark — Pre-registered Protocol

**Status:** harness scaffold landed 2026-05-08. Bug curation in progress.

The Phase 3 result on the 14-task moat-category subset did not clear
the publication bar (95% CI on the rules-block effect crossed zero).
The next milestone is a larger, more diverse, more replicable suite
that an outside reviewer can verify end-to-end. This document
pre-registers what that suite is and what claims it can support.

## Goal

Measure whether Cosmos's Project Lessons mechanism actually prevents
re-derivation of fixes the project has already shipped — across 50
real bugs from 5 open-source repositories that the operator did
**not** author. Prevents the dogfood-bias problem of testing Cosmos
against Cosmos's own bugs.

## Bug selection criteria

A row makes it into `bugs.jsonl` only if all five hold:

1. **Public source.** The repo is on a public Git host with permissive
   license (MIT / Apache / BSD). No private code.
2. **Atomic fix.** A single commit (or ≤3 commits in a tight series)
   landed the fix. Multi-month refactors don't qualify.
3. **Pre/post commits both compile.** `git checkout sha_before` builds.
   `git checkout sha_after` builds. We don't include bugs whose
   surrounding tree is broken for unrelated reasons.
4. **Has a failing test that passes after the fix.** Either an existing
   test in the repo, or one we add as part of curation. The test
   becomes the oracle.
5. **Lesson would plausibly help.** The fix is non-obvious from
   reading the failing test alone — i.e. the bug exists *because*
   the obvious approach is wrong. If a junior dev would solve it in
   five minutes by reading the test, Cosmos has nothing to add.

Repos initially targeted (subject to the criteria above):

- **click** (Python CLI lib) — 10 bugs
- **fastapi** (Python web fw) — 10 bugs
- **astro** (TS web fw) — 10 bugs
- **tauri** (Rust + TS desktop fw) — 10 bugs
- **pydantic** (Python validation) — 10 bugs

Selection is done in advance, recorded with the SHA pair, and
**frozen** before any run. No post-hoc swaps.

## Modes

Three per bug, run independently with the brain sandboxed each time
(no leak between modes):

| Mode | Cosmos engine | Lesson seeded | Rules block |
|---|---|---|---|
| **A — baseline** | absent | n/a | no |
| **B — mcp_only** | available | no | no |
| **C — mcp_lesson** | available | yes (pinned to scope_glob) | yes |

The lesson seeded in C is hand-written from the actual fix commit
message + diff context. It's stored as a `code_errors` row pointing
at `scope_glob` (the file path or directory the bug lives in).

## Tasks per run

For each (bug, mode) tuple:

1. Check out `sha_before` in a temp working tree (no shared state
   with other runs — `git worktree add` per run).
2. Spawn a Claude CLI process with the appropriate tool allowlist
   and Cosmos MCP config (or no MCP for mode A).
3. Send the bug summary as the prompt: `"Fix the bug described
   below. Run the test at <test_command> to verify."`
4. The agent has at most N turns (default 12) and a hard cost cap of
   $0.50 per task.
5. After the agent stops, run the test command. Record:
   - Did the test pass? (binary)
   - Did the patch match the actual fix at `sha_after`? (judge call)
   - In mode C only: did Cosmos's lesson surface in the activity log
     before the first edit? (binary, from `mcp_activity.jsonl`)
   - Tokens in / tokens out (per Anthropic billing)
   - Wall-clock seconds

## Scoring

Each (bug, mode) gets a 0–4 score:

- **+1** the test passes after the agent's edit.
- **+1** the patch is on the same file as the actual fix.
- **+1** the patch addresses the same root cause (judge decides).
- **+1** in mode C, the lesson surfaced before the first edit
  (mode A and B always score 0 on this point).

The judge is a separate Claude call given the agent's diff and the
canonical fix diff, both stripped of mode signals (filenames + paths
preserved, but Cosmos-specific markers removed). Two judges run per
case; we report agreement rate.

## Statistical bar

A claim graduates from "directional evidence" to "publishable" when:

- N ≥ 50 distinct bugs, ≥ 5 distinct repos
- Bonferroni-corrected p < 0.0083 across the three pairwise mode
  comparisons (A vs B, A vs C, B vs C)
- Paired bootstrap 95% CI on the C–A effect strictly above zero
- Cohen's |d_z| ≥ 0.3
- Per-bug exclusion rate < 10%
- CI half-width less than the absolute effect

This is the same five-condition gate used by the 14-task protocol
(see `benchmarks/PHASE3_PROTOCOL.md` §9.1). We are not lowering it.

## What this benchmark cannot say

- Nothing about cold-repo first-encounter performance — by design,
  mode C pre-seeds the lesson. The benchmark measures the value of
  *having recorded* lessons, not Cosmos's ability to discover bugs
  it has never seen.
- Nothing about other AI clients — the harness wraps the Claude
  CLI. Cline / Cursor / Windsurf parity is a separate run, after.
- Nothing about user behavior — the prompt is fixed, terse, and
  identical across modes. Real users phrase things differently.

These are documented up front so a reader who needs them looks
elsewhere instead of reading too much into our numbers.

## Cost

50 bugs × 3 modes × $0.50/run cap = **$75 max** per full sweep.
Two-judge scoring adds ~$10. Realistic spend ≈ $40–60 per sweep
since most runs come in well under cap.

## Reproducibility

Anyone can re-run with:

```
git clone https://github.com/atitechs-cosmos/cosmos
cd cosmos
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements-tier0.txt

# Smoke test the harness — fakes the Claude calls, exercises every
# code path, no API spend
python -m benchmarks.regression_50.runner --dry-run

# Single bug, all 3 modes — costs ~$1.50, sanity check before full
python -m benchmarks.regression_50.runner --limit 1

# Full sweep (after operator quota check)
python -m benchmarks.regression_50.runner
```

Bug list, results CSV, judge transcripts, raw `mcp_activity.jsonl`
per run — all committed under `benchmarks/regression_50/results/`.

## Versioning

`bugs.jsonl` is frozen per sweep. If we add a bug, we ship a new
sweep with a date-stamped `bugs-YYYY-MM-DD.jsonl` and run all three
modes from scratch on the new set. We never edit a published bug
list in place.