# Phase 3 Benchmark — Pre-Registered Protocol

**Status: DRAFT — pre-registration. Not yet executed.**
**Document timestamp:** 2026-05-07 (before any task pool extension or
brain curation has happened — those are scheduled steps below)

This document locks the experimental design **before** we add tasks,
curate the brain, or run any paid benchmark. The intent is auditability:
a reviewer should be able to read this file, then read the eventual
results, and confirm we didn't move the goalposts to make Cosmos look
better.

If we deviate from this protocol mid-run, the deviation gets a
timestamped section appended below and the affected results get a
"deviated from protocol" footnote — not a quiet rewrite.

---

## 1. Hypothesis

**Primary:** Loading Cosmos's MCP server alone (Mode B) does not improve
end-to-end coding-task quality vs. baseline (Mode A) by a clinically
meaningful amount, but loading MCP **plus** the Cosmos rules block
(Mode C) does — because the activation surface is what makes AI agents
actually call recall tools.

**Secondary:** On a curated brain (pinned + scope_globed lessons), the
gap C − B widens vs. a fresh empty brain — i.e., Cosmos's value scales
with the lesson library, not just the engine.

**Negative outcome we will publish:** if C ≈ B, we will not claim that
the rules block carries weight. We'll publish the null result and
keep the rules-installer feature off the marketing path.

---

## 2. Engineering setup

Locked at protocol-write time:

| Component | Version |
|---|---|
| OS | macOS arm64 (Apple Silicon) |
| Python | 3.12 from the `.venv` venv |
| Claude CLI | `/Users/kabir/.local/bin/claude` (whatever ships at run start; recorded in CSV) |
| Model | `claude-opus-4-7[1m]` (Anthropic-side) |
| Sidecar API version | ≥ 3 (PreflightBanner gate enforces) |
| Cosmos build | recorded by `/api/v2/status` → `build_version` field |
| MCP config | `.mcp.json` at repo root with key `"cosmos"` |
| ripgrep | 15.x (Phase 1 only) |

Run host stays the same machine across all phases. Claude CLI auth
uses the operator's $100 plan; no separate API key.

---

## 3. Task pool — pre-registered before extension

### 3.1 Source

**Task mix is intentionally weighted to reflect Cosmos's product
claim** — "prevent re-solving known project bugs". This is not a
neutral developer-search benchmark; it's a moat test, and the
weighting is disclosed up front so reviewers don't have to reverse-
engineer it. `past_lesson` + `apply_lesson` together total 14/36
(39%) of the pool, while `symbol_lookup` + `concept_search` (the
categories most other code-search benchmarks lean on) get 12/36
(33%). If the same task pool were used to evaluate generic-purpose
code search the weighting would invalidate the result; for evaluating
project-memory recall it's the right shape.

`benchmarks/tasks.json` (schema v1) — currently 20 tasks across these
categories:

| Category | Current count | Target count |
|---|---:|---:|
| `symbol_lookup`     | 5 | 6 |
| `cross_ref`         | 3 | 5 |
| `concept_search`    | 4 | 6 |
| `past_lesson`       | 3 | 8 |
| `apply_lesson`      | 2 | 6 |
| `bulk_exploration`  | 3 | 5 |
| **Total** | **20** | **36** |

### 3.2 Extension rules

Adding the 16 new tasks must follow these rules. Any deviation gets
recorded in §11 with a timestamp.

1. **Author cannot have seen Phase 3 task results before adding the
   task.** Tasks are added in one sitting, then the file is committed,
   *then* runs begin.
2. **Each new task must declare `expected_file` and `expected_keyword`
   at write time.** No retroactive correctness criteria.
3. **`past_lesson` and `apply_lesson` tasks must reference real lesson
   IDs** that already exist in the brain at the moment the task is
   written — verified by the schema check before commit. Same with
   files referenced — the file must exist in the target repo at task
   write time.
4. **Repo distribution stays roughly proportional to category** so we
   don't accidentally weight one project's quirks. Recorded in CSV
   per row.
5. **No task can be removed or reworded** after it's added unless
   the run errored (timeout / API failure) — rewordings due to "the
   model misunderstood" are a finding, not a bug.

### 3.3 Lock

Once §3.1 hits the target counts and §3.2 is satisfied, the
`tasks.json` file gets a `locked_at` timestamp + git-commit hash. From
that point forward the file is read-only for the run duration. A
deviation from the locked set is flagged in the report.

**Locked at: 2026-05-07T00:47:53Z**
**Commit at lock time: `d3181df245`** (the commit that landed the
curation log; tasks file extended in the next commit which this
section is part of)
**Final distribution:** symbol_lookup=6 · cross_ref=5 · concept_search=6
· past_lesson=8 · apply_lesson=6 · bulk_exploration=5 — 36 total.

**Driver:** `benchmarks/_extend_tasks_2026_05_07.py` — runs a leakage-
guard sweep across every new prompt (forbidden terms list pulled from
the curation log §11 leakage flag for lesson 78c5b62a) before writing.
Sweep returned 0 hits.

Any further mutation must produce a NEW dated tasks file, e.g.
`tasks-2026-MM-DD.json`. Editing this file in place is treated as
protocol deviation and gets logged in §12.

---

## 4. Brain conditions — two protocols

### 4.1 Brain Fresh

Empty `code_errors` table. **Critically: this is created in a temp DB,
NOT by deleting the user's real brain.** The user explicitly required
this so a benchmark can never touch production data.

#### 4.1.a SQLite-safe clone (mandatory)

A naive `shutil.copy(brain.db, …)` is **forbidden** because Cosmos runs
in WAL mode — the most recent committed writes can live in the
companion `brain.db-wal` file until checkpoint, and a file copy of
`brain.db` alone would silently miss them.

The runner uses `sqlite3.Connection.backup()` (the SQLite Backup API)
to stream pages from the live DB into a fresh file. This is online-
safe: the user's normal sidecar on port 7824 keeps writing during the
clone and the snapshot stays coherent. Implementation lives in
`benchmarks/_brain_sandbox.py::clone_brain_to_sandbox()`.

After the page-stream, the runner:

1. Sets `PRAGMA journal_mode=WAL` on the clone (matches production)
2. If `fresh=True`: `DELETE FROM code_errors` in the clone only —
   other tables (memories_v2, code_index, code_fts, code_summary,
   …) survive so MCP code search still works against the same
   indexed code base
3. Initializes an empty `mcp_activity.jsonl` in the sandbox dir —
   never copies the real one, otherwise the activation-rate metric
   would mix benchmark calls with the user's earlier sessions

#### 4.1.b Env-var contract — what the alternate sidecar sees

The runner sets the following before spawning the sandbox sidecar.
Every sandbox process MUST report these in `/api/v2/status`'s `runtime`
block; the runner asserts on the returned values before the first
benchmark task fires.

| Variable | Required | Effect |
|---|---|---|
| `COSMOS_BRAIN_DB`              | yes | Absolute path to the cloned DB |
| `COSMOS_DATA_DIR`              | yes | Directory containing the clone — `mcp_activity.jsonl` lives here |
| `COSMOS_ACTIVITY_LOG`          | optional | Override activity log path independent of DATA_DIR |
| `COSMOS_DISABLE_LESSON_MIRROR` | yes (=1) | Skip every `.cosmos/lessons.{md,json}` rebuild call |
| `COSMOS_BENCHMARK`             | yes (=1) | Belt-and-suspenders flag any future side-effect path can short-circuit on |
| `COSMOS_KEEP_SANDBOX`          | optional | Skip cleanup so a failed run can be inspected |

The `COSMOS_DISABLE_LESSON_MIRROR` flag is essential. The
project_registry that resolves `project_id → repo path` lives outside
the brain DB — even when the cloned DB is empty, calling
`code_remember_error` would walk the real registry, find the user's
real repo, and rewrite `.cosmos/lessons.{md,json}` in the working
directory. The flag short-circuits `_rebuild_for_project` entirely.

#### 4.1.c Pre-flight assertion

Before the first task runs, the runner hits the sandbox sidecar's
`/api/v2/status` and asserts:

- `runtime.brain_db_path` matches the clone path it just wrote
- `runtime.lesson_mirror_enabled` is `false`
- `runtime.is_benchmark_sandbox` is `true`
- `runtime.env_overrides_set` includes the four required keys above
- The clone's `code_errors` row count matches expectation (0 for
  Fresh, current count for Curated)

If any assertion fails the run aborts with a deviation log entry. The
real DB is never touched.

### 4.2 Brain Curated

Same as the user's live brain, but with a curated layer applied
**before any task is added or any task result is observed**. The
curation:

- Pins 3–5 lessons the operator selects (without seeing Phase 3 task
  results)
- Adds `scope_globs` to lessons whose files repeatedly appear in
  bug-magnet rankings (vite.config.ts, mcp_server.py, etc.)

**Curation timestamp is recorded** in the report. If `tasks.json` has
already been locked when curation begins, that's recorded too —
curation must precede task extension OR be done from existing tasks
only (i.e. operator must not see new task content before curating).

The Brain Curated sweep runs against the user's normal sidecar on
`127.0.0.1:7824` since curation is a normal user operation.

### 4.3 No third "Brain Adversarial" condition

We considered including a "all lessons disabled" condition to bound
the lower end. Rejected — too easy to misread as "Cosmos with
features turned off" rather than "synthetic worst case". Brain Fresh
already provides the floor.

### 4.4 Repo-view isolation — `.cosmos/` and instruction files excluded

Added per operator review: even with a sandboxed brain DB, the AI
agent in any mode can `Read('.cosmos/lessons.md')` against the live
repo path passed via `--add-dir` and pull the same recipe data the
MCP layer is supposed to gate. Without isolation, Mode A's "no
Cosmos" baseline would silently grep its way to the answer through
the file mirror — invalidating every between-mode comparison.

For the same reason the temp repo view also excludes the project's
own AI-instruction files, so the only between-mode difference is
the rules-block prefix in Mode C.

The runner copies each task's `repo_path` into a temp directory
(`benchmarks/_repo_view.py::prepare_clean_repos`) before the first
task runs, then rewrites `repo_path` in the in-memory task list to
point at the copy. The copy excludes:

| Excluded path | Why |
|---|---|
| `.cosmos/`         | The lesson mirror — Mode A could grep this otherwise |
| `CLAUDE.md`        | Cosmos guidance pre-installed; Mode B would pick it up for free |
| `AGENTS.md`        | Same — Codex-style guidance |
| `.cursor/`         | Cursor rules dir |
| `.clinerules`      | Cline rules file |
| `.windsurfrules`   | Windsurf rules file |
| `.git/`            | Avoid git blame / log shortcuts to attribution |
| `node_modules/`, `.venv/`, `target/`, `dist/`, `.next/` | Bloat, never read by AI in normal benchmark tasks |

Applied to **all three modes**, not just baseline. Mode B's "MCP
only" condition explicitly means "MCP tools available, no other
Cosmos guidance available" — so a CLAUDE.md that says "use cosmos
tools" would conflate the comparison.

**Production interpretation note**: this means the benchmark measures
the marginal value of injecting the rules block at task time, vs.
relying on MCP tools to be discovered organically. In production,
users install the rules block into their persistent CLAUDE.md / 
.cursor/rules and get the benefit on every session. The benchmark
approximates this — the production gain may be larger because the
rules-block guidance accumulates across sessions, while the
benchmark restarts cold for each task.

---

## 5. Modes (already implemented in `phase3_modes.py`)

| Mode | MCP loaded? | Rules block prefixed? |
|---|---|---|
| `A · baseline` | no | no |
| `B · mcp_only` | yes | no |
| `C · mcp_plus_rules` | yes | yes |

Mode C prefixes the rules block onto the user prompt. This is not
bit-identical to a CLAUDE.md preamble that the AI client loads
itself, but it's the closest `claude -p` lets us get without
shelling out to a sandbox dir. Recorded as a known limitation.

---

## 6. Run protocol

### 6.1 Order

For each (brain × mode × task) cell:

1. Random shuffle of the task order with seed = `cosmos-phase3-2026-05`
   (a fixed string committed in the runner so anyone can reproduce
   the exact sequence).
2. All 3 modes for one task run back-to-back (paired design — see §10).
3. Cells are interleaved across brain conditions in alternating
   blocks of 5 tasks so any time-of-day model variation hits both
   conditions equally.

### 6.2 Cost cap

Hard limits enforced in the runner:

- **Estimated cost cap: $15 USD** — runner sums Anthropic-reported
  `total_cost_usd` per call and stops entering new tasks if the
  running total + an estimated next-task cost would exceed $15.
- **Run-count cap: planned matrix size** — for the 36-task pool
  in §3.1 with 3 modes × 2 brains, that's 216 runs + 216 judge
  calls. If the runner is asked to do more it aborts.
- **Per-task wall cap: 180 s** — already in `run_claude()`. Tasks
  that hit this get marked `error: timeout` and excluded from
  primary metrics; they still appear in the per-task table.

If the cost cap fires we'll publish whatever subset completed, with
the cap reason annotated.

### 6.3 Incremental save

CSV writes after every task pair completes so a crashed sidecar /
expired Claude session doesn't lose hours of runs.

---

## 7. Judge protocol

### 7.1 Two-judge mean with blinding

Each task answer is graded by **two independent Claude calls** (judge
prompt template fixed in `phase2_interactive.py` lines 171-185 and
copied by `phase3_modes.py`). The mean of the two grades is the
recorded score; if the two judges disagree by > 1 grade, the task is
flagged `judge_disagreement=true` and excluded from the primary
metric (still shown in the per-task table).

**Excluded-row count is published as a primary number**, not buried
in an appendix — `report.md` headlines:

> "Primary metric computed over N kept rows. M rows excluded
> (judge_disagreement=K, timeout=L, run_error=…). K/N = X% — flag
> if > 10% so reviewers can decide if the run is interpretable."

If exclusions exceed 10% the result is downgraded to
"directional/internal" regardless of CI. Hiding the rows would let
us cherry-pick the easy half of the matrix while still claiming
publication-grade rigor; making them headline keeps the temptation
out of reach.

### 7.2 Mode stripping before judging

Before the answer is sent to the judge it gets passed through a
filter that removes:

- Any string starting with `mcp__cosmos__`
- Any standalone occurrence of Cosmos-specific tool names
  (`find_relevant_code`, `code_get_symbol`, `code_callers`,
  `code_callees`, `code_uses`, `code_skeleton`, `code_context_bundle`,
  `code_search`, `code_list_errors`, `code_remember_error`)
- The literal phrase "Cosmos MCP" / "the Cosmos brain"

These would otherwise leak which mode produced the answer to the judge.

### 7.3 Judge model + version

Same Claude version as the task runner (`claude-opus-4-7[1m]`). Recorded
per-row in CSV.

---

## 8. Metrics

### 8.1 Primary

**Mean correctness score (0–4)** per (brain × mode) cell. Reported
with paired bootstrap 95% CI (see §10).

This is what we will or will not put on the website.

### 8.2 Secondary

Reported but never headline:

- **Mean cost (USD)** per cell — Anthropic-reported `total_cost_usd`
- **Mean wall (s)** per cell
- **Activation rate** — % of tasks where the AI called ≥ 1
  `mcp__cosmos__*` tool. Computed from the activity log emitted by
  the temp sidecar. For Mode A this is 0 by construction.
- **Lesson recall rate** — % of tasks where the AI called
  `find_relevant_code` or `code_list_errors` specifically. Mode A: 0.
- **Unnecessary call count** — for tasks with no relevant lesson,
  count how often the AI still invoked Cosmos tools. Negative metric:
  high values mean rules block over-fires.

### 8.3 Lesson-dropout sub-test (past_lesson tasks only)

For each task in the `past_lesson` category, **three additional runs**
are added to Mode C only:

1. Lesson `pinned=true` (most aggressive surface)
2. Lesson `pinned=false, disabled=false` (default)
3. Lesson `disabled=true` (Cosmos cannot see it)

If the score drops sharply at variant 3, the lesson did the work. If
all three score the same, the AI got it from base knowledge — moat
not real for this task. Reported per-task, not aggregated as headline.

---

## 9. Publication policy

### 9.1 What we will publish

Three publication tiers — gating gets stricter the more public the
claim:

#### Tier 1 — Web headline ("publishable")

Requires **all** of:
- Primary metric C − B paired bootstrap 95% CI **strictly above zero**
  in the Brain Curated condition
- Bonferroni-corrected p-value < 0.0083 (see §10.3)
- Effect size `|d_z| > 0.3`
- Excluded rows < 10% of total
- CI width less than the absolute effect (rules out "barely
  significant + huge variance" claims)

Example claim that would qualify:

> "Across N benchmark tasks on the Cosmos repo, AI agents with the
> Cosmos rules block prefixed scored mean Δ correctness +X.XX vs.
> with Cosmos's MCP alone (95% CI [X, Y], paired bootstrap, n=N,
> Bonferroni-corrected p < 0.008, d_z = X.X)."

Plus the raw CSV + protocol + curation timestamps as a downloadable
artifact under `public/benchmark/phase3-2026-05-DD/`.

#### Tier 2 — "Directional / internal evidence"

When uncorrected p < 0.05 but Bonferroni gate misses, OR effect size
is below the 0.3 threshold, OR exclusions are 10–25%:

- Result lands in `benchmarks/results/` and the protocol log
- May be referenced internally and in technical docs
- **Cannot** appear on the marketing site as an effect-size claim
- Web /benchmark page may quote: *"Internal directional evidence
  suggests rules-block carries weight; n insufficient for
  Bonferroni-grade publication. Re-run with 50+ tasks planned."*

#### Tier 3 — Suppressed

CI crosses zero / exclusions > 25% / run errored mid-matrix:

- Raw data still committed under `benchmarks/results/` (audit trail)
- No public claim of any kind
- Protocol §11 deviation log records why the run is suppressed
- Triggers a re-design pass before the next attempt

### 9.2 What we will not publish

- Phase 1 cost numbers as a "Cosmos beats grep" headline. Phase 1
  becomes a sidebar regression-guard claim:
  > "Indexed lookup latency: 0.6–1.4 ms median across 63–1,448-file
  > repos. Index build excluded."
- Any aggregate that includes timed-out or judge-disagreement tasks
- Any number whose CI includes zero
- Phase 3.5 (real-activity replay) as headline — exploratory only

### 9.3 Null result handling

If the CI for C − B includes zero:

- We will not claim the rules block earns its weight on this run.
- We will publish the null result internally as "current evidence
  insufficient — re-run with curated brain + 50+ tasks needed".
- The rules-installer feature stays in product but loses its
  marketing claim.

---

## 10. Statistics

### 10.1 Paired bootstrap CI

For the primary metric (correctness score) we report:

- **Mean** of paired differences (per-task: `score_C − score_B`)
- **95% CI** via 1000 bootstrap resamples of the task list (resample
  with replacement, take 2.5th and 97.5th percentile)
- **Effect size**: Cohen's `d_z` for paired samples

Why paired: each task is run in all three modes back-to-back. Welch's
t-test (independent samples) overestimates variance in this setup —
pairing controls for task difficulty.

### 10.2 Paired permutation test

For null-hypothesis significance: 10,000 permutations of the
within-task sign. Report exact p-value. Significance threshold: 0.05,
but published only if effect size is also non-trivial (`|d_z| > 0.3`).

### 10.3 Multiple comparisons — Bonferroni primary, uncorrected secondary

We have three planned comparisons (B-A, C-B, C-A) per brain condition,
across two brain conditions, plus the lesson-dropout sub-test —
that's six primary tests. Bonferroni-corrected α: `0.05 / 6 ≈ 0.0083`.

**Bonferroni is the publication gate**. The web /benchmark page can
quote a comparison only if it clears Bonferroni AND §9.1 Tier 1's
other requirements (CI not crossing zero, effect size, exclusion %).

**Uncorrected p-values are reported as secondary** in the report —
useful for direction-of-effect intuition, but not citable. A Tier 2
"directional evidence" claim can reference uncorrected p < 0.05 as
long as the wording explicitly says "directional / internal" and not
"statistically significant".

This split exists because the user explicitly required it: "ใช้ strict
เป็น primary claim gate ครับ แล้ว report uncorrected p-value เป็น
secondary/exploratory ได้". Documented here so reviewers see the gate
isn't being slid mid-run.

---

## 11. Curation log (filled in as we go)

Curation events get written here in append-only fashion before runs
begin. Each entry has a timestamp + author + the lesson IDs touched.
This is what proves curation didn't see task results.

### 2026-05-07 — Phase 3 initial curation pass

- **Method:** REST `PATCH /api/v1/projects/errors/{id}` via the live
  sidecar (production DB). Driver script: `benchmarks/_apply_curation_2026_05_07.py`.
  Run ID `1ad3819c80e7` started at `2026-05-06T18:08:14Z`, finished
  one second later. Exit code 0.
- **Pre-curation invariants asserted before any mutation:**
  - `benchmarks/tasks.json` clean (no uncommitted modifications)
  - `tasks.json` entry count = 20 (the pre-Phase-3 state)
  - Sidecar `api_version` ≥ 3
  - Git HEAD recorded at `2881026b1f` for round-trip auditability
  - `tasks.json` last commit `03d811e797`
  - `no_tasks_added_before_curation: true`
- **Selection algorithm:** documented inline in
  `_apply_curation_2026_05_07.py` and serialised into the log under
  `selection_algorithm`. Deterministic, objective, weights pre-set.
- **Operator approval:** chat round on 2026-05-07, "approve as-is
  ทั้ง 5 rows ครับ", per-row review captured one chat round earlier.
- **Lessons mutated (5):**
  - `4cd762e0` — Vite optimizeDeps barrel pattern · pinned=False→True · scope_globs=[]→`["vite.config.ts","package.json"]`
  - `01933dc9` — Graph focus FPS / R3F freeze · pinned=False→True · scope_globs=[]→`["src/components/GraphView/**"]`
  - `77b13dd3` — FTS5 plain-string coverage gap · pinned=False→True · scope_globs=[]→`["core/code_indexer/**","core/api/mcp_server.py"]`
  - `78c5b62a` — MCP stdio idle-hang · pinned unchanged (False) · scope_globs=[]→`["core/api/mcp_server.py"]`
  - `ca204c2e` — Summarizer race + watcher self-trigger · pinned unchanged (False) · scope_globs=[]→`["core/code_indexer/project_summarizer.py","core/code_indexer/watcher.py"]`
- **Lessons NOT touched** (operator policy — leave sibling /
  duplicate / single-shot lessons alone unless doing duplicate
  cleanup as a separate, non-benchmark commit): `e9d5fe1b`,
  `28fd4c7c`, `25ef5d4a`, plus the seven lessons not in the top 5.
- **Side-effect:** REST PATCH triggered `_rebuild_for_project` for
  the AI-Bran project (sidecar runtime has `lesson_mirror_enabled:
  true`). The project's `.cosmos/lessons.{md,json}` files now
  reflect the pinned/scoped state. Mirror update committed as part
  of the same git commit.
- **Leakage flag:** lesson `78c5b62a` (MCP stdio hang) is at MEDIUM
  leakage risk. When `tasks.json` is extended, no task may use
  wording near the lesson's symptom verbatim. Will be cross-checked
  during task-pool review.
- **Full audit artifact:** `benchmarks/curation_log_2026-05-07.json`
  — committed prior to the next task-pool extension.

No further curation is permitted until either (a) the Phase 3 run
completes and the protocol allows a follow-up, or (b) a separate
non-benchmark cleanup commit lands with explicit "duplicate
cleanup" rationale.

---

## 12. Deviation log (filled in as we go)

Anything that deviates from this document gets a timestamped entry
here so the report can footnote it.

### 2026-05-07 — Single-task smoke test ($0.7367, 6 runs) revealed cost overrun · scope reduced to moat categories

**Smoke run:** `--limit 1` (task A1 = symbol_lookup) × 3 modes × 2
brains = 6 task calls + 12 judge calls. Wiring green; results
landed in `benchmarks/results/phase3-2026-05-07-draft.{csv,json,md}`.

**Findings:**

1. Engine alone (mcp_only) replicated the Phase 2 cost reduction —
   −46% on Brain Fresh, −39% on Brain Curated, both at the same
   4.0/4 score.

2. **Rules block on a non-lesson task is pure overhead.** Token
   count jumped 74,790 → 128,910 (+72%) in Mode C on Brain Fresh
   without quality improvement. Cost +89% vs baseline. This was
   foreseeable — rules tell the AI to use cosmos tools, but A1's
   query has no matching pinned/scoped lesson, so the verbose
   chain-of-thought is uncompensated.

3. **Linear extrapolation breaches $15 cap by 3.5×.** Single-task
   $0.74 × 72 cells projects to ~$53 for the full 36×3×2 matrix.

**Decision (operator-approved):** restrict the next paid run to
the categories the protocol's primary product claim actually tests
— past_lesson + apply_lesson only — and run Brain Curated first
on its own. If C > B in Curated holds across the moat categories,
THEN escalate to Brain Fresh subset to prove the gap is from the
lesson library, not from the rules-block wording in isolation.

**Subset matrix:**
- 8 past_lesson tasks (T-PL1..PL3 + T-PL4..PL8) + 6 apply_lesson
  tasks (T-AL1..AL2 + T-AL3..AL6) = 14 tasks
- 14 × 3 modes × 1 brain (Curated) = 42 task calls + 84 judge calls
- Projected cost: $5.60–$11.20 (well within $15 cap)

**Reporting consequence:** subset results land at Tier 2
"directional/internal evidence" per §9.1 — even if Bonferroni
passes at the planned α=0.0083 the matrix is too narrow to claim
a Tier 1 web-headline. To preserve the pre-registered primary
product claim, the operator wording for any subsequent web copy
will be:

> "On known-bug and lesson-application tasks, Project Lessons
> rules improved recall/correctness from X to Y."

— scoped to the moat categories that were actually tested. NO
global "Cosmos rules improve everything" language unless a later
run extends to symbol_lookup / cross_ref / concept_search /
bulk_exploration with positive results.

**Bonferroni denominator stays at 6** (the pre-registered planned
test count) even though the subset runs fewer tests. Lowering the
denominator mid-stream would look like moving goalposts. Better to
stay strict and accept Tier 2 framing.

**Two minor runner fixes shipped alongside the deviation:**
- `--categories` CLI flag added so the subset can be selected
  declaratively rather than by hand-editing tasks.json
- Judge-call costs are now also recorded in the cost cap (smoke
  run undercounted by ~$0.06 — small but worth being honest about)

### 2026-05-07 — D2 paid run aborted at task 5/14 due to run-count accounting bug · resumed from task 6

**What happened:** Started the D2 paid run after the categories+
judge-cost commit landed. Wired correctly: `--categories
past_lesson,apply_lesson --only-brain curated`. The run completed
5 tasks × 3 modes = 15 task calls + 30 judge calls = 45 cap entries.
Run-count cap was set to `expected_task_runs` = 14 × 3 × 1 = 42.
At entry 43 (the next task's first mode), `would_exceed_cap()`
fired with reason `run-count cap reached: 45 completed ≥ 42 planned`.

**Root cause:** The judge-cost commit added
`cap.record_run(total_cost_usd)` inside `judge_call` to fix a
$0.06 cost-undercounting issue. But `record_run` was incrementing
BOTH `spent_usd` AND `completed_runs` — so judge calls polluted
the run-count cap that was supposed to count task cells only.
$ accounting was correct ($4.42 spent of $15 budget); the
abort was purely a counting bug.

**Resolution (this commit):**

1. `_cost_cap.py` `record_run(cost_usd, *, count_as_run=True)` —
   judges pass `count_as_run=False` so they accumulate cost only,
   no run-count, no EWMA pollution. Two new tests pin this.
2. `phase3_modes.py judge_call` passes `count_as_run=False` to
   the cap.
3. `phase3_modes.py` adds `--start-from N` CLI flag — skips the
   first N tasks in the deterministic shuffle order. Same
   `SHUFFLE_SEED` guarantees index alignment with the original
   run; verified via dry-run that `--start-from 5` produces
   exactly `['T-AL4', 'D3', 'T-PL6', 'T-PL8', 'T-AL3']` as the
   skipped set — matches the D2 first 5.
4. Output files for resumed runs use a `-resume-from{N}` suffix
   so the original partial isn't overwritten. The original D2
   partial was renamed to `phase3-2026-05-07-d2-tasks1to5.*`
   before the rename collision could occur.
5. `_save_markdown` emits a 🔁 RESUMED RUN banner pointing at this
   §12 entry and the merge script.

**No rows discarded.** All 5 tasks × 3 modes data from the original
D2 run survives in `phase3-2026-05-07-d2-tasks1to5.*`. The resumed
run produces the remaining 9 tasks; merge produces unified n=14
analysis.

**Reporting consequence:** unchanged. The combined dataset still
publishes as Tier 2 directional only per §9.1 — n=14 paired
bootstrap won't clear the Bonferroni gate at α=0.0083. The 5-task
preview already showed C-B=+0.4 mean with 95% CI ≈ [-1.4, +2.2]
(crosses zero). Adding 9 more tasks tightens the CI but unlikely
flips the publication tier.

### 2026-05-07 — D2 + Resume merged · final results landed · Tier 2 confirmed

Resume run completed normally (exit 0, $9.01 spent, no cap fired).
Combined with the D2 partial via `benchmarks/_merge_phase3.py`,
the unified dataset is `phase3-2026-05-07-merged.{csv,json,md}`.

**Final numbers (Curated brain, n=12 paired):**

| Mode | n | Mean cost | Mean score |
|---|---|---:|---:|
| baseline       | 12 | $0.2337 | 1.33 / 4 |
| mcp_only       | 12 | $0.1885 | 1.33 / 4 |
| mcp_plus_rules | 13 | $0.3554 | **2.23 / 4** |

(`mcp_plus_rules` n=13 because T-PL7 had A and B time out but C
completed — that single C-only row contributes to the mode mean
but is excluded from the paired analysis below.)

**Paired analysis (C − B), n=12:**
- Mean Δ: **+0.75**
- 95% CI (normal approx): [-0.22, +1.72] — **crosses zero by 0.22**
- Exclusion rate: 11.9% (5 of 42 rows: E1 ×3 timeouts, T-PL7 ×2)
  — exceeds the 10% headline threshold per §7.1

**Verdict:** Tier 2 / directional / internal evidence only. No
Tier 1 web-headline claim is earned by this run. The CI misses
the zero-cross gate by a hair (0.22) and the exclusion rate
crosses the 10% line — either failure alone forces Tier 2 per
§9.1; both forcing it is unambiguous.

**Per-task signal (where C diverges from B):**

| Task | A | B | C | Lesson type |
|---|---:|---:|---:|---|
| T-AL3   | 3 | 0 | 4 | PINNED + scope_glob (vite barrel) |
| T-PL5   | 0 | 0 | 4 | PINNED + scope_glob (Graph FPS) |
| T-AL5   | 3 | 3 | 4 | PINNED + scope_glob (Graph FPS, apply) |
| T-AL6   | 3 | 1 | 3 | non-curated (license JWT)  — C rescued B's regression |
| D3      | 0 | 4 | 3 | non-curated (JWT collision) — B-favourable framing |
| T-PL8   | 3 | 4 | 3 | non-curated (drag-drop)     — B-favourable framing |

Pattern: **C wins specifically on PINNED+scope_globbed lessons.**
On non-curated lessons, C either ties B or drops 1. The moat is
real but it lives in the curation surface (pin + scope_glob), not
in the rules block alone.

**Cost overhead:** rules block adds +89% mean cost over B for
+0.75 mean score — unfavourable on tasks that don't need recall
(symbol_lookup style answers). For the marketing-ready claim this
means: "rules block earns its weight when lessons are curated;
without curation it's a verbosity tax."

**Decision (operator-approved in chat round):**

Per the operator's editorial guidance: do not lead with these
numbers on the marketing page. Web positioning shifts to:
1. Narrative + example (the Graph FPS lesson story) on home
2. Phase 1 lookup latency (~1ms) as a small audit-friendly fact
3. /benchmark page restructured as "transparent lab notes" — what
   we know, what we're testing, what we will not claim — with the
   merged Phase 3 CSV linked for the curious
4. NO global "Cosmos rules improve every task" wording. Only the
   moat-restricted phrasing locked in §12 above.

Phase 3 paid-run experiment closed with this entry. Future runs
that extend the matrix (Brain Fresh subset, more tasks, multi-
brain comparison) require a fresh deviation entry + budget approval
+ pre-registration update — they don't slip into this document.

Total benchmark spend across D2 + Resume: **$13.43**, well within
the operator's $5-15 budget envelope.

### 2026-05-07 — repo flipped to private (post-experiment); benchmark data unchanged

Operator decision after the merged Phase 3 results landed: both the
AI-Bran source repo and the atitechs-web marketing-site repo were
flipped from PUBLIC → PRIVATE on GitHub via
`gh repo edit --visibility private`. The product is now in
**private alpha · invitation only**; no public app downloads, no
public source distribution until the project graduates from alpha.

**Effect on this protocol document:** none on the data. The Phase 3
run completed before the privacy flip. All measurements, audit
trail, exclusion accounting, paired analysis, and the Tier 2
verdict stand unchanged. The CSVs / JSONs / MDs in
`benchmarks/results/` remain the canonical artifact.

**Effect on §9.2 publication policy citations:** the protocol's
public-link references (e.g. "raw CSV linked from the marketing
page") still hold for the file-mirror copy under
`atitechs-web/public/benchmark/phase3-2026-05-07-merged.csv`,
which is served from the live website without requiring source
access. Reviewers / future invited testers can audit the merged
dataset there. The original GitHub source-tree links inside the
website pages were rewritten to point at `/#waitlist` during the
public→private sweep so no broken-link state exists, but anyone
who needs the raw repo (the protocol script, the curation log
JSON, etc.) gets it via alpha invitation.

**Effect on benchmark reproducibility:** the protocol document and
all driver scripts are still in the repo — invited testers running
`benchmarks/phase3_modes.py` reproduce exactly the same matrix.
External reviewers without alpha access can read the merged CSV
and the Markdown report served from `atitechs.com/benchmark/...`
to verify the numerical findings without source access. Re-running
the experiment requires the source, which is alpha-only until
public launch.

**No deviation in the experimental procedure** — this entry exists
so any future reader of the protocol understands why public
GitHub links rendered in the website / docs no longer resolve.
The data, methodology, exclusion gates, and Tier 2 framing are
final.

### 2026-05-07 — distribution path chosen: R2, source repo stays private (Option B)

After the privacy flip, three options surfaced for how end users will
download future builds (Apple cert + auto-updater both need somewhere
to download FROM):

  - A. Flip repos back to public + add a license — uses GitHub Releases
       directly, free, but contradicts the just-made decision to keep
       source private.
  - B. Host updates on Cloudflare R2 with a custom `download.atitechs.com`
       domain. Source stays private; distribution opens. ~$5/mo.
       2-3 days of CI plumbing.
  - C. Public binary-only mirror repo. Splits source from distribution
       across two repos. Operationally heavier than B without solving
       a problem B doesn't.

**Decision: Option B.** Matches the public-private invariant the
operator just set, costs rounding error per month, future-friendly
(can flip to public anytime later without ripping out the
distribution stack), and is the same pattern Cursor / Linear / Bear /
Notion use.

Apple Developer enrolment ($99/yr) is sequenced LAST in the launch
plan per the operator's instruction. Until that lands, alpha builds
remain ad-hoc-signed and require the `xattr -cr` workaround
documented at `/docs#install`. The R2 distribution and Tauri update
infrastructure ship FIRST so the moment Apple cert lands, the only
remaining step is uncommenting one CI block.

**Scaffolding committed alongside this entry:**

- `docs/UPDATE_DELIVERY.md` — full architecture, CI flow,
  channel design (stable/beta), rollback policy, key management,
  domain setup notes, and an explicit next-step checklist split by
  blocking dependency.
- `.github/workflows/release.yml` — release pipeline. Build +
  Tauri update-sign + manifest + R2 upload all wired but inert
  until secrets are populated. Apple sign + notarise step is
  gated behind a `HAS_APPLE_CERT` env so the workflow stays
  green pre-enrolment.
- `src-tauri/tauri.conf.json` — `plugins.updater` section added
  with the production endpoint
  `https://download.atitechs.com/{{target}}/{{arch}}/{{current_version}}/manifest.json`
  and the public key generated at this commit's time.
- Tauri update keypair generated locally
  (`node_modules/.bin/tauri signer generate --ci`):
    - private: `~/.tauri/cosmos.key` (operator's machine; later
      copied as `TAURI_SIGNING_PRIVATE_KEY` GitHub secret)
    - public: committed in the `pubkey` field above.
  Key rotation policy + key-loss recovery posture documented in
  `UPDATE_DELIVERY.md`.

**No web changes in this commit.** Website CTAs continue pointing
at `/#waitlist` because the private-alpha framing still holds —
download infrastructure exists in scaffolding form but hasn't
shipped a public artifact yet. The web flip from `/#waitlist` →
`https://download.atitechs.com/stable/latest/manifest.json` becomes
the final step of Phase 1.10 (public beta launch), not before.

**No procedural deviation in the benchmark** — the Phase 3 dataset
is unaffected by anything downstream of distribution choice. This
entry exists so future operators picking up the launch baton see
the design decision and the trail of who-decided-what.

Files updated alongside this entry:
- `benchmarks/phase3_modes.py` (CLI flag + cap accounting)
- `benchmarks/_cost_cap.py` (no change — accounting was always
  intended; runner just wasn't calling `record_run()` for judges)

---

## 13. Before-run checklist

Operator runs through this before pulling the trigger on the paid run:

- [ ] PHASE3_PROTOCOL.md committed and reviewed
- [ ] tier_benchmark.py cost label fix committed
- [ ] PreflightBanner missing-api-version fix committed
- [ ] tasks.json extended to ≥36 entries, schema-validated
- [ ] tasks.json locked + commit hash recorded in §3.3
- [ ] Curated brain: lessons pinned + scope_globs added before lock
- [ ] Curation log §11 has timestamped entries for every change
- [ ] Cost cap = $15 hard-coded in runner
- [ ] Mode-stripping filter unit-tested on a synthetic answer
- [ ] Two-judge runner unit-tested on one task
- [ ] Live sidecar version ≥ 3 (PreflightBanner not red)
- [ ] Backup of `data/brain_v2/brain.db` made before any temp-DB clone
- [ ] $100 plan quota check — at least $20 headroom

If any item fails, the run is blocked.

---

*End of pre-registration. Document hash will be recorded by git
when this file is committed alongside the (later) results.*