# Phase 3 Benchmark — Pre-Registered Protocol **Status: DRAFT — pre-registration. Not yet executed.** **Document timestamp:** 2026-05-07 (before any task pool extension or brain curation has happened — those are scheduled steps below) This document locks the experimental design **before** we add tasks, curate the brain, or run any paid benchmark. The intent is auditability: a reviewer should be able to read this file, then read the eventual results, and confirm we didn't move the goalposts to make Cosmos look better. If we deviate from this protocol mid-run, the deviation gets a timestamped section appended below and the affected results get a "deviated from protocol" footnote — not a quiet rewrite. --- ## 1. Hypothesis **Primary:** Loading Cosmos's MCP server alone (Mode B) does not improve end-to-end coding-task quality vs. baseline (Mode A) by a clinically meaningful amount, but loading MCP **plus** the Cosmos rules block (Mode C) does — because the activation surface is what makes AI agents actually call recall tools. **Secondary:** On a curated brain (pinned + scope_globed lessons), the gap C − B widens vs. a fresh empty brain — i.e., Cosmos's value scales with the lesson library, not just the engine. **Negative outcome we will publish:** if C ≈ B, we will not claim that the rules block carries weight. We'll publish the null result and keep the rules-installer feature off the marketing path. --- ## 2. Engineering setup Locked at protocol-write time: | Component | Version | |---|---| | OS | macOS arm64 (Apple Silicon) | | Python | 3.12 from the `.venv` venv | | Claude CLI | `/Users/kabir/.local/bin/claude` (whatever ships at run start; recorded in CSV) | | Model | `claude-opus-4-7[1m]` (Anthropic-side) | | Sidecar API version | ≥ 3 (PreflightBanner gate enforces) | | Cosmos build | recorded by `/api/v2/status` → `build_version` field | | MCP config | `.mcp.json` at repo root with key `"cosmos"` | | ripgrep | 15.x (Phase 1 only) | Run host stays the same machine across all phases. Claude CLI auth uses the operator's $100 plan; no separate API key. --- ## 3. Task pool — pre-registered before extension ### 3.1 Source **Task mix is intentionally weighted to reflect Cosmos's product claim** — "prevent re-solving known project bugs". This is not a neutral developer-search benchmark; it's a moat test, and the weighting is disclosed up front so reviewers don't have to reverse- engineer it. `past_lesson` + `apply_lesson` together total 14/36 (39%) of the pool, while `symbol_lookup` + `concept_search` (the categories most other code-search benchmarks lean on) get 12/36 (33%). If the same task pool were used to evaluate generic-purpose code search the weighting would invalidate the result; for evaluating project-memory recall it's the right shape. `benchmarks/tasks.json` (schema v1) — currently 20 tasks across these categories: | Category | Current count | Target count | |---|---:|---:| | `symbol_lookup` | 5 | 6 | | `cross_ref` | 3 | 5 | | `concept_search` | 4 | 6 | | `past_lesson` | 3 | 8 | | `apply_lesson` | 2 | 6 | | `bulk_exploration` | 3 | 5 | | **Total** | **20** | **36** | ### 3.2 Extension rules Adding the 16 new tasks must follow these rules. Any deviation gets recorded in §11 with a timestamp. 1. **Author cannot have seen Phase 3 task results before adding the task.** Tasks are added in one sitting, then the file is committed, *then* runs begin. 2. **Each new task must declare `expected_file` and `expected_keyword` at write time.** No retroactive correctness criteria. 3. **`past_lesson` and `apply_lesson` tasks must reference real lesson IDs** that already exist in the brain at the moment the task is written — verified by the schema check before commit. Same with files referenced — the file must exist in the target repo at task write time. 4. **Repo distribution stays roughly proportional to category** so we don't accidentally weight one project's quirks. Recorded in CSV per row. 5. **No task can be removed or reworded** after it's added unless the run errored (timeout / API failure) — rewordings due to "the model misunderstood" are a finding, not a bug. ### 3.3 Lock Once §3.1 hits the target counts and §3.2 is satisfied, the `tasks.json` file gets a `locked_at` timestamp + git-commit hash. From that point forward the file is read-only for the run duration. A deviation from the locked set is flagged in the report. **Locked at: 2026-05-07T00:47:53Z** **Commit at lock time: `d3181df245`** (the commit that landed the curation log; tasks file extended in the next commit which this section is part of) **Final distribution:** symbol_lookup=6 · cross_ref=5 · concept_search=6 · past_lesson=8 · apply_lesson=6 · bulk_exploration=5 — 36 total. **Driver:** `benchmarks/_extend_tasks_2026_05_07.py` — runs a leakage- guard sweep across every new prompt (forbidden terms list pulled from the curation log §11 leakage flag for lesson 78c5b62a) before writing. Sweep returned 0 hits. Any further mutation must produce a NEW dated tasks file, e.g. `tasks-2026-MM-DD.json`. Editing this file in place is treated as protocol deviation and gets logged in §12. --- ## 4. Brain conditions — two protocols ### 4.1 Brain Fresh Empty `code_errors` table. **Critically: this is created in a temp DB, NOT by deleting the user's real brain.** The user explicitly required this so a benchmark can never touch production data. #### 4.1.a SQLite-safe clone (mandatory) A naive `shutil.copy(brain.db, …)` is **forbidden** because Cosmos runs in WAL mode — the most recent committed writes can live in the companion `brain.db-wal` file until checkpoint, and a file copy of `brain.db` alone would silently miss them. The runner uses `sqlite3.Connection.backup()` (the SQLite Backup API) to stream pages from the live DB into a fresh file. This is online- safe: the user's normal sidecar on port 7824 keeps writing during the clone and the snapshot stays coherent. Implementation lives in `benchmarks/_brain_sandbox.py::clone_brain_to_sandbox()`. After the page-stream, the runner: 1. Sets `PRAGMA journal_mode=WAL` on the clone (matches production) 2. If `fresh=True`: `DELETE FROM code_errors` in the clone only — other tables (memories_v2, code_index, code_fts, code_summary, …) survive so MCP code search still works against the same indexed code base 3. Initializes an empty `mcp_activity.jsonl` in the sandbox dir — never copies the real one, otherwise the activation-rate metric would mix benchmark calls with the user's earlier sessions #### 4.1.b Env-var contract — what the alternate sidecar sees The runner sets the following before spawning the sandbox sidecar. Every sandbox process MUST report these in `/api/v2/status`'s `runtime` block; the runner asserts on the returned values before the first benchmark task fires. | Variable | Required | Effect | |---|---|---| | `COSMOS_BRAIN_DB` | yes | Absolute path to the cloned DB | | `COSMOS_DATA_DIR` | yes | Directory containing the clone — `mcp_activity.jsonl` lives here | | `COSMOS_ACTIVITY_LOG` | optional | Override activity log path independent of DATA_DIR | | `COSMOS_DISABLE_LESSON_MIRROR` | yes (=1) | Skip every `.cosmos/lessons.{md,json}` rebuild call | | `COSMOS_BENCHMARK` | yes (=1) | Belt-and-suspenders flag any future side-effect path can short-circuit on | | `COSMOS_KEEP_SANDBOX` | optional | Skip cleanup so a failed run can be inspected | The `COSMOS_DISABLE_LESSON_MIRROR` flag is essential. The project_registry that resolves `project_id → repo path` lives outside the brain DB — even when the cloned DB is empty, calling `code_remember_error` would walk the real registry, find the user's real repo, and rewrite `.cosmos/lessons.{md,json}` in the working directory. The flag short-circuits `_rebuild_for_project` entirely. #### 4.1.c Pre-flight assertion Before the first task runs, the runner hits the sandbox sidecar's `/api/v2/status` and asserts: - `runtime.brain_db_path` matches the clone path it just wrote - `runtime.lesson_mirror_enabled` is `false` - `runtime.is_benchmark_sandbox` is `true` - `runtime.env_overrides_set` includes the four required keys above - The clone's `code_errors` row count matches expectation (0 for Fresh, current count for Curated) If any assertion fails the run aborts with a deviation log entry. The real DB is never touched. ### 4.2 Brain Curated Same as the user's live brain, but with a curated layer applied **before any task is added or any task result is observed**. The curation: - Pins 3–5 lessons the operator selects (without seeing Phase 3 task results) - Adds `scope_globs` to lessons whose files repeatedly appear in bug-magnet rankings (vite.config.ts, mcp_server.py, etc.) **Curation timestamp is recorded** in the report. If `tasks.json` has already been locked when curation begins, that's recorded too — curation must precede task extension OR be done from existing tasks only (i.e. operator must not see new task content before curating). The Brain Curated sweep runs against the user's normal sidecar on `127.0.0.1:7824` since curation is a normal user operation. ### 4.3 No third "Brain Adversarial" condition We considered including a "all lessons disabled" condition to bound the lower end. Rejected — too easy to misread as "Cosmos with features turned off" rather than "synthetic worst case". Brain Fresh already provides the floor. ### 4.4 Repo-view isolation — `.cosmos/` and instruction files excluded Added per operator review: even with a sandboxed brain DB, the AI agent in any mode can `Read('.cosmos/lessons.md')` against the live repo path passed via `--add-dir` and pull the same recipe data the MCP layer is supposed to gate. Without isolation, Mode A's "no Cosmos" baseline would silently grep its way to the answer through the file mirror — invalidating every between-mode comparison. For the same reason the temp repo view also excludes the project's own AI-instruction files, so the only between-mode difference is the rules-block prefix in Mode C. The runner copies each task's `repo_path` into a temp directory (`benchmarks/_repo_view.py::prepare_clean_repos`) before the first task runs, then rewrites `repo_path` in the in-memory task list to point at the copy. The copy excludes: | Excluded path | Why | |---|---| | `.cosmos/` | The lesson mirror — Mode A could grep this otherwise | | `CLAUDE.md` | Cosmos guidance pre-installed; Mode B would pick it up for free | | `AGENTS.md` | Same — Codex-style guidance | | `.cursor/` | Cursor rules dir | | `.clinerules` | Cline rules file | | `.windsurfrules` | Windsurf rules file | | `.git/` | Avoid git blame / log shortcuts to attribution | | `node_modules/`, `.venv/`, `target/`, `dist/`, `.next/` | Bloat, never read by AI in normal benchmark tasks | Applied to **all three modes**, not just baseline. Mode B's "MCP only" condition explicitly means "MCP tools available, no other Cosmos guidance available" — so a CLAUDE.md that says "use cosmos tools" would conflate the comparison. **Production interpretation note**: this means the benchmark measures the marginal value of injecting the rules block at task time, vs. relying on MCP tools to be discovered organically. In production, users install the rules block into their persistent CLAUDE.md / .cursor/rules and get the benefit on every session. The benchmark approximates this — the production gain may be larger because the rules-block guidance accumulates across sessions, while the benchmark restarts cold for each task. --- ## 5. Modes (already implemented in `phase3_modes.py`) | Mode | MCP loaded? | Rules block prefixed? | |---|---|---| | `A · baseline` | no | no | | `B · mcp_only` | yes | no | | `C · mcp_plus_rules` | yes | yes | Mode C prefixes the rules block onto the user prompt. This is not bit-identical to a CLAUDE.md preamble that the AI client loads itself, but it's the closest `claude -p` lets us get without shelling out to a sandbox dir. Recorded as a known limitation. --- ## 6. Run protocol ### 6.1 Order For each (brain × mode × task) cell: 1. Random shuffle of the task order with seed = `cosmos-phase3-2026-05` (a fixed string committed in the runner so anyone can reproduce the exact sequence). 2. All 3 modes for one task run back-to-back (paired design — see §10). 3. Cells are interleaved across brain conditions in alternating blocks of 5 tasks so any time-of-day model variation hits both conditions equally. ### 6.2 Cost cap Hard limits enforced in the runner: - **Estimated cost cap: $15 USD** — runner sums Anthropic-reported `total_cost_usd` per call and stops entering new tasks if the running total + an estimated next-task cost would exceed $15. - **Run-count cap: planned matrix size** — for the 36-task pool in §3.1 with 3 modes × 2 brains, that's 216 runs + 216 judge calls. If the runner is asked to do more it aborts. - **Per-task wall cap: 180 s** — already in `run_claude()`. Tasks that hit this get marked `error: timeout` and excluded from primary metrics; they still appear in the per-task table. If the cost cap fires we'll publish whatever subset completed, with the cap reason annotated. ### 6.3 Incremental save CSV writes after every task pair completes so a crashed sidecar / expired Claude session doesn't lose hours of runs. --- ## 7. Judge protocol ### 7.1 Two-judge mean with blinding Each task answer is graded by **two independent Claude calls** (judge prompt template fixed in `phase2_interactive.py` lines 171-185 and copied by `phase3_modes.py`). The mean of the two grades is the recorded score; if the two judges disagree by > 1 grade, the task is flagged `judge_disagreement=true` and excluded from the primary metric (still shown in the per-task table). **Excluded-row count is published as a primary number**, not buried in an appendix — `report.md` headlines: > "Primary metric computed over N kept rows. M rows excluded > (judge_disagreement=K, timeout=L, run_error=…). K/N = X% — flag > if > 10% so reviewers can decide if the run is interpretable." If exclusions exceed 10% the result is downgraded to "directional/internal" regardless of CI. Hiding the rows would let us cherry-pick the easy half of the matrix while still claiming publication-grade rigor; making them headline keeps the temptation out of reach. ### 7.2 Mode stripping before judging Before the answer is sent to the judge it gets passed through a filter that removes: - Any string starting with `mcp__cosmos__` - Any standalone occurrence of Cosmos-specific tool names (`find_relevant_code`, `code_get_symbol`, `code_callers`, `code_callees`, `code_uses`, `code_skeleton`, `code_context_bundle`, `code_search`, `code_list_errors`, `code_remember_error`) - The literal phrase "Cosmos MCP" / "the Cosmos brain" These would otherwise leak which mode produced the answer to the judge. ### 7.3 Judge model + version Same Claude version as the task runner (`claude-opus-4-7[1m]`). Recorded per-row in CSV. --- ## 8. Metrics ### 8.1 Primary **Mean correctness score (0–4)** per (brain × mode) cell. Reported with paired bootstrap 95% CI (see §10). This is what we will or will not put on the website. ### 8.2 Secondary Reported but never headline: - **Mean cost (USD)** per cell — Anthropic-reported `total_cost_usd` - **Mean wall (s)** per cell - **Activation rate** — % of tasks where the AI called ≥ 1 `mcp__cosmos__*` tool. Computed from the activity log emitted by the temp sidecar. For Mode A this is 0 by construction. - **Lesson recall rate** — % of tasks where the AI called `find_relevant_code` or `code_list_errors` specifically. Mode A: 0. - **Unnecessary call count** — for tasks with no relevant lesson, count how often the AI still invoked Cosmos tools. Negative metric: high values mean rules block over-fires. ### 8.3 Lesson-dropout sub-test (past_lesson tasks only) For each task in the `past_lesson` category, **three additional runs** are added to Mode C only: 1. Lesson `pinned=true` (most aggressive surface) 2. Lesson `pinned=false, disabled=false` (default) 3. Lesson `disabled=true` (Cosmos cannot see it) If the score drops sharply at variant 3, the lesson did the work. If all three score the same, the AI got it from base knowledge — moat not real for this task. Reported per-task, not aggregated as headline. --- ## 9. Publication policy ### 9.1 What we will publish Three publication tiers — gating gets stricter the more public the claim: #### Tier 1 — Web headline ("publishable") Requires **all** of: - Primary metric C − B paired bootstrap 95% CI **strictly above zero** in the Brain Curated condition - Bonferroni-corrected p-value < 0.0083 (see §10.3) - Effect size `|d_z| > 0.3` - Excluded rows < 10% of total - CI width less than the absolute effect (rules out "barely significant + huge variance" claims) Example claim that would qualify: > "Across N benchmark tasks on the Cosmos repo, AI agents with the > Cosmos rules block prefixed scored mean Δ correctness +X.XX vs. > with Cosmos's MCP alone (95% CI [X, Y], paired bootstrap, n=N, > Bonferroni-corrected p < 0.008, d_z = X.X)." Plus the raw CSV + protocol + curation timestamps as a downloadable artifact under `public/benchmark/phase3-2026-05-DD/`. #### Tier 2 — "Directional / internal evidence" When uncorrected p < 0.05 but Bonferroni gate misses, OR effect size is below the 0.3 threshold, OR exclusions are 10–25%: - Result lands in `benchmarks/results/` and the protocol log - May be referenced internally and in technical docs - **Cannot** appear on the marketing site as an effect-size claim - Web /benchmark page may quote: *"Internal directional evidence suggests rules-block carries weight; n insufficient for Bonferroni-grade publication. Re-run with 50+ tasks planned."* #### Tier 3 — Suppressed CI crosses zero / exclusions > 25% / run errored mid-matrix: - Raw data still committed under `benchmarks/results/` (audit trail) - No public claim of any kind - Protocol §11 deviation log records why the run is suppressed - Triggers a re-design pass before the next attempt ### 9.2 What we will not publish - Phase 1 cost numbers as a "Cosmos beats grep" headline. Phase 1 becomes a sidebar regression-guard claim: > "Indexed lookup latency: 0.6–1.4 ms median across 63–1,448-file > repos. Index build excluded." - Any aggregate that includes timed-out or judge-disagreement tasks - Any number whose CI includes zero - Phase 3.5 (real-activity replay) as headline — exploratory only ### 9.3 Null result handling If the CI for C − B includes zero: - We will not claim the rules block earns its weight on this run. - We will publish the null result internally as "current evidence insufficient — re-run with curated brain + 50+ tasks needed". - The rules-installer feature stays in product but loses its marketing claim. --- ## 10. Statistics ### 10.1 Paired bootstrap CI For the primary metric (correctness score) we report: - **Mean** of paired differences (per-task: `score_C − score_B`) - **95% CI** via 1000 bootstrap resamples of the task list (resample with replacement, take 2.5th and 97.5th percentile) - **Effect size**: Cohen's `d_z` for paired samples Why paired: each task is run in all three modes back-to-back. Welch's t-test (independent samples) overestimates variance in this setup — pairing controls for task difficulty. ### 10.2 Paired permutation test For null-hypothesis significance: 10,000 permutations of the within-task sign. Report exact p-value. Significance threshold: 0.05, but published only if effect size is also non-trivial (`|d_z| > 0.3`). ### 10.3 Multiple comparisons — Bonferroni primary, uncorrected secondary We have three planned comparisons (B-A, C-B, C-A) per brain condition, across two brain conditions, plus the lesson-dropout sub-test — that's six primary tests. Bonferroni-corrected α: `0.05 / 6 ≈ 0.0083`. **Bonferroni is the publication gate**. The web /benchmark page can quote a comparison only if it clears Bonferroni AND §9.1 Tier 1's other requirements (CI not crossing zero, effect size, exclusion %). **Uncorrected p-values are reported as secondary** in the report — useful for direction-of-effect intuition, but not citable. A Tier 2 "directional evidence" claim can reference uncorrected p < 0.05 as long as the wording explicitly says "directional / internal" and not "statistically significant". This split exists because the user explicitly required it: "ใช้ strict เป็น primary claim gate ครับ แล้ว report uncorrected p-value เป็น secondary/exploratory ได้". Documented here so reviewers see the gate isn't being slid mid-run. --- ## 11. Curation log (filled in as we go) Curation events get written here in append-only fashion before runs begin. Each entry has a timestamp + author + the lesson IDs touched. This is what proves curation didn't see task results. ### 2026-05-07 — Phase 3 initial curation pass - **Method:** REST `PATCH /api/v1/projects/errors/{id}` via the live sidecar (production DB). Driver script: `benchmarks/_apply_curation_2026_05_07.py`. Run ID `1ad3819c80e7` started at `2026-05-06T18:08:14Z`, finished one second later. Exit code 0. - **Pre-curation invariants asserted before any mutation:** - `benchmarks/tasks.json` clean (no uncommitted modifications) - `tasks.json` entry count = 20 (the pre-Phase-3 state) - Sidecar `api_version` ≥ 3 - Git HEAD recorded at `2881026b1f` for round-trip auditability - `tasks.json` last commit `03d811e797` - `no_tasks_added_before_curation: true` - **Selection algorithm:** documented inline in `_apply_curation_2026_05_07.py` and serialised into the log under `selection_algorithm`. Deterministic, objective, weights pre-set. - **Operator approval:** chat round on 2026-05-07, "approve as-is ทั้ง 5 rows ครับ", per-row review captured one chat round earlier. - **Lessons mutated (5):** - `4cd762e0` — Vite optimizeDeps barrel pattern · pinned=False→True · scope_globs=[]→`["vite.config.ts","package.json"]` - `01933dc9` — Graph focus FPS / R3F freeze · pinned=False→True · scope_globs=[]→`["src/components/GraphView/**"]` - `77b13dd3` — FTS5 plain-string coverage gap · pinned=False→True · scope_globs=[]→`["core/code_indexer/**","core/api/mcp_server.py"]` - `78c5b62a` — MCP stdio idle-hang · pinned unchanged (False) · scope_globs=[]→`["core/api/mcp_server.py"]` - `ca204c2e` — Summarizer race + watcher self-trigger · pinned unchanged (False) · scope_globs=[]→`["core/code_indexer/project_summarizer.py","core/code_indexer/watcher.py"]` - **Lessons NOT touched** (operator policy — leave sibling / duplicate / single-shot lessons alone unless doing duplicate cleanup as a separate, non-benchmark commit): `e9d5fe1b`, `28fd4c7c`, `25ef5d4a`, plus the seven lessons not in the top 5. - **Side-effect:** REST PATCH triggered `_rebuild_for_project` for the AI-Bran project (sidecar runtime has `lesson_mirror_enabled: true`). The project's `.cosmos/lessons.{md,json}` files now reflect the pinned/scoped state. Mirror update committed as part of the same git commit. - **Leakage flag:** lesson `78c5b62a` (MCP stdio hang) is at MEDIUM leakage risk. When `tasks.json` is extended, no task may use wording near the lesson's symptom verbatim. Will be cross-checked during task-pool review. - **Full audit artifact:** `benchmarks/curation_log_2026-05-07.json` — committed prior to the next task-pool extension. No further curation is permitted until either (a) the Phase 3 run completes and the protocol allows a follow-up, or (b) a separate non-benchmark cleanup commit lands with explicit "duplicate cleanup" rationale. --- ## 12. Deviation log (filled in as we go) Anything that deviates from this document gets a timestamped entry here so the report can footnote it. ### 2026-05-07 — Single-task smoke test ($0.7367, 6 runs) revealed cost overrun · scope reduced to moat categories **Smoke run:** `--limit 1` (task A1 = symbol_lookup) × 3 modes × 2 brains = 6 task calls + 12 judge calls. Wiring green; results landed in `benchmarks/results/phase3-2026-05-07-draft.{csv,json,md}`. **Findings:** 1. Engine alone (mcp_only) replicated the Phase 2 cost reduction — −46% on Brain Fresh, −39% on Brain Curated, both at the same 4.0/4 score. 2. **Rules block on a non-lesson task is pure overhead.** Token count jumped 74,790 → 128,910 (+72%) in Mode C on Brain Fresh without quality improvement. Cost +89% vs baseline. This was foreseeable — rules tell the AI to use cosmos tools, but A1's query has no matching pinned/scoped lesson, so the verbose chain-of-thought is uncompensated. 3. **Linear extrapolation breaches $15 cap by 3.5×.** Single-task $0.74 × 72 cells projects to ~$53 for the full 36×3×2 matrix. **Decision (operator-approved):** restrict the next paid run to the categories the protocol's primary product claim actually tests — past_lesson + apply_lesson only — and run Brain Curated first on its own. If C > B in Curated holds across the moat categories, THEN escalate to Brain Fresh subset to prove the gap is from the lesson library, not from the rules-block wording in isolation. **Subset matrix:** - 8 past_lesson tasks (T-PL1..PL3 + T-PL4..PL8) + 6 apply_lesson tasks (T-AL1..AL2 + T-AL3..AL6) = 14 tasks - 14 × 3 modes × 1 brain (Curated) = 42 task calls + 84 judge calls - Projected cost: $5.60–$11.20 (well within $15 cap) **Reporting consequence:** subset results land at Tier 2 "directional/internal evidence" per §9.1 — even if Bonferroni passes at the planned α=0.0083 the matrix is too narrow to claim a Tier 1 web-headline. To preserve the pre-registered primary product claim, the operator wording for any subsequent web copy will be: > "On known-bug and lesson-application tasks, Project Lessons > rules improved recall/correctness from X to Y." — scoped to the moat categories that were actually tested. NO global "Cosmos rules improve everything" language unless a later run extends to symbol_lookup / cross_ref / concept_search / bulk_exploration with positive results. **Bonferroni denominator stays at 6** (the pre-registered planned test count) even though the subset runs fewer tests. Lowering the denominator mid-stream would look like moving goalposts. Better to stay strict and accept Tier 2 framing. **Two minor runner fixes shipped alongside the deviation:** - `--categories` CLI flag added so the subset can be selected declaratively rather than by hand-editing tasks.json - Judge-call costs are now also recorded in the cost cap (smoke run undercounted by ~$0.06 — small but worth being honest about) ### 2026-05-07 — D2 paid run aborted at task 5/14 due to run-count accounting bug · resumed from task 6 **What happened:** Started the D2 paid run after the categories+ judge-cost commit landed. Wired correctly: `--categories past_lesson,apply_lesson --only-brain curated`. The run completed 5 tasks × 3 modes = 15 task calls + 30 judge calls = 45 cap entries. Run-count cap was set to `expected_task_runs` = 14 × 3 × 1 = 42. At entry 43 (the next task's first mode), `would_exceed_cap()` fired with reason `run-count cap reached: 45 completed ≥ 42 planned`. **Root cause:** The judge-cost commit added `cap.record_run(total_cost_usd)` inside `judge_call` to fix a $0.06 cost-undercounting issue. But `record_run` was incrementing BOTH `spent_usd` AND `completed_runs` — so judge calls polluted the run-count cap that was supposed to count task cells only. $ accounting was correct ($4.42 spent of $15 budget); the abort was purely a counting bug. **Resolution (this commit):** 1. `_cost_cap.py` `record_run(cost_usd, *, count_as_run=True)` — judges pass `count_as_run=False` so they accumulate cost only, no run-count, no EWMA pollution. Two new tests pin this. 2. `phase3_modes.py judge_call` passes `count_as_run=False` to the cap. 3. `phase3_modes.py` adds `--start-from N` CLI flag — skips the first N tasks in the deterministic shuffle order. Same `SHUFFLE_SEED` guarantees index alignment with the original run; verified via dry-run that `--start-from 5` produces exactly `['T-AL4', 'D3', 'T-PL6', 'T-PL8', 'T-AL3']` as the skipped set — matches the D2 first 5. 4. Output files for resumed runs use a `-resume-from{N}` suffix so the original partial isn't overwritten. The original D2 partial was renamed to `phase3-2026-05-07-d2-tasks1to5.*` before the rename collision could occur. 5. `_save_markdown` emits a 🔁 RESUMED RUN banner pointing at this §12 entry and the merge script. **No rows discarded.** All 5 tasks × 3 modes data from the original D2 run survives in `phase3-2026-05-07-d2-tasks1to5.*`. The resumed run produces the remaining 9 tasks; merge produces unified n=14 analysis. **Reporting consequence:** unchanged. The combined dataset still publishes as Tier 2 directional only per §9.1 — n=14 paired bootstrap won't clear the Bonferroni gate at α=0.0083. The 5-task preview already showed C-B=+0.4 mean with 95% CI ≈ [-1.4, +2.2] (crosses zero). Adding 9 more tasks tightens the CI but unlikely flips the publication tier. ### 2026-05-07 — D2 + Resume merged · final results landed · Tier 2 confirmed Resume run completed normally (exit 0, $9.01 spent, no cap fired). Combined with the D2 partial via `benchmarks/_merge_phase3.py`, the unified dataset is `phase3-2026-05-07-merged.{csv,json,md}`. **Final numbers (Curated brain, n=12 paired):** | Mode | n | Mean cost | Mean score | |---|---|---:|---:| | baseline | 12 | $0.2337 | 1.33 / 4 | | mcp_only | 12 | $0.1885 | 1.33 / 4 | | mcp_plus_rules | 13 | $0.3554 | **2.23 / 4** | (`mcp_plus_rules` n=13 because T-PL7 had A and B time out but C completed — that single C-only row contributes to the mode mean but is excluded from the paired analysis below.) **Paired analysis (C − B), n=12:** - Mean Δ: **+0.75** - 95% CI (normal approx): [-0.22, +1.72] — **crosses zero by 0.22** - Exclusion rate: 11.9% (5 of 42 rows: E1 ×3 timeouts, T-PL7 ×2) — exceeds the 10% headline threshold per §7.1 **Verdict:** Tier 2 / directional / internal evidence only. No Tier 1 web-headline claim is earned by this run. The CI misses the zero-cross gate by a hair (0.22) and the exclusion rate crosses the 10% line — either failure alone forces Tier 2 per §9.1; both forcing it is unambiguous. **Per-task signal (where C diverges from B):** | Task | A | B | C | Lesson type | |---|---:|---:|---:|---| | T-AL3 | 3 | 0 | 4 | PINNED + scope_glob (vite barrel) | | T-PL5 | 0 | 0 | 4 | PINNED + scope_glob (Graph FPS) | | T-AL5 | 3 | 3 | 4 | PINNED + scope_glob (Graph FPS, apply) | | T-AL6 | 3 | 1 | 3 | non-curated (license JWT) — C rescued B's regression | | D3 | 0 | 4 | 3 | non-curated (JWT collision) — B-favourable framing | | T-PL8 | 3 | 4 | 3 | non-curated (drag-drop) — B-favourable framing | Pattern: **C wins specifically on PINNED+scope_globbed lessons.** On non-curated lessons, C either ties B or drops 1. The moat is real but it lives in the curation surface (pin + scope_glob), not in the rules block alone. **Cost overhead:** rules block adds +89% mean cost over B for +0.75 mean score — unfavourable on tasks that don't need recall (symbol_lookup style answers). For the marketing-ready claim this means: "rules block earns its weight when lessons are curated; without curation it's a verbosity tax." **Decision (operator-approved in chat round):** Per the operator's editorial guidance: do not lead with these numbers on the marketing page. Web positioning shifts to: 1. Narrative + example (the Graph FPS lesson story) on home 2. Phase 1 lookup latency (~1ms) as a small audit-friendly fact 3. /benchmark page restructured as "transparent lab notes" — what we know, what we're testing, what we will not claim — with the merged Phase 3 CSV linked for the curious 4. NO global "Cosmos rules improve every task" wording. Only the moat-restricted phrasing locked in §12 above. Phase 3 paid-run experiment closed with this entry. Future runs that extend the matrix (Brain Fresh subset, more tasks, multi- brain comparison) require a fresh deviation entry + budget approval + pre-registration update — they don't slip into this document. Total benchmark spend across D2 + Resume: **$13.43**, well within the operator's $5-15 budget envelope. ### 2026-05-07 — repo flipped to private (post-experiment); benchmark data unchanged Operator decision after the merged Phase 3 results landed: both the AI-Bran source repo and the atitechs-web marketing-site repo were flipped from PUBLIC → PRIVATE on GitHub via `gh repo edit --visibility private`. The product is now in **private alpha · invitation only**; no public app downloads, no public source distribution until the project graduates from alpha. **Effect on this protocol document:** none on the data. The Phase 3 run completed before the privacy flip. All measurements, audit trail, exclusion accounting, paired analysis, and the Tier 2 verdict stand unchanged. The CSVs / JSONs / MDs in `benchmarks/results/` remain the canonical artifact. **Effect on §9.2 publication policy citations:** the protocol's public-link references (e.g. "raw CSV linked from the marketing page") still hold for the file-mirror copy under `atitechs-web/public/benchmark/phase3-2026-05-07-merged.csv`, which is served from the live website without requiring source access. Reviewers / future invited testers can audit the merged dataset there. The original GitHub source-tree links inside the website pages were rewritten to point at `/#waitlist` during the public→private sweep so no broken-link state exists, but anyone who needs the raw repo (the protocol script, the curation log JSON, etc.) gets it via alpha invitation. **Effect on benchmark reproducibility:** the protocol document and all driver scripts are still in the repo — invited testers running `benchmarks/phase3_modes.py` reproduce exactly the same matrix. External reviewers without alpha access can read the merged CSV and the Markdown report served from `atitechs.com/benchmark/...` to verify the numerical findings without source access. Re-running the experiment requires the source, which is alpha-only until public launch. **No deviation in the experimental procedure** — this entry exists so any future reader of the protocol understands why public GitHub links rendered in the website / docs no longer resolve. The data, methodology, exclusion gates, and Tier 2 framing are final. ### 2026-05-07 — distribution path chosen: R2, source repo stays private (Option B) After the privacy flip, three options surfaced for how end users will download future builds (Apple cert + auto-updater both need somewhere to download FROM): - A. Flip repos back to public + add a license — uses GitHub Releases directly, free, but contradicts the just-made decision to keep source private. - B. Host updates on Cloudflare R2 with a custom `download.atitechs.com` domain. Source stays private; distribution opens. ~$5/mo. 2-3 days of CI plumbing. - C. Public binary-only mirror repo. Splits source from distribution across two repos. Operationally heavier than B without solving a problem B doesn't. **Decision: Option B.** Matches the public-private invariant the operator just set, costs rounding error per month, future-friendly (can flip to public anytime later without ripping out the distribution stack), and is the same pattern Cursor / Linear / Bear / Notion use. Apple Developer enrolment ($99/yr) is sequenced LAST in the launch plan per the operator's instruction. Until that lands, alpha builds remain ad-hoc-signed and require the `xattr -cr` workaround documented at `/docs#install`. The R2 distribution and Tauri update infrastructure ship FIRST so the moment Apple cert lands, the only remaining step is uncommenting one CI block. **Scaffolding committed alongside this entry:** - `docs/UPDATE_DELIVERY.md` — full architecture, CI flow, channel design (stable/beta), rollback policy, key management, domain setup notes, and an explicit next-step checklist split by blocking dependency. - `.github/workflows/release.yml` — release pipeline. Build + Tauri update-sign + manifest + R2 upload all wired but inert until secrets are populated. Apple sign + notarise step is gated behind a `HAS_APPLE_CERT` env so the workflow stays green pre-enrolment. - `src-tauri/tauri.conf.json` — `plugins.updater` section added with the production endpoint `https://download.atitechs.com/{{target}}/{{arch}}/{{current_version}}/manifest.json` and the public key generated at this commit's time. - Tauri update keypair generated locally (`node_modules/.bin/tauri signer generate --ci`): - private: `~/.tauri/cosmos.key` (operator's machine; later copied as `TAURI_SIGNING_PRIVATE_KEY` GitHub secret) - public: committed in the `pubkey` field above. Key rotation policy + key-loss recovery posture documented in `UPDATE_DELIVERY.md`. **No web changes in this commit.** Website CTAs continue pointing at `/#waitlist` because the private-alpha framing still holds — download infrastructure exists in scaffolding form but hasn't shipped a public artifact yet. The web flip from `/#waitlist` → `https://download.atitechs.com/stable/latest/manifest.json` becomes the final step of Phase 1.10 (public beta launch), not before. **No procedural deviation in the benchmark** — the Phase 3 dataset is unaffected by anything downstream of distribution choice. This entry exists so future operators picking up the launch baton see the design decision and the trail of who-decided-what. Files updated alongside this entry: - `benchmarks/phase3_modes.py` (CLI flag + cap accounting) - `benchmarks/_cost_cap.py` (no change — accounting was always intended; runner just wasn't calling `record_run()` for judges) --- ## 13. Before-run checklist Operator runs through this before pulling the trigger on the paid run: - [ ] PHASE3_PROTOCOL.md committed and reviewed - [ ] tier_benchmark.py cost label fix committed - [ ] PreflightBanner missing-api-version fix committed - [ ] tasks.json extended to ≥36 entries, schema-validated - [ ] tasks.json locked + commit hash recorded in §3.3 - [ ] Curated brain: lessons pinned + scope_globs added before lock - [ ] Curation log §11 has timestamped entries for every change - [ ] Cost cap = $15 hard-coded in runner - [ ] Mode-stripping filter unit-tested on a synthetic answer - [ ] Two-judge runner unit-tested on one task - [ ] Live sidecar version ≥ 3 (PreflightBanner not red) - [ ] Backup of `data/brain_v2/brain.db` made before any temp-DB clone - [ ] $100 plan quota check — at least $20 headroom If any item fails, the run is blocked. --- *End of pre-registration. Document hash will be recorded by git when this file is committed alongside the (later) results.*