// Lab notes · last updated 2026-05-27 Lab notes · อัปเดตล่าสุด 2026-05-27

Transparent lab notes. Lab notes ที่โปร่งใส ตรวจสอบได้

Most products lead with a benchmark headline. We don't, because the product value isn't a single number — it's a memory layer that compounds. That number can wait. What we will publish today is the engineering proof, the work in progress, and the claims we explicitly refuse to make until the data clears the bar. product ส่วนใหญ่เปิดตัวด้วยตัวเลข benchmark ใหญ่ๆ แต่เราไม่ทำแบบนั้น เพราะคุณค่าจริงของ Cosmos ไม่ใช่ตัวเลขตัวเดียว — มันคือ memory layer ที่สะสมต่อยอดไปเรื่อยๆ ตัวเลขนั้นรอได้ สิ่งที่เราเผยแพร่วันนี้ คือหลักฐานเชิงวิศวกรรม งานที่ยังทำอยู่ และ claim ที่เราตั้งใจ "ไม่พูด" จนกว่าข้อมูลจะผ่านเกณฑ์

หมายเหตุ: ตารางสถิติและศัพท์เทคนิค เช่น confidence interval, effect size หรือ Bonferroni correction เราคงไว้เป็นภาษาอังกฤษเพื่อความแม่นยำในการอ้างอิง ส่วนคำอธิบายและข้อสรุปจะมีภาษาไทยแบบเข้าใจง่ายกำกับไว้ให้

Every CSV referenced here is published under /benchmark/ on this site (raw, viewable in your browser, no login). The pre-registered protocol is PHASE3_PROTOCOL.md. The next milestone's harness is REGRESSION50_PROTOCOL.md.

CSV ทุกชุดในหน้านี้เผยแพร่ที่ /benchmark/ บนเว็บนี้ (raw, ดูใน browser ได้ ไม่ต้อง login). Protocol ที่ pre-register ไว้คือ PHASE3_PROTOCOL.md. Harness ของ milestone ถัดไปคือ REGRESSION50_PROTOCOL.md.

Key verified numbers ตัวเลขที่ verify แล้ว 2026-05-20

RetrievalRetrieval: 2.09× better than BM25 baseline ดีกว่า BM25 baseline 2.09 เท่า
Search latencySearch latency: 0.03–3.7 ms across repos · flat with scale 0.03–3.7 ms ทุก repo · ไม่ช้าลงตาม scale
IndexingIndexing: 50,000 files in ~5 min · 100% local index 50,000 ไฟล์ใน ~5 นาที · 100% local
HardwareHardware: MacBook Air M4 · 16 GB · 1.4 GB peak
VerificationVerification: 50 pre-registered queries · indexer suite 18/18 50 query ที่ pre-register ไว้ · indexer suite ผ่าน 18/18

why this matters (in plain language) ทำไมเรื่องนี้ถึงสำคัญ (แบบเข้าใจง่าย)

You can point Cosmos at a large, real codebase — 50,000 files — and it's searchable in about the time it takes to make a coffee, all on a regular laptop with nothing sent to the cloud. And once it's indexed, finding anything stays effectively instant (under ~1 ms) whether the project is tiny or huge. So your AI assistant gets accurate context fast, your code never leaves your machine, and it doesn't slow down as your project grows.

คุณชี้ Cosmos ไปที่ codebase จริงขนาดใหญ่ — 50,000 ไฟล์ — แล้วมันพร้อมให้ค้นได้ในเวลาประมาณชงกาแฟแก้วนึง ทั้งหมดรันบน laptop ธรรมดา ไม่มีอะไรส่งขึ้น cloud และพอ index เสร็จแล้ว การค้นหาอะไรก็เร็วเหมือนทันที (ต่ำกว่า ~1 ms) ไม่ว่าโปรเจกต์จะเล็กหรือใหญ่ แปลว่า AI ของคุณได้ context ที่ถูกต้องเร็ว โค้ดไม่เคยออกนอกเครื่อง และระบบไม่ช้าลงเมื่อโปรเจกต์โตขึ้น

// 00 · code-aware retrieval · replicated 2026-05-12 00 · code-aware retrieval · ทดสอบซ้ำ 2026-05-12

Code-aware retrieval that walks the call graph, not the text. Retrieval ที่เดินตาม call graph จริง ไม่ใช่แค่ match ข้อความ

The core retrieval claim: when your AI asks "who calls authenticate_user?" or "if I change the return type of add_task, what breaks?", Cosmos walks the real call graph instead of grep-matching symbol mentions. Measured against BM25 over the same source dumped as markdown — the strongest "plain memory" baseline. ข้อ claim หลักด้าน retrieval: เมื่อ AI ถาม "ใครเรียก authenticate_user?" หรือ "ถ้าเปลี่ยน return type ของ add_task อะไรพัง?" Cosmos เดินตาม call graph จริง ไม่ใช่ grep หาชื่อ symbol. เทียบกับ BM25 บน source เดียวกันที่ dump เป็น markdown — baseline "plain memory" ที่แข็งที่สุด

Corpus Corpus

Cosmos MRR Cosmos MRR

BM25 MRR BM25 MRR

Gap Gap

fastapi 19 AST + lesson questions 19 คำถาม AST + lesson

0.930

0.446

2.09×

requests 15 AST questions, independent corpus 15 คำถาม AST · corpus อิสระ

1.000

0.389

2.57×

The second corpus (requests) was chosen and questions auto-mined the same way as the first — methodology was frozen before the second run started, no question hand-picked after seeing data. Both runs cleared the pre-registered gate (≥1.8× gap AND Cosmos MRR > 0.7) so the headline number is the minimum of the two: 2.09×. corpus ตัวที่สอง (requests) เราเลือกมาและ auto-mine คำถามด้วยวิธีเดียวกับตัวแรก — lock methodology ไว้ก่อนเริ่ม run รอบสอง ไม่มีการเลือกคำถามเองหลังเห็นข้อมูล ทั้งสอง run ผ่าน gate ที่ pre-register ไว้ (gap ≥1.8 เท่า และ Cosmos MRR > 0.7) ตัวเลข headline เลยใช้ค่าต่ำสุดของสองตัว: 2.09 เท่า

Active surfacing: 90% session-intent coverage. Active surfacing: ครอบคลุม session intent ถึง 90%

Companion test of the intent-aware preamble — given the user's first prompt as intent, does Cosmos surface enough context that the AI shows up ready? Measured on 10 plausible session-start scenarios per corpus: การทดสอบคู่กันของ intent-aware preamble — ให้ prompt แรกของผู้ใช้เป็น intent แล้ว Cosmos จะดึง context มาพอให้ AI พร้อมทำงานทันทีไหม? วัดจาก 10 สถานการณ์เริ่ม session ที่เป็นไปได้ ต่อ corpus:

Corpus Corpus

Cosmos hit-rate Cosmos hit-rate

Text search baseline Text search baseline

Gap Gap

fastapi · 10 scenarios fastapi · 10 scenarios

0.900

0.500

1.80×

requests · 10 scenarios requests · 10 scenarios

0.900

0.200

4.50×

What we are NOT claiming. สิ่งที่เรา "ไม่" claim

Cosmos beats Mem0 / Letta — those weren't tested (no API budget this round). Cosmos ชนะ Mem0 / Letta — รอบนี้ยังไม่ได้เทสต์ (ไม่มี API budget)
Cosmos wins on natural-language paraphrase. At Tier 0 (no embeddings) we're roughly even with BM25 on that axis. The 2.09×/2.57× gap is specifically on AST-aware retrieval. Cosmos ชนะเรื่อง natural-language paraphrase — ที่ Tier 0 (ไม่มี embeddings) เราสูสีกับ BM25 บนแกนนั้น ส่วน gap 2.09×/2.57× คือเรื่อง AST-aware retrieval โดยเฉพาะ
Results extend to non-Python codebases — Track 1 is Python source only so far. ผลครอบคลุมถึง codebase ที่ไม่ใช่ Python — Track 1 ตอนนี้ทดสอบแค่ Python source

Harness, corpus, questions, and raw result JSON live in benchmarks/longitudinal/ inside the Cosmos repo. Source will be made publicly readable when the project opens up; if you're in the private alpha already, you have it. n = 30 AST + 20 intent = 50 total queries. Harness, corpus, คำถาม, และ raw result JSON อยู่ใน benchmarks/longitudinal/ ใน Cosmos repo. source จะเปิด public ตอน project เปิดให้คนทั่วไป — ใครอยู่ใน private alpha ตอนนี้ มี source แล้ว. n = 30 AST + 20 intent = 50 query

// 01 · engine performance 01 · engine performance

Indexed lookup is fast and local. Indexed lookup เร็วและทำงาน local

This section measures plumbing — how fast the engine returns indexed context once the question is asked. It does not measure whether the lessons that ride on top help your AI; that is the next section.

ส่วนนี้วัด plumbing — engine ส่ง indexed context กลับมาเร็วแค่ไหนหลังถูกถาม. ไม่ วัดว่า lessons ที่ทับอยู่ข้างบนช่วย AI ได้จริงไหม — เรื่องนั้นอยู่ section ถัดไป

Cosmos pre-indexes your code via SQLite + FTS5. Once the index exists, a symbol lookup against an in-process query runs at the speed SQLite gives you. The numbers below are the median latency we measured, locally, on Apple Silicon, after the index is warm. Index build time is excluded — building the FTS5 index for a 1,448-file repo takes ~6-12 seconds on first run; later launches reuse it.

Cosmos pre-index โค้ดของคุณผ่าน SQLite + FTS5. หลัง index มีแล้ว การค้น symbol จาก query ใน process เดียวกันรันเร็วเท่าที่ SQLite ให้. ตัวเลขด้านล่างคือ median latency ที่เราวัด locally บน Apple Silicon หลัง index warm แล้ว. เวลา build index ไม่นับรวม — build FTS5 index สำหรับ repo ขนาด 1,448 ไฟล์ใช้เวลา ~6-12 วินาทีตอนรันครั้งแรก ครั้งต่อ ๆ ไป reuse ของเดิม

Repo · file count Repo · จำนวนไฟล์

Median latency Median latency

Notes Notes

small open-source CLI 63 files · 24K LOC CLI open-source ขนาดเล็ก 63 ไฟล์ · 24K LOC

0.6 ms

small repo repo เล็ก

Cosmos itself 167 files · 23K LOC Cosmos เอง 167 ไฟล์ · 23K LOC

3.7 ms

medium repo · the dogfood case repo ขนาดกลาง · กรณี dogfood

large code-agent repo 1,448 files · 193K LOC code-agent repo ขนาดใหญ่ 1,448 ไฟล์ · 193K LOC

1.4 ms

large repo · indexed lookup beats raw text scan repo ใหญ่ · indexed lookup เร็วกว่าการ scan ข้อความดิบ

brain (operator's own corpus) 177 memories · 5 categories brain (corpus ส่วนตัวของ operator) 177 memories · 5 categories

0.9 ms

brain search · warm · 20-query × 5 trials · 2026-05-16 brain search · warm · 20 query × 5 trials · 2026-05-16

Method: 5 warm runs per query, median reported, file filter *.py *.ts *.tsx, excluded node_modules .venv .git dist. Slowest-case (p95) numbers are in the raw CSV below for the curious.

วิธีวัด: 5 warm run ต่อ query, รายงานค่า median, file filter *.py *.ts *.tsx, exclude node_modules .venv .git dist. ตัวเลข p95 (case ช้าสุด) อยู่ใน CSV ด้านล่างถ้าอยากดู

phase 1 · CSV

Speed numbers per query, all 3 tiers ↗ ตัวเลขความเร็วต่อ query ครบทั้ง 3 tier ↗

Download CSV ↗

phase 1 · Markdown

Auto-generated report with the same tables ↗ รายงาน auto-generated พร้อมตารางชุดเดียวกัน ↗

View Report ↗

// 01.5 · indexing at scale · measured 2026-05-20 01.5 · indexing at scale · วัดเมื่อ 2026-05-20

50,000 real files, indexed locally in ~5 minutes. ไฟล์จริง 50,000 ไฟล์ · index บนเครื่องเสร็จใน ~5 นาที

Section 01 measured warm-lookup latency and excluded the one-time index build. This section measures the build itself, on a real corpus: 49,718 TypeScript/JavaScript files (456 MB) drawn from DefinitelyTyped — including 44 type-definition files over 1 MB, the largest 8.7 MB. Everything runs on one machine; nothing leaves it. The headline that matters isn't a speed-up multiple — it's that search latency stays in the same single-digit-millisecond band at 50,000 files as it does at 63.

Section 01 วัด warm-lookup latency โดยไม่รวมเวลา build index ครั้งแรก ส่วนนี้วัดตัว build เองบน corpus จริง: ไฟล์ TypeScript/JavaScript 49,718 ไฟล์ (456 MB) จาก DefinitelyTyped — รวมไฟล์ type-definition ที่ใหญ่เกิน 1 MB อยู่ 44 ไฟล์ ใหญ่สุด 8.7 MB ทุกอย่างรันบนเครื่องเดียว ไม่มีอะไรออกนอกเครื่อง จุดสำคัญไม่ใช่ตัวคูณความเร็ว — แต่คือ search latency อยู่ในช่วงหลักหน่วย ms เท่ากันทั้งที่ 50,000 ไฟล์และที่ 63 ไฟล์

Corpus · files Corpus · files

Index build Index build

Symbols Symbols

Search p50 Search p50

cpython 2,289 files · dense (~37 sym/file) 2,289 ไฟล์ · dense (~37 sym/file)

130 s

86,430

0.03 ms

DefinitelyTyped 49,718 files · 456 MB · broad TS/JS 49,718 ไฟล์ · 456 MB · TS/JS หลากหลาย

311 s

108,730

0.93 ms

Method: full cold index (parse + symbol store + FTS5 + call graph), isolated SQLite DB, on a single MacBook Air M4 · 16 GB RAM, CPU verified not thermally throttled during the run. Peak memory at the 50k scale was 1.4 GB RSS — comfortably within a 16 GB laptop. Build time scales with symbol density, not just file count — cpython is ~17 files/s (dense), DefinitelyTyped ~160 files/s (broad). Raw per-phase numbers (incl. the pre-fix run below) are in the JSONL linked at the end of this section.

วิธีวัด: full cold index (parse + symbol store + FTS5 + call graph) บน SQLite DB แยกต่างหาก รันบน MacBook Air M4 · 16 GB RAM เครื่องเดียว ยืนยันว่า CPU ไม่โดน thermal throttle ระหว่าง run · peak memory ที่ scale 50k อยู่ที่ 1.4 GB RSS — สบายๆ บน laptop 16 GB เวลา build แปรผันตาม symbol density ไม่ใช่แค่จำนวนไฟล์ — cpython ~17 ไฟล์/วิ (dense), DefinitelyTyped ~160 ไฟล์/วิ (broad) ตัวเลข raw ราย phase (รวม run ก่อนแก้ด้านล่าง) อยู่ใน JSONL ท้าย section

// at a glance // ดูแบบเร็วๆ

50k index: before → after fix index 50k: ก่อน → หลังแก้

real code · DefinitelyTyped · seconds โค้ดจริง · DefinitelyTyped · วินาที

2.7× faster after the pre_tokenize fix เร็วขึ้น 2.7× หลังแก้ pre_tokenize

Index time vs scale เวลา index ตามสเกล

Sub-linear: 22× files, ~2.4× time Sub-linear: ไฟล์ 22× แต่เวลา ~2.4×

Search latency vs scale Search latency ตามสเกล

All under 4 ms · doesn't grow with file count. The 167 spike is Cosmos's own repo (the dogfood case), not a scale effect. ทุกจุดต่ำกว่า 4 ms · ไม่โตตามจำนวนไฟล์ จุด 167 ที่กระโดดขึ้นคือ repo ของ Cosmos เอง (กรณี dogfood) ไม่ใช่ผลจาก scale

the bug we caught doing this bug ที่เราเจอตอนทำอันนี้

The first 50k run took 14 minutes, and two-thirds of it was a single phase we didn't expect. The cause wasn't FTS5 or SQLite — it was our own pre_tokenize step running a Thai word-segmenter (pythainlp) over entire multi-megabyte type-definition files, because a single Thai character in one comment flipped a "contains Thai" check. One 8 MB .d.ts was being dictionary-segmented end to end. We found it with the same harness linked below, added a size guard, and the index dropped to ~5 minutes — with byte-for-byte identical symbol and link counts (108,730 / 120,770 before and after). We're publishing the slow number too, not just the fast one.

run 50k ครั้งแรกใช้เวลา 14 นาที และสองในสามของเวลาหมดไปกับ phase เดียวที่เราไม่คาดคิด ต้นเหตุไม่ใช่ FTS5 หรือ SQLite — แต่เป็น step pre_tokenize ของเราเองที่รัน Thai word-segmenter (pythainlp) ไปทั้งไฟล์ type-definition หลายเมกะไบต์ เพราะมีตัวอักษรไทยตัวเดียวใน comment ไปทำให้ check "มีภาษาไทยไหม" เป็น true ไฟล์ .d.ts ขนาด 8 MB เลยโดนตัดคำทั้งไฟล์ เราเจอด้วย harness ตัวเดียวกับที่ link ด้านล่าง ใส่ size guard เข้าไป แล้ว index ลดเหลือ ~5 นาที — โดย symbol และ link count เท่าเดิมเป๊ะ (108,730 / 120,770 ก่อนและหลัง) เราเผยแพร่ตัวเลขช้าด้วย ไม่ใช่แค่ตัวเร็ว

What we are NOT claiming. สิ่งที่เรา "ไม่" claim

"5 minutes for any 50k repo" — build time depends on symbol density and file size; a denser codebase takes longer per file. We report both ends. "50k repo ไหนก็ 5 นาที" — เวลา build ขึ้นกับ symbol density และขนาดไฟล์ codebase ที่ dense กว่าก็ใช้เวลาต่อไฟล์นานกว่า เรารายงานทั้งสองปลาย
A speed-up multiple as a feature — the 14→5 min change was fixing our own regression, not a new optimization. It belongs in the changelog, not on a banner. ตัวคูณความเร็วเป็น feature — การเปลี่ยน 14→5 นาทีคือการแก้ regression ของเราเอง ไม่ใช่ optimization ใหม่ มันควรอยู่ใน changelog ไม่ใช่บน banner

indexer scale · JSONL

Raw per-phase numbers · cpython + 50k pre/post-fix ↗ ตัวเลข raw ราย phase · cpython + 50k ก่อน/หลังแก้ ↗

Download JSONL ↗

ext> 3.7 1.4 0.93 63 167 1.4k 50k

the bug we caught doing this bug ที่เราเจอตอนทำอันนี้

What we are NOT claiming. สิ่งที่เรา "ไม่" claim

"5 minutes for any 50k repo" — build time depends on symbol density and file size; a denser codebase takes longer per file. We report both ends. "50k repo ไหนก็ 5 นาที" — เวลา build ขึ้นกับ symbol density และขนาดไฟล์ codebase ที่ dense กว่าก็ใช้เวลาต่อไฟล์นานกว่า เรารายงานทั้งสองปลาย
A speed-up multiple as a feature — the 14→5 min change was fixing our own regression, not a new optimization. It belongs in the changelog, not on a banner. ตัวคูณความเร็วเป็น feature — การเปลี่ยน 14→5 นาทีคือการแก้ regression ของเราเอง ไม่ใช่ optimization ใหม่ มันควรอยู่ใน changelog ไม่ใช่บน banner

indexer scale · JSONL

Raw per-phase numbers · cpython + 50k pre/post-fix ↗ ตัวเลขดิบราย phase · cpython + 50k ก่อน/หลัง fix ↗

// 02 · directional evidence (does not clear publication bar) 02 · directional evidence (ยังไม่ผ่านเกณฑ์เผยแพร่)

Does the rules block actually move the needle? rules block ช่วยให้ผลดีขึ้นจริงไหม?

We pre-registered a three-mode comparison — baseline (no Cosmos), mcp_only (Cosmos engine, no rules block in the prompt), and mcp_plus_rules (Cosmos engine + rules block prefixed). The 14-task moat-category subset (8 past_lesson + 6 apply_lesson) ran on a curated brain on 2026-05-07. Result does not clear the publication threshold; reported here as directional / internal evidence. N=14 is small. A 50-bug multi-repo regression suite is the next milestone — see section 03 below.

เรา pre-register การเปรียบเทียบ 3 mode — baseline (ไม่มี Cosmos), mcp_only (มี Cosmos engine แต่ไม่ใส่ rules block ใน prompt) และ mcp_plus_rules (Cosmos engine + rules block) ชุด moat-category 14 task (past_lesson 8 + apply_lesson 6) รันบน curated brain เมื่อ 2026-05-07 ผลลัพธ์ ยังไม่ผ่านเกณฑ์เผยแพร่ — รายงานในนี้ในฐานะ directional / หลักฐานภายใน N=14 ถือว่าน้อย milestone ถัดไปคือ regression suite 50 bug หลาย repo — ดู section 03 ด้านล่าง

Mode

Mean cost Mean cost

Mean score · 0–4 Mean score · 0–4

baseline

$0.234

1.33

mcp_only

$0.189

1.33

mcp_plus_rules

$0.355

2.23

what this means in plain language แปลเป็นภาษาคนง่ายๆ

On these 14 tasks, the AI scored 2.23 out of 4 with the Cosmos rules block + curated lessons, versus 1.33 out of 4 baseline — ~67% higher. The lift is real in the sample. But the 95% confidence interval on the gain runs from −0.22 to +1.72, which crosses zero by 0.22 — meaning a slightly different 14 tasks could plausibly have produced no gain at all. We want the interval to fully clear zero before we put this on a marketing slide. Directional yes. Headline-quotable not yet.

บน 14 task นี้ AI ทำคะแนนได้ 2.23 / 4 ตอนใช้ rules block + curated lessons ของ Cosmos เทียบกับ 1.33 / 4 ในโหมด baseline — สูงขึ้นประมาณ 67%. การลิฟต์ขึ้นในตัวอย่างนี้จริง. แต่ 95% confidence interval ของ gain อยู่ระหว่าง −0.22 ถึง +1.72 ข้ามศูนย์ไป 0.22 — แปลว่าถ้า 14 task ต่างไปนิดเดียว อาจไม่มี gain เลย. เราอยากให้ interval ขึ้นเหนือศูนย์เต็ม ๆ ก่อน ใส่ไว้บน marketing slide. Directional ใช่ headline-quotable ยังไม่ใช่

honest read of the numbers อ่านตัวเลขแบบตรงไปตรงมา

+
Mean Δ (C − B): +0.75 score — rules block helps, on average, by about three-quarters of a grade out of 4. Mean Δ (C − B): +0.75 คะแนน — rules block ช่วยได้เฉลี่ยประมาณสามในสี่เกรดจาก 4
⚠
95% CI (paired, normal-approx): [-0.22, +1.72] — crosses zero by 0.22. Misses the strict zero-clearance gate the protocol pre-registered. 95% CI (paired, normal-approx): [-0.22, +1.72] — ข้ามศูนย์ไป 0.22 ยังไม่ผ่าน zero-clearance gate ที่ protocol pre-register ไว้
⚠
Exclusion rate: 11.9% (5 of 42 rows: 3 timeouts on E1, 2 on T-PL7) — exceeds the 10% threshold for headline-grade rigor. Exclusion rate: 11.9% (5 จาก 42 แถว: timeout 3 ตัวบน E1, 2 ตัวบน T-PL7) — เกินเพดาน 10% สำหรับความเข้มงวดระดับ headline
⚠
Mean cost in C: +89% over B — the rules-block prefix adds a verbosity tax. AI writes longer chains-of-thought when explicitly told to use cosmos tools. Mean cost ใน C: +89% เทียบ B — prefix ของ rules block เพิ่ม "ภาษีความยาว" AI เขียน chain-of-thought ยาวขึ้นเมื่อถูกสั่งให้ใช้ cosmos tools ตรงๆ
→
Where C wins big: tasks targeting pinned + scope-globbed lessons (T-AL3 +4, T-PL5 +4, T-AL5 +1, T-AL6 +2 vs B). Pattern: curation matters; rules block alone doesn't. จุดที่ C ชนะขาด: task ที่เล็งไปที่ lesson ซึ่ง pin + scope-glob ไว้ (T-AL3 +4, T-PL5 +4, T-AL5 +1, T-AL6 +2 เทียบ B) แพตเทิร์น: curation สำคัญ — rules block ลำพังไม่พอ
→
Where C drops: non-curated lessons with B-favourable framing (D3 −1, T-PL8 −1). When AI was already going to call recall, the rules block adds noise. จุดที่ C ตก: lesson ที่ไม่ได้ curate และมี framing เข้าทาง B (D3 −1, T-PL8 −1) เมื่อ AI จะเรียก recall อยู่แล้ว rules block กลับเพิ่ม noise

The protocol calls this Tier 2 — directional / internal evidence. Both the CI gate and the exclusion-rate gate fail independently; either one alone would block a Tier 1 headline, and we hit both. Per section 9 of the protocol: we will not publish this as a "Cosmos cuts cost / improves correctness by X%" claim. We will instead say what's actually true: the rules block earns its weight when paired with curated lessons; on tasks that don't need recall it is overhead.

Protocol เรียกแบบนี้ว่า Tier 2 — directional / หลักฐานภายใน. ทั้ง CI gate และ exclusion-rate gate ไม่ผ่านแยกกัน — แต่ละอันลำพังก็พอจะ block headline ระดับ Tier 1 ได้แล้ว เราเจอทั้งคู่. ตาม section 9 ของ protocol: เราจะไม่เผยแพร่เป็น claim "Cosmos cut cost / improve correctness ได้ X%". เราจะพูดเฉพาะสิ่งที่จริง: rules block คุ้มที่จะใส่ เมื่อใช้กับ curated lessons — กับ task ที่ไม่ต้องการ recall มันคือภาระ

Where this directly applies: the snapshot test + integration test that caught the two newest bugs in Cosmos's own audit cycle. Both are documented as Project Lessons → /lessons.

ตัวอย่างที่ใช้จริง: snapshot test + integration test ที่จับ bug ใหม่ล่าสุด 2 ตัว ใน audit cycle ของ Cosmos เอง. ทั้งคู่ถูก document เป็น Project Lessons → /lessons

phase 3 merged · CSV

All 42 rows · per-task scores · cost · tokens · timeouts ↗ 42 แถวครบ · score ต่อ task · cost · tokens · timeouts ↗

Download CSV ↗

phase 3 merged · Markdown

Auto-generated report with the paired analysis ↗ รายงาน auto-generated พร้อม paired analysis ↗

View Report ↗

protocol

Pre-registration · 12 deviation entries · audit trail ↗ Pre-registration · 12 deviation entries · audit trail ↗

View Protocol ↗

curation log

Which lessons were pinned + scope-globbed, and when ↗ Lesson ไหนถูก pin + scope-glob เมื่อไหร่ ↗

View Curation Log ↗

// 02.5 · augmented pilot (N=8) — Tier 3 directional 02.5 · augmented pilot (N=8) — Tier 3 directional

We ran a tighter pilot. The headline isn't what we hoped. เรารัน pilot ที่กระชับขึ้น. headline ไม่เป็นแบบที่หวัง

Added 10 internal past_lesson tasks to the Phase 3 task set and re-ran the 3-mode comparison (no Cosmos / Cosmos MCP / Cosmos MCP + rules block). Planned N=18; got N=8 after the sweep hit a 5-hour subscription quota cap mid-run. Stopped at the playbook's N=8 floor. Published with full disclosure.

เพิ่ม 10 internal past_lesson task เข้าใน Phase 3 task set แล้ว re-run การเปรียบเทียบ 3 mode (no Cosmos / Cosmos MCP / Cosmos MCP + rules block). วางแผน N=18 — ได้จริง N=8 หลังโดน 5-hour subscription quota cap กลางทาง. หยุดที่ floor N=8 ตาม playbook. เผยแพร่พร้อมรายงานครบ

headline finding (honest) ผลหลัก (พูดตรงๆ)

Cosmos modes did not reliably reduce tokens or wall time vs baseline at N=8. Mean Δ near zero (+1.4% tokens, sign tests not significant). What Cosmos actually does: redistributes work, doesn't reduce it on average. The clearest edge is in turn count — Mode C uses fewer turns in 6 of 8 tasks.

ที่ N=8 Cosmos ไม่ได้ ลด tokens หรือ wall time เทียบกับ baseline อย่างชัดเจน — Mean Δ เกือบเป็นศูนย์ (+1.4% tokens, sign test ไม่ significant) สิ่งที่ Cosmos ทำจริงคือ กระจายงานใหม่ ไม่ใช่ลดงานโดยเฉลี่ย จุดที่เห็นชัดสุดคือ turn count — Mode C ใช้ turn น้อยกว่าใน 6 จาก 8 task

Three task profiles tell the real story สาม task ที่เล่าเรื่องจริงได้ดีที่สุด

// D3 · striking win

"JWT plain-JSON collision" → all 3 modes correct. "JWT ชนกับ plain-JSON" → ทั้ง 3 mode ตอบถูก

baseline: 62.9s · 16 turns · 370K tok
Mode C: 19.2s · 5 turns · 87K tok
→ 3.3× faster · 4.3× fewer tokens → เร็วกว่า 3.3 เท่า · ใช้ token น้อยกว่า 4.3 เท่า

// T-PL7 · baseline gives up

"Duplicate code_summary memories" → baseline times out at 180s. "memory code_summary ซ้ำกัน" → baseline timeout ที่ 180 วินาที

baseline: FAIL · timeout
Mode C: succeeded · 116K tok
→ Cosmos finds what grep can't → Cosmos หาเจอในสิ่งที่ grep หาไม่เจอ

// T-PL5 · striking loss

"Neural Map sluggish" → all 3 modes correct, but Mode C is slowest. "Neural Map อืด" → ทั้ง 3 mode ตอบถูก แต่ Mode C ช้าสุด

baseline: 40.8s · 11 turns · 215K tok
Mode C: 116.0s · 17 turns · 433K tok
→ rules block over-consults Cosmos → rules block ทำให้เรียก Cosmos บ่อยเกินจำเป็น

What this finding actually shapes ผลนี้ทำให้เราตัดสินใจอะไรบ้าง

→ Marketing copy was always going to avoid token-savings claims (decided pre-pilot). Pilot data confirms that pivot was right — we'd have been claiming something this run doesn't support. marketing copy ตั้งใจเลี่ยงการเคลม token-savings อยู่แล้ว (ตัดสินใจก่อนทำ pilot) ข้อมูลจาก pilot ยืนยันว่าตัดสินใจถูก — ไม่งั้นเราคงเคลมในสิ่งที่ผลรอบนี้ไม่ได้รองรับ
→ Regression-50 will measure task completion + correctness, not raw token cost. T-PL7 (baseline gives up) suggests "Cosmos finds answers baseline can't reach" is the real metric to chase. Regression-50 จะวัด task completion + correctness ไม่ใช่ raw token cost. T-PL7 (baseline ยอมแพ้) บอกว่า "Cosmos หาคำตอบที่ baseline เข้าไม่ถึง" คือ metric ที่ควรไล่
→ Rules block needs a refinement before scaling: "if no lesson hits in one MCP call, fall back to grep" — prevents T-PL5-style token blowup on broad queries. Rules block ต้อง refine ก่อน scale: "ถ้าใน 1 MCP call ไม่เจอ lesson ให้ fall back ไป grep" — ป้องกัน T-PL5-style token blowup จาก query แบบกว้าง
→ Sweep infrastructure for Regression-50 switches from the subscription path to direct Anthropic API — predictable per-run cost, no rolling-quota cap. Budget ~$70 sweep + ~$30 judges. Sweep infrastructure ของ Regression-50 เปลี่ยนจาก subscription path ไปเป็น direct Anthropic API — cost ต่อ run predict ได้ ไม่มี rolling-quota cap. Budget ~$70 sweep + ~$30 judges

Methodology disclosures (4 things we got wrong or learned) Methodology disclosures (4 จุดที่เราพลาดหรือเรียนรู้) ▼

1. Quota-cascade stop, not protocol stop. We assumed Claude.ai Max subscription = $0 sweep cost. Max billing IS $0 but quota is finite (5-hour rolling window). 9 tasks of heavy sweep activity exhausted the window; remaining 9 tasks failed with exit 1. Same quota pool affected the operator's concurrent chat session.

1. หยุดเพราะ quota cascade ไม่ใช่ protocol. เราคิดว่า Claude.ai Max subscription = sweep cost $0. Max billing เป็น $0 จริง แต่ quota จำกัด (5-hour rolling window). Task หนัก 9 ตัวกินจน window หมด; อีก 9 ตัวที่เหลือ fail ด้วย exit 1. Quota pool เดียวกันกระทบ chat session ของ operator ที่ใช้พร้อมกัน

2. Selection bias (intentional). All 10 augmented tasks come from the Cosmos repo's own code_errors table. Tests retrieval (does Cosmos surface what's in its index?), not generalisation (does Cosmos help on unfamiliar repos?). Regression-50 with external repos covers generalisation.

2. Selection bias (ตั้งใจ). 10 augmented task ทั้งหมดมาจากตาราง code_errors ใน repo ของ Cosmos เอง. ทดสอบ retrieval (Cosmos surface สิ่งที่อยู่ใน index ได้ไหม) ไม่ใช่ generalisation (Cosmos ช่วยกับ repo ที่ไม่คุ้นไหม). Regression-50 ใช้ external repos จะครอบคลุม generalisation

3. No outcome-quality scoring. Sweep ran --skip-judges because judges need a separate Anthropic API key. The 24 successful raw responses are saved; correctness/citation/hallucination dimensions can be scored offline later. All hard-metric findings here are about efficiency, not correctness.

3. ไม่มี outcome-quality scoring. Sweep รันด้วย --skip-judges เพราะ judges ต้องการ Anthropic API key แยก. raw response 24 ตัวที่สำเร็จเก็บไว้แล้ว; มิติ correctness / citation / hallucination จะ score offline ทีหลังได้. ข้อค้นพบ hard-metric ในนี้เป็นเรื่อง efficiency ไม่ใช่ correctness

4. Bilingual content. 5 of 10 augmented prompts contain Thai content (R3F perf, Tauri white-screen, etc.). Claude handles multilingual fine; external judges may need calibration. Each row's language is preserved in the raw JSON for stratified analysis.

4. Content สองภาษา. 5 จาก 10 augmented prompt มี Thai content (R3F perf, Tauri white-screen, ฯลฯ). Claude จัดการ multilingual ได้ดี; external judges อาจต้อง calibrate. ภาษาของแต่ละแถวเก็บไว้ใน raw JSON สำหรับ stratified analysis

Full pilot writeup, raw JSON / CSV, and task definitions live in the Cosmos repo — currently private during alpha; full data will be public when the repo opens alongside the public download. Pilot writeup ฉบับเต็ม, raw JSON / CSV, และนิยาม task อยู่ใน Cosmos repo — ตอนนี้ private ระหว่าง alpha; data ทั้งหมดจะเปิด public พร้อมตอน repo เปิดและ public download

// 02.7 · token efficiency, on-device · measured 2026-05-27 02.7 · token efficiency บนเครื่อง · วัด 2026-05-27

"86% fewer tokens" — true, but only against the worst baseline. "ลด token 86%" — จริง แต่จริงเฉพาะเทียบ baseline ที่แย่ที่สุด

We ran the same questions entirely on a local model — Qwen3-8B (4-bit), on a MacBook Air M4 / 16 GB, nothing sent to the cloud — measuring how many context tokens the model needs to answer, and whether the answer stays correct (graded 0–4 by a blind claude CLI judge). A "86% fewer tokens" line is easy to print. So we tightened the baseline three times and report where the number holds — and where it shrinks.

เรารันคำถามชุดเดียวกันบน local model — Qwen3-8B (4-bit) บน MacBook Air M4 / 16 GB ไม่มีอะไรส่งขึ้นคลาวด์ — วัดว่าโมเดลต้องใช้ context token เท่าไรในการตอบ และคำตอบยังถูกไหม (ให้คะแนน 0–4 โดย judge claude CLI แบบปิดตา). คำว่า "ลด token 86%" พิมพ์ง่าย เราเลยรัดเกณฑ์ baseline เข้มขึ้น 3 รอบ แล้วรายงานว่าตัวเลขจริงตรงไหน และหดตรงไหน

Round 1 — vs a naive dump. รอบ 1 — เทียบกับ naive dump

Baseline = paste the whole file / a batch of notes into the prompt (an assistant with no retrieval). Cosmos = one tool call's compact result. baseline = แปะทั้งไฟล์ / โน้ตทั้งกองลงไปใน prompt (ผู้ช่วยที่ไม่มี retrieval) Cosmos = ผลลัพธ์กระชับจาก tool call เดียว

Local modellocal model

dump tok

Cosmos tok

fewerประหยัดลง

Gemma-4-e4b

44,239

6,252

−86%

Qwen3-8B

39,761

5,592

−86%

Caveat: a dump is the worst-case baseline — nobody pastes a 47k-token file by hand. The 86% is the value of any retrieval, not something unique to Cosmos. Two different models land at the same 86% because it's a ratio, not a model quirk. ข้อควรรู้: dump คือ baseline กรณีแย่ที่สุด — ไม่มีใครแปะไฟล์ 47k token ด้วยมือ ตัวเลข 86% คือคุณค่าของ retrieval แบบไหนก็ได้ ไม่ใช่อะไรที่เป็นของ Cosmos โดยเฉพาะ สอง model คนละตัวได้ 86% เท่ากันเพราะมันคืออัตราส่วน ไม่ใช่ลักษณะเฉพาะของ model

Round 2 — when the file is bigger than the context window. รอบ 2 — เมื่อไฟล์ใหญ่กว่า context window

Real source files here are 37–47k tokens — larger than the 16k window. Pasting them overflows: the no-retrieval model can't answer at all. Cosmos retrieves just the relevant function. ไฟล์ต้นฉบับจริงในชุดนี้มีขนาด 37–47k token — ใหญ่กว่า window 16k ถ้าแปะทั้งไฟล์ลงไปมันจะ overflow: model ที่ไม่มี retrieval ตอบไม่ได้เลย ส่วน Cosmos ดึงมาเฉพาะ function ที่เกี่ยวข้อง

Question (answer in a 47k-tok file) คำถาม (คำตอบอยู่ในไฟล์ 47k token)

dumpdump

Cosmos

"How does the server stop a slow call blocking the event loop?" "server กันไม่ให้ call ที่ช้ามาบล็อก event loop ได้ยังไง"

overflow · 0/4overflow · 0/4

4/4

Read this as: here Cosmos isn't "cheaper" — it's the only mode that can answer, because the content doesn't fit at all. This is the regime where retrieval is a requirement, not an optimisation. อ่านแบบนี้: ตรงนี้ Cosmos ไม่ได้ "ถูกกว่า" — แต่เป็น mode เดียว ที่ตอบได้ เพราะเนื้อหามันใหญ่เกินจะใส่เข้าไปได้เลย นี่คือกรณีที่ retrieval เป็นสิ่งจำเป็น ไม่ใช่แค่การ optimise

Round 3 — vs grep / BM25, the baseline you'd actually use. รอบ 3 — เทียบกับ grep / BM25, baseline ที่ใช้จริง

The honest comparison: ripgrep over the code, SQLite FTS5 BM25 over the notes — what a competent dev/AI does without Cosmos. Same questions, same corpus, blind-judged. การเปรียบเทียบแบบตรงไปตรงมา: ripgrep บน code, SQLite FTS5 BM25 บน notes — สิ่งที่ dev/AI เก่งๆ ทำเวลาไม่มี Cosmos คำถามชุดเดียวกัน corpus เดียวกัน ตัดสินแบบปิดตา

Axisแกนวัด

dumpdump

grep / BM25

Cosmos

accuracy · 0–4accuracy · 0–4

0.0

1.8

2.2

context tokenscontext tokens

overflowoverflow

8,532

6,605 −23%

// code · Cosmos ahead// code · Cosmos นำ

Cosmos returns coherent functions the model can use (4/4); raw grep lines are too fragmented to answer from (0/4).

Cosmos คืน function ที่ครบสมบูรณ์ให้ model หยิบไปใช้ได้เลย (4/4) ส่วน grep คืนมาเป็น บรรทัด ที่กระจัดกระจายเกินกว่าจะตอบได้ (0/4)

// notes · roughly tied// notes · สูสีกัน

Finding a needle among 852 notes, Cosmos and BM25 trade wins — effectively even, as on the paraphrase axis in section 00.

งานหาเข็มในกอง notes 852 รายการ Cosmos กับ BM25 ผลัดกันชนะ — เสมอกันโดยรวม เหมือนแกน paraphrase ใน section 00

The honest headline: the 86% is vs a naive dump only. Against grep/BM25 — the baseline a real engineer uses — Cosmos is anywhere from roughly tied to ~23% leaner, depending on the task (on 2 of 5 tasks it used more). Cosmos comes out slightly ahead on accuracy overall (2.2 vs 1.8), driven by the code side. headline แบบตรงๆ: 86% นั้น เทียบกับ naive dump เท่านั้น ถ้าเทียบกับ grep/BM25 — baseline ที่ engineer จริงใช้ — Cosmos อยู่ระหว่าง สูสีกัน ถึงประหยัดกว่า ~23% ขึ้นกับ task (มี 2 จาก 5 task ที่ใช้ มากกว่า) โดยรวม Cosmos นำด้าน accuracy เล็กน้อย (2.2 vs 1.8) ซึ่งมาจากฝั่ง code

What we are NOT claiming. สิ่งที่เรา "ไม่" claim

"Cosmos uses X% fewer tokens" as a bare headline — the big number is vs a dump; vs grep it's modest and task-dependent. "Cosmos ใช้ token น้อยลง X%" แบบ headline ลอยๆ — ตัวเลขใหญ่นั้นเทียบกับ dump; เทียบกับ grep มันเล็กน้อยและขึ้นกับ task
"More accurate than search on note lookups" — against BM25 it's roughly even, not a clear win. "แม่นกว่า search ในงานหา notes" — เทียบกับ BM25 มันสูสีกัน ไม่ใช่ชนะขาด

What this round does support. สิ่งที่รอบนี้ยืนยันได้

✓ When content overflows the context window, retrieval is the only way the model answers — Cosmos does, a dump can't. เมื่อเนื้อหา overflow context window, retrieval เป็นทาง เดียว ที่ model ตอบได้ — Cosmos ทำได้ dump ทำไม่ได้
✓ For code, structured chunks beat raw grep lines for an LLM — and it's one call instead of a grep-then-read loop. สำหรับ code, chunk ที่มีโครงสร้างดีกว่าบรรทัด grep ดิบสำหรับ LLM — และเป็น call เดียว ไม่ต้องวน grep-แล้ว-read ซ้ำ

Method: 3-mode comparison on Qwen3-8B (4-bit, GGUF) + Gemma-4-e4b, MacBook Air M4 / 16 GB, fully local; accuracy graded 0–4 by a blind claude CLI judge against hand-written ground truth; code baseline = ripgrep, notes baseline = SQLite FTS5 BM25. Harness + raw per-task JSON live in the Cosmos repo — private during alpha. Method: เทียบ 3 mode บน Qwen3-8B (4-bit, GGUF) + Gemma-4-e4b, MacBook Air M4 / 16 GB, local ทั้งหมด; ให้คะแนน accuracy 0–4 โดย judge claude CLI แบบปิดตา เทียบกับ ground truth ที่เขียนมือ; code baseline = ripgrep, notes baseline = SQLite FTS5 BM25 harness + raw JSON รายตัวอยู่ใน Cosmos repo — private ระหว่าง alpha

// 03 · the next milestone (in progress) 03 · milestone ถัดไป (กำลังทำอยู่)

Regression-50: replaying real bugs we didn't write. Regression-50: replay bug จริงที่เราไม่ได้เขียนเอง

Section 02's directional read came from 14 hand-crafted tasks on Cosmos's own brain — useful internally, not enough to publish. The next pass replays 50 real bugs from 5 open-source repositories the operator did not author (click, fastapi, astro, tauri, pydantic), three modes per bug, judged by the same five-condition statistical bar section 02 used. The harness is shipped; bug curation is in progress.

ผลเชิงทิศทางใน section 02 มาจาก 14 task ที่ทำมือบน brain ของ Cosmos เอง — มีประโยชน์ภายใน แต่ยังไม่พอจะ publish รอบถัดไปจะ replay bug จริง 50 ตัว จาก 5 open-source repo ที่ operator ไม่ได้เขียนเอง (click, fastapi, astro, tauri, pydantic), 3 mode ต่อ bug, ตัดสินด้วยเกณฑ์สถิติ 5 เงื่อนไขชุดเดียวกับ section 02 harness ปล่อยแล้ว ส่วนการคัด bug กำลังทำอยู่

Component ส่วนประกอบ

Status สถานะ

Pre-registered methodology (selection criteria, scoring, stat bar) methodology ที่ pre-register ไว้ (เกณฑ์คัดเลือก, การให้คะแนน, เกณฑ์สถิติ)

shipped

Three-mode replay harness (worktree per run, brain sandbox, cost cap) harness replay 3 mode (worktree ต่อ run, brain sandbox, cost cap)

shipped · dry-run

Bug list curation การคัดเลือก bug

5 / 50 patterns

Live Claude CLI smoke run (1 bug, 3 modes) smoke run บน Claude CLI จริง (1 bug, 3 mode)

pending

Two-judge scoring wired to Anthropic two-judge scoring ต่อเข้ากับ Anthropic

stub · _two_judge.py

Full sweep: 50 × 3 modes × 2 judges sweep เต็ม: 50 × 3 mode × 2 judge

not started

Why publish "0 / 50" instead of waiting until the data lands? Because the apparatus matters: an outside reader can clone the repo, run python -m benchmarks.regression_50.runner --dry-run, see the full code path execute end-to-end with no API spend, and verify the harness exists before any number does. The commitment is the public commitment. Numbers replace this status grid as runs land.

ทำไมถึง publish "0 / 50" แทนที่จะรอให้ข้อมูลมาครบ? เพราะตัว apparatus มันสำคัญ: คนนอกโคลน repo มารัน python -m benchmarks.regression_50.runner --dry-run เห็น code path ทั้งเส้นทำงานจบ end-to-end โดยไม่เสียค่า API เลย แล้วยืนยันได้ว่า harness มีจริงก่อนตัวเลขจะมา commitment คือ public commitment ตัวเลขจะมาแทน status grid นี้เมื่อ run จริงเข้ามา

Source: benchmarks/regression_50/ in the repo. Pre-registration: benchmarks/regression_50/protocol.md. Same five-condition statistical bar as section 02; same paired- bootstrap CI method; same blinded two-judge scoring pattern.

Source: benchmarks/regression_50/ ใน repo Pre-registration: benchmarks/regression_50/protocol.md ใช้เกณฑ์สถิติ 5 เงื่อนไขชุดเดียวกับ section 02; วิธี paired-bootstrap CI เดียวกัน; รูปแบบ two-judge scoring แบบปิดตาเหมือนกัน

// 04 · what we DO claim (and the data behind each) 04 · สิ่งที่เรา claim (พร้อมข้อมูลที่อยู่เบื้องหลังแต่ละข้อ)

Things we will say, with caveats attached. สิ่งที่เราจะพูด พร้อม caveat แนบไว้

✓

"Indexed lookup is fast enough that the engine isn't the bottleneck." 0.6–1.4 ms median across three repos (63 → 1,448 files). Source: section 01 above + phase 1 CSV. "indexed lookup เร็วพอจน engine ไม่ใช่คอขวด" median 0.6–1.4 ms ทั่ว 3 repo (63 → 1,448 ไฟล์). Source: section 01 ข้างบน + phase 1 CSV

✓

"The rules block + curated lessons help on lookup-heavy tasks." Tasks T-AL3 +4, T-PL5 +4, T-AL5 +1, T-AL6 +2 vs baseline. Pattern: bigger gains where the AI would otherwise re-derive a fix. Source: section 02 + phase 3 CSV. "rules block + lesson ที่คัดมาช่วยใน task ที่ lookup เยอะ" Task T-AL3 +4, T-PL5 +4, T-AL5 +1, T-AL6 +2 เทียบ baseline. Pattern: ได้เปรียบมากขึ้นตรงที่ AI ปกติต้องไป re-derive fix เอง. Source: section 02 + phase 3 CSV

✓

"Without curation, the rules block adds verbosity tax with no recall gain." +89% mean cost in mode C, with task drops on D3 (−1) and T-PL8 (−1) where baseline would have called recall on its own. Source: section 02 honest-read bullets. "ถ้าไม่คัด lesson, rules block เพิ่มภาษีพูดยาวโดยไม่ช่วย recall" +89% mean cost ใน mode C, มี task ที่ตกใน D3 (−1) และ T-PL8 (−1) ตรงที่ baseline จะเรียก recall เองอยู่แล้ว. Source: section 02 honest-read bullets

✓

"Curation matters more than the engine." The same engine + the same prompt scaffolding produces different outcomes depending on what's in code_errors at run time. Source: curation log + the per-task scores cited above. "Curation สำคัญกว่า engine" Engine เดียวกัน + prompt scaffolding เดียวกัน · ผลแตกต่างขึ้นอยู่กับว่าใน code_errors มีอะไรตอน run. Source: curation log + score ต่อ task ที่อ้างข้างบน

// 05 · what we will not claim yet 05 · สิ่งที่เรายังไม่ claim

Things you'll see other tools say. We won't, until the data clears. สิ่งที่เครื่องมืออื่นพูด · เราจะไม่พูด จนกว่าข้อมูลจะผ่าน

✗

"AI is N× faster with Cosmos." Tool-level lookup is fast (Section 1). End-to-end task speed depends on what the AI actually does — and that's not consistently faster across the matrix we tested. "AI เร็วขึ้น N เท่าด้วย Cosmos" Lookup ระดับ tool เร็ว (Section 1). ความเร็ว task end-to-end ขึ้นอยู่กับว่า AI ทำอะไรจริง ๆ — ซึ่งไม่ได้เร็วกว่าอย่างสม่ำเสมอใน matrix ที่ทดสอบ

✗

"Cosmos is always cheaper." On lookup-heavy tasks, often yes. On apply-style tasks where the rules block fires, the verbosity tax can make C cost more than baseline. We won't claim "cheaper" without conditions. "Cosmos ถูกกว่าเสมอ" Task ที่ใช้ lookup เยอะ · บ่อยครั้งใช่. Task แบบ apply ที่ rules block ทำงาน ภาษีพูดยาวอาจทำให้ Mode C แพงกว่า baseline. เราจะไม่ claim "ถูกกว่า" โดยไม่มีเงื่อนไข

✗

"Project Lessons rules improve every task." The 14-task protocol data shows they help specifically on lessons you've curated (pinned + scope-globbed). We will not claim a global lift. "Project Lessons rules ช่วยทุก task" ข้อมูลจาก 14-task protocol บอกว่ามันช่วยเฉพาะ lesson ที่คุณคัดมา (pin + กำหนด scope ไว้). เราจะไม่ claim ว่ามันยกระดับทุกอย่างแบบ global

✗

"Benchmarked against [competitor]." We've benchmarked our own engine vs grep / ripgrep on indexed lookup. That isn't a fair comparison to other AI memory products and we won't dress it up as one. "benchmark กับ [คู่แข่ง]" เรา benchmark engine ของเราเองเทียบ grep / ripgrep บน indexed lookup. นั่นไม่ใช่การเทียบที่แฟร์กับ AI memory product เจ้าอื่น เราจะไม่แต่งให้มันดูเป็นแบบนั้น

The bar to graduate any of these into a published claim is in section 9.1 of the protocol: Bonferroni-corrected p < 0.0083 + paired bootstrap 95% CI strictly above zero + Cohen's |d_z| > 0.3 + exclusion rate < 10% + CI width less than the absolute effect. When a future run clears all five, the claim lands here. Until then, it stays in this list.

เกณฑ์ที่จะทำให้ข้อใดข้อหนึ่งเลื่อนขั้นเป็น claim ที่ publish ได้อยู่ใน section 9.1 ของ protocol: Bonferroni-corrected p < 0.0083 + paired bootstrap 95% CI อยู่เหนือศูนย์ทั้งหมด + Cohen's |d_z| > 0.3 + exclusion rate < 10% + ความกว้าง CI น้อยกว่า effect จริง เมื่อ run ในอนาคตผ่านครบทั้ง 5 ข้อ claim นั้นจะมาอยู่ตรงนี้ จนกว่าจะถึงตอนนั้น มันอยู่ใน list นี้

// 06 · reproduce 06 · รันการทดสอบด้วยตัวเอง

Run it yourself. รันการทดสอบด้วยตัวเอง

Engine performance (section 01) is free to re-run — no API calls. The three-mode comparison (section 02) needs a working Claude CLI on your machine, but no separate API key if you have a Claude.ai subscription. Hard cost cap is wired in the runner; default $15.

engine performance (section 01) รันซ้ำได้ฟรี — ไม่มี API call ส่วนการเทียบ 3 mode (section 02) ต้องมี Claude CLI ที่ใช้งานได้บนเครื่องคุณ แต่ไม่ต้องมี API key แยกถ้าคุณมี subscription Claude.ai อยู่แล้ว ตัว runner มี hard cost cap อยู่; default $15

# Source repo is private during alpha; download the .zip below or
# request access via the waitlist. CSVs + protocols above are
# already public, no clone needed for those.
cd cosmos
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements-tier0.txt

# section 01 — engine performance (free)
python -m benchmarks.tier_benchmark

# section 02 — three-mode comparison (uses your Claude.ai auth)
# Curated brain, moat-category subset:
python -m benchmarks.phase3_modes \
  --categories past_lesson,apply_lesson \
  --only-brain curated

# section 03 — Regression-50 harness, dry-run (free, exercises every code path)
python -m benchmarks.regression_50.runner --dry-run

# section 03 — Regression-50, single bug live (~$1.50, sanity-check before sweep)
python -m benchmarks.regression_50.runner --limit 1

The product is the lesson library you build. ตัว product จริงคือคลัง lesson ที่คุณสร้างขึ้นเอง

Currently in private alpha. Local-first by design. The benchmark gets honest as the data does. ตอนนี้อยู่ใน private alpha ออกแบบมาแบบ local-first และ benchmark จะตรงไปตรงมาขึ้นตามข้อมูลที่เข้ามา

Join the waitlist เข้าร่วม waitlist See the lessons ดู lessons