tangle-network · drewstone · Jun 10, 2026 · Jun 10, 2026
diff --git a/docs/PLAIN.md b/docs/PLAIN.md
@@ -0,0 +1,76 @@
+# The system in plain language
+
+> The translation layer. Internal docs use the project's own vocabulary; THIS page says the
+> same things without it. If an explanation here contradicts a technical doc, the technical
+> doc wins — then fix this page. Audience: a colleague meeting the project cold.
+
+## Five sentences, no invented words
+
+1. We have tasks with **automatic pass/fail checks** — tests you can run, answer keys you
+   can verify mechanically.
+2. An AI attempts each task a fixed number of times under different **retry policies**:
+   "try 3 times, keep the best", "try, get feedback, try again", and so on.
+3. We compare policies **fairly**: identical tasks, identical attempt budgets, paired
+   statistics, judged on fresh tasks that no tuning step ever saw.
+4. The distinctive part: the AI also **writes new retry policies itself**, as short
+   programs, and they enter the same tournament under the same rules as human-written ones.
+5. Every dollar and second is metered, so "better" can also mean "**equally good but
+   cheaper**" — and that claim is statistically testable, not vibes.
+
+## The load-bearing core is six pieces
+
+Task-with-check · retry policy · the tournament runner · the AI policy-writer · the
+statistical promotion gate · crash-resume. Everything else is either a **fairness rule**
+(added because a specific run produced a wrong number without it) or an **experiment on
+the menu** (a configuration, not a machine part). Experiment configs are cheap; do not
+mistake a long menu for a complicated machine.
+
+## Translation table
+
+| Project term | Plain English | Standard concept? |
+|---|---|---|
+| Environment | a task domain: open it, act on it with tools, check the result | RL environment / gym |
+| shot | one attempt | — |
+| steering / `refine` | feedback injected between attempts | self-refinement |
+| the author / `authorStrategy` | the AI writes a new retry policy as a program | program synthesis |
+| evolution / generations | rounds of: write candidates → tournament → keep the champion | evolutionary search |
+| harness-verified scoring | never trust a policy's self-reported score; recompute it from the attempts the system actually ran | basic measurement hygiene |
+| selector ≠ judge (the firewall) | the feedback-giver never sees the answer key or the score | no reward leakage |
+| conserved budget pool | every policy gets exactly the same attempt budget; overspending is structurally impossible | compute-matched comparison |
+| holdout / fresh slice | final judging happens on tasks no tuning step ever touched | train/test split |
+| the gate / `promotionGate` | a seeded paired bootstrap must show the win is real before anything is declared better | standard inferential statistics |
+| non-inferiority mode | prove "not worse on quality AND significantly cheaper" | clinical-trials statistics |
+| band screen | drop questions every policy aces — they carry no information | item discrimination (psychometrics) |
+| reproducer certificate | a fresh AI re-builds the winner from a ~64-word description; if the rebuild can't match it, the win was memorization, not method | description-length / compression test (arXiv:2606.11045) |
+| κ compression / minimization | shorten the prompt; prove quality holds and cost drops | prompt compression (LLMLingua lineage); the every-Nth-character floor is delta debugging |
+| waterfall | a per-step timeline of the run: what each step cost in seconds, dollars, tokens | distributed tracing |
+| σ / α / γ / κ | the four independent on/off knobs: feedback, policy-writing, prompt optimization, prompt compression | factorial experimental design |
+
+## For a game theorist, in one paragraph
+
+A repeated tournament under mechanism-design constraints: entrants (retry policies) compete
+under a hard budget; new entrants are generated by an oracle that observes only past
+payoffs (never the scoring function); and the promotion rule is built to be
+non-manipulable — entrants cannot misreport scores, cannot observe the test set, cannot
+outspend rivals, and a declared winner must replicate from a compressed description of
+itself. The research question: which entry-generation and feedback mechanisms produce
+genuine improvements versus exploitation of the evaluation.
+
+## What it has measured (plain claims, each gated)
+
+- Feedback-between-attempts helps a lot on tasks with persistent state (+16.4pp), and
+  *hurts* on one-shot retrieval tasks — the effect has a sign that depends on the domain.
+- Tuning the feedback-giver's instructions with a state-of-the-art prompt optimizer
+  changed nothing (an exact tie on held-out tasks).
+- Naively giving the AI a memory of its own past outputs made it *worse* (−11.6pp).
+- The AI's self-written policies reliably match the best human-written policy's quality
+  at roughly 2.5× lower cost (replicated three times); they have not yet beaten it on
+  quality on held-out tasks.
+- Compressing a verbose prompt to ~a third, combined with feedback, kept quality and cut
+  cost ~30% on a hard math benchmark — promoted by the "not worse AND cheaper" test.
+
+## The honest weaknesses
+
+Mostly one domain family per claim so far (cross-domain replication is configuration, not
+new code); small holdouts (12–16 tasks) mean only effects ≳6pp are detectable; and the
+homegrown vocabulary is heavier than the machine it names — hence this page.
diff --git a/docs/README.md b/docs/README.md
@@ -35,6 +35,7 @@ The package API and subsystems.
 | Doc | Role | Purpose |
 |---|---|---|
 | [../README.md](../README.md) | API entry point | Install, the loop API, self-improvement framing, exported subpaths. |
+| [PLAIN.md](./PLAIN.md) | the translation layer | The whole system in plain language — five sentences, the six-piece core, the project-term → plain-English table, the one-paragraph version for outside collaborators. Start HERE when introducing the project to anyone. |
 | [glossary.md](./glossary.md) | canonical vocabulary | One definition per term (iteration/round/rollout/attempt, driver/worker/executor, TopologyMove, budget/spend, Scope.act + the coordination MCP), grounded to `file:line`; drifted synonyms flagged. Read when a term is ambiguous. |
 | [execution-model.md](./execution-model.md) | the picture | The four diagrams: the unified `Executor` port (router/bridge/cli/sandbox/BYO) + two engines, driver vs worker, who gets which tools/MCPs, and the spawn mechanics. |
 | [concepts.md](./concepts.md) | mental model | The product-API layer cake (chat turns, tasks, runs) — the onramp before the loop/strategy docs. |