diff --git a/docs/PLAIN.md b/docs/PLAIN.md new file mode 100644 index 0000000..564cf42 --- /dev/null +++ b/docs/PLAIN.md @@ -0,0 +1,76 @@ +# The system in plain language + +> The translation layer. Internal docs use the project's own vocabulary; THIS page says the +> same things without it. If an explanation here contradicts a technical doc, the technical +> doc wins — then fix this page. Audience: a colleague meeting the project cold. + +## Five sentences, no invented words + +1. We have tasks with **automatic pass/fail checks** — tests you can run, answer keys you + can verify mechanically. +2. An AI attempts each task a fixed number of times under different **retry policies**: + "try 3 times, keep the best", "try, get feedback, try again", and so on. +3. We compare policies **fairly**: identical tasks, identical attempt budgets, paired + statistics, judged on fresh tasks that no tuning step ever saw. +4. The distinctive part: the AI also **writes new retry policies itself**, as short + programs, and they enter the same tournament under the same rules as human-written ones. +5. Every dollar and second is metered, so "better" can also mean "**equally good but + cheaper**" — and that claim is statistically testable, not vibes. + +## The load-bearing core is six pieces + +Task-with-check · retry policy · the tournament runner · the AI policy-writer · the +statistical promotion gate · crash-resume. Everything else is either a **fairness rule** +(added because a specific run produced a wrong number without it) or an **experiment on +the menu** (a configuration, not a machine part). Experiment configs are cheap; do not +mistake a long menu for a complicated machine. + +## Translation table + +| Project term | Plain English | Standard concept? | +|---|---|---| +| Environment | a task domain: open it, act on it with tools, check the result | RL environment / gym | +| shot | one attempt | — | +| steering / `refine` | feedback injected between attempts | self-refinement | +| the author / `authorStrategy` | the AI writes a new retry policy as a program | program synthesis | +| evolution / generations | rounds of: write candidates → tournament → keep the champion | evolutionary search | +| harness-verified scoring | never trust a policy's self-reported score; recompute it from the attempts the system actually ran | basic measurement hygiene | +| selector ≠ judge (the firewall) | the feedback-giver never sees the answer key or the score | no reward leakage | +| conserved budget pool | every policy gets exactly the same attempt budget; overspending is structurally impossible | compute-matched comparison | +| holdout / fresh slice | final judging happens on tasks no tuning step ever touched | train/test split | +| the gate / `promotionGate` | a seeded paired bootstrap must show the win is real before anything is declared better | standard inferential statistics | +| non-inferiority mode | prove "not worse on quality AND significantly cheaper" | clinical-trials statistics | +| band screen | drop questions every policy aces — they carry no information | item discrimination (psychometrics) | +| reproducer certificate | a fresh AI re-builds the winner from a ~64-word description; if the rebuild can't match it, the win was memorization, not method | description-length / compression test (arXiv:2606.11045) | +| κ compression / minimization | shorten the prompt; prove quality holds and cost drops | prompt compression (LLMLingua lineage); the every-Nth-character floor is delta debugging | +| waterfall | a per-step timeline of the run: what each step cost in seconds, dollars, tokens | distributed tracing | +| σ / α / γ / κ | the four independent on/off knobs: feedback, policy-writing, prompt optimization, prompt compression | factorial experimental design | + +## For a game theorist, in one paragraph + +A repeated tournament under mechanism-design constraints: entrants (retry policies) compete +under a hard budget; new entrants are generated by an oracle that observes only past +payoffs (never the scoring function); and the promotion rule is built to be +non-manipulable — entrants cannot misreport scores, cannot observe the test set, cannot +outspend rivals, and a declared winner must replicate from a compressed description of +itself. The research question: which entry-generation and feedback mechanisms produce +genuine improvements versus exploitation of the evaluation. + +## What it has measured (plain claims, each gated) + +- Feedback-between-attempts helps a lot on tasks with persistent state (+16.4pp), and + *hurts* on one-shot retrieval tasks — the effect has a sign that depends on the domain. +- Tuning the feedback-giver's instructions with a state-of-the-art prompt optimizer + changed nothing (an exact tie on held-out tasks). +- Naively giving the AI a memory of its own past outputs made it *worse* (−11.6pp). +- The AI's self-written policies reliably match the best human-written policy's quality + at roughly 2.5× lower cost (replicated three times); they have not yet beaten it on + quality on held-out tasks. +- Compressing a verbose prompt to ~a third, combined with feedback, kept quality and cut + cost ~30% on a hard math benchmark — promoted by the "not worse AND cheaper" test. + +## The honest weaknesses + +Mostly one domain family per claim so far (cross-domain replication is configuration, not +new code); small holdouts (12–16 tasks) mean only effects ≳6pp are detectable; and the +homegrown vocabulary is heavier than the machine it names — hence this page. diff --git a/docs/README.md b/docs/README.md index b3cbbb9..8653780 100644 --- a/docs/README.md +++ b/docs/README.md @@ -35,6 +35,7 @@ The package API and subsystems. | Doc | Role | Purpose | |---|---|---| | [../README.md](../README.md) | API entry point | Install, the loop API, self-improvement framing, exported subpaths. | +| [PLAIN.md](./PLAIN.md) | the translation layer | The whole system in plain language — five sentences, the six-piece core, the project-term → plain-English table, the one-paragraph version for outside collaborators. Start HERE when introducing the project to anyone. | | [glossary.md](./glossary.md) | canonical vocabulary | One definition per term (iteration/round/rollout/attempt, driver/worker/executor, TopologyMove, budget/spend, Scope.act + the coordination MCP), grounded to `file:line`; drifted synonyms flagged. Read when a term is ambiguous. | | [execution-model.md](./execution-model.md) | the picture | The four diagrams: the unified `Executor` port (router/bridge/cli/sandbox/BYO) + two engines, driver vs worker, who gets which tools/MCPs, and the spawn mechanics. | | [concepts.md](./concepts.md) | mental model | The product-API layer cake (chat turns, tasks, runs) — the onramp before the loop/strategy docs. |