Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
28d033c
HC: FST acceleration plan (tech stack + graceful-degradation hybrid)
johnml1135 Jun 25, 2026
8a6bdc9
HC FST advisor: static grammar linter flagging rules that block/infla…
johnml1135 Jun 25, 2026
437cef1
HC FST advisor: fix infix false-positive; label advisories by stratum
johnml1135 Jun 25, 2026
d07c2cd
HC FST advisor: classify each insertion escape clean (probe-able) vs …
johnml1135 Jun 25, 2026
21120ff
HC FST advisor: add regularity (Kaplan-Kay) axis, orthogonal to the w…
johnml1135 Jun 25, 2026
3d46e5f
HC FST plan: make census reference self-contained (advisor confirms it)
johnml1135 Jun 26, 2026
d3da544
HC FST: packed 32-bit morpheme-token output schema (MorphToken/MorphOp)
johnml1135 Jun 26, 2026
4f657ae
HC FST: Word -> token converter, proving the schema represents real a…
johnml1135 Jun 26, 2026
3a52b6e
HC FST spike: FST-backed analyzer for root+suffix, parity + completeness
johnml1135 Jun 26, 2026
7a6fb2f
HC FST: FstMorpher implements IMorphologicalAnalyzer (swappable at th…
johnml1135 Jun 26, 2026
d1bcbba
HC FST: FstMorpher.FromLanguage — drive the analyzer from a compiled …
johnml1135 Jun 26, 2026
ba2f9a8
HC FST: extend FstMorpher to prefixes — recognizes (prefix?) root (su…
johnml1135 Jun 26, 2026
3c1b855
HC FST plan: add section 9 — confirming FST closure (completeness cer…
johnml1135 Jun 26, 2026
f282de9
HC FST plan: add section 10 — eager/lazy partition knob (bounded size…
johnml1135 Jun 26, 2026
c642672
HC FST: shadow/verification harness — FST-vs-search analysis-set pari…
johnml1135 Jun 26, 2026
b9f6540
HC FST: bounded (2-stem) compounding in FstMorpher — head + non-head
johnml1135 Jun 26, 2026
d556fc8
HC FST: environment-conditioned allomorphy — one chain per affix allo…
johnml1135 Jun 26, 2026
90ab71f
HC FST: static feeding-closure pass (§9.5 stratal pre-filter)
johnml1135 Jun 26, 2026
7caa7c7
HC FST: Tier-2 hybrid runtime — FST fast path + search fallback (Phas…
johnml1135 Jun 26, 2026
2d169aa
HC FST: generator / reverse direction (Phase 4) — FstGenerator
johnml1135 Jun 26, 2026
69076b6
HC FST: [Explicit] Sena benchmark — census, FST/hybrid build, parse t…
johnml1135 Jun 26, 2026
b08e324
HC FST: template analyzer with build-time category gating + token-acc…
johnml1135 Jun 26, 2026
181c8ef
HC FST: prefix+suffix templates, category+stratum gating, NFA-sim wal…
johnml1135 Jun 26, 2026
8b3fa25
HC FST: handle realizational affixes + zero-morphemes in templates — …
johnml1135 Jun 26, 2026
6ef3493
HC FST: propose-and-verify analyzer (way B) + honest Sena finding
johnml1135 Jun 26, 2026
f1c93cd
HC FST: SoundHybridMorpher (propose + replay-verify + fallback) — sou…
johnml1135 Jun 26, 2026
bdf5430
HC FST: dense bit-packed analysis-state (MorphStateLayout) + internin…
johnml1135 Jun 26, 2026
f0bb1ff
HC FST: build-time per-slot category gate (faithful, no walk-order is…
johnml1135 Jun 26, 2026
4645b31
HC FST: bare-root obligatoriness guard — closes the mbale over-genera…
johnml1135 Jun 26, 2026
820d05b
HC FST: correct the oracle (unlimited unapp), verify-discard gates, §…
johnml1135 Jun 26, 2026
d5c106e
HC FST: derivation-suffix layer (closes under-gen) + round-trip finding
johnml1135 Jun 26, 2026
6aa35ae
HC FST: confirm verify root cause (two synthesis doors) + corpus numbers
johnml1135 Jun 26, 2026
41f4da4
HC FST: pinpoint lossless-verify build (cross-stratum) + status (soun…
johnml1135 Jun 26, 2026
788281f
HC FST plan: record that GenerateWords fails even at correct order; t…
johnml1135 Jun 26, 2026
7eae96b
HC FST plan: choose Route A (reuse HC) over Route B (duplicate constr…
johnml1135 Jun 26, 2026
6e11f1e
HC FST plan: confirm Route A feasibility (analysis sets SyntacticFeat…
johnml1135 Jun 26, 2026
7659814
HC FST: Route A verify — restricted re-analysis (sound, lossless, ~18x)
johnml1135 Jun 26, 2026
ccb42b8
HC FST benchmark: cache the oracle (6min -> 2.4min) + drop obsolete r…
johnml1135 Jun 26, 2026
be0d10a
HC FST plan: record Route A done (sound+lossless ~15x) + last gap = P…
johnml1135 Jun 26, 2026
65dfff3
HC FST: category-changing derivation — attach templates over derived …
johnml1135 Jun 26, 2026
4cad649
HC FST: land DerivDepth=2 sweet spot (194/200, ~13x); plan final status
johnml1135 Jun 26, 2026
c69f864
HC FST: sharpen parity signature to morpheme identity; validate resul…
johnml1135 Jun 26, 2026
6f122a8
HC FST: completeness certificate (§12) — grammar-level proof + prefix…
johnml1135 Jun 26, 2026
28a21e7
HC FST: fix unsound certificate + vacuous proof — empirical set-parit…
johnml1135 Jun 26, 2026
d8c10d5
HC FST: negative-examples soundness test (no false positives)
johnml1135 Jun 26, 2026
82e85fc
HC FST: template-less derivational path — closes cawo, predicate now …
johnml1135 Jun 26, 2026
ae5aaf3
HC FST: MVP cleanup — multithread the verify, per-word opt-out, cut d…
johnml1135 Jun 26, 2026
380659b
HC FST: add CI unit tests for the verify chain (the review's top gap)
johnml1135 Jun 26, 2026
2461461
HC FST docs: relocate to docs/, drop scaffolding/stale, add shipped-M…
johnml1135 Jun 26, 2026
a06cdf7
HC FST: two-path caching analyzer (fast FST + slow engine + persisted…
johnml1135 Jun 26, 2026
c90040c
HC FST: certified grammars skip the full search; tunable derivation d…
johnml1135 Jun 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions docs/HERMITCRAB_FST_ADVISOR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Grammar FST Advisor — plan

A grammar evolves; one new rule can quietly push it from the fast finite-state path into the
slow combinatorial search. This plan adds a **grammar advisor/linter** that, for any HermitCrab
`Language`, flags the rules that make parsing expensive or block FST compilation, and gives the
grammar engineer **actionable write-ups**: *why* a rule is costly, how to **constrain** it back
into fast territory, and an **alternative formulation** to try.

It is the front-end to the FST work (`HERMITCRAB_FST_PLAN.md`): the same per-rule classification
that decides the FST tier also drives the warnings.

## 1. What it does

Input: a compiled `Language`. Output: a `GrammarFstReport` — a list of per-rule advisories plus
an overall **tier verdict**. Each advisory has:
- **rule name + kind** (affix / phonological / compounding / template),
- **severity**: `Escape` (breaks FST → forces search), `Cost` (inflates the search fan-out), or
`Info`,
- **issue**: one sentence on what's expensive and why,
- **advice**: "constrain it like this" and/or "try this instead".

## 2. The classifier (what flags what)

Detected from the object model (`AffixProcessRule.Allomorphs` → `Rhs` actions; `RewriteRule`
Lhs/Subrule environments; `MorphologicalOutputAction.PartName`; `Quantifier.Max/MinOccur`):

| Signal | Severity | Issue | Advice |
|---|---|---|---|
| **Reduplication** — a part copied ≥2× by `CopyFromInput` | **Escape** | copying an unbounded span isn't finite-state; forces search for any word it could apply to | "If the reduplicant is a fixed size (e.g. one CV syllable), bound the copied part's length → finite-state. If only a few forms reduplicate, list them as lexical entries. Else the grammar stays in the hybrid/search tier." |
| **Infixation / stem split** — ≥2 `CopyFromInput` of *different* parts | **Escape** (unless bounded) | the stem is split at a content-determined position | "If the infix position is fixed, encode it as a bounded split; a variable split blocks FST." |
| **Process modification** — `ModifyFromInput` present | **Info/verify** | FST-able only if the modification is local/bounded | "Local feature change in a fixed context = fine; non-local/agreement = blocks FST — try a bounded reformulation." |
| **Phonological rewrite rule** present | **Info/verify** | FST-able iff its environment is a bounded window | "Bound the left/right environment to the actual window (usually 1–2 segments); unbounded context blocks FST." |
| **Deletion rule** — Lhs longer than Rhs | **Cost** | analysis must guess where deleted segments were and re-insert them (× `DeletionReapplications`) | "Keep `DeletionReapplications` as low as the language needs; bounded deletion context is still FST-able." |
| **Unbounded environment** — a `Quantifier` with infinite `MaxOccur` in an environment | **Escape** | matches an arbitrary-length span | "Replace the `+`/`*` context with the fixed window the rule really needs." |
| **Many allomorphs** on one rule (> threshold) | **Cost** | each allomorph multiplies un-application branching | "Consolidate via environment conditioning where possible." |
| Compounding rule | **Info** | bounded by `MaxStemCount`, so finite | — |

## 3. Tier verdict (static; corpus refines it)

- **0 Escape advisories** → **Tier 1 candidate** (fully FST-able) — confirm with the FST compile
+ corpus parity check.
- **a few Escapes** → **Tier 2 candidate** (hybrid: escapes fall back to search) — run the corpus
fallback-rate measurement to confirm it's worth it vs. Tier 3.
- **pervasive Escapes** → **Tier 3** (search only).

The static report can't compute the corpus-weighted fallback rate, so it reports the tier
*candidate* + the escape list; the FST pipeline's corpus pass (`HERMITCRAB_FST_PLAN.md` §1)
confirms it.

## 4. The "one new rule blew up the grammar" workflow

Run the advisor before/after a grammar change (or in CI). A new `Escape` advisory that flips the
tier (e.g. Tier 1 → Tier 2) is the warning: it names the offending rule, says it moved the whole
grammar off the fast path, and gives the constrain/alternative write-up. Grammar engineers get
"this rule made parsing slow, here's how to keep it fast" at authoring time.

## 5. Implementation

- `GrammarFstAdvisor.Analyze(Language) → GrammarFstReport` in the HermitCrab library (pure static
analysis of the object model; no parsing, no corpus needed).
- `GrammarFstReport.Format()` for a readable dump.
- Tests: a normal concatenative grammar → Tier 1, no escapes; add a reduplication rule → the
advisor flags it `Escape` with the reduplication write-up and downgrades the tier.
- Run on the real Sena grammar and report the advisories + tier.

## 6. Validate on Sena

Census already showed Sena is concatenative + no rewrite rules + no productive reduplication →
expect **Tier 1, zero escapes**, possibly a few `Cost`/`Info` notes (allomorph counts,
compounding). That both validates the classifier (no false escapes) and confirms Sena is the
fast-path case.

## 7. Engine extension — the *regularity* axis (added, kept orthogonal to the warning)

The advisor answers one question — **"is this slow in today's engine?"** — and the user keeps
asking exactly that ("which rule blew up the grammar", "which cases are still slow"). The
extension adds a *second, independent* question — **"does an FST exist for this in principle?"**
(regular vs non-regular) — **without letting the answer soften the slow-today warning.**

Why the two must not be merged: the engine that turns "regular" into "fast" is the FST compiler,
and **it does not exist yet** (gated on the unbuilt spike, `HERMITCRAB_FST_PLAN.md` §7). So
"regular" today means *fast eventually, slow now*. If a vowel-harmony rule reported as
`Cost / Tier-1-reachable`, a non-expert reads "fine" — when in the only engine that ships it is
the worst case (harmony on a common segment ⇒ ~every word on the slow path). The severity must
keep telling the truth about **today**.

So **severity is unchanged** — it means *escapes the finite-state fast path in today's engine*
(forces the combinatorial search). Harmony, infixation, and reduplication (bounded or not) all
stay `Escape`: all are slow now. We only *add* a `Regular` axis that says whether an FST could
reclaim it later, and we report it as a **separate reclaim-path line that never upgrades the
tier**.

The theory behind the new axis is **Kaplan & Kay (1994)**: a context-sensitive rewrite rule
`φ → ψ / λ _ ρ` with regular `φ, ψ, λ, ρ`, applied obligatorily/directionally (not recursively
into its own unbounded output), **denotes a regular relation — however long `λ`/`ρ` are.** HC's
`RewriteRule` is this form, and its `Rhs` is a *bounded segment specification*, not a copy (copy
lives only in morphological `CopyFromInput`). So:

- **Unbounded-environment rewrite (harmony/spread): `Regular = true`** — *iff* the rule's own
`Lhs`/`Rhs` are bounded (only the environment is unbounded). Reclaim later by **state-encoding**
the spreading feature (or two-level pre-image arcs). If the `Lhs`/`Rhs` themselves are unbounded
we cannot confirm regularity → `Regular = false` (conservative). Stays `Escape` (slow today).
- **Reduplication splits by boundedness of the copied part.** Look up the copied part's defining
`Lhs` pattern by name: a **length-bounded** reduplicant (fixed CV/CVC) is a finite copy →
`Regular = true` (reclaim by bounded fold). Copying an **unbounded** part (whole stem,
`Annotation(any).OneOrMore`) is the one genuinely non-regular operation (`{ww}` is not regular)
→ `Regular = false`. **If the part can't be resolved, default `Regular = false` (warn).** Stays
`Escape` either way.
- **Infixation** at a pattern-defined slot: `Regular = true` (the split is a regular pattern;
reclaim by bounded fold / the per-word probe). Stays `Escape`.

### The reclaim map (how a `Regular` case *would* be made fast — once the compiler exists)

| Construct | `Regular` | Slow today? | Reclaim path (needs the FST compiler) |
|---|---|---|---|
| Unbounded-environment rewrite (harmony/spread) | ✅ (bounded Lhs/Rhs) | **yes** | state-encode the spreading feature / two-level pre-image arcs |
| Bounded reduplication (fixed CV reduplicant) | ✅ | **yes** | bounded fold — emit the finite copy as arcs |
| Infixation (pattern-defined slot) | ✅ | **yes** | bounded fold / per-word strip-and-reparse probe |
| Deletion | ✅ | **yes** | inverse probe — re-insert candidate deleted segments, re-parse |
| Unbounded-copy reduplication | ❌ | **yes** | per-word probe only (when surface-invariant); else search |

`Regular` and `Probeable` (§5a) are both *paths forward*, never excuses: `Regular` = "an FST
could reclaim it (compiler pending)", `Probeable` = "a runtime strip-and-reparse is sound". The
severity and tier keep warning about today.

### Implementation of the extension

- Add `GrammarAdvisory.Regular` (`bool?`): true = an FST exists in principle (reclaim by
compiling), false = genuinely non-regular / unconfirmable, null = N/A. **Severity is not
changed by it.**
- Reduplication: resolve the copied part's `Lhs` pattern by name; bounded → `Regular=true`,
unbounded or unresolved → `Regular=false`. Severity stays `Escape`.
- Infixation: `Regular=true`; severity stays `Escape`; keep the per-word-probe advice.
- Unbounded-environment rewrite: `Regular = !(unbounded Lhs or Rhs)`; severity stays `Escape`;
advice = Kaplan–Kay + state-encoding, explicitly "regular in principle but slow in today's
engine".
- Report: count `RegularEscapeCount` vs `NonRegularEscapeCount`; emit a **reclaim-path line**
("N of M escapes are FST-reclaimable once the compiler exists; all M are slow in today's
engine"). **The tier verdict is unchanged** — no "Tier 1-reachable" upgrade.
- Tests: a non-expert sanity check — a grammar whose only complex rule is harmony must still
report a slow-path warning (escape present), with `Regular=true` only as the reclaim note.
Unbounded-copy reduplication ⇒ `Regular=false`; bounded reduplicant ⇒ `Regular=true`;
infixation ⇒ `Escape` + `Regular=true` (the committed infix test keeps its severity). Sena
unchanged (Tier 1).
Loading
Loading