Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ See `docs/architecture.md` for system design. This file contains only rules and
- **Robust to any input.** A running device tolerates any sequence of UI actions or API calls: add, delete, replace, or reconfigure any module in any order, at any grid size, and it keeps running. Degraded or idle is acceptable; crashed is not. This robustness is a defining strongpoint of projectMM, and it's guarded by the test framework, not by hope: a discovered crash drives a new test that pins the fix (see the Hard Rule). Out of scope: power loss, malformed OTA, brown-out, and other physical/electrical faults the firmware can't intercept; this principle is about what the software accepts as input.
- **No reboot to apply a configuration change.** Every setting takes effect live, on the next render tick — change a pin map, a strand length, an output protocol, a mic pin or rate, anything, on a running device and it just works. There is no init-once-at-boot step, and no *config* change requires a restart, which sets projectMM apart from most LED-controller firmware (where a pin or protocol change means a reboot). Like robustness, this is a defining strongpoint, and it falls out of the architecture for free rather than being hand-built per module: any control whose change reshapes derived state routes through the generic `onBuildState()` rebuild sweep, so drivers, the audio peripheral, effects, layouts, modifiers and network I/O all inherit it. When adding a feature, don't reach for a reboot/restart to apply config; make the change live. Full mechanism + rationale: [architecture.md § Live reconfiguration](docs/architecture.md#live-reconfiguration-every-change-applies-without-a-reboot). The one exception is what you'd expect: a *firmware* OTA flash swaps the binary and needs the usual power cycle — that's not a configuration change, and (like power loss and brown-out) it's the same physical-fault boundary the robustness principle draws.
- **Domain-neutral core.** Separate core infrastructure from the light domain as much as practical. When mixing is necessary, use domain-neutral naming so the code stays open to future separation.
- **Present tense only.** Code, comments, and documentation describe the system as it is now. No changelogs, no roadmaps. History lives in git commits. Exceptions: `docs/backlog/` (forward-looking) and `docs/history/` (backward-looking).
- **Present tense only.** Code, comments, and documentation describe the system as it is now. No changelogs, no roadmaps. History lives in git commits. This bans not just future-tense ("will be", "planned") but **absence-narration**: phrases like "no longer", "anymore", "formerly", "used to", "X was removed", or "there's no longer a Y" describe a *change from a past state* a present-tense reader never saw — state what *is*, not what stopped being. (The test: "there is no MCLK pin" is a present-tense *property* — keep it; "there's no SET_DEVICE_MODEL RPC anymore" narrates a removal — cut it, just describe the path that exists.) Exceptions: `docs/backlog/` (forward-looking) and `docs/history/` (backward-looking) — and `decisions.md` lessons, which legitimately contrast before/after because the contrast *is* the lesson.

## Hard Rules

Expand Down
2 changes: 0 additions & 2 deletions docs/backlog/backlog.md
Original file line number Diff line number Diff line change
Expand Up @@ -292,8 +292,6 @@ Several `platform.h` APIs still use `(buf, len)` pairs where `std::span` would c

Device-model injection over Improv shipped as **"Improv = REST over serial"** (the `APPLY_OP` vendor RPC pushes the whole `deviceModels.json` entry over serial during install; the device runs the same apply-core the HTTP REST API does, on WiFi *and* eth-only firmware). That subsumed the earlier multi-step "board injection + Improv as a general data injector" plan — the general injector *is* APPLY_OP. What remains:

**Open follow-up: per-control validator hook on `ControlDescriptor`.** `SystemModule::setDeviceModel()` validates ASCII-printable (rejecting control bytes, embedded NUL); the HTTP `POST /api/control` write path uses the generic `applyControlValue()` in `Control.cpp` which has no per-control validator and writes the raw bytes through. Acceptable today (HTTP-write callers source values from `deviceModels.json` which the project controls), but the right fix is a per-control validator hook on `ControlDescriptor` so any control can declare an inline validation function pointer. Worth doing when the next control with non-trivial input constraints lands, or when the threat model grows (an integration accepts arbitrary external input and POSTs it through). Sketch: `ControlDescriptor` grows a `bool (*validate)(const void*, size_t)` slot defaulting to nullptr; `applyControlValue` calls it before writing and returns `ApplyResult::Malformed` on false; `addText` / `addPassword` get an optional validator argument. Touches ~5 sites; no protocol change.

**Open follow-up: closed-loop APPLY_OP pacing (read-back ack + retry).** The installer paces APPLY_OP frames open-loop (`sendApplyOpFrame` waits a fixed ~120 ms between ops) rather than reading the device's ack back, because a Web Serial duplex read while the writer lock is held is awkward. The delay covers the worst-case single-buffer consume window with headroom, and each op is idempotent (a lost op re-applies cleanly on a re-flash), so this is robust today. The closed-loop upgrade — read the RPC response, retry once on error `0x82` (buffer busy) — removes the fixed delay (faster install) and makes op-loss impossible rather than improbable. Worth doing if a real install is ever observed dropping an op, or when the config push grows large enough that the cumulative fixed delay is noticeable. Touches only `install-orchestrator.js`.

**Open follow-up: shared JS helpers across device-UI and web-installer.** `safeLocalGet` / `safeLocalSet` (3-line hostile-storage guards) are duplicated in `src/ui/install-picker.js` (device firmware, embedded as a C string via `embed_ui.cmake`) and `docs/install/devices.js` (web installer page, served from Pages). The two live in different build contexts so the shared extract isn't trivial — it'd need a new `src/ui/safe-storage.js` plus updates to: `embed_ui.cmake` (embed the new file), `ui_embedded.h` generator (new C array), HTTP server file routing (new path served), `release.yml` workflow staging, `preview_installer.py` staging. Five files for one 3-line helper is too much pre-merge. Worth doing when the next shared helper arrives — `relativeTime` and `formatBytes` are candidates. Two helpers earn the build-glue cost; one doesn't.
Expand Down
4 changes: 4 additions & 0 deletions docs/history/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -697,3 +697,7 @@ The installer was reworked so a board catalog ([`boards.json`](../install/boards
**`deviceName` (identity) vs `deviceModel` (product) vs board (bare PCB) — one term was doing three jobs.** "Board" had been overloaded to mean the per-unit network identity, the hardware product/catalog key, AND the bare PCB. Untangling it: `deviceName` is the **per-unit identity** — one string that drives mDNS (`<deviceName>.local`), the SoftAP name, and the DHCP hostname, so the device shows up under one name everywhere; it's RFC-1123-coerced (`sanitizeHostname`) because it becomes a hostname. `deviceModel` is the **hardware product** (the `deviceModels.json` catalog key, e.g. "projectMM testbench S3") — display-form, spaces allowed, never a hostname. "Device" is the umbrella noun; "board" now means **only the bare PCB**. This drove the BoardModule→SystemModule fold (the identity is core unit state, not a separate module), the `board`→`deviceModel` rename across catalog/installer/Improv (SET_BOARD→SET_DEVICE_MODEL, byte 0xFE unchanged), and the eth pin-map clarification (driver = firmware, pin map = firmware-seeded but **deviceModel-authoritative** so an Olimex entry can override). Lesson: when one noun answers three different questions ("what do I call this unit on the network?", "what product is it?", "what's the bare board?"), that's a naming smell — split it into the qualified terms, pick one umbrella word, and make the split visible in every layer (control names, RPC symbols, catalog keys, docs) so the three concepts can't re-merge.

**"Improv = REST over serial" — one apply-core, two transports, and the testability that follows from extracting the hard part.** The deployed HTTPS installer couldn't configure a flashed device: a browser blocks an HTTPS page from POSTing to an `http://` device (mixed-content), and the `?deviceModel=` pull/handoff that replaced it only ran if the user opened that exact link. The fix reframed the problem — the installer already owns the USB serial port during provisioning, so push the config over it as the *same REST operations the HTTP API runs*: a new `APPLY_OP` (0xFC) Improv vendor RPC whose payload is `{"op":"add|set|clearChildren",…}`, the same JSON a `POST /api/modules`/`/api/control` body carries. On the device the op routes to **one transport-free apply-core** (`HttpServerModule::applyAddModule/applySetControl/applyClearChildren/applyOp`) the HTTP handlers also call, so a network REST call and a serial APPLY_OP execute identical code; the handlers became thin `switch(applyX())` → status-code mappers. This **deleted** the whole browser handoff (device-side catalog fetch, `?deviceModel=` decoration, the inject button) — a net subtraction — and works on Ethernet-only firmware once the Improv listener is decoupled from WiFi (the vendor RPCs compile in unconditionally; only `WIFI_SETTINGS`/`GET_WIFI_NETWORKS` stay `#ifndef MM_NO_WIFI`). Lesson 1: when a push is blocked by the *medium* (mixed-content on HTTPS), look for a medium you already control (the serial port mid-flash) instead of bolting on a fragile pull. Lesson 2 (the one with legs): the way to make it *provable* was to **extract the hard part into a pure core primitive** — the chunk reassembly + out-of-order/duplicate sequence guard moved from the ESP32-only handler into `src/core/ImprovOpReassembler.h` (header-only state machine, returns `Continue/Ready/Error`), and the JS frame builders into `docs/install/improv-frame.js` so `node:test` imports them without the orchestrator's browser deps. Both are *Complexity lives in core; domain modules stay simple* applied for testability: the device handler keeps only its serial I/O, the algorithm gets unit-tested on the desktop, and a format implemented three times (device C++, Python, JS) is pinned by **one shared golden vector** asserted in `test/python` + `test/js` — a contract test is the right answer to *forced* duplication no shared compilation target can remove. The reflex worth keeping: a hard mechanism buried in a platform `.cpp` that "can only be tested on hardware" is a smell — extract its pure core, and "rock solid proven" becomes a unit test instead of a bench session.

**A periodic re-broadcast to let late joiners "catch up" is a hack wearing a keepalive costume.** The 3D preview sends a coordinate table (positions) once, then per-frame colour. The original implementation re-sent the *whole table every ~1 second* "so a client that connected after the last rebuild catches it." It looked fine — a fresh page recovered within a second — so it shipped and sat there. But it's a workaround, not the mechanism: it rebuilt the full table from the layout **every tick-second forever**, on the hot path, whether or not anyone had connected and whether or not anything changed — and it papered over a missing request/response with polling. The correct construct is event-driven: send the table **when it actually changes** (`onBuildState` — grid/layout/LUT rebuild) and **when a client asks** (a new WS connection bumps `BinaryBroadcaster::clientGeneration()`, which `PreviewDriver::loop()` watches and re-sends on change). That's strictly *less* code than the timer and zero idle cost. How it sneaked past review: the workaround *worked* in casual testing and its cost was invisible until a later change made each rebuild heavier and the per-module tick was profiled. Lessons: (1) "re-send periodically so it eventually syncs" is the polling-instead-of-events smell — ask "what's the event that should trigger this?" and trigger on *that*; (2) a recurring rebuild on the hot path must justify itself every tick, so "every second, just in case" fails *Data over objects / fastest hot path* on sight; (3) this is *Continuous refactor, no hacks* — the fix isn't a scheduled cleanup, it's "the moment you see a keepalive timer standing in for a request, replace it." The guard is a test that advances the clock several seconds with no client change and asserts the table is **not** re-sent (the old timer would have).

**Don't hold a vendor library's async handle across your own event loop — it races the library's internal timers.** A UI refresh intermittently crashed the device (`assert failed: xQueueSemaphoreTake queue.c:1709 (( pxQueue ))` — a null FreeRTOS queue — inside the espressif mDNS component's `mdns_query_async_get_results`, plus an `Interrupt wdt timeout`). The mDNS *browse* (discovering peers for the "Your devices" list — distinct from mDNS *advertise*, which serves `<deviceName>.local` and was never the problem) used the async API: `mdns_query_async_new` returns a handle that `DevicesModule` held across ticks, polling it each `loop1s` with a 0 ms timeout. The trap: the mDNS component's **own task** owns that handle's queue and **frees it when the query's window (3 s) expires** — so a poll landing in the gap after expiry asserts on a freed queue. It was intermittent and grid-size-sensitive (a bigger grid lengthens the tick, widening the gap) and looked like "refresh crashes it" only because a refresh's activity coincided with the poll. **First fix attempt was wrong:** I assumed a *service-table mutation* (live rename re-registering `_http._tcp`) tore the handle down and added a cancel-before-mutate guard — it didn't fix it, because the freeing party is the component's expiry timer, not our code. **Real fix:** stop using the async-handle API entirely — replace the start/poll/stop trio with one synchronous `mdnsBrowse()` (`mdns_query_ptr`) that queries, delivers results, and frees everything in a single call, holding **no handle across ticks**, so the race window can't exist. The catch that synchronous introduced: `mdns_query_ptr` blocks the *full* timeout (it waits the whole window for late responders, no early return) and `loop1s` is charged to the tick — an 80 ms query tanked the tick. So **throttle**: browse one service type every ~8th tick with a ~60 ms timeout — one brief hiccup every ~8 s, invisible for discovery, FPS untouched in between. Lessons: (1) a library's async/iterator handle is only valid between *its* lifecycle events — if you can't see/where those fire (here, an internal expiry timer on another task), don't hold the handle across your loop; prefer a self-contained synchronous call that owns the whole lifecycle. (2) An *intermittent, load-dependent* crash whose backtrace sits in a vendor component is a **lifecycle race**, not a component bug — but find the *actual* concurrent actor before "fixing" (my first guess at the actor was wrong and the fix did nothing). (3) Trading async for synchronous trades a race for a blocking cost — budget it (throttle + bound the timeout) so the cure isn't a tick-killer. (4) Desktop stubs these mDNS calls to no-ops, so it's a hardware-only fix the unit suite can't reach; the reproduction (concurrent WS churn at a large grid → crash before, stable after, uptime climbing) is the proof, in the commit, not a desktop test.
Loading
Loading