tools/stress: orchestrator skeleton (CLI, sweep, runlog, abort)#3776
Merged
Conversation
This was referenced May 28, 2026
Closed
nikw9944
approved these changes
May 29, 2026
ed95322 to
144c60b
Compare
ee9b822 to
b7d980a
Compare
Adds tools/stress/device-orchestrator/, the device-stress orchestrator binary for the GRE Tunnel Capacity Study. The binary parses every flag from #3746's CLI list, dumps orchestrator-config.json on start, runs a provision-then- reverse-deprovision sweep against a live serviceability program, and emits the runlog row schema {run_id, user_index, user_pubkey, tunnel_id, event, t_ns, n_after_event} for each submit | confirm | activate | deprovision_* event. Packages: - pkg/reconcile — PlanFor() pure function (lifted from the part-1 SDK PR; now lives with the orchestrator as policy, not as an SDK primitive) - pkg/runlog — append-only JSONL writer for orchestrator-runlog.json - pkg/sweep — provision-then-deprovision loop driven by PlanFor; uses a Clock + Executor interface for testability; reverse-creation-order delete - pkg/abort — sentinel-file poller that cancels a derived ctx between user iterations so an in-flight Create/Delete completes before exit - pkg/agent — AgentRunner interface + noop impl; SSH runner lands in part 3 along with pre_commit_log / applied event emission - pkg/exec — Live impl of sweep.Executor over serviceability.{Client, Executor}; picks deterministic per-user IPs from --client-ip-base - cmd/device-orchestrator — flag parsing, config dump, signal + abort handling, sweep wiring The agent runner is stubbed behind an interface so this PR can land end-to-end functionality (provision/deprovision + runlog + abort) without the SSH plumbing. The SSH runner and the corresponding pre_commit_log / applied row generation land in part 3 of #3746. Part 2 of #3746. Closes #3771.
- sweep: validate OwnerFilter is non-zero; move dependency defaults out of validate() into applyDefaults(); scope all sweep logs with run_id. - sweep: run create calls and the whole deprovision phase under context.WithoutCancel so an abort never interrupts an in-flight chain op (which could orphan a user) and teardown always completes. - sweep: skip the inter-batch hold when a batch created no users. - sweep test: drive the abort case via real ctx cancellation instead of a faked executor error. - exec: drop the dead fetchTunnelID error path (it always returned nil). - agent: guard the no-op runner's Start with sync.Once to avoid a double close panic. - cmd: validate required flags before writing orchestrator-config.json; sort missing-flag names for deterministic output; capture dumpJSON's Close error; rename the runlog to orchestrator-runlog.jsonl. - runlog: trim obvious comments. - CHANGELOG: condense the orchestrator entry.
b7d980a to
c97a8d4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the device-stress orchestrator skeleton at
tools/stress/device-orchestrator/, for the GRE Tunnel Capacity Study. Stacked on top of #3774 (part 1, SDK user CRUD). Part 2 of #3746. Closes #3771.cmd/device-orchestrator— every flag from stress: implement tools/stress/device-orchestrator #3746's CLI list (--target-user-count,--users-per-batch,--hold-seconds,--dut-pubkey,--dut-ssh-host,--dut-ssh-key,--rpc-url,--program-id,--keypair,--controller,--abort-file,--working-dirplus--client-ip-base,--tunnel-endpoint,--tenant-pubkey,--run-id,--log-level,--dry-run). Dumpsorchestrator-config.jsonon start.pkg/reconcile—PlanFor(current, target, ownerFilter)returns a deterministicPlan{ToCreate, ToDelete}delta. Lifted from the part-1 SDK PR per the discussion — it's orchestrator policy, not an SDK primitive.pkg/sweep— provision-then-reverse-deprovision loop driven byPlanFor; batches of--users-per-batchwith--hold-secondsbetween batches; reverse-creation-order deprovision tracked by the sweep itself; emitssubmit | confirm | activate | deprovision_*runlog rows.pkg/runlog— append-only JSONL writer fororchestrator-runlog.jsonwith the row schema{run_id, user_index, user_pubkey, tunnel_id, event, t_ns, n_after_event}.pkg/abort— ticker-based watcher of--abort-file; cancels a derived ctx so the sweep finishes the in-flight user before exiting non-zero, then still tears down what was created.pkg/agent—Runnerinterface (Start(ctx) error; Events() <-chan Event) with a no-op implementation. The SSH-backed runner and thepre_commit_log/appliedrow generation land in part 3.pkg/exec—Liveimpl ofsweep.Executorwrappingserviceability.{Client, Executor}; picks deterministic per-user IPs (base + idx) and forwardsDevicePubkey/TenantPubkeytoUserCreateArgs.Makefilemirrorstools/twamp/Makefile(build, test, lint).Testing Verification
pkg/sweep: fakeExecutor+ fakeClock+ no-opAgentdrive a 0→4 sweep in batches of 2. Asserts orderedsubmit/confirm/activatex4, reverse-orderdeprovision_submit/deprovision_confirm/deprovision_activatex4,Holdfires exactly once (between batches, not after reaching target), andn_after_eventincrements atactivate/ decrements atdeprovision_activate.pkg/sweepabort case: failing the 3rd create still drives deprovision over the first two users so the orchestrator never leaks state on abort.pkg/abort: tempdir + touch the sentinel + assert the derived ctx cancels within 1s; empty-path watch is a no-op that still propagates parent cancellation.pkg/runlog: round-trip rows, auto-fillt_ns, reject writes afterClose,Open(path)truncates existing content.pkg/reconcile: table-driven 0→N / N→0 / partial / foreign-only / mixed / negative / tie-break-by-pubkey.make buildproducesbin/device-orchestrator;./bin/device-orchestrator --dry-run --target-user-count 4 --users-per-batch 2 --working-dir /tmp/orchwrites a validorchestrator-config.jsonwithout contacting RPC.make go-build go-lint go-testall green.Out of scope
Committing config session due to diffs detected: <diff>and the commit-success line intopre_commit_log/appliedevents. Lands in part 3 of stress: implement tools/stress/device-orchestrator #3746.dz-localdevnet).