Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727
Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727prk-Jr wants to merge 17 commits into
Conversation
…outes_to_edgezero
When edgezero_enabled=true, reads edgezero_rollout_pct (0-100) from the trusted_server_config store. Routes each request to EdgeZero if fnv1a_bucket(client_ip) < rollout_pct, giving the ops team sticky per-user canary control without a deploy. Absent key defaults to 100 (full rollout, backward compatible with edgezero_enabled=true deployments that predate this PR). Invalid values and read errors default to 0 (all legacy, fail-safe).
Set to "0" so local dev and CI stay on the legacy path by default. Ops changes this in the production config store — no re-deploy required.
Documents the two config store keys, canary progression stages (1% → 10% → 50% → 100%), hold-point durations, pass/fail thresholds, and instant rollback procedure.
…ro-pr19-cutover-canary
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Summary
One low-severity docs/ops note: the canary runbook references routing log lines that are emitted at debug level, so production verification may require a debug/log-level deployment or equivalent observability setup.
The absent-key branch logged at warn on every request when edgezero_enabled=true but edgezero_rollout_pct was unset, flooding logs at production QPS while traffic flowed fine. Demote to debug; the setup procedure and migration runbook remain the operational safety net. The absent-key default stays 100 (backward compat) so an existing edgezero_enabled=true deployment is not silently rolled back.
For rollout_pct 0 (full rollback) and 100 (full cutover) — which together cover most of the canary's lifetime — the routing decision is fixed, yet every request still allocated the client-IP routing key string and ran the FNV hash. Match on the rollout value and only compute the bucket for a partial rollout. Also rename canary_routes_to_edgezero to routes_to_edgezero: the predicate is just `bucket < rollout_pct` and applies at any rollout value, so "canary" read misleadingly at the call site.
The four documented branches (absent -> 100, valid -> parsed, invalid -> 0, read error -> 0) are safety-critical but only the parse step was indirectly covered. ConfigStoreHandle::new(Arc::new(...)) over the public ConfigStore trait is constructible in unit tests, so add an in-memory stub store and pin every branch, including the fail-safe defaults that matter during an incident.
The routing verification log lines are emitted at log::debug!, but the Fastly
logger is capped at Info in production, so verifying the canary via log tailing
needs a debug-level deployment or another observability signal (the runbook now
says so and points at the real-time stats split as the alternative).
Also reconcile the documented log format with the adapter: at the degenerate
rollout values (0 and 100) the bucket is no longer computed, so those lines read
`routing request through {legacy,EdgeZero} path (rollout_pct=N)` without a bucket.
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly entry-point canary routing, config-store failure behavior, rollout docs, and CI/review context. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment about strengthening coverage for the safety-critical entry-point dispatch matrix.
Findings
P0 / Blockers
None.
P1 / High
None.
P2 / Medium
1 inline comment submitted.
P3 / Low
None.
CI / Existing Reviews
All reported PR checks are passing. Existing inline feedback about absent-key logging, config-store branch tests, degenerate rollout hashing, naming, and debug-level log documentation appears to have been addressed in later commits. Local cargo test-fastly rollout -- --nocapture could not run in this environment because Viceroy failed to satisfy fastly_http_downstream::downstream_bot_analyzed; CI's Fastly job is green.
Pull the rollout-percentage routing matrix out of main() into a pure should_route_to_edgezero(rollout_pct, client_ip) helper so the safety-critical wiring from rollout state to the EdgeZero/legacy choice can be unit-tested directly, not just the parsing/hashing helpers it composes. main() keeps the edgezero_enabled gate and config reads as thin I/O glue. Add tests covering the full dispatch matrix: rollout_pct=0 (always legacy), rollout_pct=100 (always EdgeZero), and partial-rollout bucket hit/miss for both present and absent client IPs.
…er' into feature/edgezero-pr19-cutover-canary
The base branch refactored IpAddr out of main.rs and dropped the import. Merging base in lost the import that should_route_to_edgezero and its tests rely on, breaking the wasm32-wasip1 build. Re-add it.
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly canary dispatch, rollout percentage parsing/defaults, config-store failure behavior, tests, local Fastly config, and the migration runbook. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment about making the production canary observability path explicit before operators rely on the runbook.
Findings
P0 / Blockers
None.
P1 / High
None.
P2 / Medium
1 inline comment submitted.
P3 / Low
None.
CI / Existing Reviews
All reported PR checks are passing. Existing review feedback about absent-key log volume, degenerate rollout hashing, config-store branch coverage, naming, and debug-level log documentation appears addressed by later commits. Local cargo test-fastly rollout -- --nocapture compiled the changed Fastly test binary but could not execute under the installed Viceroy because it lacks fastly_http_downstream::downstream_bot_analyzed; CI's Fastly test job is green.
The runbook told operators to verify the canary via a Fastly real-time-stats traffic split and by log-searching the `routing request through EdgeZero path` lines, but neither signal is usable in production: the route decision is logged only at `debug!` (`should_route_to_edgezero`) and the Fastly logger is pinned to `Info` (`logging::init_logger`, no runtime override), so those lines never reach the production log endpoint, and no real-time-stats split or `x-edgezero-path` marker exists. State plainly that no production per-branch route signal exists yet, base canary verification on aggregate service metrics tracking the staged `rollout_pct` changes, and scope the debug route-decision log lines to local Viceroy validation. Remove the unsupported real-time-stats fallback and the unachievable "debug-level logging deployment" guidance.
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly canary dispatch, rollout percentage parsing/defaults, config-store failure behavior, local Fastly config, tests, and the EdgeZero migration runbook. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment about a runbook validation signal that is not actually emitted with the current logger configuration.
Findings
P0 / Blockers
None.
P1 / High
None.
P2 / Medium
1 inline comment submitted.
P3 / Low
None.
CI / Existing Reviews
All reported PR checks are passing. Existing review feedback about absent-key logging, degenerate rollout hashing, config-store branch coverage, dispatch-matrix tests, naming, and production route observability appears addressed in later commits. Local cargo test-fastly rollout -- --nocapture compiled the Fastly test binary but could not execute under the installed Viceroy 0.16.4 because it lacks fastly_http_downstream::downstream_bot_analyzed; CI's Fastly job is green.
| - **Auction win-rate:** downstream SSP reporting, compare same-day prior week | ||
| - **Timeout rate:** `504 / total_requests` | ||
|
|
||
| > For local pre-production validation under Viceroy (where `debug!` is visible), |
There was a problem hiding this comment.
Automated review: P2: The local Viceroy route-decision logs are still filtered out.
This says the debug! route-decision lines are visible under local Viceroy, but logging::init_logger() builds the Fastly logger with max_level(log::LevelFilter::Info). That level also applies to the echo_stdout(true) path, so the log::debug! calls in should_route_to_edgezero() will not be emitted locally unless the logger configuration changes.
Why it matters: the runbook's pre-production validation points operators at a branch-level signal they cannot actually observe, which can waste incident/cutover time and leave the canary validated only by aggregate metrics.
Suggested fix: either add a documented dev/test-only way to raise the Fastly logger to Debug for Viceroy validation, or remove/reword this local-log validation section to say these lines are currently code-only/test-only and not observable with the shipped logger configuration.
Summary
edgezero_rollout_pct(0–100) to thetrusted_server_configstore so EdgeZero traffic can be shifted incrementally without a deployclient_ipfor deterministic, sticky per-user bucket assignment (0–99); routing predicate isbucket < rollout_pctrollout_pct = 0is the immediate no-deploy rollbackChanges
crates/trusted-server-adapter-fastly/src/main.rsparse_rollout_pct,fnv1a_bucket,canary_routes_to_edgezero,read_rollout_pct; updatedmain()with canary dispatch; 11 unit tests (pinned FNV-1a vectors, distribution, boundary routing)fastly.tomledgezero_rollout_pct = "0"to local Viceroy config store with ops commentdocs/internal/EDGEZERO_MIGRATION.mdCloses
Closes #500
Test plan
cargo test-fastly— 65 + 956 + 21 + 3 tests greencargo test-axum— 31 tests greencargo clippy-fastly && cargo clippy-axum— no warningscargo fmt --all -- --check— cleancd crates/js/lib && npx vitest run— 291 tests greencd crates/js/lib && npm run format— cleancd docs && npm run format— cleancargo build --package trusted-server-adapter-fastly --release --target wasm32-wasip1fastly compute servewithedgezero_rollout_pct = "1"and"0"(rollback)Checklist
unwrap()in production code — useexpect("should ...")logmacros (notprintln!)