feat(node): graceful shutdown + Prometheus metrics endpoint#22
Open
Vasanthdev2004 wants to merge 3 commits into
Open
feat(node): graceful shutdown + Prometheus metrics endpoint#22Vasanthdev2004 wants to merge 3 commits into
Vasanthdev2004 wants to merge 3 commits into
Conversation
Replace the inline &str SQL array that re-runs on every node startup with a versioned migration system. Each migration is recorded in a new schema_migrations table and applied at most once per node, inside a single transaction. For nodes upgrading from a pre-migration-versioning build, the canonical `repos` table is used as a signal to mark v1 as already applied without re-running its ~140 statements — so existing operators see zero behavior change on first restart after this commit. Closes the gap called out in docs/OSS-READINESS-AUDIT.md:78. - Adds Db::migration_status() returning applied (version, name, applied_at) - Adds 6 unit tests validating the static migration catalogue (versions strictly increasing, names distinct, bodies non-empty, v1 name locked to initial_schema) - cargo fmt + clippy -D warnings + cargo test --workspace all clean
Address two robustness gaps in the v1 migration runner on the live network: - Remove the "repos table present => v1 complete" backfill short-circuit. It assumed every existing node already had the full current schema; a node that was behind would be marked v1-applied while still missing newer tables or columns. Every v1 statement is idempotent (CREATE TABLE/INDEX IF NOT EXISTS, ADD COLUMN IF NOT EXISTS), so legacy installs now simply run v1 once: existing objects are no-ops, missing ones get created. - Guard the whole runner with a Postgres advisory lock so two instances against the same database (blue/green or rolling deploy) can't race to apply the same migration and trip the schema_migrations primary key. The lock is released explicitly, and automatically if the connection drops. Also document that per-migration transactions preclude CREATE INDEX CONCURRENTLY in future migrations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ab19465 to
a50c0b2
Compare
Two operational-safety changes that the live network needs:
1) Graceful shutdown. A process-wide tokio::sync::watch::Sender<bool>
in AppState is flipped on SIGINT (all platforms) or SIGTERM (Unix).
Every long-lived loop now selects on the shutdown arm:
* axum server (with_graceful_shutdown — drains in-flight)
* libp2p swarm (drops Swarm, closes connections)
* gossip task (exits between peer announces)
* sync worker (exits between batches)
* operator heartbeat loop
* rate-limit cleanup loop
* peer-count metrics poller
* the deferred DID-record publish task
New GITLAWB_SHUTDOWN_GRACE_SECS (default 30s) controls how long
axum waits for in-flight requests before returning.
2) Prometheus /metrics endpoint. New metrics module exposes:
* gitlawb_info{version, did} (constant gauge = 1)
* gitlawb_pushes_total{repo} (counter)
* gitlawb_fetches_total{repo} (counter)
* gitlawb_auth_successes_total{route} (counter, helper)
* gitlawb_auth_failures_total{route,reason} (counter, helper)
* gitlawb_sync_queue_processed_total{status} (counter)
* gitlawb_webhook_deliveries_total{result} (counter)
* gitlawb_pack_size_bytes (histogram)
* gitlawb_peers_connected (gauge)
Opt-in via GITLAWB_METRICS_ADDR (e.g. 127.0.0.1:9091). The endpoint
is on a separate listener so the public port stays unexposed;
bind to localhost or a private interface.
3) Metric instrumentation hooks in api/repos.rs (push/fetch),
webhooks.rs (delivery outcomes), and sync.rs (per-item status).
The auth helpers are defined but not yet wired — that's a
focused follow-up.
Tests:
* 3 new metrics::tests::* — encode returns valid prometheus text,
record_* helpers are no-ops before init(), encode() errors if
init() was never called.
* cargo fmt + clippy -D warnings + cargo test --workspace all clean.
Closes the metrics and graceful-shutdown gaps in
docs/OSS-READINESS-AUDIT.md:131 and docs/MAINTAINER-ROADMAP.md:35.
a50c0b2 to
d950e11
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two operational-safety changes that the live network needs:
tokio::sync::watch::Sender<bool>inAppStateis flipped on SIGINT (all platforms) or SIGTERM (Unix). Every long-lived loop now selects on the shutdown arm so the node drains cleanly instead of being killed mid-write./metricsendpoint — opt-in viaGITLAWB_METRICS_ADDR, bound on a separate listener so the public port stays unexposed. Counters, histograms, and a gauge covering pushes, fetches, sync, webhooks, pack size, and connected peers.Why
Both gaps are explicitly called out in the maintainers' own docs:
docs/OSS-READINESS-AUDIT.md:131— "Add basic metrics for pushes, fetches, pack sizes, peer sync queue, failed auth, and webhook failures"docs/MAINTAINER-ROADMAP.md:35— same metrics items, plus "graceful shutdown + clearer startup logging"Today, every
Ctrl-C/killon a running node drops the libp2p swarm, the sync worker, the operator heartbeat, and the gossip task mid-tick. In-flight pushes can leave advisory locks half-released, webhooks are silently lost, and agossip"node left" event fires to every peer. On restart, the next node picks upsync_queuerows that were being processed when it died.The current
/api/v1/statsendpoint only exposes three counts (repos,agents,pushes). Operators have no visibility into pack sizes, auth failure rates, webhook delivery success, sync queue depth, or per-peer connectivity.Behavior change
For operators: none, unless they set
GITLAWB_METRICS_ADDRor rely on the newCtrl-Cbehavior.On shutdown (any platform):
Ctrl-C(orkill <pid>on Unix) flips the signal.GITLAWB_SHUTDOWN_GRACE_SECS(default 30s), then exits.Swarm, closing all QUIC connections cleanly.clean exitand returns 0.New config knobs:
GITLAWB_METRICS_ADDR(default""= disabled) — bind address for/metrics.GITLAWB_SHUTDOWN_GRACE_SECS(default30) — axum drain budget.Operator action required: none. Both knobs are off / at default.
What gets measured
gitlawb_info{version, did}metrics::initgitlawb_pushes_total{repo}git-receive-packgitlawb_fetches_total{repo}git-upload-packgitlawb_sync_queue_processed_total{status}sync_queueitem (done / failed)gitlawb_webhook_deliveries_total{result}gitlawb_pack_size_bytesgitlawb_peers_connectedgitlawb_auth_successes_total{route}gitlawb_auth_failures_total{route, reason}Safety properties
metrics::init()is safe to call more than once. Once the registry is built, subsequent calls are no-ops. TheOnceLockguard makes the registry truly one-per-process.record_*andset_*helpers treat an uninitialized registry as a silent no-op, so unit tests that don't go throughmain()don't need to set anything up.tokio::select!arm that calls into network I/O is cancel-safe —axum::serve,reqwest, and libp2p'sSwarmall handle future cancellation.tokio::time::timeout, so one hung peer can't block the loop or stall the shutdown./metricslistener also drains on shutdown, thenmain()aborts itsJoinHandle.Testing
cargo fmt --all -- --check— cleancargo clippy --workspace --all-targets -- -D warnings— cleancargo test --workspace— all tests pass, including 3 newmetrics::tests::*:encode_after_init_returns_prometheus_text— verifies the exposition body contains# HELP gitlawb_info,# TYPE gitlawb_pushes_total counter, and the incremented counter labelrecord_helpers_are_noops_before_init— confirms no panicencode_before_init_returns_error— graceful error, not a panicgitlawb-node --versionandgitlawb-node --helpboth include the new flags; binary builds in release profile.Risk
Low. The shutdown change is additive (every existing loop still works the same when no signal arrives). The metrics endpoint is opt-in (off by default) and on a separate listener.
The only behavioral change an existing operator will notice: pressing
Ctrl-Con a running node now drains gracefully instead of dropping everything. This is strictly an improvement.Follow-ups (not in this PR)
record_auth_success/record_auth_failureintoauth::require_signature(the helpers exist; touching the 10+ error-return sites inauth/mod.rsis best done in a focused PR).TraceLayerspans are already there — derive metrics from them).gitlawb_infoandgitlawb_peers_connectedingl status.🤖 Generated with Claude Code