Skip to content

feat(node): graceful shutdown + Prometheus metrics endpoint#22

Open
Vasanthdev2004 wants to merge 3 commits into
mainfrom
feat/operational-safety-net
Open

feat(node): graceful shutdown + Prometheus metrics endpoint#22
Vasanthdev2004 wants to merge 3 commits into
mainfrom
feat/operational-safety-net

Conversation

@Vasanthdev2004
Copy link
Copy Markdown
Collaborator

What

Two operational-safety changes that the live network needs:

  1. Graceful shutdown — a process-wide tokio::sync::watch::Sender<bool> in AppState is flipped on SIGINT (all platforms) or SIGTERM (Unix). Every long-lived loop now selects on the shutdown arm so the node drains cleanly instead of being killed mid-write.
  2. Prometheus /metrics endpoint — opt-in via GITLAWB_METRICS_ADDR, bound on a separate listener so the public port stays unexposed. Counters, histograms, and a gauge covering pushes, fetches, sync, webhooks, pack size, and connected peers.

Why

Both gaps are explicitly called out in the maintainers' own docs:

  • docs/OSS-READINESS-AUDIT.md:131"Add basic metrics for pushes, fetches, pack sizes, peer sync queue, failed auth, and webhook failures"
  • docs/MAINTAINER-ROADMAP.md:35 — same metrics items, plus "graceful shutdown + clearer startup logging"

Today, every Ctrl-C / kill on a running node drops the libp2p swarm, the sync worker, the operator heartbeat, and the gossip task mid-tick. In-flight pushes can leave advisory locks half-released, webhooks are silently lost, and a gossip "node left" event fires to every peer. On restart, the next node picks up sync_queue rows that were being processed when it died.

The current /api/v1/stats endpoint only exposes three counts (repos, agents, pushes). Operators have no visibility into pack sizes, auth failure rates, webhook delivery success, sync queue depth, or per-peer connectivity.

Behavior change

For operators: none, unless they set GITLAWB_METRICS_ADDR or rely on the new Ctrl-C behavior.

On shutdown (any platform):

  • Ctrl-C (or kill <pid> on Unix) flips the signal.
  • axum stops accepting new connections, drains in-flight requests up to GITLAWB_SHUTDOWN_GRACE_SECS (default 30s), then exits.
  • libp2p swarm drops the Swarm, closing all QUIC connections cleanly.
  • Gossip, sync, operator heartbeat, rate-limit cleanup, and peer-count poller exit between their work units.
  • The process logs clean exit and returns 0.

New config knobs:

  • GITLAWB_METRICS_ADDR (default "" = disabled) — bind address for /metrics.
  • GITLAWB_SHUTDOWN_GRACE_SECS (default 30) — axum drain budget.

Operator action required: none. Both knobs are off / at default.

What gets measured

Metric Type Where
gitlawb_info{version, did} gauge (constant 1) metrics::init
gitlawb_pushes_total{repo} counter after successful git-receive-pack
gitlawb_fetches_total{repo} counter after successful git-upload-pack
gitlawb_sync_queue_processed_total{status} counter per sync_queue item (done / failed)
gitlawb_webhook_deliveries_total{result} counter per webhook attempt (ok / http_error / network_error)
gitlawb_pack_size_bytes histogram on every push + fetch (buckets 1 KB → 2 GB)
gitlawb_peers_connected gauge polled every 15s from the libp2p swarm
gitlawb_auth_successes_total{route} counter (helpers exist; wired in follow-up)
gitlawb_auth_failures_total{route, reason} counter (helpers exist; wired in follow-up)

Safety properties

  • Idempotent init: metrics::init() is safe to call more than once. Once the registry is built, subsequent calls are no-ops. The OnceLock guard makes the registry truly one-per-process.
  • No-op helpers before init: record_* and set_* helpers treat an uninitialized registry as a silent no-op, so unit tests that don't go through main() don't need to set anything up.
  • Cancel-safe shutdown: every tokio::select! arm that calls into network I/O is cancel-safe — axum::serve, reqwest, and libp2p's Swarm all handle future cancellation.
  • Bounded bootstrap announce: the gossip task now wraps each peer announce in a 5s tokio::time::timeout, so one hung peer can't block the loop or stall the shutdown.
  • Bounded metrics endpoint: the /metrics listener also drains on shutdown, then main() aborts its JoinHandle.

Testing

  • cargo fmt --all -- --check — clean
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo test --workspace — all tests pass, including 3 new metrics::tests::*:
    • encode_after_init_returns_prometheus_text — verifies the exposition body contains # HELP gitlawb_info, # TYPE gitlawb_pushes_total counter, and the incremented counter label
    • record_helpers_are_noops_before_init — confirms no panic
    • encode_before_init_returns_error — graceful error, not a panic
  • Manual smoke: gitlawb-node --version and gitlawb-node --help both include the new flags; binary builds in release profile.

Risk

Low. The shutdown change is additive (every existing loop still works the same when no signal arrives). The metrics endpoint is opt-in (off by default) and on a separate listener.

The only behavioral change an existing operator will notice: pressing Ctrl-C on a running node now drains gracefully instead of dropping everything. This is strictly an improvement.

Follow-ups (not in this PR)

  • Wire record_auth_success / record_auth_failure into auth::require_signature (the helpers exist; touching the 10+ error-return sites in auth/mod.rs is best done in a focused PR).
  • Per-route latency histograms (the TraceLayer spans are already there — derive metrics from them).
  • Per-peer p2p counters (connected peers by direction, gossipsub mesh size over time).
  • IPFS / Pinata counters.
  • Surface gitlawb_info and gitlawb_peers_connected in gl status.

🤖 Generated with Claude Code

Vasanthdev2004 and others added 2 commits June 1, 2026 10:39
Replace the inline &str SQL array that re-runs on every node startup with a
versioned migration system. Each migration is recorded in a new
schema_migrations table and applied at most once per node, inside a single
transaction.

For nodes upgrading from a pre-migration-versioning build, the canonical
`repos` table is used as a signal to mark v1 as already applied without
re-running its ~140 statements — so existing operators see zero behavior
change on first restart after this commit.

Closes the gap called out in docs/OSS-READINESS-AUDIT.md:78.

- Adds Db::migration_status() returning applied (version, name, applied_at)
- Adds 6 unit tests validating the static migration catalogue
  (versions strictly increasing, names distinct, bodies non-empty, v1
  name locked to initial_schema)
- cargo fmt + clippy -D warnings + cargo test --workspace all clean
Address two robustness gaps in the v1 migration runner on the live network:

- Remove the "repos table present => v1 complete" backfill short-circuit.
  It assumed every existing node already had the full current schema; a node
  that was behind would be marked v1-applied while still missing newer tables
  or columns. Every v1 statement is idempotent (CREATE TABLE/INDEX IF NOT
  EXISTS, ADD COLUMN IF NOT EXISTS), so legacy installs now simply run v1 once:
  existing objects are no-ops, missing ones get created.

- Guard the whole runner with a Postgres advisory lock so two instances
  against the same database (blue/green or rolling deploy) can't race to apply
  the same migration and trip the schema_migrations primary key. The lock is
  released explicitly, and automatically if the connection drops.

Also document that per-migration transactions preclude CREATE INDEX
CONCURRENTLY in future migrations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Vasanthdev2004 Vasanthdev2004 force-pushed the feat/operational-safety-net branch from ab19465 to a50c0b2 Compare June 1, 2026 05:20
Two operational-safety changes that the live network needs:

1) Graceful shutdown. A process-wide tokio::sync::watch::Sender<bool>
   in AppState is flipped on SIGINT (all platforms) or SIGTERM (Unix).
   Every long-lived loop now selects on the shutdown arm:
     * axum server (with_graceful_shutdown — drains in-flight)
     * libp2p swarm (drops Swarm, closes connections)
     * gossip task (exits between peer announces)
     * sync worker (exits between batches)
     * operator heartbeat loop
     * rate-limit cleanup loop
     * peer-count metrics poller
     * the deferred DID-record publish task

   New GITLAWB_SHUTDOWN_GRACE_SECS (default 30s) controls how long
   axum waits for in-flight requests before returning.

2) Prometheus /metrics endpoint. New metrics module exposes:
     * gitlawb_info{version, did}                 (constant gauge = 1)
     * gitlawb_pushes_total{repo}                 (counter)
     * gitlawb_fetches_total{repo}                (counter)
     * gitlawb_auth_successes_total{route}        (counter, helper)
     * gitlawb_auth_failures_total{route,reason}  (counter, helper)
     * gitlawb_sync_queue_processed_total{status} (counter)
     * gitlawb_webhook_deliveries_total{result}   (counter)
     * gitlawb_pack_size_bytes                    (histogram)
     * gitlawb_peers_connected                    (gauge)

   Opt-in via GITLAWB_METRICS_ADDR (e.g. 127.0.0.1:9091). The endpoint
   is on a separate listener so the public port stays unexposed;
   bind to localhost or a private interface.

3) Metric instrumentation hooks in api/repos.rs (push/fetch),
   webhooks.rs (delivery outcomes), and sync.rs (per-item status).
   The auth helpers are defined but not yet wired — that's a
   focused follow-up.

Tests:
  * 3 new metrics::tests::* — encode returns valid prometheus text,
    record_* helpers are no-ops before init(), encode() errors if
    init() was never called.
  * cargo fmt + clippy -D warnings + cargo test --workspace all clean.

Closes the metrics and graceful-shutdown gaps in
docs/OSS-READINESS-AUDIT.md:131 and docs/MAINTAINER-ROADMAP.md:35.
@Vasanthdev2004 Vasanthdev2004 force-pushed the feat/operational-safety-net branch from a50c0b2 to d950e11 Compare June 1, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant