Skip to content

feat(server): observability improvements and startup/retry fixes #159

Open
anderslindho wants to merge 9 commits into
masterfrom
improve-logging-observability
Open

feat(server): observability improvements and startup/retry fixes #159
anderslindho wants to merge 9 commits into
masterfrom
improve-logging-observability

Conversation

@anderslindho
Copy link
Copy Markdown
Contributor

@anderslindho anderslindho commented May 8, 2026

Addresses several operational pain points found in production at ESS.

Fixes

  • push_always_retry defaulted to True (a regression), which could freeze all commit processing indefinitely when CF was unreachable. Default is now False.
  • cleanOnStart/cleanOnStop swept channels synchronously on the reactor thread, blocking IOC commits for the full duration of the sweep. Both now run in a background thread.
  • Warning messages for IOC connection problems were ambiguous and could fire twice for the same event. A numeric iocName (ephemeral source port) is now flagged explicitly.

Observability

  • Periodic status lines (configurable via statusInterval, default 60 s) log active/queued connections against maxActive and the CF processor's IOC and channel counts - the key indicators that were invisible during the production incident.
  • CF push duration is logged per attempt, making CF latency regressions visible in the log stream.
  • Optional Prometheus metrics endpoint (metricsPort, requires pip install recceiver[metrics]). Exposes connection state, CF processor state, and a push duration histogram. Degrades gracefully to no-ops if prometheus-client is not installed.

Comment thread server/recceiver/cfstore.py Outdated
Comment thread server/recceiver/cfstore.py Outdated
Comment thread server/demo.conf
Comment thread server/recceiver/cfstore.py Outdated
Comment thread server/pyproject.toml
Comment thread server/recceiver/metrics.py
Comment thread server/recceiver/cfstore.py Outdated
ioc_name = str(port)
_log.debug("IOC at %s:%d did not send IOCNAME; using port as iocName", host, port)
_log.debug("IOC at %s:%d has no iocName; using source port as iocName", host, port)
if ioc_name.isdigit() and 1024 <= int(ioc_name) <= 65535:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition seems odd to me. Does this ever occur outside of the conditional right above it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can. The intent is to catch any iocName that looks like an ephemeral source port, whichever way it got there. But I guess you mean to drop the range comp?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. On the one hand ports will always be number in that range unless something super weird is going on, but I guess my main question was mor about if it is ever the case that:

  • ioc_name is not None
  • ioc_name is a stringified integer

This would clearly be a client error, but does that happen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file has since been deleted and the code moved, so this diff is outdated. To your question: yes, it can fire even when IOC name came from the IOC/reccaster - if someone misconfigured the reccaster to send a numeric string as IOCNAME. It's pathological but possible, so the check is intentionally placed outside the if not block.

So it prob does not add a huge amount of value, and I can simplify if you want.

Comment thread server/recceiver/application.py Outdated
Comment thread server/recceiver/cfstore.py Outdated
@anderslindho anderslindho force-pushed the improve-logging-observability branch 2 times, most recently from 4e677c1 to 9d4d3df Compare May 11, 2026 13:01
@anderslindho anderslindho marked this pull request as draft May 11, 2026 13:41
@anderslindho anderslindho marked this pull request as draft May 11, 2026 13:41
@anderslindho anderslindho force-pushed the improve-logging-observability branch from 9d4d3df to a55e811 Compare May 11, 2026 13:53
@anderslindho anderslindho marked this pull request as ready for review May 11, 2026 13:57
@anderslindho anderslindho marked this pull request as draft May 12, 2026 04:52
@anderslindho anderslindho force-pushed the improve-logging-observability branch from a55e811 to 5af2769 Compare May 12, 2026 05:49
Comment thread server/tests/unit/test_recast.py Fixed
…ssion

Defaulting to True changed the failure mode from "exhaust retries and
release the global lock" to "hold the lock indefinitely", which freezes
all commit processing when CF is unreachable. False restores sensible
behaviour: the commit drops and the IOC retries on its next reconnect.
…thread

clean_service() blocks the reactor for the full duration of the sweep,
holding the global lock and preventing all IOC commits. cleanOnStart
now schedules the sweep as a background thread after startup, so commits
are accepted immediately. cleanOnStop uses deferToThread so the reactor
stays live during shutdown while the lock is still held.
A numeric iocName that matches the source port range means the iocid
changes on every reconnect, silently accumulating stale channels in CF.
Log a warning when this is detected so misconfigured reccasters can be
found.

Distinguishes disconnect-before-upload from update-without-initial to
avoid the same warning firing twice for the same event.
There is currently no runtime visibility into whether maxActive is
throttling connections or how many channels the CF processor is
tracking. A LoopingCall in RecService logs active/queued connections
against the limit every 60 s; CFProcessor logs known_iocs and
tracked_channels on the same interval. Both are configurable via
statusInterval (0 disables).
Without timing, slow CF commits are invisible until they cause a
backlog. Per-attempt duration is now logged so latency regressions show
up in the log stream.

push_to_cf also checks processor.running on each iteration so a service
stop during a retry loop drains immediately rather than waiting up to
60 s per attempt.
Exposes connection state, CF processor state, and CF push
performance as Prometheus metrics on a configurable HTTP port
(metricsPort, disabled by default). Requires the optional
prometheus-client dependency (pip install recceiver[metrics]).

Gracefully degrades to no-ops if prometheus-client is not installed
so the dependency is truly optional.
…ushes

When an IOC upload exceeds trlimit and flush() splits it mid-stream, the
replacement transaction was created with empty client_infos. Subsequent
batches then fell back to the source port as iocName, writing the wrong
iocName property to CF for any channels not yet seen in that session.
Copy client_infos forward so all transactions in a session share the
same IOC-level environment variables.
The autosquash left both the old check (after update_ioc_infos) and
the new one (before it). The post-update check fires for every normal
disconnect because update_ioc_infos removes the IOC from self.iocs,
preventing channels from ever being marked inactive.
@anderslindho anderslindho force-pushed the improve-logging-observability branch from be0ecb9 to 5f6ec50 Compare May 12, 2026 09:28
@sonarqubecloud
Copy link
Copy Markdown

@anderslindho anderslindho marked this pull request as ready for review May 12, 2026 09:57
Comment thread server/demo.conf

# TCP port to expose Prometheus metrics on (0 or absent to disable).
# Requires prometheus_client: pip install recceiver[metrics]
#metricsPort = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing this config in recceiver_full.conf.

return

transaction, self.transaction = self.transaction, Transaction(self.ep, id(self))
self.transaction.client_infos = dict(transaction.client_infos)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really work with an IOC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants