feat(server): observability improvements and startup/retry fixes #159
feat(server): observability improvements and startup/retry fixes #159anderslindho wants to merge 9 commits into
Conversation
| ioc_name = str(port) | ||
| _log.debug("IOC at %s:%d did not send IOCNAME; using port as iocName", host, port) | ||
| _log.debug("IOC at %s:%d has no iocName; using source port as iocName", host, port) | ||
| if ioc_name.isdigit() and 1024 <= int(ioc_name) <= 65535: |
There was a problem hiding this comment.
This condition seems odd to me. Does this ever occur outside of the conditional right above it?
There was a problem hiding this comment.
Yes, it can. The intent is to catch any iocName that looks like an ephemeral source port, whichever way it got there. But I guess you mean to drop the range comp?
There was a problem hiding this comment.
Yes and no. On the one hand ports will always be number in that range unless something super weird is going on, but I guess my main question was mor about if it is ever the case that:
- ioc_name is not None
- ioc_name is a stringified integer
This would clearly be a client error, but does that happen?
There was a problem hiding this comment.
The file has since been deleted and the code moved, so this diff is outdated. To your question: yes, it can fire even when IOC name came from the IOC/reccaster - if someone misconfigured the reccaster to send a numeric string as IOCNAME. It's pathological but possible, so the check is intentionally placed outside the if not block.
So it prob does not add a huge amount of value, and I can simplify if you want.
4e677c1 to
9d4d3df
Compare
9d4d3df to
a55e811
Compare
a55e811 to
5af2769
Compare
…ssion Defaulting to True changed the failure mode from "exhaust retries and release the global lock" to "hold the lock indefinitely", which freezes all commit processing when CF is unreachable. False restores sensible behaviour: the commit drops and the IOC retries on its next reconnect.
…thread clean_service() blocks the reactor for the full duration of the sweep, holding the global lock and preventing all IOC commits. cleanOnStart now schedules the sweep as a background thread after startup, so commits are accepted immediately. cleanOnStop uses deferToThread so the reactor stays live during shutdown while the lock is still held.
A numeric iocName that matches the source port range means the iocid changes on every reconnect, silently accumulating stale channels in CF. Log a warning when this is detected so misconfigured reccasters can be found. Distinguishes disconnect-before-upload from update-without-initial to avoid the same warning firing twice for the same event.
There is currently no runtime visibility into whether maxActive is throttling connections or how many channels the CF processor is tracking. A LoopingCall in RecService logs active/queued connections against the limit every 60 s; CFProcessor logs known_iocs and tracked_channels on the same interval. Both are configurable via statusInterval (0 disables).
Without timing, slow CF commits are invisible until they cause a backlog. Per-attempt duration is now logged so latency regressions show up in the log stream. push_to_cf also checks processor.running on each iteration so a service stop during a retry loop drains immediately rather than waiting up to 60 s per attempt.
Exposes connection state, CF processor state, and CF push performance as Prometheus metrics on a configurable HTTP port (metricsPort, disabled by default). Requires the optional prometheus-client dependency (pip install recceiver[metrics]). Gracefully degrades to no-ops if prometheus-client is not installed so the dependency is truly optional.
…ushes When an IOC upload exceeds trlimit and flush() splits it mid-stream, the replacement transaction was created with empty client_infos. Subsequent batches then fell back to the source port as iocName, writing the wrong iocName property to CF for any channels not yet seen in that session. Copy client_infos forward so all transactions in a session share the same IOC-level environment variables.
The autosquash left both the old check (after update_ioc_infos) and the new one (before it). The post-update check fires for every normal disconnect because update_ioc_infos removes the IOC from self.iocs, preventing channels from ever being marked inactive.
be0ecb9 to
5f6ec50
Compare
|
|
|
||
| # TCP port to expose Prometheus metrics on (0 or absent to disable). | ||
| # Requires prometheus_client: pip install recceiver[metrics] | ||
| #metricsPort = 0 |
There was a problem hiding this comment.
Missing this config in recceiver_full.conf.
| return | ||
|
|
||
| transaction, self.transaction = self.transaction, Transaction(self.ep, id(self)) | ||
| self.transaction.client_infos = dict(transaction.client_infos) |
There was a problem hiding this comment.
Does this really work with an IOC?



Addresses several operational pain points found in production at ESS.
Fixes
push_always_retrydefaulted toTrue(a regression), which could freeze all commit processing indefinitely when CF was unreachable. Default is nowFalse.cleanOnStart/cleanOnStopswept channels synchronously on the reactor thread, blocking IOC commits for the full duration of the sweep. Both now run in a background thread.Observability
maxActiveand the CF processor's IOC and channel counts - the key indicators that were invisible during the production incident.pip install recceiver[metrics]). Exposes connection state, CF processor state, and a push duration histogram. Degrades gracefully to no-ops if prometheus-client is not installed.