Skip to content

fix(datafeed): retry ClientPayloadError and reset _running so the loo…#391

Open
Alex-Nalin wants to merge 1 commit into
finos:mainfrom
Alex-Nalin:fix/datafeed-payload-error-and-running-reset
Open

fix(datafeed): retry ClientPayloadError and reset _running so the loo…#391
Alex-Nalin wants to merge 1 commit into
finos:mainfrom
Alex-Nalin:fix/datafeed-payload-error-and-running-reset

Conversation

@Alex-Nalin

Copy link
Copy Markdown

…p can restart

A truncated datafeed read raises aiohttp.ClientPayloadError, which was not classified as a transient error, so read_datafeed_retry re-raised it and the datafeed loop crashed. The loop then could not be restarted because AbstractDatafeedLoop._run_loop left self._running = True on an abnormal exit (only stop() reset it), so DatafeedLoopV2.start() raised "The datafeed service V2 is already started" on every restart. The combined effect turned a transient network blip into a permanent, silent event-loss outage.

Changes:

  • strategy.is_client_timeout_error now treats ClientPayloadError as transient (also benefits datahose and auth-refresh paths).
  • AbstractDatafeedLoop._run_loop resets self._running = False in a finally, so the loop is restartable after any exit (stop, cancellation, or escaped exception). The concurrent-start guard in V1/V2 start() is unchanged.
  • Add regression tests for both behaviours.

Description

Closes #[ISSUE NUMBER]

Please put here the intent of your pull request.

Dependencies

List the other pull requests that should be merged before/along this one.

Checklist

  • Referenced an issue in the PR title or description
  • Filled properly the description and dependencies, if any
  • Unit tests updated or added
  • Docstrings added or updated
  • Updated the documentation in docsrc folder

…p can restart

A truncated datafeed read raises aiohttp.ClientPayloadError, which was not
classified as a transient error, so read_datafeed_retry re-raised it and the
datafeed loop crashed. The loop then could not be restarted because
AbstractDatafeedLoop._run_loop left self._running = True on an abnormal exit
(only stop() reset it), so DatafeedLoopV2.start() raised
"The datafeed service V2 is already started" on every restart. The combined
effect turned a transient network blip into a permanent, silent event-loss
outage.

Changes:
- strategy.is_client_timeout_error now treats ClientPayloadError as transient
  (also benefits datahose and auth-refresh paths).
- AbstractDatafeedLoop._run_loop resets self._running = False in a finally, so
  the loop is restartable after any exit (stop, cancellation, or escaped
  exception). The concurrent-start guard in V1/V2 start() is unchanged.
- Add regression tests for both behaviours.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@linux-foundation-easycla

Copy link
Copy Markdown

CLA Missing ID

  • ✅ login: Alex-Nalin / name: Alex Nalin (584a74c)
  • ❌ The email address for the commit (584a74c) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

  1. Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
  2. Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
  3. Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
  4. Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
  5. login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

@Alex-Nalin

Copy link
Copy Markdown
Author

Problem
A truncated datafeed read raises aiohttp.ClientPayloadError. This is not classified as a transient error, so read_datafeed_retry re-raises it and the datafeed loop crashes. The loop then cannot be restarted: AbstractDatafeedLoop._run_loop sets self._running = True and only stop() resets it, so after an abnormal exit _running stays True and DatafeedLoopV2.start() raises RuntimeError: The datafeed service V2 is already started on every restart attempt.

Net effect: a transient network blip becomes a permanent, silent event-loss outage. Observed in production as restart #1 (ClientPayloadError) followed by 500+ × "already started".

Changes
core/retry/strategy.py — is_client_timeout_error now treats ClientPayloadError as transient (also benefits the datahose and auth-refresh paths that share this predicate).
core/service/datafeed/abstract_datafeed_loop.py — _run_loop resets self._running = False in a finally, so the loop is restartable after any exit (stop, cancellation, or an exception escaping retry). The concurrent-start guard in DatafeedLoopV1/V2.start() is unchanged.
Tests — strategy_test.py (ClientPayloadError is classified transient) and abstract_datafeed_loop_test.py (_running reset after a crash allows restart).
Verification
pytest tests/core/retry/strategy_test.py tests/core/service/datafeed/abstract_datafeed_loop_test.py → 43 passed (incl. 2 new).

Compatibility
Both changes are small and backward-compatible. No public API changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant