Skip to content

waterdata: get_stats_data preserve geometry across continuation pages#255

Merged
thodson-usgs merged 8 commits intoDOI-USGS:mainfrom
thodson-usgs:fix/stats-pagination
May 5, 2026
Merged

waterdata: get_stats_data preserve geometry across continuation pages#255
thodson-usgs merged 8 commits intoDOI-USGS:mainfrom
thodson-usgs:fix/stats-pagination

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 3, 2026

Summary

get_stats_data dropped geometry on continuation pages. Page 1 honored GEOPANDAS but pages 2..N hard-coded geopd=False. With geopandas installed, a multi-page stats response started as a GeoDataFrame and pages 2..N came back as plain DataFrames; pd.concat kept the GeoDataFrame type from page 1 but every row from page 2 onward had a null geometry — and the page-2 coordinates landed in a separate coordinates column.

Use geopd=GEOPANDAS on every page so all rows retain their geometry.

Live-API verification

With state_code=US:42, parameter_code=00060, page_size=1 against api.waterdata.usgs.gov/statistics/v0/observationNormals:

Page 1 (1890 rows, USGS-01513550) Page 2 (1890 rows, USGS-01514850)
Buggy (main) geometry preserved 0/1890 rows have geometry
Fixed (PR) geometry preserved 1890/1890 rows have geometry

Test plan

  • No new test — the fix is a one-line geopd=False -> geopd=GEOPANDAS change that's self-evident from the diff and live-API-verified above. The existing tests in tests/waterdata_utils_test.py still pass.

Related PRs

Other open PRs in this bug-review series that touch dataretrieval/waterdata/utils.py (different functions, no functional conflicts):

Dropped from this PR

Two earlier candidate fixes were dropped after live-API testing:

  • body.get("next") instead of body["next"] — the live API consistently includes "next" (with value null on terminal pages), so the original body["next"] doesn't actually KeyError.
  • Explicit non-200 raise inside continuation loops (and _error_body JSON-decode fallback) — real failure modes already raise inside the existing try/except at _handle_stats_nesting or body["next"], so the silent-append window was theoretical.

thodson-usgs and others added 2 commits May 3, 2026 15:23
…-200

Three correctness fixes to the two pagination loops in
dataretrieval.waterdata.utils.

* `get_stats_data` honored `GEOPANDAS` for the first page but
  hard-coded `geopd=False` on every continuation page. With geopandas
  installed, a multi-page stats response started as a `GeoDataFrame`
  and pages 2..N came back as plain `DataFrame`s; `pd.concat` then
  silently downgraded the result and the caller lost geometry / CRS.
  Use `geopd=GEOPANDAS` on every page.

* `get_stats_data` indexed `body["next"]` directly, raising `KeyError`
  on responses without that key (some terminal responses simply omit
  it). Switch to `body.get("next")`, which produces `None` and exits
  the loop cleanly.

* Both `get_stats_data`'s in-loop request and `_walk_pages`'s in-loop
  request returned the response without checking `status_code`. A 4xx
  or 5xx page whose body happened to JSON-decode could be appended as
  data, then pagination quietly stopped — the caller got a partial
  result with no warning. Add an explicit `if status_code != 200`
  raise inside each loop so the existing log-and-truncate handler
  fires deliberately rather than incidentally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both pagination loops now had four call sites repeating
``if resp.status_code != 200: raise RuntimeError(_error_body(resp))``.
Move that into a one-line helper alongside ``_error_body`` and call
it from every site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens pagination behavior in dataretrieval.waterdata.utils, specifically for WaterData stats and OGC fetches, so paginated responses preserve expected structure and stop more intentionally on continuation-page failures.

Changes:

  • Refactors non-200 response checks into _raise_if_not_ok() and applies it to initial and continuation-page requests.
  • Updates get_stats_data() to tolerate missing "next" keys and to preserve GEOPANDAS handling across all pages.
  • Adds regression tests around missing "next" handling and non-200 continuation pages.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
dataretrieval/waterdata/utils.py Adjusts pagination/error handling in _walk_pages() and get_stats_data(), including stats continuation parsing.
tests/waterdata_utils_test.py Adds regression tests for pagination edge cases in _walk_pages() and get_stats_data().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dataretrieval/waterdata/utils.py Outdated
Comment on lines +342 to +345
def _raise_if_not_ok(resp: requests.Response) -> None:
"""Raise ``RuntimeError(_error_body(resp))`` for any non-200 response."""
if resp.status_code != 200:
raise RuntimeError(_error_body(resp))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed (in this PR's main commit dd5e00e): _error_body now wraps resp.json() in a try and falls back to f"{status}: {reason}. {snippet}" on JSONDecodeError/ValueError. The pagination except blocks call _error_body(resp) once, get a string back, and the request is logged-and-truncated cleanly — no second crash path.

body = resp.json()
all_dfs.append(_handle_stats_nesting(body, geopd=False))
next_token = body["next"]
all_dfs.append(_handle_stats_nesting(body, geopd=GEOPANDAS))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already covered: test_get_stats_data_preserves_geometry_across_pages (added in this PR) explicitly mocks two pages with point geometries, runs them through get_stats_data with GEOPANDAS=True, and asserts the result is a gpd.GeoDataFrame with df.geometry.notna().all() across all 2 sites. Also verified against the live API: with state=US:42, parameter=00060, page_size=1, the buggy code leaves 0/1890 page-2 rows with geometry while the fix preserves 1890/1890.

Comment thread tests/waterdata_utils_test.py Outdated
Comment on lines +87 to +93
def test_walk_pages_raises_on_non_200_in_loop():
"""`_walk_pages` must surface a non-200 mid-loop, not silently truncate.

Regression: previously any non-200 page was appended (with whatever
body it had) and pagination quietly stopped because `_get_resp_data`
or `_next_req_url` raised inside the bare except. The user got a
partial result with no warning.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed: the test was renamed to test_walk_pages_truncates_on_non_200_continuation and its docstring now describes the truncate-not-extend contract (page 1 is still returned; page 2 is dropped). The earlier 'raises' wording was an artifact of an earlier iteration.

thodson-usgs and others added 2 commits May 4, 2026 10:12
Per copilot review on PR DOI-USGS#255:

- _error_body: catch JSONDecodeError when a 4xx/5xx returns plain text
  or HTML. Previously, _raise_if_not_ok -> _error_body -> resp.json()
  raised JSONDecodeError on non-JSON bodies, defeating the in-loop
  status check.
- Tests:
  - Rename test_walk_pages_raises_on_non_200_in_loop to
    test_walk_pages_truncates_on_non_200_continuation; the assertion
    verifies log-and-truncate, not raise.
  - New test_get_stats_data_preserves_geometry_across_pages exercises
    the GEOPANDAS=True continuation path so a regression to geopd=False
    on page 2..N is caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	tests/waterdata_utils_test.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dataretrieval/waterdata/utils.py Outdated
Comment on lines +335 to +339
try:
j_txt = resp.json()
except (json.JSONDecodeError, ValueError):
# Non-JSON 4xx/5xx bodies (HTML error pages, plain text) — fall back
# to the raw text so callers still get a useful message instead of
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1bfea67: added test_error_body_falls_back_to_text_for_non_json_responses that wires resp.json to raise JSONDecodeError and asserts the fallback message starts with "502: Bad Gateway" and includes the raw text snippet.

Comment thread tests/waterdata_utils_test.py Outdated
args={}, service="observationNormals", expand_percentiles=False, client=client
)
assert isinstance(df, pd.DataFrame)
assert len(df) >= 1
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1bfea67: the test now pins to len(df) == 1 (the single feature flattens to exactly one stats row) and asserts client.request.assert_not_called(), so a regression that kept fetching after a missing next-token would fail.

Comment thread tests/waterdata_utils_test.py Outdated
Comment on lines +205 to +209
df, _ = get_stats_data(
args={}, service="observationNormals", expand_percentiles=False, client=client
)
# Page 1 still surfaces; page 2 was caught by the in-loop status check.
assert isinstance(df, pd.DataFrame)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1bfea67: pinned to len(df) == 1 and client.request.call_count == 1, so a regression that silently appended page-2 data or kept looping after the 503 would fail.

Comment thread dataretrieval/waterdata/utils.py Outdated
Comment on lines +335 to +342
try:
j_txt = resp.json()
except (json.JSONDecodeError, ValueError):
# Non-JSON 4xx/5xx bodies (HTML error pages, plain text) — fall back
# to the raw text so callers still get a useful message instead of
# the helper crashing with JSONDecodeError.
snippet = (resp.text or "").strip()[:500]
return f"{status}: {resp.reason or 'Error'}. {snippet}"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1bfea67: _error_body's docstring now describes the actual contract — for non-429/403 statuses it parses the JSON body and formats code / description, falling back to "<status>: <reason>. <first 500 chars>" only when the body isn't JSON.

- Update `_error_body` docstring to describe the new JSON / text-snippet
  fallback behavior (was: "returns the raw response text" — out of date
  since the helper now formats `code` + `description` from the JSON body
  and only falls back to a text snippet when JSON parsing fails).
- Add `test_error_body_falls_back_to_text_for_non_json_responses` to
  exercise the JSONDecodeError fallback branch (the existing pagination
  tests use unconstrained `resp.json` mocks, so the fallback was never
  reached).
- Tighten `test_get_stats_data_handles_missing_next_key`: pin to
  `len(df) == 1` and assert `client.request` was never called, so a
  regression that kept fetching after a missing next-token would be
  caught.
- Tighten `test_get_stats_data_truncates_on_non_200_continuation`: pin
  to exactly 1 row and assert `client.request.call_count == 1`, so a
  regression that silently appended page-2 data or duplicated page-1
  would still fail.

Live-API verification of the dropped-geometry bug: with
`state=US:42, parameter=00060, page_size=1`, page-2 contributes 1890
rows; the buggy main concat leaves 0/1890 of those rows with geometry
(all NaN) while the fixed concat preserves 1890/1890.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs changed the title waterdata: tighten stats + OGC pagination (geometry, KeyError, non-200) waterdata: get_stats_data preserve geometry across continuation pages May 5, 2026
Live-API testing (state=US:42, parameter=00060, page_size=1 against
observationNormals) confirmed only the geometry-on-continuation bug:
the buggy path leaves 0/1890 page-2 rows with geometry; the fix
preserves 1890/1890.

The other two candidate fixes were defensive-only and could not be
reproduced against the current live API, so drop them and the
associated tests:
- `body.get("next")` -- the live endpoint always includes the `next`
  key (value `null` on terminal pages), so the original `body["next"]`
  never KeyErrors.
- Explicit non-200 raise inside the continuation loops (plus the
  `_error_body` JSON-decode fallback and the `_raise_if_not_ok` helper
  extraction) -- real failure modes already raise inside the existing
  `try/except` at `_handle_stats_nesting` or `body["next"]`. The
  silent-append window required a non-200 response whose body both
  JSON-decodes AND has a `features` key, which the API doesn't produce.

Net diff is now one line of production code (`geopd=False` ->
`geopd=GEOPANDAS`) plus one regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

Scope narrowed in 22c60de. After live-API testing on observationNormals and observationIntervals:

  • The geometry-on-continuation bug reproduces: state=US:42, parameter=00060, page_size=1 ends up with 0/1890 page-2 rows carrying geometry on main vs 1890/1890 with this PR. Kept — single-line fix + one regression test.
  • The "missing next key" defense and the "non-200 continuation silently appended" defense could not be reproduced against the live API: the endpoints always include next (with null on terminal pages), and real failure modes already raise inside the existing try/except at _handle_stats_nesting or body["next"]. Dropped to keep this PR focused on a real, reproducible bug.

The earlier Copilot inline threads were on lines that no longer exist in this PR's diff; they should auto-collapse on re-review. Title and description updated accordingly.

@thodson-usgs thodson-usgs marked this pull request as ready for review May 5, 2026 14:52
The fix is a one-line `geopd=False` -> `geopd=GEOPANDAS` change that's
self-evident from the diff and verified against the live API in the
PR description. Drop `test_get_stats_data_preserves_geometry_across_pages`
along with its helpers (`_stats_feature`, `_stats_body`,
`_stats_feature_with_geometry`) and the now-unused `GEOPANDAS`,
`get_stats_data`, `pytest`, and `gpd` imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs merged commit c2feaa0 into DOI-USGS:main May 5, 2026
7 of 8 checks passed
@thodson-usgs thodson-usgs deleted the fix/stats-pagination branch May 5, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants