waterdata: get_stats_data preserve geometry across continuation pages#255
Conversation
…-200
Three correctness fixes to the two pagination loops in
dataretrieval.waterdata.utils.
* `get_stats_data` honored `GEOPANDAS` for the first page but
hard-coded `geopd=False` on every continuation page. With geopandas
installed, a multi-page stats response started as a `GeoDataFrame`
and pages 2..N came back as plain `DataFrame`s; `pd.concat` then
silently downgraded the result and the caller lost geometry / CRS.
Use `geopd=GEOPANDAS` on every page.
* `get_stats_data` indexed `body["next"]` directly, raising `KeyError`
on responses without that key (some terminal responses simply omit
it). Switch to `body.get("next")`, which produces `None` and exits
the loop cleanly.
* Both `get_stats_data`'s in-loop request and `_walk_pages`'s in-loop
request returned the response without checking `status_code`. A 4xx
or 5xx page whose body happened to JSON-decode could be appended as
data, then pagination quietly stopped — the caller got a partial
result with no warning. Add an explicit `if status_code != 200`
raise inside each loop so the existing log-and-truncate handler
fires deliberately rather than incidentally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both pagination loops now had four call sites repeating ``if resp.status_code != 200: raise RuntimeError(_error_body(resp))``. Move that into a one-line helper alongside ``_error_body`` and call it from every site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR tightens pagination behavior in dataretrieval.waterdata.utils, specifically for WaterData stats and OGC fetches, so paginated responses preserve expected structure and stop more intentionally on continuation-page failures.
Changes:
- Refactors non-200 response checks into
_raise_if_not_ok()and applies it to initial and continuation-page requests. - Updates
get_stats_data()to tolerate missing"next"keys and to preserveGEOPANDAShandling across all pages. - Adds regression tests around missing
"next"handling and non-200 continuation pages.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
dataretrieval/waterdata/utils.py |
Adjusts pagination/error handling in _walk_pages() and get_stats_data(), including stats continuation parsing. |
tests/waterdata_utils_test.py |
Adds regression tests for pagination edge cases in _walk_pages() and get_stats_data(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _raise_if_not_ok(resp: requests.Response) -> None: | ||
| """Raise ``RuntimeError(_error_body(resp))`` for any non-200 response.""" | ||
| if resp.status_code != 200: | ||
| raise RuntimeError(_error_body(resp)) |
There was a problem hiding this comment.
Already fixed (in this PR's main commit dd5e00e): _error_body now wraps resp.json() in a try and falls back to f"{status}: {reason}. {snippet}" on JSONDecodeError/ValueError. The pagination except blocks call _error_body(resp) once, get a string back, and the request is logged-and-truncated cleanly — no second crash path.
| body = resp.json() | ||
| all_dfs.append(_handle_stats_nesting(body, geopd=False)) | ||
| next_token = body["next"] | ||
| all_dfs.append(_handle_stats_nesting(body, geopd=GEOPANDAS)) |
There was a problem hiding this comment.
Already covered: test_get_stats_data_preserves_geometry_across_pages (added in this PR) explicitly mocks two pages with point geometries, runs them through get_stats_data with GEOPANDAS=True, and asserts the result is a gpd.GeoDataFrame with df.geometry.notna().all() across all 2 sites. Also verified against the live API: with state=US:42, parameter=00060, page_size=1, the buggy code leaves 0/1890 page-2 rows with geometry while the fix preserves 1890/1890.
| def test_walk_pages_raises_on_non_200_in_loop(): | ||
| """`_walk_pages` must surface a non-200 mid-loop, not silently truncate. | ||
|
|
||
| Regression: previously any non-200 page was appended (with whatever | ||
| body it had) and pagination quietly stopped because `_get_resp_data` | ||
| or `_next_req_url` raised inside the bare except. The user got a | ||
| partial result with no warning. |
There was a problem hiding this comment.
Already addressed: the test was renamed to test_walk_pages_truncates_on_non_200_continuation and its docstring now describes the truncate-not-extend contract (page 1 is still returned; page 2 is dropped). The earlier 'raises' wording was an artifact of an earlier iteration.
Per copilot review on PR DOI-USGS#255: - _error_body: catch JSONDecodeError when a 4xx/5xx returns plain text or HTML. Previously, _raise_if_not_ok -> _error_body -> resp.json() raised JSONDecodeError on non-JSON bodies, defeating the in-loop status check. - Tests: - Rename test_walk_pages_raises_on_non_200_in_loop to test_walk_pages_truncates_on_non_200_continuation; the assertion verifies log-and-truncate, not raise. - New test_get_stats_data_preserves_geometry_across_pages exercises the GEOPANDAS=True continuation path so a regression to geopd=False on page 2..N is caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # tests/waterdata_utils_test.py
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| j_txt = resp.json() | ||
| except (json.JSONDecodeError, ValueError): | ||
| # Non-JSON 4xx/5xx bodies (HTML error pages, plain text) — fall back | ||
| # to the raw text so callers still get a useful message instead of |
There was a problem hiding this comment.
Fixed in 1bfea67: added test_error_body_falls_back_to_text_for_non_json_responses that wires resp.json to raise JSONDecodeError and asserts the fallback message starts with "502: Bad Gateway" and includes the raw text snippet.
| args={}, service="observationNormals", expand_percentiles=False, client=client | ||
| ) | ||
| assert isinstance(df, pd.DataFrame) | ||
| assert len(df) >= 1 |
There was a problem hiding this comment.
Fixed in 1bfea67: the test now pins to len(df) == 1 (the single feature flattens to exactly one stats row) and asserts client.request.assert_not_called(), so a regression that kept fetching after a missing next-token would fail.
| df, _ = get_stats_data( | ||
| args={}, service="observationNormals", expand_percentiles=False, client=client | ||
| ) | ||
| # Page 1 still surfaces; page 2 was caught by the in-loop status check. | ||
| assert isinstance(df, pd.DataFrame) |
There was a problem hiding this comment.
Fixed in 1bfea67: pinned to len(df) == 1 and client.request.call_count == 1, so a regression that silently appended page-2 data or kept looping after the 503 would fail.
| try: | ||
| j_txt = resp.json() | ||
| except (json.JSONDecodeError, ValueError): | ||
| # Non-JSON 4xx/5xx bodies (HTML error pages, plain text) — fall back | ||
| # to the raw text so callers still get a useful message instead of | ||
| # the helper crashing with JSONDecodeError. | ||
| snippet = (resp.text or "").strip()[:500] | ||
| return f"{status}: {resp.reason or 'Error'}. {snippet}" |
There was a problem hiding this comment.
Fixed in 1bfea67: _error_body's docstring now describes the actual contract — for non-429/403 statuses it parses the JSON body and formats code / description, falling back to "<status>: <reason>. <first 500 chars>" only when the body isn't JSON.
- Update `_error_body` docstring to describe the new JSON / text-snippet fallback behavior (was: "returns the raw response text" — out of date since the helper now formats `code` + `description` from the JSON body and only falls back to a text snippet when JSON parsing fails). - Add `test_error_body_falls_back_to_text_for_non_json_responses` to exercise the JSONDecodeError fallback branch (the existing pagination tests use unconstrained `resp.json` mocks, so the fallback was never reached). - Tighten `test_get_stats_data_handles_missing_next_key`: pin to `len(df) == 1` and assert `client.request` was never called, so a regression that kept fetching after a missing next-token would be caught. - Tighten `test_get_stats_data_truncates_on_non_200_continuation`: pin to exactly 1 row and assert `client.request.call_count == 1`, so a regression that silently appended page-2 data or duplicated page-1 would still fail. Live-API verification of the dropped-geometry bug: with `state=US:42, parameter=00060, page_size=1`, page-2 contributes 1890 rows; the buggy main concat leaves 0/1890 of those rows with geometry (all NaN) while the fixed concat preserves 1890/1890. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live-API testing (state=US:42, parameter=00060, page_size=1 against
observationNormals) confirmed only the geometry-on-continuation bug:
the buggy path leaves 0/1890 page-2 rows with geometry; the fix
preserves 1890/1890.
The other two candidate fixes were defensive-only and could not be
reproduced against the current live API, so drop them and the
associated tests:
- `body.get("next")` -- the live endpoint always includes the `next`
key (value `null` on terminal pages), so the original `body["next"]`
never KeyErrors.
- Explicit non-200 raise inside the continuation loops (plus the
`_error_body` JSON-decode fallback and the `_raise_if_not_ok` helper
extraction) -- real failure modes already raise inside the existing
`try/except` at `_handle_stats_nesting` or `body["next"]`. The
silent-append window required a non-200 response whose body both
JSON-decodes AND has a `features` key, which the API doesn't produce.
Net diff is now one line of production code (`geopd=False` ->
`geopd=GEOPANDAS`) plus one regression test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Scope narrowed in
The earlier Copilot inline threads were on lines that no longer exist in this PR's diff; they should auto-collapse on re-review. Title and description updated accordingly. |
# Conflicts: # tests/waterdata_utils_test.py
The fix is a one-line `geopd=False` -> `geopd=GEOPANDAS` change that's self-evident from the diff and verified against the live API in the PR description. Drop `test_get_stats_data_preserves_geometry_across_pages` along with its helpers (`_stats_feature`, `_stats_body`, `_stats_feature_with_geometry`) and the now-unused `GEOPANDAS`, `get_stats_data`, `pytest`, and `gpd` imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
get_stats_datadropped geometry on continuation pages. Page 1 honoredGEOPANDASbut pages 2..N hard-codedgeopd=False. With geopandas installed, a multi-page stats response started as aGeoDataFrameand pages 2..N came back as plainDataFrames;pd.concatkept theGeoDataFrametype from page 1 but every row from page 2 onward had a null geometry — and the page-2 coordinates landed in a separatecoordinatescolumn.Use
geopd=GEOPANDASon every page so all rows retain their geometry.Live-API verification
With
state_code=US:42, parameter_code=00060, page_size=1againstapi.waterdata.usgs.gov/statistics/v0/observationNormals:Test plan
geopd=False->geopd=GEOPANDASchange that's self-evident from the diff and live-API-verified above. The existing tests intests/waterdata_utils_test.pystill pass.Related PRs
Other open PRs in this bug-review series that touch
dataretrieval/waterdata/utils.py(different functions, no functional conflicts):_arrange_colsstop mutating caller list._handle_stats_nestingtolerate missing drop columns.Dropped from this PR
Two earlier candidate fixes were dropped after live-API testing:
body.get("next")instead ofbody["next"]— the live API consistently includes"next"(with valuenullon terminal pages), so the originalbody["next"]doesn't actuallyKeyError._error_bodyJSON-decode fallback) — real failure modes already raise inside the existingtry/exceptat_handle_stats_nestingorbody["next"], so the silent-append window was theoretical.