Add waterdata.get_ratings for USGS stage-discharge ratings via STAC#269
Merged
thodson-usgs merged 8 commits intoDOI-USGS:mainfrom May 6, 2026
Merged
Add waterdata.get_ratings for USGS stage-discharge ratings via STAC#269thodson-usgs merged 8 commits intoDOI-USGS:mainfrom
thodson-usgs merged 8 commits intoDOI-USGS:mainfrom
Conversation
Collaborator
Author
|
Tracked in the v1.1.4 release-staging issue: #270. |
thodson-usgs
added a commit
to thodson-usgs/dataretrieval-python
that referenced
this pull request
May 6, 2026
Two non-functional follow-ups suggested during review of DOI-USGS#269: (1) Document the cross-module reach into nwis._read_rdb. Rating files use the same USGS RDB shape as NWIS responses, so the parser is already reusable as-is — no refactor of the legacy nwis module is needed. Added a comment at the import site explaining why the private import is intentional and what to watch for if _read_rdb ever moves. (2) Surface the RDB #-prefixed header block. Each parsed rating frame now carries provenance in df.attrs: - df.attrs["comment"]: the list of "#"-prefixed header lines (rating id, parameter, expansion type, last-shifted timestamp, warnings, etc.). - df.attrs["url"]: the asset URL it was fetched from. R's read_waterdata_ratings exposes the comment block via comment(df); pandas's standard `attrs` dict is the Python equivalent. Done in ratings.py only — does not touch nwis. A live spot-check against api.waterdata.usgs.gov on USGS-01104475 exsa shows the 31-line USGS header survives intact (gauge name, parameter code, rating expansion, etc.). One new unit test pins the behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs
added a commit
to thodson-usgs/dataretrieval-python
that referenced
this pull request
May 6, 2026
Six fixes from a multi-agent review pass on PR DOI-USGS#269: 1. Narrow the per-feature except handler from a bare Exception to (RequestException, ValueError, OSError). The previous catch-all would swallow programming bugs (KeyError on a malformed feature, AttributeError, ...) and silently drop rows. 2. Escape single quotes in CQL string literals via the standard doubling rule. Most monitoring-location IDs cannot contain a quote, but the function takes arbitrary strings — defending against malformed filters and potential injection regardless. New _quote_cql_str helper, applied to monitoring_location_id and file_type. New unit test pins the behaviour. 3. Promote file_type to a Literal["exsa", "base", "corr"] and derive _VALID_FILE_TYPES from it via typing.get_args, so the runtime guard and the type hint can never drift apart. 4. Rename the `datetime` parameter to `time` to match the convention used by every sibling waterdata getter (and to stop shadowing the stdlib `datetime` module). The parameter is still passed through as the STAC `datetime` query string under the hood; that's now documented explicitly. 5. Switch the multi-type local filter from a substring check on the asset URL to feature["properties"]["file_type"]. Substring matching on URLs would false-match if a host or path ever contained one of the literal type names; STAC features carry the typed property already. 6. Skip the on-disk RDB write when file_path is None. tempfile.mkdtemp leaks (no automatic cleanup), and df.attrs["url"] already records the source — so by default we now return only the parsed frame. Users who want a local copy can pass file_path=...; the contract for that path is unchanged. While here, dropped a "discrete analogue to the OGC waterdata getters" sentence from the module docstring (it was internal-architecture WHAT-narrating, not user-relevant guidance). All 11 ratings tests pass (one new test for the CQL quote-escaping). Live verification on USGS-01104475 confirms file_path=None still returns the parsed frame with df.attrs populated, multi-type via property filter returns the expected 4 tables, and the renamed `time=` parameter still drives the STAC datetime filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the new Water Data STAC catalog endpoint (api.waterdata.usgs.gov/stac/v0/search) for stage-discharge rating curves. Lives in its own module (waterdata/ratings.py) because the transport layer — STAC search + per-feature RDB download — differs from the OGC collections used by the rest of the package. The function: - Composes a CQL filter from monitoring_location_id and (single-value) file_type, mirroring R's logic. Multi-type requests fetch all matches and filter URLs client-side. - Optionally downloads each matching .rdb asset to a user-supplied file_path (default: a fresh tempfile.mkdtemp), and parses with the existing nwis._read_rdb helper. - Returns a dict[id -> DataFrame] when download_and_parse=True, or the raw list of STAC features when False (cheap "what's available?" inspection). Surfaces a clear ValueError for invalid file_type values and for ISO 8601 durations in `datetime` (the rating-curve service rejects them). Mirrors R's read_waterdata_ratings in DOI-USGS/dataRetrieval; example sites and idioms come straight from the R doc. Tests: 9 unit tests covering filter composition, the two error paths, and end-to-end search-and-download via requests_mock (single-site, download_and_parse=False, and multi-type URL filtering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two non-functional follow-ups suggested during review of DOI-USGS#269: (1) Document the cross-module reach into nwis._read_rdb. Rating files use the same USGS RDB shape as NWIS responses, so the parser is already reusable as-is — no refactor of the legacy nwis module is needed. Added a comment at the import site explaining why the private import is intentional and what to watch for if _read_rdb ever moves. (2) Surface the RDB #-prefixed header block. Each parsed rating frame now carries provenance in df.attrs: - df.attrs["comment"]: the list of "#"-prefixed header lines (rating id, parameter, expansion type, last-shifted timestamp, warnings, etc.). - df.attrs["url"]: the asset URL it was fetched from. R's read_waterdata_ratings exposes the comment block via comment(df); pandas's standard `attrs` dict is the Python equivalent. Done in ratings.py only — does not touch nwis. A live spot-check against api.waterdata.usgs.gov on USGS-01104475 exsa shows the 31-line USGS header survives intact (gauge name, parameter code, rating expansion, etc.). One new unit test pins the behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six fixes from a multi-agent review pass on PR DOI-USGS#269: 1. Narrow the per-feature except handler from a bare Exception to (RequestException, ValueError, OSError). The previous catch-all would swallow programming bugs (KeyError on a malformed feature, AttributeError, ...) and silently drop rows. 2. Escape single quotes in CQL string literals via the standard doubling rule. Most monitoring-location IDs cannot contain a quote, but the function takes arbitrary strings — defending against malformed filters and potential injection regardless. New _quote_cql_str helper, applied to monitoring_location_id and file_type. New unit test pins the behaviour. 3. Promote file_type to a Literal["exsa", "base", "corr"] and derive _VALID_FILE_TYPES from it via typing.get_args, so the runtime guard and the type hint can never drift apart. 4. Rename the `datetime` parameter to `time` to match the convention used by every sibling waterdata getter (and to stop shadowing the stdlib `datetime` module). The parameter is still passed through as the STAC `datetime` query string under the hood; that's now documented explicitly. 5. Switch the multi-type local filter from a substring check on the asset URL to feature["properties"]["file_type"]. Substring matching on URLs would false-match if a host or path ever contained one of the literal type names; STAC features carry the typed property already. 6. Skip the on-disk RDB write when file_path is None. tempfile.mkdtemp leaks (no automatic cleanup), and df.attrs["url"] already records the source — so by default we now return only the parsed frame. Users who want a local copy can pass file_path=...; the contract for that path is unchanged. While here, dropped a "discrete analogue to the OGC waterdata getters" sentence from the module docstring (it was internal-architecture WHAT-narrating, not user-relevant guidance). All 11 ratings tests pass (one new test for the CQL quote-escaping). Live verification on USGS-01104475 confirms file_path=None still returns the parsed frame with df.attrs populated, multi-type via property filter returns the expected 4 tables, and the renamed `time=` parameter still drives the STAC datetime filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight small changes; net -16 lines. No behavior change other than the duration check now using the project's anchored regex instead of an ad-hoc substring search (small correctness improvement). 1. Reorder so the public `get_ratings` is at the top and helpers follow — matches the convention in `nearest.py`. 2. Reuse `_DURATION_RE` from `waterdata.utils` instead of the ad-hoc `"P" in str(v).upper()` check. The regex is anchored at start and case-correct; the substring check was both too loose (matched any string containing 'p'/'P') and case-wrong (ISO 8601 P/PT are case-sensitive — uppercasing the input would have masked lowercase-p false positives). 3. Extract `_as_list(x)` for the `[x] if isinstance(x, str) else list(x)` pattern, which appeared three times (file_type normalization, monitoring_location_id in `_build_filter`, time values in the duration check). 4. Drop redundant `date=False` on the `_format_api_dates` call (it's the default). 5. Collapse the `if time is not None: ... else: time_str = None` four-line block into a single ternary. 6. Pre-filter `features` by file_type before the download loop, so the loop body is just download-and-handle-errors. The "skip features whose file type wasn't requested" comment goes away because the list-comprehension above is self-documenting. 7. Drop two narrating comments whose WHY is already captured by the helpers' docstrings (`_build_filter` and `_extract_rdb_comment`). 8. Rename `test_get_ratings_multi_type_filters_urls_locally` → `..._filters_via_property` — the test name still claimed URL filtering after the property-based switch in the previous /simplify pass. All 11 ratings tests pass; live verification on api.waterdata.usgs.gov confirms file_path=None still attaches df.attrs, multi-type filtering returns the expected four tables, and the duration check still rejects "P7D" while accepting "2026-04-29". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test-only fixes from a broader-ruff (--select ALL) pass. Both already-passed-CI changes are advisory under the project's narrower rule set (F, E, W, I, UP, B, Q, SIM, TID, C90, E501) but worth applying as cheap correctness/cleanliness wins. 1. RUF043: prefix the pytest.raises(match=...) pattern with `r` so the `.*` metacharacters are explicit. Functionally equivalent here (pytest re.searches the message), but the raw string makes the regex intent unambiguous. 2. PLC0415: remove a redundant inline `from dataretrieval.waterdata.ratings import _build_filter` inside test_build_filter_escapes_quotes — `_build_filter` is already imported at module top (left over from before the top-level import landed in an earlier commit on this branch). Other broader-ruff findings reviewed and intentionally skipped as either project-wide patterns or non-actionable: - S113 (`requests.get` without `timeout=`): no other call in the codebase uses one — applying here only would be inconsistent. - PLW1514 (`open(..., "w")` without `encoding=`): same project- wide situation. RDB content is ASCII. - PLR0913 / PLR0917 (8 args on get_ratings): mirrors R's read_waterdata_ratings signature; can't reduce without breaking parity. - FBT001 / FBT002 (boolean positional args download_and_parse / ssl_check): match the rest of waterdata's public surface; reordering would be a breaking API change. - FURB103 / PTH103 / PTH118 / PTH123 (pathlib over os.path): the rest of the codebase uses os.path; consistency wins. - EM101 / EM102 / TRY003 (raise ValueError(msg) with msg pre-assigned): every other `raise ValueError` in waterdata uses the inline f-string; consistency wins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The USGS RDB tab-separated format is used both by NWIS web services
and by the new Water Data STAC catalog's rating-curve assets. Until
now, only nwis.py had a parser, and waterdata/ratings.py reached
across to import the private nwis._read_rdb. That smell is gone:
- New top-level module dataretrieval.rdb with two public helpers:
- read_rdb(text, dtypes=None): pure parser. Strips the comment
block, tab-parses the header row, skips the format-spec row,
and returns a DataFrame. Forwards optional caller-supplied
dtype hints to pandas.read_csv (unknown column names are
silently ignored, so dicts are safe to pass on any RDB).
Raises ValueError on HTML responses.
- extract_rdb_comment(text): returns the #-prefixed header
block as a list of lines (the equivalent of R's comment(df)).
- nwis._read_rdb is now a thin wrapper around rdb.read_rdb that
applies the NWIS-specific dtype hints (site_no, dec_long_va, ...)
and runs format_response() for datetime indexing. The private
symbol's contract is unchanged; existing 30 nwis tests pass.
Dropped the now-unused StringIO import.
- waterdata.ratings switches its import from nwis._read_rdb to
rdb.read_rdb / extract_rdb_comment, dropping the local
_extract_rdb_comment helper. The cross-module-private-import
comment block goes away — the dependency is now a clean
same-package import.
7 new unit tests pin the shared parser's behavior: basic shape
parsing, format-spec-row skipping, dtype passthrough, all-comments
empty-result, HTML rejection, and the comment extractor.
Net stat: +177 / -72.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aggregated quality-review fixes; no behavior change. Tests: 7 rdb +
11 ratings + 28 nwis pass (was 30 nwis; -2 are pure-duplicate
coverage now lived in rdb_test.py).
Source:
- dataretrieval/rdb.py: rename `rdb` parameter -> `text` on both
read_rdb and extract_rdb_comment ("rdb" is the format, not the
thing; reads better as `read_rdb(text=...)`). Drop "USGS-internal"
from the module docstring (the format is publicly served). Folded
the two-pass `[f.replace ... for f in fields if f.strip()]` into
one comprehension + a single empty-string filter. Replaced the
defensive `>=` on the all-comments guard with `==` (it's exact
after `next(..., len(lines))`). Spelled out in
extract_rdb_comment's docstring that the returned lines retain
their leading `#` and whitespace, matching R.
- dataretrieval/nwis.py: drop the `read_rdb as _read_rdb_text`
import alias — the local name is too close to the wrapper
`_read_rdb` and confused readers. Move `_NWIS_RDB_DTYPES` from
near the bottom of the file to the top alongside `_CRS`, where
module-level constants live. Drop the narrating "pandas silently
ignores unknown names..." comment (already in the read_rdb
docstring). Drop the `if df.empty: return df` guard in the
wrapper — `format_response()` already short-circuits cleanly on
an empty frame (no `dec_lat_va` -> skip GeoDataFrame branch; no
`datetime` -> early return).
Tests:
- tests/nwis_test.py: drop the two pure-duplicate cases
(`test_valid_rdb_returns_dataframe` and
`test_all_comments_returns_empty_dataframe`) whose coverage now
lives in tests/rdb_test.py. Keep and refocus the remaining
`test_no_sites_flows_through_format_response` to exercise the
wrapper-specific contract (now load-bearing after dropping the
empty-frame guard): an empty parser result must still flow
through format_response without raising. Issue-DOI-USGS#171 regression
intent preserved in the docstring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
@ldecicco-USGS, I think this PR is straightforward, but I did refactor the rdb utilities out of the |
ldecicco-USGS
approved these changes
May 6, 2026
3 tasks
6 tasks
thodson-usgs
added a commit
to thodson-usgs/dataretrieval-python
that referenced
this pull request
May 7, 2026
…oval The module-level "use waterdata instead" warning has been firing on import for a while; this PR makes the migration guidance actionable by emitting a per-function DeprecationWarning that names the specific waterdata replacement the user should switch to. Once peaks (DOI-USGS#267) and ratings (DOI-USGS#269) land, every active nwis function has a waterdata replacement, so all nine of them are deprecated here: nwis.get_dv -> waterdata.get_daily() nwis.get_iv -> waterdata.get_continuous() nwis.get_info -> waterdata.get_monitoring_locations() nwis.what_sites -> waterdata.get_monitoring_locations() nwis.get_stats -> waterdata.get_stats_por() / waterdata.get_stats_date_range() nwis.get_discharge_peaks -> waterdata.get_peaks() nwis.get_ratings -> waterdata.get_ratings() nwis.get_record -> the appropriate waterdata.get_*() nwis.query_waterdata -> a high-level waterdata.get_*() helper nwis.query_waterservices -> a high-level waterdata.get_*() helper (get_qwdata, get_discharge_measurements, get_gwlevels, get_pmcodes, and get_water_use are already defunct and raise NameError.) Implementation follows the nadp deprecation template (DOI-USGS#243): a small _REPLACEMENTS dict + a _warn_deprecated(func_name) helper called as the first line of each public function. stacklevel=3 makes the warning point at the caller's code, not the helper's frame. 11 new parametrized tests pin the warning text — that the function name appears, the replacement helper appears, and the removal date appears — plus one end-to-end test that get_iv() actually fires its warning when called. Removal date is set to 2027-05-06, one full year out (vs. the six months used for nadp), since nwis is much more widely used and most users will need migration time. Maintainer can adjust if desired. This depends on DOI-USGS#267 (waterdata.get_peaks) and DOI-USGS#269 (waterdata.get_ratings) being merged: until then the deprecation messages for get_discharge_peaks and get_ratings point at functions users can't yet call. Hold this PR draft until those land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs
added a commit
to thodson-usgs/dataretrieval-python
that referenced
this pull request
May 7, 2026
…oval The module-level "use waterdata instead" warning has been firing on import for a while; this PR makes the migration guidance actionable by emitting a per-function DeprecationWarning that names the specific waterdata replacement the user should switch to. Once peaks (DOI-USGS#267) and ratings (DOI-USGS#269) land, every active nwis function has a waterdata replacement, so all nine of them are deprecated here: nwis.get_dv -> waterdata.get_daily() nwis.get_iv -> waterdata.get_continuous() nwis.get_info -> waterdata.get_monitoring_locations() nwis.what_sites -> waterdata.get_monitoring_locations() nwis.get_stats -> waterdata.get_stats_por() / waterdata.get_stats_date_range() nwis.get_discharge_peaks -> waterdata.get_peaks() nwis.get_ratings -> waterdata.get_ratings() nwis.get_record -> the appropriate waterdata.get_*() nwis.query_waterdata -> a high-level waterdata.get_*() helper nwis.query_waterservices -> a high-level waterdata.get_*() helper (get_qwdata, get_discharge_measurements, get_gwlevels, get_pmcodes, and get_water_use are already defunct and raise NameError.) Implementation follows the nadp deprecation template (DOI-USGS#243): a small _REPLACEMENTS dict + a _warn_deprecated(func_name) helper called as the first line of each public function. stacklevel=3 makes the warning point at the caller's code, not the helper's frame. 11 new parametrized tests pin the warning text — that the function name appears, the replacement helper appears, and the removal date appears — plus one end-to-end test that get_iv() actually fires its warning when called. Removal date is set to 2027-05-06, one full year out (vs. the six months used for nadp), since nwis is much more widely used and most users will need migration time. Maintainer can adjust if desired. This depends on DOI-USGS#267 (waterdata.get_peaks) and DOI-USGS#269 (waterdata.get_ratings) being merged: until then the deprecation messages for get_discharge_peaks and get_ratings point at functions users can't yet call. Hold this PR draft until those land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a Python wrapper around the new USGS Water Data STAC catalog (announced at https://waterdata.usgs.gov/blog/wdfn-rating-curves/) for stage-discharge rating curves. Mirrors R's
read_waterdata_ratings.Why a new module
The rating-curve catalog is exposed via STAC (
api.waterdata.usgs.gov/stac/v0/search) rather than the OGC collections that back every otherwaterdatagetter. Two transport-layer differences from OGC:filterquery param. The composer mirrors R's logic: only pinfile_typeserver-side when a single value is requested; multi-type requests fetch all matches and filter URLs client-side..rdbasset onstac-files/. The function downloads each matching asset to a user-suppliedfile_path(default:tempfile.mkdtemp()) and parses it with the existingnwis._read_rdbhelper.Because of these differences, the implementation lives in
dataretrieval/waterdata/ratings.pyrather thanapi.py.Live API examples (verified)
Error paths surfaced upfront
The duration check matches a constraint of the upstream rating-curve service that R also enforces client-side.
Test plan
tests/waterdata_ratings_test.py:file_type, ISO 8601 duration indatetime).requests_mock(single-site happy path).download_and_parse=Falsereturns the raw feature list.file_typecorrectly omitsfile_typefrom the server filter and filters URLs locally.api.waterdata.usgs.gov/stac/v0/searchfor all three demos shown above.Related
Part of the v1.1.4 release-staging series; will be linked from a tracking issue.