Skip to content

Add waterdata.get_ratings for USGS stage-discharge ratings via STAC#269

Merged
thodson-usgs merged 8 commits intoDOI-USGS:mainfrom
thodson-usgs:add-ratings
May 6, 2026
Merged

Add waterdata.get_ratings for USGS stage-discharge ratings via STAC#269
thodson-usgs merged 8 commits intoDOI-USGS:mainfrom
thodson-usgs:add-ratings

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Adds a Python wrapper around the new USGS Water Data STAC catalog (announced at https://waterdata.usgs.gov/blog/wdfn-rating-curves/) for stage-discharge rating curves. Mirrors R's read_waterdata_ratings.

Why a new module

The rating-curve catalog is exposed via STAC (api.waterdata.usgs.gov/stac/v0/search) rather than the OGC collections that back every other waterdata getter. Two transport-layer differences from OGC:

  1. Filter syntax — STAC takes a CQL2 string in a filter query param. The composer mirrors R's logic: only pin file_type server-side when a single value is requested; multi-type requests fetch all matches and filter URLs client-side.
  2. Download path — STAC features point at a per-rating .rdb asset on stac-files/. The function downloads each matching asset to a user-supplied file_path (default: tempfile.mkdtemp()) and parses it with the existing nwis._read_rdb helper.

Because of these differences, the implementation lives in dataretrieval/waterdata/ratings.py rather than api.py.

Live API examples (verified)

from dataretrieval.waterdata import get_ratings

# Default: exsa ratings for two sites
ratings = get_ratings(
    monitoring_location_id=["USGS-01104475", "USGS-01104460"],
    file_type="exsa",
)
# {'USGS-01104475.exsa.rdb': DataFrame(2001, 4), 'USGS-01104460.exsa.rdb': DataFrame(...)}
ratings["USGS-01104475.exsa.rdb"].head()
#    INDEP  SHIFT  DEP  STOR
# 0   0.00   0.00  0.0   NaN
# 1   0.01  -0.01  0.0   NaN
# 2   0.02  -0.02  0.0   NaN

# Both exsa and corr for the same two sites — 4 tables returned
ratings = get_ratings(
    monitoring_location_id=["USGS-01104475", "USGS-01104460"],
    file_type=["exsa", "corr"],
)

# Bounding-box "what's available?" without downloading
features = get_ratings(
    bbox=[-95.0, 40.0, -92.0, 42.0],
    download_and_parse=False,
)
# 59 STAC feature dicts in this region

Error paths surfaced upfront

get_ratings(monitoring_location_id="USGS-01104475", file_type="bogus")
# ValueError: Invalid file_type ['bogus']. Valid options: ['exsa', 'base', 'corr'].

get_ratings(monitoring_location_id="USGS-01104475", datetime="P7D")
# ValueError: ISO 8601 durations (e.g. 'P7D') are not supported in `datetime`
#             for the rating-curve service. Provide a date or interval instead.

The duration check matches a constraint of the upstream rating-curve service that R also enforces client-side.

Test plan

  • 9 unit tests in tests/waterdata_ratings_test.py:
    • Filter composition (single site, multi site, no site, empty).
    • Two error paths (invalid file_type, ISO 8601 duration in datetime).
    • End-to-end search + download via requests_mock (single-site happy path).
    • download_and_parse=False returns the raw feature list.
    • Multi-file_type correctly omits file_type from the server filter and filters URLs locally.
  • Live verification against api.waterdata.usgs.gov/stac/v0/search for all three demos shown above.

Related

Part of the v1.1.4 release-staging series; will be linked from a tracking issue.

@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

Tracked in the v1.1.4 release-staging issue: #270.

thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 6, 2026
Two non-functional follow-ups suggested during review of DOI-USGS#269:

(1) Document the cross-module reach into nwis._read_rdb. Rating files
    use the same USGS RDB shape as NWIS responses, so the parser is
    already reusable as-is — no refactor of the legacy nwis module is
    needed. Added a comment at the import site explaining why the
    private import is intentional and what to watch for if _read_rdb
    ever moves.

(2) Surface the RDB #-prefixed header block. Each parsed rating frame
    now carries provenance in df.attrs:
      - df.attrs["comment"]: the list of "#"-prefixed header lines
        (rating id, parameter, expansion type, last-shifted timestamp,
        warnings, etc.).
      - df.attrs["url"]: the asset URL it was fetched from.
    R's read_waterdata_ratings exposes the comment block via
    comment(df); pandas's standard `attrs` dict is the Python
    equivalent. Done in ratings.py only — does not touch nwis.

A live spot-check against api.waterdata.usgs.gov on USGS-01104475
exsa shows the 31-line USGS header survives intact (gauge name,
parameter code, rating expansion, etc.). One new unit test pins the
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 6, 2026
Six fixes from a multi-agent review pass on PR DOI-USGS#269:

1. Narrow the per-feature except handler from a bare Exception to
   (RequestException, ValueError, OSError). The previous catch-all
   would swallow programming bugs (KeyError on a malformed feature,
   AttributeError, ...) and silently drop rows.

2. Escape single quotes in CQL string literals via the standard
   doubling rule. Most monitoring-location IDs cannot contain a
   quote, but the function takes arbitrary strings — defending
   against malformed filters and potential injection regardless.
   New _quote_cql_str helper, applied to monitoring_location_id and
   file_type. New unit test pins the behaviour.

3. Promote file_type to a Literal["exsa", "base", "corr"] and derive
   _VALID_FILE_TYPES from it via typing.get_args, so the runtime
   guard and the type hint can never drift apart.

4. Rename the `datetime` parameter to `time` to match the convention
   used by every sibling waterdata getter (and to stop shadowing the
   stdlib `datetime` module). The parameter is still passed through
   as the STAC `datetime` query string under the hood; that's now
   documented explicitly.

5. Switch the multi-type local filter from a substring check on the
   asset URL to feature["properties"]["file_type"]. Substring
   matching on URLs would false-match if a host or path ever
   contained one of the literal type names; STAC features carry the
   typed property already.

6. Skip the on-disk RDB write when file_path is None. tempfile.mkdtemp
   leaks (no automatic cleanup), and df.attrs["url"] already records
   the source — so by default we now return only the parsed frame.
   Users who want a local copy can pass file_path=...; the contract
   for that path is unchanged.

While here, dropped a "discrete analogue to the OGC waterdata getters"
sentence from the module docstring (it was internal-architecture
WHAT-narrating, not user-relevant guidance).

All 11 ratings tests pass (one new test for the CQL quote-escaping).
Live verification on USGS-01104475 confirms file_path=None still
returns the parsed frame with df.attrs populated, multi-type via
property filter returns the expected 4 tables, and the renamed
`time=` parameter still drives the STAC datetime filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs requested a review from Copilot May 6, 2026 15:42
thodson-usgs and others added 3 commits May 6, 2026 11:18
Wraps the new Water Data STAC catalog endpoint
(api.waterdata.usgs.gov/stac/v0/search) for stage-discharge rating
curves. Lives in its own module (waterdata/ratings.py) because the
transport layer — STAC search + per-feature RDB download — differs
from the OGC collections used by the rest of the package.

The function:
- Composes a CQL filter from monitoring_location_id and (single-value)
  file_type, mirroring R's logic. Multi-type requests fetch all
  matches and filter URLs client-side.
- Optionally downloads each matching .rdb asset to a user-supplied
  file_path (default: a fresh tempfile.mkdtemp), and parses with the
  existing nwis._read_rdb helper.
- Returns a dict[id -> DataFrame] when download_and_parse=True, or the
  raw list of STAC features when False (cheap "what's available?"
  inspection).

Surfaces a clear ValueError for invalid file_type values and for ISO
8601 durations in `datetime` (the rating-curve service rejects them).

Mirrors R's read_waterdata_ratings in DOI-USGS/dataRetrieval; example
sites and idioms come straight from the R doc.

Tests: 9 unit tests covering filter composition, the two error paths,
and end-to-end search-and-download via requests_mock (single-site,
download_and_parse=False, and multi-type URL filtering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two non-functional follow-ups suggested during review of DOI-USGS#269:

(1) Document the cross-module reach into nwis._read_rdb. Rating files
    use the same USGS RDB shape as NWIS responses, so the parser is
    already reusable as-is — no refactor of the legacy nwis module is
    needed. Added a comment at the import site explaining why the
    private import is intentional and what to watch for if _read_rdb
    ever moves.

(2) Surface the RDB #-prefixed header block. Each parsed rating frame
    now carries provenance in df.attrs:
      - df.attrs["comment"]: the list of "#"-prefixed header lines
        (rating id, parameter, expansion type, last-shifted timestamp,
        warnings, etc.).
      - df.attrs["url"]: the asset URL it was fetched from.
    R's read_waterdata_ratings exposes the comment block via
    comment(df); pandas's standard `attrs` dict is the Python
    equivalent. Done in ratings.py only — does not touch nwis.

A live spot-check against api.waterdata.usgs.gov on USGS-01104475
exsa shows the 31-line USGS header survives intact (gauge name,
parameter code, rating expansion, etc.). One new unit test pins the
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six fixes from a multi-agent review pass on PR DOI-USGS#269:

1. Narrow the per-feature except handler from a bare Exception to
   (RequestException, ValueError, OSError). The previous catch-all
   would swallow programming bugs (KeyError on a malformed feature,
   AttributeError, ...) and silently drop rows.

2. Escape single quotes in CQL string literals via the standard
   doubling rule. Most monitoring-location IDs cannot contain a
   quote, but the function takes arbitrary strings — defending
   against malformed filters and potential injection regardless.
   New _quote_cql_str helper, applied to monitoring_location_id and
   file_type. New unit test pins the behaviour.

3. Promote file_type to a Literal["exsa", "base", "corr"] and derive
   _VALID_FILE_TYPES from it via typing.get_args, so the runtime
   guard and the type hint can never drift apart.

4. Rename the `datetime` parameter to `time` to match the convention
   used by every sibling waterdata getter (and to stop shadowing the
   stdlib `datetime` module). The parameter is still passed through
   as the STAC `datetime` query string under the hood; that's now
   documented explicitly.

5. Switch the multi-type local filter from a substring check on the
   asset URL to feature["properties"]["file_type"]. Substring
   matching on URLs would false-match if a host or path ever
   contained one of the literal type names; STAC features carry the
   typed property already.

6. Skip the on-disk RDB write when file_path is None. tempfile.mkdtemp
   leaks (no automatic cleanup), and df.attrs["url"] already records
   the source — so by default we now return only the parsed frame.
   Users who want a local copy can pass file_path=...; the contract
   for that path is unchanged.

While here, dropped a "discrete analogue to the OGC waterdata getters"
sentence from the module docstring (it was internal-architecture
WHAT-narrating, not user-relevant guidance).

All 11 ratings tests pass (one new test for the CQL quote-escaping).
Live verification on USGS-01104475 confirms file_path=None still
returns the parsed frame with df.attrs populated, multi-type via
property filter returns the expected 4 tables, and the renamed
`time=` parameter still drives the STAC datetime filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs review requested due to automatic review settings May 6, 2026 17:05
thodson-usgs and others added 4 commits May 6, 2026 12:56
Eight small changes; net -16 lines. No behavior change other than the
duration check now using the project's anchored regex instead of an
ad-hoc substring search (small correctness improvement).

1. Reorder so the public `get_ratings` is at the top and helpers
   follow — matches the convention in `nearest.py`.

2. Reuse `_DURATION_RE` from `waterdata.utils` instead of the ad-hoc
   `"P" in str(v).upper()` check. The regex is anchored at start and
   case-correct; the substring check was both too loose (matched any
   string containing 'p'/'P') and case-wrong (ISO 8601 P/PT are
   case-sensitive — uppercasing the input would have masked
   lowercase-p false positives).

3. Extract `_as_list(x)` for the `[x] if isinstance(x, str) else
   list(x)` pattern, which appeared three times (file_type
   normalization, monitoring_location_id in `_build_filter`, time
   values in the duration check).

4. Drop redundant `date=False` on the `_format_api_dates` call (it's
   the default).

5. Collapse the `if time is not None: ... else: time_str = None`
   four-line block into a single ternary.

6. Pre-filter `features` by file_type before the download loop, so
   the loop body is just download-and-handle-errors. The "skip
   features whose file type wasn't requested" comment goes away
   because the list-comprehension above is self-documenting.

7. Drop two narrating comments whose WHY is already captured by the
   helpers' docstrings (`_build_filter` and `_extract_rdb_comment`).

8. Rename `test_get_ratings_multi_type_filters_urls_locally` →
   `..._filters_via_property` — the test name still claimed URL
   filtering after the property-based switch in the previous
   /simplify pass.

All 11 ratings tests pass; live verification on
api.waterdata.usgs.gov confirms file_path=None still attaches
df.attrs, multi-type filtering returns the expected four tables, and
the duration check still rejects "P7D" while accepting "2026-04-29".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Two test-only fixes from a broader-ruff (--select ALL) pass. Both
already-passed-CI changes are advisory under the project's narrower
rule set (F, E, W, I, UP, B, Q, SIM, TID, C90, E501) but worth
applying as cheap correctness/cleanliness wins.

1. RUF043: prefix the pytest.raises(match=...) pattern with `r` so
   the `.*` metacharacters are explicit. Functionally equivalent
   here (pytest re.searches the message), but the raw string makes
   the regex intent unambiguous.

2. PLC0415: remove a redundant inline `from
   dataretrieval.waterdata.ratings import _build_filter` inside
   test_build_filter_escapes_quotes — `_build_filter` is already
   imported at module top (left over from before the top-level
   import landed in an earlier commit on this branch).

Other broader-ruff findings reviewed and intentionally skipped as
either project-wide patterns or non-actionable:

  - S113 (`requests.get` without `timeout=`): no other call in the
    codebase uses one — applying here only would be inconsistent.
  - PLW1514 (`open(..., "w")` without `encoding=`): same project-
    wide situation. RDB content is ASCII.
  - PLR0913 / PLR0917 (8 args on get_ratings): mirrors R's
    read_waterdata_ratings signature; can't reduce without breaking
    parity.
  - FBT001 / FBT002 (boolean positional args
    download_and_parse / ssl_check): match the rest of waterdata's
    public surface; reordering would be a breaking API change.
  - FURB103 / PTH103 / PTH118 / PTH123 (pathlib over os.path): the
    rest of the codebase uses os.path; consistency wins.
  - EM101 / EM102 / TRY003 (raise ValueError(msg) with msg
    pre-assigned): every other `raise ValueError` in waterdata uses
    the inline f-string; consistency wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The USGS RDB tab-separated format is used both by NWIS web services
and by the new Water Data STAC catalog's rating-curve assets. Until
now, only nwis.py had a parser, and waterdata/ratings.py reached
across to import the private nwis._read_rdb. That smell is gone:

- New top-level module dataretrieval.rdb with two public helpers:
  - read_rdb(text, dtypes=None): pure parser. Strips the comment
    block, tab-parses the header row, skips the format-spec row,
    and returns a DataFrame. Forwards optional caller-supplied
    dtype hints to pandas.read_csv (unknown column names are
    silently ignored, so dicts are safe to pass on any RDB).
    Raises ValueError on HTML responses.
  - extract_rdb_comment(text): returns the #-prefixed header
    block as a list of lines (the equivalent of R's comment(df)).

- nwis._read_rdb is now a thin wrapper around rdb.read_rdb that
  applies the NWIS-specific dtype hints (site_no, dec_long_va, ...)
  and runs format_response() for datetime indexing. The private
  symbol's contract is unchanged; existing 30 nwis tests pass.
  Dropped the now-unused StringIO import.

- waterdata.ratings switches its import from nwis._read_rdb to
  rdb.read_rdb / extract_rdb_comment, dropping the local
  _extract_rdb_comment helper. The cross-module-private-import
  comment block goes away — the dependency is now a clean
  same-package import.

7 new unit tests pin the shared parser's behavior: basic shape
parsing, format-spec-row skipping, dtype passthrough, all-comments
empty-result, HTML rejection, and the comment extractor.

Net stat: +177 / -72.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aggregated quality-review fixes; no behavior change. Tests: 7 rdb +
11 ratings + 28 nwis pass (was 30 nwis; -2 are pure-duplicate
coverage now lived in rdb_test.py).

Source:
- dataretrieval/rdb.py: rename `rdb` parameter -> `text` on both
  read_rdb and extract_rdb_comment ("rdb" is the format, not the
  thing; reads better as `read_rdb(text=...)`). Drop "USGS-internal"
  from the module docstring (the format is publicly served). Folded
  the two-pass `[f.replace ... for f in fields if f.strip()]` into
  one comprehension + a single empty-string filter. Replaced the
  defensive `>=` on the all-comments guard with `==` (it's exact
  after `next(..., len(lines))`). Spelled out in
  extract_rdb_comment's docstring that the returned lines retain
  their leading `#` and whitespace, matching R.

- dataretrieval/nwis.py: drop the `read_rdb as _read_rdb_text`
  import alias — the local name is too close to the wrapper
  `_read_rdb` and confused readers. Move `_NWIS_RDB_DTYPES` from
  near the bottom of the file to the top alongside `_CRS`, where
  module-level constants live. Drop the narrating "pandas silently
  ignores unknown names..." comment (already in the read_rdb
  docstring). Drop the `if df.empty: return df` guard in the
  wrapper — `format_response()` already short-circuits cleanly on
  an empty frame (no `dec_lat_va` -> skip GeoDataFrame branch; no
  `datetime` -> early return).

Tests:
- tests/nwis_test.py: drop the two pure-duplicate cases
  (`test_valid_rdb_returns_dataframe` and
  `test_all_comments_returns_empty_dataframe`) whose coverage now
  lives in tests/rdb_test.py. Keep and refocus the remaining
  `test_no_sites_flows_through_format_response` to exercise the
  wrapper-specific contract (now load-bearing after dropping the
  empty-frame guard): an empty parser result must still flow
  through format_response without raising. Issue-DOI-USGS#171 regression
  intent preserved in the docstring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

@ldecicco-USGS, I think this PR is straightforward, but I did refactor the rdb utilities out of the nwis module. Maybe overkill, but low risk. I'm fine to undo that if you prefer.

@thodson-usgs thodson-usgs requested a review from ldecicco-USGS May 6, 2026 19:01
@thodson-usgs thodson-usgs marked this pull request as ready for review May 6, 2026 23:42
@thodson-usgs thodson-usgs merged commit ba11d6f into DOI-USGS:main May 6, 2026
8 checks passed
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 7, 2026
…oval

The module-level "use waterdata instead" warning has been firing on
import for a while; this PR makes the migration guidance actionable
by emitting a per-function DeprecationWarning that names the specific
waterdata replacement the user should switch to.

Once peaks (DOI-USGS#267) and ratings (DOI-USGS#269) land, every active nwis function
has a waterdata replacement, so all nine of them are deprecated here:

  nwis.get_dv               -> waterdata.get_daily()
  nwis.get_iv               -> waterdata.get_continuous()
  nwis.get_info             -> waterdata.get_monitoring_locations()
  nwis.what_sites           -> waterdata.get_monitoring_locations()
  nwis.get_stats            -> waterdata.get_stats_por() /
                              waterdata.get_stats_date_range()
  nwis.get_discharge_peaks  -> waterdata.get_peaks()
  nwis.get_ratings          -> waterdata.get_ratings()
  nwis.get_record           -> the appropriate waterdata.get_*()
  nwis.query_waterdata      -> a high-level waterdata.get_*() helper
  nwis.query_waterservices  -> a high-level waterdata.get_*() helper

(get_qwdata, get_discharge_measurements, get_gwlevels, get_pmcodes,
and get_water_use are already defunct and raise NameError.)

Implementation follows the nadp deprecation template (DOI-USGS#243): a small
_REPLACEMENTS dict + a _warn_deprecated(func_name) helper called as
the first line of each public function. stacklevel=3 makes the
warning point at the caller's code, not the helper's frame.

11 new parametrized tests pin the warning text — that the function
name appears, the replacement helper appears, and the removal date
appears — plus one end-to-end test that get_iv() actually fires
its warning when called.

Removal date is set to 2027-05-06, one full year out (vs. the six
months used for nadp), since nwis is much more widely used and most
users will need migration time. Maintainer can adjust if desired.

This depends on DOI-USGS#267 (waterdata.get_peaks) and DOI-USGS#269 (waterdata.get_ratings)
being merged: until then the deprecation messages for get_discharge_peaks
and get_ratings point at functions users can't yet call. Hold this PR
draft until those land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 7, 2026
…oval

The module-level "use waterdata instead" warning has been firing on
import for a while; this PR makes the migration guidance actionable
by emitting a per-function DeprecationWarning that names the specific
waterdata replacement the user should switch to.

Once peaks (DOI-USGS#267) and ratings (DOI-USGS#269) land, every active nwis function
has a waterdata replacement, so all nine of them are deprecated here:

  nwis.get_dv               -> waterdata.get_daily()
  nwis.get_iv               -> waterdata.get_continuous()
  nwis.get_info             -> waterdata.get_monitoring_locations()
  nwis.what_sites           -> waterdata.get_monitoring_locations()
  nwis.get_stats            -> waterdata.get_stats_por() /
                              waterdata.get_stats_date_range()
  nwis.get_discharge_peaks  -> waterdata.get_peaks()
  nwis.get_ratings          -> waterdata.get_ratings()
  nwis.get_record           -> the appropriate waterdata.get_*()
  nwis.query_waterdata      -> a high-level waterdata.get_*() helper
  nwis.query_waterservices  -> a high-level waterdata.get_*() helper

(get_qwdata, get_discharge_measurements, get_gwlevels, get_pmcodes,
and get_water_use are already defunct and raise NameError.)

Implementation follows the nadp deprecation template (DOI-USGS#243): a small
_REPLACEMENTS dict + a _warn_deprecated(func_name) helper called as
the first line of each public function. stacklevel=3 makes the
warning point at the caller's code, not the helper's frame.

11 new parametrized tests pin the warning text — that the function
name appears, the replacement helper appears, and the removal date
appears — plus one end-to-end test that get_iv() actually fires
its warning when called.

Removal date is set to 2027-05-06, one full year out (vs. the six
months used for nadp), since nwis is much more widely used and most
users will need migration time. Maintainer can adjust if desired.

This depends on DOI-USGS#267 (waterdata.get_peaks) and DOI-USGS#269 (waterdata.get_ratings)
being merged: until then the deprecation messages for get_discharge_peaks
and get_ratings point at functions users can't yet call. Hold this PR
draft until those land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants