Skip to content

feat: add cast_value and scale_offset codecs#3874

Open
d-v-b wants to merge 16 commits intozarr-developers:mainfrom
d-v-b:feat/scale-offset-cast-value
Open

feat: add cast_value and scale_offset codecs#3874
d-v-b wants to merge 16 commits intozarr-developers:mainfrom
d-v-b:feat/scale-offset-cast-value

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 6, 2026

Defines two new codecs that together provide a v3-native replacement for the existing numcodecs.fixedscaleoffset codec.

The cast_value codec requires an optional dependency on the cast-value-rs package. The reason for bringing in rust here is because it gives us better average speed and much better memory usage than a numpy implementation. Happy to show this with benchmarks if needed.

fwiw, I would rather not define these codecs in zarr-python. In fact I already defined these codec classes in another package. But it's not possible to make zarr-python depend on an externally-defined codec without creating a circular dependency, so here we are. See discussion in the related issue.

related, and possibly closes #3867

edit: if depending on a new optional package for cast_value is too much, then we should revisit #3774. The conclusion there was that defining the cast_value codec in zarr-python itself was too much. And if the conclusion is that both defining the codec externally is too much, and defining it internally is too much, then we should do something like deprecate numcodecs.fixedscaleoffset and direct people to the python packages that implement its replacement.

d-v-b added 2 commits April 6, 2026 14:00
Defines two new codecs that together provide a v3-native replacement for the existing `numcodecs.fixedscaleoffset` codec.
The `cast_value` codec requires an optional dependency on the `cast-value-rs` package.
@d-v-b d-v-b requested a review from TomAugspurger April 6, 2026 12:20
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 6, 2026
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 6, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 6, 2026

cc @emmanuelmathot

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 82.79070% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.88%. Comparing base (0ea15fd) to head (2e8d644).

Files with missing lines Patch % Lines
src/zarr/codecs/cast_value.py 75.18% 34 Missing ⚠️
src/zarr/codecs/scale_offset.py 95.94% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3874      +/-   ##
==========================================
- Coverage   93.07%   92.88%   -0.20%     
==========================================
  Files          85       87       +2     
  Lines       11228    11443     +215     
==========================================
+ Hits        10451    10629     +178     
- Misses        777      814      +37     
Files with missing lines Coverage Δ
src/zarr/codecs/__init__.py 100.00% <100.00%> (ø)
src/zarr/codecs/scale_offset.py 95.94% <95.94%> (ø)
src/zarr/codecs/cast_value.py 75.18% <75.18%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b d-v-b requested a review from a team April 7, 2026 13:38
@d-v-b d-v-b requested a review from maxrjones April 8, 2026 15:44
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 10, 2026

the coverage gaps are spurious because some of these routines only run when the optional cast-value-rs package is available. this is ready for review.

@vincentsarago
Copy link
Copy Markdown

👋 This is great.

With this codec we could reduce the size of our stored dataset 🙏

Copy link
Copy Markdown
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks @d-v-b! A few comments:

protect against data corruption

I think you need to protect against integer overflow/underflow. Here's a couple tests that demonstrate corrupted data:

def test_encode_uint8_underflow_should_error():
    """uint8(0) - offset(1) wraps to 255, then * 2 = 254. Decode gives 128, not 0."""
    arr = zarr.create_array(
        store={},
        shape=(3,),
        dtype="uint8",
        chunks=(3,),
        filters=[ScaleOffset(offset=1, scale=2)],
        compressors=None,
        fill_value=1,
    )
    data = np.array([0, 1, 2], dtype="uint8")
    with pytest.raises((ValueError, OverflowError)):
        arr[:] = data


def test_encode_int8_overflow_should_error():
    """int8(50) * scale(3) = 150 wraps to -106. Decode gives -36, not 50."""
    arr = zarr.create_array(
        store={},
        shape=(1,),
        dtype="int8",
        chunks=(1,),
        filters=[ScaleOffset(offset=0, scale=3)],
        compressors=None,
        fill_value=0,
    )
    data = np.array([50], dtype="int8")
    with pytest.raises((ValueError, OverflowError)):
        arr[:] = data

user-guide documentation

I think it'd be worth having some documentation page for users coming from other data formats, since they could easily mix up the scale offset parameters. Here's a summary:

The main ones across scientific/geospatial data:

CF conventions (NetCDF): unpacked = packed * scale_factor + add_offset

  • Scale means "multiply when unpacking"

GDAL/GeoTIFF: value = raw * scale + offset

  • Same formula as CF. GDAL uses this for raster band transformations.

HDF-EOS5: value = scale_factor * (stored - offset)

  • Subtract offset before scaling — different from CF which adds offset after scaling. This is a common source of confusion between HDF-EOS and CF.

GRIB (WMO): value = reference_value + (raw * 2^binary_scale_factor) / 10^decimal_scale_factor

  • Two scale factors (binary and decimal) plus a reference value. Much more complex.

numcodecs FixedScaleOffset (zarr v2): encode: round((x - offset) * scale), decode: x / scale + offset

  • Same formula as the new zarr spec, plus explicit rounding on encode.

Zarr v3 scale_offset: encode: (x - offset) * scale, decode: x / scale + offset

  • Same as numcodecs but without built-in rounding (that's now delegated to cast_value).

ML quantization (e.g., TensorFlow Lite): real = (quantized - zero_point) * scale

  • Scale means "multiply when unpacking" (like CF). zero_point is the integer that maps to real zero.

The key axes of variation:

Scale means Order Rounding
CF / GDAL / ML quant multiply to unpack scale then offset implicit
HDF-EOS5 multiply to unpack offset then scale implicit
Zarr v3 / numcodecs multiply to pack offset then scale explicit (cast_value)
GRIB two scale factors its own thing N/A

The biggest gotcha is that CF's scale_factor=0.1 means the same thing as zarr's scale=10. Users migrating CF data need to invert the scale value.

sustainability

Will you add me or the zarr team as a maintainer on PyPI for cast-value-rs if this is merged?

offset = self._to_scalar(self.offset, chunk_spec.dtype)
scale = self._to_scalar(self.scale, chunk_spec.dtype)
if np.issubdtype(arr.dtype, np.integer):
result = (arr // scale) + offset # type: ignore[operator]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per the spec, should this use / and error if the result isn't representable in the data type?

Comment thread pyproject.toml
Comment on lines +195 to +203
[tool.hatch.envs.cast-value]
template = "test"
features = ["cast-value-rs"]

[[tool.hatch.envs.cast-value.matrix]]
python = ["3.12"]

[tool.hatch.envs.cast-value.scripts]
run = "pytest tests/test_codecs/test_cast_value.py {args:}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be tested in the CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zarr v3 scale-offset codec, and should the codec API be defined upstream

3 participants