feat: add cast_value and scale_offset codecs#3874
feat: add cast_value and scale_offset codecs#3874d-v-b wants to merge 16 commits intozarr-developers:mainfrom
Conversation
Defines two new codecs that together provide a v3-native replacement for the existing `numcodecs.fixedscaleoffset` codec. The `cast_value` codec requires an optional dependency on the `cast-value-rs` package.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3874 +/- ##
==========================================
- Coverage 93.07% 92.88% -0.20%
==========================================
Files 85 87 +2
Lines 11228 11443 +215
==========================================
+ Hits 10451 10629 +178
- Misses 777 814 +37
🚀 New features to boost your workflow:
|
…into feat/scale-offset-cast-value
…-b/zarr-python into feat/scale-offset-cast-value
|
the coverage gaps are spurious because some of these routines only run when the optional |
|
👋 This is great. With this codec we could reduce the size of our stored dataset 🙏 |
maxrjones
left a comment
There was a problem hiding this comment.
Nice, thanks @d-v-b! A few comments:
protect against data corruption
I think you need to protect against integer overflow/underflow. Here's a couple tests that demonstrate corrupted data:
def test_encode_uint8_underflow_should_error():
"""uint8(0) - offset(1) wraps to 255, then * 2 = 254. Decode gives 128, not 0."""
arr = zarr.create_array(
store={},
shape=(3,),
dtype="uint8",
chunks=(3,),
filters=[ScaleOffset(offset=1, scale=2)],
compressors=None,
fill_value=1,
)
data = np.array([0, 1, 2], dtype="uint8")
with pytest.raises((ValueError, OverflowError)):
arr[:] = data
def test_encode_int8_overflow_should_error():
"""int8(50) * scale(3) = 150 wraps to -106. Decode gives -36, not 50."""
arr = zarr.create_array(
store={},
shape=(1,),
dtype="int8",
chunks=(1,),
filters=[ScaleOffset(offset=0, scale=3)],
compressors=None,
fill_value=0,
)
data = np.array([50], dtype="int8")
with pytest.raises((ValueError, OverflowError)):
arr[:] = data
user-guide documentation
I think it'd be worth having some documentation page for users coming from other data formats, since they could easily mix up the scale offset parameters. Here's a summary:
The main ones across scientific/geospatial data:
CF conventions (NetCDF): unpacked = packed * scale_factor + add_offset
- Scale means "multiply when unpacking"
GDAL/GeoTIFF: value = raw * scale + offset
- Same formula as CF. GDAL uses this for raster band transformations.
HDF-EOS5: value = scale_factor * (stored - offset)
- Subtract offset before scaling — different from CF which adds offset after scaling. This is a common source of confusion between HDF-EOS and CF.
GRIB (WMO): value = reference_value + (raw * 2^binary_scale_factor) / 10^decimal_scale_factor
- Two scale factors (binary and decimal) plus a reference value. Much more complex.
numcodecs FixedScaleOffset (zarr v2): encode: round((x - offset) * scale), decode: x / scale + offset
- Same formula as the new zarr spec, plus explicit rounding on encode.
Zarr v3 scale_offset: encode: (x - offset) * scale, decode: x / scale + offset
- Same as numcodecs but without built-in rounding (that's now delegated to
cast_value).
ML quantization (e.g., TensorFlow Lite): real = (quantized - zero_point) * scale
- Scale means "multiply when unpacking" (like CF).
zero_pointis the integer that maps to real zero.
The key axes of variation:
| Scale means | Order | Rounding | |
|---|---|---|---|
| CF / GDAL / ML quant | multiply to unpack | scale then offset | implicit |
| HDF-EOS5 | multiply to unpack | offset then scale | implicit |
| Zarr v3 / numcodecs | multiply to pack | offset then scale | explicit (cast_value) |
| GRIB | two scale factors | its own thing | N/A |
The biggest gotcha is that CF's scale_factor=0.1 means the same thing as zarr's scale=10. Users migrating CF data need to invert the scale value.
sustainability
Will you add me or the zarr team as a maintainer on PyPI for cast-value-rs if this is merged?
| offset = self._to_scalar(self.offset, chunk_spec.dtype) | ||
| scale = self._to_scalar(self.scale, chunk_spec.dtype) | ||
| if np.issubdtype(arr.dtype, np.integer): | ||
| result = (arr // scale) + offset # type: ignore[operator] |
There was a problem hiding this comment.
per the spec, should this use / and error if the result isn't representable in the data type?
| [tool.hatch.envs.cast-value] | ||
| template = "test" | ||
| features = ["cast-value-rs"] | ||
|
|
||
| [[tool.hatch.envs.cast-value.matrix]] | ||
| python = ["3.12"] | ||
|
|
||
| [tool.hatch.envs.cast-value.scripts] | ||
| run = "pytest tests/test_codecs/test_cast_value.py {args:}" |
There was a problem hiding this comment.
these should be tested in the CI
Defines two new codecs that together provide a v3-native replacement for the existing
numcodecs.fixedscaleoffsetcodec.The
cast_valuecodec requires an optional dependency on thecast-value-rspackage. The reason for bringing in rust here is because it gives us better average speed and much better memory usage than a numpy implementation. Happy to show this with benchmarks if needed.fwiw, I would rather not define these codecs in
zarr-python. In fact I already defined these codec classes in another package. But it's not possible to make zarr-python depend on an externally-defined codec without creating a circular dependency, so here we are. See discussion in the related issue.related, and possibly closes #3867
edit: if depending on a new optional package for
cast_valueis too much, then we should revisit #3774. The conclusion there was that defining thecast_valuecodec in zarr-python itself was too much. And if the conclusion is that both defining the codec externally is too much, and defining it internally is too much, then we should do something like deprecatenumcodecs.fixedscaleoffsetand direct people to the python packages that implement its replacement.