Skip to content

New CTable dictionary spec, nested columns and richer Arrow/Parquet interoperability#634

Merged
FrancescAlted merged 40 commits into
mainfrom
ctable-dict-spec
May 15, 2026
Merged

New CTable dictionary spec, nested columns and richer Arrow/Parquet interoperability#634
FrancescAlted merged 40 commits into
mainfrom
ctable-dict-spec

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

This PR extends CTable with richer Arrow/Parquet interoperability, nested-column support, dictionary-encoded string columns, faster table opening, and lower blosc2 import overhead.

Main additions:

  • Add first-class CTable dictionary/categorical string columns via blosc2.dictionary().
    • Stored as compact int32 codes plus a persisted global string dictionary.
    • Arrow dictionary columns are preserved by default during Parquet/Arrow import.
    • Added equality, isin(), null handling, Arrow export, persistence, and CLI roundtrip support.
    • --decode-dictionaries is available in the Parquet CLI to opt out and import as vlstring.
  • Add nested Arrow/Parquet struct support for CTable.
    • Top-level Arrow/Parquet structs are recursively flattened into dotted CTable leaf columns, e.g. trip.begin.lon.
    • Leaf columns remain independent compressed arrays for fast analytics.
    • Row reads reconstruct the original nested dict shape.
    • Dotted leaves work with getitem, attribute access, where(), select(), indexes, aggregates, and Parquet roundtrips.
    • Literal ., /, and \ in field names are escaped and stored safely.
  • Add support for separating nested list<struct<...>> Parquet layouts.
    • Especially useful for Awkward-style unnamed-root list datasets.
    • CTable.from_parquet(..., separate_nested_cols=True) is now the default.
    • Each list element becomes a CTable row, and struct leaves become regular nested columns.
    • max_rows limits flattened CTable rows in this mode.
  • Improve parquet_to_blosc2 CLI.
    • Adds import/export/roundtrip workflows.
    • Adds progress reporting with ETA.
    • Adds nested-column controls via --separate-nested-cols / --no-separate-nested-cols.
    • Defaults list imports to Arrow serialization for better performance on deeply nested data.
    • Adds dictionary preservation/decode controls.
    • Adds --max-rows, timestamp handling improvements, memory reporting, profiling, and roundtrip comparison helpers.
  • Improve CTable persistence/open performance.
    • Adds lazy column metadata/data loading for faster CTable.open().
    • Makes CTable.nrows lazy.
    • Avoids unnecessary store-extension validation on open.
    • Preserves nested schema metadata and empty-root Arrow names across save/load/open/export paths.
  • Improve base blosc2 import time.
    • Lazy-imports numba and requests.
    • Avoids eager NumPy callable scans that triggered expensive lazy NumPy submodule imports.
    • Local profiling reduced import blosc2 from roughly ~225 ms under importtime to ~82–86 ms.
  • Add documentation, plans, benchmarks, and tests.
    • New CTable docs for Parquet interoperability, nested fields, dictionary columns, null policy, indexing, and querying.
    • New tests for dictionary columns, nested access/storage, nested metadata, Parquet interop, getitem access, and BatchArray metadata behavior.

…l mapping, and roundtrip support

 - Add dotted nested column support in CTable access and filters:
     - attribute namespaces (t.trip.begin.lon)
     - string where-expression rewriting for dotted operands
 - Store dotted leaf columns hierarchically under _cols/...:
     - a.b.c -> /_cols/a/b/c(.b2nd/.b2b)
 - Add schema metadata v2 for nested mappings:
     - logical/physical/storage path maps
     - root alias metadata support
 - Preserve unnamed Arrow root ("") through Parquet/Arrow import-export:
     - normalize internally, restore on export via metadata
 - Add logical->physical selector resolution across APIs:
     - __getitem__, __getattr__, select, Arrow export columns, index APIs
 - Support struct-prefix expansion in selectors (select(["trip"]))
 - Implement recursive Arrow struct flattening in from_arrow:
     - flatten to dotted physical leaves
 - Implement row reconstruction for flattened structs on row materialization
 - Add virtual struct-path column reads (t["props"][:]) from descendant leaves
     - include null-collapse behavior for fully-null structs
 - Keep top-level struct schema compatibility in columns_by_name and Arrow/Parquet export
 - Extend nested semantics for index lifecycle and sorting:
     - logical aliases in create/rebuild/drop/compact index paths
     - clear error for non-leaf sort keys resolving to multiple leaves
 - Consolidate nested tests into two modules:
     - tests/ctable/test_nested_access_storage.py
     - tests/ctable/test_nested_metadata_root.py
 - Add/adjust tests for nested access, storage paths, metadata v2, root alias,
 struct flatten/reconstruct, roundtrip, sort/index behavior, and compatibility
 - Add benchmark utility:
     - bench/ctable/bench_nested_parquet_roundtrip.py
@FrancescAlted FrancescAlted merged commit e78992f into main May 15, 2026
17 checks passed
@FrancescAlted FrancescAlted deleted the ctable-dict-spec branch May 15, 2026 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant