Skip to content

fix: preserve non-ASCII (CJK) path segments in auto-generated project name#624

Open
mvanhorn wants to merge 1 commit into
DeusData:mainfrom
mvanhorn:fix/571-cjk-project-name-slug
Open

fix: preserve non-ASCII (CJK) path segments in auto-generated project name#624
mvanhorn wants to merge 1 commit into
DeusData:mainfrom
mvanhorn:fix/571-cjk-project-name-slug

Conversation

@mvanhorn

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #571: cbm_project_name_from_path() derived an auto project slug by mapping
every byte outside [A-Za-z0-9._-] to - and then collapsing dashes. For a path
such as /Users/yunxin/Desktop/开发/后端, every UTF-8 byte of the CJK segments is
unsafe, so each segment collapsed to a single - that was then trimmed away,
yielding Users-yunxin-Desktop. Two different non-ASCII directories under the same
ASCII parent collapsed to the same name, so the identifying information was
silently lost and distinct repos collided. Users with non-Latin directory names
saw indexing create a DB under a truncated, unrecognizable name.

This change percent-encodes non-ASCII bytes instead of discarding them, so
non-ASCII paths keep a distinct, recoverable name. It also keeps the slug
generator in agreement with cbm_validate_project_name() -- the same invariant
that motivated the #349 space fix (a name that indexes but resolve_store
later rejects reports project-not-found).

Changes:

  • src/pipeline/fqn.c -- cbm_project_name_from_path() now percent-encodes any
    byte >= 0x80 as uppercase %HH into a new, larger buffer, while ASCII
    separators / unsafe bytes still map to -. ASCII slugs are byte-identical to
    before (e.g. /home/u/my project -> home-u-my-project). The existing
    dash/dot collapse and leading/trailing trim are preserved. Because the slug is
    later used as <cache>/<name>.db, long slugs are bounded to 200 bytes with a
    deterministic FNV-1a hash suffix so distinct long paths still produce distinct
    names and stay within the OS filename-component limit. Includes overflow guards
    and correct memory management.
  • src/foundation/str_util.c -- cbm_validate_project_name() additionally
    accepts %. It remains filesystem-safe; the existing .., /, \, and
    leading-. rejections are unchanged.
  • tests/test_fqn.c -- new coverage for CJK percent-encoding, distinctness, the
    long-path bound, the %-accepting validator, and an unchanged-ASCII
    regression.

Example: /Users/yunxin/Desktop/开发/后端 now yields
Users-yunxin-Desktop-%E5%BC%80%E5%8F%91-%E5%90%8E%E7%AB%AF (distinct, valid)
instead of Users-yunxin-Desktop.

Checklist

  • Every commit is signed off (git commit -s) -- required, CI rejects
    unsigned commits (DCO, see CONTRIBUTING.md)
  • Tests pass locally (make -f Makefile.cbm test) -- 5689 passed, 0 failed
  • Lint passes (make -f Makefile.cbm lint-ci) -- clang-format clean on
    touched files
  • New behavior is covered by a test (reproduce-first for bug fixes)

Fixes #571

… name

Percent-encode non-ASCII bytes in cbm_project_name_from_path so distinct
CJK paths keep recoverable, collision-free slugs; accept '%' in
cbm_validate_project_name. Bound the slug length with a hash suffix so
deep multibyte paths stay within the OS filename-component limit.

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Project name strips non-ASCII (CJK) characters from path, resulting in truncated/unrecognizable names

1 participant