Skip to content

fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136

Open
tcconnally wants to merge 3 commits into
google:mainfrom
Perseus-Computing-LLC:fix/non-english-eval-rouge
Open

fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136
tcconnally wants to merge 3 commits into
google:mainfrom
Perseus-Computing-LLC:fix/non-english-eval-rouge

Conversation

@tcconnally

Copy link
Copy Markdown

Problem

When evaluating text in non-Latin scripts (Thai, Chinese, Japanese, Arabic, etc.), the v1 ROUGE-1 evaluator returns scores of 0.0 even when the response matches the expected output exactly.

Root cause: The rouge_score library's default tokenizer uses re.findall(r'\\w+', text) which only matches ASCII [a-zA-Z0-9_]. Non-Latin characters produce zero tokens → ROUGE-1 score of 0.0 regardless of correctness.

Reproduction (from #3111)

agent = Agent(
    model="gemini-2.5-flash",
    instruction='Reply with only the word "สวัสดี"',
)
# Agent responds "สวัสดี" → ROUGE-1 score: 0.0 (should be 1.0)

Fix

Added _unicode_tokenize function that:

  1. Uses re.UNICODE flag for ASCII-majority text (preserves existing behavior)
  2. Splits on Unicode whitespace/punctuation for non-ASCII text
  3. Falls back to character-level tokens for scripts without word boundaries (Chinese, Japanese)

Closes #3111

@tcconnally tcconnally force-pushed the fix/non-english-eval-rouge branch from e275a87 to 6dff0a2 Compare June 15, 2026 21:22
@rohityan rohityan self-assigned this Jun 15, 2026
@wyf7107 wyf7107 self-assigned this Jun 16, 2026
@rohityan rohityan added the eval [Component] This issue is related to evaluation label Jun 17, 2026
@rohityan rohityan removed their assignment Jun 17, 2026
@rohityan

Copy link
Copy Markdown
Collaborator

Hi @tcconnally , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors.

@rohityan rohityan added the needs review [Status] The PR/issue is awaiting review from the maintainer label Jun 17, 2026
The default RougeScorer tokenizer uses r'\\w+' regex which only matches
ASCII [a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese,
etc.), this returns zero tokens, causing ROUGE scores of 0.0 even when
the response matches the expected output exactly.

Added _unicode_tokenize function that uses re.UNICODE flag and falls
back to character-level tokenization for non-ASCII scripts.

Closes google#3111
- Replace function _unicode_tokenize with _UnicodeTokenizer class
  implementing the tokenize() method expected by RougeScorer
- Move import re to module level
- Fix double-escaped regex patterns (\w -> \w, remove unsupported \p{P})
- Add return type annotation for tokenize() to satisfy mypy strict mode
- Fix RougeScorer constructor indentation
@tcconnally tcconnally force-pushed the fix/non-english-eval-rouge branch from 9beec74 to 98396a4 Compare June 17, 2026 18:40
@tcconnally

Copy link
Copy Markdown
Author

Fixed the pre-commit formatting issue (pyink). Rebased on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation needs review [Status] The PR/issue is awaiting review from the maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval fails for non-English languages

3 participants