benchmark.translate-zh-cn

Benchmark dataset and CLI tool for evaluating English-Chinese translation quality in academic economics and mathematics.

Overview

Three things that work together:

A gold-standard test dataset of English-Chinese translation pairs (terms, sentences, paragraphs)
A CLI tool (qebench) for contributing translations, judging model outputs, and running benchmarks
A results website (GitHub Pages) showing leaderboards, model Elo ratings, and coverage progress

Quick Start

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/QuantEcon/benchmark.translate-zh-cn.git
cd benchmark.translate-zh-cn
uv sync

# Check dataset stats
uv run qebench stats

# Start translating (the fun part)
uv run qebench translate --random

Commands

qebench stats        Show dataset coverage, Elo rankings, leaderboard
qebench translate    Translate & Compare mode (can you beat the AI?)
qebench judge        Judge mode (rate anonymous translations, build Elo)
qebench add          Add new test entries to the dataset
qebench run          Run benchmark against LLM models
qebench export       Export results for the website

Dataset

Translation pairs at three granularities sourced from QuantEcon lectures:

Level	Target	Current	Description
Terms	500+	314	Single terms with standard translations
Sentences	100+	80	One-sentence definitions or statements
Paragraphs	30+	17	Multi-sentence explanations (may include math/code/directives)

Sentences and paragraphs are seeded from aligned English/Chinese lecture pairs using scripts/seed_from_lectures.py. See Seed Script Guide.

Prompt Templates

Four prompt templates in prompts/ for LLM benchmarking:

Template	Description
`default`	General-purpose translation prompt
`academic`	Formal academic register emphasis
`action-basic`	MyST Markdown-aware rules (directive/math/code preservation)
`action-new`	MyST rules + glossary injection from `action-translation`

Use action-basic and action-new to benchmark prompts that mirror action-translation's production translation rules. See Glossary & Prompt Templates Tutorial.

Automated Formatting Checks

qebench judge includes automated MyST formatting fidelity scoring. These checks run on each translation pair and are displayed in the reveal panel:

Directive balance — open/close pairs match between source and translation
Fence consistency — no mixed $$ / ```{math} markers
Code block preservation — code blocks unchanged
Full-width punctuation — zh-cn uses ，。！？ not ,.!?
Directive spacing — space between CJK characters and MyST directives

See Architecture: Scoring Module for implementation details.

Development

# Install with dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
data		data
docs		docs
prompts		prompts
results		results
scripts		scripts
site/data		site/data
src/qebench		src/qebench
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
PLAN.md		PLAN.md
README.md		README.md
REVIEW.md		REVIEW.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

benchmark.translate-zh-cn

Overview

Quick Start

Commands

Dataset

Prompt Templates

Automated Formatting Checks

Development

Related

License

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

benchmark.translate-zh-cn

Overview

Quick Start

Commands

Dataset

Prompt Templates

Automated Formatting Checks

Development

Related

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages