Benchmark dataset and CLI tool for evaluating English-Chinese translation quality in academic economics and mathematics.
Three things that work together:
- A gold-standard test dataset of English-Chinese translation pairs (terms, sentences, paragraphs)
- A CLI tool (
qebench) for contributing translations, judging model outputs, and running benchmarks - A results website (GitHub Pages) showing leaderboards, model Elo ratings, and coverage progress
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/QuantEcon/benchmark.translate-zh-cn.git
cd benchmark.translate-zh-cn
uv sync
# Check dataset stats
uv run qebench stats
# Start translating (the fun part)
uv run qebench translate --randomqebench stats Show dataset coverage, Elo rankings, leaderboard
qebench translate Translate & Compare mode (can you beat the AI?)
qebench judge Judge mode (rate anonymous translations, build Elo)
qebench add Add new test entries to the dataset
qebench run Run benchmark against LLM models
qebench export Export results for the website
Translation pairs at three granularities sourced from QuantEcon lectures:
| Level | Target | Current | Description |
|---|---|---|---|
| Terms | 500+ | 314 | Single terms with standard translations |
| Sentences | 100+ | 80 | One-sentence definitions or statements |
| Paragraphs | 30+ | 17 | Multi-sentence explanations (may include math/code/directives) |
Sentences and paragraphs are seeded from aligned English/Chinese lecture pairs
using scripts/seed_from_lectures.py. See Seed Script Guide.
Four prompt templates in prompts/ for LLM benchmarking:
| Template | Description |
|---|---|
default |
General-purpose translation prompt |
academic |
Formal academic register emphasis |
action-basic |
MyST Markdown-aware rules (directive/math/code preservation) |
action-new |
MyST rules + glossary injection from action-translation |
Use action-basic and action-new to benchmark prompts that mirror
action-translation's
production translation rules. See Glossary & Prompt Templates Tutorial.
qebench judge includes automated MyST formatting fidelity scoring. These
checks run on each translation pair and are displayed in the reveal panel:
- Directive balance — open/close pairs match between source and translation
- Fence consistency — no mixed
$$/```{math}markers - Code block preservation — code blocks unchanged
- Full-width punctuation — zh-cn uses
,。!?not,.!? - Directive spacing — space between CJK characters and MyST directives
See Architecture: Scoring Module for implementation details.
# Install with dev dependencies
uv sync --extra dev
# Run tests
uv run pytest
# Lint
uv run ruff check src/ tests/- action-translation — the GitHub Action this benchmark evaluates
- QuantEcon lectures — source material for the dataset
- REVIEW.md — design review and gap analysis of both projects
MIT