This tutorial walks you through a judge session — comparing anonymous translations side-by-side, scoring them, and building Elo ratings that show which translation approaches work best.
- You've completed Getting Started
- Model outputs exist in
results/model-outputs/. If they don't, runqebench runfirst (see Running LLM Benchmarks)
Human judgments are the gold standard for translation quality. qebench judge
pairs two translations of the same source text — from LLM models or the human
reference — and asks you to rate each on accuracy and fluency, then pick a
winner. Your judgments update Elo ratings that rank the models over time.
Always start with the latest data:
uv run qebench update
uv run qebench statsRun a quick 5-round session:
uv run qebench judge -n 5You'll see a session header:
╭──────── Judge Session ────────╮
│ Rounds: 5 Domain: all │
│ Models: 2 User: alice │
╰───────────────────────────────╯
Each round shows the English source in a panel:
╭──── Judge (Round 1) TERM ────╮
│ │
│ inflation │
│ │
│ term-001 · economics · basic │
╰─────────────────────────────────╯
Read the source carefully before looking at the translations.
Two translations appear side by side:
╭── Translation A ──╮ ╭── Translation B ──╮
│ 通货膨胀 │ │ 通胀 │
╰───────────────────╯ ╰───────────────────╯
The labels A and B are randomized — you don't know which model produced which translation until the reveal.
After comparing both translations, pick the overall winner:
Which is better overall?
A is better
❯ B is better
Tie — equally good
Neither — both are poor
If both are equally good, pick Tie. If both are poor and neither is acceptable, pick Neither. Don't overthink it — go with your first instinct after reading both.
If you picked A or B as the winner, you'll be asked to rate each translation on two dimensions. (For Tie and Neither, scoring is skipped.)
Rate Translation A:
Accuracy (1-10): ▸ 9
Fluency (1-10): ▸ 8
Then rate Translation B the same way:
Rate Translation B:
Accuracy (1-10): ▸ 7
Fluency (1-10): ▸ 9
Accuracy means how faithfully the translation captures the meaning. Fluency means how natural and readable the Chinese is.
After picking, the result panel shows who won, automated scores, and formatting checks:
╭──────────── Result ────────────╮
│ A (claude) B (human) │
│ Winner B wins! │
│ Elo 1520 1480 │
│ Ref. overlap 85% 100% │
│ Glossary 90% 100% │
│ Punctuation 92% 98% │
│ Directives ✓ ✓ │
╰────────────────────────────────╯
- Elo — model skill rating (higher = better track record)
- Ref. overlap — character similarity to the reference translation
- Glossary — percentage of key terms correctly translated
- Punctuation — fullwidth punctuation compliance (,。! vs ,.!)
- Directives — whether MyST directive open/close pairs are balanced
The formatting scores are computed automatically — you don't need to check
MyST syntax yourself. Over time, you'll learn to associate certain models
with formatting problems, which helps action-translation improve its prompts.
After all rounds, you'll see:
╭─── Session Summary ───╮
│ Rounds completed: 5/5 │
│ XP earned: +25 │
│ Total XP: 75 │
╰─────────────────────────╯
Each judgment earns 5 XP.
Push your results to GitHub:
uv run qebench submitYour judgments are saved in results/judgments/{your-username}.jsonl and Elo
ratings are updated in results/elo.json.
Focus your judgments on a specific domain:
uv run qebench judge -n 10 -d economicsThis is useful when you have domain expertise — your ratings will be more precise for terms you know well.
The judge system pairs translations intelligently:
- 2+ models translated the same entry → two models are paired
- 1 model translated an entry → model is paired against the human reference
- 0 models → entry is skipped (nothing to compare)
- Identical pairs → automatically skipped (nothing to judge)
You can exit a session early at any prompt by pressing Ctrl+C — completed rounds are saved.
- Translate more entries: See Your First Translation Session to collect more data
- Run more models: See Running LLM Benchmarks to generate model outputs
- Compare prompts: See Glossary & Prompt Templates to test action-translation prompts
- Seed more data: See Seeding from Lectures to extract sentence/paragraph pairs (developer guide)
- Check the leaderboard:
qebench statsshows the XP leaderboard and dataset coverage